AI Adventures in Azure: Uploading data to the VM

There are many ways to transfer data from local storage to the virtual machine. Azure provides Blob storage for unstructured data managed through the user’s storage account as well as specific storage options for files and tables. There is also the option to use Data Lakes. These are all useful for storing large datasets and integrating into processing pipelines within Azure.

However, in this post I will talk about some simpler options for transferring smaller files, for example scripts or smaller images and datasets onto the VM itself, just to make the essential datasets available for code development on the VM. There are two main options – one is to upload to third party cloud storage, and the other is sharing folders through the remote desktop connection.

1) Upload data to a third party cloud storage account:

This could be an Azure store, Gdrive, OneDrive, Dropbox or similar, or an ftp site. Upload from the local computer, then start up and log into the VM and download directly to the VM hard drive. This is quite clunky and time consuming compared to a direct transfer.

2) share files using the remote desktop connection:

In XTerm there is an option to set preferences. Clicking this brings up a menu with a tab named “shared folders”. Select these folders and check the boxes for “mount automatically”. These folders are then available to the VM, and files can be copied and pasted between the local and remote machines.

Other, Azure-optimised data transfer and storage options will be covered in a later post!

Advertisements

AI Adventures in Azure: Choosing VM Size

The main purpose of a VM is to accelerate scripts compared to running locally on a laptop or desktop by outsourcing the computation to a more powerful remote computer. There is an overwhelming number of options for Azure VM sizes, each of which is optimised for a particular purpose, so to get the best performance for a specific application it’s important to choose the right VM. I started with no clue which VM would be right for me. I’m using the VM to apply scikit-learn algorithms to large images obtained from drones and satellites, which is memory hungry but “embarrassingly parallel” (meaning it is easy to separate the computation into chunks and distribute the computation across several individual cores).

I started off by prioritising access to lots of cores, thinking that distributing widely would be the best way to accelerate my code, so I initially opted for the NC24 series VM which has 24 available cores. I noticed that the NC24 series was not noticeably faster than running the code locally on my laptop, which has 8 available cores. Since the NC24 is relatively expensive, and benchmark tests showed no noticeable speed up from 8 to 24 cores, I switched to a more affordable NC6. This did not slow down the script at all relative to the NC24, suggesting number of cores is not limiting the speed of my script. To be sure, I briefly allocated a 64 core VM and ran the benchmark script again. There was clearly no need to pay extra for more cores, so the NC6 became my main VM for a while.

However, the experiments with the NC series VMs showed that there was no real benefit to paying for VM access relative to running locally on my laptop, at least in terms of benchmark script completion time, so I explored some compute-optimised options instead. The F16s-v2 worked nicely and was cheap compared to the NC series, however, it suffered from memory overload when running the larger benchmark scripts. This led to a switchover to a memory-optimised E20s-v3 VM (20vcpu’s, 160GB RAM, 32000 max IOPS). This VM outperforms my laptop and the other VM sizes I’ve tested for my particular image processing application.

So far, I am very happy with the performance of the E20s-v3 VM and will stick with it for a while, although I am interested by the announcement of the new Lsv2 series.

It was obviously extremely useful to have a benchmark script and image to compare the VMs. On an image-by-image basis the acceleration has a minor impact, but it will become more important as I start to scale to automated processing of large numbers of images.

All the VMs were running an Ubuntu 16.04 LTS Data Science machine image and the benchmarking used an identical Python script run using PyCharm.

AI Adventures in Azure: Accessing the VM via terminal or remote desktop

Accessing the Data Science Virtual Machine

Once the virtual machine is set up and started (by clicking “start” on the appropriate VM in the Azure portal) there are several ways to interface with it. The first is via the terminal (I am running Ubuntu 16.04 on both my local machine and the virtual machine). To connect to the virtual machine from the terminal, we can use secure shell, or SSH. This requires a set of keys which are used for encryption and decryption and keep the connection between the local and virtual machine secure. These keys are unique to your system, and they need to be generated. This can be done using the command line.

Generating ssh keys:

Option 1 is to use the terminal on your local machine. In Ubuntu, the following command will generate an RSA key pair (RSA is a method of encryption named after Rivest, Shamir and Adleman who first proposed it) with a length of 2048 bits:

ssh-keygen -t rsa -b 2048

Alternatively, the Azure command line interface (azure CLI) can be used. The Azure CLI is a command line service that can be installed to run from the existing terminal or it can also run in a web browser and is used to send commands directly to the virtual machine in an Azure-friendly syntax. To create the ssh key pair in Azure CLI:

az vm create –name VMname –resource-group RGname –generate-ssh-keys

Regardless of the method used to generate them, ssh key pairs are stored by default into

~/.ssh

and to view the key the following bash command can be used

cat ~/.ssh/id_rsa.pub

The key values displayed by this command should be stored somewhere secure for later use. The ssh keys enable access to the VM through the command line (local terminal or Azure CLI). Alternatively, the virtual machine can be configured with a desktop that can be accessed using a remote desktop client. This requires some further VM configuration:

 

To set up remote desktop

The ssh keys created earlier can be used to access the VM through the terminal. Then, the terminal can be used to install a desktop GUI to the VM. I chose the lightweight GUI LXDE to run on my Ubuntu VM. To install LXDE use the command:

sudo apt-get install lxde -y

To install the remote desktop support for LXDE:

sudo apt-get install xrdp -y

Then start XRDP running on the VM:

/etc/init.d/xrdp start

Then the VM needs to be configured to enable remote desktop. This cna be done via the Azure portal (portal.azure.com). Login using Azure username and password, start the VM by clicking “start” on the dashboard. Then navigate to the inbound security rules:

resource group > network security > inbound security rules > add >

A list of configuration options is then available, they should be updated to the following settings:

source: any

source port ranges: *

Destination: any

destination port ranges: 3389

protocol: TCP

Action: Allow

 

Finally,  a remote desktop client is required on the local machine. I chose to use X2Go client available from the Ubuntu software centre or can be installed in the terminal using apt-get. After The RDC is installed, the system is ready for remote access to the VM using a desktop GUI.

Remote Access to VM using Desktop GUI:

  1. The VM must first be started – this can be done via the Azure portal after logging in with the usual Azure credentials (username and password) and clicking “start” on the dashboard. Copy the VM IP addres to the clipboard.
  2. Open X2Go Client and comfigure a new session:
    1. Host = VM ip address
    2. Login = Azure login name
    3. SSH port: 22
    4. Session Type = XFCE
  3. These credentials can be saved under a named session so logging in subsequently just requires clicking on the session icon in X2Go (although the ip address for the VM is dynamic by default so will need updating each time).
  4. A LXDE GUI will open!

 

Remember that closing the remote desktop does not stop the Azure VM – the VM must be stopped by clicking “stop” on the dashboard on the Azure portal.

AI Adventures in Azure

A lot of my work at the moment requires quite computationally heavy geospatial analysis that stretches the processing capabilities of my laptop. I invested in a pretty powerful machine – i7-7700GHz processor, 32GB RAM – and sped things up by spreading the load across cores and threads, but it can still be locked up for hours when processing very large datasets. For this reason, I have started exploring cloud computing. My platform of choice is Microsoft Azure. Being new to Azure and cloud computing in general, I thought it would be helpful for me to keep notes of my learning as I climb onboard, and also thought it could be useful to make the notes public for others who might be following the same path.

I’ll be blogging these notes as “Adventures in Azure”. I’m predominantly a Linux user and the notes will focus on Linux virtual machines on Azure. My programming will almost all be in Python. The end-goal is to be proficient with machine learning applied to remote sensing image analysis in the cloud.

 

I’m certain I will find fugly ways to do things and I will be grateful for any suggestions for refinements!

 

1. Setting Up Linux Data Science Virtual Machine

I’m not going to write up notes for this as it was so easy! I created an Azure account with a Microsoft email address, then I chose to use a virtual machine image preloaded with the essentials – Ubuntu, Anaconda (2.7 and 3.5), JupyterHub, Pycharm, Tensorflow and NVIDIA drivers – amongst a range of other useful software designed specifically for data science. Microsoft call it the “Data Science Virtual Machine and the link is here and the instructions are simple to follow. I opted for a standard NC6 (which has 6 vCPUs and 56GB memory) as this is a significant step up in terms of processing power from my local machine, but comes at an affordable hourly rate.

Once the virtual machine is established, there is still a fair amount of configuring to do before using it for geospatial projects. The next post will contain info about ways to work with Python on the virtual machine.

Machine Learning: An unexplored horizon for Polar science

I recently published an article in Open Access Government about the potential for machine learning technologies to revolutionise Polar science, with focus on optical remote sensing data from drones and satellites.  You can read it online  or download it from OAGov_Oct18

 

Preparing the polar observation drone for data collection in Svalbard – machine learning technologies are ideal for extracting value from this dataset (ph. Marc Latzel/Rolex)

Upernavik Field Work 2018

2018 saw the Black & Bloom postdocs exploring a new field site in the north western sector of the Greenland Ice Sheet. After two seasons working in the south west near Kangerlussuaq, the team migrated north to investigate dark ice where the melt seasons are shorter and the temperatures lower.

DSC03461
Beautiful Upernavik, viewed from the airport (ph J Cook)

We soon learned that there were additional challenges to working up here beyond the colder weather. Upernavik itself is on a small island in an archipelago near where the ice sheet flows and calves into the sea. While this produces spectacular icebergs, it also means access to the ice sheet is possible only by helicopter. The same helicopter serves local communities elsewhere in the archipelago with food, transport and other essential services. While we were in Upernavik, a huge iceberg floated into the harbour in nearby Inarsuit, threatening the town with the potential for a huge iceberg-induced tsunami. The maritime Arctic weather also played havoc with the flight schedules, and resupplying local communities (rightly) took priority over science charters.

DSC03547
Iceberg near the harbour in Upernavik (ph. J Cook)_

These factors combined to prevent us from leaving Upernavik for 3.5 weeks. It seemed like we would never make it onto the ice. However, we finally got a weather window that coindided with heli and pilot availablity. With the difficulty of getting on to the ice weighing on our minds, we had to consider the risk of similar difficulties getting back out. We repacked to ensure we had several weeks of emergency supplies to make sure we would not be flying in to a potential search and rescue disaster.

Once on the ice, we quickly built a camp and started recording measurements quickly. The albedo measurements and paired drone flights went very smoothly, with refined methods developed over the past two seasons. However, we only saw exposed glacier ice for 1.5 days, and continuous snowfall kept it buried for the rest of the season.

DSC03792
Air Greenland’s Bell 212 sling loading our field kit (ph. J Cook)

Overall it was an interesting site, and the important thing is that we can confirm that the algal bloom we studied in the south west is also present in the northern part of the ice sheet, is composed of the same species and also makes the ice dark. We have sampled the mineral dusts too, to see how they compare with the more southern site.