AI Adventures in Azure: Blob storage

My AI for Earth project is quite memory intensive so I have been learning about ways to take the data storage off the local disk and into the cloud, while still maintaining on the fly access to crucial files on my virtual or local machine. My classification problem started off requiring just a few GB of Sentinel-2 images, but now the challenge is to use Azure to scale to observing the entire western coast of the Greenland Ice Sheet and over time, so the amount of storage required has increased dramatically and it is no longer feasible to store everything locally. The best solution I have found so far is to use Azure blob storage.

Blobs are repositories of unstructured data held in the cloud that can usefully be accessed using a simple Python API via the command line or in-script. I have been using the sentinelsat python API to batch download Sentinel-2 images for a specific range of tiles and dates, processing them from the L1C product into the L2A product using ESA Sen2Cor, uploading the L2A product to blob storage and then deleting the files stored locally. On my virtual machine I have enough memory to store one month’s worth of imagery for one tile (bearing in mind the memory required is much greater than that of the final product since the zipped and unzipped L1C and the processed L2A product will exist for a while), meaning the script can download a full month’s images before sending to the blob store, flushing the local hard disk and beginning to download the mages for the subsequent month.

One difficulty I have come across is maintaining file structures when sending entire folders to blob storage programmatically. This is trivial when using the Azure Storage Explorer because the file structure is automatically maintained simply by dragging and dropping the folder into the appropriate blob container. However, the Python API does not allow direct upload of an entire folder to a blob container unless the individual files are uploaded individually without being arranged into parent folders. To achieve this programmatically, virtual folders need to be invoked. To do this, the folder and file paths are both provided in the call to the blob service, rather than the filename alone. This requires iterating through a list of folders, then iterating through the files in each folder in a sub-loop, each time appending the filename to the folder name and using the full path as the blob store destination.

I posted my solution to this on the Azure community forum.


Heliguy Blog: Drones for Climate

UK drone company Heliguy recently ran a blog article about my work with drones in the Arctic including on my Microsoft/National Geographic AI for Earth grant.

Drones have been increasingly important in my work on Arctic climate change, especially in mapping melting over glacier surfaces and as a way to link ground measurements with satellite remote sensing. I have recently passed the UK CAA Permissions for Commercial Operations assessments, so please reach out with projects and collaboration ideas related to drone photography or remote sensing.

Image taken from a quadcopter while mapping the ice surface, near point 660, Greenland Ice Sheet.





AI Adventures in Azure: Ice Surface Classifiers

For this post I will introduce what I am actually trying to achieve with the AI for Earth grant and how it will help us to understand glacier and ice sheet dynamics in a warming world.

The Earth is heating up – that’s a problem for the parts of it made of ice. Over a billion people rely directly upon glacier fed water for drinking, washing, farming or hydropower. The sea level rise resulting from the melting of glaciers and ice sheets is one of the primary species level existential risks we face as humans in the 21st century, threatening lives, homes, infrastructures, economies, jobs, cultures and traditions. It has bee projected that $14 trillion could be wiped off the global economy annually by 2100 due to sea level rise. The major contributing factors are thermal expansion of the oceans and melting of glaciers and ice sheets, which in turn is primarily controlled by the ice albedo, or reflectivity. However, our understanding of albedo for glaciers and ice sheets is still fairly basic. Our models make drastic assumptions about how the albedo of glaciers behaves, some assign a constant value to it, some assume it varies as a simple function of exposure time in the summer, and the more sophisticated models use radiative transfer but on the assumption that the ice behaves in the same way as snow (i.e. it can be adequately represented as a collection of tiny spheres). Our remote sensing products also struggle to resolve the complexity of the ice surface and fail to detect the albedo reducing processes operating there, for example the accumulation of particles and growth of algae on the ice surface, and the changing structure of the ice itself. This limits our ability to observe the ice surface changing over time and to attribute melting to specific processes that would enable us to make better predictions of melting – and therefore sea level rise – into the future.

Aerial view of a field camp on the Greenland Ice Sheet in July 2016. The incredible complexity of this environment is clear – there are areas of bright ice, standing water, melt streams, biological aggregates known as cryoconites and areas of intense contamination with biological growth, mineral dust and soots – none of which is resolved by our current models or remote sensing but all of which affect the rate of glacier melting.

I hope to contribute to tackling this problem with AI for Earth. My idea is to use a form of machine learning known as supervised classification to map ice surfaces from drone images and then at the scale of entire glaciers and ice sheets using multispectral data from the European Space Agency’s Sentinel-2 satellite. The training data will come from spectral measurements made on the ice surface that match the wavelengths of the UAV and Sentinel sensors. I’ll be writing the necessary code in Python and processing the imagery in the cloud using Microsoft Azure, with the aim of gaining new insights into glacier and ice sheet melting and developing an accessible API to host on the AI for Earth API hub. I have been working on this problem for a while and the code (in active development) is being regularly updated on my Github repository. A publication is currently under review.

I have already posted about my Azure setup and some ways to start programming in Python on Azure virtual machines, and from here on in the posts will be more about coding specifically for this project.

National Geographic Explorers Festival London

A few weeks ago I had the pleasure of presenting at the National Geographic Explorer’s Festival in London. This was an amazing opportunity to meet the inspirational Explorers and listen to them talk about AI solutions to conservation problems around the world. In the afternoon I spoke about my work on machine learning and remote sensing for monitoring glacier and ice sheet melting, and then participate in a panel discussion about the challenges of applying AI to environmental problems. The event was livestreamed and is now archived here (my part starts at 1:48).

The work I presented is supported by Microsoft and National Geographic through their AI for Earth scheme.


AI Adventures in Azure: Ways to Program in Python on the DSVM

Having introduced the set up and configuration of a new virtual machine and the ways to interact with it, I will now show some ways to use it to start programming in Python. This post will assume that the VM is allocated and that the user is accessing the VM using a remote desktop client.

1. Using the terminal

I am running an Ubuntu virtual machine, so the command line interface is referred to as the terminal. The language used to make commands is (usually) “bash”. Since the package manager Anaconda is already installed on the data science VM, it is very easy to start building environments and running Python code in the terminal. Here is an example where I’m creating a new environment called “AzurePythonEnv” that includes some popular packages:

>> conda create -n AzurePythonEnv python=3.6 numpy matplotlib scikit-learn pandas

Now this environment can be activated any time via the terminal:

>> source activate AzurePythonEnv

Now, with the environment activated, python code can be typed directly into the terminal, or scripts can be written as text files (e.g. using the pre-installed text editors Atom or Vim) and called from the terminal:

>> python /data/home/tothepoles/Desktop/script.txt


2. Using an IDE

The data science VM includes several IDEs that can be used for developing Python Code. My preferred option at the moment in PyCharm, but Visual Studio Code is also excellent and I can envisage using this as my primary IDE later on. IDEs are available under Applications > Development in the desktop toolbar or accessible via the command line. IDEs for other languages are also pre-installed on the Linux DSVM including R-Studio. Simply open the preferred IDE and start programming. In PyCharm the bottom frame in the default view can be toggled between the terminal and the python console. This means new packages can be installed into your environment and new environments created and removed from within the IDE, along with all the other functions associated with the command line. The basic workflow for programming in the IDE is to start a new project, link it to your chosen development environment, write scripts in the editor window then run them (optionally running them in the console so that variables and datasets remain accessible after the script has finished running).

Screenshot from 2019-03-15 09-46-52
Development in the PyCharm IDE

3. Using Jupyter Notebooks

Jupyter notebooks are applications that allow active code to be run in a web browser, and the outputs displayed interactively within the same window. They are a great way to make code accessible to other users. The code is written nearly indentically to a normal python script except that it is divided into individual executable cells. Jupyter notebooks can be run in the cloud using Azure notebooks, making it easy to access Azure data storage, configure custom environments, deploy scripts and present it as an accessible resource hosted in the cloud. I will be writing more about this later as I develop my own APIs on Azure. For now, the Azure Notebook documentation is here. On the DSVM JupyterLab and Jupyter Notebooks are preinstalled and accessed simply by typing the command

>> jupyter notebook
Screenshot from 2019-03-15 09-49-48
A Jupyter notebook running in a web browser

AI Adventures in Azure: Uploading data to the VM

There are many ways to transfer data from local storage to the virtual machine. Azure provides Blob storage for unstructured data managed through the user’s storage account as well as specific storage options for files and tables. There is also the option to use Data Lakes. These are all useful for storing large datasets and integrating into processing pipelines within Azure.

However, in this post I will talk about some simpler options for transferring smaller files, for example scripts or smaller images and datasets onto the VM itself, just to make the essential datasets available for code development on the VM. There are two main options – one is to upload to third party cloud storage, and the other is sharing folders through the remote desktop connection.

1) Upload data to a third party cloud storage account:

This could be an Azure store, Gdrive, OneDrive, Dropbox or similar, or an ftp site. Upload from the local computer, then start up and log into the VM and download directly to the VM hard drive. This is quite clunky and time consuming compared to a direct transfer.

2) share files using the remote desktop connection:

In XTerm there is an option to set preferences. Clicking this brings up a menu with a tab named “shared folders”. Select these folders and check the boxes for “mount automatically”. These folders are then available to the VM, and files can be copied and pasted between the local and remote machines.

Other, Azure-optimised data transfer and storage options will be covered in a later post!

AI Adventures in Azure: Choosing VM Size

The main purpose of a VM is to accelerate scripts compared to running locally on a laptop or desktop by outsourcing the computation to a more powerful remote computer. There is an overwhelming number of options for Azure VM sizes, each of which is optimised for a particular purpose, so to get the best performance for a specific application it’s important to choose the right VM. I started with no clue which VM would be right for me. I’m using the VM to apply scikit-learn algorithms to large images obtained from drones and satellites, which is memory hungry but “embarrassingly parallel” (meaning it is easy to separate the computation into chunks and distribute the computation across several individual cores).

I started off by prioritising access to lots of cores, thinking that distributing widely would be the best way to accelerate my code, so I initially opted for the NC24 series VM which has 24 available cores. I noticed that the NC24 series was not noticeably faster than running the code locally on my laptop, which has 8 available cores. Since the NC24 is relatively expensive, and benchmark tests showed no noticeable speed up from 8 to 24 cores, I switched to a more affordable NC6. This did not slow down the script at all relative to the NC24, suggesting number of cores is not limiting the speed of my script. To be sure, I briefly allocated a 64 core VM and ran the benchmark script again. There was clearly no need to pay extra for more cores, so the NC6 became my main VM for a while.

However, the experiments with the NC series VMs showed that there was no real benefit to paying for VM access relative to running locally on my laptop, at least in terms of benchmark script completion time, so I explored some compute-optimised options instead. The F16s-v2 worked nicely and was cheap compared to the NC series, however, it suffered from memory overload when running the larger benchmark scripts. This led to a switchover to a memory-optimised E20s-v3 VM (20vcpu’s, 160GB RAM, 32000 max IOPS). This VM outperforms my laptop and the other VM sizes I’ve tested for my particular image processing application.

So far, I am very happy with the performance of the E20s-v3 VM and will stick with it for a while, although I am interested by the announcement of the new Lsv2 series.

It was obviously extremely useful to have a benchmark script and image to compare the VMs. On an image-by-image basis the acceleration has a minor impact, but it will become more important as I start to scale to automated processing of large numbers of images.

All the VMs were running an Ubuntu 16.04 LTS Data Science machine image and the benchmarking used an identical Python script run using PyCharm.