My AI for Earth project is quite memory intensive so I have been learning about ways to take the data storage off the local disk and into the cloud, while still maintaining on the fly access to crucial files on my virtual or local machine. My classification problem started off requiring just a few GB of Sentinel-2 images, but now the challenge is to use Azure to scale to observing the entire western coast of the Greenland Ice Sheet and over time, so the amount of storage required has increased dramatically and it is no longer feasible to store everything locally. The best solution I have found so far is to use Azure blob storage.
Blobs are repositories of unstructured data held in the cloud that can usefully be accessed using a simple Python API via the command line or in-script. I have been using the sentinelsat python API to batch download Sentinel-2 images for a specific range of tiles and dates, processing them from the L1C product into the L2A product using ESA Sen2Cor, uploading the L2A product to blob storage and then deleting the files stored locally. On my virtual machine I have enough memory to store one month’s worth of imagery for one tile (bearing in mind the memory required is much greater than that of the final product since the zipped and unzipped L1C and the processed L2A product will exist for a while), meaning the script can download a full month’s images before sending to the blob store, flushing the local hard disk and beginning to download the mages for the subsequent month.
One difficulty I have come across is maintaining file structures when sending entire folders to blob storage programmatically. This is trivial when using the Azure Storage Explorer because the file structure is automatically maintained simply by dragging and dropping the folder into the appropriate blob container. However, the Python API does not allow direct upload of an entire folder to a blob container unless the individual files are uploaded individually without being arranged into parent folders. To achieve this programmatically, virtual folders need to be invoked. To do this, the folder and file paths are both provided in the call to the blob service, rather than the filename alone. This requires iterating through a list of folders, then iterating through the files in each folder in a sub-loop, each time appending the filename to the folder name and using the full path as the blob store destination.
For this post I will introduce what I am actually trying to achieve with the AI for Earth grant and how it will help us to understand glacier and ice sheet dynamics in a warming world.
The Earth is heating up – that’s a problem for the parts of it made of ice. Over a billion people rely directly upon glacier fed water for drinking, washing, farming or hydropower. The sea level rise resulting from the melting of glaciers and ice sheets is one of the primary species level existential risks we face as humans in the 21st century, threatening lives, homes, infrastructures, economies, jobs, cultures and traditions. It has bee projected that $14 trillion could be wiped off the global economy annually by 2100 due to sea level rise. The major contributing factors are thermal expansion of the oceans and melting of glaciers and ice sheets, which in turn is primarily controlled by the ice albedo, or reflectivity. However, our understanding of albedo for glaciers and ice sheets is still fairly basic. Our models make drastic assumptions about how the albedo of glaciers behaves, some assign a constant value to it, some assume it varies as a simple function of exposure time in the summer, and the more sophisticated models use radiative transfer but on the assumption that the ice behaves in the same way as snow (i.e. it can be adequately represented as a collection of tiny spheres). Our remote sensing products also struggle to resolve the complexity of the ice surface and fail to detect the albedo reducing processes operating there, for example the accumulation of particles and growth of algae on the ice surface, and the changing structure of the ice itself. This limits our ability to observe the ice surface changing over time and to attribute melting to specific processes that would enable us to make better predictions of melting – and therefore sea level rise – into the future.
I hope to contribute to tackling this problem with AI for Earth. My idea is to use a form of machine learning known as supervised classification to map ice surfaces from drone images and then at the scale of entire glaciers and ice sheets using multispectral data from the European Space Agency’s Sentinel-2 satellite. The training data will come from spectral measurements made on the ice surface that match the wavelengths of the UAV and Sentinel sensors. I’ll be writing the necessary code in Python and processing the imagery in the cloud using Microsoft Azure, with the aim of gaining new insights into glacier and ice sheet melting and developing an accessible API to host on the AI for Earth API hub. I have been working on this problem for a while and the code (in active development) is being regularly updated on my Github repository. A publication is currently under review.
I have already posted about my Azure setup and some ways to start programming in Python on Azure virtual machines, and from here on in the posts will be more about coding specifically for this project.
Having introduced the set up and configuration of a new virtual machine and the ways to interact with it, I will now show some ways to use it to start programming in Python. This post will assume that the VM is allocated and that the user is accessing the VM using a remote desktop client.
1. Using the terminal
I am running an Ubuntu virtual machine, so the command line interface is referred to as the terminal. The language used to make commands is (usually) “bash”. Since the package manager Anaconda is already installed on the data science VM, it is very easy to start building environments and running Python code in the terminal. Here is an example where I’m creating a new environment called “AzurePythonEnv” that includes some popular packages:
Now this environment can be activated any time via the terminal:
>> source activate AzurePythonEnv
Now, with the environment activated, python code can be typed directly into the terminal, or scripts can be written as text files (e.g. using the pre-installed text editors Atom or Vim) and called from the terminal:
The data science VM includes several IDEs that can be used for developing Python Code. My preferred option at the moment in PyCharm, but Visual Studio Code is also excellent and I can envisage using this as my primary IDE later on. IDEs are available under Applications > Development in the desktop toolbar or accessible via the command line. IDEs for other languages are also pre-installed on the Linux DSVM including R-Studio. Simply open the preferred IDE and start programming. In PyCharm the bottom frame in the default view can be toggled between the terminal and the python console. This means new packages can be installed into your environment and new environments created and removed from within the IDE, along with all the other functions associated with the command line. The basic workflow for programming in the IDE is to start a new project, link it to your chosen development environment, write scripts in the editor window then run them (optionally running them in the console so that variables and datasets remain accessible after the script has finished running).
3. Using Jupyter Notebooks
Jupyter notebooks are applications that allow active code to be run in a web browser, and the outputs displayed interactively within the same window. They are a great way to make code accessible to other users. The code is written nearly indentically to a normal python script except that it is divided into individual executable cells. Jupyter notebooks can be run in the cloud using Azure notebooks, making it easy to access Azure data storage, configure custom environments, deploy scripts and present it as an accessible resource hosted in the cloud. I will be writing more about this later as I develop my own APIs on Azure. For now, the Azure Notebook documentation is here. On the DSVM JupyterLab and Jupyter Notebooks are preinstalled and accessed simply by typing the command
There are many ways to transfer data from local storage to the virtual machine. Azure provides Blob storage for unstructured data managed through the user’s storage account as well as specific storage options for files and tables. There is also the option to use Data Lakes. These are all useful for storing large datasets and integrating into processing pipelines within Azure.
However, in this post I will talk about some simpler options for transferring smaller files, for example scripts or smaller images and datasets onto the VM itself, just to make the essential datasets available for code development on the VM. There are two main options – one is to upload to third party cloud storage, and the other is sharing folders through the remote desktop connection.
1) Upload data to a third party cloud storage account:
This could be an Azure store, Gdrive, OneDrive, Dropbox or similar, or an ftp site. Upload from the local computer, then start up and log into the VM and download directly to the VM hard drive. This is quite clunky and time consuming compared to a direct transfer.
2) share files using the remote desktop connection:
In XTerm there is an option to set preferences. Clicking this brings up a menu with a tab named “shared folders”. Select these folders and check the boxes for “mount automatically”. These folders are then available to the VM, and files can be copied and pasted between the local and remote machines.
Other, Azure-optimised data transfer and storage options will be covered in a later post!
The main purpose of a VM is to accelerate scripts compared to running locally on a laptop or desktop by outsourcing the computation to a more powerful remote computer. There is an overwhelming number of options for Azure VM sizes, each of which is optimised for a particular purpose, so to get the best performance for a specific application it’s important to choose the right VM. I started with no clue which VM would be right for me. I’m using the VM to apply scikit-learn algorithms to large images obtained from drones and satellites, which is memory hungry but “embarrassingly parallel” (meaning it is easy to separate the computation into chunks and distribute the computation across several individual cores).
I started off by prioritising access to lots of cores, thinking that distributing widely would be the best way to accelerate my code, so I initially opted for the NC24 series VM which has 24 available cores. I noticed that the NC24 series was not noticeably faster than running the code locally on my laptop, which has 8 available cores. Since the NC24 is relatively expensive, and benchmark tests showed no noticeable speed up from 8 to 24 cores, I switched to a more affordable NC6. This did not slow down the script at all relative to the NC24, suggesting number of cores is not limiting the speed of my script. To be sure, I briefly allocated a 64 core VM and ran the benchmark script again. There was clearly no need to pay extra for more cores, so the NC6 became my main VM for a while.
However, the experiments with the NC series VMs showed that there was no real benefit to paying for VM access relative to running locally on my laptop, at least in terms of benchmark script completion time, so I explored some compute-optimised options instead. The F16s-v2 worked nicely and was cheap compared to the NC series, however, it suffered from memory overload when running the larger benchmark scripts. This led to a switchover to a memory-optimised E20s-v3 VM (20vcpu’s, 160GB RAM, 32000 max IOPS). This VM outperforms my laptop and the other VM sizes I’ve tested for my particular image processing application.
So far, I am very happy with the performance of the E20s-v3 VM and will stick with it for a while, although I am interested by the announcement of the new Lsv2 series.
It was obviously extremely useful to have a benchmark script and image to compare the VMs. On an image-by-image basis the acceleration has a minor impact, but it will become more important as I start to scale to automated processing of large numbers of images.
All the VMs were running an Ubuntu 16.04 LTS Data Science machine image and the benchmarking used an identical Python script run using PyCharm.
A lot of my work at the moment requires quite computationally heavy geospatial analysis that stretches the processing capabilities of my laptop. I invested in a pretty powerful machine – i7-7700GHz processor, 32GB RAM – and sped things up by spreading the load across cores and threads, but it can still be locked up for hours when processing very large datasets. For this reason, I have started exploring cloud computing. My platform of choice is Microsoft Azure. Being new to Azure and cloud computing in general, I thought it would be helpful for me to keep notes of my learning as I climb onboard, and also thought it could be useful to make the notes public for others who might be following the same path.
I’ll be blogging these notes as “Adventures in Azure”. I’m predominantly a Linux user and the notes will focus on Linux virtual machines on Azure. My programming will almost all be in Python. The end-goal is to be proficient with machine learning applied to remote sensing image analysis in the cloud.
I’m certain I will find fugly ways to do things and I will be grateful for any suggestions for refinements!
1. Setting Up Linux Data Science Virtual Machine
I’m not going to write up notes for this as it was so easy! I created an Azure account with a Microsoft email address, then I chose to use a virtual machine image preloaded with the essentials – Ubuntu, Anaconda (2.7 and 3.5), JupyterHub, Pycharm, Tensorflow and NVIDIA drivers – amongst a range of other useful software designed specifically for data science. Microsoft call it the “Data Science Virtual Machine and the link is here and the instructions are simple to follow. I opted for a standard NC6 (which has 6 vCPUs and 56GB memory) as this is a significant step up in terms of processing power from my local machine, but comes at an affordable hourly rate.
Once the virtual machine is established, there is still a fair amount of configuring to do before using it for geospatial projects. The next post will contain info about ways to work with Python on the virtual machine.