My AI for Earth project is quite memory intensive so I have been learning about ways to take the data storage off the local disk and into the cloud, while still maintaining on the fly access to crucial files on my virtual or local machine. My classification problem started off requiring just a few GB of Sentinel-2 images, but now the challenge is to use Azure to scale to observing the entire western coast of the Greenland Ice Sheet and over time, so the amount of storage required has increased dramatically and it is no longer feasible to store everything locally. The best solution I have found so far is to use Azure blob storage.
Blobs are repositories of unstructured data held in the cloud that can usefully be accessed using a simple Python API via the command line or in-script. I have been using the sentinelsat python API to batch download Sentinel-2 images for a specific range of tiles and dates, processing them from the L1C product into the L2A product using ESA Sen2Cor, uploading the L2A product to blob storage and then deleting the files stored locally. On my virtual machine I have enough memory to store one month’s worth of imagery for one tile (bearing in mind the memory required is much greater than that of the final product since the zipped and unzipped L1C and the processed L2A product will exist for a while), meaning the script can download a full month’s images before sending to the blob store, flushing the local hard disk and beginning to download the mages for the subsequent month.
One difficulty I have come across is maintaining file structures when sending entire folders to blob storage programmatically. This is trivial when using the Azure Storage Explorer because the file structure is automatically maintained simply by dragging and dropping the folder into the appropriate blob container. However, the Python API does not allow direct upload of an entire folder to a blob container unless the individual files are uploaded individually without being arranged into parent folders. To achieve this programmatically, virtual folders need to be invoked. To do this, the folder and file paths are both provided in the call to the blob service, rather than the filename alone. This requires iterating through a list of folders, then iterating through the files in each folder in a sub-loop, each time appending the filename to the folder name and using the full path as the blob store destination.
For this post I will introduce what I am actually trying to achieve with the AI for Earth grant and how it will help us to understand glacier and ice sheet dynamics in a warming world.
The Earth is heating up – that’s a problem for the parts of it made of ice. Over a billion people rely directly upon glacier fed water for drinking, washing, farming or hydropower. The sea level rise resulting from the melting of glaciers and ice sheets is one of the primary species level existential risks we face as humans in the 21st century, threatening lives, homes, infrastructures, economies, jobs, cultures and traditions. It has bee projected that $14 trillion could be wiped off the global economy annually by 2100 due to sea level rise. The major contributing factors are thermal expansion of the oceans and melting of glaciers and ice sheets, which in turn is primarily controlled by the ice albedo, or reflectivity. However, our understanding of albedo for glaciers and ice sheets is still fairly basic. Our models make drastic assumptions about how the albedo of glaciers behaves, some assign a constant value to it, some assume it varies as a simple function of exposure time in the summer, and the more sophisticated models use radiative transfer but on the assumption that the ice behaves in the same way as snow (i.e. it can be adequately represented as a collection of tiny spheres). Our remote sensing products also struggle to resolve the complexity of the ice surface and fail to detect the albedo reducing processes operating there, for example the accumulation of particles and growth of algae on the ice surface, and the changing structure of the ice itself. This limits our ability to observe the ice surface changing over time and to attribute melting to specific processes that would enable us to make better predictions of melting – and therefore sea level rise – into the future.
I hope to contribute to tackling this problem with AI for Earth. My idea is to use a form of machine learning known as supervised classification to map ice surfaces from drone images and then at the scale of entire glaciers and ice sheets using multispectral data from the European Space Agency’s Sentinel-2 satellite. The training data will come from spectral measurements made on the ice surface that match the wavelengths of the UAV and Sentinel sensors. I’ll be writing the necessary code in Python and processing the imagery in the cloud using Microsoft Azure, with the aim of gaining new insights into glacier and ice sheet melting and developing an accessible API to host on the AI for Earth API hub. I have been working on this problem for a while and the code (in active development) is being regularly updated on my Github repository. A publication is currently under review.
I have already posted about my Azure setup and some ways to start programming in Python on Azure virtual machines, and from here on in the posts will be more about coding specifically for this project.
Having introduced the set up and configuration of a new virtual machine and the ways to interact with it, I will now show some ways to use it to start programming in Python. This post will assume that the VM is allocated and that the user is accessing the VM using a remote desktop client.
1. Using the terminal
I am running an Ubuntu virtual machine, so the command line interface is referred to as the terminal. The language used to make commands is (usually) “bash”. Since the package manager Anaconda is already installed on the data science VM, it is very easy to start building environments and running Python code in the terminal. Here is an example where I’m creating a new environment called “AzurePythonEnv” that includes some popular packages:
Now this environment can be activated any time via the terminal:
>> source activate AzurePythonEnv
Now, with the environment activated, python code can be typed directly into the terminal, or scripts can be written as text files (e.g. using the pre-installed text editors Atom or Vim) and called from the terminal:
The data science VM includes several IDEs that can be used for developing Python Code. My preferred option at the moment in PyCharm, but Visual Studio Code is also excellent and I can envisage using this as my primary IDE later on. IDEs are available under Applications > Development in the desktop toolbar or accessible via the command line. IDEs for other languages are also pre-installed on the Linux DSVM including R-Studio. Simply open the preferred IDE and start programming. In PyCharm the bottom frame in the default view can be toggled between the terminal and the python console. This means new packages can be installed into your environment and new environments created and removed from within the IDE, along with all the other functions associated with the command line. The basic workflow for programming in the IDE is to start a new project, link it to your chosen development environment, write scripts in the editor window then run them (optionally running them in the console so that variables and datasets remain accessible after the script has finished running).
3. Using Jupyter Notebooks
Jupyter notebooks are applications that allow active code to be run in a web browser, and the outputs displayed interactively within the same window. They are a great way to make code accessible to other users. The code is written nearly indentically to a normal python script except that it is divided into individual executable cells. Jupyter notebooks can be run in the cloud using Azure notebooks, making it easy to access Azure data storage, configure custom environments, deploy scripts and present it as an accessible resource hosted in the cloud. I will be writing more about this later as I develop my own APIs on Azure. For now, the Azure Notebook documentation is here. On the DSVM JupyterLab and Jupyter Notebooks are preinstalled and accessed simply by typing the command
There are many ways to transfer data from local storage to the virtual machine. Azure provides Blob storage for unstructured data managed through the user’s storage account as well as specific storage options for files and tables. There is also the option to use Data Lakes. These are all useful for storing large datasets and integrating into processing pipelines within Azure.
However, in this post I will talk about some simpler options for transferring smaller files, for example scripts or smaller images and datasets onto the VM itself, just to make the essential datasets available for code development on the VM. There are two main options – one is to upload to third party cloud storage, and the other is sharing folders through the remote desktop connection.
1) Upload data to a third party cloud storage account:
This could be an Azure store, Gdrive, OneDrive, Dropbox or similar, or an ftp site. Upload from the local computer, then start up and log into the VM and download directly to the VM hard drive. This is quite clunky and time consuming compared to a direct transfer.
2) share files using the remote desktop connection:
In XTerm there is an option to set preferences. Clicking this brings up a menu with a tab named “shared folders”. Select these folders and check the boxes for “mount automatically”. These folders are then available to the VM, and files can be copied and pasted between the local and remote machines.
Other, Azure-optimised data transfer and storage options will be covered in a later post!
I recently published an article in Open Access Government about the potential for machine learning technologies to revolutionise Polar science, with focus on optical remote sensing data from drones and satellites. You can read it online or download it from OAGov_Oct18
Several journals now request data and/or code to be made openly available in a permanent repository accessible via a digital object identifier (doi), which is – in my opinion – generally a really good thing. However, there are associated challenges. First, because the expectation that code and data are made openly available is quite new (still nowhere near ubiquitous), many authors do not know of an appropriate workflow for managing and publishing their code. If code and data has been developed on a local machine, there is work involved in making sure the same code works when transferred to another computer where paths, dependencies and software setup may differ, and providing documentation. Neglecting this is usually no barrier to publication, so there has traditionally been little incentive to put time and effort into it. Many have mad great efforts to provide code to others via ftp sites, personal webpages or over email by request. However, this relies on those researchers maintaining their sites and responding to requests.
I thought I would share some of my experiences with curating and publishing research code using Git, because actually it is really easy and feeds back into better code development too. The ethical and pragmatic arguments in favour of adopting a proper version control system and publishing open code are clear – it enables collaborative coding, it is safer, more tractable and transparent. However, the workflow isn’t always easy to decipher to begin with. Hopefully this post will help a few people to get off the ground…
Version control is a way to manage code in active development. It is a way to avoid having hundreds of files with names like “model_code_for _TC_paper_v0134_test.py” in a folder on a computer, and a way to avoid confusion copying between machines and users. The basic idea is that the user has an online (‘remote’) repository that acts as a master where the up-to-date code is held, along with a historical log of previous versions. This remote repository is cloned on the user’s machine (‘local’ repository). The user then works on code in their local repository and the version control software (VCS) syncs the two. This can happen with many local repositories all linked to one remote repository, either to enable one user to sync across different machines or to have many users working on the same code.
Changes made to code in a local repository are called ‘modifications’. If the user is happy with the modifications, they can be ‘staged’. Staging adds a flag to the modified code, telling the VCS that the code should be considered as a new version to eventually add to the remote repository. Once the user has staged some code, the changes must be ‘committed’. Committing is saving the staged modifications safely in the local repository. Since the local repository is synced to the remote repository by the VCS, I think of making a commit as “committing to update the remote repository later”. Each time the user ‘commits’ they also submit a ‘commit message’ which details the modifications and the reasons they were made. Importantly, a commit is only a local change. Staging and committing modifications can be done offline – to actually send the changes to the remote repository the user ‘pushes’ it.
Sometimes the user might want to try out a new idea or change without endangering the main code. This can be achieved by ‘branching’ the repository. This creates a new workflow that is joined to the main ‘master’ code but kept separate so the master code is not updated by commits to the new branch. These branches can later be ‘merged’ back onto the master branch if the experiments on the branch were successful.
These simple operations keep code easy to manage and tractable. Many people can work on a piece of code, see changes made by others and, assuming the group is pushing to the remote repository regularly, be confident they are working on the latest version. New users can ‘clone’ the existing remote repository, meaning they create a local version and can then push changes up into the main code from their own machine. If a local repository is lagging behind the remote repository, local changes cannot be pushed until the user pulls the changes down from the remote repository, then pushes their new commits. This enables the VCS and the users to keep track of changes.
To make the code useable for others outside of a research group, a good README should be included in the repository, which is a clear and comprehensive explanation of the concept behind the code, the choices made in developing it and a clear description of how to use and modify it. This is also where any permissions or restrictions on usage should be communicated, and any citation or author contact information. Data accompanying the code can also be pushed to the remote repository to ensure that when someone clones it, they receive everything they need to use the code.
One great thing about Git is that almost all operations are local – if you are unable to connect to the internet you can still work with version control in Git, including making commits, and then push the changes up to the remote repository later. This is one of many reasons why Git is the most popular VCS. The name refers to the tool used to manage changes to code, whereas Github is an online hosting service for Git repositories. With Git, versions are saved as snapshots of the repository at the time of a commit. In contrast, many other VCSs log changes to files.
There are many other nuances and features that are very useful for collaborative research coding, but these basic concepts are sufficient for getting up and running. It is also worth mentioning BitBucket too – many research groups use this platform instead of GitHub because repositories can be kept private without subscribing to a payment plan, whereas Github repositories are public unless paid for.
To publish code, a version of the entire repository should be made immutable and separate from the active repository, so that readers and reviewers can always see the precise code that was used to support a particular paper. This is achieved by minting a doi (digital object identifier) for a repository that exists in GitHub. This requires exporting to a service such as Zenodo.
Zenodo will make a copy of the repository and mint a doi for it. This doi can then be provided to a journal and will always link to that snapshot of the repository. This means the users can continue to push changes and branch the original repository, safe in the knowledge the published version is safe and available. This is a great way to make research code transparent and permanent, and it means other users can access and use it, and the authors can forget about managing files for old papers on their machines and hard drives and providing their code and data over email ‘by request’. It also means the authors are not responsible for maintaining a repository indefinitely post-publication, as all the relevant code is safely stored at the doi, even if the repository is closed down.
Here’s some notes on installing Ubuntu alongside Windows on a fresh Lenovo t470p with Windows 10 preinstalled. It took a bit of trial and error for me so hopefully these notes will help someone trying to do the same.
1.Download Ubuntu ISO
The Ubuntu ISO image for your system architecture is available here: https://www.ubuntu.com/download/desktop. Download to your PC. It needs to be put onto a CD or USB that can be booted from, requiring some software. I used Universal USB Installer https://www.pendrivelinux.com/universal-usb-installer-easy-as-1-2-3.
2. Create bootable USB
Find an empty USB drive with enough space (>2GB). Open Universal USB Installer, select the downloaded Ubuntu ISO image and the destination drive (the USB) and UUI formatting and click ‘Create’.
3. Prepare partition
In Windows, find the disk manager (>dskmgmt in windows command line) and select C: drive. Right click and select ‘shrink volume’. Reduce the size of the volume by the desired amount. I left Windows with 80 GB of space, leaving 420 for Ubuntu. Once this is done, a new partition will be visible, labelled ‘unallocated’. This is where Ubuntu will sit eventually, so check you have allocated enough space.
4. Restart laptop and access boot menu
With the bootable USB containing the Ubuntu ISO inserted, restart the laptop and hold down F12 (star icon) to access the boot menu. The boot menu shows options of drives to boot from, with the top one being Windows Boot Manager. Select the UUI USB option. A ‘live’ boot of Ubuntu will run from the USB stick.
5. Install Ubuntu
From inside the live Ubuntu, the installer should auto-run. If not, there is a desktop icon for the installer that you can select. The install wizard is pretty self explanatory. I opted not to install any third party software, but otherwise maintained all the defaults. Select a username and password and choose a timezone, then click through to the end.
The final option on the installer is to restart. You have no choice but to do this, so do it. For me, the system booted straight into windows. I tried to rectify this by accessing the boot menu again using F12 (star). Although Ubuntu was visible and was the priority boot, selecting it just hung the system and I was forced to either boot Windows or Ubuntu from the USB rather than the full install. This is because the BIOS setting defaults to UEFI only, which is protected by Windows’s Secure Boot setting.
7. Restart into BIOS
To access the settings, press F1 during startup. Navigate to the ‘security’ tab and find the option to disable secure boot. Then navigate to the ‘startup’ tab and find the option for ‘UEFI/Legacy BIOS’. Change the setting from ‘UEFI only’ to ‘Both’. Save and exit.
Now on restarting the laptop, it will boot straight intop Ubuntu by default, with Windows accessible in its small partition by selecting the Windows Boot Manager from the boot screen, accessed by holding F12 during startup.
9. Test and go!
So far, Ubuntu 16.04 LTS has run very well ‘out of the box’ on the Lenovo t470p, with no major hardware issues encountered so far. The Wifi is working fine. I’ve heard the fingerprint scanner might not work great, but I’m not really interested in that anyway.