My AI for Earth project is quite memory intensive so I have been learning about ways to take the data storage off the local disk and into the cloud, while still maintaining on the fly access to crucial files on my virtual or local machine. My classification problem started off requiring just a few GB of Sentinel-2 images, but now the challenge is to use Azure to scale to observing the entire western coast of the Greenland Ice Sheet and over time, so the amount of storage required has increased dramatically and it is no longer feasible to store everything locally. The best solution I have found so far is to use Azure blob storage.
Blobs are repositories of unstructured data held in the cloud that can usefully be accessed using a simple Python API via the command line or in-script. I have been using the sentinelsat python API to batch download Sentinel-2 images for a specific range of tiles and dates, processing them from the L1C product into the L2A product using ESA Sen2Cor, uploading the L2A product to blob storage and then deleting the files stored locally. On my virtual machine I have enough memory to store one month’s worth of imagery for one tile (bearing in mind the memory required is much greater than that of the final product since the zipped and unzipped L1C and the processed L2A product will exist for a while), meaning the script can download a full month’s images before sending to the blob store, flushing the local hard disk and beginning to download the mages for the subsequent month.
One difficulty I have come across is maintaining file structures when sending entire folders to blob storage programmatically. This is trivial when using the Azure Storage Explorer because the file structure is automatically maintained simply by dragging and dropping the folder into the appropriate blob container. However, the Python API does not allow direct upload of an entire folder to a blob container unless the individual files are uploaded individually without being arranged into parent folders. To achieve this programmatically, virtual folders need to be invoked. To do this, the folder and file paths are both provided in the call to the blob service, rather than the filename alone. This requires iterating through a list of folders, then iterating through the files in each folder in a sub-loop, each time appending the filename to the folder name and using the full path as the blob store destination.
I posted my solution to this on the Azure community forum.