Transferring data between Google Drive and Google Cloud Storage using Google Colab

Photo by Markus Spiske on Unsplash

Google Colab is amazing for doing small experiments with python and machine learning. But accessing data can be tricky, especially if you need large data such as images, audio, or video files. The easiest approach is storing the data in your Google Drive and accessing it from Colab, but Google Drive tends to produce timeouts when you have a large amount of files in one folder.

More robust and scalable is Google Cloud Storage, where you can also more easily share the data with colleagues. But unfortunately there is no native way to transfer data from Google Drive to Google Cloud Storage without having to download and upload it again. However, with Google Colab we can transfer files quite easily.

Mounting Google Drive in Colab

from google.colab import drive
drive.mount(‘/content/drive’)

Setting up Google Cloud Storage bucket

Newly created project in the Resource Manager

After the project is created (and you need to have billing enabled, as the storage will cost you a few cents per month) click on the menu in the upper right corner and select Storage (somewhere way down the menu). Next you need to create a bucket for the data.

Empty project with no bucket

The name of the bucket must be globally unique, so not only for your account but for all accounts. Just be creative ;-). There you can also estimate the cost for the bucket, which is around 0.60 EUR per month for 10 GB with 10,000 uploads and 1,000,000 downloads per month.

Connecting to GCS bucket

from google.colab import auth
auth.authenticate_user()
project_id = 'nifty-depth-246308'
!gcloud config set project {project_id}
!gsutil ls

This will connect to your project and list all buckets. Next you can copy data from or to GCS using gsutil cp command. Note that the content of your Google Drive is not under /content/drive directly, but in the subfolder My Drive. If you copy more than a few files, use the -m option for gsutil, as it will enable multi-threading and speed up the copy process significantly.

bucket_name = 'medium_demo_bucket_190710'!gsutil -m cp -r /content/drive/My\ Drive/Data/* gs://{bucket_name}/

That’s it. Now the process is running and you can check from time to time if it’s completed. I created a Colab notebook with the example code given here: https://colab.research.google.com/drive/1Xc8E8mKC4MBvQ6Sw6akd_X5Z1cmHSNca

Further information can be found in the Colab documentation here and the gsutil documentation here.

For future projects just authenticate the Colab notebook and transfer the files from the bucket to the local file system. Then you can run all experiments on the local copy.

Machine learning and neuroscience | Coding python (and recently JS and Kotlin) | Building apps you love

Machine learning and neuroscience | Coding python (and recently JS and Kotlin) | Building apps you love