Transferring data between Google Drive and Google Cloud Storage using Google Colab

Photo by Markus Spiske on Unsplash

Google Colab is amazing for doing small experiments with python and machine learning. But accessing data can be tricky, especially if you need large data such as images, audio, or video files. The easiest approach is storing the data in your Google Drive and accessing it from Colab, but Google Drive tends to produce timeouts when you have a large amount of files in one folder.

More robust and scalable is Google Cloud Storage, where you can also more easily share the data with colleagues. But unfortunately there is no native way to transfer data from Google Drive to Google Cloud Storage without having to download and upload it again. However, with Google Colab we can transfer files quite easily.

Mounting your own Google Drive is fairly easy. Just import the drive tools and run the mount command. You will be asked to authenticate using a token that you create using Google Auth API. After you pasted the token your drive is mounted to the given path.

from google.colab import drive

Next we need to create a Google Cloud Storage project. Go to the Resource Manager and create a new project.

Newly created project in the Resource Manager

After the project is created (and you need to have billing enabled, as the storage will cost you a few cents per month) click on the menu in the upper right corner and select Storage (somewhere way down the menu). Next you need to create a bucket for the data.

Empty project with no bucket

The name of the bucket must be globally unique, so not only for your account but for all accounts. Just be creative ;-). There you can also estimate the cost for the bucket, which is around 0.60 EUR per month for 10 GB with 10,000 uploads and 1,000,000 downloads per month.

Once your bucket is set up you can connect Colab to GCS using Google Auth API and gsutil. First you need to authenticate yourself in the same way you did for Google Drive, then you need to set your project ID before you can access your bucket(s). The project ID is shown in the Resource Manager or the URL when you manage your buckets.

from google.colab import auth
project_id = 'nifty-depth-246308'
!gcloud config set project {project_id}
!gsutil ls

This will connect to your project and list all buckets. Next you can copy data from or to GCS using gsutil cp command. Note that the content of your Google Drive is not under /content/drive directly, but in the subfolder My Drive. If you copy more than a few files, use the -m option for gsutil, as it will enable multi-threading and speed up the copy process significantly.

bucket_name = 'medium_demo_bucket_190710'!gsutil -m cp -r /content/drive/My\ Drive/Data/* gs://{bucket_name}/

That’s it. Now the process is running and you can check from time to time if it’s completed. I created a Colab notebook with the example code given here:

Further information can be found in the Colab documentation here and the gsutil documentation here.

For future projects just authenticate the Colab notebook and transfer the files from the bucket to the local file system. Then you can run all experiments on the local copy.

Machine learning and neuroscience | Coding python (and recently JS and Kotlin) | Building apps you love

Machine learning and neuroscience | Coding python (and recently JS and Kotlin) | Building apps you love