Reproducible machine learning with docker, jupyterlab and fastai/pytorch
In machine learning applications reproducibility is a key criterion, but random initialization and sampling can make it difficult. I show you a docker setup with jupyterlab and fastai/pytorch that automatically initializes the random seed on every startup, helping you to reproduce your results.
Why is reproducibility so important?
Reproducibility, the property that an experiment can be run again and will produce exactly the same results, is important not only from a scientific point of view. You want to be able to see if changes in your parameters are the reason you get better results and not the random initialization point of your network or the random split of your data into test and training. This allows you to consistently run experiments across machines or with colleagues and compare your results. You can achieve this all manually by initializing the random number generator in every script, but people tend to forget things, so why not setting up your environment in a way that it does the reproducibility for you?
What do we want from our environment?
Our environment should fulfill the following criteria
- Use latest library version and document which version
- Be able to return to previous state without destroying setup
- Use identical version on multiple machines
- Use all available computational power
How can we solve this?
- Docker based setup
- python3 with latest libraries and cuda enabled
- Fixed random seed initialization
- date-based versioning
- Computer with nvidia drivers installed (here: Ubuntu 18.08 with GTX 1080 Ti and nvidia 440 driver series, should work with windows/mac but no experience)
- docker v19+ installed
- nvidia docker runtime installed (https://github.com/NVIDIA/nvidia-docker)
- jupyter notebook locally installed (to create password hash)
Docker image build file
First and most complex part is the setup of the docker image. The complete dockerfile is shown below and I will guide you bit by bit through the sections.
Line 1 defined the image we build our image on, in this case the cuda 10.1 base image. Depending on your driver version you might have to use a different version. See https://docs.nvidia.com/deploy/cuda-compatibility/index.html for details on compatibility.
Lines 3–9 build the linux base with installing the necessary packages (git).
Lines 11–22 create the conda installation into
/opt/conda and add the binaries to the PATH.
Lines 24–29 install all packages which are used from the conda repository. First we update the base installation, then we create an environment called torch with python 3.7 and then we subsequently install pytorch, fastai, and auxiliary modules like pandas, jupyter, scikit, etc. If you want to add additional packages I would recommend adding them to the end of the Dockerfile, so docker does not have to rerun the previous commands in order to build a new image.
Lines 31–38 are commented out. This is fastais recommended way to replace libjpeg and pillow with optimized versions. However, the build process failed several times and libraries can easily become incompatible. Play around with it if you like.
Lines 40–46 activate the conda environment and install additional packages with pip which are not or only outdated available though conda, for example opencv or albumentations, and installs segmentation_models from github as the pip packages of this repository are not always up to date.
Lines 48–61 configure the jupyter notebook such that it starts as root user with the root directory set to
/opt/notebooks, a quit button to shutdown the server from the website, no authentication, and bash shell access.
Lines 63–68 define the random seed that is set when starting an ipython kernel or jupyter notebook. This does not work for normal python scripts! There you need to set the random seed manually.
Lines 70–77 create the cache directories where pytorch and fastai store the downloaded models. We will mount a persistent volume to this directory later so the cache can be reused by each container. This saves time and download capacity. Also the python configuration is stored in the home directory inside the image to inspect for package versions later. It is important to set the HOME variable to /root/ and allow access for all users to /root/ and /opt/. We will run the notebook with the user ID of the user starting the docker process so that we have access to the notebooks and files you mount later. However, this means that we don’t know the ID when building the image and make access to the python environment available to any user. So every user ID needs access in order to read the configuration and install additional packages.
The chmod command in line 77 sadly blows up the image from ~8 GB to 18 GB due to the nature of docker. I’m currently working on a solution to avoid this, but for the time being this is the only way to have uid mapped and ability to install further packages as this user.
Line 80 exposes the jupyter notebook port 8888 to the outside of the image.
Line 83 is executed if the image is run without any command and will display a torch test script with information about GPUs and libraries.
The complete Dockerfile and all auxiliary files are available in this repository https://github.com/phillies/pytorch_cuda_docker
Build the image
To build the image run
docker build -t cudadev:2019-11 . which creates the docker image and names it cudadev with the version tag 2019–11, so we know this was built in November 2019. If you are planning on updating packages more often just add the day as well. As long as you don’t delete images this will provide you with a nice date-based versioning of your python environments.
Depending on your internet connection this can take between 5 minutes and very long, as it downloads around 2 GB of packages. Feel free to get some coffee ;-).
After the docker image has been built we can start a container. Here is my recommended command to start a container that stays as interactive console and removes itself after termination
-ti # run as interactive console
--rm # remove the container after termination
--name cudalab1 # the name of the container
-p 8810:8888 # publish port 8888 from the inside to 8810 on host
--user $(id -u):$(id -g) # make the notebook run with your user id
# mount <local dir> to /opt/notebooks in container
--mount type=bind,source=<local dir>,target=/opt/notebooks
# mount a persistent cache to the container to save download time
--shm-size 8G # shared memory size
--cpus=2 # number of CPUs to use
--gpus '"device=0"' # which GPU devices to use
cudadev:2019-11 # name and tag of the image to run
jupyter-lab # the command to be executed, can be a python script
Adjust the values for shared memory and CPUs according to your needs and machine. For GPUs the parameter can be either
'"device=0,1,2"’ for one or more GPUs or
all for all GPUs. My recommendation is using one GPU per docker image. So you don’t need to make adjustments in your code, you can always run on the first GPU. The name and the host port must be unique for each container. More conveniently formatted run command for copy & pasting:
docker run -ti --rm --name cudalab1 -p 8810:8888 --user $(id -u):$(id -g) --mount type=bind,source=<local dir>,target=/opt/notebooks --mount type=volume,source=cachevol1,target=/opt/cache --shm-size 8G --cpus=2 --gpus '"device=0"' cudadev:2019-11 jupyter-lab
Now you can start one jupyter notebook per GPU or have python scripts directly running in the docker container and store the results to
/opt/notebooks. Remember that the
--rm flag removes the container and its content after termination. So every information that is not stored in
/opt/notebooks will be lost! To prevent this run the container without the
--rm flag. Then you can restart the stopped container with
docker start <containername>.
To install additional packages you can start a bash console from the jupyter lab launcher or run a command like
!pip install pytest from the notebook. However, due to the user ID change installing packages with
apt does not work. If you need additional packages installed or updated with apt open a second terminal and run
docker exec -u root <containername> apt update and then
docker exec -u root <containername> apt upgrade -y to update the packages as root in the running container.
I hope this rather lengthy tutorial helps you. If you have questions or see room for improvement let me know in the comments!
Edit: a previous version had the issue that if you create a user in the docker file the uid does not necessarily match the uid of the user running the docker container. Thus I removed the overhead of creating a user and added the uid mapping.