Docker
This section is divided into two main parts:
- Using Docker images
- Building custom Docker images
Refer to the Docker commands overview for a handy checklist or quick reference.
Register on Docker Hub at docker.com and follow the Docker Desktop guide to get started. We recommend going through this tutorial before diving in.
Docker enables developers build, share, run, and verify applicationsseamlessly across different environments, eliminating the need for complex environment configuration and management. Before diving into hands-on activities, ensure you understand these three key concepts:
- Image: A docker image comprises a filesystem, default environment variables, a default command to execute, , and metadata about the image’s creation and configuration.
- Container: A container is a running instance of a Docker image. It represents an isolated process that operates independently from other processes on the system, using the image as its environment.
- Registry: Registry: A registry is a repository for storing Docker images. Docker Hub is a public registry where many images can be found, but each local Docker daemon also maintains its own registry to manage and store images locally.
Using Docker images
First, we’ll start by using basic commands with existing Docker images from Docker Hub. This will help you become familiar with Docker’s functionality before we move on to creating our own custom images. While people commonly use shorthand Docker commands like docker pull
and docker run
, the full versions (e.g., docker image pull
, docker container run
) provide clearer insights into their underlying concepts and specific functions.
Docker Hub
- Search for images
docker search <name>
(e.g. docker search debian) - Download an image
docker pull <image-name>
Local Docker Daemon
- Display all docker images currently stored on your local Docker daemon
docker images
(alias fordocker image ls
) - Inspect docker image
docker inspect <image_name>
(alias fordocker image inspect
) - Run a command (cmd) in a container
docker run <image_name> cmd
(alias fordocker container run <image_name> cmd
) - Start an interactive bash shell
docker run -it <image_name> bash
. Add other flags like:-p 8888:8888
to access your interactive shell through ‘localhost:8888’ on your host.-rm
to automatically delete the container once it stops, keeping your system clean (including its filesystem changes, logs and metadata). If you don’t run this flag, a container will automatically be created and information about tje processes will be kept. Check all containers in your Docker daemondocker container ls -a
--user=$(id -u):$(id -g)
useful if you are using sharing volumes and need appropriate permissions on the host to manipulate files.
- Share the current directory with the container
docker run -it --volume=$(pwd):/directory/in/container image_name bash
, the output of pwd will be mounted to the /directory/in/container (e.g. data, shared, etc.) - Manage your containers using pause or stop
- Tag images with a new name
docker image tag image_name:tag new_name:tag
docker logs <container_id>
- Remove images and clean up your hard drive
docker rmi <image_name>
- Remove containers
docker container rm <container_name>
. Alternatively, remove all dead containers:docker container prune
All Docker containers have a digest which is thesha256
hash of the image. It allows to uniquely identify a docker image and it is great for reproducibility.
In this exercise, we will use a Debian stable image (sha256-540ebf19fb0bbc243e1314edac26b9fe7445e9c203357f27968711a45ea9f1d4) as an example for pulling and inspecting Docker images. This image offers a reliable, minimal base with essential tools, including fundamental utilities like bash
for shell scripting and apt
for package management. It’s an excellent starting point for developing and testing pipelines or configuring server environments, providing a stable foundation that can be customized with additional packages as needed.
1. Get a container image
Pull docker image (using tag stable otherwise, latest will be pulled by default)
docker pull debian:stable@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912
2. Run commands in a container
List the content of the container
docker run -it -rm debian@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912 ls
Check is there is
python
orperl
in this container:docker run -rm debian@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912 which -a python perl
3. Use docker interactively
Enter the container interactively with a Bash shell
docker run -it -rm debian@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912 /bin/bash
Now, collect info about the packages installed in the environment.
Interactive Docker container
hostname whoami ls -la ~/ python echo "Hello world!" > ~/myfile.txt ls -la ~/
4. Exit and check the container
Exit the container
exit
Now, rerun commands from step 3 under “Interactive Docker container”. Does the output look any different?
You will notice that you are the root but the name of the machine has now changed and the file that you had created has disappeared.
5. Inspect the docker image
docker image inspect debian:stable@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912
6. Inspect Docker Image Details Identify the date of creation, the name of the field with the digest of the image and command run by default when entering this container.
7. Remove container
docker image docker rmi debian:stable@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912
When you exit a container, any changes made are lost because each new container starts in a clean state. Containers are isolated environments that don’t retain memory of past interactions, allowing you to experiment without affecting your system. This immutability ensures that your program will run under the same conditions in the future, even months later. So, how can you retrieve the results of computations performed inside a container?
To preserve your work, it is crucial to use Docker’s volume mounting feature. When running a Docker container with the -v
(volume) option, it’s best to execute the command from the temporary directory (e.g. /tmp/
) or a project-specific directory rather than your home directory. This practice helps maintain a clean and isolated environment within the container, while ensuring your results are saved in a designated location outside the container’s ephemeral filesystem.
Named volumes
Now, we will be using named volumes or shared directories. When you make changes in these directories within the container, the updates are immediately reflected in the corresponding directory on the host machine. This setup enables seamless synchronization between the container and the host file system.
# -v: mounting only your project-specific dir
docker run -it --volumes <project_dir>
# Named volumes: are managed by Docker and are isolated from your host file system
docker volume create <my_project_volume>
docker run -it --volumes <my_project_volume>
Mounting your current working directory (${PWD}) into the container at e.g. /home/rstudio
can compromise the benefits of container isolation. If you run the command from your home directory, any local packages or configurations (such as R or Python packages installed with install.packages
or pip
) would be accessible inside the container. This inadvertently includes your personal setup and packages, potentially undermining the container’s intended clean and isolated environment. To maintain effective isolation, it’s better to use a temporary or project-specific directory for mounting.
Alternatively, you can add a volume to a project, by modifying the compose.yaml
(also named docker-compose.yml
) file. There are two types of volumes:
- Service-level name: specify how volumes are mounted inside the container. In this case, dataset volume (defined at the top-level) will be mounted to the
/path/in/container/
(e.g. data, results or logs directory) inside the myApp container. - Top-level volume: volumes shared across multiple services in the compose.yaml file. The volume dataset can be referenced by any service and will be created if it doesn’t exit. If the host path is not specified, Docker will automatically create and manage the volume in a default location on the host machine. However, if you need the volume to be located in a specific directory, you can specify the host path directly (option 2).
compose.yaml
# Service-level name
myApp:
# ...
volumes:
- dataset:/path/in/container/
# Option 2
# - /my/local/path:/path/in/container/
# Top-level volume
volumes:
dataset:
In this case, a volume named mydata
will be mounted to the /data/ directory inside the container running the todo-databse service.
Let’s not forget to track changes to container images for reproducibility by using version control. Store your images in a shared storage area, such as with Git or Git Annex, to manage versions and facilitate collaboration.
Transfer and backup Docker images
Saving a Docker image as a tar file is useful for transferring the image to another system or operating system without requiring access to the original Docker registry. The tar file contains several key components:
- metadata: JSON files essential to reconstruct the image
layer information with each layer associated with metadata that includes which commands are used to create the later.
Dockerfile
FROM ubuntu:20.04 # Layer 1 - base image RUN apt-get update && apt-get install -y python3 # Layer 2 - the result of running the command COPY . /app # Layer 3 - add application files to the images
tags pointers to specific image digests or versions.
history of the image and instructions from the DOckerfile that were used to build the image.
manifest which ties together the layers and the overall image structure.
- Filesystem: the actual content, files and directories that make up the image.
# save to tar file
docker image save --output=image.tar image_name
# load tar file
docker image load --input=image.tar
In some situations, you may need to access the filesystem content of a Docker image for purposes such as debugging, backup, or repurposing. To do this, you should create a container from the image and then export the container’s root filesystem to a tar file. Similarly, you can create a new Docker image from this tar file if needed.
docker container create --name=temp_container image_name
docker container export --output=image.tar temp_container
docker container rm temp_container
# new image importing tar file
docker image import --input image.tar image_name
Building Docker images
We’re now ready to build custom Docker images. We’ll start with a minimal pre-built image, add the necessary software, and then upload the final image to Docker Hub.
A common practice is to start with a minimal image from Docker Hub. There are many base images you can use to start building your customized Docker image, and the key difference between them lies in what they already include. For example:
Debian: This is a very minimal base image, including only the most essential utilities and libraries required to run Debian, such as the core Linux utilities and the package manager (apt). It offers a high degree of customization, giving you full control over what you install. However, you’ll need to manually add all the necessary software, making the setup process longer. This image is relatively small in size, making it a good starting point if you want to build your environment from scratch.
Data Science-Specific Images: These images come pre-configured with a wide range of tools pre-installed, reducing the amount of customization needed. This allows you to get started much faster, though it also means less flexibility. These images tend to be larger in size due to the pre-installed software. For example:
- tensorflow/tensorflow: This image includes TensorFlow and often other machine learning libraries, making it ideal for deep learning projects.
- jupyter/scipy-notebook: This image is around 2-3 GB and includes Python, Jupyter Notebook, and libraries like NumPy, Pandas, Matplotlib, and more, making it a comprehensive option for data science tasks.
- r-base: This image provides a base for R environments, useful for data analysis and statistical computing.-
- rocker/rstudio: This image includes RStudio and a base R environment, making it perfect for those working in R for statistical computing and data analysis.
Once you have chosen your base image, use a Dockerfile to modify its components, specifying commands for software installation and configuration. The Dockerfile acts as a recipe, providing a list of steps to build the image.
Dockerfile
# deploy docker container
FROM <node|debian|python|jupyter-base>
# Info and rights of the app
LABEL software="App_name - sandbox" \
"<author.address@sund.ku.dk>" \
maintainer="YYYY.MM.DD"
version=
# root: needed to install and modify the container
USER 0
# run bash commands (eg. installing packages or softwares)
RUN mkdir -p
# install packages & dependencies
RUN apt update \
&& apt install -y jupyter-notebook python3-matplotlib python3-pandas python3-numpy \
&& rm -rf /var/lib/apt/lists/* # cleanup tmp files created by apt
&& rm -fr node_modules # remove directory which contains Node.js packages & dependencies
# set working directory (this directory should act as the main application directory)
WORKDIR /app
# copy files to-from directories
COPY /from/path/... /to/path/...
# set environment variables for Conda
ENV
# switch to user 1000 instead of the default root user
USER 11042
In this example, the RUN
command updates the package list and installs Jupyter along with all necessary dependencies. It’s common practice to remove unnecessary dependencies afterward, as shown in the example above.
Use key-value pair syntax to add the following labels to the container:
- software = name of the app
- author = maintainer or author of the app
- version = version of the app
- license = app licence, e.g., “MIT”
- description = very short description of the app
After preparing your Dockerfile, use the docker build
command to create a new image based on those instructions. Docker’s isolation ensures that any changes made within a container are lost when you exit it. Nevertheless, you can use docker commit
to save these changes by creating a new image from the updated container.This new image will retain your modifications and can be used to launch new container.
docker build -t <account/app:version> <directory_docker>
For the docker build -t
flag, the format of the tag is used to specify both the repository and the version of the Docker image. It consists of three different elements (e.g.: albarema/sandbox_app:v1.0
):
- Repository name where the image will be stored, account in Docker registry (e.g. albarema)
- Name of the image (e.g. sandbox_app)
- Version label for the image (e.g. v1.0, test, etc.)
The repository name can be deferred until the application is ready for publication on Docker Hub. You can also modify the tag at a later time.
- Create a Dockerfile in a project-specific dir (e.g.: sandbox-debian-jupyter). We will add a command to clean up the package after after installation used to reduce the image size.
FROM debian:stable
LABEL maintainer="Name Surname <abd123@ku.dk>"
# Update package list and install necessary packages
RUN apt update \
&& apt install -y jupyter-notebook \
\
python3-matplotlib \
python3-pandas \
python3-numpy && apt-get clean \
&& rm -rf /var/lib/apt/lists/* # cleanup tmp files created by apt
# You may consider adding a working directory
# WORKDIR /notebooks
# and a command to start Jupyter Notebook
# CMD ["jupyter-notebook", "--ip=0.0.0.0", "--allow-root"]
- Build the Docker image using, for example,
docker build -t sandbox-debian-jupyter:1.0 sandbox-debian-jupyter
Testing the custom image
Let’s verify if the custom image functions as expected, running the following command:
Terminal
docker run --rm -p 8888:8888 --volume=$(pwd):/root sandbox-debian-jupyter:1.0 jupyter-notebook
Jupyter typically refuses to run as root or accept network connections by default. To address this, you need to either add --ip=0.0.0.0 --allow-root
when starting Jupyter to the command above or uncomment the last line in the Dockerfile above (CMD ["jupyter-notebook", "--ip=0.0.0.0", "--allow-root"]
). Alternatively, you can run the container with the flag --user=$(id -u):$(id -g)
to ensure that files created in the container have matching user and group ownership with those on the host machine, preventing permission issues. However, this restricts the container from performing root-level operations. For broader usability and security, it is advisable to create a non-root user (e.g. jovyan) within the Docker image by adding user setup commands to the Dockerfile (see below). This approach makes the image more user-friendly and avoids file ownership conflicts.
Dockerfile2
##-------- ADD CONTENT FROM Dockerfile HERE --------
# Creating a group & user
RUN addgroup --gid 1000 user && \
adduser --uid 1000 --gid 1000 --gecos "" --disabled-password jovyan
# Setting active user
USER jovyan
# setting working directory
WORKDIR /home/jovyan
# OPT: command to open jupyter automatically
CMD ["jupyter-notebook", "--ip=0.0.0.0"]
- Use
--rm
flag to remove automatically the container once it stops running to avoid clustering your system with stopped containers. - Use
--volume
to mount data into the container (/root
), for example, your working directory - Use
--file
flag to test to dockerfile versions (default:“PATH/Dockerfile”)
docker build -t sandbox-debian-jupyter:2.0 sandbox-debian-jupyter -f sandbox-debian-jupyter/Dockerfile2
Now that we have fixed that problem, we will test A. using a port to launch a Jupyter Notebook (or Rstudio server) and B. starting a bash shell interactively.
# Assuming the Dockerfile includes CMD ["jupyter-notebook", "--ip=0.0.0.0"]
# Option A. Start jupyter-notebook or on the server
docker run --rm -p 8888:8888 --volume=$(pwd):/home/jovyan sandbox-debian-jupyter:2.0
# Option B. Start an interactive shells instead
docker run -it --rm --volume=$(pwd):/home/jovyan sandbox-debian-jupyter:2.0 /bin/bash
The -p
option in Docker allows services (e.g.: Jupyter Notebooks) running inside the container to be accessible from outside (through localhost:1234
on your local machine).
-p host_port:container_port
# port mappings between the host machine and the container-p 8787:8787
# connect port8787
on the host machine to port8787
inside the container. This setup allows you to access RStudio, which is running inside the container, by navigating to http://localhost:8787 on your host machine-p 8888:8888
# similarly, this setup enables you to access JupyterLab, which is running inside the container, by going to http://localhost:8888 on your host machine.
Create a new app from scratch
When working with containers, you usually need to create a Dockerfile to define your image and compose.yaml file that defines how to run it. As an alternative to starting with a base image and modifying it, you can use the following command:
docker init
This utility will walk you through creating the following files with sensible defaults for your project:
- .dockerignore
- Dockerfile
- compose.yaml
- README.Docker.made
Publish your Docker image on Docker Hub
Publishing your Docker image on Docker Hub is straightforward. You just need a Docker Hub account—preferably linked to your GitHub account if you have one. For detailed instructions, refer to the documentation. The process involves only a few commands.
# login
docker login # <username> <pw>
# optional change tag
# docker tag <old> <new>
# push image
docker push sandbox-debian-jupyter:1.0
In Docker, the file system of an image is built using several layers, or overlays. Each layer represents a set of changes or additions made to the image. When you update software or packages within a Docker container, a new layer is created with only the new or changed content, rather than modifying the existing layers.
You are now ready to share the link to your Docker image with your colleagues, ensuring that everyone uses the exact same environment.
Sources
Scientific articles:
- Alser, Mohammed, et al. “Packaging and containerization of computational methods.” Nature Protocols (2024): 1-11.