Docker

Modified

November 14, 2024

Content

This section is divided into two main parts:

  • Using Docker images
  • Building custom Docker images

Refer to the Docker commands overview for a handy checklist or quick reference.

Requirements

Register on Docker Hub at docker.com and follow the Docker Desktop guide to get started. We recommend going through this tutorial before diving in.

Docker enables developers build, share, run, and verify applicationsseamlessly across different environments, eliminating the need for complex environment configuration and management. Before diving into hands-on activities, ensure you understand these three key concepts:

Using Docker images

First, we’ll start by using basic commands with existing Docker images from Docker Hub. This will help you become familiar with Docker’s functionality before we move on to creating our own custom images. While people commonly use shorthand Docker commands like docker pull and docker run, the full versions (e.g., docker image pull, docker container run) provide clearer insights into their underlying concepts and specific functions.

Docker Hub

  • Search for images docker search <name> (e.g. docker search debian)
  • Download an image docker pull <image-name>

Local Docker Daemon

  • Display all docker images currently stored on your local Docker daemon docker images (alias for docker image ls)
  • Inspect docker image docker inspect <image_name> (alias for docker image inspect)
  • Run a command (cmd) in a container docker run <image_name> cmd (alias for docker container run <image_name> cmd)
  • Start an interactive bash shell docker run -it <image_name> bash. Add other flags like:
    • -p 8888:8888 to access your interactive shell through ‘localhost:8888’ on your host.
    • -rm to automatically delete the container once it stops, keeping your system clean (including its filesystem changes, logs and metadata). If you don’t run this flag, a container will automatically be created and information about tje processes will be kept. Check all containers in your Docker daemon docker container ls -a
    • --user=$(id -u):$(id -g) useful if you are using sharing volumes and need appropriate permissions on the host to manipulate files.
  • Share the current directory with the container docker run -it --volume=$(pwd):/directory/in/container image_name bash, the output of pwd will be mounted to the /directory/in/container (e.g. data, shared, etc.)
  • Manage your containers using pause or stop
  • Tag images with a new name docker image tag image_name:tag new_name:tag
  • docker logs <container_id>
  • Remove images and clean up your hard drive docker rmi <image_name>
  • Remove containers docker container rm <container_name>. Alternatively, remove all dead containers: docker container prune

All Docker containers have a digest which is thesha256 hash of the image. It allows to uniquely identify a docker image and it is great for reproducibility.

Exercise

In this exercise, we will use a Debian stable image (sha256-540ebf19fb0bbc243e1314edac26b9fe7445e9c203357f27968711a45ea9f1d4) as an example for pulling and inspecting Docker images. This image offers a reliable, minimal base with essential tools, including fundamental utilities like bash for shell scripting and apt for package management. It’s an excellent starting point for developing and testing pipelines or configuring server environments, providing a stable foundation that can be customized with additional packages as needed.

1. Get a container image

  • Pull docker image (using tag stable otherwise, latest will be pulled by default)

    docker pull debian:stable@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912

    2. Run commands in a container

  • List the content of the container

    docker run -it -rm debian@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912 ls
  • Check is there is python or perl in this container:

    docker run -rm debian@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912 which -a python perl

3. Use docker interactively

  • Enter the container interactively with a Bash shell

    docker run -it -rm debian@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912 /bin/bash
  • Now, collect info about the packages installed in the environment.

    Interactive Docker container
    hostname
    whoami
    ls -la ~/
    python  
    echo "Hello world!" > ~/myfile.txt
    ls -la ~/

    4. Exit and check the container

  • Exit the container

    exit

Now, rerun commands from step 3 under “Interactive Docker container”. Does the output look any different?

You will notice that you are the root but the name of the machine has now changed and the file that you had created has disappeared.

5. Inspect the docker image

docker image inspect debian:stable@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912

6. Inspect Docker Image Details Identify the date of creation, the name of the field with the digest of the image and command run by default when entering this container.

7. Remove container

docker image docker rmi debian:stable@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912

When you exit a container, any changes made are lost because each new container starts in a clean state. Containers are isolated environments that don’t retain memory of past interactions, allowing you to experiment without affecting your system. This immutability ensures that your program will run under the same conditions in the future, even months later. So, how can you retrieve the results of computations performed inside a container?

To preserve your work, it is crucial to use Docker’s volume mounting feature. When running a Docker container with the -v (volume) option, it’s best to execute the command from the temporary directory (e.g. /tmp/ ) or a project-specific directory rather than your home directory. This practice helps maintain a clean and isolated environment within the container, while ensuring your results are saved in a designated location outside the container’s ephemeral filesystem.

Named volumes

Now, we will be using named volumes or shared directories. When you make changes in these directories within the container, the updates are immediately reflected in the corresponding directory on the host machine. This setup enables seamless synchronization between the container and the host file system.

# -v: mounting only your project-specific dir 
docker run -it --volumes <project_dir>
# Named volumes: are managed by Docker and are isolated from your host file system
docker volume create <my_project_volume>
docker run -it --volumes <my_project_volume>
Why not mounting your home directory

Mounting your current working directory (${PWD}) into the container at e.g. /home/rstudio can compromise the benefits of container isolation. If you run the command from your home directory, any local packages or configurations (such as R or Python packages installed with install.packages or pip) would be accessible inside the container. This inadvertently includes your personal setup and packages, potentially undermining the container’s intended clean and isolated environment. To maintain effective isolation, it’s better to use a temporary or project-specific directory for mounting.

Alternatively, you can add a volume to a project, by modifying the compose.yaml (also named docker-compose.yml) file. There are two types of volumes:

  • Service-level name: specify how volumes are mounted inside the container. In this case, dataset volume (defined at the top-level) will be mounted to the /path/in/container/ (e.g. data, results or logs directory) inside the myApp container.
  • Top-level volume: volumes shared across multiple services in the compose.yaml file. The volume dataset can be referenced by any service and will be created if it doesn’t exit. If the host path is not specified, Docker will automatically create and manage the volume in a default location on the host machine. However, if you need the volume to be located in a specific directory, you can specify the host path directly (option 2).
compose.yaml
# Service-level name
myApp:
    # ...
    volumes:
      - dataset:/path/in/container/
    # Option 2 
    # - /my/local/path:/path/in/container/
                      
# Top-level volume
volumes:
  dataset:

In this case, a volume named mydata will be mounted to the /data/ directory inside the container running the todo-databse service.

Let’s not forget to track changes to container images for reproducibility by using version control. Store your images in a shared storage area, such as with Git or Git Annex, to manage versions and facilitate collaboration.

Transfer and backup Docker images

Saving a Docker image as a tar file is useful for transferring the image to another system or operating system without requiring access to the original Docker registry. The tar file contains several key components:

  • metadata: JSON files essential to reconstruct the image
    • layer information with each layer associated with metadata that includes which commands are used to create the later.

      Dockerfile
      FROM ubuntu:20.04 # Layer 1 - base image 
      RUN apt-get update && apt-get install -y python3 # Layer 2 - the result of running the command 
      COPY . /app # Layer 3 - add application files to the images
    • tags pointers to specific image digests or versions.

    • history of the image and instructions from the DOckerfile that were used to build the image.

    • manifest which ties together the layers and the overall image structure.

  • Filesystem: the actual content, files and directories that make up the image.
# save to tar file 
docker image save --output=image.tar image_name 
# load tar file
docker image load --input=image.tar  

In some situations, you may need to access the filesystem content of a Docker image for purposes such as debugging, backup, or repurposing. To do this, you should create a container from the image and then export the container’s root filesystem to a tar file. Similarly, you can create a new Docker image from this tar file if needed.

docker container create --name=temp_container image_name
docker container export --output=image.tar temp_container
docker container rm temp_container
# new image importing tar file 
docker image import --input image.tar image_name

Building Docker images

We’re now ready to build custom Docker images. We’ll start with a minimal pre-built image, add the necessary software, and then upload the final image to Docker Hub.

A common practice is to start with a minimal image from Docker Hub. There are many base images you can use to start building your customized Docker image, and the key difference between them lies in what they already include. For example:

  • Debian: This is a very minimal base image, including only the most essential utilities and libraries required to run Debian, such as the core Linux utilities and the package manager (apt). It offers a high degree of customization, giving you full control over what you install. However, you’ll need to manually add all the necessary software, making the setup process longer. This image is relatively small in size, making it a good starting point if you want to build your environment from scratch.

  • Data Science-Specific Images: These images come pre-configured with a wide range of tools pre-installed, reducing the amount of customization needed. This allows you to get started much faster, though it also means less flexibility. These images tend to be larger in size due to the pre-installed software. For example:

    • tensorflow/tensorflow: This image includes TensorFlow and often other machine learning libraries, making it ideal for deep learning projects.
    • jupyter/scipy-notebook: This image is around 2-3 GB and includes Python, Jupyter Notebook, and libraries like NumPy, Pandas, Matplotlib, and more, making it a comprehensive option for data science tasks.
    • r-base: This image provides a base for R environments, useful for data analysis and statistical computing.-
    • rocker/rstudio: This image includes RStudio and a base R environment, making it perfect for those working in R for statistical computing and data analysis.

Once you have chosen your base image, use a Dockerfile to modify its components, specifying commands for software installation and configuration. The Dockerfile acts as a recipe, providing a list of steps to build the image.

Dockerfile
# deploy docker container
FROM <node|debian|python|jupyter-base>

# Info and rights of the app
LABEL software="App_name - sandbox" \
    maintainer="<author.address@sund.ku.dk>"  \
    version="YYYY.MM.DD" 

# root: needed to install and modify the container 
USER 0 

# run bash commands (eg. installing packages or softwares)
RUN mkdir -p

# install packages & dependencies 
RUN apt update \
    && apt install -y jupyter-notebook python3-matplotlib python3-pandas python3-numpy  \
    && rm -rf /var/lib/apt/lists/* # cleanup tmp files created by apt
    && rm -fr node_modules # remove directory which contains Node.js packages & dependencies

# set working directory (this directory should act as the main application directory)
WORKDIR /app

# copy files to-from directories
COPY /from/path/... /to/path/...

# set environment variables for Conda
ENV

# switch to user 1000 instead of the default root user 
USER 11042

In this example, the RUN command updates the package list and installs Jupyter along with all necessary dependencies. It’s common practice to remove unnecessary dependencies afterward, as shown in the example above.

Label

Use key-value pair syntax to add the following labels to the container:

  • software = name of the app
  • author = maintainer or author of the app
  • version = version of the app
  • license = app licence, e.g., “MIT”
  • description = very short description of the app

After preparing your Dockerfile, use the docker build command to create a new image based on those instructions. Docker’s isolation ensures that any changes made within a container are lost when you exit it. Nevertheless, you can use docker commit to save these changes by creating a new image from the updated container.This new image will retain your modifications and can be used to launch new container.

docker build -t <account/app:version> <directory_docker>
Example of a tag name

For the docker build -t flag, the format of the tag is used to specify both the repository and the version of the Docker image. It consists of three different elements (e.g.: albarema/sandbox_app:v1.0):

  • Repository name where the image will be stored, account in Docker registry (e.g. albarema)
  • Name of the image (e.g. sandbox_app)
  • Version label for the image (e.g. v1.0, test, etc.)

The repository name can be deferred until the application is ready for publication on Docker Hub. You can also modify the tag at a later time.

Exercise
  1. Create a Dockerfile in a project-specific dir (e.g.: sandbox-debian-jupyter). We will add a command to clean up the package after after installation used to reduce the image size.
FROM debian:stable 

LABEL maintainer="Name Surname <abd123@ku.dk>"

# Update package list and install necessary packages
RUN apt update \
    && apt install -y jupyter-notebook \
                      python3-matplotlib \
                      python3-pandas \
                      python3-numpy \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* # cleanup tmp files created by apt

# You may consider adding a working directory
# WORKDIR /notebooks

# and a command to start Jupyter Notebook
# CMD ["jupyter-notebook", "--ip=0.0.0.0", "--allow-root"]
  1. Build the Docker image using, for example, docker build -t sandbox-debian-jupyter:1.0 sandbox-debian-jupyter

Testing the custom image

Let’s verify if the custom image functions as expected, running the following command:

Terminal
docker run --rm -p 8888:8888 --volume=$(pwd):/root sandbox-debian-jupyter:1.0 jupyter-notebook

Jupyter typically refuses to run as root or accept network connections by default. To address this, you need to either add --ip=0.0.0.0 --allow-root when starting Jupyter to the command above or uncomment the last line in the Dockerfile above (CMD ["jupyter-notebook", "--ip=0.0.0.0", "--allow-root"]). Alternatively, you can run the container with the flag --user=$(id -u):$(id -g)to ensure that files created in the container have matching user and group ownership with those on the host machine, preventing permission issues. However, this restricts the container from performing root-level operations. For broader usability and security, it is advisable to create a non-root user (e.g. jovyan) within the Docker image by adding user setup commands to the Dockerfile (see below). This approach makes the image more user-friendly and avoids file ownership conflicts.

Dockerfile2

##-------- ADD CONTENT FROM Dockerfile HERE --------

# Creating a group & user
RUN addgroup --gid 1000 user && \
    adduser --uid 1000 --gid 1000 --gecos "" --disabled-password jovyan
# Setting active user 
USER jovyan
# setting working directory 
WORKDIR /home/jovyan

# OPT: command to open jupyter automatically
CMD ["jupyter-notebook", "--ip=0.0.0.0"]
Tip
  • Use --rm flag to remove automatically the container once it stops running to avoid clustering your system with stopped containers.
  • Use --volume to mount data into the container (/root), for example, your working directory
  • Use --file flag to test to dockerfile versions (default:“PATH/Dockerfile”)
docker build -t sandbox-debian-jupyter:2.0 sandbox-debian-jupyter -f sandbox-debian-jupyter/Dockerfile2

Now that we have fixed that problem, we will test A. using a port to launch a Jupyter Notebook (or Rstudio server) and B. starting a bash shell interactively.

# Assuming the Dockerfile includes CMD ["jupyter-notebook", "--ip=0.0.0.0"]

# Option A. Start jupyter-notebook or on the server 
docker run --rm -p 8888:8888 --volume=$(pwd):/home/jovyan sandbox-debian-jupyter:2.0 

# Option B. Start an interactive shells instead 
docker run -it --rm --volume=$(pwd):/home/jovyan sandbox-debian-jupyter:2.0 /bin/bash
Which port to use?

The -p option in Docker allows services (e.g.: Jupyter Notebooks) running inside the container to be accessible from outside (through localhost:1234 on your local machine).

  • -p host_port:container_port # port mappings between the host machine and the container
  • -p 8787:8787 # connect port 8787 on the host machine to port 8787 inside the container. This setup allows you to access RStudio, which is running inside the container, by navigating to http://localhost:8787 on your host machine
  • -p 8888:8888 # similarly, this setup enables you to access JupyterLab, which is running inside the container, by going to http://localhost:8888 on your host machine.

Create a new app from scratch

When working with containers, you usually need to create a Dockerfile to define your image and compose.yaml file that defines how to run it. As an alternative to starting with a base image and modifying it, you can use the following command:

docker init

This utility will walk you through creating the following files with sensible defaults for your project:

- .dockerignore
- Dockerfile
- compose.yaml
- README.Docker.made

Publish your Docker image on Docker Hub

Publishing your Docker image on Docker Hub is straightforward. You just need a Docker Hub account—preferably linked to your GitHub account if you have one. For detailed instructions, refer to the documentation. The process involves only a few commands.

# login
docker login # <username> <pw>
# optional change tag
# docker tag <old> <new>
# push image 
docker push sandbox-debian-jupyter:1.0

In Docker, the file system of an image is built using several layers, or overlays. Each layer represents a set of changes or additions made to the image. When you update software or packages within a Docker container, a new layer is created with only the new or changed content, rather than modifying the existing layers.

You are now ready to share the link to your Docker image with your colleagues, ensuring that everyone uses the exact same environment.

Sources

Scientific articles:

  • Alser, Mohammed, et al. “Packaging and containerization of computational methods.” Nature Protocols (2024): 1-11.