RDM for containers

Modified

November 14, 2024

Now that you’re familiar with containers, it’s time to focus on making them reproducible and ensuring good Research Data Management (RDM) practices.

The current approach that we introduce on the Docker lesson has a significant drawback: it doesn’t ensure a reproducible environment because it depends on external servers and services that frequently update. If you lose your Docker image, you might not be able to rebuild it or even know precisely what was in it. You could save the output of the commands below alongside your Dockerfile. This information will be crucial if you need to rebuild the image.

# Retrieve info on when the image was built: 
docker image history albarema/sandbox-debian-jupyter:1.0 --human=false
# List version of software installed 
docker run albarema/sandbox-debian-jupyter:1.0 dpkg --list

How do we improve reproducibility?

Dockerfile
FROM debian:stable-20240812@sha256:2171c87301e5ea528cee5c2308a4589f8f2ac819d4d434b679b7cb1dc8de6912
# OR: Set the snapshot date for the sources: https://snapshot.debian.org/ 
ARG SNAPSHOT_DATE=20240812T000000Z

Sources

  • Content adapted from Reproducible Research II: Practices and tools for managing computations and data by members of France Universite Numerique.
  • RDM - data analysis, Elixir Europe