FAIR environments

Modified

November 14, 2024

Understanding the importance of computational environments is crucial for ensuring the consistency and reliability of research outcomes. These environments can vary significantly between systems, including different in operating systems, installed software, and software package versions. When a research project is transferred to a different computer or platform, analyses may fail to run or yield inconsistent results, particularly if the software depends on specific configurations or libraries. Dependencies can evolve over time or lack proper documentation, creating hidden variations between setups. Consequently, merely knowing a software version may not guarantee consistent performance across different environments, highlighting the need for robust management strategies.

To address these challenges, project and package managers offer valuable solutions for organizing software in isolated environments. For research to be reproducible, the original computational environment must also be recorded so others can replicate it accurately. This involves making your code easy to install and run by others, document the setup process thoroughly, and carefully manage and share your software environment.

Warning

If you’re developing your own tools/pipelines, here’s is the hard truth:

  • If your code isn’t easy to install, no one will bother using it.
  • If you don’t document how to run it, people will be left guessing—and probably give up.
  • Lastly, if you don’t carefully control and share your software environment, even you might struggle to get the same results later on.

There are several methods to achieve this:

While package managers are very easy to use and share across different systems, and are lightweight and efficient, offering fast start-up times, containers provide an even greater level of isolation by virtualizing the entire operating system, making it possible to deploy applications seamlessly across various machines without requiring additional installations. Unlike Virtual Machines, which also virtualize hardware, containers encapsulate applications and their dependencies, ensuring that they function uniformly regardless of the underlying infrastructure. This approach enhances reproducibility and streamlines the deployment process, making it a powerful tool for researchers seeking to maintain consistent results in diverse computing environments.

Recording and sharing the computational environment is essential for ensuring reproducibility and transparency. Below, we will explore two tools that can help with this: mamba, a package manager, and Docker, a container system. We will explain the differences between them and provide guidance on choosing the right tool for your specific scenario.

Package managers

When coding, it’s crucial to ensure that projects are developed under consistent software conditions. The packages and libraries (dependencies) used during development should remain unchanged throughout the project to prevent issues like variations in output formats or conflicts from new algorithm implementations, which can be difficult to trace. Environment and package managers allow users to create isolated frameworks (environments) where specific packages can be installed without affecting other software outside the environment. For even greater isolation, containers can be used (see the related section on this page).

Link Description
Conda A widely-used and user-friendly environment manager
Getting started with conda Official guide to setting up and using conda
Conda cheat sheet Quick reference for conda usage
YARN An alternative to conda

Mamba is a reimplementation of the Conda package manager in C++. While our focus will be on Mamba, it’s important to note that it maintains compatibility with Conda by using the same command-line parser, package installation and uninstallation code, and transaction verification routines.

Mamba uses software installation specifications that are maintained by extensive communities of developers, organized into channels, which serve as software repositories. For example, the “bioconda” channel specializes in bioinformatics tools, while “conda-forge” covers a broad range of data science packages.

Mamba vs. conda

As previously mentioned, mamba is a newer and faster implementation. The two commands can be used interchangeable (for most tasks). If you use Conda, you should still complete the exercises, as you’ll gain experience with both tools. For more information on their ecosystem and advantages here.

Mamba allows you to create different software environments, where multiple package version can co-exit on your system.

Build your mamba environment

Follow mamba instructions to install it. Let’s also include bioconda and conda-forge channels which will come very handy.

conda config --add channels defaults; conda config --add channels bioconda; conda config --add channels conda-forge

Now you are set to create your first environment. Follow these steps:

  1. Create a new environment named myenv
  2. Install the following packages in myenv: bowtie2, numpy=1.26.4, matplotlib=3.8.3
  3. Check the environments available
  4. Load/activate the environment
  5. Check which python executable is being used and that bowtie2 is installed.
  6. Deactivate the environment

Here are some of the commands you need for the exercise.

# use conda or mamba commands 
mamba create -n <ENV-NAME>
mamba install --channel <CHANNEL-NAME> --name <ENV-NAME>
mamba env list
# mamba init 
mamba activate <ENV-NAME>
mamba deactivate 
  1. The syntax to create a new environment is: mamba create --name myenv
  2. Example “bowtie2”: Go to anaconda.org and search for “bowtie2” to confirm it is available through Mamba and which software channel it is provided from. You will find that it is available via the “bioconda” channel: https://anaconda.org/bioconda/bowtie2. The syntax to install packages is: mamba install --channel <CHANNEL-NAME> --name <ENV-NAME> <SOFTWARE-NAME>

mamba install --name myenv --channel bioconda bowtie2=2.5.3 "matplotlib=3.8.3" "numpy=1.26.4"

Do the same with the others.

  1. To see al environments available mamba env list. There will be a “*” showing the one is activated.
  2. Load the environment mamba activate myenv.
  3. which python -> should print the one in the environment that is active (path similar to /home/mambaforge/envs/myenv/bin/python). bowtie2 --help
  4. Conda deactivate

If you have different environments set up for various projects, you can switch between them or run commands directly within a specific environment using:

mamba run -n <ENV-NAME> python myscript.py
Loading mamba environments in shell scripts

If you need to activate an environment in a shell script that will be submitted to SLURM, you must first source Mamba’s configuration file. For instance, to load the myenv environment we created, the script would include the following code:

# Always add these two commands to your scripts
eval "$(conda shell.bash hook)"
source $CONDA_PREFIX/etc/profile.d/mamba.sh

# then you can activate the environment
mamba activate myenv

When jobs are submitted to SLURM, they run in a non-interactive shell where Mamba isn’t automatically set up. By running the source command, you ensure that Mamba’s activate function is available. It’s important to remember that even if the environment is loaded on the login node, the scripts will execute on a different machine (one of the compute nodes). Therefore, always include the command to load the Mamba environment in your SLURM submission scripts.

Base environment

It is the primary environment that contains the conda package manager itself. It is activated by default unless indicated.

Your command prompt will show which env is activated within parentheses.

(base) [username@node-01 ~]$

To enhance package installation performance, we will update conda to utilize the libmamba solver (check documentation here), which significantly improves the speed of installing multiple packages concurrently.

# Installation 
conda install -n base --yes conda-libmamba-solver
# configuration 
conda config --set solver libmamba 
Warning

Avoid modifications to the base environment. This is the only instance where you should perform installations within the base environment, as any further changes could jeopardize the integrity of the conda installation.

A common practice is to disable the automatic activation of the base environment. There are several reasons but it helps creating a more organized and efficient workflow (e.g. activation time), reduce potential errors (e.g. unintended modifications), and maintain better control over your development environment.

conda config --set auto_activate_base false

Containers

Essentially, a container is a self-contained, lightweight package that includes everything needed to run a specific application—such as the operating system, libraries, and the application code itself. Containers operate independently of the host system, which allows them to run the same software across various environments without any conflicts or interference. This isolation ensures that researchers can consistently execute their code on different systems and platforms, without worrying about dependency issues or conflicts with other software on the host machine.

Link Description
Docker An open source widespread container that is popular both in research and industry
Docker course A course to use Docker, freely hosted on youtube
Docker curriculum Beginner introduction to docker
Docker basics Introduction tutorials to Docker from the official documentation page
Singularity Singularity is another containerization tool. It allows you to decide at which degree a container interacts with the hosting system
Singularity tutorial A well done Singularity tutorial for HPC users
Singularity video tutorial A video tutorial on Singularity
Reproducibility by containerization A video on reproducibility with Singularity containers
Docker vs. Singularity

The most significant difference is at the permission level required to run them. Docker containers operate as root by default, giving them full access to the host system. While this can be useful in certain situations, it also poses security risks, especially in multi-user environments. In contrast, Singularity containers run as non-root users by default, enhancing security and preventing unauthorized access to the host system.

  • Docker is ideal for building and distributing software across different operating systems
  • Singularity is designed for HPC environments and offers high performance without needing root access

In the following sections, we’ll cover how to retrieve environment information, utilize containers, and automate environment setup to improve reproducibility.

Singularity on a remote server

While you can build your own Singularity images, many popular software packages already have pre-built images available from public repositories. The two repositories you’ll most likely use or hear about are:

Installation
# You will only need to vagrant init once 
export VM=sylabs/singularity-3.0-ubuntu-bionic64 && \
    vagrant init $VM && \
    vagrant up && \
    vagrant ssh
Tip
  • We recommend using the pre-installed version provided by your system administrators if you’re working on a shared system. If you’re working on your own computer, you can install the necessary software using Mamba.
  • They might host different versions of the same software, so it’s worth checking both to find the version you need.
  • To download a software container from public repositories, use the singularity pull command.
  • To execute a command within the software container, use the singularity run command.
  • Good practice: create a directory to save all singularity images together. .sif is the standard extension for the images.
Exercise

Download a singularity image from one of the two repositories listed above (choose a software like bcftools, bedtools, bowtie2, seqkit…) and run the --help command. This command displays the help documentation of the program, verifying that our image is functioning correctly and includes the intended software.

# create a directory for our singularity images
mkdir images

# download the image
singularity pull images/bowtie2-2.5.4.sif https://depot.galaxyproject.org/singularity/bowtie2%3A2.5.4--he20e202_2

# run the image: singularity run <PATH-TO-IMAGE> <YOUR COMMANDS>
singularity run images/bowtie2-2.5.4.sif bowtie2 --help

Sources

Find pre-built singularity images:

Other training resources: