FAIR environments

Modified

September 16, 2024

Computational environments vary significantly from one system to another, encompassing differences in the operating system, installed software, and software package versions. If a research project is moved to a different computer or platform, the analysis might not run or produce consistent results, specially if the software relies on specific configurations or dependencies. Dependencies are libraries or tools that software relies on and can change over time or be poorly documented, leading to hidden variations between setups. Simply knowing a software version might often be insufficient for guaranteeing consistent results across different environments.

Warning

If you’re developing your own tools/pipelines, here’s is the hard truth:

If your code isn’t easy to install, no one will bother using it.
If you don’t document how to run it, people will be left guessing—and probably give up.
Lastly, if you don’t carefully control and share your software environment, even you might struggle to get the same results later on.

For research to be reproducible, the original computational environment must be recorded so others can replicate it accurately. This involves making your code easy to install and run by others, document the setup process thoroughly, and carefully manage and share your software environment. There are several methods to achieve this:

Containerization platforms (e.g., Docker, Singularity): allow the researcher to package their software and dependencies into a standardized container image.
Virtual Machines (e.g., VirtualBox): can share an entire virtualized computing environment (OS, software and dependencies).

Package managers

Mamba is a reimplementation of the Conda package manager in C++. While our focus will be on Mamba, it’s important to note that it maintains compatibility with Conda by using the same command-line parser, package installation and uninstallation code, and transaction verification routines.

Mamba uses software installation specifications that are maintained by extensive communities of developers, organized into channels, which serve as software repositories. For example, the “bioconda” channel specializes in bioinformatics tools, while “conda-forge” covers a broad range of data science packages.

Mamba vs. conda

As previously mentioned, mamba is a newer and faster implementation. The two commands can be used interchangeable (for most tasks). If you use Conda, you should still complete the exercises, as you’ll gain experience with both tools. For more information on their ecosystem and advantages here.

Mamba allows you to create different software environments, where multiple package version can co-exit on your system.

Build your mamba environment

Follow mamba instructions to install it. Let’s also include bioconda and conda-forge channels which will come very handy.

conda config --add channels defaults; conda config --add channels bioconda; conda config --add channels conda-forge

Now you are set to create your first environment. Follow these steps:

Create a new environment named myenv
Install the following packages in myenv: bowtie2, numpy=1.26.4, matplotlib=3.8.3
Check the environments available
Load/activate the environment
Check which python executable is being used and that bowtie2 is installed.
Deactivate the environment

Hint

Here are some of the commands you need for the exercise.

mamba create -n <ENV-NAME>
mamba install --channel <CHANNEL-NAME> --name <ENV-NAME>
mamba env list
# mamba init 
mamba activate <ENV-NAME>
mamba deactivate

Solution

The syntax to create a new environment is: mamba create --name myenv
Example “bowtie2”: Go to anaconda.org and search for “bowtie2” to confirm it is available through Mamba and which software channel it is provided from. You will find that it is available via the “bioconda” channel: https://anaconda.org/bioconda/bowtie2. The syntax to install packages is: mamba install --channel <CHANNEL-NAME> --name <ENV-NAME> <SOFTWARE-NAME>

mamba install --name myenv --channel bioconda bowtie2=2.5.3 "matplotlib=3.8.3" "numpy=1.26.4"

Do the same with the others. 3. To see al environments available mamba env list. There will be a “*” showi8ng the one is activated. 4. Load the environment mamba activate myenv. 5. which python -> should print the one in the environment that is active (path similar to /home/mambaforge/envs/myenv/bin/python). bowtie2 --help 6. Conda deactivate

If you have different environments set up for various projects, you can switch between them or run commands directly within a specific environment using:

mamba run -n <ENV-NAME> python myscript.py

Loading mamba environments in shell scripts

If you need to activate an environment in a shell script that will be submitted to SLURM, you must first source Mamba’s configuration file. For instance, to load the myenv environment we created, the script would include the following code:

# Always add these two commands to your scripts
eval "$(conda shell.bash hook)"
source $CONDA_PREFIX/etc/profile.d/mamba.sh

# then you can activate the environment
mamba activate myenv

When jobs are submitted to SLURM, they run in a non-interactive shell where Mamba isn’t automatically set up. By running the source command, you ensure that Mamba’s activate function is available. It’s important to remember that even if the environment is loaded on the login node, the scripts will execute on a different machine (one of the compute nodes). Therefore, always include the command to load the Mamba environment in your SLURM submission scripts.

Containers

Essentially, a container is a self-contained, lightweight package that includes everything needed to run a specific application—such as the operating system, libraries, and the application code itself. Containers operate independently of the host system, which allows them to run the same software across various environments without any conflicts or interference. This isolation ensures that researchers can consistently execute their code on different systems and platforms, without worrying about dependency issues or conflicts with other software on the host machine.

Docker vs. Singularity

The most significant difference is at the permission level required to run them. Docker containers operate as root by default, giving them full access to the host system. While this can be useful in certain situations, it also poses security risks, especially in multi-user environments. In contrast, Singularity containers run as non-root users by default, enhancing security and preventing unauthorized access to the host system.

Docker is ideal for building and distributing software across different operating systems
Singularity is designed for HPC environments and offers high performance without needing root access

In the following sections, we’ll cover how to retrieve environment information, utilize containers, and automate environment setup to improve reproducibility.

Singularity on a remote server

While you can build your own Singularity images, many popular software packages already have pre-built images available from public repositories. The two repositories you’ll most likely use or hear about are:

Installation

Singularity installation guides

# You will only need to vagrant init once 
export VM=sylabs/singularity-3.0-ubuntu-bionic64 && \
    vagrant init $VM && \
    vagrant up && \
    vagrant ssh

Tip

We recommend using the pre-installed version provided by your system administrators if you’re working on a shared system. If you’re working on your own computer, you can install the necessary software using Mamba.
They might host different versions of the same software, so it’s worth checking both to find the version you need.
To download a software container from public repositories, use the singularity pull command.
To execute a command within the software container, use the singularity run command.
Good practice: create a directory to save all singularity images together. .sif is the standard extension for the images.

Exercise

Download a singularity image from one of the two repositories listed above (choose a software like bcftools, bedtools, bowtie2, seqkit…) and run the --help command. This command displays the help documentation of the program, verifying that our image is functioning correctly and includes the intended software.

Solution

# create a directory for our singularity images
mkdir images

# download the image
singularity pull images/bowtie2-2.5.4.sif https://depot.galaxyproject.org/singularity/bowtie2%3A2.5.4--he20e202_2

# run the image: singularity run <PATH-TO-IMAGE> <YOUR COMMANDS>
singularity run images/bowtie2-2.5.4.sif bowtie2 --help

Sources

Anaconda for searching Mamba/conda packages
Bioconda for installing software package related to biomedical research
Conda cheat sheet
faircookbook worflows
Docker
Docker get-started
The turing way - reproducible research

Find pre-built singularity images:

Other training resources: - The turing way - reproducible research - HPC intro by Cambridge - Highly recommend Reproducible Research II: Practices and Tools for Managing Computations and Data by members of France Université Numérique.

Copyright

CC-BY-SA 4.0 license