FAIR computational pipelines
Introduction
Data analysis typically involves the use of different tools, algorithms, and scripts. It often requires multiple steps to transform, filter, aggregate, and visualize data. The process can be time-consuming because each tool may demand specific inputs and parameter settings. As analyses become more complex, the importance of reproducible and scalable automated workflow management increases since the key for reproducibility is automation.
Why avoid using a shell script?
Bash scripts have been widely used for automation in the past and can handle many tasks effectively. Typically, running a bash script requires just one command, which executes all the steps in the script. However, a significant drawback is that it re-runs all steps every time. This can be problematic in certain situations.
Consider scenarios where re-running all steps can be an issue (minimum 2).
Here are some examples:
- Changing input files: if only some parts of the pipeline are affected by the changes.
- Code bugs: issues such as incorrect paths or typos in your code.
- Software updates: newer version released.
- Parameter updates: test/update parameters in a software tool.
- Script Modifications: for example, if only the plotting section of your script is updated, re-running the entire pipeline could waste significant time and resources.
Are notebooks better than bash scripts?
Notebooks might represent an advancement in addressing this issue because they allow you to run individual cells separately. However, this process is manual, and you need to ensure that the cells are executed in the correct order (time-consuming). In practice, it’s easy to make mistakes, and thus, reproducibility is only guaranteed if you run all the cells systematically from top to bottom. This approach can be time-consuming and requires careful management to avoid errors.
What are the notebooks advantages and disadvantages in the following situations?
- If your pipeline consist of 100 commands
- If your pipeline is only 4 steps but each takes several weeks of computational time
- Benchmarking and testing parameters of new software
Main disadvantage: reproducibility! If you are running block cells by hand, it will be hard to reproduce.
- Advantages:
- can check intermediate outputs and execute incrementally.
- user-friendly interface for visualizing and debugging.
- Disadvantages:
- less efficient for large number of commands compared to bash scripts.
- can become cumbersome and slow intensive code blocks.
- no automation features and lack of advanced version control.
- Advantages:
- simplified interface for quick execution and visualization.
- easier to manage and understand fewer steps.
- great for prototyping and testing small workflows.
- Disadvantages:
- less straightforward automation.
- Advantages:
- facilitates experimentation and visualizes results instantly.
- easy documentation.
- enables step-by-step execution and modification of parameters.
- Disadvantages:
- not efficient for extensive benchmarking.
- tracking and managing multiple parameters can become complex.
Workflows crown the stack, making bash scripts and notebooks less attractive
Workflows propose a third option, which is very attractive for computations that are too big to be handled conveniently by scripts or notebooks. A workflow manager, is a suitable computer program that decides which task and when is run. Workflow management encompasses tasks such as parallelization, resumption, logging, and data provenance.
One single command should trigger all necessary steps, ensuring they run in the correct sequence (simple and easy!). Workflows are divided into separate tasks, with each task reading an input and writing an output. A given task needs to be re-run only when its code or input data have changed. Using workflow managers, you ensure:
- automation
- convenience
- portability
- reproducibility
- scalability
- readable
Popular workflow management systems such as Snakemake, Nextflow, and Galaxy can be scaled effortlessly across server, cluster, and cloud environments without altering the workflow definition. They also allow for specifying the necessary software, ensuring the workflows can be deployed in any setting. It’s important to select a workflow manager that best fits each research project. Therefore, we will provide two sections: one on Snakemake and one on Nextflow, so you can make the best selection for your needs.
Resource management
A key service offered by workflow managers is resource management, essential for executing large computations. They handle parallel execution, running tasks simultaneously when they do not depend on each other and can execute a single parallelized task on multiple nodes in a computing cluster. Workflow managers also handle data storage, managing local files, shared filesystems (storage system within a server), and cloud storage. Additionally, they can manage software environments by interfacing with package managers and running tasks in dedicated containers.
During this module, you will learn about:
- Syntax: understand the syntax of two workflow languages.
- Defining steps: how to define a step in each of the language (rule in Snakemake, process in Nextflow), including specifying input, outputs and execution statements.
- Generalizing steps: explore how to generalise steps and create a chain of dependency across multiple steps using wildcards (Snakemake) or parameters and channel operators (Nextflow).
- Advanced Customisation: gain knowledge of advanced pipeline customisation using configuration files and custom-made functions.
- Scaling workflows: understand how to scale workflows to compute servers and clusters while adapting to hardware-specific constraints.With multiple CPUs available, you can leverage parallel execution for groups of tasks that are independent of each other. In this context, tasks do not rely on the outputs of other tasks in the same group. Since data is transferred between tasks using files, it is easy to see how tasks depend on each other (dependencies).
To put this into practice, you will start by scaling up a data analysis from one dataset to a large number of datasets and incorporating additional analysis steps at the aggregate level. However, before you can scale up, the first step involves converting a notebook into a shell script that chains several computational tasks (Exercise 3
). This script will serve as an intermediate stage before you move on to using Snakemake and Nextflow, and completing it will help you understand how different tasks correspond and are split in these tools.
In a workflow, a task is executed in its entirety or not at all. Long-running tasks can undermine one of the primary advantages of workflows, which is to execute only the necessary computations. Conversely, there is overhead associated with starting a task, reading input files, and writing output files. If the computational work performed by a task is too minimal, this overhead can become disproportionately large. With this in mind, we should also assess the effects of data input/output (I/O) on disk storage and code complexity. When a task involves minimal computation but is heavily dominated by I/O operations, it becomes difficult to understand and modify. Furthermore, a task that requires large amounts of disk storage for its execution can lead to significant costs.
Consider each task as a Python script containing the code from one or more cells of a Jupyter Notebook. The key aspect of these tasks is that all data exchange occurs via files. Ideally, a task should be neither too lengthy nor too brief.
We will use the classic Iris dataset
for this exercise. Convert this Jupyter notebook to a shell script.
- Reuse as much as possible from the code above but remove code for displaying tables or plotting figures.
- Leave only comments specifically for the code, create a README.md for the description of the dataset and the objective of the project.
- Save the python code to a file (e.g. process_iris.py).
- Write a shell script and run the python script once for one of the species (e.g.”setosa” | “versicolor” | “virginica”).
Here is one possible approach. You may choose to split the notebook into more or fewer tasks as needed. Remember to make your script executable by running chmod +x run_iris_analysis.sh
so that you can run it by simply ./run_iris_analysis.sh
.
The reason for splitting the tasks is based on their functionality. We handle plotting and data manipulation or preprocessing in one script, as these tasks are typically performed together. The summary statistics, however, might need to be generated multiple times or for different species. Therefore, we created a separate script specifically for summarizing the data. This separation allows us to run the summary script as needed, either once, multiple times, or for different species, while keeping the data processing and plotting tasks contained in their own script.
#!/bin/bash
python process_iris.py
SPECIES_NAME="setosa"
python summaryStats_species.py $SPECIES_NAME
Good practices
If you develop your own software make sure you follow FAIR principles. We highly endorse following these FAIR recommendations.
- Remember to create portable code and use relative paths to ensure transferability between users.
- Use git repositories to save your projects and pipelines.
- Register and publish your scientific computational workflow on WorkflowHub.
Sources
RDM best practices for computations
Parts of the content are inspired by Reproducible Research II: Practices and Tools for Managing Computations and Data by members of France Université Numérique. Enroll here.
Scientific articles
- Wratten, Laura, Andreas Wilm, and Jonathan Göke. “Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers.” Nature methods 2021.
- Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
- Köster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014.