flowchart LR A0(["Start"]) --->|data.txt| A["Rename"] A --->|samples.txt| B(["End"])
Learn to create smooth pipelines and manage resources tinyurl.com/pipelinesGDK
Health Data Science sandbox, BiRC, AU
Molecular Biology and Genetics, AU
GenomeDK, Health, AU
2025-10-23
These slides are both a presentation and a small reference manual
We have hands-on as well, follow the slides from tinyurl.com/pipelinesGDK
Official reference documentation: genome.au.dk and gwf.app
Most important message before starting any workshop: RTFM - Read The Field Manual!
Please remember filling up the feedback at the end of the slides
Practical help:
Samuele (BiRC, MBG) - samuele@birc.au.dk
Drop in hours:
General mail for assistance
support@genome.au.dk
The basics to use the cluster
Being able to edit documents on the cluster
Have conda or pixi installed
If possible have a project on the cluster
flowchart LR A0(["Start"]) --->|data.txt| A["Rename"] A --->|samples.txt| B(["End"])
Workflow and W. Management System
A workflow is a series of calculations and data manipulations which have to be performed in a specific sequence.
A workflow management system organizes the workflow steps through defined dependencies, can assign different computing resources to each step, keeps a log of the workflow, interacts with a cluster’s queueing system.
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET] A --->|samples.txt<br>OUTPUT| B(["End"]);
A TARGET is a specific step in a workflow
Each target has a SPECIFICATION which describes what to do with the input files to produce the output files
The specification is usually a command line which can be executed in a terminal
Each target has INPUTS and OUTPUTS file(s)
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET] A --->|samples.txt<br>OUTPUT-INPUT| B[GZip<br>TARGET] B --->|samples.txt.gz<br>OUTPUT| C(["End"]);
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET<br>cores=4<br>memory=24g<br>walltime=00:01:00] A --->|samples.txt<br>OUTPUT-INPUT| B[GZip<br>TARGET<br>cores=1<br>memory=8g<br>walltime=00:10:00] B --->|samples.txt.gz<br>OUTPUT| C(["End"]);
dd-hr:mn:sc) which are used to run the target on a clusterEach target of the workflow
gantt
dateFormat HH:mm
axisFormat %H:%M
title Examplified HPC Queue Durations vs. Resources
section Small Job (1 core, 4GB, 1h)
Queue wait: active, 00:00, 0:10
Job start: active, 00:10, 1:00
section Medium Job (4 cores, 16GB, 2h)
Queue wait: active, 00:00, 0:45
Job start: active, 00:45, 2:00
section Large Job (16 cores, 64GB, 4h)
Queue wait: active, 00:00, 2:00
Job start: active, 02:00, 4:00
There are many workflow management systems available, e.g. Snakemake, Nextflow, Cromwell, Gwf, Airflow, Luigi, …
Most known in bioinformatics and in a production environment are Snakemake and Nextflow.
Pros
Cons
Pros
Cons
Pros
Cons
A lightweight and easy to adopt workflow manager. It requires only some basic Python - you can learn it along the way starting from examples. Some features:
The whole workflow is written in a python script. You first state the Workflow object, usually like this:
Now we create generic templates which will be applied to the specific targets.
Note
return will provide all info about the template when applying it to a target.
Note
return will provide all info about the template when applying it to a target.
Let’s look again at the corresponding workflow graph:
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET<br>cores=4<br>memory=24g<br>walltime=00:10:00] A --->|samples.txt<br>OUTPUT-INPUT| B[Zip<br>TARGET<br>cores=1<br>memory=8g<br>walltime=00:01:00] B --->|samples.zip<br>OUTPUT| C(["End"]);
Using templates is easy with gwf. You can use the target_from_template method to create a target from a template.
Note
Each target has an unique name so that you will be able to inspect the workflow and its status.
A more complex workflow
We will run this workflow and add some new targets to it
flowchart LR A0(["Start"]) -->|"data.fq"| A["split"] A -->|part001.fq| B["table"] A -->|part002.fq| C["table"] A -->|part....fq| D["table"] A -->|part010.fq| E["table"] B -->|table001.tsv| F["merge"] C -->|table002.tsv| F D -->|table....tsv| F E -->|table010.tsv| F F -->|table.tsv| G(["End"]);
Prepare everything for the exercise: create a new folder, then download data and workflow file
Create a conda environment for seqkit and one for the gwf workflow software. Download the seqkit container as well.
conda config --add channels gwforg
#conda env pipelineEnv for gwf
conda create -y -n pipelineEnv gwf=2.1.1
#add package for resource usage/check
conda install -y -n pipelineEnv -c micknudsen gwf-utilization
#conda env seqKitEnv for seqkit
conda create -y -n seqkitEnv seqkit
#Container download
singularity pull seqkit_2.10.0 https://depot.galaxyproject.org/singularity/seqkit:2.10.0--h9ee0642_0
Now look at the status of your workflow. You should recognize all the steps (targets). Those are marked shouldrun, because the outputs and/or inputs are not existent. Remember to activate the environment for gwf.
Tip
You do not need the option -f workflow.py if your workflow file has the name workflow.py, which is the default gwf looks for.
Now, you might also want to look at how a specific target looks like when the workflow is built
You will be able to see the actual inputs, outputs, and other targets it depends from/depending on it:
{
"split": {
"options": {
"cores": 1,
"memory": "4g",
"walltime": "05:00:00"
},
"inputs": [
"data.fq"
],
"outputs": [
"gwf_splitted/part001.fq",
"gwf_splitted/part002.fq",
"gwf_splitted/part003.fq",
"gwf_splitted/part004.fq",
"gwf_splitted/part005.fq",
"gwf_splitted/part006.fq",
"gwf_splitted/part007.fq",
"gwf_splitted/part008.fq",
"gwf_splitted/part009.fq",
"gwf_splitted/part010.fq"
],
"spec": "\n seqkit split2 -O gwf_splitted --by-part 10 --by-part-prefix part data.fq\n ",
"dependencies": [],
"dependents": [
"table_6",
"table_8",
"table_3",
"table_0",
"table_1",
"table_4",
"table_5",
"table_9",
"table_7",
"table_2"
]
}
}
Now, you can run specific targets. Let’s specify some names to test out our workflow.
Tip
You can run the entire workflow with gwf run when you are sure of your targets working correctly with the right resources.
Check the status: the two turgets will be submitted, then split has to run first, and its dependency table_0 will run when the file part_001.fq is generated! We use watch in front of the command to update its view every two seconds (Use Ctrl+C to exit from it).
At some point, you will see the running status (for a few seconds) and completed status.
Exercise break
How many resources did split and table_0 use? Run the utilization command:
The table shows we underutilized the resources. Now open workflow.py and change your resource usage for the split and table_ steps. Then, run the target:
Check again resource usage when the status is completed. Did it get better?
Exercise break
Now, you will change the executor for the template table. Your task is to:
open the workflow.py file
below importing the Conda module (line 2), add a new line with
from gwf.executors import Singularity
Now, define a new executor. Below the line where you define conda_env = Conda("seqkitEnv"), use a similar syntax and write sing = Singularity("seqkit_2.10.0"), where you provide the container file as argument.
At the end of the table template, use the new executor sing instead of conda_env.
Did you do it right? If yes, then you should be able to run the combine target:
and see its status become completed after some time. All output files should be created in your folder! If not, something is wrong. Ask for help, or look at the solution file, if you prefer.
Note
Because combine depends on all table_ targets, it will submit all those targets as well, which need to run first.
Exercise break
Ok, now we want to extend the workflow and do quality control on the part_###.fq files.
flowchart LR A0(["Start"]) -->|data.fq| A["split"] A -->|part001.fq| B["table"] A -->|part002.fq| C["table"] A -->|part....fq| D["table"] A -->|part010.fq| E["table"] B -->|table001.tsv| F["merge"] C -->|table002.tsv| F D -->|table....tsv| F E -->|table010.tsv| F F -->|table.tsv| G(["End"]) A -->|"part[001-010].fq"| H["qc"] H -->|multiqc_report.html| I(["End"])
You need to:
qcEnv where you install the two packages fastqc multiqc=1.29.qc_env based on Conda("qcEnv").def qc(data_folder)
this will need all ten gwf_splitted/part_###.fq file as input files (you can copy the output file list of the split template, where you use a variable {data_folder} instead of the explicit folder name!)
as output you want a file called ["reports/multiqc_report.html"] (default name for the generated report)
as bash commands you need:
remember to set the correct executor at the end of the template inside return
now you need to create one single target from the template, call it qc. You only need to give as input the name of the folder with the fq files
Tip
Some useful tips when developing a workflow: - Before running any targets, use gwf info qc to check dependencies. - Copy previous similar templates and modify them where needed, instead of writing each template from scratch
When you are sure you are done, then use gwf run qc. Its status should be completed if it runs successfully.
Ask for help, or look at the solution file, if you prefere.
Exercise break
Note
Good practices:
gwf utilization - needs a plugin, see earlier exercises)gwf info | less -S to check dependenciesPlease fill out this form :)
A lot of things we could not cover
use the official documentation!
ask for help, use drop in hours (ABC cafe), drop me a mail
Slides updated over time, use as a reference
