flowchart LR A0(["Start"]) --->|data.txt| A["Rename"] A --->|samples.txt| B(["End"])
Learn to create smooth pipelines and manage resources tinyurl.com/pipelinesGDK
Health Data Science sandbox, BiRC, AU
GenomeDK, Health, AU
2025-06-02
These slides are both a presentation and a small reference manual
90% of slides are you doing stuff - open your terminals and slides
Official reference documentation: genome.au.dk and gwf.app
Most important message before starting any workshop: RTFM - Read The Field Manual!.
Practical help:
Samuele (BiRC, MBG) - samuele@birc.au.dk
Drop in hours:
General mail for assistance
support@genome.au.dk
The basics to use the cluster
Being able to edit documents on the cluster
Have conda or pixi installed
If possible a project on the cluster
flowchart LR A0(["Start"]) --->|data.txt| A["Rename"] A --->|samples.txt| B(["End"])
Workflow and W. Management System
A workflow is a series of calculations and data manipulations which have to be performed in a specific sequence.
A workflow management system organizes the workflow steps through defined dependencies, can assign different computing resources to each step, keeps a log of the workflow, interacts with a cluster’s queueing system.
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET] A --->|samples.txt<br>OUTPUT| B(["End"]);
A TARGET is a specific step in a workflow
Each target has a SPECIFICATION which describes what to do with the input files to produce the output files
The specification is usually a command line which can be executed in a terminal
Each target has INPUTS and OUTPUTS file(s)
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET] A --->|samples.txt<br>OUTPUT-INPUT| B[GZip<br>TARGET] B --->|samples.txt.gz<br>OUTPUT| C(["End"]);
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET<br>cores=4<br>memory=24g<br>walltime=00:01:00] A --->|samples.txt<br>OUTPUT-INPUT| B[GZip<br>TARGET<br>cores=1<br>memory=8g<br>walltime=00:10:00] B --->|samples.txt.gz<br>OUTPUT| C(["End"]);
dd-hr:mn:sc
) which are used to run the target on a clusterEach target of the workflow
gantt dateFormat HH:mm axisFormat %H:%M title Examplified HPC Queue Durations vs. Resources section Small Job (1 core, 4GB, 1h) Queue wait: active, 00:00, 0:10 Job start: active, 00:10, 1:00 section Medium Job (4 cores, 16GB, 2h) Queue wait: active, 00:00, 0:45 Job start: active, 00:45, 2:00 section Large Job (16 cores, 64GB, 4h) Queue wait: active, 00:00, 2:00 Job start: active, 02:00, 4:00
A lightweight and easy to adopt workflow manager. It requires only some basic Python
- you can learn it along the way starting from examples. Some features:
The whole workflow is written in a python script. You first state the Workflow
object, usually like this:
Now we create generic templates which will be applied to the specific targets.
Note
return
will provide all info about the template when applying it to a target.
def zipFile(inputName):
inputs = [inputName]
outputs = [f"{inputName}.gz"]
options = {"cores": 1, "memory": "4g", "walltime": "00:10:00"}
spec = f"""
gzip -k {inputName}
"""
return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec)
Note
return
will provide all info about the template when applying it to a target.
Let’s look again at the corresponding workflow graph:
flowchart LR A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET<br>cores=4<br>memory=24g<br>walltime=00:10:00] A --->|samples.txt<br>OUTPUT-INPUT| B[Zip<br>TARGET<br>cores=1<br>memory=8g<br>walltime=00:01:00] B --->|samples.zip<br>OUTPUT| C(["End"]);
Using templates is easy with gwf
. You can use the target_from_template
method to create a target from a template.
target_rename = gwf.target_from_template("target_rename",
renameFile(inputName="data.txt",
outputName="samples.txt")
)
target_gzip = gwf.target_from_template("target_gzip",
zipFile(inputName="samples.txt")
)
Note
Each target has an unique name so that you will be able to inspect the workflow and its status.
A more complex workflow
We will run this workflow and add some new targets to it
flowchart LR A0(["Start"]) -->|"data.fq"| A["split"] A -->|part001.fq| B["table"] A -->|part002.fq| C["table"] A -->|part....fq| D["table"] A -->|part010.fq| E["table"] B -->|table001.tsv| F["merge"] C -->|table002.tsv| F D -->|table....tsv| F E -->|table010.tsv| F F -->|table.tsv| G(["End"]);
Prepare everything for the exercise: create a new folder, then download data and workflow file
Create a conda environment for seqkit
and one for the gwf
workflow software. Download the seqkit
container as well.
conda config --add channels gwforg
#conda env pipelineEnv for gwf
conda create -y -n pipelineEnv gwf=2.1.1
#add package for resource usage/check
conda install -y -n pipelineEnv -c micknudsen gwf-utilization
#conda env seqKitEnv for seqkit
conda create -y -n seqkitEnv seqkit
#Container download
singularity pull seqkit_2.10.0 https://depot.galaxyproject.org/singularity/seqkit:2.10.0--h9ee0642_0
Now look at the status
of your workflow. You should recognize all the steps (targets). Those are marked shouldrun
, because the outputs and/or inputs are not existent. Remember to activate the environment for gwf
.
Tip
You do not need the option -f workflow.py
if your workflow file has the name workflow.py
, which is the default gwf
looks for.
Now, you might also want to look at how a specific target looks like when the workflow is built
You will be able to see the actual inputs, outputs, and other targets it depends from/depending on it:
{
"split": {
"options": {
"cores": 1,
"memory": "4g",
"walltime": "05:00:00"
},
"inputs": [
"data.fq"
],
"outputs": [
"gwf_splitted/part001.fq",
"gwf_splitted/part002.fq",
"gwf_splitted/part003.fq",
"gwf_splitted/part004.fq",
"gwf_splitted/part005.fq",
"gwf_splitted/part006.fq",
"gwf_splitted/part007.fq",
"gwf_splitted/part008.fq",
"gwf_splitted/part009.fq",
"gwf_splitted/part010.fq"
],
"spec": "\n seqkit split2 -O gwf_splitted --by-part 10 --by-part-prefix part data.fq\n ",
"dependencies": [],
"dependents": [
"table_6",
"table_8",
"table_3",
"table_0",
"table_1",
"table_4",
"table_5",
"table_9",
"table_7",
"table_2"
]
}
}
Now, you can run specific targets. Let’s specify some names to test out our workflow.
Tip
You can run the entire workflow with gwf run
when you are sure of your targets working correctly with the right resources.
Check the status: the two turgets will be submitted
, then split
has to run first, and its dependency table_0
will run when the file part_001.fq
is generated! We use watch
in front of the command to update its view every two seconds (Use Ctrl+C
to exit from it).
At some point, you will see the running
status (for a few seconds) and completed
status.
Exercise break
How many resources did split
and table_0
use? Run the utilization command:
The table shows we underutilized the resources. Now open workflow.py
and change your resource usage for the split
and table_
steps. Then, run the target:
Check again resource usage when the status is completed
. Did it get better?
Exercise break
Now, you will change the executor for the template table
. Your task is to:
open the workflow.py
file
below importing the Conda
module (line 2), add a new line with
from gwf.executors import Singularity
Now, define a new executor. Below the line where you define conda_env = Conda("seqkitEnv")
, use a similar syntax and write sing = Singularity("seqkit_2.10.0")
, where you provide the container file as argument.
At the end of the align
template, use the new executor sing
instead of conda_env
.
Did you do it right? If yes, then you should be able to run the combine
target:
and see its status become completed
after some time. All output files should be created in your folder! If not, something is wrong. Ask for help, or look at the solution file, if you prefer.
Note
Because combine
depends on all table_
targets, it will submit all those targets as well, which need to run first.
Exercise break
Ok, now we want to extend the workflow and do quality control on the part_###.fq
files.
flowchart LR A0(["Start"]) -->|data.fq| A["split"] A -->|part001.fq| B["table"] A -->|part002.fq| C["table"] A -->|part....fq| D["table"] A -->|part010.fq| E["table"] B -->|table001.tsv| F["merge"] C -->|table002.tsv| F D -->|table....tsv| F E -->|table010.tsv| F F -->|table.tsv| G(["End"]) A -->|"part[001-010].fq"| H["qc"] H -->|multiqc_report.html| I(["End"])
You need to:
qcEnv
where you install the two packages fastqc multiqc=1.29
.qc_env
based on Conda("qcEnv")
.def qc(data_folder)
this will need all ten gwf_splitted/part_###.fq
file as input files (you can copy the output file list of the split
template, where you use a variable {data_folder}
instead of the explicit folder name!)
as output you want a file called ["reports/multiqc_report.html"]
(default name for the generated report)
as bash commands you need:
remember to set the correct executor at the end of the template inside return
now you need to create one single target from the template, call it qc
. You only need to give as input the name of the folder with the fq
files
Tip
Some useful tips when developing a workflow: - Before running any targets, use gwf info qc
to check dependencies. - Copy previous similar templates and modify them where needed, instead of writing each template from scratch
When you are sure you are done, then use gwf run qc
. Its status
should be completed
if it runs successfully.
Ask for help, or look at the solution file, if you prefere.
Exercise break
Note
Good practices:
gwf utilization
- needs a plugin, see earlier exercises)gwf info | less -S
to check dependenciesPlease fill out this form :)
A lot of things we could not cover
use the official documentation!
ask for help, use drop in hours (ABC cafe), drop me a mail
Slides updated over time, use as a reference