Introduction to workflows on GenomeDK

Learn to create smooth pipelines and manage resources tinyurl.com/pipelinesGDK

Samuele Soraggi

Health Data Science sandbox, BiRC, AU

Dan Søndergaard

GenomeDK, Health, AU

2025-06-02

Some background

These slides are both a presentation and a small reference manual
90% of slides are you doing stuff - open your terminals and slides
Official reference documentation: genome.au.dk and gwf.app
Most important message before starting any workshop: RTFM - Read The Field Manual!.

When you need to ask for help

Practical help:

Samuele (BiRC, MBG) - samuele@birc.au.dk
Drop in hours:
- Bioinformatics Cafe: https://abc.au.dk, abcafe@au.dk
- Samuele (BiRC, MBG) - samuele@birc.au.dk
General mail for assistance

support@genome.au.dk

What you need

The basics to use the cluster
Being able to edit documents on the cluster
Have conda or pixi installed
If possible a project on the cluster

Program

10:00-10:45:
- Workshop Introduction
- Background on workflows/pipelines
- Workflow file structure and templates
- A small example
- Cake and Questions
11:00-12:00:
- Workflow for the exercise
- Prepare for the workflow: data, software, workflow file
- Investigate specific parts of the workflow
- Execute one or more targets
- Verify what happens with each target
12:45-13:45:
- calibrate your resources
- add a template and targets to a pipeline
13:45-14:00:
- Questions
- Feedback

Background on workflows

Workflow terminology

flowchart LR
  A0(["Start"]) --->|data.txt| A["Rename"]
  A --->|samples.txt| B(["End"])

Workflow and W. Management System

A workflow is a series of calculations and data manipulations which have to be performed in a specific sequence.

A workflow management system organizes the workflow steps through defined dependencies, can assign different computing resources to each step, keeps a log of the workflow, interacts with a cluster’s queueing system.

flowchart LR
  A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET]
  A --->|samples.txt<br>OUTPUT| B(["End"]);

A TARGET is a specific step in a workflow
Each target has a SPECIFICATION which describes what to do with the input files to produce the output files
The specification is usually a command line which can be executed in a terminal
Each target has INPUTS and OUTPUTS file(s)

flowchart LR
  A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET]
  A --->|samples.txt<br>OUTPUT-INPUT| B[GZip<br>TARGET]
  B --->|samples.txt.gz<br>OUTPUT| C(["End"]);

A target can be dependent on other targets, i.e. it needs their output files as input before it can run

flowchart LR
  A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET<br>cores=4<br>memory=24g<br>walltime=00:01:00]
  A --->|samples.txt<br>OUTPUT-INPUT| B[GZip<br>TARGET<br>cores=1<br>memory=8g<br>walltime=00:10:00]
  B --->|samples.txt.gz<br>OUTPUT| C(["End"]);

A target has settings for resources
- cores
- memory
- walltime (format dd-hr:mn:sc) which are used to run the target on a cluster

Each target of the workflow

is submitted by the workflow manager to the queueing system
can be monitored with the workflow manager
influences the queueing time Depending on requested resources

gantt
    dateFormat  HH:mm
    axisFormat  %H:%M
    title Examplified HPC Queue Durations vs. Resources

    section Small Job (1 core, 4GB, 1h)
    Queue wait: active, 00:00, 0:10
    Job start: active, 00:10, 1:00

    section Medium Job (4 cores, 16GB, 2h)
    Queue wait: active, 00:00, 0:45
    Job start: active, 00:45, 2:00

    section Large Job (16 cores, 64GB, 4h)
    Queue wait: active, 00:00, 2:00
    Job start: active, 02:00, 4:00

Gwf workflow manager

A lightweight and easy to adopt workflow manager. It requires only some basic Python - you can learn it along the way starting from examples. Some features:

Developed at AU (Dan at GenomeDK) and used also at MOMA, AUH, …
- Easy to find support
In python, no need for a workflow language
- You can use all python functions to build your workflow!
Easy to structure a workflow and check resource usage
- Reusable templates
- Very pedagogical
Conda, Pixi, Container usage out-of-the-box

Setup a small workflow

The whole workflow is written in a python script. You first state the Workflow object, usually like this:

#import libraries
from gwf import *

#initiate a workflow object
gwf = Workflow( )

Using project resources

Now you have an empty workflow, which will use the few free resources each user has available. Resources from a project you are part of can be created using instead

gwf = Workflow( defaults={"account": "PROJECT_NAME"} )

Templates

Now we create generic templates which will be applied to the specific targets.

Which are the inputs and outputs?
Which are the resources?
Which commands are executed?

Note

return will provide all info about the template when applying it to a target.

def renameFile(inputName, outputName):
    inputs  = [inputName]
    outputs = [outputName]
    options = {"cores": 1, "memory": "1g", "walltime": "00:01:00"}
    spec = f"""
        cp {inputName} {outputName}
    """
    return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec)

def zipFile(inputName):
    inputs  = [inputName]
    outputs = [f"{inputName}.gz"]
    options = {"cores": 1, "memory": "4g", "walltime": "00:10:00"}
    spec = f"""
        gzip -k {inputName}
    """
    return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec)

Note

return will provide all info about the template when applying it to a target.

Let’s look again at the corresponding workflow graph:

flowchart LR
  A0(["Start"]) --->|data.txt<br>INPUT| A[Rename<br>TARGET<br>cores=4<br>memory=24g<br>walltime=00:10:00]
  A --->|samples.txt<br>OUTPUT-INPUT| B[Zip<br>TARGET<br>cores=1<br>memory=8g<br>walltime=00:01:00]
  B --->|samples.zip<br>OUTPUT| C(["End"]);

Apply templates to targets

Using templates is easy with gwf. You can use the target_from_template method to create a target from a template.

target_rename = gwf.target_from_template("target_rename", 
                                         renameFile(inputName="data.txt", 
                                                    outputName="samples.txt")
                                        )

target_gzip = gwf.target_from_template("target_gzip", 
                                         zipFile(inputName="samples.txt")
                                        )

Note

Each target has an unique name so that you will be able to inspect the workflow and its status.

Exercise

A more complex workflow

We will run this workflow and add some new targets to it

flowchart LR
  A0(["Start"]) -->|"data.fq"| A["split"]
  A -->|part001.fq| B["table"]
  A -->|part002.fq| C["table"]
  A -->|part....fq| D["table"]
  A -->|part010.fq| E["table"]
  B -->|table001.tsv| F["merge"]
  C -->|table002.tsv| F
  D -->|table....tsv| F
  E -->|table010.tsv| F
  F -->|table.tsv| G(["End"]);

Exercise I: workflow with conda environments

Prepare everything for the exercise: create a new folder, then download data and workflow file

 mkdir -p myPipeline
 cd myPipeline

 wget https://github.com/hds-sandbox/GDKworkshops/raw/refs/heads/main/Examples/smallGwf/data.fq -O data.fq
 wget https://github.com/hds-sandbox/GDKworkshops/raw/refs/heads/main/Examples/smallGwf/workflow.py -O workflow.py

Create a conda environment for seqkit and one for the gwf workflow software. Download the seqkit container as well.

conda config --add channels gwforg
#conda env pipelineEnv for gwf
conda create -y -n pipelineEnv gwf=2.1.1
#add package for resource usage/check
conda install -y -n pipelineEnv -c micknudsen gwf-utilization
#conda env seqKitEnv for seqkit
conda create -y -n seqkitEnv seqkit
#Container download
singularity pull seqkit_2.10.0  https://depot.galaxyproject.org/singularity/seqkit:2.10.0--h9ee0642_0

Tip

If you get an error because packages cannot be found, you need to add some default channels as well. Run the commands below:

conda config --add channels bioconda
conda config --add channels conda-forge

Now look at the status of your workflow. You should recognize all the steps (targets). Those are marked shouldrun, because the outputs and/or inputs are not existent. Remember to activate the environment for gwf.

conda activate pipelineEnv

gwf -f workflow.py status

Tip

You do not need the option -f workflow.py if your workflow file has the name workflow.py, which is the default gwf looks for.

Now, you might also want to look at how a specific target looks like when the workflow is built

gwf info split

You will be able to see the actual inputs, outputs, and other targets it depends from/depending on it:

{
    "split": {
        "options": {
            "cores": 1,
            "memory": "4g",
            "walltime": "05:00:00"
        },
        "inputs": [
            "data.fq"
        ],
        "outputs": [
            "gwf_splitted/part001.fq",
            "gwf_splitted/part002.fq",
            "gwf_splitted/part003.fq",
            "gwf_splitted/part004.fq",
            "gwf_splitted/part005.fq",
            "gwf_splitted/part006.fq",
            "gwf_splitted/part007.fq",
            "gwf_splitted/part008.fq",
            "gwf_splitted/part009.fq",
            "gwf_splitted/part010.fq"
        ],
        "spec": "\n    seqkit split2 -O gwf_splitted --by-part 10 --by-part-prefix part data.fq\n    ",
        "dependencies": [],
        "dependents": [
            "table_6",
            "table_8",
            "table_3",
            "table_0",
            "table_1",
            "table_4",
            "table_5",
            "table_9",
            "table_7",
            "table_2"
        ]
    }
}

Tip

Do you want to see the whole workflow in a text editor? Use the less viewer,

gwf info | less

or output into a text file:

gwf info > workflow_explicit.txt

Now, you can run specific targets. Let’s specify some names to test out our workflow.

gwf run split table_0

Tip

You can run the entire workflow with gwf run when you are sure of your targets working correctly with the right resources.

Check the status: the two turgets will be submitted, then split has to run first, and its dependency table_0 will run when the file part_001.fq is generated! We use watch in front of the command to update its view every two seconds (Use Ctrl+C to exit from it).

watch gwf status

At some point, you will see the running status (for a few seconds) and completed status.

Exercise break

Tip

When the status of a target is failed, or you want to verify messages on the command line, you can visualize the standard output and standard error messages of your target with the commands below (example with the split target):

gwf logs split 
gwf logs -e split

Resize the workflow resources and add a step

How many resources did split and table_0 use? Run the utilization command:

gwf utilization

The table shows we underutilized the resources. Now open workflow.py and change your resource usage for the split and table_ steps. Then, run the target:

gwf run table_1

Check again resource usage when the status is completed. Did it get better?

Exercise break

Exercise II: singularity in workflows

Now, you will change the executor for the template table. Your task is to:

open the workflow.py file
below importing the Conda module (line 2), add a new line with

from gwf.executors import Singularity
Now, define a new executor. Below the line where you define conda_env = Conda("seqkitEnv"), use a similar syntax and write sing = Singularity("seqkit_2.10.0"), where you provide the container file as argument.
At the end of the align template, use the new executor sing instead of conda_env.

Did you do it right? If yes, then you should be able to run the combine target:

gwf run combine

and see its status become completed after some time. All output files should be created in your folder! If not, something is wrong. Ask for help, or look at the solution file, if you prefer.

Note

Because combine depends on all table_ targets, it will submit all those targets as well, which need to run first.

Exercise break

Exercise III: Your own target!

Ok, now we want to extend the workflow and do quality control on the part_###.fq files.

flowchart LR
  A0(["Start"]) -->|data.fq| A["split"]
  A -->|part001.fq| B["table"]
  A -->|part002.fq| C["table"]
  A -->|part....fq| D["table"]
  A -->|part010.fq| E["table"]
  B -->|table001.tsv| F["merge"]
  C -->|table002.tsv| F
  D -->|table....tsv| F
  E -->|table010.tsv| F
  F -->|table.tsv| G(["End"])
  A -->|"part[001-010].fq"| H["qc"]
  H -->|multiqc_report.html| I(["End"])

You need to:

create a new conda environment called qcEnv where you install the two packages fastqc multiqc=1.29.
add a new executor called qc_env based on Conda("qcEnv").
create a new target which starts with def qc(data_folder)
- this will need all ten gwf_splitted/part_###.fq file as input files (you can copy the output file list of the split template, where you use a variable {data_folder} instead of the explicit folder name!)
- as output you want a file called ["reports/multiqc_report.html"] (default name for the generated report)
- as bash commands you need:
```
mkdir -p reports
fastqc -o reports gdk_splitted/*.fq
multiqc -o reports reports/
```
- remember to set the correct executor at the end of the template inside return
- now you need to create one single target from the template, call it qc. You only need to give as input the name of the folder with the fq files

Tip

Some useful tips when developing a workflow: - Before running any targets, use gwf info qc to check dependencies. - Copy previous similar templates and modify them where needed, instead of writing each template from scratch

When you are sure you are done, then use gwf run qc. Its status should be completed if it runs successfully.

Ask for help, or look at the solution file, if you prefere.

Exercise break

Note

Good practices:

do not make tiny and numerous targets, try to put together things which can run quickly
start testing one of many parallel elements of a workflow before running them all at once
- determine resource usage (gwf utilization - needs a plugin, see earlier exercises)
- then adjust your templates’ resources accordingly
verify often your code is correct
- gwf info | less -S to check dependencies
- remember you are using python, check some of your code directly in it!
when done, you are reproducible if you share with your datasets:
- container pull commands
- conda environment(s) package list
- workflow files

Closing the workshop

Please fill out this form :)

A lot of things we could not cover
use the official documentation!
ask for help, use drop in hours (ABC cafe), drop me a mail
Slides updated over time, use as a reference