Advanced GenomeDK

More-than-basic things to do and tips for GDK, tinyurl.com/advGDK

Samuele Soraggi

Health Data Science sandbox, BiRC, AU

Manuel Peral-Vázquez

Molecular Biology and Genetics, AU

Dan Søndergaard

GenomeDK, Health, AU

2025-04-11

Some background

  • These slides are both a presentation and a small reference manual

  • 90% of slides are you doing stuff - open your terminals and slides

  • Official reference documentation: genome.au.dk

  • Some advanced things require a bit of practice/frustration, but I hope to reduce it

  • Most important message before starting any workshop: RTFM - Read The Field Manual!. Though

    • Some manuals are really useless, but the ones for UNIX tools are pretty good
    • Unusual options might be buried somewhere or badly explained

When you need to ask for help

  • Practical help:

    Samuele (BiRC, MBG) - samuele@birc.au.dk

  • Drop in hours:

    • Bioinformatics Cafe: https://abc.au.dk, abcafe@au.dk
    • Samuele (BiRC, MBG) - samuele@birc.au.dk - we just set up a meeting/zoom
  • General mail for assistance

    support@genome.au.dk

Program

  • 10:00-10:30:
    • Workshop Introduction
    • Some everyday things for easier life
      • and your mental sanity
  • 10:30-11:00:
    • rsync copy and backups
    • [Optional] Managing multiple terminals on tmux
    • cake
  • 11:10-12:00:
    • Web applications, ports and tunnels
    • Containers (Docker, singularity)
  • 12:45-14:00:
    • Your first pipeline with gwf, conda and containers
      • more cake
      • resource management
      • adding targets to a pipeline

Get the slides

Webpage: https://hds-sandbox.github.io/GDKworkshops/

Slides will always be up to date in this webpage

Some useful things for an easier life

  • Configuration file(s)
  • System variables
  • Safety settings
  • Shortcuts

Configuration files

Your ~/.bashrc file contains settings for your user. Those are bash commands which run at every login.

 

Common practice for many softwares is to have a configuration file in your home, often starting with ., which makes it a hidden file.

 

Examples:

  • .tmux.config for tmux
  • .emacs for emacs
  • .gitconfig for github
  • .condarc for conda

Plus other things like your command history on the terminal.

Exercise I: singularity settings

Let’s make a useful setting to run at each login. We will need a temporary folder for singularity containers which are downloaded. Default is your home, which will be filled up in no time (folder ~/.singularity) with cache material.

 

Edit the file ~/.bashrc (use nano ~/.bashrc or any editor you want). Add those lines:

mkdir -p -m 700 /tmp/$USER
export SINGULARITY_TMPDIR=/tmp/$USER
export SINGULARITY_CACHEDIR=/tmp/$USER

 

The -m 700 option for mkdir command ensures also you are the only one which can see the temporary files. Useful is you use a container with password or sensitive info, so no one can access it (/tmp/ is a public folder)!

Exercise II - aliases

Now, there are many repetitive things we do every day. For example:

  • remove files and double check we can
  • cd ../ and cd ../../ and … and cd ../../../../../../../

and every time it is just annoying to waste precious time. Why not creating some aliases for all those deplorable commands? Choose the aliases you prefere from the list below and add them in your .bashrc file:

## Safe file handling
alias rmi='rm -i'
alias cpi='cp -i'
alias mvi='mv -i'

## Upwards navigation in the File system
alias ..='cd ..'
alias ...='cd ../../../'
alias ....='cd ../../../../'
alias .....='cd ../../../../'

## List views
alias ll='ls -laht' #detailed
alias l='ls -aC' #compressed

Exercise III - functions

Note

These are just inspirations, you can create any alias and function to avoid repetitive/long commands. Find all repetitive stupid stuff you use and wrap it up into the .bashrc file!

You can also create functions including multiple commands: for example making a directory and then cd into it.

## make and cd
mkcd() {
    mkdir -p $1; cd $1
    echo created $1
    }

Why not making one for git clone downloading only the latest commit history and choosing specific folders for the repository?

# Git clone with depth 1 and choice of folders
#   arg 1: username/repository
#   arg 2: folders and files in quotes '', backspace separator
#   arg 3: download folder name (optional, default:repo)
#   arg 4: branch (optional, default:main)
# Examples:
#   ghdir github/gitignore 'community Global' test01 main
#   ghdir github/gitignore 'community Global' 
ghdir() {
        echo Downloading from $1 in folder $3
        echo Selecting $2
        if [ -z "$4" ]; then
          BRANCH="-b main"
        else
          BRANCH="-b $4"
        fi
        git clone --no-checkout $BRANCH --filter=blob:none --depth 1 https://github.com/$1.git $3
        if [ -z "$3" ]; then
          folder=$(echo "$1" | cut -d'/' -f2)
          cd "$folder"
        else
          cd "$3"
        fi
        git sparse-checkout init --cone
        git sparse-checkout set $2
        git checkout
    }

Syncronizations, multiple terminals

  • How to copy using rsync
  • Use rsync to create backups and versioning
  • Create and navigate multiple sessions with tmux
  • Launch parallel background downloads with tmux

transfer and sync with rsync

rsync is a very versatile tool for

  • transfering from remote to local host (and viceversa)
  • copying from local to local host (e.g. data backups/sync)
  • transfering only files which has changed from last copy (incremental copy)

Warning

rsync cannot make a transfer between two remote hosts, e.g. running from your PC to transfer data between GenomeDK and Computerome.

rsync cannot download from web URLs

Lots of options you can find in the manual (would require a workshop only for that)

rsync manual

Exercise

Log into GenomeDK. Create anywhere you prefere a folder called advancedGDK containing rsync/data

mkdir -p advancedGDK/rsync/data
cd advancedGDK/rsync

Create 100 files with extensions fastq and log in the data folder

touch data/file{1..100}.fastq data/file{1..100}.log

Local-to-local copy

Note

The syntax of rsync is pretty simple:

rsync OPTIONS ORIGIN(s) DESTINATION

 

An archive (incremental) copy can be done with the options a. You can add a progress bar with P. You can exclude files: here we want only the ones with fastq extension. Run the command

rsync -aP --exclude="*.log" data backup

This will copy all the fastq files in backup/data. You can check with ls.

Warning

Using data will copy the entire folder, while data/ will copy only its content! This is common to many other UNIX tools.

Change the first ten fastq files with some text:

for i in {1..10}; do echo hello >> data/file$i.fastq; done

Now, we do not only want to do an incremental copy of those file with rsync, but also keep the previous version of those files. We create a folder to backup those, naming it with date and time (you will find it in your backup directory):

rsync -aP --exclude="*.log" \
      --backup \
      --backup-dir=versioning_$(date +%F_%T)/ \
      data \
      backup

Tip

If you create a folder called backup in your project folder, you can use versioning to store your analysis and results with incremental changes.

Exercise finished

Transfer between local and remote

You can in the same way transfer and backup data between your local host (your PC, or GenomeDK) and another remote host (another cluster). You need Linux or Mac on the local host. For example, to get on your computer the same fastq files:

rsync -aP --exclude="*.log" USERNAME@login.genome.au.dk:PATH_TO/advancedGDK/data PATH_TO/backup

The opposite can be done uploading data from your computer. For example:

rsync -aP --exclude="*.log" PATH_TO/data USERNAME@login.genome.au.dk:PATH_TO/backup

 

Avoid a typical error

To transfer from GenomeDK to your computer, and viceversa, you need to use the commands above from your local computer, and NOT when you are logged into GenomeDK!

[Optional] Manage multiple terminals with tmux

tmux is a server application managing multiple terminal sessions. It can

  • start a server with multiple sessions
  • each session containing one or more windows with multiple terminals (panes)
  • each terminal run simultaneously and be accessed (attached) or exited from (detached)
  • the tmux server keeps runninng without a logged user

Exercise

tmux is a keyboard-only software. But you can set it up also to change windows and panes with the mouse. Simply run this command (only once) to enable mouse usage:

echo "set -g mouse" >> ~/.tmux.conf

Warning

Using the mouse can create problems in some terminal programs, where copy-paste starts acting weird, e.g. on Mac computers and on Windows’ Moba XTerm software. In case you have a bad experience, remove the mouse setup from the file ~/.tmux.conf

 

You can start a tmux session anywhere. It is easier to navigate sessions giving them a name. For example start a session called example1:

tmux new -s example1

The command will set you into the session automatically. The window looks something like below:

Now, you are in session example1 and have one window, which you are using now. You can split the window in multiple terminals. Try both those combinations of buttons:

 

Ctrl + b    %

Ctrl + b    ""

 

Or keep right-clicked with the mouse to choose the split.

Do split the window horizontally and vertically, running 3 terminals. You can select any of them with the mouse (left-click).

Try to select a window and resize it: while keeping Ctrl + b pressed, use the arrows to change the size

Now, you have your three panes running in a window.

Create a new window with Ctrl + b c. Or keep right-clicked on the window bar and create a new window.

 

You should see another window added in the bottom window bar. Again, switch between windows with your mouse (left-click!)

In the new window, let’s look at which tmux sessions and windows are open. Run

tmux ls

 

The output will tell you that session example1 is in use (attached) and has 2 windows. Something like this:

example1: 2 windows (created Wed Apr  2 16:12:54 2025) (attached)

Launching separate downloads at the same time

Start a new session without attaching to it (d option), and call it downloads:

tmux new-session -d -s downloads

verify the session is there with tmux ls.

Warning

If you want a new session attaching to it, you need to detach from the current session with Ctrl + b + d.

Create a text file with few example files for this workshop to be downloaded.

curl -s https://api.github.com/repos/hds-sandbox/GDKworkshops/contents/Examples/rsync | jq -r '.[] | .download_url' > downloads.txt

Now, the script below launches all the URLs from the list in the download session in a new window. The new window closes after the download. If there are less than K downloads active, a new one starts, until the end! You can use this and close your terminal. The downloads will keep going and the window names will be shown to keep an eye on the current downloads. Try it out and use it whenever you have massive number of file downloads.

mkdir -p downloaded
K=2  # Maximum number of concurrent downloads
while read -r url; do
    # Wait until the number of active tmux windows in the "downloads" session is less than K
    while [ "$(tmux list-windows -t downloads | wc -l)" -ge "$((K+1))" ]; do     
        sleep 1
    done

    # Extract the filename from the URL
    filename=$(basename "$url")

    # Start a new tmux window for the download
    tmux new-window -t downloads -n "$filename" "wget -c $url -O downloaded/$filename && tmux kill-window"
    tmux list-windows -t downloads -F "#{window_name}"   
done < downloads.txt

Exercise finished

Web applications, ports, tunneling

Web applications

  Why do we use web applications for graphical interfaces?

  - all the graphics heavy-lifting is done by the browser - before, the X11 forwarding was the way to do graphics from remote - problem: X11 sends the whole graphics over the network, which is slow

  We have written a bit about that on one of our ABC tutorials

A web application on GenomeDK:

  • starts a server process on the cluster
  • This server listens for incoming requests on a specific port
  • The server sends and receives data over the network via the port.

 

The local user:

  • creates a tunnel, which is an ssh connection mapping to the remote port used by the server process

How the port forwarding looks like from the local user (your pc) to the remote node of the cluster. The purple command has to be launched on the local computer, once the server is running on the remote host. Source: KU Leuven.

Which port to use

  • Each server process on a machine needs a unique port (p2 on previous figure) to avoid conflicts.

  • Ports are in common between users on GenomeDK. So you can only use the port corresponding to your user number, which you can see with

    echo $UID

  • You will see all this in the next exercise

better safe than sorry

Launch a web application which has tokens (a random code in the URL for the browser) or a password. In theory, anyone on your same node of the cluster can get into your server process and see your program and data!

Exercise: jupyterlab web server

If you DO NOT have the conda package manager, you can quickly install it from the box below, otherwise move to the next slide!

Install conda

Run the installation script to install the software. There are some informative messages during the installation. You might need to say yes a few times

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O miniforge.sh
chmod +x miniforge.sh
bash miniforge.sh -b
~/miniforge3/bin/conda init bash
source ~/.bashrc

When you are done, you need a little bit of configuration. You can run the following command to configure conda (run them only once, they are not needed again):

conda config --append channels conda-forge
conda config --append channels bioconda
conda config --set channel_priority strict
conda config --set auto_activate_base false
source ~/.bashrc

Finally, install a package to accellerate conda. This is optional but recommended:

conda install -n base --yes conda-libmamba-solver
conda config --set solver libmamba

Now you are ready to move on!

Create a new conda environment:

conda create -y -n GDKjupyter jupyterlab numpy pandas

 

Start a job with just a few resources. This will always keep running until the end since it is inside a tmux session!

srun --mem=4g --cores=1 --time=4:00:0 --pty /bin/bash

 

Now activate your environment and run jupyterlab. The node name (remote host) and your user ID (for the port number) are given as variables to the web server simply using $(hostname) and $UID:

conda activate GDKjupyter
jupyter-lab --port=$UID --ip=$(hostname)

You will see some messages and recognize an address with: your node and your user ID. Below it, the URL you can use in your browser, which always starts with 127.0.0.1. It will look like this, but in your case it might have a longer URL with a random token (mine is instead password protected):

 

Write down your node number and user id you got from jupyterlab.

We now need a tunnel from the local host (your PC)! Write in a new terminal (not logged into GenomeDK) a command like the following (matching the example figure above), substituting the correct values:

ssh -L6835:s21n34:6835 USERNAME@login.genome.au.dk

 

Your tunnel is now opened. The web application can be accessed from your browser at the address given by the server process on GenomeDK, which is 127.0.0.1/PORT_NUMBER (put your correct port number).

Exercise finished

Tip

The same logic applies to all other web applications. They will have similar options to define the remote node and port.

Usually the host node option changes name between ip, host, address, or hostname.

 

Note also that you will always have to use the address 127.0.0.1/PORT_NUMBER on your browser. You might as well save it as a favorite address in your browser’s bookmarks.

Containers

What is a container

Container = Standalone and portable software package including

  • code
  • runtime
  • libraries
  • system tools
  • operating system-level dependencies

 

Deployment of a container on different HPCs and PCs is reproducible. A virtual environment (conda, mamba, pixi, cargo, npm, …) depends instead on the host system and is not necessarily fully reproducible on two different computer hardwares.

Container vs virtual env vs VM

A virtual environment isolates dependencies for a specific programming language within the same operating system. It does not include the operating system or system-level dependencies, so it depends on the hosting system.

  • A virtual machine (VM) virtualizes an entire operating system, including the kernel, and runs on a hypervisor (assigns resources to the VM).

Scope of containers

  • Containers are usually thought as packing a specific application which can run anywhere (or, which is portable).

  • E.g. annoying bioinformatics software requiring specific libraries or long installations.

  • Many containers are done with Docker, but this is not installed on GenomeDK

  • GenomeDK has Singularity, which can also run programs installed in Docker containers.

  • Very practical to pull and use containers in workflows.

Where to find containers?

Typical repositories with pre-built containers are:

  • Biocontainers: community-driven initiative to containerize bioinf softwares.

  • DockerHub registry: Public hub for Docker images, often including official containers from software developers

  • Quay: Same philosophy of DockerHub.

Once you find a container on the websites, simply use (eventually adapt) the provided code to pull it locally.

Exercise I: a simple bioinformatics container

We use biocontainers to pull containers and recreate a little bulkRNA alignment. First create and attach to a new tmux session called biocontainers.

 

Make a folder called containers101 in the advancedGDK directory.

mkdir -p advancedGDK/containers101
cd advancedGDK/containers101

 

Find a bash bioinformatics software you know on the Biocontainer Registry and open it to see a detailed description - if you have no good ideas look for the minimap2 aligner. You should see a software documentation illustrating how to run the software in various ways - see next slide for a screenshot.

Example: Biocontainers page for the minimap2 splice-aware aligner. Click on the image to enlarge.

Example: Biocontainers page for the minimap2 splice-aware aligner. Click on the image to enlarge.

The page will suggest to run immediately the container with singularity. I would suggest downloading it first (pull instead of run) and then running it.

In the case of minimap2:

singularity pull minimap2.sif https://depot.galaxyproject.org/singularity/minimap2:2.28--h577a1d6_4

which creates a container file in your current directory, and calls it minimap2.sif (better name than the default minimap2:2.28--h577a1d6_4).

 

You can either run the container, which opens a CLI into it, where you can execute the program. A non-interactive way (useful for pipelines) is to write the command directly. For example:

This will operate the command line inside the container. Here you can then run minimap2.

singularity run minimap2.sif

This will immediately run minimap2 from the container - useful for a pipeline where you do not need any interaction with the command line.

singularity run minimap2.sif minimap2

Exercise II: options for singularity

Some containers are very easy to run. Some others need a lot of extra options, such as binding specific folders to a path, providing certificates, initializing environmental variables. A small example below:

Environment variables

The option --env can be used to define environment variables, either needed by your software, or useful to write the code to be executed.

 singularity pull samtools.sif https://depot.galaxyproject.org/singularity/samtools:1.21--h96c455f_1

 singularity run --env REF_PATH=$(pwd) \
                 --env BAM_PREFIX=test01 \
                 samtools.sif \
                 samtools view \
                 -h https://github.com/roryk/tiny-test-data/raw/refs/heads/master/wgs/mt.sorted.bam \
                 -O bam > test01.bam

Ups, it does not work! Can you think of why? Hint: note that the address for the download of data begins with https.

Certificates and binding

Why the previous slide didn’t work!!!

HTTPS is a protocol for secure communication over the internet. When you access an URL starting with https, your system checks the server’s SSL/TLS certificate to ensure it is issued by a trusted Certificate Authority (CA). Why? To prevent man-in-the-middle (MITM) attacks, where an attacker eavesdrops in the data transfer.

The option --bind can be used to make your folders available in a specific path for the container. Here, we bind two folders containing the certificates for GDK. Those paths are quite standard across UNIX systems.

singularity run --env REF_PATH=$(pwd) \
                --env BAM_PREFIX=test01 \
                --bind /etc/ssl/certs:/etc/ssl/certs \
                --bind /etc/pki/ca-trust:/etc/pki/ca-trust \
                samtools.sif \
                samtools view \
                -h https://github.com/roryk/tiny-test-data/raw/refs/heads/master/wgs/mt.sorted.bam \
                -O bam > $BAM_PREFIX.bam

Exercise finished

A few other options which can be used in singularity. When you need those really depends on the application:

  • --fakeroot: Allows running the container with root privileges in a user namespace. Useful for containers that require root access without needing actual root privileges.

  • --writable: Enables writing to the container image. This is useful for making changes to the container, but it requires the container to be writable. --writable-tmpfs to avoid changes to be persistent in a non-writable container.

  • --contain: Isolates the container from the host system by limiting access to the host’s filesystem.

  • --no-home: Prevents the container from automatically binding the user’s home directory. Avoids exposing your home directory to the container.

  • --cleanenv: Clears all environment variables except those explicitly set for the container.

  • --nv: Enables NVIDIA GPU support by binding the necessary libraries and drivers. Equivalent for AMD GPUs is --rocm (not the case on GDK).

  • --pwd: Sets the working directory inside the container.

Your first pipeline

  • An example pipeline
  • parallelizing
  • templates
  • environments and containers
  • resource usage

Split and merge

A classical bioinformatics pipeline has splits and merge steps. We will work with this:

flowchart LR
  A0{data.fq} --> A["Split(data.fq)"]
  A --> B["table(part000.fq)"]
  A --> C["table(part002.fq)"]
  A --> D["table(part....fq)"]
  A --> E["table(part009.fq)"]
  B --> F["merge(table_[000-009].tsv)"]
  C --> F
  D --> F
  E --> F
  F --> G{table.tsv}

 

where split and table (generate allele counts and statistics in a table) are done through seqkit.

Software: gwf

There are many populare softwares for workflows (Snakemake, Nextflow), which requires learning their workflow language. We use instead gwf:

  • Developed at AU (Dan at GenomeDK) and used also at MOMA, AUH, …
    • Easy to find support
  • In python, no need for new language
    • You can use all python functions to build your workflow!
  • Easy to structure a workflow and check resource usage
    • Reusable templates
    • Very pedagogical
  • Conda, Pixi, Container usage out-of-the-box

 

Once you understand workflows with gwf, it is a matter of learning how to use other workflow languages, if you wish.

Workflow objects in gwf

The whole workflow is contained in a Workflow object. This is initiated usually like this:

 

#import libraries
from gwf import *

#initiate a workflow object
gwf = Workflow( )

 

Using project resources

Now you have an empty workflow, which will use the few free resources each user has available. Resources from a project you are part of can be created using instead

gwf = Workflow( defaults={"account": "PROJECT_NAME"} )

Templates in gwf

Now you can populate your Workflow object with templates, which are simple python functions. You can create executors for your templates (environments and containers, which will run the programs you need)!

The template is a blueprint for a specific block of the workflow.

For example, the 10 parallel table blocks can be described by one reusable template:

#create a software executor from conda environment `seqkitEnv`
from gwf.executors import Conda
conda_env = Conda("seqkitEnv")

def table(infile): #Name of the template

  inputs  = [infile]  #input file(s)
  outputs = [f"{infile}.fx2tab.tsv"] #name of the output(s)
  options = {"cores": 1, "memory": "1g", "walltime": "02:00:00"} #resources to use

  #the bash commands to run when using the template
  spec = f"""
  seqkit fx2tab -n -l -C A -C C -C G -C T -C N -g -G -s {infile} > {outputs[0]}
  """

  #here you provide all specifications defined in the template. 
  #This is usually always as below, 
  #apart from the executor, which you decide yourself
  return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec, executor=conda_env)

Apply the template

The split and table templates are then applied to the relevant files:

parts=10 #how many parts to split into

# Apply the split template to data.fq to split in 10 parts. Call this step of the workflow "split"
# T1 collects info of the workflow step (e.g. output names)
T1 = gwf.target_from_template("split", split(infile="data.fq", parts=parts))

# Tabulate each part of the splitting and save the name of the tables in
# seqkit_output_files. We apply the template table to each input file of the
# for loop. The step of the workflow is called table_1, table_2, ..., table_10
seqkit_output_files = []
for i, infile in enumerate(T1.outputs):
   T2 = gwf.target_from_template(f"table_{i}", table(infile  = infile))
   seqkit_output_files.append(T2.outputs[0])

We try out exactly those template in the following exercise.

Exercise I: workflow with conda containers

Prepare everything for the exercise: create a new folder, get data and workflow file

 mkdir -p advancedGDK/pipeline
 cd advancedGDK/pipeline

 wget https://github.com/hds-sandbox/GDKworkshops/raw/refs/heads/main/Examples/smallGwf/data.fq -O data.fq
 wget https://github.com/hds-sandbox/GDKworkshops/raw/refs/heads/main/Examples/smallGwf/workflow.py -O workflow.py

Create a conda environment for seqkit and one for the gwf workflow software. Download the seqkit container as well.

 

Tip

If packages cannot be found, you need to add these default channels as well

conda config --add channels bioconda
conda config --add channels conda-forge

 

conda config --add channels gwforg
conda create -y -n pipelineEnv gwf=2.1.1
#package for resource usage/check
conda install -y -n pipelineEnv -c micknudsen gwf-utilization
#env for seqkit
conda create -y -n seqkitEnv seqkit
#Container
singularity pull seqkit_2.10.0  https://depot.galaxyproject.org/singularity/seqkit:2.10.0--h9ee0642_0

Now look at the status of your workflow. You should recognize all the steps (targets). Those are marked shouldrun, because the outputs and/or inputs are not existent. Remember to activate the environment for gwf.

 

conda activate pipelineEnv

gwf -f workflow.py status

 

Tip

You do not need -f workflow.py if your workflow file has the name workflow.py, which is the default for gwf.

Now, you might also want to look at how a specific target looks like when the workflow is built

gwf info split

You will be able to see the actual inputs, outputs, and other targets it depends from/depending on it:

{
    "split": {
        "options": {
            "cores": 1,
            "memory": "4g",
            "walltime": "05:00:00"
        },
        "inputs": [
            "data.fq"
        ],
        "outputs": [
            "gwf_splitted/part001.fq",
            "gwf_splitted/part002.fq",
            "gwf_splitted/part003.fq",
            "gwf_splitted/part004.fq",
            "gwf_splitted/part005.fq",
            "gwf_splitted/part006.fq",
            "gwf_splitted/part007.fq",
            "gwf_splitted/part008.fq",
            "gwf_splitted/part009.fq",
            "gwf_splitted/part010.fq"
        ],
        "spec": "\n    seqkit split2 -O gwf_splitted --by-part 10 --by-part-prefix part data.fq\n    ",
        "dependencies": [],
        "dependents": [
            "table_6",
            "table_8",
            "table_3",
            "table_0",
            "table_1",
            "table_4",
            "table_5",
            "table_9",
            "table_7",
            "table_2"
        ]
    }
}

Tip

Do you want to see the whole workflow in a text editor? Use the less viewer,

gwf info | less

or output into a text file:

gwf info > workflow_explicit.txt

 

Now, you can run specific targets. Let’s specify some names to test out our workflow.

gwf run split table_0

Tip

You can run the entire workflow with gwf run when you are sure of your targets working correctly with the right resources.

Check the status: the two turgets will be submitted, then split has to run first, and its dependency table_0 will run when the file part_001.fq is generated! We use watch in front of the command to update its view every two seconds.

watch gwf status

 

At some point, you will see the running status (for a few seconds) and completed status.

Exercise break

Tip

When the status of a target is failed, or you want to verify messages on the command line, you can visualize the standard output and standard error messages of your target with the commands below (example with the split target):

 

gwf logs split 
gwf logs -e split

Resize the workflow resources and add a step

How many resources did split and table_0 use? Run the utilization command:

gwf utilization

 

The table shows we underutilized the resources. Now open workflow.py and change your resource usage for the split and table_ steps. Then, run the target:

gwf run table_1

 

Check again resource usage when the status is completed. Did it get better?

Exercise break

Exercise II: singularity in workflows

Now, you will change the executor for the template table. Your task is to:

  • open the workflow.py file

  • below importing the Conda module (line 2), add a new line with

    from gwf.executors import Singularity

  • Now, define a new executor. Below the line where you define conda_env = Conda("seqkitEnv"), use a similar syntax for Singularity, where you provide the container file as argument.

  • At the end of the align template, use the new executor instead of conda_env.

 

Did you do it right? If yes, then you should be able to run the combine target:

gwf run combine

and see its status become completed after some time, and all output files should be created in your folder! If not, something is wrong. Ask for help, or look at the solution file, if you prefere.

Note

Because combine depends on all table_ targets, it will submit all those targets as well, which need to run first.

Exercise break

Exercise III: Your own target!

Ok, now we want to extend the workflow and do quality control on the part_###.fq files.

flowchart LR
  A0{data.fq} --> A["Split(data.fq)"]
  A --> B["table(part000.fq)"]
  A --> C["table(part002.fq)"]
  A --> D["table(part....fq)"]
  A --> E["table(part009.fq)"]
  B --> F["merge(table_[000-009].tsv)"]
  C --> F
  D --> F
  E --> F
  F --> G{table.tsv}
  B --> H["qc(part[000-009].fq)"]
  C --> H
  D --> H
  E --> H
  H --> I{"multiqc_report.html"}

You need to:

  • create a new environment called qcEnv where you install the two packages fastqc multiqc.
  • add a new executor called qc_env based on Conda("qcEnv").
  • create a new target which starts with def qc(data_folder)
    • this will need all ten gwf_splitted/part_###.fq file as input files (you can copy the output file list of the split template, where you use a variable {data_folder} instead of the explicit folder name!)

    • as output you want a file called [reports/multiqc_report.html] (default name for the generated report)

    • as bash commands you need:

      mkdir -p reports
      fastqc -o reports gdk_splitted/*.fq
      multiqc -o reports reports/
    • remember to set the correct executor at the end of the template

    • now you need to create one single target from the template, call it qc. You only need to give as input the name of the folder with the fq files

Tip

  • Before running any targets, always use gwf info qc to check dependencies.
  • Copy previous similar templates and modify them where needed, instead of writing each template from scratch

When you are sure you are done, then use gwf run qc. Its status should be comlpeted if it runs successfully.

Ask for help, or look at the solution file, if you prefere.

Exercise break

Note

Good practices:

  • do not make tiny and numerous jobs, try to put together things with similar resource usage
  • start testing one of many parallel elements of a worlflow
    • determine resource usage (gwf utilization - needs a plugin, see exercises after)
    • then adjust all elements accordingly
  • verify often your code is correct
    • gwf info | less -S to check dependencies
    • remember you are using python, check some of your code directly in it!
  • when done, you are reproducible if you share with your datasets:ave
    • container pull commands
    • conda environment(s) package list
    • workflow files

Closing the workshop

Please fill out this form :)

  • A lot of things we could not cover

  • use the official documentation!

  • ask for help, use drop in hours (ABC cafe)

  • try out stuff and google yourself out of small problems

  • Slides updated over time, use as a reference

  • Next workshop all about pipelines