flowchart LR A0{data.fq} --> A["Split(data.fq)"] A --> B["table(part000.fq)"] A --> C["table(part002.fq)"] A --> D["table(part....fq)"] A --> E["table(part009.fq)"] B --> F["merge(table_[000-009].tsv)"] C --> F D --> F E --> F F --> G{table.tsv}
More-than-basic things to do and tips for GDK, tinyurl.com/advGDK
Health Data Science sandbox, BiRC, AU
Molecular Biology and Genetics, AU
GenomeDK, Health, AU
2025-04-11
These slides are both a presentation and a small reference manual
90% of slides are you doing stuff - open your terminals and slides
Official reference documentation: genome.au.dk
Some advanced things require a bit of practice/frustration, but I hope to reduce it
Most important message before starting any workshop: RTFM - Read The Field Manual!. Though
Practical help:
Samuele (BiRC, MBG) - samuele@birc.au.dk
Drop in hours:
General mail for assistance
support@genome.au.dk
rsync
copy and backupstmux
gwf
, conda
and containers
Webpage: https://hds-sandbox.github.io/GDKworkshops/
Slides will always be up to date in this webpage
Your ~/.bashrc
file contains settings for your user. Those are bash commands which run at every login.
Common practice for many softwares is to have a configuration file in your home, often starting with .
, which makes it a hidden file.
Examples:
.tmux.config
for tmux
.emacs
for emacs.gitconfig
for github.condarc
for condaPlus other things like your command history on the terminal.
Let’s make a useful setting to run at each login. We will need a temporary folder for singularity
containers which are downloaded. Default is your home, which will be filled up in no time (folder ~/.singularity
) with cache material.
Edit the file ~/.bashrc
(use nano ~/.bashrc
or any editor you want). Add those lines:
mkdir -p -m 700 /tmp/$USER
export SINGULARITY_TMPDIR=/tmp/$USER
export SINGULARITY_CACHEDIR=/tmp/$USER
The -m 700
option for mkdir
command ensures also you are the only one which can see the temporary files. Useful is you use a container with password or sensitive info, so no one can access it (/tmp/
is a public folder)!
Now, there are many repetitive things we do every day. For example:
cd ../
and cd ../../
and … and cd ../../../../../../../
and every time it is just annoying to waste precious time. Why not creating some aliases for all those deplorable commands? Choose the aliases you prefere from the list below and add them in your .bashrc
file:
Note
These are just inspirations, you can create any alias and function to avoid repetitive/long commands. Find all repetitive stupid stuff you use and wrap it up into the .bashrc
file!
You can also create functions including multiple commands: for example making a directory and then cd
into it.
Why not making one for git clone
downloading only the latest commit history and choosing specific folders for the repository?
# Git clone with depth 1 and choice of folders
# arg 1: username/repository
# arg 2: folders and files in quotes '', backspace separator
# arg 3: download folder name (optional, default:repo)
# arg 4: branch (optional, default:main)
# Examples:
# ghdir github/gitignore 'community Global' test01 main
# ghdir github/gitignore 'community Global'
ghdir() {
echo Downloading from $1 in folder $3
echo Selecting $2
if [ -z "$4" ]; then
BRANCH="-b main"
else
BRANCH="-b $4"
fi
git clone --no-checkout $BRANCH --filter=blob:none --depth 1 https://github.com/$1.git $3
if [ -z "$3" ]; then
folder=$(echo "$1" | cut -d'/' -f2)
cd "$folder"
else
cd "$3"
fi
git sparse-checkout init --cone
git sparse-checkout set $2
git checkout
}
rsync
rsync
to create backups and versioningtmux
tmux
rsync
rsync
is a very versatile tool for
Warning
rsync
cannot make a transfer between two remote hosts, e.g. running from your PC to transfer data between GenomeDK and Computerome.
rsync
cannot download from web URLs
Lots of options you can find in the manual (would require a workshop only for that)
Log into GenomeDK. Create anywhere you prefere a folder called advancedGDK
containing rsync/data
Create 100 files with extensions fastq
and log
in the data folder
Note
The syntax of rsync
is pretty simple:
rsync OPTIONS ORIGIN(s) DESTINATION
An archive (incremental) copy can be done with the options a
. You can add a progress bar with P
. You can exclude files: here we want only the ones with fastq
extension. Run the command
This will copy all the fastq
files in backup/data
. You can check with ls
.
Warning
Using data
will copy the entire folder, while data/
will copy only its content! This is common to many other UNIX tools.
Change the first ten fastq
files with some text:
Now, we do not only want to do an incremental copy of those file with rsync
, but also keep the previous version of those files. We create a folder to backup those, naming it with date and time (you will find it in your backup
directory):
Tip
If you create a folder called backup
in your project folder, you can use versioning to store your analysis and results with incremental changes.
Exercise finished
You can in the same way transfer and backup data between your local host (your PC, or GenomeDK) and another remote host (another cluster). You need Linux or Mac on the local host. For example, to get on your computer the same fastq
files:
The opposite can be done uploading data from your computer. For example:
Avoid a typical error
To transfer from GenomeDK to your computer, and viceversa, you need to use the commands above from your local computer, and NOT when you are logged into GenomeDK!
tmux
tmux
is a server application managing multiple terminal sessions. It can
tmux
is a keyboard-only software. But you can set it up also to change windows and panes with the mouse. Simply run this command (only once) to enable mouse usage:
Warning
Using the mouse can create problems in some terminal programs, where copy-paste starts acting weird, e.g. on Mac computers and on Windows’ Moba XTerm software. In case you have a bad experience, remove the mouse setup from the file ~/.tmux.conf
You can start a tmux
session anywhere. It is easier to navigate sessions giving them a name. For example start a session called example1
:
The command will set you into the session automatically. The window looks something like below:
Now, you are in session example1
and have one window, which you are using now. You can split the window in multiple terminals. Try both those combinations of buttons:
Ctrl + b %
Ctrl + b ""
Or keep right-clicked with the mouse to choose the split.
Do split the window horizontally and vertically, running 3 terminals. You can select any of them with the mouse (left-click).
Try to select a window and resize it: while keeping Ctrl + b
pressed, use the arrows to change the size
Now, you have your three panes running in a window.
Create a new window with Ctrl + b c
. Or keep right-clicked on the window bar and create a new window.
You should see another window added in the bottom window bar. Again, switch between windows with your mouse (left-click!)
In the new window, let’s look at which tmux sessions and windows are open. Run
The output will tell you that session example1
is in use (attached) and has 2 windows. Something like this:
example1: 2 windows (created Wed Apr 2 16:12:54 2025) (attached)
Start a new session without attaching to it (d
option), and call it downloads
:
verify the session is there with tmux ls
.
Warning
If you want a new session attaching to it, you need to detach from the current session with Ctrl + b + d
.
Create a text file with few example files for this workshop to be downloaded.
Now, the script below launches all the URLs from the list in the download session in a new window. The new window closes after the download. If there are less than K downloads active, a new one starts, until the end! You can use this and close your terminal. The downloads will keep going and the window names will be shown to keep an eye on the current downloads. Try it out and use it whenever you have massive number of file downloads.
mkdir -p downloaded
K=2 # Maximum number of concurrent downloads
while read -r url; do
# Wait until the number of active tmux windows in the "downloads" session is less than K
while [ "$(tmux list-windows -t downloads | wc -l)" -ge "$((K+1))" ]; do
sleep 1
done
# Extract the filename from the URL
filename=$(basename "$url")
# Start a new tmux window for the download
tmux new-window -t downloads -n "$filename" "wget -c $url -O downloaded/$filename && tmux kill-window"
tmux list-windows -t downloads -F "#{window_name}"
done < downloads.txt
Exercise finished
Why do we use web applications for graphical interfaces?
- all the graphics heavy-lifting is done by the browser - before, the X11 forwarding was the way to do graphics from remote - problem: X11 sends the whole graphics over the network, which is slow
We have written a bit about that on one of our ABC tutorials
A web application on GenomeDK:
The local user:
ssh
connection mapping to the remote port used by the server processHow the port forwarding looks like from the local user (your pc) to the remote node of the cluster. The purple command has to be launched on the local computer, once the server is running on the remote host. Source: KU Leuven.
Each server process on a machine needs a unique port (p2 on previous figure) to avoid conflicts.
Ports are in common between users on GenomeDK. So you can only use the port corresponding to your user number, which you can see with
echo $UID
You will see all this in the next exercise
better safe than sorry
Launch a web application which has tokens (a random code in the URL for the browser) or a password. In theory, anyone on your same node of the cluster can get into your server process and see your program and data!
If you DO NOT have the conda
package manager, you can quickly install it from the box below, otherwise move to the next slide!
Install conda
Run the installation script to install the software. There are some informative messages during the installation. You might need to say yes
a few times
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O miniforge.sh
chmod +x miniforge.sh
bash miniforge.sh -b
~/miniforge3/bin/conda init bash
source ~/.bashrc
When you are done, you need a little bit of configuration. You can run the following command to configure conda
(run them only once, they are not needed again):
conda config --append channels conda-forge
conda config --append channels bioconda
conda config --set channel_priority strict
conda config --set auto_activate_base false
source ~/.bashrc
Finally, install a package to accellerate conda
. This is optional but recommended:
Now you are ready to move on!
Create a new conda
environment:
Start a job with just a few resources. This will always keep running until the end since it is inside a tmux session!
Now activate your environment and run jupyterlab. The node name (remote host) and your user ID (for the port number) are given as variables to the web server simply using $(hostname)
and $UID:
You will see some messages and recognize an address with: your node and your user ID. Below it, the URL you can use in your browser, which always starts with 127.0.0.1
. It will look like this, but in your case it might have a longer URL with a random token (mine is instead password protected):
Write down your node number and user id you got from jupyterlab.
We now need a tunnel from the local host (your PC)! Write in a new terminal (not logged into GenomeDK) a command like the following (matching the example figure above), substituting the correct values:
Your tunnel is now opened. The web application can be accessed from your browser at the address given by the server process on GenomeDK, which is 127.0.0.1/PORT_NUMBER
(put your correct port number).
Exercise finished
Tip
The same logic applies to all other web applications. They will have similar options to define the remote node and port.
Usually the host node option changes name between ip
, host
, address
, or hostname
.
Note also that you will always have to use the address 127.0.0.1/PORT_NUMBER
on your browser. You might as well save it as a favorite address in your browser’s bookmarks.
What is a container
Container = Standalone and portable software package including
Deployment of a container on different HPCs and PCs is reproducible. A virtual environment (conda, mamba, pixi, cargo, npm, …) depends instead on the host system and is not necessarily fully reproducible on two different computer hardwares.
A virtual environment isolates dependencies for a specific programming language within the same operating system. It does not include the operating system or system-level dependencies, so it depends on the hosting system.
Containers are usually thought as packing a specific application which can run anywhere (or, which is portable).
E.g. annoying bioinformatics software requiring specific libraries or long installations.
Many containers are done with Docker, but this is not installed on GenomeDK
GenomeDK has Singularity, which can also run programs installed in Docker containers.
Very practical to pull and use containers in workflows.
Typical repositories with pre-built containers are:
Biocontainers: community-driven initiative to containerize bioinf softwares.
DockerHub registry: Public hub for Docker images, often including official containers from software developers
Quay: Same philosophy of DockerHub.
Once you find a container on the websites, simply use (eventually adapt) the provided code to pull it locally.
We use biocontainers to pull containers and recreate a little bulkRNA alignment. First create and attach to a new tmux
session called biocontainers
.
Make a folder called containers101
in the advancedGDK
directory.
Find a bash bioinformatics software you know on the Biocontainer Registry and open it to see a detailed description - if you have no good ideas look for the minimap2
aligner. You should see a software documentation illustrating how to run the software in various ways - see next slide for a screenshot.
The page will suggest to run immediately the container with singularity
. I would suggest downloading it first (pull
instead of run
) and then running it.
In the case of minimap2
:
which creates a container file in your current directory, and calls it minimap2.sif
(better name than the default minimap2:2.28--h577a1d6_4
).
You can either run
the container, which opens a CLI into it, where you can execute the program. A non-interactive way (useful for pipelines) is to write the command directly. For example:
Some containers are very easy to run. Some others need a lot of extra options, such as binding specific folders to a path, providing certificates, initializing environmental variables. A small example below:
The option --env
can be used to define environment variables, either needed by your software, or useful to write the code to be executed.
singularity pull samtools.sif https://depot.galaxyproject.org/singularity/samtools:1.21--h96c455f_1
singularity run --env REF_PATH=$(pwd) \
--env BAM_PREFIX=test01 \
samtools.sif \
samtools view \
-h https://github.com/roryk/tiny-test-data/raw/refs/heads/master/wgs/mt.sorted.bam \
-O bam > test01.bam
Ups, it does not work! Can you think of why? Hint: note that the address for the download of data begins with https
.
Why the previous slide didn’t work!!!
HTTPS is a protocol for secure communication over the internet. When you access an URL starting with https
, your system checks the server’s SSL/TLS certificate to ensure it is issued by a trusted Certificate Authority (CA). Why? To prevent man-in-the-middle (MITM) attacks, where an attacker eavesdrops in the data transfer.
The option --bind
can be used to make your folders available in a specific path for the container. Here, we bind two folders containing the certificates for GDK. Those paths are quite standard across UNIX systems.
singularity run --env REF_PATH=$(pwd) \
--env BAM_PREFIX=test01 \
--bind /etc/ssl/certs:/etc/ssl/certs \
--bind /etc/pki/ca-trust:/etc/pki/ca-trust \
samtools.sif \
samtools view \
-h https://github.com/roryk/tiny-test-data/raw/refs/heads/master/wgs/mt.sorted.bam \
-O bam > $BAM_PREFIX.bam
Exercise finished
A few other options which can be used in singularity. When you need those really depends on the application:
--fakeroot
: Allows running the container with root privileges in a user namespace. Useful for containers that require root access without needing actual root privileges.
--writable
: Enables writing to the container image. This is useful for making changes to the container, but it requires the container to be writable. --writable-tmpfs
to avoid changes to be persistent in a non-writable container.
--contain
: Isolates the container from the host system by limiting access to the host’s filesystem.
--no-home
: Prevents the container from automatically binding the user’s home directory. Avoids exposing your home directory to the container.
--cleanenv
: Clears all environment variables except those explicitly set for the container.
--nv
: Enables NVIDIA GPU support by binding the necessary libraries and drivers. Equivalent for AMD GPUs is --rocm
(not the case on GDK).
--pwd
: Sets the working directory inside the container.
A classical bioinformatics pipeline has splits and merge steps. We will work with this:
flowchart LR A0{data.fq} --> A["Split(data.fq)"] A --> B["table(part000.fq)"] A --> C["table(part002.fq)"] A --> D["table(part....fq)"] A --> E["table(part009.fq)"] B --> F["merge(table_[000-009].tsv)"] C --> F D --> F E --> F F --> G{table.tsv}
where split
and table
(generate allele counts and statistics in a table) are done through seqkit
.
There are many populare softwares for workflows (Snakemake
, Nextflow
), which requires learning their workflow language. We use instead gwf
:
Once you understand workflows with gwf
, it is a matter of learning how to use other workflow languages, if you wish.
The whole workflow is contained in a Workflow
object. This is initiated usually like this:
Now you can populate your Workflow
object with templates
, which are simple python functions. You can create executors for your templates (environments and containers, which will run the programs you need)!
The template is a blueprint for a specific block of the workflow.
For example, the 10 parallel table blocks can be described by one reusable template:
#create a software executor from conda environment `seqkitEnv`
from gwf.executors import Conda
conda_env = Conda("seqkitEnv")
def table(infile): #Name of the template
inputs = [infile] #input file(s)
outputs = [f"{infile}.fx2tab.tsv"] #name of the output(s)
options = {"cores": 1, "memory": "1g", "walltime": "02:00:00"} #resources to use
#the bash commands to run when using the template
spec = f"""
seqkit fx2tab -n -l -C A -C C -C G -C T -C N -g -G -s {infile} > {outputs[0]}
"""
#here you provide all specifications defined in the template.
#This is usually always as below,
#apart from the executor, which you decide yourself
return AnonymousTarget(inputs=inputs, outputs=outputs, options=options, spec=spec, executor=conda_env)
The split
and table
templates are then applied to the relevant files:
parts=10 #how many parts to split into
# Apply the split template to data.fq to split in 10 parts. Call this step of the workflow "split"
# T1 collects info of the workflow step (e.g. output names)
T1 = gwf.target_from_template("split", split(infile="data.fq", parts=parts))
# Tabulate each part of the splitting and save the name of the tables in
# seqkit_output_files. We apply the template table to each input file of the
# for loop. The step of the workflow is called table_1, table_2, ..., table_10
seqkit_output_files = []
for i, infile in enumerate(T1.outputs):
T2 = gwf.target_from_template(f"table_{i}", table(infile = infile))
seqkit_output_files.append(T2.outputs[0])
We try out exactly those template in the following exercise.
Prepare everything for the exercise: create a new folder, get data and workflow file
Create a conda environment for seqkit
and one for the gwf
workflow software. Download the seqkit
container as well.
conda config --add channels gwforg
conda create -y -n pipelineEnv gwf=2.1.1
#package for resource usage/check
conda install -y -n pipelineEnv -c micknudsen gwf-utilization
#env for seqkit
conda create -y -n seqkitEnv seqkit
#Container
singularity pull seqkit_2.10.0 https://depot.galaxyproject.org/singularity/seqkit:2.10.0--h9ee0642_0
Now look at the status
of your workflow. You should recognize all the steps (targets). Those are marked shouldrun
, because the outputs and/or inputs are not existent. Remember to activate the environment for gwf
.
Tip
You do not need -f workflow.py
if your workflow file has the name workflow.py
, which is the default for gwf
.
Now, you might also want to look at how a specific target looks like when the workflow is built
You will be able to see the actual inputs, outputs, and other targets it depends from/depending on it:
{
"split": {
"options": {
"cores": 1,
"memory": "4g",
"walltime": "05:00:00"
},
"inputs": [
"data.fq"
],
"outputs": [
"gwf_splitted/part001.fq",
"gwf_splitted/part002.fq",
"gwf_splitted/part003.fq",
"gwf_splitted/part004.fq",
"gwf_splitted/part005.fq",
"gwf_splitted/part006.fq",
"gwf_splitted/part007.fq",
"gwf_splitted/part008.fq",
"gwf_splitted/part009.fq",
"gwf_splitted/part010.fq"
],
"spec": "\n seqkit split2 -O gwf_splitted --by-part 10 --by-part-prefix part data.fq\n ",
"dependencies": [],
"dependents": [
"table_6",
"table_8",
"table_3",
"table_0",
"table_1",
"table_4",
"table_5",
"table_9",
"table_7",
"table_2"
]
}
}
Now, you can run specific targets. Let’s specify some names to test out our workflow.
Tip
You can run the entire workflow with gwf run
when you are sure of your targets working correctly with the right resources.
Check the status: the two turgets will be submitted
, then split
has to run first, and its dependency table_0
will run when the file part_001.fq
is generated! We use watch
in front of the command to update its view every two seconds.
At some point, you will see the running
status (for a few seconds) and completed
status.
Exercise break
How many resources did split
and table_0
use? Run the utilization command:
The table shows we underutilized the resources. Now open workflow.py
and change your resource usage for the split
and table_
steps. Then, run the target:
Check again resource usage when the status is completed
. Did it get better?
Exercise break
Now, you will change the executor for the template table
. Your task is to:
open the workflow.py
file
below importing the Conda
module (line 2), add a new line with
from gwf.executors import Singularity
Now, define a new executor. Below the line where you define conda_env = Conda("seqkitEnv")
, use a similar syntax for Singularity
, where you provide the container file as argument.
At the end of the align
template, use the new executor instead of conda_env
.
Did you do it right? If yes, then you should be able to run the combine
target:
and see its status become completed
after some time, and all output files should be created in your folder! If not, something is wrong. Ask for help, or look at the solution file, if you prefere.
Note
Because combine
depends on all table_
targets, it will submit all those targets as well, which need to run first.
Exercise break
Ok, now we want to extend the workflow and do quality control on the part_###.fq
files.
flowchart LR A0{data.fq} --> A["Split(data.fq)"] A --> B["table(part000.fq)"] A --> C["table(part002.fq)"] A --> D["table(part....fq)"] A --> E["table(part009.fq)"] B --> F["merge(table_[000-009].tsv)"] C --> F D --> F E --> F F --> G{table.tsv} B --> H["qc(part[000-009].fq)"] C --> H D --> H E --> H H --> I{"multiqc_report.html"}
You need to:
qcEnv
where you install the two packages fastqc multiqc
.qc_env
based on Conda("qcEnv")
.def qc(data_folder)
this will need all ten gwf_splitted/part_###.fq
file as input files (you can copy the output file list of the split
template, where you use a variable {data_folder}
instead of the explicit folder name!)
as output you want a file called [reports/multiqc_report.html]
(default name for the generated report)
as bash commands you need:
remember to set the correct executor at the end of the template
now you need to create one single target from the template, call it qc
. You only need to give as input the name of the folder with the fq
files
Tip
gwf info qc
to check dependencies.When you are sure you are done, then use gwf run qc
. Its status
should be comlpeted
if it runs successfully.
Ask for help, or look at the solution file, if you prefere.
Exercise break
Note
Good practices:
gwf utilization
- needs a plugin, see exercises after)gwf info | less -S
to check dependenciesPlease fill out this form :)
A lot of things we could not cover
use the official documentation!
ask for help, use drop in hours (ABC cafe)
try out stuff and google yourself out of small problems
Slides updated over time, use as a reference
Next workshop all about pipelines