An introduction to the GDK system and basic commands https://hds-sandbox.github.io/GDKworkshops
Health Data Science sandbox, BiRC
GenomeDK, Health
2025-04-01
These slides are both a presentation and a small reference manual
We will try out some commands during the workshop
Official reference documentation: genome.au.dk
Practical help:
Samuele (BiRC, MBG) - samuele@birc.au.dk
Drop in hours:
General mail for assistance
support@genome.au.dk
10:00-11:00: What is GenomeDK, File System, virtual environments
11:00-12:00: Exercise: access interface, new environment, transfer data, interactive job
12:45-13:15: queueing system and jobs, estimate resource usage
13:15-14:00: Send out your first job with slurm
, estimate resource usage
Webpage: https://hds-sandbox.github.io/GDKworkshops/
Slides will always be up to date in this webpage
The basic softwares
Customizable
Learn your way around the basics of the GenomeDK
cluster.
GenomeDK
is a computing cluster, i.e. a set of interconnected computers (nodes). GenomeDK
has:
Creating an account happens through this form at genome.au.dk
Logging into GenomeDK happens through the command 1
When first logged in, setup the 2-factor authentication by
showing a QR-code with the command
scanning it with your phone’s Authenticator app 2.
It is nice to avoid writing the password at every access. If you are on the cluster, exit from it to go back to your local computer
Now we set up a public-key authentication. We generate a key pair (public and private):
Always press Enter and do not insert any password when asked.
and create a folder on the cluster called .ssh
to contain the public key
and finally send the public key to the cluster, into the file authorized_keys
After this, your local private key will be tested against GenomeDK’s public key every time you log in, without you needing to write a password.
Folders and files follow a tree-like structure
/
is the root folder of the filesystem - nothing is above thathome
and faststorage
are two of the folders in the root/faststorage/project
and linked to your homeLog in: ssh USERNAME@login.genome.au.dk
Note
Run a command = Type a command + Enter
pwd
, You should see your home folder: /home/USERNAME
/home/USERNAME
is an example of path.pwd
shows your current folder (WD, Working Directory)Run ls .
to show the content of your WD (the dot .
)
Run mkdir -p GDKintro
to create a GDKintro
folder
Run echo "hello" > ./GDKintro/file.txt
to write hello in a file
Use ls ./GDKintro
to see if the text file is there.
Relative and absolute paths
/home/USERNAME
starts from the root /
. It is an absolute path../GDKintro
starts from the WD. It is a relative path.Look at the File system tree and answer to the following questions:
After log in, you will find yourself into your private home folder, denoted by ~
or equivalently /home/username
. Your prompt will look like this:
which follows the format [username@node current_folder].
Warning
We now set the WD into GDKintro
and remove all text files in it. Then we download a zipped fastq
file, unzip it, and print a preview!
rm *.txt
removes all files ending with .txt
. The symbol *
is a wildcard for the file name
Forever away
There is no trash bin - removed files are lost forever - with no exception
head
prints the first lines of a text file
Useful utility 1: less
file reader. less
is perfect for exploring (big) text files: you can scroll with the arrows, and quit pressing q
. Try
The very first sequence you see should be
@HISEQ_HU01:89:H7YRLADXX:1:1101:1116:2123 1:N:0:ATCACG
TCTGTGTAAATTACCCAGCCTCACGTATTCCTTTAGAGCAATGCAAAACAGACTAGACAAAAGGCTTTTAAAAGTCTA
ATCTGAGATTCCTGACCAAATGT
+
CCCFFFFFHHHHHJJJJJJJJJJJJHIJJJJJJJJJIJJJJJJJJJJJJJJJJJJJHIJGHJIJJIJJJJJHHHHHHH
FFFFFFFEDDEEEEDDDDDDDDD
Challenge yourself
Search online (or with less --help)
how to look for a specific word in a file with less
. Then visualize the data with less
, and try to find if there is any sequence of ten adjacent N
s (which is, ten missing nucleotides). Then, answer the question below
Useful utility 2: nano
text editor. It open, edits and saves text files. Very useful for changes on the fly.
Try nano data.fastq
. Change a base in the first sequence,
then press Ctrl+O to save (give it a new file name newData.fastq
and press Enter)
press Ctrl+X to exit. If you use ls
you can see the new saved file.
No preinstalled software on GenomeDK
You install and manage your software and its dependencies inside virtual environments
Each project needs specific software versions dependent on each other for reproducibility - without interferring with other projects.
Definition
A virtual environment keeps project-specific softwares and their dependencies separated
A package manager is a software that can retrieve, download, install, upgrade packages easily and reliably
How virtual envs work: packages at different versions are kept separated into folders, together with all system files needed to make them work.
Conda is both a virtual environment and a package manager.
A newer virtual env. and package manager
A package manager puts together the dependency trees of requested packages to find all compatible dependencies versions.
Figure: A package’s dependency tree with required versions on the edges
To install a specific package in your environment, search it on anaconda.org:
Figure: search DeSeq2 for R
Channels
packages are archived in channels. conda-forge
and bioconda
include most of the packages for bioinformatics and data science.
conda-forge
packages are often the most up-to-date.
First of all, we open the desktop interface to GenomeDK at desktop.genome.au.dk. Choose the Front end for the login.
The desktop session will be operative even if you close and reopen your browser afterwards!
The terminal will work as if you logged into the frontend (The desktop is logged into the front-end node already). You can also use the browser!
clipboard into the browser
If you copy a text locally and want to paste in the GDK desktop, you need to transfer it to the clipboard.
Click on SHOW CLIPBOARD
and paste your text. Now it is available in the desktop interface!
Open the terminal and run the command below to install pixi
:
After that, make the system recognize pixi
Change your WD with the one we created earlier, where we have the file data.fastq
Initiate a new pixi environment into the folder:
Use the file browser and open the GDKintro
folder
You can see some new files. pixi.toml
contains info pixi
will use to create your environment.
Open pixi.toml
with the text editor, and make sure you have the two channels conda-forge
and bioconda
. If not, modify the file so the channel list is like below.
Now get back to the terminal and install some packages. This is done easily.
The terminal will look like this at the end
Now open the pixi.toml
file. You should see all the installed packages with related information.
Exercise Cont’d
Be sure your WD is in the folder GDKintro
. Then run
Open the file environment.yml
. It looks very similar to pixi.toml
and is compatible with conda to recreate your environment.
Let’s zip those files into one:
Data can be downloaded/uploaded in two ways:
from the command line of a local computer
using an interactive interface (Filezilla)
How to download the environment files to our computer? Open a terminal on your computer and run this command:
scp
needs your login and the absolute path to the file. We give also the download destination as the WD on the local computer (.
)
You can transfer data with an interactive software, such as Filezilla
, which has an easy interface. Download Filezilla.
When done, open Filezilla
and use the following information on the login bar:
login.genome.au.dk
GenomeDK
username and password22
Press on Quick Connect
. As a result, you will establish a secure connection to GenomeDK
. On the left-side browser you can see your local folders and files. On the right-side, the folders and files on GenomeDK
starting from your home
.
Download the environment.zip
file. You need to right-click on it and choose Download
You can do exactly the same to upload files from your local computer!
what is a project
Projects are contained in /faststorage/project/
and linked in your home, and are simple folders with some perks:
Common-sense in project creation
bulkRNA_mouse
, bulkRNA_human
, bulkRNA_apes
with the same invited usersbulkRNA_studies
with subfolders bulkRNA_mouse
, bulkRNA_human
, bulkRNA_apes
.Request a project (after login on GDK) with the command
After GDK approval, a project folder with the desired name appears in ~
and /faststorage/project
. You should be able to set the WD into that folder:
or
Only the creator (owner) can see the project folder. You (and only you) can add an user
or remove it
Users can also be promoted to have administrative rights in the project
or demoted from those rights
You can see globally monthly used resources of your projects with
Example output:
More detailed usage: by users on a selected project
You can see how many resources your projects are using with
Example output:
project period billing hours storage (TB) backup (TB) storage files backup files
ngssummer2024 sarasj 2024-7 77.98 0.02 0.00 528 0
ngssummer2024 sarasj 2024-8 0.00 0.02 0.00 528 0
ngssummer2024 savvasc 2024-7 223.21 0.02 0.00 564 0
ngssummer2024 savvasc 2024-8 0.00 0.02 0.00 564 0
ngssummer2024 simonnn 2024-7 173.29 0.01 0.00 579 0
ngssummer2024 simonnn 2024-8 0.00 0.01 0.00 579 0
Accounting Tips
grep
to isolate specific users and/or months:
Example:
Private files or folders
Have a coherent folder structure - your future self will thank.
Example of structure, which backs up raw data and analysis
If your project has many users, a good structure can be
MUST-KNOWs for a GDK project
Backup cost >>> Storage cost >> Computation cost
Running programs on a computing cluster happens through jobs.
Learn how to get hold of computing resources to run your programs.
A computational task executed on requested HPC resources (computing nodes), which are handled by the queueing system (SLURM).
The command gnodes
will tell you if there is heavy usage across the computing nodes
Usage of computing nodes. Each node has a name (e.g. cn-1001). The symbols for each node mean running a program (0
), assigned to an user (_
) and available (.
)
If you want to venture more into checking the queueing status, Moi has done a great interactive script in R Shiny for that.
Front-end nodes are limited in memory and power, and should only be for basic operations such as
starting a new project
small folders and files management
small software installations
data transfer
and in general you should not use them to run computations. This might slow down all other users on the front-end.
Useful to run a non-repetitive task interactively
Examples:
splitting by chromosome that one bam
file you just got
open Rstudio and Jupyterlab
compress/decompress multiple files, maybe in parallel
Once you exit from the job, anything running into it will stop.
You can also run an interactive job on GenomeDK desktop. Go back to it and use the terminal to go into the GDKintro
folder:
Now run an interactive job. Use 8g of RAM, 2 cores, and choose 01:00:00 hours. Choose the account
using the name of one of your projects.
You will have to wait in queue. When you ge the resources, the node in use is shown in the prompt. Below, for example, the node is s21n32
.
[USERNAME@s21n32 ~]$
Now, run rstudio
or jupyterlab
(your choice!) from the pixi
environment:
The packages available in Rstudio and Jupyterlab are the ones installed in your environment. More on this will be in our Advanced GenomeDK workshop.
Useful to run a program non-interactively, usually for longer time and without interaction from the user. A batch script contains
and
Create in Rstudio or Jupyterlab a file called align.sh
(in the folder GDKintro
) like below:
In the terminal, you need to install two new packages
and download a reference genome
Send the script to the queueing system using the terminal:
Interrogate SLURM about the specific job with the provided number. For example
>Name : align.sh
>User : samuele
>Account : my_project
>Partition : short
>Nodes : s21n43
>Cores : 8
>GPUs : 0
>State : RUNNING
>...
or about all the queued jobs
>JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
>33735928 short align.sh samuele R 1:12 1 s21n43
If you change your mind and want to cancel a job:
Try to run a job with a smaller dataset as a test. Or run one of many jobs of the same type. While the job is running
use squeue --me
and look at the node id
log into that node from the front-end terminal:
htop -u <username>
to see what is running and how much memory and CPU it uses Please fill out this form :)
A lot of things we could not cover
use the official documentation!
ask for help, use drop in hours
try out stuff and google yourself out of small problems
Slides updated over time, use as a reference
Future workshops about advanced usage and pipelines