Use GenomeDK

An introduction to the GDK system and basic commands https://hds-sandbox.github.io/GDKworkshops

Samuele Soraggi

Health Data Science sandbox, BiRC

Dan Søndergaard

GenomeDK, Health

2024-09-18

GenomeDK’s ABC

Learn your way around the basics of the GenomeDK cluster.

Infrastructure

GenomeDK is a computing cluster, i.e. a set of interconnected computers (nodes). GenomeDK has:

  • computing nodes used for running programs (~15000 cores)
  • storage nodes storing data in many hard drives (~23 PiB)
  • a network making nodes talk to each other
  • a frontend node from which you can send your programs to a node to be executed
  • a queueing system called slurm to prioritize the users’ program to be run

Access

  • Creating an account happens through this form at genome.au.dk

  • Logging into GenomeDK happens through the command 1

    [local]$  ssh USERNAME@login.genome.au.dk
  • When first logged in, setup the 2-factor authentication by

    • showing a QR-code with the command

      [fe-open-01]$  gdk-auth-show-qr
    • scanning it with your phone’s Authenticator app 2.

Access without password

It is nice to avoid writing the password at every access. If you are on the cluster, exit from it to go back to your local computer

exit

 

Now we generate an RSA key.

ssh-keygen -t ed25519

Always press Enter and do not insert any password when asked.

Wait a second! What is an RSA key?

  1. it’s a pair of long codes, called public and private
  2. the two keys are mathematically related
  3. the user has only private key
  4. the cluster has only public key
  5. the user asks to be authenticated to GenomeDK (AuthReq)
+--------+         +--------+
|  User  |         |GenomeDK|
|        |         |        |
| PrivKey| ------> | PubKey |
|(id_rsa)| AuthReq |(auth)  |
+--------+         +--------+
     |                   ^
     v                   |
  SignReq             VerifyReq
     |                   |
     v                   v
  Access Granted if Verified
  1. the user signs the request with private key (SignReq)
  2. GenomeDK verifies the request through match with the public key (VerifyReq)
  3. If verification succeeds, access is granted

We create a folder on the cluster called .ssh to contain the RSA key we created

ssh <USERNAME>@login.genome.au.dk mkdir -p .ssh

 

and finally send the RSA public key to the cluster, into the file authorized_keys

cat ~/.ssh/id_rsa.pub | ssh username@login.genome.au.dk 'cat >> .ssh/authorized_keys'

 

this time will be the last where you are asked your password when logging in from your computer.

File System (FS) on GenomeDK

Directory structure and how to navigate it

How is the FS organized

Folders and files follow a tree-like structure

  • / is the root of the filesystem - nothing is above that
  • home and faststorage are two root folders
  • projects are in faststorage and linked to your home
  • you can reach files and folders with a path

Home directory ~

 

Log into the cluster

[local] ssh username@login.genome.au.dk

 

Tip

Use the up key on the terminal to find the commands you used previously, and press enter when you find the login command

Every time you log in, you will find yourself into your private home folder. This is denoted by ~ or equivalently /home/username/. Your prompt will show something like this:

 

[samuele@fe-open-02 ~]

 

which follows the format

 

[username@node current_folder]

The folder in which you are located is called working directory (WD). Use the following command to see its path starting from the root:

 

pwd

 

Every command you execute refers to your WD. Execute

 

ls

 

and you will see the list of files in your WD.

Try to create an empty file now with

 

touch emptyFile.txt

 

and create a folder, which will be inside our cwd:

 

mkdir myFolder

 

If you use again the ls command, the new file and folder will show in the cwd.

How do you see the directory tree of the WD? Try

tree -L 2 .

which shows you the tree with only 2 sublevels of depth.

 

Note

. denotes the WD, and is the default when you do not specify it. Retry the command above using .. (one directory above in the file system) instead of .

We want to get a file from the internet to the folder myFolder. We can use wget:

 

wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz\
     -O ./myFolder/data.fastq.gz

 

Note

-O is the option to give a path and name for the downloaded file.

Most commands have a help function to know the syntax and options. For example you can use wget --help and man wget.

Absolute and relative path

The path to a file/folder can be:

  • absolute: start from the root
  • relative: starts from your WD

 

To look inside myFolder, we can both write


ls myFolder/

and


ls ~/myFolder/

 

Note

We have used ~ which is the shortform for /home/username.

Changing WD can be useful, for example to avoid writing long relative paths.

To set the WD inside myFolder use


cd myFolder

and verify with pwd the new working directory path.

Working with Files

Moving, Downloading, Manipulating files on GenomeDK

Home on the cluster

See the content of the current folder with

ls

or to see more details

ls -lh

 

Warning

do not fill up your home with data. It has a limited amount of storage (a quota of 100GB).

Project management

 

It is easy to start creating files everywhere in your project folders. Data, analysis files, results and the like.

 

Managing your folders rationally is the best way of finding your way around. Especially when getting back to your analysis after long time.

Creation

You need a project from which you can run your programs. Request a project with the command

gdk-project-request -g <project_name>

This creates a folder with the desired name. You should be able to go into that folder:

cd <project_name>

 

You can see how many resources your projects are using with

gdk-project-usage

Users management

Only the creator (owner) can see the project folder. You can add an user

gdk-project-add-user -g <project name> -u <username>

or remove it

gdk-project-remove-user -g <project name> -u <username>

 

More about user’s management in the documentation

Folders management

It is important to

  • have a coherent folder structure
  • backup only important things (raw data, analysis scripts)

 

Remember: Storage cost >> Computation cost

Example of structure, which backs up raw data and analysis

You can do it with a script:

wget https://raw.githubusercontent.com/hds-sandbox/GDKworkshops/5bbfc11e3796d5f4f1af39aecd6858721aca1612/Scripts/populate_project.sh
bash populate_project.sh

If your project has many users, a good structure can be

 

mkdir -p Backup/Data Workspaces/username1 Workspaces/username2
ln -s Backup/Data/ Data

Each user can go in its folder inside the project and run the script to populate the folders

 

cd Workspaces/username1
wget https://github.com/hds-sandbox/GDKworkshops/blob/be7315365c152ecd75e94c1f56e1578062c2c096/Scripts/populate_project.sh
chmod +x populate_project.sh
./populate_project.sh

Downloads and Copies

 

In your daily life on a cluster you are going to need downloads and exchange of files with online archives and your local PC.

Warning

Downloads should always happens on the front-end nodes, and never using a compute node when working on GenomeDK

Download with wget

wget is a utility for command-line-based downloads. It is already installed on GenomeDK and works with http, https, ftp protocols.

 

Example:

wget -O ./output.png \
     -c \
     -b \
     https://example.com/image.png

downloads a png file and saves it as output.png (option O), downloads in background (-b) and if the download was interrupted earlier, it retrieves it from where it stopped (-c).

wget has many options you can use, but what shown in the example above is what you need most times. You can see them with the command

wget --help

 

Also, you can find this cheatsheet useful for remembering the commands to most of the things you can think about downloading files using wget. At this page there are also some concrete examples for wget.

SCP transfer

SCP (Secure Copy Protocol) can transfer files securely

  • between a LOCAL and a REMOTE host
  • between TWO REMOTE hosts

 

You can use it to transfer files from your pc to GenomeDK and viceversa, but also between GenomeDK and another computing cluster (for example, downloading data from a collaborator, which resides on a different remote computing system).

To copy a file to GenomeDK from your local computer:

scp /home/my_laptop/Documents/file.txt \
    username@login.genome.au.dk:/home/username/my_project/

 

The inverse operation just changes the order of the sender and receiver:

scp username@login.genome.au.dk:/home/username/my_project/file.txt \
    /home/my_laptop/Documents/

If you want to copy an entire folder, use the option -r (recursive copy). The previous examples become

scp -r /home/my_laptop/Documents/folder \
       username@login.genome.au.dk:/home/username/my_project/

 

and

scp -r username@login.genome.au.dk:/home/username/my_project/folder \
       /home/my_laptop/Documents/

 

A few more options are available and you can see them with the command scp --help.

Rsync transfer

Differently from scp, you can use rsync to syncronize files and folders between two locations. It copies only the changes in the data and not all of it every time.

 

Copying a file or a folder between your computer and GenomeDK works exactly as in scp. For example

rsync --progress -r \
      username@login.genome.au.dk:/home/username/my_project/folder \
      /home/my_laptop/Documents/

where we add an option to show a progress bar

An interrupted syncronization can be retrieved if interrupted. To allow future retrieval of partial transfer, the previous command needs the additional option --partial (which keeps partial downloads without deleting them):

 

rsync --partial --progress -r \
      username@login.genome.au.dk:/home/username/my_project/folder \
      /home/my_laptop/Documents/

 

After an interruption, just rerun the exact same command to retrieve the syncronization.

If you have large files, the option -z (compression) reduces the amount of data (and the time) to transfer.

Interactive transfer

You can also do transfering with an interactive software, such as Filezilla, which has an easy interface. Download Filezilla.

 

When done, open Filezilla and use the following information on the login bar:

  • Host: login.genome.au.dk
  • Username, Password: your GenomeDK username and password
  • Port: 22

Press on Quick Connect. As a result, you will establish a secure connection to GenomeDK. On the left-side browser you can see your local folders and files. On the right-side, the folders and files on GenomeDK starting from your home.

If you right-click on any local file or folder, you can upload it immediately, or add it to the transfer queue. The file will end up in the selected folder of the right-side browser.

The download process works similarly using the right-side browser and choosing the destination folder on the left-side browser.

If you have created a queue, this will be shown at the bottom of the window as a list. You can inspect destination folders from there and choose other options such as transfer priority.

To start a queue, use CTRL + P, Transfer --> Process Queue or press the button on the toolbar.

Package/Environment management

 

Properly managing your software and its dependencies is fundamental for reproducibility

Virtual environments

Each project needs specific software versions dependent on each other for reproducibility - without interferring with other projects.

 

Definition

A virtual environment keeps project-specific softwares and their dependencies separated

 

A package manager is a software that can retrieve, download, install, upgrade packages easily and reliably

Conda

 

Conda is both a virtual environment and a package manager.

  • easy to use and understand
  • can handle quite big environments
  • environments are easily shareable
  • a large archive (Anaconda) of packages
  • active community of people archiving their packages on Anaconda

Installation

 

Just download and execute the installer by

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O miniforge.sh
chmod +x miniforge.sh
bash miniforge.sh -b
./miniforge3/bin/conda init bash

After a few ENTERs and YES’s you should get the installation done. Run

source ~/.bashrc

and doublecheck that conda works:

conda info

Conda Configuration

You can add some default channels where to find archived packages. Here are some tipycal ones

conda config --append channels bioconda
conda config --append channels genomedk
conda config --append channels r
conda config --append channels conda-forge

We tell conda to look into channels in the order specified above. We also avoid opening the base environment (where conda is installed) at login.

conda config --set channel_priority strict
conda config --set auto_activate_base false

Base environment

is the environment containing conda itself. The current environment is in your prompt, but you will not see it again after disabling it at login.

(base) [samuele@fe-open-02 ~]$

We update Conda with libmamba solver - a lot faster in installing many packages at once.

conda install -n base --yes conda-libmamba-solver
conda config --set solver libmamba

Don’t touch the Base

This is the only time you should install in the base environment! You might otherwise ruin the conda installation.

Look at the settings in your conda installation. They are saved in the file ~/.condarc

cat ~/.condarc

Create an environment

 

An empty environment called test_1:

conda create -n test_1

 

You can list all the environments available:

conda env list
> # conda environments:
> #
> base      *  /home/samuele/miniconda3
> test_1       /home/samuele/miniconda3/envs/test_1

Activate and deactivate

 

To use an environment

conda activate test_1

Deactivation happens by

conda deactivate

Manage an environment

Package installation

Conda puts together the dependency trees of requested packages to find all compatible dependencies versions.

Figure: A package’s dependency tree with required versions on the edges

To install a specific package in your environment, search it on anaconda.org:

Figure: search DeSeq2 for R

Figure: suggested commands to install the package

Repositories

packages are archived in repositories. Typical ones are bioconda, conda-forge, r, anaconda.

conda-forge packges are often more up-to-date, but a few times show compatibility problems with other packages.

Install a couple of packages in the activated environment - you can always specify a version restriction to each package:

conda activate test_1
conda install bioconda::bioconductor-deseq2<=1.42.0 conda-forge::r-tidyr=1.3.1

Note

To install two packages, you need more than a hundred installations! Those are all dependencies arising from the comparison of dependency trees.

 

Look for the package tidyr in your active environment:

conda list | grep tidyr

Installation from a list of packages

You can export all the packages you have installed over time in your environment:

conda env export --from-history > environment.yml

which looks like

name: test_1
channels:
 - bioconda
 - conda-forge
 - defaults
 - r
dependencies:
 - bioconda::bioconductor-deseq2
 - conda-forge::r-tidyr

The same command without --from-history will create a very long file with ALL dependencies:

name: test_1
channels:
  - bioconda
  - conda-forge
  - defaults
  - r
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - _r-mutex=1.0.1=anacondar_1
  - argcomplete=3.2.2=pyhd8ed1ab_0
  - binutils_impl_linux-64=2.40=hf600244_0

This is guaranteed to work only on the specific system where you created the environment!

You can use the yml file to create an environment:

 

conda env create -p test_1_from_file -f ./environment.yml

 

Environment files are very useful when you want to share environments with others, especially when the package list is long.

Good practice: You want to install a lot of packages in an environment? Clone it first! If you break something, you still have the old copy.

conda create -p test_1_cloned --clone test_1

 

If installations in the cloned environment go fine, then you can remove it

conda env remove -n test_1_cloned

and repeat the installations on the original one.

Running a Job

 

Running programs on a computing cluster happens through jobs.

 

Learn how to get hold of computing resources to run your programs.

What is a job on a HPC

A computational task executed on requested HPC resources (computing nodes), which are handled by the queueing system (SLURM).

The command gnodes will tell you if there is heavy usage across the computing nodes

Usage of computing nodes. Each node has a name (e.g. cn-1001). The symbols for each node mean running a program (0), assigned to an user (_) and available (.)

If you want to venture more into checking the queueing status, Moi has done a great interactive script in R Shiny for that.

Front-end nodes are limited in memory and power, and should only be for basic operations such as

  • starting a new project

  • small folders and files management

  • small software installations

and in general you should not use them to run computations. This might slow down other users.

Interactive jobs

Useful to run a non-repetitive task interactively

Examples:

  • splitting by chromosome that one bam file you just got

  • open python/R and do some statistics

  • compress/decompress multiple files, maybe in parallel

Once you exit from the job, anything running into it will stop.

To run an interactive job simply use the command

[fe-open-01]$ srun --mem=<GB_of_RAM>g -c <nr_cores> --time=<days-hrs:mins:secs>  --account=<project_name> --pty /bin/bash

For example

[fe-open-01]$ srun --mem=32g -c 2 --time=6:0:0  --account=<project_name> --pty /bin/bash

The queueing system makes you wait based on the resources you ask and how busy the nodes are. When you get assigned a node, the resources are available. The node name is shown in the prompt.

[<username>@s21n32 ~]$

Batch script (sbatch)

Useful to run a program non-interactively, usually for longer time than a short interaction. A batch script contains

  • the desired resources
  • the sequence of commands to be executed

and

  • has a filename without spaces (forget spaces from now on)
  • starts with #!/bin/bash to know in which language (‘bash’) the commands are written into

Example

A file called align.sh such that:

#!/bin/bash
#SBATCH --account my_project
#SBATCH --cpus-per-task= 8
#SBATCH --mem 16g
#SBATCH --time 04:00:00

#activate environment
eval "$(conda shell.bash hook)"
conda activate ./bam_tools 
#index the reference file
bwa-mem2 index reference/chr2.fa
#align data
bwa-mem2 -t 8 reference/chr2.fa \
             genomes/S_Korean-2.region.fq.gz \
        | samtools sort \
            -@ 7 \
            -n \
            -O BAM \
        > alignment/S_Korean-2.sorted.bam

exit 0

Send the script to the queueing system:

sbatch align.sh
Submitted batch job 33735298

 

Interrogate SLURM about the specific job

jobinfo 33735298
>Name                : align.sh
>User                : samuele
>Account             : my_project
>Partition           : short
>Nodes               : s21n43
>Cores               : 8
>GPUs                : 0
>State               : RUNNING
>...

or about all the queued jobs

squeue -u <username>
>JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>33735928     short align.sh  samuele  R       1:12      1 s21n43

 

If you change your mind and want to cancel a job:

scancel 33735928

To observe in real time the latest output of the command in the job, you can refresh the last lines of the log file for the specific job:

watch tail align.sh-33735928.out

 

To look at the whole file, you can run at any time

less -S align.sh-33735928.out

This can be useful for debugging, when for example a command gives an error and the job interrupts.

Other ways of running jobs

Beyond sbatch, you can use other tools for workflows which are

  • modular and composable: sequences of commands can be applied in various contexts, composed together in the desired ordering
  • scalable and parallel handling many sequences of operations parallelly or interdependently
  • flexible where repetitive operations can be automatized over multiple applications

Some workflow tools:

 

Gwf has an easy python syntax instead of its own language to write workflows.

 

You need to know some basic python to use Gwf, but it is worth the effort.