An introduction to the GDK system and basic commands https://hds-sandbox.github.io/GDKworkshops
Health Data Science sandbox, BiRC
GenomeDK, Health
2024-09-18
Learn your way around the basics of the GenomeDK
cluster.
GenomeDK
is a computing cluster, i.e. a set of interconnected computers (nodes). GenomeDK
has:
Creating an account happens through this form at genome.au.dk
Logging into GenomeDK happens through the command 1
When first logged in, setup the 2-factor authentication by
showing a QR-code with the command
scanning it with your phone’s Authenticator app 2.
It is nice to avoid writing the password at every access. If you are on the cluster, exit from it to go back to your local computer
Now we generate an RSA key.
Always press Enter and do not insert any password when asked.
Wait a second! What is an RSA key?
+--------+ +--------+
| User | |GenomeDK|
| | | |
| PrivKey| ------> | PubKey |
|(id_rsa)| AuthReq |(auth) |
+--------+ +--------+
| ^
v |
SignReq VerifyReq
| |
v v
Access Granted if Verified
We create a folder on the cluster called .ssh
to contain the RSA key we created
and finally send the RSA public key to the cluster, into the file authorized_keys
this time will be the last where you are asked your password when logging in from your computer.
Directory structure and how to navigate it
Folders and files follow a tree-like structure
/
is the root of the filesystem - nothing is above thathome
and faststorage
are two root foldersfaststorage
and linked to your home
Log into the cluster
Tip
Use the up key on the terminal to find the commands you used previously, and press enter when you find the login command
Every time you log in, you will find yourself into your private home folder. This is denoted by ~
or equivalently /home/username/
. Your prompt will show something like this:
which follows the format
The folder in which you are located is called working directory (WD). Use the following command to see its path starting from the root:
Every command you execute refers to your WD. Execute
and you will see the list of files in your WD.
Try to create an empty file now with
and create a folder, which will be inside our cwd:
If you use again the ls
command, the new file and folder will show in the cwd.
How do you see the directory tree of the WD? Try
which shows you the tree with only 2 sublevels of depth.
Note
.
denotes the WD, and is the default when you do not specify it. Retry the command above using ..
(one directory above in the file system) instead of .
We want to get a file from the internet to the folder myFolder
. We can use wget
:
wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz\
-O ./myFolder/data.fastq.gz
Note
-O
is the option to give a path and name for the downloaded file.
Most commands have a help function to know the syntax and options. For example you can use wget --help
and man wget
.
The path to a file/folder can be:
To look inside myFolder
, we can both write
and
Note
We have used ~
which is the shortform for /home/username.
Changing WD can be useful, for example to avoid writing long relative paths.
To set the WD inside myFolder
use
and verify with pwd
the new working directory path.
Moving, Downloading, Manipulating files on GenomeDK
See the content of the current folder with
or to see more details
Warning
do not fill up your home with data. It has a limited amount of storage (a quota of 100GB).
It is easy to start creating files everywhere in your project folders. Data, analysis files, results and the like.
Managing your folders rationally is the best way of finding your way around. Especially when getting back to your analysis after long time.
You need a project from which you can run your programs. Request a project with the command
This creates a folder with the desired name. You should be able to go into that folder:
You can see how many resources your projects are using with
Only the creator (owner) can see the project folder. You can add an user
or remove it
More about user’s management in the documentation
It is important to
Remember: Storage cost >> Computation cost
Example of structure, which backs up raw data and analysis
You can do it with a script:
If your project has many users, a good structure can be
Each user can go in its folder inside the project and run the script to populate the folders
In your daily life on a cluster you are going to need downloads and exchange of files with online archives and your local PC.
Warning
Downloads should always happens on the front-end
nodes, and never using a compute node when working on GenomeDK
wget
is a utility for command-line-based downloads. It is already installed on GenomeDK
and works with http
, https
, ftp
protocols.
Example:
downloads a png
file and saves it as output.png
(option O
), downloads in background (-b
) and if the download was interrupted earlier, it retrieves it from where it stopped (-c
).
wget
has many options you can use, but what shown in the example above is what you need most times. You can see them with the command
Also, you can find this cheatsheet useful for remembering the commands to most of the things you can think about downloading files using wget. At this page there are also some concrete examples for wget
.
SCP
(Secure Copy Protocol) can transfer files securely
You can use it to transfer files from your pc to GenomeDK and viceversa, but also between GenomeDK and another computing cluster (for example, downloading data from a collaborator, which resides on a different remote computing system).
To copy a file to GenomeDK from your local computer:
The inverse operation just changes the order of the sender and receiver:
If you want to copy an entire folder, use the option -r (recursive copy). The previous examples become
and
A few more options are available and you can see them with the command scp --help
.
Differently from scp
, you can use rsync
to syncronize files and folders between two locations. It copies only the changes in the data and not all of it every time.
Copying a file or a folder between your computer and GenomeDK
works exactly as in scp
. For example
rsync --progress -r \
username@login.genome.au.dk:/home/username/my_project/folder \
/home/my_laptop/Documents/
where we add an option to show a progress bar
An interrupted syncronization can be retrieved if interrupted. To allow future retrieval of partial transfer, the previous command needs the additional option --partial
(which keeps partial downloads without deleting them):
rsync --partial --progress -r \
username@login.genome.au.dk:/home/username/my_project/folder \
/home/my_laptop/Documents/
After an interruption, just rerun the exact same command to retrieve the syncronization.
If you have large files, the option -z
(compression) reduces the amount of data (and the time) to transfer.
You can also do transfering with an interactive software, such as Filezilla
, which has an easy interface. Download Filezilla.
When done, open Filezilla
and use the following information on the login bar:
login.genome.au.dk
GenomeDK
username and password22
Press on Quick Connect
. As a result, you will establish a secure connection to GenomeDK
. On the left-side browser you can see your local folders and files. On the right-side, the folders and files on GenomeDK
starting from your home
.
If you right-click on any local file or folder, you can upload it immediately, or add it to the transfer queue. The file will end up in the selected folder of the right-side browser.
The download process works similarly using the right-side browser and choosing the destination folder on the left-side browser.
If you have created a queue, this will be shown at the bottom of the window as a list. You can inspect destination folders from there and choose other options such as transfer priority.
To start a queue, use CTRL + P
, Transfer --> Process Queue
or press the button on the toolbar.
Properly managing your software and its dependencies is fundamental for reproducibility
Each project needs specific software versions dependent on each other for reproducibility - without interferring with other projects.
Definition
A virtual environment keeps project-specific softwares and their dependencies separated
A package manager is a software that can retrieve, download, install, upgrade packages easily and reliably
Conda is both a virtual environment and a package manager.
Just download and execute the installer by
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O miniforge.sh
chmod +x miniforge.sh
bash miniforge.sh -b
./miniforge3/bin/conda init bash
After a few ENTER
s and YES
’s you should get the installation done. Run
and doublecheck that conda
works:
You can add some default channels where to find archived packages. Here are some tipycal ones
conda config --append channels bioconda
conda config --append channels genomedk
conda config --append channels r
conda config --append channels conda-forge
We tell conda
to look into channels in the order specified above. We also avoid opening the base
environment (where conda
is installed) at login.
is the environment containing conda itself. The current environment is in your prompt, but you will not see it again after disabling it at login.
We update Conda with libmamba solver
- a lot faster in installing many packages at once.
Don’t touch the Base
This is the only time you should install in the base
environment! You might otherwise ruin the conda installation.
Look at the settings in your conda installation. They are saved in the file ~/.condarc
An empty environment called test_1
:
You can list all the environments available:
> # conda environments:
> #
> base * /home/samuele/miniconda3
> test_1 /home/samuele/miniconda3/envs/test_1
To use an environment
Deactivation happens by
Conda puts together the dependency trees of requested packages to find all compatible dependencies versions.
To install a specific package in your environment, search it on anaconda.org:
Repositories
packages are archived in repositories. Typical ones are bioconda
, conda-forge
, r
, anaconda
.
conda-forge
packges are often more up-to-date, but a few times show compatibility problems with other packages.
Install a couple of packages in the activated environment - you can always specify a version restriction to each package:
conda activate test_1
conda install bioconda::bioconductor-deseq2<=1.42.0 conda-forge::r-tidyr=1.3.1
Note
To install two packages, you need more than a hundred installations! Those are all dependencies arising from the comparison of dependency trees.
Look for the package tidyr
in your active environment:
You can export all the packages you have installed over time in your environment:
which looks like
The same command without --from-history
will create a very long file with ALL dependencies:
name: test_1
channels:
- bioconda
- conda-forge
- defaults
- r
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- _r-mutex=1.0.1=anacondar_1
- argcomplete=3.2.2=pyhd8ed1ab_0
- binutils_impl_linux-64=2.40=hf600244_0
This is guaranteed to work only on the specific system where you created the environment!
You can use the yml
file to create an environment:
conda env create -p test_1_from_file -f ./environment.yml
Environment files are very useful when you want to share environments with others, especially when the package list is long.
Good practice: You want to install a lot of packages in an environment? Clone it first! If you break something, you still have the old copy.
If installations in the cloned environment go fine, then you can remove it
and repeat the installations on the original one.
Conda cheat sheet with all the things you can do to manage environments
Anaconda where you can search for packages
Running programs on a computing cluster happens through jobs.
Learn how to get hold of computing resources to run your programs.
A computational task executed on requested HPC resources (computing nodes), which are handled by the queueing system (SLURM).
The command gnodes
will tell you if there is heavy usage across the computing nodes
If you want to venture more into checking the queueing status, Moi has done a great interactive script in R Shiny for that.
Front-end nodes are limited in memory and power, and should only be for basic operations such as
starting a new project
small folders and files management
small software installations
and in general you should not use them to run computations. This might slow down other users.
Useful to run a non-repetitive task interactively
Examples:
splitting by chromosome that one bam
file you just got
open python
/R
and do some statistics
compress/decompress multiple files, maybe in parallel
Once you exit from the job, anything running into it will stop.
To run an interactive job simply use the command
[fe-open-01]$ srun --mem=<GB_of_RAM>g -c <nr_cores> --time=<days-hrs:mins:secs> --account=<project_name> --pty /bin/bash
For example
[fe-open-01]$ srun --mem=32g -c 2 --time=6:0:0 --account=<project_name> --pty /bin/bash
The queueing system makes you wait based on the resources you ask and how busy the nodes are. When you get assigned a node, the resources are available. The node name is shown in the prompt.
[<username>@s21n32 ~]$
Useful to run a program non-interactively, usually for longer time than a short interaction. A batch script contains
and
#!/bin/bash
to know in which language (‘bash’) the commands are written intoA file called align.sh
such that:
#!/bin/bash
#SBATCH --account my_project
#SBATCH --cpus-per-task= 8
#SBATCH --mem 16g
#SBATCH --time 04:00:00
#activate environment
eval "$(conda shell.bash hook)"
conda activate ./bam_tools
#index the reference file
bwa-mem2 index reference/chr2.fa
#align data
bwa-mem2 -t 8 reference/chr2.fa \
genomes/S_Korean-2.region.fq.gz \
| samtools sort \
-@ 7 \
-n \
-O BAM \
> alignment/S_Korean-2.sorted.bam
exit 0
Send the script to the queueing system:
Submitted batch job 33735298
Interrogate SLURM about the specific job
>Name : align.sh
>User : samuele
>Account : my_project
>Partition : short
>Nodes : s21n43
>Cores : 8
>GPUs : 0
>State : RUNNING
>...
or about all the queued jobs
>JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
>33735928 short align.sh samuele R 1:12 1 s21n43
If you change your mind and want to cancel a job:
To observe in real time the latest output of the command in the job, you can refresh the last lines of the log file for the specific job:
To look at the whole file, you can run at any time
This can be useful for debugging, when for example a command gives an error and the job interrupts.
Beyond sbatch
, you can use other tools for workflows which are
Some workflow tools:
Gwf
has an easy python
syntax instead of its own language to write workflows.
You need to know some basic python
to use Gwf
, but it is worth the effort.