An introduction to the GDK system and basic commands https://hds-sandbox.github.io/GDKworkshops
Health Data Science sandbox, BiRC
GenomeDK, Health
Core Binf facility, Biomed
2024-10-11
These slides are both a presentation and a small reference manual
We will try out the commands during the workshop
Official reference documentation: genome.au.dk
Sandbox: Samuele (BiRC, MBG) - samuele at birc.au.dk
Core facility: Per&Co (Biomed) - per.q at biomed.au.dk
Bioinformatics Cafe: https://abc.au.dk, abc at au.dk
11:00-12:00: Genome DK’s basics, File System, File manipulation
13:00-14:30: Projects, data transfer, virtual environments with conda
14:30-15:00: Introduction to jobs with slurm
commands
Copy code from the button on the rightmost edge of the code box
The quizzes can be answered
The basic softwares
Customizable
Learn your way around the basics of the GenomeDK
cluster.
GenomeDK
is a computing cluster, i.e. a set of interconnected computers (nodes). GenomeDK
has:
Creating an account happens through this form at genome.au.dk
Logging into GenomeDK happens through the command 1
When first logged in, setup the 2-factor authentication by
showing a QR-code with the command
scanning it with your phone’s Authenticator app 2.
It is nice to avoid writing the password at every access. If you are on the cluster, exit from it to go back to your local computer
Now we set up a public-key authentication. We generate a key pair (public and private):
Always press Enter and do not insert any password when asked.
and create a folder on the cluster called .ssh
to contain the public key
and finally send the public key to the cluster, into the file authorized_keys
After this, your local private key will be tested against GenomeDK’s public key every time you log in, without you needing to write a password.
Folders and files follow a tree-like structure
/
is the root folder of the filesystem - nothing is above thathome
and faststorage
are two of the folders in the rootfaststorage
and linked to your homeyou can reach files and folders with a path
Examples:
/home/username/coolProject/code/analysis.R
/faststorage/project
Paths starting from the root are called absolute
Look at the File system tree and answer to the following questions:
First of all, log into the cluster
Tip
Use the arrow up key on the terminal to find the commands you used previously, and press enter when you find the login command
After log in, you will find yourself into your private home folder, denoted by ~
or equivalently /home/username/
. Your prompt will look like this:
which follows the format [username@node current_folder].
Warning
The folder in which you are located is called working directory (WD). The WD is usually shown in your prompt. Use the following command to see the WD path starting from the root:
Every command you run refers to your WD. Execute
and you will see the list of files and folders in your WD.
Try to create an empty file now with
and create a folder, which will be inside the WD:
If you use again the ls
command, the new file and folder will show in the WD.
How do you see the directory tree of the WD? Try tree
with only 2 sublevels of depth:
Note
.
denotes the WD, and is the default when you do not specify it. Retry the command above using ..
(one directory above in the file system) instead of .
We want to get a file from the internet to the folder myFolder
. We can use wget
:
wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz\
-O ./myFolder/data.fastq.gz
options and manuals
-O
is the option to give a path and name for the downloaded file.
Most commands have a help function to know the syntax and options. For example you can use wget --help
and man wget
.
Reminder about the path types
The path to a file/folder can be:
To look inside myFolder
, we can both write
The WD is everywhere
The working directory is a very useful concept, not limited to Linux/GenomeDK, but used very widely in computer applications.
For example, when you work in R
or Python
, there is a default WD which you can change.
Changing WD can be useful. To set the WD inside myFolder
use
and verify with pwd
the new working directory path.
Moving, Downloading, Manipulating and other basic operation on files.
Many files you use in bioinformatics are nothing else than text files which are written in a specific matter. This specific way of arranging the text in the files gives you many of the file formats you encounter when doing bioinformatics.
Note
Some file formats are encoded differently than with plain ASCII text, and cannot usually be seen with a text editor.
Let’s get ready. Be sure you are in myFolder
(use pwd
) - Otherwise use
Now, you can decompress the file data.fastq.gz
, which is in gz
compressed format:
Tip
For compressing a file into gz
format, you can use gzip
. For compressing and decompressing in zip
format, you have also the commands zip
and unzip
.
less
is perfect for reading text files: you can scroll with the arrows, and quit pressing q
. Try
The very first sequence you see should be
@HISEQ_HU01:89:H7YRLADXX:1:1101:1116:2123 1:N:0:ATCACG
TCTGTGTAAATTACCCAGCCTCACGTATTCCTTTAGAGCAATGCAAAACAGACTAGACAAAAGGCTTTTAAAAGTCTA
ATCTGAGATTCCTGACCAAATGT
+
CCCFFFFFHHHHHJJJJJJJJJJJJHIJJJJJJJJJIJJJJJJJJJJJJJJJJJJJHIJGHJIJJIJJJJJHHHHHHH
FFFFFFFEDDEEEEDDDDDDDDD
The first line is metadata, the second is the sequence, then you have an empty line (symbol +), and the quality scores (encoded by letters as in this table).
Exercise
Search online (or with less --help)
how to look for a specific word in a file with less
. Then visualize the data with less
, and try to find if there is any sequence of ten adjacent N
s (which is, ten missing nucleotides). Then, answer the question below
How many lines are there in your file? The command wc
can show that to you:
The file has 100000 lines, or 25000 sequences (each sequence is defined by 4 lines).
Tip
wc
has many functionalities. As always, look for the manual or examples to see how you can use it in other many ways.
cp
can copy one or more files - we use it on our data:
mv
moves a file into another folder - here we move it into our WD, which simply changes its filename:
Use now ls -lah
and you will see two files of identical size and different creation dates.
Well, we changed our mind and do not want a copy of our data. Remove it with
Forever away
There is no trash bin - removed files are lost forever - with no exception
Write something on a file using >
:
prints out the first four lines of the data into smallFile.fastq
.
Warning
Using again >
will overwrite the file!
Print out on the screen:
Avoid overwriting by appending with >>
:
appends the last 4 lines of the data to smallFile.fastq
. Check again using cat
or wc -l
.
You can create small pipelines directly on the shell with the symbol |
. The output of a command and send it to the next command when you have |
in between. For example,
finds the pattern NNNNN
in the data.
How to find it in the first hundred sequences? Easy! we pipe head
into grep:
The output of that pipe was a small output on screen - but outputs can be huge! We could count the number of sequences by piping again into wc
!
ls
: List files and directories in the current directory.ls -l
: List in long format (detailed information).ls -a
: List all files, including hidden ones (starting with .
).ls -lh
: List with human-readable file sizes (e.g., KB, MB).ls -R
: Recursively list files in directories and subdirectories.cp source_file destination
: Copy a file to a destination.cp file1 file2 dir/
: Copy multiple files to a directory.cp -r dir1 dir2
: Recursively copy a directory and its contents.mv source_file destination
: Move a file to a new location or rename it.mv file1 file2 dir/
: Move multiple files to a directory.mv oldname newname
: Rename a file or directory.rm file
: Remove a file.rm -f file
: Force remove a file (suppress confirmation).rm -r dir
: Recursively remove a directory and its contents.rm -rf dir
: Forcefully and recursively remove a directory and its contents (use with caution).mkdir dir_name
: Create a new directory.mkdir -p parent_dir/child_dir
: Create a directory with parent directories as needed.chmod 644 file
: Set read/write for owner, and read-only for group and others.chmod 755 file
: Set read/write/execute for owner, and read/execute for group and others.chmod +x file
: Add execute permission to a file.chmod -R 755 dir
: Recursively change permissions for a directory and its contents.chown user file
: Change the ownership of a file.chown user:group file
: Change the owner and group of a file.chown -R user:group dir
: Recursively change ownership of a directory and its contents.file filename
: Display the type of a file.stat filename
: Show detailed information about a file (size, permissions, timestamps).du -sh file/dir
: Display the disk usage of a file or directory (in human-readable format).touch filename
: Create an empty file or update the timestamp of an existing file.cat filename
: View the contents of a file.less filename
: View the contents of a file, with navigation.head -n 10 filename
: View the first 10 lines of a file.tail -n 10 filename
: View the last 10 lines of a file.ln file link_name
: Create a hard link.ln -s target link_name
: Create a symbolic (soft) link.less
Space
or f
: Scroll forward by one page.Down Arrow
or j
: Scroll down by one line.d
: Scroll down by half a page.b
: Scroll backward by one page.Up Arrow
or k
: Scroll up by one line.u
: Scroll up by half a page.G
: Go to the end of the file.g
: Go to the beginning of the file.numberG
or number%
: Go to a specific line or percentage in the file./pattern
: Search forward for a pattern (use n
to move to the next match).?pattern
: Search backward for a pattern (use N
to move to the previous match).n
: Repeat the last search in the same direction.N
: Repeat the last search in the opposite direction.-N
or --LINE-NUMBERS
: Show line numbers (must start less
with this option).m<letter>
: Mark the current position with a letter.'<letter>
: Return to the marked position.less
:
q
: Exit less
.Right Arrow
or →
: Scroll right.Left Arrow
or ←
: Scroll left.:e filename
: Open another file while inside less
.:n
: Go to the next file (if multiple files were opened).:p
: Go to the previous file.F
: Continuously view a file as it grows (like tail -f
).=
: Show the current file name, line number, and percentage through the file.h
: Display help with all available commands.less
)-N
: While in a session, use this to toggle line number display.We meet at 13:00 in 1533-103
what is a project
Projects are contained in /faststorage/project/
, and are simple folders with some perks:
Common-sense in project creation
bulkRNA_mouse
, bulkRNA_human
, bulkRNA_apes
with the same invited usersbulkRNA_studies
with subfolders bulkRNA_mouse
, bulkRNA_human
, bulkRNA_apes
.Request a project (after login on GDK) with the command
After GDK approval, a project folder with the desired name appears in ~
and /faststorage/project
. You should be able to set the WD into that folder:
or
Only the creator (owner) can see the project folder. You (and only you) can add an user
or remove it
Users can also be promoted to have administrative rights in the project
or demoted from those rights
You can see globally monthly used resources of your projects with
Example output:
More detailed usage: by users on a selected project
You can see how many resources your projects are using with
Example output:
project period billing hours storage (TB) backup (TB) storage files backup files
ngssummer2024 sarasj 2024-7 77.98 0.02 0.00 528 0
ngssummer2024 sarasj 2024-8 0.00 0.02 0.00 528 0
ngssummer2024 savvasc 2024-7 223.21 0.02 0.00 564 0
ngssummer2024 savvasc 2024-8 0.00 0.02 0.00 564 0
ngssummer2024 simonnn 2024-7 173.29 0.01 0.00 579 0
ngssummer2024 simonnn 2024-8 0.00 0.01 0.00 579 0
Accounting Tips
grep
to isolate specific users and/or months:
Example:
Have a coherent folder structure - your future self will thanks.
You can do it with a script which downloads and execute with the command below:
If your project has many users, a good structure can be
Do that with these commands
and making each user to run the script in its folder
MUST-KNOWs for a GDK project
Backup cost >>> Storage cost >> Computation cost
Data transfer amongst the web, GDK and your PC is an everyday action which you can easily perform.
Warning
Downloads should always happens on the front-end
nodes, and never using a compute node when working on GenomeDK
wget
is a utility for command-line-based downloads. It is already installed on GenomeDK
and works with http
, https
, ftp
protocols.
Example:
downloads a png
file and saves it as output.png
(option O
), downloads in background (-b
) and if the download was interrupted earlier, it retrieves it from where it stopped (-c
).
wget
has many options you can use, but what shown in the example above is what you need most times. You can see them with the command
Also, you can find this cheatsheet useful for remembering the commands to most of the things you can think about downloading files using wget. At this page there are also some concrete examples for wget
.
SCP
(Secure Copy Protocol) can transfer files securely
You can use it to transfer files from your pc to GenomeDK and viceversa, but also between GenomeDK and another computing cluster (for example, downloading data from a collaborator, which resides on a different remote computing system).
To copy a file to GenomeDK from your local computer:
The inverse operation just changes the order of the sender and receiver:
If you want to copy an entire folder, use the option -r
(recursive copy). The previous examples become
and
A few more options are available and you can see them with the command scp --help
.
You can also do transfering with an interactive software, such as Filezilla
, which has an easy interface. Download Filezilla.
When done, open Filezilla
and use the following information on the login bar:
login.genome.au.dk
GenomeDK
username and password22
Press on Quick Connect
. As a result, you will establish a secure connection to GenomeDK
. On the left-side browser you can see your local folders and files. On the right-side, the folders and files on GenomeDK
starting from your home
.
If you right-click on any local file or folder, you can upload it immediately, or add it to the transfer queue. The file will end up in the selected folder of the right-side browser.
The download process works similarly using the right-side browser and choosing the destination folder on the left-side browser.
If you have created a queue, this will be shown at the bottom of the window as a list. You can inspect destination folders from there and choose other options such as transfer priority.
To start a queue, use CTRL + P
, Transfer --> Process Queue
or press the button on the toolbar.
Properly managing your software and its dependencies is fundamental for reproducibility
Each project needs specific software versions dependent on each other for reproducibility - without interferring with other projects.
Definition
A virtual environment keeps project-specific softwares and their dependencies separated
A package manager is a software that can retrieve, download, install, upgrade packages easily and reliably
Conda is both a virtual environment and a package manager.
Just download and execute the installer by
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O miniforge.sh
chmod +x miniforge.sh
bash miniforge.sh -b
./miniforge3/bin/conda init bash
After a few ENTER
s and YES
’s you should get the installation done. Run
and doublecheck that conda
works:
You can add some default channels where to find archived packages. Here are some tipycal ones
conda config --append channels bioconda
conda config --append channels genomedk
conda config --append channels r
conda config --append channels conda-forge
We tell conda
to look into channels in the order specified above. We also avoid opening the base
environment (where conda
is installed) at login.
base
is the environment containing conda itself. The current environment is in your prompt in round brackets.
We update Conda with libmamba solver
- a lot faster in installing many packages at once.
Don’t touch the Base
This is the only time you should install in the base
environment! You might otherwise ruin the conda installation.
An empty environment called test_1
:
You can list all the environments available:
> # conda environments:
> #
> base * /home/samuele/miniconda3
> test_1 /home/samuele/miniconda3/envs/test_1
Note
An environment is in reality a folder, which contains all installed packages and other configurations and utilities
To use an environment we activate it:
From now on, all installed softwares and packages will be available. (test_1)
is now shown in your prompt.
Deactivation happens by
Conda puts together the dependency trees of requested packages to find all compatible dependencies versions.
To install a specific package in your environment, search it on anaconda.org:
Repositories
packages are archived in repositories. Typical ones are bioconda
, conda-forge
, r
, anaconda
.
conda-forge
packges are often more up-to-date, but a few times show compatibility problems with other packages.
Install a couple of packages in the activated environment - you can always specify a version restriction to each package:
conda activate test_1
conda install bioconda::bioconductor-deseq2<=1.42.0 conda-forge::r-tidyr=1.3.1
Note
To install two packages, you need more than a hundred installations! Those are all dependencies arising from the comparison of dependency trees.
Look for the package tidyr
in your active environment:
You can export all the packages you have installed over time in your environment:
which looks like
The same command without --from-history
will create a very long file with ALL dependencies:
name: test_1
channels:
- bioconda
- conda-forge
- defaults
- r
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- _r-mutex=1.0.1=anacondar_1
- argcomplete=3.2.2=pyhd8ed1ab_0
...
This is guaranteed to work only on a system with same OS and architecture as GenomeDK (Linux and x86)!
You can use the yml
file to create an environment:
conda env create -p test_1_from_file -f ./environment.yml
Environment files are very useful when you want to share environments with others, especially when the package list is long.
Good practice: You want to install a lot of packages in an environment? Clone it first! If you break something, you still have the old copy.
If installations in the cloned environment go fine, then you can remove it
and repeat the installations on the original one.
Conda cheat sheet with all the things you can do to manage environments
Anaconda where you can search for packages
Running programs on a computing cluster happens through jobs.
Learn how to get hold of computing resources to run your programs.
A computational task executed on requested HPC resources (computing nodes), which are handled by the queueing system (SLURM).
The command gnodes
will tell you if there is heavy usage across the computing nodes
If you want to venture more into checking the queueing status, Moi has done a great interactive script in R Shiny for that.
Front-end nodes are limited in memory and power, and should only be for basic operations such as
starting a new project
small folders and files management
small software installations
data transfer
and in general you should not use them to run computations. This might slow down all other users on the front-end.
Useful to run a non-repetitive task interactively
Examples:
splitting by chromosome that one bam
file you just got
open python
/R
and do some statistics
compress/decompress multiple files, maybe in parallel
Once you exit from the job, anything running into it will stop.
To run an interactive job simply use the command
[fe-open-01]$ srun --mem=<GB_of_RAM>g -c <nr_cores> --time=<days-hrs:mins:secs> --account=<project_name> --pty /bin/bash
For example
[fe-open-01]$ srun --mem=32g -c 2 --time=6:0:0 --account=<project_name> --pty /bin/bash
The queueing system makes you wait based on the resources you ask and how busy the nodes are. When you get assigned a node, the resources are available. The node name is shown in the prompt.
[<username>@s21n32 ~]$
Useful to run a program non-interactively, usually for longer time and without interaction from the user. A batch script contains
and
A file called align.sh
such that:
#!/bin/bash
#SBATCH --account my_project
#SBATCH --cpus-per-task= 8
#SBATCH --mem 16g
#SBATCH --time 04:00:00
#activate environment
eval "$(conda shell.bash hook)"
conda activate bam_tools
#index the reference file
bwa-mem2 index reference/chr2.fa
#align data
bwa-mem2 -t 8 reference/chr2.fa \
genomes/S_Korean-2.region.fq.gz \
| samtools sort \
-@ 7 \
-n \
-O BAM \
> alignment/S_Korean-2.sorted.bam
exit 0
Send the script to the queueing system:
Submitted batch job 33735298
Interrogate SLURM about the specific job
>Name : align.sh
>User : samuele
>Account : my_project
>Partition : short
>Nodes : s21n43
>Cores : 8
>GPUs : 0
>State : RUNNING
>...
or about all the queued jobs
>JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
>33735928 short align.sh samuele R 1:12 1 s21n43
If you change your mind and want to cancel a job:
Tip
To observe in real time the output of the job, refresh the last lines of the log file for that job:
To look at the whole log (not in real time), run at any time
Checking the log files can be useful for debugging, when for example a command gives an error and the job interrupts before its end.
Try to run a job with a smaller dataset as a test. While it is running
use squeue-me
and look at the node id
log into that node from the front-end:
use htop -u <username>
to see what is running and how much memory and CPU it uses
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
8249 newhall 20 0 153m 4488 472 R 47 0.0 0:02.78 6 gol
8250 newhall 20 0 153m 4488 472 R 47 0.0 0:02.76 2 gol
8236 newhall 20 0 153m 4488 472 R 46 0.0 0:02.77 6 gol
8237 newhall 20 0 153m 4488 472 S 46 0.0 0:02.77 4 gol
8243 newhall 20 0 153m 4488 472 R 46 0.0 0:02.76 1 gol
8239 newhall 20 0 153m 4488 472 S 46 0.0 0:02.76 7 gol
8240 newhall 20 0 153m 4488 472 R 46 0.0 0:02.76 5 gol
8244 newhall 20 0 153m 4488 472 R 46 0.0 0:02.72 2 gol
8251 newhall 20 0 153m 4488 472 R 46 0.0 0:02.78 4 gol
Beyond sbatch
, you can use other tools for workflows which are
Some workflow tools:
Gwf
has an easy python
syntax instead of its own language to write workflows.
Learning a workflow language takes some time commitment, but it is worth the effort.
More in this slides than what we went through
Updated over time, use as a reference
Impossible to cover everything at once. We will also advanced/pipeline workshop
Come to our Cafe and/or ask
Documentation on genome.au.dk
tmux
awk
for advanced text file manipulationrsync
for synchronization of data