Use GenomeDK

An introduction to the GDK system and basic commands https://hds-sandbox.github.io/GDKworkshops

Samuele Soraggi

Health Data Science sandbox, BiRC

Dan Søndergaard

GenomeDK, Health

Per, Dimi, Jacob, Manuel, Marie

Core Binf facility, Biomed

2024-10-11

Some background

  • These slides are both a presentation and a small reference manual

  • We will try out the commands during the workshop

    • keep both slides and terminal open
  • Official reference documentation: genome.au.dk

Who we are

 

  • Sandbox: Samuele (BiRC, MBG) - samuele at birc.au.dk

  • Core facility: Per&Co (Biomed) - per.q at biomed.au.dk

  • Bioinformatics Cafe: https://abc.au.dk, abc at au.dk

Program

  • 11:00-12:00: Genome DK’s basics, File System, File manipulation

  • 13:00-14:30: Projects, data transfer, virtual environments with conda

  • 14:30-15:00: Introduction to jobs with slurm commands

Get the slides

Webpage: https://hds-sandbox.github.io/GDKworkshops/

You can also open today’s slides at this short URL https://tinyurl.com/useGDK

  • Copy code from the button on the rightmost edge of the code box

  • The quizzes can be answered

Software for command line

The basic softwares

Powershell for windows

Terminal for MacOS and Linux

Customizable

Terminator for Linux - iTerm2 for MacOS

Tabby for Linux, MacOS, Windows

GenomeDK’s ABC

Learn your way around the basics of the GenomeDK cluster.

Infrastructure

GenomeDK is a computing cluster, i.e. a set of interconnected computers (nodes). GenomeDK has:

  • computing nodes used for running programs (~15000 cores)
  • storage nodes storing data in many hard drives (~23 PiB)
  • a network making nodes talk to each other
  • a frontend node from which you can send your programs to a node to be executed
  • a queueing system called slurm to prioritize the users’ program to be run

Access

  • Creating an account happens through this form at genome.au.dk

  • Logging into GenomeDK happens through the command 1

    [local]$  ssh USERNAME@login.genome.au.dk
  • When first logged in, setup the 2-factor authentication by

    • showing a QR-code with the command

      [gdk]$  gdk-auth-show-qr
    • scanning it with your phone’s Authenticator app 2.

Access without password

It is nice to avoid writing the password at every access. If you are on the cluster, exit from it to go back to your local computer

[gdk]$  exit

 

Now we set up a public-key authentication. We generate a key pair (public and private):

[local]$  ssh-keygen -t ed25519

Always press Enter and do not insert any password when asked.

and create a folder on the cluster called .ssh to contain the public key

[local]$  ssh <USERNAME>@login.genome.au.dk mkdir -p .ssh

 

and finally send the public key to the cluster, into the file authorized_keys

[local]$  cat ~/.ssh/id_rsa.pub | ssh username@login.genome.au.dk 'cat >> .ssh/authorized_keys'

 

After this, your local private key will be tested against GenomeDK’s public key every time you log in, without you needing to write a password.

File System (FS) on GenomeDK

  • Directory structure
  • Absolute and Relative Paths
  • important folders
  • navigate the FS on the command line

How the FS is organized

Folders and files follow a tree-like structure

  • / is the root folder of the filesystem - nothing is above that
  • the FS is shared across all machines and available to all users
  • home and faststorage are two of the folders in the root
  • projects are in faststorage and linked to your home

  • you can reach files and folders with a path

  • Examples:

    /home/username/coolProject/code/analysis.R
    
    /faststorage/project
  • Paths starting from the root are called absolute

Look at the File system tree and answer to the following questions:

Home directory ~ and relative paths

 

First of all, log into the cluster

[local] ssh username@login.genome.au.dk

 

Tip

Use the arrow up key on the terminal to find the commands you used previously, and press enter when you find the login command

After log in, you will find yourself into your private home folder, denoted by ~ or equivalently /home/username/. Your prompt will look like this:

 

[username@node ~] 

which follows the format [username@node current_folder].

 

Warning

  • Do not fill up your home folder with data. It has a limited amount of storage (a quota of 100GB).
  • Your home folder is only private to you

Working directory

The folder in which you are located is called working directory (WD). The WD is usually shown in your prompt. Use the following command to see the WD path starting from the root:

pwd

 

Every command you run refers to your WD. Execute

ls

 

and you will see the list of files and folders in your WD.

Populating with files and folders

Try to create an empty file now with

touch emptyFile.txt

 

and create a folder, which will be inside the WD:

mkdir myFolder

 

If you use again the ls command, the new file and folder will show in the WD.

How do you see the directory tree of the WD? Try tree with only 2 sublevels of depth:

tree -L 2 .

 

Note

. denotes the WD, and is the default when you do not specify it. Retry the command above using .. (one directory above in the file system) instead of .

We want to get a file from the internet to the folder myFolder. We can use wget:

 

wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz\
     -O ./myFolder/data.fastq.gz

 

options and manuals

-O is the option to give a path and name for the downloaded file.

Most commands have a help function to know the syntax and options. For example you can use wget --help and man wget.

A bit more on absolute and relative path

Reminder about the path types

The path to a file/folder can be:

  • absolute: start from the root
  • relative: starts from the WD (the WD is the new root)

 

To look inside myFolder, we can both write


ls myFolder/

ls ~/myFolder/

The WD is everywhere

The working directory is a very useful concept, not limited to Linux/GenomeDK, but used very widely in computer applications.

For example, when you work in R or Python, there is a default WD which you can change.

Changing WD

Changing WD can be useful. To set the WD inside myFolder use


cd myFolder

 

and verify with pwd the new working directory path.

Working with Files

Moving, Downloading, Manipulating and other basic operation on files.

File formats

Many files you use in bioinformatics are nothing else than text files which are written in a specific matter. This specific way of arranging the text in the files gives you many of the file formats you encounter when doing bioinformatics.

 

Note

Some file formats are encoded differently than with plain ASCII text, and cannot usually be seen with a text editor.

A text file is human-readable with any text reader or editor, and is composed of only ASCII characters, such as in the fastq file format

A binary file containes other than ASCII characters. For example, the bam file format is binary and can be read with the samtools software.

Let’s get ready. Be sure you are in myFolder (use pwd) - Otherwise use

cd ~/myFolder

 

Now, you can decompress the file data.fastq.gz, which is in gz compressed format:

gunzip data.fastq.gz

Tip

For compressing a file into gz format, you can use gzip. For compressing and decompressing in zip format, you have also the commands zip and unzip.

Less for reading files

less is perfect for reading text files: you can scroll with the arrows, and quit pressing q. Try


less data.fastq

  The very first sequence you see should be

@HISEQ_HU01:89:H7YRLADXX:1:1101:1116:2123 1:N:0:ATCACG
TCTGTGTAAATTACCCAGCCTCACGTATTCCTTTAGAGCAATGCAAAACAGACTAGACAAAAGGCTTTTAAAAGTCTA
ATCTGAGATTCCTGACCAAATGT
+
CCCFFFFFHHHHHJJJJJJJJJJJJHIJJJJJJJJJIJJJJJJJJJJJJJJJJJJJHIJGHJIJJIJJJJJHHHHHHH
FFFFFFFEDDEEEEDDDDDDDDD

The first line is metadata, the second is the sequence, then you have an empty line (symbol +), and the quality scores (encoded by letters as in this table).

Exercise

Search online (or with less --help) how to look for a specific word in a file with less. Then visualize the data with less, and try to find if there is any sequence of ten adjacent Ns (which is, ten missing nucleotides). Then, answer the question below

Counting

How many lines are there in your file? The command wc can show that to you:

wc -l data.fastq

 

The file has 100000 lines, or 25000 sequences (each sequence is defined by 4 lines).

 

Tip

wc has many functionalities. As always, look for the manual or examples to see how you can use it in other many ways.

Copy and Move

cp can copy one or more files - we use it on our data:

cp data.fastq dataCopy.fastq

 

mv moves a file into another folder - here we move it into our WD, which simply changes its filename:

mv data.fastq ./dataOriginal.fastq

Use now ls -lah and you will see two files of identical size and different creation dates.

 

Well, we changed our mind and do not want a copy of our data. Remove it with

rm dataCopy.fastq

Forever away

There is no trash bin - removed files are lost forever - with no exception

Writing on a file

Write something on a file using >:

head -4 dataOriginal.fastq > smallFile.fastq

 

prints out the first four lines of the data into smallFile.fastq.

 

Warning

Using again > will overwrite the file!

Print out on the screen:

cat smallFile.fastq

 

Avoid overwriting by appending with >>:

tail -4 dataOriginal.fastq >> smallFile.fastq

appends the last 4 lines of the data to smallFile.fastq. Check again using cat or wc -l.

Piping

You can create small pipelines directly on the shell with the symbol |. The output of a command and send it to the next command when you have | in between. For example,

grep NNNNN dataOriginal.fastq

finds the pattern NNNNN in the data.

How to find it in the first hundred sequences? Easy! we pipe head into grep:

head -400 dataOriginal.fastq | grep NNNNN

The output of that pipe was a small output on screen - but outputs can be huge! We could count the number of sequences by piping again into wc!

head -400 dataOriginal.fastq | grep NNNNN | wc -l

 

Compendium for file manipulation

List Files and Directories

  • ls: List files and directories in the current directory.
  • ls -l: List in long format (detailed information).
  • ls -a: List all files, including hidden ones (starting with .).
  • ls -lh: List with human-readable file sizes (e.g., KB, MB).
  • ls -R: Recursively list files in directories and subdirectories.

Copy Files and Directories

  • cp source_file destination: Copy a file to a destination.
  • cp file1 file2 dir/: Copy multiple files to a directory.
  • cp -r dir1 dir2: Recursively copy a directory and its contents.

Move (or Rename) Files and Directories

  • mv source_file destination: Move a file to a new location or rename it.
  • mv file1 file2 dir/: Move multiple files to a directory.
  • mv oldname newname: Rename a file or directory.

Remove Files and Directories

  • rm file: Remove a file.
  • rm -f file: Force remove a file (suppress confirmation).
  • rm -r dir: Recursively remove a directory and its contents.
  • rm -rf dir: Forcefully and recursively remove a directory and its contents (use with caution).

Create Directories

  • mkdir dir_name: Create a new directory.
  • mkdir -p parent_dir/child_dir: Create a directory with parent directories as needed.

Change File Permissions

  • chmod 644 file: Set read/write for owner, and read-only for group and others.
  • chmod 755 file: Set read/write/execute for owner, and read/execute for group and others.
  • chmod +x file: Add execute permission to a file.
  • chmod -R 755 dir: Recursively change permissions for a directory and its contents.

Change File Ownership

  • chown user file: Change the ownership of a file.
  • chown user:group file: Change the owner and group of a file.
  • chown -R user:group dir: Recursively change ownership of a directory and its contents.

File Information

  • file filename: Display the type of a file.
  • stat filename: Show detailed information about a file (size, permissions, timestamps).
  • du -sh file/dir: Display the disk usage of a file or directory (in human-readable format).

Create and View Files

  • touch filename: Create an empty file or update the timestamp of an existing file.
  • cat filename: View the contents of a file.
  • less filename: View the contents of a file, with navigation.
  • head -n 10 filename: View the first 10 lines of a file.
  • tail -n 10 filename: View the last 10 lines of a file.
  • ln file link_name: Create a hard link.
  • ln -s target link_name: Create a symbolic (soft) link.

Compendium for less

Basic Navigation

  • Move Forward:
    • Space or f: Scroll forward by one page.
    • Down Arrow or j: Scroll down by one line.
    • d: Scroll down by half a page.
  • Move Backward:
    • b: Scroll backward by one page.
    • Up Arrow or k: Scroll up by one line.
    • u: Scroll up by half a page.
  • Go to Specific Line or Position:
    • G: Go to the end of the file.
    • g: Go to the beginning of the file.
    • numberG or number%: Go to a specific line or percentage in the file.

Searching

  • Search Forward:
    • /pattern: Search forward for a pattern (use n to move to the next match).
  • Search Backward:
    • ?pattern: Search backward for a pattern (use N to move to the previous match).
  • Repeat Last Search:
    • n: Repeat the last search in the same direction.
    • N: Repeat the last search in the opposite direction.

Display Line Numbers

  • Show Line Numbers:
    • -N or --LINE-NUMBERS: Show line numbers (must start less with this option).

Marking Positions

  • Set a Mark:
    • m<letter>: Mark the current position with a letter.
  • Jump to a Mark:
    • '<letter>: Return to the marked position.

Exiting

  • Quit less:
    • q: Exit less.

Scrolling Long Lines

  • Move Left and Right (For Long Lines):
    • Right Arrow or : Scroll right.
    • Left Arrow or : Scroll left.

File Manipulation

  • Open Another File:
    • :e filename: Open another file while inside less.
  • View Multiple Files:
    • :n: Go to the next file (if multiple files were opened).
    • :p: Go to the previous file.

Miscellaneous

  • Follow File in Real Time:
    • F: Continuously view a file as it grows (like tail -f).
  • Show Current Filename:
    • =: Show the current file name, line number, and percentage through the file.
  • Help Menu:
    • h: Display help with all available commands.

View Line Numbers Temporarily (without restarting less)

  • Toggle Line Numbers:
    • -N: While in a session, use this to toggle line number display.

Last thing before lunch:

We meet at 13:00 in 1533-103

1533-103E

Project management

  • What are GDK projects
  • how to track the resource usage, and
  • how to organize a project

GDK projects

what is a project

Projects are contained in /faststorage/project/, and are simple folders with some perks:

  • you have to request their creation to GDK administrators
  • access is limited to you, and users you invite
  • CPU, GPU, storage and backup usage are registered under the project for each user
  • you can keep track of per-project and -user resource usage

Example of a project managed by you with two invited users. you has requested the creation of coolProject and manages the project. you invited two users to the project.

Common-sense in project creation

  • Do not request a lot of different small project, but make larger/comprehensive ones
    • No-go example: 3 projects bulkRNA_mouse, bulkRNA_human, bulkRNA_apes with the same invited users
    • Good example: one project bulkRNA_studies with subfolders bulkRNA_mouse, bulkRNA_human, bulkRNA_apes.
  • Why? Projects cannot be deleted, so they keep cumulating

Creation

Request a project (after login on GDK) with the command

gdk-project-request -g <project_name>

 

After GDK approval, a project folder with the desired name appears in ~ and /faststorage/project. You should be able to set the WD into that folder:

cd <project_name>

or

cd ~/<project_name>

Users management

Only the creator (owner) can see the project folder. You (and only you) can add an user

gdk-project-add-user -g <project name> -u <username>

 

or remove it

gdk-project-remove-user -g <project name> -u <username>

Users can also be promoted to have administrative rights in the project

gdk-project-promote-user -g <project name> -u <username>

 

or demoted from those rights

gdk-project-demote-user -g <project name> -u <username>

Accounting

You can see globally monthly used resources of your projects with

gdk-project-usage

 

Example output:

project               period  billing hours  storage (TB)  backup (TB)  storage files  backup files
HDSSandbox            2024-8          44.58          0.09         0.00           6024             0
HDSSandbox            2024-9          25.38          0.09         0.00           6025             0
ngssummer2024         2024-6           6.73          0.00         0.00              0             0
ngssummer2024         2024-7        7547.48          0.72         0.00          27479             0

More detailed usage: by users on a selected project  

You can see how many resources your projects are using with

gdk-project-usage -u -p <project-name>

 

Example output:

project               period  billing hours  storage (TB)  backup (TB)  storage files  backup files
ngssummer2024  sarasj             2024-7          77.98          0.02         0.00            528             0
ngssummer2024  sarasj             2024-8           0.00          0.02         0.00            528             0
ngssummer2024  savvasc            2024-7         223.21          0.02         0.00            564             0
ngssummer2024  savvasc            2024-8           0.00          0.02         0.00            564             0
ngssummer2024  simonnn            2024-7         173.29          0.01         0.00            579             0
ngssummer2024  simonnn            2024-8           0.00          0.01         0.00            579             0

Accounting Tips

  • You can pipe the accounting output into grep to isolate specific users and/or months:
gdk-project-usage -u -p <project-name> | grep <username> | grep <period>

 

  • all the accounting outputs can be saved into a file, which you can later open for example as Excel sheet.

Example:

gdk-project-usage > accountingGDK.csv

Private files or folders

You have a folder or a file into the project which you do not want to share: Use

chmod -R go-rwx <file or folder>

 

which you can revert using

chmod -R go+rwx <file or folder>

Folders management

Have a coherent folder structure - your future self will thanks.

Example of structure, which backs up raw data and analysis

You can do it with a script which downloads and execute with the command below:

curl https://raw.githubusercontent.com/hds-sandbox/GDKworkshops/main/Scripts/populate_project.sh | bash

If your project has many users, a good structure can be

 

Do that with these commands

mkdir -p Backup/Data Workspaces/username1 Workspaces/username2
ln -s Backup/Data/ Data

and making each user to run the script in its folder

cd Workspaces/username1
curl https://raw.githubusercontent.com/hds-sandbox/GDKworkshops/main/Scripts/populate_project.sh | bash

MUST-KNOWs for a GDK project

  • remove unused intermediate files
    • unused and forgotten object filling up storage
  • backup only the established truth of your analysis
    • in other words the very initial data of your analysis, and the scripts
  • outputs of many files should be removed or zipped together into one
    • otherwise GDK indexes all of them: slow!!!

 

Backup cost >>> Storage cost >> Computation cost

Downloads and Copies

  • Downloads from Internet to GDK
  • Uploads from a local PC to GDK
  • Downloads from GDK to a local PC
  • Transfer data between GDK and another cluster
  • Graphical interface for download/upload with GDK

Data transfer amongst the web, GDK and your PC is an everyday action which you can easily perform.

Warning

Downloads should always happens on the front-end nodes, and never using a compute node when working on GenomeDK

Download with wget

wget is a utility for command-line-based downloads. It is already installed on GenomeDK and works with http, https, ftp protocols.

 

Example:

wget -O ./output.png \
     -c \
     -b \
     https://example.com/image.png

downloads a png file and saves it as output.png (option O), downloads in background (-b) and if the download was interrupted earlier, it retrieves it from where it stopped (-c).

wget has many options you can use, but what shown in the example above is what you need most times. You can see them with the command

wget --help

 

Also, you can find this cheatsheet useful for remembering the commands to most of the things you can think about downloading files using wget. At this page there are also some concrete examples for wget.

SCP transfer

SCP (Secure Copy Protocol) can transfer files securely

  • between a LOCAL and a REMOTE host (your PC and GDK)
  • between TWO REMOTE hosts (GDK and another cluster)

 

You can use it to transfer files from your pc to GenomeDK and viceversa, but also between GenomeDK and another computing cluster (for example, downloading data from a collaborator, which resides on a different remote computing system).

To copy a file to GenomeDK from your local computer:

scp /home/my_laptop/Documents/file.txt \
    username@login.genome.au.dk:/home/username/my_project/

 

The inverse operation just changes the order of the sender and receiver:

scp username@login.genome.au.dk:/home/username/my_project/file.txt \
    /home/my_laptop/Documents/

If you want to copy an entire folder, use the option -r (recursive copy). The previous examples become

scp -r /home/my_laptop/Documents/folder \
       username@login.genome.au.dk:/home/username/my_project/

 

and

scp -r username@login.genome.au.dk:/home/username/my_project/folder \
       /home/my_laptop/Documents/

 

A few more options are available and you can see them with the command scp --help.

Interactive transfer

You can also do transfering with an interactive software, such as Filezilla, which has an easy interface. Download Filezilla.

 

When done, open Filezilla and use the following information on the login bar:

  • Host: login.genome.au.dk
  • Username, Password: your GenomeDK username and password
  • Port: 22

Press on Quick Connect. As a result, you will establish a secure connection to GenomeDK. On the left-side browser you can see your local folders and files. On the right-side, the folders and files on GenomeDK starting from your home.

If you right-click on any local file or folder, you can upload it immediately, or add it to the transfer queue. The file will end up in the selected folder of the right-side browser.

The download process works similarly using the right-side browser and choosing the destination folder on the left-side browser.

If you have created a queue, this will be shown at the bottom of the window as a list. You can inspect destination folders from there and choose other options such as transfer priority.

To start a queue, use CTRL + P, Transfer --> Process Queue or press the button on the toolbar.

Package/Environment management

 

Properly managing your software and its dependencies is fundamental for reproducibility

Virtual environments

Each project needs specific software versions dependent on each other for reproducibility - without interferring with other projects.

 

Definition

A virtual environment keeps project-specific softwares and their dependencies separated

 

A package manager is a software that can retrieve, download, install, upgrade packages easily and reliably

Conda

 

Conda is both a virtual environment and a package manager.

  • easy to use and understand
  • can handle quite big environments
  • environments are easily shareable
  • a large archive (Anaconda) of packages
  • active community of people archiving their packages on Anaconda

Installation

Just download and execute the installer by

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O miniforge.sh
chmod +x miniforge.sh
bash miniforge.sh -b
./miniforge3/bin/conda init bash

 

After a few ENTERs and YES’s you should get the installation done. Run

source ~/.bashrc

and doublecheck that conda works:

conda info

Conda Configuration

You can add some default channels where to find archived packages. Here are some tipycal ones

conda config --append channels bioconda
conda config --append channels genomedk
conda config --append channels r
conda config --append channels conda-forge

We tell conda to look into channels in the order specified above. We also avoid opening the base environment (where conda is installed) at login.

conda config --set channel_priority strict
conda config --set auto_activate_base false

Base environment

base is the environment containing conda itself. The current environment is in your prompt in round brackets.

(environment) [samuele@fe-open-02 ~]$

We update Conda with libmamba solver - a lot faster in installing many packages at once.

conda install -n base --yes conda-libmamba-solver
conda config --set solver libmamba

Don’t touch the Base

This is the only time you should install in the base environment! You might otherwise ruin the conda installation.

Create an environment

An empty environment called test_1:

conda create -n test_1

 

You can list all the environments available:

conda env list
> # conda environments:
> #
> base      *  /home/samuele/miniconda3
> test_1       /home/samuele/miniconda3/envs/test_1

Note

An environment is in reality a folder, which contains all installed packages and other configurations and utilities

Activate and deactivate

To use an environment we activate it:

conda activate test_1

From now on, all installed softwares and packages will be available. (test_1) is now shown in your prompt.

 

Deactivation happens by

conda deactivate

Manage an environment

Package installation

Conda puts together the dependency trees of requested packages to find all compatible dependencies versions.

Figure: A package’s dependency tree with required versions on the edges

To install a specific package in your environment, search it on anaconda.org:

Figure: search DeSeq2 for R

Figure: suggested commands to install the package

Repositories

packages are archived in repositories. Typical ones are bioconda, conda-forge, r, anaconda.

conda-forge packges are often more up-to-date, but a few times show compatibility problems with other packages.

Install a couple of packages in the activated environment - you can always specify a version restriction to each package:

conda activate test_1
conda install bioconda::bioconductor-deseq2<=1.42.0 conda-forge::r-tidyr=1.3.1

Note

To install two packages, you need more than a hundred installations! Those are all dependencies arising from the comparison of dependency trees.

 

Look for the package tidyr in your active environment:

conda list | grep tidyr

Installation from a list of packages

You can export all the packages you have installed over time in your environment:

conda env export --from-history > environment.yml

which looks like

name: test_1
channels:
 - bioconda
 - conda-forge
 - defaults
 - r
dependencies:
 - bioconda::bioconductor-deseq2
 - conda-forge::r-tidyr

The same command without --from-history will create a very long file with ALL dependencies:

name: test_1
channels:
  - bioconda
  - conda-forge
  - defaults
  - r
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - _r-mutex=1.0.1=anacondar_1
  - argcomplete=3.2.2=pyhd8ed1ab_0
  ...

This is guaranteed to work only on a system with same OS and architecture as GenomeDK (Linux and x86)!

You can use the yml file to create an environment:

 

conda env create -p test_1_from_file -f ./environment.yml

 

Environment files are very useful when you want to share environments with others, especially when the package list is long.

Good practice: You want to install a lot of packages in an environment? Clone it first! If you break something, you still have the old copy.

conda create -p test_1_cloned --clone test_1

 

If installations in the cloned environment go fine, then you can remove it

conda env remove -n test_1_cloned

and repeat the installations on the original one.

Running a Job

 

Running programs on a computing cluster happens through jobs.

 

Learn how to get hold of computing resources to run your programs.

What is a job on a HPC

A computational task executed on requested HPC resources (computing nodes), which are handled by the queueing system (SLURM).

The command gnodes will tell you if there is heavy usage across the computing nodes

Usage of computing nodes. Each node has a name (e.g. cn-1001). The symbols for each node mean running a program (0), assigned to an user (_) and available (.)

If you want to venture more into checking the queueing status, Moi has done a great interactive script in R Shiny for that.

Front-end nodes are limited in memory and power, and should only be for basic operations such as

  • starting a new project

  • small folders and files management

  • small software installations

  • data transfer

and in general you should not use them to run computations. This might slow down all other users on the front-end.

Interactive jobs

Useful to run a non-repetitive task interactively

Examples:

  • splitting by chromosome that one bam file you just got

  • open python/R and do some statistics

  • compress/decompress multiple files, maybe in parallel

Once you exit from the job, anything running into it will stop.

To run an interactive job simply use the command

[fe-open-01]$ srun --mem=<GB_of_RAM>g -c <nr_cores> --time=<days-hrs:mins:secs>  --account=<project_name> --pty /bin/bash

For example

[fe-open-01]$ srun --mem=32g -c 2 --time=6:0:0  --account=<project_name> --pty /bin/bash

The queueing system makes you wait based on the resources you ask and how busy the nodes are. When you get assigned a node, the resources are available. The node name is shown in the prompt.

[<username>@s21n32 ~]$

Batch script (sbatch)

Useful to run a program non-interactively, usually for longer time and without interaction from the user. A batch script contains

  • the desired resources
  • the sequence of commands to be executed

and

  • has a filename without spaces (forget spaces from now on)

Example

A file called align.sh such that:

#!/bin/bash
#SBATCH --account my_project
#SBATCH --cpus-per-task= 8
#SBATCH --mem 16g
#SBATCH --time 04:00:00

#activate environment
eval "$(conda shell.bash hook)"
conda activate bam_tools
#index the reference file
bwa-mem2 index reference/chr2.fa
#align data
bwa-mem2 -t 8 reference/chr2.fa \
             genomes/S_Korean-2.region.fq.gz \
        | samtools sort \
            -@ 7 \
            -n \
            -O BAM \
        > alignment/S_Korean-2.sorted.bam

exit 0

Send the script to the queueing system:

sbatch align.sh
Submitted batch job 33735298

 

Interrogate SLURM about the specific job

jobinfo 33735298
>Name                : align.sh
>User                : samuele
>Account             : my_project
>Partition           : short
>Nodes               : s21n43
>Cores               : 8
>GPUs                : 0
>State               : RUNNING
>...

or about all the queued jobs

squeue -u <username>
or
squeue -u me
>JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>33735928     short align.sh  samuele  R       1:12      1 s21n43

 

If you change your mind and want to cancel a job:

scancel 33735928

Tip

To observe in real time the output of the job, refresh the last lines of the log file for that job:

watch tail align.sh-33735928.out

 

To look at the whole log (not in real time), run at any time

less -S align.sh-33735928.out

Checking the log files can be useful for debugging, when for example a command gives an error and the job interrupts before its end.

Closing the workshop

  • More in this slides than what we went through

  • Updated over time, use as a reference

  • Impossible to cover everything at once. We will also advanced/pipeline workshop

  • Come to our Cafe and/or ask

  • Documentation on genome.au.dk

A taste of the next workshops

  • virtual terminals with tmux
  • git setup
  • advanced functionalities
    • awk for advanced text file manipulation
    • rsync for synchronization of data
  • browser-based applications
  • launch containers
  • gwf pipelines

Your input for topics and evaluation