HPC Lab
  • Home
  • HPC Launch
  • HPC Pipes
  • Workshop
  1. HPC Pipes
  2. Day 1
  3. Snakemake
  • HPC Launch
    • Welcome to the HPC-Launch workshop
    • Day 1
      • HPC setup
      • HPC file transfers
      • Git and Github
    • Day 2
      • Project structure
      • Package managers
      • Queueing systems
      • Archiving
      • Final Quiz
  • HPC Pipes
    • Welcome to the HPC-Pipes workshop
    • Day 1
      • Package managers: Conda
      • Package managers: Pixi
      • Containers: Apptainer
      • Containers: Docker
      • Snakemake
    • Day 2
      • Snakemake advanced
      • Snakemake - envs
      • Nextflow
  • UCloud setup
    • UCloud project workspace
    • SSH on UCloud
    • GitHub on UCloud
    • Conda on UCloud

On this page

  • A. General knowledge
  • B. Snakemake
  1. HPC Pipes
  2. Day 1
  3. Snakemake

Snakemake

Put your learning to the test with what you’ve covered so far.

A. General knowledge

ExerciseI - General knowledge
G.1. What role does a workflow manager play in computational research?
G.2.What is the primary drawback of using shell scripts for automating computations?
G.3. What are the key features of workflow manager in computational research? (Several possible solutions)

G.4. Workflow managers can run tasks (different) concurrently if there are no dependencies (True or False)

G.5. A workflow manager can execute a single parallelized task on multiple nodes in a computing cluster (True or False)

B. Snakemake

In this section, we will be working with a tabulated metadata file, samples_1kgp_test, which contains information about the samples from the 1000 Genomes Project. Each row represents a sample, and the columns contain various attributes for that sample (ID, population, super population, and sex). The file is designed to be simple and small, making it easy to work with and manipulate across tasks. However, the specifics of the data aren’t the primary focus.

First, mount the following two drives, select an initialization file, and ask for 2 CPUs so we can run things in parallel:

  • pipesOut/smk: save your results/files here (create the new subdir).
  • hpclab-workshop: contains input files and scripts. You can read-only access from this directory (no write permissions). The snakemake file is located at: /work/HPCLab_workshop/pipes/rules/process_1kgp.smk.
  • Additional parameters - Initialization. Use a setup.sh script that contains snakemake software. You can also use ours, shared/hpclab-workshop/pipes/setup.sh

Next, activate snakemake environment.

conda activate snakemake 

Finally, navigate to your wd.

  1. Use the conda or pixi environment you have already created to run the exercises.

  2. All computations must be carried out through the queue.

ImportantDo not run anything on the login node

Always start an interactive SLURM session before running any Apptainer commands. Do not run Apptainer on the login node. srun --account DeiC-KU-L65 -t 00:00:45 --pty bash

Alternatively, you can submit a SLURM batch job instead of using an interactive session.

mybash.sh
#!/bin/bash
#SBATCH --account my_project
#SBATCH -c 1
#SBATCH --mem 1g

# COMMANDS HERE
  1. The necessary files can be found in smk-exercises. The snakemake file is located at: rules/process_1kgp.smk.

  2. exit the interactive job when you are done with the exercise!

Download the Snakefile and data required for this exercise using the links below to run the exercises locally. Activate your environment where you have the software install.

Create a folder named data in your working directory and move the input data (samples_1kgp.tsv) inside it. The Snakemake pipeline looks for the input in this location and will fail otherwise. If you prefer, you can instead update the pipeline to point to a different relative path.

We strongly recommend keeping the Snakemake documentation open for reference and guidance.

ExerciseII - Exploring rule invocation in Snakemake

In this exercise, we will explore how rules are invoked in a Snakemake workflow.

  1. Navigate to your working directory and create a new one, smk to save the input for these exercises.

  2. Open the snakefile, named process_1kgp.smk, and try to understand every single line. If you request Snakemake to generate the file results/all_female.txt, what tasks will be executed and in what sequence?

  3. Dry-run the workflow: Check the number of jobs that will be executed.

    Q.1. How many jobs will Snakemake run?

  4. Run the workflow from the directory smk (the one you just created on your personal drive). Use the name flag --snakefile </path/to/snakefile>.smk --cores 1, or the abbreviated format -s </path/to/snakefile>.smk -c 1.

  5. Please verify that the output has been successfully generated and saved in your working directory (navigate through the project).

    Q.2. Has Snakemake created a subdirectory that didn’t previously exist? What is its name?

    Q.3. How many files with the extension *.tsv can you find in that subdirectory?

  6. Dry-run the workflow again (from smk).

    Q.4. Would Snakemake run any jobs based on the results of the dry-run?

  7. Remove files starting with E in your results folder (“EAS.tsv” and “EUR.tsv”) and all_female.txt. Then, dry-run once more.

    Q.5. How many jobs will Snakemake run?

  8. Under your working directory, create a folder named rules and copy the snakefile (process_1kgp.smk) to that folder so you can edit it! Then, open the file and remove lines 13-15. How else can you run the workflow but generate instead all_male.txt using only the command line?

    process_1kgp.smk
    13 rule all:
    14    input:
    15       expand("results/all_{sex}.txt", sex=["female"])

    Q.6. Tip: what is missing at the end of the command (e.g. what should be added to ensure all_male.txt is generated)? snakemake -s process_1kgp.smk -c1

  9. Let’s add a new rule that concatenates the two files you have generated (all_female.txt and all_male.txt) and saves them into concatenated.txt. Remember, all files should be saved in the results subdir. Hint: cat file1.txt file2.txt > output.txt

  10. Run the pipeline with your own version of the process_1kgp.smk file.

HintSolution
  1. Tasks will be executed in this order: preprocess (1), split_by_superpop (5), and combine (1).
# 2. Create subdir 
mkdir smk
cd smk
# 3. Dry run 
snakemake -s <PATH/TO/rules/process_1kgp.smk> -n 
# 4. Run the workflow 
snakemake -s <PATH/TO/rules/process_1kgp.smk> -c1 
# 5. Verify output 
ls results/*
# 6. Dry run 
snakemake -s<PATH/TO/rules/process_1kgp.smk> -n 
# 7. Remove file(s) starting with E and the all_female.txt
rm results/E*.tsv results/all_female.txt
# 8. Make a copy of the snakefile and remove the lines 
mkdir rules 
cp <PATH/TO/rules/process_1kgp.smk> rules/
# 8. S5. rerun again with the <name_output>
snakemake -s rules/process_1kgp.smk -c1 results/all_male.txt 
# 9. Create rule 
rule concat: 
   input: 
      "results/all_female.txt",
      "results/all_male.txt"
   output:
      "results/concatenated.txt"
   shell:
      "cat {input} > {output}"
# 10. Run again 
snakemake -s rules/process_1kgp.smk -c1 results/concatenated.txt 

Copyright

CC-BY-SA 4.0 license