Snakemake
Put your learning to the test with what you’ve covered so far.
A. General knowledge
G.4. Workflow managers can run tasks (different) concurrently if there are no dependencies (True or False)
G.5. A workflow manager can execute a single parallelized task on multiple nodes in a computing cluster (True or False)
B. Snakemake
In this section, we will be working with a tabulated metadata file, samples_1kgp_test, which contains information about the samples from the 1000 Genomes Project. Each row represents a sample, and the columns contain various attributes for that sample (ID, population, super population, and sex). The file is designed to be simple and small, making it easy to work with and manipulate across tasks. However, the specifics of the data aren’t the primary focus.
First, mount the following two drives, select an initialization file, and ask for 2 CPUs so we can run things in parallel:
pipesOut/smk: save your results/files here (create the new subdir).
hpclab-workshop: contains input files and scripts. You can read-only access from this directory (no write permissions). The snakemake file is located at:/work/HPCLab_workshop/pipes/rules/process_1kgp.smk.- Additional parameters - Initialization. Use a setup.sh script that contains snakemake software. You can also use ours,
shared/hpclab-workshop/pipes/setup.sh
Next, activate snakemake environment.
conda activate snakemake Finally, navigate to your wd.
Use the conda or pixi environment you have already created to run the exercises.
All computations must be carried out through the queue.
Always start an interactive SLURM session before running any Apptainer commands. Do not run Apptainer on the login node. srun --account DeiC-KU-L65 -t 00:00:45 --pty bash
Alternatively, you can submit a SLURM batch job instead of using an interactive session.
mybash.sh
#!/bin/bash
#SBATCH --account my_project
#SBATCH -c 1
#SBATCH --mem 1g
# COMMANDS HEREThe necessary files can be found in
smk-exercises. The snakemake file is located at:rules/process_1kgp.smk.exitthe interactive job when you are done with the exercise!
Download the Snakefile and data required for this exercise using the links below to run the exercises locally. Activate your environment where you have the software install.
Create a folder named data in your working directory and move the input data (samples_1kgp.tsv) inside it. The Snakemake pipeline looks for the input in this location and will fail otherwise. If you prefer, you can instead update the pipeline to point to a different relative path.
We strongly recommend keeping the Snakemake documentation open for reference and guidance.
In this exercise, we will explore how rules are invoked in a Snakemake workflow.
Navigate to your working directory and create a new one,
smkto save the input for these exercises.Open the snakefile, named
process_1kgp.smk, and try to understand every single line. If you request Snakemake to generate the fileresults/all_female.txt, what tasks will be executed and in what sequence?Dry-run the workflow: Check the number of jobs that will be executed.
Q.1. How many jobs will Snakemake run?
Run the workflow from the directory
smk(the one you just created on your personal drive). Use the name flag--snakefile </path/to/snakefile>.smk --cores 1, or the abbreviated format-s </path/to/snakefile>.smk -c 1.Please verify that the output has been successfully generated and saved in your working directory (navigate through the project).
Q.2. Has Snakemake created a subdirectory that didn’t previously exist? What is its name?
Q.3. How many files with the extension
*.tsvcan you find in that subdirectory?Dry-run the workflow again (from
smk).Q.4. Would Snakemake run any jobs based on the results of the dry-run?
Remove files starting with
Ein yourresultsfolder (“EAS.tsv” and “EUR.tsv”) andall_female.txt. Then, dry-run once more.Q.5. How many jobs will Snakemake run?
Under your working directory, create a folder named
rulesand copy the snakefile (process_1kgp.smk) to that folder so you can edit it! Then, open the file and remove lines 13-15. How else can you run the workflow but generate insteadall_male.txtusing only the command line?process_1kgp.smk
13 rule all: 14 input: 15 expand("results/all_{sex}.txt", sex=["female"])Q.6. Tip: what is missing at the end of the command (e.g. what should be added to ensure
all_male.txtis generated)?snakemake -s process_1kgp.smk -c1Let’s add a new rule that concatenates the two files you have generated (
all_female.txtandall_male.txt) and saves them intoconcatenated.txt. Remember, all files should be saved in theresultssubdir. Hint:cat file1.txt file2.txt > output.txtRun the pipeline with your own version of the
process_1kgp.smkfile.
- Tasks will be executed in this order: preprocess (1), split_by_superpop (5), and combine (1).
# 2. Create subdir
mkdir smk
cd smk
# 3. Dry run
snakemake -s <PATH/TO/rules/process_1kgp.smk> -n
# 4. Run the workflow
snakemake -s <PATH/TO/rules/process_1kgp.smk> -c1
# 5. Verify output
ls results/*
# 6. Dry run
snakemake -s<PATH/TO/rules/process_1kgp.smk> -n
# 7. Remove file(s) starting with E and the all_female.txt
rm results/E*.tsv results/all_female.txt
# 8. Make a copy of the snakefile and remove the lines
mkdir rules
cp <PATH/TO/rules/process_1kgp.smk> rules/
# 8. S5. rerun again with the <name_output>
snakemake -s rules/process_1kgp.smk -c1 results/all_male.txt
# 9. Create rule
rule concat:
input:
"results/all_female.txt",
"results/all_male.txt"
output:
"results/concatenated.txt"
shell:
"cat {input} > {output}"
# 10. Run again
snakemake -s rules/process_1kgp.smk -c1 results/concatenated.txt 

