Day 1 - Part 2
Put your learning to the test with what you’ve covered so far.
A. General knowledge
G.4. Workflow managers can run tasks (different) concurrently if there are no dependencies (True or False)
G.5. A workflow manager can execute a single parallelized task on multiple nodes in a computing cluster (True or False)
B. Snakemake
In this section, we will be working with a tabulated metadata file, samples_1kgp_test
, which contains information about the samples from the 1000 Genomes Project. Each row represents a sample, and the columns contain various attributes for that sample (ID, population, super population, and sex). The file is designed to be simple and small, making it easy to work with and manipulate across tasks. However, the specifics of the data aren’t the primary focus.
First, mount the following two drives, use the setup.sh
initialization file, and ask for 2 CPUs so we can run things in parallel:
YourNameSurname#xxxx
: save your results/files here.
hpclab-workshop
: this contains input files and scripts. You can read-only access from this directory (no write permissions).
Next, activate snakemake
environment.
conda deactivate
# make sure no env is active!
conda activate snakemake
We strongly recommend keeping the Snakemake documentation open for reference and guidance.
In this exercise, we will explore how rules are invoked in a Snakemake workflow. The snakemake file is located at: /work/HPCLab_workshop/rules/process_1kgp.smk
. Now follow these steps and answer the questions:
Open the snakefile, named
process_1kgp.smk
and try to understand every single line. If you request Snakemake to generate the fileresults/all_female.txt
, what tasks will be executed and in what sequence?Open a terminal and navigate to your personal drive
cd /work/YourNameSurname#xxxx
. Create a project directory called, for example,hpclab
and make it your working directory. You should save all the results here!Dry-run the workflow: Check the number of jobs that will be executed.
Q.1. How many jobs will Snakemake run?
Run the workflow from the directory
hpclab
(the one you just created on your personal drive). Use the name flag--snakefile </path/to/snakefile>.smk --cores 1
, or the abbreviated format-s </path/to/snakefile>.smk -c 1
.Please verify that the output has been successfully generated and saved in your working directory (navigate through the project).
Q.2. Has Snakemake created a subdirectory that didn’t previously exist? What is its name?
Q.3. How many files with the extension
*.tsv
can you find in that subdirectory?Dry-run the workflow again (from
hpclab
).Q.4. Would Snakemake run any jobs based on the results of the dry-run?
Remove files starting with
E
in yourresults
folder (“EAS.tsv” and “EUR.tsv”) andall_female.txt
. Then, dry-run once more.Q.5. How many jobs will Snakemake run?
Under your working directory, create a folder named
rules
and copy the snakefile (process_1kgp.smk
) to that folder so you can edit it! Then, open the file and remove lines 13-15. How else can you run the workflow but generate insteadall_male.txt
using only the command line?process_1kgp.smk
13 rule all: 14 input: 15 expand("results/all_{sex}.txt", sex=["female"])
Q.6. Tip: what is missing at the end of the command ( e.g. what should be added to ensure
all_male.txt
is generated)?snakemake -s process_1kgp.smk -c1
Let’s add a new rule that concatenates the two files you have generated (
all_female.txt
andall_male.txt
) and save it intoconcatenated.txt
. Remember, all files should be saved into theresults
subdir. Hint:cat file1.txt file2.txt > output.txt
Run the pipeline with your own version of the
process_1kgp.smk
file.
- Tasks will be executed in this order preprocess (1), split_by_superpop (5), and combine (1).
# 2. Create subdir
cd /work/AlbaRefoyoMartínez#0753/
mkdir hpclab
cd hpclab
# 3. dry run
snakemake -s /work/HPCLab_workshop/rules/process_1kgp.smk -n
# 4. run the workflow
snakemake -s HPCLab_workshop/rules/process_1kgp.smk -c1
# 5. verify output
ls results/*
# 6. dry run
snakemake -s /work/HPCLab_workshop/rules/process_1kgp.smk -n
# 7. remove file(s) starting with E and the all_female.txt
rm results/E*.tsv results/all_female.txt
# 8. make a copy of the snakefile and remove the lines
mkdir rules
cp /work/HPCLab_workshop/rules/process_1kgp.smk rules/
# 8. S5. rerun again with the <name_output>
snakemake -s rules/process_1kgp.smk -c1 results/all_male.txt
# 9. create rule
rule concat:
input:
"results/all_female.txt",
"results/all_male.txt"
output:
"results/concatenated.txt"
shell:
"cat {input} > {output}"
# 10. Run again
snakemake -s rules/process_1kgp.smk -c1 results/concatenated.txt