Day 1 - Part 2

Put your learning to the test with what you’ve covered so far.

B. Snakemake

In this section, we will be working with a tabulated metadata file, samples_1kgp_test, which contains information about the samples from the 1000 Genomes Project. Each row represents a sample, and the columns contain various attributes for that sample (ID, population, super population, and sex). The file is designed to be simple and small, making it easy to work with and manipulate across tasks. However, the specifics of the data aren’t the primary focus.

How to run snakemake on UCloud

First, mount the following two drives, use the setup.sh initialization file, and ask for 2 CPUs so we can run things in parallel:

YourNameSurname#xxxx: save your results/files here.
hpclab-workshop: contains input files and scripts. You can read-only access from this directory (no write permissions).

Next, activate snakemake environment.

conda deactivate 
# make sure no env is active!
conda activate snakemake

Are you running the exercises locally?

Download the Snakefile and data required for this exercise using the links below if you are running things locally.

We strongly recommend keeping the Snakemake documentation open for reference and guidance.

II - Exploring rule invocation in Snakemake

In this exercise, we will explore how rules are invoked in a Snakemake workflow. The snakemake file is located at: /work/HPCLab_workshop/rules/process_1kgp.smk. Now follow these steps and answer the questions:

Open the snakefile, named process_1kgp.smk, and try to understand every single line. If you request Snakemake to generate the file results/all_female.txt, what tasks will be executed and in what sequence?
Open a terminal and navigate to your personal drive cd /work/YourNameSurname#xxxx. Create a project directory called, for example, hpclab and make it your working directory. You should save all the results here!
Dry-run the workflow: Check the number of jobs that will be executed.

Q.1. How many jobs will Snakemake run?
Run the workflow from the directory hpclab (the one you just created on your personal drive). Use the name flag --snakefile </path/to/snakefile>.smk --cores 1, or the abbreviated format -s </path/to/snakefile>.smk -c 1.
Please verify that the output has been successfully generated and saved in your working directory (navigate through the project).

Q.2. Has Snakemake created a subdirectory that didn’t previously exist? What is its name?

Q.3. How many files with the extension *.tsv can you find in that subdirectory?
Dry-run the workflow again (from hpclab).

Q.4. Would Snakemake run any jobs based on the results of the dry-run?

No. There is nothing to be done. All requested files are present and up to date Yes. Reasons: input files updated by another job (all, combine, split_by_superpop) and output files have to be generated (combine, preprocess, split_by_superpop)
Remove files starting with E in your results folder (“EAS.tsv” and “EUR.tsv”) and all_female.txt. Then, dry-run once more.

Q.5. How many jobs will Snakemake run?
Under your working directory, create a folder named rules and copy the snakefile (process_1kgp.smk) to that folder so you can edit it! Then, open the file and remove lines 13-15. How else can you run the workflow but generate instead all_male.txt using only the command line?
```
process_1kgp.smk
```
```
13 rule all:
14    input:
15       expand("results/all_{sex}.txt", sex=["female"])
```
Q.6. Tip: what is missing at the end of the command (e.g. what should be added to ensure all_male.txt is generated)? snakemake -s process_1kgp.smk -c1
Let’s add a new rule that concatenates the two files you have generated (all_female.txt and all_male.txt) and saves them into concatenated.txt. Remember, all files should be saved into the results subdir. Hint: cat file1.txt file2.txt > output.txt
Run the pipeline with your own version of the process_1kgp.smk file.

Solution

Tasks will be executed in this order: preprocess (1), split_by_superpop (5), and combine (1).

# 2. Create subdir 
cd /work/AlbaRefoyoMartínez#0753/
mkdir hpclab
cd hpclab
# 3. Dry run 
snakemake -s /work/HPCLab_workshop/rules/process_1kgp.smk -n 
# 4. Run the workflow 
snakemake -s HPCLab_workshop/rules/process_1kgp.smk -c1 
# 5. Verify output 
ls results/*
# 6. Dry run 
snakemake -s /work/HPCLab_workshop/rules/process_1kgp.smk -n 
# 7. Remove file(s) starting with E and the all_female.txt
rm results/E*.tsv results/all_female.txt
# 8. Make a copy of the snakefile and remove the lines 
mkdir rules 
cp /work/HPCLab_workshop/rules/process_1kgp.smk rules/
# 8. S5. rerun again with the <name_output>
snakemake -s rules/process_1kgp.smk -c1 results/all_male.txt 
# 9. Create rule 
rule concat: 
   input: 
      "results/all_female.txt",
      "results/all_male.txt"
   output:
      "results/concatenated.txt"
   shell:
      "cat {input} > {output}"
# 10. Run again 
snakemake -s rules/process_1kgp.smk -c1 results/concatenated.txt

Copyright

CC-BY-SA 4.0 license

A. General knowledge

B. Snakemake

Copyright