Snakemake

Modified

September 16, 2024

In this section, we will guide you through transitioning from bash scripts or notebooks to workflows. This approach is particularly useful as computations become more complex.

Basics

Snakemake is a text-based workflow management tool that uses a Python-based language with domain-specific syntax. Originally designed for scientific workflows in bioinformatics, Snakemake shares operational principles with make, from which it was inspired. Importantly, workflows in Snakemake are structured around data stored in files, and the tool supports parallel computing, cloud storage, and can manage computational environments.

The workflow is decomposed into rules that define how to obtain output files from input files. Snakemake infers dependencies and determines the execution order automatically, offering significantly more flexibility than its predecessor, make.

Semantics: define rules, each rule can use different programs (e.g. bash to run a shell command, python, R, GNU core utilities)

Snakefile

rule dw_metadata:
  input: path2/filename.tsv
  output: data/filename.tsv
  shell: 
    "wget {input} > {output}"

rule split_superpop:
  input: data/filename.tsv
  output: data/{superpop}.tsv
  shell: 
    "python process_pops.py {input} {output}"

rule avg_gender:
  input: data/{superpop}.tsv
  output: data/{superpop}_{gender}.png
  shell: 
    "python statsPlot_gender.py {input} {output}"

The only mandatory flag in Snakemake is the number of cores (-c) or jobs (-j), which indicates the number of processes that can run in parallel. To run the workflow in Snakemake, you can either:

Request Snakemake to generate a specific file (using the exact filename as defined in the rule’s output)

# Call the last output (of the pipeline or the one you want to generate)
snakemake -c2 data/EUR_female.png

Specify the name of a rule or task. You can do this when wildcards are not present

# Call the last rule
snakemake -c1 dw_metadata

Alternatively, in very common practice, determine what you want to run inside a Snakefile

rule all:
  input:
    expand(data/{superpop}_{gender}.png, superpop=["EUR"], gender=["female", "male"])

Scaling up - rules generalisation using wildcards (e.g.: from one to several datasets) You can refer by index or by name
Dependencies are determined top-down

For a given target, a rule that can be applied to create it, is determined (a job) For the input files of the rule, go on recursively, If no target is specified, snakemake, tries to apply the first rule

Rule all: target rule that collects results

Job execution

A job is executed if and only if: - otuput file is target and does not exist - output file needed by another executed job and does not exist - input file newer than output file - input file will be updated by other job (eg. changes in rules) - execution is force (‘–force-all’)

You can plot the DAG (directed acyclic graph) of the jobs

Useful command line interface

# dry-run (-n), print shell commands (-p)
snakemake -n -p
# Snakefile named different in another location 
snakemake --snakefile path/to/file.smoker
# dry-run (-n), print execution reason for each job
snakemake -n -r
# Visualise DAG of jobs using Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg

Defining resources

rule myrule:
  resources: mem_mb= 100 #(100MB memory allocation)
  threads: X
  shell:
    "command {threads}"

Let’s say you defined our rule myrule needs 4 works, if we execute the workflow with 8 cores as follows:

snakemake --cores 8

This means that 2 ‘myrule’ jobs, will be executed in parallel.

The jobs are schedules to maximize parallelization, high priority jobs will be scheduled first, all while satisfying resource constrains. This means:

If we allocate 100MB for the execution of ‘myrule’ and we call snakemake as follows:

snakemake --resources mem_mb=100 --cores 8

Only one ‘myrule’ job can be executed in parallel (you do not provide enough memory resources for 2). The memory resources is useful for jobs that are heavy memory demanding to avoid running out of memory. You will need to benchmark your pipeline to estimate how much memory and time your full workflow will take. We highly recommend doing so, get a subset of your dataset and give it a go! Log files will come very handy for the resource estimation. Of course, the execution of jobs is dependant on the free resources availability (eg. CPU cores).

rule myrule:
  log: "logs/myrule.log"
  threads: X
  shell:
    "command {threads}"

Log files need to define the same wildcards as the output files, otherwise, you will get an error.

Config files

You can also define values for wildcards or parameters in the config file. This is recommended when the pipeline might be used several times at different time points, to avoid unwanted modifications to the workflow. parameterization is key for such cases.

Cluster execution

When working from cluster systems you can execute the workflow using -qsub submission command

snakemake --cluster qsub

Additional advanced features

modularization
handling temporary and protected files: very important for intermediate files that filled up our memory and are not used in the long run and can be deleted once the final output is generated. This is automatically done by snakemake if you defined them in your pipeline HTML5 reports
rule parameters
tracking tool versions and code changes: will force rerunning older jobs when code and software are modified/updated.
data provenance information per file
python API for embedding snakemake in other tools

Create an isolated environment to install dependencies

Basic file structure

| - config.yml
| - requirements.txt (commonly also named environment.txt)
| - rules/
|   | - myrules.smk
| - scripts/
|   | - script1.py
| - Snakefile

Create conda environment, one per project!

# create env
conda create -n myworklow --file requirements.txt
# activate environment
source activate myworkflow
# then execute snakemake

Sources

Snakemake tutorial
Snakemake turorial slides by Johannes Koster
Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
Köster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014.
faircookbook worflows

Copyright

CC-BY-SA 4.0 license