Snakemake
In this section, we will guide you through transitioning from bash scripts or notebooks to workflows. This approach is particularly useful as computations become more complex.
Basics
Snakemake is a text-based workflow management tool that uses a Python-based language with domain-specific syntax. Originally designed for scientific workflows in bioinformatics, Snakemake shares operational principles with make
, from which it was inspired. Importantly, workflows in Snakemake are structured around data stored in files, and the tool supports parallel computing, cloud storage, and can manage computational environments.
The workflow is decomposed into rules that define how to obtain output files from input files. Snakemake infers dependencies and determines the execution order automatically, offering significantly more flexibility than its predecessor, make
.
Semantics
Define rules, corresponding to different tasks in a pipeline. Each rule can use different programs (e.g. bash to run a shell command, python, R, GNU core utilities)
Snakefile
rule dw_metadata:
input: path2/filename.tsv
output: data/filename.tsv
shell:
"wget {input} > {output}"
rule split_superpop:
input: data/filename.tsv
output: data/{superpop}.tsv
shell:
"python process_pops.py {input} {output}"
rule avg_gender:
input: data/{superpop}.tsv
output: data/{superpop}_{gender}.png
shell:
"python statsPlot_gender.py {input} {output}"
The only mandatory flag in Snakemake is the number of cores (-c
) or jobs (-j
), which indicates the number of processes that can run in parallel. To run the workflow in Snakemake, you can either:
- Request Snakemake to generate a specific file (using the exact filename as defined in the rule’s output)
# Call the last output (of the pipeline or the one you want to generate)
snakemake -c2 data/EUR_female.png
- Specify the name of a rule or task. You can do this when wildcards are not present
# Call the last rule
snakemake -c1 dw_metadata
- Alternatively, in very common practice, determine what you want to run inside a Snakefile
rule all:
input:
expand(data/{superpop}_{gender}.png, superpop=["EUR"], gender=["female", "male"])
Scaling up - rules generalisation using wildcards (e.g.: from one to several datasets) You can refer by index or by name
Dependencies are determined top-down
- For a given target, a rule that can be applied to create it, is determined (a job)
- For the input files of the rule, go on recursively,
- If no target is specified, snakemake, tries to apply the first rule
Rule all: target rule that collects results
Job execution
A job is executed if and only if:
- otuput file is target and does not exist
- output file needed by another executed job and does not exist
- input file newer than output file
- input file will be updated by other job (eg. changes in rules)
- execution is force (‘–force-all’)
You can plot the DAG (directed acyclic graph) of the jobs
Useful command line interface
# dry-run (-n), print shell commands (-p)
snakemake -n -p
# Snakefile named different in another location
snakemake --snakefile path/to/file.smoker
# dry-run (-n), print execution reason for each job
snakemake -n -r
# Visualise DAG of jobs using Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg
Defining resources
rule myrule:
resources: mem_mb= 100 #(100MB memory allocation)
threads: X
shell:
"command {threads}"
Let’s say you defined our rule myrule needs 4 works, if we execute the workflow with 8 cores as follows:
snakemake --cores 8
This means that two myrule
jobs, will be executed in parallel.
The jobs are schedules to maximize parallelization, high priority jobs will be scheduled first, all while satisfying resource constrains. This means:
If we allocate 100MB for the execution of myrule
and we call snakemake as follows:
snakemake --resources mem_mb=100 --cores 8
Only one myrule
job can be executed in parallel (you do not provide enough memory resources for 2). The memory resources is useful for jobs that are heavy memory demanding to avoid running out of memory. You will need to benchmark your pipeline to estimate how much memory and time your full workflow will take. We highly recommend doing so, get a subset of your dataset and give it a go! Log files will come very handy for the resource estimation. Of course, the execution of jobs is dependant on the free resources availability (eg. CPU cores).
rule myrule:
log: "logs/myrule.log"
threads: X
shell:
"command {threads}"
Log files need to define the same wildcards as the output files, otherwise, you will get an error.
Config files
You can also define values for wildcards or parameters in the config file (YAML or JSON). This is recommended when the pipeline might be used several times at different time points, to avoid unwanted modifications to the workflow. parameterization is key for such cases.
It can be loaded with: configfile: "path/to/config.yaml"
The given path is interpreted relative to the working directory, not relative to the location of the snakefile that contains the statement.
Cluster execution
When working from cluster systems you can execute the workflow using -qsub submission command
snakemake --cluster qsub
Additional advanced features
- modularization
- handling temporary and protected files: very important for intermediate files that filled up our memory and are not used in the long run and can be deleted once the final output is generated. This is automatically done by snakemake if you defined them in your pipeline HTML5 reports
- rule parameters
- tracking tool versions and code changes: will force rerunning older jobs when code and software are modified/updated.
- data provenance information per file
- python API for embedding snakemake in other tools
Create an isolated environment to install dependencies
Basic file structure
| - config.yml
| - requirements.txt (commonly also named environment.txt)
| - rules/
| | - myrules.smk
| - scripts/
| | - script1.py
| - Snakefile
Create conda environment, one per project!
# create env
conda create -n myworklow --file requirements.txt
# activate environment
source activate myworkflow
# then execute snakemake
Best Practices recommended by Snakemake
- Snakemake (v5.11+) includes a linter to check code quality, helping ensure best practices, readability, and reproducibility. Use the linter before publishing or seeking help. Run it with:
snakemake --lint
- Apply snakefmt to format workflows prior to publication.
- Always include test data when publishing on GitHub and configure GitHub Actions for continuos testing. Predefined actions for testing and linting (here), and formatting (here) are available.
- To improve discoverability, follow Snakemake’s recommended workflow structure.
- Use config files and, if needed, tabular configuration for metadata and experiment information.
- Use command-line arguments for runtime settings like threads, resources or output folders such as
--set-threads
,--set-resources
,--set-default-resources
, and--directory
. This makes workflows more readable, scalable, and portable. - Keep filenames short and consistent, but informative. Avoid mixing of too many special characters (e.g. decide whether to use _ or - as a separator and do that consistently throughout the workflow).
- separate Python code like helper functions from rules (e.g. in a workflow/rules/common.smk file). This approach helps non-experts understand the workflow without having to dig into unnecessary internal details.
- Avoid lambda expressions inside of rules.
- Where possible, use Snakemake wrappers to simplify recurring tasks.
Sources
- Snakemake tutorial
- Snakemake turorial slides by Johannes Koster
- faircookbook worflows
- Formatter snakefmt
- Snakemake workflow catalog
- Rules for inclusion in Snakemake workflow catalog
- Snakemake github template
Recommended reading:
- Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
- Köster, Johannes. “Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis”, PhD thesis, TU Dortmund 2014.