Snakemake

Modified

August 19, 2025

In this section, we will guide you through transitioning from bash scripts or notebooks to workflows. This approach is particularly useful as computations become more complex.

Basics

Snakemake is a text-based workflow management tool that uses a Python-based language with domain-specific syntax. Originally designed for scientific workflows in bioinformatics, Snakemake shares operational principles with make, from which it was inspired. Importantly, workflows in Snakemake are structured around data stored in files, and the tool supports parallel computing, cloud storage, and can manage computational environments.

The workflow is decomposed into rules that define how to obtain output files from input files. Snakemake infers dependencies and determines the execution order automatically, offering significantly more flexibility than its predecessor, make.

Semantics

Define rules, corresponding to different tasks in a pipeline. Each rule can use different programs (e.g. bash to run a shell command, python, R, GNU core utilities)

Snakefile

rule dw_metadata:
  input: path2/filename.tsv
  output: data/filename.tsv
  shell: 
    "wget {input} > {output}"

rule split_superpop:
  input: data/filename.tsv
  output: data/{superpop}.tsv
  shell: 
    "python process_pops.py {input} {output}"

rule avg_gender:
  input: data/{superpop}.tsv
  output: data/{superpop}_{gender}.png
  shell: 
    "python statsPlot_gender.py {input} {output}"

The only mandatory flag in Snakemake is the number of cores (-c) or jobs (-j), which indicates the number of processes that can run in parallel. To run the workflow in Snakemake, you can either:

Request Snakemake to generate a specific file (using the exact filename as defined in the rule’s output)

# Call the last output (of the pipeline or the one you want to generate)
snakemake -c2 data/EUR_female.png

Specify the name of a rule or task. You can do this when wildcards are not present

# Call the last rule
snakemake -c1 dw_metadata

Alternatively, in very common practice, determine what you want to run inside a Snakefile

rule all:
  input:
    expand(data/{superpop}_{gender}.png, superpop=["EUR"], gender=["female", "male"])

Scaling up - rules generalisation using wildcards (e.g.: from one to several datasets) You can refer by index or by name
Dependencies are determined top-down
- For a given target, a rule that can be applied to create it, is determined (a job)
- For the input files of the rule, go on recursively,
- If no target is specified, snakemake, tries to apply the first rule
Rule all: target rule that collects results

Job execution

A job is executed if and only if:

otuput file is target and does not exist
output file needed by another executed job and does not exist
input file newer than output file
input file will be updated by other job (eg. changes in rules)
execution is force (‘–force-all’)

You can plot the DAG (directed acyclic graph) of the jobs

Useful command line interface

# dry-run (-n), print shell commands (-p)
snakemake -n -p
# Snakefile named different in another location 
snakemake --snakefile path/to/file.smoker
# dry-run (-n), print execution reason for each job
snakemake -n -r
# Visualise DAG of jobs using Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg

Defining resources

rule myrule:
  resources: mem_mb= 100 #(100MB memory allocation)
  threads: X
  shell:
    "command {threads}"

Let’s say you defined our rule myrule needs 4 works, if we execute the workflow with 8 cores as follows:

snakemake --cores 8

This means that two myrule jobs, will be executed in parallel.

The jobs are schedules to maximize parallelization, high priority jobs will be scheduled first, all while satisfying resource constrains. This means:

If we allocate 100MB for the execution of myrule and we call snakemake as follows:

snakemake --resources mem_mb=100 --cores 8

Only one myrule job can be executed in parallel (you do not provide enough memory resources for 2). The memory resources is useful for jobs that are heavy memory demanding to avoid running out of memory. You will need to benchmark your pipeline to estimate how much memory and time your full workflow will take. We highly recommend doing so, get a subset of your dataset and give it a go! Log files will come very handy for the resource estimation. Of course, the execution of jobs is dependant on the free resources availability (eg. CPU cores).

rule myrule:
  log: "logs/myrule.log"
  threads: X
  shell:
    "command {threads}"

Log files need to define the same wildcards as the output files, otherwise, you will get an error.

Config files

You can also define values for wildcards or parameters in the config file (YAML or JSON). This is recommended when the pipeline might be used several times at different time points, to avoid unwanted modifications to the workflow. parameterization is key for such cases.

It can be loaded with: configfile: "path/to/config.yaml"

Warning

The given path is interpreted relative to the working directory, not relative to the location of the snakefile that contains the statement.

Cluster execution

When working from cluster systems you can execute the workflow using -qsub submission command

snakemake --cluster qsub

Additional advanced features

modularization
handling temporary and protected files: very important for intermediate files that filled up our memory and are not used in the long run and can be deleted once the final output is generated. This is automatically done by snakemake if you defined them in your pipeline HTML5 reports
rule parameters
tracking tool versions and code changes: will force rerunning older jobs when code and software are modified/updated.
data provenance information per file
python API for embedding snakemake in other tools

Create an isolated environment to install dependencies

Basic file structure

| - config.yml
| - requirements.txt (commonly also named environment.txt)
| - rules/
|   | - myrules.smk
| - scripts/
|   | - script1.py
| - Snakefile

Create conda environment, one per project!

# create env
conda create -n myworklow --file requirements.txt
# activate environment
source activate myworkflow
# then execute snakemake

Best Practices recommended by Snakemake

Snakemake (v5.11+) includes a linter to check code quality, helping ensure best practices, readability, and reproducibility. Use the linter before publishing or seeking help. Run it with:

snakemake --lint

Apply snakefmt to format workflows prior to publication.
Always include test data when publishing on GitHub and configure GitHub Actions for continuos testing. Predefined actions for testing and linting (here), and formatting (here) are available.
To improve discoverability, follow Snakemake’s recommended workflow structure.
Use config files and, if needed, tabular configuration for metadata and experiment information.
Use command-line arguments for runtime settings like threads, resources or output folders such as --set-threads, --set-resources, --set-default-resources, and --directory. This makes workflows more readable, scalable, and portable.
Keep filenames short and consistent, but informative. Avoid mixing of too many special characters (e.g. decide whether to use _ or - as a separator and do that consistently throughout the workflow).
separate Python code like helper functions from rules (e.g. in a workflow/rules/common.smk file). This approach helps non-experts understand the workflow without having to dig into unnecessary internal details.
Avoid lambda expressions inside of rules.
Where possible, use Snakemake wrappers to simplify recurring tasks.

Sources

Copyright

CC-BY-SA 4.0 license