Follow these steps to run the exercises on UCloud. First, mount the following two drives, use the setup.sh initialization file, and ask for 2 CPUs so we can run things in parallel:

YourNameSurname#xxxx: save your results/files here.
hpclab-workshop: contains input files and scripts. You can read-only access from this directory (no write permissions).

Next, activate snakemake environment.

# Make sure no other env is active!
conda activate snakemake

Finally, navigate to your personal drive and use the project directory you created yesterday, hpclab, to create a new subdirectory fastmixture and save here all files from this exercise (Snakefile, config.yml, output file, etc.).

You can choose to run your tasks either locally or on an accessible HPC. Refer to the Welcome page for the software you need to install. Next, create a new environment using the YAML file, activate the conda environment, and download the data.

Day 2 - Part 4

Create workflows that are both reproducible and easy to distribute. Use the config.yml file to define all variable values, and rely on separate environment YAML files for tasks that use software outside of the provided Conda environment.

II - Admixture analysis pipeline

In this exercise, you will run the software fastmixture as a Snakemake pipeline by converting the following bash script. Pay attention to the prefix for input and output files and the specified number of cores.

File paths for UCloud users:
- Rscript: /work/HPCLab_workshop/scripts/plotAdmixture.R
- Input PLINK files: /work/HPCLab_workshop/data/plink_sim/sim.small.*

fastmixture is not pre-installed in the snakemake environment, and we don’t want to include it for the purpose of this exercise. You will have to git clone the GitHub repository. There are two options on how to proceed:

Use the provided environment.yml file to create a new environment, which you are going to name fastmixture. Then, use the full path to the fastmixture env in the workflow. Is this the best approach? The workflow will rely on this environment being present on a new system where the workflow is executed. Referring to an existing environment can, however, be useful during development, e.g., when a certain software package is developed in parallel to a workflow that uses it.
Provide the path to environment.yml within the workflow. Snakemake will create the environment for you, which takes some time, but will only do this once. This is recommended and preferred for reproducibility reasons.

Either way, use the conda directive!

Have a look at the fastmixture.sh script below. How many different wildcards would you need? Recommended: Specify the values of the wildcard variables in a separate config file.

fastmixture.sh

### Run fastmixture for K=2,3 and three different seeds
FPREFIX="sim.small"
THREADS=4 

for k in {2,3}
do
    for s in {1..3}
    do
        fastmixture --bfile $FPREFIX --K $k --threads $THREADS --seed $s --out $FPREFIX
    done
done
# Saves three files for each run (.Q, .P and .log)

### Plot results and save (using R)
for k in {2,3}
do
    for s in {1..3}
    do
        Rscript plotAdmixture.R $FPREFIX.K${k}.s${s}.Q
    done
done

In fastmixture, the main arguments used in this exercise are:

--K: Specifies the number of ancestral components, representing the sources in the mixture model.
--seed: Sets the random seed to ensure reproducibility of the analysis across different runs.
--bfile: prefix for PLINK files (.bed, .bim, .fam)
--out: Prefix output name

You are set to write your own Snakemake pipeline. Need more suggestions?

Add a log directive to every rule. It is best practice to store all log files in a subdirectory logs/, prefixed by the rule or tool name.
Add a benchmark directive as well. Applying the same best practice.
Memory estimated based on the size of the file
After running the workflow for the set of K and s values shown in the bash script. Use the command line to provide a new seed value: --config PARAM=<value>
Make use of the multiext function to define a set of outputs or inputs that only differ by their extension

How to run this workflow? See the picture below:

snakemake -c1 --sdm conda

Hint

First, create a config.yaml file to specify the values for the wildcards k and s, as well as other parameters (such as the number of threads and the prefix).

Then, create the Snakemake Workflow, defining two rules for running fastmixture and plotting with R.

fastmixture is not pre-installed, check the instructions fastmixture GitHub. Need more tips?

# Git clone into your working dir
git clone https://github.com/Rosemeis/fastmixture.git

# OPTION 1: Use the env.yml (recommended)
# You just need to locate the file and use the path to the environment file under the conda directive
rule fast: 
  input: ...
  output: ...
  conda: /work/<NameSurname#xxx>/fastmixture/environment.yml

# OPTION 2: Create the env and use it (useful if you and your colleague are using the same HPC and won't get the pipeline published)
cd /work/<NameSurname#xxx>/fastmixture
conda env create -f environment.yml --prefix /work/<NameSurname#xxx>/envs/fastmixture
# Then use the env path 
rule fast: 
  input: ...
  output: ...
  conda: /work/<NameSurname#xxx>/envs/fastmixture

Make use of useful directives:

rule ...:
  output: "results/{wd2}.tsv"
  conda: "/path/to/env.yml"
  log: "logs/{wd1}.log"
  benchmark: "logs/{wd1}.txt"
  shell: "somecomamnd {output}"

My solution

This is one possible solution. How does yours look?

config.yml

PLINKPATH: "/work/HPCLab_workshop/data/plink_sim"
FPREFIX: "sim.small"
THREADS_FAST: 2  # Threads for fastmixture
THREADS_R: 1            # Threads for R plotting
K_VALUES: [2,3]        # Values of K
SEEDS:
  - 1
  - 2
  - 3    # Seed values
PATH_SCRIPT: "/work/HPCLab_workshop/scripts"

Snakefile

#!/usr/bin/env python
# -*- coding: utf-8 -*-

__author__ = "Alba Refoyo Martinez"
__copyright__ = "Copyright 2024, University of Copenhagen"
__email__ = "gsd818@@ku.dk"
__license__ = "MIT"
configfile: "config.yml"

rule all:
    input:
        expand("{FPREFIX}.K{k}.s{s}.png", FPREFIX=config["FPREFIX"], k=config["K_VALUES"], s=config["SEEDS"])

rule fastmixture:
    output:
        q="{prefix}.K{k}.s{s}.Q",
        p="{prefix}.K{k}.s{s}.P",
        log="{prefix}.K{k}.s{s}.log"
    params:
        prefix=config["FPREFIX"],
        pathIn=config["PLINKPATH"]
    threads: config["THREADS_FAST"]
    log:
        "logs/fastmixture_{prefix}.K{k}.s{s}.log"
    benchmark:
        "benchmarks/fastmixture_{prefix}.K{k}.s{s}.txt"
    conda: 
        "/work/AlbaRefoyoMartínez#0753/fastmixture/environment.yml"
    shell:
        """
        fastmixture --bfile {params.pathIn}/{params.prefix} --K {wildcards.k} --threads {threads} --seed {wildcards.s} --out {params.prefix} 2> {log}
        """

rule plot_results:
    input:
        q="{prefix}.K{k}.s{s}.Q"
    output:
        plot="{prefix}.K{k}.s{s}.png"
    params:
        plot_scripts=config["PATH_SCRIPT"]
    shell:
        """
        Rscript {params.plot_scripts}/plotAdmixture.R {input.q}
        """

Bonus exercise: functions and checkpoints

You’ll now modify the pipeline to dynamically select the best K from the fastmixture results using a checkpoint and a Python function.

Why the checkpoint?

Let’s assume you don’t know in advance how many files the pipeline will produce or what their names will be, because they depend on parameters like K values that are defined in the config file or runtime. This is handy when we have a Python function inspecting the output directory to decide on what to use.

checkpoint fastmixture:
    output:
        directory("results/fastmixture")
    params:
        ...
    shell:
        ...

What will the function do?

Search for all *.log files.
Extract the log-likelihood value for each.
Select the file corresponding to the best K (e.g., highest likelihood).

The rule for plotting

You will make use of this function in a downstream rule. The last rule, e.g., bestK_plot, will only plot the admixture proportions for the best result.

Hint

def get_q_file(wildcards): …

rule bestK_plot: input: q=lambda wildcards: get_q_file(wildcards)

Copyright

CC-BY-SA 4.0 license