Follow these steps to run the exercises on UCloud. First, mount the following two drives, use the setup.sh initialization file, and ask for 2 CPUs so we can run things in parallel:

  • YourNameSurname#xxxx: save your results/files here.
  • hpclab-workshop: this contains input files and scripts. You can read-only access from this directory (no write permissions).

Next, activate snakemake environment.

# make sure no other env is active!
conda activate snakemake 

Finally, navigate to your personal drive and use the project directory you created yesterday hpclab to create a new subdir fastmixture and save here all files from this exercise (Snakefile, config.yml, output file, etc.).

You can choose to run your tasks either locally or on an accessible HPC. Refer to the Welcome page for the software you need to install. Next, create a new environment using the YAML file, activate the conda environment, and download the data.

Day 2 - Part 4

Create workflows that are both reproducible and easy to distribute. Use the config.yml file to define all variable values, and rely on separate environment YAML files for tasks that use software outside of the provided Conda environment.

II - Admixture analysis pipeline

In this exercise, you will run the software fastmixture as a Snakemake pipeline by converting the following bash script. Pay attention to the prefix for input and output files and the specified number of cores.

  • File paths for UCloud users:
    • Rscript: /work/HPCLab_workshop/scripts/plotAdmixture.R
    • Input PLINK files: /work/HPCLab_workshop/data/plink_sim/sim.small.*

fastmixture is not pre-installed in the snakemake environment and we don’t want to include it for the purpose of this exercise. You will have to git clone the GitHub repository. There are two options on how to proceed:

  1. Use the provided environment.yml file to create a new environment, which you are going to name fastmixture. Then, use the full path to the fastmixture env in the workflow. Is this the best approach? The workflow will rely on this environment being present on a new system where the workflow is executed. Referring to an existing environment can however be useful during development, e.g. when a certain software package is developed in parallel to a workflow that uses it.
  2. Provide the path to environment.yml within the workflow. Snakemake will create the env for you which takes some time but will only do this once. This is recommended and preferred for reproducibility reasons.

Either way, use the conda directive!

Have a look at the fastmixture.sh script below. How many different wildcards would you need? Recommended: Specify the values of the wildcard variables in a separate config file.

fastmixture.sh
### Run fastmixture for K=2,3 and three different seeds
FPREFIX="sim.small"
THREADS=4 

for k in {2,3}
do
    for s in {1..3}
    do
        fastmixture --bfile $FPREFIX --K $k --threads $THREADS --seed $s --out $FPREFIX
    done
done
# Saves three files for each run (.Q, .P and .log)

### Plot results and save (using R)
for k in {2,3}
do
    for s in {1..3}
    do
        Rscript plotAdmixture.R $FPREFIX.K${k}.s${s}.Q
    done
done

In fastmixture, the main arguments used in this exercise are:

  • --K: Specifies the number of ancestral components, representing the sources in the mixture model.
  • --seed: Sets the random seed to ensure reproducibility of the analysis across different runs.
  • --bfile: prefix for PLINK files (.bed, .bim, .fam)
  • --out: Prefix output name

You are set to write your own snakemake pipeline. Need more suggestions?

  • Add a log directive to every rule. It is best practice to store all log files in a subdirectory logs/, prefixed by the rule or tool name.
  • Add a benchmark directive as well. Applying the same best practice.
  • Memory estimated based on the size of the file
  • After running the workflow for the set of K and s values shown in the bash script. Use the command line to provide a new seed value: --config PARAM=<value>
  • Make use of the multiext function to define a set of output or input that only differ by their extension

How to run this workflow? See the picture below:

snakemake -c1 --sdm conda

First, create a config.yaml file to specify the values for the wildcards k and s, as well as other parameters (such as the number of threads, and the prefix).

Then, create the Snakemake Workflow, defining two rules for running fastmixture and plotting with R.

fastmixture is not pre-installed, check the instructions fastmixture GitHub. Need more tips?

# Git clone on your working dir
git clone https://github.com/Rosemeis/fastmixture.git

# OPTION 1: Use the env.yml (recommended)
# You just need to locate the file and use the the path to the environment file under the conda directive
rule fast: 
  input: ...
  output: ...
  conda: /work/<NameSurname#xxx>/fastmixture/environment.yml

# OPTION 2: Create the env and use it (useful if you and your colleague are using the same HPC and won't get the pipeline published)
cd /work/<NameSurname#xxx>/fastmixture
conda env create -f environment.yml --prefix /work/<NameSurname#xxx>/envs/fastmixture
# Then use the env path 
rule fast: 
  input: ...
  output: ...
  conda: /work/<NameSurname#xxx>/envs/fastmixture

Make use of useful directives:

rule ...:
  output: "results/{wd2}.tsv"
  conda: "/path/to/env.yml"
  log: "logs/{wd1}.log"
  benchmark: "logs/{wd1}.txt"
  shell: "somecomamnd {output}"

This is one possible solution, how does yours look?

config.yml
PLINKPATH: "/work/HPCLab_workshop/data/plink_sim"
FPREFIX: "sim.small"
THREADS_FAST: 2  # Threads for fastmixture
THREADS_R: 1            # Threads for R plotting
K_VALUES: [2,3]        # Values of K
SEEDS:
  - 1
  - 2
  - 3    # Seed values
PATH_SCRIPT: "/work/HPCLab_workshop/scripts"
Snakefile
#!/usr/bin/env python
# -*- coding: utf-8 -*-

__author__ = "Alba Refoyo Martinez"
__copyright__ = "Copyright 2024, University of Copenhagen"
__email__ = "gsd818@@ku.dk"
__license__ = "MIT"
configfile: "config.yml"

rule all:
    input:
        expand("{FPREFIX}.K{k}.s{s}.png", FPREFIX=config["FPREFIX"], k=config["K_VALUES"], s=config["SEEDS"])

rule fastmixture:
    output:
        q="{prefix}.K{k}.s{s}.Q",
        p="{prefix}.K{k}.s{s}.P",
        log="{prefix}.K{k}.s{s}.log"
    params:
        prefix=config["FPREFIX"],
        pathIn=config["PLINKPATH"]
    threads: config["THREADS_FAST"]
    log:
        "logs/fastmixture_{prefix}.K{k}.s{s}.log"
    benchmark:
        "benchmarks/fastmixture_{prefix}.K{k}.s{s}.txt"
    conda: 
        "/work/AlbaRefoyoMartínez#0753/fastmixture/environment.yml"
    shell:
        """
        fastmixture --bfile {params.pathIn}/{params.prefix} --K {wildcards.k} --threads {threads} --seed {wildcards.s} --out {params.prefix} 2> {log}
        """

rule plot_results:
    input:
        q="{prefix}.K{k}.s{s}.Q"
    output:
        plot="{prefix}.K{k}.s{s}.png"
    params:
        plot_scripts=config["PATH_SCRIPT"]
    shell:
        """
        Rscript {params.plot_scripts}/plotAdmixture.R {input.q}
        """