Follow these steps to run the exercises on UCloud. First, mount the following two drives, use the setup.sh
initialization file, and ask for 2 CPUs so we can run things in parallel:
YourNameSurname#xxxx
: save your results/files here.hpclab-workshop
: this contains input files and scripts. You can read-only access from this directory (no write permissions).
Next, activate snakemake
environment.
# make sure no other env is active!
conda activate snakemake
Finally, navigate to your personal drive and use the project directory you created yesterday hpclab
to create a new subdir fastmixture
and save here all files from this exercise (Snakefile, config.yml, output file, etc.).
You can choose to run your tasks either locally or on an accessible HPC. Refer to the Welcome page for the software you need to install. Next, create a new environment using the YAML file, activate the conda environment, and download the data.
Day 2 - Part 4
Create workflows that are both reproducible and easy to distribute. Use the config.yml
file to define all variable values, and rely on separate environment YAML files for tasks that use software outside of the provided Conda environment.
In this exercise, you will run the software fastmixture
as a Snakemake pipeline by converting the following bash script. Pay attention to the prefix for input and output files and the specified number of cores.
- File paths for UCloud users:
- Rscript:
/work/HPCLab_workshop/scripts/plotAdmixture.R
- Input PLINK files:
/work/HPCLab_workshop/data/plink_sim/sim.small.*
- Rscript:
fastmixture
is not pre-installed in the snakemake
environment and we don’t want to include it for the purpose of this exercise. You will have to git clone the GitHub repository. There are two options on how to proceed:
- Use the provided
environment.yml
file to create a new environment, which you are going to namefastmixture
. Then, use the full path to thefastmixture
env in the workflow. Is this the best approach? The workflow will rely on this environment being present on a new system where the workflow is executed. Referring to an existing environment can however be useful during development, e.g. when a certain software package is developed in parallel to a workflow that uses it. - Provide the path to
environment.yml
within the workflow. Snakemake will create the env for you which takes some time but will only do this once. This is recommended and preferred for reproducibility reasons.
Either way, use the conda
directive!
Have a look at the fastmixture.sh
script below. How many different wildcards would you need? Recommended: Specify the values of the wildcard variables in a separate config file.
fastmixture.sh
### Run fastmixture for K=2,3 and three different seeds
FPREFIX="sim.small"
THREADS=4
for k in {2,3}
do
for s in {1..3}
do
fastmixture --bfile $FPREFIX --K $k --threads $THREADS --seed $s --out $FPREFIX
done
done
# Saves three files for each run (.Q, .P and .log)
### Plot results and save (using R)
for k in {2,3}
do
for s in {1..3}
do
Rscript plotAdmixture.R $FPREFIX.K${k}.s${s}.Q
done
done
In fastmixture
, the main arguments used in this exercise are:
--K
: Specifies the number of ancestral components, representing the sources in the mixture model.--seed
: Sets the random seed to ensure reproducibility of the analysis across different runs.--bfile
: prefix for PLINK files (.bed, .bim, .fam)--out
: Prefix output name
You are set to write your own snakemake pipeline. Need more suggestions?
- Add a
log
directive to every rule. It is best practice to store all log files in a subdirectorylogs/
, prefixed by the rule or tool name. - Add a
benchmark
directive as well. Applying the same best practice. - Memory estimated based on the size of the file
- After running the workflow for the set of
K
ands
values shown in the bash script. Use the command line to provide a new seed value:--config PARAM=<value>
- Make use of the
multiext
function to define a set of output or input that only differ by their extension
How to run this workflow? See the picture below:
snakemake -c1 --sdm conda
First, create a config.yaml file to specify the values for the wildcards k
and s
, as well as other parameters (such as the number of threads, and the prefix).
Then, create the Snakemake Workflow, defining two rules for running fastmixture
and plotting with R.
fastmixture
is not pre-installed, check the instructions fastmixture
GitHub. Need more tips?
# Git clone on your working dir
git clone https://github.com/Rosemeis/fastmixture.git
# OPTION 1: Use the env.yml (recommended)
# You just need to locate the file and use the the path to the environment file under the conda directive
rule fast:
input: ...
output: ...
conda: /work/<NameSurname#xxx>/fastmixture/environment.yml
# OPTION 2: Create the env and use it (useful if you and your colleague are using the same HPC and won't get the pipeline published)
cd /work/<NameSurname#xxx>/fastmixture
conda env create -f environment.yml --prefix /work/<NameSurname#xxx>/envs/fastmixture
# Then use the env path
rule fast:
input: ...
output: ...
conda: /work/<NameSurname#xxx>/envs/fastmixture
Make use of useful directives:
rule ...:"results/{wd2}.tsv"
output: "/path/to/env.yml"
conda: "logs/{wd1}.log"
log: "logs/{wd1}.txt"
benchmark: "somecomamnd {output}" shell:
This is one possible solution, how does yours look?
config.yml
PLINKPATH: "/work/HPCLab_workshop/data/plink_sim"
FPREFIX: "sim.small"
THREADS_FAST: 2 # Threads for fastmixture
THREADS_R: 1 # Threads for R plotting
K_VALUES: [2,3] # Values of K
SEEDS:
- 1
- 2
- 3 # Seed values
PATH_SCRIPT: "/work/HPCLab_workshop/scripts"
Snakefile
#!/usr/bin/env python
# -*- coding: utf-8 -*-
= "Alba Refoyo Martinez"
__author__ = "Copyright 2024, University of Copenhagen"
__copyright__ = "gsd818@@ku.dk"
__email__ = "MIT"
__license__ "config.yml"
configfile:
all:
rule input:
"{FPREFIX}.K{k}.s{s}.png", FPREFIX=config["FPREFIX"], k=config["K_VALUES"], s=config["SEEDS"])
expand(
rule fastmixture:
output:="{prefix}.K{k}.s{s}.Q",
q="{prefix}.K{k}.s{s}.P",
p="{prefix}.K{k}.s{s}.log"
log
params:=config["FPREFIX"],
prefix=config["PLINKPATH"]
pathIn"THREADS_FAST"]
threads: config[
log:"logs/fastmixture_{prefix}.K{k}.s{s}.log"
benchmark:"benchmarks/fastmixture_{prefix}.K{k}.s{s}.txt"
conda: "/work/AlbaRefoyoMartínez#0753/fastmixture/environment.yml"
shell:"""
fastmixture --bfile {params.pathIn}/{params.prefix} --K {wildcards.k} --threads {threads} --seed {wildcards.s} --out {params.prefix} 2> {log}
"""
rule plot_results:input:
="{prefix}.K{k}.s{s}.Q"
q
output:="{prefix}.K{k}.s{s}.png"
plot
params:=config["PATH_SCRIPT"]
plot_scripts
shell:"""
Rscript {params.plot_scripts}/plotAdmixture.R {input.q}
"""