Follow these steps to run the exercises on UCloud.
You will use the Nextflow
app, version 24.10.5
. First, mount the two drives below, and ask for 2 CPUs so we can run things in parallel:
YourNameSurname#xxxx
: save your results/files here.
/shared/HPCLab_workshop
: contains input files and scripts. You have read-only access from this directory (no write permissions).
Then, select two optional parameters:
- Interactive mode (true): to get access to a terminal and run the commands yourself.
- Initialization: use the
/shared/HPCLab_workshop/hpc-pipes/setup.sh
.
Finally, navigate to your personal drive and use the project directory you created yesterday to save all the output files from the exercises!
You can choose to run your tasks either locally or on an accessible HPC. Refer to the Welcome page for the software you need to install. Next, download the data.
Day 2 - Part 5
B. Nextflow
Let’s build your first Nextflow pipeline from scratch. What elements do we need?
my_pipeline/
├── <workflow_script>.nf # Nextflow script (e.g., main.nf)
├── <config_file>.config # Configuration file for pipeline parameters and environment settings (e.g, nextflow.config)
├── input.txt # Input data (e.g. )
If you need help, try to find answers to your questions by consulting the Nextflow documentation and Nextflow Training, as they are great resources with many examples. This is how you would approach it for any project.
Two nextflow examples for inspiration:
Preparation step
Directory setup: Assuming you’ve followed the setup instructions and launched the Nextflow app by selecting
Open terminal
, create a new subdirectory insidehpclab
—for example,nf-pipes
, to store your pipeline and resulting outputs. Make sure it’s set as your working directory (e.g.cd hpclab/nf-pipes
).File creation: you will need to create two files for this exercise:
Open a new file named
step1.nf
and include the following line of code. The script below shows how to comment out lines as well:step1.nf
#!/usr/bin/env nextflow // Single-line comments /* This is a multi-line comment example */
In this task, You’ll now develop a simple Nextflow pipeline based on the first version of the step1_wf.smk
Snakemake file you worked with (Day 2, Part 3.I). This pipeline preprocesses a sample metadata file using a Python script.
This version is intentionally kept simple: it’s not yet generalized and doesn’t make use of modular pipeline features. We’ll start here and gradually build toward a more flexible and reproducible workflow. Here’s what the original Snakemake pipeline looked like:
step1_wf.smk
rule preprocess:
input:
"/work/HPCLab_workshop/data/samples_EB.tsv"
output:
"results/samples_EB_cleaned.tsv"
shell:
"""
python scripts/preprocess.py {input} {output}
"""
Instructions
In your Nextflow version:
Open the Nextflow script you just created,
step1.nf
, and make sure it mirrors the logic of the Snakemake version.Set the
HOME
environment variable in your terminal to your project directory (e.g.,export HOME=/work/<YourNameSurname#xxxx>/hpclab/nf-pipes
).Define pipeline parameters at the top of the script. These variables help make the workflow more reusable later on:
- Use
params.input_dir
to define the location of the input files, pointing to the shared input data directory (/shared/HPCLab_workshop
). - Use
params.output_dir
to define where the results should be saved. Use your project directory for outputs (e.g.,${HOME}/results
).
- Use
Write a single process, named
PREPROCESS
, to run the Python script:- Use
input
: to read samples_EB.tsv - Use
output
: to declare samples_EB_cleaned.tsv - Use
publishDir
to copy the output file to your results directory.publishDir
makes your output files easier to locate by copying them to a defined location. It also ensures outputs are not buried in .nextflow folders.
- Use
Use a hardcoded path to the Python script for now:
/work/<YourPersonalDrive>/hpclab/scripts/preprocess.py
Use the
conda
profile to manage the software environment for your pipeline. Among other options, you can include the conda environment within a process itself or specify it from the command line using the--with-conda
flag, or use a config file to define it. We recommend specifying it via the command line (--with-conda
) and incorporating the environment within the process in the next exercise, once the pipeline is working.Take a look inside
work
directory and explore some of the files (e.g..command.sh
and.command.log
). Did Nextflow execute the command you expected?The structure of the pipeline should look like:
step1.nf
process STEP_NAME { publishDir OUTPUT_LOCATION, mode: 'copy' input: path some_input output: path some_output script: """ some_command $some_input $some_output """ } workflow { Channel.fromPath() STEP_NAME() }
The file publishing method. Can be one of the following values:
- ‘copy’: Copies the output files into the publish directory.
- ‘copyNoFollow’: Copies the output files into the publish directory without following symlinks ie. copies the links themselves.
- ‘link’: Creates a hard link in the publish directory for each output file.
- ‘move’: Moves the output files into the publish directory. Note: this is only supposed to be used for a terminal process i.e. a process whose output is not consumed by any other downstream process.
- ‘rellink’: Creates a relative symbolic link in the publish directory for each output file.
- ‘symlink’: Creates an absolute symbolic link in the publish directory for each output file (default).
You’ll also need a Nextflow config file. While simple pipelines may work with just default settings, more complex pipelines typically define one or more config files to separate concerns, such as:
- Resource definitions (e.g., cpus, memory, time)
- Execution profiles (e.g., local, slurm) and container environments (e.g., conda, singularity)
- Pipeline parameters (e.g., input/output directories, filtering options)
For now, let’s just define the resources we want this pipeline to use:
nextflow.config
process {
cpus = 1
memory = '4 GB'
time = '1h'
}
Need help with the config file? Check the documentation below:
Nextflow searches for configuration files in several locations, and it’s important to understand the order in which they are applied. For more details, refer to the official documentation: https://www.nextflow.io/docs/latest/config.html.
A Nextflow configuration file can include any combination of assignments, blocks, and includes.
Blocks can contain multiple configuration options. Below, you’ll see the difference between dot syntax and block syntax.
nextflow.config
executor.retry.maxAttempt = 2
executor {
retry.maxAttempt = 2
}
If you are running into errors, make sure to go through the hint before first, as it will provide useful tips and serve as inspiration. There isn’t just one solution, so the hint can guide you toward different approaches to solving the problem.
Also, don’t forget to look in the work
directory and review the log files (like .command.sh
and .command.log
). They’re often very informative and can help you identify and fix issues in your script before giving up.
It is very important to set HOME to a directory where you are working - your project directory. This is where Nextflow will look for the pipeline and configuration files, among other things.
How to run this workflow?
Example 1
nextflow run step1.nf -with-conda /work/HPCLab_workshop/miniconda3/envs/snakemake
Need help with the pipeline?
In the workflow block, we need to set up a channel to feed the input to the PREPROCESS process; then we can call the process itself to run on the contents of that channel.
step1.nf
/*
* Pipeline parameters
*/
// Primary input
params.input_dir = "path/to/input_dir"
params.output_dir = "path/to/output_dir"
params.input_file = "path/to/input"
process PREPROCESS {
publishDir OUTPUT_LOCATION, mode: 'copy'
input:
path some_input
output:
path some_output
script:
"""
some_command $some_input $some_output
"""
}
workflow {
// Create input channel from CLI parameter
reads_ch = Channel.fromPath(params.read_samples)
// Connect channel to process
PREPROCESS (reads_ch)
}
Alternatively, you can pipe the input channel directly into the process in a single line.
This is one possible solution. How does yours look?
- Nextflow script
step1.nf
#!/usr/bin/env nextflow
// Pipeline parameters
params.input_dir = "${HOME}/data"
params.output_dir = "${HOME}/results"
// Primary input
params.read_samples = "${params.input_dir}/samples_EB.tsv"
process PREPROCESS {
publishDir "${params.output_dir}", mode: 'copy'
input:
path input_samples
output:
path "samples_EB_cleaned.tsv"
script:
"""
python /work/<YourNameSurname>/hpclab/scripts/preprocess.py $input_samples samples_EB_cleaned.tsv
"""
}
workflow {
Channel.fromPath(params.read_samples)
| PREPROCESS
}
- Command-line
Option 1
nextflow run step1.nf -with-conda /work/HPCLab_workshop/miniconda3/envs/snakemake
Making the Pipeline Reusable, Reproducible, and Flexible
We want to avoid using full (absolute) paths in our pipelines, as they are not reproducibility-friendly and can break when the project is moved to a different system or user environment. Instead, we use relative paths, which ensure the workflow remains portable.
We also want to keep the workflow flexible and reusable, so it can easily handle multiple datasets or samples. Nextflow supports string interpolation, which allows you to dynamically construct file paths or commands using variables defined in the workflow.
Now it’s time to test additional functionalities and extend your pipeline.
1. Dynamic output names and paths
We’ll now expand the pipeline to support multiple datasets in a reusable way. In Snakemake, you likely used a wildcard
such as {dataset}
to dynamically build filenames. Here’s a simplified version of that:
step1_flex.smk
# step1_wf.smk
DATASETS = ["EB", "1kgp"]
rule all:
input:
expand("results/samples_{dataset}_cleaned.tsv", dataset=DATASETS)
rule preprocess:
input:
"/work/HPCLab_workshop/data/samples_{dataset}.tsv"
output:
tsv="results/samples_{dataset}_cleaned.tsv"
shell:
"""
python scripts/preprocess.py {input} {output}
"""
To do the same in Nextflow, we introduce a dataset
variable and use a channel to loop over all datasets, just like the wildcard in Snakemake. The processed output file is dynamically named using the value of dataset
.
Take a look at this pipeline, where they use the cohort name in a similar way to how you’ll define the dataset: Link.
2. Standardizing script management
To stay organized, create a bin
subfolder inside your pipeline directory. Move your script files into this folder and make them executable (e.g., chmod +x bin/myscript.py
). Nextflow automatically adds the bin folder to the system PATH
, so you can simply refer to script names in your workflow without full paths.
3. Configuring conda envs using profiles
In your Nextflow pipeline, specify the Conda environment you want to use for a specific process by including the conda directive. Next, create or update a configuration file (e.g., profiles.config) to enable the Conda profile (using conda.enabled = true
).
It is highly recommended to check the hint below.
Click for Nextflow documentation on conda.
You can use the conda directive in a process to point to a specific environment, like this:
process PREPROCESS {
conda 'path/to/conda/envs/myenv'
...
}
You will also need to enable the conda profile using a config file:
profiles.config
profiles {
conda {
conda.enabled = true
conda.channels = ['bioconda','conda-forge']
}
}
How can you run the pipeline now?
Terminal
nextflow run step1.nf -c profiles.config -profile conda
If you are expecting to only have one profile (e.g. pipeline only runs with conda, which is not very common for HPC setups), then your config file can look like this:
profile_conda.config
profiles {
conda.enabled = true
conda.channels = ['bioconda','conda-forge']
}
In this case, you don’t need to specify -profile conda
when running the pipeline:
Terminal
nextflow run step1.nf -c profile_conda.config
step1_flex.nf
#!/usr/bin/env nextflow
// Pipeline parameters
params.input_dir = "${HOME}/data"
params.output_dir = "${HOME}/results"
process PREPROCESS {
conda '/work/HPCLab_workshop/miniconda3/envs/snakemake'
publishDir "${params.output_dir}", mode: 'copy'
input:
val dataset
path input_samples
output:
path "samples_${dataset}_cleaned.tsv"
script:
"""
python preprocess.py $input_samples samples_${dataset}_cleaned.tsv
"""
}
workflow {
ch_dataset = Channel.of('EB', '1kgp')
ch_input = ch_dataset.map { dt -> file("${params.input_dir}/samples_${dt}.tsv") }
PREPROCESS(ch_dataset, ch_input)
}
You will now add a new task, FILTER_YEAR
, to your Nextflow pipeline. This task will be responsible for filtering datasets based on the year information (or other criteria). The idea is to use the errorStrategy = 'ignore'
setting in the config file, allowing the pipeline to skip filtering when it’s not applicable, such as when working with the 1kgp
dataset (as this info is not available).
The filter_year.py
script needs a value for the flag cutoff. It’s better to define cutoff as a parameter (e.g., params.cutoff = 2000
), which can then be modified using the config file or the command line. This approach provides flexibility to adjust the value without changing the pipeline code. As part of this exercise, run the pipeline twice, first using the default cutoff and then with a different one (e.g. 100).
Below is a basic template of how the FILTER_YEAR
task might look:
allsteps_wf.smk
DATASETS = ["EB", "1kgp"]
rule all:
input:
expand("results/samples_{dataset}_cleaned.tsv", dataset=DATASETS)
rule preprocess:
input:
"/work/HPCLab_workshop/data/samples_{dataset}.tsv"
output:
tsv="results/samples_{dataset}_cleaned.tsv"
shell:
"""
python scripts/preprocess.py {input} {output}
"""
rule filter_year:
input:
meta="results/samples_{dataset}_cleaned.tsv"
output:
fi="results/samples_{dataset}_filtered.tsv"
params:
cutoff=2000
shell:
"""
python scripts/filter_year.py -i {input.meta} -o {output.fi} -y {params.cutoff}
"""
- Nextflow script
allsteps_flex.nf
#!/usr/bin/env nextflow
// Pipeline parameters
params.input_dir = "${HOME}/data"
params.output_dir = "${HOME}/results"
params.cutoff=2000
process PREPROCESS {
conda '/work/HPCLab_workshop/miniconda3/envs/snakemake'
publishDir "${params.output_dir}", mode: 'copy'
input:
tuple val(dataset), path(input_samples)
output:
tuple val(dataset), path("samples_${dataset}_cleaned.tsv")
script:
"""
preprocess.py $input_samples samples_${dataset}_cleaned.tsv
"""
}
process FILTER_YEAR {
conda '/work/HPCLab_workshop/miniconda3/envs/snakemake'
publishDir "${params.output_dir}", mode: 'copy'
input:
tuple val(dataset), path(input_cleaned)
output:
path "samples_${dataset}_filtered.tsv"
script:
"""
filter_year.py -i $input_cleaned \
-o samples_${dataset}_filtered.tsv \
-y ${params.cutoff}
"""
}
workflow {
Channel
.of('EB', '1kgp')
.map { data -> tuple(data, file("${params.input_dir}/samples_${data}.tsv")) } | PREPROCESS | FILTER_YEAR
}
- Config file
profiles.config
profiles {
conda.enabled = true
conda.channels = ['bioconda','conda-forge']
}
process {
withName: FILTER_YEAR {
errorStrategy = 'ignore'
}
}
- Nextflow command
Terminal
nextflow run allsteps_flex.nf -c profiles.config
- You can test different thresholds for the year cutoff using the command-line
Terminal
nextflow run allsteps_flex.nf -c profiles.config --cutoff 1000 -resume