HPC Lab
  • Home
  • HPC Launch
  • HPC Pipes
  • Workshop
  1. HPC Pipes
  2. Day 2
  3. Day 2 - Part 5
  • UCloud setup
    • UCloud project workspace
    • SSH on UCloud
    • GitHub on UCloud
    • Conda on UCloud
  • HPC Launch
    • Welcome to the HPC-Launch workshop
    • Managing data
    • Knowledge Checks
  • HPC Pipes
    • Welcome to the HPC-Pipes workshop
    • Day 1
      • Day 1 - Part 1
      • Day 1 - Part 2
    • Day 2
      • Day 2 - Part 3
      • Day 2 - Part 4
      • Day 2 - Part 5

On this page

  • Day 2 - Part 5
    • B. Nextflow
      • Making the Pipeline Reusable, Reproducible, and Flexible
  1. HPC Pipes
  2. Day 2
  3. Day 2 - Part 5

Follow these steps to run the exercises on UCloud.

You will use the Nextflow app, version 24.10.5. First, mount the two drives below, and ask for 2 CPUs so we can run things in parallel:

  • YourNameSurname#xxxx: save your results/files here.
  • /shared/HPCLab_workshop: contains input files and scripts. You have read-only access from this directory (no write permissions).

Then, select two optional parameters:

  • Interactive mode (true): to get access to a terminal and run the commands yourself.
  • Initialization: use the /shared/HPCLab_workshop/hpc-pipes/setup.sh.

Finally, navigate to your personal drive and use the project directory you created yesterday to save all the output files from the exercises!

You can choose to run your tasks either locally or on an accessible HPC. Refer to the Welcome page for the software you need to install. Next, download the data.

Day 2 - Part 5

B. Nextflow

Let’s build your first Nextflow pipeline from scratch. What elements do we need?

my_pipeline/
├── <workflow_script>.nf # Nextflow script (e.g., main.nf)
├── <config_file>.config # Configuration file for pipeline parameters and environment settings (e.g, nextflow.config)
├── input.txt            # Input data (e.g. ) 
Tip

If you need help, try to find answers to your questions by consulting the Nextflow documentation and Nextflow Training, as they are great resources with many examples. This is how you would approach it for any project.

Two nextflow examples for inspiration:

  • BLAST pipeline
  • RNA-seq pipeline
  • ML pipeline
Part I. Build your first Nextflow pipeline
Preparation step
  • Directory setup: Assuming you’ve followed the setup instructions and launched the Nextflow app by selecting Open terminal, create a new subdirectory inside hpclab—for example,nf-pipes, to store your pipeline and resulting outputs. Make sure it’s set as your working directory (e.g. cd hpclab/nf-pipes).

  • File creation: you will need to create two files for this exercise:

    • Open a new file named step1.nf and include the following line of code. The script below shows how to comment out lines as well:

      step1.nf
      #!/usr/bin/env nextflow
      
      // Single-line comments
      
      /*
      This is a multi-line 
      comment example 
      */

In this task, You’ll now develop a simple Nextflow pipeline based on the first version of the step1_wf.smk Snakemake file you worked with (Day 2, Part 3.I). This pipeline preprocesses a sample metadata file using a Python script.

This version is intentionally kept simple: it’s not yet generalized and doesn’t make use of modular pipeline features. We’ll start here and gradually build toward a more flexible and reproducible workflow. Here’s what the original Snakemake pipeline looked like:

step1_wf.smk
rule preprocess:
  input:
    "/work/HPCLab_workshop/data/samples_EB.tsv"
  output:
    "results/samples_EB_cleaned.tsv"
  shell:
    """
    python scripts/preprocess.py {input} {output}
    """
Instructions

In your Nextflow version:

  • Open the Nextflow script you just created, step1.nf, and make sure it mirrors the logic of the Snakemake version.

  • Set the HOME environment variable in your terminal to your project directory (e.g., export HOME=/work/<YourNameSurname#xxxx>/hpclab/nf-pipes).

  • Define pipeline parameters at the top of the script. These variables help make the workflow more reusable later on:

    • Use params.input_dir to define the location of the input files, pointing to the shared input data directory (/shared/HPCLab_workshop).
    • Use params.output_dir to define where the results should be saved. Use your project directory for outputs (e.g., ${HOME}/results).
  • Write a single process, named PREPROCESS, to run the Python script:

    • Use input: to read samples_EB.tsv
    • Use output: to declare samples_EB_cleaned.tsv
    • Use publishDir to copy the output file to your results directory. publishDir makes your output files easier to locate by copying them to a defined location. It also ensures outputs are not buried in .nextflow folders.
  • Use a hardcoded path to the Python script for now: /work/<YourPersonalDrive>/hpclab/scripts/preprocess.py

  • Use the conda profile to manage the software environment for your pipeline. Among other options, you can include the conda environment within a process itself or specify it from the command line using the --with-conda flag, or use a config file to define it. We recommend specifying it via the command line (--with-conda) and incorporating the environment within the process in the next exercise, once the pipeline is working.

  • Take a look inside work directory and explore some of the files (e.g. .command.sh and .command.log). Did Nextflow execute the command you expected?

  • The structure of the pipeline should look like:

    step1.nf
      process STEP_NAME {
        publishDir OUTPUT_LOCATION, mode: 'copy'
    
        input:
            path some_input
    
        output:
            path some_output
    
        script:
        """
        some_command $some_input $some_output
        """
      }
    
      workflow {
          Channel.fromPath() 
          STEP_NAME()
      }
Why publishDir and mode copy?

The file publishing method. Can be one of the following values:

  • ‘copy’: Copies the output files into the publish directory.
  • ‘copyNoFollow’: Copies the output files into the publish directory without following symlinks ie. copies the links themselves.
  • ‘link’: Creates a hard link in the publish directory for each output file.
  • ‘move’: Moves the output files into the publish directory. Note: this is only supposed to be used for a terminal process i.e. a process whose output is not consumed by any other downstream process.
  • ‘rellink’: Creates a relative symbolic link in the publish directory for each output file.
  • ‘symlink’: Creates an absolute symbolic link in the publish directory for each output file (default).

You’ll also need a Nextflow config file. While simple pipelines may work with just default settings, more complex pipelines typically define one or more config files to separate concerns, such as:

  • Resource definitions (e.g., cpus, memory, time)
  • Execution profiles (e.g., local, slurm) and container environments (e.g., conda, singularity)
  • Pipeline parameters (e.g., input/output directories, filtering options)

For now, let’s just define the resources we want this pipeline to use:

nextflow.config
process {
    cpus = 1
    memory = '4 GB'
    time = '1h'
}

Need help with the config file? Check the documentation below:

Config files

Nextflow searches for configuration files in several locations, and it’s important to understand the order in which they are applied. For more details, refer to the official documentation: https://www.nextflow.io/docs/latest/config.html.

A Nextflow configuration file can include any combination of assignments, blocks, and includes.

Blocks can contain multiple configuration options. Below, you’ll see the difference between dot syntax and block syntax.

nextflow.config
executor.retry.maxAttempt = 2

executor {
    retry.maxAttempt = 2
}

If you are running into errors, make sure to go through the hint before first, as it will provide useful tips and serve as inspiration. There isn’t just one solution, so the hint can guide you toward different approaches to solving the problem.

Also, don’t forget to look in the work directory and review the log files (like .command.sh and .command.log). They’re often very informative and can help you identify and fix issues in your script before giving up.

Hint

It is very important to set HOME to a directory where you are working - your project directory. This is where Nextflow will look for the pipeline and configuration files, among other things.

How to run this workflow?

Example 1
nextflow run step1.nf -with-conda /work/HPCLab_workshop/miniconda3/envs/snakemake

Need help with the pipeline?

In the workflow block, we need to set up a channel to feed the input to the PREPROCESS process; then we can call the process itself to run on the contents of that channel.

step1.nf
/*
 * Pipeline parameters
 */

// Primary input
params.input_dir = "path/to/input_dir"
params.output_dir = "path/to/output_dir"

params.input_file = "path/to/input"

process PREPROCESS {

  publishDir OUTPUT_LOCATION, mode: 'copy'

  input:
      path some_input

  output:
      path some_output

  script:
  """
  some_command $some_input $some_output
  """
}

workflow {
  // Create input channel from CLI parameter
  reads_ch = Channel.fromPath(params.read_samples)

  // Connect channel to process
  PREPROCESS (reads_ch)
}

Alternatively, you can pipe the input channel directly into the process in a single line.

My solution

This is one possible solution. How does yours look?

  1. Nextflow script
step1.nf
#!/usr/bin/env nextflow

// Pipeline parameters
params.input_dir = "${HOME}/data"
params.output_dir = "${HOME}/results"

// Primary input
params.read_samples = "${params.input_dir}/samples_EB.tsv"

process PREPROCESS {

    publishDir "${params.output_dir}", mode: 'copy'

    input:
        path input_samples
    
    output:
        path "samples_EB_cleaned.tsv"

    script:
    """
    python /work/<YourNameSurname>/hpclab/scripts/preprocess.py $input_samples samples_EB_cleaned.tsv
    """
}

workflow {
    Channel.fromPath(params.read_samples)
        | PREPROCESS
    }
  1. Command-line
Option 1
nextflow run step1.nf -with-conda /work/HPCLab_workshop/miniconda3/envs/snakemake

Making the Pipeline Reusable, Reproducible, and Flexible

We want to avoid using full (absolute) paths in our pipelines, as they are not reproducibility-friendly and can break when the project is moved to a different system or user environment. Instead, we use relative paths, which ensure the workflow remains portable.

We also want to keep the workflow flexible and reusable, so it can easily handle multiple datasets or samples. Nextflow supports string interpolation, which allows you to dynamically construct file paths or commands using variables defined in the workflow.

Part II. Making the pipeline reproducible

Now it’s time to test additional functionalities and extend your pipeline.

1. Dynamic output names and paths

We’ll now expand the pipeline to support multiple datasets in a reusable way. In Snakemake, you likely used a wildcard such as {dataset} to dynamically build filenames. Here’s a simplified version of that:

step1_flex.smk
# step1_wf.smk

DATASETS = ["EB", "1kgp"]

rule all: 
    input:
        expand("results/samples_{dataset}_cleaned.tsv", dataset=DATASETS)

rule preprocess:
    input:
        "/work/HPCLab_workshop/data/samples_{dataset}.tsv"
    output:
        tsv="results/samples_{dataset}_cleaned.tsv"
    shell:
        """
        python scripts/preprocess.py {input} {output}
        """

To do the same in Nextflow, we introduce a dataset variable and use a channel to loop over all datasets, just like the wildcard in Snakemake. The processed output file is dynamically named using the value of dataset.

Example: Dynamic output names and paths

Take a look at this pipeline, where they use the cohort name in a similar way to how you’ll define the dataset: Link.

2. Standardizing script management

To stay organized, create a bin subfolder inside your pipeline directory. Move your script files into this folder and make them executable (e.g., chmod +x bin/myscript.py). Nextflow automatically adds the bin folder to the system PATH, so you can simply refer to script names in your workflow without full paths.

Bin directory and templates
  • Read more about the bin directory here
  • Another way to use scripts in Nextflow is through templates. You can read more about this here.

3. Configuring conda envs using profiles

In your Nextflow pipeline, specify the Conda environment you want to use for a specific process by including the conda directive. Next, create or update a configuration file (e.g., profiles.config) to enable the Conda profile (using conda.enabled = true).

It is highly recommended to check the hint below.

Conda environments

Click for Nextflow documentation on conda.

You can use the conda directive in a process to point to a specific environment, like this:

process PREPROCESS {
    conda 'path/to/conda/envs/myenv'
    ...
}

You will also need to enable the conda profile using a config file:

profiles.config
profiles {
    conda {
        conda.enabled = true
        conda.channels = ['bioconda','conda-forge']
    }
}

How can you run the pipeline now?

Terminal
nextflow run step1.nf -c profiles.config -profile conda

If you are expecting to only have one profile (e.g. pipeline only runs with conda, which is not very common for HPC setups), then your config file can look like this:

profile_conda.config
profiles {
    conda.enabled = true
    conda.channels = ['bioconda','conda-forge']
}

In this case, you don’t need to specify -profile conda when running the pipeline:

Terminal
nextflow run step1.nf -c profile_conda.config 
My Solution II
step1_flex.nf
#!/usr/bin/env nextflow

// Pipeline parameters
params.input_dir = "${HOME}/data"
params.output_dir = "${HOME}/results"

process PREPROCESS {
    conda '/work/HPCLab_workshop/miniconda3/envs/snakemake'

    publishDir "${params.output_dir}", mode: 'copy'

    input:
        val dataset
        path input_samples
    
    output:
        path "samples_${dataset}_cleaned.tsv"

    script:
    """
    python preprocess.py $input_samples samples_${dataset}_cleaned.tsv
    """
}

workflow {
    ch_dataset = Channel.of('EB', '1kgp')
    ch_input   = ch_dataset.map { dt -> file("${params.input_dir}/samples_${dt}.tsv") }

    PREPROCESS(ch_dataset, ch_input)
}
Bonus exercise: extending the pipeline

You will now add a new task, FILTER_YEAR, to your Nextflow pipeline. This task will be responsible for filtering datasets based on the year information (or other criteria). The idea is to use the errorStrategy = 'ignore' setting in the config file, allowing the pipeline to skip filtering when it’s not applicable, such as when working with the 1kgp dataset (as this info is not available).

The filter_year.py script needs a value for the flag cutoff. It’s better to define cutoff as a parameter (e.g., params.cutoff = 2000), which can then be modified using the config file or the command line. This approach provides flexibility to adjust the value without changing the pipeline code. As part of this exercise, run the pipeline twice, first using the default cutoff and then with a different one (e.g. 100).

Below is a basic template of how the FILTER_YEAR task might look:

allsteps_wf.smk
DATASETS = ["EB", "1kgp"]

rule all: 
    input:
        expand("results/samples_{dataset}_cleaned.tsv", dataset=DATASETS)

rule preprocess:
    input:
        "/work/HPCLab_workshop/data/samples_{dataset}.tsv"
    output:
        tsv="results/samples_{dataset}_cleaned.tsv"
    shell:
        """
        python scripts/preprocess.py {input} {output}
        """

rule filter_year:
    input:
        meta="results/samples_{dataset}_cleaned.tsv"
    output:
        fi="results/samples_{dataset}_filtered.tsv"
    params: 
        cutoff=2000
    shell:
        """
        python scripts/filter_year.py -i {input.meta} -o {output.fi} -y {params.cutoff}
        """
My solution III
  1. Nextflow script
allsteps_flex.nf
#!/usr/bin/env nextflow

// Pipeline parameters
params.input_dir = "${HOME}/data"
params.output_dir = "${HOME}/results"
params.cutoff=2000


process PREPROCESS {
    conda '/work/HPCLab_workshop/miniconda3/envs/snakemake'

    publishDir "${params.output_dir}", mode: 'copy'

    input:
        tuple val(dataset), path(input_samples)
    
    output:
        tuple val(dataset), path("samples_${dataset}_cleaned.tsv")

    script:
    """
    preprocess.py $input_samples samples_${dataset}_cleaned.tsv
    """
}

process FILTER_YEAR {
    conda '/work/HPCLab_workshop/miniconda3/envs/snakemake'

    publishDir "${params.output_dir}", mode: 'copy'

    input: 
        tuple val(dataset), path(input_cleaned)
    output:
        path "samples_${dataset}_filtered.tsv"
    
    script:
    """
    filter_year.py -i $input_cleaned \
        -o samples_${dataset}_filtered.tsv \
        -y ${params.cutoff}
    """
}


workflow {
    Channel
        .of('EB', '1kgp')
        .map { data -> tuple(data, file("${params.input_dir}/samples_${data}.tsv")) } | PREPROCESS | FILTER_YEAR
}
  1. Config file
profiles.config
profiles {
    conda.enabled = true
    conda.channels = ['bioconda','conda-forge']
}

process {
  withName: FILTER_YEAR {
    errorStrategy = 'ignore'
  }
}
  1. Nextflow command
Terminal
nextflow run allsteps_flex.nf -c profiles.config
  1. You can test different thresholds for the year cutoff using the command-line
Terminal
nextflow run allsteps_flex.nf -c profiles.config --cutoff 1000 -resume

Copyright

CC-BY-SA 4.0 license