Queueing systems
Job submission with SLURM
sbatch → submit a shell script to the queue
squeue → see all the jobs in the queue
squeue -u USERNAME → see your jobs only
scancel JOBID → cancel the job with the specified ID
scancel -u USERNAME → cancel all your jobs
seff JOBID → get efficiency information about your job
Submitting a Batch Job with SLURM
In this exercise, you will prepare sequencing data, create a software environment, write a SLURM batch script, and submit an alignment job to the cluster queueing system.
- Create a new subdirectory called batchLaunch inside your hpcLaunch/Day2 folder and move (
cd) into it:
mkdir -p ~/advancedGDK/batchLaunch
cd ~/advancedGDK/batchLaunch- Download the Input Data (FASTQ files)
wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz \
-O ./data.fastq.gz
wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R2_001.fastq.gz \
-O ./data2.fastq.gz
wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L002_R1_001.fastq.gz \
-O ./data3.fastq.gz
wget https://github.com/hartwigmedical/testdata/raw/master/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L002_R2_001.fastq.gz \
-O ./data4.fastq.gzUncompress the FASTQ files:
gunzip data*.fastq.gzDownload the reference genome:
wget http://genomedata.org/rnaseq-tutorial/fasta/GRCh38/chr22_with_ERCC92.fa \
-O ref.fasta- Create a Conda environment containing bwa-mem2 and samtools.
conda create -c conda-forge -c bioconda -n batchLaunch bwa-mem2 samtools- Create a Batch script called
align.shusing a text editor (e.g.nano) and add the following content to the file:
align.sh
#!/bin/bash
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=04:00:00
# Initialise conda
source ~/miniconda3/etc/profile.d/conda.sh
# Activate environment
conda activate batchLaunch
# Index reference
bwa-mem2 index ref.fasta
# Align reads and sort BAM file
bwa-mem2 mem -t 4 ref.fasta \
data.fastq \
| samtools sort \
-@ 3 \
-n \
-O BAM \
> data.bam
exit 0- Submit the batch script to SLURM:
`{.bash} sbatch align.sh
Monitor the Job (check the status)
Cancel a Job (Optional)
If needed, you can cancel a running or queued job using: scancel
Submitting a batch array
In this exercise, you will run the same alignment operation on multiple FASTQ files in parallel using a SLURM job array.
- Create directories for output files and log files, then generate a list containing all FASTQ files:
mkdir -p results logs
# List of FASTQ files
ls *.fastq > fastq_list.txt- Open a text editor and create a new batch script called
align_array.sh.
align_array.sh
#!/bin/bash
#SBATCH --cpus-per-task=2
#SBATCH --mem=8g
#SBATCH --time=01:00:00
#SBATCH --array=1-4%2
#SBATCH --job-name=alignArray
#SBATCH --output=logs/align_%A_%a.out
#SBATCH --error=logs/align_%A_%a.err
set -euo pipefail
source ~/miniconda3/etc/profile.d/conda.sh
conda activate batchGDK
mapfile -t fastqs < fastq_list.txt
fq="${fastqs[$((SLURM_ARRAY_TASK_ID-1))]}"
sample=$(basename "$fq" .fastq)
bwa-mem2 mem -t "$SLURM_CPUS_PER_TASK" ref.fasta "$fq" \
| samtools sort -@ 1 -O BAM \
> "results/${sample}.bam"
exit 0Make your pipelines easier to debug by using this command to avoid silent errors and incomplete outputs.
set -euo pipefail-e # Exit immediately if any command returns a non-zero exit status.
-u # Treat the use of undefined variables as an error.
-o pipefail # Make a pipeline fail if any command in the pipeline fails, not just the last one.
- Submit and monitor the array job: ````{.bash} sbatch align_array.sh
5. Check that outputs and logs are separated by array task id (%a) and parent job id (%A). Open of the log files!
```{.bash}
ls results/
ls logs/
With --array=1-4%2, SLURM creates 4 tasks and runs at most 2 concurrently.
Each task gets its own SLURM_ARRAY_TASK_ID, used to pick one input file from fastq_list.txt.
Chained Jobs with Dependencies
In this exercise, you will create a simple 3-step pipeline using SLURM job dependencies. The indexing step should only start if the alignment step completes successfully.
Create a second batch script called index_array.sh:
index_array.sh
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=4g
#SBATCH --time=00:30:00
#SBATCH --array=1-4
#SBATCH --job-name=indexArray
#SBATCH --output=logs/index_%A_%a.out
#SBATCH --error=logs/index_%A_%a.err
set -euo pipefail
# Initialise conda
source ~/miniconda3/etc/profile.d/conda.sh
# Activate environment
conda activate batchGDK
# Create an array containing BAM files
mapfile -t bams=(results/*.bam)
# Select the BAM file corresponding to the current array task
bam="${bams[$((SLURM_ARRAY_TASK_ID-1))]}"
# Index BAM file
samtools index "$bam"
exit 0After creating the script, submit the indexing job so that it only starts after the alignment array job has completed successfully. Use the --dependency=afterok:<jobid> option with sbatch.
Create a final reporting job called report_job.sh. This job will generate a simple summary file listing all BAM and BAI files produced by the pipeline.
report_job.sh
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=1g
#SBATCH --time=00:10:00
#SBATCH --job-name=arrayReport
#SBATCH --output=logs/report_%j.out
set -euo pipefail
echo "BAM files" > results/summary.txt
ls -1 results/*.bam >> results/summary.txt
echo "" >> results/summary.txt
echo "BAI files" >> results/summary.txt
ls -1 results/*.bam.bai >> results/summary.txt
exit 0Submit all jobs so that:
- indexing starts only after alignment succeeds
- reporting starts only after indexing succeeds
ALIGN_ID=$(sbatch --parsable align_array.sh)
INDEX_ID=$(sbatch --parsable \
--dependency=afterok:${ALIGN_ID} \
index_array.sh)
REPORT_ID=$(sbatch --parsable \
--dependency=afterok:${INDEX_ID} \
report_job.sh)
echo "ALIGN=${ALIGN_ID} INDEX=${INDEX_ID} REPORT=${REPORT_ID}"
squeue --me -n alignArray,indexArray,arrayReportAfter all jobs have completed, inspect the generated summary file:
cat results/summary.txtWith the Afterok option the job will only start if the specified parent job finishes successful.