Raw data alignment

Author

Samuele Soraggi, Manuel Peral-Vázquez, Stig U Andersen, Mikkel H Schierup

Modified

June 19, 2026

NoteTutorial description

This tutorial will cover the steps for performing the alignment of raw RNA- and HiFi-sequencing data. You will need to use the software IGV on your computer to visualize some of the output files, which can be easily downloaded once they are produced. At the end of this tutorial you will be able to:

  • perform and discuss quality control on raw data in fastq format using FastQC and MultiQC
  • align HiFi and RNA sequencing data with dedicated tools such as MiniMap2 and STAR
  • analyze the quality the alignment with qualimap

The output of this notebook will be used for the Variant calling analysis and the bulk RNA-sequencing analysis.

Biological background

White clover (Trifolium repens) is a common plant found in fields across the world. It has an unusual origin story: it’s the result of two different plant species “merging” their genetic material. At some point during evolution (during the last ice age), these two diploid (2n) parent species—T. occidentale and T. pallescens—naturally hybridized and created white clover, which is in result allotetraploid (4n) (see figure below).

Kernel Choice
  • Normal plants have two copies of each gene: one from maternal origin, one from paternal origin (these plants are diploid, 2n)
  • White clover, however has four copies of each gene. That is, two copies inherited from T. occidentale and two from T. pallescens (called an allotetraploid, 4n)
  • These four copies are in the same nucleus, you can think of it as two complete genomes side-by-side.

Normally, white clover is an obligate outcrosser species—that means that plants swap pollen with other plants instead of fertilizing themselves. However, a special self-compatible (it can self-fertilize) line was used for sequencing its genome (Griffiths et al. 2019). This line is designated as S10 in our data (this is the 10th self-fertilized generation). In addition, we also have data from a wild clover variety (ecotype) called Tienshan (Ti), from mountains in China. This variety is adapted to alpine conditions, making it genetically different from the S10 line.

We have sequences (DNA “reads”) from both clover varieties, and we need to align them to the white clover reference genome. However, our reference genome is tricky: it contains sequences from both parent species (we call them contig 1 and contig 2). Therefore, when we align short DNA reads to the reference, some reads might match equally well to both contig 1 and contig 2 (they’re similar but different species).

We’ll use quality control tools to: (1) Align reads to the complete reference (both contigs together); (2) Align reads to each subgenome separately (contig 1 alone, contig 2 alone); and (3) Compare the results to see how this affects our data quality and interpretation

Quality control and mapping

Quality Control

We run FastQC on the PacBio Hifi reads and on two of the Illumina RNA-seq libraries. FastQC does quality control of the raw sequence data, providing an overview of the data which can help identify if there are any problems that should be addressed before further analysis. You can find the report for each file into the folder results/fastqc_output/. The output is in HTML format and can be opened in any browser or in jupyterlab. It is however not easy to compare the various libraries by opening separate reports. To aggregate all the results, we apply the MultiQC software to the reports’ folder. The output of MultiQC is in the directory results/multiqc_output/fastqc_data.

%%bash
#run fastqc
mkdir -p results/fastqc_output
fastqc -q -o results/fastqc_output ../Data/Clover_Data/*.fastq  > /dev/null 2>&1
echo "Done \n Files: $(ls ../Data/Clover_Data/*.fastq)"
Done \n Files: ../Data/Clover_Data/Hifi_reads_white_clover.fastq
../Data/Clover_Data/S10_1_1.R1.fastq
../Data/Clover_Data/S10_1_1.R2.fastq
../Data/Clover_Data/S10_1_2.R1.fastq
../Data/Clover_Data/S10_1_2.R2.fastq
../Data/Clover_Data/S10_1_3.R1.fastq
../Data/Clover_Data/S10_1_3.R2.fastq
../Data/Clover_Data/S10_2_1.R1.fastq
../Data/Clover_Data/S10_2_1.R2.fastq
../Data/Clover_Data/S10_2_2.R1.fastq
../Data/Clover_Data/S10_2_2.R2.fastq
../Data/Clover_Data/S10_2_3.R1.fastq
../Data/Clover_Data/S10_2_3.R2.fastq
../Data/Clover_Data/TI_1_1.R1.fastq
../Data/Clover_Data/TI_1_1.R2.fastq
../Data/Clover_Data/TI_1_2.R1.fastq
../Data/Clover_Data/TI_1_2.R2.fastq
../Data/Clover_Data/TI_1_3.R1.fastq
../Data/Clover_Data/TI_1_3.R2.fastq
../Data/Clover_Data/TI_2_1.R1.fastq
../Data/Clover_Data/TI_2_1.R2.fastq
../Data/Clover_Data/TI_2_2.R1.fastq
../Data/Clover_Data/TI_2_2.R2.fastq
../Data/Clover_Data/TI_2_3.R1.fastq
../Data/Clover_Data/TI_2_3.R2.fastq

Note: fastqc prints a lot of output conisting of a simple confirmation of execution without error, even when using the option -q, which means quiet. Therefore we added > /dev/null 2>&1 to the command to mute the printing of that output.

%%bash
#run multiqc
mkdir -p results/multiqc_output/fastqc_data
multiqc --outdir results/multiqc_output/fastqc_data results/fastqc_output
/// ]8;id=9572713;https://multiqc.info\MultiQC]8;;\ 🔍 v1.35



       file_search | Search path: /faststorage/project/ngssummer2024/2026/samuele/results/fastqc_output

         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 50/50  ults/fastqc_output/S10_2_2.R2_fastqc.htmlhtmltml

            fastqc | Found 25 reports

     write_results | Existing reports found, adding suffix to filenames. Use '--force' to overwrite.

     write_results | Data        : results/multiqc_output/fastqc_data/multiqc_data_1_2_3

     write_results | Report      : results/multiqc_output/fastqc_data/multiqc_report_3.html

           multiqc | MultiQC complete
TipQuestions

Visualize the Webpage generated by MultiQC.

Hint: You can find a Help button that offers additional information about the plots for each panel. Focus on the following panels: “Per base sequence quality”, “Per sequence quality scores”…. (“Per base sequence content” always gives a FAIL for RNA-seq data).

Look at the sequence quality scores:

  • Q1. is there a marked difference between HiFi and illumina data?
  • Q2. What do you notice with respect to the sequence quality scores? And are there any other quality issues worth noting?
  • Q3. Is there anything off in the GC content? Why?
  • Q4. Think about why “Per base sequence content” always fails. Can you imagine how the nucleotide content is biased because of priming bias during cDNA synthesis and expressed transcripts in the sample?

Hifi data mapping — long DNA reads

We map the PacBio Hifi reads (Hifi_reads_white_clover.fastq) to the white clover reference sequence (Contig1&2) using minimap2.

To demonstrate how mapping algorithm choice can heavily affect results, we deliberately run the alignment twice with two different preset options (-x flag):

  • map-hifi: optimized for long reads (PacBio HiFi data)
  • sr: optimized for short reads (Illumina data)

The map-hifi setting expects long reads and employ algorithms that handle large insertions, deletions, and other structural variations more effectively. The scoring and alignment thresholds are adjusted to account for the longer sequence context. The sr setting xpect shorter reads and optimize for quick, efficient alignment of these short sequences. The focus is on minimizing mismatches and handling the dense packing of short reads.

The map-hifi setting is designed for long reads and is more lenient with gaps and mismatches typical of long-read sequencing. The sr setting assumes short reads and is stricter about matches. By comparing both, you can see how choosing the wrong algorithm/option can affect your results—an important lesson in bioinformatics. know your options.

Next, we create reports of the mapping results by running QualiMap on the two obtained SAM files.

We first need to index the reference fasta files using samtools faidx. This produces files in .fai format containing informations about length of the reference sequence, offset for the quality scores, name of the reference sequence. Click here for a detailed overview of fai format.

%%bash
#copy the reference data in the folder reference_data, so that you can write the indexing files
mkdir -p reference_data
cp ../Data/Clover_Data/DNA_Contig1_2.fasta ../Data/Clover_Data/DNA_Contig1.fasta ../Data/Clover_Data/DNA_Contig2.fasta reference_data
%%bash
samtools faidx reference_data/DNA_Contig1_2.fasta
samtools faidx reference_data/DNA_Contig1.fasta
samtools faidx reference_data/DNA_Contig2.fasta

we create an output folder for the HIFI alignment, and run minimap2 with the settings explained before.

%%bash 
mkdir -p results/HIFI_alignment/
minimap2 -a -x map-hifi -o results/HIFI_alignment/PacBio_clover_alignment_1_2_maphifi.sam \
                            reference_data/DNA_Contig1_2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq 

minimap2 -a -x sr -o results/HIFI_alignment/PacBio_clover_alignment_1_2_sr.sam \
                            reference_data/DNA_Contig1_2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::mm_idx_gen::0.043*0.93] collected minimizers
[M::mm_idx_gen::0.071*1.22] sorted minimizers
[M::main::0.072*1.20] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.075*1.19] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 2
[M::mm_idx_stat::0.077*1.18] distinct minimizers: 165056 (78.87% are singletons); average occurrences: 1.271; average spacing: 9.959; total length: 2089554
[M::worker_pipeline::7.785*2.61] mapped 4395 sequences
[M::main] Version: 2.31-r1302
[M::main] CMD: minimap2 -a -x map-hifi -o results/HIFI_alignment/PacBio_clover_alignment_1_2_maphifi.sam reference_data/DNA_Contig1_2.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 7.815 sec; CPU: 20.384 sec; Peak RSS: 1.347 GB
[M::mm_idx_gen::0.058*0.90] collected minimizers
[M::mm_idx_gen::0.091*1.08] sorted minimizers
[M::main::0.092*1.07] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.092*1.07] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 2
[M::mm_idx_stat::0.095*1.07] distinct minimizers: 278081 (79.60% are singletons); average occurrences: 1.257; average spacing: 5.977; total length: 2089554
[M::worker_pipeline::4.029*2.58] mapped 3075 sequences
[M::worker_pipeline::5.701*2.49] mapped 1320 sequences
[M::main] Version: 2.31-r1302
[M::main] CMD: minimap2 -a -x sr -o results/HIFI_alignment/PacBio_clover_alignment_1_2_sr.sam reference_data/DNA_Contig1_2.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 5.719 sec; CPU: 14.218 sec; Peak RSS: 0.182 GB

samtools sort is used to sort the alignment with left-to-right coordinates. The output is in .bam format, with .sam files in input (Note that you could have gotten .bam files from minimap2 with a specific option).

%%bash
samtools sort results/HIFI_alignment/PacBio_clover_alignment_1_2_maphifi.sam \
                -o results/HIFI_alignment/PacBio_clover_alignment_1_2_maphifi.sort.bam

samtools sort results/HIFI_alignment/PacBio_clover_alignment_1_2_sr.sam \
                -o results/HIFI_alignment/PacBio_clover_alignment_1_2_sr.sort.bam

samtools index creates the index for the bam file, stored in .bai format. The index file lets programs access any position into the aligned data without reading the whole file, which would take too much time.

%%bash
samtools index results/HIFI_alignment/PacBio_clover_alignment_1_2_maphifi.sort.bam
samtools index results/HIFI_alignment/PacBio_clover_alignment_1_2_sr.sort.bam
TipQuestions - IGV

Now you can inspect the alignment files in IGV.

  • First, you need the references fasta sequence in ../Data/Clover_Data/DNA_Contig1_2.fasta, ../Data/Clover_Data/DNA_Contig1.fasta, ../Data/Clover_Data/DNA_Contig2.fasta imported into IGV. Remember: this is done with the menu Genomes --> Load Genome from file and by selecting the relevant fasta file. Then, choose the reference you need from the drop-down menu (see figure below). You will not yet see much, but you can choose one of the two subgenomes (contig 1 or 2) and double click on a chromosome position to inspect the reference sequence. The next step will visualize the mapped files on IGV.

 

  • Choose the reference DNA_Contig1_2.fasta and load the two .bam files obtained with the two different alignment settings (results/HIFI_alignment/PacBio_clover_alignment_1_2_sr.sort.bam, results/HIFI_alignment/PacBio_clover_alignment_1_2_maphifi.sort.bam) to show them against the reference. This is done by choosing the .bam files from the File --> Load from file menu.

    • Q5. What do you notice about them? What is characteristic about the alignment sr?
    • Q6. Can you link these characteristics to the quality report seen before?
    • Q7. How did the alignment forced itself to align the reads for sr settings? Do you see a lot of mismatches and indels in the sr alignment?

Run quality control on both files

%%bash
unset DISPLAY

qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1_2_maphifi.sort.bam \
                 -outdir results/qualimap_output/PacBio_clover_alignment_1_2_maphifi

qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1_2_sr.sort.bam \
                 -outdir results/qualimap_output/PacBio_clover_alignment_1_2_sr
Java memory size is set to 1200M
Launching application...

QualiMap v.2.3
Built on 2023-05-19 16:57

Selected tool: bamqc
Available memory (Mb): 35
Max memory (Mb): 1258
Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 401
Chunk of reads size: 1000
Number of threads: 4
Processed 50 out of 401 windows...
Processed 100 out of 401 windows...
Processed 150 out of 401 windows...
Processed 200 out of 401 windows...
Processed 250 out of 401 windows...
Processed 300 out of 401 windows...
Processed 350 out of 401 windows...
Processed 400 out of 401 windows...
Total processed windows:401
Number of reads: 4395
Number of valid reads: 4810
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 4395
Num mapped first of pair: 0
Num mapped second of pair: 0
Num singletons: 0
Time taken to analyze reads: 5
Computing descriptors...
numberOfMappedBases: 70053032
referenceSize: 2089554
numberOfSequencedBases: 69900791
numberOfAs: 23571707
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 5
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully

Finished
Java memory size is set to 1200M
Launching application...

QualiMap v.2.3
Built on 2023-05-19 16:57

Selected tool: bamqc
Available memory (Mb): 35
Max memory (Mb): 1258
Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 401
Chunk of reads size: 1000
Number of threads: 4
Processed 50 out of 401 windows...
Processed 100 out of 401 windows...
Processed 150 out of 401 windows...
Processed 200 out of 401 windows...
Processed 250 out of 401 windows...
Processed 300 out of 401 windows...
Processed 350 out of 401 windows...
Processed 400 out of 401 windows...
Total processed windows:401
Number of reads: 4395
Number of valid reads: 9636
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 4395
Num mapped first of pair: 0
Num mapped second of pair: 0
Num singletons: 0
Time taken to analyze reads: 4
Computing descriptors...
numberOfMappedBases: 37073578
referenceSize: 2089554
numberOfSequencedBases: 37017351
numberOfAs: 12429824
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 4
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully

Finished

For easier comparison, we can again collapse the two reports into a single one using MultiQC, in the same way we did for putting together the other reports from fastQC.

%%bash

#run multiqc
multiqc --outdir results/qualimap_output results/qualimap_output
/// ]8;id=937117;https://multiqc.info\MultiQC]8;;\ 🔍 v1.35



       file_search | Search path: /faststorage/project/ngssummer2024/2026/samuele/results/qualimap_output

         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 92/92   

          qualimap | Found 2 BamQC reports

     write_results | Data        : results/qualimap_output/multiqc_data

     write_results | Report      : results/qualimap_output/multiqc_report.html

           multiqc | MultiQC complete
TipQuestions

Now you can visualize the report generated, which is in results/qualimap_output/multiqc_report.html

  • Q8. How do the average coverage of each sample correspond to what you have seen in the alignments using IGV?
  • Q9. How comes that the GC content remains the same across the different alignments?

Next, we map the white clover PacBio Hifi reads to contig1 and contig2 separately, using the setting you selected at the previous step (let’s say map-hifi was chosen, but you are free to change this setting in the commands). As the two contigs represent the two white clover subgenomes, this mapping will allow you to see the two subgenome haplotypes and call subgenome SNPs.

%%bash 
minimap2 -a -x map-hifi -o results/HIFI_alignment/PacBio_clover_alignment_1.sam \
                            reference_data/DNA_Contig1.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::mm_idx_gen::0.027*0.82] collected minimizers
[M::mm_idx_gen::0.035*0.84] sorted minimizers
[M::main::0.036*0.82] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.038*0.83] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.039*0.83] distinct minimizers: 91166 (91.66% are singletons); average occurrences: 1.139; average spacing: 9.939; total length: 1031631
[M::worker_pipeline::25.186*2.81] mapped 4395 sequences
[M::main] Version: 2.31-r1302
[M::main] CMD: minimap2 -a -x map-hifi -o results/HIFI_alignment/PacBio_clover_alignment_1.sam reference_data/DNA_Contig1.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 25.190 sec; CPU: 70.676 sec; Peak RSS: 1.495 GB
%%bash 
minimap2 -a -x map-hifi -o results/HIFI_alignment/PacBio_clover_alignment_2.sam \
                            reference_data/DNA_Contig2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::mm_idx_gen::0.035*0.70] collected minimizers
[M::mm_idx_gen::0.049*0.80] sorted minimizers
[M::main::0.049*0.80] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.051*0.80] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.052*0.81] distinct minimizers: 97892 (93.26% are singletons); average occurrences: 1.083; average spacing: 9.980; total length: 1057923
[M::worker_pipeline::25.597*2.81] mapped 4395 sequences
[M::main] Version: 2.31-r1302
[M::main] CMD: minimap2 -a -x map-hifi -o results/HIFI_alignment/PacBio_clover_alignment_2.sam reference_data/DNA_Contig2.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 25.604 sec; CPU: 72.041 sec; Peak RSS: 1.380 GB

Sort the bam files and create their index using samtools

%%bash
samtools sort results/HIFI_alignment/PacBio_clover_alignment_1.sam -o results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam
samtools sort results/HIFI_alignment/PacBio_clover_alignment_2.sam -o results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam
%%bash
samtools index results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam
samtools index results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam
TipQuestions - IGV

Now you can inspect the alignment files in IGV.

  • Remember, you need the references fasta sequence in ../Data/Clover_Data/DNA_Contig1_2.fasta, ../Data/Clover_Data/DNA_Contig1.fasta, ../Data/Clover_Data/DNA_Contig2.fasta imported into IGV. Remember: this is done with the menu Genomes --> Load Genome from file and by selecting the relevant fasta file. Then, choose the reference you need from the drop-down menu (see figure below).

 

  • Load the two bam files obtained with the map-hifi settings for contig1 and contig2 separately (results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam and results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam). Look first at those two files:

    • Q10. Which differences do you observe?
    • Q11. Have a look at their polymorphic regions in IGV. Are they true polymorphisms?

 

  • Look at the alignment to both contigs:

    • Q12. Why do you see fluctuations in coverage and large regions without any apparent subgenome SNPs?
    • Q13. What are the major differences between the stats for the reads mapped to Contigs1&2 versus contig1 and contig2?
    • Q14. What is your interpretation of the differences?

RNA-seq mapping — short RNA reads

In the ../Data folder you will find 24 RNA-seq libraries: 12 from the S10 line and 12 from Tienshan. Each library is paired-end, which means that each library is split into two files—one for forward reads (R1) and one for reverse reads (R2). For example: S10_1_1.R1.fastq and S10_1_1.R2.fastq.

We align each library separately, then merge the 12 S10 alignments into one sample and the 12 Tienshan alignments into another. This gives us two final samples for comparison.

Before aligning reads, we need to prepare the reference genome for STAR. This involves two steps:

  1. Build a genome index using STAR --runMode genomeGenerate. This creates a searchable index of the reference genome, allowing STAR to align reads quickly without scanning the whole sequence every time. STAR createas a file in SA format which contains an indexed collection of suffixes, which makes the navigation of suffix trees very fast.

  2. Convert gene annotations from gff to gtf format using gffread. STAR needs annotations in GTF format to count gene-level expression (how many reads map to each gene). When the gtf file is created, try to open it as a text file to find out its content. Compare with this guide.

STAR is a very complex tool with many options, so it is always useful to have a reference manual (even LLM agents can get STAR wrong because of changing options in new versions and specific combinations of options in specific scenarios).

%%bash
gffread -T -o reference_data/white_clover_genes.gtf ../Data/Clover_Data/white_clover_genes.gff
%%bash
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir results/STAR_output/indexing_contigs_1_2 \
--genomeFastaFiles reference_data/DNA_Contig1_2.fasta \
--sjdbGTFfile reference_data/white_clover_genes.gtf 
    STAR --runThreadN 8 --runMode genomeGenerate --genomeDir results/STAR_output/indexing_contigs_1_2 --genomeFastaFiles reference_data/DNA_Contig1_2.fasta --sjdbGTFfile reference_data/white_clover_genes.gtf
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 08:37:22 ..... started STAR run
!!!!! WARNING: Could not move Log.out file from ./Log.out into results/STAR_output/indexing_contigs_1_2/Log.out. Will keep ./Log.out
Jun 18 08:37:22 ... starting to generate Genome files
Jun 18 08:37:22 ..... processing annotations GTF
!!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=2089554, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 9
Jun 18 08:37:22 ... starting to sort Suffix Array. This may take a long time...
Jun 18 08:37:22 ... sorting Suffix Array chunks and saving them to disk...
Jun 18 08:37:22 ... loading chunks from disk, packing SA...
Jun 18 08:37:22 ... finished generating suffix array
Jun 18 08:37:22 ... generating Suffix Array index
Jun 18 08:37:26 ... completed Suffix Array index
Jun 18 08:37:26 ..... inserting junctions into the genome indices
Jun 18 08:37:32 ... writing Genome to disk ...
Jun 18 08:37:32 ... writing Suffix Array to disk ...
Jun 18 08:37:32 ... writing SAindex to disk
Jun 18 08:37:33 ..... finished successfully

We got a warning saying

!!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=2089554, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 9

meaning we need shorter strings of bases (9 bases instead of 14) to be indexed, as our reference genome is very short, and too long strings would cause many alignment errors. So we rerun the command with the suggested option (down below).

%%bash
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir results/STAR_output/indexing_contigs_1_2 \
--genomeFastaFiles reference_data/DNA_Contig1_2.fasta \
--sjdbGTFfile reference_data/white_clover_genes.gtf \
--genomeSAindexNbases 9
    STAR --runThreadN 8 --runMode genomeGenerate --genomeDir results/STAR_output/indexing_contigs_1_2 --genomeFastaFiles reference_data/DNA_Contig1_2.fasta --sjdbGTFfile reference_data/white_clover_genes.gtf --genomeSAindexNbases 9
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 08:42:06 ..... started STAR run
!!!!! WARNING: Could not move Log.out file from ./Log.out into results/STAR_output/indexing_contigs_1_2/Log.out. Will keep ./Log.out
Jun 18 08:42:06 ... starting to generate Genome files
Jun 18 08:42:06 ..... processing annotations GTF
Jun 18 08:42:07 ... starting to sort Suffix Array. This may take a long time...
Jun 18 08:42:07 ... sorting Suffix Array chunks and saving them to disk...
Jun 18 08:42:07 ... loading chunks from disk, packing SA...
Jun 18 08:42:07 ... finished generating suffix array
Jun 18 08:42:07 ... generating Suffix Array index
Jun 18 08:42:07 ... completed Suffix Array index
Jun 18 08:42:07 ..... inserting junctions into the genome indices
Jun 18 08:42:07 ... writing Genome to disk ...
Jun 18 08:42:07 ... writing Suffix Array to disk ...
Jun 18 08:42:07 ... writing SAindex to disk
Jun 18 08:42:07 ..... finished successfully

We use again STAR to align every single library for S10. We extract the library name of each file and run STAR through each pair of files.

TipTask T1

In the paper (Kuo et al. 2024) the size of introns (expressed as log10 of the base pairs) is given for various legumes (included T.repens) by the plot below.

Using --alignIntronMax in the code below, set the maximum intron size as an upper bound used by the aligner when trying to find out intronic locations in the alignment. Remove the hashtag symbol to render the change effective in the code.

%%bash
for i in `ls ../Data/Clover_Data/S10*.R1.fastq`
do

PREFIXNAME=`basename $i .R1.fastq`
echo "###############################################"
echo "##### ALIGNING PAIRED-END READS "$PREFIXNAME
echo "###############################################"
STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ \
--runThreadN 8 \
--runMode alignReads \
--readFilesIn ../Data/Clover_Data/$PREFIXNAME.R1.fastq ../Data/Clover_Data/$PREFIXNAME.R2.fastq \
--outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/$PREFIXNAME \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--quantMode GeneCounts \
#--alignIntronMax WRITE_MAX_INTRON_SIZE

done
###############################################
##### ALIGNING PAIRED-END READS S10_1_1
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_1_1.R1.fastq ../Data/Clover_Data/S10_1_1.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_1_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:01:13 ..... started STAR run
Jun 18 09:01:13 ..... loading genome
Jun 18 09:01:13 ..... started mapping
Jun 18 09:01:37 ..... finished mapping
Jun 18 09:01:37 ..... started sorting BAM
Jun 18 09:01:38 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_1_2
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_1_2.R1.fastq ../Data/Clover_Data/S10_1_2.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_1_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:01:38 ..... started STAR run
Jun 18 09:01:38 ..... loading genome
Jun 18 09:01:38 ..... started mapping
Jun 18 09:02:01 ..... finished mapping
Jun 18 09:02:01 ..... started sorting BAM
Jun 18 09:02:02 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_1_3
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_1_3.R1.fastq ../Data/Clover_Data/S10_1_3.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_1_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:02:02 ..... started STAR run
Jun 18 09:02:02 ..... loading genome
Jun 18 09:02:02 ..... started mapping
Jun 18 09:02:27 ..... finished mapping
Jun 18 09:02:27 ..... started sorting BAM
Jun 18 09:02:28 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_2_1
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_2_1.R1.fastq ../Data/Clover_Data/S10_2_1.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_2_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:02:28 ..... started STAR run
Jun 18 09:02:28 ..... loading genome
Jun 18 09:02:28 ..... started mapping
Jun 18 09:02:50 ..... finished mapping
Jun 18 09:02:50 ..... started sorting BAM
Jun 18 09:02:50 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_2_2
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_2_2.R1.fastq ../Data/Clover_Data/S10_2_2.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_2_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:02:50 ..... started STAR run
Jun 18 09:02:50 ..... loading genome
Jun 18 09:02:50 ..... started mapping
Jun 18 09:03:10 ..... finished mapping
Jun 18 09:03:10 ..... started sorting BAM
Jun 18 09:03:11 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_2_3
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_2_3.R1.fastq ../Data/Clover_Data/S10_2_3.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_2_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:03:11 ..... started STAR run
Jun 18 09:03:11 ..... loading genome
Jun 18 09:03:11 ..... started mapping
Jun 18 09:03:32 ..... finished mapping
Jun 18 09:03:33 ..... started sorting BAM
Jun 18 09:03:33 ..... finished successfully

Do the same alignment for Tienshan libraries

%%bash
for i in `ls ../Data/Clover_Data/TI*.R1.fastq`
do

PREFIXNAME=`basename $i .R1.fastq`
echo "###############################################"
echo "##### ALIGNING PAIRED-END READS "$PREFIXNAME
echo "###############################################"
STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ \
--runThreadN 8 \
--readFilesIn ../Data/Clover_Data/$PREFIXNAME.R1.fastq ../Data/Clover_Data/$PREFIXNAME.R2.fastq \
--outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/$PREFIXNAME \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--quantMode GeneCounts \
#--alignIntronMax WRITE_MAX_INTRON_SIZE 

done
###############################################
##### ALIGNING PAIRED-END READS TI_1_1
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_1_1.R1.fastq ../Data/Clover_Data/TI_1_1.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_1_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:03:33 ..... started STAR run
Jun 18 09:03:33 ..... loading genome
Jun 18 09:03:33 ..... started mapping
Jun 18 09:03:59 ..... finished mapping
Jun 18 09:03:59 ..... started sorting BAM
Jun 18 09:03:59 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_1_2
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_1_2.R1.fastq ../Data/Clover_Data/TI_1_2.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_1_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:03:59 ..... started STAR run
Jun 18 09:03:59 ..... loading genome
Jun 18 09:03:59 ..... started mapping
Jun 18 09:04:24 ..... finished mapping
Jun 18 09:04:24 ..... started sorting BAM
Jun 18 09:04:24 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_1_3
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_1_3.R1.fastq ../Data/Clover_Data/TI_1_3.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_1_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:04:24 ..... started STAR run
Jun 18 09:04:24 ..... loading genome
Jun 18 09:04:24 ..... started mapping
Jun 18 09:04:55 ..... finished mapping
Jun 18 09:04:55 ..... started sorting BAM
Jun 18 09:04:55 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_2_1
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_2_1.R1.fastq ../Data/Clover_Data/TI_2_1.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_2_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:04:56 ..... started STAR run
Jun 18 09:04:56 ..... loading genome
Jun 18 09:04:56 ..... started mapping
Jun 18 09:05:24 ..... finished mapping
Jun 18 09:05:24 ..... started sorting BAM
Jun 18 09:05:24 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_2_2
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_2_2.R1.fastq ../Data/Clover_Data/TI_2_2.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_2_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:05:24 ..... started STAR run
Jun 18 09:05:24 ..... loading genome
Jun 18 09:05:25 ..... started mapping
Jun 18 09:05:55 ..... finished mapping
Jun 18 09:05:55 ..... started sorting BAM
Jun 18 09:05:55 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_2_3
###############################################
    STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_2_3.R1.fastq ../Data/Clover_Data/TI_2_3.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_2_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 18 09:05:56 ..... started STAR run
Jun 18 09:05:56 ..... loading genome
Jun 18 09:05:56 ..... started mapping
Jun 18 09:06:28 ..... finished mapping
Jun 18 09:06:28 ..... started sorting BAM
Jun 18 09:06:28 ..... finished successfully

Run quality control on each aligned library with MultiQC. In this way there will be a whole report to compare S10 files and Tienshan files.

%%bash
multiqc --outdir results/multiqc_output/TI_STAR_align_1_2 \
            results/STAR_output/TI_align_contigs_1_2/
/// ]8;id=6553969;https://multiqc.info\MultiQC]8;;\ 🔍 v1.35



       file_search | Search path: /faststorage/project/ngssummer2024/2026/samuele/results/STAR_output/TI_align_contigs_1_2

         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 36/36   

              star | Found 6 reports and 6 gene count files

     write_results | Data        : results/multiqc_output/TI_STAR_align_1_2/multiqc_data

     write_results | Report      : results/multiqc_output/TI_STAR_align_1_2/multiqc_report.html

           multiqc | MultiQC complete
%%bash
multiqc --outdir results/multiqc_output/S10_STAR_align_1_2 \
            results/STAR_output/S10_align_contigs_1_2/
/// ]8;id=16222650;https://multiqc.info\MultiQC]8;;\ 🔍 v1.35



       file_search | Search path: /faststorage/project/ngssummer2024/2026/samuele/results/STAR_output/S10_align_contigs_1_2

         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 36/36   

              star | Found 6 reports and 6 gene count files

     write_results | Data        : results/multiqc_output/S10_STAR_align_1_2/multiqc_data

     write_results | Report      : results/multiqc_output/S10_STAR_align_1_2/multiqc_report.html

           multiqc | MultiQC complete
TipTask T2

Verify the alignments are of good quality by looking at the report statistics

We merge the outputs of each group of aligned libraries. Here is the files for the Tienshan.

%%bash
ls -lh  results/STAR_output/TI_align_contigs_1_2/TI_*.sortedByCoord.out.bam 
-rw-rw-r-- 1 root root 6.6M Jun 18 09:03 results/STAR_output/TI_align_contigs_1_2/TI_1_1Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 root root 6.0M Jun 18 09:04 results/STAR_output/TI_align_contigs_1_2/TI_1_2Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 root root 7.1M Jun 18 09:04 results/STAR_output/TI_align_contigs_1_2/TI_1_3Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 root root 9.6M Jun 18 09:05 results/STAR_output/TI_align_contigs_1_2/TI_2_1Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 root root 7.2M Jun 18 09:05 results/STAR_output/TI_align_contigs_1_2/TI_2_2Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 root root 8.2M Jun 18 09:06 results/STAR_output/TI_align_contigs_1_2/TI_2_3Aligned.sortedByCoord.out.bam

Apply samtools merge to combine the individual per-library BAM files

%%bash
mkdir -p results/STAR_output/TI_align_contigs_1_2_merge/
samtools merge -f results/STAR_output/TI_align_contigs_1_2_merge/TI.sorted.bam results/STAR_output/TI_align_contigs_1_2/TI_*.sortedByCoord.out.bam 
%%bash
mkdir -p results/STAR_output/S10_align_contigs_1_2_merge/
samtools merge -f results/STAR_output/S10_align_contigs_1_2_merge/S10.sorted.bam results/STAR_output/S10_align_contigs_1_2/S10_*.sortedByCoord.out.bam 

Index both merging outputs. A file in format bam.bai will appear in their respective folders.

%%bash
samtools index results/STAR_output/S10_align_contigs_1_2_merge/S10.sorted.bam
%%bash
samtools index results/STAR_output/TI_align_contigs_1_2_merge/TI.sorted.bam
TipQuestions - IGV

Look at one of the two merged files just created using IGV and the reference for contigs 1 and 2

  • Q15. How does the visualization differ from the one used for DNA data?
  • Q16. Have you noticed some junctions are long and overlap existing junctions? Why do you think that happens?
  • Load the gene track file (.gtf) from the folder reference_data. Right-click anywhere in IGV and choose “Sashimi plot”. This will show junctions and gene tracks, with numbers telling how many reads support the junction. Zoom into a very crowded section of the plot.
  • Q17. Click anywhere and set a threshold for minimum junction coverage, for example 20. What happens? Is it cleaner?
  • Q18. What can this diagnostic be useful for when aligning data like ours?
NoteWrapping up

In this exercise, you learned to align various types of data after performing quality control for raw data. We looked at some of the options for the aligners and at how to use some of the basic samtools manipulation programs. The outputs from the RNA alignments will be used for the VCF file analysis in the next notebook, and the RNA alignments will be use for the bulk RNA data analysis.

References

Griffiths, Andrew G., Roger Moraga, Marni Tausen, et al. 2019. “Breaking Free: The Genomics of Allopolyploidy-Facilitated Niche Expansion in White Clover.” The Plant Cell 31 (7): 1466–87. https://doi.org/10.1105/tpc.18.00606.
Kuo, Wen-Hsi, Sara J. Wright, Linda L. Small, and Kenneth M. Olsen. 2024. “De Novo Genome Assembly of White Clover (Trifolium Repens l.) Reveals the Role of Copy Number Variation in Rapid Environmental Adaptation.” BMC Biology 22 (August): 165. https://doi.org/10.1186/s12915-024-01962-6.