Raw data alignment

This tutorial will cover the steps for performing the alignment of raw RNA- and HiFi-sequencing data. You will need to use the software IGV on your computer to visualize some of the output files, which can be easily downloaded once they are produced. At the end of this tutorial you will be able to:

perform and discuss quality control on raw data in fastq format using FastQC and MultiQC
align HiFi and RNA sequencing data with dedicated tools such as MiniMap2 and STAR
analyze the quality the alignment with qualimap

</div>

The output of this notebook will be used for the Variant calling analysis and the bulk RNA-sequencing analysis. If you do not want to run this notebook, you can alternatively use the free interactive tool Galaxy to perform the alignment steps. We have uploaded the data on Galaxy, and the manual to perform the exercise is found at the course webpage.

The present tutorial, like the rest of the course material, is available at our open-source github repository.

A few introductory points to run this notebook (click to show):

To use this notebook, use the NGS (python) kernel that contains the packages. Choose it by selecting Kernel -> Change Kernel in the menu on top of the window.

In this notebook you will use only bash commands as you would do in the command line (this is why you read %%bash at the beginning of each piece of code). Those commands can be replicated in the command line, but we thought of integrating them in a notebook to make the tutorial understandable. The bash commands can also be marked with an ! sign at the beginning of the line
On some computers, you might see the result of the commands once they are done running. This means you will wait some time while the computer is crunching, and only afterwards you will see the result of the command you have executed
You can run the code in each cell by clicking on the run cell button, or by pressing Shift + Enter . When the code is done running, a small green check sign will appear on the left side
You need to run the cells in sequential order, please do not run a cell until the one above finished running and do not skip any cells
Each cell contains a short description of the code and the output you should get. Please try not to focus on understanding the code for each command in too much detail, but rather try to focus on the output
You can create new code cells by pressing + in the Menu bar above.

</details>

Biological background¶

White clover (Trifolium repens) is an allotetraploid. It is a relatively young, outcrossing species, which originated during the most recent glaciation around 20,000 years ago by hybridisation of two diploid species, T. occidentale and T. pallescens (see figure below).

This means that it contains genomes originating from two different species within the same nucleus. Normally, white clover is an outbreeding species, but a self-compatible line was used for sequencing the white clover genome (Griffiths et al, 2019). This line will be designated as S10 in the data, indicating that this is the 10th self-fertilized generation. In addition, we have data from a wild clover accession (ecotype) called Tienshan (Ti), which is collected from the Chinese mountains and is adapted to alpine conditions.

We will perform alignment of the data to the white clover's reference genome containing both T. occidentale and T. pallescens (called contig 1 and contig 2 in the data). We will also perform alignment to each subgenome, and see which are the differences with the quality control tools.

Quality control and mapping¶

Quality Control¶

We run FastQC on the PacBio Hifi reads and on two of the Illumina RNA-seq libraries. FastQC does quality control of the raw sequence data, providing an overview of the data which can help identify if there are any problems that should be addressed before further analysis. You can find the report for each file into the folder results/fastqc_output/. The output is in HTML format and can be opened in any browser or in jupyterlab. It is however not easy to compare the various libraries by opening separate reports. To aggregate all the results, we apply the MultiQC software to the reports' folder. The output of MultiQC is in the directory results/multiqc_output/fastqc_data.

In [1]:

            
                Copied!
                
%%bash
#run fastqc
mkdir -p results/fastqc_output
fastqc -q -o results/fastqc_output ../Data/Clover_Data/*.fastq  > /dev/null 2>&1
%%bash
#run fastqc
mkdir -p results/fastqc_output
fastqc -q -o results/fastqc_output ../Data/Clover_Data/*.fastq  > /dev/null 2>&1

Note: fastqc prints a lot of output conisting of a simple confirmation of execution without error, even when using the option -q, which means quiet. Therefore we added > /dev/null 2>&1 to the command to mute the printing of that output.

In [2]:

            
                Copied!
                
%%bash
#run multiqc
multiqc --outdir results/multiqc_output/fastqc_data results/fastqc_output
%%bash
#run multiqc
multiqc --outdir results/multiqc_output/fastqc_data results/fastqc_output

  /// ]8;id=903972;https://multiqc.info\MultiQC]8;;\ 🔍 | v1.14

|           multiqc | Search path : /work/SamueleSoraggi/Notebooks/results/fastqc_output
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 50/50  .html
|            fastqc | Found 25 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : results/multiqc_output/fastqc_data/multiqc_report.html
|           multiqc | Data        : results/multiqc_output/fastqc_data/multiqc_data
|           multiqc | MultiQC complete

Questions

Visualize the Webpage generated by MultiQC.

Hint: You can find a Help button that offers additional information about the plots for each panel. Focus on the following panels: “Per base sequence quality”, “Per sequence quality scores”.... (“Per base sequence content” always gives a FAIL for RNA-seq data).

What do you notice with respect to the sequence quality scores?
Are there any other quality issues worth noting?

Hifi data mapping¶

We map the PacBio Hifi reads (Hifi_reads_white_clover.fastq) to the white clover reference sequence (Contig1&2) using minimap2. We run two mapping rounds, using two different preset options (-x in the command) for the technology:

PacBio/Oxford Nanopore read to reference mapping: map-pb
Long assembly to reference mapping. Divergence is below 20%” settings asm20. Next, we create reports of the mapping results by running QualiMap on the two obtained SAM files.

We first need to index the reference fasta files using samtools faidx. This produces files in .fai format containing informations about length of the reference sequence, offset for the quality scores, name of the reference sequence. Click here for a detailed overview.

In [3]:

            
                Copied!
                
%%bash
#copy the reference data in the folder reference_data, so that you can write the indexing files
mkdir -p reference_data
cp ../Data/Clover_Data/DNA_Contig1_2.fasta ../Data/Clover_Data/DNA_Contig1.fasta ../Data/Clover_Data/DNA_Contig2.fasta reference_data
%%bash
#copy the reference data in the folder reference_data, so that you can write the indexing files
mkdir -p reference_data
cp ../Data/Clover_Data/DNA_Contig1_2.fasta ../Data/Clover_Data/DNA_Contig1.fasta ../Data/Clover_Data/DNA_Contig2.fasta reference_data

In [4]:

            
                Copied!
                
%%bash
samtools faidx reference_data/DNA_Contig1_2.fasta
samtools faidx reference_data/DNA_Contig1.fasta
samtools faidx reference_data/DNA_Contig2.fasta
%%bash
samtools faidx reference_data/DNA_Contig1_2.fasta
samtools faidx reference_data/DNA_Contig1.fasta
samtools faidx reference_data/DNA_Contig2.fasta

we create an output folder for the HIFI alignment, and run minimap2 with the settings explained before.

In [5]:

            
                Copied!
                
                    
                    
                
                

        
%%bash 
mkdir -p results/HIFI_alignment/
minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sam \
                            reference_data/DNA_Contig1_2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq 

minimap2 -a -x asm20 -o results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sam \
                            reference_data/DNA_Contig1_2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq
%%bash 
mkdir -p results/HIFI_alignment/
minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sam \
                            reference_data/DNA_Contig1_2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq 

minimap2 -a -x asm20 -o results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sam \
                            reference_data/DNA_Contig1_2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq

[M::mm_idx_gen::0.075*1.04] collected minimizers
[M::mm_idx_gen::0.099*1.52] sorted minimizers
[M::main::0.099*1.52] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.114*1.45] mid_occ = 11
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 2
[M::mm_idx_stat::0.118*1.43] distinct minimizers: 203943 (79.05% are singletons); average occurrences: 1.273; average spacing: 8.047; total length: 2089554
[M::worker_pipeline::11.859*2.89] mapped 4395 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sam reference_data/DNA_Contig1_2.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 11.873 sec; CPU: 34.246 sec; Peak RSS: 1.186 GB
[M::mm_idx_gen::0.084*1.04] collected minimizers
[M::mm_idx_gen::0.121*1.62] sorted minimizers
[M::main::0.121*1.62] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.138*1.54] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 2
[M::mm_idx_stat::0.143*1.52] distinct minimizers: 298340 (78.21% are singletons); average occurrences: 1.277; average spacing: 5.484; total length: 2089554
[M::worker_pipeline::18.366*2.88] mapped 4395 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x asm20 -o results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sam reference_data/DNA_Contig1_2.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 18.394 sec; CPU: 52.932 sec; Peak RSS: 1.692 GB

samtools sort is used to sort the alignment with left-to-right coordinates. The output is in .bam format, with .sam files in input (Note that you could have gotten .bam files from minimap2 with a specific option).

In [6]:

            
                Copied!
                
%%bash
samtools sort results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sam \
                -o results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sort.bam

samtools sort results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sam \
                -o results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sort.bam
%%bash
samtools sort results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sam \
                -o results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sort.bam

samtools sort results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sam \
                -o results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sort.bam

samtools index creates the index for the bam file, stored in .bai format. The index file lets programs access any position into the aligned data without reading the whole file, which would take too much time.

In [7]:

            
                Copied!
                
%%bash
samtools index results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sort.bam
samtools index results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sort.bam
%%bash
samtools index results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sort.bam
samtools index results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sort.bam

Run quality control on both files

In [8]:

            
                Copied!
                
%%bash
qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sort.bam \
                 -outdir results/qualimap_output/PacBio_clover_alignment_1_2_mappb

qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sort.bam \
                 -outdir results/qualimap_output/PacBio_clover_alignment_1_2_asm20
%%bash
qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1_2_mappb.sort.bam \
                 -outdir results/qualimap_output/PacBio_clover_alignment_1_2_mappb

qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1_2_asm20.sort.bam \
                 -outdir results/qualimap_output/PacBio_clover_alignment_1_2_asm20

Java memory size is set to 1200M
Launching application...

QualiMap v.2.2.2-dev
Built on 2019-11-11 14:05

Selected tool: bamqc
Available memory (Mb): 33
Max memory (Mb): 1258
Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 401
Chunk of reads size: 1000
Number of threads: 8
Processed 50 out of 401 windows...
Processed 100 out of 401 windows...
Processed 150 out of 401 windows...
Processed 200 out of 401 windows...
Processed 250 out of 401 windows...
Processed 300 out of 401 windows...
Processed 350 out of 401 windows...
Processed 400 out of 401 windows...
Total processed windows:401
Number of reads: 4395
Number of valid reads: 4696
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 4395
Num mapped first of pair: 0
Num mapped second of pair: 0
Num singletons: 0
Time taken to analyze reads: 12
Computing descriptors...
numberOfMappedBases: 70274383
referenceSize: 2089554
numberOfSequencedBases: 70039397
numberOfAs: 23621784
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 12
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully

Finished
Java memory size is set to 1200M
Launching application...

QualiMap v.2.2.2-dev
Built on 2019-11-11 14:05

Selected tool: bamqc
Available memory (Mb): 33
Max memory (Mb): 1258
Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 401
Chunk of reads size: 1000
Number of threads: 8
Processed 50 out of 401 windows...
Processed 100 out of 401 windows...
Processed 150 out of 401 windows...
Processed 200 out of 401 windows...
Processed 250 out of 401 windows...
Processed 300 out of 401 windows...
Processed 350 out of 401 windows...
Processed 400 out of 401 windows...
Total processed windows:401
Number of reads: 4395
Number of valid reads: 4747
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 4395
Num mapped first of pair: 0
Num mapped second of pair: 0
Num singletons: 0
Time taken to analyze reads: 12
Computing descriptors...
numberOfMappedBases: 69947563
referenceSize: 2089554
numberOfSequencedBases: 69812688
numberOfAs: 23541478
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 12
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully

Finished

For easier comparison, we can again collapse the two reports into a single one using MultiQC, in the same way we did for putting together the other reports from fastQC.

In [9]:

            
                Copied!
                
%%bash

#run multiqc
multiqc --outdir results/qualimap_output results/qualimap_output
%%bash

#run multiqc
multiqc --outdir results/qualimap_output results/qualimap_output

  /// ]8;id=941002;https://multiqc.info\MultiQC]8;;\ 🔍 | v1.14

|           multiqc | Search path : /work/SamueleSoraggi/Notebooks/results/qualimap_output
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 93/93  nt.t…
|          qualimap | Found 2 BamQC reports
|           multiqc | Compressing plot data
|           multiqc | Report      : results/qualimap_output/multiqc_report.html
|           multiqc | Data        : results/qualimap_output/multiqc_data
|           multiqc | MultiQC complete

Now you can visualize the report generated, which is in results/qualimap_output/multiqc_report.html.

Next, we map the white clover PacBio Hifi reads to contig1 and contig2 separately, using the setting you selected at the previous step (let's say map-pb was chosen, but you are free to change this setting in the commands). As the two contigs represent the two white clover subgenomes, this mapping will allow you to see the two subgenome haplotypes and call subgenome SNPs.

In [10]:

            
                Copied!
                
%%bash 
minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_1.sam \
                            reference_data/DNA_Contig1.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq
%%bash 
minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_1.sam \
                            reference_data/DNA_Contig1.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq

[M::mm_idx_gen::0.050*0.96] collected minimizers
[M::mm_idx_gen::0.071*1.55] sorted minimizers
[M::main::0.071*1.55] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.079*1.50] mid_occ = 10
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 1
[M::mm_idx_stat::0.082*1.48] distinct minimizers: 112547 (91.45% are singletons); average occurrences: 1.141; average spacing: 8.031; total length: 1031631
[M::worker_pipeline::27.024*2.90] mapped 4395 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_1.sam reference_data/DNA_Contig1.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 27.035 sec; CPU: 78.431 sec; Peak RSS: 2.258 GB

In [11]:

            
                Copied!
                
%%bash 
minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_2.sam \
                            reference_data/DNA_Contig2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq
%%bash 
minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_2.sam \
                            reference_data/DNA_Contig2.fasta \
                            ../Data/Clover_Data/Hifi_reads_white_clover.fastq

[M::mm_idx_gen::0.048*1.04] collected minimizers
[M::mm_idx_gen::0.069*1.61] sorted minimizers
[M::main::0.069*1.61] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.076*1.55] mid_occ = 10
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 1
[M::mm_idx_stat::0.080*1.53] distinct minimizers: 120722 (93.18% are singletons); average occurrences: 1.087; average spacing: 8.062; total length: 1057923
[M::worker_pipeline::27.150*2.94] mapped 4395 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-pb -o results/HIFI_alignment/PacBio_clover_alignment_2.sam reference_data/DNA_Contig2.fasta ../Data/Clover_Data/Hifi_reads_white_clover.fastq
[M::main] Real time: 27.162 sec; CPU: 79.819 sec; Peak RSS: 2.058 GB

Sort the bam files and create their index using samtools

In [12]:

            
                Copied!
                
%%bash
samtools sort results/HIFI_alignment/PacBio_clover_alignment_1.sam -o results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam
samtools sort results/HIFI_alignment/PacBio_clover_alignment_2.sam -o results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam
%%bash
samtools sort results/HIFI_alignment/PacBio_clover_alignment_1.sam -o results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam
samtools sort results/HIFI_alignment/PacBio_clover_alignment_2.sam -o results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam

In [13]:

            
                Copied!
                
%%bash
samtools index results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam
samtools index results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam
%%bash
samtools index results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam
samtools index results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam

Perform quality control

In [14]:

            
                Copied!
                
%%bash
mkdir -p results/qualimap_output
qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam -outdir results/qualimap_output/PacBio_clover_alignment_1
%%bash
mkdir -p results/qualimap_output
qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam -outdir results/qualimap_output/PacBio_clover_alignment_1

Java memory size is set to 1200M
Launching application...

QualiMap v.2.2.2-dev
Built on 2019-11-11 14:05

Selected tool: bamqc
Available memory (Mb): 33
Max memory (Mb): 1258
Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 400
Chunk of reads size: 1000
Number of threads: 8
Processed 50 out of 400 windows...
Processed 100 out of 400 windows...
Processed 150 out of 400 windows...
Processed 200 out of 400 windows...
Processed 250 out of 400 windows...
Processed 300 out of 400 windows...
Processed 350 out of 400 windows...
Processed 400 out of 400 windows...
Total processed windows:400
Number of reads: 4395
Number of valid reads: 9070
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 4356
Num mapped first of pair: 0
Num mapped second of pair: 0
Num singletons: 0
Time taken to analyze reads: 10
Computing descriptors...
numberOfMappedBases: 59058058
referenceSize: 1031631
numberOfSequencedBases: 54917933
numberOfAs: 18508479
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 11
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully

Finished

In [15]:

            
                Copied!
                
%%bash
qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam -outdir results/qualimap_output/PacBio_clover_alignment_2
%%bash
qualimap bamqc -bam results/HIFI_alignment/PacBio_clover_alignment_2.sort.bam -outdir results/qualimap_output/PacBio_clover_alignment_2

Java memory size is set to 1200M
Launching application...

QualiMap v.2.2.2-dev
Built on 2019-11-11 14:05

Selected tool: bamqc
Available memory (Mb): 33
Max memory (Mb): 1258
Starting bam qc....
Loading sam header...
Loading locator...
Loading reference...
Number of windows: 400, effective number of windows: 400
Chunk of reads size: 1000
Number of threads: 8
Processed 50 out of 400 windows...
Processed 100 out of 400 windows...
Processed 150 out of 400 windows...
Processed 200 out of 400 windows...
Processed 250 out of 400 windows...
Processed 300 out of 400 windows...
Processed 350 out of 400 windows...
Processed 400 out of 400 windows...
Total processed windows:400
Number of reads: 4395
Number of valid reads: 9462
Number of correct strand reads:0

Inside of regions...
Num mapped reads: 4394
Num mapped first of pair: 0
Num mapped second of pair: 0
Num singletons: 0
Time taken to analyze reads: 13
Computing descriptors...
numberOfMappedBases: 59388453
referenceSize: 1057923
numberOfSequencedBases: 55867300
numberOfAs: 18822184
Computing per chromosome statistics...
Computing histograms...
Overall analysis time: 13
end of bam qc
Computing report...
Writing HTML report...
HTML report created successfully

Finished

In [16]:

            
                Copied!
                
%%bash

#run multiqc
multiqc --outdir results/qualimap_output results/qualimap_output
%%bash

#run multiqc
multiqc --outdir results/qualimap_output results/qualimap_output

  /// ]8;id=999624;https://multiqc.info\MultiQC]8;;\ 🔍 | v1.14

|           multiqc | Search path : /work/SamueleSoraggi/Notebooks/results/qualimap_output
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 187/187  .t…
|          qualimap | Found 4 BamQC reports
|           multiqc | Compressing plot data
|           multiqc | Previous MultiQC output found! Adjusting filenames..
|           multiqc | Use -f or --force to overwrite existing reports instead
|           multiqc | Report      : results/qualimap_output/multiqc_report_1.html
|           multiqc | Data        : results/qualimap_output/multiqc_data_1
|           multiqc | MultiQC complete

Task: IGV visualization and Questions

Now you can inspect the alignment files in IGV.

First, you will need to download the reference fasta sequence in ../Data/Clover_Data/DNA_Contig1_2.fasta and import it into IGV. You can do the same for the files DNA_Contig1.fasta and DNA_Contig2.fasta that you might need later. In IGV, this is done with the menu Genomes --> Load Genome from file menu and by selecting the relevant fasta file. Then, choose the reference you need from the drop-down menu (see figure below). You will not yet see much, but you can choose one of the two subgenomes (contig 1 or 2) and double click on a chromosome position to inspect the reference sequence. The next step will visualize the mapped files on IGV.

Each mapped genome can be seen in IGV against the reference file of choice. To load an aligned file, first download it together with the index file in .bai format. For example, you need to download both results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam and results/HIFI_alignment/PacBio_clover_alignment_1.sort.bam.bai to see this alignment (you need to open only the .bam file with IGV). If you open more files, their alignments will be distributed in the IGV interface, and you can change the size of each visualization yourself (below shown with only one opened alignment).

Now compare in IGV the two bam files PacBio_clover_alignment_1.sort.bam and PacBio_clover_alignment_2.sort.bam.

What do you observe when comparing the two BAM files?
Have a look at the polymorphic regions in IGV. Are they true polymorphisms?

Add to the visualization the third alignment PacBio_clover_alignment_1_2_mappb.sort.bam in IGV.

Why do you see fluctuations in coverage and large regions without any apparent subgenome SNPs?
What are the major differences between the stats for the reads mapped to Contigs1&2 versus contig1 and contig2? What is your interpretation of the differences?

RNA-seq mapping¶

In the ../Data folder you will find 24 RNA-seq libraries, 12 S10 libraries and 12 Tienshan libraries. Each library is paired-end, which is denoted by R1 and R2 at the end of two files having the same name, such as S10_1_1.R1.fastq and S10_1_1.R2.fastq. We will align each library separately and then merge the alignments to create two final samples for S10 and Tienshan.

First, we need to create a genome file for the reference fasta file of contigs 1 and 2. This is done with STAR, using the option --runMode genomeGenerate. We also need to convert the gene annotation from gff to gtf format with gffread to allow counting gene transcripts. STAR is a very complex tool with many options, so it is always useful to have a reference manual.

In [17]:

            
                Copied!
                
%%bash
gffread -T -o reference_data/white_clover_genes.gtf ../Data/Clover_Data/white_clover_genes.gff
%%bash
gffread -T -o reference_data/white_clover_genes.gtf ../Data/Clover_Data/white_clover_genes.gff

In [18]:

            
                Copied!
                
                    
                    
                
                

        
%%bash
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir results/STAR_output/indexing_contigs_1_2 \
--genomeFastaFiles reference_data/DNA_Contig1_2.fasta \
--sjdbGTFfile reference_data/white_clover_genes.gtf
%%bash
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir results/STAR_output/indexing_contigs_1_2 \
--genomeFastaFiles reference_data/DNA_Contig1_2.fasta \
--sjdbGTFfile reference_data/white_clover_genes.gtf 

	STAR --runThreadN 8 --runMode genomeGenerate --genomeDir results/STAR_output/indexing_contigs_1_2 --genomeFastaFiles reference_data/DNA_Contig1_2.fasta --sjdbGTFfile reference_data/white_clover_genes.gtf
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:23:03 ..... started STAR run
Jun 19 10:23:03 ... starting to generate Genome files
Jun 19 10:23:03 ..... processing annotations GTF

!!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=2089554, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 9

Jun 19 10:23:03 ... starting to sort Suffix Array. This may take a long time...
Jun 19 10:23:03 ... sorting Suffix Array chunks and saving them to disk...
Jun 19 10:23:04 ... loading chunks from disk, packing SA...
Jun 19 10:23:04 ... finished generating suffix array
Jun 19 10:23:04 ... generating Suffix Array index
Jun 19 10:23:08 ... completed Suffix Array index
Jun 19 10:23:08 ..... inserting junctions into the genome indices
Jun 19 10:23:16 ... writing Genome to disk ...
Jun 19 10:23:16 ... writing Suffix Array to disk ...
Jun 19 10:23:17 ... writing SAindex to disk
Jun 19 10:23:18 ..... finished successfully

We got a warning saying

!!!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=2089554, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 9

meaning we need shorter strings of bases (9 bases instead of 14) to be indexed, as our reference genome is very short, and too long strings would cause many alignment errors. So we rerun the command with the suggested option.

In [19]:

            
                Copied!
                
                    
                    
                
                

        
%%bash
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir results/STAR_output/indexing_contigs_1_2 \
--genomeFastaFiles reference_data/DNA_Contig1_2.fasta \
--sjdbGTFfile reference_data/white_clover_genes.gtf \
--genomeSAindexNbases 9
%%bash
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir results/STAR_output/indexing_contigs_1_2 \
--genomeFastaFiles reference_data/DNA_Contig1_2.fasta \
--sjdbGTFfile reference_data/white_clover_genes.gtf \
--genomeSAindexNbases 9

	STAR --runThreadN 8 --runMode genomeGenerate --genomeDir results/STAR_output/indexing_contigs_1_2 --genomeFastaFiles reference_data/DNA_Contig1_2.fasta --sjdbGTFfile reference_data/white_clover_genes.gtf --genomeSAindexNbases 9
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:23:18 ..... started STAR run
Jun 19 10:23:18 ... starting to generate Genome files
Jun 19 10:23:18 ..... processing annotations GTF
Jun 19 10:23:20 ... starting to sort Suffix Array. This may take a long time...
Jun 19 10:23:20 ... sorting Suffix Array chunks and saving them to disk...
Jun 19 10:23:21 ... loading chunks from disk, packing SA...
Jun 19 10:23:21 ... finished generating suffix array
Jun 19 10:23:21 ... generating Suffix Array index
Jun 19 10:23:21 ... completed Suffix Array index
Jun 19 10:23:22 ..... inserting junctions into the genome indices
Jun 19 10:23:23 ... writing Genome to disk ...
Jun 19 10:23:23 ... writing Suffix Array to disk ...
Jun 19 10:23:23 ... writing SAindex to disk
Jun 19 10:23:24 ..... finished successfully

We use again STAR to align every single library for S10. We extract the library name of each file and run STAR through each pair of files. Note that plant introns are very rarely more than 5000 bp and that you are mapping to two homoeologous contigs that show high similarity, especially in genic regions. We set the maximum size to 5000 using --alignIntronMax 5000.

In [20]:

            
                Copied!
                
                    
                    
                
                

        
%%bash
for i in `ls ../Data/Clover_Data/S10*.R1.fastq`
do

PREFIXNAME=`basename $i .R1.fastq`
echo "###############################################"
echo "##### ALIGNING PAIRED-END READS "$PREFIXNAME
echo "###############################################"
STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ \
--runThreadN 8 \
--runMode alignReads \
--readFilesIn ../Data/Clover_Data/$PREFIXNAME.R1.fastq ../Data/Clover_Data/$PREFIXNAME.R2.fastq \
--outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/$PREFIXNAME \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--quantMode GeneCounts \
--alignIntronMax 5000

done
%%bash
for i in `ls ../Data/Clover_Data/S10*.R1.fastq`
do

PREFIXNAME=`basename $i .R1.fastq`
echo "###############################################"
echo "##### ALIGNING PAIRED-END READS "$PREFIXNAME
echo "###############################################"
STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ \
--runThreadN 8 \
--runMode alignReads \
--readFilesIn ../Data/Clover_Data/$PREFIXNAME.R1.fastq ../Data/Clover_Data/$PREFIXNAME.R2.fastq \
--outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/$PREFIXNAME \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--quantMode GeneCounts \
--alignIntronMax 5000

done

###############################################
##### ALIGNING PAIRED-END READS S10_1_1
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_1_1.R1.fastq ../Data/Clover_Data/S10_1_1.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_1_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:23:24 ..... started STAR run
Jun 19 10:23:24 ..... loading genome
Jun 19 10:23:24 ..... started mapping
Jun 19 10:23:35 ..... finished mapping
Jun 19 10:23:35 ..... started sorting BAM
Jun 19 10:23:37 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_1_2
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_1_2.R1.fastq ../Data/Clover_Data/S10_1_2.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_1_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:23:38 ..... started STAR run
Jun 19 10:23:38 ..... loading genome
Jun 19 10:23:38 ..... started mapping
Jun 19 10:23:48 ..... finished mapping
Jun 19 10:23:48 ..... started sorting BAM
Jun 19 10:23:51 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_1_3
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_1_3.R1.fastq ../Data/Clover_Data/S10_1_3.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_1_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:23:52 ..... started STAR run
Jun 19 10:23:52 ..... loading genome
Jun 19 10:23:52 ..... started mapping
Jun 19 10:24:03 ..... finished mapping
Jun 19 10:24:03 ..... started sorting BAM
Jun 19 10:24:06 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_2_1
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_2_1.R1.fastq ../Data/Clover_Data/S10_2_1.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_2_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:24:06 ..... started STAR run
Jun 19 10:24:06 ..... loading genome
Jun 19 10:24:06 ..... started mapping
Jun 19 10:24:16 ..... finished mapping
Jun 19 10:24:16 ..... started sorting BAM
Jun 19 10:24:18 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_2_2
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_2_2.R1.fastq ../Data/Clover_Data/S10_2_2.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_2_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:24:20 ..... started STAR run
Jun 19 10:24:20 ..... loading genome
Jun 19 10:24:20 ..... started mapping
Jun 19 10:24:31 ..... finished mapping
Jun 19 10:24:31 ..... started sorting BAM
Jun 19 10:24:33 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS S10_2_3
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --runMode alignReads --readFilesIn ../Data/Clover_Data/S10_2_3.R1.fastq ../Data/Clover_Data/S10_2_3.R2.fastq --outFileNamePrefix results/STAR_output/S10_align_contigs_1_2/S10_2_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:24:33 ..... started STAR run
Jun 19 10:24:33 ..... loading genome
Jun 19 10:24:33 ..... started mapping
Jun 19 10:24:45 ..... finished mapping
Jun 19 10:24:45 ..... started sorting BAM
Jun 19 10:24:48 ..... finished successfully

Do the same alignment for Tienshan libraries

In [21]:

            
                Copied!
                
                    
                    
                
                

        
%%bash
for i in `ls ../Data/Clover_Data/TI*.R1.fastq`
do

PREFIXNAME=`basename $i .R1.fastq`
echo "###############################################"
echo "##### ALIGNING PAIRED-END READS "$PREFIXNAME
echo "###############################################"
STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ \
--runThreadN 8 \
--readFilesIn ../Data/Clover_Data/$PREFIXNAME.R1.fastq ../Data/Clover_Data/$PREFIXNAME.R2.fastq \
--outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/$PREFIXNAME \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--quantMode GeneCounts \
--alignIntronMax 5000 

done
%%bash
for i in `ls ../Data/Clover_Data/TI*.R1.fastq`
do

PREFIXNAME=`basename $i .R1.fastq`
echo "###############################################"
echo "##### ALIGNING PAIRED-END READS "$PREFIXNAME
echo "###############################################"
STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ \
--runThreadN 8 \
--readFilesIn ../Data/Clover_Data/$PREFIXNAME.R1.fastq ../Data/Clover_Data/$PREFIXNAME.R2.fastq \
--outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/$PREFIXNAME \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--quantMode GeneCounts \
--alignIntronMax 5000 

done

###############################################
##### ALIGNING PAIRED-END READS TI_1_1
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_1_1.R1.fastq ../Data/Clover_Data/TI_1_1.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_1_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:24:50 ..... started STAR run
Jun 19 10:24:50 ..... loading genome
Jun 19 10:24:50 ..... started mapping
Jun 19 10:25:03 ..... finished mapping
Jun 19 10:25:04 ..... started sorting BAM
Jun 19 10:25:18 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_1_2
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_1_2.R1.fastq ../Data/Clover_Data/TI_1_2.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_1_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:25:18 ..... started STAR run
Jun 19 10:25:18 ..... loading genome
Jun 19 10:25:18 ..... started mapping
Jun 19 10:25:30 ..... finished mapping
Jun 19 10:25:30 ..... started sorting BAM
Jun 19 10:25:33 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_1_3
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_1_3.R1.fastq ../Data/Clover_Data/TI_1_3.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_1_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:25:36 ..... started STAR run
Jun 19 10:25:36 ..... loading genome
Jun 19 10:25:36 ..... started mapping
Jun 19 10:25:50 ..... finished mapping
Jun 19 10:25:50 ..... started sorting BAM
Jun 19 10:25:51 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_2_1
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_2_1.R1.fastq ../Data/Clover_Data/TI_2_1.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_2_1 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:25:51 ..... started STAR run
Jun 19 10:25:51 ..... loading genome
Jun 19 10:25:52 ..... started mapping
Jun 19 10:26:06 ..... finished mapping
Jun 19 10:26:06 ..... started sorting BAM
Jun 19 10:26:09 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_2_2
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_2_2.R1.fastq ../Data/Clover_Data/TI_2_2.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_2_2 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:26:09 ..... started STAR run
Jun 19 10:26:09 ..... loading genome
Jun 19 10:26:09 ..... started mapping
Jun 19 10:26:23 ..... finished mapping
Jun 19 10:26:23 ..... started sorting BAM
Jun 19 10:26:31 ..... finished successfully
###############################################
##### ALIGNING PAIRED-END READS TI_2_3
###############################################
	STAR --genomeDir results/STAR_output/indexing_contigs_1_2/ --runThreadN 8 --readFilesIn ../Data/Clover_Data/TI_2_3.R1.fastq ../Data/Clover_Data/TI_2_3.R2.fastq --outFileNamePrefix results/STAR_output/TI_align_contigs_1_2/TI_2_3 --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode GeneCounts --alignIntronMax 5000
	STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jun 19 10:26:31 ..... started STAR run
Jun 19 10:26:31 ..... loading genome
Jun 19 10:26:31 ..... started mapping
Jun 19 10:26:46 ..... finished mapping
Jun 19 10:26:46 ..... started sorting BAM
Jun 19 10:26:49 ..... finished successfully

Run quality control on each aligned library with MultiQC. In this way there will be a whole report to compare S10 files and Tienshan files.

In [22]:

            
                Copied!
                
%%bash
multiqc --outdir results/multiqc_output/TI_STAR_align_1_2 \
            results/STAR_output/TI_align_contigs_1_2/
%%bash
multiqc --outdir results/multiqc_output/TI_STAR_align_1_2 \
            results/STAR_output/TI_align_contigs_1_2/

  /// ]8;id=121575;https://multiqc.info\MultiQC]8;;\ 🔍 | v1.14

|           multiqc | Search path : /work/SamueleSoraggi/Notebooks/results/STAR_output/TI_align_contigs_1_2
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 36/36  og.o…
|              star | Found 6 reports and 6 gene count files
|           multiqc | Compressing plot data
|           multiqc | Report      : results/multiqc_output/TI_STAR_align_1_2/multiqc_report.html
|           multiqc | Data        : results/multiqc_output/TI_STAR_align_1_2/multiqc_data
|           multiqc | MultiQC complete

In [23]:

            
                Copied!
                
%%bash
multiqc --outdir results/multiqc_output/S10_STAR_align_1_2 \
            results/STAR_output/S10_align_contigs_1_2/
%%bash
multiqc --outdir results/multiqc_output/S10_STAR_align_1_2 \
            results/STAR_output/S10_align_contigs_1_2/

  /// ]8;id=596946;https://multiqc.info\MultiQC]8;;\ 🔍 | v1.14

|           multiqc | Search path : /work/SamueleSoraggi/Notebooks/results/STAR_output/S10_align_contigs_1_2
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 36/36  ut.t…
|              star | Found 6 reports and 6 gene count files
|           multiqc | Compressing plot data
|           multiqc | Report      : results/multiqc_output/S10_STAR_align_1_2/multiqc_report.html
|           multiqc | Data        : results/multiqc_output/S10_STAR_align_1_2/multiqc_data
|           multiqc | MultiQC complete

We merge the outputs of each group of aligned libraries. Here is how the files look like for the Tienshan.

In [24]:

            
                Copied!
                
!ls -lh  results/STAR_output/TI_align_contigs_1_2/TI_*.sortedByCoord.out.bam
!ls -lh  results/STAR_output/TI_align_contigs_1_2/TI_*.sortedByCoord.out.bam

-rw-r--r--. 1 ucloud users 6.6M Jun 19 10:25 results/STAR_output/TI_align_contigs_1_2/TI_1_1Aligned.sortedByCoord.out.bam
-rw-r--r--. 1 ucloud users 6.0M Jun 19 10:25 results/STAR_output/TI_align_contigs_1_2/TI_1_2Aligned.sortedByCoord.out.bam
-rw-r--r--. 1 ucloud users 7.1M Jun 19 10:25 results/STAR_output/TI_align_contigs_1_2/TI_1_3Aligned.sortedByCoord.out.bam
-rw-r--r--. 1 ucloud users 9.6M Jun 19 10:26 results/STAR_output/TI_align_contigs_1_2/TI_2_1Aligned.sortedByCoord.out.bam
-rw-r--r--. 1 ucloud users 7.2M Jun 19 10:26 results/STAR_output/TI_align_contigs_1_2/TI_2_2Aligned.sortedByCoord.out.bam
-rw-r--r--. 1 ucloud users 8.2M Jun 19 10:26 results/STAR_output/TI_align_contigs_1_2/TI_2_3Aligned.sortedByCoord.out.bam

Apply samtools merge

In [25]:

            
                Copied!
                
%%bash
mkdir -p results/STAR_output/TI_align_contigs_1_2_merge/
samtools merge -f results/STAR_output/TI_align_contigs_1_2_merge/TI.sorted.bam results/STAR_output/TI_align_contigs_1_2/TI_*.sortedByCoord.out.bam
%%bash
mkdir -p results/STAR_output/TI_align_contigs_1_2_merge/
samtools merge -f results/STAR_output/TI_align_contigs_1_2_merge/TI.sorted.bam results/STAR_output/TI_align_contigs_1_2/TI_*.sortedByCoord.out.bam

In [26]:

            
                Copied!
                
%%bash
mkdir -p results/STAR_output/S10_align_contigs_1_2_merge/
samtools merge -f results/STAR_output/S10_align_contigs_1_2_merge/S10.sorted.bam results/STAR_output/S10_align_contigs_1_2/S10_*.sortedByCoord.out.bam
%%bash
mkdir -p results/STAR_output/S10_align_contigs_1_2_merge/
samtools merge -f results/STAR_output/S10_align_contigs_1_2_merge/S10.sorted.bam results/STAR_output/S10_align_contigs_1_2/S10_*.sortedByCoord.out.bam

Index both merging outputs. A file in format bam.bai will appear in their respective folders.

In [27]:

            
                Copied!
                
%%bash
samtools index results/STAR_output/S10_align_contigs_1_2_merge/S10.sorted.bam
%%bash
samtools index results/STAR_output/S10_align_contigs_1_2_merge/S10.sorted.bam

In [28]:

            
                Copied!
                
%%bash
samtools index results/STAR_output/TI_align_contigs_1_2_merge/TI.sorted.bam
%%bash
samtools index results/STAR_output/TI_align_contigs_1_2_merge/TI.sorted.bam

Wrapping up 🎉 🎉 🎉¶

In this exercise, you learnt to align various types of data after performing quality control for raw data. We looked at some of the options for the aligners and at how to use some of the basic samtools manipulation programs. The outputs from the RNA alignments will be used for the VCF file analysis in the next notebook, and the RNA alignments will be use for the bulk RNA data analysis.