Single Cell Analysis Tutorial

Samuele Soraggi

This tutorial will show analysis and integration of single cell RNA-seq data from wells of Faba bean (Vicia faba). The tutorial is organized in three main sections:

The first part of the tutorial is focused on preprocessing the data, which means primarily filtering and normalizing it.
The second part of the tutorial is focused on integrating multiple datasets, then identifying cell types and find populations of cells.
The third part of the tutorial applies tipycal gene analysis finding groups of genes co-expressed in the data

The tutorial follow the phylosophy of the best practices explained in Luecken and Theis (2019) and Heumos et al. (2023). The last part is based pulling cells transcripts together with different granularities to improve the statistical power of calculations based on their gene expression (as in Morabito et al. (2023)).

Learning outcomes

At the end of this tutorial you will be able to use R to

Filter your data selecting specific criteria
Preprocess your data for advanced analysis
Merge and integrate datasets and perform cross-data analysis
Identify potential cell types by markers and exploiting other datasets
Perform and elaborate differential and conserved gene expression
Infer gene modules from your data and isolate significant ones related to a cell type
Visualize gene modules and extract their gene ontology to draw biological conclusions

Biological introduction of Vicia Faba

Vicia faba is a cool-season grain legume in the family Fabaceae, cultivated as an important source of protein for food and feed. Like other legumes, it forms nitrogen-fixing symbioses with rhizobia, which makes it biologically interesting both as a crop and as a system for studying plant development and plant-microbe interactions.

A notable feature of Vicia faba is its very large genome, which is about 13 Gb and therefore much larger than the genomes of common plant model species such as Arabidopsis thaliana. This large genome is one reason why transcriptomic approaches are especially valuable for resolving cell types and gene expression programs in faba bean tissues.

UMI-based single cell data from wells

The single-cell data used in this tutorial come from a plate-based workflow inspired by CEL-Seq2, adapted with a personalized laboratory protocol for Vicia faba tissues.

In general, individual cells are isolated into separate wells of a multiwell plate. In each well, mRNA is captured and reverse-transcribed using barcoded primers that contain both a cell barcode and a UMI. This enables pooling material across wells while still tracing each read back to its original cell and transcript molecule.

After reverse transcription, cDNA from all wells is pooled, amplified, and prepared for Illumina sequencing. During analysis, reads are demultiplexed by cell barcode, UMIs are used to collapse PCR duplicates, and gene-by-cell count matrices are generated for downstream quality control and cell-type analysis.

Each biological sample corresponds to one plate with 380 profiled cells.

The raw data in practice

Let’s look at a specific read and its UMI and cell barcode. The data is organized in paired-end reads (written on fastq files), where the first fastq file contains reads in the following format

@EXAMPLE_0001/1 length=150
ACGTTCGAGCTTAACTGATCCGATGGTACCTTAGGCTAACCGTTGACCTAGTTCGATGCTAGGATCCGTTAACGGTCAAGTTCCGATGCTAACGTTAGCTAGGCTAACCGTTAG
+EXAMPLE_0001/1 length=150
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

In this workflow, the first 8 bases of read 1 (here ACGTTCGA) are the UMI and the next 6 bases (here GCTTAA) are the cell barcode. The remaining 136 bases of read 1 are cDNA sequence and are used for alignment together with read 2. The last line contains quality scores.

The associated second fastq file is also 150 bp and contains the paired read, for example:

@EXAMPLE_0001/2 length=150
TTGACCGTAGCTTACCGGATTCGATGCAACTTAGGCTTACCGATGGTCAAGTTACCGGATCCTTAGGCAACTTGATCGGTTACCGATGCTTAGGCAACTTGATCCGGTAACCTGA
+EXAMPLE_0001/2 length=150
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Read 2 is the paired cDNA fragment from the same molecule. Because part of read 1 is occupied by UMI+barcode, only the remaining cDNA portion of read 1 should be aligned jointly with read 2. During alignment, it is important to allow for potentially large gaps (for example due to introns or long insert structures), otherwise valid paired reads may be missed.

Alignment and expression matrix

The data is aligned with STAR, a software allowing for spliced alignment of RNA-seq reads. The output of the alignment is a gene-by-cell count matrix, where each entry represents the number of UMIs (or reads) assigned to a particular gene in a particular cell. This matrix is the starting point for all downstream analyses, including quality control, normalization, dimensionality reduction