Single cell analysis workflow

Author

Samuele Soraggi, Manuel Peral-Vázquez, Stig U Andersen, Mikkel H Schierup

Modified

June 19, 2026

NoteTutorial description

This tutorial will cover the basic steps of single cell analysis from preprocessing to the final results production. at the end of this tutorial you will be able to use python to

  • Filter your data selecting specific criteria
  • Preprocess your data for advanced analysis
  • Identify potential cell types
  • Perform differential gene expression
  • Visualize the basic differentiation dynamics of your data
  • Merge datasets and do cross-data analysis

Biological background

We will start by analyzing a dataset coming from various sections of human testicular tissue. The testis is a complex organ composed of multiple cell types: germ cells at different stages of maturation and several somatic cell types supporting testicular structure and spermatogenesis (development of cells into spermatozoa); Sertoli cells, peritubular cells, Leydig cells, and other interstitial cells, as outlined in the figure below. Characterizing the various cell types is important to understand which genes and processes are relevant at different levels of maturations of cells into spermatozoa.

Tubule scheme
Figure: section of a tubule of the human testis. The human testis are surrounded by long tubules in which cells
start to develop, beginning from the walls of the tubules towards the center. At the center of the tubule,
spermatozoa will access to the epididymis to reach full maturation.

After characterizing the spermatogenic process, we will perform comparative analysis of our dataset to testicular samples from men affected by azoospermia (reduced or absent froduction of spermatozoa). Infertility is a growing problem, especially in the Western world, where approximately 10–15% of couples are infertile. In about half of the infertile couples, the cause involves a male-factor (Agarwal et al. 2015; Barratt et al. 2017). One of the most severe forms of male infertility is azoospermia (from greek azo, without life) where no spermatozoa can be detected in the ejaculate, which renders biological fatherhood difficult. Azoospermia is found in approximately 10–15% of infertile men (Jarow et al. 1989; Olesen et al. 2017) and the aetiology is thought to be primarily genetic.

Common to the various azoospermic conditions is the lack or distuption of gene expression patterns. It makes therefore sense to detect genes expressed more in the healthy dataset against the azoospermic one. We can also investigate gene enrichment databases to get a clearer picture of what the genes of interest are relevant to.

Tubule Azoospermia
Figure: Examples of testicular histology and the composition of testicular cell types that can be observed among men with various types of azoospermia. Degenerated ghost tubules (#) are tubules where an abnormally large central channel is present, but no cells are developing from the walls of the tubule. Other tubules show Sertoli-cell-only (SCO) pattern (*) and large clusters of Leydig cells, meaning they only have nurse cells, but no developing germ cells. Tubules with germ cell neoplasia in situ (GCNIS) do not contain any normal germ cells (&). GCNIS cells are the precursor cells of testicular germ cell cancer and are found more frequently among men with azoospermia than among men with good semen quality. From Soraggi et al 2020.

UMI-based single cell data from microdroplets

The dataset we are using in this tutorial is based on a microdroplet-based method from 10X chromium. From today’s lecture we remember that a microdroplet single cell sequencing protocol works as follow:

  • cells and beads are co-encapsulated in gel/oil droplets, aiming for one cell and one bead per occupied droplet
Annotated Data
Figure: Isolation of cells and beads into microdroplets.
</figure>
  • captured transcripts are tagged by the bead with a cell barcode and a unique molecular identifier (UMI); note that capture efficiency is typically 10–20%, so most transcripts in a cell are not recovered
  • 3’ reverse transcription of captured mRNA into cDNA is then performed in preparation for PCR amplification
  • the cDNA is amplified through PCR cycles
    Annotated Data
    Figure: steps for the microdroplet-based single cell RNA sequencing after isolation.

The raw data in practice

Let’s look at a specific read and its UMI and cell barcode. The data is organized in paired-end reads (written on fastq files), where the first fastq file contains reads in the following format

@SRR8363305.1 1 length=26
NTGAAGTGTTAAGACAAGCGTGAACT
+SRR8363305.1 1 length=26
#AAFFJJJJJJJJJJJJJJJJFJJJJ

Here, the first 16 characters NTGAAGTGTTAAGACA represent the cell barcode, while the last 10 characters AGCGTGAACT are the transcript UMI tag. The last line represents the quality scores of the 26 characters of barcode+UMI.

The associated second fastq file contains reads of 98nt as the following

@SRR8363305.1 1 length=98
NCTAAAGATCACACTAAGGCAACTCATGGAGGGGTCTTCAAAGACCTTGCAAGAAGTACTAACTATGGAGTATCGGCTAAGTCAANCNTGTATGAGAT
+SRR8363305.1 1 length=98
#A<77AFJJFAAAJJJ7-7-<7FJ-7----<77--7FAAA--<JFFF-7--7<<-F77---FF---7-7A-777777A-<-7---#-#A-7-7--7--

The 98nt-long string of characters in the second line is a partial sequence of the cDNA transcript. Specifically, the 10X chromium protocol is biased towards the 3’ end of transcripts, because the bead oligos capture mRNA at its poly-A tail; sequencing therefore covers predominantly the 3’ portion of each transcript. The last line contains the corresponding quality scores.

Alignment and expression matrix

Once the data is sequenced, it is possible to align the reads to the genome. This is done with splice-aware aligners (also called gap aligners, e.g. STAR or HISAT2), which can handle reads that span exon–exon junctions when aligning to the genome. We will skip the alignment step because it is quite trivial (it requires a pipeline implemented by 10X if you are using 10X data, otherwise you can use one of the available pipelines available on NextFlow), and because it would require too much time and memory for the scope of a one-day tutorial. Instead, we start from the transcript count matrix that results as the output from the genome alignment.

Data analysis

Warning
  • Run the code cells one at a time, and wait that the running cell is node before starting the next one. This notebook has a tendency to cause memory problems when you execute too many cells at once
  • If at some point you cannot see any plot in this notebook, please create a code cell with the command %matplotlib inline, and run it. We use various plotting libraries which mess up some internal settings once in a while.

Prepare packages and data necessary to run this python notebook. We will use scanpy as the main analysis tool for the analysis. Scanpy has a comprehensive manual webpage that includes many different tutorials you can use for further practicing. Scanpy is used in the discussion paper and the tutorial paper of this course. An alternative and well-established tool for R users is Seurat. However, scanpy is mainatined and updated by a wider community with many of the latest developed tools.

Note: it can take few minutes to get all the package loaded. Do not mind red-coloured warnings.

import warnings
warnings.filterwarnings("ignore")
import scanpy as sc
import anndata as ad
import bbknn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import os
import gc
import plotly.express as px
import re

%matplotlib inline
%run ../Scripts/pythonScripts.py

Load the dataset.

There are many different possible formats the data can be loaded from. Each format has a dedicated reading command in scanpy, for example read_h5ad, read_10X, read_csv,…. In our case, we have a file already in h5ad format. This format is very convenient to store data with large matrices and their annotations, and is often used to store the scRNAseq expression data after alignment and demultiplexing.

adata = sc.read_h5ad('../Data/scrna_data/rawDataScanpy.h5ad')

The data is opened and an Annotated data object is created. This object contains:

  • The data matrix adata.X of size \(N\_cells \times N\_genes\). The cells are called observations (obs) and the genes variables (var).
  • Vectors of cells-related variables in the dataframe adata.obs
  • Vectors of genes-related variables in the dataframe adata.var
  • Matrices of size \(N\_cells \times N\_genes\) in adata.layers
  • Matrices where each line is cell-related in adata.obsm
  • Matrices where each line is gene-related in adata.varm
  • Anything else that must be saved is in adata.uns
Annotated Data
Figure: Structure of an annotated data object. In green the stack of expression matrix it can contains, where X is the one currently used for analysis.
</figure>

The data has 62751 cells and 33694 genes

adata.shape
(62751, 33694)

We calculate quality measures to fill the object adata with some information about cells and genes

sc.preprocessing.calculate_qc_metrics(adata, inplace=True)

We can see that now adata contains many observations (obs) and variables (var). Those can be used for filtering and analysis purpose, as well as they might be needed by some scanpy tools

adata
AnnData object with n_obs × n_vars = 62751 × 33694
    obs: 'batch', 'super_batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_counts'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

adata.obs is a dataframe, i.e. a table with indexes on the rows (cell barcodes) and column names (the observation types). One can select a specific observation type by indexing it in the table. We use .head() to show only the first few lines of the dataframe.

adata.obs.head()
batch super_batch n_genes_by_counts log1p_n_genes_by_counts total_counts log1p_total_counts pct_counts_in_top_50_genes pct_counts_in_top_100_genes pct_counts_in_top_200_genes pct_counts_in_top_500_genes n_counts
index
AAACCTGAGCCGGTAA-1-0 Sohni1_und SohniUnd 61 4.127134 511.0 6.238325 97.847358 100.000000 100.000000 100.000000 511.0
AAACCTGAGCGATTCT-1-0 Sohni1_und SohniUnd 2127 7.662938 5938.0 8.689296 34.725497 45.368811 56.113169 69.434153 5938.0
AAACCTGAGCGTTTAC-1-0 Sohni1_und SohniUnd 3768 8.234565 8952.0 9.099744 16.979446 24.005809 33.031725 48.514298 8952.0
AAACCTGAGGACAGAA-1-0 Sohni1_und SohniUnd 1588 7.370860 4329.0 8.373322 35.458535 48.972049 60.198660 74.867175 4329.0
AAACCTGAGTCATGCT-1-0 Sohni1_und SohniUnd 618 6.428105 962.0 6.870053 35.654886 46.049896 56.548857 87.733888 962.0
adata.obs['batch'] #sample label - the data contains 15 separate samples
index
AAACCTGAGCCGGTAA-1-0     Sohni1_und
AAACCTGAGCGATTCT-1-0     Sohni1_und
AAACCTGAGCGTTTAC-1-0     Sohni1_und
AAACCTGAGGACAGAA-1-0     Sohni1_und
AAACCTGAGTCATGCT-1-0     Sohni1_und
                            ...    
TTTGTCACACAGACTT-1-14      Her8_Spc
TTTGTCACAGAGTGTG-1-14      Her8_Spc
TTTGTCAGTTCGGCAC-1-14      Her8_Spc
TTTGTCATCAAACCAC-1-14      Her8_Spc
TTTGTCATCTTCAACT-1-14      Her8_Spc
Name: batch, Length: 62751, dtype: category
Categories (15, object): ['Sohni1_und', 'Sohni2_und', 'Sohni1_I', 'Sohni2_I', ..., 'Her5', 'Her6', 'Her7_Spt', 'Her8_Spc']

adata.var works similarly, but now each row is referred to a gene

adata.var.head()
n_cells_by_counts mean_counts log1p_mean_counts pct_dropout_by_counts total_counts log1p_total_counts
index
RP11-34P13.3 113 0.001865 0.001863 99.819923 117.0 4.770685
FAM138A 0 0.000000 0.000000 100.000000 0.0 0.000000
OR4F5 1 0.000016 0.000016 99.998406 1.0 0.693147
RP11-34P13.7 635 0.010805 0.010747 98.988064 678.0 6.520621
RP11-34P13.8 12 0.000191 0.000191 99.980877 12.0 2.564949
adata.var['n_cells_by_counts'] #nr of cells showing transcripts of a gene
index
RP11-34P13.3     113
FAM138A            0
OR4F5              1
RP11-34P13.7     635
RP11-34P13.8      12
                ... 
AC233755.2        13
AC233755.1         3
AC240274.1      9434
AC213203.1        15
FAM231B            0
Name: n_cells_by_counts, Length: 33694, dtype: int64

Preprocessing

We preprocess the dataset by filtering cells and genes according to various quality measures and removing doublets. Note that we are working with all the samples at once. It is more correct to filter one sample at a time, and then merge them together prior to normalization, but we are keeping the samples merged for simplicity, and because the various samples are technically quite homogeneous.

Quality Filtering

Using the prefix MT- in the gene names we calculate the percentage of mithocondrial genes in each cell, and store this value as an observation in adata.obs. Cells with high MT percentage are often broken cells that spilled out mithocondrial content (in this case they will often have low gene and transcript counts), cells captured together with residuals of broken cells (more unlikely if a good job in the sequencing lab has been done) or empty droplets containing only ambient RNA.

MT = ['MT-' in i for i in adata.var_names] #a vector with True and False to find MT genes
perc_mito = np.sum( adata[:,MT].X, 1 ).A1 / np.sum( adata.X, 1 ).A1
adata.obs['perc_mito'] = perc_mito.copy() # saves fraction of cell's total transcripts coming from mitochondrial genes.

One can identify cells to be filtered out by looking at the relation between number of transcripts (horizontal axis) and number of genes per cell (vertical axis), coloured by percent of MT genes. We can see that high percentages of mitocondrial genes are present for cells that have less than 1000 detected genes (vertical axis).

sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='perc_mito', 
              title='Transcript vs detected genes coloured by mitochondrial content')

Another useful visualization is the distribution of each quality feature of the data. We look at the amount of transcripts per droplet zooming into the interval (0,20000) transcripts to find a lower threshold (representing which droplets are empty). Usually, there is a peak with low quality droplets on the left side of the histogram, or a descending tail, which you consider as the threshold area. In our case we can choose 2000 as threshold (will be used later in the code). Hover on the plots with the mouse to see the value of each bar of the histogram.

fig = px.histogram(adata[adata.obs['total_counts']<20000].obs, x='total_counts', nbins=100,
                  title='distribution of total transcripts per droplet for <20000 transcripts')
fig.show()

For the upper threshold of the number of transcripts, we can choose 40000

fig = px.histogram(adata.obs, x='total_counts', nbins=100, 
                   title='distribution of total transcripts per droplet')
fig.show()

TipTask T1

Now look at the histogram for the number of detected genes per droplet and choose two thresholds in the same way as above. Keep those number in mind.

fig = px.histogram(adata.obs, x='n_genes_by_counts', nbins=100, title='distribution of detected genes per droplet')
fig.show()

TipTask T2

Cells with too much mitochondrial content might be broken cells spilling out MT content, or ambient noise captured into the droplet. Choose a maximum allowed value for mitochondrial content. Standard values for such threshold are between 0.05 and 0.2.

fig = px.histogram(adata.obs, x='perc_mito', nbins=100, title='distribution of mitochondrial content per droplet')
fig.show()

TipTask T3

Change the three thresholds below with the values you chose - the ones here are just examples. MIN_GENES and MAX_GENES are the lower and upper threshold for the number of detected genes, and MAX_MITO is the thresholds for the max mitochondrial content allowed.

MAX_GENES = 8000
MIN_GENES = 800
MAX_MITO = .05
sc.preprocessing.filter_cells(adata, max_genes=MAX_GENES)
sc.preprocessing.filter_cells(adata, min_genes=MIN_GENES)
sc.preprocessing.filter_cells(adata, max_counts=40000)
adata = adata[adata.obs['perc_mito']<MAX_MITO].copy()

It is good practice to also remove those genes found in too few cells (for example in 10 or less cells). Any cell type clustering 10 or less cells will be undetected in the data, but in any case it would be irrelevant to have such tiny clusters, since statistical analysis on those would be unreliable.

sc.preprocessing.filter_genes(adata, min_cells=10)
print('There are now', adata.shape[0], 'cells and', adata.shape[1],'genes after filtering')
There are now 28286 cells and 28842 genes after filtering

Doublets removal

Another important step consists in filtering out multiplets. We will use the package scrublet (Wolock et al, 2019), that simulates doublets from the data and compare the simulations to the real data to find any doublet-like cells in it.
Scrublet
Figure: schematics of the scrublet algorithm, from the related paper.
</figure>
Note

Multiplets are in the almost totality of the cases doublets, because triplets and higher multiplets are extremely rare. We will thus talk only about doublets instead of multiplets. Read this more technical blog post for deeper explanations about this fact.

As a rule of thumb, you can have a look at the table below to see what is the expected amount of doublets rate for different amounts of cells loaded in a single cell 10X experiment. In our case, each sample ranges somewhere between 3000 and 5000 cells, meaning there were somewhere between 8000 and 10000 loaded cells in each experiment (assuming efficiency of cell capture between 50% and 70%), so one could use 6-8% as a guess. scrublet wants a rough estimate to start from, so we provide such number.

Doublets
Figure: doublet rates for various setting in a single cell experiment with 10X technology.

Below, we run the detection and obtain a score between 0 and 1 for each doublet. We will use that score for filtering

import scrublet
scrub = scrublet.Scrublet(adata.X, expected_doublet_rate=0.08, random_state=123)
doublet_score, _ = scrub.scrub_doublets(verbose=True)
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.35
Detected doublet rate = 1.3%
Estimated detectable doublet fraction = 62.3%
Overall doublet rate:
    Expected   = 8.0%
    Estimated  = 2.1%
Elapsed time: 45.3 seconds
TipQuestion
  • Q1. Look at the estimated rate of doublets estimated in the output message of scrublet. Does it match with the expectation from the table above knowing we have 3000-5000 cells per sample?

We now plot the doublet scores assigned to each cell by the algorithm. We can see that most cells have a low score (the score is a value between 0 and 1, where 1 is a theoretically perfect doublet). Datasets with many doublets show a more bimodal distribution (look for example at this jupyter notebook from the scrublet tutorial), while here we just have a light tail beyond 0.1. Therefore we will filter out the cells above this threshold

fig = px.histogram(adata.obs, x=doublet_score, title='Distribution of doublet scores per cell')
fig.show()

adata = adata[ doublet_score < .1, ].copy()

Data Normalization

Biologically similar cells are not necessarily directly comparable in a dataset because of different technical biases - amongst many the different percentage of captured transcripts (capture efficiency), the presence of technical replicates, the presence of noisy transcripts. The capture efficiency can be influenced by many factors, i.e. the different transcript tags leading to different capture efficiency, the type of protocol used in the laboratory, the amount of PCR performed on different transcripts. Biological biases might as well alter the transcript proportion in a cell, for example in case of different points in the cell cycles altering the expression of specific genes.

To avoid these differences, a normalization approach is needed. Normalization is one of the main topics of scRNAseq data preprocessing, and many advanced techniques take into account the statistical distribution of counts and the presence of technical/biological features of interest (Lytal et al, 2020).

The most standard approach is the TPM (Transcripts Per Million) normalization. Here, the transcripts in each cell are rescaled by a factor such that each cell has the same number of transcripts. After TPM rescaling, the data is usually logarithmized, so that a transcript \(x\) becomes \(log(x+1)\). Logarithmization is known to help reducing the technical bias induced by the amount of transcripts in each cell. Finally, the data is standardized with mean 0 and variance 1. This is necessary since the PCA assumes implicitly that datapoints are normally distributed.

As a rule of thumb, TPM works fine but might introduce biases in the data, mostly due to technical differences that are not always removed by normalization. It is possible to use more advanced methods for considering technical and biological covariates as part of a statistical model for the transcripts. One of the current state-of-the-art method is scTransform (Hafemeister and Satija, 2019). This is currently implemented in R.

We use the standard TPM normalization and evaluate the result looking at the data projections later on - our datasets are of very good quality, so there will be no problems with the simple TPM normalization. We select also the most variable genes to avoid considering genes that have very constant expression across all the dataset, and are therefore not informative. The most variable genes are used to create the PCA projection of the data, and we can use them also in other parts of the analysis.

We now normalize, label the most variable genes and scale the data

# save raw data matrix
adata.layers['raw_counts'] = adata.X.copy()
# TPM normalization
sc.pp.normalize_total(adata)
# matrix logarithmization
sc.pp.log1p(adata)
# most variable genes
sc.pp.highly_variable_genes(adata)
#scale
sc.pp.scale(adata)
adata.layers['scaled_counts'] = adata.X.copy()

Dimensionality reduction

With the term dimensionality reduction, we intend the projection of each data point (cell) \(x=[x_1, x_2, \dots, x_{N_{genes}}]\) into a data point \(y\) in \(D\) dimensions, so that \(y=[y_1, y_2, \dots, y_D]\), where \(D << N_{genes}\).

Dimensionality reduction is meaningful for single cell data analysis, since we know that the genes are expressed in modules of co-expression, meaning that the behaviour of many co-expressed genes can be “compressed” into a single pattern. Clustering and other heavy algorithms will use this compressed version of the data.

PCA

PCA is one of the most used dimensionality reduction methods. It projects the data by identifying the “axae of maximum variation”. In other words, it highlights the strongest effects making cells different from each other. PCA has proved to be a reliable method for single cell data, especially to establish a starting point for other analysis. The PCA is simple to run:

sc.pp.pca(adata)

You can visualize the ratio of variances of each subsequent pair of principal components, where you can see which number of dimensions is best to consider for further applications. Low variance ratios illustrate that along their principal components only little information is represented, probably mostly backgound noise of the dataset. Here we can for example say that 15 dimensions mostly explain the meaningful biological variation in the data. We will use the option n_pcs=15 in some of the functions of scanpy that are based on the PCA.

%matplotlib inline
sc.plotting.pca_variance_ratio(adata)

You can plot any dimensionality reduction and colour cells by elements of adata.obs or by gene expression. Below we plot the PCA coloured by different samples composing the dataset, the total number of transcripts and the expression of gene SYCP1. Note how the various samples overlap into the PCA plot. In some cases you can see that the PCA plot is composed of chunks separated by sample, but fortunately this is not our situation.

In such case more advanced techniques for batch correction are needed, so that the samples are taken into account. Scanpy has the function sc.pp.combat to correct for batch effects, but there are many other possibilities such as the package mnnCorrect.

plt.rcParams['figure.figsize'] = (6,6) #reduce figure size
sc.pl.pca(adata, color=['batch','total_counts','SYCP1'])

UMAP projection

UMAP (McInnes et al, 2018) is another projection algorithm based on topological theory. This projection technique has become one of the most used and appreciated, and is structured in a way that calculations are fast also on big datasets. Explore this “paper with code” to learn about UMAP and its extension applied to neural networks. UMAP has also a manual page with other examples in python.

Warninginterpreting UMAP

UMAP (and similar techniques) are doing some complex work behind the scenes. Notably, they “deform” the space which contains the data to calculate the projection, and this happens through operations which are not of geometrical nature. The result is that you get a very nice plot to look at to understand cluster structure, but distances are not always comparable to the original ones. This means that two clusters close to each other might be more different than two farther apart, so it is useful to double check if closeness on UMAP corresponds to closeness in biology, by performing more formal analysis (like cluster assignment).

We calculate first the neighborhood of each cell (the distance between the most similar cells). We use the package bbknn to take samples into account and reduce any remaining difference between samples, even though we can see a good overlapping in the PCA plot. In case you haven’t multiple samples, you can use the function sc.pp.neighbors. Note that we impose the use of 15 PCA components.

bbknn.bbknn(adata, n_pcs=15)

We calculate and plot the UMAP projection. We calculate three components, so we can also produce a 3d plot.

TipTask T4

The parameters a and b can be used to strech and compress the projection. Usually they are decided automatically, but you can play with values around 1 to find the plot that is fanciest or clearest to you. Simply rerun the three commands below as much as you want.

sc.tools.umap(adata, random_state=12345, n_components=3, a=.9, b=.9)

The plot is made for different pairs of dimensions (1,2 - 1,3 - 2,3) and coloured by a single gene expressed in spermatids. You can observe how there are some smaller clusters separated from a main block of cells looking like a long “snake”. Note also that clusters that seem close to each other, might actually end up being far away when rotating our point of view.

sc.plotting.umap(adata, color=['TNP2'], components=['1,2','1,3','2,3'])

sc.plotting.umap(adata, color=['TNP2'], projection='3d', components=['1,2,3'] )

Clusters Identification

Cell types can be identified and then assigned to clusters calculated by one of the many available clustering techniques. When identifying a cluster, we can assign that to a likely cell type by visualizing known markers on a projection plot, e.g. an UMAP plot. Similarly, one can score cells for a set of markers. Finally, if an atlas of cell types is available, one can use it for a supervised assignment of cell types (see for example the R package scmap) and the ingest tool from scanpy.

Be aware that clusters do not necessarily match perfectly cell types. After all we are pretty complex organisms and cells change in a continuous way. Hard clusters are thoughwidely accepted, and are intended as a tool to understand the underlying biology of the data.

Leiden clustering algorithm

Scanpy contains the leiden algorithm (Traag et al, 2019), making clusters whose points are well connected and at the same time mostly disconnected from other clusters. Other approaches, e.g. k-means, can be performed on the data or on its PCA/tSNE/UMAP projection. The leiden algorithm is however considered to be one of the best techniques at the moment to find communities of similar datapoints, and we will therefore use that. Read more about the leiden algorithm here.

leiden
Figure: refinement steps used by the leiden algorithm to find communities of well-connected points.
</figure>

We calculate some clusterings at various resolution, and choose the one matching best the clusters we have visualized from the markers’ score. The amount of clusters depends on the chosen resolution. Later in this tutorial, we will integrate this dataset with another one, showing how to cluster the new data using ours as a “reference” clustering.

#leiden clustering at various resolutions
sc.tools.leiden(adata, resolution=1, random_state=12345, key_added='leiden_R1')
sc.tools.leiden(adata, resolution=0.5, random_state=12345, key_added='leiden_R.5')
sc.tools.leiden(adata, resolution=0.25, random_state=12345, key_added='leiden_R.25')
sc.tools.leiden(adata, resolution=0.1, random_state=12345, key_added='leiden_R.1')
TipTask T5

Choose below which clustering you prefere. It should match as much as possible the clusters you could visualize through the scores plots on UMAP, or have more clusters than that (you can always name two clusters the same), Write the clustering name in the variable CHOSEN_CLUSTERING after the plots, where you have an example already written in it.

sc.plotting.umap(adata, color=['leiden_R1','leiden_R.5','leiden_R.25', 'leiden_R.1'], legend_loc='on data', ncols=2)

CHOSEN_CLUSTERING = 'leiden_R.5'

Differential Gene expression

It is not always easy to assign clusters from marker scores. For example, cluster 12 at resolution 0.25 is not immediately recognizable as peritubular myoid cells (the Myoid cluster). Sometimes it is good to double check which genes are most expressed in a cluster compared to all the others before assigning cell types. Here we look at the 50 most expressed genes in each cluster from the Leiden algorithm.

Differential gene expression is also the way of making a sound argument in a publication or report. Showing fancy UMAP plots coloured by gene expression is sadly not enough :) A standard way to do it, is to test if each gene has mean expression higher in a cluster than in all the others. Scanpy does that by using a t-test or the wilcoxon test. While the t-test is somewhat less powerful, the wilcoxon test is not best behaved with very sparse data, so we simply apply the default t-test to the dataset. Note that, for the standard t-test in scanpy, you need to use logarithmized data.

More advanced techniques that model the statistical properties of the data, such as its sparsity, are available for differential expression. See for example the package MAST (Finak et al, 2015) and the following review on differential expression methods: (Squair et al, 2021).

#Perform the test on logarithmized data
adata.X = adata.layers['raw_counts'] #raw data
sc.pp.normalize_total(adata) #TPM normalization
sc.pp.log1p(adata) #logarithm
sc.tl.rank_genes_groups(adata, groupby=CHOSEN_CLUSTERING, n_genes=50) #Top 50 diff.expressed genes in each cluster
adata.X = adata.layers['scaled_counts'] #Set again the scaled data as standard data matrix
WARNING: adata.X seems to be already log-transformed.

Organize result in a table. Each column has the cluster numbers with _N, _P, _L representing respectively the gene Names, their P-values and the Log-fold change of each gene expression compared to the other clusters. Many clusters show markers typical of spermatogenesis processes (MYL9, TEX-genes, …)

result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
X = pd.DataFrame(
    {group + '_' + key[:1].upper(): result[key][group]
    for group in groups for key in ['names', 'pvals_adj','logfoldchanges']})
X.head() #print only first five lines
0_N 0_P 0_L 1_N 1_P 1_L 2_N 2_P 2_L 3_N ... 12_L 13_N 13_P 13_L 14_N 14_P 14_L 15_N 15_P 15_L
0 ACTA2 0.0 8.279956 HMGA1 0.0 5.305916 B2M 0.0 5.296463 DCN ... 5.242647 B2M 0.0 4.939682 CD74 0.0 7.522440 SMC3 1.464649e-248 4.686258
1 IGFBP7 0.0 5.464073 ZNF428 0.0 4.582889 GNG11 0.0 6.104707 IGFBP7 ... 3.766889 TMSB10 0.0 4.535407 HLA-DRA 0.0 8.359200 WBSCR22 3.479532e-245 4.242674
2 PTGDS 0.0 7.328448 RPS12 0.0 3.290441 HLA-B 0.0 4.736052 PTGDS ... 2.512227 TMSB4X 0.0 4.319129 TMSB4X 0.0 4.720968 SYCP3 1.536166e-230 5.716512
3 CALD1 0.0 4.448563 RPSA 0.0 3.078222 HLA-E 0.0 5.278060 VIM ... 2.573015 HLA-B 0.0 4.532363 B2M 0.0 4.394653 HSP90AA1 4.960966e-264 2.685838
4 TPM2 0.0 5.584510 DNAJB6 0.0 3.448990 TMSB4X 0.0 4.875499 CD63 ... 3.612036 IFI27 0.0 5.533888 FTL 0.0 4.282821 TEX30 2.468216e-217 3.748611

5 rows × 48 columns

The table can also be exported into csv format in the folder results_scrna. It can be opened using jupyterlab as well Excel if you download it and set the comma character , as separator. You can also open them from the jupyterlab file manager.

%%bash

mkdir -p results_scrna
X.to_csv('./results_scrna/DiffExpression_NoAnnotation.csv', sep=',', index=None)

Cluster assignment

Defining the most likely cell type present in a cluster is sometimes a very complicated task, especially in big datasets where the cells change from one type to another in a very continuous fashion, making the “hard clustering” not a completely realistic model.

However, for many applications, a hard clustering associated with a careful markers analysis is a well-accepted technique in the scientific community, and is used in basically the totality of the state-of-the-art publications. To avoid the subjectivity of assigning clusters by hand, we get the best score for each cell type in each of the clusters. To do that, we average the markers scores calculated before with an ad-hoc function we wrote

adata.obs['spermatogenesis_types'] = clustersByScores(adata, markers_scores, leidenClusters=adata.obs[CHOSEN_CLUSTERING])

Let’s see how things look like! It seems we have a pretty uniform cluster naming!

plt.rcParams['figure.figsize'] = (12,12)
sc.pl.umap( adata, color=['spermatogenesis_types'], 
           legend_loc='on data', 
           legend_fontsize=16,
           frameon=False,
           size=60,
           add_outline=True,
           ncols=1,
           components=['1,2','2,3','1,3'] 
           )

Warning

Some cell types for which we had markers might be missing in the plot. They might not have had enough high scoring to be assigned, or we had not enough markers, and need more biological insight. We continue with whatever clustering we have obtained, which should contains all or almost all types we need.

Gene enrichment analysis (GEA)

TipTask T6

GEA consists in testing a set of genes to see if it overlaps significantly with lists of genes related to biological terms contained in a database (Maleki et al, 2020). There are all sorts of databases related to biological pathways, Tissue types, Transcription factors coexpression, Illnesses, … that contain a priori information about sets of genes. Usually those come from reviewed publications.

We will try to find enrichment of genes for pathways, ontologies or diseases from the differentially expressed genes of a cluster, and we will create a “consensus ontology”. We will use Enrichr (Chen et al, 2013), that has a web interface of easy use. In this exercise, you will - choose one of the identified clusters and write it in CHOSEN_CLUSTER - printout a file with a list of differentially expressed genes from the chosen cluster - copy the whole list printout - paste the list in the Enrichr website - find up to 5 terms you consider to be related to spermatogenesis - write them into mentimeter at https://menti.com

There are a lot of databases shown in the results from Enrichr, separated in various tabs, so many terms will not be necessarily related to spermatogenesis, or they will be related to basic cellular processes common to many tissues.

Remember the available clusters:

print( list( adata.obs['spermatogenesis_types'].cat.categories ) ) 
['Elong.Spt', 'Endothelial', 'Leydig', 'Macroph', 'Myoid', 'Round.Spt', 'SpermatocytesI', 'SpermatocytesII', 'SpermatogoniaA', 'SpermatogoniaB']

Choose the cluster name (write it between the two quotes)

CHOSEN_CLUSTER = 'Myoid'

Run differential expression

#Perform the test on logarithmized data
adata.X = adata.layers['raw_counts']
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.tl.rank_genes_groups(adata, groupby='spermatogenesis_types', n_genes=50)
#Use again the scaled data as standard
adata.X = adata.layers['scaled_counts']
WARNING: adata.X seems to be already log-transformed.

Create the table and print the gene names. Highlight and copy them, so they are ready to be pasted.

result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
X = pd.DataFrame(
    {group + '_' + key[:1].upper(): result[key][group]
    for group in groups for key in ['names', 'pvals_adj','logfoldchanges']})
for i in X[f'{CHOSEN_CLUSTER}_N'].values:
    print(i)
ACTA2
TPM2
MYL9
MYH11
CALD1
PTGDS
IGFBP7
APOE
FHL2
LGALS1
MYL6
TMSB4X
TIMP1
TIMP3
SPARCL1
SMOC2
AEBP1
CPE
SPARC
ITM2B
CD63
CST3
DPEP1
TCEAL4
NGFRAP1
ACTB
NR2F2
RPL10
ENG
MALAT1
OSR2
TPM1
DCN
TAGLN
EEF1A1
VIM
C11orf96
NUPR1
FTL
B2M
TSHZ2
RPL13A
NEAT1
RPS4X
FTH1
RPL10A
RPS8
MYLK
RPL21
EID1

Now paste the genes into the Enrichr website, then find possible interesting enriched terms.

End of task


Differentiation dynamics

Knowing the beginning of a differentiation process for the cells of an organ, we can model how the development proceeds. This is done by the pseudotimes, based on the statistical concept of diffusion maps (Ronald Coifman and Stephane Lafon, 2006, Angerer et al, 2016): given the cells at the beginning of the differentiation, pseudotimes are a special distance-based measure computed in diffusion map space - imagine walking along the differentiation trajectory starting from SpermatogoniaA, measuring how far you have travelled. The pseudotime is then visualized on the UMAP, but note that it is not computed from UMAP distances. Here we subset the dataset to consider only spermatogenic cells excluding somatic cells.

subdata = adata[ [i not in ['Endothelial','Macroph','Myoid','Smooth_Muscle','Leydig'] 
                  for i in adata.obs['spermatogenesis_types']] ].copy()
subdata.uns['iroot'] = np.flatnonzero(subdata.obs['spermatogenesis_types'] == 'SpermatogoniaA')[1]
sc.tl.dpt(subdata, n_dcs=2)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
plt.rcParams['figure.figsize'] = (8,8)
sc.pl.umap( subdata, color=['dpt_pseudotime'],
           legend_loc='right margin', 
           legend_fontsize=16,
           frameon=False,
           size=60,
           add_outline=True,
           ncols=1,
           cmap='Blues'
           )

Note how some clusters have a small variance in pseudotime. This is because those clusters show a lower variability in gene expressions than in cell types where a wide range of differentiation events happen.

sc.pl.violin(subdata, keys='dpt_pseudotime', groupby='spermatogenesis_types', rotation=90,
             order= ['SpermatogoniaA','SpermatogoniaB','SpermatocytesI','SpermatocytesII','Round.Spt','Elong.Spt'] )

Copy pseudotimes in the main object, leaving somatic cells at a negative value as a default, since they have not a calculated pseudotime

adata.obs['dpt_pseudotime'] = np.repeat(-1, adata.shape[0])
adata.obs.loc[subdata.obs_names, 'dpt_pseudotime'] = subdata.obs['dpt_pseudotime']

remove subdata and use the gc garbage collector to free up memory

#remove subdata as it is no longer used
del subdata
gc.collect() 
6615

Comparisons across different datasets

In this last part of the analysis we import a dataset of testis tissues from infertile men (affected by cryptozoospermia). The data has already been preprocessed in advance. We will first annotate the new dataset using our spermatogenesis data as a reference, so that cluster names and pseudotimes are assigned accordingly. Then we will compare gene expressions using statistical tests.

Reference-based annotation

We read the new data

crypto = sc.read_h5ad('../Data/scrna_data/crypto_azoospermia.h5ad')
#use only genes present in both datasets
var_names = adata.var_names.intersection(crypto.var_names)
adata = adata[:, var_names]
crypto = crypto[:, var_names]

The tool ingest included in scanpy allows the annotation of a dataset starting from a reference. Look here for further examples about the use of ingest. We choose the clustering and the pseudotimes as observations to transfer to the new data. It takes some time to do the annotation.

sc.tl.ingest(crypto, adata, obs=['spermatogenesis_types','dpt_pseudotime'])

Keep the same cluster colours in the new data

adata
View of AnnData object with n_obs × n_vars = 27396 × 12312
    obs: 'batch', 'super_batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_counts', 'perc_mito', 'n_genes', 'SpermatogoniaA_score', 'SpermatogoniaB_score', 'SpermatocytesI_score', 'SpermatocytesII_score', 'Round.Spt_score', 'Elong.Spt_score', 'Sertoli_score', 'Macroph_score', 'Leydig_score', 'Endothelial_score', 'Myoid_score', 'Smooth_Muscle_score', 'leiden_R1', 'leiden_R.5', 'leiden_R.25', 'leiden_R.1', 'spermatogenesis_types', 'dpt_pseudotime'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'batch_colors', 'neighbors', 'umap', 'leiden_R1', 'leiden_R.5', 'leiden_R.25', 'leiden_R.1', 'leiden_R1_colors', 'leiden_R.5_colors', 'leiden_R.25_colors', 'leiden_R.1_colors', 'rank_genes_groups', 'spermatogenesis_types_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'raw_counts', 'scaled_counts'
    obsp: 'distances', 'connectivities'
crypto.uns['spermatogenesis_types_colors'] = adata.uns['spermatogenesis_types_colors']  # fix colors

Make pseudotimes as a numeric array

crypto.obs['dpt_pseudotime'] = np.array(crypto.obs['dpt_pseudotime'])

Ingest adapts also the UMAP plot of the new data to the reference one. Note how pseudotimes are off for some cell, especially for the Elongated Spermatids cluster. This might be due to reduced spermatogenic functionality of cells in infertile men.

sc.pl.umap(crypto, color=['spermatogenesis_types', 'dpt_pseudotime'], wspace=0.5,
          title=['atlas-based clustering of azoospermic data', 'atlas-based pseudotimes of azoospermic data'])

Now we merge the two datasets together. Note that the expression matrix will not be integrated. Ingest is used only for reference-based annotation and UMAP alignment, but it cannot integrate the expression matrices removing biases (such as sample bias). For this there are other tools such as scVI-tools, CCA and rPCA and others. However, a good and computationally lighter alternative is to keep the matrices as they are and model the differences when applying tools on them, as we will do with differential expression testing.

merged = ad.concat([adata, crypto], keys=['Healthy', 'Crypto'], label='condition', index_unique='-')
merged.uns['spermatogenesis_types_colors'] = adata.uns['spermatogenesis_types_colors']  # fix category colors

The UMAP shows a nice overlapping of the datasets and cluster names, apart from a few cell types that are off

sc.pl.umap(merged, color=['condition','spermatogenesis_types','dpt_pseudotime'], wspace=0.3, s=30)

As already noticed before, pseudotimes are quite different between the two conditions

fig, (ax1) = plt.subplots(1, 1, figsize=(12,8), gridspec_kw={'wspace':0.5})
ax = sns.violinplot(x="spermatogenesis_types", y="dpt_pseudotime", hue="condition", scale="width", palette="Set2", split=True, ax=ax1,                    
                    data=merged[merged.obs['dpt_pseudotime']>=0].obs[ ['condition', 'dpt_pseudotime', 'spermatogenesis_types'] ], 
                    order=['SpermatogoniaA','SpermatogoniaB','SpermatocytesI', 'SpermatocytesII','Round.Spt','Elong.Spt'])

Free some memory up

del adata, crypto
gc.collect() 
91319

You can save the data like this

merged.write('merged.h5ad')

Cross-dataset differential expression

We will perform the differential expression testing between the two conditions. Just to give a very quick overview, you can find three ways in which differential gene expression is done with scRNA-seq data:

  • using tests that are analogous to the ones for bulkRNA data. This has been the standard for some years, but generates a lot of false positives, because each cell is used as a bulk sample with very low amount of transcripts.
  • using statistical models that take into accounts technical biases in the data (e.g. D3E and MAST) and should be more reliable than standard bulkRNA methods. Squair et al, 2021 show how those methods are much less effective than previously thought, probably because we still consider each cell as a bulk sample in the test, and the technical differences between cells are just too many to model them.
  • using groups of cells (in our case grouped by sample, condition and cell type) as a single bulk sample by summing the cells’ transcriptomes. Squair et al, 2021 show how already standard bulkRNA methods applied on the pseudobulk data are very effective and reduce false positives. Right now, this is probably the state-of-the-art way to proceed. Look also at (this other course) for an application of the pseudobulk technique.

We will apply the latter method. To do this we create pseudobulk samples by the sum of the transcriptomes from multiple cells in the data. In other words, we are doing nothing much different than the bulk RNA analysis done in Notebook 3 of this course. We will code only in python, but in future, when working on your own data, you can use any bulkRNA analysis tool with the pseudobulk matrix.

After creating the pseudobulk matrix, we use a standard t-test for differential gene expression. More advanced tools like pydeseq2 have features for including factors in the analysis.

merged = sc.read('merged.h5ad') #read the data
merged.X
<Compressed Sparse Row sparse matrix of dtype 'float64'
    with 343039572 stored elements and shape (29543, 12312)>

We remove mitochondrial and ribosomal genes to focus only on genes of interest.

MT_RP = [('MT-' not in i)and(re.match('^RP[A-Za-z]', i) is None) for i in merged.var_names] #a vector with True and False to find mitocondrial/ribosomal genes
merged = merged[:, MT_RP].copy()

We select the raw count matrix for the analysis, and subset the dataset into a new one including the cluster of interest. First, let’s also look at how many cells there are in each cluster of the healthy and azoospermic sample:

merged.X = merged.layers['raw_counts'].copy()
merged[merged.obs['condition']=='Healthy'].obs['spermatogenesis_types'].value_counts() #healthy sample
spermatogenesis_types
SpermatogoniaA     5043
Myoid              4405
SpermatocytesII    3651
Endothelial        3416
Elong.Spt          3400
Round.Spt          2866
Leydig             2680
SpermatogoniaB     1007
Macroph             596
SpermatocytesI      332
Name: count, dtype: int64

We can see there are many cells in each cell type, but we must also check that each condition has enough. Note that the azoospermia sample is quite little and have few cells in some clusters:

merged[merged.obs['condition']=='Crypto'].obs['spermatogenesis_types'].value_counts()
spermatogenesis_types
SpermatogoniaA     664
Leydig             458
Round.Spt          331
SpermatocytesII    237
Myoid              186
Endothelial        147
Macroph             45
SpermatogoniaB      44
Elong.Spt           28
SpermatocytesI       7
Name: count, dtype: int64

We renormalize the data and take some significant genes. Less variable genes will not be useful for differential expression analysis.

sc.pp.normalize_total(merged)
sc.pp.log1p(merged)
sc.pp.highly_variable_genes(merged, n_top_genes=5000)

Now we subset by most significant genes. Afterwards, we use a script to generate the bulk matrix for both conditions. This script always uses the raw counts matrix even if we normalized the data above to find the most variable genes.

%run ../Scripts/pythonScripts.py
<Figure size 800x800 with 0 Axes>
matrix_bulk, clusters, conditions = pseudobulk_matrix(adata=merged, 
                                          batch_key='batch', 
                                          condition_key='condition', 
                                          cluster_key='spermatogenesis_types')

The bulk matrix looks like this. It shows the total expression of each gene in every cluster, sample and condition. Those are represented in the bulk sample name, together with a number. Our script actually creates 10 pseudobulk samples per each group of cells, by randomly subsampling them. In this way we still keep some of the single cell variability. This is why each pseudobulk sample has a number. For example: Healthy_Guo1_Elong_Spermatids_2 is the pseudobulk sample number 2 from cells in the healthy condition, sample name Guo1 and cluster Elongated spermatids.

matrix_bulk.head()
Healthy_Guo1_Elong.Spt_0 Healthy_Guo1_Elong.Spt_1 Healthy_Guo1_Elong.Spt_2 Healthy_Guo1_Elong.Spt_3 Healthy_Guo1_Elong.Spt_4 Healthy_Guo1_Elong.Spt_5 Healthy_Guo1_Elong.Spt_6 Healthy_Guo1_Elong.Spt_7 Healthy_Guo1_Elong.Spt_8 Healthy_Guo1_Elong.Spt_9 ... Healthy_Sohni2_und_SpermatogoniaB_0 Healthy_Sohni2_und_SpermatogoniaB_1 Healthy_Sohni2_und_SpermatogoniaB_2 Healthy_Sohni2_und_SpermatogoniaB_3 Healthy_Sohni2_und_SpermatogoniaB_4 Healthy_Sohni2_und_SpermatogoniaB_5 Healthy_Sohni2_und_SpermatogoniaB_6 Healthy_Sohni2_und_SpermatogoniaB_7 Healthy_Sohni2_und_SpermatogoniaB_8 Healthy_Sohni2_und_SpermatogoniaB_9
FAM41C 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00000 0.0 0.0 ... 0.000000 0.000000 0.000000 1.058002 0.000000 0.000000 1.069217 0.000000 0.000000 0.000000
SAMD11 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00000 0.0 0.0 ... 0.314359 0.654008 0.480988 0.777766 1.059563 0.000000 0.000000 0.398048 0.544388 0.000000
NOC2L 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00000 0.0 0.0 ... 6.223267 4.058583 4.633202 5.251925 4.440653 4.436545 2.341810 2.899758 5.324596 3.836604
KLHL17 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00000 0.0 0.0 ... 0.455224 0.000000 0.000000 0.162646 0.000000 0.000000 1.419866 0.000000 0.000000 0.000000
ISG15 0.0 0.0 0.0 0.466368 0.641645 0.0 0.0 1.29108 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

5 rows × 1390 columns

For practical purpose we exchange rows and columns by rotating the matrix.

matrix_bulk = matrix_bulk.T
matrix_bulk.head()
FAM41C SAMD11 NOC2L KLHL17 ISG15 C1orf159 SDF4 UBE2J2 SCNN1D ACAP3 ... MCM3AP YBEY C21orf58 PCNT DIP2A PRMT2 AC136616.1 AC007325.4 AC023491.2 AC240274.1
Healthy_Guo1_Elong.Spt_0 0.0 0.0 0.0 0.0 0.000000 2.545403 0.000000 3.691806 0.000000 0.000000 ... 0.000000 2.768188 0.000000 0.000000 1.271720 0.000000 0.000000 8.418765 0.0 0.0
Healthy_Guo1_Elong.Spt_1 0.0 0.0 0.0 0.0 0.000000 0.723980 0.000000 5.138862 0.000000 0.862926 ... 0.000000 2.008637 0.919861 0.000000 0.000000 1.206255 0.000000 8.023553 0.0 0.0
Healthy_Guo1_Elong.Spt_2 0.0 0.0 0.0 0.0 0.000000 0.857198 0.622431 7.140317 0.000000 0.000000 ... 0.000000 2.164976 1.302259 0.000000 0.000000 0.000000 0.857198 8.220791 0.0 0.0
Healthy_Guo1_Elong.Spt_3 0.0 0.0 0.0 0.0 0.466368 2.461339 0.000000 5.575853 0.000000 0.000000 ... 0.000000 4.105031 0.562777 0.378161 0.884709 0.935201 0.000000 7.046681 0.0 0.0
Healthy_Guo1_Elong.Spt_4 0.0 0.0 0.0 0.0 0.641645 1.674945 0.000000 6.038050 0.434032 0.862926 ... 0.541363 3.484651 1.280955 0.641645 0.000000 0.434032 0.434032 6.337600 0.0 0.0

5 rows × 12169 columns

We create an annotated data object to be able to do statistical testing.

pbulk = ad.AnnData(matrix_bulk)
pbulk
AnnData object with n_obs × n_vars = 1390 × 12169

We just remove those genes lowly expressed in the pseudobulk data

sc.pp.calculate_qc_metrics(pbulk, inplace=True)
sc.pp.filter_genes(pbulk, min_cells=5)

We create some metadata

pbulk.obs['condition'] = conditions
pbulk.obs['spermatogenesis_types'] = clusters 

and normalize depth of each pseudobulk replicate and logarithmize, similarly to what was done in Notebook 3 of the course

sc.pp.normalize_total(pbulk)
sc.pp.log1p(pbulk)

Out of curiosity, why not plotting a PCA of the pseudobulk data? We can see how it looks similar to the single cell data, with a decent overlapping of the conditions, meaning the pseudobulk process kept the differences between types, and also the data structure overall.

sc.pp.pca(pbulk)
sc.pl.pca(pbulk, color=['condition','spermatogenesis_types'], s=50)

We can finally run the statistical test by condition. We will do a list with ALL genes, so we can explore it at any time afterwards. We will use the condition Healthy as reference, so the results will give how changes are in Crypto against Healthy

sc.tl.rank_genes_groups(pbulk, groupby="condition", reference='Healthy', n_genes=pbulk.shape[1])

We use a script to extract the results in a matrix. We will also calculate some other values under the hood and add them in the table.

DEG_matrix = pseudobulk_extract_DEG(pbulk=pbulk, adata=merged)
---Extracting results
WARNING: adata.X seems to be already log-transformed.
densified
---done

What does the resulting matrix contain? Apart from genes names, we can see in how many bulk samples those are expressed (in percentage for healthy and crypto cells), the fold-change and log-fold-change, the pvalues and adjusted pvalues. Note: those results are for Crypto vs Healthy. For values with Healthy as a reference, just change the sign to all fold-changes and log-fold-changes, and you have changed your reference!

DEG_matrix.head()
Crypto_NAMES Crypto_PVALS Crypto_PVALS_ADJ Crypto_LOGFOLDCHANGES Healthy_PCT Crypto_PCT Crypto_FOLDCHANGES Crypto_LOGPVALS_ADJ Crypto_LOGPVALS
0 EEF1G 7.718793e-218 4.696500e-214 7.676920 3.898379 97.345133 204.636536 50.0 50.0
1 MRPS24 3.667054e-186 1.115610e-182 6.484880 1.741130 80.763857 89.566071 50.0 50.0
2 GABARAP 2.251046e-175 5.478596e-172 5.280738 16.933129 97.438286 38.874126 50.0 50.0
3 H3F3B 5.131351e-158 7.805427e-155 2.113990 96.605344 99.906847 4.328868 50.0 50.0
4 UBE2V1 6.872677e-165 1.393893e-161 4.207666 5.303694 65.160689 18.477098 50.0 50.0

We just want to make a volcano plot of the result table. We consider as significant genes with p-value below 0.001 (log p-value of 3) and a log-fold-change below -2 and above 2 for this plot. The dot size is the percentege of cells expressing each gene.

Depending on your choices along this exercise, you will find slightly different sets of UPregulated and DOWNregulated genes. In some of our tests we found amongst others CXCL2 (upregulated only when azoospermia is together with inflamation) and ZNRF3 (involved in disorder of sex development). Upregulated are for example CMTM1 (involved in fertility, whose knockout has no effect on it), SPATS1 (involved in sperm development) and DLGAP2 (associated to male infertility)

TipTask T7

Find one or two UP and DOWN regulated genes and find out if they are involved in infertility, impaired spermatogenesis, sperm development, …. Optional: open the final table we saved into a table editor like Excel, find all upregulated genes according to log-fold change and p-value. Then use the Enrichr website to see if you find biological terms related to infertility.

pseudobulk_volcano(DEG_matrix, logfold_threshold=2, logpval_threshold=3, plot_size=(800,800))

Finally, you can save the table.

DEG_matrix.to_csv('./pseudobulk_markers.csv', sep='\t', index=None)
NoteWrapping up

You executed the whole analysis to the very end of this notebook!

It has been a pretty long one, but I hope you learnt new things and you will benefit from them in your future scRNA-seq analysis. Remember that the scanpy webpage has a lot of tutorials and very good documentation for all the functions implemented. The universe of available tools for single cell RNA data is growing constantly and is difficult to navigate - I suggest you to try tools that seem interesting, are usable without too many dificulties, and then stick only to those that are really relevant. Fortunately the community revolving around scRNA-seq data analysis is huge and rather helpful and open-source. It is mostly a matter of googling yourself to relevant forums and github repositories to find useful answers to doubts, available tools, and guides.