Tutorial description

This tutorial will cover the basic steps of single cell analysis from preprocessing to the final results production. at the end of this tutorial you will be able to use python to

Filter your data selecting specific criteria
Preprocess your data for advanced analysis
Identify potential cell types
Perform differential gene expression
Visualize the basic differentiation dynamics of your data
Merge datasets and do cross-data analysis

The present tutorial, like the rest of the course material, is available at our open-source github repository and will be kept up-to-date as long as the course will be renewed.

To use this notebook, use the NGS (python) kernel that contains the packages. Choose it by selecting Kernel -> Change Kernel in the menu on top of the window.

A few introductory points to run this notebook (click to show)

To use this notebook, use the NGS (Python) kernel that contains the packages. Choose it by selecting Kernel -> Change Kernel in the menu on top of the window.

In this notebook you will use only python commands
On some computers, you might see the result of the commands once they are done running. This means you will wait some time while the computer is crunching, and only afterwards you will see the result of the command you have executed
You can run the code in each cell by clicking on the run cell button, or by pressing Shift + Enter . When the code is done running, a small green check sign will appear on the left side
You need to run the cells in sequential order, please do not run a cell until the one above finished running and do not skip any cells
Each cell contains a short description of the code and the output you should get. Please try not to focus on understanding the code for each command in too much detail, but rather try to focus on the output
You can create new code cells by pressing + in the Menu bar above.

Biological background

We will start by analyzing a dataset coming from various sections of human testicular tissue. The testis is a complex organ composed of multiple cell types: germ cells at different stages of maturation and several somatic cell types supporting testicular structure and spermatogenesis (development of cells into spermatozoa); Sertoli cells, peritubular cells, Leydig cells, and other interstitial cells, as outlined in the figure below. Characterizing the various cell types is important to understand which genes and processes are relevant at different levels of maturations of cells into spermatozoa.

Tubule scheme — Figure: section of a tubule of the human testis. The human testis are surrounded by long tubules in which cells
start to develop, beginning from the walls of the tubules towards the center. At the center of the tubule,
spermatozoa will access to the epididymis to reach full maturation.

After characterizing the spermatogenic process, we will perform comparative analysis of our dataset to testicular samples from men affected by azoospermia (reduced or absent froduction of spermatozoa). Infertility is a growing problem, especially in the Western world, where approximately 10–15% of couples are infertile. In about half of the infertile couples, the cause involves a male-factor (Agarwal et al. 2015; Barratt et al. 2017). One of the most severe forms of male infertility is azoospermia (from greek azo, without life) where no spermatozoa can be detected in the ejaculate, which renders biological fatherhood difficult. Azoospermia is found in approximately 10–15% of infertile men (Jarow et al. 1989; Olesen et al. 2017) and the aetiology is thought to be primarily genetic.

Common to the various azoospermic conditions is the lack or distuption of gene expression patterns. It makes therefore sense to detect genes expressed more in the healthy dataset against the azoospermic one. We can also investigate gene enrichment databases to get a clearer picture of what the genes of interest are relevant to.

Tubule Azoospermia — Figure: Examples of testicular histology and the composition of testicular cell types that can be observed among men with various types of azoospermia. Degenerated ghost tubules (#) are tubules where an abnormally large central channel is present, but no cells are developing from the walls of the tubule. Other tubules show Sertoli-cell-only (SCO) pattern (*) and large clusters of Leydig cells, meaning they only have nurse cells, but no developing germ cells. Tubules with germ cell neoplasia in situ (GCNIS) do not contain any normal germ cells (&). GCNIS cells are the precursor cells of testicular germ cell cancer and are found more frequently among men with azoospermia than among men with good semen quality. From Soraggi et al 2020.

UMI-based single cell data from microdroplets

The dataset we are using in this tutorial is based on a microdroplet-based method from 10X chromium. From today’s lecture we remember that a microdroplet single cell sequencing protocol works as follow:

each cell is isolated together with a barcode bead in a gel/oil droplet

Annotated Data — Figure: Isolation of cells and beads into microdroplets.

Data analysis

Prepare packages and data necessary to run this python notebook. We will use scanpy as the main analysis tool for the analysis. Scanpy has a comprehensive manual webpage that includes many different tutorials you can use for further practicing. Scanpy is used in the discussion paper and the tutorial paper of this course. An alternative and well-established tool for R users is Seurat. However, scanpy is mainatined and updated by a wider community with many of the latest developed tools.

Note: it can take few minutes to get all the package loaded. Do not mind red-coloured warnings.

import warnings
warnings.filterwarnings("ignore")

import scanpy as sc
import bbknn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import os
import gc
import plotly.express as px
import re
import doubletdetection

Some of the commands used in the course are functions we implement to simplify reading the code of this course. Mostly, those are commands requiring lines of code that would not add anything to your learning curve (management of plots, trivial calculations, few management of the notebook layout). However, you are free to look at the code into the file Scripts/pythonScripts.py and to reuse our code in your own work (citing our course).

%run ../Scripts/pythonScripts.py

Load the dataset.

There are many different possible formats the data can be loaded from. Each format has a dedicated reading command in scanpy, for example read_h5ad, read_10X, read_csv,…. In our case, we have a file already in h5ad format. This format is very convenient to store data with large matrices and their annotations, and is often used to store the scRNAseq expression data after alignment and demultiplexing.

adata = sc.read_h5ad('../Data/scrna_data/rawDataScanpy.h5ad')

The data is opened and an Annotated data object is created. This object contains:

The data matrix adata.X of size \(N\_cells \times N\_genes\). The cells are called observations (obs) and the genes variables (var).
Vectors of cells-related variables in the dataframe adata.obs
Vectors of genes-related variables in the dataframe adata.var
Matrices of size \(N\_cells \times N\_genes\) in adata.layers
Matrices where each line is cell-related in adata.obsm
Matrices where each line is gene-related in adata.varm
Anything else that must be saved is in adata.uns

Figure: Structure of an annotated data object. In green the stack of expression matrix it can contains, where X is the one currently used for analysis.

</figure>

The data has 62751 cells and 33694 genes

adata.shape

(62751, 33694)

If you are running this tutorial on your own laptop and your computer crashes, you might need to subsample your data when you run the code, because there might be some issue with too much memory usage. You can subsample the data to include for example only 5000 cells using the command below (remove the # so that the code can be executed). The results should not differ much from the tutorial with the whole dataset, but you might have to tune some parameters along the code (especially clustering and UMAP projection will look different).

#sc.pp.subsample(adata, n_obs=5000, random_state=12345, copy=False)

We calculate quality measures to fill the object adata with some information about cells and genes

sc.preprocessing.calculate_qc_metrics(adata, inplace=True)

We can see that now adata contains many observations (obs) and variables (var). Those can be used for filtering and analysis purpose, as well as they might be needed by some scanpy tools

adata

AnnData object with n_obs × n_vars = 62751 × 33694
    obs: 'batch', 'super_batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_counts'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

adata.obs is a dataframe, i.e. a table with indexes on the rows (cell barcodes) and column names (the observation types). One can select a specific observation type by indexing it in the table. We use .head() to show only the first few lines of the dataframe.

adata.obs.head()

	batch	super_batch	n_genes_by_counts	log1p_n_genes_by_counts	total_counts	log1p_total_counts	pct_counts_in_top_50_genes	pct_counts_in_top_100_genes	pct_counts_in_top_200_genes	pct_counts_in_top_500_genes	n_counts
index
AAACCTGAGCCGGTAA-1-0	Sohni1_und	SohniUnd	61	4.127134	511.0	6.238325	97.847358	100.000000	100.000000	100.000000	511.0
AAACCTGAGCGATTCT-1-0	Sohni1_und	SohniUnd	2127	7.662938	5938.0	8.689296	34.725497	45.368811	56.113169	69.434153	5938.0
AAACCTGAGCGTTTAC-1-0	Sohni1_und	SohniUnd	3768	8.234565	8952.0	9.099744	16.979446	24.005809	33.031725	48.514298	8952.0
AAACCTGAGGACAGAA-1-0	Sohni1_und	SohniUnd	1588	7.370860	4329.0	8.373322	35.458535	48.972049	60.198660	74.867175	4329.0
AAACCTGAGTCATGCT-1-0	Sohni1_und	SohniUnd	618	6.428105	962.0	6.870053	35.654886	46.049896	56.548857	87.733888	962.0

adata.obs['batch'] #sample label - the data contains 15 separate samples

index
AAACCTGAGCCGGTAA-1-0     Sohni1_und
AAACCTGAGCGATTCT-1-0     Sohni1_und
AAACCTGAGCGTTTAC-1-0     Sohni1_und
AAACCTGAGGACAGAA-1-0     Sohni1_und
AAACCTGAGTCATGCT-1-0     Sohni1_und
                            ...    
TTTGTCACACAGACTT-1-14      Her8_Spc
TTTGTCACAGAGTGTG-1-14      Her8_Spc
TTTGTCAGTTCGGCAC-1-14      Her8_Spc
TTTGTCATCAAACCAC-1-14      Her8_Spc
TTTGTCATCTTCAACT-1-14      Her8_Spc
Name: batch, Length: 62751, dtype: category
Categories (15, object): ['Sohni1_und', 'Sohni2_und', 'Sohni1_I', 'Sohni2_I', ..., 'Her5', 'Her6', 'Her7_Spt', 'Her8_Spc']

adata.var works similarly, but now each row is referred to a gene

adata.var.head()

	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts
index
RP11-34P13.3	113	0.001865	0.001863	99.819923	117.0	4.770685
FAM138A	0	0.000000	0.000000	100.000000	0.0	0.000000
OR4F5	1	0.000016	0.000016	99.998406	1.0	0.693147
RP11-34P13.7	635	0.010805	0.010747	98.988064	678.0	6.520621
RP11-34P13.8	12	0.000191	0.000191	99.980877	12.0	2.564949

adata.var['n_cells_by_counts'] #nr of cells showing transcripts of a gene

index
RP11-34P13.3     113
FAM138A            0
OR4F5              1
RP11-34P13.7     635
RP11-34P13.8      12
                ... 
AC233755.2        13
AC233755.1         3
AC240274.1      9434
AC213203.1        15
FAM231B            0
Name: n_cells_by_counts, Length: 33694, dtype: int64

Preprocessing

We preprocess the dataset by filtering cells and genes according to various quality measures and removing doublets. Note that we are working with all the samples at once. It is more correct to filter one sample at a time, and then merge them together prior to normalization, but we are keeping the samples merged for simplicity, and because the various samples are technically quite homogeneous.

Quality Filtering

Using the prefix MT- in the gene names we calculate the percentage of mithocondrial genes in each cell, and store this value as an observation in adata.obs. Cells with high MT percentage are often broken cells that spilled out mithocondrial content (in this case they will often have low gene and transcript counts), cells captured together with residuals of broken cells (more unlikely if a good job in the sequencing lab has been done) or empty droplets containing only ambient RNA.

MT = ['MT-' in i for i in adata.var_names] #a vector with True and False to find MT genes
perc_mito = np.sum( adata[:,MT].X, 1 ).A1 / np.sum( adata.X, 1 ).A1
adata.obs['perc_mito'] = perc_mito.copy()

One can identify cells to be filtered out by looking at the relation between number of transcripts (horizontal axis) and number of genes per cell (vertical axis), coloured by percent of MT genes. We can see that high percentages of mitocondrial genes are present for cells that have less than 1000 detected genes (vertical axis).

sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='perc_mito', 
              title='Transcript vs detected genes coloured by mitochondrial content')

We can zoom into the plot by selecting cells with less than 3000 genes

sc.pl.scatter(adata[adata.obs['n_genes_by_counts']<3000], x='total_counts', y='n_genes_by_counts', color='perc_mito',
             title='Transcript vs detected genes coloured by mitochondrial content\nfor <3000 genes')

Another useful visualization is the distribution of each quality feature of the data. We look at the amount of transcripts per cell zooming into the interval (0,20000) transcripts to find a lower threshold. Usually, there is a peak with low quality cells on the left side of the histogram, or a descending tail. The threshold whould select such peak (or tail). In our case we can select 2000 as threshold. Hover on the plots with the mouse to see the value of each bar of the histogram.

fig = px.histogram(adata[adata.obs['total_counts']<20000].obs, x='total_counts', nbins=100,
                  title='distribution of total transcripts per cell for <20000 transcripts')
fig.show()

For the upper threshold of the number of transcripts, we can choose 40000

fig = px.histogram(adata.obs, x='total_counts', nbins=100, 
                   title='distribution of total transcripts per cell')
fig.show()

Regarding the number of detected genes, a lower threshold could be around 800 genes. An Upper threshold can be 8000 genes, to remove the tail on the right side of the histogram

fig = px.histogram(adata.obs, x='n_genes_by_counts', nbins=100, title='distribution of detected genes per cell')
fig.show()

Cells with too much mitochondrial content might be broken cells spilling out MT content, or ambient noise captured into the droplet. Standard values of the threshold are between 5% and 20%. We select 20%.

fig = px.histogram(adata.obs, x='perc_mito', nbins=100, title='distribution of mitochondrial content per cell')
fig.show()

Finally, we look at the percentage of transcripts expressing genes in each cell. We plot the genes showing the highest percentages in a barplot. We can see MALAT1 is expressed in up to 60% of the transcripts in some cells. This can be an indicator of cells with too low quality. Other genes that are highly expressed are of the mitocondrial type and will be filtered out already with the mitochondrial threshold. PRM1, PRM2, PTGDS are typical of spermatogonial processes, and we do not consider those as unusual.

The expression matrix is in compressed format (a so-called sparse matrix), but from now on we will need only the uncompressed matrix. We made a little function to decompress the matrix (array_and_densify).

adata.X = array_and_densify(adata.X)

densified

%matplotlib inline

fig, ax = plt.subplots(1,1)
ax.set_title('Top genes in terms of percentage of transcripts explained in each cell')
fig = sc.pl.highest_expr_genes(adata, n_top=20, ax=ax)
fig

We save the percentages of transcripts expressing MALAT1 and select a threshold for this values. We choose 10% as threshold to cut out the upper tail.

perc_malat = np.sum( adata[:,'MALAT1'].X, 1 ) / np.sum( adata.X, 1 )
adata.obs['perc_MALAT1'] = perc_malat.copy()

fig = px.histogram(adata.obs, x='perc_MALAT1', nbins=100, title='Distribution of the amount of MALAT1 transcripts in each cell')
fig.show()

Note also how cells with high amount of MALAT1 expression are usually cells of low quality, containing a low amount of transcripts (position the mouse on some of the dots to see the values). This means that many of the cells with high content of MALAT1 will be also filtered out when removing cells with low amount of transcripts. This is compatible with the fact that MALAT1 can indicate dead cells who underwent apoptosis.

px.scatter(data_frame=adata.obs, x='total_counts', y='perc_MALAT1', 
           title='Relationship between amount of MALAT1 gene and transcripts per cell')

We use the following commands to implement some of the thresholds discussed in the plots above

sc.preprocessing.filter_cells(adata, max_genes=8000)

sc.preprocessing.filter_cells(adata, min_genes=800)

sc.preprocessing.filter_cells(adata, max_counts=40000)

adata = adata[adata.obs['perc_mito']<0.2].copy()

adata = adata[adata.obs['perc_MALAT1']<0.1].copy()

It is good practice to also remove those genes found in too few cells (for example in 10 or less cells). Any cell type clustering 10 or less cells will be undetected in the data, but in any case it would be irrelevant to have such tiny clusters, since statistical analysis on those would be unreliable.

sc.preprocessing.filter_genes(adata, min_cells=10)

print('There are now', adata.shape[0], 'cells and', adata.shape[1],'genes after filtering')

There are now 49243 cells and 29830 genes after filtering

Doublets removal

Another important step consists in filtering out multiplets. We will use the package scrublet (Wolock et al, 2019), that simulates doublets from the data and compare the simulations to the real data to find any doublet-like cells in it.

	0_N	0_L	1_N	1_L	2_N	2_L	3_N	...	13_L	14_N	14_L	15_N	15_L	16_N	16_P	16_L
0	LYPLA1	2.635769	ZNF428	3.806565	PTGDS	7.961807	MT-CO2	...	8.849788	ANKRD7	5.520726	TMSB4X	5.302206	MALAT1	0.000000e+00	3.612107
1	SRSF9	2.288302	HMGA1	3.570310	IGFBP7	6.235969	SBNO1	...	6.715557	CMTM2	5.160422	B2M	5.034176	MYL9	2.031513e-297	6.074902
2	SMS	2.876471	RPS12	2.645890	ACTA2	7.899968	MT-CO1	...	5.880414	TEX40	5.001250	TMSB10	4.341215	CALD1	9.825148e-295	4.660460
3	HMGA1	2.629156	RAC3	4.114265	APOE	6.465453	MT-CYB	...	5.738638	ROPN1L	4.393602	TYROBP	10.120045	TMSB4X	0.000000e+00	3.789072
4	PFN1	2.365119	GNB2L1	2.468919	CALD1	4.854297	MT-ND4	...	5.993700	TSACC	4.597932	CD74	7.758448	ADIRF	1.545145e-272	6.255490

	Crypto_NAMES	Crypto_PVALS	Crypto_PVALS_ADJ	Crypto_LOGFOLDCHANGES	Healthy_PCT	Crypto_PCT	Crypto_FOLDCHANGES	Crypto_LOGPVALS_ADJ	Crypto_LOGPVALS
0	EEF1G	4.557527e-188	1.849293e-184	7.720026	3.664627	97.345133	210.843109	50.0	50.0
1	MRPS24	4.462642e-173	1.358094e-169	6.370932	2.181480	80.763857	82.764008	50.0	50.0
2	GABARAP	5.121595e-194	3.117259e-190	5.045707	14.256373	97.438286	33.030037	50.0	50.0
3	H3F3B	2.158483e-139	2.919468e-136	2.287395	97.323417	99.906847	4.881739	50.0	50.0
4	PSMA6	9.807033e-158	1.705443e-154	3.847374	22.273150	89.287378	14.393784	50.0	50.0

Single cell analysis workflow

Biological background

UMI-based single cell data from microdroplets

The raw data in practice

Alignment and expression matrix

Data analysis

Preprocessing

Quality Filtering

Doublets removal

Data Normalization

Effect of normalization on technical features

Dimensionality reduction

PCA

UMAP projection

Clusters Identification

Print markers’ scores

Leiden clustering algorithm

Differential Gene expression

Cluster assignment

Data dynamics

Comparisons across different datasets

Reference-based annotation

Cross-dataset differential expression