Tutorial description

This tutorial will cover the basic steps of single cell analysis from preprocessing to the final results production. at the end of this tutorial you will be able to use python to

Filter your data selecting specific criteria
Preprocess your data for advanced analysis
Identify potential cell types
Perform differential gene expression
Visualize the basic differentiation dynamics of your data
Merge datasets and do cross-data analysis

Biological background

We will start by analyzing a dataset coming from various sections of human testicular tissue. The testis is a complex organ composed of multiple cell types: germ cells at different stages of maturation and several somatic cell types supporting testicular structure and spermatogenesis (development of cells into spermatozoa); Sertoli cells, peritubular cells, Leydig cells, and other interstitial cells, as outlined in the figure below. Characterizing the various cell types is important to understand which genes and processes are relevant at different levels of maturations of cells into spermatozoa.

Tubule scheme — Figure: section of a tubule of the human testis. The human testis are surrounded by long tubules in which cells
start to develop, beginning from the walls of the tubules towards the center. At the center of the tubule,
spermatozoa will access to the epididymis to reach full maturation.

After characterizing the spermatogenic process, we will perform comparative analysis of our dataset to testicular samples from men affected by azoospermia (reduced or absent froduction of spermatozoa). Infertility is a growing problem, especially in the Western world, where approximately 10–15% of couples are infertile. In about half of the infertile couples, the cause involves a male-factor (Agarwal et al. 2015; Barratt et al. 2017). One of the most severe forms of male infertility is azoospermia (from greek azo, without life) where no spermatozoa can be detected in the ejaculate, which renders biological fatherhood difficult. Azoospermia is found in approximately 10–15% of infertile men (Jarow et al. 1989; Olesen et al. 2017) and the aetiology is thought to be primarily genetic.

Common to the various azoospermic conditions is the lack or distuption of gene expression patterns. It makes therefore sense to detect genes expressed more in the healthy dataset against the azoospermic one. We can also investigate gene enrichment databases to get a clearer picture of what the genes of interest are relevant to.

Tubule Azoospermia — Figure: Examples of testicular histology and the composition of testicular cell types that can be observed among men with various types of azoospermia. Degenerated ghost tubules (#) are tubules where an abnormally large central channel is present, but no cells are developing from the walls of the tubule. Other tubules show Sertoli-cell-only (SCO) pattern (*) and large clusters of Leydig cells, meaning they only have nurse cells, but no developing germ cells. Tubules with germ cell neoplasia in situ (GCNIS) do not contain any normal germ cells (&). GCNIS cells are the precursor cells of testicular germ cell cancer and are found more frequently among men with azoospermia than among men with good semen quality. From Soraggi et al 2020.

UMI-based single cell data from microdroplets

The dataset we are using in this tutorial is based on a microdroplet-based method from 10X chromium. From today’s lecture we remember that a microdroplet single cell sequencing protocol works as follow:

cells and beads are co-encapsulated in gel/oil droplets, aiming for one cell and one bead per occupied droplet

Annotated Data — Figure: Isolation of cells and beads into microdroplets.

Data analysis

Warning

Run the code cells one at a time, and wait that the running cell is node before starting the next one. This notebook has a tendency to cause memory problems when you execute too many cells at once
If at some point you cannot see any plot in this notebook, please create a code cell with the command %matplotlib inline, and run it. We use various plotting libraries which mess up some internal settings once in a while.

Prepare packages and data necessary to run this python notebook. We will use scanpy as the main analysis tool for the analysis. Scanpy has a comprehensive manual webpage that includes many different tutorials you can use for further practicing. Scanpy is used in the discussion paper and the tutorial paper of this course. An alternative and well-established tool for R users is Seurat. However, scanpy is mainatined and updated by a wider community with many of the latest developed tools.

Note: it can take few minutes to get all the package loaded. Do not mind red-coloured warnings.

import warnings
warnings.filterwarnings("ignore")
import scanpy as sc
import anndata as ad
import bbknn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import os
import gc
import plotly.express as px
import re

%matplotlib inline

%run ../Scripts/pythonScripts.py

Load the dataset.

There are many different possible formats the data can be loaded from. Each format has a dedicated reading command in scanpy, for example read_h5ad, read_10X, read_csv,…. In our case, we have a file already in h5ad format. This format is very convenient to store data with large matrices and their annotations, and is often used to store the scRNAseq expression data after alignment and demultiplexing.

adata = sc.read_h5ad('../Data/scrna_data/rawDataScanpy.h5ad')

The data is opened and an Annotated data object is created. This object contains:

The data matrix adata.X of size \(N\_cells \times N\_genes\). The cells are called observations (obs) and the genes variables (var).
Vectors of cells-related variables in the dataframe adata.obs
Vectors of genes-related variables in the dataframe adata.var
Matrices of size \(N\_cells \times N\_genes\) in adata.layers
Matrices where each line is cell-related in adata.obsm
Matrices where each line is gene-related in adata.varm
Anything else that must be saved is in adata.uns

Figure: Structure of an annotated data object. In green the stack of expression matrix it can contains, where `X` is the one currently used for analysis.

	batch	super_batch	n_genes_by_counts	log1p_n_genes_by_counts	total_counts	log1p_total_counts	pct_counts_in_top_50_genes	pct_counts_in_top_100_genes	pct_counts_in_top_200_genes	pct_counts_in_top_500_genes	n_counts
index
AAACCTGAGCCGGTAA-1-0	Sohni1_und	SohniUnd	61	4.127134	511.0	6.238325	97.847358	100.000000	100.000000	100.000000	511.0
AAACCTGAGCGATTCT-1-0	Sohni1_und	SohniUnd	2127	7.662938	5938.0	8.689296	34.725497	45.368811	56.113169	69.434153	5938.0
AAACCTGAGCGTTTAC-1-0	Sohni1_und	SohniUnd	3768	8.234565	8952.0	9.099744	16.979446	24.005809	33.031725	48.514298	8952.0
AAACCTGAGGACAGAA-1-0	Sohni1_und	SohniUnd	1588	7.370860	4329.0	8.373322	35.458535	48.972049	60.198660	74.867175	4329.0
AAACCTGAGTCATGCT-1-0	Sohni1_und	SohniUnd	618	6.428105	962.0	6.870053	35.654886	46.049896	56.548857	87.733888	962.0

	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts
index
RP11-34P13.3	113	0.001865	0.001863	99.819923	117.0	4.770685
FAM138A	0	0.000000	0.000000	100.000000	0.0	0.000000
OR4F5	1	0.000016	0.000016	99.998406	1.0	0.693147
RP11-34P13.7	635	0.010805	0.010747	98.988064	678.0	6.520621
RP11-34P13.8	12	0.000191	0.000191	99.980877	12.0	2.564949

	0_N	0_L	1_N	1_L	2_N	2_L	3_N	...	12_L	13_N	13_L	14_N	14_L	15_N	15_P	15_L
0	ACTA2	8.279956	HMGA1	5.305916	B2M	5.296463	DCN	...	5.242647	B2M	4.939682	CD74	7.522440	SMC3	1.464649e-248	4.686258
1	IGFBP7	5.464073	ZNF428	4.582889	GNG11	6.104707	IGFBP7	...	3.766889	TMSB10	4.535407	HLA-DRA	8.359200	WBSCR22	3.479532e-245	4.242674
2	PTGDS	7.328448	RPS12	3.290441	HLA-B	4.736052	PTGDS	...	2.512227	TMSB4X	4.319129	TMSB4X	4.720968	SYCP3	1.536166e-230	5.716512
3	CALD1	4.448563	RPSA	3.078222	HLA-E	5.278060	VIM	...	2.573015	HLA-B	4.532363	B2M	4.394653	HSP90AA1	4.960966e-264	2.685838
4	TPM2	5.584510	DNAJB6	3.448990	TMSB4X	4.875499	CD63	...	3.612036	IFI27	5.533888	FTL	4.282821	TEX30	2.468216e-217	3.748611

	Healthy_Guo1_Elong.Spt_3	Healthy_Guo1_Elong.Spt_4	Healthy_Guo1_Elong.Spt_7	...	Healthy_Sohni2_und_SpermatogoniaB_0	Healthy_Sohni2_und_SpermatogoniaB_1	Healthy_Sohni2_und_SpermatogoniaB_2	Healthy_Sohni2_und_SpermatogoniaB_3	Healthy_Sohni2_und_SpermatogoniaB_4	Healthy_Sohni2_und_SpermatogoniaB_5	Healthy_Sohni2_und_SpermatogoniaB_6	Healthy_Sohni2_und_SpermatogoniaB_7	Healthy_Sohni2_und_SpermatogoniaB_8	Healthy_Sohni2_und_SpermatogoniaB_9
FAM41C	0.000000	0.000000	0.00000	...	0.000000	0.000000	0.000000	1.058002	0.000000	0.000000	1.069217	0.000000	0.000000	0.000000
SAMD11	0.000000	0.000000	0.00000	...	0.314359	0.654008	0.480988	0.777766	1.059563	0.000000	0.000000	0.398048	0.544388	0.000000
NOC2L	0.000000	0.000000	0.00000	...	6.223267	4.058583	4.633202	5.251925	4.440653	4.436545	2.341810	2.899758	5.324596	3.836604
KLHL17	0.000000	0.000000	0.00000	...	0.455224	0.000000	0.000000	0.162646	0.000000	0.000000	1.419866	0.000000	0.000000	0.000000
ISG15	0.466368	0.641645	1.29108	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

	ISG15	C1orf159	SDF4	UBE2J2	SCNN1D	ACAP3	...	MCM3AP	YBEY	C21orf58	PCNT	DIP2A	PRMT2	AC136616.1	AC007325.4
Healthy_Guo1_Elong.Spt_0	0.000000	2.545403	0.000000	3.691806	0.000000	0.000000	...	0.000000	2.768188	0.000000	0.000000	1.271720	0.000000	0.000000	8.418765
Healthy_Guo1_Elong.Spt_1	0.000000	0.723980	0.000000	5.138862	0.000000	0.862926	...	0.000000	2.008637	0.919861	0.000000	0.000000	1.206255	0.000000	8.023553
Healthy_Guo1_Elong.Spt_2	0.000000	0.857198	0.622431	7.140317	0.000000	0.000000	...	0.000000	2.164976	1.302259	0.000000	0.000000	0.000000	0.857198	8.220791
Healthy_Guo1_Elong.Spt_3	0.466368	2.461339	0.000000	5.575853	0.000000	0.000000	...	0.000000	4.105031	0.562777	0.378161	0.884709	0.935201	0.000000	7.046681
Healthy_Guo1_Elong.Spt_4	0.641645	1.674945	0.000000	6.038050	0.434032	0.862926	...	0.541363	3.484651	1.280955	0.641645	0.000000	0.434032	0.434032	6.337600

Single cell analysis workflow

Biological background

UMI-based single cell data from microdroplets

The raw data in practice

Alignment and expression matrix

Data analysis

Preprocessing

Quality Filtering

Doublets removal

Data Normalization

Dimensionality reduction

PCA

UMAP projection

Clusters Identification

Print markers’ scores

Leiden clustering algorithm

Differential Gene expression

Cluster assignment

Gene enrichment analysis (GEA)

Differentiation dynamics

Comparisons across different datasets

Reference-based annotation

Cross-dataset differential expression

	Crypto_NAMES	Crypto_PVALS	Crypto_PVALS_ADJ	Crypto_LOGFOLDCHANGES	Healthy_PCT	Crypto_PCT	Crypto_FOLDCHANGES	Crypto_LOGPVALS_ADJ	Crypto_LOGPVALS
0	EEF1G	7.718793e-218	4.696500e-214	7.676920	3.898379	97.345133	204.636536	50.0	50.0
1	MRPS24	3.667054e-186	1.115610e-182	6.484880	1.741130	80.763857	89.566071	50.0	50.0
2	GABARAP	2.251046e-175	5.478596e-172	5.280738	16.933129	97.438286	38.874126	50.0	50.0
3	H3F3B	5.131351e-158	7.805427e-155	2.113990	96.605344	99.906847	4.328868	50.0	50.0
4	UBE2V1	6.872677e-165	1.393893e-161	4.207666	5.303694	65.160689	18.477098	50.0	50.0