Filtering a low quality sample¶
Application of a PCA-based technique for filtering
Motivation:
Quality control and filtering is the most important steps of single cell data analysis. Allowing low quality cells into your analysis will compromise/mislead your conclusions by adding hundreds of meaningless data points to your workflow. The main sources of low quality cells are
- broken cells for which some of their transcripts get lost
- cells isolated together with too much ambient RNA
- missing cell during isolation (e.g. empty droplet in microfluidic machines)
- multiple cells isolated together (multiplets, usually only two cells - doublets)
A typical way of filtering cells is to look at thresholds for various QC measures. Here, we apply also a PCA technique that looks at all QC measures available at the same time.
Learning objectives:
- Understand and discuss QC issues and measures from single cell data
- Explore QC graphs and set filtering tools and thresholds
- Understand and apply PCA outliers detection
- Analyze the results of QC filters and evaluate necessity for different filtering
Execution time: 40 minutes
Import the packages
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
sample_1 = sc.read_h5ad('../../../Data/notebooks_data/sample_1.h5ad')
sample_1
AnnData object with n_obs × n_vars = 23794 × 36601 obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size' var: 'gene_ids', 'feature_types', 'Accession', 'Chromosome', 'End', 'Start', 'Strand' layers: 'ambiguous', 'spliced', 'unspliced'
Read the kneeplot (for 10X data with cellranger)¶
The cellRanger
aligner used for 10X data provides a report for the user. This contains a lot of useful statistics including the so-called knee plot, shown below on the right side of the picture. The knee plot is very useful to have an overview of the data quality. The x axis represents the barcode of each cell (ordered by decreasing number of UMIs), while the y axis shows how many UMIs are present in each cells. The line is blue where data points have been identified as cells of good quality, and it becomes lighter when the data point is less likely to be a cell.
This knee plot is clearly representing a low quality dataset (look at the three typical scenarios below). Moreover, we have a very low proportion of reads in each cell. Filtering this dataset might not keep many cells at the end of the process.
Representation of knee plots scenario. From the authors of the cellBender package.
Percentage of mitocondrial genes
We calculate the percentage of mitocondrial genes into each cell. A high percentage denotes the possibility that material from broken cells has been captured during cell isolation, and then sequenced. Mitocondrial percentage is not usually calculated by scanpy
, because there is need for an identifier for mitocondrial genes, and there is not a standard one. In our case, we look at genes that contain MT-
into their ID, and calculate their transcript proportion into each cell. We save the result as an observation into .obs['perc_mito']
sample_1.X = np.array(sample_1.X.todense()).copy()
MT = ['MT' in i for i in sample_1.var_names]
perc_mito = np.sum( sample_1[:,MT].X, 1 ) / np.sum( sample_1.X, 1 )
sample_1.obs['perc_mito'] = perc_mito.copy()
sc.pp.calculate_qc_metrics(sample_1, inplace=True)
Visualize and evaluate quality measures¶
We can do some plots to have a look at quality measures combined together
Counts vs Genes: this is a typical plot, where you look at the total transcripts per cells (x axis) and detected genes per cell (y axis). Usually, those two measures grow together. Points with a lot of transcripts and genes might be multiplets (multiple cells sequenced together as one), while very few transcripts and genes denote the presence of only ambient RNA or very low quality sequencing of a cell. Below, the dots are coloured based on the percentage of mitocondrial transcripts. Note how a high proportion is often on cells with very low transcripts and genes (bottom left corner of the plot)
plt.rcParams['figure.figsize'] = (8,6)
sc.pl.scatter(sample_1, x='total_counts', y='n_genes_by_counts', color='perc_mito',
title ='Nr of transcripts vs Nr detected genes, coloured by mitocondrial content',
size=30)
sc.pl.scatter(sample_1[sample_1.obs['total_counts']<10000], x='total_counts', y='n_genes_by_counts', color='perc_mito',
title ='Nr of transcripts vs Nr detected genes, coloured by mitocondrial content (Zoom)',
size=30)
Transcripts and Genes distribution: Here we simply look at the distribution of transcripts per cells and detected genes per cell. Note how the distribution is bimodal. This usually denotes a cluster of low-quality cells and viable cells. Sometimes filtering out the data points on the left-most modes of those graphs removes a lot of cells from a dataset, but this is quite a normal thing not to be worried about. The right side of the distributions show a tail with few cells having a lot of transcripts and genes. It is also good to filter out some of those extreme values - for technical reasons, it will also help in having a better normalization of the data later on.
ax = sns.distplot(sample_1.obs['total_counts'], bins=50)
ax.set_title('Cells Transcripts distribution')
Text(0.5, 1.0, 'Cells Transcripts distribution')
ax = sns.distplot(sample_1.obs['n_genes_by_counts'], bins=50)
ax.set_title('Distribution of detected genes per cell')
Text(0.5, 1.0, 'Distribution of detected genes per cell')
In this dataset there are few cell with a high percentage of mitocondrial content. Those are precisely 77 if we set 0.1 (that is 10%) as a treshold. A value between 10% and 20% is the usual standard when filtering single cell datasets.
#subsetting to see how many cells have percentage of mitocondrial genes above 10%
sample_1[ sample_1.obs['perc_mito']>0.1, : ].shape
(77, 36601)
ax = sns.distplot(sample_1.obs['perc_mito'], bins=50)
ax.set_title('Distribution of mitocondrial content per cell')
Text(0.5, 1.0, 'Distribution of mitocondrial content per cell')
PCA-based filtering¶
Now we calculate a PCA on a set of quality measures, and find out the outliers in the resulting PCA projections. We remove these outliers from the dataset: they will be data points with an anomalous subset of quality measures. This technique is rewritten from the R
package scran
. The script below does the filtering and plots the PCA with the outliers represented by circles.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.ensemble import IsolationForest
plt.rcParams['figure.figsize'] = (16,6)
#write here which quality measures you want to use
df = sample_1.obs[ ['n_genes_by_counts', 'total_counts',
'perc_mito', 'pct_counts_in_top_50_genes'] ]
df2 = scale(df, axis=0)
pca = PCA(n_components=2)
Y = pca.fit_transform(df2)
clf = IsolationForest(random_state=0)
pred = clf.fit_predict(df2)
sample_1 = sample_1[pred==1].copy()
pred = pd.Categorical(pred)
pred = pred.rename_categories(['Outlier','Cell'])
df['Category'] = pred
sns.scatterplot(Y[:,0],Y[:,1], hue = df.total_counts, style = df.Category,
palette="Blues", sizes=(20, 200), hue_norm=(0, 100))
<AxesSubplot:>
we have removed around 3000 cells
sample_1
AnnData object with n_obs × n_vars = 20984 × 36601 obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'perc_mito', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'gene_ids', 'feature_types', 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts' layers: 'ambiguous', 'spliced', 'unspliced'
Choosing thresholds¶
We use some thresholds by looking at the previous QC plots. Those will eventually remove some remaining low quality cells.
MIN_COUNTS = 5000 #minimum number of transcripts per cell
MAX_COUNTS = 15000 #maximum number of transcripts per cell
MIN_GENES = 2500 #minimum number of genes per cell
MAX_GENES = 6000 #maximum number of genes per cell
MAX_MITO = .1 #mitocondrial percentage treshold
We can do some subsetting to zoom into the plots we did before
plt.rcParams['figure.figsize'] = (8,6)
sc.pl.scatter(sample_1[ sample_1.obs['total_counts']<MAX_COUNTS ],
x='total_counts', y='n_genes_by_counts', color='perc_mito',
title =f'Nr of transcripts vs Nr detected genes, coloured by mitocondrial content\nsubsetting with threshold MAX_COUNTS={MAX_COUNTS}',
size=30)
sc.pl.scatter(sample_1[ sample_1.obs['n_genes_by_counts'] > MIN_GENES ],
x='total_counts', y='n_genes_by_counts', color='perc_mito',
title =f'Nr of transcripts vs Nr detected genes, coloured by mitocondrial content\nsubsetting with treshold MIN_GENES={MIN_GENES}',
size=30)
The following commands filter using the chose tresholds. Again, scanpy does not do the mitocondrial QC filtering, so we do that on our own by subsetting the data.
Note for the last two filterings: the parameter min_cells
remove all those cells showing transcripts for only 10 genes or less - standard values for this parameter are usually between 3 and 10, and do not come from looking at the QC plots. The last command uses the standard value for the mitocondrial content treshold.
sc.preprocessing.filter_cells(sample_1, max_counts=MAX_COUNTS)
sc.preprocessing.filter_cells(sample_1, min_counts=MIN_COUNTS)
sc.preprocessing.filter_cells(sample_1, min_genes=MIN_GENES)
sc.preprocessing.filter_cells(sample_1, max_genes=MAX_GENES)
sc.preprocessing.filter_genes(sample_1, min_cells=10)
sample_1 = sample_1[sample_1.obs['perc_mito']<MAX_MITO].copy()
We can see how only a low number of cells is left.
sample_1
AnnData object with n_obs × n_vars = 243 × 10203 obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'perc_mito', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_counts', 'n_genes' var: 'gene_ids', 'feature_types', 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells' layers: 'ambiguous', 'spliced', 'unspliced'
Doublet filtering¶
Another important step consists in filtering out multiplets. Those are in the almost totality of the cases doublets, because triplets and above multiplets are extremely rare. Read this more technical blog post for more explanations about this.
The external tool scrublet
simulates doublets by putting together the transcripts of random pairs of cells from the dataset. Then it assigns a score to each cell in the data, based on the similarity with the simulated doublets. An expected_doublet_rate
of 0.06 (6%) is quite a typical value for single cell data, but if you have a better estimate from laboratory work, microscope imaging or a specific protocol/sequencing machine, you can also tweak the value. random_state
is a number choosing how the simulations are done. using a specific random state means that you will always simulate the same doublets whenever you run this code. This allows you to reproduce exactly the same results every time and is a great thing for reproducibility in your own research.
sc.external.pp.scrublet(sample_1,
expected_doublet_rate=0.06,
random_state=12345)
Automatically set threshold at doublet score = 0.24 Detected doublet rate = 7.0% Estimated detectable doublet fraction = 17.5% Overall doublet rate: Expected = 6.0% Estimated = 40.0%
It seems that the doublet rate is likely to be lower than 6%, meaning that in this regard the data has been produced pretty well. We now plot the doublet scores assigned to each cell by the algorithm. We can see that most cells have a low score (the score is a value between 0 and 1). Datasets with many doublets show a more bimodal distribution, while here we just have a light tail beyond 0.1.
sns.distplot(sample_1.obs['doublet_score'])
<AxesSubplot:xlabel='doublet_score', ylabel='Density'>
We can choose 0.1 as filtering treshold for the few detected doublets or alternatively use the automatic selection of doublets by the algorithm. We will choose the last option and use the automatically chosen doublets.
sample_1 = sample_1[np.invert(sample_1.obs['predicted_doublet'])].copy()
Evaluation of filtering¶
A quite basic but easy way to look at the results of our filtering is to normalize and plot the dataset on some projections. Here we use a standard normalization technique that consists of:
- TPM normalization: the transcripts of each cell are normalized, so that their total amounts to the same value. This should make cells more comparable independently of how many transcripts their has been retained during cell isolation.
- Logarithmization: the logarithm of the normalized transcripts is calculated. This reduce the variability of transcripts values and highlights variations due to biological factors.
- Standardization: Each gene is standardized across all cells. This is useful for example for projecting the data onto a PCA.
# TPM normalization and storage of the matrix
sc.pp.normalize_per_cell(sample_1)
sample_1.layers['umi_tpm'] = sample_1.X.copy()
# Logarithmization and storage
sc.pp.log1p(sample_1)
sample_1.layers['umi_log'] = sample_1.X.copy()
# Select some of the most meaningful genes to calculate the PCA plot later
# This must be done on logarithmized values
sc.pp.highly_variable_genes(sample_1, n_top_genes=15000)
# save the dataset
sample_1.write('../Data/notebooks_data/sample_1.filt.h5ad')
# standardization and matrix storage
sc.pp.scale(sample_1)
sample_1.layers['umi_gauss'] = sample_1.X.copy()
Now we calculate the PCA projection
sc.preprocessing.pca(sample_1, svd_solver='arpack', random_state=12345)
We can look at the PCA plot and color it by some quality measure and gene expression. Even though we filtered a lot, there is still some structure left. But with so few cells, the dataset is not worth of usage, since the other samples have thousands of cells of better quality
sc.pl.pca(sample_1, color=['total_counts','SYCP1'])
We plot the variance ratio to see how each component of the PCA changes in variability. Small changes in variability denote that the components are mostly modeling noise in the data. We can choose a threshold (for example 15 PCA components) to be used in all algorithms that use PCA to calculate any quantity.
sc.plotting.pca_variance_ratio(sample_1)
We project the data using the UMAP algorithm. This is very good in preserving the structure of a dataset in low dimension, if any is present. We first calculate the neighbors of each cell (that is, its most similar cells), those are then used for the UMAP. The neighbors are calculated using the PCA matrix instead of the full data matrix, so we can choose the number of PCA components to use (parameter n_pcs
). Many algorithms work on the PCA, so you will see the parameter used again in other places.
sc.pp.neighbors(sample_1, n_pcs=15, random_state=12345)
sc.tools.umap(sample_1, random_state=54321)
The UMAP plot gives a pretty well-structured output for this dataset. We will keep working further with this filtering.
sc.plotting.umap(sample_1, color=['total_counts','SYCP1'])
Wrapping up¶
We have looked through a low quality dataset, and applied PCA-based and threshold-based filterings. The filtering resulted in removing almost all available data points.