Normalize and Integrate¶
Motivation:
Biologically similar cells are not necessarily directly comparable in a dataset because of different technical biases, amongst many the different percentage of captured transcripts (capture efficiency), the presence of technical replicates, the presence of noisy transcripts. The capture efficiency can be influenced by many factors, i.e. the different transcript tags leading to different capture efficiency, the type of protocol used in the laboratory, the amount of PCR performed on different transcripts.
To avoid these differences, a normalization approach is needed. Normalization is one of the main topics of scRNAseq data preprocessing, and many advanced techniques takes into account the statistical distribution of counts and the presence of technical/biological features of interest.
The most standard approach is the TMP (Transcript Per Million) normalization followed by logarithmization and standardization. We have applied this technique to have a double-check on our data filtering.
As a rule of thumb, TPM+log+standardization is no longer consider a very good normalization technique, especially when you are integrating multiple datasets together. Instead, it is suggested to use more advanced methods for considering technical and biological covariates as part of a statistical model for the transcripts. One of the current state-of-the-art method is scTransform. We will apply it in this notebook.
Learning objectives:
- Perform advanced normalization of multiple datasets
- Understand and apply tools for multiple datasets integration
- Evaluate the quality of data integration
Execution time: 40 minutes if using at least 2 cores (that is, at least 16 virtual-CPUs). Expect longer time with slower hardware, and a RAM usage of some 40-100GB.
*Import packages*
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import anndata as ad
plt.rcParams['figure.figsize']=(6,6) #rescale figures
We will need to use a couple of packages that are only developed in R. To do this we will use the package rpy2
, that allows python-R
interaction. Below we will load the package and some settings (You do not need to look at them, but they can be useful as reference for your own future coding).
import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
import anndata2ri
# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)
# Automatically convert rpy2 outputs to pandas dataframes
pandas2ri.activate()
anndata2ri.activate()
#import os
#os.environ['R_HOME'] = '../../../scrna-environment/lib/R/' #path to your R installation
%load_ext rpy2.ipython
%%R
.libPaths( c( "../../../../sandbox_scRNA_testAndFeedback/scrna-environment/lib/R/library/" ) )
Load the two separated datasets
sample_2 = sc.read('../../Data/notebooks_data/sample_2.filt.h5ad')
sample_3 = sc.read('../../Data/notebooks_data/sample_3.filt.h5ad')
Concatenation of datasets¶
Here we concatenate the two datasets. We establish the name of each sample. Each cell will be assigned the correct name, that will be stored in sample.obs['batch']
. We clean also all the variables (quantities calculated on genes), because those will be duplicated from each dataset, while observations (quantities calculated on cells) will be concatenated together without duplication.
batch_names = ['SAM_2','SAM_3'] #choose names for samples
sample = ad.AnnData.concatenate(sample_2, sample_3) #concatenate
sample.rename_categories(key='batch', categories=batch_names) #apply sample names
scv.utils.cleanup(sample, clean='var') #remove duplicated gene quantites
We unload the old separated samples to free memory
del sample_2, sample_3
The new sample looks like this. It has the well-known observations from the previous coding sessions and has all the stored matrices. We have now more than 6000 cells.
sample
AnnData object with n_obs × n_vars = 6431 × 22825 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'perc_mito', 'n_counts', 'n_genes', 'doublet_score', 'predicted_doublet', 'batch' layers: 'umi_log', 'umi_raw', 'umi_tpm'
For each sample, we filtered away genes shown in less than 10 cells. Now we have two samples, so it is good to filter away genes shown in less than 20 cells (to be sure that a sample does not have genes that are absent in the other sample)
sc.preprocessing.filter_genes(sample, min_cells=20)
This should eliminate a few more genes
sample
AnnData object with n_obs × n_vars = 6431 × 22790 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'perc_mito', 'n_counts', 'n_genes', 'doublet_score', 'predicted_doublet', 'batch' var: 'n_cells' layers: 'umi_log', 'umi_raw', 'umi_tpm'
Normalization¶
We will normalize the dataset by using a more advanced tool. While standard normalization seemed to work fine previously, the result is still very much influenced by technical factors, such as by
- how cell transcripts have been captured efficiently prior to sequencing
- difference in PCR amplification
We will use a tool called sctransform
: this does a statistical model for the transcript counts and does a regression to remove technical factors connected to the amount of transcripts and stabilize the variance of the data. In this way we will have a normalized dataset that is better explained by biological processes rather than technical variables.
At the moment, sctransform
is implemented only in R
. Using R
integrated in python
is pretty easy. First you need to define the variables needed in R
: here we have the raw matrix, the names of the genes, and the sample identifier.
rawMatrix = np.array( sample.layers['umi_raw'].T.copy())
genes_name = sample.var_names
cells_info = sample.obs[ ["batch"] ].copy()
We now use a so-called cell magic command. This is a command preceded by a double %%
symbol. In our case we write %%R
to declare we use the R
language. We import the variables in R
by using -i
for each variable. The following four commands are all executed in R
, and will exist into an underlying R
environment that we do not see explicitly.
%%R -i cells_info -i rawMatrix -i genes_name
library(scater)
cell_df <- DataFrame(data = cells_info)
colnames(rawMatrix) <- rownames(cell_df) #cell names
rownames(rawMatrix) <- genes_name #gene names
del rawMatrix, genes_name, cells_info
Now we setup a few things to make the normalization algorithm use multiple cores. This is done with the package future
in R
. Here we declare we want to use multiple cores and choose how many in the option workers
(Here we set up to 32, but change according to your machine). Then we assign how much memory we can use in total with the option future.globals.maxSize
(here we choose up to 50 GigaBytes, where 1024^3
denotes that we are talking about GigaBytes. Limit this value according your chosen machine).
%%R
library(sctransform)
library(future)
future::plan(strategy = 'multicore', workers = 32)
options(future.globals.maxSize = 50 * 1024 ^ 3)
Now we run the normalization. There are a few parameters to be assigned. You can especially think about tweaking n_genes
: it will base the normalization on a number of most significant genes. The more genes you choose, the more time and memory you require for the normalization. usually somewhere between 2000 and 5000 is a typical value.
%%R
vst_out=vst( as.matrix(rawMatrix), #data matrix
cell_attr=cell_df, #dataframe containing batch variable
n_genes=3000, #most variable genes in your data
batch_var='data.batch', #name of the batch variable
method='qpoisson', #type of statistical model. use "poisson" for more precision but much slower execution
show_progress=TRUE, #show progress bars
return_corrected_umi=TRUE) #return corrected umi count matrix
|======================================================================| 100% |======================================================================| 100% |======================================================================| 100%
The algorithm returns normalized data, matrix of umi counts with adjusted value according to the statistical model, and most significant genes. We assign them to variables in R
and export them into python
using the commands -o
%%R -o new_matrix -o sct_genes -o all_genes -o umi_matrix
new_matrix=vst_out$y #normalized matrix
sct_genes = rownames(vst_out$model_pars) #most variable genes
all_genes = rownames(new_matrix) #vector of all genes to check if any have been filtered out
umi_matrix=vst_out$umi_corrected #umi matrix
We save the most variable genes into our dataset
sct_genes = list(sct_genes)
sample.var['highly_variable'] = [i in sct_genes for i in sample.var_names]
We check that the genes actually match with the results from R
sample = sample[:,list(all_genes)].copy()
We store in .layers
the normalized and UMI matrix
sample.layers['norm_sct'] = np.transpose( new_matrix )
sample.layers['umi_sct'] = np.transpose( umi_matrix )
Data integration¶
Standard PCA framework¶
Let's now calculate and plot the PCA using normalized data. We can see that the PCA of the two samples is not perfectly overlapping. This is because PCA does not model in any way the presence of two different samples. Even though this has been taken into account during normalization, it is not enough to integrate the datasets together.
sample.X = sample.layers['norm_sct'].copy() #use normalized data in .X
sc.pp.scale(sample) #standardize
sc.preprocessing.pca(sample, svd_solver='arpack', random_state=12345) #do PCA
sc.pl.pca(sample, color=['batch','total_counts','PRM3'])
One can keep the PCA above and compensate for differences between batches when calculating the neighborhood of each cell (the distance between cells that are most similar). This is done by taking into account the presence of multiple samples with the package bbknn
(in the filtering notebooks we used the command sc.pp.neighbors
)
import bbknn as bbknn
bbknn.bbknn(sample)
We can visualize the result with an UMAP plot. We can see how the samples are now overlapping in the plot.
sc.tools.umap(sample, random_state=54321)
sc.plotting.umap(sample, color=['batch','total_counts','PRM3'])
How well went the integration? We can see if, for every point in the UMAP plot, the other most similar points are a mix from the various samples, or if they are dominated by only one of them. The R
package kbet
does this test for each datapoint. The test is rejected if a datapoint is in an area where cells from all samples are not mixed/overlapping. A perfect integration has around zero rejection (left side of the plot). In reality, this does not happen, but a good integration stays in general below a 50% rejection (right side of the plot).
data = np.array( sample.obsm['X_umap'] )
batch = np.array( sample.obs['batch'] )
%%R -i batch -i data
library(kBET)
library(ggplot2)
batch.estimate <- kBET( data, batch, plot=TRUE, k0=10 )
plot.data <- data.frame(class=rep(c('observed', 'expected'),
each=length(batch.estimate$stats$kBET.observed)),
data = c(batch.estimate$stats$kBET.observed,
batch.estimate$stats$kBET.expected))
sample.write('../../Data/notebooks_data/sample_123.filt.norm.h5ad')
sample = sc.read('../../Data/notebooks_data/sample_123.filt.norm.h5ad')
WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.h5ad']. Only considering the two last: ['.norm', '.h5ad']. WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.h5ad']. Only considering the two last: ['.norm', '.h5ad'].
Generalized PCA model¶
A better way of doing a PCA for multiple samples is to use a model that takes into account the various batches. We can do this with the package glmpca
. The execution time can be a bit longer (especially if you choose many variable genes in the normalization), but the resulting PCA is usually better thanks to a statistical model better reflectiong the transcript distribution.
We first need to model the batch variables to be compatible in glmpca
import sklearn.preprocessing
import numpy as np
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(sample.obs['batch'])
batch_onehot = label_binarizer.transform(sample.obs['batch'])
We run glmpca
. If the algorithm fails, you need to change the parameter penalty
of the command glmpca.glmpca
. Always start from 1 and increase by ten if errors occur. In case you keep getting error, then you need to resort to the standard PCA used before. That happens when some transcripts have a very noisy distribution, for example because of bad data filtering in the early steps of the analysis. In our case we made the algorithm work with penalty=10
. Note in the second line how we subset the dataset to use only the highly variable genes. Also, we use the matrix of corrected UMI counts as a base for the PCA (third line)
ctl = {"maxIter":30, "eps":1e-3, "optimizeTheta":True}
sample_glmpca = sample[:,sample.var['highly_variable']].copy()
Y = sample_glmpca.layers['umi_sct'].T.todense().copy()
Y = np.asarray(Y)
from glmpca import glmpca
print("calculating")
res = glmpca.glmpca(Y, 15, penalty=10, X=batch_onehot, verbose=True, ctl=ctl)
factors = res["factors"]
sample_glmpca.obsm['X_pca']=factors
calculating Iteration: 0 | deviance=3.8470E+7 Iteration: 1 | deviance=3.7307E+7 Iteration: 2 | deviance=1.7101E+7 Iteration: 3 | deviance=1.3117E+7 Iteration: 4 | deviance=1.2319E+7 Iteration: 5 | deviance=1.1993E+7 Iteration: 6 | deviance=1.1824E+7 Iteration: 7 | deviance=1.1723E+7 Iteration: 8 | deviance=1.1654E+7 Iteration: 9 | deviance=1.1603E+7 Iteration: 10 | deviance=1.1565E+7 Iteration: 11 | deviance=1.1534E+7 Iteration: 12 | deviance=1.1509E+7 Iteration: 13 | deviance=1.1488E+7 Iteration: 14 | deviance=1.1470E+7 Iteration: 15 | deviance=1.1455E+7 Iteration: 16 | deviance=1.1441E+7 Iteration: 17 | deviance=1.1429E+7
Plot the new PCA on the subsetted dataset. The integration looks of higher quality than before.
sc.pl.pca(sample_glmpca, color=['batch'])
Assign the new PCA to the full dataset
sample.obsm['X_pca'] = sample_glmpca.obsm['X_pca'].copy()
Recalculate and plot the UMAP
import bbknn as bbknn
bbknn.bbknn(sample)
sc.tools.umap(sample, random_state=54321)
sc.plotting.umap(sample, color=['batch','total_counts','PRM3'])
sample.write('../../Data/notebooks_data/sample_123.filt.norm.red.h5ad')
Evaluate the integration¶
We can double check if the integration is still acceptable. At least in terms of overlapping, the previous version was slightly better, but in general it is a good practice to use the glmpca
approach if this does not fail furing execution
data = np.array( sample.obsm['X_umap'] )
batch = np.array( sample.obs['batch'] )
%%R -i batch -i data
library(kBET)
library(ggplot2)
batch.estimate <- kBET( data, batch, plot=TRUE, k0=10 )
plot.data <- data.frame(class=rep(c('observed', 'expected'),
each=length(batch.estimate$stats$kBET.observed)),
data = c(batch.estimate$stats$kBET.observed,
batch.estimate$stats$kBET.expected))
How would a missing integration step result?
We can see the evaluation of an integration without using the proper normalization and integration steps with PCA and neighborhoods calculations.
sample.X = sample.layers['umi_raw'].copy()
sc.pp.log1p(sample)
sc.pp.normalize_total(sample)
sc.pp.scale(sample)
sc.pp.pca(sample, svd_solver='arpack', random_state=12345)
sc.pp.neighbors(sample, random_state=12345)
sc.tools.umap(sample, random_state=54321, n_components=2)
The UMAP plot speaks pretty clearly
sc.pl.umap(sample, color=['batch'])
The overlapping test gets naturally a high rejection
non_integrated_data = np.array( sample.obsm['X_umap'] )
batch = np.array( sample.obs['batch'] )
%%R -i batch -i non_integrated_data
library(kBET)
library(ggplot2)
batch.estimate <- kBET( non_integrated_data, batch, plot=TRUE, k0=10 )
plot.data <- data.frame(class=rep(c('observed', 'expected'),
each=length(batch.estimate$stats$kBET.observed)),
data = c(batch.estimate$stats$kBET.observed,
batch.estimate$stats$kBET.expected))
Wrapping up¶
This notebook completes the integration of the datasets. We have both used the standard PCA and then a version of the PCA that can model transcript distribution and takes batches into account. Finally we have calculated cell neighborhoods with bbknn
: this allows us to further remove the technical differences between batches. We used a statistical test to check that the integration has been successful, and compared it with the test performed on a dataset without integration steps.