Normalize and Integrate¶

Motivation:

Biologically similar cells are not necessarily directly comparable in a dataset because of different technical biases, amongst many the different percentage of captured transcripts (capture efficiency), the presence of technical replicates, the presence of noisy transcripts. The capture efficiency can be influenced by many factors, i.e. the different transcript tags leading to different capture efficiency, the type of protocol used in the laboratory, the amount of PCR performed on different transcripts.

To avoid these differences, a normalization approach is needed. Normalization is one of the main topics of scRNAseq data preprocessing, and many advanced techniques takes into account the statistical distribution of counts and the presence of technical/biological features of interest.

The most standard approach is the TMP (Transcript Per Million) normalization followed by logarithmization and standardization. We have applied this technique to have a double-check on our data filtering.

As a rule of thumb, TPM+log+standardization is no longer consider a very good normalization technique, especially when you are integrating multiple datasets together. Instead, it is suggested to use more advanced methods for considering technical and biological covariates as part of a statistical model for the transcripts. One of the current state-of-the-art method is scTransform. We will apply it in this notebook.

Learning objectives:

Perform advanced normalization of multiple datasets
Understand and apply tools for multiple datasets integration
Evaluate the quality of data integration

Execution time: 40 minutes if using at least 2 cores (that is, at least 16 virtual-CPUs). Expect longer time with slower hardware, and a RAM usage of some 40-100GB.

*Import packages*

In [1]:

  Copied!     
 
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import anndata as ad

plt.rcParams['figure.figsize']=(6,6) #rescale figures
import scanpy as sc import pandas as pd import scvelo as scv import numpy as np import seaborn as sns import matplotlib.pyplot as plt import sklearn import anndata as ad plt.rcParams['figure.figsize']=(6,6) #rescale figures

We will need to use a couple of packages that are only developed in R. To do this we will use the package rpy2, that allows python-R interaction. Below we will load the package and some settings (You do not need to look at them, but they can be useful as reference for your own future coding).

In [2]:

  Copied!     
 
import rpy2.rinterface_lib.callbacks
import logging

from rpy2.robjects import pandas2ri
import anndata2ri

# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

# Automatically convert rpy2 outputs to pandas dataframes
pandas2ri.activate()
anndata2ri.activate()

#import os
#os.environ['R_HOME'] = '../../../scrna-environment/lib/R/' #path to your R installation

%load_ext rpy2.ipython
import rpy2.rinterface_lib.callbacks import logging from rpy2.robjects import pandas2ri import anndata2ri # Ignore R warning messages #Note: this can be commented out to get more verbose R output rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR) # Automatically convert rpy2 outputs to pandas dataframes pandas2ri.activate() anndata2ri.activate() #import os #os.environ['R_HOME'] = '../../../scrna-environment/lib/R/' #path to your R installation %load_ext rpy2.ipython

In [3]:

  Copied!     
 
%%R
.libPaths( c( "../../../../sandbox_scRNA_testAndFeedback/scrna-environment/lib/R/library/" ) )
%%R .libPaths( c( "../../../../sandbox_scRNA_testAndFeedback/scrna-environment/lib/R/library/" ) )

Load the two separated datasets

In [4]:

  Copied!     
 
sample_2 = sc.read('../../Data/notebooks_data/sample_2.filt.h5ad')
sample_3 = sc.read('../../Data/notebooks_data/sample_3.filt.h5ad')
sample_2 = sc.read('../../Data/notebooks_data/sample_2.filt.h5ad') sample_3 = sc.read('../../Data/notebooks_data/sample_3.filt.h5ad')

Concatenation of datasets¶

Here we concatenate the two datasets. We establish the name of each sample. Each cell will be assigned the correct name, that will be stored in sample.obs['batch']. We clean also all the variables (quantities calculated on genes), because those will be duplicated from each dataset, while observations (quantities calculated on cells) will be concatenated together without duplication.

In [5]:

  Copied!     
 
batch_names = ['SAM_2','SAM_3'] #choose names for samples
sample = ad.AnnData.concatenate(sample_2, sample_3) #concatenate
sample.rename_categories(key='batch', categories=batch_names) #apply sample names
scv.utils.cleanup(sample, clean='var') #remove duplicated gene quantites
batch_names = ['SAM_2','SAM_3'] #choose names for samples sample = ad.AnnData.concatenate(sample_2, sample_3) #concatenate sample.rename_categories(key='batch', categories=batch_names) #apply sample names scv.utils.cleanup(sample, clean='var') #remove duplicated gene quantites

We unload the old separated samples to free memory

In [6]:

  Copied!     
 
del sample_2, sample_3
del sample_2, sample_3

The new sample looks like this. It has the well-known observations from the previous coding sessions and has all the stored matrices. We have now more than 6000 cells.

In [7]:

  Copied!     
 
sample
sample

Out[7]:

AnnData object with n_obs × n_vars = 6431 × 22825
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'perc_mito', 'n_counts', 'n_genes', 'doublet_score', 'predicted_doublet', 'batch'
    layers: 'umi_log', 'umi_raw', 'umi_tpm'

For each sample, we filtered away genes shown in less than 10 cells. Now we have two samples, so it is good to filter away genes shown in less than 20 cells (to be sure that a sample does not have genes that are absent in the other sample)

In [8]:

  Copied!     
 
sc.preprocessing.filter_genes(sample, min_cells=20)
sc.preprocessing.filter_genes(sample, min_cells=20)

This should eliminate a few more genes

In [9]:

  Copied!     
 
sample
sample

Out[9]:

AnnData object with n_obs × n_vars = 6431 × 22790
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'perc_mito', 'n_counts', 'n_genes', 'doublet_score', 'predicted_doublet', 'batch'
    var: 'n_cells'
    layers: 'umi_log', 'umi_raw', 'umi_tpm'

Normalization¶

We will normalize the dataset by using a more advanced tool. While standard normalization seemed to work fine previously, the result is still very much influenced by technical factors, such as by

how cell transcripts have been captured efficiently prior to sequencing
difference in PCR amplification

We will use a tool called sctransform: this does a statistical model for the transcript counts and does a regression to remove technical factors connected to the amount of transcripts and stabilize the variance of the data. In this way we will have a normalized dataset that is better explained by biological processes rather than technical variables.

At the moment, sctransform is implemented only in R. Using R integrated in python is pretty easy. First you need to define the variables needed in R: here we have the raw matrix, the names of the genes, and the sample identifier.

In [10]:

  Copied!     
 
rawMatrix = np.array( sample.layers['umi_raw'].T.copy())
genes_name = sample.var_names
cells_info = sample.obs[ ["batch"] ].copy()
rawMatrix = np.array( sample.layers['umi_raw'].T.copy()) genes_name = sample.var_names cells_info = sample.obs[ ["batch"] ].copy()

We now use a so-called cell magic command. This is a command preceded by a double %% symbol. In our case we write %%R to declare we use the R language. We import the variables in R by using -i for each variable. The following four commands are all executed in R, and will exist into an underlying R environment that we do not see explicitly.

In [11]:

  Copied!     
 
%%R -i cells_info -i rawMatrix -i genes_name
library(scater)
cell_df <- DataFrame(data = cells_info)
colnames(rawMatrix) <- rownames(cell_df) #cell names
rownames(rawMatrix) <- genes_name #gene names
%%R -i cells_info -i rawMatrix -i genes_name library(scater) cell_df <- DataFrame(data = cells_info) colnames(rawMatrix) <- rownames(cell_df) #cell names rownames(rawMatrix) <- genes_name #gene names

In [12]:

  Copied!     
 
del rawMatrix, genes_name, cells_info
del rawMatrix, genes_name, cells_info

Now we setup a few things to make the normalization algorithm use multiple cores. This is done with the package future in R. Here we declare we want to use multiple cores and choose how many in the option workers (Here we set up to 32, but change according to your machine). Then we assign how much memory we can use in total with the option future.globals.maxSize (here we choose up to 50 GigaBytes, where 1024^3 denotes that we are talking about GigaBytes. Limit this value according your chosen machine).

In [17]:

  Copied!     
 
%%R
library(sctransform)
library(future)
future::plan(strategy = 'multicore', workers = 32)
options(future.globals.maxSize = 50 * 1024 ^ 3)
%%R library(sctransform) library(future) future::plan(strategy = 'multicore', workers = 32) options(future.globals.maxSize = 50 * 1024 ^ 3)

Now we run the normalization. There are a few parameters to be assigned. You can especially think about tweaking n_genes: it will base the normalization on a number of most significant genes. The more genes you choose, the more time and memory you require for the normalization. usually somewhere between 2000 and 5000 is a typical value.

In [18]:

  Copied!     
 
%%R
vst_out=vst( as.matrix(rawMatrix), #data matrix
            cell_attr=cell_df, #dataframe containing batch variable
            n_genes=3000, #most variable genes in your data
            batch_var='data.batch', #name of the batch variable
            method='qpoisson', #type of statistical model. use "poisson" for more precision but much slower execution
            show_progress=TRUE, #show progress bars
            return_corrected_umi=TRUE) #return corrected umi count matrix
%%R vst_out=vst( as.matrix(rawMatrix), #data matrix cell_attr=cell_df, #dataframe containing batch variable n_genes=3000, #most variable genes in your data batch_var='data.batch', #name of the batch variable method='qpoisson', #type of statistical model. use "poisson" for more precision but much slower execution show_progress=TRUE, #show progress bars return_corrected_umi=TRUE) #return corrected umi count matrix

  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%

The algorithm returns normalized data, matrix of umi counts with adjusted value according to the statistical model, and most significant genes. We assign them to variables in R and export them into python using the commands -o

In [19]:

  Copied!     
 
%%R -o new_matrix -o sct_genes -o all_genes -o umi_matrix
new_matrix=vst_out$y #normalized matrix
sct_genes = rownames(vst_out$model_pars) #most variable genes
all_genes = rownames(new_matrix) #vector of all genes to check if any have been filtered out
umi_matrix=vst_out$umi_corrected #umi matrix
%%R -o new_matrix -o sct_genes -o all_genes -o umi_matrix new_matrix=vst_out$y #normalized matrix sct_genes = rownames(vst_out$model_pars) #most variable genes all_genes = rownames(new_matrix) #vector of all genes to check if any have been filtered out umi_matrix=vst_out$umi_corrected #umi matrix

We save the most variable genes into our dataset

In [20]:

  Copied!     
 
sct_genes = list(sct_genes)
sample.var['highly_variable'] = [i in sct_genes for i in sample.var_names]
sct_genes = list(sct_genes) sample.var['highly_variable'] = [i in sct_genes for i in sample.var_names]

We check that the genes actually match with the results from R

In [21]:

  Copied!     
 
sample = sample[:,list(all_genes)].copy()
sample = sample[:,list(all_genes)].copy()

We store in .layers the normalized and UMI matrix

In [22]:

  Copied!     
 
sample.layers['norm_sct'] = np.transpose( new_matrix )
sample.layers['umi_sct'] = np.transpose( umi_matrix )
sample.layers['norm_sct'] = np.transpose( new_matrix ) sample.layers['umi_sct'] = np.transpose( umi_matrix )

Data integration¶

Standard PCA framework¶

Let's now calculate and plot the PCA using normalized data. We can see that the PCA of the two samples is not perfectly overlapping. This is because PCA does not model in any way the presence of two different samples. Even though this has been taken into account during normalization, it is not enough to integrate the datasets together.

In [23]:

  Copied!     
 
sample.X = sample.layers['norm_sct'].copy() #use normalized data in .X
sc.pp.scale(sample) #standardize
sc.preprocessing.pca(sample, svd_solver='arpack', random_state=12345) #do PCA
sample.X = sample.layers['norm_sct'].copy() #use normalized data in .X sc.pp.scale(sample) #standardize sc.preprocessing.pca(sample, svd_solver='arpack', random_state=12345) #do PCA

In [24]:

  Copied!     
 
sc.pl.pca(sample, color=['batch','total_counts','PRM3'])
sc.pl.pca(sample, color=['batch','total_counts','PRM3'])

One can keep the PCA above and compensate for differences between batches when calculating the neighborhood of each cell (the distance between cells that are most similar). This is done by taking into account the presence of multiple samples with the package bbknn (in the filtering notebooks we used the command sc.pp.neighbors)

In [25]:

  Copied!     
 
import bbknn as bbknn
bbknn.bbknn(sample)
import bbknn as bbknn bbknn.bbknn(sample) 

We can visualize the result with an UMAP plot. We can see how the samples are now overlapping in the plot.

In [26]:

  Copied!     
 
sc.tools.umap(sample, random_state=54321)
sc.tools.umap(sample, random_state=54321)

In [27]:

  Copied!     
 
sc.plotting.umap(sample, color=['batch','total_counts','PRM3'])
sc.plotting.umap(sample, color=['batch','total_counts','PRM3'])

How well went the integration? We can see if, for every point in the UMAP plot, the other most similar points are a mix from the various samples, or if they are dominated by only one of them. The R package kbet does this test for each datapoint. The test is rejected if a datapoint is in an area where cells from all samples are not mixed/overlapping. A perfect integration has around zero rejection (left side of the plot). In reality, this does not happen, but a good integration stays in general below a 50% rejection (right side of the plot).

In [28]:

  Copied!     
 
data = np.array( sample.obsm['X_umap'] )
batch = np.array( sample.obs['batch'] )
data = np.array( sample.obsm['X_umap'] ) batch = np.array( sample.obs['batch'] )

In [29]:

  Copied!     
 
%%R -i batch -i data

library(kBET)
library(ggplot2)

batch.estimate <- kBET( data, batch, plot=TRUE, k0=10 )
plot.data <- data.frame(class=rep(c('observed', 'expected'), 
                                  each=length(batch.estimate$stats$kBET.observed)), 
                        data =  c(batch.estimate$stats$kBET.observed,
                                  batch.estimate$stats$kBET.expected))
%%R -i batch -i data library(kBET) library(ggplot2) batch.estimate <- kBET( data, batch, plot=TRUE, k0=10 ) plot.data <- data.frame(class=rep(c('observed', 'expected'), each=length(batch.estimate$stats$kBET.observed)), data = c(batch.estimate$stats$kBET.observed, batch.estimate$stats$kBET.expected))

In [30]:

  Copied!     
 
sample.write('../../Data/notebooks_data/sample_123.filt.norm.h5ad')
sample.write('../../Data/notebooks_data/sample_123.filt.norm.h5ad')

In [31]:

  Copied!     
 
sample = sc.read('../../Data/notebooks_data/sample_123.filt.norm.h5ad')
sample = sc.read('../../Data/notebooks_data/sample_123.filt.norm.h5ad')

WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.h5ad'].
Only considering the two last: ['.norm', '.h5ad'].
WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.h5ad'].
Only considering the two last: ['.norm', '.h5ad'].

Generalized PCA model¶

A better way of doing a PCA for multiple samples is to use a model that takes into account the various batches. We can do this with the package glmpca. The execution time can be a bit longer (especially if you choose many variable genes in the normalization), but the resulting PCA is usually better thanks to a statistical model better reflectiong the transcript distribution.

We first need to model the batch variables to be compatible in glmpca

In [32]:

  Copied!     
 
import sklearn.preprocessing
import numpy as np
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(sample.obs['batch'])
batch_onehot = label_binarizer.transform(sample.obs['batch'])
import sklearn.preprocessing import numpy as np label_binarizer = sklearn.preprocessing.LabelBinarizer() label_binarizer.fit(sample.obs['batch']) batch_onehot = label_binarizer.transform(sample.obs['batch'])

We run glmpca. If the algorithm fails, you need to change the parameter penalty of the command glmpca.glmpca. Always start from 1 and increase by ten if errors occur. In case you keep getting error, then you need to resort to the standard PCA used before. That happens when some transcripts have a very noisy distribution, for example because of bad data filtering in the early steps of the analysis. In our case we made the algorithm work with penalty=10. Note in the second line how we subset the dataset to use only the highly variable genes. Also, we use the matrix of corrected UMI counts as a base for the PCA (third line)

In [33]:

  Copied!     
 
ctl = {"maxIter":30, "eps":1e-3, "optimizeTheta":True}
sample_glmpca = sample[:,sample.var['highly_variable']].copy()
Y = sample_glmpca.layers['umi_sct'].T.todense().copy()
Y = np.asarray(Y)
from glmpca import glmpca
print("calculating")
res = glmpca.glmpca(Y, 15, penalty=10, X=batch_onehot, verbose=True, ctl=ctl)
factors = res["factors"]
sample_glmpca.obsm['X_pca']=factors
ctl = {"maxIter":30, "eps":1e-3, "optimizeTheta":True} sample_glmpca = sample[:,sample.var['highly_variable']].copy() Y = sample_glmpca.layers['umi_sct'].T.todense().copy() Y = np.asarray(Y) from glmpca import glmpca print("calculating") res = glmpca.glmpca(Y, 15, penalty=10, X=batch_onehot, verbose=True, ctl=ctl) factors = res["factors"] sample_glmpca.obsm['X_pca']=factors

calculating
Iteration: 0 | deviance=3.8470E+7
Iteration: 1 | deviance=3.7307E+7
Iteration: 2 | deviance=1.7101E+7
Iteration: 3 | deviance=1.3117E+7
Iteration: 4 | deviance=1.2319E+7
Iteration: 5 | deviance=1.1993E+7
Iteration: 6 | deviance=1.1824E+7
Iteration: 7 | deviance=1.1723E+7
Iteration: 8 | deviance=1.1654E+7
Iteration: 9 | deviance=1.1603E+7
Iteration: 10 | deviance=1.1565E+7
Iteration: 11 | deviance=1.1534E+7
Iteration: 12 | deviance=1.1509E+7
Iteration: 13 | deviance=1.1488E+7
Iteration: 14 | deviance=1.1470E+7
Iteration: 15 | deviance=1.1455E+7
Iteration: 16 | deviance=1.1441E+7
Iteration: 17 | deviance=1.1429E+7

Plot the new PCA on the subsetted dataset. The integration looks of higher quality than before.

In [38]:

  Copied!     
 
sc.pl.pca(sample_glmpca, color=['batch'])
sc.pl.pca(sample_glmpca, color=['batch'])

Assign the new PCA to the full dataset

In [39]:

  Copied!     
 
sample.obsm['X_pca'] = sample_glmpca.obsm['X_pca'].copy()
sample.obsm['X_pca'] = sample_glmpca.obsm['X_pca'].copy()

Recalculate and plot the UMAP

In [40]:

  Copied!     
 
import bbknn as bbknn
bbknn.bbknn(sample)
sc.tools.umap(sample, random_state=54321)
sc.plotting.umap(sample, color=['batch','total_counts','PRM3'])
import bbknn as bbknn bbknn.bbknn(sample) sc.tools.umap(sample, random_state=54321) sc.plotting.umap(sample, color=['batch','total_counts','PRM3'])

In [41]:

  Copied!     
 
sample.write('../../Data/notebooks_data/sample_123.filt.norm.red.h5ad')
sample.write('../../Data/notebooks_data/sample_123.filt.norm.red.h5ad')

Evaluate the integration¶

We can double check if the integration is still acceptable. At least in terms of overlapping, the previous version was slightly better, but in general it is a good practice to use the glmpca approach if this does not fail furing execution

In [42]:

  Copied!     
 
data = np.array( sample.obsm['X_umap'] )
batch = np.array( sample.obs['batch'] )
data = np.array( sample.obsm['X_umap'] ) batch = np.array( sample.obs['batch'] )

In [43]:

  Copied!     
 
%%R -i batch -i data

library(kBET)
library(ggplot2)

batch.estimate <- kBET( data, batch, plot=TRUE, k0=10 )
plot.data <- data.frame(class=rep(c('observed', 'expected'), 
                                  each=length(batch.estimate$stats$kBET.observed)), 
                        data =  c(batch.estimate$stats$kBET.observed,
                                  batch.estimate$stats$kBET.expected))
%%R -i batch -i data library(kBET) library(ggplot2) batch.estimate <- kBET( data, batch, plot=TRUE, k0=10 ) plot.data <- data.frame(class=rep(c('observed', 'expected'), each=length(batch.estimate$stats$kBET.observed)), data = c(batch.estimate$stats$kBET.observed, batch.estimate$stats$kBET.expected))

How would a missing integration step result?

We can see the evaluation of an integration without using the proper normalization and integration steps with PCA and neighborhoods calculations.

In [44]:

  Copied!     
 
sample.X = sample.layers['umi_raw'].copy()
sc.pp.log1p(sample)
sc.pp.normalize_total(sample)
sc.pp.scale(sample)
sc.pp.pca(sample, svd_solver='arpack', random_state=12345)
sc.pp.neighbors(sample, random_state=12345)
sc.tools.umap(sample, random_state=54321, n_components=2)
sample.X = sample.layers['umi_raw'].copy() sc.pp.log1p(sample) sc.pp.normalize_total(sample) sc.pp.scale(sample) sc.pp.pca(sample, svd_solver='arpack', random_state=12345) sc.pp.neighbors(sample, random_state=12345) sc.tools.umap(sample, random_state=54321, n_components=2)

The UMAP plot speaks pretty clearly

In [45]:

  Copied!     
 
sc.pl.umap(sample, color=['batch'])
sc.pl.umap(sample, color=['batch'])

The overlapping test gets naturally a high rejection

In [46]:

  Copied!     
 
non_integrated_data = np.array( sample.obsm['X_umap'] )
batch = np.array( sample.obs['batch'] )
non_integrated_data = np.array( sample.obsm['X_umap'] ) batch = np.array( sample.obs['batch'] )

In [47]:

  Copied!     
 
%%R -i batch -i non_integrated_data

library(kBET)
library(ggplot2)

batch.estimate <- kBET( non_integrated_data, batch, plot=TRUE, k0=10 )
plot.data <- data.frame(class=rep(c('observed', 'expected'), 
                                  each=length(batch.estimate$stats$kBET.observed)), 
                        data =  c(batch.estimate$stats$kBET.observed,
                                  batch.estimate$stats$kBET.expected))
%%R -i batch -i non_integrated_data library(kBET) library(ggplot2) batch.estimate <- kBET( non_integrated_data, batch, plot=TRUE, k0=10 ) plot.data <- data.frame(class=rep(c('observed', 'expected'), each=length(batch.estimate$stats$kBET.observed)), data = c(batch.estimate$stats$kBET.observed, batch.estimate$stats$kBET.expected))

Wrapping up¶

This notebook completes the integration of the datasets. We have both used the standard PCA and then a version of the PCA that can model transcript distribution and takes batches into account. Finally we have calculated cell neighborhoods with bbknn: this allows us to further remove the technical differences between batches. We used a statistical test to check that the integration has been successful, and compared it with the test performed on a dataset without integration steps.