Read the data¶

Learning objectives:¶

Get an overview of the scanpy package and the python language syntax
Learn and explore the data structure containing a single cell dataset
Understand and apply basic interactions with the transcript matrix and the components of a dataset

Execution time: 30-60 minutes

Import the packages¶

We will use scanpy as the main analysis tool for the analysis, where we will also apply some other packages. Scanpy has a comprehensive manual webpage that includes many different tutorial you can use for further practicing. Packages are imported with the command import, and their name is shortened with the command as, so that we can write shorter names in our code

An alternative and well-established tool for R users is Seurat. This is used in the R version of this course.

In [1]:

  Copied!     
 
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import scanpy as sc import pandas as pd import scvelo as scv import numpy as np import seaborn as sns import matplotlib.pyplot as plt import sklearn

Commands from scanpy are under different categories: preprocessing (pp), tools (tl), plotting (pl). Each category contains some functions to work on single cell data. Scanpy has also a category called external, where a few external packages have been integrated to work with scanpy. Use the help() command to see what a command does in python

In [2]:

  Copied!     
 
help(sc.preprocessing.calculate_qc_metrics)
help(sc.preprocessing.calculate_qc_metrics)

Help on function calculate_qc_metrics in module scanpy.preprocessing._qc:

calculate_qc_metrics(adata: anndata._core.anndata.AnnData, *, expr_type: str = 'counts', var_type: str = 'genes', qc_vars: Collection[str] = (), percent_top: Union[Collection[int], NoneType] = (50, 100, 200, 500), layer: Union[str, NoneType] = None, use_raw: bool = False, inplace: bool = False, log1p: bool = True, parallel: Union[bool, NoneType] = None) -> Union[Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], NoneType]
    Calculate quality control metrics.
    
    Calculates a number of qc metrics for an AnnData object, see section
    `Returns` for specifics. Largely based on `calculateQCMetrics` from scater
    [McCarthy17]_. Currently is most efficient on a sparse CSR or dense matrix.
    
    Note that this method can take a while to compile on the first call. That
    result is then cached to disk to be used later.
    
    Parameters
    ----------
    adata
        Annotated data matrix.
    expr_type
        Name of kind of values in X.
    var_type
        The kind of thing the variables are.
    qc_vars
        Keys for boolean columns of `.var` which identify variables you could
        want to control for (e.g. "ERCC" or "mito").
    percent_top
        Which proportions of top genes to cover. If empty or `None` don't
        calculate. Values are considered 1-indexed, `percent_top=[50]` finds
        cumulative proportion to the 50th most expressed gene.
    layer
        If provided, use `adata.layers[layer]` for expression values instead
        of `adata.X`.
    use_raw
        If True, use `adata.raw.X` for expression values instead of `adata.X`.
    inplace
        Whether to place calculated metrics in `adata`'s `.obs` and `.var`.
    log1p
        Set to `False` to skip computing `log1p` transformed annotations.
    
    Returns
    -------
    Depending on `inplace` returns calculated metrics
    (as :class:`~pandas.DataFrame`) or updates `adata`'s `obs` and `var`.
    
    Observation level metrics include:
    
    `total_{var_type}_by_{expr_type}`
        E.g. "total_genes_by_counts". Number of genes with positive counts in a cell.
    `total_{expr_type}`
        E.g. "total_counts". Total number of counts for a cell.
    `pct_{expr_type}_in_top_{n}_{var_type}` – for `n` in `percent_top`
        E.g. "pct_counts_in_top_50_genes". Cumulative percentage of counts
        for 50 most expressed genes in a cell.
    `total_{expr_type}_{qc_var}` – for `qc_var` in `qc_vars`
        E.g. "total_counts_mito". Total number of counts for variabes in
        `qc_vars`.
    `pct_{expr_type}_{qc_var}` – for `qc_var` in `qc_vars`
        E.g. "pct_counts_mito". Proportion of total counts for a cell which
        are mitochondrial.
    
    Variable level metrics include:
    
    `total_{expr_type}`
        E.g. "total_counts". Sum of counts for a gene.
    `n_genes_by_{expr_type}`
        E.g. "n_genes_by_counts". The number of genes with at least 1 count in a cell. Calculated for all cells.
    `mean_{expr_type}`
        E.g. "mean_counts". Mean expression over all cells.
    `n_cells_by_{expr_type}`
        E.g. "n_cells_by_counts". Number of cells this expression is
        measured in.
    `pct_dropout_by_{expr_type}`
        E.g. "pct_dropout_by_counts". Percentage of cells this feature does
        not appear in.
    
    Example
    -------
    Calculate qc metrics for visualization.
    
    .. plot::
        :context: close-figs
    
        import scanpy as sc
        import seaborn as sns
    
        pbmc = sc.datasets.pbmc3k()
        pbmc.var["mito"] = pbmc.var_names.str.startswith("MT-")
        sc.pp.calculate_qc_metrics(pbmc, qc_vars=["mito"], inplace=True)
        sns.jointplot(
            data=pbmc.obs,
            x="log1p_total_counts",
            y="log1p_n_genes_by_counts",
            kind="hex",
        )
    
    .. plot::
        :context: close-figs
    
        sns.histplot(pbmc.obs["pct_counts_mito"])

Loading and understanding the dataset structure¶

Data can be loaded from many different possible formats. Each format has a dedicated reading command, for example read_h5ad, read_10X_mtx, read_txt. We are going to use read_10X_mtx to load the output of the 10X software that produces the aligned data.

Note the option cache=True. If you are going to read again the same data, it will be loaded extremely fast, because it has been stored in a convenient format for large datasets (h5ad format)

In [3]:

  Copied!     
 
sample_2 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample2/outs/filtered_feature_bc_matrix/', cache=True)
sample_2 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample2/outs/filtered_feature_bc_matrix/', cache=True)

In [4]:

  Copied!     
 
sample_3 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample3/outs/filtered_feature_bc_matrix/', cache=True)
sample_3 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample3/outs/filtered_feature_bc_matrix/', cache=True)

The datasets sample_2 and sample_3 are now created. They are so-called Annotated datasets. Each annotated dataset contains:

The data matrix X of size $N\_cells \times N\_genes$
Vectors of cells-related quantities in the table .obs(for example, how many transcripts there are in each cell)
Vectors of genes-related quantities in the table .var (for example, in how many cells the each gene is detected)
Matrices of size $N\_cells \times N\_genes$ in .layers (for example, normalized data matrix, imputed data matrix, ....)

We will often call the cells for observations (obs) and the genes for variables (var) when it is practical in relation to the annotated dataset

During the analysis we will encounter other components of the annotated datasets. They will be explained when it is necessary, so you might want to skip this explanation if you want.

Matrices where each line is cell-related in .obsm (for example, the PCA coordinates of each cell)
Matrices where each line is gene-related in .varm (for example, mean of the gene in each cell type)
Anything else useful is in .uns

Above: a representation of the data matrix, variable and observations in an annotated dataset.

Each component of the annotated dataset is called by using a dot. For example, we can see the data matrix by

In [5]:

  Copied!     
 
sample_2.X
sample_2.X

Out[5]:

<8583x36601 sparse matrix of type '<class 'numpy.float32'>'
	with 22578947 stored elements in Compressed Sparse Row format>

The matrix is in compressed format. We can reassign it as a dense matrix, so that we can see what it contains.

In [6]:

  Copied!     
 
sample_2.X = np.array( sample_2.X.todense() )
sample_2.X = np.array( sample_2.X.todense() )

In [7]:

  Copied!     
 
sample_2.X
sample_2.X

Out[7]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [8]:

  Copied!     
 
sample_3.X = np.array( sample_3.X.todense() )
sample_3.X = np.array( sample_3.X.todense() )

In [9]:

  Copied!     
 
sample_3.X
sample_3.X

Out[9]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

When the matrix is no longer compressed, we can calculate some statistics for both cells and genes with the following scanpy command. Note that all scanpy commands follow a similar format. The two commands used below are the same, but in the second we used the short form for the preprocessing category.

In [10]:

  Copied!     
 
sc.preprocessing.calculate_qc_metrics(sample_2, inplace=True)
sc.pp.calculate_qc_metrics(sample_3, inplace=True)
sc.preprocessing.calculate_qc_metrics(sample_2, inplace=True) sc.pp.calculate_qc_metrics(sample_3, inplace=True)

We can see that obs and var now contains a lot of different values whose names, that are mostly self-explicative. For example

n_genes_by_counts is the number of detected genes in each cell
total_counts is the number of transcripts in each cell
mean_counts is the average of counts of each gene across all cells

In [11]:

  Copied!     
 
sample_2
sample_2

Out[11]:

AnnData object with n_obs × n_vars = 8583 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

You can access directly all observations/variables or some of them specifically. Each observation line is named with the cell barcode, while variables have gene names in each line

In [12]:

  Copied!     
 
sample_2.obs
sample_2.obs

Out[12]:

	n_genes_by_counts	log1p_n_genes_by_counts	total_counts	log1p_total_counts	pct_counts_in_top_50_genes	pct_counts_in_top_100_genes	pct_counts_in_top_200_genes	pct_counts_in_top_500_genes
AAACCTGAGATCCTGT-1	473	6.161207	694.0	6.543912	32.853026	46.253602	60.662824	100.000000
AAACCTGAGCCTATGT-1	2016	7.609367	6868.0	8.834774	47.408270	53.567268	61.560862	74.635993
AAACCTGAGCTTTGGT-1	3670	8.208219	9188.0	9.125762	17.599042	25.555072	35.154549	51.273400
AAACCTGAGGATGGTC-1	2175	7.685244	4021.0	8.299535	13.603581	21.139020	32.131311	51.927381
AAACCTGAGTACGACG-1	1622	7.392032	3237.0	8.082711	20.883534	30.552981	42.570281	63.608279
...	...	...	...	...	...	...	...	...
TTTGTCATCACGACTA-1	3234	8.081784	6918.0	8.842027	13.356461	20.381613	29.719572	46.863255
TTTGTCATCAGTTTGG-1	1312	7.180070	2895.0	7.971086	32.918826	44.006908	55.613126	71.951641
TTTGTCATCCAAACTG-1	1009	6.917706	1461.0	7.287560	20.123203	29.295003	42.984257	65.160849
TTTGTCATCCGTCAAA-1	1399	7.244228	2712.0	7.905810	28.945428	40.007375	50.884956	66.851032
TTTGTCATCGTTACAG-1	3294	8.100161	7757.0	8.956480	25.125693	33.518113	42.671136	56.323321

8583 rows × 8 columns

In [13]:

  Copied!     
 
sample_2.obs[ ['total_counts','n_genes_by_counts'] ]
sample_2.obs[ ['total_counts','n_genes_by_counts'] ]

Out[13]:

	total_counts	n_genes_by_counts
AAACCTGAGATCCTGT-1	694.0	473
AAACCTGAGCCTATGT-1	6868.0	2016
AAACCTGAGCTTTGGT-1	9188.0	3670
AAACCTGAGGATGGTC-1	4021.0	2175
AAACCTGAGTACGACG-1	3237.0	1622
...	...	...
TTTGTCATCACGACTA-1	6918.0	3234
TTTGTCATCAGTTTGG-1	2895.0	1312
TTTGTCATCCAAACTG-1	1461.0	1009
TTTGTCATCCGTCAAA-1	2712.0	1399
TTTGTCATCGTTACAG-1	7757.0	3294

8583 rows × 2 columns

In [14]:

  Copied!     
 
sample_2.var
sample_2.var

Out[14]:

	gene_ids	feature_types	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts
MIR1302-2HG	ENSG00000243485	Gene Expression	13	0.001515	0.001513	99.848538	13.0	2.639057
FAM138A	ENSG00000237613	Gene Expression	0	0.000000	0.000000	100.000000	0.0	0.000000
OR4F5	ENSG00000186092	Gene Expression	0	0.000000	0.000000	100.000000	0.0	0.000000
AL627309.1	ENSG00000238009	Gene Expression	40	0.004777	0.004766	99.533962	41.0	3.737670
AL627309.3	ENSG00000239945	Gene Expression	0	0.000000	0.000000	100.000000	0.0	0.000000
...	...	...	...	...	...	...	...	...
AC141272.1	ENSG00000277836	Gene Expression	0	0.000000	0.000000	100.000000	0.0	0.000000
AC023491.2	ENSG00000278633	Gene Expression	587	0.249680	0.222887	93.160899	2143.0	7.670429
AC007325.1	ENSG00000276017	Gene Expression	212	0.040196	0.039409	97.530001	345.0	5.846439
AC007325.4	ENSG00000278817	Gene Expression	1949	0.436910	0.362495	77.292322	3750.0	8.229777
AC007325.2	ENSG00000277196	Gene Expression	2	0.000233	0.000233	99.976698	2.0	1.098612

36601 rows × 8 columns

We store the matrix X to save the raw values. We will be able to see it in layers, independently of how we transform the matrix X

In [15]:

  Copied!     
 
sample_2.layers[ 'umi_raw' ] = sample_2.X.copy()
sample_2.layers[ 'umi_raw' ] = sample_2.X.copy()

In [16]:

  Copied!     
 
sample_3.layers[ 'umi_raw' ] = sample_3.X.copy()
sample_3.layers[ 'umi_raw' ] = sample_3.X.copy()

We can see that the matrix is stored in .layers['umi_raw'], and we can reassign it to .X or use it if needed in some future analysis

In [17]:

  Copied!     
 
sample_2
sample_2

Out[17]:

AnnData object with n_obs × n_vars = 8583 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    layers: 'umi_raw'

In [18]:

  Copied!     
 
sample_2.layers['umi_raw']
sample_2.layers['umi_raw']

Out[18]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

You can always subset a dataset by using a selection of cells and genes, and assign it as a new dataset (or to itself if you want to filter out some cells or genes)

An annotated dataset can be subsetted by cells, for example using a quality measure as the number of transcripts per cell

In [19]:

  Copied!     
 
sample_2_qc = sample_2[ sample_2.obs['total_counts']<10000, : ].copy()
sample_2_qc = sample_2[ sample_2.obs['total_counts']<10000, : ].copy()

In [20]:

  Copied!     
 
sample_2_qc
sample_2_qc

Out[20]:

AnnData object with n_obs × n_vars = 5452 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    layers: 'umi_raw'

In a similar way, you can use values calculated on the genes to subset the data by genes, for example in how many cells each gene is detected

In [21]:

  Copied!     
 
sample_2_qc = sample_2[ :, sample_2.var['n_cells_by_counts']>3 ].copy()
sample_2_qc = sample_2[ :, sample_2.var['n_cells_by_counts']>3 ].copy()

In [22]:

  Copied!     
 
sample_2_qc
sample_2_qc

Out[22]:

AnnData object with n_obs × n_vars = 8583 × 28220
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    layers: 'umi_raw'

Note how sample_2_qc has first a reduced number of cells and then a reduced number of genes.

Remember that you cannot subset at the same time by cells and genes, for example

sample_2[ sample_2.obs['total_counts']<10000, sample_2.var['mean_counts']>1 ]

but those two steps have to be done separately as shown before.

The annotated datasets can be easily saved by using write. The format to be used in the file name is h5ad.

In [23]:

  Copied!     
 
!mkdir -p ../../Data/notebooks_data
!mkdir -p ../../Data/notebooks_data

In [24]:

  Copied!     
 
sample_2.write('../../Data/notebooks_data/sample_2.h5ad')
sample_2.write('../../Data/notebooks_data/sample_2.h5ad')

... storing 'feature_types' as categorical

In [25]:

  Copied!     
 
sample_3.write('../../Data/notebooks_data/sample_3.h5ad')
sample_3.write('../../Data/notebooks_data/sample_3.h5ad')

... storing 'feature_types' as categorical