Read the data¶
Learning objectives:¶
- Get an overview of the
scanpypackage and thepythonlanguage syntax - Learn and explore the data structure containing a single cell dataset
- Understand and apply basic interactions with the transcript matrix and the components of a dataset
Execution time: 30-60 minutes
Import the packages¶
We will use scanpy as the main analysis tool for the analysis, where we will also apply some other packages. Scanpy has a comprehensive manual webpage that includes many different tutorial you can use for further practicing. Packages are imported with the command import, and their name is shortened with the command as, so that we can write shorter names in our code
An alternative and well-established tool for R users is Seurat. This is used in the R version of this course.
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
Commands from scanpy are under different categories: preprocessing (pp), tools (tl), plotting (pl). Each category contains some functions to work on single cell data. Scanpy has also a category called external, where a few external packages have been integrated to work with scanpy. Use the help() command to see what a command does in python
help(sc.preprocessing.calculate_qc_metrics)
Help on function calculate_qc_metrics in module scanpy.preprocessing._qc:
calculate_qc_metrics(adata: anndata._core.anndata.AnnData, *, expr_type: str = 'counts', var_type: str = 'genes', qc_vars: Collection[str] = (), percent_top: Union[Collection[int], NoneType] = (50, 100, 200, 500), layer: Union[str, NoneType] = None, use_raw: bool = False, inplace: bool = False, log1p: bool = True, parallel: Union[bool, NoneType] = None) -> Union[Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], NoneType]
Calculate quality control metrics.
Calculates a number of qc metrics for an AnnData object, see section
`Returns` for specifics. Largely based on `calculateQCMetrics` from scater
[McCarthy17]_. Currently is most efficient on a sparse CSR or dense matrix.
Note that this method can take a while to compile on the first call. That
result is then cached to disk to be used later.
Parameters
----------
adata
Annotated data matrix.
expr_type
Name of kind of values in X.
var_type
The kind of thing the variables are.
qc_vars
Keys for boolean columns of `.var` which identify variables you could
want to control for (e.g. "ERCC" or "mito").
percent_top
Which proportions of top genes to cover. If empty or `None` don't
calculate. Values are considered 1-indexed, `percent_top=[50]` finds
cumulative proportion to the 50th most expressed gene.
layer
If provided, use `adata.layers[layer]` for expression values instead
of `adata.X`.
use_raw
If True, use `adata.raw.X` for expression values instead of `adata.X`.
inplace
Whether to place calculated metrics in `adata`'s `.obs` and `.var`.
log1p
Set to `False` to skip computing `log1p` transformed annotations.
Returns
-------
Depending on `inplace` returns calculated metrics
(as :class:`~pandas.DataFrame`) or updates `adata`'s `obs` and `var`.
Observation level metrics include:
`total_{var_type}_by_{expr_type}`
E.g. "total_genes_by_counts". Number of genes with positive counts in a cell.
`total_{expr_type}`
E.g. "total_counts". Total number of counts for a cell.
`pct_{expr_type}_in_top_{n}_{var_type}` – for `n` in `percent_top`
E.g. "pct_counts_in_top_50_genes". Cumulative percentage of counts
for 50 most expressed genes in a cell.
`total_{expr_type}_{qc_var}` – for `qc_var` in `qc_vars`
E.g. "total_counts_mito". Total number of counts for variabes in
`qc_vars`.
`pct_{expr_type}_{qc_var}` – for `qc_var` in `qc_vars`
E.g. "pct_counts_mito". Proportion of total counts for a cell which
are mitochondrial.
Variable level metrics include:
`total_{expr_type}`
E.g. "total_counts". Sum of counts for a gene.
`n_genes_by_{expr_type}`
E.g. "n_genes_by_counts". The number of genes with at least 1 count in a cell. Calculated for all cells.
`mean_{expr_type}`
E.g. "mean_counts". Mean expression over all cells.
`n_cells_by_{expr_type}`
E.g. "n_cells_by_counts". Number of cells this expression is
measured in.
`pct_dropout_by_{expr_type}`
E.g. "pct_dropout_by_counts". Percentage of cells this feature does
not appear in.
Example
-------
Calculate qc metrics for visualization.
.. plot::
:context: close-figs
import scanpy as sc
import seaborn as sns
pbmc = sc.datasets.pbmc3k()
pbmc.var["mito"] = pbmc.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(pbmc, qc_vars=["mito"], inplace=True)
sns.jointplot(
data=pbmc.obs,
x="log1p_total_counts",
y="log1p_n_genes_by_counts",
kind="hex",
)
.. plot::
:context: close-figs
sns.histplot(pbmc.obs["pct_counts_mito"])
Loading and understanding the dataset structure¶
Data can be loaded from many different possible formats. Each format has a dedicated reading command, for example read_h5ad, read_10X_mtx, read_txt. We are going to use read_10X_mtx to load the output of the 10X software that produces the aligned data.
Note the option cache=True. If you are going to read again the same data, it will be loaded extremely fast, because it has been stored in a convenient format for large datasets (h5ad format)
sample_2 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample2/outs/filtered_feature_bc_matrix/', cache=True)
sample_3 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample3/outs/filtered_feature_bc_matrix/', cache=True)
The datasets sample_2 and sample_3 are now created. They are so-called Annotated datasets. Each annotated dataset contains:
- The data matrix
Xof size $N\_cells \times N\_genes$ - Vectors of cells-related quantities in the table
.obs(for example, how many transcripts there are in each cell) - Vectors of genes-related quantities in the table
.var(for example, in how many cells the each gene is detected) - Matrices of size $N\_cells \times N\_genes$ in
.layers(for example, normalized data matrix, imputed data matrix, ....)
We will often call the cells for observations (obs) and the genes for variables (var) when it is practical in relation to the annotated dataset
During the analysis we will encounter other components of the annotated datasets. They will be explained when it is necessary, so you might want to skip this explanation if you want.
- Matrices where each line is cell-related in
.obsm(for example, the PCA coordinates of each cell) - Matrices where each line is gene-related in
.varm(for example, mean of the gene in each cell type) - Anything else useful is in
.uns
Above: a representation of the data matrix, variable and observations in an annotated dataset.
Each component of the annotated dataset is called by using a dot. For example, we can see the data matrix by
sample_2.X
<8583x36601 sparse matrix of type '<class 'numpy.float32'>' with 22578947 stored elements in Compressed Sparse Row format>
The matrix is in compressed format. We can reassign it as a dense matrix, so that we can see what it contains.
sample_2.X = np.array( sample_2.X.todense() )
sample_2.X
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32) sample_3.X = np.array( sample_3.X.todense() )
sample_3.X
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32) When the matrix is no longer compressed, we can calculate some statistics for both cells and genes with the following scanpy command. Note that all scanpy commands follow a similar format. The two commands used below are the same, but in the second we used the short form for the preprocessing category.
sc.preprocessing.calculate_qc_metrics(sample_2, inplace=True)
sc.pp.calculate_qc_metrics(sample_3, inplace=True)
We can see that obs and var now contains a lot of different values whose names, that are mostly self-explicative. For example
n_genes_by_countsis the number of detected genes in each celltotal_countsis the number of transcripts in each cellmean_countsis the average of counts of each gene across all cells
sample_2
AnnData object with n_obs × n_vars = 8583 × 36601
obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts' You can access directly all observations/variables or some of them specifically. Each observation line is named with the cell barcode, while variables have gene names in each line
sample_2.obs
| n_genes_by_counts | log1p_n_genes_by_counts | total_counts | log1p_total_counts | pct_counts_in_top_50_genes | pct_counts_in_top_100_genes | pct_counts_in_top_200_genes | pct_counts_in_top_500_genes | |
|---|---|---|---|---|---|---|---|---|
| AAACCTGAGATCCTGT-1 | 473 | 6.161207 | 694.0 | 6.543912 | 32.853026 | 46.253602 | 60.662824 | 100.000000 |
| AAACCTGAGCCTATGT-1 | 2016 | 7.609367 | 6868.0 | 8.834774 | 47.408270 | 53.567268 | 61.560862 | 74.635993 |
| AAACCTGAGCTTTGGT-1 | 3670 | 8.208219 | 9188.0 | 9.125762 | 17.599042 | 25.555072 | 35.154549 | 51.273400 |
| AAACCTGAGGATGGTC-1 | 2175 | 7.685244 | 4021.0 | 8.299535 | 13.603581 | 21.139020 | 32.131311 | 51.927381 |
| AAACCTGAGTACGACG-1 | 1622 | 7.392032 | 3237.0 | 8.082711 | 20.883534 | 30.552981 | 42.570281 | 63.608279 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTTGTCATCACGACTA-1 | 3234 | 8.081784 | 6918.0 | 8.842027 | 13.356461 | 20.381613 | 29.719572 | 46.863255 |
| TTTGTCATCAGTTTGG-1 | 1312 | 7.180070 | 2895.0 | 7.971086 | 32.918826 | 44.006908 | 55.613126 | 71.951641 |
| TTTGTCATCCAAACTG-1 | 1009 | 6.917706 | 1461.0 | 7.287560 | 20.123203 | 29.295003 | 42.984257 | 65.160849 |
| TTTGTCATCCGTCAAA-1 | 1399 | 7.244228 | 2712.0 | 7.905810 | 28.945428 | 40.007375 | 50.884956 | 66.851032 |
| TTTGTCATCGTTACAG-1 | 3294 | 8.100161 | 7757.0 | 8.956480 | 25.125693 | 33.518113 | 42.671136 | 56.323321 |
8583 rows × 8 columns
sample_2.obs[ ['total_counts','n_genes_by_counts'] ]
| total_counts | n_genes_by_counts | |
|---|---|---|
| AAACCTGAGATCCTGT-1 | 694.0 | 473 |
| AAACCTGAGCCTATGT-1 | 6868.0 | 2016 |
| AAACCTGAGCTTTGGT-1 | 9188.0 | 3670 |
| AAACCTGAGGATGGTC-1 | 4021.0 | 2175 |
| AAACCTGAGTACGACG-1 | 3237.0 | 1622 |
| ... | ... | ... |
| TTTGTCATCACGACTA-1 | 6918.0 | 3234 |
| TTTGTCATCAGTTTGG-1 | 2895.0 | 1312 |
| TTTGTCATCCAAACTG-1 | 1461.0 | 1009 |
| TTTGTCATCCGTCAAA-1 | 2712.0 | 1399 |
| TTTGTCATCGTTACAG-1 | 7757.0 | 3294 |
8583 rows × 2 columns
sample_2.var
| gene_ids | feature_types | n_cells_by_counts | mean_counts | log1p_mean_counts | pct_dropout_by_counts | total_counts | log1p_total_counts | |
|---|---|---|---|---|---|---|---|---|
| MIR1302-2HG | ENSG00000243485 | Gene Expression | 13 | 0.001515 | 0.001513 | 99.848538 | 13.0 | 2.639057 |
| FAM138A | ENSG00000237613 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
| OR4F5 | ENSG00000186092 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
| AL627309.1 | ENSG00000238009 | Gene Expression | 40 | 0.004777 | 0.004766 | 99.533962 | 41.0 | 3.737670 |
| AL627309.3 | ENSG00000239945 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AC141272.1 | ENSG00000277836 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
| AC023491.2 | ENSG00000278633 | Gene Expression | 587 | 0.249680 | 0.222887 | 93.160899 | 2143.0 | 7.670429 |
| AC007325.1 | ENSG00000276017 | Gene Expression | 212 | 0.040196 | 0.039409 | 97.530001 | 345.0 | 5.846439 |
| AC007325.4 | ENSG00000278817 | Gene Expression | 1949 | 0.436910 | 0.362495 | 77.292322 | 3750.0 | 8.229777 |
| AC007325.2 | ENSG00000277196 | Gene Expression | 2 | 0.000233 | 0.000233 | 99.976698 | 2.0 | 1.098612 |
36601 rows × 8 columns
We store the matrix X to save the raw values. We will be able to see it in layers, independently of how we transform the matrix X
sample_2.layers[ 'umi_raw' ] = sample_2.X.copy()
sample_3.layers[ 'umi_raw' ] = sample_3.X.copy()
We can see that the matrix is stored in .layers['umi_raw'], and we can reassign it to .X or use it if needed in some future analysis
sample_2
AnnData object with n_obs × n_vars = 8583 × 36601
obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
layers: 'umi_raw' sample_2.layers['umi_raw']
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32) You can always subset a dataset by using a selection of cells and genes, and assign it as a new dataset (or to itself if you want to filter out some cells or genes)
An annotated dataset can be subsetted by cells, for example using a quality measure as the number of transcripts per cell
sample_2_qc = sample_2[ sample_2.obs['total_counts']<10000, : ].copy()
sample_2_qc
AnnData object with n_obs × n_vars = 5452 × 36601
obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
layers: 'umi_raw' In a similar way, you can use values calculated on the genes to subset the data by genes, for example in how many cells each gene is detected
sample_2_qc = sample_2[ :, sample_2.var['n_cells_by_counts']>3 ].copy()
sample_2_qc
AnnData object with n_obs × n_vars = 8583 × 28220
obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
layers: 'umi_raw' Note how sample_2_qc has first a reduced number of cells and then a reduced number of genes.
Remember that you cannot subset at the same time by cells and genes, for example
sample_2[ sample_2.obs['total_counts']<10000, sample_2.var['mean_counts']>1 ]
but those two steps have to be done separately as shown before.
The annotated datasets can be easily saved by using write. The format to be used in the file name is h5ad.
!mkdir -p ../../Data/notebooks_data
sample_2.write('../../Data/notebooks_data/sample_2.h5ad')
... storing 'feature_types' as categorical
sample_3.write('../../Data/notebooks_data/sample_3.h5ad')
... storing 'feature_types' as categorical