Read the data¶
Learning objectives:¶
- Get an overview of the
scanpy
package and thepython
language syntax - Learn and explore the data structure containing a single cell dataset
- Understand and apply basic interactions with the transcript matrix and the components of a dataset
Execution time: 30-60 minutes
Import the packages¶
We will use scanpy
as the main analysis tool for the analysis, where we will also apply some other packages. Scanpy has a comprehensive manual webpage that includes many different tutorial you can use for further practicing. Packages are imported with the command import
, and their name is shortened with the command as
, so that we can write shorter names in our code
An alternative and well-established tool for R
users is Seurat. This is used in the R
version of this course.
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
Commands from scanpy are under different categories: preprocessing (pp), tools (tl), plotting (pl). Each category contains some functions to work on single cell data. Scanpy has also a category called external
, where a few external packages have been integrated to work with scanpy. Use the help()
command to see what a command does in python
help(sc.preprocessing.calculate_qc_metrics)
Help on function calculate_qc_metrics in module scanpy.preprocessing._qc: calculate_qc_metrics(adata: anndata._core.anndata.AnnData, *, expr_type: str = 'counts', var_type: str = 'genes', qc_vars: Collection[str] = (), percent_top: Union[Collection[int], NoneType] = (50, 100, 200, 500), layer: Union[str, NoneType] = None, use_raw: bool = False, inplace: bool = False, log1p: bool = True, parallel: Union[bool, NoneType] = None) -> Union[Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], NoneType] Calculate quality control metrics. Calculates a number of qc metrics for an AnnData object, see section `Returns` for specifics. Largely based on `calculateQCMetrics` from scater [McCarthy17]_. Currently is most efficient on a sparse CSR or dense matrix. Note that this method can take a while to compile on the first call. That result is then cached to disk to be used later. Parameters ---------- adata Annotated data matrix. expr_type Name of kind of values in X. var_type The kind of thing the variables are. qc_vars Keys for boolean columns of `.var` which identify variables you could want to control for (e.g. "ERCC" or "mito"). percent_top Which proportions of top genes to cover. If empty or `None` don't calculate. Values are considered 1-indexed, `percent_top=[50]` finds cumulative proportion to the 50th most expressed gene. layer If provided, use `adata.layers[layer]` for expression values instead of `adata.X`. use_raw If True, use `adata.raw.X` for expression values instead of `adata.X`. inplace Whether to place calculated metrics in `adata`'s `.obs` and `.var`. log1p Set to `False` to skip computing `log1p` transformed annotations. Returns ------- Depending on `inplace` returns calculated metrics (as :class:`~pandas.DataFrame`) or updates `adata`'s `obs` and `var`. Observation level metrics include: `total_{var_type}_by_{expr_type}` E.g. "total_genes_by_counts". Number of genes with positive counts in a cell. `total_{expr_type}` E.g. "total_counts". Total number of counts for a cell. `pct_{expr_type}_in_top_{n}_{var_type}` – for `n` in `percent_top` E.g. "pct_counts_in_top_50_genes". Cumulative percentage of counts for 50 most expressed genes in a cell. `total_{expr_type}_{qc_var}` – for `qc_var` in `qc_vars` E.g. "total_counts_mito". Total number of counts for variabes in `qc_vars`. `pct_{expr_type}_{qc_var}` – for `qc_var` in `qc_vars` E.g. "pct_counts_mito". Proportion of total counts for a cell which are mitochondrial. Variable level metrics include: `total_{expr_type}` E.g. "total_counts". Sum of counts for a gene. `n_genes_by_{expr_type}` E.g. "n_genes_by_counts". The number of genes with at least 1 count in a cell. Calculated for all cells. `mean_{expr_type}` E.g. "mean_counts". Mean expression over all cells. `n_cells_by_{expr_type}` E.g. "n_cells_by_counts". Number of cells this expression is measured in. `pct_dropout_by_{expr_type}` E.g. "pct_dropout_by_counts". Percentage of cells this feature does not appear in. Example ------- Calculate qc metrics for visualization. .. plot:: :context: close-figs import scanpy as sc import seaborn as sns pbmc = sc.datasets.pbmc3k() pbmc.var["mito"] = pbmc.var_names.str.startswith("MT-") sc.pp.calculate_qc_metrics(pbmc, qc_vars=["mito"], inplace=True) sns.jointplot( data=pbmc.obs, x="log1p_total_counts", y="log1p_n_genes_by_counts", kind="hex", ) .. plot:: :context: close-figs sns.histplot(pbmc.obs["pct_counts_mito"])
Loading and understanding the dataset structure¶
Data can be loaded from many different possible formats. Each format has a dedicated reading command, for example read_h5ad
, read_10X_mtx
, read_txt
. We are going to use read_10X_mtx
to load the output of the 10X software that produces the aligned data.
Note the option cache=True
. If you are going to read again the same data, it will be loaded extremely fast, because it has been stored in a convenient format for large datasets (h5ad
format)
sample_2 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample2/outs/filtered_feature_bc_matrix/', cache=True)
sample_3 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample3/outs/filtered_feature_bc_matrix/', cache=True)
The datasets sample_2
and sample_3
are now created. They are so-called Annotated datasets
. Each annotated dataset contains:
- The data matrix
X
of size $N\_cells \times N\_genes$ - Vectors of cells-related quantities in the table
.obs
(for example, how many transcripts there are in each cell) - Vectors of genes-related quantities in the table
.var
(for example, in how many cells the each gene is detected) - Matrices of size $N\_cells \times N\_genes$ in
.layers
(for example, normalized data matrix, imputed data matrix, ....)
We will often call the cells for observations (obs) and the genes for variables (var) when it is practical in relation to the annotated dataset
During the analysis we will encounter other components of the annotated datasets. They will be explained when it is necessary, so you might want to skip this explanation if you want.
- Matrices where each line is cell-related in
.obsm
(for example, the PCA coordinates of each cell) - Matrices where each line is gene-related in
.varm
(for example, mean of the gene in each cell type) - Anything else useful is in
.uns
Above: a representation of the data matrix, variable and observations in an annotated dataset.
Each component of the annotated dataset is called by using a dot
. For example, we can see the data matrix by
sample_2.X
<8583x36601 sparse matrix of type '<class 'numpy.float32'>' with 22578947 stored elements in Compressed Sparse Row format>
The matrix is in compressed format. We can reassign it as a dense matrix, so that we can see what it contains.
sample_2.X = np.array( sample_2.X.todense() )
sample_2.X
array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 1., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
sample_3.X = np.array( sample_3.X.todense() )
sample_3.X
array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
When the matrix is no longer compressed, we can calculate some statistics for both cells and genes with the following scanpy
command. Note that all scanpy commands follow a similar format. The two commands used below are the same, but in the second we used the short form for the preprocessing
category.
sc.preprocessing.calculate_qc_metrics(sample_2, inplace=True)
sc.pp.calculate_qc_metrics(sample_3, inplace=True)
We can see that obs
and var
now contains a lot of different values whose names, that are mostly self-explicative. For example
n_genes_by_counts
is the number of detected genes in each celltotal_counts
is the number of transcripts in each cellmean_counts
is the average of counts of each gene across all cells
sample_2
AnnData object with n_obs × n_vars = 8583 × 36601 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
You can access directly all observations/variables or some of them specifically. Each observation line is named with the cell barcode, while variables have gene names in each line
sample_2.obs
n_genes_by_counts | log1p_n_genes_by_counts | total_counts | log1p_total_counts | pct_counts_in_top_50_genes | pct_counts_in_top_100_genes | pct_counts_in_top_200_genes | pct_counts_in_top_500_genes | |
---|---|---|---|---|---|---|---|---|
AAACCTGAGATCCTGT-1 | 473 | 6.161207 | 694.0 | 6.543912 | 32.853026 | 46.253602 | 60.662824 | 100.000000 |
AAACCTGAGCCTATGT-1 | 2016 | 7.609367 | 6868.0 | 8.834774 | 47.408270 | 53.567268 | 61.560862 | 74.635993 |
AAACCTGAGCTTTGGT-1 | 3670 | 8.208219 | 9188.0 | 9.125762 | 17.599042 | 25.555072 | 35.154549 | 51.273400 |
AAACCTGAGGATGGTC-1 | 2175 | 7.685244 | 4021.0 | 8.299535 | 13.603581 | 21.139020 | 32.131311 | 51.927381 |
AAACCTGAGTACGACG-1 | 1622 | 7.392032 | 3237.0 | 8.082711 | 20.883534 | 30.552981 | 42.570281 | 63.608279 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
TTTGTCATCACGACTA-1 | 3234 | 8.081784 | 6918.0 | 8.842027 | 13.356461 | 20.381613 | 29.719572 | 46.863255 |
TTTGTCATCAGTTTGG-1 | 1312 | 7.180070 | 2895.0 | 7.971086 | 32.918826 | 44.006908 | 55.613126 | 71.951641 |
TTTGTCATCCAAACTG-1 | 1009 | 6.917706 | 1461.0 | 7.287560 | 20.123203 | 29.295003 | 42.984257 | 65.160849 |
TTTGTCATCCGTCAAA-1 | 1399 | 7.244228 | 2712.0 | 7.905810 | 28.945428 | 40.007375 | 50.884956 | 66.851032 |
TTTGTCATCGTTACAG-1 | 3294 | 8.100161 | 7757.0 | 8.956480 | 25.125693 | 33.518113 | 42.671136 | 56.323321 |
8583 rows × 8 columns
sample_2.obs[ ['total_counts','n_genes_by_counts'] ]
total_counts | n_genes_by_counts | |
---|---|---|
AAACCTGAGATCCTGT-1 | 694.0 | 473 |
AAACCTGAGCCTATGT-1 | 6868.0 | 2016 |
AAACCTGAGCTTTGGT-1 | 9188.0 | 3670 |
AAACCTGAGGATGGTC-1 | 4021.0 | 2175 |
AAACCTGAGTACGACG-1 | 3237.0 | 1622 |
... | ... | ... |
TTTGTCATCACGACTA-1 | 6918.0 | 3234 |
TTTGTCATCAGTTTGG-1 | 2895.0 | 1312 |
TTTGTCATCCAAACTG-1 | 1461.0 | 1009 |
TTTGTCATCCGTCAAA-1 | 2712.0 | 1399 |
TTTGTCATCGTTACAG-1 | 7757.0 | 3294 |
8583 rows × 2 columns
sample_2.var
gene_ids | feature_types | n_cells_by_counts | mean_counts | log1p_mean_counts | pct_dropout_by_counts | total_counts | log1p_total_counts | |
---|---|---|---|---|---|---|---|---|
MIR1302-2HG | ENSG00000243485 | Gene Expression | 13 | 0.001515 | 0.001513 | 99.848538 | 13.0 | 2.639057 |
FAM138A | ENSG00000237613 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
OR4F5 | ENSG00000186092 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
AL627309.1 | ENSG00000238009 | Gene Expression | 40 | 0.004777 | 0.004766 | 99.533962 | 41.0 | 3.737670 |
AL627309.3 | ENSG00000239945 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
AC141272.1 | ENSG00000277836 | Gene Expression | 0 | 0.000000 | 0.000000 | 100.000000 | 0.0 | 0.000000 |
AC023491.2 | ENSG00000278633 | Gene Expression | 587 | 0.249680 | 0.222887 | 93.160899 | 2143.0 | 7.670429 |
AC007325.1 | ENSG00000276017 | Gene Expression | 212 | 0.040196 | 0.039409 | 97.530001 | 345.0 | 5.846439 |
AC007325.4 | ENSG00000278817 | Gene Expression | 1949 | 0.436910 | 0.362495 | 77.292322 | 3750.0 | 8.229777 |
AC007325.2 | ENSG00000277196 | Gene Expression | 2 | 0.000233 | 0.000233 | 99.976698 | 2.0 | 1.098612 |
36601 rows × 8 columns
We store the matrix X
to save the raw values. We will be able to see it in layers
, independently of how we transform the matrix X
sample_2.layers[ 'umi_raw' ] = sample_2.X.copy()
sample_3.layers[ 'umi_raw' ] = sample_3.X.copy()
We can see that the matrix is stored in .layers['umi_raw']
, and we can reassign it to .X
or use it if needed in some future analysis
sample_2
AnnData object with n_obs × n_vars = 8583 × 36601 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts' layers: 'umi_raw'
sample_2.layers['umi_raw']
array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 1., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
You can always subset a dataset by using a selection of cells and genes, and assign it as a new dataset (or to itself if you want to filter out some cells or genes)
An annotated dataset can be subsetted by cells, for example using a quality measure as the number of transcripts per cell
sample_2_qc = sample_2[ sample_2.obs['total_counts']<10000, : ].copy()
sample_2_qc
AnnData object with n_obs × n_vars = 5452 × 36601 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts' layers: 'umi_raw'
In a similar way, you can use values calculated on the genes to subset the data by genes, for example in how many cells each gene is detected
sample_2_qc = sample_2[ :, sample_2.var['n_cells_by_counts']>3 ].copy()
sample_2_qc
AnnData object with n_obs × n_vars = 8583 × 28220 obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes' var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts' layers: 'umi_raw'
Note how sample_2_qc
has first a reduced number of cells and then a reduced number of genes.
Remember that you cannot subset at the same time by cells and genes, for example
sample_2[ sample_2.obs['total_counts']<10000, sample_2.var['mean_counts']>1 ]
but those two steps have to be done separately as shown before.
The annotated datasets can be easily saved by using write
. The format to be used in the file name is h5ad
.
!mkdir -p ../../Data/notebooks_data
sample_2.write('../../Data/notebooks_data/sample_2.h5ad')
... storing 'feature_types' as categorical
sample_3.write('../../Data/notebooks_data/sample_3.h5ad')
... storing 'feature_types' as categorical