Dimensionality projections and topology diagnostics

Author

Samuele Soraggi

Published

September 11, 2024

Modified

September 16, 2024

Workshop start

Here you will go through some different ways of projecting single cell data. You start with PCA-based projections, then move to geometry-based projections and diagnostic ways of choosing the method which matches best your data. At the end, you will

understand briefly the difference between modeling distances and geometry
know how to apply the methods with scanpy
be able to compare the methods and their distortions

Notebook setup

Here we install some packages, download a file with scripts and the data from the github repository

%pip install --quiet scanpy topometry hnswlib pacmap plotly ipywidgets igraph leidenalg scvelo jupyterquiz git+https://github.com/DmitryUlyanov/Multicore-TSNE.git

Note: you may need to restart the kernel to use updated packages.

Load packages

import re

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pacmap
import pandas as pd
import plotly.express as px
import scanpy as sc
import scvelo as scv
import seaborn as sns
import sklearn
import topo as tp
from jupyterquiz import display_quiz
from matplotlib import pyplot as plt
from scipy.cluster import hierarchy as sch

%matplotlib inline

Setup some default figure parameters

scv.set_figure_params("scvelo", dpi=100, fontsize=14, dpi_save=600, format="tif")

Download scripts

!wget -q https://github.com/hds-sandbox/scverse-2024-workshop/raw/main/Scripts/scripts.py
%run scripts.py

Download data from github repository

!wget -q https://github.com/hds-sandbox/scverse-2024-workshop/raw/main/Data/testis.h5ad.gz
!gunzip testis.h5ad.gz

Data

The dataset is from human testicular tissue, that is divided in round tubules. Each tubule contains spermatogonia cells which are the beginning of the differentiation process, which leads to elongating spermatids (spermatozoa). See figure below.

Other somatic cell types assist the differentiation and movement of cells, and are usually out of the tubules. Below: a section of testis tissue where cells differentiate inside a tubule to become spermatozoa towards the center.

The dataset is a subsample of 10,000 cells from (Winge et al., 2020), already filtered down to 7765 cells. The full data is downloadable and you can find the exact filtering procedure in our summer school tutorial.

We read the data in h5ad format

adata = sc.read("testis.h5ad")

This is an annotated data (anndata) object. You can look at the object size and content here. We mostly have some cell-related QC measures in obs, and the number of cells expressing each gene in var.

adata

AnnData object with n_obs × n_vars = 7765 × 27897
    obs: 'batch', 'super_batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_counts', 'perc_mito', 'perc_MALAT1', 'n_genes'
    var: 'n_cells'

Here we proceed with just a rough normalization in the interest of time (transcript per million + logarithm + standardization). In the meanwhile we determine the 5000 most variable genes. A bit more info on normalization here if you want to read. Also consider this paper and the scVerse best practices chapter on normalization.

sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
sc.pp.scale(adata, max_value=10)

Projections

Now we look at various projections methods.

PCA

We start with the usual projection methods everyone starts with after preprocessing, PCA.

PCA is usually the starting point of most calculations in scRNA analysis.

The PCA plot is not able to show much structure in the data, it seems we simply have two long clusters of cells.

sc.pp.pca(adata, random_state=123)  # calculate PCA
plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.pca(adata)  # plot PCA

How many components are we able to detect with PCA’s linear separation? It seems 10 principal components contain most of the data’s variance.

sc.pl.pca_variance_ratio(adata)

tSNE

Now we want to try out tSNE. tSNE is a graph optimization algorithm, and it is based on optimizing a cost function which balances two strenghts: an attractive one and a repulsive one.

The attractive force draws points closer to each other privileging the global structure (e.g. overall relative position of clusters), while the repulsive force pushes distant points far away, preserving the local structure (e.g. such as cluster separation). More on tSNE in the original paper and in this guidance paper.

Distances are calculated through a gaussian kernel in high-dimensional data, and with t-student in the projection (figure below). A parameter called perplexity determines the width of the gaussian.

We try to plot tSNE with perplexities 1, 10, 100 to see how we can get different tSNEs.

First of all, we calculate distances on the 10 PCA components - which means tSNE will be based on PCA distances (so a non-linear projection is using a linear projection to start with).

sc.pp.neighbors(adata, n_pcs=10, random_state=123)

Now we run tSNE three times and plot it. We use only 1000 cells because it runs otherwise slow.

perps = [1, 10, 100]
adata_subset = sc.pp.subsample(adata, n_obs=1000, copy=True)
for i in range(3):
    sc.tl.tsne(adata_subset, perplexity=perps[i], n_jobs=4, n_pcs=10, random_state=123)
    adata_subset.obsm[f"X_tsne_P{perps[i]}_"] = adata_subset.obsm[f"X_tsne"].copy()

plt.rcParams["figure.figsize"] = (10, 4)
fig, axs = plt.subplots(1, 3)
perps = [1, 10, 100]
for i in range(3):
    f = sc.pl.scatter(
        adata_subset,
        basis=f"tsne_P{perps[i]}_",
        ax=axs[i],
        show=False,
        title=f"perplexity {perps[i]}",
    )
fig.show()

UMAP

UMAP is also a graph optimization algorithm like tSNE. It also ends up balancing attractive and repulsive forces like the tSNE algorithm. Those two forces change their weight using parameters a and b.

Choose below the a and b parameters and test which plot you get. Try as many times as you want before continuing. Usually, values of a and b are not much distant from 1 and chosen automatically by UMAP, but it is worth tuning them to see the effect. Here the range is restricted to [0.1-2]

# @title #### DIY: UMAP parameters choice {display-mode: "form"}

a = 1  # @param {type:"slider", min:0.1, max:2, step:0.05}
b = 1  # @param {type:"slider", min:0.1, max:2, step:0.05}

print(f"Chosen parameters {a} and {b}")
print("running UMAP...")
sc.tl.umap(adata, a=a, b=b, random_state=123)
print("Plotting")
plt.rcParams["figure.figsize"] = (4, 4)
fig, axs = plt.subplots(1)
sc.pl.umap(adata, ax=axs, show=True)

Chosen parameters 1 and 1
running UMAP...
Plotting

Geometrical approach - topoMETRY

topoMetry works by modelling the geometry of the shape formed by the data points in high dimension. This is done through the so-called discrete Laplace-Beltrami Operator (LBO). For reference, there is a paper and a documentation page for the package.

How it works (very, very briefly)

In few words, the LBO finds a sort of principal components (called eigenfunctions or eigenvectors) of the geometrical shape formed by the data, which pieced together approximate the original dataset (figure below, right).

Usually the eigenvectors are less than the dimension of the data, because many of the genes are related by similar expression patterns (figure below, left). The number of eigenvector is the intrinsic dimensionality of the data, which tells you how many dimensions you need to describe the geometry of the data surface.

What can you use the LBO for

The LBO is used in the same way as the PCA, which is, its eigenvector matrix is the starting point for other dimensionality reductions. Now, anything you used will be based on geometry and not on PCA distances.

Below we go through the steps of using such a tool. Remember: no distance preservation is involved in calculating the LBO, it is all about the geometry (curvature and tangent space in each point).

Calculate Intrinsic dimensionality

First of all, we subset the data to the highly variable genes.

adataHVG = adata[:, adata.var.highly_variable].copy()

Then, we calculate the intrinsic dimensionality of the data. This is done with two indices (FSA, MLE) tested on different neighborhoods of each cell. Those two indices estimate the same thing, which is, the intrinsic dimensionality of each cell and its neighbors.

The result is a histogram for which the guideline is to choose the peak. However, tails are also important and can give extra dynamics in the data, so I usually select the number of dimensions on the right side of the peak (before the tail flattens out).

FSA and MLE usually are quite concordant. I usually select the highest estimation of intrinsic dimensionality. Here, for example I choose 250 dimensions.

plt.rcParams["figure.figsize"] = (4, 4)
tp.tpgraph.IntrinsicDim(
    methods=["fsa", "mle"],
    k=range(10, 60, 10),
    backend="hnswlib",
    metric="euclidean",
    n_jobs=-1,
    plot=True,
    random_state=123,
).fit(adataHVG.X)

Calculate the LBO

Now it is time to calculate the Laplace-Beltrami Operator. This is done in a computer by creating a matrix, where each column is an eigenvector of the operator. Every eigenvector describes a geometrical characteristics of the data surface. Topometry gives choice of methods to calculate LBO with projections and clusters based on that. You can choose

kernels how to calculate similarities
eigenmap_method how to calculate LBO
projections which dimensionality reduction to do starting from LBO
resolution the LBO-based clustering resolution

Do not worry about the first two (find more in the topometry docs). For the third one we choose diffusion maps (MAP) and PaCMAP (which is superior to UMAP in various ways). For the clustering resolution we choose 0.5.

# Run TopOMetry

# Create a TopOGraph object with the desired parameters
tg = tp.TopOGraph(n_eigs=250, n_jobs=-1, verbosity=1, random_state=42)

# Fit some models to the data
adata = tp.sc.topological_workflow(
    adata,
    tg,
    kernels=["bw_adaptive"],
    eigenmap_methods=["LE"],
    projections=["MAP", "PaCMAP"],
    resolution=0.5,
)

Computing neighborhood graph...
 Base kNN graph computed in 6.281542 (sec)
 Fitted the bw_adaptive kernel in 0.308347 (sec)
Computing eigenbasis...
 Fitted eigenbasis with Laplacian Eigenmaps from the bw_adaptive in 8.976884 (sec)
    Building topological graph from eigenbasis...
        Computing neighborhood graph...
 Computed in 0.216704 (sec)
 Fitted the bw_adaptive graph kernel in 0.311466 (sec)
 Computed MAP in 10.575032 (sec)
 Computed PaCMAP in 47.988012 (sec)

The resulting projections are inside adata.obsm and have long names which combine the three parameters chosen before to run topoMetry.

Note also we have a clustering in adata.obs.w_adaptive from LE with bw_adaptive_leiden ! This is a clustering based on the LBO instead of the PCA, and calculated together with the projections.

The eigenvector matrix is also there and called X_LE with bw_adaptive

adata.obsm

AxisArrays with keys: X_pca, X_umap, X_LE with bw_adaptive, X_MAP of bw_adaptive from LE with bw_adaptive, X_PaCMAP of LE with bw_adaptive

We rename some of the objects to make them more practically usable

adata.obsm["X_topoMAP"] = adata.obsm["X_MAP of bw_adaptive from LE with bw_adaptive"]
adata.obsm["X_topoPaCMAP"] = adata.obsm["X_PaCMAP of LE with bw_adaptive"]
adata.obs["topo_leiden"] = adata.obs["bw_adaptive from LE with bw_adaptive_leiden"]
adata.obsm["X_eigenvectors"] = adata.obsm["X_LE with bw_adaptive"]

Now, let’s look at the clusters on the LBO-based PaCMAP projection and clustering

plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.embedding(
    adata,
    basis="topoPaCMAP",
    color=["topo_leiden"],
    palette="tab20",
    ncols=2,
    legend_loc="on data",
    legend_fontsize=10,
)

Note how the diffusion maps below lose part of the structure creating a sparse blob of clusters

sc.pl.embedding(
    adata,
    basis="topoMAP",
    color=["topo_leiden"],
    palette="tab20",
    ncols=2,
    legend_loc="on data",
    legend_fontsize=10,
)

This is instead the UMAP based on PCA, which is maybe the most similar to LBO + PaCMAP

sc.pl.embedding(
    adata,
    basis="umap",
    color=["topo_leiden"],
    palette="tab20",
    ncols=2,
    legend_loc="on data",
    legend_fontsize=10,
)

Here I create a list of cell types and their markers, which I then assign using markers scores calculated as in Seurat using one of the scripts we downloaded

markers = dict()  # make an empty dictionary
### SPERMATOCYTOGENESIS
markers["SpermatogoniaA"] = ["ID4", "HMGA1"]
markers["SpermatogoniaB"] = ["MKI67", "DMRT1", "STRA8"]
markers["SpermatocytesI"] = ["MEIOB", "PRSS50", "SYCP1", "TEX101"]
markers["SpermatocytesII"] = ["PIWIL1", "ACRV1", "SPATA16", "CLGN"]
### SPERMIOGENESIS
markers["Round.Spt"] = ["SPATA9", "SPAM1"]  # Round spermatids
markers["Elong.Spt"] = ["PRM1", "PRM2"]  # Elongated spermatids
### SOMATIC CELLS
markers["Sertoli"] = ["CTSL", "VIM", "GATA4"]
markers["Macroph"] = ["CD163", "TYROBP"]
markers["Leydig"] = ["CFD"]
markers["Endothelial"] = ["CD34"]
markers["Myoid"] = ["ACTA2"]
markers["Smooth_Muscle"] = ["RGS5"]

markers_scores, adata = marker_score(markers, adata)

adata.obs["clusters"] = clustersByScores(
    adata, markers_scores, leidenClusters=adata.obs["topo_leiden"]
)

We can look at the clustering. See how the cell types enclosed in a tubule are often close together, while other cell types are in separated clusters

sc.pl.embedding(
    adata,
    basis="topoPaCMAP",
    color=["topo_leiden", "clusters"],
    palette="tab20",
    ncols=2,
    legend_loc="on data",
    legend_fontsize=10,
)

Plot single LBO eigenvectors

Now, what about looking at the eigenvectors of the LBO, and see what tey highlight in our data?

Below, notice that the eigenvectors matrix has 250 columns, where each is an eigenvector. Eigenvectors are already normalized.

adata.obsm["X_eigenvectors"].shape

(7765, 250)

Below there is a question that you can solve by exploring some eigenvectors with the input box after the question:

# @title #### DIY: Choose an LBO eigenvector {display-mode: "form"}
component = 0  # @param {type:"number", min:0, max:250, step:1}

adata.obs[f"Eigenvector_{component}"] = adata.obsm["X_eigenvectors"][:, component]

plt.rcParams["figure.figsize"] = (4, 4)
sc.pl.embedding(
    adata,
    basis="topoPaCMAP",
    color=[f"Eigenvector_{component}", "clusters"],
    palette="tab20",
    ncols=3,
    legend_fontsize=10,
)

plt.rcParams["figure.figsize"] = (12, 4)
print(f"Scores of component {component} by cluster")
f = sc.pl.violin(
    adata,
    keys=[f"Eigenvector_{component}"],
    groupby="clusters",
    palette="tab20",
    rotation=90,
)

Scores of component 0 by cluster

Some distortion diagnosic

How to choose a projection based on LBO? We can look at the distortion of each projection. As when drawing a map of the globe, projecting single cell data will create distortions. Those can be determined with the riemannian metric.

Topometry uses the metric to draw ellipses around some of the points to show distortions. The directions in which the ellipses are wider are the directions of distortion, and the size of the ellipses’ axae represent the distortion intensity.

Running the diagnostics

Below we plot distortion ellipses on 500 points for all the projections we based on LBO.

# Get the graph Laplacian of our base graph kernel:
L = tg.base_kernel.L

# Define how many ellipses to plot:
n_plot = 500

# Convert the labels to integers:
labels = adata.obs["clusters"].cat.codes


# Name the projections you want to plot:
projections = ["topoMAP", "topoPaCMAP"]

# Plot the Riemann metric
for name in projections:
    xy_coord = adata.obsm["X_" + name][:, :2]
    tp.pl.plot_riemann_metric(
        xy_coord,  # The coordinates of the projection
        L,  # The graph Laplacian
        std=0.5,  # A scaling factor for the ellipses size
        n_plot=n_plot,  # How many ellipses to plot
        labels=labels,
        title=f"Riemannian Metric \n for {name}",
        cmap="tab20",  # For coloring
        random_state=tg.random_state,  # For coloring with same colors as before
        figsize=(4, 4),  # Size of the figure (NEEDS TO BE SQUARE!!!),
    )

Look at the first plot with diffusion maps. Ellipses are pretty large and many of them have both axae that are wide, meaning distortions in both directions of the 2D space.

Usually, cells at the edge of the plot have some distortion pointing towards the center. This is phisyological and means that distances are not as large as they appear in the 2D space (similarly to distances which are wrong at the edges of the planar world map).
If all cells including those at the center of the plot have large distortions, then the projection is not reliable, because all points coordinates are distorted.

The diffusion map does not seem a very good projection of the data. PaCMap is definitely more reliable, showing mostly only distortion towards the center from the edges.

Done!

So this is the end of the workshop. What are the most important points to get out of this?

PCA is not too reliable to base your research on, even though it is accepted as the gold standard in preprocessing. It flattens out non-linearity and creates biases in the other methods which use PCA
LBO models geometry and create harmonics to which you can base any other projection on
There is no best projection method - evaluate distortions of various LBO-based projections and choose accordingly. Do not assume one method is best (as what you usually do with PCA + UMAP)

Not Done! Bonus: Interactive plots

You might want to document your notebook on your own webpage, and why not, show some interactive plots. Here you can do it easily with a library called plotly express.

For example you can plot your PaCMAP with clusters and with a specific gene expression when you hover the mouse.

import plotly.io as pio
pio.renderers.default='plotly_mimetype+jupyterlab+notebook'

px.scatter(x=adata.obsm['X_topoPaCMAP'][:,0],
                 y=adata.obsm['X_topoPaCMAP'][:,1],
                 color=adata.obs.clusters,
                 hover_name=pd.Series(adata[:,"PRM1"].X[:,0] ),
                 title="Clusters on PacMAP and PRM1 expression on hovering",
                 height=800)

You can also use bubble sizes to look at positive numbers, such as the eccentricity of the ellipses we visualized before.Higher eccentricity means higher distortion in axae directions. You can add an animation slider, here for example on different batches.

from topo.utils import get_eccentricity

L = tg.base_kernel.L

eccentricities = get_eccentricity(adata.obsm["X_topoMAP"], L)

adata.obs["topoMAP_eccentricities"] = eccentricities

fig = px.scatter(
    x=adata.obsm["X_topoMAP"][:, 0],
    y=adata.obsm["X_topoMAP"][:, 1],
    size=adata.obs.topoMAP_eccentricities,
    size_max=10,
    color=adata.obs.clusters,
    animation_frame=adata.obs.batch,
    title="Diffusion Maps colored by cluster, sized by eccentricities with selector for batches",
    height=800,
)
fig.show()

The same plot for PaCMAP shows smaller eccentricities as you would expect.

from topo.utils import get_eccentricity

L = tg.base_kernel.L

eccentricities = get_eccentricity(adata.obsm["X_topoPaCMAP"], L)

adata.obs["topoPaCMAP_eccentricities"] = eccentricities

fig = px.scatter(
    x=adata.obsm["X_topoPaCMAP"][:, 0],
    y=adata.obsm["X_topoPaCMAP"][:, 1],
    size=adata.obs.topoPaCMAP_eccentricities,
    size_max=10,
    color=adata.obs.clusters,
    animation_frame=adata.obs.batch,
    title="PaCMAP colored by cluster, sized by eccentricities\nwith selector for batches",
    height=800,
)
fig.show()

Done for good!

Ok now you are really done. You can see all the plots above in the web version of this tutorial! But note, those plots will not be shown if you created the notebook with google colab, you need to create the plots in jupyterlab to make the plots work (Google Colab has its own renderer for those plots, and it does not work with any other tool for creating webpages…)