Cross-data analysis¶

Motivation:

Azoospermia is a condition where men do not produce any spermatozoa or produce semen of too low quality for allowing pregnancy to actually happen. Various types of azoospermia can happen, and those will look differently at a cellular level, as you can see below.

Examples of testicular histology and the composition of testicular cell types that can be observed among men with non-obstructive azoospermia. a A biopsy from a patient with Klinefelter syndrome (47, XXY) showing degenerated ghost tubules (#), tubules with Sertoli-cell-only (SCO) pattern () and large clusters of Leydig cells. b SCO () observed in a patient with a complete AZFc deletion. c Tubules with germ cell neoplasia in situ, GCNIS, which do not contain any normal germ cells (&). GCNIS cells are the precursor cells of testicular germ cell cancer and are found more frequently among men with azoospermia than among men with good semen quality (Hoei-Hansen et al. 2003). d Classical Sertoli-cell-only syndrome (SCOS) where no germ cells are present. Only Sertoli cells are found inside the seminiferous tubules marked with an asterisk (). e SCO () with partial hyalinisation of tubules (#). f Spermatocytic arrest (SPA) (§) at the stage of spermatocytes. The bar represents 100 microns and all images are in the same magnification. From (Soraggi et al 2020).

Common to the various azoospermic conditions is the lack or distuption of gene expression patterns. It makes therefore sense to detect genes expressed more in the healthy dataset against the azoospermic one. We can also investigate gene enrichment databases to get a clearer picture of what the genes of interest are relevant to.

We try to do a simple analysis of the dataset with "healthy" cells against a dataset with azoospermic cells: we integrate the data, apply differential expression and gene enrichment analysis. The azoospermic dataset has been already preprocessed and clustered. Notebooks for the whole process to elaborate the data are included under the section Extra of the course webpage, and can be found in the folder Notebooks/Python/Azoospermia. The original data is also provided, so you can as well play around on your own to preprocess and cluster again the data.

Learning objectives:

Integrate datasets and detect DE genes in two different health conditions
Evaluate visually the integration results
Perform and interpret gene enrichment analysis

Execution time: 30 minutes

*Import packages*

In [1]:

  Copied!     
 
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import anndata as ad
import gseapy as gp

plt.rcParams['figure.figsize']=(6,6) #rescale figures
import scanpy as sc import pandas as pd import scvelo as scv import numpy as np import seaborn as sns import matplotlib.pyplot as plt import sklearn import anndata as ad import gseapy as gp plt.rcParams['figure.figsize']=(6,6) #rescale figures

In [2]:

  Copied!     
 
import rpy2.robjects as ro

import rpy2.rinterface_lib.callbacks
import logging

from rpy2.robjects import pandas2ri
import anndata2ri

# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

# Automatically convert rpy2 outputs to pandas dataframes
pandas2ri.activate()
anndata2ri.activate()
%load_ext rpy2.ipython
import rpy2.robjects as ro import rpy2.rinterface_lib.callbacks import logging from rpy2.robjects import pandas2ri import anndata2ri # Ignore R warning messages #Note: this can be commented out to get more verbose R output rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR) # Automatically convert rpy2 outputs to pandas dataframes pandas2ri.activate() anndata2ri.activate() %load_ext rpy2.ipython

In [3]:

  Copied!     
 
%%R
.libPaths( c( "../../../../sandbox_scRNA_testAndFeedback/scrna-environment/lib/R/library/" ) )
%%R .libPaths( c( "../../../../sandbox_scRNA_testAndFeedback/scrna-environment/lib/R/library/" ) )

Read the data for healthy and azoospermic patient

In [4]:

  Copied!     
 
healthy = sc.read('../../Data/notebooks_data/sample_123.filt.norm.red.clst.2.times.h5ad')
azoospermic = sc.read('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/notebooks_data/crypto_123.filt.norm.red.clst.2.times.h5ad')
healthy = sc.read('../../Data/notebooks_data/sample_123.filt.norm.red.clst.2.times.h5ad') azoospermic = sc.read('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/notebooks_data/crypto_123.filt.norm.red.clst.2.times.h5ad') 

WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.red', '.clst', '.2', '.times', '.h5ad'].
Only considering the two last: ['.times', '.h5ad'].
WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.red', '.clst', '.2', '.times', '.h5ad'].
Only considering the two last: ['.times', '.h5ad'].
WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.red', '.clst', '.2', '.times', '.h5ad'].
Only considering the two last: ['.times', '.h5ad'].
WARNING: Your filename has more than two extensions: ['.filt', '.norm', '.red', '.clst', '.2', '.times', '.h5ad'].
Only considering the two last: ['.times', '.h5ad'].

Rename cluster variable to match the two datasets

In [5]:

  Copied!     
 
azoospermic.obs['clusters_som']=azoospermic.obs['clusters_spc'].copy()
azoospermic.obs['clusters_som']=azoospermic.obs['clusters_spc'].copy()

Just a reminder of available clusters and UMAP plot. In this case we have matching clusters apart from SpermatogoniaB - whose markers were not observed in azoospermic patients. Note also differences in pseudotimes.

In [6]:

  Copied!     
 
sc.pl.umap(healthy, color=['clusters_spc','pseudotimes'], 
           legend_loc='on data', title='Healthy patient clustering')
sc.pl.umap(healthy, color=['clusters_spc','pseudotimes'], legend_loc='on data', title='Healthy patient clustering')

WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.

In [7]:

  Copied!     
 
sc.pl.umap(azoospermic, color=['clusters_spc','pseudotimes'], 
           legend_loc='on data', title='Azoospermic patient clustering')
sc.pl.umap(azoospermic, color=['clusters_spc','pseudotimes'], legend_loc='on data', title='Azoospermic patient clustering')

WARNING: The title list is shorter than the number of panels. Using 'color' value instead for some plots.

In [8]:

  Copied!     
 
healthy.shape
healthy.shape

Out[8]:

(6431, 22790)

In [9]:

  Copied!     
 
azoospermic.shape
azoospermic.shape

Out[9]:

(2147, 14018)

Put data together¶

One possible comparison is to do a differential gene expression of each cluster found in both datasets. In this way we can find genes expressed in one sample and not the other. To do this we first concatenate the datasets and normalize them.

In [10]:

  Copied!     
 
batch_names = ['healthy','azoospermic'] #choose names for samples
sample = ad.AnnData.concatenate(healthy, azoospermic, batch_key='condition') #concatenate
sample.rename_categories(key='condition', categories=batch_names) #apply sample names
scv.utils.cleanup(sample, clean='var') #remove duplicated gene quantites
batch_names = ['healthy','azoospermic'] #choose names for samples sample = ad.AnnData.concatenate(healthy, azoospermic, batch_key='condition') #concatenate sample.rename_categories(key='condition', categories=batch_names) #apply sample names scv.utils.cleanup(sample, clean='var') #remove duplicated gene quantites

We normalize the data and consider both batch and condition as batch variable to distinguish samples

In [11]:

  Copied!     
 
sample.obs['batch_condition'] = [f'{i}_{j}' for i,j in zip(sample.obs['batch'],sample.obs['condition'])]
sample.obs['batch_condition'] = [f'{i}_{j}' for i,j in zip(sample.obs['batch'],sample.obs['condition'])]

In [12]:

  Copied!     
 
rawMatrix = np.array( sample.layers['umi_raw'].T.copy())
genes_name = sample.var_names
cells_info = sample.obs[ ["batch_condition"] ].copy()
rawMatrix = np.array( sample.layers['umi_raw'].T.copy()) genes_name = sample.var_names cells_info = sample.obs[ ["batch_condition"] ].copy()

In [13]:

  Copied!     
 
%%R -i cells_info -i rawMatrix -i genes_name
library(scater)
cell_df <- DataFrame(data = cells_info)
colnames(rawMatrix) <- rownames(cell_df) #cell names
rownames(rawMatrix) <- genes_name #gene names
%%R -i cells_info -i rawMatrix -i genes_name library(scater) cell_df <- DataFrame(data = cells_info) colnames(rawMatrix) <- rownames(cell_df) #cell names rownames(rawMatrix) <- genes_name #gene names

In [14]:

  Copied!     
 
%%R
library(sctransform)
library(future)
future::plan(strategy = 'multicore', workers = 32)
options(future.globals.maxSize = 50 * 1024 ^ 3)
%%R library(sctransform) library(future) future::plan(strategy = 'multicore', workers = 32) options(future.globals.maxSize = 50 * 1024 ^ 3)

In [15]:

  Copied!     
 
%%R
vst_out=vst( as.matrix(rawMatrix), #data matrix
            cell_attr=cell_df, #dataframe containing batch variable
            n_genes=3000, #most variable genes in your data
            batch_var='data.batch_condition', #name of the batch variable
            method='qpoisson', #type of statistical model. use "poisson" for more precision but much slower execution
            show_progress=TRUE, #show progress bars
            return_corrected_umi=TRUE) #return corrected umi count matrix
%%R vst_out=vst( as.matrix(rawMatrix), #data matrix cell_attr=cell_df, #dataframe containing batch variable n_genes=3000, #most variable genes in your data batch_var='data.batch_condition', #name of the batch variable method='qpoisson', #type of statistical model. use "poisson" for more precision but much slower execution show_progress=TRUE, #show progress bars return_corrected_umi=TRUE) #return corrected umi count matrix

  |======================================================================| 100%
  |======================================================================| 100%
  |======================================================================| 100%

In [16]:

  Copied!     
 
%%R -o new_matrix -o sct_genes -o all_genes -o umi_matrix
new_matrix=vst_out$y #normalized matrix
sct_genes = rownames(vst_out$model_pars) #most variable genes
all_genes = rownames(new_matrix) #vector of all genes to check if any have been filtered out
umi_matrix=vst_out$umi_corrected #umi matrix
%%R -o new_matrix -o sct_genes -o all_genes -o umi_matrix new_matrix=vst_out$y #normalized matrix sct_genes = rownames(vst_out$model_pars) #most variable genes all_genes = rownames(new_matrix) #vector of all genes to check if any have been filtered out umi_matrix=vst_out$umi_corrected #umi matrix

In [17]:

  Copied!     
 
sct_genes = list(sct_genes)
sample.var['highly_variable'] = [i in sct_genes for i in sample.var_names]
sct_genes = list(sct_genes) sample.var['highly_variable'] = [i in sct_genes for i in sample.var_names]

In [18]:

  Copied!     
 
sample = sample[:,list(all_genes)].copy()
sample = sample[:,list(all_genes)].copy()

In [19]:

  Copied!     
 
sample.layers['norm_sct_condition'] = np.transpose( new_matrix )
sample.layers['umi_sct_condition'] = np.transpose( umi_matrix )
sample.layers['norm_sct_condition'] = np.transpose( new_matrix ) sample.layers['umi_sct_condition'] = np.transpose( umi_matrix )

Now we have less genes because of the azoospermic dataset

In [20]:

  Copied!     
 
sample
sample

Out[20]:

AnnData object with n_obs × n_vars = 8578 × 14009
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'perc_mito', 'n_counts', 'n_genes', 'doublet_score', 'predicted_doublet', 'batch', 'leiden', 'clusters', 'clusters_spc', 'pseudotimes', 'clusters_som', 'condition', 'batch_condition'
    var: 'highly_variable'
    obsm: 'X_pca', 'X_umap'
    layers: 'norm_sct', 'umi_log', 'umi_raw', 'umi_sct', 'umi_tpm', 'norm_sct_condition', 'umi_sct_condition'

Differential expression¶

Here we do a simple differential expression analysis of all healthy vs all azoospermic cells. We can see how healhy samples mostly dominate with the expression of genes related to the development of sperm, especially in the round and elongated spermatids stages. many of the genes for azoospermic cells are ribosomal genes.

In [21]:

  Copied!     
 
sample.X = sample.layers['umi_sct_condition'].copy()
sc.pp.log1p(sample)
sample.X = sample.layers['umi_sct_condition'].copy() sc.pp.log1p(sample)

In [22]:

  Copied!     
 
sc.tl.rank_genes_groups(sample, groupby='condition', key_added='DE_condition', 
                        use_raw=False, n_genes=50, method='wilcoxon')
sc.tl.rank_genes_groups(sample, groupby='condition', key_added='DE_condition', use_raw=False, n_genes=50, method='wilcoxon')

... storing 'batch' as categorical
... storing 'leiden' as categorical
... storing 'clusters' as categorical
... storing 'clusters_spc' as categorical
... storing 'batch_condition' as categorical

In [23]:

  Copied!     
 
pd.DataFrame(sample.uns['DE_condition']['names'])
pd.DataFrame(sample.uns['DE_condition']['names'])

Out[23]:

	healthy	azoospermic
0	PRM2	RPS27
1	PRM1	RPS29
2	CRISP2	RPS26
3	TNP1	RANBP1
4	CCDC7	FTH1
5	TRIM36	RPS23
6	ZNF295-AS1	RPL34
7	LINC01921	RPL38
8	ODF2	RPS28
9	MLF1	RPL37
10	C2orf73	RPS2
11	SPATA4	RPLP0
12	DCUN1D1	RPLP1
13	ROPN1	RPL36
14	HMGB4	TOMM7
15	CCDC110	RPL6
16	PFKP	CST3
17	ADAD1	RPL4
18	BRDT	PTMA
19	SSX2IP	ITM2B
20	BAG5	RPL12
21	MPC2	SRSF2
22	NUPR2	NOP53
23	FAM104A	MTDH
24	PGK2	PCBP2
25	PDHA2	RPL10A
26	NME5	RPS9
27	CABYR	RPL18
28	DKKL1	RPS8
29	C11orf71	KDELR2
30	ATAD1	RPS3
31	CAMLG	RPL37A
32	MORN2	TMSB4X
33	CAPZA3	SNHG7
34	TPP2	PRELID1
35	IFT57	UBL5
36	DNAJC5B	MAP1LC3B
37	ROPN1B	SNHG5
38	TMF1	RPL21
39	SMCP	RPS14
40	PIH1D2	RPLP2
41	GSG1	RPL14
42	SPATA7	RPL10
43	FAM81B	MT-CO3
44	MRPL42	RPL18A
45	H2AFJ	MT-ND3
46	CDKN3	RPL36A
47	B4GALT1-AS1	MIF
48	ACTRT2	RPL41
49	ARMC3	SRRM2

You can again look at log-fold changes and p-values

In [24]:

  Copied!     
 
result = sample.uns['DE_condition']
groups = result['names'].dtype.names
X = pd.DataFrame(
    {group + '_' + key[:1].upper(): result[key][group]
    for group in groups for key in ['names', 'pvals_adj','logfoldchanges']})
X
result = sample.uns['DE_condition'] groups = result['names'].dtype.names X = pd.DataFrame( {group + '_' + key[:1].upper(): result[key][group] for group in groups for key in ['names', 'pvals_adj','logfoldchanges']}) X

Out[24]:

	healthy_N	healthy_P	healthy_L	azoospermic_N	azoospermic_P	azoospermic_L
0	PRM2	0.000000e+00	2.032233	RPS27	0.000000e+00	1.026861
1	PRM1	8.437521e-278	2.002583	RPS29	0.000000e+00	1.019021
2	CRISP2	1.365587e-275	2.028542	RPS26	5.887639e-283	1.706868
3	TNP1	3.314488e-248	1.960973	RANBP1	2.006989e-258	1.354506
4	CCDC7	3.481192e-218	1.627345	FTH1	7.402147e-247	1.787864
5	TRIM36	9.660027e-212	1.600318	RPS23	1.312549e-243	1.120837
6	ZNF295-AS1	8.585883e-207	2.665506	RPL34	4.200615e-230	0.888604
7	LINC01921	4.328831e-204	2.113969	RPL38	3.827195e-223	0.688914
8	ODF2	1.599192e-201	1.300002	RPS28	1.337696e-220	1.546971
9	MLF1	1.570552e-200	1.216649	RPL37	1.047240e-218	0.832729
10	C2orf73	1.061742e-194	1.577648	RPS2	3.335509e-215	1.854126
11	SPATA4	6.659028e-192	1.445028	RPLP0	1.289418e-214	1.185492
12	DCUN1D1	1.017568e-182	1.625203	RPLP1	4.094501e-187	1.333767
13	ROPN1	1.508635e-181	1.584022	RPL36	3.640184e-184	0.944616
14	HMGB4	3.223076e-180	1.864275	TOMM7	1.270308e-183	0.970103
15	CCDC110	5.220077e-180	1.732849	RPL6	1.002790e-180	1.022597
16	PFKP	1.278087e-178	1.463896	CST3	1.260241e-177	2.394424
17	ADAD1	3.616420e-176	1.324294	RPL4	1.460616e-176	0.796177
18	BRDT	8.874404e-174	1.290164	PTMA	8.979741e-175	1.705152
19	SSX2IP	5.096976e-171	1.476762	ITM2B	1.604503e-173	2.367598
20	BAG5	2.808186e-168	1.357559	RPL12	2.164859e-169	1.274136
21	MPC2	1.214120e-167	0.905591	SRSF2	4.355515e-166	1.596860
22	NUPR2	2.634776e-162	1.640146	NOP53	5.980976e-162	1.646122
23	FAM104A	3.466602e-160	1.044372	MTDH	1.129843e-158	1.224478
24	PGK2	4.973615e-160	1.429321	PCBP2	1.658544e-154	1.404459
25	PDHA2	1.642773e-158	1.552514	RPL10A	4.730326e-153	1.158851
26	NME5	4.617018e-158	1.289523	RPS9	1.458009e-152	0.653831
27	CABYR	4.015829e-157	1.422755	RPL18	1.200189e-151	1.581931
28	DKKL1	7.798017e-156	1.504000	RPS8	2.148500e-147	0.876272
29	C11orf71	2.073641e-155	1.292396	KDELR2	3.930590e-147	1.265155
30	ATAD1	3.476670e-154	1.453236	RPS3	3.491632e-144	0.919098
31	CAMLG	2.757895e-153	0.921729	RPL37A	6.894196e-141	0.469303
32	MORN2	1.001687e-152	1.051656	TMSB4X	1.566276e-139	1.955806
33	CAPZA3	5.032883e-152	1.846487	SNHG7	2.205658e-134	2.393714
34	TPP2	1.209302e-149	1.473941	PRELID1	2.977950e-134	2.087113
35	IFT57	1.874411e-148	1.070314	UBL5	3.269293e-134	1.022076
36	DNAJC5B	1.844754e-145	1.805371	MAP1LC3B	4.815629e-134	1.013808
37	ROPN1B	1.930813e-145	1.505872	SNHG5	6.897227e-134	1.493377
38	TMF1	3.429067e-145	1.395980	RPL21	1.176173e-133	1.789231
39	SMCP	1.581934e-144	1.590584	RPS14	2.252929e-133	0.548712
40	PIH1D2	3.286820e-144	1.328082	RPLP2	4.735469e-131	0.595375
41	GSG1	1.233251e-143	1.588035	RPL14	4.360103e-129	0.965527
42	SPATA7	4.604242e-143	1.577425	RPL10	2.062930e-128	1.895483
43	FAM81B	8.028167e-142	1.791096	MT-CO3	2.657770e-128	1.624999
44	MRPL42	9.043097e-142	1.034540	RPL18A	1.201094e-126	1.703612
45	H2AFJ	1.441976e-140	1.373664	MT-ND3	3.394581e-126	1.560312
46	CDKN3	4.676738e-140	1.375274	RPL36A	1.429154e-125	1.860079
47	B4GALT1-AS1	8.175465e-137	1.812452	MIF	8.080621e-125	1.383539
48	ACTRT2	2.204747e-136	1.581528	RPL41	7.655546e-124	0.854599
49	ARMC3	4.142454e-136	1.658994	SRRM2	1.194687e-123	1.427597

In [25]:

  Copied!     
 
X.to_csv('../../Data/results/diff_expression_condition.csv', header=True, index=False)
X.to_csv('../../Data/results/diff_expression_condition.csv', header=True, index=False)

Integration plot. We use the standard PCA because it is faster and rely on bbknn for correcting differences between samples. While we could not identify Somatic cells in healthy data, now they can be distinguished into endothelial and somatic with the overlapping UMAP plot

In [26]:

  Copied!     
 
sample.X = sample.layers['norm_sct_condition'].copy() #use normalized data in .X
sc.pp.scale(sample) #standardize
sc.preprocessing.pca(sample, svd_solver='arpack', random_state=12345) #do PCA
sample.X = sample.layers['norm_sct_condition'].copy() #use normalized data in .X sc.pp.scale(sample) #standardize sc.preprocessing.pca(sample, svd_solver='arpack', random_state=12345) #do PCA

In [27]:

  Copied!     
 
import bbknn as bbknn
bbknn.bbknn(sample, batch_key='batch_condition')
import bbknn as bbknn bbknn.bbknn(sample, batch_key='batch_condition')

In [28]:

  Copied!     
 
sc.tools.umap(sample, random_state=54321)
sc.tools.umap(sample, random_state=54321)

In [29]:

  Copied!     
 
sc.plotting.umap(sample, color=['condition','clusters_spc'], ncols=1)
sc.plotting.umap(sample, color=['condition','clusters_spc'], ncols=1)

Below, we average the UMAP coordinates for each cluster in the azoospermic (A) and healthy (H) dataset, and plot those averages. We can see if they are close to each other, or if they are far apart. Notice that Spermatogonia B and Leptotene overlap. This because in only one of the two dataset we have left some spermatogonia B cells where we could not observe leptotene markers. But we could also have misedentified some Spermatogonia A cells. Somatic cells are off compared to the myoid and endothelial, simply because of the different cell identification

In [30]:

  Copied!     
 
new_names = np.array([ str(i[0]).upper() + '_' + str(j) for i,j in 
             zip(sample.obs['condition'], sample.obs['clusters_spc']) ])

np.unique(new_names)

markers = { 'azoospermic':'s', 'healthy':'p' }

idx = [i=='A_Dyplotene' for i in new_names]
new_names[idx]  = 'A_Diplotene'

np.unique(new_names)

plt.rcParams['figure.figsize']=(10,6) #rescale figures
X_umap = sample.obsm['X_umap'].copy()
x = []
y = []
clst = []
condition = []

#need the same category names order to have the same color palette for the clusters
for i in np.unique(new_names):
    boolean = [j==i for j in new_names]
    x.append( np.mean(X_umap[boolean,0]) )
    y.append( np.mean(X_umap[boolean,1]) )
    clst.append( i.split('_')[1] )
    condition.append( sample.obs['condition'][boolean][0] )
sns.set_style("white", {'axes.grid' : False})    
g=sns.scatterplot(x,y,style=condition,hue=clst, markers=markers, s=1000)
g.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., fontsize= 20, markerscale = 3)
g.set_title('Overlapping of cluster coordinates on UMAP')
g.set(xlabel = 'UMAP_0', ylabel='UMAP_1')
new_names = np.array([ str(i[0]).upper() + '_' + str(j) for i,j in zip(sample.obs['condition'], sample.obs['clusters_spc']) ]) np.unique(new_names) markers = { 'azoospermic':'s', 'healthy':'p' } idx = [i=='A_Dyplotene' for i in new_names] new_names[idx] = 'A_Diplotene' np.unique(new_names) plt.rcParams['figure.figsize']=(10,6) #rescale figures X_umap = sample.obsm['X_umap'].copy() x = [] y = [] clst = [] condition = [] #need the same category names order to have the same color palette for the clusters for i in np.unique(new_names): boolean = [j==i for j in new_names] x.append( np.mean(X_umap[boolean,0]) ) y.append( np.mean(X_umap[boolean,1]) ) clst.append( i.split('_')[1] ) condition.append( sample.obs['condition'][boolean][0] ) sns.set_style("white", {'axes.grid' : False}) g=sns.scatterplot(x,y,style=condition,hue=clst, markers=markers, s=1000) g.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., fontsize= 20, markerscale = 3) g.set_title('Overlapping of cluster coordinates on UMAP') g.set(xlabel = 'UMAP_0', ylabel='UMAP_1') 

Out[30]:

[Text(0.5, 0, 'UMAP_0'), Text(0, 0.5, 'UMAP_1')]

We can look at the percentage of cell clusters in the two datasets. This is not a much reliable number for usage when clustering by hand looking at markers, because we might misidentify some subclusters into one type or the other. For example Spermatogonia B and Leptotene in the healthy data sum up to the amount of leptotene cells in the azoospermic data

In [33]:

  Copied!     
 
healthy.obs['clusters_spc'].value_counts() / healthy.shape[0] * 100
healthy.obs['clusters_spc'].value_counts() / healthy.shape[0] * 100

Out[33]:

RoundSpermatids    39.200746
Diplotene          19.406002
SpermatogoniaA     10.620432
ElongSpermatids     9.609703
Zygotene            7.510496
SpermatogoniaB      5.100295
Somatic             4.587156
Pachytene           2.612346
Leptotene           1.352822
Name: clusters_spc, dtype: float64

In [34]:

  Copied!     
 
azoospermic.obs['clusters_spc'].value_counts() / azoospermic.shape[0] * 100
azoospermic.obs['clusters_spc'].value_counts() / azoospermic.shape[0] * 100

Out[34]:

SpermatogoniaA     19.701910
RoundSpermatids    19.329297
Myoid              14.019562
Zygotene           12.529110
Diplotene           8.337215
ElongSpermatids     6.380997
Dyplotene           6.054960
Leptotene           5.728924
Pachytene           5.635771
Endothelial         2.282254
Name: clusters_spc, dtype: float64

In [35]:

  Copied!     
 
sample.write('../../Data/notebooks_data/condition.integrated.h5ad')
sample.write('../../Data/notebooks_data/condition.integrated.h5ad')

Gene enrichment¶

Let's do enrichment analysis to see how differentially expressed genes from healthy patients can be interpreted. We use the package gseapy, that allows you to choose a lot of gene enrichment archives to explore. This package is just an interface to the website Enrichr, where you can copy-paste a list of genes and visualize the same results as in this python code.

In [36]:

  Copied!     
 
sample = sc.read('../../Data/notebooks_data/condition.integrated.h5ad')
sample = sc.read('../../Data/notebooks_data/condition.integrated.h5ad')

load differential expression table

In [37]:

  Copied!     
 
DE_genes = pd.read_csv('../../Data/results/diff_expression_condition.csv')
DE_genes = DE_genes.loc[:, [i.split('_')[1]=='N' for i in DE_genes.columns] ]
DE_genes = pd.read_csv('../../Data/results/diff_expression_condition.csv') DE_genes = DE_genes.loc[:, [i.split('_')[1]=='N' for i in DE_genes.columns] ]

Run gene enrichment analysis. Results are in the folders Data/results/enrichment_condition/healthy and Data/results/enrichment_condition/azoospermic of the course material.

In [38]:

  Copied!     
 
enrich_results = dict()
for CONDITION in DE_genes.columns:
    print('------Enrichment analysis for condition ' + CONDITION.split('_')[0] + '------')
    enrich_results[CONDITION.split('_')[0]] = gp.enrichr(gene_list=DE_genes[CONDITION],
                 gene_sets=[ 'ARCHS4_TFs_Coexp',
                             'Chromosome_Location_hg19',
                             'WikiPathway_2021_Human',
                             'ARCHS4_Tissues',
                             'GO_Molecular_Function_2021',],
                 organism='Human', # don't forget to set organism to the one you desired
                 description=CONDITION,
                 outdir=f'../../Data/results/enrichment_condition/{CONDITION}',                              
                 cutoff=0.05 # p-value for enrichment test.
                )
enrich_results = dict() for CONDITION in DE_genes.columns: print('------Enrichment analysis for condition ' + CONDITION.split('_')[0] + '------') enrich_results[CONDITION.split('_')[0]] = gp.enrichr(gene_list=DE_genes[CONDITION], gene_sets=[ 'ARCHS4_TFs_Coexp', 'Chromosome_Location_hg19', 'WikiPathway_2021_Human', 'ARCHS4_Tissues', 'GO_Molecular_Function_2021',], organism='Human', # don't forget to set organism to the one you desired description=CONDITION, outdir=f'../../Data/results/enrichment_condition/{CONDITION}', cutoff=0.05 # p-value for enrichment test. )

------Enrichment analysis for condition healthy------

2022-02-23 12:46:10,635 Warning: No enrich terms using library Chromosome_Location_hg19 when cutoff = 0.05
2022-02-23 12:46:18,540 Warning: No enrich terms using library GO_Molecular_Function_2021 when cutoff = 0.05

------Enrichment analysis for condition azoospermic------

2022-02-23 12:46:24,940 Warning: No enrich terms using library Chromosome_Location_hg19 when cutoff = 0.05

Note we have chosen five databases as example (option gene_sets), but you can see a list with all databases below, or by visiting the webpage

In [39]:

  Copied!     
 
gp.get_library_name()
gp.get_library_name()

Out[39]:

['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_10x_scRNA_2021',
 'Allen_Brain_Atlas_down',
 'Allen_Brain_Atlas_up',
 'Azimuth_Cell_Types_2021',
 'BioCarta_2013',
 'BioCarta_2015',
 'BioCarta_2016',
 'BioPlanet_2019',
 'BioPlex_2017',
 'CCLE_Proteomics_2020',
 'CORUM',
 'COVID-19_Related_Gene_Sets',
 'COVID-19_Related_Gene_Sets_2021',
 'Cancer_Cell_Line_Encyclopedia',
 'CellMarker_Augmented_2021',
 'ChEA_2013',
 'ChEA_2015',
 'ChEA_2016',
 'Chromosome_Location',
 'Chromosome_Location_hg19',
 'ClinVar_2019',
 'DSigDB',
 'Data_Acquisition_Method_Most_Popular_Genes',
 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019',
 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019',
 'Descartes_Cell_Types_and_Tissue_2021',
 'DisGeNET',
 'Disease_Perturbations_from_GEO_down',
 'Disease_Perturbations_from_GEO_up',
 'Disease_Signatures_from_GEO_down_2014',
 'Disease_Signatures_from_GEO_up_2014',
 'DrugMatrix',
 'Drug_Perturbations_from_GEO_2014',
 'Drug_Perturbations_from_GEO_down',
 'Drug_Perturbations_from_GEO_up',
 'ENCODE_Histone_Modifications_2013',
 'ENCODE_Histone_Modifications_2015',
 'ENCODE_TF_ChIP-seq_2014',
 'ENCODE_TF_ChIP-seq_2015',
 'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X',
 'ESCAPE',
 'Elsevier_Pathway_Collection',
 'Enrichr_Libraries_Most_Popular_Genes',
 'Enrichr_Submissions_TF-Gene_Coocurrence',
 'Enrichr_Users_Contributed_Lists_2020',
 'Epigenomics_Roadmap_HM_ChIP-seq',
 'GO_Biological_Process_2013',
 'GO_Biological_Process_2015',
 'GO_Biological_Process_2017',
 'GO_Biological_Process_2017b',
 'GO_Biological_Process_2018',
 'GO_Biological_Process_2021',
 'GO_Cellular_Component_2013',
 'GO_Cellular_Component_2015',
 'GO_Cellular_Component_2017',
 'GO_Cellular_Component_2017b',
 'GO_Cellular_Component_2018',
 'GO_Cellular_Component_2021',
 'GO_Molecular_Function_2013',
 'GO_Molecular_Function_2015',
 'GO_Molecular_Function_2017',
 'GO_Molecular_Function_2017b',
 'GO_Molecular_Function_2018',
 'GO_Molecular_Function_2021',
 'GTEx_Aging_Signatures_2021',
 'GTEx_Tissue_Expression_Down',
 'GTEx_Tissue_Expression_Up',
 'GWAS_Catalog_2019',
 'GeneSigDB',
 'Gene_Perturbations_from_GEO_down',
 'Gene_Perturbations_from_GEO_up',
 'Genes_Associated_with_NIH_Grants',
 'Genome_Browser_PWMs',
 'HDSigDB_Human_2021',
 'HDSigDB_Mouse_2021',
 'HMDB_Metabolites',
 'HMS_LINCS_KinomeScan',
 'HomoloGene',
 'HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression',
 'HumanCyc_2015',
 'HumanCyc_2016',
 'Human_Gene_Atlas',
 'Human_Phenotype_Ontology',
 'InterPro_Domains_2019',
 'Jensen_COMPARTMENTS',
 'Jensen_DISEASES',
 'Jensen_TISSUES',
 'KEA_2013',
 'KEA_2015',
 'KEGG_2013',
 'KEGG_2015',
 'KEGG_2016',
 'KEGG_2019_Human',
 'KEGG_2019_Mouse',
 'KEGG_2021_Human',
 'Kinase_Perturbations_from_GEO_down',
 'Kinase_Perturbations_from_GEO_up',
 'L1000_Kinase_and_GPCR_Perturbations_down',
 'L1000_Kinase_and_GPCR_Perturbations_up',
 'LINCS_L1000_Chem_Pert_down',
 'LINCS_L1000_Chem_Pert_up',
 'LINCS_L1000_Ligand_Perturbations_down',
 'LINCS_L1000_Ligand_Perturbations_up',
 'Ligand_Perturbations_from_GEO_down',
 'Ligand_Perturbations_from_GEO_up',
 'MCF7_Perturbations_from_GEO_down',
 'MCF7_Perturbations_from_GEO_up',
 'MGI_Mammalian_Phenotype_2013',
 'MGI_Mammalian_Phenotype_2017',
 'MGI_Mammalian_Phenotype_Level_3',
 'MGI_Mammalian_Phenotype_Level_4',
 'MGI_Mammalian_Phenotype_Level_4_2019',
 'MGI_Mammalian_Phenotype_Level_4_2021',
 'MSigDB_Computational',
 'MSigDB_Hallmark_2020',
 'MSigDB_Oncogenic_Signatures',
 'Microbe_Perturbations_from_GEO_down',
 'Microbe_Perturbations_from_GEO_up',
 'Mouse_Gene_Atlas',
 'NCI-60_Cancer_Cell_Lines',
 'NCI-Nature_2015',
 'NCI-Nature_2016',
 'NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions',
 'NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions',
 'NIH_Funded_PIs_2017_Human_AutoRIF',
 'NIH_Funded_PIs_2017_Human_GeneRIF',
 'NURSA_Human_Endogenous_Complexome',
 'OMIM_Disease',
 'OMIM_Expanded',
 'Old_CMAP_down',
 'Old_CMAP_up',
 'Orphanet_Augmented_2021',
 'PPI_Hub_Proteins',
 'PanglaoDB_Augmented_2021',
 'Panther_2015',
 'Panther_2016',
 'Pfam_Domains_2019',
 'Pfam_InterPro_Domains',
 'PheWeb_2019',
 'PhenGenI_Association_2021',
 'Phosphatase_Substrates_from_DEPOD',
 'ProteomicsDB_2020',
 'RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO',
 'RNAseq_Automatic_GEO_Signatures_Human_Down',
 'RNAseq_Automatic_GEO_Signatures_Human_Up',
 'RNAseq_Automatic_GEO_Signatures_Mouse_Down',
 'RNAseq_Automatic_GEO_Signatures_Mouse_Up',
 'Rare_Diseases_AutoRIF_ARCHS4_Predictions',
 'Rare_Diseases_AutoRIF_Gene_Lists',
 'Rare_Diseases_GeneRIF_ARCHS4_Predictions',
 'Rare_Diseases_GeneRIF_Gene_Lists',
 'Reactome_2013',
 'Reactome_2015',
 'Reactome_2016',
 'SILAC_Phosphoproteomics',
 'SubCell_BarCode',
 'SysMyo_Muscle_Gene_Sets',
 'TF-LOF_Expression_from_GEO',
 'TF_Perturbations_Followed_by_Expression',
 'TG_GATES_2020',
 'TRANSFAC_and_JASPAR_PWMs',
 'TRRUST_Transcription_Factors_2019',
 'Table_Mining_of_CRISPR_Studies',
 'TargetScan_microRNA',
 'TargetScan_microRNA_2017',
 'Tissue_Protein_Expression_from_Human_Proteome_Map',
 'Tissue_Protein_Expression_from_ProteomicsDB',
 'Transcription_Factor_PPIs',
 'UK_Biobank_GWAS_v1',
 'Virus-Host_PPI_P-HIPSTer_2020',
 'VirusMINT',
 'Virus_Perturbations_from_GEO_down',
 'Virus_Perturbations_from_GEO_up',
 'WikiPathway_2021_Human',
 'WikiPathways_2013',
 'WikiPathways_2015',
 'WikiPathways_2016',
 'WikiPathways_2019_Human',
 'WikiPathways_2019_Mouse',
 'dbGaP',
 'huMAP',
 'lncHUB_lncRNA_Co-Expression',
 'miRTarBase_2017']

We can plot some information here instead of looking into folders. we can plot a table with pvalues, genes present in the database and their enrichment term. Here the enrichment for healthy samples (filtered with pvalue <0.01)

In [40]:

  Copied!     
 
healthy_table = enrich_results['healthy'].results #get the table
healthy_table = enrich_results['healthy'].results #get the table

In [41]:

  Copied!     
 
healthy_table.head() #table preview
healthy_table.head() #table preview

Out[41]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
0	ARCHS4_TFs_Coexp	YBX2 human tf ARCHS4 coexpression	16/299	1.298141e-17	3.810044e-15	32.703388	1271.606272	ROPN1B;SMCP;CRISP2;PRM2;CCDC110;PRM1;ODF2;DKKL...
1	ARCHS4_TFs_Coexp	HSF5 human tf ARCHS4 coexpression	16/299	1.298141e-17	3.810044e-15	32.703388	1271.606272	ROPN1B;SMCP;CRISP2;PRM2;PRM1;ODF2;HMGB4;CABYR;...
2	ARCHS4_TFs_Coexp	DHX57 human tf ARCHS4 coexpression	13/299	3.105686e-13	6.076792e-11	24.157248	695.737723	SMCP;CRISP2;PRM2;PRM1;ODF2;HMGB4;CABYR;CAPZA3;...
3	ARCHS4_TFs_Coexp	SOX30 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ODF2;ADAD1;...
4	ARCHS4_TFs_Coexp	CUL3 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ODF2;CCDC7;...

Note the Wikipathway results at the end of the table, where we have genes related to male infertility (when their expression is disrupted) and genes involved in the Cori Cycle (essential for spermatogenesis). Also, Sperm and Testis are recognized as likely tissues from which our data comes from. Relevant transcriptio factors highlighted here are for example HSF5 (early spermatogenesis), YBX2 (Abnormal spermatogenesis in case of disruption), SOX30 (male fertility). Using gene enrichment analyses requires of course a biological background to understand the usefulness of results.

In [42]:

  Copied!     
 
healthy_table[ healthy_table['Adjusted P-value']<0.01 ] #filtered with pvalue
healthy_table[ healthy_table['Adjusted P-value']<0.01 ] #filtered with pvalue

Out[42]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
0	ARCHS4_TFs_Coexp	YBX2 human tf ARCHS4 coexpression	16/299	1.298141e-17	3.810044e-15	32.703388	1271.606272	ROPN1B;SMCP;CRISP2;PRM2;CCDC110;PRM1;ODF2;DKKL...
1	ARCHS4_TFs_Coexp	HSF5 human tf ARCHS4 coexpression	16/299	1.298141e-17	3.810044e-15	32.703388	1271.606272	ROPN1B;SMCP;CRISP2;PRM2;PRM1;ODF2;HMGB4;CABYR;...
2	ARCHS4_TFs_Coexp	DHX57 human tf ARCHS4 coexpression	13/299	3.105686e-13	6.076792e-11	24.157248	695.737723	SMCP;CRISP2;PRM2;PRM1;ODF2;HMGB4;CABYR;CAPZA3;...
3	ARCHS4_TFs_Coexp	SOX30 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ODF2;ADAD1;...
4	ARCHS4_TFs_Coexp	CUL3 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ODF2;CCDC7;...
5	ARCHS4_TFs_Coexp	ZDHHC19 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ODF2;CCDC7;...
6	ARCHS4_TFs_Coexp	HIST1H1T human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	ROPN1B;PRM2;CCDC110;PRM1;CABYR;ODF2;GSG1;ADAD1...
7	ARCHS4_TFs_Coexp	RFX4 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;TRIM36;PGK2...
8	ARCHS4_TFs_Coexp	ZNF541 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ODF2;ADAD1;...
9	ARCHS4_TFs_Coexp	DMRTC2 human tf ARCHS4 coexpression	12/299	7.312013e-12	4.292151e-10	21.635430	554.764935	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ODF2;CCDC7;...
10	ARCHS4_TFs_Coexp	NFYA human tf ARCHS4 coexpression	11/299	1.544050e-10	6.971980e-09	19.255876	435.018007	SMCP;CRISP2;PRM2;PRM1;CAPZA3;ODF2;PGK2;TNP1;AC...
11	ARCHS4_TFs_Coexp	ZIM2 human tf ARCHS4 coexpression	11/299	1.544050e-10	6.971980e-09	19.255876	435.018007	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;ADAD1;PGK2;...
12	ARCHS4_TFs_Coexp	ZNF473 human tf ARCHS4 coexpression	11/299	1.544050e-10	6.971980e-09	19.255876	435.018007	SMCP;CRISP2;PRM2;CABYR;ODF2;ADAD1;PGK2;TNP1;AC...
13	ARCHS4_TFs_Coexp	ZNF628 human tf ARCHS4 coexpression	10/299	2.906607e-09	8.530891e-08	17.007785	334.309785	SMCP;CRISP2;PRM2;PRM1;CAPZA3;ODF2;NUPR2;PGK2;T...
14	ARCHS4_TFs_Coexp	ADAMTS17 human tf ARCHS4 coexpression	10/299	2.906607e-09	8.530891e-08	17.007785	334.309785	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;PGK2;TNP1;A...
15	ARCHS4_TFs_Coexp	JRKL human tf ARCHS4 coexpression	10/299	2.906607e-09	8.530891e-08	17.007785	334.309785	SMCP;CRISP2;PRM2;PRM1;CAPZA3;NUPR2;PGK2;TNP1;A...
16	ARCHS4_TFs_Coexp	HILS1 human tf ARCHS4 coexpression	10/299	2.906607e-09	8.530891e-08	17.007785	334.309785	SMCP;CRISP2;PRM2;PRM1;CAPZA3;ODF2;PGK2;TNP1;AC...
17	ARCHS4_TFs_Coexp	ZNF213 human tf ARCHS4 coexpression	10/299	2.906607e-09	8.530891e-08	17.007785	334.309785	SMCP;CRISP2;PRM2;CABYR;CAPZA3;ODF2;NUPR2;PGK2;...
18	ARCHS4_TFs_Coexp	ZNF513 human tf ARCHS4 coexpression	10/299	2.906607e-09	8.530891e-08	17.007785	334.309785	SMCP;CRISP2;PRM2;PRM1;CAPZA3;ODF2;NUPR2;PGK2;T...
19	ARCHS4_TFs_Coexp	EVX2 human tf ARCHS4 coexpression	10/299	2.906607e-09	8.530891e-08	17.007785	334.309785	SMCP;CRISP2;PRM2;PRM1;CAPZA3;ODF2;PGK2;TNP1;AC...
20	ARCHS4_TFs_Coexp	ETV2 human tf ARCHS4 coexpression	9/299	4.841854e-08	1.184237e-06	14.881413	250.653339	SMCP;CRISP2;PRM2;PRM1;CAPZA3;PGK2;TNP1;ACTRT2;...
21	ARCHS4_TFs_Coexp	ZC3H10 human tf ARCHS4 coexpression	9/299	4.841854e-08	1.184237e-06	14.881413	250.653339	SMCP;CRISP2;PRM2;PRM1;CABYR;CAPZA3;PGK2;TNP1;A...
22	ARCHS4_TFs_Coexp	ZC3H18 human tf ARCHS4 coexpression	9/299	4.841854e-08	1.184237e-06	14.881413	250.653339	SMCP;CRISP2;PRM2;PRM1;CAPZA3;ODF2;PGK2;TNP1;AC...
23	ARCHS4_TFs_Coexp	EVX1 human tf ARCHS4 coexpression	9/299	4.841854e-08	1.184237e-06	14.881413	250.653339	SMCP;CRISP2;PRM2;PRM1;CAPZA3;PGK2;TNP1;ACTRT2;...
24	ARCHS4_TFs_Coexp	PAX2 human tf ARCHS4 coexpression	8/299	7.073288e-07	1.384007e-05	12.867943	182.232853	SMCP;CRISP2;PRM2;CAPZA3;ODF2;PGK2;TNP1;ACTRT2
25	ARCHS4_TFs_Coexp	MAEL human tf ARCHS4 coexpression	8/299	7.073288e-07	1.384007e-05	12.867943	182.232853	PRM2;PRM1;CABYR;CAPZA3;TNP1;ACTRT2;HMGB4;BRDT
26	ARCHS4_TFs_Coexp	SHOX2 human tf ARCHS4 coexpression	8/299	7.073288e-07	1.384007e-05	12.867943	182.232853	SMCP;CRISP2;PRM2;PRM1;CAPZA3;PGK2;TNP1;ACTRT2
27	ARCHS4_TFs_Coexp	RNF113B human tf ARCHS4 coexpression	8/299	7.073288e-07	1.384007e-05	12.867943	182.232853	ROPN1B;PRM2;CCDC110;GSG1;ADAD1;IFT57;ROPN1;BRDT
28	ARCHS4_TFs_Coexp	HMGB4 human tf ARCHS4 coexpression	8/299	7.073288e-07	1.384007e-05	12.867943	182.232853	SMCP;CRISP2;PRM2;PRM1;CAPZA3;PGK2;TNP1;ACTRT2
29	ARCHS4_TFs_Coexp	FOXO6 human tf ARCHS4 coexpression	8/299	7.073288e-07	1.384007e-05	12.867943	182.232853	SMCP;CRISP2;PRM2;CAPZA3;ODF2;PGK2;TNP1;ACTRT2
30	ARCHS4_TFs_Coexp	ZCCHC6 human tf ARCHS4 coexpression	7/299	8.960863e-06	1.421629e-04	10.959382	127.376996	SMCP;CRISP2;PRM2;PRM1;CAPZA3;TNP1;ACTRT2
31	ARCHS4_TFs_Coexp	GTF2F1 human tf ARCHS4 coexpression	7/299	8.960863e-06	1.421629e-04	10.959382	127.376996	SMCP;CRISP2;PRM2;ODF2;PGK2;TNP1;ACTRT2
32	ARCHS4_TFs_Coexp	ZNF578 human tf ARCHS4 coexpression	7/299	8.960863e-06	1.421629e-04	10.959382	127.376996	PRM2;PRM1;CAPZA3;GSG1;SPATA4;TNP1;HMGB4
33	ARCHS4_TFs_Coexp	ZFAT human tf ARCHS4 coexpression	7/299	8.960863e-06	1.421629e-04	10.959382	127.376996	SMCP;CRISP2;PRM2;PRM1;CAPZA3;TNP1;ACTRT2
34	ARCHS4_TFs_Coexp	ZSCAN20 human tf ARCHS4 coexpression	7/299	8.960863e-06	1.421629e-04	10.959382	127.376996	SMCP;CRISP2;PRM2;PRM1;CAPZA3;TNP1;HMGB4
35	ARCHS4_TFs_Coexp	ZNF668 human tf ARCHS4 coexpression	7/299	8.960863e-06	1.421629e-04	10.959382	127.376996	PRM2;PRM1;CAPZA3;TNP1;ACTRT2;HMGB4;BRDT
36	ARCHS4_TFs_Coexp	ZNF683 human tf ARCHS4 coexpression	7/299	8.960863e-06	1.421629e-04	10.959382	127.376996	SMCP;CRISP2;PRM2;PRM1;CAPZA3;TNP1;ACTRT2
37	ARCHS4_TFs_Coexp	RFX3 human tf ARCHS4 coexpression	6/299	9.705912e-05	1.356517e-03	9.148464	84.533550	ARMC3;MORN2;TRIM36;NME5;FAM81B;MLF1
38	ARCHS4_TFs_Coexp	RNF138 human tf ARCHS4 coexpression	6/299	9.705912e-05	1.356517e-03	9.148464	84.533550	SMCP;CRISP2;CABYR;BAG5;ADAD1;PGK2
39	ARCHS4_TFs_Coexp	TBX1 human tf ARCHS4 coexpression	6/299	9.705912e-05	1.356517e-03	9.148464	84.533550	SMCP;PRM2;PRM1;CAPZA3;TNP1;ACTRT2
40	ARCHS4_TFs_Coexp	DMRTB1 human tf ARCHS4 coexpression	6/299	9.705912e-05	1.356517e-03	9.148464	84.533550	SMCP;CRISP2;CABYR;ADAD1;PGK2;BRDT
41	ARCHS4_TFs_Coexp	ZNF608 human tf ARCHS4 coexpression	6/299	9.705912e-05	1.356517e-03	9.148464	84.533550	SMCP;CRISP2;PRM2;TRIM36;PGK2;TNP1
606	WikiPathway_2021_Human	Glycolysis and Gluconeogenesis WP534	3/45	1.937400e-04	2.131140e-03	30.255319	258.652525	MPC2;PGK2;PFKP
607	WikiPathway_2021_Human	Male infertility WP4673	4/146	4.835142e-04	2.659328e-03	12.129822	92.604278	CRISP2;PRM2;PRM1;BRDT
608	WikiPathway_2021_Human	Cori Cycle WP1946	2/17	8.132713e-04	2.981995e-03	55.375000	393.962434	PGK2;PFKP
617	ARCHS4_Tissues	SPERM	21/2316	4.714050e-08	4.666910e-06	5.570656	93.977707	ROPN1B;CRISP2;CCDC110;ODF2;DNAJC5B;DKKL1;MLF1;...
618	ARCHS4_Tissues	TESTIS (BULK TISSUE)	19/2316	1.303973e-06	6.454665e-05	4.710309	63.825139	ROPN1B;CRISP2;CCDC110;DKKL1;MLF1;HMGB4;CABYR;C...

In the azoospermic patient most of the enrichment terms are related to ribosomal genes and therefore to processes such as rRNA binding. There isn't much information to get out of this table

In [43]:

  Copied!     
 
azoos_table = enrich_results['azoospermic'].results
azoos_table = enrich_results['azoospermic'].results

In [44]:

  Copied!     
 
azoos_table.head() #table preview
azoos_table.head() #table preview

Out[44]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
0	ARCHS4_TFs_Coexp	EIF3K human tf ARCHS4 coexpression	25/299	7.600257e-33	4.377748e-30	71.810219	5310.877416	RPL10;RPL12;RPL34;RPLP1;RPLP0;RPL36A;RPL10A;UB...
1	ARCHS4_TFs_Coexp	RFXANK human tf ARCHS4 coexpression	21/299	1.044332e-25	3.007676e-23	51.241875	2947.496732	PRELID1;RPS9;RPL41;RPL10;RPS8;RPL34;RPLP1;RPLP...
2	ARCHS4_TFs_Coexp	FOXB1 human tf ARCHS4 coexpression	18/299	9.769366e-21	1.875718e-18	39.372998	1814.112285	RPL41;RPL21;RPL34;RPLP1;RPLP0;RPL36A;RPL6;UBL5...
3	ARCHS4_TFs_Coexp	POU3F1 human tf ARCHS4 coexpression	17/299	3.725729e-19	5.365050e-17	35.929078	1524.609259	RPL41;RPL21;RPL34;RPLP1;RPL36A;RPL6;UBL5;RPS14...
4	ARCHS4_TFs_Coexp	SOX2 human tf ARCHS4 coexpression	16/299	1.298141e-17	1.495459e-15	32.703388	1271.606272	RPL41;RPL21;RPL34;RPLP1;RPL36A;RPL6;UBL5;RPS14...

In [45]:

  Copied!     
 
azoos_table[ azoos_table['Adjusted P-value']<0.01 ]
azoos_table[ azoos_table['Adjusted P-value']<0.01 ]

Out[45]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Old P-value	Old Adjusted P-value	Odds Ratio	Combined Score	Genes
0	ARCHS4_TFs_Coexp	EIF3K human tf ARCHS4 coexpression	25/299	7.600257e-33	4.377748e-30	0	0	71.810219	5310.877416	RPL10;RPL12;RPL34;RPLP1;RPLP0;RPL36A;RPL10A;UB...
1	ARCHS4_TFs_Coexp	RFXANK human tf ARCHS4 coexpression	21/299	1.044332e-25	3.007676e-23	0	0	51.241875	2947.496732	PRELID1;RPS9;RPL41;RPL10;RPS8;RPL34;RPLP1;RPLP...
2	ARCHS4_TFs_Coexp	FOXB1 human tf ARCHS4 coexpression	18/299	9.769366e-21	1.875718e-18	0	0	39.372998	1814.112285	RPL41;RPL21;RPL34;RPLP1;RPLP0;RPL36A;RPL6;UBL5...
3	ARCHS4_TFs_Coexp	POU3F1 human tf ARCHS4 coexpression	17/299	3.725729e-19	5.365050e-17	0	0	35.929078	1524.609259	RPL41;RPL21;RPL34;RPLP1;RPL36A;RPL6;UBL5;RPS14...
4	ARCHS4_TFs_Coexp	SOX2 human tf ARCHS4 coexpression	16/299	1.298141e-17	1.495459e-15	0	0	32.703388	1271.606272	RPL41;RPL21;RPL34;RPLP1;RPL36A;RPL6;UBL5;RPS14...
...	...	...	...	...	...	...	...	...	...	...
656	GO_Molecular_Function_2021	rRNA binding (GO:0019843)	6/42	8.766374e-10	3.199726e-08	0	0	75.431818	1573.125115	RPS14;RPS9;RPL12;RPLP0;RPS3;NOP53
657	GO_Molecular_Function_2021	mRNA binding (GO:0003729)	7/263	3.869450e-06	9.415662e-05	0	0	12.523438	156.072065	SRRM2;RPS26;RPS14;RPL41;PCBP2;RPS3;RPS2
658	GO_Molecular_Function_2021	large ribosomal subunit rRNA binding (GO:0070180)	2/5	6.095679e-05	1.112461e-03	0	0	277.041667	2688.785055	RPL12;RPLP0
659	GO_Molecular_Function_2021	cadherin binding (GO:0045296)	6/322	1.454970e-04	2.124256e-03	0	0	8.472670	74.859047	RPS26;RANBP1;RPL34;RPL14;RPS2;RPL6
660	GO_Molecular_Function_2021	small ribosomal subunit rRNA binding (GO:0070181)	2/9	2.180466e-04	2.652900e-03	0	0	118.708333	1000.806420	RPS14;RPS3

82 rows × 10 columns

Wrapping up¶

We have performed a basic analysis of one dataset against the other, and seen how we can find a lot of relevant information about how azoospermic patients are characterized in terms of absence of specific genes and enrichment terms. Note that gene enrichment can be applied in any type of analysis, and this application is just a specific showcase.