Use pretrained models of scVI-hub for CELLxGENE

Use pretrained models of scVI-hub for CELLxGENE#

This notebook was contributed by Can Ergen and Martin Kim.

Original article: Scvi-hub: an actionable repository for model-driven single cell analysis

https://www.biorxiv.org/content/10.1101/2024.03.01.582887

The anndata object we’re using here is a subset of the full CELLxGENE census data. Use: s3://cellxgene-contrib-public/models/scvi/2024-02-12/mus_musculus/adata_minified.h5ad

Steps performed:

Loading the minified data from AWS
Setting up minified model with minified data
Visualize the latent space
Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
!pip install --quiet cellxgene-census
!pip install --quiet pybiomart
from scvi_colab import install

install()

import os
import tempfile

import botocore
import cellxgene_census
import numpy as np
import plotnine as p9
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scvi.hub import HubModel

scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)

Last run with scvi-tools version: 1.4.2

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

We download a toy example here (all cells from the spinal cord). To download the full census use backup_url: https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata_minified.h5ad. Expect that the download of the full census takes ~30 minutes with a high bandwidth internet connection.

We share models for mouse at https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/mus_musculus (same folder structure).

adata_path = os.path.join(save_dir.name, "adata-spinal-cord-minified.h5ad")

adata = sc.read(
    adata_path,
    backup_url="https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata-spinal-cord-minified.h5ad",
)

model = HubModel.pull_from_s3(
    s3_bucket="cellxgene-contrib-public",
    s3_path="models/scvi/2024-02-12/homo_sapiens/modelhub",
    pull_anndata=False,
    config=botocore.config.Config(signature_version=botocore.UNSIGNED),
)

adata, model

HubModel with:
local_dir: /tmp/tmp84uddqqb
model loaded? No
adata loaded? No
large_training_adata loaded? No
metadata:
HubMetadata(scvi_version='1.0.4', anndata_version='0.8.0', model_cls_name='SCVI', training_data_url=None, model_parent_module='scvi.model')
model_card:

-------------------------------------------------------------------------------------------------------------------

license: cc-by-4.0                                                                                                 

library_name: scvi-tools                                                                                           

tags:                                                                                                              

 • biology                                                                                                         
 • genomics                                                                                                        
 • single-cell                                                                                                     
 • model_cls_name:SCVI                                                                                             
 • scvi_version:1.0.4                                                                                              
 • anndata_version:0.8.0                                                                                           

-------------------------------------------------------------------------------------------------------------------

                                                    Description                                                    

SCVI model trained on the homo_sapiens data of the CELLxGENE Discover Census.                                      

                                                 Model properties                                                  

Many model properties are in the model tags. Some more are listed below.                                           

model_init_params:                                                                                                 

                                                                                                                   
                                                                                                                   
 {                                                                                                                 
                                                                                                                   
     "n_hidden": 128,                                                                                              
                                                                                                                   
     "n_latent": 50,                                                                                               
                                                                                                                   
     "n_layers": 2,                                                                                                
                                                                                                                   
     "dropout_rate": 0.1,                                                                                          
                                                                                                                   
     "dispersion": "gene",                                                                                         
                                                                                                                   
     "gene_likelihood": "nb",                                                                                      
                                                                                                                   
     "latent_distribution": "normal",                                                                              
                                                                                                                   
     "encode_covariates": false                                                                                    
                                                                                                                   
 }                                                                                                                 
                                                                                                                   

model_setup_anndata_args:                                                                                          

                                                                                                                   
                                                                                                                   
 {                                                                                                                 
                                                                                                                   
     "layer": null,                                                                                                
                                                                                                                   
     "batch_key": "batch",                                                                                         
                                                                                                                   
     "labels_key": null,                                                                                           
                                                                                                                   
     "size_factor_key": null,                                                                                      
                                                                                                                   
     "categorical_covariate_keys": null,                                                                           
                                                                                                                   
     "continuous_covariate_keys": null                                                                             
                                                                                                                   
 }                                                                                                                 
                                                                                                                   

model_summary_stats:                                                                                               

|     Summary Stat Key     |   Value  |                                                                            

|--------------------------|----------|                                                                            

|         n_batch          |   6006   |                                                                            

|         n_cells          | 34544157 |                                                                            

| n_extra_categorical_covs |    0     |                                                                            

| n_extra_continuous_covs  |    0     |                                                                            

|         n_labels         |    1     |                                                                            

|          n_vars          |  8000    |                                                                            

model_data_registry:                                                                                               

|   Registry Key    |         scvi-tools Location          |                                                       

|-------------------|--------------------------------------|                                                       

|         X         |               adata.X                |                                                       

|       batch       |       adata.obs['_scvi_batch']       |                                                       

|      labels       |      adata.obs['_scvi_labels']       |                                                       

model_parent_module: scvi.model                                                                                    

data_is_minified: To be added...                                                                                   

                                                   Training data                                                   

This is an optional link to where the training data is stored if it is too large                                   

to host on the huggingface Model hub.                                                                              


Training data url: N/A                                                                                             

                                                   Training code                                                   

This is an optional link to the code used to train the model.                                                      

Training code url: N/A                                                                                             

                                                    References                                                     

To be added...

(AnnData object with n_obs × n_vars = 117463 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
 )

Setup minified model#

Census was trained on all primary cells. We don’t encode covariates so inference and generating latent codes works without retraining on these batches. We have to subset to all training batches. The setup will be optimized in a future version of scvi-tools.

del adata.uns["_scvi_adata_minify_type"]
model.load_model(adata=adata[adata.obs["is_primary_data"]].copy())
census_model = model.model

INFO     Loading model...                                                                                          
INFO     File /tmp/tmp84uddqqb/model.pt already downloaded

census_model_all = scvi.model.SCVI.load_query_data(adata, census_model)

For spinal cord less than a half of cells is labeled as primary cells, while the other cells are duplicated from this dataset. census_model_all contains all cells while census_model only contains primary cells.

adata.obs["is_primary_data"].value_counts()

is_primary_data
False    67707
True     49756
Name: count, dtype: int64

census_model.adata, census_model_all.adata

(AnnData object with n_obs × n_vars = 49756 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
 AnnData object with n_obs × n_vars = 117463 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv')

By default scvi-tools loads models as not minified. We set up the model here with minified data, so we minify the model using the respective obsm fields.

census_model.minify_adata(
    use_latent_qzm_key="_scvi_latent_qzm", use_latent_qzv_key="_scvi_latent_qzv"
)

INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup                             
INFO     Generating sequential column names                                                                        
INFO     Generating sequential column names

Get the latent space and compute UMAP#

sc.pp.neighbors(census_model.adata, n_neighbors=20, use_rep="_scvi_latent_qzm")
sc.tl.umap(census_model.adata)
sc.pl.umap(census_model.adata, color=["tissue", "cell_type", "assay", "disease"], ncols=1)

../../../_images/c7ec95e67c6089b79e703561290c4eb16fd91d30afaec8016dd4575eea72d4c8.png

Performing Differential Expression in scVI#

While we only have access to the minified data, we can still perform downstream analysis using the generative part of the model.

Differential expression (DE) analysis is used to quantify the differences in gene expression across subpopulations of genes. If we have two sets of cells \(A\) and \(B\), a DE test is typically used to predict two things for each gene:

Whether the gene’s expression levels are significantly different in the A and B sets of cells.
An effect size that quantifies the strength of the differential expression.

Once trained, scVI can natively perform both of these tasks. Additionally, its differential expression module can account for batch effects and filter DE genes expected to be of little relevance.

Selecting cell subpopulations to compare#

# let's take a look at abundances of different cell types
print(
    census_model.adata.obs["cell_type"].value_counts(),
    "\n\n\n\n",
    census_model.adata.obs["tissue"].value_counts(),
)

cell_type
oligodendrocyte                                        20708
neuron                                                  7831
astrocyte                                               4449
oligodendrocyte precursor cell                          3191
microglial cell                                         1864
GABAergic neuron                                        1790
central nervous system macrophage                       1777
erythroid lineage cell                                  1684
capillary endothelial cell                               915
mural cell                                               879
neutrophil                                               708
erythroid progenitor cell                                606
glutamatergic neuron                                     576
leukocyte                                                542
endothelial cell of artery                               330
endothelial cell                                         247
mesenchymal stem cell                                    238
cerebellar granule cell                                  236
macrophage                                               159
ependymal cell                                           147
professional antigen presenting cell                     140
vascular associated smooth muscle cell                   125
cord blood hematopoietic stem cell                        87
pericyte                                                  78
differentiation-committed oligodendrocyte precursor       75
fibroblast                                                69
monocyte                                                  54
dendritic cell                                            52
B cell                                                    35
T cell                                                    33
chondrocyte                                               32
epithelial cell                                           29
renal intercalated cell                                   16
primordial germ cell                                      11
stromal cell                                              11
plasma cell                                                8
Bergmann glial cell                                        7
alternatively activated macrophage                         4
kidney loop of Henle epithelial cell                       3
cell of skeletal muscle                                    3
embryonic stem cell                                        2
choroid plexus epithelial cell                             1
endocrine cell                                             1
enterocyte                                                 1
stratified epithelial cell                                 1
smooth muscle cell                                         1
Name: count, dtype: int64 



 tissue
spinal cord                          30106
cervical spinal cord white matter    19650
Name: count, dtype: int64

scVI provides several options to identify the two populations of interest.

cell_type = "oligodendrocyte"
tissue1 = "spinal cord"
cell_idx1 = np.logical_and(
    census_model.adata.obs["cell_type"] == cell_type,
    census_model.adata.obs["tissue"] == tissue1,
)
print(sum(cell_idx1), "cells from tissue", tissue1)

tissue2 = "cervical spinal cord white matter"
cell_idx2 = np.logical_and(
    census_model.adata.obs["cell_type"] == cell_type,
    census_model.adata.obs["tissue"] == tissue2,
)
print(sum(cell_idx2), "cells of type", tissue2)

# or equivalently, provide a string of the form "my_celltype_column == 'desired_celltype'"
# cell_idx1 = "cell_type == 'Ciliated_non_amphid_neuron'"
# cell_idx2 = "cell_type == 'Intestine'"

11547 cells from tissue spinal cord
9161 cells of type cervical spinal cord white matter

Running DE analyses#

A simple DE analysis can then be performed using the following command

de_change = census_model.differential_expression(
    idx1=cell_idx1, idx2=cell_idx2, all_stats=False, mode="change"
)

This method returns a pandas DataFrame, where each row corresponds to a gene. The most important columns of this dataframe are the following. proba_de, which captures the posterior probability of \(M_{2, g}\) that the gene is differentially expressed. Values close to one indicate that the gene is DE; lfc_mean and lfc_median, respectively denoting the mean and the median of the posterior distribution of \(\beta_g\). Positive values of the LFC signify that the gene is upregulated in idx1; is_de_fdr_0.05 is True when the gene is tagged DE after FDR correction at target level \(\alpha=0.05\). The target level can be adjusted by specifying fdr_target in the differential_expression method.

Volcano plot with p-values#

de_change["log10_pscore"] = np.log10(de_change["proba_not_de"] + 1e-6)
de_change = de_change.join(census_model.adata.var, how="inner")
de_change = de_change.loc[np.max(de_change[["scale1", "scale2"]], axis=1) > 1e-4]
de_change["feature_id"] = de_change.index
de_change.index = de_change["feature_name"]
de_change.head(20)

	proba_de	proba_not_de	bayes_factor	scale1	scale2	pseudocounts	delta	lfc_mean	lfc_median	lfc_std	lfc_min	lfc_max	is_de_fdr_0.05	log10_pscore	soma_joinid	feature_name	feature_length	nnz	n_measured_obs	feature_id
feature_name
SOX2	1.0	0.0	18.420681	3.384034e-06	1.123714e-04	4.288460e-08	0.25	-5.039413	-5.034239	0.500462	-7.230298	-2.478858	True	-6.0	25043	SOX2	2512	3032247	58220612	ENSG00000181449
SOX2-OT	1.0	0.0	18.420681	4.396113e-03	7.087180e-07	4.288460e-08	0.25	12.826218	12.834159	0.823557	10.126595	17.054296	True	-6.0	25044	SOX2-OT	31637	14163978	36686207	ENSG00000242808
RP11-964E11.3	1.0	0.0	18.420681	1.923145e-04	2.241279e-06	4.288460e-08	0.25	6.560994	6.617713	1.492750	0.508492	10.717464	True	-6.0	49923	RP11-964E11.3	2073	1224228	26271998	ENSG00000283098
RP11-237N2.1	1.0	0.0	18.420681	3.558666e-04	4.184131e-07	4.288460e-08	0.25	10.095840	10.152714	1.692653	4.902654	15.073298	True	-6.0	49855	RP11-237N2.1	10986	4066655	21167246	ENSG00000286746
RP11-272B17.3	1.0	0.0	18.420681	7.729452e-04	3.016204e-06	4.288460e-08	0.25	8.106521	8.106206	0.723615	0.603325	11.951858	True	-6.0	49723	RP11-272B17.3	6636	1155875	20970779	ENSG00000286279
RPS10	1.0	0.0	18.420681	1.541836e-04	5.661847e-06	4.288460e-08	0.25	4.840816	4.891184	1.233305	0.625889	8.894837	True	-6.0	22445	RPS10	2302	26915086	58641444	ENSG00000124614
RWDD3	1.0	0.0	18.420681	1.028717e-04	8.987226e-06	4.288460e-08	0.25	3.543085	3.551251	0.334796	2.384393	4.760509	True	-6.0	23058	RWDD3	1927	6622342	58885186	ENSG00000122481
ZC3H11A	1.0	0.0	18.420681	4.557053e-04	1.018896e-04	4.288460e-08	0.25	2.195212	2.148960	0.434324	0.880529	3.938943	True	-6.0	28966	ZC3H11A	12829	12692037	62662199	ENSG00000058673
RPL32P3	1.0	0.0	18.420681	1.700545e-04	1.103668e-07	4.288460e-08	0.25	10.739322	10.720686	0.726264	8.571869	13.156144	True	-6.0	21965	RPL32P3	4504	6764181	24635964	ENSG00000251474
RPL17	1.0	0.0	18.420681	1.745479e-04	5.694760e-06	4.288460e-08	0.25	4.926170	4.978228	1.090603	0.999575	8.551223	True	-6.0	21496	RPL17	2451	28351933	59434543	ENSG00000265681
UBE2V1	1.0	0.0	18.420681	1.389076e-04	8.478121e-06	4.288460e-08	0.25	4.083253	4.081970	0.449361	2.604101	5.697139	True	-6.0	27926	UBE2V1	4818	13729803	56545393	ENSG00000244687
ABHD16A	1.0	0.0	18.420681	1.084239e-04	1.877089e-05	4.288460e-08	0.25	2.600349	2.603707	0.737122	0.374827	4.662211	True	-6.0	29821	ABHD16A	4059	6625528	55538094	ENSG00000204427
C5orf17	1.0	0.0	18.420681	6.683705e-04	1.119344e-04	4.288460e-08	0.25	2.693624	2.658252	0.861144	0.297724	5.964550	True	-6.0	29989	C5orf17	1984	3961683	59850491	ENSG00000248874
ZNF56	1.0	0.0	18.420681	2.485951e-04	1.208375e-07	4.288460e-08	0.25	11.136928	11.129822	0.671358	8.913495	13.393724	True	-6.0	29451	ZNF56	2355	5381008	23563306	ENSG00000267419
PRKCQ-AS1	1.0	0.0	18.420681	1.226647e-04	6.104561e-06	4.288460e-08	0.25	4.233367	4.243305	0.951379	1.049817	7.125087	True	-6.0	19228	PRKCQ-AS1	5834	3056965	40400414	ENSG00000237943
MTRNR2L12	1.0	0.0	18.420681	2.038424e-07	1.365834e-04	4.288460e-08	0.25	-9.679708	-9.755687	2.109375	-16.073889	-2.684278	True	-6.0	31235	MTRNR2L12	1049	13615749	57188872	ENSG00000269028
RP11-326E22.2	1.0	0.0	18.420681	2.041467e-04	5.539313e-07	4.288460e-08	0.25	8.633562	8.685425	1.032294	3.045305	11.544564	True	-6.0	55102	RP11-326E22.2	2597	3155057	12629433	ENSG00000285579
BCOR	1.0	0.0	18.420681	4.525876e-05	1.604009e-04	4.288460e-08	0.25	-1.832544	-1.841138	0.444156	-3.198434	-0.359479	True	-6.0	1872	BCOR	9965	7262911	62379638	ENSG00000183337
RP11-13A2.5	1.0	0.0	18.420681	2.157081e-04	4.321780e-07	4.288460e-08	0.25	9.080471	9.034204	0.743024	6.708925	12.421048	True	-6.0	55800	RP11-13A2.5	3379	3128509	11181954	ENSG00000285837
MRPS31P5	1.0	0.0	18.420681	1.029742e-04	9.368161e-08	4.288460e-08	0.25	10.212408	10.177984	0.640787	8.041052	12.319384	True	-6.0	15273	MRPS31P5	3873	2281179	23426996	ENSG00000243406

gene_annotations = sc.queries.biomart_annotations(
    org="hsapiens",
    attrs=["ensembl_gene_id", "gene_biotype"],
)

gene_annotations.index = gene_annotations["ensembl_gene_id"]
gene_annotation_dict = gene_annotations["gene_biotype"].to_dict()
de_change["Biotype"] = [
    gene_annotation_dict.pop(i, "Unannotated") for i in de_change["feature_id"]
]
de_change["Biotype"].value_counts()

Biotype
protein_coding                        814
lncRNA                                 69
transcribed_unprocessed_pseudogene      7
Unannotated                             4
Mt_rRNA                                 2
transcribed_unitary_pseudogene          1
Name: count, dtype: int64

(
    p9.ggplot(de_change, p9.aes("lfc_mean", "-log10_pscore", color="Biotype"))
    + p9.geom_point(
        de_change.query("Biotype == 'protein_coding'"), alpha=0.5
    )  # Plot other genes with transparence
    + p9.xlim(-10, 10)  # Set x limits
    + p9.ylim(0, 7)  # Set y limits
    + p9.geom_point(de_change.query("Biotype != 'protein_coding'"))
    + p9.labs(x="LFC mean", y="Significance score (higher is more significant)")
)

../../../_images/0f889326b63469fafd9bc5b32b64ad7e635324fa6866616d9059fde94d72b9cb.png

upregulated_genes = de_change.loc[
    de_change["lfc_median"] > 0, ["feature_id", "feature_name"]
].head(4)

Display generated counts from scVI model

census_model.adata[:, upregulated_genes["feature_id"]] = census_model.get_normalized_expression(
    gene_list=list(upregulated_genes["feature_id"]), n_samples=10
)
sc.pl.umap(
    census_model.adata,
    color=upregulated_genes["feature_name"],
    gene_symbols="feature_name",
    cmap="viridis",
)

../../../_images/c192033e6980c467e196b22a82cfe486c3c6de0034e052c142ad468fff681d53.png

Performing differential expression and yield expression from census#

Now we perform DE between each cell type vs all other cells and make a dotplot of the result.

Performing differential expression#

# here we do a 1-vs-all DE test, which compares each cell type with all others
# this returns the concatenation of all 1 vs all results, contained in a DataFrame
change_per_cluster_de = census_model.differential_expression(
    adata=census_model.adata[census_model.adata.obs["assay"] == "10x 3' v3"],
    groupby="cell_type",
    all_stats=False,
    mode="change",
)

INFO     Received view of anndata, making copy.                                                                    
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

cell_types = (
    adata.obs["cell_type"]
    .value_counts()
    # .loc[lambda x: (x >= 500) & (x.index != "nan")]
    .loc[lambda x: x.index != "nan"]
    .to_frame("n_cells")
)
cell_types.loc[:, "associated_test"] = cell_types.index.astype(str) + " vs Rest"

change_per_cluster_de = change_per_cluster_de.join(census_model.adata.var, how="inner")
change_per_cluster_de = change_per_cluster_de[
    change_per_cluster_de[["scale1", "scale2"]].max(axis=1) > 1e-4
]
change_per_cluster_de.head(20)

	proba_de	proba_not_de	bayes_factor	scale1	scale2	pseudocounts	delta	lfc_mean	lfc_median	lfc_std	...	lfc_max	is_de_fdr_0.05	comparison	group1	group2	soma_joinid	feature_name	feature_length	nnz	n_measured_obs
feature_id
ENSG00000204929	0.9884	0.0116	4.445082	0.000939	0.000113	3.200916e-07	0.25	4.635637	4.416831	2.389565	...	12.149960	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	33810	AC074391.1	12344	7205936	41750524
ENSG00000287544	0.9840	0.0160	4.119037	0.000313	0.000015	3.200916e-07	0.25	9.663720	8.085569	4.618678	...	19.936712	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	48757	RP11-624A4.2	6317	589676	21095338
ENSG00000151322	0.9824	0.0176	4.022099	0.015489	0.004528	3.200916e-07	0.25	2.478718	1.656276	1.884972	...	9.619658	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	16408	NPAS3	10811	20379621	61713514
ENSG00000288560	0.9784	0.0216	3.813225	0.000164	0.000023	3.200916e-07	0.25	5.707840	4.334204	4.024246	...	17.234283	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	60533	RP1-236J16.2	1303	730034	12430934
ENSG00000233639	0.9780	0.0220	3.794467	0.000868	0.000143	3.200916e-07	0.25	5.063036	3.226929	3.691148	...	16.205431	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	11839	PANTR1	9231	11000299	39610997
ENSG00000236790	0.9774	0.0226	3.766946	0.002097	0.000147	3.200916e-07	0.25	6.707326	7.455571	2.670608	...	12.378292	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	11388	LINC00299	22322	5588941	62289901
ENSG00000261404	0.9762	0.0238	3.713981	0.000547	0.000159	3.200916e-07	0.25	2.333580	1.784798	1.570074	...	7.553150	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	33187	PSMD7-DT	7473	5476988	57001363
ENSG00000080493	0.9758	0.0242	3.696905	0.003838	0.000327	3.200916e-07	0.25	5.944194	6.824965	2.573866	...	10.528049	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	24192	SLC4A4	9153	10506365	62846587
ENSG00000287292	0.9740	0.0260	3.623314	0.001743	0.000187	3.200916e-07	0.25	7.771810	6.196614	4.915103	...	17.857506	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	48727	RP11-563M4.2	4010	7893116	21167246
ENSG00000273118	0.9720	0.0280	3.547151	0.000380	0.000086	3.200916e-07	0.25	4.649741	2.213397	3.780624	...	13.904197	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	45190	AC079610.1	433	4469301	32099560
ENSG00000145476	0.9704	0.0296	3.489934	0.000222	0.000098	3.200916e-07	0.25	1.251280	1.289731	0.482151	...	2.898293	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	4770	CYP4V2	9767	9194672	62664775
ENSG00000091138	0.9700	0.0300	3.476098	0.000005	0.000301	3.200916e-07	0.25	-3.930191	-2.347409	2.798681	...	1.171524	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	24047	SLC26A3	3382	4731736	57602392
ENSG00000083067	0.9694	0.0306	3.455677	0.016102	0.001976	3.200916e-07	0.25	5.801075	7.135654	3.094589	...	11.319915	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	27473	TRPM3	20015	13113971	62015147
ENSG00000234377	0.9662	0.0338	3.352910	0.006446	0.000491	3.200916e-07	0.25	7.251151	7.934378	2.585241	...	11.855108	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	21043	OBI1-AS1	8790	2160307	52308922
ENSG00000231304	0.9634	0.0366	3.270420	0.000430	0.000079	3.200916e-07	0.25	4.959074	2.800905	3.595532	...	13.147403	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	23658	SGO1-AS1	6571	3336199	39961949
ENSG00000164089	0.9616	0.0384	3.220541	0.001599	0.000128	3.200916e-07	0.25	7.381253	8.115472	2.559449	...	13.202562	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	6310	ETNPPL	2942	1046308	59653806
ENSG00000287158	0.9594	0.0406	3.162540	0.000191	0.000016	3.200916e-07	0.25	10.063144	8.814186	4.944439	...	19.293985	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	48527	RP11-412N9.1	3783	213855	19188243
ENSG00000198838	0.9580	0.0420	3.127178	0.005951	0.000640	3.200916e-07	0.25	6.005604	6.995085	2.814702	...	12.153649	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	23071	RYR3	20353	10782416	62456018
ENSG00000251372	0.9564	0.0436	3.088119	0.000488	0.000062	3.200916e-07	0.25	4.700628	5.185833	2.178193	...	10.656058	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	11502	LINC00499	20131	1490946	49149694
ENSG00000198963	0.9538	0.0462	3.027474	0.001032	0.000137	3.200916e-07	0.25	4.358594	4.706037	1.999524	...	10.492231	True	Bergmann glial cell vs Rest	Bergmann glial cell	Rest	21331	RORB	9621	7627352	62735387

20 rows × 21 columns

# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)[["feature_name", "soma_joinid"]]
    .drop_duplicates()
)

Download raw counts for these genes from CELLxGENE census#

We can download the raw expression only for cells and genes of interest from CELLxGENE census. This drastically improves runtime. It is important to use the same census version used for training the model

census = cellxgene_census.open_soma(census_version="2023-12-15")

# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)[["feature_name", "soma_joinid"]]
    .drop_duplicates()
)

adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_coords=marker_genes["soma_joinid"].to_list(),
    obs_coords=census_model.adata.obs.loc[
        census_model.adata.obs["assay"] == "10x 3' v3", "soma_joinid"
    ].to_list(),
    column_names={
        "obs": [
            "soma_joinid",
            "dataset_id",
            "assay",
            "cell_type",
            "disease",
            "donor_id",
            "sex",
            "suspension_type",
            "tissue",
            "raw_sum",
            "nnz",
            "raw_mean_nnz",
            "raw_variance_nnz",
            "n_measured_vars",
        ]
    },
)

Confirm results of downloading cells and dotplot#

Check that census download yields the same cells

adata.var.index = adata.var["feature_id"]

np.all(
    adata.obs["dataset_id"].values
    == census_model.adata.obs.loc[census_model.adata.obs["assay"] == "10x 3' v3", "dataset_id"]
)

np.True_

adata_log = adata[adata.obs.cell_type.isin(cell_types.index.values)].copy()
sc.pp.normalize_total(adata_log)
sc.pp.log1p(adata_log)
sc.pl.dotplot(
    adata_log,
    marker_genes["feature_name"].to_list(),
    groupby="cell_type",
    gene_symbols="feature_name",
    standard_scale="var",
)

../../../_images/69756584c7249230c01025869861bbbb91ba4117b2b127b3d3c5790cc483c153.png