Use pretrained models of scVI-hub for CELLxGENE#

This notebook was contributed by Can Ergen and Martin Kim.

Original article: Scvi-hub: an actionable repository for model-driven single cell analysis

https://www.biorxiv.org/content/10.1101/2024.03.01.582887

The anndata object we’re using here is a subset of the full CELLxGENE census data. Use: s3://cellxgene-contrib-public/models/scvi/2024-02-12/mus_musculus/adata_minified.h5ad

Steps performed:

  1. Loading the minified data from AWS

  2. Setting up minified model with minified data

  3. Visualize the latent space

  4. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
!pip install --quiet cellxgene-census
!pip install --quiet pybiomart
from scvi_colab import install

install()
import os
import tempfile

import botocore
import cellxgene_census
import numpy as np
import plotnine as p9
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scvi.hub import HubModel
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.4.2

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

We download a toy example here (all cells from the spinal cord). To download the full census use backup_url: https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata_minified.h5ad. Expect that the download of the full census takes ~30 minutes with a high bandwidth internet connection.

We share models for mouse at https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/mus_musculus (same folder structure).

adata_path = os.path.join(save_dir.name, "adata-spinal-cord-minified.h5ad")

adata = sc.read(
    adata_path,
    backup_url="https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata-spinal-cord-minified.h5ad",
)

model = HubModel.pull_from_s3(
    s3_bucket="cellxgene-contrib-public",
    s3_path="models/scvi/2024-02-12/homo_sapiens/modelhub",
    pull_anndata=False,
    config=botocore.config.Config(signature_version=botocore.UNSIGNED),
)

adata, model
HubModel with:
local_dir: /tmp/tmp84uddqqb
model loaded? No
adata loaded? No
large_training_adata loaded? No
metadata:
HubMetadata(scvi_version='1.0.4', anndata_version='0.8.0', model_cls_name='SCVI', training_data_url=None, model_parent_module='scvi.model')
model_card:
-------------------------------------------------------------------------------------------------------------------

license: cc-by-4.0                                                                                                 

library_name: scvi-tools                                                                                           

tags:                                                                                                              

biology                                                                                                         
genomics                                                                                                        
single-cell                                                                                                     
model_cls_name:SCVI                                                                                             
scvi_version:1.0.4                                                                                              
anndata_version:0.8.0                                                                                           

-------------------------------------------------------------------------------------------------------------------

                                                    Description                                                    

SCVI model trained on the homo_sapiens data of the CELLxGENE Discover Census.                                      

                                                 Model properties                                                  

Many model properties are in the model tags. Some more are listed below.                                           

model_init_params:                                                                                                 

                                                                                                                   
                                                                                                                   
 {                                                                                                                 
                                                                                                                   
     "n_hidden": 128,                                                                                              
                                                                                                                   
     "n_latent": 50,                                                                                               
                                                                                                                   
     "n_layers": 2,                                                                                                
                                                                                                                   
     "dropout_rate": 0.1,                                                                                          
                                                                                                                   
     "dispersion": "gene",                                                                                         
                                                                                                                   
     "gene_likelihood": "nb",                                                                                      
                                                                                                                   
     "latent_distribution": "normal",                                                                              
                                                                                                                   
     "encode_covariates": false                                                                                    
                                                                                                                   
 }                                                                                                                 
                                                                                                                   

model_setup_anndata_args:                                                                                          

                                                                                                                   
                                                                                                                   
 {                                                                                                                 
                                                                                                                   
     "layer": null,                                                                                                
                                                                                                                   
     "batch_key": "batch",                                                                                         
                                                                                                                   
     "labels_key": null,                                                                                           
                                                                                                                   
     "size_factor_key": null,                                                                                      
                                                                                                                   
     "categorical_covariate_keys": null,                                                                           
                                                                                                                   
     "continuous_covariate_keys": null                                                                             
                                                                                                                   
 }                                                                                                                 
                                                                                                                   

model_summary_stats:                                                                                               

|     Summary Stat Key     |   Value  |                                                                            

|--------------------------|----------|                                                                            

|         n_batch          |   6006   |                                                                            

|         n_cells          | 34544157 |                                                                            

| n_extra_categorical_covs |    0     |                                                                            

| n_extra_continuous_covs  |    0     |                                                                            

|         n_labels         |    1     |                                                                            

|          n_vars          |  8000    |                                                                            

model_data_registry:                                                                                               

|   Registry Key    |         scvi-tools Location          |                                                       

|-------------------|--------------------------------------|                                                       

|         X         |               adata.X                |                                                       

|       batch       |       adata.obs['_scvi_batch']       |                                                       

|      labels       |      adata.obs['_scvi_labels']       |                                                       

model_parent_module: scvi.model                                                                                    

data_is_minified: To be added...                                                                                   

                                                   Training data                                                   

This is an optional link to where the training data is stored if it is too large                                   

to host on the huggingface Model hub.                                                                              


Training data url: N/A                                                                                             

                                                   Training code                                                   

This is an optional link to the code used to train the model.                                                      

Training code url: N/A                                                                                             

                                                    References                                                     

To be added...                                                                                                     
(AnnData object with n_obs × n_vars = 117463 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
 )

Setup minified model#

Census was trained on all primary cells. We don’t encode covariates so inference and generating latent codes works without retraining on these batches. We have to subset to all training batches. The setup will be optimized in a future version of scvi-tools.

del adata.uns["_scvi_adata_minify_type"]
model.load_model(adata=adata[adata.obs["is_primary_data"]].copy())
census_model = model.model
INFO     Loading model...                                                                                          
INFO     File /tmp/tmp84uddqqb/model.pt already downloaded
census_model_all = scvi.model.SCVI.load_query_data(adata, census_model)

For spinal cord less than a half of cells is labeled as primary cells, while the other cells are duplicated from this dataset. census_model_all contains all cells while census_model only contains primary cells.

adata.obs["is_primary_data"].value_counts()
is_primary_data
False    67707
True     49756
Name: count, dtype: int64
census_model.adata, census_model_all.adata
(AnnData object with n_obs × n_vars = 49756 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
 AnnData object with n_obs × n_vars = 117463 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv')

By default scvi-tools loads models as not minified. We set up the model here with minified data, so we minify the model using the respective obsm fields.

census_model.minify_adata(
    use_latent_qzm_key="_scvi_latent_qzm", use_latent_qzv_key="_scvi_latent_qzv"
)
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup                             
INFO     Generating sequential column names                                                                        
INFO     Generating sequential column names

Get the latent space and compute UMAP#

sc.pp.neighbors(census_model.adata, n_neighbors=20, use_rep="_scvi_latent_qzm")
sc.tl.umap(census_model.adata)
sc.pl.umap(census_model.adata, color=["tissue", "cell_type", "assay", "disease"], ncols=1)

Performing Differential Expression in scVI#

While we only have access to the minified data, we can still perform downstream analysis using the generative part of the model.

Differential expression (DE) analysis is used to quantify the differences in gene expression across subpopulations of genes. If we have two sets of cells \(A\) and \(B\), a DE test is typically used to predict two things for each gene:

  1. Whether the gene’s expression levels are significantly different in the A and B sets of cells.

  2. An effect size that quantifies the strength of the differential expression.

Once trained, scVI can natively perform both of these tasks. Additionally, its differential expression module can account for batch effects and filter DE genes expected to be of little relevance.

Selecting cell subpopulations to compare#

# let's take a look at abundances of different cell types
print(
    census_model.adata.obs["cell_type"].value_counts(),
    "\n\n\n\n",
    census_model.adata.obs["tissue"].value_counts(),
)
cell_type
oligodendrocyte                                        20708
neuron                                                  7831
astrocyte                                               4449
oligodendrocyte precursor cell                          3191
microglial cell                                         1864
GABAergic neuron                                        1790
central nervous system macrophage                       1777
erythroid lineage cell                                  1684
capillary endothelial cell                               915
mural cell                                               879
neutrophil                                               708
erythroid progenitor cell                                606
glutamatergic neuron                                     576
leukocyte                                                542
endothelial cell of artery                               330
endothelial cell                                         247
mesenchymal stem cell                                    238
cerebellar granule cell                                  236
macrophage                                               159
ependymal cell                                           147
professional antigen presenting cell                     140
vascular associated smooth muscle cell                   125
cord blood hematopoietic stem cell                        87
pericyte                                                  78
differentiation-committed oligodendrocyte precursor       75
fibroblast                                                69
monocyte                                                  54
dendritic cell                                            52
B cell                                                    35
T cell                                                    33
chondrocyte                                               32
epithelial cell                                           29
renal intercalated cell                                   16
primordial germ cell                                      11
stromal cell                                              11
plasma cell                                                8
Bergmann glial cell                                        7
alternatively activated macrophage                         4
kidney loop of Henle epithelial cell                       3
cell of skeletal muscle                                    3
embryonic stem cell                                        2
choroid plexus epithelial cell                             1
endocrine cell                                             1
enterocyte                                                 1
stratified epithelial cell                                 1
smooth muscle cell                                         1
Name: count, dtype: int64 



 tissue
spinal cord                          30106
cervical spinal cord white matter    19650
Name: count, dtype: int64

scVI provides several options to identify the two populations of interest.

cell_type = "oligodendrocyte"
tissue1 = "spinal cord"
cell_idx1 = np.logical_and(
    census_model.adata.obs["cell_type"] == cell_type,
    census_model.adata.obs["tissue"] == tissue1,
)
print(sum(cell_idx1), "cells from tissue", tissue1)

tissue2 = "cervical spinal cord white matter"
cell_idx2 = np.logical_and(
    census_model.adata.obs["cell_type"] == cell_type,
    census_model.adata.obs["tissue"] == tissue2,
)
print(sum(cell_idx2), "cells of type", tissue2)

# or equivalently, provide a string of the form "my_celltype_column == 'desired_celltype'"
# cell_idx1 = "cell_type == 'Ciliated_non_amphid_neuron'"
# cell_idx2 = "cell_type == 'Intestine'"
11547 cells from tissue spinal cord
9161 cells of type cervical spinal cord white matter

Running DE analyses#

A simple DE analysis can then be performed using the following command

de_change = census_model.differential_expression(
    idx1=cell_idx1, idx2=cell_idx2, all_stats=False, mode="change"
)

This method returns a pandas DataFrame, where each row corresponds to a gene. The most important columns of this dataframe are the following. proba_de, which captures the posterior probability of \(M_{2, g}\) that the gene is differentially expressed. Values close to one indicate that the gene is DE; lfc_mean and lfc_median, respectively denoting the mean and the median of the posterior distribution of \(\beta_g\). Positive values of the LFC signify that the gene is upregulated in idx1; is_de_fdr_0.05 is True when the gene is tagged DE after FDR correction at target level \(\alpha=0.05\). The target level can be adjusted by specifying fdr_target in the differential_expression method.

Volcano plot with p-values#

de_change["log10_pscore"] = np.log10(de_change["proba_not_de"] + 1e-6)
de_change = de_change.join(census_model.adata.var, how="inner")
de_change = de_change.loc[np.max(de_change[["scale1", "scale2"]], axis=1) > 1e-4]
de_change["feature_id"] = de_change.index
de_change.index = de_change["feature_name"]
de_change.head(20)
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std lfc_min lfc_max is_de_fdr_0.05 log10_pscore soma_joinid feature_name feature_length nnz n_measured_obs feature_id
feature_name
SOX2 1.0 0.0 18.420681 3.384034e-06 1.123714e-04 4.288460e-08 0.25 -5.039413 -5.034239 0.500462 -7.230298 -2.478858 True -6.0 25043 SOX2 2512 3032247 58220612 ENSG00000181449
SOX2-OT 1.0 0.0 18.420681 4.396113e-03 7.087180e-07 4.288460e-08 0.25 12.826218 12.834159 0.823557 10.126595 17.054296 True -6.0 25044 SOX2-OT 31637 14163978 36686207 ENSG00000242808
RP11-964E11.3 1.0 0.0 18.420681 1.923145e-04 2.241279e-06 4.288460e-08 0.25 6.560994 6.617713 1.492750 0.508492 10.717464 True -6.0 49923 RP11-964E11.3 2073 1224228 26271998 ENSG00000283098
RP11-237N2.1 1.0 0.0 18.420681 3.558666e-04 4.184131e-07 4.288460e-08 0.25 10.095840 10.152714 1.692653 4.902654 15.073298 True -6.0 49855 RP11-237N2.1 10986 4066655 21167246 ENSG00000286746
RP11-272B17.3 1.0 0.0 18.420681 7.729452e-04 3.016204e-06 4.288460e-08 0.25 8.106521 8.106206 0.723615 0.603325 11.951858 True -6.0 49723 RP11-272B17.3 6636 1155875 20970779 ENSG00000286279
RPS10 1.0 0.0 18.420681 1.541836e-04 5.661847e-06 4.288460e-08 0.25 4.840816 4.891184 1.233305 0.625889 8.894837 True -6.0 22445 RPS10 2302 26915086 58641444 ENSG00000124614
RWDD3 1.0 0.0 18.420681 1.028717e-04 8.987226e-06 4.288460e-08 0.25 3.543085 3.551251 0.334796 2.384393 4.760509 True -6.0 23058 RWDD3 1927 6622342 58885186 ENSG00000122481
ZC3H11A 1.0 0.0 18.420681 4.557053e-04 1.018896e-04 4.288460e-08 0.25 2.195212 2.148960 0.434324 0.880529 3.938943 True -6.0 28966 ZC3H11A 12829 12692037 62662199 ENSG00000058673
RPL32P3 1.0 0.0 18.420681 1.700545e-04 1.103668e-07 4.288460e-08 0.25 10.739322 10.720686 0.726264 8.571869 13.156144 True -6.0 21965 RPL32P3 4504 6764181 24635964 ENSG00000251474
RPL17 1.0 0.0 18.420681 1.745479e-04 5.694760e-06 4.288460e-08 0.25 4.926170 4.978228 1.090603 0.999575 8.551223 True -6.0 21496 RPL17 2451 28351933 59434543 ENSG00000265681
UBE2V1 1.0 0.0 18.420681 1.389076e-04 8.478121e-06 4.288460e-08 0.25 4.083253 4.081970 0.449361 2.604101 5.697139 True -6.0 27926 UBE2V1 4818 13729803 56545393 ENSG00000244687
ABHD16A 1.0 0.0 18.420681 1.084239e-04 1.877089e-05 4.288460e-08 0.25 2.600349 2.603707 0.737122 0.374827 4.662211 True -6.0 29821 ABHD16A 4059 6625528 55538094 ENSG00000204427
C5orf17 1.0 0.0 18.420681 6.683705e-04 1.119344e-04 4.288460e-08 0.25 2.693624 2.658252 0.861144 0.297724 5.964550 True -6.0 29989 C5orf17 1984 3961683 59850491 ENSG00000248874
ZNF56 1.0 0.0 18.420681 2.485951e-04 1.208375e-07 4.288460e-08 0.25 11.136928 11.129822 0.671358 8.913495 13.393724 True -6.0 29451 ZNF56 2355 5381008 23563306 ENSG00000267419
PRKCQ-AS1 1.0 0.0 18.420681 1.226647e-04 6.104561e-06 4.288460e-08 0.25 4.233367 4.243305 0.951379 1.049817 7.125087 True -6.0 19228 PRKCQ-AS1 5834 3056965 40400414 ENSG00000237943
MTRNR2L12 1.0 0.0 18.420681 2.038424e-07 1.365834e-04 4.288460e-08 0.25 -9.679708 -9.755687 2.109375 -16.073889 -2.684278 True -6.0 31235 MTRNR2L12 1049 13615749 57188872 ENSG00000269028
RP11-326E22.2 1.0 0.0 18.420681 2.041467e-04 5.539313e-07 4.288460e-08 0.25 8.633562 8.685425 1.032294 3.045305 11.544564 True -6.0 55102 RP11-326E22.2 2597 3155057 12629433 ENSG00000285579
BCOR 1.0 0.0 18.420681 4.525876e-05 1.604009e-04 4.288460e-08 0.25 -1.832544 -1.841138 0.444156 -3.198434 -0.359479 True -6.0 1872 BCOR 9965 7262911 62379638 ENSG00000183337
RP11-13A2.5 1.0 0.0 18.420681 2.157081e-04 4.321780e-07 4.288460e-08 0.25 9.080471 9.034204 0.743024 6.708925 12.421048 True -6.0 55800 RP11-13A2.5 3379 3128509 11181954 ENSG00000285837
MRPS31P5 1.0 0.0 18.420681 1.029742e-04 9.368161e-08 4.288460e-08 0.25 10.212408 10.177984 0.640787 8.041052 12.319384 True -6.0 15273 MRPS31P5 3873 2281179 23426996 ENSG00000243406
gene_annotations = sc.queries.biomart_annotations(
    org="hsapiens",
    attrs=["ensembl_gene_id", "gene_biotype"],
)
gene_annotations.index = gene_annotations["ensembl_gene_id"]
gene_annotation_dict = gene_annotations["gene_biotype"].to_dict()
de_change["Biotype"] = [
    gene_annotation_dict.pop(i, "Unannotated") for i in de_change["feature_id"]
]
de_change["Biotype"].value_counts()
Biotype
protein_coding                        814
lncRNA                                 69
transcribed_unprocessed_pseudogene      7
Unannotated                             4
Mt_rRNA                                 2
transcribed_unitary_pseudogene          1
Name: count, dtype: int64
(
    p9.ggplot(de_change, p9.aes("lfc_mean", "-log10_pscore", color="Biotype"))
    + p9.geom_point(
        de_change.query("Biotype == 'protein_coding'"), alpha=0.5
    )  # Plot other genes with transparence
    + p9.xlim(-10, 10)  # Set x limits
    + p9.ylim(0, 7)  # Set y limits
    + p9.geom_point(de_change.query("Biotype != 'protein_coding'"))
    + p9.labs(x="LFC mean", y="Significance score (higher is more significant)")
)
upregulated_genes = de_change.loc[
    de_change["lfc_median"] > 0, ["feature_id", "feature_name"]
].head(4)

Display generated counts from scVI model

census_model.adata[:, upregulated_genes["feature_id"]] = census_model.get_normalized_expression(
    gene_list=list(upregulated_genes["feature_id"]), n_samples=10
)
sc.pl.umap(
    census_model.adata,
    color=upregulated_genes["feature_name"],
    gene_symbols="feature_name",
    cmap="viridis",
)

Performing differential expression and yield expression from census#

Now we perform DE between each cell type vs all other cells and make a dotplot of the result.

Performing differential expression#

# here we do a 1-vs-all DE test, which compares each cell type with all others
# this returns the concatenation of all 1 vs all results, contained in a DataFrame
change_per_cluster_de = census_model.differential_expression(
    adata=census_model.adata[census_model.adata.obs["assay"] == "10x 3' v3"],
    groupby="cell_type",
    all_stats=False,
    mode="change",
)
INFO     Received view of anndata, making copy.                                                                    
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
cell_types = (
    adata.obs["cell_type"]
    .value_counts()
    # .loc[lambda x: (x >= 500) & (x.index != "nan")]
    .loc[lambda x: x.index != "nan"]
    .to_frame("n_cells")
)
cell_types.loc[:, "associated_test"] = cell_types.index.astype(str) + " vs Rest"
change_per_cluster_de = change_per_cluster_de.join(census_model.adata.var, how="inner")
change_per_cluster_de = change_per_cluster_de[
    change_per_cluster_de[["scale1", "scale2"]].max(axis=1) > 1e-4
]
change_per_cluster_de.head(20)
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... lfc_max is_de_fdr_0.05 comparison group1 group2 soma_joinid feature_name feature_length nnz n_measured_obs
feature_id
ENSG00000204929 0.9884 0.0116 4.445082 0.000939 0.000113 3.200916e-07 0.25 4.635637 4.416831 2.389565 ... 12.149960 True Bergmann glial cell vs Rest Bergmann glial cell Rest 33810 AC074391.1 12344 7205936 41750524
ENSG00000287544 0.9840 0.0160 4.119037 0.000313 0.000015 3.200916e-07 0.25 9.663720 8.085569 4.618678 ... 19.936712 True Bergmann glial cell vs Rest Bergmann glial cell Rest 48757 RP11-624A4.2 6317 589676 21095338
ENSG00000151322 0.9824 0.0176 4.022099 0.015489 0.004528 3.200916e-07 0.25 2.478718 1.656276 1.884972 ... 9.619658 True Bergmann glial cell vs Rest Bergmann glial cell Rest 16408 NPAS3 10811 20379621 61713514
ENSG00000288560 0.9784 0.0216 3.813225 0.000164 0.000023 3.200916e-07 0.25 5.707840 4.334204 4.024246 ... 17.234283 True Bergmann glial cell vs Rest Bergmann glial cell Rest 60533 RP1-236J16.2 1303 730034 12430934
ENSG00000233639 0.9780 0.0220 3.794467 0.000868 0.000143 3.200916e-07 0.25 5.063036 3.226929 3.691148 ... 16.205431 True Bergmann glial cell vs Rest Bergmann glial cell Rest 11839 PANTR1 9231 11000299 39610997
ENSG00000236790 0.9774 0.0226 3.766946 0.002097 0.000147 3.200916e-07 0.25 6.707326 7.455571 2.670608 ... 12.378292 True Bergmann glial cell vs Rest Bergmann glial cell Rest 11388 LINC00299 22322 5588941 62289901
ENSG00000261404 0.9762 0.0238 3.713981 0.000547 0.000159 3.200916e-07 0.25 2.333580 1.784798 1.570074 ... 7.553150 True Bergmann glial cell vs Rest Bergmann glial cell Rest 33187 PSMD7-DT 7473 5476988 57001363
ENSG00000080493 0.9758 0.0242 3.696905 0.003838 0.000327 3.200916e-07 0.25 5.944194 6.824965 2.573866 ... 10.528049 True Bergmann glial cell vs Rest Bergmann glial cell Rest 24192 SLC4A4 9153 10506365 62846587
ENSG00000287292 0.9740 0.0260 3.623314 0.001743 0.000187 3.200916e-07 0.25 7.771810 6.196614 4.915103 ... 17.857506 True Bergmann glial cell vs Rest Bergmann glial cell Rest 48727 RP11-563M4.2 4010 7893116 21167246
ENSG00000273118 0.9720 0.0280 3.547151 0.000380 0.000086 3.200916e-07 0.25 4.649741 2.213397 3.780624 ... 13.904197 True Bergmann glial cell vs Rest Bergmann glial cell Rest 45190 AC079610.1 433 4469301 32099560
ENSG00000145476 0.9704 0.0296 3.489934 0.000222 0.000098 3.200916e-07 0.25 1.251280 1.289731 0.482151 ... 2.898293 True Bergmann glial cell vs Rest Bergmann glial cell Rest 4770 CYP4V2 9767 9194672 62664775
ENSG00000091138 0.9700 0.0300 3.476098 0.000005 0.000301 3.200916e-07 0.25 -3.930191 -2.347409 2.798681 ... 1.171524 True Bergmann glial cell vs Rest Bergmann glial cell Rest 24047 SLC26A3 3382 4731736 57602392
ENSG00000083067 0.9694 0.0306 3.455677 0.016102 0.001976 3.200916e-07 0.25 5.801075 7.135654 3.094589 ... 11.319915 True Bergmann glial cell vs Rest Bergmann glial cell Rest 27473 TRPM3 20015 13113971 62015147
ENSG00000234377 0.9662 0.0338 3.352910 0.006446 0.000491 3.200916e-07 0.25 7.251151 7.934378 2.585241 ... 11.855108 True Bergmann glial cell vs Rest Bergmann glial cell Rest 21043 OBI1-AS1 8790 2160307 52308922
ENSG00000231304 0.9634 0.0366 3.270420 0.000430 0.000079 3.200916e-07 0.25 4.959074 2.800905 3.595532 ... 13.147403 True Bergmann glial cell vs Rest Bergmann glial cell Rest 23658 SGO1-AS1 6571 3336199 39961949
ENSG00000164089 0.9616 0.0384 3.220541 0.001599 0.000128 3.200916e-07 0.25 7.381253 8.115472 2.559449 ... 13.202562 True Bergmann glial cell vs Rest Bergmann glial cell Rest 6310 ETNPPL 2942 1046308 59653806
ENSG00000287158 0.9594 0.0406 3.162540 0.000191 0.000016 3.200916e-07 0.25 10.063144 8.814186 4.944439 ... 19.293985 True Bergmann glial cell vs Rest Bergmann glial cell Rest 48527 RP11-412N9.1 3783 213855 19188243
ENSG00000198838 0.9580 0.0420 3.127178 0.005951 0.000640 3.200916e-07 0.25 6.005604 6.995085 2.814702 ... 12.153649 True Bergmann glial cell vs Rest Bergmann glial cell Rest 23071 RYR3 20353 10782416 62456018
ENSG00000251372 0.9564 0.0436 3.088119 0.000488 0.000062 3.200916e-07 0.25 4.700628 5.185833 2.178193 ... 10.656058 True Bergmann glial cell vs Rest Bergmann glial cell Rest 11502 LINC00499 20131 1490946 49149694
ENSG00000198963 0.9538 0.0462 3.027474 0.001032 0.000137 3.200916e-07 0.25 4.358594 4.706037 1.999524 ... 10.492231 True Bergmann glial cell vs Rest Bergmann glial cell Rest 21331 RORB 9621 7627352 62735387

20 rows × 21 columns

# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)[["feature_name", "soma_joinid"]]
    .drop_duplicates()
)

Download raw counts for these genes from CELLxGENE census#

We can download the raw expression only for cells and genes of interest from CELLxGENE census. This drastically improves runtime. It is important to use the same census version used for training the model

census = cellxgene_census.open_soma(census_version="2023-12-15")
# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)[["feature_name", "soma_joinid"]]
    .drop_duplicates()
)
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_coords=marker_genes["soma_joinid"].to_list(),
    obs_coords=census_model.adata.obs.loc[
        census_model.adata.obs["assay"] == "10x 3' v3", "soma_joinid"
    ].to_list(),
    column_names={
        "obs": [
            "soma_joinid",
            "dataset_id",
            "assay",
            "cell_type",
            "disease",
            "donor_id",
            "sex",
            "suspension_type",
            "tissue",
            "raw_sum",
            "nnz",
            "raw_mean_nnz",
            "raw_variance_nnz",
            "n_measured_vars",
        ]
    },
)

Confirm results of downloading cells and dotplot#

Check that census download yields the same cells

adata.var.index = adata.var["feature_id"]
np.all(
    adata.obs["dataset_id"].values
    == census_model.adata.obs.loc[census_model.adata.obs["assay"] == "10x 3' v3", "dataset_id"]
)
np.True_
adata_log = adata[adata.obs.cell_type.isin(cell_types.index.values)].copy()
sc.pp.normalize_total(adata_log)
sc.pp.log1p(adata_log)
sc.pl.dotplot(
    adata_log,
    marker_genes["feature_name"].to_list(),
    groupby="cell_type",
    gene_symbols="feature_name",
    standard_scale="var",
)