Use pretrained models of scVI-hub for CELLxGENE#

This notebook was contributed by Can Ergen and Martin Kim.

Original article: Scvi-hub: an actionable repository for model-driven single cell analysis

https://www.biorxiv.org/content/10.1101/2024.03.01.582887

The anndata object we’re using here is a subset of the full CELLxGENE census data. Use: s3://cellxgene-contrib-public/models/scvi/2024-02-12/mus_musculus/adata_minified.h5ad

Steps performed:

  1. Loading the minified data from AWS

  2. Setting up minified model with minified data

  3. Visualize the latent space

  4. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
!pip install --quiet cellxgene-census
!pip install --quiet pybiomart
from scvi_colab import install

install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.35.24 requires botocore<1.36.0,>=1.35.24, but you have botocore 1.35.23 which is incompatible.
spatialdata 0.2.2 requires fsspec<=2023.6, but you have fsspec 2024.9.0 which is incompatible.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

import os
import tempfile

import botocore
import cellxgene_census
import numpy as np
import plotnine as p9
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scvi.hub import HubModel
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.1.6

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

We download a toy example here (all cells from the spinal cord). To download the full census use backup_url: https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata_minified.h5ad. Expect that the download of the full census takes ~30 minutes with a high bandwidth internet connection.

We share models for mouse at https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/mus_musculus (same folder structure).

adata_path = os.path.join(save_dir.name, "adata-spinal-cord-minified.h5ad")

adata = sc.read(
    adata_path,
    backup_url="https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata-spinal-cord-minified.h5ad",
)

model = HubModel.pull_from_s3(
    s3_bucket="cellxgene-contrib-public",
    s3_path="models/scvi/2024-02-12/homo_sapiens/modelhub",
    pull_anndata=False,
    config=botocore.config.Config(signature_version=botocore.UNSIGNED),
)

adata, model
HubModel with:
local_dir: /tmp/tmp3gyd_s7l
model loaded? No
adata loaded? No
large_training_adata loaded? No
metadata:
HubMetadata(scvi_version='1.0.4', anndata_version='0.8.0', model_cls_name='SCVI', training_data_url=None, model_parent_module='scvi.model')
model_card:
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
library_name: scvi-tools                                                                                           

license: cc-by-4.0                                                                                                 

tags:                                                                                                              

biology                                                                                                         
genomics                                                                                                        
single-cell                                                                                                     
model_cls_name:SCVI                                                                                             
scvi_version:1.0.4                                                                                              
anndata_version:0.8.0                                                                                           

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                                   Description                                                   ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

SCVI model trained on the homo_sapiens data of the CELLxGENE Discover Census.                                      

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                                Model properties                                                 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Many model properties are in the model tags. Some more are listed below.                                           

model_init_params:                                                                                                 

                                                                                                                   
                                                                                                                   
 {                                                                                                                 
                                                                                                                   
     "n_hidden": 128,                                                                                              
                                                                                                                   
     "n_latent": 50,                                                                                               
                                                                                                                   
     "n_layers": 2,                                                                                                
                                                                                                                   
     "dropout_rate": 0.1,                                                                                          
                                                                                                                   
     "dispersion": "gene",                                                                                         
                                                                                                                   
     "gene_likelihood": "nb",                                                                                      
                                                                                                                   
     "latent_distribution": "normal",                                                                              
                                                                                                                   
     "encode_covariates": false                                                                                    
                                                                                                                   
 }                                                                                                                 
                                                                                                                   

model_setup_anndata_args:                                                                                          

                                                                                                                   
                                                                                                                   
 {                                                                                                                 
                                                                                                                   
     "layer": null,                                                                                                
                                                                                                                   
     "batch_key": "batch",                                                                                         
                                                                                                                   
     "labels_key": null,                                                                                           
                                                                                                                   
     "size_factor_key": null,                                                                                      
                                                                                                                   
     "categorical_covariate_keys": null,                                                                           
                                                                                                                   
     "continuous_covariate_keys": null                                                                             
                                                                                                                   
 }                                                                                                                 
                                                                                                                   

model_summary_stats:                                                                                               

|     Summary Stat Key     |   Value  |                                                                            

|--------------------------|----------|                                                                            

|         n_batch          |   6006   |                                                                            

|         n_cells          | 34544157 |                                                                            

| n_extra_categorical_covs |    0     |                                                                            

| n_extra_continuous_covs  |    0     |                                                                            

|         n_labels         |    1     |                                                                            

|          n_vars          |  8000    |                                                                            

model_data_registry:                                                                                               

|   Registry Key    |         scvi-tools Location          |                                                       

|-------------------|--------------------------------------|                                                       

|         X         |               adata.X                |                                                       

|       batch       |       adata.obs['_scvi_batch']       |                                                       

|      labels       |      adata.obs['_scvi_labels']       |                                                       

model_parent_module: scvi.model                                                                                    

data_is_minified: To be added...                                                                                   

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                                  Training data                                                  ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

This is an optional link to where the training data is stored if it is too large                                   

to host on the huggingface Model hub.                                                                              


Training data url: N/A                                                                                             

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                                  Training code                                                  ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

This is an optional link to the code used to train the model.                                                      

Training code url: N/A                                                                                             

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                                   References                                                    ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

To be added...                                                                                                     
(AnnData object with n_obs × n_vars = 117463 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
 )

Setup minified model#

Census was trained on all primary cells. We don’t encode covariates so inference and generating latent codes works without retraining on these batches. We have to subset to all training batches. The setup will be optimized in a future version of scvi-tools.

del adata.uns["_scvi_adata_minify_type"]
model.load_model(adata=adata[adata.obs["is_primary_data"]].copy())
census_model = model.model
INFO     Loading model...
INFO     File /tmp/tmp3gyd_s7l/model.pt already downloaded
census_model_all = scvi.model.SCVI.load_query_data(adata, census_model)

For spinal cord less than a half of cells is labeled as primary cells, while the other cells are duplicated from this dataset. census_model_all contains all cells while census_model only contains primary cells.

adata.obs["is_primary_data"].value_counts()
is_primary_data
False    67707
True     49756
Name: count, dtype: int64
census_model.adata, census_model_all.adata
(AnnData object with n_obs × n_vars = 49756 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
 AnnData object with n_obs × n_vars = 117463 × 8000
     obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
     var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
     uns: '_scvi_manager_uuid', '_scvi_uuid'
     obsm: '_scvi_latent_qzm', '_scvi_latent_qzv')

By default scvi-tools loads models as not minified. We set up the model here with minified data, so we minify the model using the respective obsm fields.

census_model.minify_adata(
    use_latent_qzm_key="_scvi_latent_qzm", use_latent_qzv_key="_scvi_latent_qzv"
)
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
INFO     Generating sequential column names
INFO     Generating sequential column names

Get the latent space and compute UMAP#

sc.pp.neighbors(census_model.adata, n_neighbors=20, use_rep="_scvi_latent_qzm")
sc.tl.umap(census_model.adata)
sc.pl.umap(census_model.adata, color=["tissue", "cell_type", "assay", "disease"], ncols=1)

Performing Differential Expression in scVI#

While we only have access to the minified data, we can still perform downstream analysis using the generative part of the model.

Differential expression (DE) analysis is used to quantify the differences in gene expression across subpopulations of genes. If we have two sets of cells \(A\) and \(B\), a DE test is typically used to predict two things for each gene:

  1. Whether the gene’s expression levels are significantly different in the A and B sets of cells.

  2. An effect size that quantifies the strength of the differential expression.

Once trained, scVI can natively perform both of these tasks. Additionally, its differential expression module can account for batch effects and filter DE genes expected to be of little relevance.

Selecting cell subpopulations to compare#

# let's take a look at abundances of different cell types
print(
    census_model.adata.obs["cell_type"].value_counts(),
    "\n\n\n\n",
    census_model.adata.obs["tissue"].value_counts(),
)
cell_type
oligodendrocyte                                        20708
neuron                                                  7831
astrocyte                                               4449
oligodendrocyte precursor cell                          3191
microglial cell                                         1864
GABAergic neuron                                        1790
central nervous system macrophage                       1777
erythroid lineage cell                                  1684
capillary endothelial cell                               915
mural cell                                               879
neutrophil                                               708
erythroid progenitor cell                                606
glutamatergic neuron                                     576
leukocyte                                                542
endothelial cell of artery                               330
endothelial cell                                         247
mesenchymal stem cell                                    238
cerebellar granule cell                                  236
macrophage                                               159
ependymal cell                                           147
professional antigen presenting cell                     140
vascular associated smooth muscle cell                   125
cord blood hematopoietic stem cell                        87
pericyte                                                  78
differentiation-committed oligodendrocyte precursor       75
fibroblast                                                69
monocyte                                                  54
dendritic cell                                            52
B cell                                                    35
T cell                                                    33
chondrocyte                                               32
epithelial cell                                           29
renal intercalated cell                                   16
primordial germ cell                                      11
stromal cell                                              11
plasma cell                                                8
Bergmann glial cell                                        7
alternatively activated macrophage                         4
kidney loop of Henle epithelial cell                       3
cell of skeletal muscle                                    3
embryonic stem cell                                        2
choroid plexus epithelial cell                             1
endocrine cell                                             1
enterocyte                                                 1
stratified epithelial cell                                 1
smooth muscle cell                                         1
Name: count, dtype: int64 



 tissue
spinal cord                          30106
cervical spinal cord white matter    19650
Name: count, dtype: int64

scVI provides several options to identify the two populations of interest.

cell_type = "oligodendrocyte"
tissue1 = "spinal cord"
cell_idx1 = np.logical_and(
    census_model.adata.obs["cell_type"] == cell_type,
    census_model.adata.obs["tissue"] == tissue1,
)
print(sum(cell_idx1), "cells from tissue", tissue1)

tissue2 = "cervical spinal cord white matter"
cell_idx2 = np.logical_and(
    census_model.adata.obs["cell_type"] == cell_type,
    census_model.adata.obs["tissue"] == tissue2,
)
print(sum(cell_idx2), "cells of type", tissue2)

# or equivalently, provide a string of the form "my_celltype_column == 'desired_celltype'"
# cell_idx1 = "cell_type == 'Ciliated_non_amphid_neuron'"
# cell_idx2 = "cell_type == 'Intestine'"
11547 cells from tissue spinal cord
9161 cells of type cervical spinal cord white matter

Running DE analyses#

A simple DE analysis can then be performed using the following command

de_change = census_model.differential_expression(idx1=cell_idx1, idx2=cell_idx2, all_stats=False)

This method returns a pandas DataFrame, where each row corresponds to a gene. The most important columns of this dataframe are the following. proba_de, which captures the posterior probability of \(M_{2, g}\) that the gene is differentially expressed. Values close to one indicate that the gene is DE; lfc_mean and lfc_median, respectively denoting the mean and the median of the posterior distribution of \(\beta_g\). Positive values of the LFC signify that the gene is upregulated in idx1; is_de_fdr_0.05 is True when the gene is tagged DE after FDR correction at target level \(\alpha=0.05\). The target level can be adjusted by specifying fdr_target in the differential_expression method.

Volcano plot with p-values#

de_change["log10_pscore"] = np.log10(de_change["proba_not_de"] + 1e-6)
de_change = de_change.join(census_model.adata.var, how="inner")
de_change = de_change.loc[np.max(de_change[["scale1", "scale2"]], axis=1) > 1e-4]
de_change["feature_id"] = de_change.index
de_change.index = de_change["feature_name"]
de_change.head(20)
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std lfc_min lfc_max is_de_fdr_0.05 log10_pscore soma_joinid feature_name feature_length nnz n_measured_obs feature_id
feature_name
PRKCQ-AS1 1.0 0.0 18.420681 1.230528e-04 6.122091e-06 0.0 0.25 4.234055 4.235188 0.961707 0.672802 8.398901 True -6.0 19228 PRKCQ-AS1 5834 3056965 40400414 ENSG00000237943
PLCG2 1.0 0.0 18.420681 1.080190e-05 1.328016e-04 0.0 0.25 -3.573616 -3.575109 0.880594 -6.409345 -0.501044 True -6.0 18601 PLCG2 13375 11105234 62902272 ENSG00000197943
RPL17 1.0 0.0 18.420681 1.755241e-04 5.781009e-06 0.0 0.25 4.930901 4.978129 1.116411 1.120990 8.368684 True -6.0 21496 RPL17 2451 28351933 59434543 ENSG00000265681
RPL32P3 1.0 0.0 18.420681 1.699188e-04 1.112937e-07 0.0 0.25 10.732672 10.726068 0.735848 8.573318 13.043948 True -6.0 21965 RPL32P3 4504 6764181 24635964 ENSG00000251474
RPS10 1.0 0.0 18.420681 1.555399e-04 5.756226e-06 0.0 0.25 4.850702 4.882492 1.267890 0.591713 8.893880 True -6.0 22445 RPS10 2302 26915086 58641444 ENSG00000124614
RWDD3 1.0 0.0 18.420681 1.028030e-04 9.010642e-06 0.0 0.25 3.540111 3.550129 0.340671 2.308668 4.649152 True -6.0 23058 RWDD3 1927 6622342 58885186 ENSG00000122481
SGO1-AS1 1.0 0.0 18.420681 1.176230e-04 1.214340e-06 0.0 0.25 6.705634 6.654238 0.730793 4.400548 9.642642 True -6.0 23658 SGO1-AS1 6571 3336199 39961949 ENSG00000231304
SLC26A3 1.0 0.0 18.420681 1.699776e-05 6.341395e-04 0.0 0.25 -5.116875 -5.147023 0.758414 -7.547303 -2.231958 True -6.0 24047 SLC26A3 3382 4731736 57602392 ENSG00000091138
SOX2 1.0 0.0 18.420681 3.380096e-06 1.133860e-04 0.0 0.25 -5.052306 -5.042466 0.510457 -7.003593 -3.061825 True -6.0 25043 SOX2 2512 3032247 58220612 ENSG00000181449
SOX2-OT 1.0 0.0 18.420681 4.394894e-03 7.167412e-07 0.0 0.25 12.822459 12.835955 0.840095 10.335123 17.573542 True -6.0 25044 SOX2-OT 31637 14163978 36686207 ENSG00000242808
MRPS31P5 1.0 0.0 18.420681 1.028728e-04 9.419149e-08 0.0 0.25 10.207337 10.179218 0.650570 7.974120 13.034983 True -6.0 15273 MRPS31P5 3873 2281179 23426996 ENSG00000243406
STX16 1.0 0.0 18.420681 1.538485e-04 6.493473e-05 0.0 0.25 1.271160 1.253100 0.321001 0.327801 2.536581 True -6.0 25635 STX16 5099 13989471 60324836 ENSG00000124222
UBE2V1 1.0 0.0 18.420681 1.391584e-04 8.504647e-06 0.0 0.25 4.084114 4.086703 0.452431 2.505506 5.501486 True -6.0 27926 UBE2V1 4818 13729803 56545393 ENSG00000244687
ZC3H11A 1.0 0.0 18.420681 4.547413e-04 1.021843e-04 0.0 0.25 2.191399 2.144897 0.447136 0.718927 4.242169 True -6.0 28966 ZC3H11A 12829 12692037 62662199 ENSG00000058673
RP11-13A2.5 1.0 0.0 18.420681 2.150162e-04 4.361469e-07 0.0 0.25 9.070029 9.006927 0.767520 6.387317 13.014623 True -6.0 55800 RP11-13A2.5 3379 3128509 11181954 ENSG00000285837
ZNF56 1.0 0.0 18.420681 2.483926e-04 1.220145e-07 0.0 0.25 11.129900 11.136563 0.686629 9.009073 13.542611 True -6.0 29451 ZNF56 2355 5381008 23563306 ENSG00000267419
MIR100HG 1.0 0.0 18.420681 2.157415e-03 2.144368e-05 0.0 0.25 6.733043 6.718174 0.533213 4.932419 11.507467 True -6.0 13116 MIR100HG 28649 17120360 38431486 ENSG00000255248
RP11-326E22.2 1.0 0.0 18.420681 2.032912e-04 5.628331e-07 0.0 0.25 8.620193 8.649633 1.066291 3.490223 11.772171 True -6.0 55102 RP11-326E22.2 2597 3155057 12629433 ENSG00000285579
MAGI2-AS3 1.0 0.0 18.420681 4.456284e-04 2.454530e-07 0.0 0.25 11.037951 11.028957 0.799092 8.539783 13.758058 True -6.0 12585 MAGI2-AS3 17783 10950979 36720979 ENSG00000234456
MTRNR2L12 1.0 0.0 18.420681 2.134031e-07 1.378079e-04 0.0 0.25 -9.673280 -9.739614 2.136707 -16.651707 -2.575340 True -6.0 31235 MTRNR2L12 1049 13615749 57188872 ENSG00000269028
gene_annotations = sc.queries.biomart_annotations(
    org="hsapiens",
    attrs=["ensembl_gene_id", "gene_biotype"],
)
gene_annotations.index = gene_annotations["ensembl_gene_id"]
gene_annotation_dict = gene_annotations["gene_biotype"].to_dict()
de_change["Biotype"] = [
    gene_annotation_dict.pop(i, "Unannotated") for i in de_change["feature_id"]
]
de_change["Biotype"].value_counts()
Biotype
protein_coding                        814
lncRNA                                 71
transcribed_unprocessed_pseudogene      7
Unannotated                             3
Mt_rRNA                                 2
transcribed_unitary_pseudogene          1
Name: count, dtype: int64
(
    p9.ggplot(de_change, p9.aes("lfc_mean", "-log10_pscore", color="Biotype"))
    + p9.geom_point(
        de_change.query("Biotype == 'protein_coding'"), alpha=0.5
    )  # Plot other genes with transparence
    + p9.xlim(-10, 10)  # Set x limits
    + p9.ylim(0, 7)  # Set y limits
    + p9.geom_point(de_change.query("Biotype != 'protein_coding'"))
    + p9.labs(x="LFC mean", y="Significance score (higher is more significant)")
)
upregulated_genes = de_change.loc[
    de_change["lfc_median"] > 0, ["feature_id", "feature_name"]
].head(4)

Display generated counts from scVI model

census_model.adata[:, upregulated_genes["feature_id"]] = census_model.get_normalized_expression(
    gene_list=list(upregulated_genes["feature_id"]), n_samples=10
)
sc.pl.umap(
    census_model.adata,
    color=upregulated_genes["feature_name"],
    gene_symbols="feature_name",
    cmap="viridis",
)

Performing differential expression and yield expression from census#

Now we perform DE between each cell type vs all other cells and make a dotplot of the result.

Performing differential expression#

# here we do a 1-vs-all DE test, which compares each cell type with all others
# this returns the concatenation of all 1 vs all results, contained in a DataFrame
change_per_cluster_de = census_model.differential_expression(
    adata=census_model.adata[census_model.adata.obs["assay"] == "10x 3' v3"],
    groupby="cell_type",
    all_stats=False,
)
INFO     Received view of anndata, making copy.
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
cell_types = (
    adata.obs["cell_type"]
    .value_counts()
    # .loc[lambda x: (x >= 500) & (x.index != "nan")]
    .loc[lambda x: x.index != "nan"]
    .to_frame("n_cells")
)
cell_types.loc[:, "associated_test"] = cell_types.index.astype(str) + " vs Rest"
change_per_cluster_de = change_per_cluster_de.join(census_model.adata.var, how="inner")
change_per_cluster_de = change_per_cluster_de[
    change_per_cluster_de[["scale1", "scale2"]].max(axis=1) > 1e-4
]
change_per_cluster_de.head(20)
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... lfc_max is_de_fdr_0.05 comparison group1 group2 soma_joinid feature_name feature_length nnz n_measured_obs
feature_id
ENSG00000287372 0.9966 0.0034 5.680571 0.000352 0.000053 0.0 0.25 10.098774 9.726233 4.152241 ... 17.578503 True Bergmann glial cell vs Rest Bergmann glial cell Rest 48122 RP11-259G3.1 4336 177399 20685604
ENSG00000287544 0.9954 0.0046 5.377086 0.000312 0.000013 0.0 0.25 9.761824 8.134218 4.613431 ... 20.266708 True Bergmann glial cell vs Rest Bergmann glial cell Rest 48757 RP11-624A4.2 6317 589676 21095338
ENSG00000125968 0.9932 0.0068 4.984008 0.000126 0.000137 0.0 0.25 3.457879 4.233216 3.074085 ... 10.953665 True Bergmann glial cell vs Rest Bergmann glial cell Rest 9555 ID1 1233 7330830 62902272
ENSG00000163453 0.9932 0.0068 4.984008 0.000474 0.000444 0.0 0.25 3.567271 4.429197 2.611707 ... 9.524614 True Bergmann glial cell vs Rest Bergmann glial cell Rest 9678 IGFBP7 1930 11183207 62984438
ENSG00000172201 0.9928 0.0072 4.926447 0.000633 0.000193 0.0 0.25 5.217178 5.938439 2.980853 ... 10.261541 True Bergmann glial cell vs Rest Bergmann glial cell Rest 9559 ID4 3873 4530224 61308177
ENSG00000271945 0.9928 0.0072 4.926447 0.000317 0.000081 0.0 0.25 6.255356 5.814308 4.282311 ... 15.188108 True Bergmann glial cell vs Rest Bergmann glial cell Rest 38739 RP11-354K4.2 1901 1442656 47074810
ENSG00000283380 0.9926 0.0074 4.898846 0.000242 0.000033 0.0 0.25 7.731935 7.162708 4.517736 ... 17.631754 True Bergmann glial cell vs Rest Bergmann glial cell Rest 50548 RP11-274G22.1 6221 3389503 25112547
ENSG00000041982 0.9924 0.0076 4.871977 0.001716 0.000522 0.0 0.25 6.446057 7.178648 3.019589 ... 13.512787 True Bergmann glial cell vs Rest Bergmann glial cell Rest 26842 TNC 9639 2703824 62202641
ENSG00000286133 0.9920 0.0080 4.820280 0.000250 0.000028 0.0 0.25 7.234351 6.476529 3.618663 ... 16.751778 True Bergmann glial cell vs Rest Bergmann glial cell Rest 48738 RP11-364P22.4 996 2296885 20269491
ENSG00000182492 0.9918 0.0082 4.795386 0.000027 0.000164 0.0 0.25 3.659565 4.861273 4.252452 ... 10.689773 True Bergmann glial cell vs Rest Bergmann glial cell Rest 1924 BGN 2404 2419369 54108679
ENSG00000286749 0.9914 0.0086 4.747355 0.000057 0.001011 0.0 0.25 2.216055 1.884742 6.936182 ... 18.921701 True Bergmann glial cell vs Rest Bergmann glial cell Rest 48878 RP11-141F7.1 4207 3096053 19526686
ENSG00000167772 0.9912 0.0088 4.724163 0.000350 0.000129 0.0 0.25 4.048803 4.234704 2.688754 ... 10.068689 True Bergmann glial cell vs Rest Bergmann glial cell Rest 823 ANGPTL4 2475 2734520 62832892
ENSG00000117318 0.9904 0.0096 4.636345 0.000198 0.000205 0.0 0.25 2.890789 3.181549 3.075953 ... 10.973293 True Bergmann glial cell vs Rest Bergmann glial cell Rest 30371 ID3 1496 8797503 61326069
ENSG00000185650 0.9900 0.0100 4.595119 0.000810 0.000711 0.0 0.25 3.651941 5.341339 3.379040 ... 8.302951 True Bergmann glial cell vs Rest Bergmann glial cell Rest 29062 ZFP36L1 5160 21557771 62723992
ENSG00000154175 0.9894 0.0106 4.536244 0.000378 0.000123 0.0 0.25 3.347516 3.957913 1.896013 ... 6.751803 True Bergmann glial cell vs Rest Bergmann glial cell Rest 121 ABI3BP 10978 4301085 62072541
ENSG00000114315 0.9892 0.0108 4.517349 0.000174 0.000104 0.0 0.25 4.290458 5.222503 3.255705 ... 11.530048 True Bergmann glial cell vs Rest Bergmann glial cell Rest 8720 HES1 2059 6963292 62902272
ENSG00000080493 0.9890 0.0110 4.498798 0.003841 0.000305 0.0 0.25 6.003542 6.863938 2.503586 ... 11.079055 True Bergmann glial cell vs Rest Bergmann glial cell Rest 24192 SLC4A4 9153 10506365 62846587
ENSG00000231427 0.9890 0.0110 4.498798 0.000011 0.000209 0.0 0.25 -0.748589 0.257591 4.291699 ... 9.724770 True Bergmann glial cell vs Rest Bergmann glial cell Rest 30479 LINC01445 8937 772477 45299808
ENSG00000074410 0.9886 0.0114 4.462676 0.000243 0.000063 0.0 0.25 4.959381 5.567525 2.714884 ... 9.235083 True Bergmann glial cell vs Rest Bergmann glial cell Rest 2512 CA12 6998 2884701 61513620
ENSG00000204929 0.9884 0.0116 4.445082 0.000940 0.000112 0.0 0.25 4.645700 4.434205 2.387054 ... 13.011336 True Bergmann glial cell vs Rest Bergmann glial cell Rest 33810 AC074391.1 12344 7205936 41750524

20 rows × 21 columns

# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)[["feature_name", "soma_joinid"]]
    .drop_duplicates()
)

Download raw counts for these genes from CELLxGENE census#

We can download the raw expression only for cells and genes of interest from CELLxGENE census. This drastically improves runtime. It is important to use the same census version used for training the model

census = cellxgene_census.open_soma(census_version="2023-12-15")
# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)[["feature_name", "soma_joinid"]]
    .drop_duplicates()
)
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_coords=marker_genes["soma_joinid"].to_list(),
    obs_coords=census_model.adata.obs.loc[
        census_model.adata.obs["assay"] == "10x 3' v3", "soma_joinid"
    ].to_list(),
    column_names={
        "obs": [
            "soma_joinid",
            "dataset_id",
            "assay",
            "cell_type",
            "disease",
            "donor_id",
            "observation_joinid",
            "sex",
            "suspension_type",
            "tissue",
            "raw_sum",
            "nnz",
            "raw_mean_nnz",
            "raw_variance_nnz",
            "n_measured_vars",
        ]
    },
)
[2024-09-22 12:28:20.596] [tiledbsoma] [Process: 154] [Thread: 2320] [warning] [TileDB-SOMA::ManagedQuery] [obs] Invalid column selected: observation_joinid

Confirm results of downloading cells and dotplot#

Check that census download yields the same cells

adata.var.index = adata.var["feature_id"]
np.all(
    adata.obs["dataset_id"].values
    == census_model.adata.obs.loc[census_model.adata.obs["assay"] == "10x 3' v3", "dataset_id"]
)
True
adata_log = adata[adata.obs.cell_type.isin(cell_types.index.values)].copy()
sc.pp.normalize_total(adata_log)
sc.pp.log1p(adata_log)
sc.pl.dotplot(
    adata_log,
    marker_genes["feature_name"].to_list(),
    groupby="cell_type",
    gene_symbols="feature_name",
    standard_scale="var",
)