Use pretrained models of scVI-hub for CELLxGENE#
This notebook was contributed by Can Ergen and Martin Kim.
Original article:
Scvi-hub: an actionable repository for model-driven single cell analysis
https://www.biorxiv.org/content/10.1101/2024.03.01.582887
The anndata object we’re using here is a subset of the full CELLxGENE census data. Use: s3://cellxgene-contrib-public/models/scvi/2024-02-12/mus_musculus/adata_minified.h5ad
Steps performed:
Loading the minified data from AWS
Setting up minified model with minified data
Visualize the latent space
Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly
This notebook was designed to be run in Google Colab.
Note
Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.
!pip install --quiet scvi-colab
!pip install --quiet cellxgene-census
!pip install --quiet pybiomart
from scvi_colab import install
install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.35.24 requires botocore<1.36.0,>=1.35.24, but you have botocore 1.35.23 which is incompatible.
spatialdata 0.2.2 requires fsspec<=2023.6, but you have fsspec 2024.9.0 which is incompatible.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
import os
import tempfile
import botocore
import cellxgene_census
import numpy as np
import plotnine as p9
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scvi.hub import HubModel
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.1.6
Note
You can modify save_dir
below to change where the data files for this tutorial are saved.
sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()
%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"
We download a toy example here (all cells from the spinal cord). To download the full census use backup_url: https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata_minified.h5ad. Expect that the download of the full census takes ~30 minutes with a high bandwidth internet connection.
We share models for mouse at https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/mus_musculus (same folder structure).
adata_path = os.path.join(save_dir.name, "adata-spinal-cord-minified.h5ad")
adata = sc.read(
adata_path,
backup_url="https://cellxgene-contrib-public.s3.amazonaws.com/models/scvi/2024-02-12/homo_sapiens/adata-spinal-cord-minified.h5ad",
)
model = HubModel.pull_from_s3(
s3_bucket="cellxgene-contrib-public",
s3_path="models/scvi/2024-02-12/homo_sapiens/modelhub",
pull_anndata=False,
config=botocore.config.Config(signature_version=botocore.UNSIGNED),
)
adata, model
HubModel with:
local_dir: /tmp/tmp3gyd_s7l
model loaded? No
adata loaded? No
large_training_adata loaded? No
metadata:
HubMetadata(scvi_version='1.0.4', anndata_version='0.8.0', model_cls_name='SCVI', training_data_url=None, model_parent_module='scvi.model')
model_card:
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── library_name: scvi-tools license: cc-by-4.0 tags: • biology • genomics • single-cell • model_cls_name:SCVI • scvi_version:1.0.4 • anndata_version:0.8.0 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Description ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ SCVI model trained on the homo_sapiens data of the CELLxGENE Discover Census. ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Model properties ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ Many model properties are in the model tags. Some more are listed below. model_init_params: { "n_hidden": 128, "n_latent": 50, "n_layers": 2, "dropout_rate": 0.1, "dispersion": "gene", "gene_likelihood": "nb", "latent_distribution": "normal", "encode_covariates": false } model_setup_anndata_args: { "layer": null, "batch_key": "batch", "labels_key": null, "size_factor_key": null, "categorical_covariate_keys": null, "continuous_covariate_keys": null } model_summary_stats: | Summary Stat Key | Value | |--------------------------|----------| | n_batch | 6006 | | n_cells | 34544157 | | n_extra_categorical_covs | 0 | | n_extra_continuous_covs | 0 | | n_labels | 1 | | n_vars | 8000 | model_data_registry: | Registry Key | scvi-tools Location | |-------------------|--------------------------------------| | X | adata.X | | batch | adata.obs['_scvi_batch'] | | labels | adata.obs['_scvi_labels'] | model_parent_module: scvi.model data_is_minified: To be added... ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Training data ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ This is an optional link to where the training data is stored if it is too large to host on the huggingface Model hub. Training data url: N/A ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Training code ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ This is an optional link to the code used to train the model. Training code url: N/A ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ References ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ To be added...
(AnnData object with n_obs × n_vars = 117463 × 8000
obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
)
Setup minified model#
Census was trained on all primary cells. We don’t encode covariates so inference and generating latent codes works without retraining on these batches. We have to subset to all training batches. The setup will be optimized in a future version of scvi-tools.
del adata.uns["_scvi_adata_minify_type"]
model.load_model(adata=adata[adata.obs["is_primary_data"]].copy())
census_model = model.model
INFO Loading model...
INFO File /tmp/tmp3gyd_s7l/model.pt already downloaded
census_model_all = scvi.model.SCVI.load_query_data(adata, census_model)
For spinal cord less than a half of cells is labeled as primary cells, while the other cells are duplicated from this dataset. census_model_all contains all cells while census_model only contains primary cells.
adata.obs["is_primary_data"].value_counts()
is_primary_data
False 67707
True 49756
Name: count, dtype: int64
census_model.adata, census_model_all.adata
(AnnData object with n_obs × n_vars = 49756 × 8000
obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
uns: '_scvi_manager_uuid', '_scvi_uuid'
obsm: '_scvi_latent_qzm', '_scvi_latent_qzv',
AnnData object with n_obs × n_vars = 117463 × 8000
obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars', 'batch', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
var: 'soma_joinid', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
uns: '_scvi_manager_uuid', '_scvi_uuid'
obsm: '_scvi_latent_qzm', '_scvi_latent_qzv')
By default scvi-tools loads models as not minified. We set up the model here with minified data, so we minify the model using the respective obsm fields.
census_model.minify_adata(
use_latent_qzm_key="_scvi_latent_qzm", use_latent_qzv_key="_scvi_latent_qzv"
)
INFO Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
INFO Generating sequential column names
INFO Generating sequential column names
Get the latent space and compute UMAP#
Performing Differential Expression in scVI#
While we only have access to the minified data, we can still perform downstream analysis using the generative part of the model.
Differential expression (DE) analysis is used to quantify the differences in gene expression across subpopulations of genes. If we have two sets of cells \(A\) and \(B\), a DE test is typically used to predict two things for each gene:
Whether the gene’s expression levels are significantly different in the A and B sets of cells.
An effect size that quantifies the strength of the differential expression.
Once trained, scVI can natively perform both of these tasks. Additionally, its differential expression module can account for batch effects and filter DE genes expected to be of little relevance.
Selecting cell subpopulations to compare#
# let's take a look at abundances of different cell types
print(
census_model.adata.obs["cell_type"].value_counts(),
"\n\n\n\n",
census_model.adata.obs["tissue"].value_counts(),
)
cell_type
oligodendrocyte 20708
neuron 7831
astrocyte 4449
oligodendrocyte precursor cell 3191
microglial cell 1864
GABAergic neuron 1790
central nervous system macrophage 1777
erythroid lineage cell 1684
capillary endothelial cell 915
mural cell 879
neutrophil 708
erythroid progenitor cell 606
glutamatergic neuron 576
leukocyte 542
endothelial cell of artery 330
endothelial cell 247
mesenchymal stem cell 238
cerebellar granule cell 236
macrophage 159
ependymal cell 147
professional antigen presenting cell 140
vascular associated smooth muscle cell 125
cord blood hematopoietic stem cell 87
pericyte 78
differentiation-committed oligodendrocyte precursor 75
fibroblast 69
monocyte 54
dendritic cell 52
B cell 35
T cell 33
chondrocyte 32
epithelial cell 29
renal intercalated cell 16
primordial germ cell 11
stromal cell 11
plasma cell 8
Bergmann glial cell 7
alternatively activated macrophage 4
kidney loop of Henle epithelial cell 3
cell of skeletal muscle 3
embryonic stem cell 2
choroid plexus epithelial cell 1
endocrine cell 1
enterocyte 1
stratified epithelial cell 1
smooth muscle cell 1
Name: count, dtype: int64
tissue
spinal cord 30106
cervical spinal cord white matter 19650
Name: count, dtype: int64
scVI provides several options to identify the two populations of interest.
cell_type = "oligodendrocyte"
tissue1 = "spinal cord"
cell_idx1 = np.logical_and(
census_model.adata.obs["cell_type"] == cell_type,
census_model.adata.obs["tissue"] == tissue1,
)
print(sum(cell_idx1), "cells from tissue", tissue1)
tissue2 = "cervical spinal cord white matter"
cell_idx2 = np.logical_and(
census_model.adata.obs["cell_type"] == cell_type,
census_model.adata.obs["tissue"] == tissue2,
)
print(sum(cell_idx2), "cells of type", tissue2)
# or equivalently, provide a string of the form "my_celltype_column == 'desired_celltype'"
# cell_idx1 = "cell_type == 'Ciliated_non_amphid_neuron'"
# cell_idx2 = "cell_type == 'Intestine'"
11547 cells from tissue spinal cord
9161 cells of type cervical spinal cord white matter
Running DE analyses#
A simple DE analysis can then be performed using the following command
de_change = census_model.differential_expression(idx1=cell_idx1, idx2=cell_idx2, all_stats=False)
This method returns a pandas DataFrame, where each row corresponds to a gene.
The most important columns of this dataframe are the following.
proba_de, which captures the posterior probability of \(M_{2, g}\) that the gene is differentially expressed. Values close to one indicate that the gene is DE;
lfc_mean and lfc_median, respectively denoting the mean and the median of the posterior distribution of \(\beta_g\). Positive values of the LFC signify that the gene is upregulated in idx1
;
is_de_fdr_0.05 is True
when the gene is tagged DE after FDR correction at target level \(\alpha=0.05\). The target level can be adjusted by specifying fdr_target
in the differential_expression
method.
Volcano plot with p-values#
de_change["log10_pscore"] = np.log10(de_change["proba_not_de"] + 1e-6)
de_change = de_change.join(census_model.adata.var, how="inner")
de_change = de_change.loc[np.max(de_change[["scale1", "scale2"]], axis=1) > 1e-4]
de_change["feature_id"] = de_change.index
de_change.index = de_change["feature_name"]
de_change.head(20)
proba_de | proba_not_de | bayes_factor | scale1 | scale2 | pseudocounts | delta | lfc_mean | lfc_median | lfc_std | lfc_min | lfc_max | is_de_fdr_0.05 | log10_pscore | soma_joinid | feature_name | feature_length | nnz | n_measured_obs | feature_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature_name | ||||||||||||||||||||
PRKCQ-AS1 | 1.0 | 0.0 | 18.420681 | 1.230528e-04 | 6.122091e-06 | 0.0 | 0.25 | 4.234055 | 4.235188 | 0.961707 | 0.672802 | 8.398901 | True | -6.0 | 19228 | PRKCQ-AS1 | 5834 | 3056965 | 40400414 | ENSG00000237943 |
PLCG2 | 1.0 | 0.0 | 18.420681 | 1.080190e-05 | 1.328016e-04 | 0.0 | 0.25 | -3.573616 | -3.575109 | 0.880594 | -6.409345 | -0.501044 | True | -6.0 | 18601 | PLCG2 | 13375 | 11105234 | 62902272 | ENSG00000197943 |
RPL17 | 1.0 | 0.0 | 18.420681 | 1.755241e-04 | 5.781009e-06 | 0.0 | 0.25 | 4.930901 | 4.978129 | 1.116411 | 1.120990 | 8.368684 | True | -6.0 | 21496 | RPL17 | 2451 | 28351933 | 59434543 | ENSG00000265681 |
RPL32P3 | 1.0 | 0.0 | 18.420681 | 1.699188e-04 | 1.112937e-07 | 0.0 | 0.25 | 10.732672 | 10.726068 | 0.735848 | 8.573318 | 13.043948 | True | -6.0 | 21965 | RPL32P3 | 4504 | 6764181 | 24635964 | ENSG00000251474 |
RPS10 | 1.0 | 0.0 | 18.420681 | 1.555399e-04 | 5.756226e-06 | 0.0 | 0.25 | 4.850702 | 4.882492 | 1.267890 | 0.591713 | 8.893880 | True | -6.0 | 22445 | RPS10 | 2302 | 26915086 | 58641444 | ENSG00000124614 |
RWDD3 | 1.0 | 0.0 | 18.420681 | 1.028030e-04 | 9.010642e-06 | 0.0 | 0.25 | 3.540111 | 3.550129 | 0.340671 | 2.308668 | 4.649152 | True | -6.0 | 23058 | RWDD3 | 1927 | 6622342 | 58885186 | ENSG00000122481 |
SGO1-AS1 | 1.0 | 0.0 | 18.420681 | 1.176230e-04 | 1.214340e-06 | 0.0 | 0.25 | 6.705634 | 6.654238 | 0.730793 | 4.400548 | 9.642642 | True | -6.0 | 23658 | SGO1-AS1 | 6571 | 3336199 | 39961949 | ENSG00000231304 |
SLC26A3 | 1.0 | 0.0 | 18.420681 | 1.699776e-05 | 6.341395e-04 | 0.0 | 0.25 | -5.116875 | -5.147023 | 0.758414 | -7.547303 | -2.231958 | True | -6.0 | 24047 | SLC26A3 | 3382 | 4731736 | 57602392 | ENSG00000091138 |
SOX2 | 1.0 | 0.0 | 18.420681 | 3.380096e-06 | 1.133860e-04 | 0.0 | 0.25 | -5.052306 | -5.042466 | 0.510457 | -7.003593 | -3.061825 | True | -6.0 | 25043 | SOX2 | 2512 | 3032247 | 58220612 | ENSG00000181449 |
SOX2-OT | 1.0 | 0.0 | 18.420681 | 4.394894e-03 | 7.167412e-07 | 0.0 | 0.25 | 12.822459 | 12.835955 | 0.840095 | 10.335123 | 17.573542 | True | -6.0 | 25044 | SOX2-OT | 31637 | 14163978 | 36686207 | ENSG00000242808 |
MRPS31P5 | 1.0 | 0.0 | 18.420681 | 1.028728e-04 | 9.419149e-08 | 0.0 | 0.25 | 10.207337 | 10.179218 | 0.650570 | 7.974120 | 13.034983 | True | -6.0 | 15273 | MRPS31P5 | 3873 | 2281179 | 23426996 | ENSG00000243406 |
STX16 | 1.0 | 0.0 | 18.420681 | 1.538485e-04 | 6.493473e-05 | 0.0 | 0.25 | 1.271160 | 1.253100 | 0.321001 | 0.327801 | 2.536581 | True | -6.0 | 25635 | STX16 | 5099 | 13989471 | 60324836 | ENSG00000124222 |
UBE2V1 | 1.0 | 0.0 | 18.420681 | 1.391584e-04 | 8.504647e-06 | 0.0 | 0.25 | 4.084114 | 4.086703 | 0.452431 | 2.505506 | 5.501486 | True | -6.0 | 27926 | UBE2V1 | 4818 | 13729803 | 56545393 | ENSG00000244687 |
ZC3H11A | 1.0 | 0.0 | 18.420681 | 4.547413e-04 | 1.021843e-04 | 0.0 | 0.25 | 2.191399 | 2.144897 | 0.447136 | 0.718927 | 4.242169 | True | -6.0 | 28966 | ZC3H11A | 12829 | 12692037 | 62662199 | ENSG00000058673 |
RP11-13A2.5 | 1.0 | 0.0 | 18.420681 | 2.150162e-04 | 4.361469e-07 | 0.0 | 0.25 | 9.070029 | 9.006927 | 0.767520 | 6.387317 | 13.014623 | True | -6.0 | 55800 | RP11-13A2.5 | 3379 | 3128509 | 11181954 | ENSG00000285837 |
ZNF56 | 1.0 | 0.0 | 18.420681 | 2.483926e-04 | 1.220145e-07 | 0.0 | 0.25 | 11.129900 | 11.136563 | 0.686629 | 9.009073 | 13.542611 | True | -6.0 | 29451 | ZNF56 | 2355 | 5381008 | 23563306 | ENSG00000267419 |
MIR100HG | 1.0 | 0.0 | 18.420681 | 2.157415e-03 | 2.144368e-05 | 0.0 | 0.25 | 6.733043 | 6.718174 | 0.533213 | 4.932419 | 11.507467 | True | -6.0 | 13116 | MIR100HG | 28649 | 17120360 | 38431486 | ENSG00000255248 |
RP11-326E22.2 | 1.0 | 0.0 | 18.420681 | 2.032912e-04 | 5.628331e-07 | 0.0 | 0.25 | 8.620193 | 8.649633 | 1.066291 | 3.490223 | 11.772171 | True | -6.0 | 55102 | RP11-326E22.2 | 2597 | 3155057 | 12629433 | ENSG00000285579 |
MAGI2-AS3 | 1.0 | 0.0 | 18.420681 | 4.456284e-04 | 2.454530e-07 | 0.0 | 0.25 | 11.037951 | 11.028957 | 0.799092 | 8.539783 | 13.758058 | True | -6.0 | 12585 | MAGI2-AS3 | 17783 | 10950979 | 36720979 | ENSG00000234456 |
MTRNR2L12 | 1.0 | 0.0 | 18.420681 | 2.134031e-07 | 1.378079e-04 | 0.0 | 0.25 | -9.673280 | -9.739614 | 2.136707 | -16.651707 | -2.575340 | True | -6.0 | 31235 | MTRNR2L12 | 1049 | 13615749 | 57188872 | ENSG00000269028 |
gene_annotations = sc.queries.biomart_annotations(
org="hsapiens",
attrs=["ensembl_gene_id", "gene_biotype"],
)
gene_annotations.index = gene_annotations["ensembl_gene_id"]
gene_annotation_dict = gene_annotations["gene_biotype"].to_dict()
de_change["Biotype"] = [
gene_annotation_dict.pop(i, "Unannotated") for i in de_change["feature_id"]
]
de_change["Biotype"].value_counts()
Biotype
protein_coding 814
lncRNA 71
transcribed_unprocessed_pseudogene 7
Unannotated 3
Mt_rRNA 2
transcribed_unitary_pseudogene 1
Name: count, dtype: int64
(
p9.ggplot(de_change, p9.aes("lfc_mean", "-log10_pscore", color="Biotype"))
+ p9.geom_point(
de_change.query("Biotype == 'protein_coding'"), alpha=0.5
) # Plot other genes with transparence
+ p9.xlim(-10, 10) # Set x limits
+ p9.ylim(0, 7) # Set y limits
+ p9.geom_point(de_change.query("Biotype != 'protein_coding'"))
+ p9.labs(x="LFC mean", y="Significance score (higher is more significant)")
)
upregulated_genes = de_change.loc[
de_change["lfc_median"] > 0, ["feature_id", "feature_name"]
].head(4)
Display generated counts from scVI model
Performing differential expression and yield expression from census#
Now we perform DE between each cell type vs all other cells and make a dotplot of the result.
Performing differential expression#
# here we do a 1-vs-all DE test, which compares each cell type with all others
# this returns the concatenation of all 1 vs all results, contained in a DataFrame
change_per_cluster_de = census_model.differential_expression(
adata=census_model.adata[census_model.adata.obs["assay"] == "10x 3' v3"],
groupby="cell_type",
all_stats=False,
)
INFO Received view of anndata, making copy.
INFO Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
cell_types = (
adata.obs["cell_type"]
.value_counts()
# .loc[lambda x: (x >= 500) & (x.index != "nan")]
.loc[lambda x: x.index != "nan"]
.to_frame("n_cells")
)
cell_types.loc[:, "associated_test"] = cell_types.index.astype(str) + " vs Rest"
change_per_cluster_de = change_per_cluster_de.join(census_model.adata.var, how="inner")
change_per_cluster_de = change_per_cluster_de[
change_per_cluster_de[["scale1", "scale2"]].max(axis=1) > 1e-4
]
change_per_cluster_de.head(20)
proba_de | proba_not_de | bayes_factor | scale1 | scale2 | pseudocounts | delta | lfc_mean | lfc_median | lfc_std | ... | lfc_max | is_de_fdr_0.05 | comparison | group1 | group2 | soma_joinid | feature_name | feature_length | nnz | n_measured_obs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
feature_id | |||||||||||||||||||||
ENSG00000287372 | 0.9966 | 0.0034 | 5.680571 | 0.000352 | 0.000053 | 0.0 | 0.25 | 10.098774 | 9.726233 | 4.152241 | ... | 17.578503 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 48122 | RP11-259G3.1 | 4336 | 177399 | 20685604 |
ENSG00000287544 | 0.9954 | 0.0046 | 5.377086 | 0.000312 | 0.000013 | 0.0 | 0.25 | 9.761824 | 8.134218 | 4.613431 | ... | 20.266708 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 48757 | RP11-624A4.2 | 6317 | 589676 | 21095338 |
ENSG00000125968 | 0.9932 | 0.0068 | 4.984008 | 0.000126 | 0.000137 | 0.0 | 0.25 | 3.457879 | 4.233216 | 3.074085 | ... | 10.953665 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 9555 | ID1 | 1233 | 7330830 | 62902272 |
ENSG00000163453 | 0.9932 | 0.0068 | 4.984008 | 0.000474 | 0.000444 | 0.0 | 0.25 | 3.567271 | 4.429197 | 2.611707 | ... | 9.524614 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 9678 | IGFBP7 | 1930 | 11183207 | 62984438 |
ENSG00000172201 | 0.9928 | 0.0072 | 4.926447 | 0.000633 | 0.000193 | 0.0 | 0.25 | 5.217178 | 5.938439 | 2.980853 | ... | 10.261541 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 9559 | ID4 | 3873 | 4530224 | 61308177 |
ENSG00000271945 | 0.9928 | 0.0072 | 4.926447 | 0.000317 | 0.000081 | 0.0 | 0.25 | 6.255356 | 5.814308 | 4.282311 | ... | 15.188108 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 38739 | RP11-354K4.2 | 1901 | 1442656 | 47074810 |
ENSG00000283380 | 0.9926 | 0.0074 | 4.898846 | 0.000242 | 0.000033 | 0.0 | 0.25 | 7.731935 | 7.162708 | 4.517736 | ... | 17.631754 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 50548 | RP11-274G22.1 | 6221 | 3389503 | 25112547 |
ENSG00000041982 | 0.9924 | 0.0076 | 4.871977 | 0.001716 | 0.000522 | 0.0 | 0.25 | 6.446057 | 7.178648 | 3.019589 | ... | 13.512787 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 26842 | TNC | 9639 | 2703824 | 62202641 |
ENSG00000286133 | 0.9920 | 0.0080 | 4.820280 | 0.000250 | 0.000028 | 0.0 | 0.25 | 7.234351 | 6.476529 | 3.618663 | ... | 16.751778 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 48738 | RP11-364P22.4 | 996 | 2296885 | 20269491 |
ENSG00000182492 | 0.9918 | 0.0082 | 4.795386 | 0.000027 | 0.000164 | 0.0 | 0.25 | 3.659565 | 4.861273 | 4.252452 | ... | 10.689773 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 1924 | BGN | 2404 | 2419369 | 54108679 |
ENSG00000286749 | 0.9914 | 0.0086 | 4.747355 | 0.000057 | 0.001011 | 0.0 | 0.25 | 2.216055 | 1.884742 | 6.936182 | ... | 18.921701 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 48878 | RP11-141F7.1 | 4207 | 3096053 | 19526686 |
ENSG00000167772 | 0.9912 | 0.0088 | 4.724163 | 0.000350 | 0.000129 | 0.0 | 0.25 | 4.048803 | 4.234704 | 2.688754 | ... | 10.068689 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 823 | ANGPTL4 | 2475 | 2734520 | 62832892 |
ENSG00000117318 | 0.9904 | 0.0096 | 4.636345 | 0.000198 | 0.000205 | 0.0 | 0.25 | 2.890789 | 3.181549 | 3.075953 | ... | 10.973293 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 30371 | ID3 | 1496 | 8797503 | 61326069 |
ENSG00000185650 | 0.9900 | 0.0100 | 4.595119 | 0.000810 | 0.000711 | 0.0 | 0.25 | 3.651941 | 5.341339 | 3.379040 | ... | 8.302951 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 29062 | ZFP36L1 | 5160 | 21557771 | 62723992 |
ENSG00000154175 | 0.9894 | 0.0106 | 4.536244 | 0.000378 | 0.000123 | 0.0 | 0.25 | 3.347516 | 3.957913 | 1.896013 | ... | 6.751803 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 121 | ABI3BP | 10978 | 4301085 | 62072541 |
ENSG00000114315 | 0.9892 | 0.0108 | 4.517349 | 0.000174 | 0.000104 | 0.0 | 0.25 | 4.290458 | 5.222503 | 3.255705 | ... | 11.530048 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 8720 | HES1 | 2059 | 6963292 | 62902272 |
ENSG00000080493 | 0.9890 | 0.0110 | 4.498798 | 0.003841 | 0.000305 | 0.0 | 0.25 | 6.003542 | 6.863938 | 2.503586 | ... | 11.079055 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 24192 | SLC4A4 | 9153 | 10506365 | 62846587 |
ENSG00000231427 | 0.9890 | 0.0110 | 4.498798 | 0.000011 | 0.000209 | 0.0 | 0.25 | -0.748589 | 0.257591 | 4.291699 | ... | 9.724770 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 30479 | LINC01445 | 8937 | 772477 | 45299808 |
ENSG00000074410 | 0.9886 | 0.0114 | 4.462676 | 0.000243 | 0.000063 | 0.0 | 0.25 | 4.959381 | 5.567525 | 2.714884 | ... | 9.235083 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 2512 | CA12 | 6998 | 2884701 | 61513620 |
ENSG00000204929 | 0.9884 | 0.0116 | 4.445082 | 0.000940 | 0.000112 | 0.0 | 0.25 | 4.645700 | 4.434205 | 2.387054 | ... | 13.011336 | True | Bergmann glial cell vs Rest | Bergmann glial cell | Rest | 33810 | AC074391.1 | 12344 | 7205936 | 41750524 |
20 rows × 21 columns
# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
change_per_cluster_de.reset_index()
.loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
.groupby("comparison")
.apply(
lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
) # Select top 5 DE genes per comparison
.reset_index(drop=True)[["feature_name", "soma_joinid"]]
.drop_duplicates()
)
Download raw counts for these genes from CELLxGENE census#
We can download the raw expression only for cells and genes of interest from CELLxGENE census. This drastically improves runtime. It is important to use the same census version used for training the model
census = cellxgene_census.open_soma(census_version="2023-12-15")
# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
change_per_cluster_de.reset_index()
.loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
.groupby("comparison")
.apply(
lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
) # Select top 5 DE genes per comparison
.reset_index(drop=True)[["feature_name", "soma_joinid"]]
.drop_duplicates()
)
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_coords=marker_genes["soma_joinid"].to_list(),
obs_coords=census_model.adata.obs.loc[
census_model.adata.obs["assay"] == "10x 3' v3", "soma_joinid"
].to_list(),
column_names={
"obs": [
"soma_joinid",
"dataset_id",
"assay",
"cell_type",
"disease",
"donor_id",
"observation_joinid",
"sex",
"suspension_type",
"tissue",
"raw_sum",
"nnz",
"raw_mean_nnz",
"raw_variance_nnz",
"n_measured_vars",
]
},
)
[2024-09-22 12:28:20.596] [tiledbsoma] [Process: 154] [Thread: 2320] [warning] [TileDB-SOMA::ManagedQuery] [obs] Invalid column selected: observation_joinid
Confirm results of downloading cells and dotplot#
Check that census download yields the same cells
adata.var.index = adata.var["feature_id"]
np.all(
adata.obs["dataset_id"].values
== census_model.adata.obs.loc[census_model.adata.obs["assay"] == "10x 3' v3", "dataset_id"]
)
True