CITE-seq analysis with totalVI#
With totalVI, we can produce a joint latent representation of cells, denoised data for both protein and RNA, integrate datasets, and compute differential expression of RNA and protein. Here we demonstrate this functionality with an integrated analysis of PBMC10k and PBMC5k, datasets of peripheral blood mononuclear cells publicly available from 10X Genomics subset to the 14 shared proteins between them. The same pipeline would generally be used to analyze a single CITE-seq dataset.
If you use totalVI, please consider citing:
Gayoso, A., Steier, Z., Lopez, R., Regier, J., Nazor, K. L., Streets, A., & Yosef, N. (2021). Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods, 18(3), 272-282.
Note
Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.
!pip install --quiet "scipy==1.14"
!pip install --quiet scvi-colab
from scvi_colab import install
install()
error: subprocess-exited-with-error
× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [47 lines of output]
+ meson setup /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1 /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3 -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=/tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3/meson-python-native-file.ini
The Meson build system
Version: 1.10.1
Source dir: /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1
Build dir: /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3
Build type: native build
Project name: scipy
Project version: 1.14.0
C compiler for the host machine: cc (gcc 13.3.0 "cc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0")
C linker for the host machine: cc ld.bfd 2.42
C++ compiler for the host machine: c++ (gcc 13.3.0 "c++ (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0")
C++ linker for the host machine: c++ ld.bfd 2.42
Cython compiler for the host machine: cython (cython 3.0.12)
Host machine cpu family: x86_64
Host machine cpu: x86_64
Program python found: YES (/opt/anaconda3/envs/scvi_new/bin/python3.13)
Found pkg-config: YES (/usr/bin/pkg-config) 1.8.1
Run-time dependency python found: YES 3.13
Program cython found: YES (/tmp/pip-build-env-ldn1chrx/overlay/bin/cython)
Compiler for C supports arguments -Wno-unused-but-set-variable: YES
Compiler for C supports arguments -Wno-unused-function: YES
Compiler for C supports arguments -Wno-conversion: YES
Compiler for C supports arguments -Wno-misleading-indentation: YES
Library m found: YES
Fortran compiler for the host machine: gfortran (gcc 13.3.0 "GNU Fortran (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0")
Fortran linker for the host machine: gfortran ld.bfd 2.42
Compiler for Fortran supports arguments -Wno-conversion: YES
Checking if "-Wl,--version-script" links: YES
Program tools/generate_f2pymod.py found: YES (/opt/anaconda3/envs/scvi_new/bin/python3.13 /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/tools/generate_f2pymod.py)
Program scipy/_build_utils/tempita.py found: YES (/opt/anaconda3/envs/scvi_new/bin/python3.13 /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/scipy/_build_utils/tempita.py)
Program pythran found: YES 0.16.1 0.16.1 (/tmp/pip-build-env-ldn1chrx/overlay/bin/pythran)
Found pkg-config: YES (/usr/bin/pkg-config) 1.8.1
Did not find CMake 'cmake'
Found CMake: NO
Run-time dependency xsimd found: NO (tried pkgconfig and cmake)
Run-time dependency threads found: YES
Library npymath found: YES
pybind11-config found: YES (/tmp/pip-build-env-ldn1chrx/overlay/bin/pybind11-config) 2.12.1
Run-time dependency pybind11 found: YES 2.12.1
Program f2py found: YES (/tmp/pip-build-env-ldn1chrx/overlay/bin/f2py)
Run-time dependency scipy-openblas found: NO (tried pkgconfig)
Run-time dependency openblas found: NO (tried pkgconfig and cmake)
Run-time dependency openblas found: NO (tried pkgconfig)
../scipy/meson.build:216:9: ERROR: Dependency "OpenBLAS" not found, tried pkgconfig
A full log can be found at /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3/meson-logs/meson-log.txt
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> scipy
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Imports and data loading#
import os
import tempfile
import matplotlib.pyplot as plt
import muon
import numpy as np
import requests
import scanpy as sc
import scvi
import seaborn as sns
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.4.2
Note
You can modify save_dir below to change where the data files for this tutorial are saved.
sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()
%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"
Load mudata#
For this tutorial we use an integrated PBMC10k and PBMC5k dataset available from 10X Genomics. To see the exact preprocessing that was done, or to preprocess your own CITE-seq dataset for use with scvi-tools models, see our preprocessing tutorial.
mdata_path = os.path.join(save_dir.name, "CITE-seq_pbmc_combined_preprocessed.h5mu")
# direct download URL
url = "https://exampledata.scverse.org/scvi-tools/CITE-seq_pbmc_combined_preprocessed.h5mu"
# Download only if file doesn't already exist
if not os.path.exists(mdata_path):
print(f"Downloading MuData file to {mdata_path}...")
r = requests.get(url)
with open(mdata_path, "wb") as f:
f.write(r.content)
# Load the MuData object
mdata = muon.read_h5mu(mdata_path)
Downloading MuData file to /tmp/tmpyrc6gsma/CITE-seq_pbmc_combined_preprocessed.h5mu...
Setup mudata#
Now we run setup_mudata, which is the MuData analog to setup_anndata. The caveat of this workflow is that we need to provide this function which modality of the mdata object contains each piece of data. So for example, the batch information is in mdata.mod["rna"].obs["batch"]. Therefore, in the modalities argument below we specify that the batch_key can be found in the "rna_subset" modality of the MuData object.
Notably, we provide protein_layer=None. This means scvi-tools will pull information from .X from the modality specified in modalities ("protein" in this case). In the case of RNA, we want to use the counts, which we stored in mdata.mod["rna"].layers["counts"].
mdata
MuData object with n_obs × n_vars = 13112 × 37555
3 modalities
rna: 13112 x 33538
obs: 'batch'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', 'log1p'
layers: 'counts'
prot: 13112 x 17
rna_subset: 13112 x 4000
obs: 'batch'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', 'log1p'
layers: 'counts'# we need to work with dense, not sparse matricies
mdata["prot"].X = mdata["prot"].X.toarray()
mdata["rna_subset"].X = mdata["rna_subset"].X.toarray()
mdata.mod["rna_subset"].layers["counts"] = mdata.mod["rna_subset"].layers["counts"].toarray()
Note: wasn’t sure if I should combine two datasets similar to the current tutorial. Otherwise I assume I don’t specify a batch key if there’s just one dataset?
scvi.model.TOTALVI.setup_mudata(
mdata,
rna_layer="counts",
protein_layer=None,
batch_key="batch",
modalities={
"rna_layer": "rna_subset",
"protein_layer": "prot",
"batch_key": "rna_subset",
},
)
Note
Specify the modality of each argument via the modalities dictionary, which maps layer/key arguments to MuData modalities.
Prepare and run model#
For the rest of this tutorial we will use the model with the external data we loaded.
model = scvi.model.TOTALVI(mdata)
INFO Computing empirical prior initialization for protein background.
mdata
MuData object with n_obs × n_vars = 13112 × 37555
obs: '_scvi_labels'
uns: '_scvi_uuid', '_scvi_manager_uuid'
3 modalities
rna: 13112 x 33538
obs: 'batch'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', 'log1p'
layers: 'counts'
prot: 13112 x 17
obs: '_scvi_batch'
rna_subset: 13112 x 4000
obs: 'batch', '_scvi_batch'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', 'log1p'
layers: 'counts'model.train(early_stopping=True)
We will plot the loss curves for training and validation using auto alignment for the yaxis
last_val_valid = np.array(model.history["elbo_validation"])[-1]
last_val_train = np.array(model.history["elbo_train"])[-1]
global_min_loss = min(
np.min(model.history["elbo_train"]), np.min(model.history["elbo_validation"])
)
last_max_loss = max(last_val_train, last_val_valid)[0]
global_max_loss = max(
np.max(model.history["elbo_train"]), np.max(model.history["elbo_validation"])
)
# Compute the min and max of both train and validation losses
min_loss = min(min(last_val_train, last_val_valid), global_min_loss)
max_loss = max(max(last_val_train, last_val_valid), global_max_loss)
ylim_min = 0.995 * min_loss # 0.5% below the minimum
ylim_max = min(
global_max_loss, ylim_min + (last_max_loss - ylim_min) * 4
) # keep it under the 25% part of figure
fig, ax = plt.subplots(1, 1)
model.history["elbo_train"].plot(ax=ax, label="train")
model.history["elbo_validation"].plot(ax=ax, label="validation")
if isinstance(ylim_min, (int | float)) and isinstance(ylim_max, (int | float)):
ax.set(title="Negative ELBO over training epochs", ylim=(ylim_min, ylim_max))
else:
ax.set(title="Negative ELBO over training epochs")
ax.legend()
Analyze outputs#
We use Scanpy and muon for clustering and visualization after running totalVI. It’s also possible to save totalVI outputs for an R-based workflow.
rna = mdata.mod["rna_subset"]
protein = mdata.mod["prot"]
# arbitrarily store latent in rna modality
TOTALVI_LATENT_KEY = "X_totalVI"
rna.obsm[TOTALVI_LATENT_KEY] = model.get_latent_representation()
rna_denoised, protein_denoised = model.get_normalized_expression(n_samples=25, return_mean=True)
rna.layers["denoised_rna"] = rna_denoised
protein.layers["denoised_protein"] = protein_denoised
protein.layers["protein_foreground_prob"] = 100 * model.get_protein_foreground_probability(
n_samples=25, return_mean=True
)
parsed_protein_names = [p.split("_")[0] for p in protein.var_names]
protein.var["clean_names"] = parsed_protein_names
mdata.update()
Now we can compute clusters and visualize the latent space.
TOTALVI_CLUSTERS_KEY = "leiden_totalVI"
sc.pp.neighbors(rna, use_rep=TOTALVI_LATENT_KEY)
sc.tl.umap(rna)
sc.tl.leiden(rna, key_added=TOTALVI_CLUSTERS_KEY)
mdata.update()
We can now use muon plotting functions which can pull data from either modality of the MuData object. We will show the umap of the model embeddings with leiden clusters (and batch inegration of the datasets if exists). Following that we will show the denoised protein values and the foreground probability of the 14 protein listed.
mdata
MuData object with n_obs × n_vars = 13112 × 37555
obs: '_scvi_labels'
uns: '_scvi_uuid', '_scvi_manager_uuid'
3 modalities
rna: 13112 x 33538
obs: 'batch'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', 'log1p'
layers: 'counts'
prot: 13112 x 17
obs: '_scvi_batch'
var: 'clean_names'
layers: 'denoised_protein', 'protein_foreground_prob'
rna_subset: 13112 x 4000
obs: 'batch', '_scvi_batch', 'leiden_totalVI'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
uns: 'hvg', 'log1p', 'neighbors', 'umap', 'leiden_totalVI'
obsm: 'X_totalVI', 'X_umap'
layers: 'counts', 'denoised_rna'
obsp: 'distances', 'connectivities'muon.pl.embedding(
mdata,
basis="rna_subset:X_umap",
color=[f"rna_subset:{TOTALVI_CLUSTERS_KEY}", "rna_subset:batch"],
frameon=False,
ncols=1,
)
Visualize denoised protein values#
max_prot_to_plot = 14
protein.var_names[:max_prot_to_plot]
Index(['CD3_TotalSeqB', 'CD4_TotalSeqB', 'CD8a_TotalSeqB', 'CD14_TotalSeqB',
'CD15_TotalSeqB', 'CD16_TotalSeqB', 'CD56_TotalSeqB', 'CD19_TotalSeqB',
'CD25_TotalSeqB', 'CD45RA_TotalSeqB', 'CD45RO_TotalSeqB',
'PD-1_TotalSeqB', 'TIGIT_TotalSeqB', 'CD127_TotalSeqB'],
dtype='object')
Visualize probability of foreground#
Here we visualize the probability of foreground for the first 14 proteins in the list and cell (projected on UMAP). Some proteins are easier to disentangle than others. Some proteins end up being “all background”. For example, CD15 does not appear to be captured well, when looking at the denoised values above we see little localization in the monocytes.
Note
While the foreground probability could theoretically be used to identify cell populations, we recommend using the denoised protein expression, which accounts for the foreground/background probability, but preserves the dynamic range of the protein measurements. Consequently, the denoised values are on the same scale as the raw data and it may be desirable to take a transformation like log or square root.
By viewing the foreground probability, we can get a feel for the types of cells in our dataset. For example, it’s very easy to see a population of monocytes based on the CD14 foregroud probability.
Differential expression#
Here we do a one-vs-all DE test, where each cluster is tested against all cells not in that cluster. The results for each of the one-vs-all tests is concatenated into one DataFrame object. Inividual tests can be sliced using the “comparison” column. Genes and proteins are included in the same DataFrame.
Important
We do not recommend using totalVI denoised values in other differential expression tools, as denoised values are a summary of a random quantity. The totalVI DE test takes into account the full uncertainty of the denoised quantities.
de_df = model.differential_expression(
groupby="rna_subset:leiden_totalVI", delta=0.5, batch_correction=True
)
de_df.head(5)
| proba_de | proba_not_de | bayes_factor | scale1 | scale2 | pseudocounts | delta | lfc_mean | lfc_median | lfc_std | ... | raw_mean1 | raw_mean2 | non_zeros_proportion1 | non_zeros_proportion2 | raw_normalized_mean1 | raw_normalized_mean2 | is_de_fdr_0.05 | comparison | group1 | group2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FHIT | 0.9752 | 0.0248 | 3.671799 | 0.000669 | 0.000062 | 0.00001 | 0.5 | 4.687300 | 4.976536 | 2.099845 | ... | 0.544369 | 0.061422 | 0.381422 | 0.052481 | 8.070629 | 0.681380 | True | 0 vs Rest | 0 | Rest |
| NOG | 0.9586 | 0.0414 | 3.142193 | 0.000269 | 0.000019 | 0.00001 | 0.5 | 7.715130 | 8.318260 | 3.465424 | ... | 0.220550 | 0.015914 | 0.180592 | 0.014305 | 3.914433 | 0.277270 | True | 0 vs Rest | 0 | Rest |
| LRRN3 | 0.9516 | 0.0484 | 2.978645 | 0.000578 | 0.000050 | 0.00001 | 0.5 | 7.891078 | 8.513194 | 3.836812 | ... | 0.407369 | 0.042825 | 0.275558 | 0.029414 | 5.799955 | 0.660772 | True | 0 vs Rest | 0 | Rest |
| ADTRP | 0.9434 | 0.0566 | 2.813481 | 0.000212 | 0.000017 | 0.00001 | 0.5 | 6.058331 | 6.307013 | 3.337694 | ... | 0.165023 | 0.012338 | 0.140114 | 0.010818 | 2.795105 | 0.243646 | True | 0 vs Rest | 0 | Rest |
| MAP3K8 | 0.9402 | 0.0598 | 2.755087 | 0.000008 | 0.000211 | 0.00001 | 0.5 | -4.629491 | -4.887618 | 2.056406 | ... | 0.003114 | 0.316942 | 0.003114 | 0.213679 | 0.039370 | 1.967212 | True | 0 vs Rest | 0 | Rest |
5 rows × 22 columns
filtered_pro = {}
filtered_rna = {}
cats = rna.obs[TOTALVI_CLUSTERS_KEY].cat.categories
for c in cats:
cid = f"{c} vs Rest"
cell_type_df = de_df.loc[de_df.comparison == cid]
cell_type_df = cell_type_df.sort_values("lfc_median", ascending=False)
cell_type_df = cell_type_df[cell_type_df.lfc_median > 0]
pro_rows = cell_type_df.index.str.contains("TotalSeqB")
data_pro = cell_type_df.iloc[pro_rows]
data_pro = data_pro[data_pro["bayes_factor"] > 0.7]
data_rna = cell_type_df.iloc[~pro_rows]
data_rna = data_rna[data_rna["bayes_factor"] > 3]
data_rna = data_rna[data_rna["non_zeros_proportion1"] > 0.1]
filtered_pro[c] = data_pro.index.tolist()[:3]
filtered_rna[c] = data_rna.index.tolist()[:2]
We can also use general scanpy visualization functions
sc.tl.dendrogram(rna, groupby=TOTALVI_CLUSTERS_KEY, use_rep=TOTALVI_LATENT_KEY)
# This is a bit of a hack to be able to use scanpy dendrogram with the protein data
protein.obs[TOTALVI_CLUSTERS_KEY] = rna.obs[TOTALVI_CLUSTERS_KEY]
protein.obsm[TOTALVI_LATENT_KEY] = rna.obsm[TOTALVI_LATENT_KEY]
sc.tl.dendrogram(protein, groupby=TOTALVI_CLUSTERS_KEY, use_rep=TOTALVI_LATENT_KEY)
Matrix plot displays totalVI denoised protein expression per leiden cluster.
sc.pl.matrixplot(
protein,
protein.var["clean_names"],
groupby=TOTALVI_CLUSTERS_KEY,
gene_symbols="clean_names",
dendrogram=True,
swap_axes=True,
layer="denoised_protein",
cmap="Greens",
standard_scale="var",
)
This is a selection of some of the markers that turned up in the RNA DE test.