CITE-seq analysis with totalVI#

With totalVI, we can produce a joint latent representation of cells, denoised data for both protein and RNA, integrate datasets, and compute differential expression of RNA and protein. Here we demonstrate this functionality with an integrated analysis of PBMC10k and PBMC5k, datasets of peripheral blood mononuclear cells publicly available from 10X Genomics subset to the 14 shared proteins between them. The same pipeline would generally be used to analyze a single CITE-seq dataset.

If you use totalVI, please consider citing:

  • Gayoso, A., Steier, Z., Lopez, R., Regier, J., Nazor, K. L., Streets, A., & Yosef, N. (2021). Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods, 18(3), 272-282.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet "scipy==1.14"
!pip install --quiet scvi-colab
from scvi_colab import install

install()
  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
   exit code: 1
  ╰─> [47 lines of output]
      + meson setup /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1 /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3 -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=/tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3/meson-python-native-file.ini
      The Meson build system
      Version: 1.10.1
      Source dir: /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1
      Build dir: /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3
      Build type: native build
      Project name: scipy
      Project version: 1.14.0
      C compiler for the host machine: cc (gcc 13.3.0 "cc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0")
      C linker for the host machine: cc ld.bfd 2.42
      C++ compiler for the host machine: c++ (gcc 13.3.0 "c++ (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0")
      C++ linker for the host machine: c++ ld.bfd 2.42
      Cython compiler for the host machine: cython (cython 3.0.12)
      Host machine cpu family: x86_64
      Host machine cpu: x86_64
      Program python found: YES (/opt/anaconda3/envs/scvi_new/bin/python3.13)
      Found pkg-config: YES (/usr/bin/pkg-config) 1.8.1
      Run-time dependency python found: YES 3.13
      Program cython found: YES (/tmp/pip-build-env-ldn1chrx/overlay/bin/cython)
      Compiler for C supports arguments -Wno-unused-but-set-variable: YES
      Compiler for C supports arguments -Wno-unused-function: YES
      Compiler for C supports arguments -Wno-conversion: YES
      Compiler for C supports arguments -Wno-misleading-indentation: YES
      Library m found: YES
      Fortran compiler for the host machine: gfortran (gcc 13.3.0 "GNU Fortran (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0")
      Fortran linker for the host machine: gfortran ld.bfd 2.42
      Compiler for Fortran supports arguments -Wno-conversion: YES
      Checking if "-Wl,--version-script" links: YES
      Program tools/generate_f2pymod.py found: YES (/opt/anaconda3/envs/scvi_new/bin/python3.13 /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/tools/generate_f2pymod.py)
      Program scipy/_build_utils/tempita.py found: YES (/opt/anaconda3/envs/scvi_new/bin/python3.13 /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/scipy/_build_utils/tempita.py)
      Program pythran found: YES 0.16.1 0.16.1 (/tmp/pip-build-env-ldn1chrx/overlay/bin/pythran)
      Found pkg-config: YES (/usr/bin/pkg-config) 1.8.1
      Did not find CMake 'cmake'
      Found CMake: NO
      Run-time dependency xsimd found: NO (tried pkgconfig and cmake)
      Run-time dependency threads found: YES
      Library npymath found: YES
      pybind11-config found: YES (/tmp/pip-build-env-ldn1chrx/overlay/bin/pybind11-config) 2.12.1
      Run-time dependency pybind11 found: YES 2.12.1
      Program f2py found: YES (/tmp/pip-build-env-ldn1chrx/overlay/bin/f2py)
      Run-time dependency scipy-openblas found: NO (tried pkgconfig)
      Run-time dependency openblas found: NO (tried pkgconfig and cmake)
      Run-time dependency openblas found: NO (tried pkgconfig)
      
      ../scipy/meson.build:216:9: ERROR: Dependency "OpenBLAS" not found, tried pkgconfig
      
      A full log can be found at /tmp/pip-install-t15wsvjs/scipy_cb18514c915d4ad08e1cd49b59ae58f1/.mesonpy-yq2pwzr3/meson-logs/meson-log.txt
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> scipy

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Imports and data loading#

import os
import tempfile

import matplotlib.pyplot as plt
import muon
import numpy as np
import requests
import scanpy as sc
import scvi
import seaborn as sns
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.4.2

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

Load mudata#

For this tutorial we use an integrated PBMC10k and PBMC5k dataset available from 10X Genomics. To see the exact preprocessing that was done, or to preprocess your own CITE-seq dataset for use with scvi-tools models, see our preprocessing tutorial.

mdata_path = os.path.join(save_dir.name, "CITE-seq_pbmc_combined_preprocessed.h5mu")

# direct download URL
url = "https://exampledata.scverse.org/scvi-tools/CITE-seq_pbmc_combined_preprocessed.h5mu"

# Download only if file doesn't already exist
if not os.path.exists(mdata_path):
    print(f"Downloading MuData file to {mdata_path}...")
    r = requests.get(url)
    with open(mdata_path, "wb") as f:
        f.write(r.content)

# Load the MuData object
mdata = muon.read_h5mu(mdata_path)
Downloading MuData file to /tmp/tmpyrc6gsma/CITE-seq_pbmc_combined_preprocessed.h5mu...

Setup mudata#

Now we run setup_mudata, which is the MuData analog to setup_anndata. The caveat of this workflow is that we need to provide this function which modality of the mdata object contains each piece of data. So for example, the batch information is in mdata.mod["rna"].obs["batch"]. Therefore, in the modalities argument below we specify that the batch_key can be found in the "rna_subset" modality of the MuData object.

Notably, we provide protein_layer=None. This means scvi-tools will pull information from .X from the modality specified in modalities ("protein" in this case). In the case of RNA, we want to use the counts, which we stored in mdata.mod["rna"].layers["counts"].

mdata
MuData object with n_obs × n_vars = 13112 × 37555
  3 modalities
    rna:	13112 x 33538
      obs:	'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'hvg', 'log1p'
      layers:	'counts'
    prot:	13112 x 17
    rna_subset:	13112 x 4000
      obs:	'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'hvg', 'log1p'
      layers:	'counts'
# we need to work with dense, not sparse matricies
mdata["prot"].X = mdata["prot"].X.toarray()
mdata["rna_subset"].X = mdata["rna_subset"].X.toarray()
mdata.mod["rna_subset"].layers["counts"] = mdata.mod["rna_subset"].layers["counts"].toarray()

Note: wasn’t sure if I should combine two datasets similar to the current tutorial. Otherwise I assume I don’t specify a batch key if there’s just one dataset?

scvi.model.TOTALVI.setup_mudata(
    mdata,
    rna_layer="counts",
    protein_layer=None,
    batch_key="batch",
    modalities={
        "rna_layer": "rna_subset",
        "protein_layer": "prot",
        "batch_key": "rna_subset",
    },
)

Note

Specify the modality of each argument via the modalities dictionary, which maps layer/key arguments to MuData modalities.

Prepare and run model#

For the rest of this tutorial we will use the model with the external data we loaded.

model = scvi.model.TOTALVI(mdata)
INFO     Computing empirical prior initialization for protein background.
mdata
MuData object with n_obs × n_vars = 13112 × 37555
  obs:	'_scvi_labels'
  uns:	'_scvi_uuid', '_scvi_manager_uuid'
  3 modalities
    rna:	13112 x 33538
      obs:	'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'hvg', 'log1p'
      layers:	'counts'
    prot:	13112 x 17
      obs:	'_scvi_batch'
    rna_subset:	13112 x 4000
      obs:	'batch', '_scvi_batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'hvg', 'log1p'
      layers:	'counts'
model.train(early_stopping=True)

We will plot the loss curves for training and validation using auto alignment for the yaxis

last_val_valid = np.array(model.history["elbo_validation"])[-1]
last_val_train = np.array(model.history["elbo_train"])[-1]
global_min_loss = min(
    np.min(model.history["elbo_train"]), np.min(model.history["elbo_validation"])
)
last_max_loss = max(last_val_train, last_val_valid)[0]
global_max_loss = max(
    np.max(model.history["elbo_train"]), np.max(model.history["elbo_validation"])
)
# Compute the min and max of both train and validation losses
min_loss = min(min(last_val_train, last_val_valid), global_min_loss)
max_loss = max(max(last_val_train, last_val_valid), global_max_loss)
ylim_min = 0.995 * min_loss  # 0.5% below the minimum
ylim_max = min(
    global_max_loss, ylim_min + (last_max_loss - ylim_min) * 4
)  # keep it under the 25% part of figure
fig, ax = plt.subplots(1, 1)
model.history["elbo_train"].plot(ax=ax, label="train")
model.history["elbo_validation"].plot(ax=ax, label="validation")
if isinstance(ylim_min, (int | float)) and isinstance(ylim_max, (int | float)):
    ax.set(title="Negative ELBO over training epochs", ylim=(ylim_min, ylim_max))
else:
    ax.set(title="Negative ELBO over training epochs")
ax.legend()
<matplotlib.legend.Legend at 0x7430dfe4efd0>
../../../_images/848892b53d7bbed893bd13e381e97c11d38bb9b2e451296790f7aded4f947959.png

Analyze outputs#

We use Scanpy and muon for clustering and visualization after running totalVI. It’s also possible to save totalVI outputs for an R-based workflow.

rna = mdata.mod["rna_subset"]
protein = mdata.mod["prot"]
# arbitrarily store latent in rna modality
TOTALVI_LATENT_KEY = "X_totalVI"
rna.obsm[TOTALVI_LATENT_KEY] = model.get_latent_representation()
rna_denoised, protein_denoised = model.get_normalized_expression(n_samples=25, return_mean=True)
rna.layers["denoised_rna"] = rna_denoised
protein.layers["denoised_protein"] = protein_denoised

protein.layers["protein_foreground_prob"] = 100 * model.get_protein_foreground_probability(
    n_samples=25, return_mean=True
)
parsed_protein_names = [p.split("_")[0] for p in protein.var_names]
protein.var["clean_names"] = parsed_protein_names
mdata.update()

Now we can compute clusters and visualize the latent space.

TOTALVI_CLUSTERS_KEY = "leiden_totalVI"

sc.pp.neighbors(rna, use_rep=TOTALVI_LATENT_KEY)
sc.tl.umap(rna)
sc.tl.leiden(rna, key_added=TOTALVI_CLUSTERS_KEY)
mdata.update()

We can now use muon plotting functions which can pull data from either modality of the MuData object. We will show the umap of the model embeddings with leiden clusters (and batch inegration of the datasets if exists). Following that we will show the denoised protein values and the foreground probability of the 14 protein listed.

mdata
MuData object with n_obs × n_vars = 13112 × 37555
  obs:	'_scvi_labels'
  uns:	'_scvi_uuid', '_scvi_manager_uuid'
  3 modalities
    rna:	13112 x 33538
      obs:	'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'hvg', 'log1p'
      layers:	'counts'
    prot:	13112 x 17
      obs:	'_scvi_batch'
      var:	'clean_names'
      layers:	'denoised_protein', 'protein_foreground_prob'
    rna_subset:	13112 x 4000
      obs:	'batch', '_scvi_batch', 'leiden_totalVI'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'hvg', 'log1p', 'neighbors', 'umap', 'leiden_totalVI'
      obsm:	'X_totalVI', 'X_umap'
      layers:	'counts', 'denoised_rna'
      obsp:	'distances', 'connectivities'
muon.pl.embedding(
    mdata,
    basis="rna_subset:X_umap",
    color=[f"rna_subset:{TOTALVI_CLUSTERS_KEY}", "rna_subset:batch"],
    frameon=False,
    ncols=1,
)

Visualize denoised protein values#

max_prot_to_plot = 14
protein.var_names[:max_prot_to_plot]
Index(['CD3_TotalSeqB', 'CD4_TotalSeqB', 'CD8a_TotalSeqB', 'CD14_TotalSeqB',
       'CD15_TotalSeqB', 'CD16_TotalSeqB', 'CD56_TotalSeqB', 'CD19_TotalSeqB',
       'CD25_TotalSeqB', 'CD45RA_TotalSeqB', 'CD45RO_TotalSeqB',
       'PD-1_TotalSeqB', 'TIGIT_TotalSeqB', 'CD127_TotalSeqB'],
      dtype='object')
muon.pl.embedding(
    mdata,
    basis="rna_subset:X_umap",
    color=protein.var_names[:max_prot_to_plot],
    frameon=False,
    ncols=3,
    vmax="p99",
    wspace=0.1,
    layer="denoised_protein",
)

Visualize probability of foreground#

Here we visualize the probability of foreground for the first 14 proteins in the list and cell (projected on UMAP). Some proteins are easier to disentangle than others. Some proteins end up being “all background”. For example, CD15 does not appear to be captured well, when looking at the denoised values above we see little localization in the monocytes.

Note

While the foreground probability could theoretically be used to identify cell populations, we recommend using the denoised protein expression, which accounts for the foreground/background probability, but preserves the dynamic range of the protein measurements. Consequently, the denoised values are on the same scale as the raw data and it may be desirable to take a transformation like log or square root.

By viewing the foreground probability, we can get a feel for the types of cells in our dataset. For example, it’s very easy to see a population of monocytes based on the CD14 foregroud probability.

muon.pl.embedding(
    mdata,
    basis="rna_subset:X_umap",
    layer="protein_foreground_prob",
    color=protein.var_names[:max_prot_to_plot],
    frameon=False,
    ncols=3,
    vmax="p99",
    wspace=0.1,
    color_map="cividis",
)

Differential expression#

Here we do a one-vs-all DE test, where each cluster is tested against all cells not in that cluster. The results for each of the one-vs-all tests is concatenated into one DataFrame object. Inividual tests can be sliced using the “comparison” column. Genes and proteins are included in the same DataFrame.

Important

We do not recommend using totalVI denoised values in other differential expression tools, as denoised values are a summary of a random quantity. The totalVI DE test takes into account the full uncertainty of the denoised quantities.

de_df = model.differential_expression(
    groupby="rna_subset:leiden_totalVI", delta=0.5, batch_correction=True
)
de_df.head(5)
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05 comparison group1 group2
FHIT 0.9752 0.0248 3.671799 0.000669 0.000062 0.00001 0.5 4.687300 4.976536 2.099845 ... 0.544369 0.061422 0.381422 0.052481 8.070629 0.681380 True 0 vs Rest 0 Rest
NOG 0.9586 0.0414 3.142193 0.000269 0.000019 0.00001 0.5 7.715130 8.318260 3.465424 ... 0.220550 0.015914 0.180592 0.014305 3.914433 0.277270 True 0 vs Rest 0 Rest
LRRN3 0.9516 0.0484 2.978645 0.000578 0.000050 0.00001 0.5 7.891078 8.513194 3.836812 ... 0.407369 0.042825 0.275558 0.029414 5.799955 0.660772 True 0 vs Rest 0 Rest
ADTRP 0.9434 0.0566 2.813481 0.000212 0.000017 0.00001 0.5 6.058331 6.307013 3.337694 ... 0.165023 0.012338 0.140114 0.010818 2.795105 0.243646 True 0 vs Rest 0 Rest
MAP3K8 0.9402 0.0598 2.755087 0.000008 0.000211 0.00001 0.5 -4.629491 -4.887618 2.056406 ... 0.003114 0.316942 0.003114 0.213679 0.039370 1.967212 True 0 vs Rest 0 Rest

5 rows × 22 columns

filtered_pro = {}
filtered_rna = {}
cats = rna.obs[TOTALVI_CLUSTERS_KEY].cat.categories
for c in cats:
    cid = f"{c} vs Rest"
    cell_type_df = de_df.loc[de_df.comparison == cid]
    cell_type_df = cell_type_df.sort_values("lfc_median", ascending=False)

    cell_type_df = cell_type_df[cell_type_df.lfc_median > 0]

    pro_rows = cell_type_df.index.str.contains("TotalSeqB")
    data_pro = cell_type_df.iloc[pro_rows]
    data_pro = data_pro[data_pro["bayes_factor"] > 0.7]

    data_rna = cell_type_df.iloc[~pro_rows]
    data_rna = data_rna[data_rna["bayes_factor"] > 3]
    data_rna = data_rna[data_rna["non_zeros_proportion1"] > 0.1]

    filtered_pro[c] = data_pro.index.tolist()[:3]
    filtered_rna[c] = data_rna.index.tolist()[:2]

We can also use general scanpy visualization functions

sc.tl.dendrogram(rna, groupby=TOTALVI_CLUSTERS_KEY, use_rep=TOTALVI_LATENT_KEY)
# This is a bit of a hack to be able to use scanpy dendrogram with the protein data
protein.obs[TOTALVI_CLUSTERS_KEY] = rna.obs[TOTALVI_CLUSTERS_KEY]
protein.obsm[TOTALVI_LATENT_KEY] = rna.obsm[TOTALVI_LATENT_KEY]
sc.tl.dendrogram(protein, groupby=TOTALVI_CLUSTERS_KEY, use_rep=TOTALVI_LATENT_KEY)

Matrix plot displays totalVI denoised protein expression per leiden cluster.

sc.pl.matrixplot(
    protein,
    protein.var["clean_names"],
    groupby=TOTALVI_CLUSTERS_KEY,
    gene_symbols="clean_names",
    dendrogram=True,
    swap_axes=True,
    layer="denoised_protein",
    cmap="Greens",
    standard_scale="var",
)

This is a selection of some of the markers that turned up in the RNA DE test.

sc.pl.umap(
    rna,
    color=[
        TOTALVI_CLUSTERS_KEY,
        "IGHD",
        "FCER1A",
        "SCT",
        "GZMH",
        "NOG",
        "FOXP3",
        "C1QA",
        "SIGLEC1",
        "XCL2",
        "GZMK",
    ],
    legend_loc="on data",
    frameon=False,
    ncols=3,
    layer="denoised_rna",
    wspace=0.2,
)