CITE-seq analysis with totalVI#

With totalVI, we can produce a joint latent representation of cells, denoised data for both protein and RNA, integrate datasets, and compute differential expression of RNA and protein. Here we demonstrate this functionality with an integrated analysis of PBMC10k and PBMC5k, datasets of peripheral blood mononuclear cells publicly available from 10X Genomics subset to the 14 shared proteins between them. The same pipeline would generally be used to analyze a single CITE-seq dataset.

If you use totalVI, please consider citing:

  • Gayoso, A., Steier, Z., Lopez, R., Regier, J., Nazor, K. L., Streets, A., & Yosef, N. (2021). Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods, 18(3), 272-282.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
from scvi_colab import install

install()
import tempfile

import anndata as ad
import matplotlib.pyplot as plt
import mudata as md
import muon
import numpy as np
import scanpy as sc
import scvi
import seaborn as sns
import torch

Imports and data loading#

scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.2.2

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

This data that is used in this tutorial is the same from the totalvi manuscript and is PBMCs from 10x Genomics profiled with RNA and protein that was already filtered as described in the totalVI manuscript (low quality cells, doublets, lowly expressed genes, etc.) and in this link from the totalvi reproducibility repo. We do this for both dataests (5K and 10K). Note that this filtering script is old and perhaps not all functionalities work with recent updates, but the general purpose and ideas remain the same. We then use those 2 filtered datasets and merge them so that only shared proteins remains for each. This process can be seen in the _load_pbmcs_10x_cite_seq function here

adata = scvi.data.pbmcs_10x_cite_seq(save_path=save_dir.name)
adata
INFO     Downloading file at /tmp/tmpqctjnerx/pbmc_10k_protein_v3.h5ad
INFO     Downloading file at /tmp/tmpqctjnerx/pbmc_5k_protein_v3.h5ad
AnnData object with n_obs × n_vars = 10849 × 15792
    obs: 'n_genes', 'percent_mito', 'n_counts', 'batch'
    obsm: 'protein_expression'

We run the standard workflow for keeping count and normalized data together:

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
adata.obs_names_make_unique()

Important

In this tutorial we will show totalVI’s compatibility with the MuData format, which is a container for multiple AnnData objects. MuData objects can be read from the outputs of CellRanger using muon.read_10x_h5.

Furthermore, AnnData alone can also be used by storing the protein count data in .obsm, which is how it already is. For the AnnData-only workflow, see the documentation for setup_anndata in scvi.model.TOTALVI.

protein_adata = ad.AnnData(adata.obsm["protein_expression"])
protein_adata.obs_names = adata.obs_names
del adata.obsm["protein_expression"]
mdata = md.MuData({"rna": adata, "protein": protein_adata})
mdata
MuData object with n_obs × n_vars = 10849 × 15806
  2 modalities
    rna:	10849 x 15792
      obs:	'n_genes', 'percent_mito', 'n_counts', 'batch'
      uns:	'log1p'
      layers:	'counts'
    protein:	10849 x 14
sc.pp.highly_variable_genes(
    mdata.mod["rna"],
    n_top_genes=4000,
    flavor="seurat_v3",
    batch_key="batch",
    layer="counts",
)
# Place subsetted counts in a new modality
mdata.mod["rna_subset"] = mdata.mod["rna"][:, mdata.mod["rna"].var["highly_variable"]].copy()
mdata.update()
mdata
MuData object with n_obs × n_vars = 10849 × 19806
  3 modalities
    rna:	10849 x 15792
      obs:	'n_genes', 'percent_mito', 'n_counts', 'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
      uns:	'log1p', 'hvg'
      layers:	'counts'
    protein:	10849 x 14
    rna_subset:	10849 x 4000
      obs:	'n_genes', 'percent_mito', 'n_counts', 'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
      uns:	'log1p', 'hvg'
      layers:	'counts'

Using own data#

If you want to analyze your own data, there are multiple ways to upload the count matrix (in h5 or mtx format):

  • It can be upload directly to the colab session storage, use the Upload to session sotrage button on the upper left corner in the Files tab.

  • You can also upload the data to Google Drive, and mount it using the Mount Drive button in the Files tab (for Colab users only).

  • Download directly from public available data source using command line tools such as curl. This will work not only for Colab.

In this example we are downloading h5 format of single-cell multiomic data generated by Proteintech Genomics directly into the Google Colab session storage. The data is from human resting PBMCs stained with the MultiPro® Human Discovery Panel (HDP) followed by processing using 10x Genomics Flex chemistry with Feature Barcoding Technology.

!curl -LO https://ptgngsdata.s3.us-west-2.amazonaws.com/counts/E44_1_restPBMC_DCpos_filtered_feature_bc_matrix.h5
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 38.9M  100 38.9M    0     0  2384k      0  0:00:16  0:00:16 --:--:-- 6764k

We then run the same preprocess workflow we show before:

mdata1 = muon.read_10x_h5("E44_1_restPBMC_DCpos_filtered_feature_bc_matrix.h5")
mdata1.mod["rna"].var_names_make_unique()
mdata1.mod["rna"].layers["counts"] = mdata1.mod["rna"].X.copy()
sc.pp.normalize_total(mdata1.mod["rna"])
sc.pp.log1p(mdata1.mod["rna"])

sc.pp.highly_variable_genes(
    mdata1.mod["rna"],
    n_top_genes=4000,
    flavor="seurat_v3",
    layer="counts",
)
# Place subsetted counts in a new modality
mdata1.mod["rna_subset"] = mdata1.mod["rna"][:, mdata1.mod["rna"].var["highly_variable"]].copy()

Becuase of the filtering process we will re create the mdata here

mdata1 = md.MuData(mdata1.mod)
# we need to work with dense and not sparse matrices:
mdata1["prot"].X = mdata1["prot"].X.toarray()
mdata1["rna_subset"].X = mdata1["rna_subset"].X.toarray()
mdata1.mod["rna_subset"].layers["counts"] = mdata1.mod["rna_subset"].layers["counts"].toarray()
mdata1.update()
mdata1
MuData object with n_obs × n_vars = 15530 × 22431
  3 modalities
    rna:	15530 x 18082
      var:	'gene_ids', 'feature_types', 'ENS.GENE', 'IsoCtrl', 'genome', 'map_rna', 'pattern', 'read', 'sequence', 'uniprot', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'log1p', 'hvg'
      layers:	'counts'
    prot:	15530 x 349
      var:	'gene_ids', 'feature_types', 'ENS.GENE', 'IsoCtrl', 'genome', 'map_rna', 'pattern', 'read', 'sequence', 'uniprot'
    rna_subset:	15530 x 4000
      var:	'gene_ids', 'feature_types', 'ENS.GENE', 'IsoCtrl', 'genome', 'map_rna', 'pattern', 'read', 'sequence', 'uniprot', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'log1p', 'hvg'
      layers:	'counts'

Setup mudata#

Now we run setup_mudata, which is the MuData analog to setup_anndata. The caveat of this workflow is that we need to provide this function which modality of the mdata object contains each piece of data. So for example, the batch information is in mdata.mod["rna"].obs["batch"]. Therefore, in the modalities argument below we specify that the batch_key can be found in the "rna_subset" modality of the MuData object.

Notably, we provide protein_layer=None. This means scvi-tools will pull information from .X from the modality specified in modalities ("protein" in this case). In the case of RNA, we want to use the counts, which we stored in mdata.mod["rna"].layers["counts"].

scvi.model.TOTALVI.setup_mudata(
    mdata,
    rna_layer="counts",
    protein_layer=None,
    batch_key="batch",
    modalities={
        "rna_layer": "rna_subset",
        "protein_layer": "protein",
        "batch_key": "rna_subset",
    },
)
scvi.model.TOTALVI.setup_mudata(
    mdata1,
    rna_layer="counts",
    protein_layer=None,
    modalities={
        "rna_layer": "rna_subset",
        "protein_layer": "prot",
        "batch_key": "rna_subset",
    },
)

Note

Specify the modality of each argument via the modalities dictionary, which maps layer/key arguments to MuData modalities.

Prepare and run model#

For the rest of this tutorial we will use the model with the external data we loaded.

model = scvi.model.TOTALVI(mdata1)
INFO     Computing empirical prior initialization for protein background.
model.train()

We will plot the loss curves for training and validation using auto alignment for the yaxis

last_val_valid = np.array(model.history["elbo_validation"])[-1]
last_val_train = np.array(model.history["elbo_train"])[-1]
global_min_loss = min(
    np.min(model.history["elbo_train"]), np.min(model.history["elbo_validation"])
)
last_max_loss = max(last_val_train, last_val_valid)[0]
global_max_loss = max(
    np.max(model.history["elbo_train"]), np.max(model.history["elbo_validation"])
)
# Compute the min and max of both train and validation losses
min_loss = min(min(last_val_train, last_val_valid), global_min_loss)
max_loss = max(max(last_val_train, last_val_valid), global_max_loss)
ylim_min = 0.995 * min_loss  # 0.5% below the minimum
ylim_max = min(
    global_max_loss, ylim_min + (last_max_loss - ylim_min) * 4
)  # keep it under the 25% part of figure
fig, ax = plt.subplots(1, 1)
model.history["elbo_train"].plot(ax=ax, label="train")
model.history["elbo_validation"].plot(ax=ax, label="validation")
ax.set(
    title="Negative ELBO over training epochs", ylim=(ylim_min, ylim_max)
)  # you can still plug in you numbers
ax.legend()
<matplotlib.legend.Legend at 0x7cf098ea2cc0>
../../../_images/18b633f5d92e7eede4d771b921a3c60ba26b6121b5fc5242a05490501d603729.png

Analyze outputs#

We use Scanpy and muon for clustering and visualization after running totalVI. It’s also possible to save totalVI outputs for an R-based workflow.

rna = mdata1.mod["rna_subset"]
protein = mdata1.mod["prot"]
# arbitrarily store latent in rna modality
TOTALVI_LATENT_KEY = "X_totalVI"
rna.obsm[TOTALVI_LATENT_KEY] = model.get_latent_representation()
rna_denoised, protein_denoised = model.get_normalized_expression(n_samples=25, return_mean=True)
rna.layers["denoised_rna"] = rna_denoised
protein.layers["denoised_protein"] = protein_denoised

protein.layers["protein_foreground_prob"] = 100 * model.get_protein_foreground_probability(
    n_samples=25, return_mean=True
)
parsed_protein_names = [p.split("_")[0] for p in protein.var_names]
protein.var["clean_names"] = parsed_protein_names
mdata1.update()

Now we can compute clusters and visualize the latent space.

TOTALVI_CLUSTERS_KEY = "leiden_totalVI"

sc.pp.neighbors(rna, use_rep=TOTALVI_LATENT_KEY)
sc.tl.umap(rna)
sc.tl.leiden(rna, key_added=TOTALVI_CLUSTERS_KEY)
mdata1.update()

We can now use muon plotting functions which can pull data from either modality of the MuData object. We will show the umap of the model embeddings with leiden clusters (and batch inegration of the datasets if exists). Following that we will show the denoised protein values and the foreground probability of the 14 protein listed.

muon.pl.embedding(
    mdata1,
    basis="rna_subset:X_umap",
    color=[f"rna_subset:{TOTALVI_CLUSTERS_KEY}"],
    frameon=False,
    ncols=1,
)

Visualize denoised protein values#

max_prot_to_plot = 14
protein.var_names[:max_prot_to_plot]
Index(['prot:AARS.67909.1', 'prot:ACACA.67373.1', 'prot:ACTN1.66895.1',
       'prot:AHNAK.16637.1', 'prot:AHR.67785.1', 'prot:AIF1.10904.1',
       'prot:AK2.66127.1', 'prot:AKT1.80457.1', 'prot:ALCAM.65223.1',
       'prot:ALDH1A1.15910.1', 'prot:ALDH1A1.60171.1', 'prot:ANXA1.66344.1',
       'prot:ANXA2.66035.1', 'prot:ANXA5.66245.1'],
      dtype='object')
muon.pl.embedding(
    mdata1,
    basis="rna_subset:X_umap",
    color=protein.var_names[:max_prot_to_plot],
    frameon=False,
    ncols=3,
    vmax="p99",
    wspace=0.1,
    layer="denoised_protein",
)

Visualize probability of foreground#

Here we visualize the probability of foreground for the first 14 proteins in the list and cell (projected on UMAP). Some proteins are easier to disentangle than others. Some proteins end up being “all background”. For example, CD15 does not appear to be captured well, when looking at the denoised values above we see little localization in the monocytes.

Note

While the foreground probability could theoretically be used to identify cell populations, we recommend using the denoised protein expression, which accounts for the foreground/background probability, but preserves the dynamic range of the protein measurements. Consequently, the denoised values are on the same scale as the raw data and it may be desirable to take a transformation like log or square root.

By viewing the foreground probability, we can get a feel for the types of cells in our dataset. For example, it’s very easy to see a population of monocytes based on the CD14 foregroud probability.

muon.pl.embedding(
    mdata1,
    basis="rna_subset:X_umap",
    layer="protein_foreground_prob",
    color=protein.var_names[:max_prot_to_plot],
    frameon=False,
    ncols=3,
    vmax="p99",
    wspace=0.1,
    color_map="cividis",
)

Differential expression#

Here we do a one-vs-all DE test, where each cluster is tested against all cells not in that cluster. The results for each of the one-vs-all tests is concatenated into one DataFrame object. Inividual tests can be sliced using the “comparison” column. Genes and proteins are included in the same DataFrame.

Important

We do not recommend using totalVI denoised values in other differential expression tools, as denoised values are a summary of a random quantity. The totalVI DE test takes into account the full uncertainty of the denoised quantities.

de_df = model.differential_expression(
    groupby="rna_subset:leiden_totalVI", delta=0.5, batch_correction=True
)
de_df.head(5)
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05 comparison group1 group2
HP 0.9184 0.0816 2.420804 2.056065e-05 0.000007 0.0 0.5 2.833113 2.999560 3.219114 ... 0.026329 0.008360 0.022355 0.007324 0.225366 0.073018 False 0 vs Rest 0 Rest
COL5A3 0.9170 0.0830 2.402267 1.304645e-07 0.000033 0.0 0.5 -3.665083 -3.136525 4.787304 ... 0.000000 0.011171 0.000000 0.008582 0.000000 0.234775 False 0 vs Rest 0 Rest
RAMP1 0.9144 0.0856 2.368583 6.441141e-08 0.000002 0.0 0.5 -3.047432 -2.911168 3.192563 ... 0.000000 0.001850 0.000000 0.001480 0.000000 0.023384 False 0 vs Rest 0 Rest
HOXB7 0.9118 0.0882 2.335814 8.704242e-08 0.000006 0.0 0.5 -3.323101 -3.021720 3.949417 ... 0.000000 0.002293 0.000000 0.002071 0.000000 0.044373 False 0 vs Rest 0 Rest
PAK6 0.9116 0.0884 2.333329 3.994765e-07 0.000014 0.0 0.5 -3.142666 -2.940191 3.507081 ... 0.000000 0.007398 0.000000 0.006732 0.000000 0.139727 False 0 vs Rest 0 Rest

5 rows × 22 columns

Now we filter the results such that we retain features above a certain Bayes factor (which here is on the natural log scale) and genes with greater than 10% non-zero entries in the cluster of interest.

filtered_pro = {}
filtered_rna = {}
cats = rna.obs[TOTALVI_CLUSTERS_KEY].cat.categories
for c in cats:
    cid = f"{c} vs Rest"
    cell_type_df = de_df.loc[de_df.comparison == cid]
    cell_type_df = cell_type_df.sort_values("lfc_median", ascending=False)

    cell_type_df = cell_type_df[cell_type_df.lfc_median > 0]

    pro_rows = cell_type_df.index.str.contains("TotalSeqB")
    data_pro = cell_type_df.iloc[pro_rows]
    data_pro = data_pro[data_pro["bayes_factor"] > 0.7]

    data_rna = cell_type_df.iloc[~pro_rows]
    data_rna = data_rna[data_rna["bayes_factor"] > 3]
    data_rna = data_rna[data_rna["non_zeros_proportion1"] > 0.1]

    filtered_pro[c] = data_pro.index.tolist()[:3]
    filtered_rna[c] = data_rna.index.tolist()[:2]

We can also use general scanpy visualization functions

sc.tl.dendrogram(rna, groupby=TOTALVI_CLUSTERS_KEY, use_rep=TOTALVI_LATENT_KEY)
# This is a bit of a hack to be able to use scanpy dendrogram with the protein data
protein.obs[TOTALVI_CLUSTERS_KEY] = rna.obs[TOTALVI_CLUSTERS_KEY]
protein.obsm[TOTALVI_LATENT_KEY] = rna.obsm[TOTALVI_LATENT_KEY]
sc.tl.dendrogram(protein, groupby=TOTALVI_CLUSTERS_KEY, use_rep=TOTALVI_LATENT_KEY)

Matrix plot displays totalVI denoised protein expression per leiden cluster.

sc.pl.matrixplot(
    protein,
    protein.var["clean_names"],
    groupby=TOTALVI_CLUSTERS_KEY,
    gene_symbols="clean_names",
    dendrogram=True,
    swap_axes=True,
    layer="denoised_protein",
    cmap="Greens",
    standard_scale="var",
)

This is a selection of some of the markers that turned up in the RNA DE test.

sc.pl.umap(
    rna,
    color=[
        TOTALVI_CLUSTERS_KEY,
        "IGHD",
        "FCER1A",
        "SCT",
        "GZMH",
        "NOG",
        "FOXP3",
        "C1QA",
        "SIGLEC1",
        "XCL2",
        "GZMK",
    ],
    legend_loc="on data",
    frameon=False,
    ncols=3,
    layer="denoised_rna",
    wspace=0.2,
)