This page was generated from cite_scrna_integration_w_totalVI.ipynb. Interactive online version: Colab badge.

Integration of CITE-seq and scRNA-seq data

Here we demonstrate how to integrate CITE-seq and scRNA-seq datasets with totalVI. The same principles here can be used to integrate CITE-seq datasets with different sets of measured proteins.

import sys

#if branch is stable, will install via pypi, else will install from source
branch = "stable"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+$branch#egg=scvi-tools[tutorials]
     |████████████████████████████████| 69 kB 3.4 MB/s
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
nbclient 0.5.4 requires jupyter-client>=6.1.5, but you have jupyter-client 5.3.5 which is incompatible.
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
     |████████████████████████████████| 127 kB 7.8 MB/s
     |████████████████████████████████| 813 kB 38.6 MB/s
     |████████████████████████████████| 242 kB 60.8 MB/s
     |████████████████████████████████| 678 kB 43.8 MB/s
     |████████████████████████████████| 211 kB 51.4 MB/s
     |████████████████████████████████| 1.4 MB 33.0 MB/s
     |████████████████████████████████| 3.2 MB 52.7 MB/s
     |████████████████████████████████| 8.8 MB 33.4 MB/s
     |████████████████████████████████| 2.0 MB 60.1 MB/s
     |████████████████████████████████| 41 kB 122 kB/s
     |████████████████████████████████| 48 kB 5.7 MB/s
     |████████████████████████████████| 282 kB 49.4 MB/s
     |████████████████████████████████| 829 kB 42.0 MB/s
     |████████████████████████████████| 123 kB 51.6 MB/s
     |████████████████████████████████| 636 kB 53.4 MB/s
     |████████████████████████████████| 1.3 MB 33.9 MB/s
     |████████████████████████████████| 51 kB 7.2 MB/s
     |████████████████████████████████| 80 kB 9.5 MB/s
     |████████████████████████████████| 1.1 MB 36.5 MB/s
     |████████████████████████████████| 294 kB 58.4 MB/s
     |████████████████████████████████| 142 kB 60.8 MB/s
     |████████████████████████████████| 63 kB 2.3 MB/s
  Building wheel for scvi-tools (PEP 517) ... done
  Building wheel for docrep ( ... done
  Building wheel for loompy ( ... done
  Building wheel for future ( ... done
  Building wheel for umap-learn ( ... done
  Building wheel for pynndescent ( ... done
  Building wheel for numpy-groupies ( ... done
  Building wheel for sinfo ( ... done

Imports and data loading

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as p9

import scanpy as sc
import scvi

sc.set_figure_params(figsize=(4, 4))
Global seed set to 0
/usr/local/lib/python3.7/dist-packages/numba/np/ufunc/ NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled.

Here we focus on two CITE-seq datasets of peripheral blood mononuclear cells from 10x Genomics and used in the totalVI manuscript. We have already filtered these datasets for doublets and low-quality cells and genes.

The quality of totalVI’s protein imputation is somewhat reliant on how well the datasets mix in the latent space. In other words, it’s assumed here the datasets largely share the same cell subpopulations.

adata =
INFO     Downloading file at data/pbmc_10k_protein_v3.h5ad
Downloading...: 24938it [00:00, 78888.86it/s]
INFO     Downloading file at data/pbmc_5k_protein_v3.h5ad
Downloading...: 100%|██████████| 18295/18295.0 [00:00<00:00, 77306.01it/s]
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
# batch 0 corresponds to dataset_10k, batch 1 corresponds to dataset_5k
batch = adata.obs.batch.values.ravel()
Name: batch, Length: 10849, dtype: object

Now we hold-out the proteins of the 5k dataset. To do so, we can replace all the values with 0s. We will store the original values to validate after training.

held_out_proteins = adata.obsm["protein_expression"][batch == "PBMC5k"].copy()
adata.obsm["protein_expression"].loc[batch == "PBMC5k"] = np.zeros_like(adata.obsm["protein_expression"][batch == "PBMC5k"])
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.


scvi-tools will automatically detect proteins as missing in a certain batch if the protein has 0 counts for each cell in the batch. In other words, to indicate a protein is missing in a certain batch, please set it to 0 for each cell.

scvi.model.TOTALVI.setup_anndata(adata, batch_key="batch", protein_expression_obsm_key="protein_expression")
INFO     Using batches from adata.obs["batch"]
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.X
INFO     Using protein expression from adata.obsm['protein_expression']
INFO     Using protein names from columns of adata.obsm['protein_expression']
INFO     Found batches with missing protein expression
INFO     Successfully registered anndata object containing 10849 cells, 4000 vars, 2 batches,
         1 labels, and 14 proteins. Also registered 0 extra categorical covariates and 0
         extra continuous covariates.
INFO     Please do not further modify adata until model is trained.

Prepare and run model

model = scvi.model.TOTALVI(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Epoch 254/400:  64%|██████▎   | 254/400 [07:06<03:55,  1.61s/it, loss=588, v_num=1]Epoch   254: reducing learning rate of group 0 to 2.4000e-03.
Epoch 296/400:  74%|███████▍  | 296/400 [08:13<02:47,  1.61s/it, loss=588, v_num=1]Epoch   296: reducing learning rate of group 0 to 1.4400e-03.
Epoch 337/400:  84%|████████▍ | 337/400 [09:18<01:40,  1.59s/it, loss=605, v_num=1]Epoch   337: reducing learning rate of group 0 to 8.6400e-04.
Epoch 400/400: 100%|██████████| 400/400 [11:00<00:00,  1.60s/it, loss=599, v_num=1]Epoch   400: reducing learning rate of group 0 to 5.1840e-04.
Epoch 400/400: 100%|██████████| 400/400 [11:00<00:00,  1.65s/it, loss=599, v_num=1]
plt.plot(model.history["elbo_train"], label="train")
plt.plot(model.history["elbo_validation"], label="val")
plt.title("Negative ELBO over training epochs")
plt.ylim(1100, 1500)
<matplotlib.legend.Legend at 0x7ff53b30a3d0>

Analyze outputs

Again, we rely on Scanpy.

adata.obsm["X_totalVI"] = model.get_latent_representation()
adata.obsm["protein_fg_prob"] = model.get_protein_foreground_probability(transform_batch="PBMC10k")

rna, protein = model.get_normalized_expression(transform_batch="PBMC10k", n_samples=25, return_mean=True)


transform_batch is a power parameter. Setting this allows one to predict the expression of cells as if they came from the inputted batch. In this case, we’ve observed protein expression in batch “PBMC10k” (batch categories from original adata object), but we have no protein expression in batch “PBMC5k”. We’d like to take the cells of batch “PBMC5k” and make a counterfactual prediction: “What would the expression look like if my batch “PBMC5k” cells came from batch “PBMC10k”?”

protein.iloc[:5, :5]
CD3_TotalSeqB CD4_TotalSeqB CD8a_TotalSeqB CD14_TotalSeqB CD15_TotalSeqB
AAACCCAAGATTGTGA-1 13.776643 243.066116 4.119991 952.314941 109.560150
AAACCCACATCGGTTA-1 31.085346 206.181992 2.688894 746.883911 107.579414
AAACCCAGTACCGCGT-1 10.900543 391.018829 4.109083 1290.523926 117.206902
AAACCCAGTATCGAAA-1 0.596661 1.118086 25.140722 0.286133 126.729347
AAACCCAGTCGTCATA-1 1.097599 0.063791 44.169453 0.092724 94.427673


The following is for illustrative purposes. In the code blocks above, we have the denoised protein values for each cell. These values have the expected protein background component removed. However, to compare to the held out protein values, we must include both protein foreground and background. We recommend using the values above for downstream tasks.

_, protein_means = model.get_normalized_expression(
sc.pp.neighbors(adata, use_rep="X_totalVI"), min_dist=0.4), key_added="leiden_totalVI")
perm_inds = np.random.permutation(len(adata))
    color=["leiden_totalVI", "batch"],
/usr/local/lib/python3.7/dist-packages/anndata/_core/ ImplicitModificationWarning: Initializing view as actual.
  "Initializing view as actual.", ImplicitModificationWarning
Trying to set attribute `.obs` of view, copying.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
... storing 'batch' as categorical
batch = adata.obs.batch.values.ravel()
combined_protein = np.concatenate([adata.obsm["protein_expression"].values[batch=="PBMC10k"], held_out_proteins], axis=0)

# cleaner protein names
parsed_protein_names = [p.split("_")[0] for p in adata.obsm["protein_expression"].columns]
for i, p in enumerate(parsed_protein_names):
    adata.obs["{} imputed".format(p)] = protein_means.iloc[:, i]
    adata.obs["{} observed".format(p)] = combined_protein[:, i]
viz_keys = []
for p in parsed_protein_names:
    viz_keys.append(p + " imputed")
    viz_keys.append(p + " observed")
    adata[adata.obs.batch == "PBMC5k"],
/usr/local/lib/python3.7/dist-packages/anndata/_core/ ImplicitModificationWarning: Initializing view as actual.
  "Initializing view as actual.", ImplicitModificationWarning
Trying to set attribute `.obs` of view, copying.
... storing 'batch' as categorical

Imputed vs denoised correlations

from scipy.stats import pearsonr
imputed_pros = protein_means[batch == "PBMC5k"]
held_vs_denoised = pd.DataFrame()
held_vs_denoised["Observed (log)"] = np.log1p(held_out_proteins.values.ravel())
held_vs_denoised["Imputed (log)"] = np.log1p(imputed_pros.to_numpy().ravel())
protein_names_corrs = []
for i, p in enumerate(parsed_protein_names):
    protein_names_corrs.append(parsed_protein_names[i] + ": Corr=" + str(np.round(pearsonr(held_out_proteins.values[:, i], imputed_pros.iloc[:, i])[0], 3)))
held_vs_denoised["Protein"] = protein_names_corrs * len(held_out_proteins)
Observed (log) Imputed (log) Protein
0 3.258097 3.551565 CD3: Corr=0.786
1 5.105945 5.946287 CD4: Corr=0.877
2 2.833213 3.391618 CD8a: Corr=0.823
3 6.546785 7.193236 CD14: Corr=0.91
4 2.995732 4.842935 CD15: Corr=0.097

We notice that CD15 has a really low correlation (imputation accuracy). Recall that imputation involves a counterfactual query – “what would the protein expression have been for these cells if they came from the PBMC10k dataset?” Thus, any technical issues with proteins in CD15 in PBMC10k will be reflected in the imputed values. It’s the case here that CD15 was not captured as well in the PBMC10k dataset compared to the PBMC5k dataset.

(p9.ggplot(held_vs_denoised, p9.aes("Observed (log)", "Imputed (log)"))
 + p9.geom_point(size=0.5)
 + p9.facet_wrap("~Protein", scales="free")
 + p9.theme(figure_size=(10, 10), panel_spacing=.35,)

<ggplot: (8793023032337)>