totalVI¶
totalVI 1 (total Variational Inference; Python class TOTALVI
) posits a flexible generative model of CITEseq RNA and protein data that can subsequently
be used for many common downstream tasks.
The advantages of totalVI are:
Comprehensive in capabilities.
Scalable to very large datasets (>1 million cells).
The limitations of totalVI include:
Effectively requires a GPU for fast inference.
Difficult to understand the balance between RNA and protein data in the lowdimensional representation of cells.
Tutorials:
Preliminaries¶
totalVI takes as input a scRNAseq gene expression matrix of UMI counts \(X\) with \(N\) cells and \(G\) genes along with a paired matrix of protein abundance (UMI counts) \(Y\), also of \(N\) cells, but with \(T\) proteins. Thus, for each cell, we observe both RNA and protein information. Additionally, a design matrix \(S\) containing \(p\) observed covariates for each of the cells, such as day, donor, etc, is an optional input. While \(S\) can include both categorical covariates and continuous covariates, in the following, we assume it contains only one categorical covariate with \(K\) categories, which represents the common case of having multiple batches of data.
Generative process¶
We posit each cell’s protein and RNA expression to be generated by the following process:
First, for each cell \(n\),
The prior parameters \(l_\mu\) and \(l_{\sigma^2}\) are computed per batch as the mean and variance of the log library size over cells. The generative process of totalVI uses neural networks:
where \(d\) is the dimension of the latent space (associated with latent variable \(z\)). We also have global parameters \(\theta_g\) and \(\phi_t\), which represent gene and proteinspecific (respectively) overdispersion.
Then for each gene \(g\) in cell \(n\),
where the negative binomial is parameterized by its mean and inverse dispersion. And finally for each protein \(t\) in cell \(n\),
Integrating out \(v_{nt}\) yields a negative binomial mixture conditional distribution for \(y_{nt}\).
Furthermore, \(\beta_{nt}\) represents background protein signal due to ambient antibodies or nonspecific antibody binding.
The prior parameters \(c_t\) and \(d_t\) are unfortunately called background_pro_alpha
and background_pro_log_beta
in the code.
They are learned parameters during infererence, but are initialized through a procedure that fits a twocomponent Gaussian mixture model for each cell
and records the mean and variance of the component with smaller mean, aggregating across all cells. This can be disabled by setting empirical_protein_background_prior=False
,
which then forces a random Initialization.
The latent variables, along with their description are summarized in the following table:
Latent variable 
Description 
Code variable (if different) 

\(z_n \in \mathbb{R}^d\) 
Lowdimensional representation capturing joint state of a cell 
N/A 
\(\rho_n \in \Delta^{G1}\) 
Denoised/normalized gene expression, 

\(\alpha_n \in [1, \infty)^T\) 
Foreground scaling factor for proteins, identifies the mixture distribution (see below) 

\(\pi_n \in (0, 1)^T\) 
Probability of background for each protein 

\(l_n \in (0, \infty)\) 
Library size for RNA. Here it is modeled as a latent variable, but the recent default for totalVI is to treat library size as observed, equal to the total RNA UMI count of a cell. This can be controlled by passing 
N/A 
\(\beta_{nt} \in (0, \infty)\) 
Protein background intensity. Used twice to identify the protein mixture model. 

Inference¶
totalVI uses variational inference, and specifically autoencoding variational bayes (see Variational Inference), to learn both the model parameters (the neural network params, dispersion params, etc.), and an approximate posterior distribution with the following factorization:
Here \(\eta\) is a set of parameters corresponding to inference neural networks, which we do not describe in detail here, but are described in the totalVI paper. totalVI can also handle missing proteins (i.e., a dataset comprised of multiple batches, where each batch potentially has a different antibody panel, or no protein data at all). We refer the reader to the original publication for these details.
Tasks¶
Dimensionality reduction¶
For dimensionality reduction, we by default return the mean of the approximate posterior \(q_\eta(z_n \mid x_n, y_n,s_n)\). This is achieved using the method:
>>> latent = model.get_latent_representation()
>>> adata.obsm["X_totalvi"] = latent
Users may also return samples from this distribution, as opposed to the mean by passing the argument give_mean=False
.
The latent representation can be used to create a nearest neighbor graph with scanpy with:
>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_totalvi")
>>> adata.obsp["distances"]
Normalization and denoising of RNA and protein expression¶
In get_normalized_expression()
totalVI returns, for RNA, the expected value of \(l_n\rho_n\) under the approximate posterior,
and for proteins, the expected value of \((1 − \pi_{nt})\beta_{nt}\alpha_n\).
For one cell \(n\), in the case of RNA, this can be written as:
where \(l_n'\) is by default set to 1. See the library_size
parameter for more details. The expectation is approximated using Monte Carlo, and the number of samples can be passed as an argument in the code:
>>> rna, protein = model.get_normalized_expression(n_samples=10)
By default the mean over these samples is returned, but users may pass return_mean=False
to retrieve all the samples.
In the case of proteins, there are a few important options that control what constitues denoised protein expression.
For example, include_protein_background=True
will result in estimating the expectation of \((1 − \pi_{nt})\beta_{nt}\alpha_{nt} + \pi_{nt}\beta_{nt}\).
Setting sampling_protein_mixing=True
will result in sampling \(v_{nt} \sim \textrm{Bernoulli}(\pi_{nt})\) and
replacing \(\pi_{nt}\) with \(v_{nt}\).
Notably, this function also has the transform_batch
parameter that allows counterfactual prediction of expression in an unobserved batch. See the Counterfactual prediction guide.
Differential expression¶
Differential expression analysis is achieved with differential_expression()
. totalVI tests differences in magnitude of \(f_\rho\left( z_n, s_n \right)\) for RNA,
and \((1 − \pi_{nt})\beta_{nt}\alpha_{nt}\) with similar options to change this quantity as in the normalized expression function.
More info on the mathematics behind differential expression is in Differential Expression.
Data simulation¶
Data can be generated from the model using the posterior predictive distribution in posterior_predictive_sample()
.
This is equivalent to feeding a cell through the model, sampling from the posterior
distributions of the latent variables, retrieving the likelihood parameters, and finally, sampling from this distribution.
References:
 1
Adam Gayoso*, Zoë Steier*, Romain Lopez, Jeffrey Regier, Kristopher L Nazor, Aaron Streets, Nir Yosef (2021), Joint probabilistic modeling of singlecell multiomic data with totalVI, Nature Methods.