scANVI¶
scANVI 1 (singlecell ANnotation using Variational Inference; Python class SCANVI
) is a semisupervised model for singlecell transcriptomics data.
In a sense, it can be seen as a scVI extension that can leverage the cell type knowledge for a subset of the cells present in the data sets to infer the states of the rest of the cells.
For this reason, scANVI can help annotate a data set of unlabelled cells from manually annotated atlases, e.g., Tabula Sapiens 2.
The advantages of scANVI are:
Comprehensive in capabilities.
Scalable to very large datasets (>1 million cells).
The limitations of scANVI include:
Effectively requires a GPU for fast inference.
Latent space is not interpretable, unlike that of a linear method.
May not scale to very large number of cell types.
Preliminaries¶
scANVI takes as input a scRNAseq gene expression matrix \(X\) with \(N\) cells and \(G\) genes, as well as a vector \(\mathbf{c}\) containing the partially observed cell type annotations. Let \(C\) be the number of observed cell types in the data. Additionally, a design matrix \(S\) containing \(p\) observed covariates, such as day, donor, etc, is an optional input. While \(S\) can include both categorical covariates and continuous covariates, in the following, we assume it contains only one categorical covariate with \(K\) categories, which represents the common case of having multiple batches of data.
Generative process¶
scANVI extends the scVI model by making use of observed cell types \(c_n\) following a graphical model inspired by works on semisupervised VAEs 3.
We assume no knowledge over the distribution of cell types in the data (i.e., uniform probabilities for categorical distribution on \(c_n\)). This modeling choice helps ensure a proper handling of rare cell types in the data. We assume that the withincelltype characterization of the cell follows a Normal distribution, s.t. \(u_n \sim \mathcal{N}(0, I_d)\). The distribution over the random vector \(z_n\) contains learnable parameters in the form of the neural networks \(f_z^\mu\), \(f_z^\sigma\). Qualitatively, \(z_n\) characterizes each cell cellular state as a continuous, lowdimensional random variable, and has the same interpretation as in the scVI model. However, the prior for this variable takes into account the partial celltype information to better structure the latent space.
The rest of the model closely follows scVI. In particular, it represents the library size as a random variable, and gene expression likelihoods as negative binomial distributions parameterized by functions of \(z_n, l_n\), condition to the batch assignments \(s_n\).
In addition to the table in scVI, we have the following in scANVI.
Latent variable 
Description 
Code variable (if different) 

\(c_n \in \Delta^{C1}\) 
Cell type. 

\(z_n \in \mathbb{R}^{d}\) 
Latent cell state 

\(u_n \in \mathbb{R}^{d}\) 
Latent celltype specific state 

Inference¶
scANVI assumes the following factorization for the inference model
We make several observations here. First, each of those variational distributions will be parameterized by neural networks. Second, while \(q_\eta(z_n, x_n)\) and \(q_\eta(u_n \mid c_n, z_n)\) are assumed Gaussian, \(q_\eta(c_n \mid z_n)\) corresponds to a Categorical distribution over cell types. In particular, the variational distribution \(q_\eta(c_n \mid z_n)\) can predict cell types for any cell.
Behind the scenes, scANVI’s classifier uses the mean of a cell’s variational distribution \(q_\eta(z_n \mid x_n)\) for classification.
Training details¶
scANVI optimizes evidence lower bounds (ELBO) on the log evidence. For the sake of clarity, we ignore the library size and batch assignments below. We note that the evidence and hence the ELBO have a different expression for cells with observed and unobserved cell types.
First, assume that we observe both gene expressions \(x_n\) and type assignments \(c_n\). In that case, we bound the log evidence as
We aim to optimize for \(\theta, \eta\) the righthand side of this equation using stochastic gradient descent. Gradient updates for the generative model parameters \(\theta\) are easy to get. In that case, the gradient of the expectation corresponds to the expectation of the gradients.
However, this is not the case when we differentiate for \(\eta\). The reparameterization trick solves this issue and applies to the (Gaussian) distributions associated with \(q_\eta(z_n \mid x_n) ,q_\eta(u_n \mid z_n, c_n)\). In particular, we can write \(\mathcal{L}_S\) as an expectation under noise distributions independent of \(\eta\). For convenience, we will write expectations of the form \(\mathbb{E}_{\epsilon_v}\) to denote expectation under the variational distribution using the reparameterization trick. We refer the reader to 4 for additional insight on the reparameterization trick.
Things get trickier in the unobserved cell type case. In this setup, the ELBO corresponds to the righthand side of
Unfortunately, the reparameterization trick does not apply naturally to \(q_\eta(c_n \mid z_n)\). As an alternative, we observe that
In this form, we can differentiate \(\mathcal{L}_u\) with respect to the inference network parameters, as
In other words, we will need to marginalize \(c_n\) out to circumvent the fact that categorical distributions cannot use the reparameterization trick.
Overall, we optimize \(\mathcal{L} = \mathcal{L}_U + \mathcal{L}_S\) to train the model on both labeled and unlabelled data.
Tasks¶
scANVI can perform all the same tasks as scVI (see scVI). In addition, scANVI can do the following:
Prediction¶
For prediction, scANVI returns \(q_\eta(c_n \mid z_n)\) in the following function:
>>> adata.obs["scanvi_prediction"] = model.predict()
References:
 1
Xu Chenling, Romain Lopez, Edouard Mehlman, Jeffrey Regier, Michael I. Jordan, Nir Yosef (2021), Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models, Molecular systems biology 17.1.
 2
Tabula Sapiens Consortium (2021), The Tabula Sapiens: a single cell transcriptomic atlas of multiple organs from individual human donors, BioRxiv.
 3
Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling (2014), Semisupervised learning with deep generative models, Advances in neural information processing systems.
 4
Diederik P. Kingma, Max Welling (2013) (2014), AutoEncoding Variational Bayes, Arxiv.