MultiVI#
MultiVI [1] (Python class MULTIVI
) multimodal generative model capable of
integrating multiome, scRNA-seq and scATAC-seq data. After training, it can be used for many common downstream tasks,
and also for imputation of a missing modality.
The advantages of multiVI are:
Comprehensive in capabilities. Able to perform DE gene, DA region analysis.
Scalable to very large datasets (>1 million cells).
Once trained with sufficient multimodal data, able to accurately input missing modalities.
The limitations of MultiVI include:
Effectively requires a GPU for fast inference.
Preliminaries#
MultiVI takes as input a multiome single-cell matrix \(X_{mult}\) with \(N\) cells and \(G\) genes and \(M\) genomic regions (peaks), or just a scRNA-seq gene expression matrix \(X_{rna}\) with \(N\) cells and \(G\) genes or just a scATAC-seq accessibility matrix \(X_{acc}\) with \(N\) cells and \(M\) genomic regions (peaks), which can be binary or have count data. Additionally, a design matrix \(S\) containing \(p\) observed covariates, such as day, donor, etc, is an optional input. While \(S\) can include both categorical covariates and continuous covariates, in the following, we assume it contains only one categorical covariate with \(K\) categories, which represents the common case of having multiple batches of data.
Generative process#
MultiVI posits that the observed UMI counts for cell \(n\), gene \(g\), \(x_{ng}\), and regions \(j\), \(y_{nj}\), are generated by the following process:
For RNA-seq, the prior parameters \(\ell_\mu\) and \(\ell_{\sigma^2}\) are computed per batch as the mean and
variance of the log library size over cells. The expression data are generated from a count-based likelihood
distribution, which here, we denote as the \(\mathrm{ObservationModel}\). While by default the
\(\mathrm{ObservationModel}\) is a \(\mathrm{ZeroInflatedNegativeBinomial}\) (ZINB) distribution parameterized
by its mean, inverse dispersion, and non-zero-inflation probability, respectively, users can pass
gene_likelihood = "negative_binomial"
to MultiVI
, for example, to use a simpler
\(\mathrm{NegativeBinomial}\) distribution.
For ATAC-seq, detecting a region as accessible (\(y_{nj} > 0\)) is generated by a Bernoulli random variable which depends on a cell-specific latent variable \(z_n\), which captures biological heterogeneity, and two auxiliary scaling factors \(\ell_n, r_j\), which account for cell-specific and region-specific technical effects, respectively.
The generative process of MultiVI uses neural networks to produce:
which respectively decode the denoised gene expression, non-zero-inflation probability (only if using ZINB) and estimates the probability of accessibility.
The latent variables, along with their description are summarized in the following table:
Latent variable |
Description |
Code variable (if different) |
---|---|---|
\(z_n \in \mathbb{R}^d\) |
Low-dimensional representation capturing the state of a cell. |
N/A |
\(\rho_n \in \Delta^{G-1}\) |
Denoised/normalized gene expression. This is a vector that sums to 1 within a cell, unless size_factor_key is not None in |
|
\(\ell_n \in (0, \infty)\) |
Library size for RNA. |
N/A |
\(\theta_g \in (0, \infty)\) |
Inverse dispersion for negative binomial. This can be set to be gene/batch specific for example (and would thus be \(\theta_{kg}\)), by passing |
N/A |
\(p_r\) |
Accessibility probability estimate |
N/A |
\(\ell_n \in \left[0,1\right]\) |
Cell-wise scaling factor. Learned, but can be set manually with size_factor_key in |
|
\(r_j \in \left[0,1\right]\) |
Region-wise scaling factor |
|
Inference#
MultiVI uses variational inference and specifically auto-encoding variational bayes (see Variational Inference) to learn both the model parameters (the neural network params, dispersion params, etc.) and an approximate posterior distribution with the following factorization:
Here \(\eta\) is a set of parameters corresponding to inference neural networks (encoders), which we do not describe
in detail here, but are described in the MultiVI paper. The underlying class used as the encoder for MultiVI is
Encoder
. \(z_n\) is calculated determinimistically as the average of two latent variables part of
the variational approximation \(z^{acc}_n\) and \(z^{rna}_n\). These two variables are Normal and for that
reason, \(z_n\) is Normal. This formalism permits to handle individual modalities by bypassing the average mechanism
and considering each modality variation approximation.
Tasks#
Here we provide an overview of some of the tasks that MultiVI can perform. Please see scvi.model.MULTIVI
for the full API reference.
Dimensionality reduction#
For dimensionality reduction, the mean of the approximate posterior \(q_\eta(z_n \mid x_n, s_n)\) is returned by default. This is achieved using the method:
>>> latent = model.get_latent_representation()
>>> adata.obsm["X_mvi"] = latent
Users may also return samples from this distribution, as opposed to the mean by passing the argument give_mean=False
.
The latent representation can be used to create a nearest neighbor graph with scanpy with:
>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_mvi")
>>> adata.obsp["distances"]
Normalization/denoising/imputation of expression#
In get_normalized_expression()
MultiVI returns the expected value of \(\rho_n\) under the approximate posterior. For one cell \(n\), this can be written as:
where \(\ell_n'\) is by default set to 1. See the library_size
parameter for more details. The expectation is approximated using Monte Carlo, and the number of samples can be passed as an argument in the code:
>>> model.get_normalized_expression(n_samples=10)
By default the mean over these samples is returned, but users may pass return_mean=False
to retrieve all the samples.
Notably, this function also has the transform_batch
parameter that allows counterfactual prediction of expression in an unobserved batch. See the Counterfactual prediction guide.
It is worth noting that when accessibility data is passed, MultiVI computes imputation of missing gene expression data. When gene expression data is passed, MultiVI computes denoised gene expression data.
Denoising/imputation of accessibility#
In get_accessibility_estimates()
MultiVI returns the expected value of \(y_i\) under the
approximate posterior. For one cell \(i\), this can be written as:
The expectation is approximated by returning the mean of the variation approximation \(z_n\) and then, using this value to decode the accessibility probability estimate \(p_r\). Alternatively, to approximate the variational mean, a number of samples can be passed as an argument in the code:
>>> model.get_accessibility_estimates(n_samples_overall=10)
This value is used to compute the mean of the latent variable over these samples. Notably, this function also has
the transform_batch
parameter that allows counterfactual prediction of accessibility in an unobserved batch.
See the Counterfactual prediction guide.
It is worth noting that when gene expression data is passed, MultiVI computes imputation of missing accessibility data. When accessibility data is passed, MultiVI computes denoised chromatin accessibility.
Differential expression#
Differential expression analysis is achieved with differential_expression()
. MultiVI tests differences in magnitude of \(f_w\left( z_n, s_n \right)\). More info is in Differential Expression.
Differential accessibility#
Differential accessibility analysis is achieved with differential_accessibility()
. MultiVI tests differences in accessibility of \(g_z\left( z_n, s_n \right)\).