MethylVI

MethylVI#

methylVI [1] (Python class METHYLVI) is a generative model of scBS-seq data that can subsequently be used for many common downstream tasks.

The advantages of methylVI are:

Comprehensive in capabilities.
Scalable to very large datasets (>1 million cells).

The limitations of methylVI include:

Effectively requires a GPU for fast inference.
Latent space is not interpretable, unlike that of a linear method.

Preliminaries#

MethylVI takes as input scBS-seq count matrices representing methylation measurements aggregated over pre-defined regions of interest (e.g., gene bodies, known regulatory regions, etc.). Depending on the system being investigated, such measurements may be separated based on methylation context (e.g., CpG methylation versus non-CpG methylation).

For each context, methylVI accepts two count matrices as input \(Y^{C}_{mc}\) and \(Y^{C}_{cov}\). Here \(C\) refers to an arbitrary methylation context, and each of these matrices has data from \(N\) cells and \(M\) genomic regions. Each entry in \(Y_{cov}\) represents the total number of cytosines profiled at a given region in a cell, while the entries in \(Y_{mc}\) denote the number of methylated cytosines in a region for a cell. Additionally, a vector of categorical covariates \(S\), representing batch, donor, etc., is an optional input to the model.

Generative process#

MethylVI posits that the observed number of methylated cytosines in context \(C\) for cell \(i\) in region \(j\), \(y^{C}_{ij}\), is generated by the following process:

\begin{align} z_{i} &\sim \mathcal{N}(0, I_d) \\ \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\ p^{C}_{ijk} &\sim \text{Beta}(\mu^{C}_{ij}, \gamma^{C}_j) \\ y^{C}_{ijk} &\sim \text{Ber}(p^{C}_{ijk}) \\ y^{C}_{ij} &= \sum_{k}y^{C}_{ijk} \end{align}

In brief, we assume that detection of individual cytosine \(k\) within region \(j\) for cell \(i\) as methylated can be modeled as a Bernoulli random variable. The parameters of these Bernoulli distributions are assumed to be similar for all cytosines \(k\) within a region \(j\), which we model as draws from a Beta distribution with parameters that depend on a cell-specific latent variable \(z_i\) that captures underlying methylation state as well as a batch covariate \(s_i\). The outcomes of these Bernoulli draws are then summed to get our number of methylated cytosines within the given region.

The above hierarchical process can be expressed more compactly as:

\begin{align} z_{i} &\sim \mathcal{N}(0, I_d) \\ \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\ y^{C}_{ij} &\sim \text{BetaBinomial}(n_{ij}, \mu_{ij}, \gamma_{j}) \end{align}

For each methylation context \(C\), the MethylVI generative process uses a single neural network:

\begin{align} f_{\theta^{C}}(z_{i}, s_i) &: \mathbb{R}^{d} \times \{0, 1\}^K \to \left(0,1\right)^M \end{align}

which estimates regions’ the methylation levels.

The latent variables, along with their description, are summarized in the following table:

Latent variable	Description	Code variable (if different)
\(z_i \in \mathbb{R}^d\)	Low-dimensional representation capturing the state of a cell	`z`
\(\mu_i \in \left(0,1\right)^{M}\)	Per-region methylation level estimates	`mu`
\(\gamma_i \in \left(0,1\right)\)	Region-wise dispersion factor	`d`

Inference#

MethylVI uses variational inference, specifically auto-encoding variational Bayes (see Variational Inference) to learn both the model parameters (the neural network params, dispersion parameters, etc.) and an approximate posterior distribution. In particular, we approximate the true posterior distribution with a mean-field variational distribution \(q_{\phi}(z_i \mid y_i, n_i, s_i)\) chosen to be Gaussian with a diagonal covariance matrix. Here \(y_i\) (\(n_i\)) is used as a shorthand to denote the concatenation of the numbers of methylated (total) cytosines for each region in all contexts, and \(\phi\) denotes a set of learned weights used to infer the parameters of our approximate posterior.

Tasks#

Here we provide an overview of some of the tasks that MethylVI can perform. Please see scvi.external.METHYLVI for the full API reference.

Dimensionality reduction#

For dimensionality reduction, the mean of the approximate posterior \(q_\phi(z_i \mid y_i, n_i)\) is returned by default. This is achieved using the method:

>>> adata.obsm["X_methylvi"] = model.get_latent_representation()

Users may also return samples from this distribution, as opposed to the mean, by passing the argument give_mean=False. The latent representation can be used to create a nearest neighbor graph with scanpy with:

>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_methylvi")
>>> adata.obsp["distances"]

Transfer learning#

A MethylVI model can be pre-trained on reference data and updated with query data using load_query_data(), which then facilitates transfer of metadata like cell type annotations. See the Transfer learning guide for more information.

Estimation of methylation levels#

In get_normalized_methylation() MethylVI returns the expected value of \(\mu_i\) under the approximate posterior. For one cell \(i\), this can be written as:

\begin{align} \mathbb{E}_{q_\phi(z_i \mid y_i, n_i)}\left[f_{\theta}\left(z_{i}, s_i)\right) \right], \end{align}

As the expectation can be expensive to compute, by default, MethylVI uses the mean of \(z_i\) as a point estimate, but this behavior can be changed by setting use_z_mean=False argument.

Differential methylation#

Differential methylation analysis is achieved with differential_methylation(). MethylVI tests differences in methylation levels \(\mu^{C}_{i} = f_{\theta^{C}}\left(z_{i}, s_i)\right)\).