MethylANVI#

MethylANVI [1] (Python class METHYLANVI) is a semi-supervised generative model of scBS-seq data. Similar to how scANVI extends scVI, MethylANVI can be treated as an extension of MethylVI that can leverage cell type annotations for a subset of the cells present in the data sets to infer the states of the rest of the cells

The advantages of MethylANVI are:

  • Comprehensive in capabilities.

  • Scalable to very large datasets (>1 million cells).

The limitations of MethylANVI include:

  • Effectively requires a GPU for fast inference.

  • Latent space is not interpretable, unlike that of a linear method.

  • May not scale to very large number of cell types.

Preliminaries#

MethylANVI takes as input scBS-seq count matrices representing methylation measurements aggregated over pre-defined regions of interest (e.g. gene bodies, known regulatory regions, etc.). Depending on the system being investigated, such measurements may be separated based on methylation context (e.g. CpG methylation versus non-CpG methylation). For each context, MethylANVI accepts two count matrices as input \(Y^{C}_{mc}\) and \(Y^{C}_{cov}\). Here \(C\) refers to an arbitrary methylation context, and each of these matrices has data from \(N\) cells and \(M\) genomic regions. Each entry in \(Y_{cov}\) represents the total number of cytosines profiled at a given region in a cell, while the entries in \(Y_{mc}\) denote the number of methylated cytosines in a region for a cell.

In addition to methylation measurements, MethylANVI takes as input a vector of partially observed cell-type labels \(\mathbf{l}\), where \(L\) denotes the total number of cell types. Additionally, a vector of categorical covariates \(S\), representing batch, donor, etc, is an optional input to the model.

Generative process#

MethylANVI posits that the observed number of methylated cytosines in context \(C\) for cell \(i\) in region \(j\), \(y^{C}_{ij}\), is generated by the following process:

\begin{align} l_i &\sim \text{Categorical}(1/L, \ldots, 1/L) \\ u_i &\sim \mathcal{N}(0, I_d) \\ z_{i} &\sim \mathcal{N}(f_z^{\mu}(u_i, l_i), f_z^{\sigma}(u_i, l_i)) \\ \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\ p^{C}_{ijk} &\sim \text{Beta}(\mu^{C}_{ij}, \gamma^{C}_j) \\ y^{C}_{ijk} &\sim \text{Ber}(p^{C}_{ijk}) \\ y^{C}_{ij} &= \sum_{k}y_{ijk} \end{align}

Equivalently, we can express this process more compactly as

\begin{align} l_i &\sim \text{Categorical}(1/L, \ldots, 1/L) \\ u_i &\sim \mathcal{N}(0, I_d) \\ z_{i} &\sim \mathcal{N}(f_z^{\mu}(u_i, l_i), f_z^{\sigma}(u_i, l_i)) \\ z_{i} &\sim \mathcal{N}(0, I_d) \\ \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\ y^{C}_{ij} &\sim \text{BetaBinomial}(n^{C}_{ij}, \mu^{C}_{ij}, \gamma^{C}_{j}) \end{align}

We assume no prior knowledge on the distribution of cell types in the data (i.e., we place a uniform prior on the distribution of cell type labels). Within-cell-type variations \(u_i\) are assumed to follow a fixed standard normal distribution, while the distribution over the cell-type-aware latent variables \(z_i\) depend on the learnable neural networks \(f_z^{\mu}\) and \(f_z^{\sigma}\). The variables \(z_i\) summarize a cell’s state as a low-dimensional vector, and have a similar interpretation as with MethylVI. However, by incorporating cell type labels into the model, MethylANVI may learn a better structured latent space compared to MethylVI.

The remainder of the model closely follows MethylVI. In particular, observed methylated cytosine counts are assumed to follow a beta-binomial distribution conditioned on a cell’s underlying state \(z_i\) as well as batch covariates \(s_i\).

In addition to the variables defined for MethylVI, we have the following variables for MethylANVI:

Latent variable

Description

Code variable (if different)

\(l_i \in \Delta^{L-1}\)

Cell type label

y

\(z_i \in \mathbb{R}^d\)

Latent cell state

z_1

\(u_i \in \mathbb{R}^{d}\)

Latent cell-type specific state

z_2

Inference#

MethylANVI posits the following factorized distribution for posterior inference

:nowrap: true

(1)#\[\begin{align} q_\phi(z_i, u_i, c_i \mid y_i, n_i, s_i) = q_\phi(z_i \mid y_i, n_i, s_i) q_\phi(c_i \mid z_i) q_\phi(u_i \mid c_i, z_i) \end{align}\]

Each of the individual variational distributions in our factorized expression is parameterized by neural networks. Here \(q_\phi(z_i \mid y_i, n_i, s_i)\) and \(q_\phi(u_i \mid c_i, z_i)\) follow Gaussian distributions, while \(q_\phi(c_i \mid z_i)\) represents a Categorical distribution over cell types. Notably, \(q_\phi(c_i \mid z_i)\) can be leveraged post-training to predict cell types for an unlabeled cell. For this classification procedure, under the hood we use as input the mean of the variational distribution \(q_\phi(z_i \mid y_i, n_i, s_i)\).

Training details#

MethylANVI optimizes two evidence lower bounds (ELBOs) on the log evidence, with the two bounds corresponding to labeled and unlabeled cells. These bounds largely mirror those of scANVI, with appropriate substitutions made to account for scBS-seq observations. We refer the reader to the scANVI documentation for further details.

Tasks#

MethylANVI can perform the same tasks as MethylVI (see MethylVI). In addition, MethylANVI can do the following:

Cell type label prediction#

For cell type label prediction, MethylANVI returns the distribution \(q_{\phi}(l_i \mid z_i)\) in the following function:

>>> mdata.obs["methylanvi_prediction"] = model.predict()