# MethylANVI

**MethylANVI** [^ref1] (Python class {class}`~scvi.external.METHYLANVI`) is a semi-supervised generative model of scBS-seq data.
Similar to how scANVI extends scVI, MethylANVI can be treated as an extension of MethylVI that can leverage cell type annotations
for a subset of the cells present in the data sets to infer the states of the rest of the cells

The advantages of MethylANVI are:

-   Comprehensive in capabilities.
-   Scalable to very large datasets (>1 million cells).

The limitations of MethylANVI include:

-   Effectively requires a GPU for fast inference.
-   Latent space is not interpretable, unlike that of a linear method.
-   May not scale to very large number of cell types.

```{topic} Tutorials:

-   Work in progress.
```

## Preliminaries

MethylANVI takes as input scBS-seq count matrices representing methylation measurements aggregated over pre-defined
regions of interest (e.g. gene bodies, known regulatory regions, etc.). Depending on the system being investigated,
such measurements may be separated based on methylation context (e.g. CpG methylation versus non-CpG methylation).
For each context, MethylANVI accepts two count matrices as input $Y^{C}_{mc}$ and $Y^{C}_{cov}$. Here $C$ refers to
an arbitrary methylation context, and each of these matrices has data from $N$ cells and $M$ genomic regions.
Each entry in $Y_{cov}$ represents the _total_ number of cytosines profiled at a given region in a cell, while the
entries in $Y_{mc}$ denote the number of _methylated_ cytosines in a region  for a cell.

In addition to methylation measurements, MethylANVI takes as input a vector of partially observed cell-type labels $\mathbf{l}$,
where $L$ denotes the total number of cell types. Additionally, a vector of  categorical covariates $S$, representing batch,
donor, etc, is an optional input to the model.

## Generative process

MethylANVI posits that the observed number of methylated cytosines in context $C$ for cell $i$ in region $j$,
$y^{C}_{ij}$, is generated by the following process:

```{math}
:nowrap: true

\begin{align}
    l_i &\sim \text{Categorical}(1/L, \ldots, 1/L) \\
    u_i &\sim \mathcal{N}(0, I_d) \\
    z_{i} &\sim \mathcal{N}(f_z^{\mu}(u_i, l_i), f_z^{\sigma}(u_i, l_i)) \\
    \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\
    p^{C}_{ijk} &\sim \text{Beta}(\mu^{C}_{ij}, \gamma^{C}_j) \\
    y^{C}_{ijk} &\sim \text{Ber}(p^{C}_{ijk}) \\
    y^{C}_{ij} &= \sum_{k}y_{ijk}
\end{align}
```

Equivalently, we can express this process more compactly as

```{math}
:nowrap: true

\begin{align}
    l_i &\sim \text{Categorical}(1/L, \ldots, 1/L) \\
    u_i &\sim \mathcal{N}(0, I_d) \\
    z_{i} &\sim \mathcal{N}(f_z^{\mu}(u_i, l_i), f_z^{\sigma}(u_i, l_i)) \\
    z_{i} &\sim \mathcal{N}(0, I_d) \\
    \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\
    y^{C}_{ij} &\sim \text{BetaBinomial}(n^{C}_{ij}, \mu^{C}_{ij}, \gamma^{C}_{j})
\end{align}
```

We assume no prior knowledge on the distribution of cell types in the data (i.e., we place a uniform prior on the
distribution of cell type labels). Within-cell-type variations $u_i$ are assumed to follow a fixed standard normal distribution,
while the distribution over the cell-type-aware latent variables $z_i$ depend on the learnable neural networks $f_z^{\mu}$ and
$f_z^{\sigma}$. The variables $z_i$ summarize a cell's state as a low-dimensional vector, and have a similar interpretation
as with MethylVI. However, by incorporating cell type labels into the model, MethylANVI may learn a better structured
latent space compared to MethylVI.

The remainder of the model closely follows MethylVI. In particular, observed methylated cytosine counts are assumed
to follow a beta-binomial distribution conditioned on a cell's underlying state $z_i$ as well as batch covariates $s_i$.

In addition to the variables defined for {doc}`/user_guide/models/methylvi`, we have the following variables for MethylANVI:

```{eval-rst}
.. list-table::
   :widths: 20 90 15
   :header-rows: 1

   * - Latent variable
     - Description
     - Code variable (if different)
   * - :math:`l_i \in \Delta^{L-1}`
     - Cell type label
     - ``y``
   * - :math:`z_i \in \mathbb{R}^d`
     - Latent cell state
     - ``z_1``
   * - :math:`u_i \in \mathbb{R}^{d}`
     - Latent cell-type specific state
     - ``z_2``
```

## Inference

MethylANVI posits the following factorized distribution for posterior inference

:nowrap: true

\begin{align}
   q_\phi(z_i, u_i, c_i \mid y_i, n_i, s_i)
   =
   q_\phi(z_i \mid y_i, n_i, s_i)
   q_\phi(c_i \mid z_i)
   q_\phi(u_i \mid c_i, z_i)
\end{align}

Each of the individual variational distributions in our factorized expression is parameterized by neural
networks. Here $q_\phi(z_i \mid y_i, n_i, s_i)$ and $q_\phi(u_i \mid c_i, z_i)$ follow Gaussian distributions, while
$q_\phi(c_i \mid z_i)$ represents a Categorical distribution over cell types. Notably, $q_\phi(c_i \mid z_i)$ can be
leveraged post-training to predict cell types for an unlabeled cell. For this classification procedure, under the hood
we use as input the mean of the variational distribution $q_\phi(z_i \mid y_i, n_i, s_i)$.

## Training details

MethylANVI optimizes two evidence lower bounds (ELBOs) on the log evidence, with the two bounds corresponding to labeled
and unlabeled cells. These bounds largely mirror those of scANVI, with appropriate substitutions made to account for scBS-seq
observations. We refer the reader to the {doc}`/user_guide/models/scanvi` documentation for further details.

## Tasks

MethylANVI can perform the same tasks as MethylVI (see {doc}`/user_guide/models/methylvi`). In addition, MethylANVI can
do the following:

### Cell type label prediction

For cell type label prediction, MethylANVI returns the distribution $q_{\phi}(l_i \mid z_i)$ in the following
function:

```
>>> mdata.obs["methylanvi_prediction"] = model.predict()
```

[^ref1]:
    Ethan Weinberger and Su-In Lee (2021),
    _A deep generative model of single-cell methylomic data_,
    [OpenReview](https://openreview.net/forum?id=Mg2DM0F3AY).
