# MethylVI

**methylVI** [^ref1] (Python class {class}`~scvi.external.METHYLVI`) is a generative model of scBS-seq data that can subsequently
be used for many common downstream tasks.

The advantages of methylVI are:

-   Comprehensive in capabilities.
-   Scalable to very large datasets (>1 million cells).

The limitations of methylVI include:

-   Effectively requires a GPU for fast inference.
-   Latent space is not interpretable, unlike that of a linear method.

```{topic} Tutorials:

-   {doc}`/tutorials/notebooks/scbs/MethylVI_batch`
```

## Preliminaries

MethylVI takes as input scBS-seq count matrices representing methylation measurements aggregated over pre-defined
regions of interest (e.g. gene bodies, known regulatory regions, etc.). Depending on the system being investigated,
such measurements may be separated based on methylation context (e.g. CpG methylation versus non-CpG methylation).

For each context, methylVI accepts two count matrices as input $Y^{C}_{mc}$ and $Y^{C}_{cov}$. Here $C$ refers to
an arbitrary methylation context, and each of these matrices has data from $N$ cells and $M$ genomic regions.
Each entry in $Y_{cov}$ represents the _total_ number of cytosines profiled at a given region in a cell, while the
entries in $Y_{mc}$ denote the number of _methylated_ cytosines in a region  for a cell.  Additionally, a vector of
categorical covariates $S$, representing batch, donor, etc, is an optional input to the model.

## Generative process

MethylVI posits that the observed number of methylated cytosines in context $C$ for cell $i$ in region $j$,
$y^{C}_{ij}$, is generated by the following process:

```{math}
:nowrap: true

\begin{align}
    z_{i} &\sim \mathcal{N}(0, I_d) \\
    \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\
    p^{C}_{ijk} &\sim \text{Beta}(\mu^{C}_{ij}, \gamma^{C}_j) \\
    y^{C}_{ijk} &\sim \text{Ber}(p^{C}_{ijk}) \\
    y^{C}_{ij} &= \sum_{k}y^{C}_{ijk}
\end{align}
```

In brief, we assume that detection of an individual cytosine $k$ within region $j$ for cell $i$ as methylated
can be modeled as a Bernoulli random variable. The parameters of these Bernoulli distributions are
assumed to be similar for all cytosines $k$ within region $j$, which we model as draws from a Beta distribution with
parameters that depend on a cell-specific latent variable $z_i$ that captures underlying methylation state as well
as a batch covariate $s_i$. The outcomes of these Bernoulli draws are then summed to obtain our number of methylated
cytosines within the given region.

The above hierarchical process can be expressed more compactly as:

```{math}
:nowrap: true

\begin{align}
    z_{i} &\sim \mathcal{N}(0, I_d) \\
    \mu^{C}_{ij} &= f_{\theta^{C}}(z_{i}, s_i)_j \\
    y^{C}_{ij} &\sim \text{BetaBinomial}(n_{ij}, \mu_{ij}, \gamma_{j})
\end{align}
```

For each methylation context $C$, the MethylVI generative process uses a single neural network:

```{math}
:nowrap: true

\begin{align}
    f_{\theta^{C}}(z_{i}, s_i) &: \mathbb{R}^{d} \times \{0, 1\}^K \to \left(0,1\right)^M
\end{align}
```

which estimates regions' the methylation levels.

The latent variables, along with their description are summarized in the following table:

```{eval-rst}
.. list-table::
   :widths: 20 90 15
   :header-rows: 1

   * - Latent variable
     - Description
     - Code variable (if different)
   * - :math:`z_i \in \mathbb{R}^d`
     - Low-dimensional representation capturing the state of a cell
     - ``z``
   * - :math:`\mu_i \in \left(0,1\right)^{M}`
     - Per-region methylation level estimates
     - ``mu``
   * - :math:`\gamma_i \in \left(0,1\right)`
     - Region-wise dispersion factor
     - ``d``
```

## Inference

MethylVI uses variational inference, specifically auto-encoding variational Bayes
(see {doc}`/user_guide/background/variational_inference`) to learn both the model parameters (the neural network params,
dispersion parameters, etc.) and an approximate posterior distribution. In particular, we approximate the true posterior
distribution with a mean-field variational distribution $q_{\phi}(z_i \mid y_i, n_i, s_i)$ chosen to be Gaussian
with a diagonal covariance matrix. Here $y_i$ ($n_i$) is used as a shorthand to denote the concatenation of the numbers
of methylated (total) cytosines for each region in all contexts, and $\phi$ denotes a set of learned weights used to
infer the parameters of our approximate posterior.

## Tasks

Here we provide an overview of some of the tasks that MethylVI can perform. Please see {class}`scvi.external.METHYLVI`
for the full API reference.

### Dimensionality reduction

For dimensionality reduction, the mean of the approximate posterior $q_\phi(z_i \mid y_i, n_i)$ is returned by default.
This is achieved using the method:

```
>>> adata.obsm["X_methylvi"] = model.get_latent_representation()
```

Users may also return samples from this distribution, as opposed to the mean, by passing the argument `give_mean=False`.
The latent representation can be used to create a nearest neighbor graph with scanpy with:

```
>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_methylvi")
>>> adata.obsp["distances"]
```

### Transfer learning

A MethylVI model can be pre-trained on reference data and updated with query data using {meth}`~scvi.external.METHYLVI.load_query_data`, which then facilitates transfer of metadata like cell type annotations. See the {doc}`/user_guide/background/transfer_learning` guide for more information.

### Estimation of methylation levels

In {meth}`~scvi.external.METHYLVI.get_normalized_methylation` MethylVI returns the expected value of $\mu_i$ under the approximate posterior. For one cell $i$, this can be written as:

```{math}
:nowrap: true

\begin{align}
   \mathbb{E}_{q_\phi(z_i \mid y_i, n_i)}\left[f_{\theta}\left(z_{i}, s_i)\right) \right],
\end{align}
```

As the expectation can be expensive to compute, by default, MethylVI uses the mean of $z_i$ as a point estimate, but this behaviour can be changed by setting `use_z_mean=False` argument.

### Differential methylation

Differential methylation analysis is achieved with {meth}`~scvi.external.METHYLVI.differential_methylation`.
MethylVI tests differences in methylation levels $\mu^{C}_{i} = f_{\theta^{C}}\left(z_{i}, s_i)\right)$.

[^ref1]:
    Ethan Weinberger and Su-In Lee (2021),
    _A deep generative model of single-cell methylomic data_,
    [OpenReview](https://openreview.net/forum?id=Mg2DM0F3AY).
