# PeakVI

**peakVI** [^ref1] (Python class {class}`~scvi.model.PEAKVI`) is a generative model of scATAC-seq data that can subsequently
be used for many common downstream tasks.

The advantages of peakVI are:

-   Comprehensive in capabilities.
-   Scalable to very large datasets (>1 million cells).

The limitations of peakVI include:

-   Effectively requires a GPU for fast inference.
-   Latent space is not interpretable, unlike that of a linear method.

```{topic} Tutorials:

-   {doc}`/tutorials/notebooks/api_overview`
-   {doc}`/tutorials/notebooks/PeakVI`
-   {doc}`/tutorials/notebooks/peakvi_in_R`
-   {doc}`/tutorials/notebooks/scarches_scvi_tools`
```

## Preliminaries

PeakVI takes as input a scATAC-seq accessibility matrix $X$ with $N$ cells and $M$ genomic regions (peaks),
which can be binary or have count data.
Additionally, a design matrix $S$ containing $p$ observed covariates, such as batch, donor, etc, is an optional input.
While $S$ can include both categorical covariates and continuous covariates, in the following, we assume it contains only one
categorical covariate with $K$ categories, which represents the common case of having multiple batches of data.

## Generative process

PeakVI posits that the observed value for cell $i$ and region $j$, $x_{ij}$,
is generated by the following process:

```{math}
:nowrap: true

\begin{align}
 z_i &\sim {\mathrm{Normal}}\left( {0,I} \right) \\
 y_{ij} &= g_z^j\left(z_i, s_i\right) \\
 \ell_i &= f_\ell\left(x_i\right)\\
 \left(x_{ij} > 0\right) &\sim \mathrm{Bernoulli}\left(y_{ij} \cdot \ell_i \cdot r_j \right)
\end{align}
```

Briefly, detecting a region as accessible ($x_{ij} > 0$) is generated by a Bernoulli random variable which
depends on a cell-specific latent variable $z_i$, which captures biological heterogeneity, and two auxiliary
scaling factors $\ell_i, r_j$, which account for cell-specific and region-specific technical effects, respectively.

The PeakVI generative process uses a single neural network:

```{math}
:nowrap: true

\begin{align}
        g_z(z_i, s_i) &: \mathbb{R}^{d} \times \{0, 1\}^K \to \left[0,1\right]^M
\end{align}
```

which estimates the probability of accessibility.

The latent variables, along with their description are summarized in the following table:

```{eval-rst}
.. list-table::
   :widths: 20 90 15
   :header-rows: 1

   * - Latent variable
     - Description
     - Code variable (if different)
   * - :math:`z_i \in \mathbb{R}^d`
     - Low-dimensional representation capturing the state of a cell
     - ``z``
   * - :math:`y_i \in \left[0,1\right]^{M}`
     - Acessibility probability estimate
     - ``p``
   * - :math:`\ell_i \in \left[0,1\right]`
     - Cell-wise scaling factor
     - ``d``
   * - :math:`r_j \in \left[0,1\right]`
     - Region-wise scaling factor
     - ``f``
```

## Inference

PeakVI uses variational inference, specifically auto-encoding variational Bayes (see {doc}`/user_guide/background/variational_inference`) to learn both the model parameters (the neural network params, scaling factors, etc.) and an approximate posterior distribution with the following factorization:

```{math}
:nowrap: true

\begin{align}
   q_\eta(z_i, \ell_i \mid x_i) :=
   q_\eta(z_i \mid x_i)q_\eta(\ell_i \mid x_i).
\end{align}
```

Here $\eta$ is a set of parameters corresponding to inference neural networks (encoders), which we do not describe in detail here. The underlying class used as the encoder for PeakVI is {class}`~scvi.nn.Encoder`.

It it important to note that by default, PeakVI only receives the accessibility data as input (i.e., not the observed cell-level covariates).
Empirically, we have not seen much of a difference by having the encoder take as input the concatenation of these items (i.e., $q_\eta(z_i, \ell_i \mid x_i, s_i)$, but users can control it manually by passing
`encode_covariates=True` to {class}`scvi.model.PEAKVI`.

## Tasks

Here we provide an overview of some of the tasks that PeakVI can perform. Please see {class}`scvi.model.PEAKVI` for the full API reference.

### Dimensionality reduction

For dimensionality reduction, the mean of the approximate posterior $q_\eta(z_i \mid x_i)$ is returned by default.
This is achieved using the method:

```
>>> adata.obsm["X_peakvi"] = model.get_latent_representation()
```

Users may also return samples from this distribution, as opposed to the mean, by passing the argument `give_mean=False`.
The latent representation can be used to create a nearest neighbor graph with scanpy with:

```
>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_scvi")
>>> adata.obsp["distances"]
```

### Transfer learning

A PeakVI model can be pre-trained on reference data and updated with query data using {func}`~scvi.model.PEAKVI.load_query_data`, which then facilitates transfer of metadata like cell type annotations. See the {doc}`/user_guide/background/transfer_learning` guide for more information.

### Estimation of accessibility

In {func}`~scvi.model.PEAKVI.get_accessibility_estimates` PeakVI returns the expected value of $y_i$ under the approximate posterior. For one cell $i$, this can be written as:

```{math}
:nowrap: true

\begin{align}
   \mathbb{E}_{q_\eta(z_i \mid x_i)}\left[g_z\left( z_i, s_i \right) \right],
\end{align}
```

As the expectation can be expensive to compute, by default, PeakVI uses the mean of $z_i$ as a point estimate, but this behaviour can be changed by setting `use_z_mean=False` argument.

Notably, this function also has the `transform_batch` parameter that allows counterfactual prediction of accessibility in an unobserved batch. See the {doc}`/user_guide/background/counterfactual_prediction` guide.

### Differential accessibility

Differential accessibility analysis is achieved with {func}`~scvi.model.PEAKVI.differential_accessibility`. PeakVI tests differences in accessibility of $g_z\left( z_i, s_i \right)$.

[^ref1]:
    Tal Ashuach, Daniel A. Reidenbach, Nir Yosef (2021),
    _PeakVI: A Deep Generative Model For Single Cell Chromatin Accessibility Analysis_,
    [BioRxiv](https://www.biorxiv.org/content/10.1101/2021.04.29.442020v1).