# PeakVI
**peakVI** [^ref1] (Python class {class}`~scvi.model.PEAKVI`) is a generative model of scATAC-seq data that can subsequently
be used for many common downstream tasks.
The advantages of peakVI are:
- Comprehensive in capabilities.
- Scalable to very large datasets (>1 million cells).
The limitations of peakVI include:
- Effectively requires a GPU for fast inference.
- Latent space is not interpretable, unlike that of a linear method.
```{topic} Tutorials:
- {doc}`/tutorials/notebooks/quick_start/api_overview`
- {doc}`/tutorials/notebooks/atac/PeakVI`
- {doc}`/tutorials/notebooks/atac/peakvi_in_R`
- {doc}`/tutorials/notebooks/scrna/scarches_scvi_tools`
```
## Preliminaries
PeakVI takes as input a scATAC-seq accessibility matrix $X$ with $N$ cells and $M$ genomic regions (peaks),
which can be binary or have count data.
Additionally, a design matrix $S$ containing $p$ observed covariates, such as batch, donor, etc, is an optional input.
While $S$ can include both categorical covariates and continuous covariates, in the following, we assume it contains only one
categorical covariate with $K$ categories, which represents the common case of having multiple batches of data.
## Generative process
PeakVI posits that the observed value for cell $i$ and region $j$, $x_{ij}$,
is generated by the following process:
```{math}
:nowrap: true
\begin{align}
z_i &\sim {\mathrm{Normal}}\left( {0,I} \right) \\
y_{ij} &= g_z^j\left(z_i, s_i\right) \\
\ell_i &= f_\ell\left(x_i\right)\\
\left(x_{ij} > 0\right) &\sim \mathrm{Bernoulli}\left(y_{ij} \cdot \ell_i \cdot r_j \right)
\end{align}
```
Briefly, detecting a region as accessible ($x_{ij} > 0$) is generated by a Bernoulli random variable which
depends on a cell-specific latent variable $z_i$, which captures biological heterogeneity, and two auxiliary
scaling factors $\ell_i, r_j$, which account for cell-specific and region-specific technical effects, respectively.
The PeakVI generative process uses a single neural network:
```{math}
:nowrap: true
\begin{align}
g_z(z_i, s_i) &: \mathbb{R}^{d} \times \{0, 1\}^K \to \left[0,1\right]^M
\end{align}
```
which estimates the probability of accessibility.
The latent variables, along with their description are summarized in the following table:
```{eval-rst}
.. list-table::
:widths: 20 90 15
:header-rows: 1
* - Latent variable
- Description
- Code variable (if different)
* - :math:`z_i \in \mathbb{R}^d`
- Low-dimensional representation capturing the state of a cell
- ``z``
* - :math:`y_i \in \left[0,1\right]^{M}`
- Acessibility probability estimate
- ``p``
* - :math:`\ell_i \in \left[0,1\right]`
- Cell-wise scaling factor
- ``d``
* - :math:`r_j \in \left[0,1\right]`
- Region-wise scaling factor
- ``f``
```
## Inference
PeakVI uses variational inference, specifically auto-encoding variational Bayes (see {doc}`/user_guide/background/variational_inference`) to learn both the model parameters (the neural network params, scaling factors, etc.) and an approximate posterior distribution with the following factorization:
```{math}
:nowrap: true
\begin{align}
q_\eta(z_i, \ell_i \mid x_i) :=
q_\eta(z_i \mid x_i)q_\eta(\ell_i \mid x_i).
\end{align}
```
Here $\eta$ is a set of parameters corresponding to inference neural networks (encoders), which we do not describe in detail here. The underlying class used as the encoder for PeakVI is {class}`~scvi.nn.Encoder`.
It it important to note that by default, PeakVI only receives the accessibility data as input (i.e., not the observed cell-level covariates).
Empirically, we have not seen much of a difference by having the encoder take as input the concatenation of these items (i.e., $q_\eta(z_i, \ell_i \mid x_i, s_i)$, but users can control it manually by passing
`encode_covariates=True` to {class}`scvi.model.PEAKVI`.
## Tasks
Here we provide an overview of some of the tasks that PeakVI can perform. Please see {class}`scvi.model.PEAKVI` for the full API reference.
### Dimensionality reduction
For dimensionality reduction, the mean of the approximate posterior $q_\eta(z_i \mid x_i)$ is returned by default.
This is achieved using the method:
```
>>> adata.obsm["X_peakvi"] = model.get_latent_representation()
```
Users may also return samples from this distribution, as opposed to the mean, by passing the argument `give_mean=False`.
The latent representation can be used to create a nearest neighbor graph with scanpy with:
```
>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_scvi")
>>> adata.obsp["distances"]
```
### Transfer learning
A PeakVI model can be pre-trained on reference data and updated with query data using {func}`~scvi.model.PEAKVI.load_query_data`, which then facilitates transfer of metadata like cell type annotations. See the {doc}`/user_guide/background/transfer_learning` guide for more information.
### Estimation of accessibility
In {func}`~scvi.model.PEAKVI.get_accessibility_estimates` PeakVI returns the expected value of $y_i$ under the approximate posterior. For one cell $i$, this can be written as:
```{math}
:nowrap: true
\begin{align}
\mathbb{E}_{q_\eta(z_i \mid x_i)}\left[g_z\left( z_i, s_i \right) \right],
\end{align}
```
As the expectation can be expensive to compute, by default, PeakVI uses the mean of $z_i$ as a point estimate, but this behaviour can be changed by setting `use_z_mean=False` argument.
Notably, this function also has the `transform_batch` parameter that allows counterfactual prediction of accessibility in an unobserved batch. See the {doc}`/user_guide/background/counterfactual_prediction` guide.
### Differential accessibility
Differential accessibility analysis is achieved with {func}`~scvi.model.PEAKVI.differential_accessibility`. PeakVI tests differences in accessibility of $g_z\left( z_i, s_i \right)$.
[^ref1]:
Tal Ashuach, Daniel A. Reidenbach, Nir Yosef (2021),
_PeakVI: A Deep Generative Model For Single Cell Chromatin Accessibility Analysis_,
[BioRxiv](https://www.biorxiv.org/content/10.1101/2021.04.29.442020v1).