PeakVI

peakVI 1 (Python class PEAKVI) is a generative model of scATAC-seq data that can subsequently be used for many common downstream tasks.

The advantages of peakVI are:

  • Comprehensive in capabilities.

  • Scalable to very large datasets (>1 million cells).

The limitations of peakVI include:

  • Effectively requires a GPU for fast inference.

  • Latent space is not interpretable, unlike that of a linear method.

Preliminaries

PeakVI takes as input a scATAC-seq accessibility matrix \(X\) with \(N\) cells and \(M\) genomic regions (peaks), which can be binary or have count data. Additionally, a design matrix \(S\) containing \(p\) observed covariates, such as batch, donor, etc, is an optional input. While \(S\) can include both categorical covariates and continuous covariates, in the following, we assume it contains only one categorical covariate with \(K\) categories, which represents the common case of having multiple batches of data.

Generative process

PeakVI posits that the observed value for cell \(i\) and region \(j\), \(x_{ij}\), is generated by the following process:

\begin{align} z_i &\sim {\mathrm{Normal}}\left( {0,I} \right) \\ y_{ij} &= g_z^j\left(z_i, s_i\right) \\ \ell_i &= f_\ell\left(x_i\right)\\ \left(x_{ij} > 0\right) &\sim \mathrm{Bernoulli}\left(y_{ij} \cdot \ell_i \cdot r_j \right) \end{align}

Briefly, detecting a region as accessible (\(x_{ij} > 0\)) is generated by a Bernoulli random variable which depends on a cell-specific latent variable \(z_i\), which captures biological heterogeneity, and two auxiliary scaling factors \(\ell_i, r_j\), which account for cell-specific and region-specific technical effects, respectively.

The PeakVI generative process uses a single neural network:

\begin{align} g_z(z_i, s_i) &: \mathbb{R}^{d} \times \{0, 1\}^K \to \left[0,1\right]^M \end{align}

which estimates the probability of accessibility.

The latent variables, along with their description are summarized in the following table:

Latent variable

Description

Code variable (if different)

\(z_i \in \mathbb{R}^d\)

Low-dimensional representation capturing the state of a cell

z

\(y_i \in \left[0,1\right]^{M}\)

Acessibility probability estimate

p

\(\ell_i \in \left[0,1\right]\)

Cell-wise scaling factor

d

\(r_j \in \left[0,1\right]\)

Region-wise scaling factor

f

Inference

PeakVI uses variational inference, specifically auto-encoding variational Bayes (see Variational Inference) to learn both the model parameters (the neural network params, scaling factors, etc.) and an approximate posterior distribution with the following factorization:

\begin{align} q_\eta(z_i, \ell_i \mid x_i) := q_\eta(z_i \mid x_i)q_\eta(\ell_i \mid x_i). \end{align}

Here \(\eta\) is a set of parameters corresponding to inference neural networks (encoders), which we do not describe in detail here. The underlying class used as the encoder for PeakVI is Encoder.

It it important to note that by default, PeakVI only receives the accessibility data as input (i.e., not the observed cell-level covariates). Empirically, we have not seen much of a difference by having the encoder take as input the concatenation of these items (i.e., \(q_\eta(z_i, \ell_i \mid x_i, s_i)\), but users can control it manually by passing encode_covariates=True to scvi.model.PEAKVI.

Tasks

Here we provide an overview of some of the tasks that PeakVI can perform. Please see scvi.model.PEAKVI for the full API reference.

Dimensionality reduction

For dimensionality reduction, the mean of the approximate posterior \(q_\eta(z_i \mid x_i)\) is returned by default. This is achieved using the method:

>>> adata.obsm["X_peakvi"] = model.get_latent_representation()

Users may also return samples from this distribution, as opposed to the mean, by passing the argument give_mean=False. The latent representation can be used to create a nearest neighbor graph with scanpy with:

>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_scvi")
>>> adata.obsp["distances"]

Transfer learning

A PeakVI model can be pre-trained on reference data and updated with query data using load_query_data(), which then facilitates transfer of metadata like cell type annotations. See the Transfer learning guide for more information.

Estimation of accessibility

In get_accessibility_estimates() PeakVI returns the expected value of \(y_i\) under the approximate posterior. For one cell \(i\), this can be written as:

\begin{align} \mathbb{E}_{q_\eta(z_i \mid x_i)}\left[g_z\left( z_i, s_i \right) \right], \end{align}

As the expectation can be expensive to compute, by default, PeakVI uses the mean of \(z_i\) as a point estimate, but this behaviour can be changed by setting use_z_mean=False argument.

Notably, this function also has the transform_batch parameter that allows counterfactual prediction of accessibility in an unobserved batch. See the Counterfactual prediction guide.

Differential accessibility

Differential accessibility analysis is achieved with differential_accessibility(). PeakVI tests differences in accessibility of \(g_z\left( z_i, s_i \right)\).

References:

1

Tal Ashuach, Daniel A. Reidenbach, Nir Yosef (2021), PeakVI: A Deep Generative Model For Single Cell Chromatin Accessibility Analysis, BioRxiv.