PoissonVI: Analyzing quantitative scATAC-seq fragment counts#

PoissonVI is used for analyzing scATAC-seq data using quantitative fragment counts. This tutorial walks through how to read, set-up and train the model, accessing and visualizing the latent space, and differential accessibility. We use the 5kPBMC sample dataset from 10x but these steps can be easily adjusted for other datasets.

If you use PoissonVI, please consider citing:

  • Martens, L. D., Fischer, D. S., Yépez, V. A., Theis, F. J., & Gagneur, J. (2023). Modeling fragment counts improves single-cell ATAC-seq analysis. Nature Methods.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
from scvi_colab import install

install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

import os
import tempfile
from pathlib import Path

import pooch
import scanpy as sc
import scvi
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.1.0

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(4, 4), frameon=False)
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor" : "w"}
%config InlineBackend.figure_format="retina"

Download and preprocess data#

First we need to download the sample data - we will use the pooch package to do this.

def download_data(save_path: str, fname: str = "atac_pbmc_5k") -> str:
    """Download the data files."""
    data_paths = pooch.retrieve(
        url="https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_filtered_peak_bc_matrix.tar.gz",
        known_hash="78e536a1508108fa5bd3b411a7484809c011f3403800369b20db05bdbfeb2284",
        fname=fname,
        path=save_path,
        processor=pooch.Untar(),
        progressbar=True,
    )
    return str(Path(data_paths[0]).parent)
data_path = download_data(save_dir.name)

PoissonVI expects as input an AnnData object with a cell-by-region matrix. There are various pipelines that handle preprocessing of scATAC-seq to obtain this matrix from the sequencing data. If the data was generated by 10x genomics, this matrix is among the standard outputs of CellRanger. Other pipelines, like SnapATAC and ArchR, also generate similar matrices.

Important

In each of these approaches, a read count matrix is generated which has more even counts than uneven counts. Due to this, the matrix can not easily be modeled with a standard count distribution. As an alternative, we recommend modeling the fragment counts, which are monotonic decreasing. We suggest a straightforward method to convert read counts to fragment counts: round the read counts to the nearest even count and divide by two- which in most cases is very close to the true fragment counts. If you do not want to use the approximation, counting fragments overlapping a peak can be achieved using methods such as the FeatureMatrix function in Signac, which specifically counts fragments.

In the case of 10x data, scvi has a special reader function scvi.data.read_10x_atac that reads the files and creates an AnnData object, demonstrated below.

Throughout this tutorial, we use sample scATACseq data from 10X of 5K PBMCs.

adata = scvi.data.read_10x_atac(data_path)
adata
AnnData object with n_obs × n_vars = 4585 × 115554
    obs: 'batch_id'
    var: 'chr', 'start', 'end'

A rapid method for discerning whether a count matrix contains read counts or fragment counts involves a simple count of ones and twos within the matrix. If we have a read count matrix, we expect a higher number of even counts, resulting in a higher frequency of twos compared to ones.

(adata.X == 1).sum()
1617289
(adata.X == 2).sum()
25632158

We can see that we have more twos than ones, which means that we have a read count matrix. In this tutorial, we will use the described approximation to convert them to fragment counts using the scvi.data.reads_to_fragments which will store the fragment counts in adata.layers['fragments']

scvi.data.reads_to_fragments(adata)
adata
AnnData object with n_obs × n_vars = 4585 × 115554
    obs: 'batch_id'
    var: 'chr', 'start', 'end'
    layers: 'fragments'

If we test again the number of ones and twos for the fragment count layer, we see that we do have more ones that twos:

(adata.layers["fragments"] == 1).sum()
27249447
(adata.layers["fragments"] == 2).sum()
7233892

We can use Scanpy to further handle, filter, and manipulate the data. In our case, we might want to filter out peaks that are rarely detected, to make the model train faster:

print("# regions before filtering:", adata.shape[-1])

# compute the threshold: 5% of the cells
min_cells = int(adata.shape[0] * 0.05)
# in-place filtering of regions
sc.pp.filter_genes(adata, min_cells=min_cells)

print("# regions after filtering:", adata.shape[-1])
# regions before filtering: 115554
# regions after filtering: 33142

Set up, train, save, and load the model#

We can now set up the AnnData object with PoissonVI, which will ensure everything the model needs is in place for training.

This is also the stage where we can condition the model on additional covariates, which encourages the model to remove the impact of those covariates from the learned latent space. Our sample data is a single batch, so we won’t demonstrate this directly, but it can be done simply by setting the batch_key argument to the annotation to be used as a batch covariate (must be a valid key in adata.obs). As the count layer, we will choose the fragments layer which contains our converted counts.

scvi.external.POISSONVI.setup_anndata(adata, layer="fragments")

We can now create a PoissonVI model object and train it!

Important

The default max_epochs is set to 500, but in practice PoissonVI stops early once the model converges (we quantify convergence with the model’s validation reconstruction loss). This is especially the case for larger datasets, which require fewer training epochs to converge since each epoch lets the model view more data.

This means that the estimated training runtime is usually an overestimate of the actual runtime. For the data used in this tutorial, it typically converges with around half of max_epochs!

model = scvi.external.POISSONVI(adata)
model.train()
Epoch 248/500:  50%|████▉     | 248/500 [01:50<01:51,  2.25it/s, v_num=1, train_loss_step=1.76e+4, train_loss_epoch=1.84e+4]
Monitored metric reconstruction_loss_validation did not improve in the last 50 records. Best score: 18672.771. Signaling Trainer to stop.

Since training a model can take a while, we recommend saving the trained model after training, just in case.

model_dir = os.path.join(save_dir.name, "poissonvi_pbmc")
model.save(model_dir, overwrite=True)

We can then load the model later, which require providing an AnnData object that is structured similarly to the one used for training (or, in most cases, the same one):

model = scvi.external.POISSONVI.load(model_dir, adata=adata)
INFO     File /tmp/tmp_udwzwzp/poissonvi_pbmc/model.pt already downloaded                                          

Visualizing and analyzing the latent space#

We can now use the trained model to visualize, cluster, and analyze the data. We first extract the latent representation from the model, and save it back into our AnnData object:

POISSONVI_LATENT_KEY = "X_poissonvi"

latent = model.get_latent_representation()
adata.obsm[POISSONVI_LATENT_KEY] = latent
latent.shape
(4585, 13)

We can now use Scanpy to cluster and visualize our latent space:

POISSONVI_CLUSTERS_KEY = "clusters_poissonvi"

# compute the k-nearest-neighbor graph that is used in both clustering and umap algorithms
sc.pp.neighbors(adata, use_rep=POISSONVI_LATENT_KEY)
# compute the umap
sc.tl.umap(adata, min_dist=0.2)
# cluster the space (we use a lower resolution to get fewer clusters than the default)
sc.tl.leiden(adata, key_added=POISSONVI_CLUSTERS_KEY, resolution=0.2)
sc.pl.umap(adata, color=POISSONVI_CLUSTERS_KEY)
../../../_images/6a1a71ff9947418df9c5466efb3869e44868b9ca89418d286f1f850aa14bcf08.png

Differential accessibility#

Finally, we can use PoissonVI to identify regions that are differentially accessible. There are many different ways to run this analysis, but the simplest is comparing one cluster against all others, or comparing two clusters to each other. In this case we’ll be looking for marker-regions, so we’ll mostly want a one-sided test (the significant regions will only be the ones preferentially accessible in our target cluster).

Important

If the data includes multiple batches, we encourage setting batch_correction=True so the model will sample from multiple batches when computing the differential signal.

da_peaks = model.differential_accessibility(
    adata, groupby=POISSONVI_CLUSTERS_KEY, group1="3", two_sided=False
)
DE...: 100%|██████████| 1/1 [00:05<00:00,  5.48s/it]

We will filter the results for peaks that are accessible in the cluster by filtering for peaks that are at least present in 5% of the cells in the cluster (emp_prob1 > 0.05).

da_peaks_filt = da_peaks[(da_peaks.emp_prob1 >= 0.05)]
da_peaks_filt.head(10)
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std lfc_min lfc_max emp_prob1 emp_prob2 emp_effect is_de_fdr_0.05 comparison group1 group2
chr11:115373408-115375985 0.9970 0.0030 5.806135 0.000079 0.000018 0.0 0.05 2.518515 2.437192 1.208494 -0.611462 6.126089 0.550095 0.121302 0.428793 True 3 vs Rest 3 Rest
chr16:23493881-23495402 0.9956 0.0044 5.421739 0.000068 0.000003 0.0 0.05 5.564806 6.104351 1.920917 -1.356367 9.319690 0.468809 0.018245 0.450564 True 3 vs Rest 3 Rest
chr22:27052873-27055430 0.9948 0.0052 5.253881 0.000044 0.000017 0.0 0.05 1.413068 1.406695 0.578154 -0.346727 2.762671 0.347826 0.151381 0.196445 True 3 vs Rest 3 Rest
chr10:81076360-81078474 0.9948 0.0052 5.253881 0.000042 0.000017 0.0 0.05 1.494146 1.241558 0.841479 -0.785093 3.902652 0.328922 0.133876 0.195047 True 3 vs Rest 3 Rest
chr3:169994725-169996073 0.9944 0.0056 5.179371 0.000034 0.000004 0.0 0.05 3.686594 3.933604 1.379379 -1.700559 6.616471 0.272212 0.027367 0.244845 True 3 vs Rest 3 Rest
chr17:37928618-37930728 0.9938 0.0062 5.076985 0.000019 0.000005 0.0 0.05 2.728828 2.315487 1.618020 -1.045378 6.670283 0.151229 0.041913 0.109316 True 3 vs Rest 3 Rest
chr12:22562036-22563652 0.9926 0.0074 4.898846 0.000057 0.000005 0.0 0.05 5.017355 5.710606 2.123703 -1.687163 8.243825 0.415879 0.039694 0.376185 True 3 vs Rest 3 Rest
chr16:81751220-81753358 0.9924 0.0076 4.871977 0.000024 0.000008 0.0 0.05 1.749549 1.810368 0.736658 -0.281440 3.999022 0.219282 0.061884 0.157398 True 3 vs Rest 3 Rest
chr10:22934917-22936564 0.9924 0.0076 4.871977 0.000025 0.000006 0.0 0.05 3.250408 3.250577 2.023829 -0.644413 6.487287 0.232514 0.047830 0.184684 True 3 vs Rest 3 Rest
chr19:6530376-6534948 0.9916 0.0084 4.771087 0.000063 0.000031 0.0 0.05 1.082075 1.120963 0.427321 -0.434524 2.111772 0.468809 0.261095 0.207714 True 3 vs Rest 3 Rest

We can visualize the marker peaks of cluster 3:

sc.pl.umap(adata, color=da_peaks_filt.index[:4], layer="fragments", vmax="p99.0")
../../../_images/98f4deb7244d5ec04f1729ce7210c47a85a493e8320958daaa116dc6a5946c3c.png