PeakVI: Analyzing scATACseq data#
PeakVI is used for analyzing scATACseq data. This tutorial walks through how to read, set-up and train the model, accessing and visualizing the latent space, and differential accessibility. We use the 5kPBMC sample dataset from 10X but these steps can be easily adjusted for other datasets.
Note
Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.
!pip install --quiet scvi-colab
from scvi_colab import install
install()
import os
import tempfile
import matplotlib.pyplot as plt
import numpy as np
import scanpy as sc
import scvi
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.4.2
Note
You can modify save_dir below to change where the data files for this tutorial are saved.
sc.set_figure_params(figsize=(4, 4), frameon=False)
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()
%config InlineBackend.print_figure_kwargs={"facecolor" : "w"}
%config InlineBackend.figure_format="retina"
Download and preprocess data#
In this tutorial, we use an already preprocessed dataset of 5k PBMCs from 10X Genomics. To see the exact preprocessing that was done, or to preprocess your own scATAC-seq dataset for use with scvi-tools models, see our preprocessing tutorial.
# download preprocessed dataset
adata_path = os.path.join(save_dir.name, "atac_pbmc_5k_preprocessed.h5ad")
adata = sc.read(
adata_path,
backup_url="https://exampledata.scverse.org/scvi-tools/atac_pbmc_5k_preprocessed.h5ad",
)
adata
AnnData object with n_obs × n_vars = 4585 × 33142
obs: 'batch_id'
var: 'chr', 'start', 'end', 'n_cells'
Set up, train, save, and load the model#
We can now set up the AnnData object with PeakVI, which will ensure everything the model needs is in place for training.
This is also the stage where we can condition the model on additional covariates, which encourages the model to remove the impact of those covariates from the learned latent space. Our sample data is a single batch, so we won’t demonstrate this directly, but it can be done simply by setting the batch_key argument to the annotation to be used as a batch covariate (must be a valid key in adata.obs) .
scvi.model.PEAKVI.setup_anndata(adata)
We can now create a PeakVI model object and train it!
Important
The default max_epochs is set to 500, but in practice PeakVI stops early once the model converges (we quantify convergence with the model’s validation reconstruction loss). This is especially the case for larger datasets, which require fewer training epochs to converge since each epoch lets the model view more data.
This means that the estimated training runtime is usually an overestimate of the actual runtime. For the data used in this tutorial, it typically converges with around half of max_epochs!
model = scvi.model.PEAKVI(adata)
model.train()
Monitored metric reconstruction_loss_validation did not improve in the last 50 records. Best score: 13014.943. Signaling Trainer to stop.
Since training a model can take a while, we recommend saving the trained model after training, just in case.
model_dir = os.path.join(save_dir.name, "peakvi_pbmc")
model.save(model_dir, overwrite=True)
We can then load the model later, which require providing an AnnData object that is structured similarly to the one used for training (or, in most cases, the same one):
model = scvi.model.PEAKVI.load(model_dir, adata=adata)
INFO File /tmp/tmpgpgzg2d5/peakvi_pbmc/model.pt already downloaded
Visualizing and analyzing the latent space#
We can now use the trained model to visualize, cluster, and analyze the data. We first extract the latent representation from the model, and save it back into our AnnData object:
PEAKVI_LATENT_KEY = "X_peakvi"
latent = model.get_latent_representation()
adata.obsm[PEAKVI_LATENT_KEY] = latent
latent.shape
(4585, 13)
We can now use Scanpy to cluster and visualize our latent space:
PEAKVI_CLUSTERS_KEY = "clusters_peakvi"
# compute the k-nearest-neighbor graph that is used in both clustering and umap algorithms
sc.pp.neighbors(adata, use_rep=PEAKVI_LATENT_KEY)
# compute the umap
sc.tl.umap(adata, min_dist=0.2)
# cluster the space (we use a lower resolution to get fewer clusters than the default)
sc.tl.leiden(adata, key_added=PEAKVI_CLUSTERS_KEY, resolution=0.2)
Differential accessibility#
Finally, we can use PeakVI to identify regions that are differentially accessible. There are many different ways to run this analysis, but the simplest is comparing one cluster against all others, or comparing two clusters to each other. In the first case we’ll be looking for marker-regions, so we’ll mostly want a one-sided test (the significant regions will only be the ones preferentially accessible in our target cluster). In the second case we’ll use a two-sided test to find regions that are differentially accessible, regardless of direction.
We demonstrate both of these next, and do this in two different ways: (1) more convenient but less flexible: using an existing factor to group the cells, and then comparing groups. (2) more flexible: using cell indices directly.
Important
If the data includes multiple batches, we encourage setting batch_correction=True so the model will sample from multiple batches when computing the differential signal. We do this below despite the data only having a single batch, as a demonstration.
# (1.1) using a known factor to compare two clusters
da_res11 = model.differential_accessibility(
groupby=PEAKVI_CLUSTERS_KEY, group1="3", group2="0", test_mode="two"
)
# (1.2) using a known factor to compare a cluster against all other clusters
## if we only provide group1, group2 is all other cells by default
da_res12 = model.differential_accessibility(groupby=PEAKVI_CLUSTERS_KEY, group1="3")
# (2.1) using indices to compare two clusters
## we can use boolean masks or integer indices for the `idx1` and `idx2` arguments
da_res21 = model.differential_accessibility(
idx1=adata.obs[PEAKVI_CLUSTERS_KEY] == "3",
idx2=adata.obs[PEAKVI_CLUSTERS_KEY] == "0",
test_mode="two",
)
# (2.2) using indices to compare a cluster against all other clusters
## if we don't provide idx2, it uses all other cells as the contrast
da_res22 = model.differential_accessibility(
idx1=np.where(adata.obs[PEAKVI_CLUSTERS_KEY] == "3"),
)
da_res22.head()
| prob_da | is_da_fdr | bayes_factor | effect_size | emp_effect | est_prob1 | est_prob2 | emp_prob1 | emp_prob2 | |
|---|---|---|---|---|---|---|---|---|---|
| chr12:92566079-92567078 | 1.0 | True | 18.420681 | -0.834624 | -0.433824 | 0.864028 | 0.029404 | 0.448352 | 0.014528 |
| chr6:150943842-150944960 | 1.0 | True | 18.420681 | -0.863446 | -0.508288 | 0.936082 | 0.072636 | 0.549451 | 0.041162 |
| chr6:91085122-91086213 | 1.0 | True | 18.420681 | -0.820112 | -0.472863 | 0.890324 | 0.070212 | 0.512088 | 0.039225 |
| chr10:35667588-35668789 | 1.0 | True | 18.420681 | -0.546625 | -0.260812 | 0.604062 | 0.057437 | 0.290110 | 0.029298 |
| chr3:122294823-122297353 | 1.0 | True | 18.420681 | -0.733786 | -0.417974 | 0.868624 | 0.134839 | 0.492308 | 0.074334 |
Note that da_res11 and da_res21 are equivalent, as are da_res12 and da_res22.
The return value is a pandas DataFrame with the differential results and basic properties of the comparison:
prob_da in our case is the probability of cells from cluster 0 being more than 0.05 (the default minimal effect) more accessible than cells from the rest of the data.
is_da_fdr is a conservative classification (True/False) of whether a region is differential accessible. This is one way to threshold the results.
bayes_factor is a statistical significance score. It doesn’t have a commonly acceptable threshold (e.g 0.05 for p-values), bu we demonstrate below that it’s well calibrated to the effect size.
effect_size is the effect size, calculated as est_prob2 - est_prob1.
emp_effect is the empirical effect size, calculated as emp_prob2 - emp_prob1.
est_prob{1,2} are the estimated probabilities of accessibility in group1 and group2.
emp_prob{1,2} are the empirical probabilities of detection (how many cells in group X was the region detected in).
We can make sure the probability of DA is well calibrated, and look at the regions that are identified as differentially accessible:
plt.scatter(da_res22.effect_size, da_res22.prob_da, s=1)
plt.xlabel("effect size")
plt.ylabel("probability of DA")
plt.show()
da_res22.loc[da_res22.is_da_fdr].sort_values("prob_da", ascending=False).head(10)
| prob_da | is_da_fdr | bayes_factor | effect_size | emp_effect | est_prob1 | est_prob2 | emp_prob1 | emp_prob2 | |
|---|---|---|---|---|---|---|---|---|---|
| chr12:92566079-92567078 | 1.0 | True | 18.420681 | -0.834624 | -0.433824 | 0.864028 | 0.029404 | 0.448352 | 0.014528 |
| chr6:150943842-150944960 | 1.0 | True | 18.420681 | -0.863446 | -0.508288 | 0.936082 | 0.072636 | 0.549451 | 0.041162 |
| chr6:91085122-91086213 | 1.0 | True | 18.420681 | -0.820112 | -0.472863 | 0.890324 | 0.070212 | 0.512088 | 0.039225 |
| chr10:35667588-35668789 | 1.0 | True | 18.420681 | -0.546625 | -0.260812 | 0.604062 | 0.057437 | 0.290110 | 0.029298 |
| chr3:122294823-122297353 | 1.0 | True | 18.420681 | -0.733786 | -0.417974 | 0.868624 | 0.134839 | 0.492308 | 0.074334 |
| chr10:126289468-126291985 | 1.0 | True | 18.420681 | -0.896548 | -0.485640 | 0.934311 | 0.037763 | 0.505495 | 0.019855 |
| chr17:37950320-37951202 | 1.0 | True | 18.420681 | -0.488380 | -0.224697 | 0.559338 | 0.070958 | 0.257143 | 0.032446 |
| chr8:28195531-28197587 | 1.0 | True | 18.420681 | -0.850861 | -0.541814 | 0.939959 | 0.089098 | 0.591209 | 0.049395 |
| chr4:115519248-115520882 | 1.0 | True | 18.420681 | -0.755023 | -0.380611 | 0.790196 | 0.035173 | 0.397802 | 0.017191 |
| chr15:79048532-79050206 | 1.0 | True | 18.420681 | -0.655174 | -0.341050 | 0.691689 | 0.036515 | 0.358242 | 0.017191 |
We can now examine these regions to understand what is happening in the data, using various different annotation and enrichment methods. For instance, chr11:60222766-60223569, one of the regions preferentially accessible in cluster 0, is the promoter region of MS4A1, also known as CD20, a known B-cell surface marker, indicating that cluster 0 are probably B-cells.