scBasset

scBasset#

scBasset [Yuan and Kelley, 2022] (Python class SCBASSET) is a sequence-based model for representation learning of scATAC-seq data. It uses the DNA sequence of each accessible region to learn region embeddings and jointly learns cell embeddings that reconstruct a binary accessibility matrix.

Warning

SCBASSET’s development is still in progress. The current scvi-tools implementation may not fully reproduce the original implementation’s results.

The advantages of scBasset are:

Sequence representations allow for TF motif discovery and other sequence-based analyses.
The learned cell embeddings can be used for visualization, clustering, and batch integration of scATAC-seq data.
The model can score transcription factor activity with a motif injection procedure.

The limitations of scBasset include:

It expects binary accessibility data and DNA sequence encodings for the genomic regions.
The current implementation assumes fixed-length sequence inputs, following the original 1344 bp scBasset setting.
scBasset cannot currently leverage unobserved data and thus cannot currently be used for transfer learning tasks.
The built-in motif library download currently supports the human motif library used by the scBasset paper.

Preliminaries#

scBasset uses a region-by-cell AnnData object. In a standard scATAC-seq AnnData object, cells are observations and regions are variables, so the data are typically transposed before setup:

>>> bdata = adata.transpose()
>>> SCBASSET.setup_anndata(bdata, layer="binary", dna_code_key="dna_code")

The registered matrix should contain binary accessibility values. The dna_code_key argument points to integer-encoded DNA sequences for each region. In the transposed object, these encodings are stored in bdata.obsm, one row per region. If a batch_key is supplied, it is read from bdata.var because cells are variables in this layout.

The tutorial demonstrates creating the required sequence fields with add_dna_sequence(), which stores both raw sequence strings and integer codes.

Overview#

scBasset is not a variational autoencoder. It is a neural network that predicts cell-by-region accessibility from genomic sequence.

The model first converts each DNA sequence into a one-hot representation. A convolutional neural network processes the sequence with stochastic reverse-complement augmentation, stochastic shifts, a stem convolution, a convolutional tower, and a bottleneck dense layer. The output is a low-dimensional embedding for each genomic region.

The model also learns:

a cell embedding matrix,
a cell-specific bias term, and
when a batch key is registered, a batch embedding for each cell’s batch.

The accessibility logits are computed as the matrix product between region embeddings and cell embeddings, plus the cell bias. When batches are registered, the batch embedding is added to the cell embedding before this product.

Inference#

scBasset is trained by minimizing binary cross-entropy between predicted accessibility logits and the observed binary accessibility matrix. The implementation also reports AUROC during training and can add L2 regularization to the cell embedding matrix with l2_reg_cell_embedding, which is useful in the batch-integration tutorial.

Training mini-batches are over regions, not cells. This follows from the region-by-cell input layout and the sequence encoder, which processes a batch of region sequences at a time.

Tasks#

Here we provide an overview of common tasks. Please see SCBASSET for the full API reference.

Cell Representation#

The learned cell embedding is returned by get_latent_representation():

>>> adata.obsm["X_scbasset"] = model.get_latent_representation()

This representation can be used for nearest-neighbor graph construction, visualization, clustering, or integration diagnostics.

Cell Bias#

get_cell_bias() returns the learned cell-specific bias term, which reflects cell-level accessibility propensity in the reconstruction model.

Transcription Factor Activity#

get_tf_activity() estimates transcription factor activity with motif injection. The method compares model-predicted accessibility for sequences with a known motif inserted against dinucleotide-shuffled background sequences, then returns a cell-level activity score for the requested transcription factor.