Preprocessing datasets for analysis with scvi-tools#

In this tutorial, we go over several preprocessing techniques for different types of data used with scvi-tools models. Each section of this tutorial is independent from the other sections, and is relevant to other scvi-tools tutorials which use the same type of dataset. For example, the preprocessing techniques used in the scRNA-seq section of this tutorial are generally used in the scvi-tools scRNA-seq related tutorials. Relevant tutorials are linked in each section.

Dependencies#

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
from scvi_colab import install

install()

Imports and preparing files#

import os
import tempfile
from pathlib import Path

import anndata as ad
import mudata as md
import muon
import numpy as np
import pooch
import scanpy as sc
import scvi
import seaborn as sns
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.4.2

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

You can modify file_name below to the name of the dataset you would like to preprocess. This file will end with .h5ad or .h5 depending on which model you plan to use.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

scRNA-seq#

Relevant scRNA-seq Tutorials:#

The following tutorial uses the exact preprocessed dataset that results from this section:

Atlas-level integration of lung data

The following tutorials may not use the exact dataset, but the preprocessing steps should be very similar to what is covered in this section:

MrVI Quick Start Tutorial

Differential expression on C. elegans data

Annotation with CellAssign

Linearly decoded VAE

Isolating perturbation-induced variations with contrastiveVI

Topic Modeling with Amortized LDA

Identification of zero-inflated genes

Integration of scRNA-seq data with substantial batch effects using sysVI

Seed labeling with scANVI

Benchmarking the scANVI fix

Integration and label transfer with Tabula Muris

Reference mapping with scvi-tools

Querying the Human Lung Cell Atlas

Decipher Quick Start Tutorial

Variational inference for RNA velocity with VeloVI

Preprocessing#

Here we demonstrate preprocessing of an scRNA-seq dataset using cells from the lung atlas integration task from the scIB manuscript.

adata_path = os.path.join(save_dir.name, "lung_atlas.h5ad")

adata = sc.read(
    adata_path,
    backup_url="https://exampledata.scverse.org/scvi-tools/lung_atlas.h5ad",
)
adata
AnnData object with n_obs × n_vars = 32472 × 15148
    obs: 'dataset', 'location', 'nGene', 'nUMI', 'patientGroup', 'percent.mito', 'protocol', 'sanger_type', 'size_factors', 'sampling_method', 'batch', 'cell_type', 'donor'
    layers: 'counts'

This dataset was already processed as described in the scIB manuscript. Generally, models in scvi-tools expect data that has been filtered/aggregated in the same fashion as one would do with Scanpy/Seurat.

Another important thing to keep in mind is highly-variable gene selection. While scVI and scANVI both accomodate using all genes in terms of runtime, we usually recommend filtering genes for best integration performance. This will, among other things, remove batch-specific variation due to batch-specific gene expression.

We perform this gene selection using the Scanpy pipeline while keeping the full dimension normalized data in the adata.raw object. We obtain variable genes from each dataset and take their intersections.

This dataset already has counts stored in layers, and adata.X contains log transformed scran normalized expression. If this is not the case for your dataset, you can preserve the raw counts, then normalize and log transform them with

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Below we perform gene selection while keeping the full dimension normalized data in adata.raw. We obtain variable genes from each dataset and take their intersections.

adata.raw = adata  # keep full dimension safe
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=2000,
    flavor="seurat_v3",
    layer="counts",
    subset=True,
    batch_key="batch",  # Change depending on the batch key for your dataset
)

Important

We see a warning about the data not containing counts. This is due to some of the samples in this dataset containing SoupX-corrected counts. scvi-tools models will run for non-negative real-valued data, but we strongly suggest checking that these possibly non-count values are intended to represent pseudocounts, and not some other normalized data, in which the variance/covariance structure of the data has changed dramatically.

# Save preprocessed data for later use
adata.write_h5ad("lung_atlas_preprocessed.h5ad")

scATAC-seq#

Relevant scATAC-seq tutorials:#

The following tutorial uses the exact preprocessed dataset that results from this section:

PeakVI: Analyzing scATACseq data

The following tutorials may not use the exact dataset, but the preprocessing steps should be very similar to what is covered in this section:

PoissonVI: Analyzing quantitative scATAC-seq fragment counts

ScBasset: Analyzing scATACseq data

scBasset: Batch correction of scATACseq data

Preprocessing#

To demonstrate preprocessing of scATAC-seq data, we use a 5k PBMC sample dataset from 10X. We use the pooch package to download the data.

def download_data(save_path: str, fname: str = "atac_pbmc_5k") -> str:
    """Download the data files."""
    data_paths = pooch.retrieve(
        url="https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_filtered_peak_bc_matrix.tar.gz",
        known_hash="78e536a1508108fa5bd3b411a7484809c011f3403800369b20db05bdbfeb2284",
        fname=fname,
        path=save_path,
        processor=pooch.Untar(),
        progressbar=True,
    )
    return str(Path(data_paths[0]).parent)
data_path = download_data(save_dir.name)

PeakVI expects as input an AnnData object with a cell-by-region matrix. There are various pipelines that handle preprocessing of scATACseq to obtain this matrix from the sequencing data. If the data was generated by 10X genomics, this matrix is among the standard outputs of CellRanger. Other pipelines, like SnapATAC and ArchR, also generate similar matrices.

In the case of 10X data, PeakVI has a special reader function scvi.data.read_10x_atac that reads the files and creates an AnnData object, demonstrated below. For conveniece, we also demonstrate how to initialize an AnnData object from scratch.

Throughout this tutorial, we use sample scATACseq data from 10X of 5K PBMCs.

adata = scvi.data.read_10x_atac(data_path)
adata
AnnData object with n_obs × n_vars = 4585 × 115554
    obs: 'batch_id'
    var: 'chr', 'start', 'end'

We use Scanpy here to filter out peaks that are rarely detected, so that the model trains faster:

print("# regions before filtering:", adata.shape[-1])

# compute the threshold: 5% of the cells
min_cells = int(adata.shape[0] * 0.05)
# in-place filtering of regions
sc.pp.filter_genes(adata, min_cells=min_cells)

print("# regions after filtering:", adata.shape[-1])
# regions before filtering: 115554
# regions after filtering: 33142
adata.write_h5ad("atac_pbmc_5k_preprocessed.h5ad")

CITE-seq#

Relevant CITE-seq Tutorials#

The following tutorial uses the exact preprocessed dataset that results from this section:

CITE-seq analysis with totalVI

The following tutorials may not use the exact dataset, but the preprocessing steps should be very similar to what is covered in this section:

CITE-seq reference mapping with totalVI

Integration of CITE-seq and scRNA-seq data

Preprocessing#

Here we demonstrate preprocessing of CITE-seq data on integrated PBMC10k and PBMC5k datasets available from 10X Genomics. We will subset to the 17 proteins shared between the datasets.

def download_data(save_path: str, fname: str = "CITE-seq_pbmc_10k") -> str:
    """Download the data files."""
    if fname == "CITE-seq_pbmc_10k":
        hash = "md5:26d53ffe08b5f7d3b28df61b592d51fb"
        url = "https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_protein_v3/pbmc_10k_protein_v3_filtered_feature_bc_matrix.tar.gz"
    else:
        hash = "md5:9be3d672b0445b944ca06a507c3d780b"
        url = "https://cf.10xgenomics.com/samples/cell-exp/3.1.0/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.tar.gz"

    data_paths = pooch.retrieve(
        url=url,
        known_hash=hash,
        fname=fname,
        path=save_path,
        processor=pooch.Untar(),
        progressbar=True,
    )
    return str(Path(data_paths[0]).parent)
data_path1 = download_data(save_dir.name)
data_path2 = download_data(save_dir.name, "CITE-seq_pbmc_5k")
mdata1 = muon.read_10x_mtx(data_path1)
mdata2 = muon.read_10x_mtx(data_path2)
# Create batch annotations
mdata1.mod["rna"].obs["batch"] = "10kpbmc"
mdata2.mod["rna"].obs["batch"] = "5kpbmc"
mdata1
MuData object with n_obs × n_vars = 7865 × 33555
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	7865 x 33538
      obs:	'batch'
      var:	'gene_ids', 'feature_types'
    prot:	7865 x 17
      var:	'gene_ids', 'feature_types'
mdata2
MuData object with n_obs × n_vars = 5247 × 33570
  var:	'gene_ids', 'feature_types'
  2 modalities
    rna:	5247 x 33538
      obs:	'batch'
      var:	'gene_ids', 'feature_types'
    prot:	5247 x 32
      var:	'gene_ids', 'feature_types'
# Make variable names unique
mdata1.mod["rna"].var_names_make_unique()
mdata2.mod["rna"].var_names_make_unique()

# Filter to shared genes and proteins between the two datasets with an inner join
rna_c = ad.concat([mdata1.mod["rna"], mdata2.mod["rna"]])
rna_c.obs_names_make_unique()

prot_c = ad.concat([mdata1.mod["prot"], mdata2.mod["prot"]])
prot_c.obs_names_make_unique()
mdata = md.MuData({"rna": rna_c, "prot": prot_c})
mdata
MuData object with n_obs × n_vars = 13112 × 33555
  2 modalities
    rna:	13112 x 33538
      obs:	'batch'
    prot:	13112 x 17

We make var names unique, store raw counts in layers, and normalize and log transform counts. Then we perform gene selection.

def filter_pbmc(mdata):
    # mdata.mod["rna"].var_names_make_unique()
    mdata.mod["rna"].layers["counts"] = mdata.mod["rna"].X.copy()
    sc.pp.normalize_total(mdata.mod["rna"])
    sc.pp.log1p(mdata.mod["rna"])

    sc.pp.highly_variable_genes(
        mdata.mod["rna"],
        n_top_genes=4000,
        flavor="seurat_v3",
        layer="counts",
    )
    # Place subsetted counts in a new modality
    mdata.mod["rna_subset"] = mdata.mod["rna"][:, mdata.mod["rna"].var["highly_variable"]].copy()
    mdata = md.MuData(mdata.mod)

    return mdata
mdata = filter_pbmc(mdata)
mdata
MuData object with n_obs × n_vars = 13112 × 37555
  3 modalities
    rna:	13112 x 33538
      obs:	'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'log1p', 'hvg'
      layers:	'counts'
    prot:	13112 x 17
    rna_subset:	13112 x 4000
      obs:	'batch'
      var:	'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
      uns:	'log1p', 'hvg'
      layers:	'counts'
mdata.write("CITE-seq_pbmc_combined_preprocessed.h5mu")

Multiome#

Relevant Multiome Tutorials#

The following tutorials may not use the exact dataset, but the preprocessing steps should be very similar to what is covered in this section:

Joint analysis of paired and unpaired multiomic data with MultiVI

Preprocessing#

First we download three datasets from 10X to use in this section of the tutorial: multiome, scRNA-seq, and scATAC-seq. Importantly, MultiVI assumes that there are shared features between the datasets. This is trivial for gene expression datasets, which generally use the same set of genes as features. For ATAC-seq peaks, this is less trivial, and often requires preprocessing steps with other tools to get all datasets to use a shared set of peaks. That can be achieved with tools like SnapATAC, ArchR, and CellRanger in the case of 10X data.

Important

MultiVI requires the datasets to use shared features. scATAC-seq datasets need to be processed to use a shared set of peaks.

Below we download 2 PBMC datasets from 10X:

multiome

scRNA-seq

def load_10x_mudata(url, filename=None, cache_dir=None, hash=None):
    if filename is None:
        filename = url.split("/")[-1]
    path = pooch.retrieve(
        url=url,
        known_hash=hash,
        fname=filename,
        path=cache_dir,
    )
    mdata = muon.read_10x_h5(path)

    return mdata
mdata_multiome = load_10x_mudata(
    url="https://cf.10xgenomics.com/samples/cell-arc/2.0.0/10k_PBMC_Multiome_nextgem_Chromium_X/10k_PBMC_Multiome_nextgem_Chromium_X_filtered_feature_bc_matrix.h5",
    hash="md5:cee000a98c8a05d0456add3ff864cdb0",
    cache_dir=save_dir.name,
)
Added `interval` annotation for features from /tmp/tmpaauhtjnz/10k_PBMC_Multiome_nextgem_Chromium_X_filtered_feature_bc_matrix.h5
mdata_expr = load_10x_mudata(
    url="https://cf.10xgenomics.com/samples/cell-exp/9.0.0/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5",
    hash="md5:be6fbc95481d813c8113b696ca3c3efd",
    cache_dir=save_dir.name,
)
mdata_multiome
MuData object with n_obs × n_vars = 10970 × 148344
  var:	'gene_ids', 'feature_types', 'genome', 'interval'
  2 modalities
    rna:	10970 x 36601
      var:	'gene_ids', 'feature_types', 'genome', 'interval'
    atac:	10970 x 111743
      var:	'gene_ids', 'feature_types', 'genome', 'interval'
mdata_expr
MuData object with n_obs × n_vars = 4782 × 38606
  var:	'gene_ids', 'feature_types', 'genome'
  1 modality
    rna:	4782 x 38606
      var:	'gene_ids', 'feature_types', 'genome'
# Make var names unique
for m in [mdata_expr, mdata_multiome]:
    m.var_names_make_unique()
    m.update()

Because mdata_expr only has one modality, we will add an atac modality, filled with zeros, so that when we concatenate the mudatas, the rna modality will be padded to have the same number of observations as the atac modality. MultiVI requires that all modalities are fully paired.

# Create a zero-filled ATAC AnnData with same variables as multiome ATAC
empty_atac = ad.AnnData(
    X=np.zeros((mdata_expr.n_obs, mdata_multiome.mod["atac"].n_vars)),
    var=mdata_multiome.mod["atac"].var.copy(),
    obs=mdata_expr.obs.copy(),
)

# Add ATAC modality to the RNA-only MuData
mdata_expr.mod["atac"] = empty_atac
mdata = md.concat([mdata_multiome, mdata_expr], label="dataset")
mdata
MuData object with n_obs × n_vars = 15752 × 135155
  obs:	'dataset'
  2 modalities
    atac:	15752 x 111743
      obs:	'dataset'
    rna:	15752 x 23412
      obs:	'dataset'

Below, we filter features for both modalities

print(mdata.mod["rna"].shape)
sc.pp.filter_genes(mdata.mod["rna"], min_cells=int(mdata.mod["rna"].shape[0] * 0.1))
print(mdata.mod["rna"].shape)
(15752, 23412)
(15752, 5947)
print(mdata.mod["atac"].shape)
sc.pp.filter_genes(mdata.mod["atac"], min_cells=int(mdata.mod["atac"].shape[0] * 0.1))
print(mdata.mod["atac"].shape)
(15752, 111743)
(15752, 15678)
mdata
MuData object with n_obs × n_vars = 15752 × 135155
  obs:	'dataset'
  2 modalities
    atac:	15752 x 15678
      obs:	'dataset'
      var:	'n_cells'
    rna:	15752 x 5947
      obs:	'dataset'
      var:	'n_cells'
# save preprocessed data
mdata.write("pbmc_multi_preprocessed.h5mu")

Spatial transciptomics#

Relevant Spatial transciptomics Tutorials#

The following tutorial uses the exact preprocessed dataset that results from this section:

Multi-resolution deconvolution of spatial transcriptomics

The following tutorials may not use the exact dataset, but the preprocessing steps should be very similar to what is covered in this section:

ResolVI to address noise and biases in spatial transcriptomics

Introduction to gimVI

Spatial mapping with Tangram

Stereoscope applied to left ventricule data

Mapping human lymph node cell types to 10X Visium with Cell2location

Preprocessing#

To demonstrate preprocessing for spatial transcriptomics, we use data from a comparative study of murine lymph nodes, comparing wild-type with a stimulation after injection of a mycobacteria. We have at disposal a 10x Visium dataset as well as a matching scRNA-seq dataset from the same tissue.

url1 = "https://github.com/romain-lopez/DestVI-reproducibility/blob/master/lymph_node/deconvolution/ST-LN-compressed.h5ad?raw=true"
url2 = "https://github.com/romain-lopez/DestVI-reproducibility/blob/master/lymph_node/deconvolution/scRNA-LN-compressed.h5ad?raw=true"
out1 = "data/ST-LN-compressed.h5ad"
out2 = "data/scRNA-LN-compressed.h5ad"

First, let’s load the single-cell data. We profiled immune cells from murine lymph nodes with 10x Chromium, as a control / case study to study the immune response to exposure to a mycobacteria (refer to DestVI paper for more info). It contains the raw counts (DestVI always takes raw counts as input).

sc_adata = sc.read(out2, backup_url=url2)
# let us filter some genes
G = 2000
sc.pp.filter_genes(sc_adata, min_counts=10)

sc_adata.layers["counts"] = sc_adata.X.copy()

sc.pp.highly_variable_genes(
    sc_adata, n_top_genes=G, subset=True, layer="counts", flavor="seurat_v3"
)

sc.pp.normalize_total(sc_adata, target_sum=10e4)
sc.pp.log1p(sc_adata)
sc_adata.raw = sc_adata

Load the spatial data

st_adata = sc.read(out1, backup_url=url1)
st_adata.layers["counts"] = st_adata.X.copy()
st_adata.obsm["spatial"] = st_adata.obsm["location"]

sc.pp.normalize_total(st_adata, target_sum=10e4)
sc.pp.log1p(st_adata)
st_adata.raw = st_adata

Here we must ensure that the two datasets have a common gene subset.

# filter genes to be the same on the spatial data
intersect = np.intersect1d(sc_adata.var_names, st_adata.var_names)
st_adata = st_adata[:, intersect].copy()
sc_adata = sc_adata[:, intersect].copy()
G = len(intersect)
st_adata.write_h5ad("st_lymph_node_preprocessed.h5ad")
sc_adata.write_h5ad("sc_lymph_node_preprocessed.h5ad")