Integrating single-cell methylation data from different scBS-seq experiments with methylVI

Integrating single-cell methylation data from different scBS-seq experiments with methylVI#


A common problem when analyzing single-cell omics datasets across multiple experiments is the presence of batch effects (i.e., systematic variations due to differences in sequencing platform). Here we demonstrate how methylVI can be used to integrate data from different single-cell bisulfite sequencing platforms. As an example, we’ll consider single-cell methylomes from the dentate gyrus region of the brain collected in “DNA methylation atlas of the mouse brain at single-cell resolution” (Liu et al., Nature, 2021) using two sequencing protocols: snmC-seq2 and snm-3C-seq.

If you use methylVI in your work, please consider citing

  • Weinberger, E. and Lee, S.I. A deep generative model of single-cell methylomic data. NeurIPS 2023 Generative AI and Biology (GenBio) Workshop.

!pip install --quiet scvi-colab
from scvi_colab import install

install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip

Imports and Data Loading#

import tempfile

import matplotlib.pyplot as plt
import mudata
import numpy as np
import pooch
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scvi.external import METHYLVI
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.2.1

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

This dataset was preprocessed as described in the methylVI manuscript. In particular ALLC files containing methylation reads at individual cytosines were aggregated into gene body methylation features using ALLCools. Due to their distinct regulatory roles, CpG methylation and CpH methylation (i.e., non-CpG methylation) were considered separately. The resulting methylation count features were stored in a MuData object with separate modality fields for each methylation context: mCG (for CpG methylation) and mCH for CpH methylation.

mdata = mudata.read_h5mu(
    pooch.retrieve(
        url="https://figshare.com/ndownloader/files/49632108",
        known_hash="9fc5980fa807151ce983309af1011949fa0b2b746c13c2960f4b4768b0a3172c",
        fname="Liu2021_batch.h5mu",
        path=save_dir.name,
        progressbar=True,
    )
)
mdata.mod
MuData
├─ mCG AnnData (10726 x 2500)
└─ mCH AnnData (10726 x 2500)

Within the modality-specific AnnData objects, the coverage (i.e., number of measured cytosines) within each gene body region for a cell can be found in the cov layer, while the number of methylated cytosines can be found in the mc layer.

mdata["mCG"].layers
Layers with keys: cov, mc

Finally, normalized methylation counts for each region (computed using the add_mc_frac function in ALLCools) can be found in the AnnData objects’ .X fields.

mdata["mCG"].X
array([[ 0.        ,  0.        , -0.34459764, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.62912544,  0.96123438,  0.90394941, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        , -1.88005841, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.87361881, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.10159167,  0.        ,  0.48925986, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.367354  , ...,  0.        ,
         0.        ,  0.        ]])

Now we’ll briefly explore our data. To do so, we’ll follow the ALLCools workflow detailed here to combine information from the two methylation contexts (i.e., CpG and CpH methylation). In particular, we’ll apply principal component analysis (PCA) to each modality, and then aggregate the resulting PCs.

sc.tl.pca(mdata["mCG"])
sc.tl.pca(mdata["mCH"])

ch_pcs = mdata["mCH"].obsm["X_pca"]
cg_pcs = mdata["mCG"].obsm["X_pca"]

# standardize the values of PCs from both modalities
cg_pcs = cg_pcs / cg_pcs.std()
ch_pcs = ch_pcs / ch_pcs.std()

# total_pcs
total_pcs = np.hstack([ch_pcs, cg_pcs])

mdata.obsm["X_pca"] = total_pcs

Now we’ll visualize our data with UMAP. We find that there exist clear batch effects between the two sequencing protocols, with clear separation by protocol (left) in addition to variations due to cell type differences (right).

sc.pp.neighbors(mdata)
sc.tl.umap(mdata)

fig, ax = plt.subplots(1, 2, figsize=(11, 5))

sc.pl.umap(mdata, color="mCG:Platform", ax=ax[0], show=False, title="Sequencing protocol")
sc.pl.umap(mdata, color="mCG:CoarseType", ax=ax[1], show=False, title="Cell type")

plt.subplots_adjust(wspace=0.5)

In the next section, we’ll see how methylVI can alleviate these issues.

Prepare and run model#

Before training our model, we’ll use methylVI’s setup_mudata function to prepare our MuData object for training.

First, we need to tell methylVI which modalities in our MuData object to consider via the methylation_contexts argument. Here we’ll jointly model both CpG and non-CpG methylation features, so we’ll set this argument to a list containing the names of both modalities. Next, methylVI directly models the total coverage and number of methylated cytosines in each region. Thus, for each modality in our MuData object, we need layers containing the coverage in each region (specified by cov_layer) and layers with the number of methylated cytosines (specified by mc_layer). Finally, we’ll provide methylVI with a categorical covariate specifying the sequencing protocol used for each cell.

METHYLVI.setup_mudata(
    mdata,
    mc_layer="mc",
    cov_layer="cov",
    methylation_contexts=["mCG", "mCH"],
    categorical_covariate_keys=["Protocol"],
    modalities={"categorical_covariate_keys": "mCG"},
)

Note

Specify the modality of each argument via the modalities dictionary, which maps layer/key arguments to MuData modalities. In our case, both the mCG and mCH modalities contain the all of the fields specified in the categorical_covariate_keys argument (i.e., Protocol) in their respective .obs, so we arbitrarily choose mCG here

Next, we declare a METHYLVI model object with 20 latent variables as done in the methylVI manuscript, and train our model with early stopping.

model = METHYLVI(
    mdata,
    n_latent=20,
)

model.train(max_epochs=500, early_stopping=True)
INFO     The model has been initialized
Monitored metric elbo_validation did not improve in the last 45 records. Best score: 5524.801. Signaling Trainer to stop.

Now that our model is trained, we’ll obtain latent representations of each cell.

mdata.obsm["methylVI"] = model.get_latent_representation()

Visualizing these representations, we find that cells now mix across protocol (left) while separating by cell type (right).

sc.pp.neighbors(mdata, use_rep="methylVI")
sc.tl.umap(mdata)

fig, ax = plt.subplots(1, 2, figsize=(11, 5))

sc.pl.umap(mdata, color="mCG:Platform", ax=ax[0], show=False, title="Sequencing protocol")
sc.pl.umap(mdata, color="mCG:CoarseType", ax=ax[1], show=False, title="Cell type")

plt.subplots_adjust(wspace=0.5)