Note

This page was generated from harmonization.ipynb. Interactive online version: Colab badge. Some tutorial content may look better in light mode.

Atlas-level integration of lung data#

An important task of single-cell analysis is the integration of several samples, which we can perform with scVI. For integration, scVI treats the data as unlabelled. When our dataset is fully labelled (perhaps in independent studies, or independent analysis pipelines), we can obtain an integration that better preserves biology using scANVI, which incorporates cell type annotation information. Here we demonstrate this functionality with an integrated analysis of cells from the lung atlas integration task from the scIB manuscript. The same pipeline would generally be used to analyze any collection of scRNA-seq datasets.

[ ]:
!pip install --quiet scvi-colab
!pip install --quiet scib-metrics
from scvi_colab import install

install()
[ ]:
import scanpy as sc
import scvi
from rich import print
from scib_metrics.benchmark import Benchmarker
from scvi.model.utils import mde
from scvi_colab import install
[1]:
sc.set_figure_params(figsize=(4, 4))

%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.10/site-packages/pytorch_lightning/utilities/seed.py:48: LightningDeprecationWarning: `pytorch_lightning.utilities.seed.seed_everything` has been deprecated in v1.8.0 and will be removed in v1.10.0. Please use `lightning_lite.utilities.seed.seed_everything` instead.
  rank_zero_deprecation(
Global seed set to 0
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.10/site-packages/pytorch_lightning/loggers/base.py:24: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.rank_zero_experiment` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.rank_zero_experiment` instead.
  rank_zero_deprecation(
[2]:
adata = sc.read(
    "data/lung_atlas.h5ad",
    backup_url="https://figshare.com/ndownloader/files/24539942",
)

Note that this dataset has the counts already separated in a layer. Here, adata.X contains log transformed scran normalized expression.

[3]:
adata
[3]:
AnnData object with n_obs × n_vars = 32472 × 15148
    obs: 'dataset', 'location', 'nGene', 'nUMI', 'patientGroup', 'percent.mito', 'protocol', 'sanger_type', 'size_factors', 'sampling_method', 'batch', 'cell_type', 'donor'
    layers: 'counts'

Dataset preprocessing#

This dataset was already processed as described in the scIB manuscript. Generally, models in scvi-tools expect data that has been filtered/aggregated in the same fashion as one would do with Scanpy/Seurat.

Another important thing to keep in mind is highly-variable gene selection. While scVI and scANVI both accomodate using all genes in terms of runtime, we usually recommend filtering genes for best integration performance. This will, among other things, remove batch-specific variation due to batch-specific gene expression.

We perform this gene selection using the Scanpy pipeline while keeping the full dimension normalized data in the adata.raw object. We obtain variable genes from each dataset and take their intersections.

[4]:
adata.raw = adata  # keep full dimension safe
sc.pp.highly_variable_genes(
    adata,
    flavor="seurat_v3",
    n_top_genes=2000,
    layer="counts",
    batch_key="batch",
    subset=True,
)
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:62: UserWarning: `flavor='seurat_v3'` expects raw count data, but non-integers were found.
  warnings.warn(

Important

We see a warning about the data not containing counts. This is due to some of the samples in this dataset containing SoupX-corrected counts. scvi-tools models will run for non-negative real-valued data, but we strongly suggest checking that these possibly non-count values are intended to represent pseudocounts, and not some other normalized data, in which the variance/covariance structure of the data has changed dramatically.

Integration with scVI#

As a first step, we assume that the data is completely unlabelled and we wish to find common axes of variation between the two datasets. There are many methods available in scanpy for this purpose (BBKNN, Scanorama, etc.). In this notebook we present scVI. To run scVI, we simply need to:

  • Register the AnnData object with the correct key to identify the sample and the layer key with the count data.

  • Create an SCVI model object.

[5]:
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
/home/adam/Documents/software/scvi-tools/scvi/data/fields/_layer_field.py:91: UserWarning: adata.layers[counts] does not contain unnormalized count data. Are you sure this is what you want?
  warnings.warn(

We note that these parameters are non-default; however, they have been verified to generally work well in the integration task.

[6]:
vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30, gene_likelihood="nb")

Now we train scVI. This should take a couple of minutes on a Colab session

[7]:
vae.train()
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 246/246: 100%|██████████| 246/246 [06:12<00:00,  1.53s/it, loss=553, v_num=1]
`Trainer.fit` stopped: `max_epochs=246` reached.
Epoch 246/246: 100%|██████████| 246/246 [06:12<00:00,  1.51s/it, loss=553, v_num=1]

Once the training is done, we can evaluate the latent representation of each cell in the dataset and add it to the AnnData object

[8]:
adata.obsm["X_scVI"] = vae.get_latent_representation()

Finally, we can cluster the dataset and visualize it the scVI latent space.

[9]:
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.leiden(adata)

To visualize the scVI’s learned embeddings, we use the pymde package wrapperin scvi-tools. This is an alternative to UMAP that is GPU-accelerated.

[10]:
adata.obsm["X_mde"] = mde(adata.obsm["X_scVI"])
[11]:
sc.pl.embedding(
    adata,
    basis="X_mde",
    color=["batch", "leiden"],
    frameon=False,
    ncols=1,
)
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../../_images/tutorials_notebooks_harmonization_23_1.png

Because this data has been used for benchmarking, we have access here to curated annotations. We can use those to assess whether the integration worked reasonably well.

[12]:
sc.pl.embedding(adata, basis="X_mde", color=["cell_type"], frameon=False, ncols=1)
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../../_images/tutorials_notebooks_harmonization_25_1.png

At a quick glance, it looks like the integration worked well. Indeed, the two datasets are relatively mixed in latent space and the cell types cluster together. We see that this dataset is quite complex, where only some batches contain certain cell types.

Below we quantify the performance.

Integration with scANVI#

Previously, we used scVI as we assumed we did not have any cell type annotations available to guide us. Consequently, after the previous analysis, one would have to annotate clusters using differential expression, or by other means.

Now, we assume that all of our data is annotated. This can lead to a more accurate integration result when using scANVI, i.e., our latent data manifold is better suited to downstream tasks like visualization, trajectory inference, or nearest-neighbor-based tasks. scANVI requires:

  • the sample identifier for each cell (as in scVI)

  • the cell type/state for each cell

scANVI can also be used for label transfer and we recommend checking out the other scANVI tutorials to see explore this functionality.

Since we’ve already trained an scVI model on our data, we will use it to initialize scANVI. When initializing scANVI, we provide it the labels_key. As scANVI can also be used for datasets with partially-observed annotations, we need to give it the name of the category that corresponds to unlabeled cells. As we have no unlabeled cells, we can give it any random name that is not the name of an exisiting cell type.

Important

scANVI should be initialized from a scVI model pre-trained on the same exact data.

[13]:
lvae = scvi.model.SCANVI.from_scvi_model(
    vae,
    adata=adata,
    labels_key="cell_type",
    unlabeled_category="Unknown",
)
/home/adam/Documents/software/scvi-tools/scvi/data/fields/_layer_field.py:91: UserWarning: adata.layers[counts] does not contain unnormalized count data. Are you sure this is what you want?
  warnings.warn(
[14]:
lvae.train(max_epochs=20, n_samples_per_label=100)
INFO     Training for 20 epochs.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 20/20: 100%|██████████| 20/20 [01:10<00:00,  3.57s/it, loss=628, v_num=1]
`Trainer.fit` stopped: `max_epochs=20` reached.
Epoch 20/20: 100%|██████████| 20/20 [01:10<00:00,  3.55s/it, loss=628, v_num=1]

Now we can retrieve the latent space

[15]:
adata.obsm["X_scANVI"] = lvae.get_latent_representation(adata)

Again, we may visualize the latent space as well as the inferred labels

[16]:
adata.obsm["X_mde_scanvi"] = mde(adata.obsm["X_scANVI"])
[17]:
sc.pl.embedding(
    adata, basis="X_mde_scanvi", color=["cell_type"], ncols=1, frameon=False
)
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../../_images/tutorials_notebooks_harmonization_36_1.png

Compute integration metrics#

Here we use the scib-metrics package, which contains scalable implementations of the metrics used in the scIB benchmarking suite. We can use these metrics to assess the quality of the integration.

We can see that the additional training with label information and scANVI improved the metrics that capture bio conservation (cLISI, Silhouette labels) without sacrificing too much batch correction power (iLISI, Silhouette batch)

[18]:
bm = Benchmarker(
    adata,
    batch_key="batch",
    label_key="cell_type",
    embedding_obsm_keys=["X_pca", "X_scVI", "X_scANVI"],
    n_jobs=-1,
)
bm.benchmark()
Computing neighbors: 100%|██████████| 3/3 [00:57<00:00, 19.12s/it]
Embeddings:  67%|██████▋   | 2/3 [01:04<00:30, 30.92s/it]
INFO     Diffusion distance failed. Skip.
Embeddings: 100%|██████████| 3/3 [01:23<00:00, 27.89s/it]
[19]:
bm.plot_results_table(min_max_scale=False)
../../_images/tutorials_notebooks_harmonization_39_0.png
[19]:
<plottable.table.Table at 0x7f2225164be0>
[20]:
df = bm.get_results(min_max_scale=False)
print(df)
              Isolated labels        Leiden NMI        Leiden ARI  \
Embedding
X_pca                 0.50606          0.724169          0.474548
X_scVI               0.540021          0.750638           0.58713
X_scANVI             0.536124          0.771481          0.639793
Metric Type  Bio conservation  Bio conservation  Bio conservation

             Silhouette label             cLISI  Silhouette batch  \
Embedding
X_pca                0.557418               1.0          0.878176
X_scVI               0.535189          0.994187          0.900951
X_scANVI             0.588762          0.998546           0.88585
Metric Type  Bio conservation  Bio conservation  Batch correction

                        iLISI              KBET Graph connectivity  \
Embedding
X_pca                0.003306          0.223596           0.728504
X_scVI               0.111625          0.339272           0.899708
X_scANVI             0.105794          0.349628           0.922797
Metric Type  Batch correction  Batch correction   Batch correction

               PCR comparison Batch correction Bio conservation  \
Embedding
X_pca                     0.0         0.366716         0.652439
X_scVI               0.891112         0.628534         0.681433
X_scANVI             0.809051         0.614624         0.706941
Metric Type  Batch correction  Aggregate score  Aggregate score

                       Total
Embedding
X_pca                0.53815
X_scVI              0.660273
X_scANVI            0.670014
Metric Type  Aggregate score