Seed labeling with scANVI#

In this tutorial, we go through the steps of training scANVI for seed annotation. This is useful for when we have ground truth labels for a few cells and want to annotate unlabelled cells. For more information, please refer to the original scANVI publication.

Plan for this tutorial:

  1. Loading the data

  2. Creating the seed labels: groundtruth for a small fraction of cells

  3. Training the scANVI model: transferring annotation to the whole dataset

  4. Visualizing the latent space and predicted labels

Uncomment the following lines in Google Colab in order to install scvi-tools:

# !pip install --quiet scvi-colab
# from scvi_colab import install

# install()
import tempfile

import numpy as np
import scanpy as sc
import scvi
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.1.0

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(4, 4), frameon=False)
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor" : "w"}
%config InlineBackend.figure_format="retina"

Data Loading#

For the purposes of this notebook, we will be labeling 4 cell types in a dataset of purified peripheral blood mononuclear cells from 10x Genomics:

  • CD4 Regulatory T cells

  • CD4 Naive T cells

  • CD4 Memory T cells

  • CD8 Naive T cells

adata =,
    subset_datasets=["regulatory_t", "naive_t", "memory_t", "naive_cytotoxic"],
INFO     Downloading file at /tmp/tmpwjgbn4tz/PurifiedPBMCDataset.h5ad                                             
Downloading...: 157054it [00:01, 99568.53it/s]                               

From now on, we assume that cell type information for each cell is unavailable to us, and we seek to retrieve it.

Automatic annotation using seed labels#

In this section we hand curate and select cells which will serve as our ground truth labels.

We start by putting together a list of curated marker genes, from which we aim at identifying our ground truth cell types. These are extracted from the scANVI publication.

gene_subset = [

Then, we assign a score to every cell as a function of its cell type signature. In order to compute these scores, we need to normalize the data. Because this is not the case of scVI or scANVI, we proceed with a copy of the dataset for this step.

normalized = adata.copy()
sc.pp.normalize_total(normalized, target_sum=1e4)

normalized = normalized[:, gene_subset].copy()

Now, here we have two helper functions that will help in scoring the cells, as well as taking the most confident cells with respect to these scores.

def get_score(normalized_adata, gene_set):
    """Returns the score per cell given a dictionary of + and - genes

      anndata dataset that has been log normalized and scaled to mean 0, std 1
      a dictionary with two keys: 'positive' and 'negative'
      each key should contain a list of genes
      for each gene in gene_set['positive'], its expression will be added to the score
      for each gene in gene_set['negative'], its expression will be subtracted from its score

    array of length of n_cells containing the score per cell
    score = np.zeros(normalized_adata.n_obs)
    for gene in gene_set["positive"]:
        expression = np.array(normalized_adata[:, gene].X)
        score += expression.flatten()
    for gene in gene_set["negative"]:
        expression = np.array(normalized_adata[:, gene].X)
        score -= expression.flatten()
    return score

def get_cell_mask(normalized_adata, gene_set):
    """Calculates the score per cell for a list of genes, then returns a mask for
    the cells with the highest 50 scores.

      anndata dataset that has been log normalized and scaled to mean 0, std 1
      a dictionary with two keys: 'positive' and 'negative'
      each key should contain a list of genes
      for each gene in gene_set['positive'], its expression will be added to the score
      for each gene in gene_set['negative'], its expression will be subtracted from its score

    Mask for the cells with the top 50 scores over the entire dataset
    score = get_score(normalized_adata, gene_set)
    cell_idx = score.argsort()[-50:]
    mask = np.zeros(normalized_adata.n_obs)
    mask[cell_idx] = 1
    return mask.astype(bool)

We run those function to identify highly confident cells, that we aim at using as seed labels

# hand curated list of genes for identifying ground truth

cd4_reg_geneset = {
    "positive": ["TNFRSF18", "CTLA4", "FOXP3", "IL2RA"],
    "negative": ["S100A4", "PTPRC", "CD8B"],

cd8_naive_geneset = {"positive": ["CD8B", "CCR7"], "negative": ["CD4"]}

cd4_naive_geneset = {
    "positive": ["CCR7", "CD4"],
    "negative": ["S100A4", "PTPRC", "FOXP3", "IL2RA", "CD69"],

cd4_mem_geneset = {
    "positive": ["S100A4"],
    "negative": ["IL2RA", "FOXP3", "TNFRSF18", "CCR7"],
cd4_reg_mask = get_cell_mask(
cd8_naive_mask = get_cell_mask(
cd4_naive_mask = get_cell_mask(
cd4_mem_mask = get_cell_mask(
seed_labels = np.array(cd4_mem_mask.shape[0] * ["Unknown"])
seed_labels[cd8_naive_mask] = "CD8 Naive T cell"
seed_labels[cd4_naive_mask] = "CD4 Naive T cell"
seed_labels[cd4_mem_mask] = "CD4 Memory T cell"
seed_labels[cd4_reg_mask] = "CD4 Regulatory T cell"

adata.obs["seed_labels"] = seed_labels

We can observe what seed label information we have now

Unknown    42719
CD4 Reg       50
CD4 Nai       50
CD4 Mem       50
CD8 Nai       50
Name: count, dtype: int64

As expected, we use 50 cells for each cell type!

Transfer of annotation with scANVI#

As in the harmonization notebook, we need to register the AnnData object for use in scANVI. Namely, we can ignore the batch parameter because those cells don’t have much batch effect to begin with. However, we will give the seed labels for scANVI to use.

scvi.model.SCVI.setup_anndata(adata, batch_key=None, labels_key="seed_labels")
scvi_model = scvi.model.SCVI(adata, n_latent=30, n_layers=2)
Epoch 100/100: 100%|██████████| 100/100 [03:43<00:00,  2.21s/it, v_num=1, train_loss_step=1.75e+3, train_loss_epoch=1.76e+3]Epoch 100/100: 100%|██████████| 100/100 [03:43<00:00,  2.24s/it, v_num=1, train_loss_step=1.75e+3, train_loss_epoch=1.76e+3]

Now we can train scANVI and transfer the labels!

scanvi_model = scvi.model.SCANVI.from_scvi_model(scvi_model, "Unknown")
INFO     Training for 25 epochs.                                                                                   
Epoch 25/25: 100%|██████████| 25/25 [02:02<00:00,  5.00s/it, v_num=1, train_loss_step=1.77e+3, train_loss_epoch=1.74e+3]Epoch 25/25: 100%|██████████| 25/25 [02:02<00:00,  4.91s/it, v_num=1, train_loss_step=1.77e+3, train_loss_epoch=1.74e+3]

Now we can predict the missing cell types, and get the latent space


adata.obsm[SCANVI_LATENT_KEY] = scanvi_model.get_latent_representation(adata)
adata.obs[SCANVI_PREDICTIONS_KEY] = scanvi_model.predict(adata)

Again, we may visualize the latent space as well as the inferred labels

sc.pp.neighbors(adata, use_rep=SCANVI_LATENT_KEY), color=["labels", SCANVI_PREDICTIONS_KEY])

From this, we can see that it is relatively easy for scANVI to separate the CD4 T cells from the CD8 T cells (in latent space, as well as for the classifier). The regulatory CD4 T cells are sometimes missclassified into CD4 Naive, but this stays a minor phenomenon. Also, we expect that better results may be obtained by careful hyperparameter selection for the classifier. Learn about all of this in our documentation.

Clean up#

Uncomment the following line to remove all data files created in this tutorial: