Note

This page was generated from cellassign_tutorial.ipynb. Interactive online version: Colab badge.

Annotation with CellAssign

Assigning single-cell RNA-seq data to known cell types

CellAssign is a probabilistic model that uses prior knowledge of cell-type marker genes to annotate scRNA data into predefined cell types. Unlike other methods for assigning cell types, CellAssign does not require labeled single cell data and only needs to know whether or not each given gene is a marker of each cell type. The original paper and R code are linked below.

Paper: Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling, *Nature Methods 2019*

Code: https://github.com/Irrationone/cellassign

This notebook will demonstrate how to use CellAssign on follicular lymphoma and HGSC scRNA data.

[1]:
import sys

#if branch is stable, will install via pypi, else will install from source
branch = "stable"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/yoseflab/scvi-tools@$branch#egg=scvi-tools[tutorials]
     |████████████████████████████████| 184kB 18.5MB/s
     |████████████████████████████████| 122kB 48.2MB/s
     |████████████████████████████████| 829kB 58.6MB/s
     |████████████████████████████████| 634kB 51.3MB/s
     |████████████████████████████████| 81kB 11.7MB/s
     |████████████████████████████████| 204kB 58.3MB/s
     |████████████████████████████████| 245kB 51.3MB/s
     |████████████████████████████████| 8.7MB 29.6MB/s
     |████████████████████████████████| 2.4MB 41.4MB/s
     |████████████████████████████████| 51kB 7.9MB/s
     |████████████████████████████████| 10.3MB 25.8MB/s
     |████████████████████████████████| 3.1MB 55.3MB/s
     |████████████████████████████████| 276kB 61.3MB/s
     |████████████████████████████████| 112kB 56.2MB/s
     |████████████████████████████████| 829kB 53.2MB/s
     |████████████████████████████████| 51kB 9.1MB/s
     |████████████████████████████████| 112kB 63.8MB/s
     |████████████████████████████████| 1.3MB 45.3MB/s
     |████████████████████████████████| 51kB 8.4MB/s
     |████████████████████████████████| 71kB 11.6MB/s
     |████████████████████████████████| 143kB 61.8MB/s
     |████████████████████████████████| 296kB 59.3MB/s
  Building wheel for loompy (setup.py) ... done
  Building wheel for PyYAML (setup.py) ... done
  Building wheel for future (setup.py) ... done
  Building wheel for numpy-groupies (setup.py) ... done
  Building wheel for sinfo (setup.py) ... done

To demonstrate CellAssign, we use the data from the original publication, which we converted into h5ad format. The data are originally available from here:

https://zenodo.org/record/3372746

[2]:
import gdown
url = 'https://drive.google.com/uc?id=10l6m2KKKioCZnQlRHomheappHh-jTFmx'
output = 'sce_follicular_annotated_final.h5ad'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1Pae7VEcoZbKRvtllGAEWG4SOLWSjjtCO'
output = 'sce_hgsc_annotated_final.h5ad'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1Mk5uPdnPC4IMRnuG5N4uFvypT8hPdJ74'
output = 'HGSC_celltype.csv'
gdown.download(url, output, quiet=False)

url = 'https://drive.google.com/uc?id=1tJSOI9ve0i78WmszMLx2ul8F8tGycBTd'
output = 'FL_celltype.csv'
gdown.download(url, output, quiet=False)
Downloading...
From: https://drive.google.com/uc?id=10l6m2KKKioCZnQlRHomheappHh-jTFmx
To: /content/sce_follicular_annotated_final.h5ad
83.0MB [00:01, 63.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Pae7VEcoZbKRvtllGAEWG4SOLWSjjtCO
To: /content/sce_hgsc_annotated_final.h5ad
110MB [00:01, 92.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Mk5uPdnPC4IMRnuG5N4uFvypT8hPdJ74
To: /content/HGSC_celltype.csv
100%|██████████| 1.16k/1.16k [00:00<00:00, 2.21MB/s]
Downloading...
From: https://drive.google.com/uc?id=1tJSOI9ve0i78WmszMLx2ul8F8tGycBTd
To: /content/FL_celltype.csv
100%|██████████| 446/446 [00:00<00:00, 1.26MB/s]
[2]:
'FL_celltype.csv'
[36]:
import scvi
import scanpy as sc
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

Follicular Lymphoma Data

Load follicular lymphoma data and marker gene matrix (see Supplementary Table 2 from the original paper).

[4]:
adata = sc.read("sce_follicular_annotated_final.h5ad")
adata.var_names_make_unique()
adata.obs_names_make_unique()
/usr/local/lib/python3.7/dist-packages/anndata/_core/anndata.py:119: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[5]:
marker_gene_mat = pd.read_csv('FL_celltype.csv', index_col=0)

Create and fit CellAssign model

The anndata object and cell type marker matrix should contain the same genes, so we index into adata to include only the genes from marker_gene_mat.

[6]:
bdata = adata[:, marker_gene_mat.index].copy()
[7]:
scvi.data.setup_anndata(bdata)
INFO     No batch_key inputted, assuming all cells are same batch
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.X
INFO     Computing library size prior per batch
WARNING  This dataset has some empty cells, this might fail inference.Data should be filtered
         with `scanpy.pp.filter_cells()`
INFO     Successfully registered anndata object containing 9156 cells, 24 vars, 1 batches, 1
         labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra
         continuous covariates.
INFO     Please do not further modify adata until model is trained.

Then we setup anndata and initialize a CellAssign model. Here we set the size_factor_key to “size_factor”, which is a column in bdata.obs. A size factor may be defined manually as UMI count and should not be placed on the log scale, as the model will do this manually.

For example,

adata.obs["size_factor"] = adata.X.sum(1)
[8]:
from scvi.external import CellAssign
model = CellAssign(bdata, marker_gene_mat, size_factor_key="size_factor")
model.train()
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Epoch 400/400: 100%|██████████| 400/400 [00:36<00:00, 11.11it/s, loss=19.9, v_num=1]

Inspecting the convergence:

[9]:
model.history["elbo_validation"].plot()
[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe72904e9d0>
../../_images/user_guide_notebooks_cellassign_tutorial_18_1.png

Predict and plot assigned cell types

Predict the soft cell type assignment probability for each cell.

[13]:
predictions = model.predict()
predictions.head()
[13]:
B cells Cytotoxic T cells CD4 T cells Tfh other
0 1.000000e+00 7.236593e-20 6.324431e-16 1.754270e-17 3.476350e-15
1 1.000000e+00 1.092724e-21 7.497330e-18 1.106397e-19 5.981450e-17
2 1.000000e+00 9.595119e-27 9.861930e-23 1.133820e-24 1.078264e-21
3 1.000000e+00 1.139548e-44 8.985884e-38 2.384867e-40 2.626802e-34
4 3.880475e-17 4.667202e-13 9.994294e-01 5.706293e-04 2.487126e-18

We can visualize the probabilities of assignment with a heatmap that returns the probability matrix for each cell and cell type.

[12]:
sns.clustermap(predictions, cmap="viridis")
/usr/local/lib/python3.7/dist-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
[12]:
<seaborn.matrix.ClusterGrid at 0x7fe728fe5d10>
../../_images/user_guide_notebooks_cellassign_tutorial_23_2.png

We then create a UMAP plot labeled by maximum probability assignments from the CellAssign model. The left plot contains the true cell types and the right plot contains our model’s predictions.

[14]:
bdata.obs["scvi-tools predictions"] = predictions.idxmax(axis=1).values
[39]:
# celltype is the original CellAssign prediction
sc.pl.umap(
    bdata,
    color=["celltype", "scvi-tools predictions"],
    frameon=False,
    ncols=1
)
../../_images/user_guide_notebooks_cellassign_tutorial_26_0.png

Model reproducibility

We see that the scvi-tools implementation highly reproduces the original implementation’s predictions.

[37]:
df = bdata.obs
confusion_matrix = pd.crosstab(
    df["scvi-tools predictions"],
    df["celltype"],
    rownames=["scvi-tools predictions"],
    colnames=["Original predictions"],
)
confusion_matrix /= confusion_matrix.sum(1).ravel().reshape(-1, 1)
fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(
    confusion_matrix,
    cmap=sns.diverging_palette(245, 320, s=60, as_cmap=True),
    ax=ax,
    square=True,
    cbar_kws=dict(shrink=0.4, aspect=12),
)
[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe6d3f51390>
../../_images/user_guide_notebooks_cellassign_tutorial_29_1.png

HGSC Data

We can repeat the same process for HGSC data.

[21]:
hgsc_adata = scvi.data.read_h5ad("sce_hgsc_annotated_final.h5ad")
hgsc_adata.var_names_make_unique()
hgsc_adata.obs_names_make_unique()
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[22]:
marker_gene_mat_hgsc = pd.read_csv('HGSC_celltype.csv', index_col=0)

Create and fit CellAssign model

[23]:
hgsc_bdata = hgsc_adata[:, marker_gene_mat_hgsc.index].copy()
[24]:
scvi.data.setup_anndata(hgsc_bdata)
INFO     No batch_key inputted, assuming all cells are same batch
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Successfully registered anndata object containing 4848 cells, 41 vars, 1 batches, 1
         labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra
         continuous covariates.
INFO     Please do not further modify adata until model is trained.
[25]:
from scvi.external import CellAssign
model_hgsc = CellAssign(hgsc_bdata, marker_gene_mat_hgsc, "size_factor")
model_hgsc.train()
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Epoch 400/400: 100%|██████████| 400/400 [00:30<00:00, 13.09it/s, loss=40.9, v_num=1]
[26]:
model.history["elbo_validation"].plot()
[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe7207fc7d0>
../../_images/user_guide_notebooks_cellassign_tutorial_38_1.png

Predict and plot assigned cell types

[27]:
predictions_hgsc = model_hgsc.predict()
[28]:
sns.clustermap(predictions_hgsc, cmap="viridis")
/usr/local/lib/python3.7/dist-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
[28]:
<seaborn.matrix.ClusterGrid at 0x7fe720698d50>
../../_images/user_guide_notebooks_cellassign_tutorial_41_2.png
[29]:
hgsc_bdata.obs["scvi-tools predictions"] = predictions_hgsc.idxmax(axis=1).values
[30]:
sc.pl.umap(
    hgsc_bdata,
    color=["celltype", "scvi-tools predictions"],
    ncols=1,
    frameon=False,
)
... storing 'scvi-tools predictions' as categorical
../../_images/user_guide_notebooks_cellassign_tutorial_43_1.png

Model reproducibility

[38]:
df = hgsc_bdata.obs
confusion_matrix = pd.crosstab(
    df["scvi-tools predictions"],
    df["celltype"],
    rownames=["scvi-tools predictions"],
    colnames=["Original predictions"],
)
confusion_matrix /= confusion_matrix.sum(1).ravel().reshape(-1, 1)
fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(
    confusion_matrix,
    cmap=sns.diverging_palette(245, 320, s=60, as_cmap=True),
    ax=ax,
    square=True,
    cbar_kws=dict(shrink=0.4, aspect=12),
)
[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe6d3e16b10>
../../_images/user_guide_notebooks_cellassign_tutorial_45_1.png