Note

This page was generated from cellassign_tutorial.ipynb. Interactive online version: Colab badge. Some tutorial content may look better in light mode.

Annotation with CellAssign#

Assigning single-cell RNA-seq data to known cell types#

CellAssign is a probabilistic model that uses prior knowledge of cell-type marker genes to annotate scRNA data into predefined cell types. Unlike other methods for assigning cell types, CellAssign does not require labeled single cell data and only needs to know whether or not each given gene is a marker of each cell type. The original paper and R code are linked below.

Paper: Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling, Nature Methods 2019

Code: https://github.com/Irrationone/cellassign

This notebook will demonstrate how to use CellAssign on follicular lymphoma and HGSC scRNA data.

[ ]:
!pip install --quiet scvi-colab
from scvi_colab import install

install()
[1]:
import gdown
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
import scvi
import seaborn as sns
from scvi.external import CellAssign

To demonstrate CellAssign, we use the data from the original publication, which we converted into h5ad format. The data are originally available from here:

https://zenodo.org/record/3372746

[9]:
url = "https://drive.google.com/uc?id=10l6m2KKKioCZnQlRHomheappHh-jTFmx"
output = "sce_follicular_annotated_final.h5ad"
gdown.download(url, output, quiet=False)

url = "https://drive.google.com/uc?id=1Pae7VEcoZbKRvtllGAEWG4SOLWSjjtCO"
output = "sce_hgsc_annotated_final.h5ad"
gdown.download(url, output, quiet=False)

url = "https://drive.google.com/uc?id=1Mk5uPdnPC4IMRnuG5N4uFvypT8hPdJ74"
output = "HGSC_celltype.csv"
gdown.download(url, output, quiet=False)

url = "https://drive.google.com/uc?id=1tJSOI9ve0i78WmszMLx2ul8F8tGycBTd"
output = "FL_celltype.csv"
gdown.download(url, output, quiet=False)
Downloading...
From: https://drive.google.com/uc?id=10l6m2KKKioCZnQlRHomheappHh-jTFmx
To: /home/adam/Documents/software/scvi-tutorials/sce_follicular_annotated_final.h5ad
100%|██████████| 83.0M/83.0M [00:00<00:00, 96.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Pae7VEcoZbKRvtllGAEWG4SOLWSjjtCO
To: /home/adam/Documents/software/scvi-tutorials/sce_hgsc_annotated_final.h5ad
100%|██████████| 110M/110M [00:01<00:00, 95.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Mk5uPdnPC4IMRnuG5N4uFvypT8hPdJ74
To: /home/adam/Documents/software/scvi-tutorials/HGSC_celltype.csv
100%|██████████| 1.16k/1.16k [00:00<00:00, 4.07MB/s]
Downloading...
From: https://drive.google.com/uc?id=1tJSOI9ve0i78WmszMLx2ul8F8tGycBTd
To: /home/adam/Documents/software/scvi-tutorials/FL_celltype.csv
100%|██████████| 446/446 [00:00<00:00, 1.60MB/s]
[9]:
'FL_celltype.csv'
[10]:
sc.set_figure_params(figsize=(4, 4))

%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

Follicular Lymphoma Data#

Load follicular lymphoma data and marker gene matrix (see Supplementary Table 2 from the original paper).

[11]:
adata = sc.read("sce_follicular_annotated_final.h5ad")
adata.var_names_make_unique()
adata.obs_names_make_unique()
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.9/site-packages/anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[12]:
marker_gene_mat = pd.read_csv("FL_celltype.csv", index_col=0)

Create and fit CellAssign model#

The anndata object and cell type marker matrix should contain the same genes, so we index into adata to include only the genes from marker_gene_mat.

[13]:
bdata = adata[:, marker_gene_mat.index].copy()

Then we setup anndata and initialize a CellAssign model. Here we set the size_factor_key to “size_factor”, which is a column in bdata.obs. A size factor may be defined manually as scaled library size (total UMI count) and should not be placed on the log scale, as the model will do this manually. The library size should be computed before any gene subsetting (in this case, technically, a few notebook cells up).

For example,

lib_size = adata.X.sum(1)
adata.obs["size_factor"] = lib_size / np.mean(lib_size)
[14]:
scvi.external.CellAssign.setup_anndata(bdata, size_factor_key="size_factor")
[15]:
model = CellAssign(bdata, marker_gene_mat)
model.train()
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.9/site-packages/pytorch_lightning/trainer/data_loading.py:432: UserWarning: The number of training samples (9) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 400/400: 100%|██████████| 400/400 [00:22<00:00, 17.59it/s, loss=19.9, v_num=1]

Inspecting the convergence:

[16]:
model.history["elbo_validation"].plot()
[16]:
<AxesSubplot:xlabel='epoch'>
../../_images/tutorials_notebooks_cellassign_tutorial_19_1.png

Predict and plot assigned cell types#

Predict the soft cell type assignment probability for each cell.

[17]:
predictions = model.predict()
predictions.head()
[17]:
B cells Cytotoxic T cells CD4 T cells Tfh other
0 1.000000e+00 1.739431e-19 7.751762e-16 1.869599e-17 4.378877e-15
1 1.000000e+00 2.178871e-21 9.382402e-18 1.186499e-19 7.339101e-17
2 1.000000e+00 1.889330e-26 1.200271e-22 1.176634e-24 1.285317e-21
3 1.000000e+00 3.226794e-44 1.242664e-37 2.663802e-40 3.492028e-34
4 3.411843e-17 7.348493e-13 9.995215e-01 4.784879e-04 2.219958e-18

We can visualize the probabilities of assignment with a heatmap that returns the probability matrix for each cell and cell type.

[18]:
sns.clustermap(predictions, cmap="viridis")
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.9/site-packages/seaborn/matrix.py:654: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
[18]:
<seaborn.matrix.ClusterGrid at 0x7fbb9c6fd250>
../../_images/tutorials_notebooks_cellassign_tutorial_24_2.png

We then create a UMAP plot labeled by maximum probability assignments from the CellAssign model. The left plot contains the true cell types and the right plot contains our model’s predictions.

[19]:
bdata.obs["scvi-tools predictions"] = predictions.idxmax(axis=1).values
[20]:
# celltype is the original CellAssign prediction
sc.pl.umap(bdata, color=["celltype", "scvi-tools predictions"], frameon=False, ncols=1)
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.9/site-packages/anndata/_core/anndata.py:1228: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'scvi-tools predictions' as categorical
../../_images/tutorials_notebooks_cellassign_tutorial_27_1.png

Model reproducibility#

We see that the scvi-tools implementation highly reproduces the original implementation’s predictions.

[21]:
df = bdata.obs
confusion_matrix = pd.crosstab(
    df["scvi-tools predictions"],
    df["celltype"],
    rownames=["scvi-tools predictions"],
    colnames=["Original predictions"],
)
confusion_matrix /= confusion_matrix.sum(1).ravel().reshape(-1, 1)
fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(
    confusion_matrix,
    cmap=sns.diverging_palette(245, 320, s=60, as_cmap=True),
    ax=ax,
    square=True,
    cbar_kws=dict(shrink=0.4, aspect=12),
)
[21]:
<AxesSubplot:xlabel='Original predictions', ylabel='scvi-tools predictions'>
../../_images/tutorials_notebooks_cellassign_tutorial_30_1.png

HGSC Data#

We can repeat the same process for HGSC data.

[22]:
hgsc_adata = scvi.data.read_h5ad("sce_hgsc_annotated_final.h5ad")
hgsc_adata.var_names_make_unique()
hgsc_adata.obs_names_make_unique()
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[23]:
marker_gene_mat_hgsc = pd.read_csv("HGSC_celltype.csv", index_col=0)

Create and fit CellAssign model#

[24]:
hgsc_bdata = hgsc_adata[:, marker_gene_mat_hgsc.index].copy()
[25]:
scvi.external.CellAssign.setup_anndata(hgsc_bdata, "size_factor")
[26]:
model_hgsc = CellAssign(hgsc_bdata, marker_gene_mat_hgsc)
model_hgsc.train()
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.9/site-packages/pytorch_lightning/trainer/data_loading.py:432: UserWarning: The number of training samples (5) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 400/400: 100%|██████████| 400/400 [00:16<00:00, 23.85it/s, loss=41.5, v_num=1]
[27]:
model_hgsc.history["elbo_validation"].plot()
[27]:
<AxesSubplot:xlabel='epoch'>
../../_images/tutorials_notebooks_cellassign_tutorial_39_1.png

Predict and plot assigned cell types#

[28]:
predictions_hgsc = model_hgsc.predict()
[29]:
sns.clustermap(predictions_hgsc, cmap="viridis")
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.9/site-packages/seaborn/matrix.py:654: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
[29]:
<seaborn.matrix.ClusterGrid at 0x7fbb8690a3d0>
../../_images/tutorials_notebooks_cellassign_tutorial_42_2.png
[30]:
hgsc_bdata.obs["scvi-tools predictions"] = predictions_hgsc.idxmax(axis=1).values
[31]:
sc.pl.umap(
    hgsc_bdata,
    color=["celltype", "scvi-tools predictions"],
    ncols=1,
    frameon=False,
)
/home/adam/miniconda3/envs/scvi-tools-dev/lib/python3.9/site-packages/anndata/_core/anndata.py:1228: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
... storing 'scvi-tools predictions' as categorical
../../_images/tutorials_notebooks_cellassign_tutorial_44_1.png

Model reproducibility#

[32]:
df = hgsc_bdata.obs
confusion_matrix = pd.crosstab(
    df["scvi-tools predictions"],
    df["celltype"],
    rownames=["scvi-tools predictions"],
    colnames=["Original predictions"],
)
confusion_matrix /= confusion_matrix.sum(1).ravel().reshape(-1, 1)
fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(
    confusion_matrix,
    cmap=sns.diverging_palette(245, 320, s=60, as_cmap=True),
    ax=ax,
    square=True,
    cbar_kws=dict(shrink=0.4, aspect=12),
)
[32]:
<AxesSubplot:xlabel='Original predictions', ylabel='scvi-tools predictions'>
../../_images/tutorials_notebooks_cellassign_tutorial_46_1.png