This page was generated from scVI_DE_worm.ipynb. Interactive online version: Colab badge.

Differential expression on Packer C. elegans data

This notebook was contributed by Eduardo Beltrame [@Munfred]( and edited by Romain Lopez with help from Adam Gayoso.

Processing and visualizing 89k cells from Packer et al. 2019 C. elegans 10xv2 single cell data

Original article: A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

The anndata object we provide has 89701 cells and 20222 genes. It includes short gene descriptions from WormBase that will show up when mousing over the interactive plots.

Steps performed:

  1. Loading the data from anndata containing cell labels and gene descriptions

  2. Training the model with batch labels for integration with scVI

  3. Retrieving the scVI latent space and imputed values

  4. Visualize the latent space with an interactive t-SNE plot using Plotly

  5. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

# If running in Colab, navigate to Runtime -> Change runtime type
# and ensure you're using a Python 3 runtime with GPU hardware accelerator
# Installation of scVI in Colab can take several minutes

import sys
IN_COLAB = "google.colab" in sys.modules

show_plot = True
save_path = "./"

    !pip install --quiet scvi-tools
import scvi
    !pip install openTSNE --quiet
from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Control warnings
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import anndata
import as px
import plotly.graph_objects as go
from umap import UMAP

## Download h5ad file with C. elegans data

if os.path.isfile('packer2019.h5ad'):
    print ("Found the data file! No need to download.")
    print ("Downloading data...")
    ! wget

Found the data file! No need to download.
adata ='packer2019.h5ad')

AnnData object with n_obs × n_vars = 89701 × 20222
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    var: 'gene_id', 'gene_name', 'gene_description'

Take a look at the gene descriptions

The gene descriptions were taken using the WormBase API.

display(adata.var.head().style.set_properties(subset=['gene_description'], **{'width': '600px'}))
gene_id gene_name gene_description
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16; daf-12; and hsf-1 based on RNA-seq and tiling array studies. Is affected by six chemicals including Rotenone; Psoralens; and metformin based on RNA-seq and microarray studies.
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies. Is affected by several genes including daf-16; daf-12; and clk-1 based on RNA-seq and microarray studies. Is affected by six chemicals including Alovudine; Psoralens; and metformin based on RNA-seq studies.
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1). Is predicted to contribute to NADH dehydrogenase activity. Human ortholog(s) of this gene are implicated in Leber hereditary optic neuropathy and MELAS syndrome.
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transporting ATP synthase activity, rotational mechanism.
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; clk-1; and elt-2 based on RNA-seq and microarray studies. Is affected by eight chemicals including stavudine; Zidovudine; and Psoralens based on RNA-seq and microarray studies.
cell numi time_point batch size_factor cell_type cell_subtype plot_cell_type raw_embryo_time embryo_time embryo_time_bin raw_embryo_time_bin lineage passed_qc
AAACCTGAGACAATAC-300.1.1 AAACCTGAGACAATAC-300.1.1 1630 300_minutes Waterston_300_minutes 1.023195 Body_wall_muscle BWM_head_row_1 BWM_head_row_1 360 380.0 330-390 330-390 MSxpappp True
AAACCTGAGGGCTCTC-300.1.1 AAACCTGAGGGCTCTC-300.1.1 2319 300_minutes Waterston_300_minutes 1.458210 nan nan nan 260 220.0 210-270 210-270 MSxapaap True
AAACCTGAGTGCGTGA-300.1.1 AAACCTGAGTGCGTGA-300.1.1 3719 300_minutes Waterston_300_minutes 2.338283 nan nan nan 270 230.0 210-270 270-330 nan True
AAACCTGAGTTGAGTA-300.1.1 AAACCTGAGTTGAGTA-300.1.1 4251 300_minutes Waterston_300_minutes 2.659051 Body_wall_muscle BWM_anterior BWM_anterior 260 280.0 270-330 210-270 Dxap True
AAACCTGCAAGACGTG-300.1.1 AAACCTGCAAGACGTG-300.1.1 1003 300_minutes Waterston_300_minutes 0.629610 Ciliated_amphid_neuron AFD AFD 350 350.0 330-390 330-390 ABalpppapav/ABpraaaapav True

Selecting genes and loading data

We use the utility to select genes according to their dropout rate.

This method was described by Andrews & Hemberg in the article M3Drop: dropout-based feature selection for scRNASeq:

This method modifies the adata to add the following fields:

highly_variable                   # boolean true for chosen genes
observed_fraction_zeros        # fraction of observed zeros per gene
expected_fraction_zeros        # expected fraction of observed zeros per gene
prob_zero_enriched_nbatches    # If batch_key is given, this denotes in how many batches genes are detected as zero enriched
prob_zero_enrichment              # Probability of zero enrichment, median across batches in the case of multiple batches
prob_zero_enrichment_rank         # Rank of the gene according to probability of zero enrichment
                                 #batch_key='batch', # estimate a dispersion parameter batchwise
                                 n_top_genes = 4000, # how many genes to select

Sampling from binomial...: 100%|██████████| 30000/30000 [00:02<00:00, 11776.61it/s]
gene_id gene_name gene_description highly_variable observed_fraction_zeros expected_fraction_zeros prob_zero_enriched_nbatches prob_zero_enrichment prob_zero_enrichment_rank
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16;... False 0.075685 0.002028 0 0.073633 15533.0
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies.... True 0.659680 0.623356 1 0.248400 19553.0
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondriall... True 0.201993 0.046608 1 0.195833 19011.0
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transport... True 0.138081 0.001848 1 0.136900 17834.0
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; ... True 0.468434 0.383116 1 0.286100 19807.0

We can visualize the overdispersion by plotting the mean of zero entries versus the observed fraction of zeroes

fig, ax = plt.subplots(1, 1, figsize=(10,6))

subtitle = f'Overdispersion of genes in {len(adata.obs)} droplets'
df = adata.var
df['mean'] = np.array(adata.X.mean(0))[0]
ax.grid(b=True, which='major', color='gray', linestyle='-')
ax.grid(b=True, which='minor', color='gainsboro', linestyle='-')
ax.scatter(x, y, c=color, cmap='viridis')
selx = df.query('highly_variable')['mean']
sely = df.query('highly_variable')['observed_fraction_zeros']
ax.scatter(selx, sely, c='k', s = 2)
leg = ax.scatter(x, yred, c='red', s = 4, label='Expected fraction of zeros')


ax.set_xlim(0.011, 110)
ax.set_ylim(-0.05, 1.05)

plt.subplots_adjust(wspace=0, hspace=0)
plt.xlabel('Mean UMIs')
plt.ylabel(' Observed fraction of zeros')

plt.legend([leg],['Expected \n fraction \n of zeros'])
im = plt.gca().get_children()[0]
cax = fig.add_axes([0.91,0.15,0.03,0.7])

cbar = fig.colorbar(im, cax)
cbar.set_label('Probability of zero enrichment',rotation=270, fontsize = 12)
### now we restrict our adata to only the selected genes
adata = adata[:,adata.var['highly_variable']]
adata.layers["counts"] = adata.X.copy() # preserve counts, layer="counts", batch_key='batch')
INFO      Using batches from adata.obs["batch"]
INFO      No label_key inputted, assuming all cells have same label
INFO      Using data from adata.layers["counts"]
INFO      Computing library size prior per batch
INFO      Successfully registered anndata object containing 89701 cells, 4000 genes, 7
          batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates
          and 0 extra continuous covariates.
INFO      Please do not further modify adata until model is trained.

Define and train the model

model = scvi.model.SCVI(adata,

            n_epochs = 15,

train_test_results = pd.DataFrame(model.trainer.history).rename(columns={'elbo_train_set':'Train', 'elbo_test_set':'Test'})

ax = train_test_results.plot()
INFO      Training for 15 epochs
INFO      Training without KL warmup
INFO      Training time:  134 s. / 15 epochs
           Train          Test
0   15642.314737  16115.868420
1    2355.471100   2385.473573
2    2204.314944   2226.968748
3    2166.036837   2187.951084
4    2140.251632   2159.819345
5    2124.115926   2143.911156
6    2123.072407   2141.730817
7    2100.825752   2118.840176
8    2094.635836   2112.178665
9    2089.066252   2106.669971
10   2087.258148   2104.877996
11   2082.402573   2099.967509
12   2081.133728   2098.843202
13   2081.475392   2099.820359
14   2086.151489   2104.785134
15   2076.096122   2094.152306

Get the latent space and compute t-SNE

latent = model.get_latent_representation() # get latent

tsne = TSNE(negative_gradient_method='fft',

latent_tsne =

Iteration  100, KL divergence  7.8707, 100 iterations in 8.6912 sec
Iteration  200, KL divergence  6.6073, 100 iterations in 8.9548 sec
Iteration  100, KL divergence  4.6690, 100 iterations in 9.0919 sec
Iteration  200, KL divergence  4.1866, 100 iterations in 9.0414 sec
Iteration  300, KL divergence  3.8909, 100 iterations in 9.0502 sec
Iteration  400, KL divergence  3.6792, 100 iterations in 9.6585 sec
Iteration  500, KL divergence  3.5168, 100 iterations in 10.9217 sec
Iteration  600, KL divergence  3.3861, 100 iterations in 11.3986 sec
Iteration  700, KL divergence  3.2775, 100 iterations in 12.9932 sec
Iteration  800, KL divergence  3.1854, 100 iterations in 13.9640 sec
Iteration  900, KL divergence  3.1061, 100 iterations in 14.7271 sec
Iteration  1000, KL divergence  3.0367, 100 iterations in 16.8525 sec

Using Plotly’s Scattergl we can easily and speedily make interactive plots with 89k cells!

# here's a hack to randomize categorical colors, since plotly can't do that in a straightforward manner
# we take the list of named css colors that it recognizes, and we picked a color based on the code of
# the cluster we are coloring

# we just repeat the list of colors a bunch of times to ensure we always have more colors than clusters
css_colors = css_colors*100

# now define a function to plot any embedding

def plot_embedding(embedding_kind,              # the embedding must be a label in the post_adata.obsm
                   adata=adata,                 # the original adata for taking the cluster labels
                   cluster_feature ='cell_type',
                   xlabel="Dimension 1",
                   ylabel="Dimension 2",
                   plot_title="Embedding on single cell data"):

    # `cluster_feature` should be the name of one of the categorical annotation columns
    # e.g. `cell_type`, `cell_subtype`, `time_point`

    cluster_ids = adata.obs[cluster_feature]
    id_to_cluster_map = dict( zip( adata.obs[cluster_feature], adata.obs[cluster_feature] ) )
    cluster_to_id_map  = dict([[v,k] for k,v in id_to_cluster_map.items()])

    fig = go.Figure()
    for _cluster_id in adata.obs[cluster_feature]

                  x = adata.obsm[embedding_kind][:,0][adata.obs[cluster_feature]]
                , y = adata.obsm[embedding_kind][:,1][adata.obs[cluster_feature]]
                , mode='markers'
                , marker=dict(
                    # we randomize colors by starting at a random place in the list of named css colors
                    , size = 3
                , showlegend=True
                , name=id_to_cluster_map[_cluster_id]
                , hoverinfo=['name']

        "title": {"text": plot_title
                  , 'x':0.5
        , 'xaxis': {'title': {"text": xlabel}}
        , 'yaxis': {'title': {"text": ylabel}}
        , "height": 800
        , "width":1000
    return fig
fig = plot_embedding(embedding_kind='tsne',
               cluster_feature ='cell_type',
               xlabel="t-SNE 1",
               ylabel="t-SNE 2",
               plot_title="Interactive t-SNE on scVI latent space for Packer C. elegans single cell data")
# uncomment this line to save an interactive html plot, in case it is not rendering