Note

This page was generated from scVI_DE_worm.ipynb. Interactive online version: Colab badge.

DE on Packer C. elegans data

This notebook was contributed by Eduardo Beltrame [@Munfred](https://github.com/Munfred) and edited by Romain Lopez with help from Adam Gayoso.

Processing and visualizing 89k cells from Packer et al. 2019 C. elegans 10xv2 single cell data

Original article: A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

https://science.sciencemag.org/content/365/6459/eaax1971.long

The anndata object we provide has 89701 cells and 20222 genes. It includes short gene descriptions from WormBase that will show up when mousing over the interactive plots.

Steps performed:

  1. Loading the data from anndata containing cell labels and gene descriptions

  2. Training the model with batch labels for integration with scVI

  3. Retrieving the scVI latent space and imputed values

  4. Visualize the latent space with an interactive t-SNE plot using Plotly

  5. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

[16]:
# If running in Colab, navigate to Runtime -> Change runtime type
# and ensure you're using a Python 3 runtime with GPU hardware accelerator
# Installation of scVI in Colab can take several minutes


import sys

#if branch is stable, will install via pypi, else will install from source
branch = "stable"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/yoseflab/scvi-tools@$branch#egg=scvi-tools[tutorials]
[17]:
import scvi
scvi.__version__
[17]:
'0.9.0'
[18]:
if IN_COLAB:
    !pip install openTSNE --quiet
from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger


[19]:
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Control warnings
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import anndata
import plotly.express as px
import plotly.graph_objects as go
from umap import UMAP

[20]:
## Download h5ad file with C. elegans data

if os.path.isfile('packer2019.h5ad'):
    print ("Found the data file! No need to download.")
else:
    print ("Downloading data...")
    ! wget https://github.com/Munfred/wormcells-site/releases/download/packer2019/packer2019.h5ad

Found the data file! No need to download.
[21]:
adata = anndata.read('packer2019.h5ad')
print(adata)
adata.X

AnnData object with n_obs × n_vars = 89701 × 20222
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    var: 'gene_id', 'gene_name', 'gene_description'
[21]:
<89701x20222 sparse matrix of type '<class 'numpy.float32'>'
        with 82802059 stored elements in Compressed Sparse Column format>

Take a look at the gene descriptions

The gene descriptions were taken using the WormBase API.

[22]:
display(adata.var.head().style.set_properties(subset=['gene_description'], **{'width': '600px'}))
gene_id gene_name gene_description
index
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16; daf-12; and hsf-1 based on RNA-seq and tiling array studies. Is affected by six chemicals including Rotenone; Psoralens; and metformin based on RNA-seq and microarray studies.
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies. Is affected by several genes including daf-16; daf-12; and clk-1 based on RNA-seq and microarray studies. Is affected by six chemicals including Alovudine; Psoralens; and metformin based on RNA-seq studies.
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1). Is predicted to contribute to NADH dehydrogenase activity. Human ortholog(s) of this gene are implicated in Leber hereditary optic neuropathy and MELAS syndrome.
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transporting ATP synthase activity, rotational mechanism.
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; clk-1; and elt-2 based on RNA-seq and microarray studies. Is affected by eight chemicals including stavudine; Zidovudine; and Psoralens based on RNA-seq and microarray studies.
[23]:
adata.obs.head()
[23]:
cell numi time_point batch size_factor cell_type cell_subtype plot_cell_type raw_embryo_time embryo_time embryo_time_bin raw_embryo_time_bin lineage passed_qc
index
AAACCTGAGACAATAC-300.1.1 AAACCTGAGACAATAC-300.1.1 1630 300_minutes Waterston_300_minutes 1.023195 Body_wall_muscle BWM_head_row_1 BWM_head_row_1 360 380.0 330-390 330-390 MSxpappp True
AAACCTGAGGGCTCTC-300.1.1 AAACCTGAGGGCTCTC-300.1.1 2319 300_minutes Waterston_300_minutes 1.458210 nan nan nan 260 220.0 210-270 210-270 MSxapaap True
AAACCTGAGTGCGTGA-300.1.1 AAACCTGAGTGCGTGA-300.1.1 3719 300_minutes Waterston_300_minutes 2.338283 nan nan nan 270 230.0 210-270 270-330 nan True
AAACCTGAGTTGAGTA-300.1.1 AAACCTGAGTTGAGTA-300.1.1 4251 300_minutes Waterston_300_minutes 2.659051 Body_wall_muscle BWM_anterior BWM_anterior 260 280.0 270-330 210-270 Dxap True
AAACCTGCAAGACGTG-300.1.1 AAACCTGCAAGACGTG-300.1.1 1003 300_minutes Waterston_300_minutes 0.629610 Ciliated_amphid_neuron AFD AFD 350 350.0 330-390 330-390 ABalpppapav/ABpraaaapav True

Selecting genes and loading data

We use the utility scvi.data.poisson_gene_selection to select genes according to their dropout rate.

This method was described by Andrews & Hemberg in the article M3Drop: dropout-based feature selection for scRNASeq: https://academic.oup.com/bioinformatics/article/35/16/2865/5258099

This method modifies the adata to add the following fields:

highly_variable                   # boolean true for chosen genes
observed_fraction_zeros        # fraction of observed zeros per gene
expected_fraction_zeros        # expected fraction of observed zeros per gene
prob_zero_enriched_nbatches    # If batch_key is given, this denotes in how many batches genes are detected as zero enriched
prob_zero_enrichment              # Probability of zero enrichment, median across batches in the case of multiple batches
prob_zero_enrichment_rank         # Rank of the gene according to probability of zero enrichment
[24]:
###  SELECT GENES AND PREPARE DATA
scvi.data.poisson_gene_selection(adata,
                                 #batch_key='batch', # estimate a dispersion parameter batchwise
                                 n_top_genes = 4000, # how many genes to select
                                 n_samples=30000
                                 )

Sampling from binomial...: 100%|██████████| 30000/30000 [00:01<00:00, 19870.16it/s]
[25]:
adata.var.head()
[25]:
gene_id gene_name gene_description highly_variable observed_fraction_zeros expected_fraction_zeros prob_zero_enriched_nbatches prob_zero_enrichment prob_zero_enrichment_rank
index
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16;... False 0.075685 0.002028 0 0.074667 15576.0
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies.... True 0.659680 0.623356 1 0.247567 19538.0
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondriall... True 0.201993 0.046608 1 0.192433 18974.0
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transport... True 0.138081 0.001848 1 0.140300 17918.0
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; ... True 0.468434 0.383116 1 0.292533 19835.0

We can visualize the overdispersion by plotting the mean of zero entries versus the observed fraction of zeroes

[26]:
fig, ax = plt.subplots(1, 1, figsize=(10,6))

subtitle = f'Overdispersion of genes in {len(adata.obs)} droplets'
df = adata.var
df['mean'] = np.array(adata.X.mean(0))[0]
x=df['mean'],
y=df['observed_fraction_zeros']
yred=df['expected_fraction_zeros']
color=df['prob_zero_enrichment']
ax.grid(b=True, which='major', color='gray', linestyle='-')
ax.grid(b=True, which='minor', color='gainsboro', linestyle='-')
ax.set_axisbelow(True)
ax.scatter(x, y, c=color, cmap='viridis')
selx = df.query('highly_variable')['mean']
sely = df.query('highly_variable')['observed_fraction_zeros']
ax.scatter(selx, sely, c='k', s = 2)
leg = ax.scatter(x, yred, c='red', s = 4, label='Expected fraction of zeros')

ax.set_xscale('log')

ax.set_title(subtitle)
ax.label_outer()
ax.set_xlim(0.011, 110)
ax.set_ylim(-0.05, 1.05)


plt.subplots_adjust(wspace=0, hspace=0)
plt.xlabel('Mean UMIs')
plt.ylabel(' Observed fraction of zeros')

plt.legend([leg],['Expected \n fraction \n of zeros'])
im = plt.gca().get_children()[0]
cax = fig.add_axes([0.91,0.15,0.03,0.7])

cbar = fig.colorbar(im, cax)
cbar.set_label('Probability of zero enrichment',rotation=270, fontsize = 12)
../../_images/user_guide_notebooks_scVI_DE_worm_15_0.png
[27]:
### now we restrict our adata to only the selected genes

adata = adata[:,adata.var['highly_variable']]
adata.layers["counts"] = adata.X.copy().tocsr() # converts to CSR format, preserve counts

scvi.data.setup_anndata(adata, layer="counts", batch_key='batch')
INFO     Using batches from adata.obs["batch"]
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.layers["counts"]
INFO     Computing library size prior per batch
INFO     Successfully registered anndata object containing 89701 cells, 4000 vars, 7 batches,
         1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra
         continuous covariates.
INFO     Please do not further modify adata until model is trained.

Define and train the model

[28]:
### DEFINE MODEL
model = scvi.model.SCVI(adata,
                        n_hidden=128,
                        n_layers=2,
                        gene_likelihood='nb',
                        dispersion='gene-batch'
                        )

# MODEL TRAINING
model.train(check_val_every_n_epoch =1,
            use_gpu=True,
            max_epochs = 50,
            plan_kwargs={'lr':1e-3})


display(train_test_results)
train_test_results = model.history['elbo_train']
train_test_results['elbo_validation'] = model.history['elbo_validation']
train_test_results.plot(logy=True)
plt.show()
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Epoch 50/50: 100%|██████████| 50/50 [05:30<00:00,  6.62s/it, loss=2.07e+03, v_num=1]
elbo_train elbo_validation
epoch
0 2323.28 2247.47
1 2212.47 2205.4
2 2185.13 2181.8
3 2165.67 2162.24
4 2151.49 2152.37
5 2140.19 2142.64
6 2131.5 2133.76
7 2124.33 2126.52
8 2118.21 2122.23
9 2112.86 2115.72
10 2108.35 2113.34
11 2104.63 2108.39
12 2101.66 2106.08
13 2099.25 2104.11
14 2096.73 2101.63
15 2094.47 2098.34
16 2092.54 2098.04
17 2091.33 2096.76
18 2089.41 2096.12
19 2088.2 2093.3
20 2086.9 2092.96
21 2085.81 2091.27
22 2084.63 2091.86
23 2083.87 2090.15
24 2083.01 2090.01
../../_images/user_guide_notebooks_scVI_DE_worm_18_3.png

Get the latent space and compute t-SNE

[29]:
latent = model.get_latent_representation() # get latent

[30]:
### MAKE TSNE
tsne = TSNE(negative_gradient_method='fft',
            initialization='random',
            neighbors='approx',
            n_iter=1500,
            callbacks=ErrorLogger(),
            callbacks_every_iters=100,
            n_jobs=-1,
            learning_rate=300)

latent_tsne = tsne.fit(np.squeeze(np.asarray(latent)))
adata.obsm['tsne']=latent_tsne


Iteration  100, KL divergence  7.8501, 100 iterations in 9.5315 sec
Iteration  200, KL divergence  6.6654, 100 iterations in 9.4532 sec
Iteration  100, KL divergence  4.6842, 100 iterations in 9.6015 sec
Iteration  200, KL divergence  4.2028, 100 iterations in 9.5157 sec
Iteration  300, KL divergence  3.9059, 100 iterations in 9.5712 sec
Iteration  400, KL divergence  3.6938, 100 iterations in 9.9184 sec
Iteration  500, KL divergence  3.5305, 100 iterations in 10.7656 sec
Iteration  600, KL divergence  3.3997, 100 iterations in 11.5231 sec
Iteration  700, KL divergence  3.2917, 100 iterations in 12.7622 sec
Iteration  800, KL divergence  3.2001, 100 iterations in 13.6144 sec
Iteration  900, KL divergence  3.1212, 100 iterations in 14.4077 sec
Iteration  1000, KL divergence  3.0526, 100 iterations in 16.5562 sec
Iteration  1100, KL divergence  2.9918, 100 iterations in 17.2580 sec
Iteration  1200, KL divergence  2.9376, 100 iterations in 19.0630 sec
Iteration  1300, KL divergence  2.8891, 100 iterations in 20.8113 sec
Iteration  1400, KL divergence  2.8455, 100 iterations in 22.0511 sec
Iteration  1500, KL divergence  2.8059, 100 iterations in 23.1604 sec

Using Plotly’s Scattergl we can easily and speedily make interactive plots with 89k cells!

[31]:
# here's a hack to randomize categorical colors, since plotly can't do that in a straightforward manner
# we take the list of named css colors that it recognizes, and we picked a color based on the code of
# the cluster we are coloring
css_colors=[
'aliceblue','antiquewhite','aqua','aquamarine','azure','bisque','black','blanchedalmond','blue',
'blueviolet','brown','burlywood','cadetblue','chartreuse','chocolate','coral','cornflowerblue',
'crimson','cyan','darkblue','darkcyan','darkgoldenrod','darkgray','darkgrey','darkgreen','darkkhaki',
'darkmagenta','darkolivegreen','darkorange','darkorchid','darkred','darksalmon','darkseagreen',
'darkslateblue','darkslategray','darkslategrey','darkturquoise','darkviolet','deeppink','deepskyblue',
'dimgray','dimgrey','dodgerblue','firebrick','floralwhite','forestgreen','fuchsia','gainsboro','ghostwhite',
'gold','goldenrod','gray','grey','green','greenyellow','honeydew','hotpink','indianred','indigo',
'ivory','khaki','lavender','lavenderblush','lawngreen','lemonchiffon','lightblue','lightcoral','lightcyan',
'lightgoldenrodyellow','lightgray','lightgrey','lightgreen','lightpink','lightsalmon','lightseagreen',
'lightskyblue','lightslategray','lightslategrey','lightsteelblue','lightyellow','lime','limegreen','linen',
'magenta','maroon','mediumaquamarine','mediumblue','mediumorchid','mediumpurple','mediumseagreen',
'mediumslateblue','mediumspringgreen','mediumturquoise','mediumvioletred','midnightblue','mintcream',
'mistyrose','moccasin','navajowhite','navy','oldlace','olive','olivedrab','orange','orangered','orchid',
'palegoldenrod','palegreen','paleturquoise','palevioletred','papayawhip','peachpuff','peru','pink','plum'
,'powderblue','purple','red','rosybrown','royalblue','saddlebrown','salmon','sandybrown','seagreen',
'seashell','sienna','silver','skyblue','slateblue','slategray','slategrey','snow','springgreen','steelblue',
'tan','teal','thistle','tomato','turquoise','violet','wheat','white','whitesmoke','yellow','yellowgreen']

# we just repeat the list of colors a bunch of times to ensure we always have more colors than clusters
css_colors = css_colors*100

# now define a function to plot any embedding

def plot_embedding(embedding_kind,              # the embedding must be a label in the post_adata.obsm
                   adata=adata,                 # the original adata for taking the cluster labels
                   cluster_feature ='cell_type',
                   xlabel="Dimension 1",
                   ylabel="Dimension 2",
                   plot_title="Embedding on single cell data"):

    # `cluster_feature` should be the name of one of the categorical annotation columns
    # e.g. `cell_type`, `cell_subtype`, `time_point`

    cluster_ids = adata.obs[cluster_feature].cat.codes.unique()
    id_to_cluster_map = dict( zip( adata.obs[cluster_feature].cat.codes, adata.obs[cluster_feature] ) )
    cluster_to_id_map  = dict([[v,k] for k,v in id_to_cluster_map.items()])

    fig = go.Figure()
    for _cluster_id in adata.obs[cluster_feature].cat.codes.unique():

        fig.add_trace(
            go.Scattergl(
                  x = adata.obsm[embedding_kind][:,0][adata.obs[cluster_feature].cat.codes==_cluster_id]
                , y = adata.obsm[embedding_kind][:,1][adata.obs[cluster_feature].cat.codes==_cluster_id]
                , mode='markers'
                , marker=dict(
                    # we randomize colors by starting at a random place in the list of named css colors
                    color=css_colors[_cluster_id+np.random.randint(0,len(np.unique(css_colors)))]
                    , size = 3
                        )
                , showlegend=True
                , name=id_to_cluster_map[_cluster_id]
                , hoverinfo=['name']
                )
            )

    layout={
        "title": {"text": plot_title
                  , 'x':0.5
                 }
        , 'xaxis': {'title': {"text": xlabel}}
        , 'yaxis': {'title': {"text": ylabel}}
        , "height": 800
        , "width":1000
    }
    fig.update_layout(layout)
    fig.update_layout(showlegend=True)
    return fig
[32]:
fig = plot_embedding(embedding_kind='tsne',
               cluster_feature ='cell_type',
               xlabel="t-SNE 1",
               ylabel="t-SNE 2",
               plot_title="Interactive t-SNE on scVI latent space for Packer C. elegans single cell data")
[33]:
# uncomment this line to save an interactive html plot, in case it is not rendering
#fig.write_html('worms_interactive_tsne.html')
fig.show()