Differential expression on Packer C. elegans data¶

This notebook was contributed by Eduardo Beltrame [@Munfred](https://github.com/Munfred) and edited by Romain Lopez with help from Adam Gayoso.

Processing and visualizing 89k cells from Packer et al. 2019 C. elegans 10xv2 single cell data

Original article: A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

https://science.sciencemag.org/content/365/6459/eaax1971.long

The anndata object we provide has 89701 cells and 20222 genes. It includes short gene descriptions from WormBase that will show up when mousing over the interactive plots.

Steps performed:¶

Loading the data from anndata containing cell labels and gene descriptions
Training the model with batch labels for integration with scVI
Retrieving the scVI latent space and imputed values
Visualize the latent space with an interactive UMAP plot using Plotly
Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

[ ]:

# If running in Colab, navigate to Runtime -> Change runtime type
# and ensure you're using a Python 3 runtime with GPU hardware accelerator
# Installation of scVI in Colab can take several minutes

[ ]:

import sys
IN_COLAB = "google.colab" in sys.modules

show_plot = True
save_path = "./"

if IN_COLAB:
    !pip install --quiet scvi[notebooks]==0.6.3

[4]:

import scvi
scvi.__version__

[4]:

'0.6.3'

[ ]:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Control warnings
import warnings; warnings.simplefilter('ignore')

import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from scvi.dataset import GeneExpressionDataset
from scvi.models import VAE
from scvi.inference import UnsupervisedTrainer
import torch
import anndata

import plotly.express as px
import plotly.graph_objects as go

from umap import UMAP

if IN_COLAB:
    %matplotlib inline

[6]:

## Change the path where the models will be saved
save_path = "./"
vae_file_name = 'worm_vae.pkl'

if os.path.isfile('packer2019.h5ad'):
    print ("Found the data file! No need to download.")
else:
    print ("Downloading data...")
    ! wget https://github.com/Munfred/wormcells-site/releases/download/packer2019/packer2019.h5ad

Found the data file! No need to download.

[ ]:

adata = anndata.read('packer2019.h5ad')

[8]:

adata

[8]:

AnnData object with n_obs × n_vars = 89701 × 20222
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    var: 'gene_id', 'gene_name', 'gene_description'

[9]:

adata.X

[9]:

<89701x20222 sparse matrix of type '<class 'numpy.float32'>'
        with 82802059 stored elements in Compressed Sparse Column format>

Take a look at the gene descriptions¶

The gene descriptions were taken using the WormBase API.

[10]:

display(adata.var.head().style.set_properties(subset=['gene_description'], **{'width': '600px'}))

	gene_id	gene_name	gene_description
index
WBGene00010957	WBGene00010957	nduo-6	Is affected by several genes including daf-16; daf-12; and hsf-1 based on RNA-seq and tiling array studies. Is affected by six chemicals including Rotenone; Psoralens; and metformin based on RNA-seq and microarray studies.
WBGene00010958	WBGene00010958	ndfl-4	Is enriched in Psub2 based on RNA-seq studies. Is affected by several genes including daf-16; daf-12; and clk-1 based on RNA-seq and microarray studies. Is affected by six chemicals including Alovudine; Psoralens; and metformin based on RNA-seq studies.
WBGene00010959	WBGene00010959	nduo-1	Is an ortholog of human MT-ND1 (mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1). Is predicted to contribute to NADH dehydrogenase activity. Human ortholog(s) of this gene are implicated in Leber hereditary optic neuropathy and MELAS syndrome.
WBGene00010960	WBGene00010960	atp-6	Is predicted to contribute to proton-transporting ATP synthase activity, rotational mechanism.
WBGene00010961	WBGene00010961	nduo-2	Is affected by several genes including hsf-1; clk-1; and elt-2 based on RNA-seq and microarray studies. Is affected by eight chemicals including stavudine; Zidovudine; and Psoralens based on RNA-seq and microarray studies.

[11]:

adata.obs.head().T

[11]:

index	AAACCTGAGACAATAC-300.1.1	AAACCTGAGGGCTCTC-300.1.1	AAACCTGAGTGCGTGA-300.1.1	AAACCTGAGTTGAGTA-300.1.1	AAACCTGCAAGACGTG-300.1.1
cell	AAACCTGAGACAATAC-300.1.1	AAACCTGAGGGCTCTC-300.1.1	AAACCTGAGTGCGTGA-300.1.1	AAACCTGAGTTGAGTA-300.1.1	AAACCTGCAAGACGTG-300.1.1
numi	1630	2319	3719	4251	1003
time_point	300_minutes	300_minutes	300_minutes	300_minutes	300_minutes
batch	Waterston_300_minutes	Waterston_300_minutes	Waterston_300_minutes	Waterston_300_minutes	Waterston_300_minutes
size_factor	1.02319	1.45821	2.33828	2.65905	0.62961
cell_type	Body_wall_muscle	nan	nan	Body_wall_muscle	Ciliated_amphid_neuron
cell_subtype	BWM_head_row_1	nan	nan	BWM_anterior	AFD
plot_cell_type	BWM_head_row_1	nan	nan	BWM_anterior	AFD
raw_embryo_time	360	260	270	260	350
embryo_time	380	220	230	280	350
embryo_time_bin	330-390	210-270	210-270	270-330	330-390
raw_embryo_time_bin	330-390	210-270	270-330	210-270	330-390
lineage	MSxpappp	MSxapaap	nan	Dxap	ABalpppapav/ABpraaaapav
passed_qc	True	True	True	True	True

Loading data¶

We load the Packer data and use the batch annotations for scVI. Here, each experiment correspond to a batch.

[12]:

gene_dataset = GeneExpressionDataset()

# we provide the `batch_indices` so that scvi can perform batch correction
gene_dataset.populate_from_data(
            adata.X,
            gene_names=adata.var.index.values,
            cell_types=adata.obs['cell_type'].values,
            batch_indices=adata.obs['batch'].cat.codes.values,
            )

# We select the 1000 most variable genes, which is a recommended selection criteria of scvi
# this method in particular is based on the method of Seurat v3
gene_dataset.subsample_genes(1000)
sel_genes = gene_dataset.gene_names

[2020-04-02 03:49:05,753] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-04-02 03:49:05,759] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-04-02 03:49:05,765] INFO - scvi.dataset.dataset | extracting highly variable genes using seurat_v3 flavor

Transforming to str index.

[2020-04-02 03:49:56,429] INFO - scvi.dataset.dataset | Downsampling from 20222 to 1000 genes
[2020-04-02 03:49:56,775] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-04-02 03:49:57,163] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2020-04-02 03:49:57,579] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-04-02 03:49:57,965] INFO - scvi.dataset.dataset | Downsampled from 89701 to 89701 cells

At this point we may make the scVI dataset RNA count matrix dense, as we filtered the genes. This makes inference faster because when the matrix is sparse, the trainer has to make each minibatch dense before it loads it on the model and the GPU is using CUDA.

[13]:

gene_dataset.X = gene_dataset.X.toarray()

[2020-04-02 03:50:00,523] INFO - scvi.dataset.dataset | Computing the library size for the new data

[14]:

adata.obs['cell_type'].values

[14]:

[Body_wall_muscle, nan, nan, Body_wall_muscle, Ciliated_amphid_neuron, ..., Rectal_gland, nan, nan, nan, nan]
Length: 89701
Categories (37, object): [ABarpaaa_lineage, Arcade_cell, Body_wall_muscle, Ciliated_amphid_neuron, ...,
                          hmc_and_homolog, hmc_homolog, hyp1V_and_ant_arc_V, nan]

Training¶

n_epochs: Maximum number of epochs to train the model. If the likelihood change is small than a set threshold training will stop automatically.
lr: learning rate. Set to 0.001 here.
use_cuda: Set to true to use CUDA (GPU required)

[ ]:

# for this dataset 50 epochs is sufficient
# note that smaller datasets require hundreds of epochs
n_epochs = 50
lr = 1e-3
use_cuda = True

We now create the model and the trainer object. We train the model and output model likelihood every epoch. In order to evaluate the likelihood on a test set, we split the datasets (the current code can also so train/validation/test).

If a pre-trained model already exist in the save_path then load the same model rather than re-training it. This is particularly useful for large datasets.

[ ]:

# set the VAE to perform batch correction
vae = VAE(gene_dataset.nb_genes, n_batch=gene_dataset.n_batches)

[ ]:

# Use per minibatch warmup (iters) for large datasets
trainer = UnsupervisedTrainer(
    vae,
    gene_dataset,
    train_size=0.85, # number between 0 and 1, default 0.8
    use_cuda=use_cuda,
    frequency=1,
    n_epochs_kl_warmup=None,
    n_iter_kl_warmup=int(128*5000/400)
)

[18]:

# check if a previously trained model already exists, if yes load it

full_file_save_path = os.path.join(save_path, vae_file_name)

if os.path.isfile(full_file_save_path):
    trainer.model.load_state_dict(torch.load(full_file_save_path))
    trainer.model.eval()
else:
    trainer.train(n_epochs=n_epochs, lr=lr)
    torch.save(trainer.model.state_dict(), full_file_save_path)

[2020-04-02 03:50:07,270] INFO - scvi.inference.inference | KL warmup for 1600 iterations
training: 100%|██████████| 50/50 [08:53<00:00, 10.66s/it]

Plotting the likelihood change across training¶

[19]:

train_test_results = pd.DataFrame(trainer.history).rename(columns={'elbo_train_set':'Train', 'elbo_test_set':'Test'})

train_test_results

[19]:

	Train	Test
0	8881.413828	9174.447856
1	857.742913	865.714189
2	821.685303	829.391246
3	808.042363	816.110580
4	796.994002	804.426280
5	787.069519	795.110373
6	775.612372	782.976124
7	785.185200	792.350435
8	778.276230	786.173274
9	766.440439	773.111641
10	763.470819	770.053581
11	762.309879	769.090608
12	759.699914	766.290829
13	760.565406	767.061953
14	755.770818	762.247230
15	755.289439	761.668258
16	760.560715	767.147357
17	755.285547	761.907396
18	752.282227	758.606450
19	752.231292	758.707572
20	751.001023	757.318010
21	750.882924	757.270630
22	753.712712	760.248563
23	752.402924	759.038684
24	750.010072	756.424613
25	751.233984	757.741389
26	748.450227	754.827124
27	752.164061	758.740302
28	748.442852	754.916773
29	753.610544	760.393845
30	748.497168	755.029036
31	749.499188	756.388857
32	748.662938	755.379153
33	747.743216	754.304865
34	747.289906	753.938626
35	746.901389	753.527914
36	747.174424	753.747798
37	746.882975	753.544082
38	746.582319	753.286055
39	747.826352	754.543043
40	747.763953	754.599980
41	746.903656	753.518663
42	746.093606	752.914513
43	747.682025	754.577081
44	748.042126	754.908751
45	746.043148	752.920929
46	745.559933	752.408477
47	745.672933	752.533768
48	748.517461	755.667539
49	745.237304	752.205676
50	745.312490	752.293802

[20]:

ax = train_test_results.plot()
ax.set_xlabel("Epoch")
ax.set_ylabel("Error")
plt.show()

../_images/contributed_tutorials_scVI_DE_worm_27_0.png

Obtaining the posterior object and sample latent space¶

The posterior object contains a model and a gene_dataset, as well as additional arguments that for Pytorch’s DataLoader. It also comes with many methods or utilities querying the model, such as differential expression, imputation and differential analyisis.

To get an ordered output result, we might use .sequential posterior’s method which return another instance of posterior (with shallow copy of all its object references), but where the iteration is in the same ordered as its indices attribute.

[21]:

# This provides a posterior with sequential index sampling
full = trainer.create_posterior()
latent, batch_indices, labels = full.get_latent()
batch_indices = batch_indices.ravel()
latent.shape

[21]:

(89701, 10)

[22]:

# store the latent space in a new anndata object
post_adata = anndata.AnnData(X=gene_dataset.X)
post_adata.obs=adata.obs
post_adata.obsm["latent"] = latent
post_adata

[22]:

AnnData object with n_obs × n_vars = 89701 × 1000
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    obsm: 'latent'

Using Plotly’s `Scattergl` we can easily and speedily make interactive plots with 89k cells!¶

[ ]:

# here's a hack to randomize categorical colors, since plotly can't do that in a straightforward manner
# we take the list of named css colors that it recognizes, and we picked a color based on the code of
# the cluster we are coloring
css_colors=[
'aliceblue','antiquewhite','aqua','aquamarine','azure','bisque','black','blanchedalmond','blue',
'blueviolet','brown','burlywood','cadetblue','chartreuse','chocolate','coral','cornflowerblue',
'crimson','cyan','darkblue','darkcyan','darkgoldenrod','darkgray','darkgrey','darkgreen','darkkhaki',
'darkmagenta','darkolivegreen','darkorange','darkorchid','darkred','darksalmon','darkseagreen',
'darkslateblue','darkslategray','darkslategrey','darkturquoise','darkviolet','deeppink','deepskyblue',
'dimgray','dimgrey','dodgerblue','firebrick','floralwhite','forestgreen','fuchsia','gainsboro','ghostwhite',
'gold','goldenrod','gray','grey','green','greenyellow','honeydew','hotpink','indianred','indigo',
'ivory','khaki','lavender','lavenderblush','lawngreen','lemonchiffon','lightblue','lightcoral','lightcyan',
'lightgoldenrodyellow','lightgray','lightgrey','lightgreen','lightpink','lightsalmon','lightseagreen',
'lightskyblue','lightslategray','lightslategrey','lightsteelblue','lightyellow','lime','limegreen','linen',
'magenta','maroon','mediumaquamarine','mediumblue','mediumorchid','mediumpurple','mediumseagreen',
'mediumslateblue','mediumspringgreen','mediumturquoise','mediumvioletred','midnightblue','mintcream',
'mistyrose','moccasin','navajowhite','navy','oldlace','olive','olivedrab','orange','orangered','orchid',
'palegoldenrod','palegreen','paleturquoise','palevioletred','papayawhip','peachpuff','peru','pink','plum'
,'powderblue','purple','red','rosybrown','royalblue','saddlebrown','salmon','sandybrown','seagreen',
'seashell','sienna','silver','skyblue','slateblue','slategray','slategrey','snow','springgreen','steelblue',
'tan','teal','thistle','tomato','turquoise','violet','wheat','white','whitesmoke','yellow','yellowgreen']

# we just repeat the list of colors a bunch of times to ensure we always have more colors than clusters
css_colors = css_colors*100

# now define a function to plot any embedding

def plot_embedding(embedding_kind,              # the embedding must be a label in the post_adata.obsm
                   adata=adata,                 # the original adata for taking the cluster labels
                   post_adata=post_adata,
                   cluster_feature ='cell_type',
                   xlabel="Dimension 1",
                   ylabel="Dimension 2",
                   plot_title="Embedding on single cell data"):

    # `cluster_feature` should be the name of one of the categorical annotation columns
    # e.g. `cell_type`, `cell_subtype`, `time_point`

    cluster_ids = adata.obs[cluster_feature].cat.codes.unique()
    id_to_cluster_map = dict( zip( adata.obs[cluster_feature].cat.codes, adata.obs[cluster_feature] ) )
    cluster_to_id_map  = dict([[v,k] for k,v in id_to_cluster_map.items()])

    fig = go.Figure()
    for _cluster_id in adata.obs[cluster_feature].cat.codes.unique():

        fig.add_trace(
            go.Scattergl(
                  x = post_adata.obsm[embedding_kind][:,0][post_adata.obs[cluster_feature].cat.codes==_cluster_id]
                , y = post_adata.obsm[embedding_kind][:,1][post_adata.obs[cluster_feature].cat.codes==_cluster_id]
                , mode='markers'
                , marker=dict(
                    # we randomize colors by starting at a random place in the list of named css colors
                    color=css_colors[_cluster_id+np.random.randint(0,len(np.unique(css_colors)))]
                    , size = 3
                        )
                , showlegend=True
                , name=id_to_cluster_map[_cluster_id]
                , hoverinfo=['name']
                )
            )

    layout={
        "title": {"text": plot_title
                  , 'x':0.5
                 }
        , 'xaxis': {'title': {"text": xlabel}}
        , 'yaxis': {'title': {"text": ylabel}}
        , "height": 800
        , "width":1000
    }
    fig.update_layout(layout)
    fig.update_layout(showlegend=True)
    return fig

[ ]:

latent_umap = UMAP(spread=2).fit_transform(latent)

[ ]:

post_adata.obsm["UMAP"] = np.array(latent_umap)

[ ]:

fig = plot_embedding(embedding_kind='UMAP',
               cluster_feature ='cell_type',
               xlabel="UMAP 1",
               ylabel="UMAP 2",
               plot_title="UMAP on scVI latent space for Packer C. elegans single cell data")

[27]:

# uncomment this line to save an interactive html plot, in case it is not rendering
#fig.write_html('worms_interactive_tsne.html')
fig.show()

Performing Differential Expression with `vanilla` and `change` modes¶

Note: scVI recently introduced a second way to perform DE, and some functions and documentation are still changing. The old mode is called ``vanilla`` and is performed below. The new mode is called ``change`` and is done at the end of the notebook. In addition to Bayes Factors, the new ``change`` mode allows for calculating p-values, which are more commonly seen in volcano plots.

From the trained VAE model we can sample the gene expression rate for each gene in each cell.
For two populations of interest, we can then randomly sample pairs of cells, one from each population to compare their expression rate for a gene.
In the vanilla mode, DE is measured by logit(p/(1-p)) where p is the probability of a cell from population A having a higher expression than a cell from population B.
We can form the null distribution of the DE values by sampling pairs randomly from the combined population.

Explanation of `Vanilla` DE mode and Bayes Factors¶

Explanation adapted from the scVi `differential_expression_score docstring <https://github.com/YosefLab/scVI/blob/05920d1f85daa362d4fb694e588ab090bc84e207/scvi/inference/posterior.py#L640>`__.

The ``vanilla`` mode follows protocol described in Lopez et al, arXiv:1709.02082.

In this case for a given gene we perform hypothesis testing based on latent variable in the generative model that models the mean of the gene expression.

We are comparing $h_{1g}$, the mean expression of gene $g$ in cell type 1, with $h_{2g}$, the mean expression of $g$ in cell type 2.

The hypotheses are:

\[M^g_1: h_{1g} > h_{2g}\]

\[M^g_2: h_{1g} \leq h_{2g}\]

DE between cell types 1 and 2 for each gene can then be based on the Bayes factors:

\[\text{Natural Log Bayes Factor for gene g in cell types 1 and 2} = \ln ( {BF^g_{12}) = \ln(\frac{ p(M^g_1 | x_1, x_2)}{p(M^g_2 | x_1, x_2)}})\]

Note that the scvi ``differential_expression_score`` returns the *natural logarithm* of the Bayes Factor. This is :math:`ln(BF_{10})` in the table discussed below.

To compute the gene specific Bayes factors using masks idx1 and idx2 we sample the Posterior in the following way:

The posterior is sampled n_samples times for each subpopulation
For computation efficiency (posterior sampling is quite expensive), instead of comparing element-wise the obtained samples, we can permute posterior samples.

Remember that computing the Bayes Factor requires sampling $q(z_A | x_A)$ and $q(z_B | x_B)$

Interpreting Bayes factors¶

To learn more about Bayes factors vs. p-values, see the review On p-Values and Bayes Factors by Leonhard Held and Manuela Ott.

For a shorter overview, see this blog post. A common interpretation table is copied below.

In our notation, $BF_{10}$ is $BF^g_{12}$, $H_0$ is $M^g_1$ and $H_1$ is $M^g_2$

Bayes factor $BF_{10}$	$\ln(BF_{10})$	Interpretation
> 100	> 4.60	Extreme evidence for H1
30 – 100	(3.4, 4.6)	Very strong evidence for H1
10 – 30	(2.3, 3.4)	Strong evidence for H1
3 – 10	(1.1, 2.3)	Moderate evidence for H1
1 – 3	(0 , 1.1)	Anecdotal evidence for H1
1	0	No evidence
1/3 – 1	(-1.1, 0)	Anecdotal evidence for H0
1/3 – 1/10	(-2.30, -1.1)	Moderate evidence for H0
1/10 – 1/30	(-3.4, -2.30)	Strong evidence for H0
1/30 – 1/100	(-4.6, -3.4)	Very strong evidence for H0
< 1/100	< -4.6	Extreme evidence for H0

Selecting cells to compare¶

[28]:

# let's take a look at abundances of different cell types
adata.obs['cell_type'].value_counts()

[28]:

nan                               35052
Body_wall_muscle                  17520
Hypodermis                         7746
Ciliated_amphid_neuron             6090
Ciliated_non_amphid_neuron         4468
Seam_cell                          2766
Pharyngeal_muscle                  2562
Glia                               1857
Intestine                          1732
Pharyngeal_neuron                  1471
Pharyngeal_marginal_cell            911
Coelomocyte                         787
Pharyngeal_gland                    786
GLR                                 768
Intestinal_and_rectal_muscle        568
Germline                            499
Pharyngeal_intestinal_valve         493
Arcade_cell                         434
Z1_Z4                               372
Rectal_cell                         327
M_cell                              315
ABarpaaa_lineage                    273
Rectal_gland                        265
Excretory_cell                      215
Excretory_gland                     205
hmc                                 189
hmc_homolog                         155
T                                   141
hmc_and_homolog                     122
Parent_of_exc_gland_AVK             114
hyp1V_and_ant_arc_V                 112
Excretory_duct_and_pore              91
Parent_of_hyp1V_and_ant_arc_V        75
G2_and_W_blasts                      72
Excretory_cell_parent                62
Parent_of_exc_duct_pore_DB_1_3       61
XXX                                  25
Name: cell_type, dtype: int64

[29]:

# let's pick two cell types
cell_type_1 = 'Ciliated_non_amphid_neuron'
cell_type_2 = 'Intestine'


cell_idx1 = adata.obs['cell_type'] == cell_type_1
print(sum(cell_idx1), 'cells of type', cell_type_1)
cell_idx2 = adata.obs['cell_type'] == cell_type_2
print(sum(cell_idx2), 'cells of type', cell_type_2)

4468 cells of type Ciliated_non_amphid_neuron
1732 cells of type Intestine

Vanilla DE parameters¶

n_samples: the number of times to sample the posterior gene frequencies from the vae model for each gene in each cell.
M_permutation: Number of pairs sampled for comparison.
idx1: boolean array masking subpopulation cells 1. (True where cell is from population)
idx2: boolean array masking subpopulation cells 2. (True where cell is from population)

[ ]:

n_samples = 10000
M_permutation = 10000

[ ]:

de_vanilla = full.differential_expression_score(
    idx1 = cell_idx1,
    idx2 = cell_idx2,
    mode='vanilla', # vanilla is the default
    n_samples=n_samples,
    M_permutation=M_permutation,
)

Print the differential expression results¶

bayes``i``: The bayes factor for cell type 1 having a higher expression than cell type 2
bayes``i``_permuted: estimate Bayes Factors of random populations of the union of the two cell populations
mean``i``: average UMI counts in cell type i
nonz``i``: proportion of non-zero expression in cell type i
norm_mean``i``: average UMI counts in cell type i normalized by cell size
scale``i``: average scVI imputed gene expression scale in cell type i

[32]:

de_vanilla.head()

[32]:

	proba_m1	proba_m2	bayes_factor	scale1	scale2	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2
WBGene00003175	0.9997	0.0003	8.110995	0.006821	0.000027	5.615488	0.124134	0.347135	0.091224	10.227241	0.025440
WBGene00022836	0.9997	0.0003	8.110995	0.001819	0.000184	2.111683	0.521363	0.747986	0.355658	3.143920	0.114598
WBGene00044387	0.9997	0.0003	8.110995	0.004801	0.000024	3.650403	0.081986	0.602507	0.066975	8.174725	0.016240
WBGene00001196	0.9996	0.0004	7.823221	0.002145	0.000278	2.378021	0.897229	0.759624	0.444573	3.800714	0.165707
WBGene00011917	0.9995	0.0005	7.599982	0.001539	0.000204	1.550806	0.243072	0.620636	0.197460	2.158809	0.051089

[33]:

# manipulate the DE results for plotting

# we compute the ratio of the scVI scales to use that as a rough proxy for fold change
de_vanilla['ratio_scale12']=de_vanilla['scale1']/de_vanilla['scale2']
de_vanilla['log_scale_ratio']=np.log2(de_vanilla['ratio_scale12'])

# we take absolute values of the first bayes factor as the one to use on the volcano plot
# bayes1 and bayes2 should be roughtly the same, except with opposite signs
de_vanilla['abs_bayes_factor']=np.abs(de_vanilla['bayes_factor'])
de_vanilla=de_vanilla.join(adata.var, how='inner')
de_vanilla.head()

[33]:

	proba_m1	proba_m2	bayes_factor	scale1	scale2	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2	ratio_scale12	log_scale_ratio	abs_bayes_factor	gene_id	gene_name	gene_description
WBGene00003175	0.9997	0.0003	8.110995	0.006821	0.000027	5.615488	0.124134	0.347135	0.091224	10.227241	0.025440	249.736282	7.964262	8.110995	WBGene00003175	mec-12	Is an ortholog of human TUBA1A (tubulin alpha ...
WBGene00022836	0.9997	0.0003	8.110995	0.001819	0.000184	2.111683	0.521363	0.747986	0.355658	3.143920	0.114598	9.885020	3.305244	8.110995	WBGene00022836	ZK973.11	Is an ortholog of human TMX3. Is predicted to ...
WBGene00044387	0.9997	0.0003	8.110995	0.004801	0.000024	3.650403	0.081986	0.602507	0.066975	8.174725	0.016240	200.889313	7.650257	8.110995	WBGene00044387	C27A2.8	Is enriched in intestine and nervous system ba...
WBGene00001196	0.9996	0.0004	7.823221	0.002145	0.000278	2.378021	0.897229	0.759624	0.444573	3.800714	0.165707	7.715135	2.947691	7.823221	WBGene00001196	egl-30	Is an ortholog of human GNAQ (G protein subuni...
WBGene00011917	0.9995	0.0005	7.599982	0.001539	0.000204	1.550806	0.243072	0.620636	0.197460	2.158809	0.051089	7.532043	2.913041	7.599982	WBGene00011917	T22C1.6	Is an ortholog of human TXLNA (taxilin alpha)....

Because we’re using the vanilla mode, note that this volcano plot shows the ratios of the scVI expression scale of the two tissues vs the absolute value of the natural log bayes factor.

The new change mode allows for calculating log fold change and p-values, which are more commonly seen in volcano plots

We can highlight genes of interest based on simple string matching. For example, the cell below highlights all C. elegans neuropeptides (whose name conveniently all start with nlp, ins or flp). Other genes with be a transparent gray dot.

[ ]:

de_vanilla['gene_color'] = 'rgba(100, 100, 100, 0.25)'
de_vanilla.loc[de_vanilla['gene_name'].str.contains('ins-'), 'gene_color'] = 'rgba(255, 1,0, 1)'
de_vanilla.loc[de_vanilla['gene_name'].str.contains('nlp-'), 'gene_color'] = 'rgba(255, 0,0, 1)'
de_vanilla.loc[de_vanilla['gene_name'].str.contains('flp-'), 'gene_color'] = 'rgba(255, 0,1, 1)'

[35]:

# first we create these variables to customize the hover text in plotly's heatmap
# the text needs to be arranged in a matrix the same shape as the heatmap
# for the gene descriptions text, which can be several sentences, we add a line break after each sentence
de_vanilla['gene_description_html'] = de_vanilla['gene_description'].str.replace('\. ', '.<br>')

fig = go.Figure(
                data=go.Scatter(
                          x=de_vanilla["log_scale_ratio"].round(3)
                        , y=de_vanilla["abs_bayes_factor"].round(3)
                        , mode='markers'
                        , marker=dict(color=de_vanilla['gene_color'])
                        , hoverinfo='text'
                        , text=de_vanilla['gene_description_html']
                        , customdata=de_vanilla.gene_id.values + '<br>Name: ' + de_vanilla.gene_name.values
                        , hovertemplate='%{customdata} <br>' +
                                        '|ln(BF)|: %{y}<br>' +
                                        'Log2 scale ratio: %{x}' +
                                        '<extra>%{text}</extra>'
                        )
                        , layout= {
                                "title": {"text":
                                          "Vanilla differential expression of Packer C. elegans data between <br> <b>" +
                                          str(cell_type_1) + "</b> and <b>" + str(cell_type_2) + " "
                                          , 'x':0.5
                                         }
                                , 'xaxis': {'title': {"text": "Log2 of scVI expression scale"}}
                                , 'yaxis': {'title': {"text": "Absolute value of natural log of Bayes Factor"}}
                        }
               )
# uncomment line below to save the interactive volcano plot as html
# fig.write_html('worms_interactive_volcano_plot_vanilla_DE.html')
fig.show()

Heatmap of top expressed genes for `vanilla` mode¶

Now we perform DE between each cell type vs all other cells and make a heatmap of the result.

First we need to make cell type sumary with numerical codes for each cell type

[36]:

# we need to numerically encode the cell types for passing the cluster identity to scVI
cell_code_to_type = dict( zip( adata.obs['cell_type'].cat.codes, adata.obs['cell_type'] ) )
cell_type_to_code_map = dict([[v,k] for k,v in cell_code_to_type.items()])
# check that we got unique cell type labels
assert len(cell_code_to_type)==len(cell_type_to_code_map)

cell_types_summary=pd.DataFrame(index=adata.obs['cell_type'].value_counts().index)
cell_types_summary['cell_type_code']=cell_types_summary.index.map(cell_type_to_code_map)
cell_types_summary['ncells']=adata.obs['cell_type'].value_counts()
cell_types_summary['cell_type_name']=adata.obs['cell_type'].value_counts().index
cell_types_summary.to_csv('packer_cell_types_summary.csv')
cell_types_summary.head()

[36]:

	cell_type_code	ncells	cell_type_name
nan	36	35052	nan
Body_wall_muscle	2	17520	Body_wall_muscle
Hypodermis	14	7746	Hypodermis
Ciliated_amphid_neuron	3	6090	Ciliated_amphid_neuron
Ciliated_non_amphid_neuron	4	4468	Ciliated_non_amphid_neuron

[37]:

# create a column in the cell data with the cluster id each cell belongs to
adata.obs['cell_type_code'] = adata.obs['cell_type'].cat.codes

# this returns a list of dataframes with DE results (one for each cluster),
# and a list with the corresponding cluster id
vanilla_per_cluster_de, vanilla_cluster_id = full.one_vs_all_degenes(
    cell_labels=adata.obs['cell_type_code'].ravel(),
    mode = 'vanilla', # vanilla is the default mode
    min_cells=1)

[ ]:

# pick the top 10 genes in each cluster
vanilla_top_genes = []
for x in vanilla_per_cluster_de:
    vanilla_top_genes.append(x[:10])
vanilla_top_genes = pd.concat(vanilla_top_genes)
vanilla_top_genes = np.unique(vanilla_top_genes.index)

[39]:

# fetch the expression values for the top 10 genes
vanilla_top_expression = [x.filter(items=vanilla_top_genes, axis=0)['scale1'] for x in vanilla_per_cluster_de]
vanilla_top_expression = pd.concat(vanilla_top_expression, axis=1)
vanilla_top_expression = np.log10(1 + vanilla_top_expression)
cluster_name = [cell_code_to_type[_id] for _id in vanilla_cluster_id]
vanilla_top_expression.columns=cluster_name

# convert into anndata object to tie with more metadata, such as gene names and descriptions
vanilla_top_expression = anndata.AnnData(vanilla_top_expression.T)
vanilla_top_expression.obs = vanilla_top_expression.obs.join(cell_types_summary)
vanilla_top_expression.obs.head()

[39]:

	cell_type_code	ncells	cell_type_name
ABarpaaa_lineage	0	273	ABarpaaa_lineage
Arcade_cell	1	434	Arcade_cell
Body_wall_muscle	2	17520	Body_wall_muscle
Ciliated_amphid_neuron	3	6090	Ciliated_amphid_neuron
Ciliated_non_amphid_neuron	4	4468	Ciliated_non_amphid_neuron

[40]:

#make a copy of the annotated gene metadata with gene ids all lower case to avoid problems when joining dataframes
adata_var_lowcase = adata.var.copy()
adata_var_lowcase.index = adata_var_lowcase.index.str.lower()

#convert top_expression gene ids index to lowercase for joining with metadata
vanilla_top_expression.var.index = vanilla_top_expression.var.index.str.lower()
vanilla_top_expression.var=vanilla_top_expression.var.join(adata_var_lowcase)

vanilla_top_expression.var.index=vanilla_top_expression.var['gene_id']
vanilla_top_expression.var.head().style.set_properties(subset=['gene_description'], **{'width': '600px'})

[40]:

	gene_id	gene_name	gene_description
gene_id
WBGene00000064	WBGene00000064	act-2	Is an ortholog of human ACTB (actin beta). Is predicted to have ATP binding activity. Is involved in several processes, including cortical actin cytoskeleton organization; cytoskeleton-dependent cytokinesis; and embryo development. Localizes to the actin filament and cell cortex. Is expressed in body wall musculature; gonad; nervous system; and the hypodermis. Human ortholog(s) of this gene are implicated in several diseases, including artery disease (multiple); intrinsic cardiomyopathy (multiple); and myopathy (multiple).
WBGene00000188	WBGene00000188	arl-3	Is an ortholog of human ARL3 (ADP ribosylation factor like GTPase 3). Is predicted to have GTP binding activity. Is expressed in head neurons and tail neurons.
WBGene00000229	WBGene00000229	atp-2	Is an ortholog of human ATP5F1B (ATP synthase F1 subunit beta). Exhibits protein domain specific binding activity. Is predicted to contribute to ATPase activity and proton-transporting ATP synthase activity, rotational mechanism. Is involved in several processes, including determination of adult lifespan; inductive cell migration; and pharyngeal pumping. Localizes to the mitochondrion and non-motile cilium. Is expressed in HOB; head; ray neuron type B; and tail.
WBGene00000429	WBGene00000429	ceh-2	Is an ortholog of human EMX1 (empty spiracles homeobox 1). Is predicted to have DNA binding activity. Localizes to the nucleus. Is expressed in nervous system; pharynx; somatic nervous system; and vulva.
WBGene00000431	WBGene00000431	ceh-6	Is an ortholog of human POU3F4 (POU class 3 homeobox 4). Is predicted to have DNA-binding transcription factor activity. Is involved in excretion; transdifferentiation; and water homeostasis. Localizes to the cytosol and nucleus. Is expressed in body wall musculature; head neurons; hyp7 syncytium; intestine; and tail neurons. Human ortholog(s) of this gene are implicated in sensorineural hearing loss.

Create interactive heatmap for `vanilla` results¶

In this heatmap we plot the top 10 genes for each of the 37 annotated tissue types.

[41]:

# first we create these variables to customize the hover text in plotly's heatmap
# the text needs to be arranged in a matrix the same shape as the heatmap
# for the gene descriptions text, which can be several sentences, we add a line break after each sentence
vanilla_top_expression.var['gene_description_html'] = vanilla_top_expression.var['gene_description'].str.replace('\. ', '.<br>')
gene_description_text_matrix = np.tile(vanilla_top_expression.var['gene_description_html'].values, (len(vanilla_top_expression.obs['cell_type_name']),1) )
gene_ids_text_matrix = np.tile(vanilla_top_expression.var['gene_id'].values, (len(vanilla_top_expression.obs['cell_type_name']),1))


# now create the heatmap with plotly
fig = go.Figure(
                data=go.Heatmap(
                        z=np.log(vanilla_top_expression.X * 10000), # multiply by 10000 to interpret this as ln(CP10K) scale
                        x=vanilla_top_expression.var['gene_name'],
                        y=vanilla_top_expression.obs['cell_type_name'],
                        hoverinfo='text',
                        text=gene_description_text_matrix,
                        customdata=gene_ids_text_matrix,
                        hovertemplate='%{customdata} <br>Name: %{x}<br>Cell type: %{y}<br>ln(CP10K) %{z}  <extra>%{text}</extra>',
                        ),
                    layout= {
                        "title": {"text": "Vanilla differential expression of Packer C. elegans single cell data"},
                        "height": 800,
                        },
               )

# uncomment line below to save the interactive volcano plot as html
# fig.write_html('worms_interactive_heatmap_vanilla_DE.html')
fig.show()

Now we perform differential expression using the change DE mode introduced in scVI v0.60

The ``change`` mode follows the protocol described in Boyeau et al, bioRxiv 2019. doi: 10.1101/794289

It consists in estimating an effect size random variable (e.g., log fold-change) and performing Bayesian hypothesis testing on this variable.

The new change_fn function computes the effect size variable r based two inputs corresponding to the normalized means in both populations

$M_1: r \in R_0$ (effect size r in region inducing differential expression)

$M_2: r \notin R_0$ (no differential expression)

To characterize the region $R_0$, the user has two choices.

A common case is when the region $[-\delta, \delta]$ does not induce differential expression.

If the user specifies a threshold delta, we suppose that $R_0 = \mathbb{R} \backslash [-\delta, \delta]$

Specify an specific indicator function $ \mathbb{1} : \mathbb{R} \mapsto `{0, 1} :nbsphinx-math:text{ s.t. }` r \in `R\_0 :nbsphinx-math:iff :nbsphinx-math:mathbb{1}`(r) = 1$

Decision-making can then be based on the estimates of $p(M_1 | x_1, x_2)$

[ ]:

de_change = full.differential_expression_score(
    idx1 = cell_idx1, # we use the same cells as chosen before
    idx2 = cell_idx2,
    mode='change', # set to the new change mode
    n_samples=n_samples,
    M_permutation=M_permutation,
)

[43]:

de_change.head()

[43]:

	proba_de	proba_not_de	bayes_factor	scale1	scale2	lfc_mean	lfc_median	lfc_std	lfc_min	lfc_max	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2
WBGene00044387	1.0000	0.0000	18.420681	0.004867	0.000020	7.937036	8.051170	1.842242	-1.833663	13.808327	3.650403	0.081986	0.602507	0.066975	8.174725	0.016240
WBGene00044705	0.9998	0.0002	8.516543	0.000797	0.000007	7.166002	7.212691	1.636343	-2.694234	13.170542	0.006938	0.015589	0.006714	0.015012	0.017730	0.002663
WBGene00009741	0.9998	0.0002	8.516543	0.000082	0.001987	-4.921936	-4.966117	1.226826	-8.780192	1.363821	0.032005	8.352772	0.028424	0.979215	0.066305	1.669146
WBGene00003175	0.9998	0.0002	8.516543	0.006789	0.000024	7.801917	7.984080	2.047122	-1.597371	14.624739	5.615488	0.124134	0.347135	0.091224	10.227241	0.025440
WBGene00000433	0.9997	0.0003	8.110995	0.003091	0.000013	8.755489	8.788426	1.745392	-1.577462	14.714615	0.619517	0.023095	0.087511	0.018476	0.714217	0.011383

[44]:

# manipulate the DE results for plotting

# we use the `mean` entru in de_chage, it is the scVI posterior log2 fold change
de_change['log10_pvalue']=np.log10(de_change['proba_not_de'])

# we take absolute values of the first bayes factor as the one to use on the volcano plot
# bayes1 and bayes2 should be roughtly the same, except with opposite signs
de_change['abs_bayes_factor']=np.abs(de_change['bayes_factor'])
de_change=de_change.join(adata.var, how='inner')
de_change.head()

[44]:

	proba_de	proba_not_de	bayes_factor	scale1	scale2	lfc_mean	lfc_median	lfc_std	lfc_min	lfc_max	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2	log10_pvalue	abs_bayes_factor	gene_id	gene_name	gene_description
WBGene00044387	1.0000	0.0000	18.420681	0.004867	0.000020	7.937036	8.051170	1.842242	-1.833663	13.808327	3.650403	0.081986	0.602507	0.066975	8.174725	0.016240	-inf	18.420681	WBGene00044387	C27A2.8	Is enriched in intestine and nervous system ba...
WBGene00044705	0.9998	0.0002	8.516543	0.000797	0.000007	7.166002	7.212691	1.636343	-2.694234	13.170542	0.006938	0.015589	0.006714	0.015012	0.017730	0.002663	-3.698796	8.516543	WBGene00044705	T22F3.12	Is an ortholog of human PPIL6 (peptidylprolyl ...
WBGene00009741	0.9998	0.0002	8.516543	0.000082	0.001987	-4.921936	-4.966117	1.226826	-8.780192	1.363821	0.032005	8.352772	0.028424	0.979215	0.066305	1.669146	-3.698796	8.516543	WBGene00009741	drr-1	Is enriched in OLL; PVD; intestine; and pharyn...
WBGene00003175	0.9998	0.0002	8.516543	0.006789	0.000024	7.801917	7.984080	2.047122	-1.597371	14.624739	5.615488	0.124134	0.347135	0.091224	10.227241	0.025440	-3.698796	8.516543	WBGene00003175	mec-12	Is an ortholog of human TUBA1A (tubulin alpha ...
WBGene00000433	0.9997	0.0003	8.110995	0.003091	0.000013	8.755489	8.788426	1.745392	-1.577462	14.714615	0.619517	0.023095	0.087511	0.018476	0.714217	0.011383	-3.522705	8.110995	WBGene00000433	ceh-8	Is an ortholog of human DRGX (dorsal root gang...

Volcano plot of change mode DE with p-values¶

In addition to Bayes Factors, the new change mode allows for calculating log fold change and p-values, which are more commonly seen in volcano plots.