Differential expression on C. elegans data#

This notebook was contributed by Eduardo Beltrame @Munfred and edited by Romain Lopez, Adam Gayoso, and Pierre Boyeau.

Processing and visualizing 89k cells from Packer et al. 2019 C. elegans 10xv2 single cell data

Original article: A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

https://science.sciencemag.org/content/365/6459/eaax1971.long

The anndata object we provide has 89,701 cells and 20,222 genes. It includes short gene descriptions from WormBase that will show up when mousing over the interactive plots.

Steps performed:

  1. Loading the data from anndata containing cell labels and gene descriptions

  2. Training the model with batch labels for integration with scVI

  3. Retrieving the scVI latent space and imputed values

  4. Visualize the latent space with an interactive t-SNE plot using Plotly

  5. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
from scvi_colab import install

install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

import os
import tempfile

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as p9
import scanpy as sc
import scvi
import seaborn as sns
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.1.6

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"
adata_path = os.path.join(save_dir.name, "packer2019.h5ad")

adata = sc.read(
    adata_path,
    backup_url="https://github.com/Munfred/wormcells-site/releases/download/packer2019/packer2019.h5ad",
)
adata
AnnData object with n_obs × n_vars = 89701 × 20222
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    var: 'gene_id', 'gene_name', 'gene_description'

Take a look at the gene descriptions#

The gene descriptions were taken using the WormBase API.

display(adata.var.head().style.set_properties(subset=["gene_description"], **{"width": "600px"}))
  gene_id gene_name gene_description
index      
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16; daf-12; and hsf-1 based on RNA-seq and tiling array studies. Is affected by six chemicals including Rotenone; Psoralens; and metformin based on RNA-seq and microarray studies.
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies. Is affected by several genes including daf-16; daf-12; and clk-1 based on RNA-seq and microarray studies. Is affected by six chemicals including Alovudine; Psoralens; and metformin based on RNA-seq studies.
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1). Is predicted to contribute to NADH dehydrogenase activity. Human ortholog(s) of this gene are implicated in Leber hereditary optic neuropathy and MELAS syndrome.
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transporting ATP synthase activity, rotational mechanism.
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; clk-1; and elt-2 based on RNA-seq and microarray studies. Is affected by eight chemicals including stavudine; Zidovudine; and Psoralens based on RNA-seq and microarray studies.

Selecting genes and loading data#

We use the utility scvi.data.poisson_gene_selection to select genes according to their dropout rate, which is a simple and scalable approach to select genes.

This method was described by Andrews & Hemberg in the article M3Drop: dropout-based feature selection for scRNASeq : https://academic.oup.com/bioinformatics/article/35/16/2865/5258099

This method modifies the adata to add the following fields:

highly_variable                   # boolean true for chosen genes
observed_fraction_zeros	       # fraction of observed zeros per gene
expected_fraction_zeros	       # expected fraction of observed zeros per gene
prob_zero_enriched_nbatches	   # If batch_key is given, this denotes in how many batches genes are detected as zero enriched
prob_zero_enrichment	          # Probability of zero enrichment, median across batches in the case of multiple batches
prob_zero_enrichment_rank         # Rank of the gene according to probability of zero enrichment

Note

Gene selection is an important step to obtain relevant cell representations with scVI. Generally, selecting the top few thousands top-ranking genes predicted by a gene selection tool suffice to obtain good performance.

Increasing the number of selected genes may be required in some applications, e.g., to increase the number of considered genes for differential expression. Note, however, that this will increase the time required to reach convergence and GPU memory load. It may also require to tune scVI’s model hyperparameters (see the autotune tutorial)

Alternatives to the Poisson gene selection can be used, e.g., via scanpy or seurat.

scvi.data.poisson_gene_selection(adata)
adata.var.head()
gene_id gene_name gene_description highly_variable observed_fraction_zeros expected_fraction_zeros prob_zero_enriched_nbatches prob_zero_enrichment prob_zero_enrichment_rank
index
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16;... False 0.075685 0.002028 0 0.0762 15660.0
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies.... True 0.659680 0.623356 1 0.2502 19567.0
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondriall... True 0.201993 0.046608 1 0.1886 18906.0
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transport... True 0.138081 0.001848 1 0.1381 17861.0
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; ... True 0.468434 0.383116 1 0.2860 19801.0
adata = adata[:, adata.var["highly_variable"]]  # focus on selected genes
adata.layers["counts"] = adata.X.copy().tocsr()  # converts to CSR format, preserve counts

scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")  # prepare data for scVI

Define and train the model#

model = scvi.model.SCVI(
    adata, gene_likelihood="nb"
)  # We use Negative Binomial count likelihoods, following Boyeau et al., 2023.
model.train(
    check_val_every_n_epoch=1,
    max_epochs=400,
    early_stopping=True,
    early_stopping_patience=20,
    early_stopping_monitor="elbo_validation",
)
Monitored metric elbo_validation did not improve in the last 20 records. Best score: 2076.785. Signaling Trainer to stop.
# Ensure convergence
train_test_results = model.history["elbo_train"]
train_test_results["elbo_validation"] = model.history["elbo_validation"]
train_test_results.iloc[10:].plot(logy=True)  # exclude first 10 epochs
plt.show()

Get the latent space and compute UMAP#

SCVI_LATENT_KEY = "X_scVI"

latent = model.get_latent_representation()
adata.obsm[SCVI_LATENT_KEY] = latent
sc.pp.neighbors(adata, use_rep=SCVI_LATENT_KEY)
sc.tl.umap(adata)
sc.pl.umap(adata, color="cell_type")

Performing Differential Expression in scVI#

Differential expression (DE) analysis is used to quantify the differences in gene expression across subpopulations of genes. If we have two sets of cells \(A\) and \(B\), a DE test is typically used to predict two things for each gene:

  1. Whether the gene’s expression levels are significantly different in the A and B sets of cells.

  2. An effect size that quantifies the strength of the differential expression.

Once trained, scVI can natively perform both of these tasks. Additionally, its differential expression module can account for batch effects and filter DE genes expected to be of little relevance.

Selecting cell subpopulations to compare#

# let's take a look at abundances of different cell types
adata.obs["cell_type"].value_counts()
cell_type
nan                               35052
Body_wall_muscle                  17520
Hypodermis                         7746
Ciliated_amphid_neuron             6090
Ciliated_non_amphid_neuron         4468
Seam_cell                          2766
Pharyngeal_muscle                  2562
Glia                               1857
Intestine                          1732
Pharyngeal_neuron                  1471
Pharyngeal_marginal_cell            911
Coelomocyte                         787
Pharyngeal_gland                    786
GLR                                 768
Intestinal_and_rectal_muscle        568
Germline                            499
Pharyngeal_intestinal_valve         493
Arcade_cell                         434
Z1_Z4                               372
Rectal_cell                         327
M_cell                              315
ABarpaaa_lineage                    273
Rectal_gland                        265
Excretory_cell                      215
Excretory_gland                     205
hmc                                 189
hmc_homolog                         155
T                                   141
hmc_and_homolog                     122
Parent_of_exc_gland_AVK             114
hyp1V_and_ant_arc_V                 112
Excretory_duct_and_pore              91
Parent_of_hyp1V_and_ant_arc_V        75
G2_and_W_blasts                      72
Excretory_cell_parent                62
Parent_of_exc_duct_pore_DB_1_3       61
XXX                                  25
Name: count, dtype: int64

scVI provides several options to identify the two populations of interest.

cell_type_1 = "Ciliated_non_amphid_neuron"
cell_idx1 = adata.obs["cell_type"] == cell_type_1
print(sum(cell_idx1), "cells of type", cell_type_1)

cell_type_2 = "Intestine"
cell_idx2 = adata.obs["cell_type"] == cell_type_2
print(sum(cell_idx2), "cells of type", cell_type_2)

# or equivalently, provide a string of the form "my_celltype_column == 'desired_celltype'"
# cell_idx1 = "cell_type == 'Ciliated_non_amphid_neuron'"
# cell_idx2 = "cell_type == 'Intestine'"
4468 cells of type Ciliated_non_amphid_neuron
1732 cells of type Intestine

Basic principle#

DE and log fold-changes#

scVI can natively be used to perform differential expression analyses to compare populations of cells, as described here and there. It achieves this by estimating the posterior distribution of the log fold-change (LFC) between subpopulations \(A\) and \(B\). Specifically, for a given gene g, scVI calculates the LFC as the difference between the logarithm of its expression level in population A, denoted by \(h_g^A\), and the logarithm of its expression level in population B, denoted by \(h_g^B\).

The resulting value, denoted by \(\beta_g\), provides insights into the expression patterns of gene \(g\). Values close to zero indicate that the gene is expressed similarly in both populations, positive values suggest upregulation in population A and negative values indicate downregulation in population A. This information can be used to better understand the biological mechanisms underlying the differences between the two cell populations.

DE testing#

In addition to estimating the LFC, scVI can also detect which genes have significant expression patterns. To tag which genes are differentially expressed, scVI tests the following competing hypotheses \(M_{1, g}: \beta_g \in [-\delta, \delta]\) and \(M_{2, g}: \beta_g \in (-\infty, -\delta) \cup (\delta, \infty)\). Here, \(\delta\) denotes a small LFC threshold, such that \(\beta_g \in [-\delta, \delta]\) is evidence that the gene is equally expressed in the two subpopulations.

Differentially expressed genes are identified by computing the posterior probability of \(M_{2, g}\).

Running DE analyses#

Running and understanding a DE run#

A simple DE analysis can then be performed using the following command

de_change = model.differential_expression(idx1=cell_idx1, idx2=cell_idx2)
de_change
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std lfc_min lfc_max raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05
index
WBGene00003772 1.0000 0.0000 18.420681 0.000226 0.000006 0.0 0.25 5.163298 5.281876 1.198758 -1.250165 9.098156 0.247987 0.032333 0.158684 0.028868 2.492400 0.051523 True
WBGene00011936 1.0000 0.0000 18.420681 0.000020 0.001830 0.0 0.25 -6.679718 -6.750519 1.188723 -10.740919 2.571755 0.018800 12.181263 0.016786 0.986143 0.187631 17.695843 True
WBGene00011121 1.0000 0.0000 18.420681 0.001351 0.000006 0.0 0.25 7.137579 7.532352 2.320725 -2.493482 12.864960 1.442006 0.040416 0.537601 0.038684 14.993924 0.059869 True
WBGene00001564 1.0000 0.0000 18.420681 0.000111 0.002672 0.0 0.25 -4.625587 -4.613111 1.482866 -8.969914 2.663394 0.102955 18.479753 0.092211 0.981524 1.170010 26.443811 True
WBGene00007286 0.9998 0.0002 8.516943 0.001115 0.000006 0.0 0.25 7.580273 7.694222 2.170473 -1.519025 13.688978 1.427018 0.027136 0.495524 0.025404 11.854581 0.041313 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
WBGene00021784 0.5892 0.4108 0.360659 0.000122 0.000121 0.0 0.25 0.035663 0.035655 0.463000 -1.587824 1.548220 0.148837 0.860279 0.127574 0.530023 1.195154 1.191352 False
WBGene00011859 0.5874 0.4126 0.353227 0.000135 0.000117 0.0 0.25 0.206646 0.180473 0.500783 -1.575585 3.873812 0.141899 0.824482 0.122426 0.515589 1.178066 1.153581 False
WBGene00017829 0.5840 0.4160 0.339216 0.000110 0.000100 0.0 0.25 0.130990 0.119615 0.498566 -1.895624 1.927452 0.125112 0.711317 0.103626 0.464203 1.063442 1.014998 False
WBGene00012693 0.5700 0.4300 0.281851 0.000090 0.000083 0.0 0.25 0.157164 0.100334 0.576234 -1.909251 3.791691 0.111684 0.601039 0.094897 0.405889 0.919914 0.803560 False
WBGene00022702 0.5294 0.4706 0.117736 0.000062 0.000062 0.0 0.25 -0.010734 -0.002976 0.488402 -2.254934 2.250859 0.075649 0.457272 0.068039 0.340647 0.639974 0.621126 False

4000 rows × 19 columns

This method returns a pandas DataFrame, where each row corresponds to a gene. The most important columns of this dataframe are the following. proba_de, which captures the posterior probability of \(M_{2, g}\) that the gene is differentially expressed. Values close to one indicate that the gene is DE; lfc_mean and lfc_median, respectively denoting the mean and the median of the posterior distribution of \(\beta_g\). Positive values of the LFC signify that the gene is upregulated in idx1; is_de_fdr_0.05 is True when the gene is tagged DE after FDR correction at target level \(\alpha=0.05\). The target level can be adjusted by specifying fdr_target in the differential_expression method.

Main hyperparameters#

Additional parameters can be tuned in specific applications.

  1. weights, which can correspond to ”uniform” or ”importance” specify how normalized gene expressions in the two subpopulations are computed. The ”importance” flavor is specifically designed to provide FDR-calibrated gene sets, but may be overkill if you only aim to rank genes (via their LFC for instance).

  2. filter_outlier_cells filters out outlier cells prior to computing normalized gene expressions. It is important to set this parameter to True when weights='importance'.

  3. delta allows to specify \(\delta\), which is used to detect differentially expressed genes. It can also be set automatically using delta=None

  4. pseudocounts, which is zero by default, but can be set to a small value (e.g., 1e-6) when many detected DE genes are seldom expressed in the compared populations.

  5. batch_correction, which should be set to True to account for batch effects. This only makes sense when idx1 and idx2 denote cells coming from overlapping batches.

The exact function of these parameters is described here.

Volcano plot of change mode DE with p-values#

de_change_uniform = model.differential_expression(
    idx1=cell_idx1,  # we use the same cells as chosen before
    idx2=cell_idx2,
    weights="uniform",
    batch_correction=True,
)

# manipulate the DE results for plotting
de_change_uniform["log10_pscore"] = np.log10(de_change_uniform["proba_not_de"])
de_change_uniform = de_change_uniform.join(adata.var, how="inner")
de_change_uniform.head()
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... log10_pscore gene_id gene_name gene_description highly_variable observed_fraction_zeros expected_fraction_zeros prob_zero_enriched_nbatches prob_zero_enrichment prob_zero_enrichment_rank
index
WBGene00004372 1.0000 0.0000 18.420681 0.001047 0.000006 0.0 0.25 7.527723 7.624940 1.684090 ... -inf WBGene00004372 rig-5 Is an ortholog of human LSAMP (limbic system a... True 0.928574 0.814163 1 0.1740 18665.0
WBGene00004258 1.0000 0.0000 18.420681 0.000026 0.000409 0.0 0.25 -4.006417 -4.047699 0.885588 ... -inf WBGene00004258 pyc-1 Is an ortholog of human PC (pyruvate carboxyla... True 0.930893 0.897955 1 0.0925 16361.0
WBGene00004949 1.0000 0.0000 18.420681 0.000319 0.000009 0.0 0.25 5.401074 5.432315 1.396228 ... -inf WBGene00004949 sox-2 Is an ortholog of human SOX14 (SRY-box transcr... True 0.899867 0.847891 1 0.1398 17906.0
WBGene00013693 0.9998 0.0002 8.516543 0.000011 0.000518 0.0 0.25 -5.675631 -5.665921 1.484586 ... -3.698796 WBGene00013693 Y105E8B.9 Is an ortholog of human GUSB (glucuronidase be... True 0.962698 0.903186 1 0.0963 16530.0
WBGene00011328 0.9998 0.0002 8.516543 0.000384 0.000005 0.0 0.25 6.410315 6.412659 1.846124 ... -3.698796 WBGene00011328 T01D3.3 Is predicted to have metal ion binding activit... True 0.925530 0.848358 1 0.1433 18009.0

5 rows × 29 columns

de_change_importance = model.differential_expression(
    idx1=cell_idx1,  # we use the same cells as chosen before
    idx2=cell_idx2,
    weights="importance",
    filter_outlier_cells=True,
    batch_correction=True,
)

# manipulate the DE results for plotting
de_change_importance["log10_pscore"] = np.log10(de_change_importance["proba_not_de"])
de_change_importance = de_change_importance.join(adata.var, how="inner")
de_comp = pd.concat(
    [
        de_change_importance.assign(flavor="importance"),
        de_change_uniform.assign(flavor="uniform"),
    ]
)
de_comp["gene_type"] = "Other"
de_comp.loc[lambda x: x["gene_name"].str.contains("rpl-"), "gene_type"] = "RPL"
de_comp.loc[lambda x: x["gene_name"].str.contains("ceh-"), "gene_type"] = "CEH"
de_comp.loc[lambda x: x["gene_name"].str.contains("flp-"), "gene_type"] = "FPL"

(
    p9.ggplot(de_comp, p9.aes("lfc_mean", "-log10_pscore", color="gene_type"))
    + p9.geom_point(
        de_comp.query("gene_type == 'Other'"), alpha=0.5
    )  # Plot other genes with transparence
    + p9.geom_point(de_comp.query("gene_type != 'Other'"))
    + p9.labs(x="LFC mean", y="Significance score (higher is more significant)")
    + p9.facet_wrap("flavor")
)

Contrary to the uniform flavor, the importance flavor returns sharper posterior probability scores (posterior probability near 0 or 1, respectively in case of equal expression or differential expression).

Heatmap of top expressed genes#

Now we perform DE between each cell type vs all other cells and make a heatmap of the result.

# here we do a 1-vs-all DE test, which compares each cell type with all others
# this returns the concatenation of all 1vsall results, contained in a DataFrame
change_per_cluster_de = model.differential_expression(groupby="cell_type")

We focus on cell-types with at least 500 cells, and which have annotations to facilitate heatmap visualization

cell_types = (
    adata.obs["cell_type"]
    .value_counts()
    # .loc[lambda x: (x >= 500) & (x.index != "nan")]
    .loc[lambda x: x.index != "nan"]
    .to_frame("n_cells")
)
cell_types.loc[:, "associated_test"] = cell_types.index.astype(str) + " vs Rest"
# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)["index"]
    .unique()
)
adata_log = adata[adata.obs.cell_type.isin(cell_types.index.values)].copy()
sc.pp.normalize_total(adata_log)
sc.pp.log1p(adata_log)
sc.pl.dotplot(adata_log, marker_genes, groupby="cell_type")