Differential expression on C. elegans data#

This notebook was contributed by Eduardo Beltrame @Munfred and edited by Romain Lopez, Adam Gayoso, and Pierre Boyeau.

Processing and visualizing 89k cells from Packer et al. 2019 C. elegans 10xv2 single cell data

Original article: A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

https://science.sciencemag.org/content/365/6459/eaax1971.long

The anndata object we provide has 89,701 cells and 20,222 genes. It includes short gene descriptions from WormBase that will show up when mousing over the interactive plots.

Steps performed:

  1. Loading the data from anndata containing cell labels and gene descriptions

  2. Training the model with batch labels for integration with scVI

  3. Retrieving the scVI latent space and imputed values

  4. Visualize the latent space with an interactive t-SNE plot using Plotly

  5. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

Uncomment the following lines in Google Colab in order to install scvi-tools:

# !pip install --quiet scvi-colab
# from scvi_colab import install

# install()
import os
import tempfile
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotnine as p9
import scanpy as sc
import scvi
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.0.3

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(4, 4), frameon=False)
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()
warnings.simplefilter("ignore")

%config InlineBackend.print_figure_kwargs={"facecolor" : "w"}
%config InlineBackend.figure_format="retina"
adata_path = os.path.join(save_dir.name, "packer2019.h5ad")

adata = sc.read(
    adata_path,
    backup_url="https://github.com/Munfred/wormcells-site/releases/download/packer2019/packer2019.h5ad",
)
adata
AnnData object with n_obs × n_vars = 89701 × 20222
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    var: 'gene_id', 'gene_name', 'gene_description'

Take a look at the gene descriptions#

The gene descriptions were taken using the WormBase API.

display(
    adata.var.head().style.set_properties(
        subset=["gene_description"], **{"width": "600px"}
    )
)
  gene_id gene_name gene_description
index      
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16; daf-12; and hsf-1 based on RNA-seq and tiling array studies. Is affected by six chemicals including Rotenone; Psoralens; and metformin based on RNA-seq and microarray studies.
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies. Is affected by several genes including daf-16; daf-12; and clk-1 based on RNA-seq and microarray studies. Is affected by six chemicals including Alovudine; Psoralens; and metformin based on RNA-seq studies.
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1). Is predicted to contribute to NADH dehydrogenase activity. Human ortholog(s) of this gene are implicated in Leber hereditary optic neuropathy and MELAS syndrome.
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transporting ATP synthase activity, rotational mechanism.
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; clk-1; and elt-2 based on RNA-seq and microarray studies. Is affected by eight chemicals including stavudine; Zidovudine; and Psoralens based on RNA-seq and microarray studies.

Selecting genes and loading data#

We use the utility scvi.data.poisson_gene_selection to select genes according to their dropout rate, which is a simple and scalable approach to select genes.

This method was described by Andrews & Hemberg in the article M3Drop: dropout-based feature selection for scRNASeq : https://academic.oup.com/bioinformatics/article/35/16/2865/5258099

This method modifies the adata to add the following fields:

highly_variable                   # boolean true for chosen genes
observed_fraction_zeros	       # fraction of observed zeros per gene
expected_fraction_zeros	       # expected fraction of observed zeros per gene
prob_zero_enriched_nbatches	   # If batch_key is given, this denotes in how many batches genes are detected as zero enriched
prob_zero_enrichment	          # Probability of zero enrichment, median across batches in the case of multiple batches
prob_zero_enrichment_rank         # Rank of the gene according to probability of zero enrichment

Note

Gene selection is an important step to obtain relevant cell representations with scVI. Generally, selecting the top few thousands top-ranking genes predicted by a gene selection tool suffice to obtain good performance.

Increasing the number of selected genes may be required in some applications, e.g., to increase the number of considered genes for differential expression. Note, however, that this will increase the time required to reach convergence and GPU memory load. It may also require to tune scVI’s model hyperparameters (see the autotune tutorial)

Alternatives to the Poisson gene selection can be used, e.g., via scanpy or seurat.

scvi.data.poisson_gene_selection(adata)
adata.var.head()
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 21695.95it/s]
gene_id gene_name gene_description highly_variable observed_fraction_zeros expected_fraction_zeros prob_zero_enriched_nbatches prob_zero_enrichment prob_zero_enrichment_rank
index
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16;... False 0.075685 0.002028 0 0.0762 15661.0
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies.... True 0.659680 0.623356 1 0.2502 19567.0
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondriall... True 0.201993 0.046608 1 0.1886 18907.0
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transport... True 0.138081 0.001848 1 0.1381 17859.0
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; ... True 0.468434 0.383116 1 0.2860 19800.0
adata = adata[:, adata.var["highly_variable"]]  # focus on selected genes
adata.layers[
    "counts"
] = adata.X.copy().tocsr()  # converts to CSR format, preserve counts

scvi.model.SCVI.setup_anndata(
    adata, layer="counts", batch_key="batch"
)  # prepare data for scVI

Define and train the model#

model = scvi.model.SCVI(
    adata, gene_likelihood="nb"
)  # We use Negative Binomial count likelihoods, following Boyeau et al., 2023.
model.train(
    check_val_every_n_epoch=1,
    max_epochs=400,
    early_stopping=True,
    early_stopping_patience=20,
    early_stopping_monitor="elbo_validation",
)
Epoch 338/400:  84%|████████▍ | 338/400 [16:57<03:06,  3.01s/it, v_num=1, train_loss_step=2.16e+3, train_loss_epoch=2.06e+3]
Monitored metric elbo_validation did not improve in the last 20 records. Best score: 2077.445. Signaling Trainer to stop.
# Ensure convergence
train_test_results = model.history["elbo_train"]
train_test_results["elbo_validation"] = model.history["elbo_validation"]
train_test_results.iloc[10:].plot(logy=True)  # exclude first 10 epochs
plt.show()
../../../_images/cab4c89dd11f04735a9e2d9aecc7f3c044c187f1600d3f3e8eb898a20cda00c7.png

Get the latent space and compute UMAP#

SCVI_LATENT_KEY = "X_scVI"

latent = model.get_latent_representation()
adata.obsm[SCVI_LATENT_KEY] = latent
sc.pp.neighbors(adata, use_rep=SCVI_LATENT_KEY)
sc.tl.umap(adata)
sc.pl.umap(adata, color="cell_type")
../../../_images/d80c4c447e0bb8a0c525eae9ef279b920c9d16eacdd4ca5351d80a0cb809c8b5.png

Performing Differential Expression in scVI#

Differential expression (DE) analysis is used to quantify the differences in gene expression across subpopulations of genes. If we have two sets of cells \(A\) and \(B\), a DE test is typically used to predict two things for each gene:

  1. Whether the gene’s expression levels are significantly different in the A and B sets of cells.

  2. An effect size that quantifies the strength of the differential expression.

Once trained, scVI can natively perform both of these tasks. Additionally, its differential expression module can account for batch effects and filter DE genes expected to be of little relevance.

Selecting cell subpopulations to compare#

# let's take a look at abundances of different cell types
adata.obs["cell_type"].value_counts()
cell_type
nan                               35052
Body_wall_muscle                  17520
Hypodermis                         7746
Ciliated_amphid_neuron             6090
Ciliated_non_amphid_neuron         4468
Seam_cell                          2766
Pharyngeal_muscle                  2562
Glia                               1857
Intestine                          1732
Pharyngeal_neuron                  1471
Pharyngeal_marginal_cell            911
Coelomocyte                         787
Pharyngeal_gland                    786
GLR                                 768
Intestinal_and_rectal_muscle        568
Germline                            499
Pharyngeal_intestinal_valve         493
Arcade_cell                         434
Z1_Z4                               372
Rectal_cell                         327
M_cell                              315
ABarpaaa_lineage                    273
Rectal_gland                        265
Excretory_cell                      215
Excretory_gland                     205
hmc                                 189
hmc_homolog                         155
T                                   141
hmc_and_homolog                     122
Parent_of_exc_gland_AVK             114
hyp1V_and_ant_arc_V                 112
Excretory_duct_and_pore              91
Parent_of_hyp1V_and_ant_arc_V        75
G2_and_W_blasts                      72
Excretory_cell_parent                62
Parent_of_exc_duct_pore_DB_1_3       61
XXX                                  25
Name: count, dtype: int64

scVI provides several options to identify the two populations of interest.

cell_type_1 = "Ciliated_non_amphid_neuron"
cell_idx1 = adata.obs["cell_type"] == cell_type_1
print(sum(cell_idx1), "cells of type", cell_type_1)

cell_type_2 = "Intestine"
cell_idx2 = adata.obs["cell_type"] == cell_type_2
print(sum(cell_idx2), "cells of type", cell_type_2)

# or equivalently, provide a string of the form "my_celltype_column == 'desired_celltype'"
# cell_idx1 = "cell_type == 'Ciliated_non_amphid_neuron'"
# cell_idx2 = "cell_type == 'Intestine'"
4468 cells of type Ciliated_non_amphid_neuron
1732 cells of type Intestine

Basic principle#

DE and log fold-changes#

scVI can natively be used to perform differential expression analyses to compare populations of cells, as described here and there. It achieves this by estimating the posterior distribution of the log fold-change (LFC) between subpopulations \(A\) and \(B\). Specifically, for a given gene g, scVI calculates the LFC as the difference between the logarithm of its expression level in population A, denoted by \(h_g^A\), and the logarithm of its expression level in population B, denoted by \(h_g^B\).

The resulting value, denoted by \(\beta_g\), provides insights into the expression patterns of gene \(g\). Values close to zero indicate that the gene is expressed similarly in both populations, positive values suggest upregulation in population A and negative values indicate downregulation in population A. This information can be used to better understand the biological mechanisms underlying the differences between the two cell populations.

DE testing#

In addition to estimating the LFC, scVI can also detect which genes have significant expression patterns. To tag which genes are differentially expressed, scVI tests the following competing hypotheses \(M_{1, g}: \beta_g \in [-\delta, \delta]\) and \(M_{2, g}: \beta_g \in (-\infty, -\delta) \cup (\delta, \infty)\). Here, \(\delta\) denotes a small LFC threshold, such that \(\beta_g \in [-\delta, \delta]\) is evidence that the gene is equally expressed in the two subpopulations.

Differentially expressed genes are identified by computing the posterior probability of \(M_{2, g}\).

Running DE analyses#

Running and understanding a DE run#

A simple DE analysis can then be performed using the following command

de_change = model.differential_expression(idx1=cell_idx1, idx2=cell_idx2)
de_change
DE...: 100%|██████████| 1/1 [00:00<00:00,  1.24it/s]
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std lfc_min lfc_max raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05
index
WBGene00007531 1.0000 0.0000 18.420681 0.000698 0.000028 0.0 0.25 4.795135 4.808759 1.011883 -0.962938 7.874225 0.633173 0.180138 0.419651 0.151848 5.617295 0.279113 True
WBGene00007286 1.0000 0.0000 18.420681 0.001188 0.000004 0.0 0.25 8.008764 8.104925 2.057703 -2.602439 14.755898 1.427018 0.027136 0.495524 0.025404 11.854699 0.041317 True
WBGene00019060 0.9998 0.0002 8.516943 0.000020 0.000500 0.0 0.25 -4.760843 -4.810782 1.257221 -8.972113 3.627006 0.015443 3.243659 0.012981 0.798499 0.168669 4.161954 True
WBGene00009741 0.9998 0.0002 8.516943 0.000030 0.001270 0.0 0.25 -5.677362 -5.777770 1.342877 -9.842583 1.260775 0.032005 8.352755 0.028424 0.979215 0.356610 12.143387 True
WBGene00011121 0.9998 0.0002 8.516943 0.001362 0.000006 0.0 0.25 7.046995 7.553209 2.198702 -1.141126 11.610086 1.442006 0.040416 0.537601 0.038684 14.994382 0.059873 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
WBGene00021783 0.5964 0.4036 0.390487 0.000066 0.000071 0.0 0.25 -0.074189 -0.123168 0.566793 -2.223165 3.320147 0.079902 0.545033 0.071844 0.393764 0.644058 0.762247 False
WBGene00019401 0.5920 0.4080 0.372239 0.000349 0.000343 0.0 0.25 0.063715 0.007353 0.486871 -1.737333 2.928462 0.455240 2.608553 0.335721 0.830254 3.878721 3.532027 False
WBGene00022044 0.5514 0.4486 0.206329 0.000173 0.000156 0.0 0.25 0.190751 0.132852 0.554910 -1.281088 3.584072 0.189124 1.164550 0.163608 0.618938 1.599760 1.586375 False
WBGene00011116 0.5432 0.4568 0.173232 0.000290 0.000261 0.0 0.25 0.146163 0.158200 0.382986 -1.415763 1.717790 0.356761 1.845267 0.275067 0.763279 2.979073 2.603784 False
WBGene00011859 0.5008 0.4992 0.003200 0.000118 0.000114 0.0 0.25 0.056651 0.042266 0.477081 -1.702410 2.852151 0.141899 0.824482 0.122426 0.515589 1.178068 1.153656 False

4000 rows × 19 columns

This method returns a pandas DataFrame, where each row corresponds to a gene. The most important columns of this dataframe are the following. proba_de, which captures the posterior probability of \(M_{2, g}\) that the gene is differentially expressed. Values close to one indicate that the gene is DE; lfc_mean and lfc_median, respectively denoting the mean and the median of the posterior distribution of \(\beta_g\). Positive values of the LFC signify that the gene is upregulated in idx1; is_de_fdr_0.05 is True when the gene is tagged DE after FDR correction at target level \(\alpha=0.05\). The target level can be adjusted by specifying fdr_target in the differential_expression method.

Main hyperparameters#

Additional parameters can be tuned in specific applications.

  1. weights, which can correspond to ”uniform” or ”importance” specify how normalized gene expressions in the two subpopulations are computed. The ”importance” flavor is specifically designed to provide FDR-calibrated gene sets, but may be overkill if you only aim to rank genes (via their LFC for instance).

  2. filter_outlier_cells filters out outlier cells prior to computing normalized gene expressions. It is important to set this parameter to True when weights='importance'.

  3. delta allows to specify \(\delta\), which is used to detect differentially expressed genes. It can also be set automatically using delta=None

  4. pseudocounts, which is zero by default, but can be set to a small value (e.g., 1e-6) when many detected DE genes are seldom expressed in the compared populations.

  5. batch_correction, which should be set to True to account for batch effects. This only makes sense when idx1 and idx2 denote cells coming from overlapping batches.

The exact function of these parameters is described here.

Volcano plot of change mode DE with p-values#

de_change_uniform = model.differential_expression(
    idx1=cell_idx1,  # we use the same cells as chosen before
    idx2=cell_idx2,
    weights="uniform",
    batch_correction=True,
)

# manipulate the DE results for plotting
de_change_uniform["log10_pscore"] = np.log10(de_change_uniform["proba_not_de"])
de_change_uniform = de_change_uniform.join(adata.var, how="inner")
de_change_uniform.head()
DE...: 100%|██████████| 1/1 [00:01<00:00,  1.45s/it]
proba_de proba_not_de bayes_factor scale1 scale2 pseudocounts delta lfc_mean lfc_median lfc_std ... log10_pscore gene_id gene_name gene_description highly_variable observed_fraction_zeros expected_fraction_zeros prob_zero_enriched_nbatches prob_zero_enrichment prob_zero_enrichment_rank
index
WBGene00009987 0.9998 0.0002 8.516543 0.000599 0.000019 0.0 0.25 4.217998 4.238795 1.678915 ... -3.698796 WBGene00009987 tbcb-1 Is an ortholog of human TBCB (tubulin folding ... True 0.912008 0.892268 1 0.0963 16532.0
WBGene00011121 0.9998 0.0002 8.516543 0.001347 0.000007 0.0 0.25 6.837592 7.433930 2.182874 ... -3.698796 WBGene00011121 R07E5.17 Is enriched in nervous system based on tiling ... True 0.895821 0.814998 1 0.1710 18613.0
WBGene00013716 0.9998 0.0002 8.516543 0.000012 0.000676 0.0 0.25 -6.081238 -6.136255 1.638618 ... -3.698796 WBGene00013716 Y106G6H.1 Is affected by several genes including daf-16;... True 0.973166 0.892223 1 0.0989 16640.0
WBGene00001186 0.9998 0.0002 8.516543 0.000383 0.000008 0.0 0.25 5.868375 5.906182 1.080143 ... -3.698796 WBGene00001186 egl-18 Is predicted to have DNA-binding transcription... True 0.873201 0.839800 1 0.1397 17900.0
WBGene00002025 0.9998 0.0002 8.516543 0.000039 0.000546 0.0 0.25 -4.003997 -4.082103 1.043394 ... -3.698796 WBGene00002025 hsp-60 Is an ortholog of human HSPD1 (heat shock prot... True 0.836000 0.771438 1 0.1881 18897.0

5 rows × 29 columns

de_change_importance = model.differential_expression(
    idx1=cell_idx1,  # we use the same cells as chosen before
    idx2=cell_idx2,
    weights="importance",
    filter_outlier_cells=True,
    batch_correction=True,
)

# manipulate the DE results for plotting
de_change_importance["log10_pscore"] = np.log10(de_change_importance["proba_not_de"])
de_change_importance = de_change_importance.join(adata.var, how="inner")
DE...: 100%|██████████| 1/1 [00:53<00:00, 53.24s/it]
de_comp = pd.concat(
    [
        de_change_importance.assign(flavor="importance"),
        de_change_uniform.assign(flavor="uniform"),
    ]
)
de_comp["gene_type"] = "Other"
de_comp.loc[lambda x: x["gene_name"].str.contains("rpl-"), "gene_type"] = "RPL"
de_comp.loc[lambda x: x["gene_name"].str.contains("ceh-"), "gene_type"] = "CEH"
de_comp.loc[lambda x: x["gene_name"].str.contains("flp-"), "gene_type"] = "FPL"

(
    p9.ggplot(de_comp, p9.aes("lfc_mean", "-log10_pscore", color="gene_type"))
    + p9.geom_point(
        de_comp.query("gene_type == 'Other'"), alpha=0.5
    )  # Plot other genes with transparence
    + p9.geom_point(de_comp.query("gene_type != 'Other'"))
    + p9.labs(x="LFC mean", y="Significance score (higher is more significant)")
    + p9.facet_wrap("flavor")
)
../../../_images/59081b2d4f9207d49927d8b8f5ff9e5dbb798299a73d4086b04d2a7774709132.png
<Figure Size: (640 x 480)>

Contrary to the uniform flavor, the importance flavor returns sharper posterior probability scores (posterior probability near 0 or 1, respectively in case of equal expression or differential expression).

Heatmap of top expressed genes#

Now we perform DE between each cell type vs all other cells and make a heatmap of the result.

# here we do a 1-vs-all DE test, which compares each cell type with all others
# this returns the concatenation of all 1vsall results, contained in a DataFrame
change_per_cluster_de = model.differential_expression(groupby="cell_type")
DE...: 100%|██████████| 37/37 [02:11<00:00,  3.55s/it]

We focus on cell-types with at least 500 cells, and which have annotations to facilitate heatmap visualization

cell_types = (
    adata.obs["cell_type"]
    .value_counts()
    # .loc[lambda x: (x >= 500) & (x.index != "nan")]
    .loc[lambda x: x.index != "nan"]
    .to_frame("n_cells")
)
cell_types.loc[:, "associated_test"] = cell_types.index.astype(str) + " vs Rest"
# This cell extracts list of top 5 upregulated genes for every cell-type
marker_genes = (
    change_per_cluster_de.reset_index()
    .loc[lambda x: x.comparison.isin(cell_types.associated_test.values)]
    .groupby("comparison")
    .apply(
        lambda x: x.sort_values("lfc_mean", ascending=False).iloc[:5]
    )  # Select top 5 DE genes per comparison
    .reset_index(drop=True)["index"]
    .unique()
)
adata_log = adata[adata.obs.cell_type.isin(cell_types.index.values)].copy()
sc.pp.normalize_total(adata_log)
sc.pp.log1p(adata_log)
sc.pl.dotplot(adata_log, marker_genes, groupby="cell_type")
../../../_images/9c53635d052e06450a516c17ce53ddd77b7b0f13105b22edaf81dc867edcf625.png