Identification of zero-inflated genes#

AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene \(g\), AutoZI notably learns the distribution of a random variable \(\delta_g\) which denotes the probability that gene \(g\) is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.

More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
from scvi_colab import install

install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

import tempfile

import numpy as np
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scipy.stats import beta
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.2.0

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

Imports, data loading and preparation#

adata = scvi.data.pbmc_dataset(save_path=save_dir.name)
adata.layers["counts"] = adata.X.copy()
scvi.data.poisson_gene_selection(
    adata,
    n_top_genes=1000,
    batch_key="batch",
    subset=True,
    layer="counts",
)
scvi.model.AUTOZI.setup_anndata(
    adata,
    labels_key="str_labels",
    batch_key="batch",
    layer="counts",
)
INFO     Downloading file at /tmp/tmp81rsicr0/gene_info_pbmc.csv
INFO     Downloading file at /tmp/tmp81rsicr0/pbmc_metadata.pickle
INFO     Downloading file at /tmp/tmp81rsicr0/pbmc8k/filtered_gene_bc_matrices.tar.gz
INFO     Extracting tar file
INFO     Removing extracted data at /tmp/tmp81rsicr0/pbmc8k/filtered_gene_bc_matrices
INFO     Downloading file at /tmp/tmp81rsicr0/pbmc4k/filtered_gene_bc_matrices.tar.gz
INFO     Extracting tar file
INFO     Removing extracted data at /tmp/tmp81rsicr0/pbmc4k/filtered_gene_bc_matrices

Analyze gene-specific ZI#

In AutoZI, all \(\delta_g\)’s follow a common \(\text{Beta}(\alpha,\beta)\) prior distribution where \(\alpha,\beta \in (0,1)\) and the zero-inflation probability in the ZINB component is bounded below by \(\tau_{\text{dropout}} \in (0,1)\). AutoZI is encoded by the AutoZIVAE class whose inputs, besides the size of the dataset, are \(\alpha\) (alpha_prior), \(\beta\) (beta_prior), \(\tau_{\text{dropout}}\) (minimal_dropout). By default, we set \(\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01\).

Note : we can learn \(\alpha,\beta\) in an Empirical Bayes fashion, which is possible by setting alpha_prior = None and beta_prior = None

model = scvi.model.AUTOZI(adata)

We fit, for each gene \(g\), an approximate posterior distribution \(q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)\) (with \(\alpha^g,\beta^g \in (0,1)\)) on which we rely. We retrieve \(\alpha^g,\beta^g\) for all genes \(g\) (and \(\alpha,\beta\), if learned) as numpy arrays using the method get_alphas_betas of AutoZIVAE.

model.train(max_epochs=200, plan_kwargs={"lr": 1e-2})
outputs = model.get_alphas_betas()
alpha_posterior = outputs["alpha_posterior"]
beta_posterior = outputs["beta_posterior"]

Now that we obtained fitted \(\alpha^g,\beta^g\), different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis \(q(\delta_g < 0.5)\), but also other metrics such as the mean wrt \(q\) of \(\delta_g\) are possible. We focus on the former. We decide that gene \(g\) is ZI if and only if \(q(\delta_g < 0.5)\) is greater than a given threshold, say \(0.5\). We may note that it is equivalent to \(\alpha^g < \beta^g\). From this we can deduce the fraction of predicted ZI genes in the dataset.

# Threshold (or Kzinb/Knb+Kzinb in paper)
threshold = 0.5

# q(delta_g < 0.5) probabilities
zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior)

# ZI genes
is_zi_pred = zi_probs > threshold

print("Fraction of predicted ZI genes :", is_zi_pred.mean())
Fraction of predicted ZI genes : 0.842

We noted that predictions were less accurate for genes \(g\) whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.

mask_sufficient_expression = (np.array(adata.X.mean(axis=0)) > 1.0).reshape(-1)
print("Fraction of genes with avg expression > 1 :", mask_sufficient_expression.mean())
print(
    "Fraction of predicted ZI genes with avg expression > 1 :",
    is_zi_pred[mask_sufficient_expression].mean(),
)
Fraction of genes with avg expression > 1 : 0.104
Fraction of predicted ZI genes with avg expression > 1 : 0.47115384615384615

Analyze gene-cell-type-specific ZI#

One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable \(\delta_{gc}\) for each gene \(g\) and cell type \(c\) which denotes the probability that gene \(g\) is not zero-inflated in cell-type \(c\). The analysis above can be extended to this new scale.

# Model definition
model_genelabel = scvi.model.AUTOZI(adata, dispersion="gene-label", zero_inflation="gene-label")

# Training
model_genelabel.train(max_epochs=200, plan_kwargs={"lr": 1e-2})

# Retrieve posterior distribution parameters
outputs_genelabel = model_genelabel.get_alphas_betas()
alpha_posterior_genelabel = outputs_genelabel["alpha_posterior"]
beta_posterior_genelabel = outputs_genelabel["beta_posterior"]
# q(delta_g < 0.5) probabilities
zi_probs_genelabel = beta.cdf(0.5, alpha_posterior_genelabel, beta_posterior_genelabel)

# ZI gene-cell-types
is_zi_pred_genelabel = zi_probs_genelabel > threshold

ct = adata.obs.str_labels.astype("category")
codes = np.unique(ct.cat.codes)
cats = ct.cat.categories
for ind_cell_type, cell_type in zip(codes, cats, strict=False):
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[:, ind_cell_type]
    print(
        f"Fraction of predicted ZI genes for cell type {cell_type} :",
        is_zi_pred_genelabel_here.mean(),
        "\n",
    )
Fraction of predicted ZI genes for cell type B cells : 0.587 

Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.622 

Fraction of predicted ZI genes for cell type CD4 T cells : 0.613 

Fraction of predicted ZI genes for cell type CD8 T cells : 0.527 

Fraction of predicted ZI genes for cell type Dendritic Cells : 0.587 

Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.539 

Fraction of predicted ZI genes for cell type Megakaryocytes : 0.616 

Fraction of predicted ZI genes for cell type NK cells : 0.524 

Fraction of predicted ZI genes for cell type Other : 0.648
# With avg expressions > 1
for ind_cell_type, cell_type in zip(codes, cats, strict=False):
    mask_sufficient_expression = (
        np.array(adata.X[adata.obs.str_labels.values.reshape(-1) == cell_type, :].mean(axis=0))
        > 1.0
    ).reshape(-1)
    print(
        f"Fraction of genes with avg expression > 1 for cell type {cell_type} :",
        mask_sufficient_expression.mean(),
    )
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[mask_sufficient_expression, ind_cell_type]
    print(
        f"Fraction of predicted ZI genes with avg expression > 1 for cell type {cell_type} :",
        is_zi_pred_genelabel_here.mean(),
        "\n",
    )
Fraction of genes with avg expression > 1 for cell type B cells : 0.044
Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.2727272727272727 

Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.123
Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.24390243902439024 

Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.066
Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.21212121212121213 

Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.08
Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.2875 

Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.397
Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.41309823677581864 

Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.234
Fraction of predicted ZI genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.34615384615384615 

Fraction of genes with avg expression > 1 for cell type Megakaryocytes : 0.231
Fraction of predicted ZI genes with avg expression > 1 for cell type Megakaryocytes : 0.45021645021645024 

Fraction of genes with avg expression > 1 for cell type NK cells : 0.106
Fraction of predicted ZI genes with avg expression > 1 for cell type NK cells : 0.20754716981132076 

Fraction of genes with avg expression > 1 for cell type Other : 0.21
Fraction of predicted ZI genes with avg expression > 1 for cell type Other : 0.42857142857142855