Identification of zero-inflated genes#

AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene \(g\), AutoZI notably learns the distribution of a random variable \(\delta_g\) which denotes the probability that gene \(g\) is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.

More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
from scvi_colab import install

install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

import tempfile

import numpy as np
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scipy.stats import beta
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.1.0

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

Imports, data loading and preparation#

adata = scvi.data.pbmc_dataset(save_path=save_dir.name)
adata.layers["counts"] = adata.X.copy()
scvi.data.poisson_gene_selection(
    adata,
    n_top_genes=1000,
    batch_key="batch",
    subset=True,
    layer="counts",
)
scvi.model.AUTOZI.setup_anndata(
    adata,
    labels_key="str_labels",
    batch_key="batch",
    layer="counts",
)
INFO     Downloading file at /tmp/tmpz2r8k26v/gene_info_pbmc.csv                                                   
Downloading...: 909it [00:00, 22863.34it/s]              
INFO     Downloading file at /tmp/tmpz2r8k26v/pbmc_metadata.pickle                                                 
Downloading...: 4001it [00:00, 53094.64it/s]              
INFO     Downloading file at /tmp/tmpz2r8k26v/pbmc8k/filtered_gene_bc_matrices.tar.gz                              
Downloading...: 37559it [00:00, 113542.09it/s]                             
INFO     Extracting tar file                                                                                       
INFO     Removing extracted data at /tmp/tmpz2r8k26v/pbmc8k/filtered_gene_bc_matrices                              
INFO     Downloading file at /tmp/tmpz2r8k26v/pbmc4k/filtered_gene_bc_matrices.tar.gz                              
Downloading...: 100%|██████████| 18424/18424.0 [00:00<00:00, 98338.49it/s] 
INFO     Extracting tar file                                                                                       
INFO     Removing extracted data at /tmp/tmpz2r8k26v/pbmc4k/filtered_gene_bc_matrices                              
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 17726.14it/s]
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 18625.46it/s]

Analyze gene-specific ZI#

In AutoZI, all \(\delta_g\)’s follow a common \(\text{Beta}(\alpha,\beta)\) prior distribution where \(\alpha,\beta \in (0,1)\) and the zero-inflation probability in the ZINB component is bounded below by \(\tau_{\text{dropout}} \in (0,1)\). AutoZI is encoded by the AutoZIVAE class whose inputs, besides the size of the dataset, are \(\alpha\) (alpha_prior), \(\beta\) (beta_prior), \(\tau_{\text{dropout}}\) (minimal_dropout). By default, we set \(\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01\).

Note : we can learn \(\alpha,\beta\) in an Empirical Bayes fashion, which is possible by setting alpha_prior = None and beta_prior = None

model = scvi.model.AUTOZI(adata)

We fit, for each gene \(g\), an approximate posterior distribution \(q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)\) (with \(\alpha^g,\beta^g \in (0,1)\)) on which we rely. We retrieve \(\alpha^g,\beta^g\) for all genes \(g\) (and \(\alpha,\beta\), if learned) as numpy arrays using the method get_alphas_betas of AutoZIVAE.

model.train(max_epochs=200, plan_kwargs={"lr": 1e-2})
Epoch 200/200: 100%|██████████| 200/200 [02:12<00:00,  1.51it/s, v_num=1, train_loss_step=7.22e+6, train_loss_epoch=7.76e+6]
outputs = model.get_alphas_betas()
alpha_posterior = outputs["alpha_posterior"]
beta_posterior = outputs["beta_posterior"]

Now that we obtained fitted \(\alpha^g,\beta^g\), different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis \(q(\delta_g < 0.5)\), but also other metrics such as the mean wrt \(q\) of \(\delta_g\) are possible. We focus on the former. We decide that gene \(g\) is ZI if and only if \(q(\delta_g < 0.5)\) is greater than a given threshold, say \(0.5\). We may note that it is equivalent to \(\alpha^g < \beta^g\). From this we can deduce the fraction of predicted ZI genes in the dataset.

# Threshold (or Kzinb/Knb+Kzinb in paper)
threshold = 0.5

# q(delta_g < 0.5) probabilities
zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior)

# ZI genes
is_zi_pred = zi_probs > threshold

print("Fraction of predicted ZI genes :", is_zi_pred.mean())
Fraction of predicted ZI genes : 0.843

We noted that predictions were less accurate for genes \(g\) whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.

mask_sufficient_expression = (np.array(adata.X.mean(axis=0)) > 1.0).reshape(-1)
print("Fraction of genes with avg expression > 1 :", mask_sufficient_expression.mean())
print(
    "Fraction of predicted ZI genes with avg expression > 1 :",
    is_zi_pred[mask_sufficient_expression].mean(),
)
Fraction of genes with avg expression > 1 : 0.104
Fraction of predicted ZI genes with avg expression > 1 : 0.49038461538461536

Analyze gene-cell-type-specific ZI#

One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable \(\delta_{gc}\) for each gene \(g\) and cell type \(c\) which denotes the probability that gene \(g\) is not zero-inflated in cell-type \(c\). The analysis above can be extended to this new scale.

# Model definition
model_genelabel = scvi.model.AUTOZI(
    adata, dispersion="gene-label", zero_inflation="gene-label"
)

# Training
model_genelabel.train(max_epochs=200, plan_kwargs={"lr": 1e-2})

# Retrieve posterior distribution parameters
outputs_genelabel = model_genelabel.get_alphas_betas()
alpha_posterior_genelabel = outputs_genelabel["alpha_posterior"]
beta_posterior_genelabel = outputs_genelabel["beta_posterior"]
Epoch 200/200: 100%|██████████| 200/200 [02:14<00:00,  1.49it/s, v_num=1, train_loss_step=7.65e+6, train_loss_epoch=7.76e+6]
# q(delta_g < 0.5) probabilities
zi_probs_genelabel = beta.cdf(0.5, alpha_posterior_genelabel, beta_posterior_genelabel)

# ZI gene-cell-types
is_zi_pred_genelabel = zi_probs_genelabel > threshold

ct = adata.obs.str_labels.astype("category")
codes = np.unique(ct.cat.codes)
cats = ct.cat.categories
for ind_cell_type, cell_type in zip(codes, cats):
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[:, ind_cell_type]
    print(
        f"Fraction of predicted ZI genes for cell type {cell_type} :",
        is_zi_pred_genelabel_here.mean(),
        "\n",
    )
Fraction of predicted ZI genes for cell type B cells : 0.603 

Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.613 

Fraction of predicted ZI genes for cell type CD4 T cells : 0.584 

Fraction of predicted ZI genes for cell type CD8 T cells : 0.551 

Fraction of predicted ZI genes for cell type Dendritic Cells : 0.608 

Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.566 

Fraction of predicted ZI genes for cell type Megakaryocytes : 0.599 

Fraction of predicted ZI genes for cell type NK cells : 0.51 

Fraction of predicted ZI genes for cell type Other : 0.683 
# With avg expressions > 1
for ind_cell_type, cell_type in zip(codes, cats):
    mask_sufficient_expression = (
        np.array(
            adata.X[adata.obs.str_labels.values.reshape(-1) == cell_type, :].mean(
                axis=0
            )
        )
        > 1.0
    ).reshape(-1)
    print(
        f"Fraction of genes with avg expression > 1 for cell type {cell_type} :",
        mask_sufficient_expression.mean(),
    )
    is_zi_pred_genelabel_here = is_zi_pred_genelabel[
        mask_sufficient_expression, ind_cell_type
    ]
    print(
        f"Fraction of predicted ZI genes with avg expression > 1 for cell type {cell_type} :",
        is_zi_pred_genelabel_here.mean(),
        "\n",
    )
Fraction of genes with avg expression > 1 for cell type B cells : 0.044
Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.25 

Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.123
Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.21951219512195122 

Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.066
Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.22727272727272727 

Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.08
Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.275 

Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.397
Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.43073047858942065 

Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.234
Fraction of predicted ZI genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.3247863247863248 

Fraction of genes with avg expression > 1 for cell type Megakaryocytes : 0.231
Fraction of predicted ZI genes with avg expression > 1 for cell type Megakaryocytes : 0.4155844155844156 

Fraction of genes with avg expression > 1 for cell type NK cells : 0.106
Fraction of predicted ZI genes with avg expression > 1 for cell type NK cells : 0.2641509433962264 

Fraction of genes with avg expression > 1 for cell type Other : 0.21
Fraction of predicted ZI genes with avg expression > 1 for cell type Other : 0.4857142857142857