Identification of zero-inflated genes#
AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene \(g\), AutoZI notably learns the distribution of a random variable \(\delta_g\) which denotes the probability that gene \(g\) is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.
More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2
Note
Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.
!pip install --quiet scvi-colab
from scvi_colab import install
install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
import tempfile
import numpy as np
import scanpy as sc
import scvi
import seaborn as sns
import torch
from scipy.stats import beta
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.2.0
Note
You can modify save_dir
below to change where the data files for this tutorial are saved.
sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()
%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"
Imports, data loading and preparation#
adata = scvi.data.pbmc_dataset(save_path=save_dir.name)
adata.layers["counts"] = adata.X.copy()
scvi.data.poisson_gene_selection(
adata,
n_top_genes=1000,
batch_key="batch",
subset=True,
layer="counts",
)
scvi.model.AUTOZI.setup_anndata(
adata,
labels_key="str_labels",
batch_key="batch",
layer="counts",
)
INFO Downloading file at /tmp/tmp81rsicr0/gene_info_pbmc.csv
INFO Downloading file at /tmp/tmp81rsicr0/pbmc_metadata.pickle
INFO Downloading file at /tmp/tmp81rsicr0/pbmc8k/filtered_gene_bc_matrices.tar.gz
INFO Extracting tar file
INFO Removing extracted data at /tmp/tmp81rsicr0/pbmc8k/filtered_gene_bc_matrices
INFO Downloading file at /tmp/tmp81rsicr0/pbmc4k/filtered_gene_bc_matrices.tar.gz
INFO Extracting tar file
INFO Removing extracted data at /tmp/tmp81rsicr0/pbmc4k/filtered_gene_bc_matrices
Analyze gene-specific ZI#
In AutoZI, all \(\delta_g\)’s follow a common \(\text{Beta}(\alpha,\beta)\) prior distribution where \(\alpha,\beta \in (0,1)\) and the zero-inflation probability in the ZINB component is bounded below by \(\tau_{\text{dropout}} \in (0,1)\). AutoZI is encoded by the AutoZIVAE
class whose inputs, besides the size of the dataset, are \(\alpha\) (alpha_prior
), \(\beta\) (beta_prior
), \(\tau_{\text{dropout}}\) (minimal_dropout
). By default, we set \(\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01\).
Note : we can learn \(\alpha,\beta\) in an Empirical Bayes fashion, which is possible by setting alpha_prior = None
and beta_prior = None
model = scvi.model.AUTOZI(adata)
We fit, for each gene \(g\), an approximate posterior distribution \(q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)\) (with \(\alpha^g,\beta^g \in (0,1)\)) on which we rely. We retrieve \(\alpha^g,\beta^g\) for all genes \(g\) (and \(\alpha,\beta\), if learned) as numpy arrays using the method get_alphas_betas
of AutoZIVAE
.
model.train(max_epochs=200, plan_kwargs={"lr": 1e-2})
outputs = model.get_alphas_betas()
alpha_posterior = outputs["alpha_posterior"]
beta_posterior = outputs["beta_posterior"]
Now that we obtained fitted \(\alpha^g,\beta^g\), different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis \(q(\delta_g < 0.5)\), but also other metrics such as the mean wrt \(q\) of \(\delta_g\) are possible. We focus on the former. We decide that gene \(g\) is ZI if and only if \(q(\delta_g < 0.5)\) is greater than a given threshold, say \(0.5\). We may note that it is equivalent to \(\alpha^g < \beta^g\). From this we can deduce the fraction of predicted ZI genes in the dataset.
# Threshold (or Kzinb/Knb+Kzinb in paper)
threshold = 0.5
# q(delta_g < 0.5) probabilities
zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior)
# ZI genes
is_zi_pred = zi_probs > threshold
print("Fraction of predicted ZI genes :", is_zi_pred.mean())
Fraction of predicted ZI genes : 0.842
We noted that predictions were less accurate for genes \(g\) whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.
mask_sufficient_expression = (np.array(adata.X.mean(axis=0)) > 1.0).reshape(-1)
print("Fraction of genes with avg expression > 1 :", mask_sufficient_expression.mean())
print(
"Fraction of predicted ZI genes with avg expression > 1 :",
is_zi_pred[mask_sufficient_expression].mean(),
)
Fraction of genes with avg expression > 1 : 0.104
Fraction of predicted ZI genes with avg expression > 1 : 0.47115384615384615
Analyze gene-cell-type-specific ZI#
One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable \(\delta_{gc}\) for each gene \(g\) and cell type \(c\) which denotes the probability that gene \(g\) is not zero-inflated in cell-type \(c\). The analysis above can be extended to this new scale.
# Model definition
model_genelabel = scvi.model.AUTOZI(adata, dispersion="gene-label", zero_inflation="gene-label")
# Training
model_genelabel.train(max_epochs=200, plan_kwargs={"lr": 1e-2})
# Retrieve posterior distribution parameters
outputs_genelabel = model_genelabel.get_alphas_betas()
alpha_posterior_genelabel = outputs_genelabel["alpha_posterior"]
beta_posterior_genelabel = outputs_genelabel["beta_posterior"]
# q(delta_g < 0.5) probabilities
zi_probs_genelabel = beta.cdf(0.5, alpha_posterior_genelabel, beta_posterior_genelabel)
# ZI gene-cell-types
is_zi_pred_genelabel = zi_probs_genelabel > threshold
ct = adata.obs.str_labels.astype("category")
codes = np.unique(ct.cat.codes)
cats = ct.cat.categories
for ind_cell_type, cell_type in zip(codes, cats, strict=False):
is_zi_pred_genelabel_here = is_zi_pred_genelabel[:, ind_cell_type]
print(
f"Fraction of predicted ZI genes for cell type {cell_type} :",
is_zi_pred_genelabel_here.mean(),
"\n",
)
Fraction of predicted ZI genes for cell type B cells : 0.587
Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.622
Fraction of predicted ZI genes for cell type CD4 T cells : 0.613
Fraction of predicted ZI genes for cell type CD8 T cells : 0.527
Fraction of predicted ZI genes for cell type Dendritic Cells : 0.587
Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.539
Fraction of predicted ZI genes for cell type Megakaryocytes : 0.616
Fraction of predicted ZI genes for cell type NK cells : 0.524
Fraction of predicted ZI genes for cell type Other : 0.648
# With avg expressions > 1
for ind_cell_type, cell_type in zip(codes, cats, strict=False):
mask_sufficient_expression = (
np.array(adata.X[adata.obs.str_labels.values.reshape(-1) == cell_type, :].mean(axis=0))
> 1.0
).reshape(-1)
print(
f"Fraction of genes with avg expression > 1 for cell type {cell_type} :",
mask_sufficient_expression.mean(),
)
is_zi_pred_genelabel_here = is_zi_pred_genelabel[mask_sufficient_expression, ind_cell_type]
print(
f"Fraction of predicted ZI genes with avg expression > 1 for cell type {cell_type} :",
is_zi_pred_genelabel_here.mean(),
"\n",
)
Fraction of genes with avg expression > 1 for cell type B cells : 0.044
Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.2727272727272727
Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.123
Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.24390243902439024
Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.066
Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.21212121212121213
Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.08
Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.2875
Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.397
Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.41309823677581864
Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.234
Fraction of predicted ZI genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.34615384615384615
Fraction of genes with avg expression > 1 for cell type Megakaryocytes : 0.231
Fraction of predicted ZI genes with avg expression > 1 for cell type Megakaryocytes : 0.45021645021645024
Fraction of genes with avg expression > 1 for cell type NK cells : 0.106
Fraction of predicted ZI genes with avg expression > 1 for cell type NK cells : 0.20754716981132076
Fraction of genes with avg expression > 1 for cell type Other : 0.21
Fraction of predicted ZI genes with avg expression > 1 for cell type Other : 0.42857142857142855