Note

This page was generated from AutoZI_tutorial.ipynb. Interactive online version: . Some tutorial content may look better in light mode.

# Identification of zero-inflated genes#

AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene $$g$$, AutoZI notably learns the distribution of a random variable $$\delta_g$$ which denotes the probability that gene $$g$$ is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.

More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2

[1]:

!pip install --quiet scvi-colab
from scvi_colab import install
install()


[2]:

import numpy as np
import pandas as pd
import anndata

import scanpy as sc
import scvi

Global seed set to 0

[3]:

pbmc = scvi.data.pbmc_dataset()
pbmc.layers["counts"] = pbmc.X.copy()
sc.pp.normalize_total(pbmc, target_sum=10e4)
sc.pp.log1p(pbmc)
pbmc.raw = pbmc
scvi.data.poisson_gene_selection(
pbmc,
n_top_genes=1000,
batch_key="batch",
subset=True,
layer="counts",
)
scvi.model.AUTOZI.setup_anndata(
pbmc,
labels_key="str_labels",
batch_key="batch",
layer="counts",
)

INFO     File data/gene_info_pbmc.csv already downloaded
INFO     Extracting tar file
INFO     Removing extracted data at data/pbmc4k/filtered_gene_bc_matrices
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 21548.71it/s]
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 22836.53it/s]


## Analyze gene-specific ZI#

In AutoZI, all $$\delta_g$$’s follow a common $$\text{Beta}(\alpha,\beta)$$ prior distribution where $$\alpha,\beta \in (0,1)$$ and the zero-inflation probability in the ZINB component is bounded below by $$\tau_{\text{dropout}} \in (0,1)$$. AutoZI is encoded by the AutoZIVAE class whose inputs, besides the size of the dataset, are $$\alpha$$ (alpha_prior), $$\beta$$ (beta_prior), $$\tau_{\text{dropout}}$$ (minimal_dropout). By default, we set $$\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01$$.

Note : we can learn $$\alpha,\beta$$ in an Empirical Bayes fashion, which is possible by setting alpha_prior = None and beta_prior = None

[4]:

vae = scvi.model.AUTOZI(pbmc)


We fit, for each gene $$g$$, an approximate posterior distribution $$q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)$$ (with $$\alpha^g,\beta^g \in (0,1)$$) on which we rely. We retrieve $$\alpha^g,\beta^g$$ for all genes $$g$$ (and $$\alpha,\beta$$, if learned) as numpy arrays using the method get_alphas_betas of AutoZIVAE.

[5]:

vae.train(max_epochs=200, plan_kwargs = {'lr':1e-2})

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Epoch 200/200: 100%|██████████| 200/200 [02:30<00:00,  1.33it/s, loss=724, v_num=1]

[6]:

outputs = vae.get_alphas_betas()
alpha_posterior = outputs['alpha_posterior']
beta_posterior = outputs['beta_posterior']


Now that we obtained fitted $$\alpha^g,\beta^g$$, different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis $$q(\delta_g < 0.5)$$, but also other metrics such as the mean wrt $$q$$ of $$\delta_g$$ are possible. We focus on the former. We decide that gene $$g$$ is ZI if and only if $$q(\delta_g < 0.5)$$ is greater than a given threshold, say $$0.5$$. We may note that it is equivalent to $$\alpha^g < \beta^g$$. From this we can deduce the fraction of predicted ZI genes in the dataset.

[7]:

from scipy.stats import beta

# Threshold (or Kzinb/Knb+Kzinb in paper)
threshold = 0.5

# q(delta_g < 0.5) probabilities
zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior)

# ZI genes
is_zi_pred = (zi_probs > threshold)

print('Fraction of predicted ZI genes :', is_zi_pred.mean())

Fraction of predicted ZI genes : 0.446


We noted that predictions were less accurate for genes $$g$$ whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.

[8]:

mask_sufficient_expression = (np.array(pbmc.X.mean(axis=0)) > 1.).reshape(-1)
print('Fraction of genes with avg expression > 1 :', mask_sufficient_expression.mean())
print('Fraction of predicted ZI genes with avg expression > 1 :', is_zi_pred[mask_sufficient_expression].mean())

Fraction of genes with avg expression > 1 : 0.498
Fraction of predicted ZI genes with avg expression > 1 : 0.40963855421686746


## Analyze gene-cell-type-specific ZI#

One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable $$\delta_{gc}$$ for each gene $$g$$ and cell type $$c$$ which denotes the probability that gene $$g$$ is not zero-inflated in cell-type $$c$$. The analysis above can be extended to this new scale.

[9]:

# Model definition
vae_genelabel = scvi.model.AUTOZI(
pbmc,
dispersion='gene-label',
zero_inflation='gene-label'
)

# Training
vae_genelabel.train(max_epochs=200, plan_kwargs = {'lr':1e-2})

# Retrieve posterior distribution parameters
outputs_genelabel = vae_genelabel.get_alphas_betas()
alpha_posterior_genelabel = outputs_genelabel['alpha_posterior']
beta_posterior_genelabel = outputs_genelabel['beta_posterior']

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Epoch 200/200: 100%|██████████| 200/200 [02:32<00:00,  1.31it/s, loss=731, v_num=1]

[10]:

# q(delta_g < 0.5) probabilities
zi_probs_genelabel = beta.cdf(0.5,alpha_posterior_genelabel, beta_posterior_genelabel)

# ZI gene-cell-types
is_zi_pred_genelabel = (zi_probs_genelabel > threshold)

ct = pbmc.obs.str_labels.astype("category")
codes = np.unique(ct.cat.codes)
cats = ct.cat.categories
for ind_cell_type, cell_type in zip(codes, cats):
is_zi_pred_genelabel_here = is_zi_pred_genelabel[:,ind_cell_type]
print('Fraction of predicted ZI genes for cell type {} :'.format(cell_type),
is_zi_pred_genelabel_here.mean(),'\n')

Fraction of predicted ZI genes for cell type B cells : 0.46

Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.487

Fraction of predicted ZI genes for cell type CD4 T cells : 0.461

Fraction of predicted ZI genes for cell type CD8 T cells : 0.444

Fraction of predicted ZI genes for cell type Dendritic Cells : 0.457

Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.473

Fraction of predicted ZI genes for cell type Megakaryocytes : 0.437

Fraction of predicted ZI genes for cell type NK cells : 0.472

Fraction of predicted ZI genes for cell type Other : 0.46


[11]:

# With avg expressions > 1
for ind_cell_type, cell_type in zip(codes, cats):
mask_sufficient_expression = (np.array(pbmc.X[pbmc.obs.str_labels.values.reshape(-1) == cell_type,:].mean(axis=0)) > 1.).reshape(-1)
print('Fraction of genes with avg expression > 1 for cell type {} :'.format(cell_type),
print('Fraction of predicted ZI genes with avg expression > 1 for cell type {} :'.format(cell_type),
is_zi_pred_genelabel_here.mean(), '\n')

Fraction of genes with avg expression > 1 for cell type B cells : 0.389
Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.39845758354755784

Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.49
Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.46938775510204084

Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.432
Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.39814814814814814

Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.498
Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.41967871485943775

Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.859
Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.4342258440046566

Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.738
Fraction of predicted ZI genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.44579945799457993

Fraction of genes with avg expression > 1 for cell type Megakaryocytes : 0.746
Fraction of predicted ZI genes with avg expression > 1 for cell type Megakaryocytes : 0.3967828418230563

Fraction of genes with avg expression > 1 for cell type NK cells : 0.575
Fraction of predicted ZI genes with avg expression > 1 for cell type NK cells : 0.4643478260869565

Fraction of genes with avg expression > 1 for cell type Other : 0.785
Fraction of predicted ZI genes with avg expression > 1 for cell type Other : 0.4585987261146497