Note
This page was generated from AutoZI_tutorial.ipynb. Interactive online version: .
AutoZI is a deep generative model adapted from scVI allowing a gene-specific treatment of zero-inflation. For each gene \(g\), AutoZI notably learns the distribution of a random variable \(\delta_g\) which denotes the probability that gene \(g\) is not zero-inflated. In this notebook, we present the use of the model on a PBMC dataset.
More details about AutoZI can be found in : https://www.biorxiv.org/content/10.1101/794875v2
[1]:
import sys #if branch is stable, will install via pypi, else will install from source branch = "stable" IN_COLAB = "google.colab" in sys.modules if IN_COLAB and branch == "stable": !pip install --quiet scvi-tools[tutorials] elif IN_COLAB and branch != "stable": !pip install --quiet --upgrade jsonschema !pip install --quiet git+https://github.com/yoseflab/scvi-tools@$branch#egg=scvi-tools[tutorials]
[2]:
import numpy as np import pandas as pd import anndata import scanpy as sc import scvi
[3]:
pbmc = scvi.data.pbmc_dataset(run_setup_anndata=False) pbmc.layers["counts"] = pbmc.X.copy() sc.pp.normalize_total(pbmc, target_sum=10e4) sc.pp.log1p(pbmc) pbmc.raw = pbmc scvi.data.poisson_gene_selection( pbmc, n_top_genes=1000, batch_key="batch", subset=True, layer="counts", ) scvi.data.setup_anndata( pbmc, labels_key="str_labels", batch_key="batch", layer="counts", )
INFO File data/gene_info_pbmc.csv already downloaded INFO File data/pbmc_metadata.pickle already downloaded INFO File data/pbmc8k/filtered_gene_bc_matrices.tar.gz already downloaded INFO Extracting tar file INFO Removing extracted data at data/pbmc8k/filtered_gene_bc_matrices INFO File data/pbmc4k/filtered_gene_bc_matrices.tar.gz already downloaded INFO Extracting tar file INFO Removing extracted data at data/pbmc4k/filtered_gene_bc_matrices
/home/galen/.pyenv/versions/3.8.3/envs/scvi-dev/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2487: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version. res = method(*args, **kwargs)
Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 11465.69it/s] Sampling from binomial...: 100%|██████████| 10000/10000 [00:00<00:00, 11546.41it/s] INFO Using batches from adata.obs["batch"] INFO Using labels from adata.obs["str_labels"] INFO Using data from adata.layers["counts"] INFO Computing library size prior per batch INFO Successfully registered anndata object containing 11990 cells, 1000 vars, 2 batches, 9 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra continuous covariates. INFO Please do not further modify adata until model is trained.
In AutoZI, all \(\delta_g\)’s follow a common \(\text{Beta}(\alpha,\beta)\) prior distribution where \(\alpha,\beta \in (0,1)\) and the zero-inflation probability in the ZINB component is bounded below by \(\tau_{\text{dropout}} \in (0,1)\). AutoZI is encoded by the AutoZIVAE class whose inputs, besides the size of the dataset, are \(\alpha\) (alpha_prior), \(\beta\) (beta_prior), \(\tau_{\text{dropout}}\) (minimal_dropout). By default, we set \(\alpha = 0.5, \beta = 0.5, \tau_{\text{dropout}} = 0.01\).
AutoZIVAE
alpha_prior
beta_prior
minimal_dropout
Note : we can learn \(\alpha,\beta\) in an Empirical Bayes fashion, which is possible by setting alpha_prior = None and beta_prior = None
alpha_prior = None
beta_prior = None
[4]:
vae = scvi.model.AUTOZI(pbmc)
We fit, for each gene \(g\), an approximate posterior distribution \(q(\delta_g) = \text{Beta}(\alpha^g,\beta^g)\) (with \(\alpha^g,\beta^g \in (0,1)\)) on which we rely. We retrieve \(\alpha^g,\beta^g\) for all genes \(g\) (and \(\alpha,\beta\), if learned) as numpy arrays using the method get_alphas_betas of AutoZIVAE.
get_alphas_betas
[5]:
vae.train(max_epochs=200, plan_kwargs = {'lr':1e-2})
GPU available: True, used: True TPU available: None, using: 0 TPU cores
Epoch 200/200: 100%|██████████| 200/200 [07:38<00:00, 2.29s/it, loss=726, v_num=1]
[6]:
outputs = vae.get_alphas_betas() alpha_posterior = outputs['alpha_posterior'] beta_posterior = outputs['beta_posterior']
Now that we obtained fitted \(\alpha^g,\beta^g\), different metrics are possible. Bayesian decision theory suggests us the posterior probability of the zero-inflation hypothesis \(q(\delta_g < 0.5)\), but also other metrics such as the mean wrt \(q\) of \(\delta_g\) are possible. We focus on the former. We decide that gene \(g\) is ZI if and only if \(q(\delta_g < 0.5)\) is greater than a given threshold, say \(0.5\). We may note that it is equivalent to \(\alpha^g < \beta^g\). From this we can deduce the fraction of predicted ZI genes in the dataset.
[7]:
from scipy.stats import beta # Threshold (or Kzinb/Knb+Kzinb in paper) threshold = 0.5 # q(delta_g < 0.5) probabilities zi_probs = beta.cdf(0.5, alpha_posterior, beta_posterior) # ZI genes is_zi_pred = (zi_probs > threshold) print('Fraction of predicted ZI genes :', is_zi_pred.mean())
Fraction of predicted ZI genes : 0.455
We noted that predictions were less accurate for genes \(g\) whose average expressions - or predicted NB means, equivalently - were low. Indeed, genes assumed not to be ZI were more often predicted as ZI for such low average expressions. A threshold of 1 proved reasonable to separate genes predicted with more or less accuracy. Hence we may want to focus on predictions for genes with average expression above 1.
[8]:
mask_sufficient_expression = (np.array(pbmc.X.mean(axis=0)) > 1.).reshape(-1) print('Fraction of genes with avg expression > 1 :', mask_sufficient_expression.mean()) print('Fraction of predicted ZI genes with avg expression > 1 :', is_zi_pred[mask_sufficient_expression].mean())
Fraction of genes with avg expression > 1 : 0.499 Fraction of predicted ZI genes with avg expression > 1 : 0.43887775551102204
One may argue that zero-inflation should also be treated on the cell-type (or ‘label’) level, in addition to the gene level. AutoZI can be extended by assuming a random variable \(\delta_{gc}\) for each gene \(g\) and cell type \(c\) which denotes the probability that gene \(g\) is not zero-inflated in cell-type \(c\). The analysis above can be extended to this new scale.
[9]:
# Model definition vae_genelabel = scvi.model.AUTOZI( pbmc, dispersion='gene-label', zero_inflation='gene-label' ) # Training vae_genelabel.train(max_epochs=200, plan_kwargs = {'lr':1e-2}) # Retrieve posterior distribution parameters outputs_genelabel = vae_genelabel.get_alphas_betas() alpha_posterior_genelabel = outputs_genelabel['alpha_posterior'] beta_posterior_genelabel = outputs_genelabel['beta_posterior']
Epoch 200/200: 100%|██████████| 200/200 [06:35<00:00, 1.98s/it, loss=732, v_num=1]
[10]:
# q(delta_g < 0.5) probabilities zi_probs_genelabel = beta.cdf(0.5,alpha_posterior_genelabel, beta_posterior_genelabel) # ZI gene-cell-types is_zi_pred_genelabel = (zi_probs_genelabel > threshold) ct = pbmc.obs.str_labels.astype("category") codes = np.unique(ct.cat.codes) cats = ct.cat.categories for ind_cell_type, cell_type in zip(codes, cats): is_zi_pred_genelabel_here = is_zi_pred_genelabel[:,ind_cell_type] print('Fraction of predicted ZI genes for cell type {} :'.format(cell_type), is_zi_pred_genelabel_here.mean(),'\n')
Fraction of predicted ZI genes for cell type B cells : 0.476 Fraction of predicted ZI genes for cell type CD14+ Monocytes : 0.486 Fraction of predicted ZI genes for cell type CD4 T cells : 0.462 Fraction of predicted ZI genes for cell type CD8 T cells : 0.448 Fraction of predicted ZI genes for cell type Dendritic Cells : 0.414 Fraction of predicted ZI genes for cell type FCGR3A+ Monocytes : 0.492 Fraction of predicted ZI genes for cell type Megakaryocytes : 0.451 Fraction of predicted ZI genes for cell type NK cells : 0.495 Fraction of predicted ZI genes for cell type Other : 0.462
[11]:
# With avg expressions > 1 for ind_cell_type, cell_type in zip(codes, cats): mask_sufficient_expression = (np.array(pbmc.X[pbmc.obs.str_labels.values.reshape(-1) == cell_type,:].mean(axis=0)) > 1.).reshape(-1) print('Fraction of genes with avg expression > 1 for cell type {} :'.format(cell_type), mask_sufficient_expression.mean()) is_zi_pred_genelabel_here = is_zi_pred_genelabel[mask_sufficient_expression,ind_cell_type] print('Fraction of predicted ZI genes with avg expression > 1 for cell type {} :'.format(cell_type), is_zi_pred_genelabel_here.mean(), '\n')
Fraction of genes with avg expression > 1 for cell type B cells : 0.39 Fraction of predicted ZI genes with avg expression > 1 for cell type B cells : 0.43846153846153846 Fraction of genes with avg expression > 1 for cell type CD14+ Monocytes : 0.491 Fraction of predicted ZI genes with avg expression > 1 for cell type CD14+ Monocytes : 0.46028513238289204 Fraction of genes with avg expression > 1 for cell type CD4 T cells : 0.433 Fraction of predicted ZI genes with avg expression > 1 for cell type CD4 T cells : 0.4503464203233256 Fraction of genes with avg expression > 1 for cell type CD8 T cells : 0.499 Fraction of predicted ZI genes with avg expression > 1 for cell type CD8 T cells : 0.43687374749499 Fraction of genes with avg expression > 1 for cell type Dendritic Cells : 0.861 Fraction of predicted ZI genes with avg expression > 1 for cell type Dendritic Cells : 0.4065040650406504 Fraction of genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.74 Fraction of predicted ZI genes with avg expression > 1 for cell type FCGR3A+ Monocytes : 0.4648648648648649 Fraction of genes with avg expression > 1 for cell type Megakaryocytes : 0.749 Fraction of predicted ZI genes with avg expression > 1 for cell type Megakaryocytes : 0.4152202937249666 Fraction of genes with avg expression > 1 for cell type NK cells : 0.575 Fraction of predicted ZI genes with avg expression > 1 for cell type NK cells : 0.4504347826086956 Fraction of genes with avg expression > 1 for cell type Other : 0.787 Fraction of predicted ZI genes with avg expression > 1 for cell type Other : 0.45870393900889456
[ ]: