scvi.data.poisson_gene_selection#

scvi.data.poisson_gene_selection(adata, layer=None, n_top_genes=4000, use_gpu=True, subset=False, inplace=True, n_samples=10000, batch_key=None, silent=False, minibatch_size=5000)[source]#

Rank and select genes based on the enrichment of zero counts.

Enrichment is considered by comparing data to a Poisson count model. This is based on M3Drop: https://github.com/tallulandrews/M3Drop The method accounts for library size internally, a raw count matrix should be provided.

Instead of Z-test, enrichment of zeros is quantified by posterior probabilites from a binomial model, computed through sampling.

Parameters:
  • adata – AnnData object (with sparse X matrix).

  • layer (Optional[str]) – If provided, use adata.layers[layer] for expression values instead of adata.X.

  • n_top_genes (int) – How many variable genes to select.

  • use_gpu (bool) – Whether to use GPU

  • subset (bool) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.

  • inplace (bool) – Whether to place calculated metrics in .var or return them.

  • n_samples (int) – The number of Binomial samples to use to estimate posterior probability of enrichment of zeros for each gene.

  • batch_key (str) – key in adata.obs that contains batch info. If None, do not use batch info. Defatult: None.

  • silent (bool) – If True, disables the progress bar.

  • minibatch_size (int) – Size of temporary matrix for incremental calculation. Larger is faster but requires more RAM or GPU memory. (The default should be fine unless there are hundreds of millions cells or millions of genes.)

Returns:

Depending on inplace returns calculated metrics (DataFrame) or updates .var with the following fields

-highly_variable (bool)

boolean indicator of highly-variable genes

observed_fraction_zeros

fraction of observed zeros per gene

expected_fraction_zeros

expected fraction of observed zeros per gene

-prob_zero_enrichment (float)

Probability of zero enrichment, median across batches in the case of multiple batches

-prob_zero_enrichment_rank (float)

Rank of the gene according to probability of zero enrichment, median rank in the case of multiple batches

-prob_zero_enriched_nbatches (int)

If batch_key is given, this denotes in how many batches genes are detected as zero enriched

Return type:

Optional[DataFrame]