scvi.data.poisson_gene_selection

scvi.data.poisson_gene_selection#

scvi.data.poisson_gene_selection(adata, layer=None, n_top_genes=4000, accelerator='auto', device='auto', subset=False, inplace=True, n_samples=10000, batch_key=None, silent=False, minibatch_size=5000)[source]#

Rank and select genes based on the enrichment of zero counts.

Enrichment is considered by comparing data to a Poisson count model. This is based on M3Drop: tallulandrews/M3Drop The method accounts for library size internally, a raw count matrix should be provided.

Instead of Z-test, enrichment of zeros is quantified by posterior probabilites from a binomial model, computed through sampling.

Parameters:
  • adata – AnnData object (with sparse X matrix).

  • layer (Optional[str] (default: None)) – If provided, use adata.layers[layer] for expression values instead of adata.X.

  • n_top_genes (int (default: 4000)) – How many variable genes to select.

  • accelerator (str (default: 'auto')) – Supports passing different accelerator types (“cpu”, “gpu”, “tpu”, “ipu”, “hpu”, “mps, “auto”) as well as custom accelerator instances.

  • device (Union[int, str] (default: 'auto')) – The device to use. Can be set to a non-negative index (int or str) or “auto” for automatic selection based on the chosen accelerator. If set to “auto” and accelerator is not determined to be “cpu”, then device will be set to the first available device.

  • subset (bool (default: False)) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.

  • inplace (bool (default: True)) – Whether to place calculated metrics in .var or return them.

  • n_samples (int (default: 10000)) – The number of Binomial samples to use to estimate posterior probability of enrichment of zeros for each gene.

  • batch_key (str (default: None)) – key in adata.obs that contains batch info. If None, do not use batch info. Defatult: None.

  • silent (bool (default: False)) – If True, disables the progress bar.

  • minibatch_size (int (default: 5000)) – Size of temporary matrix for incremental calculation. Larger is faster but requires more RAM or GPU memory. (The default should be fine unless there are hundreds of millions cells or millions of genes.)

Return type:

Optional[DataFrame]

Returns:

Depending on inplace returns calculated metrics (DataFrame) or updates .var with the following fields

-highly_variable (bool)

boolean indicator of highly-variable genes

observed_fraction_zeros

fraction of observed zeros per gene

expected_fraction_zeros

expected fraction of observed zeros per gene

-prob_zero_enrichment (float)

Probability of zero enrichment, median across batches in the case of multiple batches

-prob_zero_enrichment_rank (float)

Rank of the gene according to probability of zero enrichment, median rank in the case of multiple batches

-prob_zero_enriched_nbatches (int)

If batch_key is given, this denotes in how many batches genes are detected as zero enriched