TotalPosterior

class scvi.inference.TotalPosterior(model, gene_dataset, shuffle=False, indices=None, use_cuda=True, data_loader_kwargs={})[source]

Bases: scvi.inference.posterior.Posterior

The functional data unit for totalVI.

A TotalPosterior instance is instantiated with a model and a gene_dataset, and as well as additional arguments that for Pytorch’s DataLoader. A subset of indices can be specified, for purposes such as splitting the data into train/test/validation. Each trainer instance of the TotalTrainer class can therefore have multiple TotalPosterior instances to train a model. A TotalPosterior instance also comes with many methods or utilities for its corresponding data.

Parameters
  • model (TOTALVITOTALVI) – A model instance from class TOTALVI

  • gene_dataset (GeneExpressionDatasetGeneExpressionDataset) – A gene_dataset instance like CbmcDataset() with attribute protein_expression

  • shuffle (boolbool) – Specifies if a RandomSampler or a SequentialSampler should be used

  • indices (ndarray, NoneOptional[ndarray]) – Specifies how the data should be split with regards to train/test or labelled/unlabelled

  • use_cuda (boolbool) – Default: True

  • data_loader_kwargs – Keyword arguments to passed into the DataLoader

Examples

Let us instantiate a trainer, with a gene_dataset and a model

>>> gene_dataset = CbmcDataset()
>>> totalvi = TOTALVI(gene_dataset.nb_genes, len(gene_dataset.protein_names),
... n_batch=gene_dataset.n_batches, use_cuda=True)
>>> trainer = TotalTrainer(vae, gene_dataset)
>>> trainer.train(n_epochs=500)

Methods Summary

compute_elbo(vae, **kwargs)

Computes the ELBO.

compute_marginal_log_likelihood([…])

Computes a biased estimator for log p(x, y), which is the marginal log likelihood.

compute_reconstruction_error(vae, **kwargs)

Computes log p(x/z), which is the reconstruction error.

corrupted()

Corrupts gene counts.

differential_expression_score(idx1, idx2[, …])

Unified method for differential expression inference.

differential_expression_stats([M_sampling])

Output average over statistics in a symmetric way (a against b), forget the sets if permutation is True

elbo()

Returns the Evidence Lower Bound associated to the object.

generate([n_samples, batch_size])

Sample from posterior predictive.

generate_denoised_samples([n_samples, …])

Samples from an adjusted posterior predictive.

generate_feature_correlation_matrix([…])

Wrapper of generate_denoised_samples() to create a gene-protein gene-protein corr matrix

generate_parameters()

Estimates data’s count means, dispersions and dropout logits.

get_latent([sample])

Output posterior z mean or sample, batch index, and label

get_normalized_denoised_expression([…])

Returns the tensors of denoised normalized gene and protein expression

get_protein_background_mean()

get_protein_mean([n_samples, give_mean, …])

Returns the tensors of protein mean (with foreground and background)

get_sample_dropout([n_samples, give_mean])

Zero-inflation mixing component for genes

get_sample_mixing([n_samples, give_mean, …])

Returns mixing bernoulli parameter for protein negative binomial mixtures (probability background)

get_sample_scale([transform_batch, eps, …])

Helper function to provide normalized expression for DE testing.

imputation([n_samples])

Gene imputation

imputation_list([n_samples])

This code is identical to same function in posterior.py

marginal_ll([n_mc_samples])

Estimates the marginal likelihood of the object’s data.

reconstruction_error([mode])

Returns the reconstruction error associated to the object.

uncorrupted()

Uncorrupts gene counts.

Methods Documentation

compute_elbo(vae, **kwargs)[source]

Computes the ELBO.

The ELBO is the reconstruction error + the KL divergences between the variational distributions and the priors. It differs from the marginal log likelihood. Specifically, it is a lower bound on the marginal log likelihood plus a term that is constant with respect to the variational distribution. It still gives good insights on the modeling of the data, and is fast to compute.

Parameters
  • vae (TOTALVITOTALVI) –

  • **kwargs

Returns

compute_marginal_log_likelihood(n_samples_mc=100, batch_size=96)[source]

Computes a biased estimator for log p(x, y), which is the marginal log likelihood.

Despite its bias, the estimator still converges to the real value of log p(x, y) when n_samples_mc (for Monte Carlo) goes to infinity (a fairly high value like 100 should be enough). 5000 is the standard in machine learning publications. Due to the Monte Carlo sampling, this method is not as computationally efficient as computing only the reconstruction loss

Parameters
  • n_samples_mc (intint) – (Default value = 100)

  • batch_size (intint) – (Default value = 96)

Returns

compute_reconstruction_error(vae, **kwargs)[source]

Computes log p(x/z), which is the reconstruction error.

Differs from the marginal log likelihood, but still gives good insights on the modeling of the data, and is fast to compute

This is really a helper function to self.ll, self.ll_protein, etc.

corrupted()[source]

Corrupts gene counts.

differential_expression_score(idx1, idx2, mode='vanilla', batchid1=None, batchid2=None, use_observed_batches=False, n_samples=5000, use_permutation=True, M_permutation=10000, all_stats=True, change_fn=None, m1_domain_fn=None, delta=0.5, cred_interval_lvls=None, **kwargs)[source]

Unified method for differential expression inference.

This function is an extension of the get_bayes_factors method providing additional genes information to the user

Two modes coexist:

  • the “vanilla” mode follows protocol described in [Lopez18]

In this case, we perform hypothesis testing based on the hypotheses

\[M_1: h_1 > h_2 ~\text{and}~ M_2: h_1 \leq h_2\]

DE can then be based on the study of the Bayes factors

\[\log p(M_1 | x_1, x_2) / p(M_2 | x_1, x_2)\]

consists in estimating an effect size random variable (e.g., log fold-change) and performing Bayesian hypothesis testing on this variable. The change_fn function computes the effect size variable r based two inputs corresponding to the normalized means in both populations.

Hypotheses:

\[M_1: r \in R_1 ~\text{(effect size r in region inducing differential expression)}\]
\[M_2: r \notin R_1 ~\text{(no differential expression)}\]

To characterize the region \(R_1\), which induces DE, the user has two choices.

1. A common case is when the region \([-\delta, \delta]\) does not induce differential expression. If the user specifies a threshold delta, we suppose that \(R_1 = \mathbb{R} \setminus [-\delta, \delta]\)

  1. specify an specific indicator function

\[f: \mathbb{R} \mapsto \{0, 1\} ~\text{s.t.}~ r \in R_1 ~\text{iff.}~ f(r) = 1\]

Decision-making can then be based on the estimates of

\[p(M_1 \mid x_1, x_2)\]

Both modes require to sample the normalized means posteriors. To that purpose, we sample the Posterior in the following way:

  1. The posterior is sampled n_samples times for each subpopulation

  2. For computation efficiency (posterior sampling is quite expensive), instead of

    comparing the obtained samples element-wise, we can permute posterior samples. Remember that computing the Bayes Factor requires sampling \(q(z_A \mid x_A)\) and \(q(z_B \mid x_B)\)

Currently, the code covers several batch handling configurations:

1. If use_observed_batches=True, then batch are considered as observations and cells’ normalized means are conditioned on real batch observations

2. If case (cell group 1) and control (cell group 2) are conditioned on the same batch ids. Examples:

>>> set(batchid1) = set(batchid2)

or

>>> batchid1 = batchid2 = None

3. If case and control are conditioned on different batch ids that do not intersect i.e.,

>>> set(batchid1) != set(batchid2)

and

>>> len(set(batchid1).intersection(set(batchid2))) == 0

This function does not cover other cases yet and will warn users in such cases.

Parameters
  • mode (str, NoneOptional[str]) – one of [“vanilla”, “change”]

  • idx1 (List[bool], ndarrayUnion[List[bool], ndarray]) – bool array masking subpopulation cells 1. Should be True where cell is from associated population

  • idx2 (List[bool], ndarrayUnion[List[bool], ndarray]) – bool array masking subpopulation cells 2. Should be True where cell is from associated population

  • batchid1 (List[int], ndarray, NoneUnion[List[int], ndarray, None]) – List of batch ids for which you want to perform DE Analysis for subpopulation 1. By default, all ids are taken into account

  • batchid2 (List[int], ndarray, NoneUnion[List[int], ndarray, None]) – List of batch ids for which you want to perform DE Analysis for subpopulation 2. By default, all ids are taken into account

  • use_observed_batches (bool, NoneOptional[bool]) – Whether normalized means are conditioned on observed batches

  • n_samples (intint) – Number of posterior samples

  • use_permutation (boolbool) – Activates step 2 described above. Simply formulated, pairs obtained from posterior sampling (when calling sample_scale_from_batch) will be randomly permuted so that the number of pairs used to compute Bayes Factors becomes M_permutation.

  • M_permutation (intint) – Number of times we will “mix” posterior samples in step 2. Only makes sense when use_permutation=True

  • change_fn (str, Callable, NoneUnion[str, Callable, None]) – function computing effect size based on both normalized means

  • m1_domain_fn (Callable, NoneOptional[Callable]) – custom indicator function of effect size regions inducing differential expression

  • delta (float, NoneOptional[float]) – specific case of region inducing differential expression. In this case, we suppose that R setminus [-delta, delta] does not induce differential expression (LFC case)

  • cred_interval_lvls (List[float], ndarray, NoneUnion[List[float], ndarray, None]) – List of credible interval levels to compute for the posterior LFC distribution

  • all_stats (boolbool) – whether additional metrics should be provided

  • **kwargs – Other keywords arguments for get_sample_scale

Return type

DataFrameDataFrame

Returns

diff_exp_results The most important columns are:

  • proba_de (probability of being differentially expressed in change mode)

  • bayes_factor (bayes factors in the vanilla mode)

  • scale1 and scale2 (means of the scales in population 1 and 2)

  • When using the change mode, the mean, median, std of the posterior LFC

differential_expression_stats(M_sampling=100)[source]

Output average over statistics in a symmetric way (a against b), forget the sets if permutation is True

Parameters

M_sampling (intint) – number of samples

Returns

type Tuple px_scales, all_labels where (i) px_scales: scales of shape (M_sampling, n_genes) (ii) all_labels: labels of shape (M_sampling, )

elbo()[source]

Returns the Evidence Lower Bound associated to the object.

generate(n_samples=100, batch_size=64)[source]

Sample from posterior predictive. Proteins are concatenated to genes.

Parameters
  • n_samples (intint) – Number of posterior predictive samples

  • batch_size (intint) – mini batch size for loaded data. Lower for less memory usage

Return type

Tuple[ndarray, ndarray]Tuple[ndarray, ndarray]

Returns

x_newtorch.Tensor

tensor with shape (n_cells, n_genes + n_proteins, n_samples)

x_oldtorch.Tensor

tensor with shape (n_cells, n_genes + n_proteins)

generate_denoised_samples(n_samples=25, batch_size=64, rna_size_factor=1, transform_batch=None)[source]

Samples from an adjusted posterior predictive. Proteins are concatenated to genes.

Parameters
  • n_samples (intint) – How may samples per cell

  • batch_size (intint) – Mini-batch size for sampling. Lower means less GPU memory footprint

  • rna_size_factor (intint) – size factor for RNA prior to sampling gamma distribution

  • transform_batch (int, NoneOptional[int]) – int of which batch to condition on for all cells

Returns

generate_feature_correlation_matrix(n_samples=25, batch_size=64, rna_size_factor=1000, transform_batch=None, correlation_mode='pearson', log_transform=False)[source]

Wrapper of generate_denoised_samples() to create a gene-protein gene-protein corr matrix

Parameters
  • n_samples (intint) – How may samples per cell

  • batch_size (intint) – Mini-batch size for sampling. Lower means less GPU memory footprint

  • rna_size_factor (intint) – size factor for RNA prior to sampling gamma distribution

  • transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.

  • log_transform (boolbool) – Whether to log transform denoised values prior to correlation calculation

Returns

Correlation matrix

generate_parameters()[source]

Estimates data’s count means, dispersions and dropout logits.

get_latent(sample=False)[source]

Output posterior z mean or sample, batch index, and label

Parameters

sample (boolbool) – z mean or z sample

Return type

Tuple[ndarray, ndarray, ndarray, ndarray]Tuple[ndarray, ndarray, ndarray, ndarray]

Returns

type 4-tuple of latent, batch_indices, labels, library_gene

get_normalized_denoised_expression(n_samples=1, give_mean=True, transform_batch=None, sample_protein_mixing=True)[source]

Returns the tensors of denoised normalized gene and protein expression

Parameters
  • n_samples (intint) – number of samples from posterior distribution

  • sample_protein_mixing (boolbool) – Sample mixing bernoulli, setting background to zero

  • give_mean (boolbool) – bool, whether to return samples along first axis or average over samples

  • transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.

Return type

Tuple[ndarray, ndarray]Tuple[ndarray, ndarray]

Returns

Denoised genes, denoised proteins

get_protein_background_mean()[source]
get_protein_mean(n_samples=1, give_mean=True, transform_batch=None)[source]

Returns the tensors of protein mean (with foreground and background)

Parameters
  • n_samples (intint) – number of samples from posterior distribution

  • give_mean (boolbool) – bool, whether to return samples along first axis or average over samples

  • transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.

Return type

ndarrayndarray

Returns

Protein NB Mixture mean

get_sample_dropout(n_samples=1, give_mean=True)[source]

Zero-inflation mixing component for genes

Parameters
  • n_samples (intint) – (Default value = 1)

  • give_mean (boolbool) – (Default value = True)

Returns

get_sample_mixing(n_samples=1, give_mean=True, transform_batch=None)[source]

Returns mixing bernoulli parameter for protein negative binomial mixtures (probability background)

Parameters
  • n_samples (intint) – number of samples from posterior distribution

  • sample_protein_mixing – Sample mixing bernoulli, setting background to zero

  • give_mean (boolbool) – bool, whether to return samples along first axis or average over samples

  • transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.

Return type

ndarrayndarray

Returns

array of probability background

get_sample_scale(transform_batch=None, eps=0.5, normalize_pro=False, sample_bern=True, include_bg=False)[source]

Helper function to provide normalized expression for DE testing.

For normalized, denoised expression, please use

get_normalized_denoised_expression()

Parameters
  • transform_batch – Int of batch to “transform” all cells into (Default value = None)

  • eps – Prior count to add to protein normalized expression (Default value = 0.5)

  • normalize_pro – bool, whether to make protein expression sum to one in a cell (Default value = False)

  • include_bg – bool, whether to include the background component of expression (Default value = False)

  • sample_bern – (Default value = True)

Return type

ndarrayndarray

Returns

imputation(n_samples=1)[source]

Gene imputation

Parameters

n_samples (intint) – (Default value = 1)

Returns

imputation_list(n_samples=1)[source]

This code is identical to same function in posterior.py

Except, we use the totalVI definition of model.get_sample_rate

Parameters

n_samples (intint) – (Default value = 1)

Returns

marginal_ll(n_mc_samples=1000)[source]

Estimates the marginal likelihood of the object’s data.

Parameters

n_mc_samples – Number of MC estimates to use

Returns

Marginal LL

reconstruction_error(mode='total')[source]

Returns the reconstruction error associated to the object.

uncorrupted()[source]

Uncorrupts gene counts.