TotalPosterior¶

class scvi.inference.TotalPosterior(model, gene_dataset, shuffle=False, indices=None, use_cuda=True, data_loader_kwargs={})[source]¶

Bases: scvi.inference.posterior.Posterior

The functional data unit for totalVI.

A TotalPosterior instance is instantiated with a model and a gene_dataset, and as well as additional arguments that for Pytorch’s DataLoader. A subset of indices can be specified, for purposes such as splitting the data into train/test/validation. Each trainer instance of the TotalTrainer class can therefore have multiple TotalPosterior instances to train a model. A TotalPosterior instance also comes with many methods or utilities for its corresponding data.

Parameters

model (TOTALVITOTALVI) – A model instance from class TOTALVI
gene_dataset (GeneExpressionDatasetGeneExpressionDataset) – A gene_dataset instance like CbmcDataset() with attribute protein_expression
shuffle (boolbool) – Specifies if a RandomSampler or a SequentialSampler should be used
indices (ndarray, NoneOptional[ndarray]) – Specifies how the data should be split with regards to train/test or labelled/unlabelled
use_cuda (boolbool) – Default: True
data_loader_kwargs – Keyword arguments to passed into the DataLoader

Examples

Let us instantiate a trainer, with a gene_dataset and a model

>>> gene_dataset = CbmcDataset()
>>> totalvi = TOTALVI(gene_dataset.nb_genes, len(gene_dataset.protein_names),
... n_batch=gene_dataset.n_batches, use_cuda=True)
>>> trainer = TotalTrainer(vae, gene_dataset)
>>> trainer.train(n_epochs=500)

Methods Summary

`compute_elbo`(vae, **kwargs)	Computes the ELBO.
`compute_marginal_log_likelihood`([…])	Computes a biased estimator for log p(x, y), which is the marginal log likelihood.
`compute_reconstruction_error`(vae, **kwargs)	Computes log p(x/z), which is the reconstruction error.
`corrupted`()	Corrupts gene counts.
`differential_expression_score`(idx1, idx2[, …])	Unified method for differential expression inference.
`differential_expression_stats`([M_sampling])	Output average over statistics in a symmetric way (a against b), forget the sets if permutation is True
`elbo`()	Returns the Evidence Lower Bound associated to the object.
`generate`([n_samples, batch_size])	Sample from posterior predictive.
`generate_denoised_samples`([n_samples, …])	Samples from an adjusted posterior predictive.
`generate_feature_correlation_matrix`([…])	Wrapper of generate_denoised_samples() to create a gene-protein gene-protein corr matrix
`generate_parameters`()	Estimates data’s count means, dispersions and dropout logits.
`get_latent`([sample])	Output posterior z mean or sample, batch index, and label
`get_normalized_denoised_expression`([…])	Returns the tensors of denoised normalized gene and protein expression
`get_protein_background_mean`()
`get_protein_mean`([n_samples, give_mean, …])	Returns the tensors of protein mean (with foreground and background)
`get_sample_dropout`([n_samples, give_mean])	Zero-inflation mixing component for genes
`get_sample_mixing`([n_samples, give_mean, …])	Returns mixing bernoulli parameter for protein negative binomial mixtures (probability background)
`get_sample_scale`([transform_batch, eps, …])	Helper function to provide normalized expression for DE testing.
`imputation`([n_samples])	Gene imputation
`imputation_list`([n_samples])	This code is identical to same function in posterior.py
`marginal_ll`([n_mc_samples])	Estimates the marginal likelihood of the object’s data.
`reconstruction_error`([mode])	Returns the reconstruction error associated to the object.
`uncorrupted`()	Uncorrupts gene counts.

Methods Documentation

compute_elbo(vae, **kwargs)[source]¶

Computes the ELBO.

The ELBO is the reconstruction error + the KL divergences between the variational distributions and the priors. It differs from the marginal log likelihood. Specifically, it is a lower bound on the marginal log likelihood plus a term that is constant with respect to the variational distribution. It still gives good insights on the modeling of the data, and is fast to compute.

Parameters

vae (TOTALVITOTALVI) –
**kwargs –

Returns

compute_marginal_log_likelihood(n_samples_mc=100, batch_size=96)[source]¶

Computes a biased estimator for log p(x, y), which is the marginal log likelihood.

Despite its bias, the estimator still converges to the real value of log p(x, y) when n_samples_mc (for Monte Carlo) goes to infinity (a fairly high value like 100 should be enough). 5000 is the standard in machine learning publications. Due to the Monte Carlo sampling, this method is not as computationally efficient as computing only the reconstruction loss

Parameters

n_samples_mc (intint) – (Default value = 100)
batch_size (intint) – (Default value = 96)

Returns

compute_reconstruction_error(vae, **kwargs)[source]¶

Computes log p(x/z), which is the reconstruction error.

Differs from the marginal log likelihood, but still gives good insights on the modeling of the data, and is fast to compute

This is really a helper function to self.ll, self.ll_protein, etc.

corrupted()[source]¶: Corrupts gene counts.

differential_expression_score(idx1, idx2, mode='vanilla', batchid1=None, batchid2=None, use_observed_batches=False, n_samples=5000, use_permutation=True, M_permutation=10000, all_stats=True, change_fn=None, m1_domain_fn=None, delta=0.5, cred_interval_lvls=None, **kwargs)[source]¶

Unified method for differential expression inference.

This function is an extension of the get_bayes_factors method providing additional genes information to the user

Two modes coexist:

the “vanilla” mode follows protocol described in [Lopez18]

In this case, we perform hypothesis testing based on the hypotheses

\[M_1: h_1 > h_2 ~\text{and}~ M_2: h_1 \leq h_2\]

DE can then be based on the study of the Bayes factors

\[\log p(M_1 | x_1, x_2) / p(M_2 | x_1, x_2)\]

the “change” mode (described in [Boyeau19])

consists in estimating an effect size random variable (e.g., log fold-change) and performing Bayesian hypothesis testing on this variable. The change_fn function computes the effect size variable r based two inputs corresponding to the normalized means in both populations.

Hypotheses:

\[M_1: r \in R_1 ~\text{(effect size r in region inducing differential expression)}\]

\[M_2: r \notin R_1 ~\text{(no differential expression)}\]

To characterize the region \(R_1\), which induces DE, the user has two choices.

1. A common case is when the region \([-\delta, \delta]\) does not induce differential expression. If the user specifies a threshold delta, we suppose that \(R_1 = \mathbb{R} \setminus [-\delta, \delta]\)

specify an specific indicator function

\[f: \mathbb{R} \mapsto \{0, 1\} ~\text{s.t.}~ r \in R_1 ~\text{iff.}~ f(r) = 1\]

Decision-making can then be based on the estimates of

\[p(M_1 \mid x_1, x_2)\]

Both modes require to sample the normalized means posteriors. To that purpose, we sample the Posterior in the following way:

The posterior is sampled n_samples times for each subpopulation
For computation efficiency (posterior sampling is quite expensive), instead of
comparing the obtained samples element-wise, we can permute posterior samples. Remember that computing the Bayes Factor requires sampling \(q(z_A \mid x_A)\) and \(q(z_B \mid x_B)\)

Currently, the code covers several batch handling configurations:

1. If use_observed_batches=True, then batch are considered as observations and cells’ normalized means are conditioned on real batch observations

2. If case (cell group 1) and control (cell group 2) are conditioned on the same batch ids. Examples:

>>> set(batchid1) = set(batchid2)

or

>>> batchid1 = batchid2 = None

3. If case and control are conditioned on different batch ids that do not intersect i.e.,

>>> set(batchid1) != set(batchid2)

and

>>> len(set(batchid1).intersection(set(batchid2))) == 0

This function does not cover other cases yet and will warn users in such cases.

Parameters

mode (str, NoneOptional[str]) – one of [“vanilla”, “change”]
idx1 (List[bool], ndarrayUnion[List[bool], ndarray]) – bool array masking subpopulation cells 1. Should be True where cell is from associated population
idx2 (List[bool], ndarrayUnion[List[bool], ndarray]) – bool array masking subpopulation cells 2. Should be True where cell is from associated population
batchid1 (List[int], ndarray, NoneUnion[List[int], ndarray, None]) – List of batch ids for which you want to perform DE Analysis for subpopulation 1. By default, all ids are taken into account
batchid2 (List[int], ndarray, NoneUnion[List[int], ndarray, None]) – List of batch ids for which you want to perform DE Analysis for subpopulation 2. By default, all ids are taken into account
use_observed_batches (bool, NoneOptional[bool]) – Whether normalized means are conditioned on observed batches
n_samples (intint) – Number of posterior samples
use_permutation (boolbool) – Activates step 2 described above. Simply formulated, pairs obtained from posterior sampling (when calling sample_scale_from_batch) will be randomly permuted so that the number of pairs used to compute Bayes Factors becomes M_permutation.
M_permutation (intint) – Number of times we will “mix” posterior samples in step 2. Only makes sense when use_permutation=True
change_fn (str, Callable, NoneUnion[str, Callable, None]) – function computing effect size based on both normalized means
m1_domain_fn (Callable, NoneOptional[Callable]) – custom indicator function of effect size regions inducing differential expression
delta (float, NoneOptional[float]) – specific case of region inducing differential expression. In this case, we suppose that R setminus [-delta, delta] does not induce differential expression (LFC case)
cred_interval_lvls (List[float], ndarray, NoneUnion[List[float], ndarray, None]) – List of credible interval levels to compute for the posterior LFC distribution
all_stats (boolbool) – whether additional metrics should be provided
**kwargs – Other keywords arguments for get_sample_scale

Return type

DataFrameDataFrame

Returns

diff_exp_results The most important columns are:

proba_de (probability of being differentially expressed in change mode)
bayes_factor (bayes factors in the vanilla mode)
scale1 and scale2 (means of the scales in population 1 and 2)
When using the change mode, the mean, median, std of the posterior LFC

differential_expression_stats(M_sampling=100)[source]¶

Output average over statistics in a symmetric way (a against b), forget the sets if permutation is True

Parameters: M_sampling (intint) – number of samples
Returns: type Tuple px_scales, all_labels where (i) px_scales: scales of shape (M_sampling, n_genes) (ii) all_labels: labels of shape (M_sampling, )

elbo()[source]¶: Returns the Evidence Lower Bound associated to the object.

generate(n_samples=100, batch_size=64)[source]¶

Sample from posterior predictive. Proteins are concatenated to genes.

Parameters

n_samples (intint) – Number of posterior predictive samples
batch_size (intint) – mini batch size for loaded data. Lower for less memory usage

Return type

Tuple[ndarray, ndarray]Tuple[ndarray, ndarray]

Returns

x_newtorch.Tensor: tensor with shape (n_cells, n_genes + n_proteins, n_samples)
x_oldtorch.Tensor: tensor with shape (n_cells, n_genes + n_proteins)

generate_denoised_samples(n_samples=25, batch_size=64, rna_size_factor=1, transform_batch=None)[source]¶

Samples from an adjusted posterior predictive. Proteins are concatenated to genes.

Parameters

n_samples (intint) – How may samples per cell
batch_size (intint) – Mini-batch size for sampling. Lower means less GPU memory footprint
rna_size_factor (intint) – size factor for RNA prior to sampling gamma distribution
transform_batch (int, NoneOptional[int]) – int of which batch to condition on for all cells

Returns

generate_feature_correlation_matrix(n_samples=25, batch_size=64, rna_size_factor=1000, transform_batch=None, correlation_mode='pearson', log_transform=False)[source]¶

Wrapper of generate_denoised_samples() to create a gene-protein gene-protein corr matrix

Parameters

n_samples (intint) – How may samples per cell
batch_size (intint) – Mini-batch size for sampling. Lower means less GPU memory footprint
rna_size_factor (intint) – size factor for RNA prior to sampling gamma distribution
transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.
log_transform (boolbool) – Whether to log transform denoised values prior to correlation calculation

Returns

Correlation matrix

generate_parameters()[source]¶: Estimates data’s count means, dispersions and dropout logits.

get_latent(sample=False)[source]¶

Output posterior z mean or sample, batch index, and label

Parameters: sample (boolbool) – z mean or z sample
Return type: Tuple[ndarray, ndarray, ndarray, ndarray]Tuple[ndarray, ndarray, ndarray, ndarray]
Returns: type 4-tuple of latent, batch_indices, labels, library_gene

get_normalized_denoised_expression(n_samples=1, give_mean=True, transform_batch=None, sample_protein_mixing=True)[source]¶

Returns the tensors of denoised normalized gene and protein expression

Parameters

n_samples (intint) – number of samples from posterior distribution
sample_protein_mixing (boolbool) – Sample mixing bernoulli, setting background to zero
give_mean (boolbool) – bool, whether to return samples along first axis or average over samples
transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.

Return type

Tuple[ndarray, ndarray]Tuple[ndarray, ndarray]

Returns

Denoised genes, denoised proteins

get_protein_background_mean()[source]¶

get_protein_mean(n_samples=1, give_mean=True, transform_batch=None)[source]¶

Returns the tensors of protein mean (with foreground and background)

Parameters

n_samples (intint) – number of samples from posterior distribution
give_mean (boolbool) – bool, whether to return samples along first axis or average over samples
transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.

Return type

ndarrayndarray

Returns

Protein NB Mixture mean

get_sample_dropout(n_samples=1, give_mean=True)[source]¶

Zero-inflation mixing component for genes

Parameters

n_samples (intint) – (Default value = 1)
give_mean (boolbool) – (Default value = True)

Returns

get_sample_mixing(n_samples=1, give_mean=True, transform_batch=None)[source]¶

Returns mixing bernoulli parameter for protein negative binomial mixtures (probability background)

Parameters

n_samples (intint) – number of samples from posterior distribution
sample_protein_mixing – Sample mixing bernoulli, setting background to zero
give_mean (boolbool) – bool, whether to return samples along first axis or average over samples
transform_batch (int, List[int], NoneUnion[int, List[int], None]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.

Return type

ndarrayndarray

Returns

array of probability background

get_sample_scale(transform_batch=None, eps=0.5, normalize_pro=False, sample_bern=True, include_bg=False)[source]¶

Helper function to provide normalized expression for DE testing.

For normalized, denoised expression, please use: get_normalized_denoised_expression()

Parameters

transform_batch – Int of batch to “transform” all cells into (Default value = None)
eps – Prior count to add to protein normalized expression (Default value = 0.5)
normalize_pro – bool, whether to make protein expression sum to one in a cell (Default value = False)
include_bg – bool, whether to include the background component of expression (Default value = False)
sample_bern – (Default value = True)

Return type

ndarrayndarray

Returns

imputation(n_samples=1)[source]¶

Gene imputation

Parameters: n_samples (intint) – (Default value = 1)
Returns

imputation_list(n_samples=1)[source]¶

This code is identical to same function in posterior.py

Except, we use the totalVI definition of model.get_sample_rate

Parameters: n_samples (intint) – (Default value = 1)
Returns

marginal_ll(n_mc_samples=1000)[source]¶

Estimates the marginal likelihood of the object’s data.

Parameters: n_mc_samples – Number of MC estimates to use
Returns: Marginal LL

reconstruction_error(mode='total')[source]¶: Returns the reconstruction error associated to the object.

uncorrupted()[source]¶: Uncorrupts gene counts.