TotalPosterior¶
-
class
scvi.inference.
TotalPosterior
(model, gene_dataset, shuffle=False, indices=None, use_cuda=True, data_loader_kwargs={})[source]¶ Bases:
scvi.inference.posterior.Posterior
The functional data unit for totalVI.
A TotalPosterior instance is instantiated with a model and a gene_dataset, and as well as additional arguments that for Pytorch’s DataLoader. A subset of indices can be specified, for purposes such as splitting the data into train/test/validation. Each trainer instance of the TotalTrainer class can therefore have multiple TotalPosterior instances to train a model. A TotalPosterior instance also comes with many methods or utilities for its corresponding data.
- Parameters
model (
TOTALVI
TOTALVI
) – A model instance from classTOTALVI
gene_dataset (
GeneExpressionDataset
GeneExpressionDataset
) – A gene_dataset instance likeCbmcDataset()
with attributeprotein_expression
shuffle (
bool
bool
) – Specifies if a RandomSampler or a SequentialSampler should be usedindices (
ndarray
,None
Optional
[ndarray
]) – Specifies how the data should be split with regards to train/test or labelled/unlabelleddata_loader_kwargs – Keyword arguments to passed into the DataLoader
Examples
Let us instantiate a trainer, with a gene_dataset and a model
>>> gene_dataset = CbmcDataset() >>> totalvi = TOTALVI(gene_dataset.nb_genes, len(gene_dataset.protein_names), ... n_batch=gene_dataset.n_batches, use_cuda=True) >>> trainer = TotalTrainer(vae, gene_dataset) >>> trainer.train(n_epochs=500)
Methods Summary
compute_elbo
(vae, **kwargs)Computes the ELBO.
Computes a biased estimator for log p(x, y), which is the marginal log likelihood.
compute_reconstruction_error
(vae, **kwargs)Computes log p(x/z), which is the reconstruction error.
Corrupts gene counts.
differential_expression_score
(idx1, idx2[, …])Unified method for differential expression inference.
differential_expression_stats
([M_sampling])Output average over statistics in a symmetric way (a against b), forget the sets if permutation is True
elbo
()Returns the Evidence Lower Bound associated to the object.
generate
([n_samples, batch_size])Sample from posterior predictive.
generate_denoised_samples
([n_samples, …])Samples from an adjusted posterior predictive.
Wrapper of generate_denoised_samples() to create a gene-protein gene-protein corr matrix
Estimates data’s count means, dispersions and dropout logits.
get_latent
([sample])Output posterior z mean or sample, batch index, and label
Returns the tensors of denoised normalized gene and protein expression
get_protein_mean
([n_samples, give_mean, …])Returns the tensors of protein mean (with foreground and background)
get_sample_dropout
([n_samples, give_mean])Zero-inflation mixing component for genes
get_sample_mixing
([n_samples, give_mean, …])Returns mixing bernoulli parameter for protein negative binomial mixtures (probability background)
get_sample_scale
([transform_batch, eps, …])Helper function to provide normalized expression for DE testing.
imputation
([n_samples])Gene imputation
imputation_list
([n_samples])This code is identical to same function in posterior.py
marginal_ll
([n_mc_samples])Estimates the marginal likelihood of the object’s data.
reconstruction_error
([mode])Returns the reconstruction error associated to the object.
Uncorrupts gene counts.
Methods Documentation
-
compute_elbo
(vae, **kwargs)[source]¶ Computes the ELBO.
The ELBO is the reconstruction error + the KL divergences between the variational distributions and the priors. It differs from the marginal log likelihood. Specifically, it is a lower bound on the marginal log likelihood plus a term that is constant with respect to the variational distribution. It still gives good insights on the modeling of the data, and is fast to compute.
- Parameters
vae (
TOTALVI
TOTALVI
) –**kwargs –
- Returns
-
compute_marginal_log_likelihood
(n_samples_mc=100, batch_size=96)[source]¶ Computes a biased estimator for log p(x, y), which is the marginal log likelihood.
Despite its bias, the estimator still converges to the real value of log p(x, y) when n_samples_mc (for Monte Carlo) goes to infinity (a fairly high value like 100 should be enough). 5000 is the standard in machine learning publications. Due to the Monte Carlo sampling, this method is not as computationally efficient as computing only the reconstruction loss
-
compute_reconstruction_error
(vae, **kwargs)[source]¶ Computes log p(x/z), which is the reconstruction error.
Differs from the marginal log likelihood, but still gives good insights on the modeling of the data, and is fast to compute
This is really a helper function to self.ll, self.ll_protein, etc.
-
differential_expression_score
(idx1, idx2, mode='vanilla', batchid1=None, batchid2=None, use_observed_batches=False, n_samples=5000, use_permutation=True, M_permutation=10000, all_stats=True, change_fn=None, m1_domain_fn=None, delta=0.5, cred_interval_lvls=None, **kwargs)[source]¶ Unified method for differential expression inference.
This function is an extension of the get_bayes_factors method providing additional genes information to the user
Two modes coexist:
the “vanilla” mode follows protocol described in [Lopez18]
In this case, we perform hypothesis testing based on the hypotheses
\[M_1: h_1 > h_2 ~\text{and}~ M_2: h_1 \leq h_2\]DE can then be based on the study of the Bayes factors
\[\log p(M_1 | x_1, x_2) / p(M_2 | x_1, x_2)\]the “change” mode (described in [Boyeau19])
consists in estimating an effect size random variable (e.g., log fold-change) and performing Bayesian hypothesis testing on this variable. The change_fn function computes the effect size variable r based two inputs corresponding to the normalized means in both populations.
Hypotheses:
\[M_1: r \in R_1 ~\text{(effect size r in region inducing differential expression)}\]\[M_2: r \notin R_1 ~\text{(no differential expression)}\]To characterize the region \(R_1\), which induces DE, the user has two choices.
1. A common case is when the region \([-\delta, \delta]\) does not induce differential expression. If the user specifies a threshold delta, we suppose that \(R_1 = \mathbb{R} \setminus [-\delta, \delta]\)
specify an specific indicator function
\[f: \mathbb{R} \mapsto \{0, 1\} ~\text{s.t.}~ r \in R_1 ~\text{iff.}~ f(r) = 1\]Decision-making can then be based on the estimates of
\[p(M_1 \mid x_1, x_2)\]Both modes require to sample the normalized means posteriors. To that purpose, we sample the Posterior in the following way:
The posterior is sampled n_samples times for each subpopulation
- For computation efficiency (posterior sampling is quite expensive), instead of
comparing the obtained samples element-wise, we can permute posterior samples. Remember that computing the Bayes Factor requires sampling \(q(z_A \mid x_A)\) and \(q(z_B \mid x_B)\)
Currently, the code covers several batch handling configurations:
1. If
use_observed_batches=True
, then batch are considered as observations and cells’ normalized means are conditioned on real batch observations2. If case (cell group 1) and control (cell group 2) are conditioned on the same batch ids. Examples:
>>> set(batchid1) = set(batchid2)
or
>>> batchid1 = batchid2 = None
3. If case and control are conditioned on different batch ids that do not intersect i.e.,
>>> set(batchid1) != set(batchid2)
and
>>> len(set(batchid1).intersection(set(batchid2))) == 0
This function does not cover other cases yet and will warn users in such cases.
- Parameters
mode (
str
,None
Optional
[str
]) – one of [“vanilla”, “change”]idx1 (
List
[bool
],ndarray
Union
[List
[bool
],ndarray
]) – bool array masking subpopulation cells 1. Should be True where cell is from associated populationidx2 (
List
[bool
],ndarray
Union
[List
[bool
],ndarray
]) – bool array masking subpopulation cells 2. Should be True where cell is from associated populationbatchid1 (
List
[int
],ndarray
,None
Union
[List
[int
],ndarray
,None
]) – List of batch ids for which you want to perform DE Analysis for subpopulation 1. By default, all ids are taken into accountbatchid2 (
List
[int
],ndarray
,None
Union
[List
[int
],ndarray
,None
]) – List of batch ids for which you want to perform DE Analysis for subpopulation 2. By default, all ids are taken into accountuse_observed_batches (
bool
,None
Optional
[bool
]) – Whether normalized means are conditioned on observed batchesuse_permutation (
bool
bool
) – Activates step 2 described above. Simply formulated, pairs obtained from posterior sampling (when calling sample_scale_from_batch) will be randomly permuted so that the number of pairs used to compute Bayes Factors becomes M_permutation.M_permutation (
int
int
) – Number of times we will “mix” posterior samples in step 2. Only makes sense when use_permutation=Truechange_fn (
str
,Callable
,None
Union
[str
,Callable
,None
]) – function computing effect size based on both normalized meansm1_domain_fn (
Callable
,None
Optional
[Callable
]) – custom indicator function of effect size regions inducing differential expressiondelta (
float
,None
Optional
[float
]) – specific case of region inducing differential expression. In this case, we suppose that R setminus [-delta, delta] does not induce differential expression (LFC case)cred_interval_lvls (
List
[float
],ndarray
,None
Union
[List
[float
],ndarray
,None
]) – List of credible interval levels to compute for the posterior LFC distributionall_stats (
bool
bool
) – whether additional metrics should be provided**kwargs – Other keywords arguments for get_sample_scale
- Return type
- Returns
diff_exp_results The most important columns are:
proba_de
(probability of being differentially expressed in change mode)bayes_factor
(bayes factors in the vanilla mode)scale1
andscale2
(means of the scales in population 1 and 2)When using the change mode, the mean, median, std of the posterior LFC
-
differential_expression_stats
(M_sampling=100)[source]¶ Output average over statistics in a symmetric way (a against b), forget the sets if permutation is True
-
generate
(n_samples=100, batch_size=64)[source]¶ Sample from posterior predictive. Proteins are concatenated to genes.
- Parameters
- Return type
- Returns
- x_new
torch.Tensor
tensor with shape (n_cells, n_genes + n_proteins, n_samples)
- x_old
torch.Tensor
tensor with shape (n_cells, n_genes + n_proteins)
- x_new
-
generate_denoised_samples
(n_samples=25, batch_size=64, rna_size_factor=1, transform_batch=None)[source]¶ Samples from an adjusted posterior predictive. Proteins are concatenated to genes.
-
generate_feature_correlation_matrix
(n_samples=25, batch_size=64, rna_size_factor=1000, transform_batch=None, correlation_mode='pearson', log_transform=False)[source]¶ Wrapper of generate_denoised_samples() to create a gene-protein gene-protein corr matrix
- Parameters
batch_size (
int
int
) – Mini-batch size for sampling. Lower means less GPU memory footprintrna_size_factor (
int
int
) – size factor for RNA prior to sampling gamma distributiontransform_batch (
int
,List
[int
],None
Union
[int
,List
[int
],None
]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.log_transform (
bool
bool
) – Whether to log transform denoised values prior to correlation calculation
- Returns
Correlation matrix
-
get_normalized_denoised_expression
(n_samples=1, give_mean=True, transform_batch=None, sample_protein_mixing=True)[source]¶ Returns the tensors of denoised normalized gene and protein expression
- Parameters
n_samples (
int
int
) – number of samples from posterior distributionsample_protein_mixing (
bool
bool
) – Sample mixing bernoulli, setting background to zerogive_mean (
bool
bool
) – bool, whether to return samples along first axis or average over samplestransform_batch (
int
,List
[int
],None
Union
[int
,List
[int
],None
]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.
- Return type
- Returns
Denoised genes, denoised proteins
-
get_protein_mean
(n_samples=1, give_mean=True, transform_batch=None)[source]¶ Returns the tensors of protein mean (with foreground and background)
- Parameters
n_samples (
int
int
) – number of samples from posterior distributiongive_mean (
bool
bool
) – bool, whether to return samples along first axis or average over samplestransform_batch (
int
,List
[int
],None
Union
[int
,List
[int
],None
]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.
- Return type
- Returns
Protein NB Mixture mean
-
get_sample_mixing
(n_samples=1, give_mean=True, transform_batch=None)[source]¶ Returns mixing bernoulli parameter for protein negative binomial mixtures (probability background)
- Parameters
n_samples (
int
int
) – number of samples from posterior distributionsample_protein_mixing – Sample mixing bernoulli, setting background to zero
give_mean (
bool
bool
) – bool, whether to return samples along first axis or average over samplestransform_batch (
int
,List
[int
],None
Union
[int
,List
[int
],None
]) – Batches to condition on. If transform_batch is: - None, then real observed batch is used - int, then batch transform_batch is used - list of int, then values are averaged over provided batches.
- Return type
- Returns
array of probability background
-
get_sample_scale
(transform_batch=None, eps=0.5, normalize_pro=False, sample_bern=True, include_bg=False)[source]¶ Helper function to provide normalized expression for DE testing.
- For normalized, denoised expression, please use
get_normalized_denoised_expression()
- Parameters
transform_batch – Int of batch to “transform” all cells into (Default value = None)
eps – Prior count to add to protein normalized expression (Default value = 0.5)
normalize_pro – bool, whether to make protein expression sum to one in a cell (Default value = False)
include_bg – bool, whether to include the background component of expression (Default value = False)
sample_bern – (Default value = True)
- Return type
- Returns
-
imputation_list
(n_samples=1)[source]¶ This code is identical to same function in posterior.py
Except, we use the totalVI definition of model.get_sample_rate
-
marginal_ll
(n_mc_samples=1000)[source]¶ Estimates the marginal likelihood of the object’s data.
- Parameters
n_mc_samples – Number of MC estimates to use
- Returns
Marginal LL