CytoVI

CytoVI#

CytoVI [Ingelfinger et al., 2025] (Python class CYTOVI) is a generative model for cytometry data that leverages deep probabilistic latent variable modeling to enable denoising, imputation, integration, and differential analysis across technologies and batches.

The advantages of CytoVI are:

Provides a denoised and batch-corrected low-dimensional cell state representation across antibody-based single cell technologies.
Facilitates the integration of data from different antibody panels and the imputation of non-overlapping markers.
Enables cross-technology integration (e.g., flow cytometry, mass cytometry, and CITE-seq) and modality imputation, facilitating unified analysis across cytometry and transcriptomics.
Allows for the automated identification of disease-associated cell states (label-free differential abundance analysis) and differential protein expression between groups of cells.
Scalable to very large datasets (>20 million cells).

The limitations of CytoVI include:

Requires at least partial feature overlap across datasets for effective integration.
Effectively requires a GPU for fast inference.
Assumes measurements have been corrected for fluorescent spillover and preprocessed according to standard practice.

Preliminaries#

CytoVI can process protein expression matrices from the following technologies:

Flow cytometry (traditional PMT-based cytometers or full spectrum analyzers)
Mass cytometry
CITE-seq (providing RNA and protein expression for the same cells)
Any other antibody-based single cell measurement of protein expression

For each of these technologies CytoVI expects as input:

A transformed (and optionally scaled) protein expression matrix $X \in \mathbb{R}^{N \times P}$, where each row is a single cell among $N$ total cells and each column is a protein marker among $P$ total proteins.
Optionally, experimental covariates that the user wishes to control for such as batch annotations, different technologies, or confounding variables such as donor sex. For simplicity, we will describe the case of a one-hot encoded batch identifier $s_n \in \{1,...,S\}$.
Optionally, cell label annotations can be used to weakly inform the prior of the latent space. We assume a one-hot encoded label identifier $y_n \in\{1,...,Y\}$.
Optionally, sample annotations (e.g., indicating which cells were measured from which patient) to inform differential abundance and expression analyses.

Preprocessing expected: CytoVI expects the input matrix $x_{np}$ to be processed using common preprocessing options for cytometry data in the field. These include, for instance, arcsinh, log1p, biexponential or logicle transformations and are optionally followed by feature-wise scaling (e.g., z-score, min-max or rank-scaled).

Descriptive model#

CytoVI is a latent variable model that assumes each cell $n$ has a $d$-dimensional latent representation $z_n$ capturing its intrinsic state that is decoupled from the variation in $s$. It models the observed protein expression $x_{np}$ for protein $p$ as a function of $z_n$ and its associated batch $s_n$.

We assume a prior on the latent space $z_n$ that can be either:

\[\begin{split} z_n \sim \begin{cases} \mathcal{N}(0, 1), & \text{if isotropic Gaussian prior} \\ \sum_{k=1}^K \pi_k \, \mathcal{N}(\mu_k, \sigma_k^2), & \text{if mixture of Gaussians prior} \end{cases} \end{split}\]

By default, a mixture of Gaussians is used, which allows more expressive modeling and improves integration of heterogeneous data. Optionally, prior weights can be informed by known labels $y$:

\[ \pi_k' = \pi_k + \lambda y \]

where $\lambda$ is a tunable weight (default $\lambda=10$).

Generative process#

The decoder network maps latent variables and batch vectors to a vector of protein expression parameters:

\[ f_x: \mathbb{R}^d \times \{0,1\}^S \to \mathbb{R}^P \]

We assume the observed expression $x_{np}$ follows one of the following distributions:

\[\begin{split} x_{np} \mid z_n, s_n \sim \begin{cases} \mathcal{N}(\mu_{np}, \sigma_{np}^2), & \text{if Gaussian model (default)} \\ \text{Beta}(\alpha_{np}, \beta_{np}), & \text{if Beta model} \end{cases} \end{split}\]

The Gaussian likelihood is suited for arcsinh/log-transformed (and optionally scaled) cytometry data. The Beta likelihood requires a scaling of the data to a $[0, 1]$ range.

Inference#

Since the posterior $p(z_n \mid x_n)$ is intractable, we use variational inference and specifically auto-encoding variational bayes (see Variational Inference) to learn both the model parameters (the neural network params, params for the protein observation model, etc.) and an approximate posterior distribution.

Handling of overlapping antibody panels#

To integrate datasets with different protein panels, CytoVI employs a masking strategy inspired by TOTALVI [Gayoso et al., 2021].

Let $x_n^{(s)} \in \mathbb{R}^{P_s}$ be the observed input vector for cell $n$ in batch $s$. Denote:

$\mathcal{T}_s$ = set of proteins measured in batch $s$
$\mathcal{I} = \bigcap_s \mathcal{T}_s$ = shared proteins across batches
$\mathcal{U} = \bigcup_s \mathcal{T}_s$ = union of all proteins

The binary mask $M_n^{(s)}$ indicates which features were actually observed: $$ M_{np}^{(s)} = \begin{cases} 1 & \text{if } p \in \mathcal{T}_s \\ 0 & \text{otherwise} \end{cases} $$

This binary mask is generated automatically when using CytoVI’s ‘merge_batches’ function:

>>> adata_list = [adata_batch1, adata_batch2]
>>> adata = scvi.external.CytoVI.merge_batches(adata_list)

Consecutively, the encoder uses only the shared features $\mathcal{I}$:

\[ z_n \sim q(z_n \mid x_n^{(\mathcal{I})}) \]

The decoder reconstructs $\mathcal{U}$, yielding a reconstructed protein expression vector $x_n^{\mathcal{(U)}}$ for each cell.

Tasks#

Here we provide an overview of some of the tasks that scVI can perform. Please see scvi.external.CYTOVI for the full API reference.

Dimensionality reduction#

For dimensionality reduction, the mean of the approximate posterior $q_\eta(z_n \mid x_n, s_n)$ is returned by default. This is achieved using the method:

>>> latent = model.get_latent_representation()
>>> adata.obsm["X_CytoVI"] = latent

Users may also return samples from this distribution, as opposed to the mean by passing the argument give_mean=False. The latent representation can be used to create a nearest neighbor graph with scanpy with:

>>> import scanpy as sc
>>> sc.pp.neighbors(adata, use_rep="X_CytoVI")
>>> adata.obsp["distances"]

Transfer learning#

A CytoVI model can be pre-trained on reference data and updated with query data using load_query_data(), which then facilitates transfer of metadata like cell type annotations using impute_categories_from_reference().

>>> model_query = scvi.external.CYTOVI.load_query_data(adata = adata_query, reference_model=model)
>>> model_query.is_trained = True
>>> adata_query.obs['imputed_label'] = model_query.impute_categories_from_reference(adata_reference, cat_key = 'cell_type')

See the Transfer learning guide for more information.

Label-free differential abundance#

CytoVI supports label-free differential abundance estimation via differential_abundance() to detect shifts in cellular composition across sample-level covariates (e.g., disease vs. control), as described in [Boyeau et al., 2025].

This method:

Aggregates the approximate posterior distributions $q(z \mid x_n)$ across all cells within each sample to obtain a sample-level latent density.
Compares the average latent densities across sample groups $A$ and $B$ (e.g., condition vs. control) to compute a log-ratio score $r_{A,B}(z)$.
Identifies enriched or depleted regions of latent space without requiring clustering.

This approach enables cluster-free detection of condition-associated cell states by directly comparing their latent representations across groups.

>>> da_res = model.differential_abundance(adata, groupby='group')

Normalization/denoising/imputation of expression#

In get_normalized_expression() CytoVI returns the expected value of $x_{n}^{(s)}$ under the approximate posterior. For one cell $n$, this can be written as:

\[ \mathbb{E}_{q_\eta(z_n \mid x_n)}\left[f_x\left( z_n, s_n \right) \right] \]

By default, we decode the reconstructed protein expression across all possible batches $S$ and return the mean, yielding a batch-corrected version of the protein expression.

In the case of overlapping antibody panels, CytoVI by default returns protein expression of all markers $x_{n,\mathcal{U}}$, thereby effectively imputing missing proteins.

Differential expression#

Differential expression analysis is achieved with differential_expression(). CytoVI tests differences in magnitude of $f_x\left( z_n, s_n \right)$. More info is in Differential Expression.

If a sample_key is provided, CytoVI by default samples equal numbers of cells for each patient for differential expression computation.

>>> de_res = model.differential_expression(adata, groupby='group')

Data simulation#

Data can be generated from the model using the posterior predictive distribution in posterior_predictive_sample(). This is equivalent to feeding a cell through the model, sampling from the posterior distributions of the latent variables, retrieving the likelihood parameters (of $p(x \mid z, s)$), and finally, sampling from this distribution.

RNA/modality imputation#

CytoVI enables cross-modal imputation by leveraging a shared latent space between datasets with overlapping protein features via impute_rna_from_reference(). For example, transcriptomic profiles from a CITE-seq reference can be imputed into a flow cytometry dataset.

To do this:

A joint CytoVI model is trained on both datasets, aligning them in a shared latent space.
For each query cell (e.g., flow cytometry), its k-nearest neighbors (default: k = 20) are identified in the latent space from the reference (e.g., CITE-seq).
The imputed RNA expression is obtained by averaging the RNA profiles of those neighbors.

This approach allows label-free imputation of unobserved modalities, such as gene expression, in datasets where only protein measurements are available.

>>> adata_imputed_rna = model.impute_rna_from_reference(reference_batch='CITE_seq', adata_rna = adata_rna, layer_key='rna_normalized', return_query_only = True)