Advanced Tutorial: Multi-Panel Integration and Downstream Analysis with CytoVI

Advanced Tutorial: Multi-Panel Integration and Downstream Analysis with CytoVI#

In this tutorial, we demonstrate advanced functionality of CytoVI, a deep generative model for protein expression measurements from technologies such as flow cytometry, mass cytometry, or CITE-seq. Building on the quick start tutorial, we now explore how CytoVI can be used to integrate multiple cytometry panels, impute missing markers, transfer annotations between datasets, and uncover biological differences through differential expression and abundance analysis.

If you are new to CytoVI or unfamiliar with data loading, preprocessing, or training the model, we recommend starting with the quick start tutorial where these fundamental steps are introduced in detail. In this tutorial, we will work with preprocessed and partially annotated data to focus on the advanced use cases of the model.

Specifically, we analyze conventional flow cytometry data of tumor-infiltrating T cells obtained from patients with B-cell non-Hodgkin lymphoma (BNHL). These samples were profiled using two distinct antibody panels, which share a subset of common markers. Using CytoVI, we will integrate both panels into a shared representation space, infer missing marker expression, and perform downstream biological analysis to gain insights into T cell heterogeneity across patients.

Plan for this tutorial:

Load and inspect preprocessed data
Train a CytoVI model that integrates both antibody panels
Visualize the joint latent space and evaluate panel integration
Impute non-overlapping protein markers and assess imputation quality
Automatically annotate immune cell types via label transfer
Quantify differential protein expression across clusters
Detect disease-associated T cell states using label-free differential abundance analysis

# Install from GitHub for now
!pip install --quiet scvi-colab
from scvi_colab import install

install()

import os
import random
import tempfile

import matplotlib.pyplot as plt  # type: ignore
import numpy as np  # type: ignore
import pandas as pd  # type: ignore
import scanpy as sc  # type: ignore
import scvi
import seaborn as sns  # type: ignore
import torch  # type: ignore
from rich import print  # type: ignore
from scipy.stats import mannwhitneyu
from scvi.external import cytovi  # type: ignore
from sklearn.cluster import KMeans  # type: ignore

os.environ["SCIPY_ARRAY_API"] = "1"

sc.set_figure_params(figsize=(4, 4))

scvi.settings.seed = 0
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
print("Last run with scvi-tools version:", scvi.__version__)

Last run with scvi-tools version: 1.4.2

Loading the data#

In this tutorial, we will work with a curated, lightweight subset of flow cytometry data from the BNHL study by Roider et al. 2024 (Nature Cell Biology, https://doi.org/10.1038/s41556-024-01358-2). The dataset includes flow cytometry measurements of T cells from 33 donors across two distinct antibody panels, each profiling 12 protein markers along with morphological features such as forward and side scatter (FSC and SSC). Samples were acquired across four independent experimental batches. For ease of use, the data have been preprocessed to correct for fluorescent spillover, restricted to live single-cell events, and transformed using a hyperbolic arcsin transformation, scaled and subsampled to ~5000 cells per panel. We will access the dataset as preprocessed .h5ad files. For demonstration purposes data from one of the panels comes with cell type annotations.

temp_dir_obj = tempfile.TemporaryDirectory()

adata_p1_path = os.path.join(temp_dir_obj.name, "Roider_et_al_BNHL_panel1.h5ad")
adata_p1 = sc.read(
    adata_p1_path,
    backup_url="https://exampledata.scverse.org/scvi-tools/Roider_et_al_BNHL_panel1.h5ad",
)

adata_p2_path = os.path.join(temp_dir_obj.name, "Roider_et_al_BNHL_panel2.h5ad")
adata_p2 = sc.read(
    adata_p2_path,
    backup_url="https://exampledata.scverse.org/scvi-tools/Roider_et_al_BNHL_panel2.h5ad",
)

adata_p1

AnnData object with n_obs × n_vars = 4983 × 14
    obs: 'sample_id', 'PatientID', 'batch', 'panel', 'Entity', 'cell_type'
    layers: '_nan_mask', 'raw', 'scaled', 'transformed'

As the data has been preprocessed already, we can directly merge the two panels into one anndata object using cytovi.merge_batches(). This will automatically register a nan_layer that will handle the modeling of missing markers under the hood.

adata = cytovi.merge_batches([adata_p1, adata_p2], batch_key="panel_batch")
adata

AnnData object with n_obs × n_vars = 9966 × 18
    obs: 'sample_id', 'PatientID', 'batch', 'panel', 'Entity', 'cell_type', 'panel_batch'
    var: '_batch_0', '_batch_1'
    layers: '_nan_mask', 'raw', 'scaled', 'transformed'

Inspection of the histograms for each marker demonstrates that some proteins, such as CD4, CD3, CD45RA, CD69, were detected in both antibody panels, while four markers were unique to batch one (e.g. CD25, CXCR5) and four additional markers were unique to batch two (e.g. CD244 and TIM-3).

cytovi.plot_histogram(adata, layer_key="scaled", groupby="panel")

../../../_images/6e69ca5187f9d5c5d742fb39635944e0dc4cd053c5791fda3f4fd28749dcb42d.png

Additionally, we see that CD69 demonstrates a batch effect in one of the four batches.

cytovi.plot_biaxial(adata, marker_x="CD69", marker_y="CD4", layer_key="scaled", color="batch")

../../../_images/1c942382302a0953e46d53c112428f8dfc77deb8db24162d085634197b1e718a.png

Training a CytoVI model#

In the next step, we will train a CytoVI model to control for technical variation between the batches. As our dataset consists of multiple patients of different B cell lymphoma diagnoses, we will additionally specify a sample_key, which will be used for downstream application such as performing differential expression across patients or for the identification of disease-associated T cell states. If using CytoVI’s cytovi.merge_batches function to combine both panels, CytoVI will automatically register a nan_layer and handle imputation of missing proteins. In this case only the shared backbone markers are encoded into the latent representation, while the decoder network reconstructs the full protein panel.

cytovi.CYTOVI.setup_anndata(adata, layer="scaled", batch_key="batch", sample_key="PatientID")
model = cytovi.CYTOVI(adata)
model.train(n_epochs_kl_warmup=50)

Monitored metric elbo_validation did not improve in the last 30 records. Best score: -14.847. Signaling Trainer to stop.

model

CytoVI Model with the following params: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, 
protein_likelihood: normal, latent_distribution: normal, 
MoG prior: True, n_labels 1, n_proteins: 18, 
Impute missing markers: True, 
Backbone markers: CD3, CD4, CD45RA, CD69, CD8, FSC-A, FoxP3, Ki67, PD1, SSC-A
Training status: Trained

plt.plot(model.history["elbo_train"], label="Train")
plt.plot(model.history["elbo_validation"], label="Validation")
plt.xlabel("Epochs")
plt.ylabel("ELBO")
plt.legend()
plt.title("Training vs Validation ELBO")
plt.show()

../../../_images/50ea7e7897da2e57bf66de7512cfa4e70c687dc2f9ef205f409b17f436582679.png

Visualize the joint latent space#

Next, we get the latent representation of our cells while controlling for batch and panel variation and visualize the joint latent space using UMAP.

adata.obsm["X_CytoVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_CytoVI", transformer="pynndescent")
sc.tl.umap(adata, min_dist=0.4)
sc.pl.umap(adata, color=["batch", "panel", "Entity"])

../../../_images/5e34fb8e42321b91978a254f5adb889e715428190c85644b4ddb5ea3d2567d3e.png

We can see that this latent representation effectively controlled for batch and panel variation and yielded a cell representation that still maintains the variability between the different disease entities.

Impute non-overlapping protein markers#

Next, we obtain the batch corrected protein expression from the CytoVI model that automatically imputes non-overlapping protein markers. Here, we will sample ten times from the posterior distribution and generate the imputed protein expression ten times in order to assess the imputation uncertainty. We take the mean over these ten samples as our estimate for the imputed protein expression.

imp_expr = model.get_normalized_expression(n_samples=10, return_mean=False)
adata.layers["imputed"] = imp_expr.mean(axis=0).copy()

Inspection of the marker histograms now shows the complete imputed protein expression for all proteins present in the combined set of the two antibody panels. Additionally, we can now query markers that were unique for each of the panels, and display for example the expression of the costimulatory protein CD244 versus the IL-2 high affinity chain CD25.

cytovi.plot_histogram(adata, layer_key="imputed", groupby="panel")

../../../_images/e7f99b3a4b44d5e7677123f6b841f75bfd01ec705c9ed236297cb2edbf6cdf25.png

cytovi.plot_biaxial(adata, marker_x="CD25", marker_y="CD244", layer_key="imputed", color="panel")

../../../_images/4f37d24f38b7d21dd8a8b5f990e75144561c6e234a8363aed6f39be862283c88.png

Next, we compute the coefficient of variation across the ten imputed expression estimates as a measure of uncertainty of the posterior samples and visualize the uncertainty across the cell and feature axes to judge imputation performance.

adata.layers["imputed_cv"] = 100 * imp_expr.std(axis=0).copy() / imp_expr.mean(axis=0).copy()

adata.var["var_imp_uncertainty"] = adata.layers["imputed_cv"].mean(axis=0).copy()
adata.obs["obs_imp_uncertainty"] = adata.layers["imputed_cv"].mean(axis=1).copy()

uncertainty = adata.var["var_imp_uncertainty"].sort_values()

plt.figure(figsize=(6, 5))
plt.barh(uncertainty.index, uncertainty.values, color="steelblue")
plt.xlabel("Imputation Uncertainty")
plt.ylabel("Protein")
plt.title("Imputation Uncertainty per Protein")
plt.tight_layout()
plt.show()

../../../_images/84626d554b41c5f2fe96387bb5f96b32c7dab2c8d552542399853eaec285ea4f.png

This analysis demonstrated that the model was less certain when imputing Ki67 expression (even though it was measured in both of the panels) compared to CD25 or CD244 that were only measured in one of the panels.

sc.pl.umap(adata, color=["obs_imp_uncertainty", "batch"])

../../../_images/a67efe78324f88112849d2eddcf902c1e360c2fd4a86a79cf22abccb0d4f8f50.png

Visualizing the aggregated imputation uncertainty per cell provides an estimate of how reliably missing markers can be imputed. In this example, the uncertainty plot reveals that the model is especially uncertain for cells in regions of CytoVI’s latent space where batch representation is imbalanced. This underscores the importance of imputing markers only when integrating biologically comparable studies. We can also query and visualize the imputation uncertainty of individual markers to judge how much we can trust the imputation results in downstream analyses.

sc.pl.umap(adata, color="Ki67", layer="imputed", title="Imputed Ki67 Expression")
sc.pl.umap(adata, color="Ki67", layer="imputed_cv", title="Uncertainty of Ki67 Imputation")

../../../_images/b9652b38f88d90b1ac78a5c11da6cb3cb21df761f3ee7b8dab9fa982063adef6.png

../../../_images/af44cf27667d09d7f4518a8476c4e35fa2014c7be83e43c5652d78f850666dbd.png

Label transfer#

In our example, cells from one antibody panel were already annotated, whereas data from the second panel lacked annotations. We will demonstrate how the integrated CytoVI latent space can be used to transfer labels—such as cell type annotations—from the annotated panel to the unannotated dataset.

for panel in adata.obs["panel"].drop_duplicates():
    adata_panel = adata[adata.obs["panel"] == panel].copy()
    sc.pl.umap(adata_panel, color="cell_type", title=panel)

../../../_images/868971fdbebedd8f4578e3bc338d8b6d3acd0bfe4fd63940294cc10c7ccd43ae.png

../../../_images/b34fcf2d4fb34cecd3321dbf26209e8a1cfecf7aff6504e211e332d7bbafd077.png

adata_ref = adata[adata.obs["panel"] == "panel1"].copy()
adata.obs["cell_type_predicted"] = model.impute_categories_from_reference(
    adata_reference=adata_ref, cat_key="cell_type"
)

sc.pl.umap(adata, color="cell_type_predicted")

../../../_images/006eec2beaea26c550958b1535da08a7de80a0108d1b0974287c999fb8cfefa5.png

Since both antibody panels were applied to the same patients we expect a similar proportion of cell types across both panels and can use this as a quick sanity check.

adata.obs.groupby(["panel", "cell_type_predicted"]).size().unstack().plot(kind="bar", stacked=True)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

../../../_images/5d87cc1fa4f36051d616a4f1b9f0c167eb52eba07f7827fc6e8dfebdff9d5909.png

We can then visualize the expression profile of the transferred cell type labels across the combined marker space of both antibody panels.

sc.pl.matrixplot(
    adata,
    var_names=adata.var_names,
    groupby="cell_type_predicted",
    layer="imputed",
    dendrogram=True,
    standard_scale="var",
    cmap="mako",
)

WARNING: dendrogram data not found (using key=dendrogram_cell_type_predicted). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

../../../_images/f9bc96af05650da1af5f3379888d49362de121d35d322a4c65979db89eb66557.png

Differential expression#

We can now directly query the generative model to find differentially expressed proteins for each of the transferred cell type labels in relation to all other cells.

de_res = model.differential_expression(groupby="cell_type_predicted", delta=0.1)

de_res

	proba_de	proba_not_de	bayes_factor	scale1	scale2	pseudocounts	delta	lfc_mean	lfc_median	lfc_std	lfc_min	lfc_max	is_de_fdr_0.05	comparison	group1	group2
CD45RA	0.9706	0.0294	3.496919	0.434004	0.213028	1.000000e-10	0.1	1.041761	1.165231	0.673131	-2.073050	2.584914	True	Naive CD4 T vs Rest	Naive CD4 T	Rest
PD1	0.9592	0.0408	3.157417	0.164877	0.382105	1.000000e-10	0.1	-1.112946	-1.232954	0.663755	-2.513757	1.706124	True	Naive CD4 T vs Rest	Naive CD4 T	Rest
CD31	0.9104	0.0896	2.318529	0.312120	0.382095	1.000000e-10	0.1	-0.198945	-0.199453	0.626980	-1.818321	1.523065	False	Naive CD4 T vs Rest	Naive CD4 T	Rest
ICOS	0.9076	0.0924	2.284677	0.217959	0.340132	1.000000e-10	0.1	-0.579738	-0.550900	0.513512	-1.970637	1.384790	False	Naive CD4 T vs Rest	Naive CD4 T	Rest
SSC-A	0.8924	0.1076	2.115494	0.291745	0.356592	1.000000e-10	0.1	-0.318686	-0.294884	0.686055	-2.617624	1.761648	False	Naive CD4 T vs Rest	Naive CD4 T	Rest
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
CD69	0.8228	0.1772	1.535434	0.434981	0.382993	1.000000e-10	0.1	0.193504	0.185164	0.396212	-0.931715	1.393487	False	Ttox EM3 vs Rest	Ttox EM3	Rest
CXCR5	0.7800	0.2200	1.265666	0.358435	0.373278	1.000000e-10	0.1	-0.041337	-0.037766	0.373335	-1.434914	1.236979	False	Ttox EM3 vs Rest	Ttox EM3	Rest
IKZF3	0.7690	0.2310	1.202673	0.660656	0.566145	1.000000e-10	0.1	0.231808	0.219143	0.220237	-0.400252	1.075055	False	Ttox EM3 vs Rest	Ttox EM3	Rest
FSC-A	0.7680	0.2320	1.197052	0.599032	0.522675	1.000000e-10	0.1	0.208974	0.196232	0.271950	-0.626969	1.533572	False	Ttox EM3 vs Rest	Ttox EM3	Rest
CD45RA	0.7396	0.2604	1.043891	0.188851	0.234621	1.000000e-10	0.1	-0.194501	-0.038121	0.583049	-2.589115	1.338408	False	Ttox EM3 vs Rest	Ttox EM3	Rest

234 rows × 16 columns

These results demonstrate that CD45RA is enriched in Naive T helper cells versus all other cells, while PD-1 is decreased, which is in line with our understanding of these marker proteins.

Label-free differential abundance#

Next, we will use CytoVI’s label-free differential abundance analysis to automatically detect positions in latent space that are associated with a covariate of interest. In this case we will apply the model.differential_abundance method to identify T cell states that are associated with the different lymphoma entities (one versus all). This will for each cell and each lymphoma entity yield a differential abundance (DA) score that estimates how strongly the cell is associated with the respective lymphoma entity compared to all other entities. We can then visualize the DA scores on the latent space to identify regions characteristic for disease.

da_res = model.differential_abundance(groupby="Entity")

da_res

	DA_MCL	DA_DLBCL	DA_MZL	DA_rLN	DA_FL
0	0.721177	-1.917385	-0.464626	-1.462440	1.483964
1	0.573540	1.083618	2.072742	-4.232052	-0.516788
2	0.400154	-1.397992	-0.409788	-1.452880	1.708511
3	1.003199	3.260658	2.854763	-1.661751	-2.218349
4	-0.070237	-0.819746	0.305704	-1.453968	1.619961
...	...	...	...	...	...
9961	-0.234638	2.258600	2.350037	-2.985151	-0.401577
9962	1.480638	-1.434464	-1.796121	4.641170	-0.752605
9963	1.025166	-0.003979	-1.792423	4.319269	-1.124134
9964	-0.488432	-0.128273	-0.130234	-5.210672	2.050974
9965	-0.105505	2.035009	1.945004	-3.339054	-0.216951

9966 rows × 5 columns

adata.obs = pd.concat(
    [adata.obs, pd.DataFrame(da_res.values, columns=da_res.columns, index=adata.obs.index)], axis=1
)
sc.pl.umap(adata, color=da_res.columns, cmap="icefire", ncols=3, vmin=-3, vmax=3)

../../../_images/df8433a00cc05b8eeef108f44b813d041e8247ca71878ed97a9ece78266270e1.png

In a next step we want to use these DA scores to determine differentially abundant clusters that are associated with disease. For this we will concatenate the DA scores for each tumor entity to the CytoVI latent space and perform Kmeans clustering.

da_latent = np.hstack((da_res, adata.obsm["X_CytoVI"]))

kmeans = KMeans(n_clusters=13, random_state=0).fit(da_latent)
adata.obs["da_cluster"] = kmeans.labels_.astype("str")

sc.pl.umap(adata, color="da_cluster", title="Differential Abundance Clusters")

../../../_images/e228f67990a4d053fe708d889ede03df3c2d98817473cd78539ddbf4f73a5668.png

Next, we will compute the relative frequency of the DA clusters per patient and test for differential abundance compared to the control group using a Mann-Whitney U test. We will do this for the cluster that showed highest DA scores for follicular lymphoma (FL) patients. Note: in this case it is cluster 0 but it can change depending on the runtime environment of the notebook.

freq_table = adata.obs.groupby(["PatientID", "da_cluster"]).size().unstack(fill_value=0)
freq_table_normalized = freq_table.div(freq_table.sum(axis=1), axis=0) * 100
res = pd.merge(
    freq_table_normalized.reset_index("PatientID"),
    adata.obs[["PatientID", "Entity"]].drop_duplicates(),
    on="PatientID",
)

cluster_oi = "0"
control = "rLN"
group_order = ["rLN", "DLBCL", "MCL", "FL", "MZL"]
control_vals = res.loc[res["Entity"] == control, cluster_oi]

plt.figure(figsize=(6, 4))
sns.violinplot(res, x="Entity", y=cluster_oi, hue="Entity", inner="box", order=group_order)
plt.ylim(0, None)
plt.ylabel(f"Frequency of cluster {cluster_oi} (%)")

y_max = res[cluster_oi].max()
for i, entity in enumerate(group_order):
    if entity != control:
        p = mannwhitneyu(control_vals, res.loc[res["Entity"] == entity, cluster_oi]).pvalue
        if p < 0.05:
            plt.text(i, y_max + 2, f"p={p:.3e}", ha="center", fontsize=8)

plt.tight_layout()
plt.show()

../../../_images/96b7869627acc86009a7ee73072e8c49f7cb895ac136633c1158ce4e60310d78.png

This analysis demonstrates that we indeed retrieved T cell states that were differentially abundant for the lymphoma entity. In a last step we compute the confusion matrix with the predicted cell type labels in order to assign these cells to a canonical T cell lineage.

conf_mtx = sc.metrics.confusion_matrix(adata.obs["cell_type_predicted"], adata.obs["da_cluster"])
conf_clust_oi = conf_mtx[cluster_oi].sort_values()

plt.figure(figsize=(6, 4))
plt.barh(conf_clust_oi.index, conf_clust_oi.values)
plt.xlabel("Cluster Cverlap")
plt.ylabel("Predicted Cell Type Label")
plt.title(f"Cell Type Correspondence of Cluster {cluster_oi}")
plt.tight_layout()
plt.show()

../../../_images/3d7ea6f7e92e61176e1e75657c782fe4037841af5f388eba5a9dc3edab9ea33e.png

This analysis demonstrates that the differentially abundant T cell cluster we identified using our label-free DA analysis appear to be mainly comprised of T follicular helper cells - a subpopulation of T helper cells that stimulate B cell responses and has been associated with Follicular Lymphoma in the original study by Roider et al.