Advanced Tutorial: Multi-Panel Integration and Downstream Analysis with CytoVI#
In this tutorial, we demonstrate advanced functionality of CytoVI, a deep generative model for protein expression measurements from technologies such as flow cytometry, mass cytometry, or CITE-seq. Building on the quick start tutorial, we now explore how CytoVI can be used to integrate multiple cytometry panels, impute missing markers, transfer annotations between datasets, and uncover biological differences through differential expression and abundance analysis.
If you are new to CytoVI or unfamiliar with data loading, preprocessing, or training the model, we recommend starting with the quick start tutorial where these fundamental steps are introduced in detail. In this tutorial, we will work with preprocessed and partially annotated data to focus on the advanced use cases of the model.
Specifically, we analyze conventional flow cytometry data of tumor-infiltrating T cells obtained from patients with B-cell non-Hodgkin lymphoma (BNHL). These samples were profiled using two distinct antibody panels, which share a subset of common markers. Using CytoVI, we will integrate both panels into a shared representation space, infer missing marker expression, and perform downstream biological analysis to gain insights into T cell heterogeneity across patients.
Plan for this tutorial:
Load and inspect preprocessed data
Train a CytoVI model that integrates both antibody panels
Visualize the joint latent space and evaluate panel integration
Impute non-overlapping protein markers and assess imputation quality
Automatically annotate immune cell types via label transfer
Quantify differential protein expression across clusters
Detect disease-associated T cell states using label-free differential abundance analysis
# Install from GitHub for now
!pip install --quiet scvi-colab
from scvi_colab import install
install()
import os
import random
import tempfile
import matplotlib.pyplot as plt # type: ignore
import numpy as np # type: ignore
import pandas as pd # type: ignore
import scanpy as sc # type: ignore
import scvi
import seaborn as sns # type: ignore
import torch # type: ignore
from rich import print # type: ignore
from scipy.stats import mannwhitneyu
from scvi.external import cytovi # type: ignore
from sklearn.cluster import KMeans # type: ignore
os.environ["SCIPY_ARRAY_API"] = "1"
sc.set_figure_params(figsize=(4, 4))
scvi.settings.seed = 0
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.4.2
Loading the data#
In this tutorial, we will work with a curated, lightweight subset of flow cytometry data from the BNHL study by Roider et al. 2024 (Nature Cell Biology, https://doi.org/10.1038/s41556-024-01358-2). The dataset includes flow cytometry measurements of T cells from 33 donors across two distinct antibody panels, each profiling 12 protein markers along with morphological features such as forward and side scatter (FSC and SSC). Samples were acquired across four independent experimental batches. For ease of use, the data have been preprocessed to correct for fluorescent spillover, restricted to live single-cell events, and transformed using a hyperbolic arcsin transformation, scaled and subsampled to ~5000 cells per panel. We will access the dataset as preprocessed .h5ad files. For demonstration purposes data from one of the panels comes with cell type annotations.
temp_dir_obj = tempfile.TemporaryDirectory()
adata_p1_path = os.path.join(temp_dir_obj.name, "Roider_et_al_BNHL_panel1.h5ad")
adata_p1 = sc.read(
adata_p1_path,
backup_url="https://exampledata.scverse.org/scvi-tools/Roider_et_al_BNHL_panel1.h5ad",
)
adata_p2_path = os.path.join(temp_dir_obj.name, "Roider_et_al_BNHL_panel2.h5ad")
adata_p2 = sc.read(
adata_p2_path,
backup_url="https://exampledata.scverse.org/scvi-tools/Roider_et_al_BNHL_panel2.h5ad",
)
adata_p1
AnnData object with n_obs × n_vars = 4983 × 14
obs: 'sample_id', 'PatientID', 'batch', 'panel', 'Entity', 'cell_type'
layers: '_nan_mask', 'raw', 'scaled', 'transformed'
As the data has been preprocessed already, we can directly merge the two panels into one anndata object using cytovi.merge_batches(). This will automatically register a nan_layer that will handle the modeling of missing markers under the hood.
adata = cytovi.merge_batches([adata_p1, adata_p2], batch_key="panel_batch")
adata
AnnData object with n_obs × n_vars = 9966 × 18
obs: 'sample_id', 'PatientID', 'batch', 'panel', 'Entity', 'cell_type', 'panel_batch'
var: '_batch_0', '_batch_1'
layers: '_nan_mask', 'raw', 'scaled', 'transformed'
Inspection of the histograms for each marker demonstrates that some proteins, such as CD4, CD3, CD45RA, CD69, were detected in both antibody panels, while four markers were unique to batch one (e.g. CD25, CXCR5) and four additional markers were unique to batch two (e.g. CD244 and TIM-3).
Additionally, we see that CD69 demonstrates a batch effect in one of the four batches.
Training a CytoVI model#
In the next step, we will train a CytoVI model to control for technical variation between the batches. As our dataset consists of multiple patients of different B cell lymphoma diagnoses, we will additionally specify a sample_key, which will be used for downstream application such as performing differential expression across patients or for the identification of disease-associated T cell states. If using CytoVI’s cytovi.merge_batches function to combine both panels, CytoVI will automatically register a nan_layer and handle imputation of missing proteins. In this case only the shared backbone markers are encoded into the latent representation, while the decoder network reconstructs the full protein panel.
cytovi.CYTOVI.setup_anndata(adata, layer="scaled", batch_key="batch", sample_key="PatientID")
model = cytovi.CYTOVI(adata)
model.train(n_epochs_kl_warmup=50)
Monitored metric elbo_validation did not improve in the last 30 records. Best score: -14.847. Signaling Trainer to stop.
model
CytoVI Model with the following params: n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, protein_likelihood: normal, latent_distribution: normal, MoG prior: True, n_labels 1, n_proteins: 18, Impute missing markers: True, Backbone markers: CD3, CD4, CD45RA, CD69, CD8, FSC-A, FoxP3, Ki67, PD1, SSC-A Training status: Trained
Visualize the joint latent space#
Next, we get the latent representation of our cells while controlling for batch and panel variation and visualize the joint latent space using UMAP.
adata.obsm["X_CytoVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_CytoVI", transformer="pynndescent")
sc.tl.umap(adata, min_dist=0.4)
sc.pl.umap(adata, color=["batch", "panel", "Entity"])
We can see that this latent representation effectively controlled for batch and panel variation and yielded a cell representation that still maintains the variability between the different disease entities.
Impute non-overlapping protein markers#
Next, we obtain the batch corrected protein expression from the CytoVI model that automatically imputes non-overlapping protein markers. Here, we will sample ten times from the posterior distribution and generate the imputed protein expression ten times in order to assess the imputation uncertainty. We take the mean over these ten samples as our estimate for the imputed protein expression.
imp_expr = model.get_normalized_expression(n_samples=10, return_mean=False)
adata.layers["imputed"] = imp_expr.mean(axis=0).copy()
Inspection of the marker histograms now shows the complete imputed protein expression for all proteins present in the combined set of the two antibody panels. Additionally, we can now query markers that were unique for each of the panels, and display for example the expression of the costimulatory protein CD244 versus the IL-2 high affinity chain CD25.
Next, we compute the coefficient of variation across the ten imputed expression estimates as a measure of uncertainty of the posterior samples and visualize the uncertainty across the cell and feature axes to judge imputation performance.
adata.layers["imputed_cv"] = 100 * imp_expr.std(axis=0).copy() / imp_expr.mean(axis=0).copy()
adata.var["var_imp_uncertainty"] = adata.layers["imputed_cv"].mean(axis=0).copy()
adata.obs["obs_imp_uncertainty"] = adata.layers["imputed_cv"].mean(axis=1).copy()
uncertainty = adata.var["var_imp_uncertainty"].sort_values()
plt.figure(figsize=(6, 5))
plt.barh(uncertainty.index, uncertainty.values, color="steelblue")
plt.xlabel("Imputation Uncertainty")
plt.ylabel("Protein")
plt.title("Imputation Uncertainty per Protein")
plt.tight_layout()
plt.show()
This analysis demonstrated that the model was less certain when imputing Ki67 expression (even though it was measured in both of the panels) compared to CD25 or CD244 that were only measured in one of the panels.
Visualizing the aggregated imputation uncertainty per cell provides an estimate of how reliably missing markers can be imputed. In this example, the uncertainty plot reveals that the model is especially uncertain for cells in regions of CytoVI’s latent space where batch representation is imbalanced. This underscores the importance of imputing markers only when integrating biologically comparable studies. We can also query and visualize the imputation uncertainty of individual markers to judge how much we can trust the imputation results in downstream analyses.
Label transfer#
In our example, cells from one antibody panel were already annotated, whereas data from the second panel lacked annotations. We will demonstrate how the integrated CytoVI latent space can be used to transfer labels—such as cell type annotations—from the annotated panel to the unannotated dataset.
for panel in adata.obs["panel"].drop_duplicates():
adata_panel = adata[adata.obs["panel"] == panel].copy()
sc.pl.umap(adata_panel, color="cell_type", title=panel)
adata_ref = adata[adata.obs["panel"] == "panel1"].copy()
adata.obs["cell_type_predicted"] = model.impute_categories_from_reference(
adata_reference=adata_ref, cat_key="cell_type"
)
Since both antibody panels were applied to the same patients we expect a similar proportion of cell types across both panels and can use this as a quick sanity check.
adata.obs.groupby(["panel", "cell_type_predicted"]).size().unstack().plot(kind="bar", stacked=True)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()
We can then visualize the expression profile of the transferred cell type labels across the combined marker space of both antibody panels.
sc.pl.matrixplot(
adata,
var_names=adata.var_names,
groupby="cell_type_predicted",
layer="imputed",
dendrogram=True,
standard_scale="var",
cmap="mako",
)
Differential expression#
We can now directly query the generative model to find differentially expressed proteins for each of the transferred cell type labels in relation to all other cells.
de_res = model.differential_expression(groupby="cell_type_predicted", delta=0.1)
de_res
| proba_de | proba_not_de | bayes_factor | scale1 | scale2 | pseudocounts | delta | lfc_mean | lfc_median | lfc_std | lfc_min | lfc_max | is_de_fdr_0.05 | comparison | group1 | group2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CD45RA | 0.9706 | 0.0294 | 3.496919 | 0.434004 | 0.213028 | 1.000000e-10 | 0.1 | 1.041761 | 1.165231 | 0.673131 | -2.073050 | 2.584914 | True | Naive CD4 T vs Rest | Naive CD4 T | Rest |
| PD1 | 0.9592 | 0.0408 | 3.157417 | 0.164877 | 0.382105 | 1.000000e-10 | 0.1 | -1.112946 | -1.232954 | 0.663755 | -2.513757 | 1.706124 | True | Naive CD4 T vs Rest | Naive CD4 T | Rest |
| CD31 | 0.9104 | 0.0896 | 2.318529 | 0.312120 | 0.382095 | 1.000000e-10 | 0.1 | -0.198945 | -0.199453 | 0.626980 | -1.818321 | 1.523065 | False | Naive CD4 T vs Rest | Naive CD4 T | Rest |
| ICOS | 0.9076 | 0.0924 | 2.284677 | 0.217959 | 0.340132 | 1.000000e-10 | 0.1 | -0.579738 | -0.550900 | 0.513512 | -1.970637 | 1.384790 | False | Naive CD4 T vs Rest | Naive CD4 T | Rest |
| SSC-A | 0.8924 | 0.1076 | 2.115494 | 0.291745 | 0.356592 | 1.000000e-10 | 0.1 | -0.318686 | -0.294884 | 0.686055 | -2.617624 | 1.761648 | False | Naive CD4 T vs Rest | Naive CD4 T | Rest |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| CD69 | 0.8228 | 0.1772 | 1.535434 | 0.434981 | 0.382993 | 1.000000e-10 | 0.1 | 0.193504 | 0.185164 | 0.396212 | -0.931715 | 1.393487 | False | Ttox EM3 vs Rest | Ttox EM3 | Rest |
| CXCR5 | 0.7800 | 0.2200 | 1.265666 | 0.358435 | 0.373278 | 1.000000e-10 | 0.1 | -0.041337 | -0.037766 | 0.373335 | -1.434914 | 1.236979 | False | Ttox EM3 vs Rest | Ttox EM3 | Rest |
| IKZF3 | 0.7690 | 0.2310 | 1.202673 | 0.660656 | 0.566145 | 1.000000e-10 | 0.1 | 0.231808 | 0.219143 | 0.220237 | -0.400252 | 1.075055 | False | Ttox EM3 vs Rest | Ttox EM3 | Rest |
| FSC-A | 0.7680 | 0.2320 | 1.197052 | 0.599032 | 0.522675 | 1.000000e-10 | 0.1 | 0.208974 | 0.196232 | 0.271950 | -0.626969 | 1.533572 | False | Ttox EM3 vs Rest | Ttox EM3 | Rest |
| CD45RA | 0.7396 | 0.2604 | 1.043891 | 0.188851 | 0.234621 | 1.000000e-10 | 0.1 | -0.194501 | -0.038121 | 0.583049 | -2.589115 | 1.338408 | False | Ttox EM3 vs Rest | Ttox EM3 | Rest |
234 rows × 16 columns
These results demonstrate that CD45RA is enriched in Naive T helper cells versus all other cells, while PD-1 is decreased, which is in line with our understanding of these marker proteins.
Label-free differential abundance#
Next, we will use CytoVI’s label-free differential abundance analysis to automatically detect positions in latent space that are associated with a covariate of interest. In this case we will apply the model.differential_abundance method to identify T cell states that are associated with the different lymphoma entities (one versus all). This will for each cell and each lymphoma entity yield a differential abundance (DA) score that estimates how strongly the cell is associated with the respective lymphoma entity compared to all other entities. We can then visualize the DA scores on the latent space to identify regions characteristic for disease.
da_res = model.differential_abundance(groupby="Entity")
da_res
| DA_MCL | DA_DLBCL | DA_MZL | DA_rLN | DA_FL | |
|---|---|---|---|---|---|
| 0 | 0.721177 | -1.917385 | -0.464626 | -1.462440 | 1.483964 |
| 1 | 0.573540 | 1.083618 | 2.072742 | -4.232052 | -0.516788 |
| 2 | 0.400154 | -1.397992 | -0.409788 | -1.452880 | 1.708511 |
| 3 | 1.003199 | 3.260658 | 2.854763 | -1.661751 | -2.218349 |
| 4 | -0.070237 | -0.819746 | 0.305704 | -1.453968 | 1.619961 |
| ... | ... | ... | ... | ... | ... |
| 9961 | -0.234638 | 2.258600 | 2.350037 | -2.985151 | -0.401577 |
| 9962 | 1.480638 | -1.434464 | -1.796121 | 4.641170 | -0.752605 |
| 9963 | 1.025166 | -0.003979 | -1.792423 | 4.319269 | -1.124134 |
| 9964 | -0.488432 | -0.128273 | -0.130234 | -5.210672 | 2.050974 |
| 9965 | -0.105505 | 2.035009 | 1.945004 | -3.339054 | -0.216951 |
9966 rows × 5 columns
adata.obs = pd.concat(
[adata.obs, pd.DataFrame(da_res.values, columns=da_res.columns, index=adata.obs.index)], axis=1
)
sc.pl.umap(adata, color=da_res.columns, cmap="icefire", ncols=3, vmin=-3, vmax=3)
In a next step we want to use these DA scores to determine differentially abundant clusters that are associated with disease. For this we will concatenate the DA scores for each tumor entity to the CytoVI latent space and perform Kmeans clustering.
da_latent = np.hstack((da_res, adata.obsm["X_CytoVI"]))
kmeans = KMeans(n_clusters=13, random_state=0).fit(da_latent)
adata.obs["da_cluster"] = kmeans.labels_.astype("str")
Next, we will compute the relative frequency of the DA clusters per patient and test for differential abundance compared to the control group using a Mann-Whitney U test. We will do this for the cluster that showed highest DA scores for follicular lymphoma (FL) patients. Note: in this case it is cluster 0 but it can change depending on the runtime environment of the notebook.
freq_table = adata.obs.groupby(["PatientID", "da_cluster"]).size().unstack(fill_value=0)
freq_table_normalized = freq_table.div(freq_table.sum(axis=1), axis=0) * 100
res = pd.merge(
freq_table_normalized.reset_index("PatientID"),
adata.obs[["PatientID", "Entity"]].drop_duplicates(),
on="PatientID",
)
cluster_oi = "0"
control = "rLN"
group_order = ["rLN", "DLBCL", "MCL", "FL", "MZL"]
control_vals = res.loc[res["Entity"] == control, cluster_oi]
plt.figure(figsize=(6, 4))
sns.violinplot(res, x="Entity", y=cluster_oi, hue="Entity", inner="box", order=group_order)
plt.ylim(0, None)
plt.ylabel(f"Frequency of cluster {cluster_oi} (%)")
y_max = res[cluster_oi].max()
for i, entity in enumerate(group_order):
if entity != control:
p = mannwhitneyu(control_vals, res.loc[res["Entity"] == entity, cluster_oi]).pvalue
if p < 0.05:
plt.text(i, y_max + 2, f"p={p:.3e}", ha="center", fontsize=8)
plt.tight_layout()
plt.show()
This analysis demonstrates that we indeed retrieved T cell states that were differentially abundant for the lymphoma entity. In a last step we compute the confusion matrix with the predicted cell type labels in order to assign these cells to a canonical T cell lineage.
conf_mtx = sc.metrics.confusion_matrix(adata.obs["cell_type_predicted"], adata.obs["da_cluster"])
conf_clust_oi = conf_mtx[cluster_oi].sort_values()
plt.figure(figsize=(6, 4))
plt.barh(conf_clust_oi.index, conf_clust_oi.values)
plt.xlabel("Cluster Cverlap")
plt.ylabel("Predicted Cell Type Label")
plt.title(f"Cell Type Correspondence of Cluster {cluster_oi}")
plt.tight_layout()
plt.show()
This analysis demonstrates that the differentially abundant T cell cluster we identified using our label-free DA analysis appear to be mainly comprised of T follicular helper cells - a subpopulation of T helper cells that stimulate B cell responses and has been associated with Follicular Lymphoma in the original study by Roider et al.