SysVI

SysVI#

sysVI (cross-SYStem Variational Inference, Python class SysVI) is a representation learning models that can remove significant batch effects.

The advantages of SysVI are:

Improved integration: For datasets with substantial batch effects (e.g., cross-species or organoid-tissue), where other models often fail. It provides a good tradeoff between batch correction and preservation of cell-type and sub-cell-type biological variation.
Tunable integration: The integration strength is directly tunable via cycle consistency loss.
Generally applicable: The model operates on approximately normally distributed data (e.g., normalized and log+1 transformed scRNA-seq data), which makes it more generally applicable than just scRNA-seq.
Scalable: Can integrate very large datasets if using a GPU.

The limitations of SysVI include:

Weak batch effects: For datasets with small batch effects (e.g., multiple subjects from a single laboratory) we recommend using scVI instead, as it has slightly higher biological preservation in this setting. To determine whether a dataset has substantial batch effects, please refer to our paper.
Model selection: The best performance is achieved if selecting the best model from multiple runs with a few different cycle consistency loss weights and random seed initializations, as explained in the tutorial. However, we provide defaults that generate decent results in many settings.

Method background#

The model is based on a variational autoencoder (VAE), with the integrated representation corresponding to the latent space embedding of the cells.

Stronger batch correction with cycle-consistency loss#

Vanilla VAEs struggle to achieve strong batch correction without losing substantial biological variation. This issue arises as the VAE loss does not directly penalize the presence of batch covariate information in the latent space. Instead, conditional VAEs assume that batch covariate information will be omitted from the latent space, which has limited-capacity, as it is separately injected into the decoder. Namely, its presence in the latent space is “unnecessary” for the reconstruction (Hrovatin and Moinfar, 2023).

To achieve stronger integration than vanilla VAEs, SysVI employs cycle-consistency loss in the latent space. In particular, the model embeds a cell from one system (i.e., the covariate representing substantial batch effect) into latent space and then decodes it using another category of the system covariate. In this way it generates a biologically identical cell with a different batch effect. The generated cell is then likewise embedded into the latent space, and the distance between the embeddings of the original and the switched-batch cell is computed. The model is trained to minimize this distance.

Cycle consistency loss used to increase batch correction in SysVI.

Benefits of this approach:

As only cells with identical biological background are compared, this method retains good biological preservation even when removing substantial batch effects. This distinguishes it from alternative approaches that compare cells with different biological backgrounds (e.g., via adversarial loss; see Hrovatin and Moinfar (2023) for details).
The integration strength can be directly tuned via the cycle-consistency loss weight.

Improved biological preservation via the VampPrior#

Vanilla VAEs employ standard normal before regularizing latent space. However, this prior is very restrictive and can lead to loss of important biological variation in the latent space.

Instead, we use the VampPrior (Tomczak, 2017), which permits a more expressive latent space. VampPrior is a multi-modal prior for which the mode positions are learned during the training.

VampPrior used to increase the preservation of biological variation in SysVI.

Benefits of this approach:

More expressive latent space leads to increased preservation of biological variability.
VampPrior was more robust with respect to the number of modes than the better-known Gaussian mixture prior.

Application flexibility due to using normally distributed inputs#

Many scRNA-seq integration models are specially designed to work with scRNA-seq data, e.g., raw counts that follow negative binomial distribution. However, due to this, these models cannot be directly used for other types of data.

We observed that for representation learning this specialized setup is not strictly required. SysVI is designed for data following normal distribution, while performing competitively in comparison to the more specialized models on scRNA-seq data. To make scRNA-seq data approximately normally distributed, we preprocess it via size-factor normalization and log+1 transformation.

Thus, SysVI could be also applied to other types of normally distributed data. However, we did not specifically test its performance on other data types.

Other tips and tricks for data integration#

Besides the benefits of the SysVI model, our paper (Hrovatin and Moinfar, 2023) and talk provide additional advice on scRNA-seq integration that applies beyond SysVI. The two most important insights are:

Try to make the integration task as easy for the model as possible. This means that data should be pre-processed in a way that already eliminates some of the batch differences, when possible:
- Use intersection of HVGs across batches with substantial batch effects (e.g., the systems).
- Mitigate known technical artefacts, such as ambient gene expression (Hrovatin and Sikkema, 2024).
Ensure that the metrics used to evaluate integration are of high quality:
- They should be able to capture the key properties required for downstream tasks. For example, the standard cell-type-based biological preservation metrics do not assess whether subtler biological differences, such as within-cell-type disease effects, are preserved.
- Be cautious of potential biases within integration metric scores. - The scores may not directly correspond to the desired data property, being influenced by other factors, or certain models may be able to trick the metrics.