Minification#

Minification refers to the process of reducing the amount of content in your dataset in a smart way. This can be useful for various sorts of reasons and there can be different ways you might want to do this (we call these minification types). Currently, the only type of minification we support is one where we replace the count data with the parameters of the latent posterior distribution, estimated by a trained model. We will focus this tutorial on this type of minification.

There are multiple motivations for minifying the data in this way:

  • The data is more compact, so it takes up less space on disk and in memory.

  • Data transfer (share, upload, download) is more smooth owing to the smaller data size.

  • By using the latent posterior parameters, we can skip the encoder network and save on computation time.

The reason why this is that most post-training routines for scvi-tools models do not in fact require the full counts. Once your model is trained, you essentially only need the model weights and the pre-computed embeddings to carry out analyses. There are certain exceptions to this, but those routines will alert you if you try to call them with a minified dataset.

Minification overview

Moreover, you can actually use the latent posterior and the decoder network to estimate the original counts! This is of course not the exact same thing as using your actual full counts, but we can show that it is a good approximation using posterior predictive metrics (paper link tbd).

Let’s now see how to minify a dataset and use the corresponding model.

Note

Running the following cell will install tutorial dependencies on Google Colab only. It will have no effect on environments other than Google Colab.

!pip install --quiet scvi-colab
from scvi_colab import install

install()
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

import os
import tempfile

import scanpy as sc
import scvi
import seaborn as sns
import torch
scvi.settings.seed = 0
print("Last run with scvi-tools version:", scvi.__version__)
Last run with scvi-tools version: 1.1.6

Note

You can modify save_dir below to change where the data files for this tutorial are saved.

sc.set_figure_params(figsize=(6, 6), frameon=False)
sns.set_theme()
torch.set_float32_matmul_precision("high")
save_dir = tempfile.TemporaryDirectory()

%config InlineBackend.print_figure_kwargs={"facecolor": "w"}
%config InlineBackend.figure_format="retina"

Get the data and model#

Here we use the data and pre-trained model obtained from running this scvi-tools tutorial.

The dataset used is a subset of the heart cell atlas dataset:
Litviňuková, M., Talavera-López, C., Maatz, H., Reichart, D., Worth, C. L., Lindberg, E. L., … & Teichmann, S. A. (2020). Cells of the adult human heart. Nature, 588(7838), 466-472.

Let’s train the model as usual. Also save the model and data on disk as we’ll need them later.

adata = scvi.data.heart_cell_atlas_subsampled(save_path=save_dir.name)
INFO     Downloading file at /tmp/tmpyhtb3zgo/hca_subsampled_20k.h5ad
sc.pp.filter_genes(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source",
)
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["cell_source", "donor"],
    continuous_covariate_keys=["percent_mito", "percent_ribo"],
)
model = scvi.model.SCVI(adata)
model.train(max_epochs=20)
model_path = os.path.join(save_dir.name, "scvi_hca")
model.save(model_path, save_anndata=True, overwrite=True)
model = scvi.model.SCVI.load(model_path)
model
INFO     File /tmp/tmpyhtb3zgo/scvi_hca/model.pt already downloaded
SCVI model with the following parameters: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb, 
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: False

Note that, as expected, “Model’s adata is minified” is False.

model.adata
AnnData object with n_obs × n_vars = 18641 × 1200
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', '_scvi_batch', '_scvi_labels'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'cell_type_colors', 'hvg', 'log1p'
    obsm: '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs'
    layers: 'counts'

Notice that in addition to adata.X, we also have a layer (counts) and a raw attribute.

model.adata.raw
<anndata._core.raw.Raw at 0x7a7222581640>

Let’s also save a reference to model.adata. We’ll see later that this remains unchanged because minification is not an inplace procedure.

bdata = model.adata
bdata is model.adata  # this should be True because we didn't copy the anndata object
True

Minify#

To minify the data, all we need to do is:

  1. get the latent representation and store it in the adata

  2. call model.minify_adata()

qzm, qzv = model.get_latent_representation(give_mean=False, return_dist=True)
model.adata.obsm["X_latent_qzm"] = qzm
model.adata.obsm["X_latent_qzv"] = qzv

model.minify_adata()
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
INFO     Generating sequential column names
INFO     Generating sequential column names
model
SCVI model with the following parameters: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb, 
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: True

As expected, “Model’s adata is minified” is now True. Also, we can check the model’s minified_data_type:

model.minified_data_type
'latent_posterior_parameters'

Let’s check out the data now:

model.adata
AnnData object with n_obs × n_vars = 18641 × 1200
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: '_scvi_manager_uuid', 'cell_type_colors', 'hvg', 'log1p', '_scvi_adata_minify_type', '_scvi_uuid'
    obsm: '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs', 'X_latent_qzm', 'X_latent_qzv', '_scvi_latent_qzm', '_scvi_latent_qzv'
    layers: 'counts'

First, let’s check that the original adata was not modified (minification is not inplace):

model.adata is bdata
False

Next, we see that we still have the same number of obs and vars: 18641 × 1200. This seems strange! Didn’t we say we minized the data? We did. The way we did that is we “emptied” the contents of adata.X, adata.layers["counts"], and adata.raw. Instead, we cached the much smaller latent posterior parameters in adata.obsm["_scvi_latent_qzm"] and adata.obsm["_scvi_latent_qzv"]. Let’s double check that:

model.adata.X
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 0 stored elements and shape (18641, 1200)>
model.adata.layers["counts"]
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 0 stored elements and shape (18641, 1200)>
model.adata.raw is None
True
bdata
AnnData object with n_obs × n_vars = 18641 × 1200
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', '_scvi_batch', '_scvi_labels'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'cell_type_colors', 'hvg', 'log1p'
    obsm: '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs', 'X_latent_qzm', 'X_latent_qzv'
    layers: 'counts'

Everything else is the same, all the other metadata is there.

But is the data really smaller now? Let’s check:

minified_model_path = os.path.join(save_dir.name, "scvi_hca_minified")
model.save(minified_model_path, save_anndata=True, overwrite=True)
before = os.path.getsize(os.path.join(model_path, "adata.h5ad")) // (1024 * 1024)
after = os.path.getsize(os.path.join(minified_model_path, "adata.h5ad")) // (1024 * 1024)

print(f"AnnData size before minification: {before} MB")
print(f"AnnData size after minification: {after} MB")
AnnData size before minification: 212 MB
AnnData size after minification: 8 MB

We also see a a new uns key called _scvi_adata_minify_type. This specifies the type of minification. It’s the same as model.minified_data_type. In fact this is a quick way to tell if your data is minified. We also expose a utility function to check that quickly.

model.adata.uns["_scvi_adata_minify_type"]
'latent_posterior_parameters'
scvi.data._utils._is_minified(model.adata)
True

Last but not least, you might have noticed that there is a new obs columns called _scvi_observed_lib_size. We add the pre-computed per-cell library sizes to this column and use it during inference, because the minified data is deprived of the full counts.

Another claim we made earlier is that analysis functions are faster if you use the minified data. Let’s time how much they take. Here we’ll look at the get_likelihood_parameters method.

model_orig = scvi.model.SCVI.load(model_path)

print("Running `get_likelihood_parameters` without minified data...")
%timeit model_orig.get_likelihood_parameters(n_samples=3, give_mean=True)
INFO     File /tmp/tmpyhtb3zgo/scvi_hca/model.pt already downloaded
Running `get_likelihood_parameters` without minified data...
2.19 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print("Running `get_likelihood_parameters` with minified data...")
%timeit model.get_likelihood_parameters(n_samples=3, give_mean=True)
Running `get_likelihood_parameters` with minified data...
2.28 s ± 76.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time savings are not very sharp in the case of this dataset, but there are some marginal savings regardless.

Save and load#

Just like a regular model, you can save the model and its minified data, and load them back in:

model.save(minified_model_path, overwrite=True, save_anndata=True)

# load saved model with saved (minified) adata
loaded_model = scvi.model.SCVI.load(minified_model_path)
loaded_model
INFO     File /tmp/tmpyhtb3zgo/scvi_hca_minified/model.pt already downloaded
SCVI model with the following parameters: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb, 
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: True

Next, let’s load the model with a non-minified data.

loaded_model = scvi.model.SCVI.load(model_path, adata=bdata)
loaded_model
INFO     File /tmp/tmpyhtb3zgo/scvi_hca/model.pt already downloaded
SCVI model with the following parameters: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb, 
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: False

So if you want to “undo” the minification procedure, so to speak, you can always load your model with the non-minified data (if you still have it), or any other non-minified data for that matter, as long as it’s compatible with the model of course.

Last but not least, let’s see what happens if we try to load a model whose adata was not minified, with a dataset that is minified:

scvi.data._utils._is_minified(model.adata)
True
try:
    scvi.model.SCVI.load(model_path, adata=model.adata)
except KeyError as e:
    print("KeyError: " + str(e))
INFO     File /tmp/tmpyhtb3zgo/scvi_hca/model.pt already downloaded
KeyError: 'state_registry'

As we see, this is not allowed. This is because when you try to load a model with another dataset, we try to validate that dataset against the model’s registry. In this case, the data is not compatible with the model registry because it has attributes pertaining to minification, which this model is not aware of.

Support#

Minification is not supported for all models yet. A model supports this functionality if and only if it inherits from the BaseMinifiedModeModelClass class. A model that does not support this:

  • does not have a minify_adata() method

  • cannot be loaded with a minified data. If you try to do this you will see this error: “The MyModel model currently does not support minified data.”

To support minification for your own model, inherit your model class from the BaseMinifiedModeModelClass and your module class from the BaseMinifiedModeModuleClass.