Note

This page was generated from minification.ipynb. Interactive online version: Colab badge. Some tutorial content may look better in light mode.

Minification#

Minification refers to the process of reducing the amount of content in your dataset in a smart way. This can be useful for various sorts of reasons and there can be different ways you might want to do this (we call these minification types). Currently, the only type of minification we support is one where we replace the count data with the parameters of the latent posterior distribution, estimated by a trained model. We will focus this tutorial on this type of minification.

There are multiple motivations for minifying the data in this way:

  • The data is more compact, so it takes up less space on disk and in memory.

  • Data transfer (share, upload, download) is more smooth owing to the smaller data size.

  • By using the latent posterior parameters, we can skip the encoder network and save on computation time.

The reason why this is that most post-training routines for scvi-tools models do not in fact require the full counts. Once your model is trained, you essentially only need the model weights and the pre-computed embeddings to carry out analyses. There are certain exceptions to this, but those routines will alert you if you try to call them with a minified dataset.

Minification overview

Moreover, you can actually use the latent posterior and the decoder network to estimate the original counts! This is of course not the exact same thing as using your actual full counts, but we can show that it is a good approximation using posterior predictive metrics (paper link tbd).

Let’s now see how to minify a dataset and use the corresponding model.

[1]:
!pip install --quiet scvi-colab
from scvi_colab import install

install()
[1]:
import scanpy as sc

sc.set_figure_params(figsize=(4, 4))

# for white background of figures (only for docs rendering)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'
[2]:
import time

import scanpy as sc
import scvi
Global seed set to 0

Get the data and model#

Here we use the data and pre-trained model obtained from running this scvi-tools tutorial.

The dataset used is a subset of the heart cell atlas dataset:
Litviňuková, M., Talavera-López, C., Maatz, H., Reichart, D., Worth, C. L., Lindberg, E. L., … & Teichmann, S. A. (2020). Cells of the adult human heart. Nature, 588(7838), 466-472.

Let’s train the model as usual. Also save the model and data on disk as we’ll need them later.

[3]:
adata = scvi.data.heart_cell_atlas_subsampled()
sc.pp.filter_genes(adata, min_counts=3)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source",
)
scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["cell_source", "donor"],
    continuous_covariate_keys=["percent_mito", "percent_ribo"],
)
model = scvi.model.SCVI(adata)
model.train()
model.save("local/hca/", save_anndata=True)
[36]:
model_path = "local/hca"
model = scvi.model.SCVI.load(model_path)
INFO     File local/hca/model.pt already downloaded
[37]:
model
SCVI Model with the following params:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb,
latent_distribution: normal
Training status: Trained
Model's adata is minified?: False
[37]:

Note that, as expected, “Model’s adata is minified” is False.

[38]:
model.adata
[38]:
AnnData object with n_obs × n_vars = 18641 × 1200
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', '_scvi_batch', '_scvi_labels'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'cell_type_colors', 'hvg', 'log1p'
    obsm: '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs'
    layers: 'counts'

Notice that in addition to adata.X, we also have a layer (counts) and a raw attribute.

[39]:
model.adata.raw
[39]:
<anndata._core.raw.Raw at 0x140678400>

Let’s also save a reference to model.adata. We’ll see later that this remains unchanged because minification is not an inplace procedure.

[41]:
bdata = model.adata
bdata is model.adata  # this should be True because we didn't copy the anndata object
[41]:
True

Minify#

To minify the data, all we need to do is:

  1. get the latent representation and store it in the adata

  2. call model.minify_adata()

[42]:
qzm, qzv = model.get_latent_representation(give_mean=False, return_dist=True)
model.adata.obsm["X_latent_qzm"] = qzm
model.adata.obsm["X_latent_qzv"] = qzv

model.minify_adata()
INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup
INFO     Generating sequential column names
INFO     Generating sequential column names
/Users/valehvpa/GitRepos/scvi-tools/scvi/model/utils/_minification.py:31: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  bdata = AnnData(
[43]:
model
SCVI Model with the following params:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb,
latent_distribution: normal
Training status: Trained
Model's adata is minified?: True
[43]:

As expected, “Model’s adata is minified” is now True. Also, we can check the model’s minified_data_type:

[44]:
model.minified_data_type
[44]:
'latent_posterior_parameters'

Let’s check out the data now:

[45]:
model.adata
[45]:
AnnData object with n_obs × n_vars = 18641 × 1200
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: '_scvi_manager_uuid', 'cell_type_colors', 'hvg', 'log1p', '_scvi_adata_minify_type', '_scvi_uuid'
    obsm: '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs', 'X_latent_qzm', 'X_latent_qzv', '_scvi_latent_qzm', '_scvi_latent_qzv'
    layers: 'counts'

First, let’s check that the original adata was not modified (minification is not inplace):

[46]:
model.adata is bdata
[46]:
False

Next, we see that we still have the same number of obs and vars: 18641 × 1200. This seems strange! Didn’t we say we minized the data? We did. The way we did that is we “emptied” the contents of adata.X, adata.layers["counts"], and adata.raw. Instead, we cached the much smaller latent posterior parameters in adata.obsm["_scvi_latent_qzm"] and adata.obsm["_scvi_latent_qzv"]. Let’s double check that:

[47]:
model.adata.X
[47]:
<18641x1200 sparse matrix of type '<class 'numpy.float32'>'
        with 0 stored elements in Compressed Sparse Row format>
[48]:
model.adata.layers["counts"]
[48]:
<18641x1200 sparse matrix of type '<class 'numpy.float64'>'
        with 0 stored elements in Compressed Sparse Row format>
[49]:
model.adata.raw is None
[49]:
True
[50]:
bdata
[50]:
AnnData object with n_obs × n_vars = 18641 × 1200
    obs: 'NRP', 'age_group', 'cell_source', 'cell_type', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'source', 'type', 'version', 'cell_states', 'Used', '_scvi_batch', '_scvi_labels'
    var: 'gene_ids-Harvard-Nuclei', 'feature_types-Harvard-Nuclei', 'gene_ids-Sanger-Nuclei', 'feature_types-Sanger-Nuclei', 'gene_ids-Sanger-Cells', 'feature_types-Sanger-Cells', 'gene_ids-Sanger-CD45', 'feature_types-Sanger-CD45', 'n_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: '_scvi_manager_uuid', '_scvi_uuid', 'cell_type_colors', 'hvg', 'log1p'
    obsm: '_scvi_extra_categorical_covs', '_scvi_extra_continuous_covs', 'X_latent_qzm', 'X_latent_qzv'
    layers: 'counts'

Everything else is the same, all the other metadata is there.

But is the data really smaller now? Let’s check:

[51]:
save_path = "local/hca_minified"
model.save(save_path, overwrite=True, save_anndata=True)
[61]:
ls -lh local/hca/adata.h5ad
-rw-r--r--  1 valehvpa  staff   212M Jan 30 18:02 local/hca/adata.h5ad
[60]:
ls -lh local/hca_minified/adata.h5ad
-rw-r--r--  1 valehvpa  staff   8.1M Jan 30 18:05 local/hca_minified/adata.h5ad

We also see a a new uns key called _scvi_adata_minify_type. This specifies the type of minification. It’s the same as model.minified_data_type. In fact this is a quick way to tell if your data is minified. We also expose a utility function to check that quickly.

[26]:
model.adata.uns["_scvi_adata_minify_type"]
[26]:
'latent_posterior_parameters'
[27]:
scvi.data._utils._is_minified(model.adata)
[27]:
True

Last but not least, you might have noticed that there is a new obs columns called _scvi_observed_lib_size. We add the pre-computed per-cell library sizes to this column and use it during inference, because the minified data is deprived of the full counts.

Another claim we made earlier is that analysis functions are faster if you use the minified data. Let’s time how much they take. Here we’ll look at the get_likelihood_parameters method.

[65]:
model_orig = scvi.model.SCVI.load("local/hca")

n = 5
start_time = time.time()
for i in range(n):
    model_orig.get_likelihood_parameters(n_samples=3, give_mean=True)
end_time = time.time()
print(
    f"without a minified data `get_likelihood_parameters` takes on average {(end_time - start_time)/n} seconds"
)
INFO     File local/hca/model.pt already downloaded
without a minified data `get_likelihood_parameters` takes on average 3.2357523918151854 seconds
[71]:
n = 5
start_time = time.time()
for i in range(n):
    model.get_likelihood_parameters(n_samples=3, give_mean=True)
end_time = time.time()
print(
    f"with a minified data `get_likelihood_parameters` takes on average {(end_time - start_time)/n} seconds"
)
with a minified data `get_likelihood_parameters` takes on average 3.049869012832642 seconds

Time savings are not very sharp in the case of this dataset, but there are some marginal savings regardless.

Save and load#

Just like a regular model, you can save the model and its minified data, and load them back in:

[72]:
save_path = "local/hca_minified"
model.save(save_path, overwrite=True, save_anndata=True)

# load saved model with saved (minified) adata
loaded_model = scvi.model.SCVI.load(save_path)

print(loaded_model)
print("Data is minified?: " + str(scvi.data._utils._is_minified(loaded_model.adata)))
INFO     File local/hca_minified/model.pt already downloaded
SCVI Model with the following params:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb,
latent_distribution: normal
Training status: Trained
Model's adata is minified?: True

Data is minified?: True

Next, let’s load the model with a non-minified data.

[73]:
loaded_model = scvi.model.SCVI.load(save_path, adata=bdata)

print(loaded_model)
print("Data is minified?: " + str(scvi.data._utils._is_minified(loaded_model.adata)))
INFO     File local/hca_minified/model.pt already downloaded
SCVI Model with the following params:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: zinb,
latent_distribution: normal
Training status: Trained
Model's adata is minified?: False

Data is minified?: False

So if you want to “undo” the minification procedure, so to speak, you can always load your model with the non-minified data (if you still have it), or any other non-minified data for that matter, as long as it’s compatible with the model of course.

Last but not least, let’s see what happens if we try to load a model whose adata was not minified, with a dataset that is minified:

[74]:
scvi.data._utils._is_minified(model.adata)
[74]:
True
[75]:
try:
    scvi.model.SCVI.load("local/hca", adata=model.adata)
except KeyError as e:
    print(e)
INFO     File local/hca/model.pt already downloaded
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In [75], line 1
----> 1 scvi.model.SCVI.load("local/hca", adata=model.adata)

File ~/GitRepos/scvi-tools/scvi/model/base/_base_model.py:669, in BaseModelClass.load(cls, dir_path, adata, use_gpu, prefix, backup_url)
    665 # Calling ``setup_anndata`` method with the original arguments passed into
    666 # the saved model. This enables simple backwards compatibility in the case of
    667 # newly introduced fields or parameters.
    668 method_name = registry.get(_SETUP_METHOD_NAME, "setup_anndata")
--> 669 getattr(cls, method_name)(
    670     adata, source_registry=registry, **registry[_SETUP_ARGS_KEY]
    671 )
    673 model = _initialize_model(cls, adata, attr_dict)
    674 model.module.on_load(model)

File ~/GitRepos/scvi-tools/scvi/model/_scvi.py:213, in SCVI.setup_anndata(cls, adata, layer, batch_key, labels_key, size_factor_key, categorical_covariate_keys, continuous_covariate_keys, **kwargs)
    209     anndata_fields += cls._get_fields_for_adata_minification(adata_minify_type)
    210 adata_manager = AnnDataManager(
    211     fields=anndata_fields, setup_method_args=setup_method_args
    212 )
--> 213 adata_manager.register_fields(adata, **kwargs)
    214 cls.register_manager(adata_manager)

File ~/GitRepos/scvi-tools/scvi/data/_manager.py:179, in AnnDataManager.register_fields(self, adata, source_registry, **transfer_kwargs)
    176 self._validate_anndata_object(adata)
    178 for field in self.fields:
--> 179     self._add_field(
    180         field=field,
    181         adata=adata,
    182         source_registry=source_registry,
    183         **transfer_kwargs,
    184     )
    186 # Save arguments for register_fields.
    187 self._source_registry = deepcopy(source_registry)

File ~/GitRepos/scvi-tools/scvi/data/_manager.py:215, in AnnDataManager._add_field(self, field, adata, source_registry, **transfer_kwargs)
    211 if not field.is_empty:
    212     # Transfer case: Source registry is used for validation and/or setup.
    213     if source_registry is not None:
    214         field_registry[_constants._STATE_REGISTRY_KEY] = field.transfer_field(
--> 215             source_registry[_constants._FIELD_REGISTRIES_KEY][
    216                 field.registry_key
    217             ][_constants._STATE_REGISTRY_KEY],
    218             adata,
    219             **transfer_kwargs,
    220         )
    221     else:
    222         field_registry[_constants._STATE_REGISTRY_KEY] = field.register_field(
    223             adata
    224         )

KeyError: 'state_registry'

As we see, this is not allowed. This is because when you try to load a model with another dataset, we try to validate that dataset against the model’s registry. In this case, the data is not compatible with the model registry because it has attributes pertaining to minification, which this model is not aware of.

Support#

Minification is not supported for all models yet. A model supports this functionality if and only if it inherits from the BaseMinifiedModeModelClass class. A model that does not support this:

  • does not have a minify_adata() method

  • cannot be loaded with a minified data. If you try to do this you will see this error: “The MyModel model currently does not support minified data.”

To support minification for your own model, inherit your model class from the BaseMinifiedModeModelClass and your module class from the BaseMinifiedModeModuleClass.