This page was generated from data_tutorial.ipynb. Interactive online version: Colab badge.

Data handling in scvi-tools

In this tutorial we will cover how data is handled in scvi-tools.


  1. Data Registration via setup_anndata() and register_tensor_from_anndata()

  2. Introduction to the scvi_setup_dict

  3. Explanation of data_registry and corresponding fields

  4. Data loading with AnnDataLoader()

import sys

#if branch is stable, will install via pypi, else will install from source
branch = "stable"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+$branch#egg=scvi-tools[tutorials]
import scvi
from scvi import _CONSTANTS
import numpy as np

Data Registration

Scvi-tools knows what data to load into models via a data registration process handled by and

The setup process saves the scvi_setup_dict to adata.uns['_scvi']. We will go over the scvi_setup_dict in subsequent sections. is used to setup common data fields for our models.

Explanation of parameters for

  • adata is the input anndata

  • batch_key is the key in adata.obs for batch information. If this is None, will assume that all the data is the same batch.

  • labels_key is the key in adata.obs for label information. If this is None, will assume that all the data has the same label.

  • layer is the key in adata.layers to use for the input data matrix. By default, this is None and the input data matrix will be pulled from adata.X.

  • protein_expression_obsm_key is the key in adata.obsm for protein expression data.

  • protein_names_uns_key is the key in adata.uns for the protein names.

  • categorical_covariate_keys is a list of keys in adata.obs for categorical covariates.

  • continuous_covariate_key is a list of keys in adata.obs for continuous covariates. is a function for the generic registration of tensors in the AnnData object. It is used to setup data fields not included in

Explanation of parameters for

  • adata is the input anndata

  • registry_key is the key to access the data in the dataloader output (More on this in the DataLoader section of this tutorial)

  • adata_attr_name is the AnnData attribute with the data. Can be ['obs', 'obsm', 'var', 'varm', 'uns']

  • adata_key_name is the key in adata.adata_attr_name to access the data

  • is_categorical, if True and adata_attr_name is obs, will integer encode the data and saved in adata.obs with the key passed to adata_alternate_key_name

  • adata_alternate_key_name is the key in adata.obs to save the data to if is_categorical is True and adata_attr_name is obs. If None, the saved key will be adata_key_name + '_scvi'

Under the hood:

  • For all categorical data (batch, labels, categorical covariates), scvi will automatically compute a mapping from values to integers. Eg. ['a','b','c','a'] will become [0,1,2,0].

  • For data fields registered with, scvi will copy the data to a seperate field in the anndata.

    • batch_key is copied to scvi.obs['_scvi_batch'] with its integer encoding

    • labels_key is copied to scvi.obs['_scvi_labels'] with its integer encoding

    • keys in categorical_covariate_keys are concatenated and saved as a pandas DataFrame and stored in adata.obsm['_scvi_extra_categoricals'] with its integer encoding.

    • keys in continuous_covariate_keys are concatenated and saved as a pandas DataFrame and stored in adata.obsm['_scvi_extra_continuous']

    • batch specific log library size mean is computed and stored in adata.obs['_scvi_local_l_mean']

    • batch specific log library size variance is computed and stored in adata.obs['_scvi_local_l_var']

  • For data fields registered with

    • If is_categorical is True and adata_attr_name is obs, data will be encoded as integers and saved to adata.obs with the key in adata_alternate_key_name.

    • If is_categorical is False, data will be loaded as is.

In the following code, we first format an example AnnData Object to setup for scvi-tools, then call to register all the tensors we want to load to the model during training. For our example AnnData Object, we build off the synthetic_iid() dataset, copy X to a layer, and add continuous and categorical covariates to the AnnData.

adata =
adata.layers['raw_counts'] = adata.X.copy()
adata.obs['my_categorical_covariate'] = ['A'] * 200 + ['B'] * 200
adata.obs['my_continuous_covariate'] = np.random.randint(0,100,400)
AnnData object with n_obs × n_vars = 400 × 100
    obs: 'batch', 'labels', 'my_categorical_covariate', 'my_continuous_covariate'
    uns: 'protein_names'
    obsm: 'protein_expression'
    layers: 'raw_counts'
INFO     Using batches from adata.obs["batch"]
INFO     Using labels from adata.obs["labels"]
INFO     Using data from adata.layers["raw_counts"]
INFO     Computing library size prior per batch
INFO     Using protein expression from adata.obsm['protein_expression']
INFO     Using protein names from adata.uns['protein_names']
INFO     Successfully registered anndata object containing 400 cells, 100 vars, 2 batches, 3
         labels, and 100 proteins. Also registered 1 extra categorical covariates and 1 extra
         continuous covariates.
INFO     Please do not further modify adata until model is trained.
/home/galen/.pyenv/versions/scvi-dev/lib/python3.8/site-packages/pandas/core/arrays/ FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version.
  res = method(*args, **kwargs)
/home/galen/.pyenv/versions/scvi-dev/lib/python3.8/site-packages/pandas/core/arrays/ FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version.
  res = method(*args, **kwargs)

We can view what was registered via the command.

Anndata setup with scvi-tools version 0.0.0.
              Data Summary              
┃             Data              Count ┃
│            Cells               400  │
│             Vars               100  │
│            Labels               3   │
│           Batches               2   │
│           Proteins             100  │
│ Extra Categorical Covariates    1   │
│ Extra Continuous Covariates     1   │
                      SCVI Data Registry                       
┃        Data                  scvi-tools Location           ┃
│         X                 adata.layers['raw_counts']       │
│   batch_indices            adata.obs['_scvi_batch']        │
│    local_l_mean        adata.obs['_scvi_local_l_mean']     │
│    local_l_var          adata.obs['_scvi_local_l_var']     │
│       labels              adata.obs['_scvi_labels']        │
│ protein_expression     adata.obsm['protein_expression']    │
│      cat_covs       adata.obsm['_scvi_extra_categoricals'] │
│     cont_covs        adata.obsm['_scvi_extra_continuous']  │
                     Label Categories                     
┃   Source Location    Categories  scvi-tools Encoding ┃
│ adata.obs['labels']   label_0             0          │
│                       label_1             1          │
│                       label_2             2          │
                    Batch Categories                     
┃  Source Location    Categories  scvi-tools Encoding ┃
│ adata.obs['batch']   batch_0             0          │
│                      batch_1             1          │
                        Extra Categorical Variables                         
┃            Source Location             Categories  scvi-tools Encoding ┃
│ adata.obs['my_categorical_covariate']      A                0          │
│                                            B                1          │
│                                                                        │
            Extra Continuous Variables            
┃           Source Location              Range  ┃
│ adata.obs['my_continuous_covariate']  0 -> 99 │

If there are other tensors in the anndata you need to register, you can use the command.

In the following code we add a new field to our AnnData with the key extra_values. Then we register the tensor with register_tensor_from_anndata() by passing the adata (adata=adata), the datafield of the key we want to register (adata_attr_name='obs'), the key we wish to register (adata_key_name="extra_values"), and the key to access the data when it is loaded via the dataloader (registry_key='_extra_values')

key = "extra_values"
adata.obs[key] = np.random.randint(0, 10, 400)
0      2
1      5
2      7
3      2
4      6
395    5
396    7
397    0
398    4
399    4
Name: extra_values, Length: 400, dtype: int64

Scvi setup dictionary

In this section we enumerate the fields in the scvi setup dictionary. The scvi setup dictionary is accessed via adata.uns['_scvi'].

The following keys in the scvi setup dictionary will always be there:

  • scvi_version keeps track of the version of scvi-tools used to setup the AnnData Object

  • categorical_mappings keeps track of the mappings for the categorical variables (batch and label)

  • data_registry contains the location of data to load. This is what is used by the DataLoaders to iterate over the AnnData

  • summary_stats contains summary statistics

The following keys will be in the scvi setup dictionary if they were provided:

  • protein_names keeps track of the protein names

  • extra_categoricals keeps track of the keys and mappings of the extra categorical covariates

  • extra_continuous_keys keeps track of the keys and mappings of the extra continuous covariates.

scvi_setup_dict = adata.uns['_scvi']
dict_keys(['scvi_version', 'categorical_mappings', 'protein_names', 'extra_categoricals', 'extra_continuous_keys', 'data_registry', 'summary_stats'])

Here we show the contents of scvi_version, summary_stats, and protein_names. We will go over the data_registry, extra_categoricals, and extra_continuous_keys in the next section

# scvi version
# summary stats
{'n_batch': 2, 'n_cells': 400, 'n_vars': 100, 'n_labels': 3, 'n_proteins': 100, 'n_continuous_covs': 1}
# protein names
['0' '1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15'
 '16' '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29'
 '30' '31' '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43'
 '44' '45' '46' '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57'
 '58' '59' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '70' '71'
 '72' '73' '74' '75' '76' '77' '78' '79' '80' '81' '82' '83' '84' '85'
 '86' '87' '88' '89' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99']

Data Registry

Now lets turn our attention to the data_registry, categorical_mappings, extra_categoricals, and extra_continuous_keys.

This is used by the DataLoaders to load data during the data loop. Each key of the data_registry is the name of tensor and is used to retreive the data from the dataloader output.

  • All the data registered via has its keys globally set via scvi._CONSTANTS.

  • All the data registered via, the key is provided via the parameter registry_key.

The value of each key in the data_registry is a dictionary with two keys: attr_name and attr_key.

  • attr_name is the attribute of adata to load data from eg. obs, obsm, layers.

  • attr_key is the key of the attribute to access the data

For example, based off the following data_registry, batch information is loaded from adata.obs['_scvi_batch'] and will be accessible via _CONSTANTS.BATCH_KEY

data_registry = scvi_setup_dict['data_registry']
{'X': {'attr_name': 'layers', 'attr_key': 'raw_counts'},
 'batch_indices': {'attr_name': 'obs', 'attr_key': '_scvi_batch'},
 'local_l_mean': {'attr_name': 'obs', 'attr_key': '_scvi_local_l_mean'},
 'local_l_var': {'attr_name': 'obs', 'attr_key': '_scvi_local_l_var'},
 'labels': {'attr_name': 'obs', 'attr_key': '_scvi_labels'},
 'protein_expression': {'attr_name': 'obsm', 'attr_key': 'protein_expression'},
 'cat_covs': {'attr_name': 'obsm', 'attr_key': '_scvi_extra_categoricals'},
 'cont_covs': {'attr_name': 'obsm', 'attr_key': '_scvi_extra_continuous'},
 '_extra_values': {'attr_name': 'obs', 'attr_key': 'extra_values_scvi'}}
print(_CONSTANTS.X_KEY)                 # key for X values
print(_CONSTANTS.BATCH_KEY)             # key for batch info
print(_CONSTANTS.LOCAL_L_MEAN_KEY)      # key for mean of batch specific log library size
print(_CONSTANTS.LOCAL_L_VAR_KEY)       # key for variance of batch specific log library size
print(_CONSTANTS.LABELS_KEY)            # key for label data
print(_CONSTANTS.PROTEIN_EXP_KEY)       # key for protein data
print(_CONSTANTS.CAT_COVS_KEY)          # key for categorical covariate data
print(_CONSTANTS.CONT_COVS_KEY)         # key for continuous covariate data
from scvi import _CONSTANTS

{'attr_name': 'obs', 'attr_key': '_scvi_batch'}

During the data registration process, we also keep track of the location of the original data as well as the categorical to integer mappings.

In the categorical_mappings dict, the keys are the attr_key for each categorical key from the data_registry (except for extra continuous and categorical covariates). The values is then another dictionary with two keys:

  • original_key is the original key passed in by the user to load the data

  • mapping is the categorical to integer mapping of the data. The index of the category is its corresponding integer representation.

{'_scvi_batch': {'original_key': 'batch',
  'mapping': array(['batch_0', 'batch_1'], dtype=object)},
 '_scvi_labels': {'original_key': 'labels',
  'mapping': array(['label_0', 'label_1', 'label_2'], dtype=object)},
 'extra_values_scvi': {'original_key': 'extra_values',
  'mapping': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}}

If the anndata was setup with extra_categorical_covariates, there will also be an extra_categoricals dict in the scvi_setup_dict.

There are three keys:

  • mappings whose value is a dictionary where the key is the original obs key and the value is the categorical mapping

  • keys these are the keys of the pandas DataFrame in adata.obs['_scvi_extra_categoricals']

  • n_cats_per_key contains the number of categories per key

{'mappings': {'my_categorical_covariate': array(['A', 'B'], dtype=object)},
 'keys': ['my_categorical_covariate'],
 'n_cats_per_key': [2]}
0 0
1 0
2 0
3 0
4 0
... ...
395 1
396 1
397 1
398 1
399 1

400 rows × 1 columns

If the anndata was setup with extra_continuous_covariates, extra_continuous_keys will be a key in the scvi_setup_dict.

This is a list of the keys in adata.obs['_scvi_extra_continuous'] to load extra continuous covariates from.

array(['my_continuous_covariate'], dtype=object)
0 6
1 9
2 46
3 8
4 43
... ...
395 49
396 12
397 88
398 8
399 29

400 rows × 1 columns


AnnDataLoader is the base dataloader for scvi-tools. In this section we show how the data registered is loaded by AnnDataLoader.

Parameters of AnnDataLoader:

  • adata: registered AnnData object to load data from

  • shuffle: if True will shuffle the data beforehand

  • indices: can provide a subset of indices to load from (Useful when doing train/test splits)

  • data_and_attributes: a dictionary where the key corresponds to its key in the data_registry and the value is the numpy data type. By default, all data is passed to the model as np.float32.

  • data_loader_kwargs: additional arguments from

First, we construct an AnnDataLoader and get the first batch. Then we will enumerate all the values in the batch. The variable data_batch contains the first batch of data. It is a dictionary whose values are the tensors registered in the previous section via setup_anndata() and register_tensor_from_anndata().

from scvi.dataloaders._ann_dataloader import AnnDataLoader

# initialize an AnnDataLoader which will iterate over our anndata
adl = AnnDataLoader(adata, shuffle=False, batch_size = 10)

# get the first batch of data
data_batch = next(tensors for tensors in adl)

For tensors setup with setup_anndata() the keys are from scvi._CONSTANTS. For tensors setup with register_tensor_from_anndata(), the keys are the values passed to registry_key. Notice that the keys in data_batch are the same as the keys in the data_registry. See previous section for more detailed explanation

dict_keys(['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels', 'protein_expression', 'cat_covs', 'cont_covs', '_extra_values'])
{'X': {'attr_name': 'layers', 'attr_key': 'raw_counts'},
 'batch_indices': {'attr_name': 'obs', 'attr_key': '_scvi_batch'},
 'local_l_mean': {'attr_name': 'obs', 'attr_key': '_scvi_local_l_mean'},
 'local_l_var': {'attr_name': 'obs', 'attr_key': '_scvi_local_l_var'},
 'labels': {'attr_name': 'obs', 'attr_key': '_scvi_labels'},
 'protein_expression': {'attr_name': 'obsm', 'attr_key': 'protein_expression'},
 'cat_covs': {'attr_name': 'obsm', 'attr_key': '_scvi_extra_categoricals'},
 'cont_covs': {'attr_name': 'obsm', 'attr_key': '_scvi_extra_continuous'},
 '_extra_values': {'attr_name': 'obs', 'attr_key': 'extra_values_scvi'}}

If we look at the labels for the first batch from the data loader, it corresponds to the labels of the first 10 cells of our AnnData.

0    label_0
1    label_0
2    label_1
3    label_1
4    label_2
5    label_1
6    label_1
7    label_2
8    label_2
9    label_1
Name: labels, dtype: category
Categories (3, object): ['label_0', 'label_1', 'label_2']
# setup_anndata automatically encoded the categorical labels as integers
print(data_batch[_CONSTANTS.X_KEY].shape) #shape is batch_size x n_genes
print(data_batch[_CONSTANTS.BATCH_KEY].shape) #shape is batch_size x 1
torch.Size([10, 100])
torch.Size([10, 1])

For the tensor we registered via register_tensor_from_anndata(), the key to access the data is the value passed to the registry_keyargument, which in our case was _extra_values.

0    2
1    5
2    7
3    2
4    6
5    2
6    2
7    3
8    1
9    5
Name: extra_values, dtype: int64

By default, all the data loaded in scvi-tools is np.float32. If you wish to load as a different datatype, you can pass in a dictionary where the key corresponds to a key in the data registry and the value is the datatype.

In the following snippet, we load some continuous data as np.float64 and integer data as np.long32.

adl = AnnDataLoader(adata, shuffle=False, batch_size = 10)
data_batch = next(tensors for tensors in adl)

# by default data has the dtype np.float32
dict_keys(['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels', 'protein_expression', 'cat_covs', 'cont_covs', '_extra_values'])

To specify the datatype of each key, we can use the data_and_attributes parameter of AnnDataLoader. Here we make make X an np.long and our cat_covs an np.float64, but keep everything else as np.float32.

#the keys of data_and_attributes should correspond to keys in the data registry
data_registry_keys = adata.uns['_scvi']['data_registry'].keys()
print("Data Registry keys:",data_registry_keys)
Data Registry keys: dict_keys(['X', 'batch_indices', 'local_l_mean', 'local_l_var', 'labels', 'protein_expression', 'cat_covs', 'cont_covs', '_extra_values'])
data_and_attributes = {}
for key in data_registry_keys:
    if key == _CONSTANTS.X_KEY:
        data_and_attributes[key] = np.long
    elif key == _CONSTANTS.CONT_COVS_KEY:
        data_and_attributes[key] = np.float64
        data_and_attributes[key] = np.float32
{'X': <class 'int'>, 'batch_indices': <class 'numpy.float32'>, 'local_l_mean': <class 'numpy.float32'>, 'local_l_var': <class 'numpy.float32'>, 'labels': <class 'numpy.float32'>, 'protein_expression': <class 'numpy.float32'>, 'cat_covs': <class 'numpy.float32'>, 'cont_covs': <class 'numpy.float64'>, '_extra_values': <class 'numpy.float32'>}
adl = AnnDataLoader(adata, shuffle=False, batch_size = 10, data_and_attributes=data_and_attributes)
data_batch = next(tensors for tensors in adl)

# by default data has the dtype np.float32

Finally, if the data_and_attributes parameter is used, it will only load the keys of the passed in dictionary. For example, if the only key in the dictionary passed in to data_and_attributes is X, the data loader will only load X.

data_and_attributes = {_CONSTANTS.X_KEY: np.float}
adl = AnnDataLoader(
    adata, shuffle=False, batch_size=10, data_and_attributes=data_and_attributes
data_batch = next(tensors for tensors in adl)