Note

This page was generated from scvi_hub_upload_and_large_files.ipynb. Interactive online version: Colab badge. Some tutorial content may look better in light mode.

Using scvi-hub to upload pretrained scvi-tools models#

In this tutorial, we will see how to use scvi-tools to upload pretrained models onto Hugging Face. We will also see how to handle large training datasets.

If you have not already, make sure to refer to our scvi_hub_into_and_download tutorial, which is a pre-requisite to this one. It introduces Hugging Face (HF) and the scvi-hub, and describes how to use them for downloading pre-trained models from the HF Model Hub.

[ ]:
!pip install --quiet scvi-colab
from scvi_colab import install

install()
[ ]:
import scanpy as sc
import scvi
from scvi.hub import HubMetadata, HubModel, HubModelCardHelper

sc.set_figure_params(figsize=(4, 4))

# for white background of figures (only for docs rendering)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

Imports#

Let’s start by adding the Python imports we need.

Pretrain a demo model#

Let’s pretrain a model on some synthetic data which we’ll use to upload to the scvi-hub later.

[5]:
local_dir = "local/scvi_hub_upload"

adata = scvi.data.synthetic_iid()
scvi.model.SCVI.setup_anndata(adata)
model = scvi.model.SCVI(adata)
model.train(1)
model.save(local_dir, save_anndata=True, overwrite=True)
/Users/valehvpa/GitRepos/scvi-tools/scvi/data/_built_in_data/_synthetic.py:32: FutureWarning: X.dtype being converted to np.float32 from int64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  adata = AnnData(data)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (mps), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
/opt/homebrew/Caskroom/miniconda/base/envs/scvi-hub/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1789: UserWarning: MPS available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='mps', devices=1)`.
  rank_zero_warn(
/opt/homebrew/Caskroom/miniconda/base/envs/scvi-hub/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1892: PossibleUserWarning: The number of training batches (3) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 1/1: 100%|██████████| 1/1 [00:00<00:00, 39.55it/s, loss=333, v_num=1]
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 1/1: 100%|██████████| 1/1 [00:00<00:00, 31.65it/s, loss=333, v_num=1]

Model Card and Metadata#

To upload pretrained models, you’ll need to create an instance of the HubModel class and then simply call its push_to_huggingface_hub method.

As you can see from the API reference, the HubModel init function requires metadata and a Model Card. There are a few ways you can provide these:

  • The metadata can be either an instance of the HubMetadata class that contains the required metadata for this model, or a path to a JSON file on disk where this metadata can be read from.

  • The Model Card can be an instance of the HubModelCardHelper class created for this model, or an instance of the HF Model Card object, or a path to a Markdown file on disk where the model card can be read from.

    • You can also use the HubModelCardHelper class to create a Model Card from the scvi-tools template, then save it on disk and change it as you wish before passing its path into the HubModel class.

Here we’ll see how to create the HubMetadata and a Model Card from the data on disk.

[17]:
hm = HubMetadata.from_dir(local_dir, anndata_version="0.8.0")

hmch = HubModelCardHelper.from_dir(
    local_dir,
    license_info="cc-by-4.0",
    anndata_version="0.8.0",
    data_modalities=["rna", "protein"],
    data_is_annotated=False,
    description="This is a demo model used during upload demo.",
    references="None.",
)
INFO     File local/scvi_hub_upload/model.pt already downloaded
INFO     File local/scvi_hub_upload/model.pt already downloaded
[18]:
print(hmch.model_card.content)
---
license: cc-by-4.0
library_name: scvi-tools
tags:
- model_cls_name:SCVI
- scvi_version:0.19.0a0
- anndata_version:0.8.0
- modality:rna
- modality:protein
- annotated:False
---

# Description

This is a demo model used during upload demo.

# Model properties

Many model properties are in the model tags. Some more are listed below.

**model_init_params**:
```json
{
    "n_hidden": 128,
    "n_latent": 10,
    "n_layers": 1,
    "dropout_rate": 0.1,
    "dispersion": "gene",
    "gene_likelihood": "zinb",
    "latent_distribution": "normal"
}
```

**model_setup_anndata_args**:
```json
{
    "layer": null,
    "batch_key": null,
    "labels_key": null,
    "size_factor_key": null,
    "categorical_covariate_keys": null,
    "continuous_covariate_keys": null
}
```

**model_summary_stats**:
|     Summary Stat Key     | Value |
|--------------------------|-------|
|         n_batch          |   1   |
|         n_cells          |  400  |
| n_extra_categorical_covs |   0   |
| n_extra_continuous_covs  |   0   |
|         n_labels         |   1   |
|          n_vars          |  100  |

**model_data_registry**:
| Registry Key |    scvi-tools Location    |
|--------------|---------------------------|
|      X       |          adata.X          |
|    batch     | adata.obs['_scvi_batch']  |
|    labels    | adata.obs['_scvi_labels'] |

**model_parent_module**: scvi.model

**data_is_latent**: False

# Training data

This is an optional link to where the training data is stored if it is too large
to host on the huggingface Model hub.

<!-- This field is required for models that haven't been minified by converting to latent
mode. See the scvi-tools documentation for more details. -->

Training data url: N/A

# Training code

This is an optional link to the code used to train the model.

Training code url: N/A

# References

None.

Note: Suppose I wanted to change the content a little bit. To do that, I’d save the card to disk, change it manually as I wish, and then pass its path to HubModel.

hmch.model_card.save(
    "local/my_model_card.md"
)  # then change the markdown file on disk...

Create a HubModel and upload it#

Now we can create the HubModel and push it to the HF Model Hub:

[24]:
hmo = HubModel(local_dir, metadata=hm, model_card=hmch)
hmo
HubModel with:
local_dir: local/scvi_hub_upload
model loaded? No
adata loaded? No
large_training_adata loaded? No
metadata:
HubMetadata(scvi_version='0.19.0a0', anndata_version='0.8.0', training_data_url=None, model_parent_module='scvi.model')
model_card:
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
license: cc-by-4.0

library_name: scvi-tools

tags:

model_cls_name:SCVI
scvi_version:0.19.0a0
anndata_version:0.8.0
modality:rna
modality:protein
annotated:False

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
╔═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                   Description                                                   ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

This is a demo model used during upload demo.

╔═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                Model properties                                                 ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Many model properties are in the model tags. Some more are listed below.

model_init_params:

                                                                                                                   
{                                                                                                                  
                                                                                                                   
    "n_hidden": 128,                                                                                               
                                                                                                                   
    "n_latent": 10,                                                                                                
                                                                                                                   
    "n_layers": 1,                                                                                                 
                                                                                                                   
    "dropout_rate": 0.1,                                                                                           
                                                                                                                   
    "dispersion": "gene",                                                                                          
                                                                                                                   
    "gene_likelihood": "zinb",                                                                                     
                                                                                                                   
    "latent_distribution": "normal"                                                                                
                                                                                                                   
}                                                                                                                  

model_setup_anndata_args:

                                                                                                                   
{                                                                                                                  
                                                                                                                   
    "layer": null,                                                                                                 
                                                                                                                   
    "batch_key": null,                                                                                             
                                                                                                                   
    "labels_key": null,                                                                                            
                                                                                                                   
    "size_factor_key": null,                                                                                       
                                                                                                                   
    "categorical_covariate_keys": null,                                                                            
                                                                                                                   
    "continuous_covariate_keys": null                                                                              
                                                                                                                   
}                                                                                                                  

model_summary_stats:

|     Summary Stat Key     | Value |

|--------------------------|-------|

|         n_batch          |   1   |

|         n_cells          |  400  |

| n_extra_categorical_covs |   0   |

| n_extra_continuous_covs  |   0   |

|         n_labels         |   1   |

|          n_vars          |  100  |

model_data_registry:

| Registry Key |    scvi-tools Location    |

|--------------|---------------------------|

|      X       |          adata.X          |

|    batch     | adata.obs['_scvi_batch']  |

|    labels    | adata.obs['_scvi_labels'] |

model_parent_module: scvi.model

data_is_latent: False

╔═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                  Training data                                                  ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

This is an optional link to where the training data is stored if it is too large

to host on the huggingface Model hub.

Training data url: N/A

╔═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                  Training code                                                  ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

This is an optional link to the code used to train the model.

Training code url: N/A

╔═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                   References                                                    ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

None.
[24]:

To upload, you need to call:

hmo.push_to_huggingface_hub(
    repo_name=repo_name, repo_token=repo_token, repo_create=True
)

We won’t do it here but will explain the parameters you need to pass:

  • repo_name: The name/id of your repo.

  • repo_token: The token you need to authenticate yourself to the HF Model Hub. It must have “write” permissions. You can get this from your HF account page. Read this article to find out how.

    • The token can either be passed in as plain text or as the full path to a file on disk where the token is stored.

  • repo_create: Whether you want scvi-tools to create the repo for you. If you want to create the repo yourself on the HF Model Hub or if it already exists, you can set this to False.

Large training data#

So far, all models we’ve seen have contained the dataset in the HF Hub Model object. However, in some cases, this is not possible — or desirable — if the training data is too large. For all files large than 5GB, you are prompted to store your training data on a separate storage and provide its URL to the HubModel. This will alert scvi-tools as to where to pull the data from when loading it (or the model) into memory.

There are three possible scenarios:

  1. Your model is not latent-ified and your training data is naturally <5GB. 👉 In this case, both your model and data will be uploaded to the same HF Model.

  2. Your model is latent-ified and your latent training data is <5GB. 👉 In this case, both your model and data will be uploaded to the same HF Model. Optionally, you can provide a link to your full (i.e., non-latent) training data.

  3. Your model is not latent-ified and your training data is >=5GB. 👉 In this case, only your model will be uploaded to the HF Model. If you want to use your training data, then it is required to provide a link to it (this must be in the required metadata file, and can be present in the model card as well). When needed, scvi-tools will automatically download your training data from the link you registered.

It is highly recommended to try to latent-ify your model if possible. Please refer to our Latent Mode tutorial [link TBD] for how to do that.

Note

It is always possible to use another dataset than your training data. You can set model.adata prior to saving. However, the convention with scvi-hub is to provide access to the training data (raw or in latent form), so that users can reproduce the results of the model and perform their own analyses on the same data.

Model evaluation#

We recommend that you include some evaluation results in your Model Card. One way to do this is by using our scvi-criticism Python package. It provides a simple API to evaluate the goodness of fit of your model and generate various visualizations. Read more about it in the scvi-criticism documentation.