Note

This page was generated from api_overview.ipynb. Interactive online version: Colab badge.

Introduction to scvi-tools

In this introductory tutorial, we go through the different steps of an scvi-tools workflow.

While we focus on scVI in this tutorial, the API is consistent across all models.

[1]:
import sys

#if branch is stable, will install via pypi, else will install from source
branch = "stable"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and branch == "stable":
    !pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != "stable":
    !pip install --quiet --upgrade jsonschema
    !pip install --quiet git+https://github.com/yoseflab/scvi-tools@$branch#egg=scvi-tools[tutorials]
[2]:
import scvi
import scanpy as sc

sc.set_figure_params(figsize=(4, 4))

Loading and preparing data

Let us first load a subsampled version of the heart cell atlas dataset described in Litviňuková et al. (2020). scvi-tools has many “built-in” datasets as well as support for loading arbitrary .csv, .loom, and .h5ad (AnnData) files. Please see our tutorial on data loading for more examples.

  • Litviňuková, M., Talavera-López, C., Maatz, H., Reichart, D., Worth, C. L., Lindberg, E. L., … & Teichmann, S. A. (2020). Cells of the adult human heart. Nature, 588(7838), 466-472.

Important

All scvi-tools models require AnnData objects as input.

[3]:
adata = scvi.data.heart_cell_atlas_subsampled()
INFO     File data/hca_subsampled_20k.h5ad already downloaded
INFO     No batch_key inputted, assuming all cells are same batch
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.X
INFO     Computing library size prior per batch
INFO     Successfully registered anndata object containing 18641 cells, 26662 vars, 1
         batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates
         and 0 extra continuous covariates.
INFO     Please do not further modify adata until model is trained.

Now we preprocess the data to remove, for example, genes that are very lowly expressed and other outliers. For these tasks we prefer the Scanpy preprocessing module.

[4]:
sc.pp.filter_genes(adata, min_counts=3)

In scRNA-seq analysis, it’s popular to normalize the data. These values are not used by scvi-tools, but given their popularity in other tasks as well as for visualization, we store them in the anndata object separately (via the .raw attribute).

Important

Unless otherwise specific, scvi-tools models require the raw counts.

[5]:
adata.layers["counts"] = adata.X.copy() # preserve counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata # freeze the state in `.raw`

Finally, we perform feature selection, to reduce the number of features (genes in this case) used as input to the scvi-tools model. For best practices of how/when to perform feature selection, please refer to the model-specific tutorial. For scVI, we recommend anywhere from 1,000 to 10,000 HVGs, but it will be context-dependent.

[6]:
sc.pp.highly_variable_genes(
    adata,
    n_top_genes=1200,
    subset=True,
    layer="counts",
    flavor="seurat_v3",
    batch_key="cell_source"
)
/home/adam/.pyenv/versions/scvi-tools-dev/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py:144: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version.  Use .loc with labels or .iloc with positions instead.
  df.loc[: int(n_top_genes), 'highly_variable'] = True

Now it’s time to run setup_anndata(), which alerts scvi-tools to the locations of various matrices inside the anndata. It’s important to run this function with the correct arguments so scvi-tools is notified that your dataset has batches, annotations, etc. For example, if batches are registered with scvi-tools, the subsequent model will correct for batch effects. See the full documentation for details.

In this dataset, there is a “cell_source” categorical covariate, and within each “cell_source”, multiple “donors”, “gender” and “age_group”. There are also two continuous covariates we’d like to correct for: “percent_mito” and “percent_ribo”. These covariates can be registered using the categorical_covariate_keys argument. If you only have one categorical covariate, you can also use the batch_key argument instead.

[7]:
scvi.data.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["cell_source", "donor"],
    continuous_covariate_keys=["percent_mito", "percent_ribo"]
)
INFO     No batch_key inputted, assuming all cells are same batch
INFO     No label_key inputted, assuming all cells have same label
INFO     Using data from adata.layers["counts"]
INFO     Computing library size prior per batch
INFO     Successfully registered anndata object containing 18641 cells, 1200 vars, 1 batches,
         1 labels, and 0 proteins. Also registered 2 extra categorical covariates and 2 extra
         continuous covariates.
INFO     Please do not further modify adata until model is trained.

Warning

If the adata is modified after running setup_anndata, please run setup_anndata again.

Creating and training a model

While we highlight the scVI model here, the API is consistent across all scvi-tools models and is inspired by that of scikit-learn. For a full list of options, see the scvi documentation.

[8]:
model = scvi.model.SCVI(adata)

We can see an overview of the model by printing it.

[9]:
model
SCVI Model with the following params:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene,
gene_likelihood: zinb, latent_distribution: normal
Training status: Not Trained

To print summary of associated AnnData, use: scvi.data.view_anndata_setup(model.adata)
[9]:

[10]:
model.train()
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Epoch 400/400: 100%|██████████| 400/400 [06:26<00:00,  1.04it/s, loss=285, v_num=1]

Saving and loading

Saving consists of saving the model neural network weights, as well as parameters used to initialize the model.

[11]:
# model.save("my_model/")
[12]:
# model = scvi.model.SCVI.load("my_model/", adata, use_gpu=True)

Obtaining model outputs

[13]:
latent = model.get_latent_representation()

It’s often useful to store the outputs of scvi-tools back into the original anndata, as it permits interoperability with Scanpy.

[14]:
adata.obsm["X_scVI"] = latent

The model.get...() functions default to using the anndata that was used to initialize the model. It’s possible to also query a subset of the anndata, or even use a completely independent anndata object as long as the anndata is organized in an equivalent fashion.

[15]:
adata_subset = adata[adata.obs.cell_type == "Fibroblast"]
latent_subset = model.get_latent_representation(adata_subset)
INFO     Received view of anndata, making copy.
[16]:
denoised = model.get_normalized_expression(adata_subset, library_size=1e4)
denoised.iloc[:5, :5]
INFO     Received view of anndata, making copy.
[16]:
ISG15 TNFRSF18 VWA1 HES5 SPSB1
GTCAAGTCATGCCACG-1-HCAHeart7702879 1.868347 0.116484 1.172189 0.139826 3.535994
GAGTCATTCTCCGTGT-1-HCAHeart8287128 0.576529 0.054445 1.005461 0.013783 28.139740
CCTCTGATCGTGACAT-1-HCAHeart7702881 0.823483 0.076762 2.363107 0.002883 2.497471
CGCCATTCATCATCTT-1-H0035_apex 0.108322 0.062542 1.287478 0.008230 7.921440
TCGTAGAGTAGGACTG-1-H0015_septum 0.101305 0.006768 0.861312 0.190125 17.437141

Let’s store the normalized values back in the anndata.

[17]:
adata.layers["scvi_normalized"] = model.get_normalized_expression(
    library_size=10e4
)

Interoperability with Scanpy

Scanpy is a powerful python library for visualization and downstream analysis of scRNA-seq data. We show here how to feed the objects produced by scvi-tools into a scanpy workflow.

[18]:
# use scVI latent space for UMAP generation
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata, min_dist=0.3)
[19]:
sc.pl.umap(
    adata,
    color=["cell_type"],
    frameon=False,
)
../../_images/user_guide_notebooks_api_overview_36_0.png

We can see that scVI was able to correct for nuisance variation due to nuclei/whole cell, age group, and donor, while maintaining separation of cell types.

[20]:
sc.pl.umap(
    adata,
    color=["donor", "cell_source"],
    ncols=2,
    frameon=False,
)
../../_images/user_guide_notebooks_api_overview_38_0.png

The user will note that we imported curated labels from the original publication. Our interface with scanpy makes it easy to cluster the data with scanpy from scVI’s latent space and then reinject them into scVI (e.g., for differential expression).

[21]:
# neighbors were already computed using scVI
sc.tl.leiden(adata, key_added="leiden_scVI", resolution=0.5)
[22]:
sc.pl.umap(
    adata,
    color=["leiden_scVI"],
    frameon=False,
)
../../_images/user_guide_notebooks_api_overview_41_0.png

Differential expression

We can also use many scvi-tools models for differential expression. For further details on the methods underlying these functions as well as additional options, please see TODO.

[23]:
adata.obs.cell_type.head()
[23]:
AACTCCCCACGAGAGT-1-HCAHeart7844001                      Myeloid
ATAACGCAGAGCTGGT-1-HCAHeart7829979    Ventricular_Cardiomyocyte
GTCAAGTCATGCCACG-1-HCAHeart7702879                   Fibroblast
GGTGATTCAAATGAGT-1-HCAHeart8102858                  Endothelial
AGAGAATTCTTAGCAG-1-HCAHeart8102863                  Endothelial
Name: cell_type, dtype: category
Categories (11, object): ['Adipocytes', 'Atrial_Cardiomyocyte', 'Endothelial', 'Fibroblast', ..., 'Neuronal', 'Pericytes', 'Smooth_muscle_cells', 'Ventricular_Cardiomyocyte']

For example, a 1-vs-1 DE test is as simple as:

[24]:
de_df = model.differential_expression(
    groupby="cell_type",
    group1="Endothelial",
    group2="Fibroblast"
)
de_df.head()
DE...: 100%|██████████| 1/1 [00:01<00:00,  1.46s/it]
[24]:
proba_de proba_not_de bayes_factor scale1 scale2 lfc_mean lfc_median lfc_std lfc_min lfc_max raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05 comparison
SOX17 0.9992 0.0008 7.130086 0.001604 0.000038 5.806608 5.770064 2.001048 -1.559964 15.031383 0.784371 0.006541 0.307617 0.004497 17.128170 0.185868 True Endothelial vs Fibroblast
COL6A3 0.9990 0.0010 6.906745 0.000176 0.005022 -5.281342 -5.352539 1.524231 -9.799274 1.640718 0.026284 1.228131 0.021903 0.498365 1.195808 54.193195 True Endothelial vs Fibroblast
ABCA8 0.9990 0.0010 6.906745 0.000346 0.018645 -6.695785 -6.894620 2.154335 -12.788260 3.045361 0.036505 3.181145 0.018496 0.768602 1.501560 183.288528 True Endothelial vs Fibroblast
SLC9A3R2 0.9986 0.0014 6.569875 0.012278 0.000218 5.871807 5.956819 1.775999 -7.184036 13.259937 4.451492 0.045380 0.712339 0.034342 111.582703 1.657324 True Endothelial vs Fibroblast
ABCA10 0.9982 0.0018 6.318161 0.000070 0.006311 -8.407307 -8.783027 2.756606 -16.415844 4.522050 0.014359 1.282508 0.007788 0.499591 0.501825 69.613876 True Endothelial vs Fibroblast

We can also do a 1-vs-all DE test, which compares each cell type with the rest of the dataset:

[25]:
de_df = model.differential_expression(
    groupby="cell_type",
)
de_df.head()
DE...: 100%|██████████| 11/11 [00:16<00:00,  1.47s/it]
[25]:
proba_de proba_not_de bayes_factor scale1 scale2 lfc_mean lfc_median lfc_std lfc_min lfc_max raw_mean1 raw_mean2 non_zeros_proportion1 non_zeros_proportion2 raw_normalized_mean1 raw_normalized_mean2 is_de_fdr_0.05 comparison
GPAM 0.9992 0.0008 7.130086 0.025509 0.000208 7.312753 7.323672 2.341895 -0.780353 14.880513 17.372416 0.035791 0.896552 0.031520 280.340485 1.565905 True Adipocytes vs Rest
DGAT2 0.9992 0.0008 7.130086 0.003877 0.000038 7.254048 7.316399 2.514551 -3.225917 16.004608 2.682757 0.005028 0.593103 0.004866 43.613537 0.194096 True Adipocytes vs Rest
ADIPOQ 0.9992 0.0008 7.130086 0.003762 0.000044 8.397635 7.871881 3.795881 -2.219330 23.430929 2.324136 0.003622 0.593103 0.003352 33.361748 0.217763 True Adipocytes vs Rest
PLIN1 0.9988 0.0012 6.724225 0.004689 0.000050 7.763868 7.760591 2.750206 -3.084210 18.667627 2.799999 0.004379 0.806897 0.004325 52.921761 0.196486 True Adipocytes vs Rest
CIDEC 0.9988 0.0012 6.724225 0.002012 0.000027 6.972395 6.969475 2.482747 -1.806476 18.552200 1.137931 0.001406 0.510345 0.001406 21.809771 0.057564 True Adipocytes vs Rest

We now extract top markers for each cluster using the DE results.

[26]:
markers = {}
cats = adata.obs.cell_type.cat.categories
for i, c in enumerate(cats):
    cid = "{} vs Rest".format(c)
    cell_type_df = de_df.loc[de_df.comparison == cid]

    cell_type_df = cell_type_df[cell_type_df.lfc_mean > 0]

    cell_type_df = cell_type_df[cell_type_df["bayes_factor"] > 3]
    cell_type_df = cell_type_df[cell_type_df["non_zeros_proportion1"] > 0.1]

    markers[c] = cell_type_df.index.tolist()[:3]
[27]:
sc.tl.dendrogram(adata, groupby="cell_type", use_rep="X_scVI")
[28]:
sc.pl.dotplot(
    adata,
    markers,
    groupby='cell_type',
    dendrogram=True,
    color_map="Blues",
    swap_axes=True,
    use_raw=True,
    standard_scale="var",
)
../../_images/user_guide_notebooks_api_overview_51_0.png

We can also visualize the scVI normalized gene expression values with the layer option.

[29]:
sc.pl.heatmap(
    adata,
    markers,
    groupby='cell_type',
    layer="scvi_normalized",
    standard_scale="var",
    dendrogram=True,
    figsize=(8, 12)
)
../../_images/user_guide_notebooks_api_overview_53_0.png

Logging information

Verbosity varies in the following way:

  • logger.setLevel(logging.WARNING) will show a progress bar.

  • logger.setLevel(logging.INFO) will show global logs including the number of jobs done.

  • logger.setLevel(logging.DEBUG) will show detailed logs for each training (e.g the parameters tested).

This function’s behaviour can be customized, please refer to its documentation for information about the different parameters available.

In general, you can use scvi.settings.verbosity to set the verbosity of the scvi package. Note that verbosity corresponds to the logging levels of the standard python logging module. By default, that verbosity level is set to INFO (=20). As a reminder the logging levels are:

Level

Numeric value

CRITICAL

50

ERROR

40

WARNING

30

INFO

20

DEBUG

10

NOTSET

0