Note
This page was generated from data_loading.ipynb. Interactive online version: . Some tutorial content may look better in light mode.
Data loading and preparation#
Here we walk through the necessary steps to get your data into ready for scvi-tools.
[1]:
!pip install --quiet scvi-colab
from scvi_colab import install
install()
[2]:
import scvi
import scanpy as sc
Global seed set to 0
/Users/jhong/miniforge3/envs/scvi-env/lib/python3.9/site-packages/docrep/decorators.py:43: SyntaxWarning: 'param_size_factor_key' is not a valid key!
doc = func(self, args[0].__doc__, *args[1:], **kwargs)
Loading data#
scvi-tools supports the AnnData data format, which also underlies Scanpy. AnnData is quite similar to other popular single cell objects like that of Seurat and SingleCellExperiment. In particular, it allows cell-level and feature-level metadata to coexist in the same data structure as the molecular counts.
It’s also now possible to automatically convert these R-based objects to AnnData within a Jupyter notebook. See the following tutorial for more information.
scvi-tools has a number of convenience methods for loading data from .csv
, .loom
, and .h5ad
formats. To load outputs from Cell Ranger, please use Scanpy’s reading functionality.
Let us now download an AnnData object (.h5ad
format) and load it using scvi-tools.
PBMC3k#
[8]:
!wget 'http://falexwolf.de/data/pbmc3k_raw.h5ad'
--2022-02-18 12:12:58-- http://falexwolf.de/data/pbmc3k_raw.h5ad
Resolving falexwolf.de (falexwolf.de)... 85.13.135.70
Connecting to falexwolf.de (falexwolf.de)|85.13.135.70|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://falexwolf.de/data/pbmc3k_raw.h5ad [following]
--2022-02-18 12:12:58-- https://falexwolf.de/data/pbmc3k_raw.h5ad
Connecting to falexwolf.de (falexwolf.de)|85.13.135.70|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://falexwolf.me/data/pbmc3k_raw.h5ad [following]
--2022-02-18 12:12:59-- https://falexwolf.me/data/pbmc3k_raw.h5ad
Resolving falexwolf.me (falexwolf.me)... 75.2.60.5
Connecting to falexwolf.me (falexwolf.me)|75.2.60.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5855727 (5.6M) [application/octet-stream]
Saving to: ‘pbmc3k_raw.h5ad’
pbmc3k_raw.h5ad 100%[===================>] 5.58M 10.6MB/s in 0.5s
2022-02-18 12:13:00 (10.6 MB/s) - ‘pbmc3k_raw.h5ad’ saved [5855727/5855727]
[9]:
pbmc3k = scvi.data.read_h5ad("pbmc3k_raw.h5ad")
[10]:
pbmc3k
[10]:
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
This is a fairly simple object, it just contains the count data and the ENSEMBL ids for the genes.
[11]:
pbmc3k.var.head()
[11]:
gene_ids | |
---|---|
index | |
MIR1302-10 | ENSG00000243485 |
FAM138A | ENSG00000237613 |
OR4F5 | ENSG00000186092 |
RP11-34P13.7 | ENSG00000238009 |
RP11-34P13.8 | ENSG00000239945 |
PBMC5k#
As another example, let’s download a dataset from 10x Genomics. This data was obtained from a CITE-seq experiment, so it also contains protein count data.
[12]:
!wget https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5
--2022-02-18 12:13:01-- https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 104.18.1.173, 104.18.0.173
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|104.18.1.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17129253 (16M) [binary/octet-stream]
Saving to: ‘5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5’
5k_pbmc_protein_v3_ 100%[===================>] 16.33M 9.91MB/s in 1.6s
2022-02-18 12:13:03 (9.91 MB/s) - ‘5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5’ saved [17129253/17129253]
[13]:
pbmc5k = sc.read_10x_h5(
"5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5",
gex_only=False
)
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
It’s often helpful to give the gene names unique names.
[14]:
pbmc5k.var_names_make_unique()
We can see that adata.X
contains the concatenated gene and protein expression data.
[15]:
pbmc5k.var.feature_types.astype("category").cat.categories
[15]:
Index(['Antibody Capture', 'Gene Expression'], dtype='object')
We can use scvi-tools to organize this object, which places the protein expression in adata.obms["protein_expression]
.
[16]:
scvi.data.organize_cite_seq_10x(pbmc5k)
[17]:
pbmc5k
[17]:
AnnData object with n_obs × n_vars = 5247 × 33538
var: 'gene_ids', 'feature_types', 'genome'
obsm: 'protein_expression'
Concatenate the datasets#
[18]:
adata = pbmc5k.concatenate(pbmc3k)
Notice that the resulting AnnData has a batch key in .obs
.
[19]:
adata.obs.sample(n=5)
[19]:
batch | |
---|---|
GTAGCTAAGTTCATGC-1-0 | 0 |
CACTGAAAGACGGTCA-1-0 | 0 |
AAGGAATCAGTCGCAC-1-0 | 0 |
CCACTTGAGCGTTACT-1-0 | 0 |
GAGGGATGGGAAAT-1-1 | 1 |
Preprocessing the data#
It is common to remove outliers, and even perform feature selection before model fitting. We prefer the Scanpy preprocessing module at this stage.
[20]:
sc.pp.filter_genes(adata, min_counts=3)
sc.pp.filter_cells(adata, min_counts=3)
As it is popular to normalize the data for many methods, we can use Scanpy for this; however, it’s important to keep the count information intact for scvi-tools models.
[21]:
adata.layers["counts"] = adata.X.copy()
Now we can proceed with common normalization methods.
[22]:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
We can store the normalized values in .raw
to keep them safe in the event the anndata gets subsetted feature-wise.
[23]:
adata.raw = adata
Register the data with scvi-tools#
Now that we have an AnnData object, we need to alert scvi-tools of all the interesting data in our object. For example, now that we have batches in our AnnData, we can alert the models that we’d like to perform batch correction. Also, because we have the count data in a layer, we can use the layer
argument.
Normally, we set up the data right before using a model, thus we would call the setup_anndata
method specific to that model. However, we are not using any particular model here since we are just demonstrating data usage and handling in this tutorial. We will use the SCVI model’s setup_anndata
method here and in what follows for sake of example.
Basic case#
[24]:
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
Notice the info messages notify us that batches were detected in the data. Just to demonstrate what happens if we don’t include this option:
[25]:
scvi.model.SCVI.setup_anndata(adata, layer="counts")
Now integration-based tasks can no longer be performed in subsequent models because there is no knowledge of such information.
CITE-seq case#
As PBMC5k is a CITE-seq dataset, we can use scvi-tools to register the protein expression. Note that totalVI is the only current model that uses the protein expression. The usage of registered items is model specific. As another example, registering the labels in the AnnData object will not affect totalVI or scVI, but is necessary to run scANVI.
We have not preprocessed the pbmc5k
object, which we do recommend. We show how to run setup_anndata
in this case for illustrative purposes.
[26]:
scvi.model.TOTALVI.setup_anndata(pbmc5k, protein_expression_obsm_key="protein_expression")
INFO Using column names from columns of adata.obsm['protein_expression']
Warning
After setup_anndata
has been run, the adata object should not be modified. In other words, the very next step in the workflow is to initialize and train the model of interest (e.g., scVI, totalVI). If you do modify the adata, it’s ok, just run setup_anndata
again – and then reinitialize the model.
Viewing the scvi-tools data setup#
[27]:
model = scvi.model.TOTALVI(pbmc5k)
model.view_anndata_setup(pbmc5k)
INFO Computing empirical prior initialization for protein background.
Anndata setup with scvi-tools version 0.15.0a0.
Summary Statistics ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ ┃ Summary Stat Key ┃ Value ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ │ n_cells │ 5247 │ │ n_vars │ 33538 │ │ n_labels │ 1 │ │ n_batch │ 1 │ │ n_extra_categorical_covs │ 0 │ │ n_extra_continuous_covs │ 0 │ │ n_proteins │ 32 │ └──────────────────────────┴───────┘
Data Registry ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Registry Key ┃ scvi-tools Location ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ X │ adata.X │ │ labels │ adata.obs['_scvi_labels'] │ │ batch │ adata.obs['_scvi_batch'] │ │ proteins │ adata.obsm['protein_expression'] │ └──────────────┴──────────────────────────────────┘
labels State Registry ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ Source Location ┃ Categories ┃ scvi-tools Encoding ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ adata.obs['_scvi_labels'] │ 0 │ 0 │ └───────────────────────────┴────────────┴─────────────────────┘
batch State Registry ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ Source Location ┃ Categories ┃ scvi-tools Encoding ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ adata.obs['_scvi_batch'] │ 0 │ 0 │ └──────────────────────────┴────────────┴─────────────────────┘