2. Data Loading Tutorial

[1]:
# Below is code that helps us test the notebooks
# when not testing, we make the save_path a directory called data
# in the scVI root directory, feel free to move anywhere
[2]:
def allow_notebook_for_test():
    print("Testing the data loading notebook")

test_mode = False
save_path = "data/"

# Feel free to move this to any convenient location
if not test_mode:
    save_path = "../../data"
[3]:
from scvi.dataset import LoomDataset, CsvDataset, Dataset10X, DownloadableAnnDataset
import urllib.request
import os
from scvi.dataset import BrainLargeDataset, CortexDataset, PbmcDataset, RetinaDataset, HematoDataset, CbmcDataset, BrainSmallDataset, SmfishDataset
[2019-10-01 09:07:08,503] INFO - scvi._settings | Added StreamHandler with custom formatter to 'scvi' logger.

2.1. Generic Datasets

scvi supports dataset loading for the following three generic file formats: * .loom files * .csv files * .h5ad files * datasets processed with Cell Ranger (or from 10x website)

Most of the dataset loading instances implemented in scvi use a positional argument filename and an optional argument save_path (value by default: data/). Files will be downloaded or searched for at the location os.path.join(save_path, filename), make sure this path is valid when you specify the arguments.

scVI can now also handle 10x datasets with CITE-seq protein measurements (shown below).

2.1.1. Loading a .loom file

Any .loom file can be loaded with initializing LoomDataset with filename.

Optional parameters: * save_path: save path (default to be data/) of the file * url: url the dataset if the file needs to be downloaded from the web * new_n_genes: the number of subsampling genes - set it to be False to turn off subsampling * subset_genes: a list of gene names for subsampling

[4]:
# Loading a remote dataset
remote_loom_dataset = LoomDataset("osmFISH_SScortex_mouse_all_cell.loom",
                                  save_path=save_path,
                                  url='http://linnarssonlab.org/osmFISH/osmFISH_SScortex_mouse_all_cells.loom')
[2019-10-01 09:07:10,167] INFO - scvi.dataset.dataset | Downloading file at /Users/adamgayoso/Google Drive/Berkeley/Software/scVI/data/osmFISH_SScortex_mouse_all_cell.loom
[2019-10-01 09:07:10,244] INFO - scvi.dataset.loom | Preprocessing dataset
[2019-10-01 09:07:10,271] WARNING - scvi.dataset.loom | Removing non-expressing cells
[2019-10-01 09:07:10,389] INFO - scvi.dataset.loom | Finished preprocessing dataset
[2019-10-01 09:07:10,392] WARNING - scvi.dataset.dataset | X is a protected attribute and cannot be set with this name in initialize_cell_attribute, changing name to X_cell and setting
[2019-10-01 09:07:10,393] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2019-10-01 09:07:10,395] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[5]:
# Loading a local dataset
local_loom_dataset = LoomDataset("osmFISH_SScortex_mouse_all_cell.loom",
                                 save_path=save_path)
[2019-10-01 09:07:11,038] INFO - scvi.dataset.loom | Preprocessing dataset
[2019-10-01 09:07:11,070] WARNING - scvi.dataset.loom | Removing non-expressing cells
[2019-10-01 09:07:11,193] INFO - scvi.dataset.loom | Finished preprocessing dataset
[2019-10-01 09:07:11,197] WARNING - scvi.dataset.dataset | X is a protected attribute and cannot be set with this name in initialize_cell_attribute, changing name to X_cell and setting
[2019-10-01 09:07:11,198] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2019-10-01 09:07:11,200] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]

2.1.2. Loading a .csv file

Any .csv file can be loaded with initializing CsvDataset with filename.

Optional parameters: * save_path: save path (default to be data/) of the file * url: url of the dataset if the file needs to be downloaded from the web * compression: set compression as .gz, .bz2, .zip, or .xz to load a zipped csv file * new_n_genes: the number of subsampling genes - set it to be None to turn off subsampling * subset_genes: a list of gene names for subsampling

Note: CsvDataset currently only supoorts .csv files that are genes by cells.

If the dataset has already been downloaded at the location save_path, it will not be downloaded again.

[6]:
# Loading a remote dataset
remote_csv_dataset = CsvDataset("GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz",
                                save_path=save_path,
                                new_n_genes=600,
                                compression='gzip',
                                url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE100866&format=file&file=GSE100866%5FCBMC%5F8K%5F13AB%5F10X%2DRNA%5Fumi%2Ecsv%2Egz")
[2019-10-01 09:07:39,627] INFO - scvi.dataset.dataset | Downloading file at /Users/adamgayoso/Google Drive/Berkeley/Software/scVI/data/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz
[2019-10-01 09:07:41,457] INFO - scvi.dataset.csv | Preprocessing dataset
[2019-10-01 09:09:46,750] INFO - scvi.dataset.csv | Finished preprocessing dataset
[2019-10-01 09:09:50,655] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2019-10-01 09:09:50,659] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2019-10-01 09:09:51,992] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:09:53,389] INFO - scvi.dataset.dataset | Downsampled from 8617 to 8617 cells
[2019-10-01 09:10:16,073] INFO - scvi.dataset.dataset | Downsampling from 36280 to 600 genes
[2019-10-01 09:10:16,406] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:10:16,453] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2019-10-01 09:10:16,469] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:10:16,477] INFO - scvi.dataset.dataset | Downsampled from 8617 to 8617 cells
[7]:
# Loading a local dataset
local_csv_dataset = CsvDataset("GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz",
                               save_path=save_path,
                               new_n_genes=600,
                               compression='gzip')
[2019-10-01 09:10:16,547] INFO - scvi.dataset.csv | Preprocessing dataset
[2019-10-01 09:11:43,587] INFO - scvi.dataset.csv | Finished preprocessing dataset
[2019-10-01 09:11:45,660] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2019-10-01 09:11:45,662] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2019-10-01 09:11:46,563] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:11:47,483] INFO - scvi.dataset.dataset | Downsampled from 8617 to 8617 cells
[2019-10-01 09:12:05,327] INFO - scvi.dataset.dataset | Downsampling from 36280 to 600 genes
[2019-10-01 09:12:05,929] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:12:05,976] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2019-10-01 09:12:05,995] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:12:06,005] INFO - scvi.dataset.dataset | Downsampled from 8617 to 8617 cells

2.1.3. Loading a .h5ad file

AnnData objects can be stored in .h5ad format. Any .h5ad file can be loaded with initializing DownloadableAnnDataset with filename.

Optional parameters: * save_path: save path (default to be data/) of the file * url: url the dataset if the file needs to be downloaded from the web

[8]:
# Loading a remote anndata dataset
remote_ann_dataset = DownloadableAnnDataset(
    "TM_droplet_mat.h5ad",
    save_path=save_path,
    url='https://github.com/YosefLab/scVI/blob/master/tests/data/TM_droplet_mat.h5ad?raw=true'
)

# Loading a local anndata dataset (output not shown)
# import anndata
# anndataset = anndata.read(save_path + "TM_droplet_mat.h5ad")
# dataset = AnnDatasetFromAnnData(ad = anndataset)
[2019-10-01 09:12:06,873] INFO - scvi.dataset.dataset | Downloading file at /Users/adamgayoso/Google Drive/Berkeley/Software/scVI/data/TM_droplet_mat.h5ad
[2019-10-01 09:12:06,919] INFO - scvi.dataset.anndataset | Dense size under 1Gb, casting to dense format (np.ndarray).
[2019-10-01 09:12:06,921] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2019-10-01 09:12:06,923] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2019-10-01 09:12:06,924] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:12:06,926] INFO - scvi.dataset.dataset | Downsampled from 47 to 47 cells

2.1.4. Loading outputs from Cell Ranger (10x Genomics)

scVI can download datasets from the 10x website, or load a local directory with outputs generated by Cell Ranger.

10x has published several datasets on their website. Initialize Dataset10X by passing in the dataset name of one of the following datasets that scvi currently supports. Please check dataset10x.py for a full list of supported datasets. If the dataset has already been downloaded at the location save_path, it will not be downloaded again.

The batch indices after the 10x barcode (e.g., AAAAAA-1) is automatically added to the batch_indices dataset attribute.

Optional parameters: * save_path: save path (default to be data/) of the file * type: set type (default to be filtered) to be filtered or raw to choose one from the two datasets that’s available on 10X * dense: bool of whether to load as a dense matrix (scVI can be faster since it doesn’t have to densify minibatches). We recommend setting this to True (not default). * measurement_names_column column in which to find measurement names in the corresponding .tsv file.

2.1.4.1. Downloading 10x data

pbmc_10k_protein_v3 is the name of a publicly available dataset from 10x.

[9]:
tenX_dataset = Dataset10X("pbmc_10k_protein_v3", save_path=os.path.join(save_path, "10X"), measurement_names_column=1)
[2019-10-01 09:12:06,970] INFO - scvi.dataset.dataset | Downloading file at /Users/adamgayoso/Google Drive/Berkeley/Software/scVI/data/10X/pbmc_10k_protein_v3/filtered_feature_bc_matrix.tar.gz
[2019-10-01 09:12:08,414] INFO - scvi.dataset.dataset10X | Preprocessing dataset
[2019-10-01 09:12:08,417] INFO - scvi.dataset.dataset10X | Extracting tar file
[2019-10-01 09:12:41,192] INFO - scvi.dataset.dataset10X | Finished preprocessing dataset
[2019-10-01 09:12:41,346] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2019-10-01 09:12:41,348] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2019-10-01 09:12:41,387] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:12:41,484] INFO - scvi.dataset.dataset | Downsampled from 7865 to 7865 cells

2.1.4.2. Loading local 10x data

It is also possible to create a Dataset object from 10X data saved locally. Here scVI assumes that in the directory defined by the save_path there exists another directory with the matrix, features/genes and barcodes files. As long as this directory was directly outputted by Cell Ranger and no other files were added, this function will work. If you are struggling to use this function, you may want to load your data using scanpy and save an AnnData object.

In the example below, inside the save path a directory named filtered_feature_bc_matrix exists containing ONLY the files barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz. The name and compression of these files may depend on the version of Cell Ranger and scVI will adapt accordingly.

[10]:
local_10X_dataset = Dataset10X(
    save_path=os.path.join(save_path, "10X/pbmc_10k_protein_v3/"),
    measurement_names_column=1,
)
[2019-10-01 09:12:41,522] INFO - scvi.dataset.dataset10X | Preprocessing dataset
[2019-10-01 09:13:13,262] INFO - scvi.dataset.dataset10X | Finished preprocessing dataset
[2019-10-01 09:13:13,403] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2019-10-01 09:13:13,404] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2019-10-01 09:13:13,443] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:13:13,557] INFO - scvi.dataset.dataset | Downsampled from 7865 to 7865 cells

2.2. Exploring a dataset object

A GeneExpressionDataset has cell_attributes, gene_attributes, and cell_categorical_attributes.

The pbmc10k_protein_v3 from 10X Genomics also has CITE-seq protein data, which we show how to access.

[11]:
print(tenX_dataset)
GeneExpressionDataset object with n_cells x nb_genes = 7865 x 33538
    gene_attribute_names: 'gene_names'
    cell_attribute_names: 'barcodes', 'local_vars', 'batch_indices', 'labels', 'protein_expression', 'local_means'
    cell_categorical_attribute_names: 'labels', 'batch_indices'
    cell_measurements_col_mappings: {'protein_expression': 'protein_names'}
[12]:
tenX_dataset.gene_names
[12]:
array(['MIR1302-2HG', 'FAM138A', 'OR4F5', ..., 'AC240274.1', 'AC213203.1',
       'FAM231C'], dtype='<U64')
[13]:
tenX_dataset.protein_expression
[13]:
array([[  18,  138,   13, ...,    5,    2,    3],
       [  30,  119,   19, ...,    4,    8,    3],
       [  18,  207,   10, ...,   12,   19,    6],
       ...,
       [  39,  249,   10, ...,    8,   12,    2],
       [ 914, 2240,   16, ...,    3,    7,    1],
       [1885, 2788,   25, ...,    7,    3,    2]], dtype=int64)
[14]:
tenX_dataset.protein_names
[14]:
array(['CD3_TotalSeqB', 'CD4_TotalSeqB', 'CD8a_TotalSeqB',
       'CD14_TotalSeqB', 'CD15_TotalSeqB', 'CD16_TotalSeqB',
       'CD56_TotalSeqB', 'CD19_TotalSeqB', 'CD25_TotalSeqB',
       'CD45RA_TotalSeqB', 'CD45RO_TotalSeqB', 'PD-1_TotalSeqB',
       'TIGIT_TotalSeqB', 'CD127_TotalSeqB', 'IgG2a_control_TotalSeqB',
       'IgG1_control_TotalSeqB', 'IgG2b_control_TotalSeqB'], dtype=object)
[15]:
tenX_dataset.barcodes
[15]:
array(['AAACCCAAGATTGTGA-1', 'AAACCCACATCGGTTA-1', 'AAACCCAGTACCGCGT-1',
       ..., 'TTTGTTGGTTGCGGCT-1', 'TTTGTTGTCGAGTGAG-1',
       'TTTGTTGTCGTTCAGA-1'], dtype='<U64')

2.3. Subsetting a dataset object

At the core, we have two methods GeneExpressionDataset.update_cells(subset) and GeneExpressionDataset.update_genes(subset).

The subset should be defined as an np.ndarray of either int’s with arbitrary shape which values are the indexes of the cells/genes to keep or boolean array used as a mask-like index. When subsetting, all gene and cell attributes are also updated.

These methods can be used directly, but we also have helpful wrappers around them. For example:

[16]:
# Take the top 3000 genes by variance across cells
tenX_dataset.subsample_genes(new_n_genes=3000)
[2019-10-01 09:15:35,060] INFO - scvi.dataset.dataset | Downsampling from 33538 to 3000 genes
[2019-10-01 09:15:35,172] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:15:35,207] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2019-10-01 09:15:35,245] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:15:35,282] INFO - scvi.dataset.dataset | Downsampled from 7865 to 7865 cells
[17]:
# Retain cells with >= 1200 UMI counts
tenX_dataset.filter_cells_by_count(1200)
[2019-10-01 09:16:05,710] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:16:05,746] INFO - scvi.dataset.dataset | Downsampled from 7865 to 7661 cells
[19]:
# Retain genes that start with letter "A"
retain = []
for i, g in enumerate(tenX_dataset.gene_names):
    if g.startswith("A"):
        retain.append(i)
tenX_dataset.subsample_genes(subset_genes=retain)
[2019-10-01 09:19:35,978] INFO - scvi.dataset.dataset | Downsampling from 3000 to 215 genes
[2019-10-01 09:19:36,024] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:19:36,032] WARNING - scvi.dataset.dataset | This dataset has some empty cells, this might fail scVI inference.Data should be filtered with `my_dataset.filter_cells_by_count()
[2019-10-01 09:19:36,049] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2019-10-01 09:19:36,055] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2019-10-01 09:19:36,059] INFO - scvi.dataset.dataset | Downsampled from 7661 to 7660 cells

2.4. Built-In Datasets

We’ve also implemented seven built-in datasets to make it easier to reproduce results from the scVI paper.

  • PBMC: 12,039 human peripheral blood mononuclear cells profiled with 10x;

  • RETINA: 27,499 mouse retinal bipolar neurons, profiled in two batches using the Drop-Seq technology;

  • HEMATO: 4,016 cells from two batches that were profiled using in-drop;

  • CBMC: 8,617 cord blood mononuclear cells profiled using 10x along with, for each cell, 13 well-characterized mononuclear antibodies;

  • BRAIN SMALL: 9,128 mouse brain cells profiled using 10x.

  • BRAIN LARGE: 1.3 million mouse brain cells profiled using 10x;

  • CORTEX: 3,005 mouse Cortex cells profiled using the Smart-seq2 protocol, with the addition of UMI

  • SMFISH: 4,462 mouse Cortex cells profiled using the osmFISH protocol

  • DROPSEQ: 71,639 mouse Cortex cells profiled using the Drop-Seq technology

  • STARMAP: 3,722 mouse Cortex cells profiled using the STARmap technology

We do not show the outputs of running the commands below

2.4.1. Loading STARMAP dataset

StarmapDataset consists of 3722 cells profiled in 3 batches. The cells come with spatial coordinates of their location inside the tissue from which they were extracted and cell type labels retrieved by the authors ofthe original publication.

Reference: X.Wang et al., Science10.1126/science.aat5691 (2018)

2.4.2. Loading DROPSEQ dataset

DropseqDataset consists of 71,639 mouse Cortex cells profiled using the Drop-Seq technology. To facilitate comparison with other methods we use a random filtered set of 15000 cells and then keep only a filtered set of 6000 highly variable genes. Cells have cell type annotaions and even sub-cell type annotations inferred by the authors of the original publication.

Reference: https://www.biorxiv.org/content/biorxiv/early/2018/04/10/299081.full.pdf

2.4.3. Loading SMFISH dataset

SmfishDataset consists of 4,462 mouse cortex cells profiled using the OsmFISH protocol. The cells come with spatial coordinates of their location inside the tissue from which they were extracted and cell type labels retrieved by the authors of the original publication.

Reference: Simone Codeluppi, Lars E Borm, Amit Zeisel, Gioele La Manno, Josina A van Lunteren, Camilla I Svensson, and Sten Linnarsson. Spatial organization of the somatosensory cortex revealed by cyclic smFISH. bioRxiv, 2018.

[20]:
smfish_dataset = SmfishDataset(save_path=save_path)

2.4.4. Loading BRAIN-LARGE dataset

Loading BRAIN-LARGE requires at least 32 GB memory!

BrainLargeDataset consists of 1.3 million mouse brain cells, spanning the cortex, hippocampus and subventricular zone, and profiled with 10x chromium. We use this dataset to demonstrate the scalability of scVI. It can be used to demonstrate the scalability of scVI.

Reference: 10x genomics (2017). URL https://support.10xgenomics.com/single-cell-gene-expression/datasets.

[21]:
brain_large_dataset = BrainLargeDataset(save_path=save_path)

2.4.5. Loading CORTEX dataset

CortexDataset consists of 3,005 mouse cortex cells profiled with the Smart-seq2 protocol, with the addition of UMI. To facilitate com- parison with other methods, we use a filtered set of 558 highly variable genes. The CortexDataset exhibits a clear high-level subpopulation struc- ture, which has been inferred by the authors of the original publication using computational tools and annotated by inspection of specific genes or transcriptional programs. Similar levels of annotation are provided with the PbmcDataset and RetinaDataset.

Reference: Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq. Science 347, 1138–1142 (2015).

[22]:
cortex_dataset = CortexDataset(save_path=save_path, total_genes=558)

2.4.6. Loading PBMC dataset

PbmcDataset consists of 12,039 human peripheral blood mononu- clear cells profiled with 10x.

Reference: Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049 (2017).

[23]:
pbmc_dataset = PbmcDataset(save_path=save_path, save_path_10X=os.path.join(save_path, "10X"))

2.4.7. Loading RETINA dataset

RetinaDataset includes 27,499 mouse retinal bipolar neu- rons, profiled in two batches using the Drop-Seq technology.

Reference: Shekhar, K. et al. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell 166, 1308–1323.e30 (2017).

[24]:
retina_dataset = RetinaDataset(save_path=save_path)

2.4.8. Loading HEMATO dataset

HematoDataset includes 4,016 cells from two batches that were profiled using in-drop. This data provides a snapshot of hematopoietic progenitor cells differentiating into various lineages. We use this dataset as an example for cases where gene expression varies in a continuous fashion (along pseudo-temporal axes) rather than forming discrete subpopulations.

Reference: Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).

[25]:
hemato_dataset = HematoDataset(save_path=os.path.join(save_path, 'HEMATO/'))

2.4.9. Loading CBMC dataset

CbmcDataset includes 8,617 cord blood mononuclear cells pro- filed using 10x along with, for each cell, 13 well-characterized mononuclear antibodies. We used this dataset to analyze how the latent spaces inferred by dimensionality-reduction algorithms summarize protein marker abundance.

Reference: Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods 14, 865–868 (2017).

[26]:
cbmc_dataset = CbmcDataset(save_path=os.path.join(save_path, "citeSeq/"))

2.4.10. Loading BRAIN-SMALL dataset

BrainSmallDataset consists of 9,128 mouse brain cells profiled using 10x. This dataset is used as a complement to PBMC for our study of zero abundance and quality control metrics correlation with our generative posterior parameters.

Reference:

[27]:
brain_small_dataset = BrainSmallDataset(save_path=save_path, save_path_10X=os.path.join(save_path, "10X"))