GeneExpressionDataset¶

class scvi.dataset.GeneExpressionDataset[source]¶

Bases: torch.utils.data.dataset.Dataset

Generic class representing RNA counts and annotation information.

This class is scVI’s base dataset class. It gives access to several standard attributes: counts, number of cells, number of genes, etc. More importantly, it implements gene-based and cell-based filtering methods. It also allows the storage of cell and gene annotation information, as well as mappings from these annotation attributes to unique identifiers. In order to propagate the filtering behaviour correctly through the relevant attributes, they are kept in registries (cell, gene, mappings) which are iterated through upon any filtering operation.

Note that the constructor merely instantiates the GeneExpressionDataset objects. It should be used in combination with one of the populating method. Either:

populate_from_data: to populate using a (nb_cells, nb_genes) matrix.

populate_from_per_batch_array: to populate using a (n_batches, nb_cells, nb_genes) matrix.

populate_from_per_batch_list: to populate using a n_batches-long
list of (nb_cells, nb_genes) matrices.

populate_from_datasets: to populate using multiple GeneExperessionDataset objects,
merged using the intersection of a gene-wise attribute (gene_names by default).

Attributes Summary

`X`
`batch_indices`	rtype `ndarrayndarray`
`corrupted_X`	Returns the corrupted version of X.
`labels`	rtype `ndarrayndarray`
`nb_cells`	rtype `intint`
`nb_genes`	rtype `intint`
`norm_X`	Returns a normalized version of X.

Methods Summary

`cell_types_to_labels`(cell_types)	Forms a one-on-one corresponding `np.ndarray` of labels for the specified `cell_types`.
`collate_fn_base`(attributes_and_types, batch)	Given indices and attributes to batch, returns a full batch of `Torch.Tensor`
`collate_fn_builder`([…])	Returns a collate_fn with the requested shape/attributes
`compute_library_size_batch`()	Computes the library size per batch.
`corrupt`([rate, corruption])	Forms a corrupted_X attribute containing a corrupted version of X.
`filter_cell_types`(cell_types)	Performs in-place filtering of cells by keeping cell types in `cell_types`.
`filter_cells_by_attribute`(values_to_keep[, on])	Performs in-place cell filtering based on any cell attribute.
`filter_cells_by_count`([min_count])
`filter_genes_by_attribute`(values_to_keep[, on])	Performs in-place gene filtering based on any gene attribute.
`filter_genes_by_count`([min_count, per_batch])
`genes_to_index`(genes[, on])	Returns the index of a subset of genes, given their `on` attribute in `genes`.
`get_batch_mask_cell_measurement`(attribute_name)	Returns a list with length number of batches where each entry is a mask over present
`initialize_cell_attribute`(attribute_name, …)	Sets and registers a cell-wise attribute, e.g annotation information.
`initialize_cell_measurement`(measurement)	Initializes a cell measurement: set attributes and update registers
`initialize_gene_attribute`(attribute_name, …)	Sets and registers a gene-wise attribute, e.g annotation information.
`initialize_mapped_attribute`(…)	Sets and registers an attribute mapping, e.g labels to named cell_types.
`make_gene_names_lower`()
`map_cell_types`(cell_types_dict)	Performs in-place filtering of cells using a cell type mapping.
`merge_cell_types`(cell_types, new_cell_type_name)	Merges some cell types into a new one, and changes the labels accordingly.
`normalize`()
`populate_from_data`(X[, Ys, batch_indices, …])	Populates the data attributes of a GeneExpressionDataset object from a (nb_cells, nb_genes) matrix.
`populate_from_datasets`(gene_datasets_list[, …])	Populates the data attribute of a GeneExpressionDataset from multiple `GeneExpressionDataset` objects, merged using the intersection of a gene-wise attribute (`gene_names` by default).
`populate_from_per_batch_list`(Xs[, …])	Populates the data attributes of a GeneExpressionDataset object from a `n_batches`-long
`populate_from_per_label_list`(Xs[, …])	Populates the data attributes of a GeneExpressionDataset object from a `n_labels`-long
`raw_counts_properties`(idx1, idx2)	Computes and returns some statistics on the raw counts of two sub-populations.
`register_dataset_version`(version_name)	Registers a version of the dataset, e.g normalized version.
`remap_categorical_attributes`([…])
`reorder_cell_types`(new_order)	Reorder in place the cell-types.
`reorder_genes`(first_genes[, drop_omitted_genes])	Performs a in-place reordering of genes and gene-related attributes.
`subsample_cells`([size])	Wrapper around `update_cells` allowing for automatic (based on sum of counts) subsampling.
`subsample_genes`([new_n_genes, …])	Wrapper around `update_genes` allowing for manual and automatic (based on count variance) subsampling.
`to_anndata`()	Converts the dataset to a anndata.AnnData object.
`update_cells`(subset_cells)	Performs a in-place sub-sampling of cells and cell-related attributes.
`update_genes`(subset_genes)	Performs a in-place sub-sampling of genes and gene-related attributes.

Attributes Documentation

X¶

batch_indices¶

Return type: ndarrayndarray

corrupted_X¶

Returns the corrupted version of X.

Return type: csr_matrix, ndarrayUnion[csr_matrix, ndarray]

labels¶

Return type: ndarrayndarray

nb_cells¶

Return type: intint

nb_genes¶

Return type: intint

norm_X¶

Returns a normalized version of X.

Return type: csr_matrix, ndarrayUnion[csr_matrix, ndarray]

Methods Documentation

cell_types_to_labels(cell_types)[source]¶

Forms a one-on-one corresponding np.ndarray of labels for the specified cell_types.

Return type: ndarrayndarray

collate_fn_base(attributes_and_types, batch)[source]¶

Given indices and attributes to batch, returns a full batch of Torch.Tensor

Return type: Tuple[Tensor, …]Tuple[Tensor, …]

collate_fn_builder(add_attributes_and_types=None, override=False, corrupted=False)[source]¶

Returns a collate_fn with the requested shape/attributes

Return type: Callable[[Union[List[int], ndarray]], Tuple[Tensor, …]]Callable[[Union[List[int], ndarray]], Tuple[Tensor, …]]

compute_library_size_batch()[source]¶: Computes the library size per batch.

corrupt(rate=0.1, corruption='uniform')[source]¶

Forms a corrupted_X attribute containing a corrupted version of X.

Sub-samples rate * self.X.shape[0] * self.X.shape[1] entries and perturbs them according to the corruption method. Namely:

“uniform” multiplies the count by a Bernouilli(0.9)

“binomial” replaces the count with a Binomial(count, 0.2)

A corrupted version of self.X is stored in self.corrupted_X.

Parameters

rate (floatfloat) – Rate of corrupted entries.
corruption (strstr) – Corruption method.

filter_cell_types(cell_types)[source]¶

Performs in-place filtering of cells by keeping cell types in cell_types.

Parameters: cell_types (List[str], List[int], ndarrayUnion[List[str], List[int], ndarray]) – numpy array of type np.int (indices) or np.str (cell-types names)

filter_cells_by_attribute(values_to_keep, on='labels')[source]¶

Performs in-place cell filtering based on any cell attribute.

Uses labels by default.

filter_cells_by_count(min_count=1)[source]¶

filter_genes_by_attribute(values_to_keep, on='gene_names')[source]¶: Performs in-place gene filtering based on any gene attribute. Uses gene_names by default.

filter_genes_by_count(min_count=1, per_batch=False)[source]¶

genes_to_index(genes, on=None)[source]¶

Returns the index of a subset of genes, given their on attribute in genes.

If integers are passed in genes, the function returns genes. If on is None, it defaults to gene_names.

get_batch_mask_cell_measurement(attribute_name)[source]¶

Returns a list with length number of batches where each entry is a mask over present: cell measurement columns

Parameters: attribute_name (strstr) – cell_measurement attribute name
Returns: type List of np.ndarray containing, for each batch, a mask of which columns were actually measured in that batch. This is useful when taking the union of a cell measurement over datasets.

initialize_cell_attribute(attribute_name, attribute, categorical=False)[source]¶: Sets and registers a cell-wise attribute, e.g annotation information.

initialize_cell_measurement(measurement)[source]¶: Initializes a cell measurement: set attributes and update registers

initialize_gene_attribute(attribute_name, attribute)[source]¶: Sets and registers a gene-wise attribute, e.g annotation information.

initialize_mapped_attribute(source_attribute_name, mapping_name, mapping_values)[source]¶: Sets and registers an attribute mapping, e.g labels to named cell_types.

make_gene_names_lower()[source]¶

map_cell_types(cell_types_dict)[source]¶

Performs in-place filtering of cells using a cell type mapping.

Cell types in the keys of cell_types_dict are merged and given the name of the associated value

Parameters: cell_types_dict ({int, str, Tuple[int, …], Tuple[str, …]: str}Dict[Union[int, str, Tuple[int, …], Tuple[str, …]], str]) – dictionary with tuples of cell types to merge as keys and new cell type names as values.

merge_cell_types(cell_types, new_cell_type_name)[source]¶

Merges some cell types into a new one, and changes the labels accordingly. The old cell types are not erased but ‘#merged’ is appended to their names

Parameters

cell_types (Tuple[int, …], Tuple[str, …], List[int], List[str], ndarrayUnion[Tuple[int, …], Tuple[str, …], List[int], List[str], ndarray]) – Cell types to merge.
new_cell_type_name (strstr) – Name for the new aggregate cell type.

normalize()[source]¶

populate_from_data(X, Ys=None, batch_indices=None, labels=None, gene_names=None, cell_types=None, cell_attributes_dict=None, gene_attributes_dict=None, remap_attributes=True)[source]¶

Populates the data attributes of a GeneExpressionDataset object from a (nb_cells, nb_genes) matrix.

Parameters

X (ndarray, csr_matrixUnion[ndarray, csr_matrix]) – RNA counts matrix, sparse format supported (e.g scipy.sparse.csr_matrix).
Ys (List[CellMeasurement], NoneOptional[List[CellMeasurement]]) – List of paired count measurements (e.g CITE-seq protein measurements, spatial coordinates)
batch_indices (List[int], ndarray, csr_matrix, NoneUnion[List[int], ndarray, csr_matrix, None]) – np.ndarray`` with shape (nb_cells,). Maps each cell to the batch it originates from. Note that a batch most likely refers to a specific piece of tissue or a specific experimental protocol.
labels (List[int], ndarray, csr_matrix, NoneUnion[List[int], ndarray, csr_matrix, None]) – np.ndarray`` with shape (nb_cells,). Cell-wise labels. Can be mapped to cell types using attribute mappings.
gene_names (List[str], ndarray, NoneUnion[List[str], ndarray, None]) – List`` or np.ndarray with length/shape (nb_genes,). Maps each gene to its name.
cell_types (List[str], ndarray, NoneUnion[List[str], ndarray, None]) – Maps each integer label in labels to a cell type.
cell_attributes_dict ({str: List, ndarray}, NoneOptional[Dict[str, Union[List, ndarray]]]) – List`` or np.ndarray with shape (nb_cells,).
gene_attributes_dict ({str: List, ndarray}, NoneOptional[Dict[str, Union[List, ndarray]]]) – List`` or np.ndarray with shape (nb_genes,).
remap_attributes (boolbool) – If set to True (default), the function calls remap_categorical_attributes at the end

populate_from_datasets(gene_datasets_list, shared_labels=True, mapping_reference_for_sharing=None, cell_measurement_intersection=None)[source]¶

Populates the data attribute of a GeneExpressionDataset from multiple GeneExpressionDataset objects, merged using the intersection of a gene-wise attribute (gene_names by default).

Warning: The merging procedure modifies the gene_dataset given as inputs

For gene-wise attributes, only the attributes of the first dataset are kept. For cell-wise attributes, either we “concatenate” or add an “offset” corresponding to the number of already existing categories.

Parameters

gene_datasets_list (List[GeneExpressionDataset]List[GeneExpressionDataset]) – GeneExpressionDataset`` objects to be merged.
shared_labels – whether to share labels through cell_types mapping or not. (Default value = True)
mapping_reference_for_sharing ({str: str, None}, NoneOptional[Dict[str, Optional[str]]]) – Instructions on how to share cell-wise attributes between datasets. Keys are the attribute name and values are registered mapped attribute. If provided the mapping is merged across all datasets and then the attribute is remapped using index backtracking between the old and merged mapping. If no mapping is provided, concatenate the values and add an offset if the attribute is registered as categorical in the first dataset.
cell_measurement_intersection ({str: bool}, NoneOptional[Dict[str, bool]]) – A dictionary with keys being cell measurement attributes and values being True or False. If True, that cell measurement attribute will be intersected across datasets. If False, the union is taken. Defaults to intersection for each cell_measurement

populate_from_per_batch_list(Xs, labels_per_batch=None, gene_names=None, cell_types=None, remap_attributes=True)[source]¶

Populates the data attributes of a GeneExpressionDataset object from a n_batches-long: list of (nb_cells, nb_genes) matrices.

Parameters

Xs (List[Union[csr_matrix, ndarray]]List[Union[csr_matrix, ndarray]]) – RNA counts in the form of a list of np.ndarray with shape (…, nb_genes)
labels_per_batch (ndarray, List[ndarray], NoneUnion[ndarray, List[ndarray], None]) – list of cell-wise labels for each batch.
gene_names (List[str], ndarray, NoneUnion[List[str], ndarray, None]) – gene names, stored as str.
cell_types (List[str], ndarray, NoneUnion[List[str], ndarray, None]) – cell types, stored as str.
remap_attributes (boolbool) – If set to True (default), the function calls remap_categorical_attributes at the end

populate_from_per_label_list(Xs, batch_indices_per_label=None, gene_names=None, remap_attributes=True)[source]¶

Populates the data attributes of a GeneExpressionDataset object from a n_labels-long: list of (nb_cells, nb_genes) matrices.

Parameters

Xs (List[Union[csr_matrix, ndarray]]List[Union[csr_matrix, ndarray]]) – RNA counts in the form of a list of np.ndarray with shape (…, nb_genes)
batch_indices_per_label (List[Union[List[int], ndarray]], NoneOptional[List[Union[List[int], ndarray]]]) – cell-wise batch indices, for each cell label.
gene_names (List[str], ndarray, NoneUnion[List[str], ndarray, None]) – gene names, stored as str.
remap_attributes (boolbool) – If set to True (default), the function calls remap_categorical_attributes at the end

raw_counts_properties(idx1, idx2)[source]¶

Computes and returns some statistics on the raw counts of two sub-populations.

Parameters

idx1 (List[int], ndarrayUnion[List[int], ndarray]) – subset of indices describing the first population.
idx2 (List[int], ndarrayUnion[List[int], ndarray]) – subset of indices describing the second population.

Return type

Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray]Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray]

Returns

type Tuple of np.ndarray containing, by pair (one for each sub-population), mean expression per gene, proportion of non-zero expression per gene, mean of normalized expression.

register_dataset_version(version_name)[source]¶: Registers a version of the dataset, e.g normalized version.

remap_categorical_attributes(attributes_to_remap=None)[source]¶

reorder_cell_types(new_order)[source]¶: Reorder in place the cell-types. The cell-types provided will be added at the beginning of cell_types attribute, such that if some existing cell-types are omitted in new_order, they will be left after the new given order

reorder_genes(first_genes, drop_omitted_genes=False)[source]¶

Performs a in-place reordering of genes and gene-related attributes.

Reorder genes according to the first_genes list of gene names. Consequently, modifies in-place the data X and the registered gene attributes.

Parameters

first_genes (List[str], ndarrayUnion[List[str], ndarray]) – New ordering of the genes; if some genes are missing, they will be added after the first_genes in the same order as they were before if drop_omitted_genes is False
drop_omitted_genes (boolbool) – Whether to keep or drop the omitted genes in first_genes

subsample_cells(size=1.0)[source]¶

Wrapper around update_cells allowing for automatic (based on sum of counts) subsampling.

If size is a:

(0,1) float: subsample 100*``size`` % of the cells

int: subsample size cells

subsample_genes(new_n_genes=None, new_ratio_genes=None, subset_genes=None, mode='seurat_v3', batch_correction=True, **highly_var_genes_kwargs)[source]¶

Wrapper around update_genes allowing for manual and automatic (based on count variance) subsampling.

The function either:

Subsamples new_n_genes genes among all genes

Subsambles a proportion of new_ratio_genes of the genes

Subsamples the genes in subset_genes

In the first two cases, a mode of highly variable gene selection is used as specified in the mode argument. F

In the case where new_n_genes, new_ratio_genes and subset_genes are all None, this method automatically computes the number of genes to keep (when mode=’seurat_v2’ or mode=’cell_ranger’)

In the case where mode==”seurat_v3”, an adapted version of the method described in [Stuart19] is used. This method requires new_n_genes or new_ratio_genes to be specified.

In the case where mode==”poisson_zeros”, a method based on [Andrews & Hemberg 2019] is used. This method requires new_n_genes or new_ratio_genes to be specified.

Parameters

subset_genes (List[int], List[bool], ndarray, NoneUnion[List[int], List[bool], ndarray, None]) – list of indices or mask of genes to retain
new_n_genes (int, NoneOptional[int]) – number of genes to retain, the highly variable genes will be kept
new_ratio_genes (float, NoneOptional[float]) – proportion of genes to retain, the highly variable genes will be kept
mode (str, NoneOptional[str]) – Either “variance”, “seurat_v2”, “cell_ranger”, “seurat_v3” or “poisson_zeros”
batch_correction (bool, NoneOptional[bool]) – Account for batches when choosing highly variable genes. HVGs are selected in each batch and merged.
highly_var_genes_kwargs – Kwargs to feed to highly_variable_genes when using seurat_v2 or cell_ranger (cf. highly_variable_genes method)

to_anndata()[source]¶

Converts the dataset to a anndata.AnnData object. The obtained dataset can then be saved/retrieved using the anndata API.

Return type: AnnDataAnnData

update_cells(subset_cells)[source]¶

Performs a in-place sub-sampling of cells and cell-related attributes.

Sub-selects cells according to subset_cells sub-index. Consequently, modifies in-place the data X, its versions and the registered cell attributes.

Parameters: subset_cells – Index used for cell sub-sampling. Either a int array with arbitrary shape which values are the indexes of the cells to keep. Or boolean array used as a mask-like index.

update_genes(subset_genes)[source]¶

Performs a in-place sub-sampling of genes and gene-related attributes.

Sub-selects genes according to subset_genes sub-index. Consequently, modifies in-place the data X and the registered gene attributes.

Parameters: subset_genes (ndarrayndarray) – Index used for gene sub-sampling. Either a int array with arbitrary shape which values are the indexes of the genes to keep. Or boolean array used as a mask-like index.