GeneExpressionDataset

class scvi.dataset.GeneExpressionDataset[source]

Bases: torch.utils.data.dataset.Dataset

Generic class representing RNA counts and annotation information.

This class is scVI’s base dataset class. It gives access to several standard attributes: counts, number of cells, number of genes, etc. More importantly, it implements gene-based and cell-based filtering methods. It also allows the storage of cell and gene annotation information, as well as mappings from these annotation attributes to unique identifiers. In order to propagate the filtering behaviour correctly through the relevant attributes, they are kept in registries (cell, gene, mappings) which are iterated through upon any filtering operation.

Note that the constructor merely instantiates the GeneExpressionDataset objects. It should be used in combination with one of the populating method. Either:

  • populate_from_data: to populate using a (nb_cells, nb_genes) matrix.

  • populate_from_per_batch_array: to populate using a (n_batches, nb_cells, nb_genes) matrix.

  • populate_from_per_batch_list: to populate using a n_batches-long

    list of (nb_cells, nb_genes) matrices.

  • populate_from_datasets: to populate using multiple GeneExperessionDataset objects,

    merged using the intersection of a gene-wise attribute (gene_names by default).

Attributes Summary

X

batch_indices

rtype

ndarrayndarray

corrupted_X

Returns the corrupted version of X.

labels

rtype

ndarrayndarray

nb_cells

rtype

intint

nb_genes

rtype

intint

norm_X

Returns a normalized version of X.

Methods Summary

cell_types_to_labels(cell_types)

Forms a one-on-one corresponding np.ndarray of labels for the specified cell_types.

collate_fn_base(attributes_and_types, batch)

Given indices and attributes to batch, returns a full batch of Torch.Tensor

collate_fn_builder([…])

Returns a collate_fn with the requested shape/attributes

compute_library_size_batch()

Computes the library size per batch.

corrupt([rate, corruption])

Forms a corrupted_X attribute containing a corrupted version of X.

filter_cell_types(cell_types)

Performs in-place filtering of cells by keeping cell types in cell_types.

filter_cells_by_attribute(values_to_keep[, on])

Performs in-place cell filtering based on any cell attribute.

filter_cells_by_count([min_count])

filter_genes_by_attribute(values_to_keep[, on])

Performs in-place gene filtering based on any gene attribute.

filter_genes_by_count([min_count, per_batch])

genes_to_index(genes[, on])

Returns the index of a subset of genes, given their on attribute in genes.

get_batch_mask_cell_measurement(attribute_name)

Returns a list with length number of batches where each entry is a mask over present

initialize_cell_attribute(attribute_name, …)

Sets and registers a cell-wise attribute, e.g annotation information.

initialize_cell_measurement(measurement)

Initializes a cell measurement: set attributes and update registers

initialize_gene_attribute(attribute_name, …)

Sets and registers a gene-wise attribute, e.g annotation information.

initialize_mapped_attribute(…)

Sets and registers an attribute mapping, e.g labels to named cell_types.

make_gene_names_lower()

map_cell_types(cell_types_dict)

Performs in-place filtering of cells using a cell type mapping.

merge_cell_types(cell_types, new_cell_type_name)

Merges some cell types into a new one, and changes the labels accordingly.

normalize()

populate_from_data(X[, Ys, batch_indices, …])

Populates the data attributes of a GeneExpressionDataset object from a (nb_cells, nb_genes) matrix.

populate_from_datasets(gene_datasets_list[, …])

Populates the data attribute of a GeneExpressionDataset from multiple GeneExpressionDataset objects, merged using the intersection of a gene-wise attribute (gene_names by default).

populate_from_per_batch_list(Xs[, …])

Populates the data attributes of a GeneExpressionDataset object from a n_batches-long

populate_from_per_label_list(Xs[, …])

Populates the data attributes of a GeneExpressionDataset object from a n_labels-long

raw_counts_properties(idx1, idx2)

Computes and returns some statistics on the raw counts of two sub-populations.

register_dataset_version(version_name)

Registers a version of the dataset, e.g normalized version.

remap_categorical_attributes([…])

reorder_cell_types(new_order)

Reorder in place the cell-types.

reorder_genes(first_genes[, drop_omitted_genes])

Performs a in-place reordering of genes and gene-related attributes.

subsample_cells([size])

Wrapper around update_cells allowing for automatic (based on sum of counts) subsampling.

subsample_genes([new_n_genes, …])

Wrapper around update_genes allowing for manual and automatic (based on count variance) subsampling.

to_anndata()

Converts the dataset to a anndata.AnnData object.

update_cells(subset_cells)

Performs a in-place sub-sampling of cells and cell-related attributes.

update_genes(subset_genes)

Performs a in-place sub-sampling of genes and gene-related attributes.

Attributes Documentation

X
batch_indices
Return type

ndarrayndarray

corrupted_X

Returns the corrupted version of X.

Return type

csr_matrix, ndarrayUnion[csr_matrix, ndarray]

labels
Return type

ndarrayndarray

nb_cells
Return type

intint

nb_genes
Return type

intint

norm_X

Returns a normalized version of X.

Return type

csr_matrix, ndarrayUnion[csr_matrix, ndarray]

Methods Documentation

cell_types_to_labels(cell_types)[source]

Forms a one-on-one corresponding np.ndarray of labels for the specified cell_types.

Return type

ndarrayndarray

collate_fn_base(attributes_and_types, batch)[source]

Given indices and attributes to batch, returns a full batch of Torch.Tensor

Return type

Tuple[Tensor, …]Tuple[Tensor, …]

collate_fn_builder(add_attributes_and_types=None, override=False, corrupted=False)[source]

Returns a collate_fn with the requested shape/attributes

Return type

Callable[[Union[List[int], ndarray]], Tuple[Tensor, …]]Callable[[Union[List[int], ndarray]], Tuple[Tensor, …]]

compute_library_size_batch()[source]

Computes the library size per batch.

corrupt(rate=0.1, corruption='uniform')[source]

Forms a corrupted_X attribute containing a corrupted version of X.

Sub-samples rate * self.X.shape[0] * self.X.shape[1] entries and perturbs them according to the corruption method. Namely:

  • “uniform” multiplies the count by a Bernouilli(0.9)

  • “binomial” replaces the count with a Binomial(count, 0.2)

A corrupted version of self.X is stored in self.corrupted_X.

Parameters
  • rate (floatfloat) – Rate of corrupted entries.

  • corruption (strstr) – Corruption method.

filter_cell_types(cell_types)[source]

Performs in-place filtering of cells by keeping cell types in cell_types.

Parameters

cell_types (List[str], List[int], ndarrayUnion[List[str], List[int], ndarray]) – numpy array of type np.int (indices) or np.str (cell-types names)

filter_cells_by_attribute(values_to_keep, on='labels')[source]

Performs in-place cell filtering based on any cell attribute.

Uses labels by default.

filter_cells_by_count(min_count=1)[source]
filter_genes_by_attribute(values_to_keep, on='gene_names')[source]

Performs in-place gene filtering based on any gene attribute. Uses gene_names by default.

filter_genes_by_count(min_count=1, per_batch=False)[source]
genes_to_index(genes, on=None)[source]

Returns the index of a subset of genes, given their on attribute in genes.

If integers are passed in genes, the function returns genes. If on is None, it defaults to gene_names.

get_batch_mask_cell_measurement(attribute_name)[source]
Returns a list with length number of batches where each entry is a mask over present

cell measurement columns

Parameters

attribute_name (strstr) – cell_measurement attribute name

Returns

type List of np.ndarray containing, for each batch, a mask of which columns were actually measured in that batch. This is useful when taking the union of a cell measurement over datasets.

initialize_cell_attribute(attribute_name, attribute, categorical=False)[source]

Sets and registers a cell-wise attribute, e.g annotation information.

initialize_cell_measurement(measurement)[source]

Initializes a cell measurement: set attributes and update registers

initialize_gene_attribute(attribute_name, attribute)[source]

Sets and registers a gene-wise attribute, e.g annotation information.

initialize_mapped_attribute(source_attribute_name, mapping_name, mapping_values)[source]

Sets and registers an attribute mapping, e.g labels to named cell_types.

make_gene_names_lower()[source]
map_cell_types(cell_types_dict)[source]

Performs in-place filtering of cells using a cell type mapping.

Cell types in the keys of cell_types_dict are merged and given the name of the associated value

Parameters

cell_types_dict ({int, str, Tuple[int, …], Tuple[str, …]: str}Dict[Union[int, str, Tuple[int, …], Tuple[str, …]], str]) – dictionary with tuples of cell types to merge as keys and new cell type names as values.

merge_cell_types(cell_types, new_cell_type_name)[source]

Merges some cell types into a new one, and changes the labels accordingly. The old cell types are not erased but ‘#merged’ is appended to their names

Parameters
normalize()[source]
populate_from_data(X, Ys=None, batch_indices=None, labels=None, gene_names=None, cell_types=None, cell_attributes_dict=None, gene_attributes_dict=None, remap_attributes=True)[source]

Populates the data attributes of a GeneExpressionDataset object from a (nb_cells, nb_genes) matrix.

Parameters
populate_from_datasets(gene_datasets_list, shared_labels=True, mapping_reference_for_sharing=None, cell_measurement_intersection=None)[source]

Populates the data attribute of a GeneExpressionDataset from multiple GeneExpressionDataset objects, merged using the intersection of a gene-wise attribute (gene_names by default).

Warning: The merging procedure modifies the gene_dataset given as inputs

For gene-wise attributes, only the attributes of the first dataset are kept. For cell-wise attributes, either we “concatenate” or add an “offset” corresponding to the number of already existing categories.

Parameters
  • gene_datasets_list (List[GeneExpressionDataset]List[GeneExpressionDataset]) – GeneExpressionDataset`` objects to be merged.

  • shared_labels – whether to share labels through cell_types mapping or not. (Default value = True)

  • mapping_reference_for_sharing ({str: str, None}, NoneOptional[Dict[str, Optional[str]]]) – Instructions on how to share cell-wise attributes between datasets. Keys are the attribute name and values are registered mapped attribute. If provided the mapping is merged across all datasets and then the attribute is remapped using index backtracking between the old and merged mapping. If no mapping is provided, concatenate the values and add an offset if the attribute is registered as categorical in the first dataset.

  • cell_measurement_intersection ({str: bool}, NoneOptional[Dict[str, bool]]) – A dictionary with keys being cell measurement attributes and values being True or False. If True, that cell measurement attribute will be intersected across datasets. If False, the union is taken. Defaults to intersection for each cell_measurement

populate_from_per_batch_list(Xs, labels_per_batch=None, gene_names=None, cell_types=None, remap_attributes=True)[source]
Populates the data attributes of a GeneExpressionDataset object from a n_batches-long

list of (nb_cells, nb_genes) matrices.

Parameters
populate_from_per_label_list(Xs, batch_indices_per_label=None, gene_names=None, remap_attributes=True)[source]
Populates the data attributes of a GeneExpressionDataset object from a n_labels-long

list of (nb_cells, nb_genes) matrices.

Parameters
raw_counts_properties(idx1, idx2)[source]

Computes and returns some statistics on the raw counts of two sub-populations.

Parameters
Return type

Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray]Tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray]

Returns

type Tuple of np.ndarray containing, by pair (one for each sub-population), mean expression per gene, proportion of non-zero expression per gene, mean of normalized expression.

register_dataset_version(version_name)[source]

Registers a version of the dataset, e.g normalized version.

remap_categorical_attributes(attributes_to_remap=None)[source]
reorder_cell_types(new_order)[source]

Reorder in place the cell-types. The cell-types provided will be added at the beginning of cell_types attribute, such that if some existing cell-types are omitted in new_order, they will be left after the new given order

reorder_genes(first_genes, drop_omitted_genes=False)[source]

Performs a in-place reordering of genes and gene-related attributes.

Reorder genes according to the first_genes list of gene names. Consequently, modifies in-place the data X and the registered gene attributes.

Parameters
  • first_genes (List[str], ndarrayUnion[List[str], ndarray]) – New ordering of the genes; if some genes are missing, they will be added after the first_genes in the same order as they were before if drop_omitted_genes is False

  • drop_omitted_genes (boolbool) – Whether to keep or drop the omitted genes in first_genes

subsample_cells(size=1.0)[source]

Wrapper around update_cells allowing for automatic (based on sum of counts) subsampling.

If size is a:

  • (0,1) float: subsample 100*``size`` % of the cells

  • int: subsample size cells

subsample_genes(new_n_genes=None, new_ratio_genes=None, subset_genes=None, mode='seurat_v3', batch_correction=True, **highly_var_genes_kwargs)[source]

Wrapper around update_genes allowing for manual and automatic (based on count variance) subsampling.

The function either:

  • Subsamples new_n_genes genes among all genes

  • Subsambles a proportion of new_ratio_genes of the genes

  • Subsamples the genes in subset_genes

In the first two cases, a mode of highly variable gene selection is used as specified in the mode argument. F

In the case where new_n_genes, new_ratio_genes and subset_genes are all None, this method automatically computes the number of genes to keep (when mode=’seurat_v2’ or mode=’cell_ranger’)

In the case where mode==”seurat_v3”, an adapted version of the method described in [Stuart19] is used. This method requires new_n_genes or new_ratio_genes to be specified.

In the case where mode==”poisson_zeros”, a method based on [Andrews & Hemberg 2019] is used. This method requires new_n_genes or new_ratio_genes to be specified.

Parameters
  • subset_genes (List[int], List[bool], ndarray, NoneUnion[List[int], List[bool], ndarray, None]) – list of indices or mask of genes to retain

  • new_n_genes (int, NoneOptional[int]) – number of genes to retain, the highly variable genes will be kept

  • new_ratio_genes (float, NoneOptional[float]) – proportion of genes to retain, the highly variable genes will be kept

  • mode (str, NoneOptional[str]) – Either “variance”, “seurat_v2”, “cell_ranger”, “seurat_v3” or “poisson_zeros”

  • batch_correction (bool, NoneOptional[bool]) – Account for batches when choosing highly variable genes. HVGs are selected in each batch and merged.

  • highly_var_genes_kwargs – Kwargs to feed to highly_variable_genes when using seurat_v2 or cell_ranger (cf. highly_variable_genes method)

to_anndata()[source]

Converts the dataset to a anndata.AnnData object. The obtained dataset can then be saved/retrieved using the anndata API.

Return type

AnnDataAnnData

update_cells(subset_cells)[source]

Performs a in-place sub-sampling of cells and cell-related attributes.

Sub-selects cells according to subset_cells sub-index. Consequently, modifies in-place the data X, its versions and the registered cell attributes.

Parameters

subset_cells – Index used for cell sub-sampling. Either a int array with arbitrary shape which values are the indexes of the cells to keep. Or boolean array used as a mask-like index.

update_genes(subset_genes)[source]

Performs a in-place sub-sampling of genes and gene-related attributes.

Sub-selects genes according to subset_genes sub-index. Consequently, modifies in-place the data X and the registered gene attributes.

Parameters

subset_genes (ndarrayndarray) – Index used for gene sub-sampling. Either a int array with arbitrary shape which values are the indexes of the genes to keep. Or boolean array used as a mask-like index.