GeneExpressionDataset¶
-
class
scvi.dataset.
GeneExpressionDataset
[source]¶ Bases:
torch.utils.data.dataset.Dataset
Generic class representing RNA counts and annotation information.
This class is scVI’s base dataset class. It gives access to several standard attributes: counts, number of cells, number of genes, etc. More importantly, it implements gene-based and cell-based filtering methods. It also allows the storage of cell and gene annotation information, as well as mappings from these annotation attributes to unique identifiers. In order to propagate the filtering behaviour correctly through the relevant attributes, they are kept in registries (cell, gene, mappings) which are iterated through upon any filtering operation.
Note that the constructor merely instantiates the GeneExpressionDataset objects. It should be used in combination with one of the populating method. Either:
populate_from_data
: to populate using a (nb_cells, nb_genes) matrix.populate_from_per_batch_array
: to populate using a (n_batches, nb_cells, nb_genes) matrix.populate_from_per_batch_list
: to populate using an_batches
-longlist
of (nb_cells, nb_genes) matrices.
populate_from_datasets
: to populate using multipleGeneExperessionDataset
objects,merged using the intersection of a gene-wise attribute (
gene_names
by default).
Attributes Summary
Returns the corrupted version of X.
Returns a normalized version of X.
Methods Summary
cell_types_to_labels
(cell_types)Forms a one-on-one corresponding
np.ndarray
of labels for the specifiedcell_types
.collate_fn_base
(attributes_and_types, batch)Given indices and attributes to batch, returns a full batch of
Torch.Tensor
collate_fn_builder
([…])Returns a collate_fn with the requested shape/attributes
Computes the library size per batch.
corrupt
([rate, corruption])Forms a corrupted_X attribute containing a corrupted version of X.
filter_cell_types
(cell_types)Performs in-place filtering of cells by keeping cell types in
cell_types
.filter_cells_by_attribute
(values_to_keep[, on])Performs in-place cell filtering based on any cell attribute.
filter_cells_by_count
([min_count])filter_genes_by_attribute
(values_to_keep[, on])Performs in-place gene filtering based on any gene attribute.
filter_genes_by_count
([min_count, per_batch])genes_to_index
(genes[, on])Returns the index of a subset of genes, given their
on
attribute ingenes
.get_batch_mask_cell_measurement
(attribute_name)Returns a list with length number of batches where each entry is a mask over present
initialize_cell_attribute
(attribute_name, …)Sets and registers a cell-wise attribute, e.g annotation information.
initialize_cell_measurement
(measurement)Initializes a cell measurement: set attributes and update registers
initialize_gene_attribute
(attribute_name, …)Sets and registers a gene-wise attribute, e.g annotation information.
Sets and registers an attribute mapping, e.g labels to named cell_types.
map_cell_types
(cell_types_dict)Performs in-place filtering of cells using a cell type mapping.
merge_cell_types
(cell_types, new_cell_type_name)Merges some cell types into a new one, and changes the labels accordingly.
populate_from_data
(X[, Ys, batch_indices, …])Populates the data attributes of a GeneExpressionDataset object from a (nb_cells, nb_genes) matrix.
populate_from_datasets
(gene_datasets_list[, …])Populates the data attribute of a GeneExpressionDataset from multiple
GeneExpressionDataset
objects, merged using the intersection of a gene-wise attribute (gene_names
by default).populate_from_per_batch_list
(Xs[, …])Populates the data attributes of a GeneExpressionDataset object from a
n_batches
-longpopulate_from_per_label_list
(Xs[, …])Populates the data attributes of a GeneExpressionDataset object from a
n_labels
-longraw_counts_properties
(idx1, idx2)Computes and returns some statistics on the raw counts of two sub-populations.
register_dataset_version
(version_name)Registers a version of the dataset, e.g normalized version.
reorder_cell_types
(new_order)Reorder in place the cell-types.
reorder_genes
(first_genes[, drop_omitted_genes])Performs a in-place reordering of genes and gene-related attributes.
subsample_cells
([size])Wrapper around
update_cells
allowing for automatic (based on sum of counts) subsampling.subsample_genes
([new_n_genes, …])Wrapper around
update_genes
allowing for manual and automatic (based on count variance) subsampling.Converts the dataset to a anndata.AnnData object.
update_cells
(subset_cells)Performs a in-place sub-sampling of cells and cell-related attributes.
update_genes
(subset_genes)Performs a in-place sub-sampling of genes and gene-related attributes.
Attributes Documentation
-
X
¶
-
corrupted_X
¶ Returns the corrupted version of X.
- Return type
-
norm_X
¶ Returns a normalized version of X.
- Return type
Methods Documentation
-
cell_types_to_labels
(cell_types)[source]¶ Forms a one-on-one corresponding
np.ndarray
of labels for the specifiedcell_types
.
-
collate_fn_base
(attributes_and_types, batch)[source]¶ Given indices and attributes to batch, returns a full batch of
Torch.Tensor
-
collate_fn_builder
(add_attributes_and_types=None, override=False, corrupted=False)[source]¶ Returns a collate_fn with the requested shape/attributes
-
corrupt
(rate=0.1, corruption='uniform')[source]¶ Forms a corrupted_X attribute containing a corrupted version of X.
Sub-samples
rate * self.X.shape[0] * self.X.shape[1]
entries and perturbs them according to thecorruption
method. Namely:“uniform” multiplies the count by a Bernouilli(0.9)
“binomial” replaces the count with a Binomial(count, 0.2)
A corrupted version of
self.X
is stored inself.corrupted_X
.
-
filter_cell_types
(cell_types)[source]¶ Performs in-place filtering of cells by keeping cell types in
cell_types
.
-
filter_cells_by_attribute
(values_to_keep, on='labels')[source]¶ Performs in-place cell filtering based on any cell attribute.
Uses labels by default.
-
filter_genes_by_attribute
(values_to_keep, on='gene_names')[source]¶ Performs in-place gene filtering based on any gene attribute. Uses gene_names by default.
-
genes_to_index
(genes, on=None)[source]¶ Returns the index of a subset of genes, given their
on
attribute ingenes
.If integers are passed in
genes
, the function returnsgenes
. Ifon
is None, it defaults togene_names
.
-
get_batch_mask_cell_measurement
(attribute_name)[source]¶ - Returns a list with length number of batches where each entry is a mask over present
cell measurement columns
-
initialize_cell_attribute
(attribute_name, attribute, categorical=False)[source]¶ Sets and registers a cell-wise attribute, e.g annotation information.
-
initialize_cell_measurement
(measurement)[source]¶ Initializes a cell measurement: set attributes and update registers
-
initialize_gene_attribute
(attribute_name, attribute)[source]¶ Sets and registers a gene-wise attribute, e.g annotation information.
-
initialize_mapped_attribute
(source_attribute_name, mapping_name, mapping_values)[source]¶ Sets and registers an attribute mapping, e.g labels to named cell_types.
-
map_cell_types
(cell_types_dict)[source]¶ Performs in-place filtering of cells using a cell type mapping.
Cell types in the keys of
cell_types_dict
are merged and given the name of the associated value
-
merge_cell_types
(cell_types, new_cell_type_name)[source]¶ Merges some cell types into a new one, and changes the labels accordingly. The old cell types are not erased but ‘#merged’ is appended to their names
-
populate_from_data
(X, Ys=None, batch_indices=None, labels=None, gene_names=None, cell_types=None, cell_attributes_dict=None, gene_attributes_dict=None, remap_attributes=True)[source]¶ Populates the data attributes of a GeneExpressionDataset object from a (nb_cells, nb_genes) matrix.
- Parameters
X (
ndarray
,csr_matrix
Union
[ndarray
,csr_matrix
]) – RNA counts matrix, sparse format supported (e.gscipy.sparse.csr_matrix
).Ys (
List
[CellMeasurement
],None
Optional
[List
[CellMeasurement
]]) – List of paired count measurements (e.g CITE-seq protein measurements, spatial coordinates)batch_indices (
List
[int
],ndarray
,csr_matrix
,None
Union
[List
[int
],ndarray
,csr_matrix
,None
]) – np.ndarray`` with shape (nb_cells,). Maps each cell to the batch it originates from. Note that a batch most likely refers to a specific piece of tissue or a specific experimental protocol.labels (
List
[int
],ndarray
,csr_matrix
,None
Union
[List
[int
],ndarray
,csr_matrix
,None
]) – np.ndarray`` with shape (nb_cells,). Cell-wise labels. Can be mapped to cell types using attribute mappings.gene_names (
List
[str
],ndarray
,None
Union
[List
[str
],ndarray
,None
]) – List`` ornp.ndarray
with length/shape (nb_genes,). Maps each gene to its name.cell_types (
List
[str
],ndarray
,None
Union
[List
[str
],ndarray
,None
]) – Maps each integer label inlabels
to a cell type.cell_attributes_dict ({
str
:List
,ndarray
},None
Optional
[Dict
[str
,Union
[List
,ndarray
]]]) – List`` ornp.ndarray
with shape (nb_cells,).gene_attributes_dict ({
str
:List
,ndarray
},None
Optional
[Dict
[str
,Union
[List
,ndarray
]]]) – List`` ornp.ndarray
with shape (nb_genes,).remap_attributes (
bool
bool
) – If set to True (default), the function calls remap_categorical_attributes at the end
-
populate_from_datasets
(gene_datasets_list, shared_labels=True, mapping_reference_for_sharing=None, cell_measurement_intersection=None)[source]¶ Populates the data attribute of a GeneExpressionDataset from multiple
GeneExpressionDataset
objects, merged using the intersection of a gene-wise attribute (gene_names
by default).Warning: The merging procedure modifies the gene_dataset given as inputs
For gene-wise attributes, only the attributes of the first dataset are kept. For cell-wise attributes, either we “concatenate” or add an “offset” corresponding to the number of already existing categories.
- Parameters
gene_datasets_list (
List
[GeneExpressionDataset
]List
[GeneExpressionDataset
]) – GeneExpressionDataset`` objects to be merged.shared_labels – whether to share labels through
cell_types
mapping or not. (Default value = True)mapping_reference_for_sharing ({
str
:str
,None
},None
Optional
[Dict
[str
,Optional
[str
]]]) – Instructions on how to share cell-wise attributes between datasets. Keys are the attribute name and values are registered mapped attribute. If provided the mapping is merged across all datasets and then the attribute is remapped using index backtracking between the old and merged mapping. If no mapping is provided, concatenate the values and add an offset if the attribute is registered as categorical in the first dataset.cell_measurement_intersection ({
str
:bool
},None
Optional
[Dict
[str
,bool
]]) – A dictionary with keys being cell measurement attributes and values being True or False. If True, that cell measurement attribute will be intersected across datasets. If False, the union is taken. Defaults to intersection for each cell_measurement
-
populate_from_per_batch_list
(Xs, labels_per_batch=None, gene_names=None, cell_types=None, remap_attributes=True)[source]¶ - Populates the data attributes of a GeneExpressionDataset object from a
n_batches
-long list
of (nb_cells, nb_genes) matrices.
- Parameters
Xs (
List
[Union
[csr_matrix
,ndarray
]]List
[Union
[csr_matrix
,ndarray
]]) – RNA counts in the form of a list of np.ndarray with shape (…, nb_genes)labels_per_batch (
ndarray
,List
[ndarray
],None
Union
[ndarray
,List
[ndarray
],None
]) – list of cell-wise labels for each batch.gene_names (
List
[str
],ndarray
,None
Union
[List
[str
],ndarray
,None
]) – gene names, stored asstr
.cell_types (
List
[str
],ndarray
,None
Union
[List
[str
],ndarray
,None
]) – cell types, stored asstr
.remap_attributes (
bool
bool
) – If set to True (default), the function calls remap_categorical_attributes at the end
- Populates the data attributes of a GeneExpressionDataset object from a
-
populate_from_per_label_list
(Xs, batch_indices_per_label=None, gene_names=None, remap_attributes=True)[source]¶ - Populates the data attributes of a GeneExpressionDataset object from a
n_labels
-long list
of (nb_cells, nb_genes) matrices.
- Parameters
Xs (
List
[Union
[csr_matrix
,ndarray
]]List
[Union
[csr_matrix
,ndarray
]]) – RNA counts in the form of a list of np.ndarray with shape (…, nb_genes)batch_indices_per_label (
List
[Union
[List
[int
],ndarray
]],None
Optional
[List
[Union
[List
[int
],ndarray
]]]) – cell-wise batch indices, for each cell label.gene_names (
List
[str
],ndarray
,None
Union
[List
[str
],ndarray
,None
]) – gene names, stored asstr
.remap_attributes (
bool
bool
) – If set to True (default), the function calls remap_categorical_attributes at the end
- Populates the data attributes of a GeneExpressionDataset object from a
-
raw_counts_properties
(idx1, idx2)[source]¶ Computes and returns some statistics on the raw counts of two sub-populations.
- Parameters
- Return type
Tuple
[ndarray
,ndarray
,ndarray
,ndarray
,ndarray
,ndarray
]Tuple
[ndarray
,ndarray
,ndarray
,ndarray
,ndarray
,ndarray
]- Returns
type Tuple of
np.ndarray
containing, by pair (one for each sub-population), mean expression per gene, proportion of non-zero expression per gene, mean of normalized expression.
-
register_dataset_version
(version_name)[source]¶ Registers a version of the dataset, e.g normalized version.
-
reorder_cell_types
(new_order)[source]¶ Reorder in place the cell-types. The cell-types provided will be added at the beginning of cell_types attribute, such that if some existing cell-types are omitted in new_order, they will be left after the new given order
-
reorder_genes
(first_genes, drop_omitted_genes=False)[source]¶ Performs a in-place reordering of genes and gene-related attributes.
Reorder genes according to the
first_genes
list of gene names. Consequently, modifies in-place the dataX
and the registered gene attributes.- Parameters
first_genes (
List
[str
],ndarray
Union
[List
[str
],ndarray
]) – New ordering of the genes; if some genes are missing, they will be added after the first_genes in the same order as they were before if drop_omitted_genes is Falsedrop_omitted_genes (
bool
bool
) – Whether to keep or drop the omitted genes in first_genes
-
subsample_cells
(size=1.0)[source]¶ Wrapper around
update_cells
allowing for automatic (based on sum of counts) subsampling.If size is a:
(0,1) float: subsample 100*``size`` % of the cells
int: subsample
size
cells
-
subsample_genes
(new_n_genes=None, new_ratio_genes=None, subset_genes=None, mode='seurat_v3', batch_correction=True, **highly_var_genes_kwargs)[source]¶ Wrapper around
update_genes
allowing for manual and automatic (based on count variance) subsampling.The function either:
Subsamples new_n_genes genes among all genes
Subsambles a proportion of new_ratio_genes of the genes
Subsamples the genes in subset_genes
In the first two cases, a mode of highly variable gene selection is used as specified in the mode argument. F
In the case where new_n_genes, new_ratio_genes and subset_genes are all None, this method automatically computes the number of genes to keep (when mode=’seurat_v2’ or mode=’cell_ranger’)
In the case where mode==”seurat_v3”, an adapted version of the method described in [Stuart19] is used. This method requires new_n_genes or new_ratio_genes to be specified.
In the case where mode==”poisson_zeros”, a method based on [Andrews & Hemberg 2019] is used. This method requires new_n_genes or new_ratio_genes to be specified.
- Parameters
subset_genes (
List
[int
],List
[bool
],ndarray
,None
Union
[List
[int
],List
[bool
],ndarray
,None
]) – list of indices or mask of genes to retainnew_n_genes (
int
,None
Optional
[int
]) – number of genes to retain, the highly variable genes will be keptnew_ratio_genes (
float
,None
Optional
[float
]) – proportion of genes to retain, the highly variable genes will be keptmode (
str
,None
Optional
[str
]) – Either “variance”, “seurat_v2”, “cell_ranger”, “seurat_v3” or “poisson_zeros”batch_correction (
bool
,None
Optional
[bool
]) – Account for batches when choosing highly variable genes. HVGs are selected in each batch and merged.highly_var_genes_kwargs – Kwargs to feed to highly_variable_genes when using seurat_v2 or cell_ranger (cf. highly_variable_genes method)
-
to_anndata
()[source]¶ Converts the dataset to a anndata.AnnData object. The obtained dataset can then be saved/retrieved using the anndata API.
-
update_cells
(subset_cells)[source]¶ Performs a in-place sub-sampling of cells and cell-related attributes.
Sub-selects cells according to
subset_cells
sub-index. Consequently, modifies in-place the dataX
, its versions and the registered cell attributes.- Parameters
subset_cells – Index used for cell sub-sampling. Either a
int
array with arbitrary shape which values are the indexes of the cells to keep. Or boolean array used as a mask-like index.