CellAssign#
CellAssign [1] (Python class CellAssign
) is a simple yet, efficient
approach for annotating scRNA-seq data in the scenario in which cell-type-specific
gene markers are known. The method also allows users to control for nuisance covariates
like batch or donor. The scvi-tools implementation of CellAssign uses stochastic inference,
such that CellAssign will scale to very large datasets.
The advantages of CellAssign are:
Lightweight model that can be fit quickly.
Ability to control for nuisance factors.
The limitations of CellAssign include:
Requirement for a cell types by gene markers binary matrix.
The simple linear model may not handle non-linear batch effects.
Preliminaries#
CellAssign takes as input a scRNA-seq gene expression matrix \(X\) with \(N\) cells and \(G\) genes along with a cell-type marker matrix \(\rho\) which is a binary matrix of \(G\) genes by \(C\) cell types denoting if a gene is a marker of a particular cell type. A size factor \(s\) for each cell may also be provided as input, otherwise it is computed empirically as the total unique molecular identifier count of a cell. Additionally, a design matrix \(D\) containing \(p\) observed covariates, such as day, donor, etc, is an optional input.
Generative process#
CellAssign uses a negative binomial mixture model to make cell-type predictions. The cell-type proportion is drawn from a Dirichlet distribution,
with \(\alpha = 0.01\).
CellAssign then posits that the observed gene expression counts \(x_{ng}\) for cell \(n\) and gene \(g\) are generated by the following process:
Notably, the logarithm of the mean of the negative binomial for cell $n$, gene \(g\), given that it belongs to cell type \(c\) (Equation \(\ref{a}\)) is computed as the sum of (1) the base expression of a gene \(g\), \(\beta_{g0}\), (2) a cell-type-specific overexpression term for a gene, \(\delta_{gc}\), (3) an offset for the size factor, \(\log s_n\), and (4) a linear combination of covariates in the design matrix (weighted by coefficients \(\beta_{gi}\)). A cell-specific discrete latent variable, \(z_n\), represents the cell-type assignment for cell $n$.
Furthermore, the inverse dispersion of the negative binomial, \(\tilde{\phi}_{ngc}\) (Equation \(\ref{b}\)) is computed with a sum of radial basis functions of the mean centered on $v_i$ with parameters $a_i$ and $b$. In total, there are $B$ centers $v_1, v_2, …, v_{10}$ that are set a priori to be equally spaced between 0 and the maximum count value of $X$. Additionally, as in the original implementation we used \(B=10\). The parameters \(a_i\) and \(b\) are further described below.
Inference#
CellAssign uses expectation maximization (EM) to optimize its parameters and provides a cell-type prediction for each cell.
Initialization#
CellAssign initializes parameters as follows: \(\delta_{gc}\) is initialized with a \(\textrm{LogNormal}(0, 1)\) draw, \(\bar{\delta}\) is initialized to 0, \(\sigma^2\) is initialized to 1, \(\pi_i\) is initialized to \(1/C\), \(\beta_{gi}\) is initialized with a \(\mathcal{N}(0, 1)\) draw, and the inverse dispersion parameter \(\log a_i\) is initialized to 0. Additionally, \(b\) is fixed a priori to be
where \(v_1\) and \(v_2\) are the first and second center determined as stated previously.
E step#
The E step consists of computing the expected joint log likelihood with respect to the conditional posterior, \(p(z_n \mid x_n, \delta, \beta, a, \pi)\), for each cell. This conditional posterior is computed in closed form as
which is normalized over all $c$. The expected joint log likelihood over all \(N\) cells is then computed as
Herein lies the major difference between the scvi-tools implementation and the original CellAssign implementation. Notably, in scvi-tools we compute this expected joint log likelihood using a mini-batch of 1,024 cells, using the fact that
for a minibatch of $M<N$ cells, where \(\tau\) is a function describing a permutation of the data indices ${1, 2, …, N}$.
M step#
The parameters of the expected joint log likelihood (\(\pi, \delta, \beta, a, \bar{\delta}, \sigma^2\)) are optimized as in the original implementation [2], using the Adam optimizer [3], except that now an optimization step corresponds to data from one minibatch. Following the original implementation, we also clamped \(\delta > 2\). We also added early stopping with respect to the log likelihood of a held-out validation set.
Tasks#
Cell type prediction#
The primary task of CellAssign is to predict cell types for each cell. This is accomplished by:
>>> model = CellAssign(adata, marker_gene_matrix, size_factor_key='size_factor')
>>> model.train()
>>> predictions = model.predict(adata)
where predictions
stores \(\gamma_{nc}\) for each cell \(n\) and cell type \(c\).
Implementation details#
The logic implementing CellAssign can be found in scvi.external.cellassign.CellAssignModule
.
The implementation uses the same variable names as the math.
The core logic is implemented in
scvi.external.cellassign.CellAssignModule.generative()
. In this method, the E step is taken and the log likelihood \(\log p(X \mid \beta, a, \bar{\delta}, \sigma^2, z_n=c)\) is computed for all cell types.In
scvi.external.cellassign.CellAssignModule.loss()
the full expected log likelihood is computed, as well as the penalities corresponding to the priors on \(\pi\) and \(\delta\).CellAssign uses the standard
TrainingPlan
.