CellAssign#

CellAssign [1] (Python class CellAssign) is a simple yet, efficient approach for annotating scRNA-seq data in the scenario in which cell-type-specific gene markers are known. The method also allows users to control for nuisance covariates like batch or donor. The scvi-tools implementation of CellAssign uses stochastic inference, such that CellAssign will scale to very large datasets.

The advantages of CellAssign are:

Lightweight model that can be fit quickly.
Ability to control for nuisance factors.

The limitations of CellAssign include:

Requirement for a cell types by gene markers binary matrix.
The simple linear model may not handle non-linear batch effects.

Preliminaries#

CellAssign takes as input a scRNA-seq gene expression matrix \(X\) with \(N\) cells and \(G\) genes along with a cell-type marker matrix \(\rho\) which is a binary matrix of \(G\) genes by \(C\) cell types denoting if a gene is a marker of a particular cell type. A size factor \(s\) for each cell may also be provided as input, otherwise it is computed empirically as the total unique molecular identifier count of a cell. Additionally, a design matrix \(D\) containing \(p\) observed covariates, such as day, donor, etc, is an optional input.

Generative process#

CellAssign uses a negative binomial mixture model to make cell-type predictions. The cell-type proportion is drawn from a Dirichlet distribution,

\[ (\pi_1, ..., \pi_C) \sim \textrm{Dirichlet}(\alpha, ..., \alpha), \tag{1} \]

with \(\alpha = 0.01\).

CellAssign then posits that the observed gene expression counts \(x_{ng}\) for cell \(n\) and gene \(g\) are generated by the following process:

\begin{align} \delta_{gc} &\sim \textrm{LogNormal}(\bar{\delta}, \sigma^2) \tag{2} \\ \log \mu _{ngc} &= \log s_n + \delta _{gc}\rho _{gc}+ \beta _{g0}+ \sum_p{\beta _{gp}d_{pn}} \label{a} \tag{3}\\ \tilde \phi _{ngc} &= \sum_{i = 1}^B {a_i \times \exp ({ - b \times ( {\mu _{ngc} - v_i})^2})} \label{b} \tag{4}\\ z_n &\sim \textrm{Discrete}(\pi) \tag{5}\\ x_{ng} \mid z_n = c &\sim \textrm{NegativeBinomial}(\mu_{ngc}, \tilde \phi _{ngc}) \tag{6} \end{align}

Notably, the logarithm of the mean of the negative binomial for cell \(n\), gene \(g\), given that it belongs to cell type \(c\) (Equation \(\ref{a}\)) is computed as the sum of (1) the base expression of a gene \(g\), \(\beta_{g0}\), (2) a cell-type-specific overexpression term for a gene, \(\delta_{gc}\), (3) an offset for the size factor, \(\log s_n\), and (4) a linear combination of covariates in the design matrix (weighted by coefficients \(\beta_{gi}\)). A cell-specific discrete latent variable, \(z_n\), represents the cell-type assignment for cell \(n\).

Furthermore, the inverse dispersion of the negative binomial, \(\tilde{\phi}_{ngc}\) (Equation \(\ref{b}\)) is computed with a sum of radial basis functions of the mean centered on \(v_i\) with parameters \(a_i\) and \(b\). In total, there are \(B\) centers \(v_1, v_2, ..., v_{10}\) that are set a priori to be equally spaced between 0 and the maximum count value of \(X\). Additionally, as in the original implementation we used \(B=10\). The parameters \(a_i\) and \(b\) are further described below.

Inference#

CellAssign uses expectation maximization (EM) to optimize its parameters and provides a cell-type prediction for each cell.

Initialization#

CellAssign initializes parameters as follows: \(\delta_{gc}\) is initialized with a \(\textrm{LogNormal}(0, 1)\) draw, \(\bar{\delta}\) is initialized to 0, \(\sigma^2\) is initialized to 1, \(\pi_i\) is initialized to \(1/C\), \(\beta_{gi}\) is initialized with a \(\mathcal{N}(0, 1)\) draw, and the inverse dispersion parameter \(\log a_i\) is initialized to 0. Additionally, \(b\) is fixed a priori to be

\[ b = \frac{1}{2(v_2 - v_1)^2}, \tag{7} \]

where \(v_1\) and \(v_2\) are the first and second center determined as stated previously.

E step#

The E step consists of computing the expected joint log likelihood with respect to the conditional posterior, \(p(z_n \mid x_n, \delta, \beta, a, \pi)\), for each cell. This conditional posterior is computed in closed form as

\[ \gamma_{nc} := p(z_n = c \mid x_n, \delta, \beta, a, \pi) \propto p(z_n = c \mid \pi)\prod_g p(x_{ng} \mid \mu_{ngc}, \tilde{\phi}_{ngc}), \tag{8} \]

which is normalized over all \(c\). The expected joint log likelihood over all \(N\) cells is then computed as

\begin{align} \begin{split} \mathbb{E}_{z \mid X,\pi,\delta, \beta, a}[\log p(X, \pi, \delta \mid \beta, a, \bar{\delta}, \sigma^2)] % &=\log p(\theta) + \log p(\delta) +\sum_n\sum_{c}\gamma_{nc}\log p(y_{n}|c)\\ &= \sum_{n=1}^{N}\sum_{c=1}^{C}\gamma_{nc}\sum_{g=1}^{G}\log p(x_{ng}|z_n = c) \\ & \qquad + \log p(\pi) + \log p(\delta). \end{split} \tag{9} \end{align}

Herein lies the major difference between the scvi-tools implementation and the original CellAssign implementation. Notably, in scvi-tools we compute this expected joint log likelihood using a mini-batch of 1,024 cells, using the fact that

\[ \sum_{n=1}^{N}\sum_{c=1}^{C}\gamma_{nc}\sum_{g=1}^{G}\log p(x_{ng}|z_n = c) \approx \frac{N}{M}\sum_{m=1}^M\sum_{c=1}^{C}\gamma_{nc}\sum_{g=1}^{G}\log p(x_{\tau(m)g}|z_n = c) \tag{10} \]

for a minibatch of \(M<N\) cells, where \(\tau\) is a function describing a permutation of the data indices \({1, 2, ..., N}\).

M step#

The parameters of the expected joint log likelihood (\(\pi, \delta, \beta, a, \bar{\delta}, \sigma^2\)) are optimized as in the original implementation [2], using the Adam optimizer [3], except that now an optimization step corresponds to data from one minibatch. Following the original implementation, we also clamped \(\delta > 2\). We also added early stopping with respect to the log likelihood of a held-out validation set.

Tasks#

Cell type prediction#

The primary task of CellAssign is to predict cell types for each cell. This is accomplished by:

>>> model = CellAssign(adata, marker_gene_matrix, size_factor_key='size_factor')
>>> model.train()
>>> predictions = model.predict(adata)

where predictions stores \(\gamma_{nc}\) for each cell \(n\) and cell type \(c\).

Implementation details#

The logic implementing CellAssign can be found in scvi.external.cellassign.CellAssignModule. The implementation uses the same variable names as the math.

The core logic is implemented in scvi.external.cellassign.CellAssignModule.generative(). In this method, the E step is taken and the log likelihood \(\log p(X \mid \beta, a, \bar{\delta}, \sigma^2, z_n=c)\) is computed for all cell types.

In scvi.external.cellassign.CellAssignModule.loss() the full expected log likelihood is computed, as well as the penalities corresponding to the priors on \(\pi\) and \(\delta\).

CellAssign uses the standard TrainingPlan.