CellAssign#

CellAssign [1] (Python class CellAssign) is a simple yet, efficient approach for annotating scRNA-seq data in the scenario in which cell-type-specific gene markers are known. The method also allows users to control for nuisance covariates like batch or donor. The scvi-tools implementation of CellAssign uses stochastic inference, such that CellAssign will scale to very large datasets.

The advantages of CellAssign are:

Lightweight model that can be fit quickly.
Ability to control for nuisance factors.

The limitations of CellAssign include:

Requirement for a cell types by gene markers binary matrix.
The simple linear model may not handle non-linear batch effects.

Preliminaries#

CellAssign takes as input a scRNA-seq gene expression matrix $X$ with $N$ cells and $G$ genes along with a cell-type marker matrix $\rho$ which is a binary matrix of $G$ genes by $C$ cell types denoting if a gene is a marker of a particular cell type. A size factor $s$ for each cell may also be provided as input, otherwise it is computed empirically as the total unique molecular identifier count of a cell. Additionally, a design matrix $D$ containing $p$ observed covariates, such as day, donor, etc, is an optional input.

Generative process#

CellAssign uses a negative binomial mixture model to make cell-type predictions. The cell-type proportion is drawn from a Dirichlet distribution,

\[ (\pi_1, ..., \pi_C) \sim \textrm{Dirichlet}(\alpha, ..., \alpha), \tag{1} \]

with $\alpha = 0.01$.

CellAssign then posits that the observed gene expression counts $x_{ng}$ for cell $n$ and gene $g$ are generated by the following process:

\begin{align} \delta_{gc} &\sim \textrm{LogNormal}(\bar{\delta}, \sigma^2) \tag{2} \\ \log \mu _{ngc} &= \log s_n + \delta _{gc}\rho _{gc}+ \beta _{g0}+ \sum_p{\beta _{gp}d_{pn}} \label{a} \tag{3}\\ \tilde \phi _{ngc} &= \sum_{i = 1}^B {a_i \times \exp ({ - b \times ( {\mu _{ngc} - v_i})^2})} \label{b} \tag{4}\\ z_n &\sim \textrm{Discrete}(\pi) \tag{5}\\ x_{ng} \mid z_n = c &\sim \textrm{NegativeBinomial}(\mu_{ngc}, \tilde \phi _{ngc}) \tag{6} \end{align}

Notably, the logarithm of the mean of the negative binomial for cell $n$, gene $g$, given that it belongs to cell type $c$ (Equation $\ref{a}$) is computed as the sum of (1) the base expression of a gene $g$, $\beta_{g0}$, (2) a cell-type-specific overexpression term for a gene, $\delta_{gc}$, (3) an offset for the size factor, $\log s_n$, and (4) a linear combination of covariates in the design matrix (weighted by coefficients $\beta_{gi}$). A cell-specific discrete latent variable, $z_n$, represents the cell-type assignment for cell $n$.

Furthermore, the inverse dispersion of the negative binomial, $\tilde{\phi}_{ngc}$ (Equation $\ref{b}$) is computed with a sum of radial basis functions of the mean centered on $v_i$ with parameters $a_i$ and $b$. In total, there are $B$ centers $v_1, v_2, …, v_{10}$ that are set a priori to be equally spaced between 0 and the maximum count value of $X$. Additionally, as in the original implementation we used $B=10$. The parameters $a_i$ and $b$ are further described below.

Inference#

CellAssign uses expectation maximization (EM) to optimize its parameters and provides a cell-type prediction for each cell.

Initialization#

CellAssign initializes parameters as follows: $\delta_{gc}$ is initialized with a $\textrm{LogNormal}(0, 1)$ draw, $\bar{\delta}$ is initialized to 0, $\sigma^2$ is initialized to 1, $\pi_i$ is initialized to $1/C$, $\beta_{gi}$ is initialized with a $\mathcal{N}(0, 1)$ draw, and the inverse dispersion parameter $\log a_i$ is initialized to 0. Additionally, $b$ is fixed a priori to be

\[ b = \frac{1}{2(v_2 - v_1)^2}, \tag{7} \]

where $v_1$ and $v_2$ are the first and second center determined as stated previously.

E step#

The E step consists of computing the expected joint log likelihood with respect to the conditional posterior, $p(z_n \mid x_n, \delta, \beta, a, \pi)$, for each cell. This conditional posterior is computed in closed form as

\[ \gamma_{nc} := p(z_n = c \mid x_n, \delta, \beta, a, \pi) \propto p(z_n = c \mid \pi)\prod_g p(x_{ng} \mid \mu_{ngc}, \tilde{\phi}_{ngc}), \tag{8} \]

which is normalized over all $c$. The expected joint log likelihood over all $N$ cells is then computed as

\begin{align} \begin{split} \mathbb{E}_{z \mid X,\pi,\delta, \beta, a}[\log p(X, \pi, \delta \mid \beta, a, \bar{\delta}, \sigma^2)] % &=\log p(\theta) + \log p(\delta) +\sum_n\sum_{c}\gamma_{nc}\log p(y_{n}|c)\\ &= \sum_{n=1}^{N}\sum_{c=1}^{C}\gamma_{nc}\sum_{g=1}^{G}\log p(x_{ng}|z_n = c) \\ & \qquad + \log p(\pi) + \log p(\delta). \end{split} \tag{9} \end{align}

Herein lies the major difference between the scvi-tools implementation and the original CellAssign implementation. Notably, in scvi-tools we compute this expected joint log likelihood using a mini-batch of 1,024 cells, using the fact that

\[ \sum_{n=1}^{N}\sum_{c=1}^{C}\gamma_{nc}\sum_{g=1}^{G}\log p(x_{ng}|z_n = c) \approx \frac{N}{M}\sum_{m=1}^M\sum_{c=1}^{C}\gamma_{nc}\sum_{g=1}^{G}\log p(x_{\tau(m)g}|z_n = c) \tag{10} \]

for a minibatch of $M<N$ cells, where $\tau$ is a function describing a permutation of the data indices ${1, 2, …, N}$.

M step#

The parameters of the expected joint log likelihood ($\pi, \delta, \beta, a, \bar{\delta}, \sigma^2$) are optimized as in the original implementation [2], using the Adam optimizer [3], except that now an optimization step corresponds to data from one minibatch. Following the original implementation, we also clamped $\delta > 2$. We also added early stopping with respect to the log likelihood of a held-out validation set.

Tasks#

Cell type prediction#

The primary task of CellAssign is to predict cell types for each cell. This is accomplished by:

>>> model = CellAssign(adata, marker_gene_matrix, size_factor_key='size_factor')
>>> model.train()
>>> predictions = model.predict(adata)

where predictions stores $\gamma_{nc}$ for each cell $n$ and cell type $c$.

Implementation details#

The logic implementing CellAssign can be found in scvi.external.cellassign.CellAssignModule. The implementation uses the same variable names as the math.

The core logic is implemented in scvi.external.cellassign.CellAssignModule.generative(). In this method, the E step is taken and the log likelihood $\log p(X \mid \beta, a, \bar{\delta}, \sigma^2, z_n=c)$ is computed for all cell types.

In scvi.external.cellassign.CellAssignModule.loss() the full expected log likelihood is computed, as well as the penalities corresponding to the priors on $\pi$ and $\delta$.

CellAssign uses the standard TrainingPlan.