Information Constraints on Auto-Encoding Variational Bayes

Romain Lopez, Jeffrey Regier, Michael I. Jordan, Nir Yosef

Introduction

Since the introduction of variational auto-encoders (VAEs) , graphical models whose conditional distribution are specified by deep neural networks have become commonplace. For problems where all that matters is the goodness-of-fit (e.g., marginal log probability of the data), there is little reason to constrain the flexibility/expressiveness of these networks other than possible considerations of overfitting. In other problems, however, some latent representations may be preferable to others—for example, for reasons of interpretability or modularity. Traditionally, such constraints on latent representations have been expressed in the graphical model setting via conditional independence assumptions. But these assumptions are relatively rigid, and with the advent of highly flexible conditional distributions, it has become important to find ways to constrain latent representations that go beyond the rigid conditional independence structures of classical graphical models.

In this paper, we propose a new method for restricting the search space to latent representations with desired independence properties. As in , we approximate the posterior for each observation $X$ with an encoder network that parameterizes $q_{\phi}(Z\mid X)$ . Restricting this search space amounts to constraining the class of variational distributions that we consider. In particular, we aim to constrain the aggregated variational posterior :

Here $p_{\text{data}}(X)$ denotes the empirical distribution. We aim to enforce independence statements of the form $\hat{q}_{\phi}(Z^{i})\upmodels\hat{q}_{\phi}(Z^{j})$ , where $i$ and $j$ are different coordinates of our latent representation.

Unfortunately, because $\hat{q}_{\phi}(Z)$ is a mixture distribution, computing any standard measure of independence is intractable, even in the case of Gaussian terms . In this paper, we circumvent this problem in a novel way. First, we estimate dependency though a kernel-based measure of independence, in particular the Hilbert-Schmidt Information Criterion (HSIC) . Second, by scaling and then subtracting this measure of dependence in the variational lower bound, we get a new variational lower bound on $\log p(X)$ . Maximizing it amounts to maximizing the traditional variational lower bound with a penalty for deviating from the desired independence conditions. We refer to this approach as HSIC-constrained VAE (HCV).

The remainder of the paper is organized as follows. In Section 2, we provide background on VAEs and the HSIC. In Section 3, we precisely define HCV and provide a theoretical analysis. The next three sections each present an application of HVC—one for each task shown in Figure 1. In Section 4, we consider the problem of learning an interpretable latent representation, and we show that HCV compares favorably to $\beta$ -VAE and $\beta$ -TCVAE . In Section 5, we consider the problem of learning an invariant representation, showing both that HCV includes the variational fair auto-encoder (VFAE) as a special case, and that it can improve on the VFAE with respect to its own metrics. In Section 6, we denoise single-cell RNA sequencing data with HCV, and show that our method recovers biological signal better than the current state-of-the-art approach.

Background

In representation learning, we aim to transform a variable $x$ into a representation vector $z$ for which a given downstream task can be performed more efficiently, either computationally or statistically. For example, one may learn a low-dimensional representation that is predictive of a particular label $y$ , as in supervised dictionary learning . More generally, a hierarchical Bayesian model applied to a dataset yields stochastic representations, namely, the sufficient statistics for the model’s posterior distribution. In order to learn representations that respect specific independence statements, we need to bring together two independent lines of research. First, we will present briefly variational auto-encoders and then non-parametric measures of dependence.

We focus on variational auto-encoders which effectively summarize data for many tasks within a Bayesian inference paradigm . Let $\{X,S\}$ denote the set of observed random variables and $Z$ the set of hidden random variables (we will use the notation $z^{i}$ to denote the $i$ -th random variable in the set $Z$ ). Then Bayesian inference aims to maximize the likelihood:

Because the integral is in general intractable, variational inference finds a distribution $q_{\phi}(Z\mid X,S)$ that minimizes a lower bound on the data—the evidence lower bound (ELBO):

In auto-encoding variational Bayes (AEVB), the variational distribution is parametrized by a neural network. In the case of a variational auto-encoder (VAE), both the generative model and the variational approximation have conditional distributions parametrized with neural networks. The difference between the data likelihood and the ELBO is the variational gap:

The original AEVB framework is described in the seminal paper for the case $Z=\{z\},X=\{x\},S=\emptyset$ . The representation $z$ is optimized to “explain” the data $x$ .

AEVB has since been successfully applied and extended. One notable example is the semi-supervised learning case—where $Z=\{z^{1},z^{2}\}$ , $X=\{x\}$ , $y\in X\cup Z$ —which is addressed by the M1 + M2 model . Here, the representation $z_{1}$ both explains the original data and is predictive of the label $y$ . More generally, solving an additional problem is tantamount to adding a node in the underlying graphical model. Finally, the variational distribution can be used to meet different needs: $q_{\phi}(y\mid x)$ is a classifier and $q_{\phi}(z^{1}\mid x)$ summarizes the data.

2 Non-parametric estimates of dependence with kernels

Given this setting, one can embed the distribution $P$ of random variable $u$ into a single point $\mu_{P}$ of the RKHS $\mathcal{H}$ as follows:

If the kernel $k$ is universalA kernel $k$ is universal if $k(x,\cdot)$ is continuous for all $x$ and the RKHS induced by $k$ is dense in $C(\mathcal{X})$ . This is true for the Gaussian kernel $(u,u^{\prime})\mapsto e^{-\gamma||u-u^{\prime}||^{2}}$ when $\gamma>0$ ., then the mean embedding operator $P\mapsto\mu_{P}$ is injective .

We now introduce a kernel-based estimate of distance between two distributions $P$ and $Q$ over the random variable $u$ . This approach will be used by one of our baselines for learning invariant representations. Such a distance, defined via the canonical distance between their $\mathcal{H}$ -embeddings, is called the maximum mean discrepancy and denoted $\text{MMD}(P,Q)$ .

The joint distribution $P(u,v)$ defined over the product space $\mathcal{X}\times\mathcal{Y}$ can be embedded as a point $\mathcal{C}_{uv}$ in the tensor space $\mathcal{H}\otimes\mathcal{K}$ . It can also be interpreted as a linear map $\mathcal{H}\rightarrow\mathcal{K}$ :

The $d$ HSIC generalizes the HSIC to $d$ variables. We present the $d$ HSIC in Appendix A.

Theory for HSIC-Constrained VAE (HCV)

This paper is concerned with intepretability of representations learned via VAEs. Independence between certain components of the representation can aid in interpretability . First, we will explain why AEVB might not be suitable for learning representations that satisfy independence statements. Second, we will present a simple diagnostic in the case where the generative model is fixed. Third, we will introduce HSIC-constrained VAEs (HCV): our method to correct approximate posteriors learned via AEVB in order to recover independent representations.

The goal of learning representation that satisfies certain independence statements can be achieved by adding suitable nodes and edges to the generative distribution graphical model. In particular, marginal independence can be the consequence of an “explaining away” pattern as in Figure 1a for the triplet $\{u,x,v\}$ . If we consider the setting of infinite data and an accurate posterior, we find that independence statements in the generative model are respected in the latent representation:

2 A simple diagnostic in the case of posterior approximation

A theoretical analysis explaining why the empirical aggregated posterior presents some misspecified correlation is not straightforward. The main reason is that the learning of the model parameters $\theta$ along with the variational parameters $\phi$ makes diagnosis hard. As a first line of attack, let us consider the case where we approximate the posterior of a fixed model. Consider learning a posterior $q_{\phi}(Z\mid X,S)$ via naive mean field AEVB. Recent work focuses on decomposing the second term of the ELBO and identifying terms, one of which is the total correlation between hidden variables in the aggregate posterior. This term, in principle, promotes independence. However, the decomposition has numerous interacting terms, which makes exact interpretation difficult. As the generative model is fixed in this setting, optimizing the ELBO is tantamount to minimizing the variational gap, which we propose to decompose as

The last term of this equation quantifies the misspecification of the mean-field assumption. The larger it is, the more the coupling between the hidden variables $Z$ . Since neural networks are flexible, they can be very successful at optimizing this variational gap but at the price of introducing supplemental correlation between $Z$ in the aggregated posterior. We expect this side effect whenever we use neural networks to learn a misspecified variational approximation.

3 Correcting the variational posterior

A few comments are in order regarding this penalty. First, the $d$ HSIC is positive and therefore our objective function is still a lower bound on the log-likelihood. The bound will be looser but the resulting parameters will yield a more suitable representation. This trade-off is adjustable via the parameter $\lambda$ . Second, the $d$ HSIC can be estimated with the same samples used for stochastic variational inference (i.e., sampling from the variational distribution) and for minibatch sampling (i.e., subsampling the dataset). Third, the HSIC penalty is based only on the variational parameters—not the parameters of the generative model.

Case study: Learning interpretable representations

Suppose we want to summarize the data $x$ with two independent components $u$ and $v$ , as shown in Figure 1a. The task is especially important for data exploration since independent representations are often more easily interpreted.

A related problem is finding latent factors $(z^{1},...,z^{d})$ that correspond to real and interpretable variations in the data. Learning independent representations is then a key step towards learning disentangled representations . The $\beta$ -VAE proposes further penalizing the $D_{KL}(q_{\phi}(z\mid x)\;||\;p(z))$ term. It attains significant improvement over state-of-the art methods on real datasets. However, this penalization has been shown to yield poor reconstruction performance . The $\beta$ -TCVAE penalized an approximation of the total correlation (TC), defined as $D_{KL}(\hat{q}_{\phi}(z)\;||\;\prod_{k}\hat{q}_{\phi}(z^{k}))$ , which is a measure of multivariate mutual independence. However, this quantity does not have a closed-form solution and the $\beta$ -TCVAE uses a biased estimator of the TC—a lower bound from Jensen inequality. That bias will be zero only if evaluated on the whole dataset, which is not possible since the estimator has quadratic complexity in the number of samples. However, the bias from the HSIC is of order $\mathcal{O}(\nicefrac{{1}}{{n}})$ ; it is negligible whenever the batch-size is large enough. HSIC therefore appears to be a more suitable method to enforce independence in the latent space.

Results are reported in Figure 2. The VAE baseline (like all the other methods) has an ELBO value worse than the marginal log-likelihood (horizontal bar) since the real posterior is not likely to be in the function class given by naive mean field AEVB. Also, this baseline has a greater dependence in the aggregated posterior $\hat{q}_{\phi}(u,v)$ than in the exact posterior $\hat{p}(u,v)$ (vertical bar) for the two measures of correlation. Second, while correcting the variational posterior, we want the best trade-off between model fit and independence. HCV attains the highest ELBO values despite having the lowest correlation.

Case study: Learning invariant representations

We now consider the particular problem of learning representations for the data that is invariant to a given nuisance variable. As a particular instance of the graphical model in Figure 1b, we embed an image $x$ into a latent vector $z_{1}$ whose distribution is independent of the observed lighting condition $s$ while being predictive of the person identity $y$ (Figure 3). The generative model is defined in Figure 3c and the variational distribution decomposes as $q_{\phi}(z^{1},z^{2}\mid x,s,y)=q_{\phi}(z^{1}\mid x,s)q_{\phi}(z^{2}\mid z^{1},y)$ , as in .

This problem has been studied in for binary or categorical $s$ . For their experiment with a continuous covariate $s$ , they discretize $s$ and use the MMD to match the distributions $\hat{q}_{\phi}(z^{1}\mid s=0)$ and $\hat{q}_{\phi}(z^{1}\mid s=j)$ for all $j$ . Perhaps surprisingly, their penalty turns out to be a special case of our HSIC penalty. (We present a proof of this fact in Appendix D.)

Let the nuisance factor $s$ be a discrete random variable and let $l$ (the kernel for $\mathcal{K}$ ) be a Kronecker delta function $\delta:(s,s^{\prime})\mapsto\mathds{1}_{s=s^{\prime}}$ . Then, the V-statistic corresponding to $\textrm{HSIC}(\hat{q}_{\phi}(z^{1}),p_{\text{data}})$ is a weighted sum of the V-statistics of the MMD between the pairs $\hat{q}_{\phi}(z\mid s=i),\hat{q}_{\phi}(z\mid s=j)$ . The weights are functions of the empirical probabilities for $s$ .

Working with the HSIC rather than an MMD penalty lets us avoid discretizing $s$ . We take into account the whole angular range and not simply the direction of the light. We apply HCV with mean-field AEVB, $Z=\{z^{1},z^{2}\},X=\{x,y\},S=\{s\},\mathcal{Z}_{0}=\{z^{1}\}$ and $\mathcal{S}_{0}=\{s\}$ .

We repeat the experiments from the paper introducing the variational fair auto-encoder (VFAE) , this time comparing the VAE with no covariate $s$ , the VFAE with observed lighting direction groups (five groups), and the HCV with the lighting direction vector (a three-dimensional vector). As a supplemental baseline, we also report results for the unconstrained VAEs. As in , we report 1) the accuracy for classifying the person based on the variational distribution $q_{\phi}(y\mid z^{1},s)$ ; 2) the classification accuracy for the lighting group condition (five-way classification) based on a logistic regression and a random forest classifier on a sample from the variational posterior $q_{\phi}(z^{1}\mid z^{2},y,s)$ for each datapoint; and 3) the average error for predicting the lighting direction with linear regression and a random forest regressor, trained on a sample from the variational posterior $q_{\phi}(z^{1}\mid z^{2},y,s)$ . Error is expressed in degrees. $\lambda$ is optimized via grid search as in .

We report our results in Table 1. As expected, adding information (either the lightning group or the refined lightning direction) always improves the quality of the classifier $q_{\phi}(y\mid z^{1},s)$ . This can be seen by comparing the scores between the vanilla VAE and the unconstrained algorithms. However, by using side information $s$ , the unconstrained models yield a representation less suitable because it is more correlated with the nuisance variables. There is therefore a trade-off between correlation to the nuisance and performance. Our proposed method (HCV) shows greater invariance to lighting direction while accurately predicting people’s identities.

Case study: Learning denoised representations

This section presents a case study of denoising datasets in the setting of an important open scientific problem. The task of denoising consists of representing experimental observations $x$ and nuisance observations $s$ with two independent signals: biological signal $z$ and technical noise $u$ . The difficulty is that $x$ contains both biological signal and noise and is therefore strongly correlated with $s$ (Figure 1c). In particular, we focus on single-cell RNA sequencing (scRNA-seq) data which renders a gene-expression snapshot of an heterogeneous sample of cells. Such data can reveal a cell’s type , if we can cope with a high level of technical noise .

The output of an scRNA-seq experiment is a list of transcripts $(l_{m})_{m\in\mathcal{M}}$ . Each transcript $l_{m}$ is an mRNA molecule enriched with a cell-specific barcode and a unique molecule identifier, as in . Cell-specific barcodes enable the biologist to work at single-cell resolution. Unique molecule identifiers (UMIs) are meant to remove some significant part of the technical bias (e.g., amplification bias) and make it possible to obtain an accurate probabilistic model for these datasets . Transcripts are then aligned to a reference genome with tools such as CellRanger .

The data from the experiment has two parts. First, there is a gene expression matrix $(X_{ng})_{(n,g)\in\mathcal{N}\times\mathcal{G}}$ , where $\mathcal{N}$ designates the set of cells detected in the experiment and $\mathcal{G}$ is the set of genes the transcripts have been aligned with. A particular entry of this matrix indicates the number of times a particular gene has been expressed in a particular cell. Second, we have quality control metrics $(s^{i})_{i\in\mathcal{S}}$ (described in Appendix E) which assess the level of errors and corrections in the alignment process. These metrics cannot be described with a generative model as easily as gene expression data but they nonetheless impact a significant number of tasks in the research area . Another significant portion of these metrics focus on the sampling effects (i.e., the discrepancy in the total number of transcripts captured in each cell) which can be taken into account in a principled way in a graphical model as in .

We visualize these datasets $x$ and $s$ with tSNE in Figure 4. Note that $x$ is correlated with $s$ , especially within each cell type. A common application for scRNA-seq is discovering cell types, which can be be done without correcting for the alignment errors . A second important application is identifying genes that are more expressed in one cell type than in another—this hypothesis testing problem is called differential expression . Not modeling $s$ can induce a dependence on $x$ which hampers hypothesis testing .

Most research efforts in scRNA-seq methodology research focus on using generalized linear models and two-way ANOVA to regress out the effects of quality control metrics. However, this paradigm is incompatible with hypothesis testing. A generative approach, however, would allow marginalizing out the effect of these metrics, which is more aligned with Bayesian principles. Our main contribution is to incorporate these alignment errors into our graphical model to provide a better Bayesian testing procedure. We apply HCV with $Z=\{z,u\},X=\{x,s\},\mathcal{Z}_{0}=\{z,u\}$ . By integrating out $u$ while sampling from the variational posterior, $\int q_{\phi}(x\mid z,u)dp(u)$ , we find a Bayes factor that is not subject to noise. (See Appendix F for a complete presentation of the hypothesis testing framework and the graphical model under consideration).

We considered scRNA-seq data from peripheral blood mononuclear cells (PBMCs) from a healthy donor . Our dataset includes 12,039 cells and 3,346 genes, five quality control metrics from CellRanger and cell-type annotations extracted with Seurat . We preprocessed the data as in . Our ground truth for the hypothesis testing, from microarray studies, is a set of genes that are differentially expressed between human B cells and dendritic cells (n=10 in each group ).

We compare scVI , a state-of-the-art model, with no observed nuisance variables ( $8$ latent dimensions for $z$ ), and our proposed model with observed quality control metrics. We use five latent dimensions for $z$ and three for $u$ . The penalty $\lambda$ is selected through grid search. For each algorithm, we report 1) the coefficient of determination of a linear regression and random forest regressor for the quality metrics predictions based on the latent space, 2) the irreproducible discovery rate (IDR) model between the Bayes factor of the model and the p-values from the micro-array. The mixture weights, reported in , are similar between the original scVI and our modification (and therefore higher than other mainstream differential expression procedures) and saturate the number of significant genes in this experiment ( $\sim$ 23%). We also report the correlation of the reproducible mixture as a second-order quality metric for our gene rankings.

We report our results in Table 2. First, the proposed method efficiently removes much correlation with the nuisance variables $s$ in the latent space $z$ . Second, the proposed method yields a better ranking of the genes when performing Bayesian hypothesis testing. This is shown by a substantially higher correlation coefficient for the IDR, which indicates the obtained ranking better conforms with the micro-array results. Our denoised latent space is therefore extracting information from the data that is less subject to alignment errors and more biologically interpretable.

Discussion

We have presented a flexible framework for correcting independence properties of aggregated variational posteriors learned via naive mean field AEVB. The correction is performed by penalizing the ELBO with the HSIC—a kernel-based measure of dependency—between samples from the variational posterior.

We illustrated how variational posterior misspecification in AEVB could unwillingly promote dependence in the aggregated posterior. Future work should look at other variational approximations and quantify this dependence.

Penalizing the HSIC as we do for each mini-batch implies that no information is learned about distribution $\hat{q}(Z)$ or $\prod_{i}{\hat{q}(z^{i})}$ during training. On one hand, this is positive since we do no have to estimate more parameters, especially if the joint estimation would imply a minimax problem as in . One the other hand, that could be harmful if the HSIC could not be estimated with only a mini-batch. Our experiments show this does not happen in a reasonable set of configurations.

Trading a minimax problem for an estimation problem does not come for free. First, there are some computational considerations. The HSIC is computed in quadratic time but linear time estimators of dependence or random features approximations should be used for non-standard batch sizes. For example, to train on the entire extended Yale B dataset, VAE takes two minutes, VFAE takes ten minutesVFAE is slower because of the discrete operation it has to perform to form the samples for estimating the MMD., and HCV takes three minutes. Second, the problem of choosing the best kernel is known to be difficult . In the experiments, we rely on standard and efficient choices: a Gaussian kernel with median heuristic for the bandwidth. The bandwidth can be chosen analytically in the case of a Gaussian latent variable and done offline in case of an observed nuisance variable. Third, the general formulation of HCV with the $d$ HSIC penalization, as in Equation 9, should be nuanced since the V-statistic relies on a U-statistic of order $2d$ . Standard non-asymptotic bounds as in would exhibit a concentration rate of $\mathcal{O}(\sqrt{\nicefrac{{d}}{{n}}})$ and therefore not scale well for a large number of variables.

We also applied our HCV framework to scRNA-seq data to remove technical noise. The same graphical model can be readily applied to several other problems in the field. For example, we may wish to remove cell cycles that are biologically variable but typically independent of what biologists want to observe. We hope our approach will empower biological analysis with scalable and flexible tools for data interpretation.

NY and RL were supported by grant U19 AI090023 from NIH-NIAID.

References

Appendix A dHSIC

The definition of the cross-covariance operator can be extended to the case of $d$ random variables, simply by adhering to the formal language of tensor spaces. Let $X=(X^{1},...,X^{d})$ be a random vector with each component of dimension $p$ . The components $X^{1},...,X^{d}$ are mutually independent if the joint distribution is equal to the tensor product of the marginal distribution. derives functional formulation, population statistics and V-statistics for the Hilbert-Schmidt norm of the corresponding generalized cross-covariance operator called $d$ HSIC. Notably, under the hypothesis that the canonical kernel from the tensor product $\bigotimes_{k}\mathcal{H}_{k}$ is universal, the $d$ HSIC is null if and only if the components of the random vector are mutually independent. We write here the retained V-statistics that we will use for the experiments and can compute in time $\mathcal{O}(dn^{2}p)$ :

The question that naturally follows is when is the canonical kernel from the tensor product $\bigotimes_{k}\mathcal{H}_{k}$ universal. showed that the characteristic property of the individual kernels is not enough in general. However, in the case of continuous, bounded and translation invariant kernels this is a sufficient condition. In subsequent work, we plan to consider the $d$ HSIC in this setting.

Appendix B Proof of independence in representation

Without loss of generality, we can write $\mathcal{I}$ as independence between two variables $Z_{i}\upmodels Z_{j}$ for some $(i,j)$ . Under infinite data, the empirical distribution $p_{\text{data}}(X,S)$ is close to $p(X,S)$ , the real distribution:

Appendix C Linear Gaussian system for independence

Having drawn these parameters, the generative model is:

The marginal log-likelihood $p(x)$ is tractable:

The complete posterior $p(u,v\mid x)$ is tractable:

Appendix D Proof of equivalence between HSIC and MMD

The proof relies on sum manipulations. First, we carefully write the case where $s$ is binary without loss of generality. Let us assume $M$ samples from the joint $(x,s)$ and let us reorder them such that $s_{0}=...=s_{N}=0$ and $s_{N+1}=...=s_{M}=1$ . In that case,

Above, the term inside the parenthesis is the V-statistic for the MMD between $q(z\mid s=0)$ and $q(z\mid s=1)$ . In the general case of $s$ discrete, we then have a sum of MMD weighted by the values of the empirical $p(s)$ . ∎

Appendix E Nuisance factors presentation for scRNA-seq

$s^{1}$ : proportion of transcripts which confidently mapped to a gene;

$s^{2}$ : proportion of transcripts mapping to the genome, but not to a gene;

$s^{3}$ : proportion of transcripts which did not align;

$s^{4}$ : proportion of transcripts whose UMI sequence was corrected by the alignment procedure;

$s^{5}$ : proportion of transcripts whose barcode sequence was corrected by the alignment procedure.

Appendix F Graphical model for scRNA-seq

Our probabilistic graphical model is a modification of the single-cell Variational Inference model . The main difference is the addition of latent variable $u$ and the node for the sequencing errors $s$ .

$z$ will encode the biological information, $u$ the technical information from the alignment process and $l$ the sampling intensity. Then, we have some intermediate hidden variables useful for testing and model interpretation that we will integrate out for inference:

Physically, $w_{ng}$ represents the average proportion of transcripts aligned with gene $g$ in cell $n$ . $y_{ng}$ represents one outcome of the sampling process. $h_{ng}$ represents some additional control for the zeros that come from alignment. Finally, the observations fall from:

where we constrained the mean $f_{\mu_{s}}^{j}(u_{n})$ to be in the 0-1 range since $s$ is a vector of individual proportions. This model is a very competitive solution for representing single-cell RNA sequencing data.

Applying AEVB with $X=\{x,s\},Z=\{z,l,u,w,y,h\}$ seems doomed to fail since some of the variables are discrete. Fortunately, variables $\{w,y,h\}$ can be integrated out analytically: $p(x\mid l,z,u)$ is a Zero-Inflated Negative Binomial distribution. We then apply AEVB with $X=\{x,s\},Z=\{z,l,u\}$ with the mean-field variational posterior:

We can capitalize on our careful modeling to query the Bayesian model. One particular flavor of Bayesian Hypothesis Testing is called Differential Expression in the biostatistics literature. Given two sets of samples $\{x_{a}\mid a\in A\}$ and $\{x_{b}\mid b\in B\}$ , we would like to test whether a particular gene $g$ is more expressed in population $A$ or in population $B$ .

More formally, for each gene $g$ and a pair of cells $(z_{a},u_{a})$ , $(z_{b},u_{b})$ with observed gene expression $(x_{a},x_{b})$ and quality control metrics $(s_{a},s_{b})$ , we can formulate two models of the world under which one of the following hypotheses is true:

Where the expectation is taken over $u$ to integrate out the technical variation. Evaluating the likelihood ratio test for whether our datapoints $(x_{a},x_{b}),(s_{a},s_{b})$ are more probable under the first hypothesis is equivalent to writing a Bayes factor:

where the posterior of these models can be approximated via the variational distribution:

We can use Monte Carlo integration to compute these integrals because all the measures have low dimension.