Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Anoop Korattikara, Yutian Chen, Max Welling

Introduction

Markov chain Monte Carlo (MCMC) sampling has been the main workhorse of Bayesian computation since the 1990s. A canonical MCMC algorithm proposes samples from a distribution $q$ and then accepts or rejects these proposals with a certain probability given by the Metropolis-Hastings (MH) formula (Metropolis et al., 1953; Hastings, 1970). For each proposed sample, the MH rule needs to examine the likelihood of all data-items. When the number of data-cases is large this is an awful lot of computation for one bit of information, namely whether to accept or reject a proposal.

In today’s Big Data world, we need to rethink our Bayesian inference algorithms. Standard MCMC methods do not meet the Big Data challenge for the reason described above. Researchers have made some progress in terms of making MCMC more efficient, mostly by focusing on parallelization. Very few question the algorithm itself: is the standard MCMC paradigm really optimally efficient in achieving its goals? We claim it is not.

Any method that includes computation as an essential ingredient should acknowledge that there is a finite amount of time, $T$ , to finish a calculation. An efficient MCMC algorithm should therefore decrease the “error” (properly defined) maximally in the given time $T$ . For MCMC algorithms, there are two contributions to this error: bias and variance. Bias occurs because the chain needs to burn in during which it is sampling from the wrong distribution. Bias usually decreases fast, as evidenced by the fact that practitioners are willing to wait until the bias has (almost) completely vanished after which they discard these “burn-in samples”. The second cause of error is sampling variance, which occurs because of the random nature of the sampling process. The retained samples after burn-in will reduce the variance as $O(1/T)$ .

However, given a finite amount of computational time, it is not at all clear whether the strategy of retaining few unbiased samples and accepting an error dominated by variance is optimal. Perhaps, by decreasing the bias more slowly we could sample faster and thus reduce variance faster? In this paper we illustrate this effect by cutting the computational budget of the MH accept/reject step. To achieve that, we conduct sequential hypothesis tests to decide whether to accept or reject a given sample and find that the majority of these decisions can be made based on a small fraction of the data with high confidence. A related method was used in Singh et al. (2012), where the factors of a graphical model are sub-sampled to compute fixed-width confidence intervals for the log-likelihood in the MH test.

Our “philosophy” runs deeper than the algorithm proposed here. We advocate MCMC algorithms with a “bias-knob”, allowing one to dial down the bias at a rate that optimally balances error due to bias and variance. We only know of one algorithm that would also adhere to this strategy: stochastic gradient Langevin dynamics (Welling and Teh, 2011) and its successor stochastic gradient Fisher scoring (Ahn et al., 2012). In their case the bias-knob was the stepsize. These algorithms do not have an MH step which resulted in occasional samples with extremely low probability. We show that our approximate MH step largely resolves this, still avoiding $O(N)$ computations per iteration.

In the next section we introduce the MH algorithm and discuss its drawbacks. Then in Section 3, we introduce the idea of approximate MCMC methods and the bias variance trade-off involved. We develop approximate MH tests for Bayesian posterior sampling in Section 4 and present a theoretical analysis in Section 5. Finally, we show our experimental results in Section 6 and conclude in Section 7.

The Metropolis-Hastings algorithm

MCMC methods generate samples from a distribution $\mathcal{S}_{0}(\theta)$ by simulating a Markov chain designed to have stationary distribution $\mathcal{S}_{0}(\theta)$ . A Markov chain with a given stationary distribution can be constructed using the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970), which uses the following rule for transitioning from the current state $\theta_{t}$ to the next state $\theta_{t+1}$ :

Draw a candidate state $\theta^{\prime}$ from a proposal distribution $q(\theta^{\prime}|\theta_{t})$

Draw $u\sim\text{Uniform}$ . If $u<P_{a}$ set $\theta_{t+1}\leftarrow\theta^{\prime}$ , otherwise set $\theta_{t+1}\leftarrow\theta_{t}$ .

Following this transition rule ensures that the stationary distribution of the Markov chain is $\mathcal{S}_{0}(\theta)$ . The samples from the Markov chain are usually used to estimate the expectation of a function $f(\theta)$ with respect to $\mathcal{S}_{0}(\theta)$ . To do this we collect $T$ samples and approximate the expectation $I=\langle f\rangle_{\mathcal{S}_{0}}$ as $\hat{I}=\frac{1}{T}\sum_{t=1}^{T}f(\theta_{t})$ . Since the stationary distribution of the Markov chain is $\mathcal{S}_{0}$ , $\hat{I}$ is an unbiased estimator of $I$ (if we ignore burn-in).

Approximate MCMC and the Bias-Variance Tradeoff

Ironically, the reason MCMC methods are so slow is that they are designed to be unbiased. If we were to allow a small bias in the stationary distribution, it is possible to design a Markov chain that can be simulated cheaply (Welling and Teh, 2011; Ahn et al., 2012). That is, to estimate $I=\langle f\rangle_{\mathcal{S}_{0}}$ , we can use a Markov chain with stationary distribution $\mathcal{S}_{\epsilon}$ where $\epsilon$ is a parameter that can be used to control the bias in the algorithm. Then $I$ can be estimated as $\hat{I}=\frac{1}{T}\sum_{t=1}^{T}f(\theta_{t})$ , computed using samples from $\mathcal{S}_{\epsilon}$ instead of $\mathcal{S}_{0}$ .

As $\epsilon\to 0$ , $\mathcal{S}_{\epsilon}$ approaches $\mathcal{S}_{0}$ (the distribution of interest) but it becomes expensive to simulate the Markov chain. Therefore, the bias in $\hat{I}$ is low, but the variance is high because we can collect only a small number of samples in a given amount of computational time. As $\epsilon$ moves away from , it becomes cheap to simulate the Markov chain but the difference between $\mathcal{S}_{\epsilon}$ and $\mathcal{S}_{0}$ grows. Therefore, $\hat{I}$ will have higher bias, but lower variance because we can collect a larger number of samples in the same amount of computational time. This is a classical bias-variance trade-off and can be studied using the risk of the estimator.

The optimal setting of $\epsilon$ that minimizes the risk depends on the amount of computational time available. If we have an infinite amount of computational time, we should set $\epsilon$ to 0. Then there is no bias, and the variance can be brought down to by drawing an infinite number of samples. This is the traditional MCMC setting. However, given a finite amount of computational time, this setting may not be optimal. It might be better to tolerate a small amount of bias in the stationary distribution if it allows us to reduce the variance quickly, either by making it cheaper to collect a large number of samples or by mixing faster.

It is interesting to note that two recently proposed algorithms follow this paradigm: Stochastic Gradient Langevin Dynamics (SGLD) (Welling and Teh, 2011) and Stochastic Gradient Fisher Scoring (SGFS) (Ahn et al., 2012). These algorithms are biased because they omit the required Metropolis-Hastings tests. However, in both cases, a knob $\epsilon$ (the step-size of the proposal distribution) is available to control the bias. As $\epsilon\to 0$ , the acceptance probability $P_{a}\to 1$ and the bias from not conducting MH tests disappears. However, when $\epsilon\to 0$ the chain mixes very slowly and the variance increases because the auto-correlation time $\tau\to\infty$ . As $\epsilon$ is increased from , the auto-correlation, and therefore the variance, reduces. But, at the same time, the acceptance probability reduces and the bias from not conducting MH tests increases as well.

In the next section, we will develop another class of approximate MCMC algorithms for the case where the target $\mathcal{S}_{0}$ is a Bayesian posterior distribution given a very large dataset. We achieve this by developing an approximate Metropolis-Hastings test, equipped with a knob for controlling the bias. Moreover, our algorithm has the advantage that it can be used with any proposal distribution. For example, our method allows approximate MCMC methods to be applied to problems where it is impossible to compute gradients (which is necessary to apply SGLD/SGFS). Or, we can even combine our method with SGLD/SGFS, to obtain the best of both worlds.

Approximate Metropolis-Hastings Test for Bayesian Posterior Sampling

An important method in the toolbox of Bayesian inference is posterior sampling. Given a dataset of $N$ independent observations $X_{N}=\left\{x_{1},\dots,x_{N}\right\}$ , which we model using a distribution $p(x;\theta)$ parameterized by $\theta$ , defined on a space $\Theta$ with measure $\Omega$ , and a prior distribution $\rho(\theta)$ , the task is to sample from the posterior distribution $\mathcal{S}_{0}(\theta)\propto\rho(\theta)\prod_{i=1}^{N}p(x_{i};\theta)$ .

If the dataset has a billion datapoints, it becomes very painful to compute $\mathcal{S}_{0}(.)$ in the MH test, which has to be done for each posterior sample we generate. Spending $O(N)$ computation to get just $1$ bit of information, i.e. whether to accept or reject a sample, is likely not the best use of computational resources.

But, if we try to develop accept/reject tests that satisfy detailed balance exactly with respect to the posterior distribution using only sub-samples of data, we will quickly see the no free lunch theorem kicking in. For example, the pseudo marginal MCMC method (Andrieu and Roberts, 2009) and the method developed by Lin et al. (2000) provide a way to conduct exact accept/reject tests using unbiased estimators of the likelihood. However, unbiased estimators of the likelihood that can be computed from mini-batches of data, such as the Poisson estimator (Fearnhead et al., 2008) or the Kennedy-Bhanot estimator (Lin et al., 2000) have very high variance for large datasets. Because of this, once we get a very high estimate of the likelihood, almost all proposed moves are rejected and the algorithm gets stuck.

Thus, we should be willing to tolerate some error in the stationary distribution if we want faster accept/reject tests. If we can offset this small bias by drawing a large number of samples cheaply and reducing the variance faster, we can establish a potentially large reduction in the risk.

We will now show how to develop such approximate tests by reformulating the MH test as a statistical decision problem. It is easy to see that the original MH test (Eqn. 1) is equivalent to the following procedure: Draw $u\sim\text{Uniform}$ and accept the proposal $\theta^{\prime}$ if the average difference $\mu$ in the log-likelihoods of $\theta^{\prime}$ and $\theta_{t}$ is greater than a threshold $\mu_{0}$ , i.e. compute

Then if $\mu>\mu_{0}$ , accept the proposal and set $\theta_{t+1}\leftarrow\theta^{\prime}$ . If $\mu\leq\mu_{0}$ , reject the proposal and set $\theta_{t+1}\leftarrow\theta_{t}$ . This reformulation of the MH test makes it very easy to frame it as a statistical hypothesis test. Given $\mu_{0}$ and a random sample $\left\{l_{i_{1}},\dots,l_{i_{n}}\right\}$ drawn without replacement from the population $\left\{l_{1},\dots,l_{N}\right\}$ , can we decide whether the population mean $\mu$ is greater than or less than the threshold $\mu_{0}$ ? The answer to this depends on the precision in the random sample. If the difference between the sample mean $\bar{l}$ and $\mu_{0}$ is significantly greater than the standard deviation $s$ of $\bar{l}$ , we can make the decision to accept or reject the proposal confidently. If not, we should draw more data to increase the precision of $\bar{l}$ (reduce $s$ ) until we have enough evidence to make a decision.

More formally, we test the hypotheses $H_{1}:\mu>\mu_{0}$ vs $H_{2}:\mu<\mu_{0}$ . To do this, we proceed as follows: We compute the sample mean $\bar{l}$ and the sample standard deviation $s_{l}=\sqrt{(\overline{l^{2}}-(\bar{l})^{2})\frac{n}{n-1}}$ . Then the standard deviation of $\bar{l}$ can be estimated as:

where $\sqrt{1-\frac{n-1}{N-1}}$ , the finite population correction term, is applied because we are drawing the subsample without replacement from a finite-sized population. Then, we compute the test statistic:

If $n$ is large enough for the central limit theorem (CLT) to hold, the test statistic $t$ follows a standard Student-t distribution with $n-1$ degrees of freedom, when $\mu=\mu_{0}$ (see Fig. 7 in supplementary for an empirical verification). Then, we compute $\delta=1-\phi_{n-1}(|t|)$ where $\phi_{n-1}(.)$ is the cdf of the standard Student-t distribution with $n-1$ degrees of freedom. If $\delta<\epsilon$ (a fixed threshold) we can confidently say that $\mu$ is significantly different from $\mu_{0}$ . In this case, if $\bar{l}>\mu_{0}$ , we decide $\mu>\mu_{0}$ , otherwise we decide $\mu<\mu_{0}$ . If $\delta\geq\epsilon$ , we do not have enough evidence to make a decision. In this case, we draw more data to reduce the uncertainty, $s$ , in the sample mean $\bar{l}$ . We keep drawing more data until we have the required confidence (i.e. until $\delta<\epsilon$ ). Note, that this procedure will terminate because when we have used all the available data, i.e. $n=N$ , the standard deviation $s$ is 0, the sample mean $\bar{l}=\mu$ and $\delta=0<\epsilon$ . So, we will make the same decision as the original MH test would make. Pseudo-code for our test is shown in Algorithm 4. Here, we start with a mini-batch of size $m$ for the first test and increase it by $m$ datapoints when required.

The advantage of our method is that often we can make confident decisions with $n<N$ datapoints and save on computation, although we introduce a small bias in the stationary distribution. But, we can use the computational time we save to draw more samples and reduce the variance. The bias-variance trade-off can be controlled by adjusting the knob $\epsilon$ . When $\epsilon$ is high, we make decisions without sufficient evidence and introduce a high bias. As $\epsilon\to 0$ , we make more accurate decisions but are forced to examine more data which results in high variance.

Our algorithm will behave erratically if the CLT does not hold, e.g. with very sparse datasets or datasets with extreme outliers. The CLT assumption can be easily tested empirically before running the algorithm to avoid such pathological situations. The sequential hypothesis testing method can also be used to speed-up Gibbs sampling in densely connected Markov Random Fields. We explore this idea briefly in Section F of the supplementary.

Error Analysis and Test Design

In 5.1, we study the relation between the parameter $\epsilon$ , the error ${\cal E}$ of the complete sequential test, the error $\Delta$ in the acceptance probability and the error in the stationary distribution. In 5.2, we describe how to design an optimal test that minimizes data usage given a bound on the error.

Now, let $P_{a,\epsilon}(\theta,\theta^{\prime})$ be the actual acceptance probability of our algorithm and let $\Delta(\theta,\theta^{\prime})\overset{\textrm{def}}{=}P_{a,\epsilon}(\theta,\theta^{\prime})-P_{a}(\theta,\theta^{\prime})$ be the error in $P_{a,\epsilon}$ . In Section B of the supplementary, we show that for any $(\theta,\theta^{\prime})$ :

Thus, the errors corresponding to different $u$ ’s partly cancel each other. As a result, although $|\Delta(\theta,\theta^{\prime})|$ is upper-bounded by the worst-case error ${\cal E}(0,m,\epsilon)$ of the sequential test, the actual error is usually much smaller. For any given $(\theta,\theta^{\prime})$ , $\Delta$ can be computed easily using 1-dimensional quadrature.

The distance between the posterior distribution $\mathcal{S}_{0}$ and the stationary distribution of our approximate Markov chain $\mathcal{S}_{\epsilon}$ is upper bounded as:

2 Optimal Sequential Test Design

We now briefly describe how to choose the parameters of the algorithm: $\epsilon$ , the error of a single test and $m$ , the mini-batch size. A very simple strategy we recommend is to choose $m\approx 500$ so that the Central Limit Theorem holds and keep $\epsilon$ as small as possible while maintaining a low average data usage. This rule works well in practice and is used in Experiments 6.1 - 6.4.

The more discerning practitioner can design an optimal test that minimizes the data used while keeping the error below a given tolerance. Ideally, we want to do this based on a tolerance on the error in the stationary distribution ${\cal S}_{\epsilon}$ . Unfortunately, this error depends on the contraction parameter, $\eta$ , of the exact transition kernel, which is difficult to compute. A more practical choice is a bound on the error $\Delta$ in the acceptance probability, since the error in ${\cal S}_{\epsilon}$ increases linearly with $\Delta$ . Since $\Delta$ is a function of $(\theta,\theta^{\prime})$ , we can try to control the average value of $\Delta$ over the empirical distribution of $(\theta,\theta^{\prime})$ that would be encountered while simulating the Markov chain. Given a tolerance $\Delta^{*}$ on this average error, we can find the optimal $m$ and $\epsilon$ by solving the following optimization problem (e.g. using grid search) to minimize the average data usage :

But this leads to a very conservative design as the worst case error is usually much higher than the average case error. We illustrate the sequential design in Experiment 6.5. More details and a generalization of this method is given in supplementary Section D.

Experiments

We first test our method using a random walk proposal $q(\theta^{\prime}|\theta_{t})=\mathcal{N}(\theta_{t},\sigma_{RW}^{2})$ . Although the random walk proposal is not efficient, it is very useful for illustrating our algorithm because the proposal does not contain any information about the target distribution, unlike Langevin or Hamiltonian methods. So, the responsibility of converging to the correct distribution lies solely with the MH test. Also since $q$ is symmetric, it does not appear in the MH test and we can use $\mu_{0}=\frac{1}{N}\log\left[u\rho(\theta_{t})/\rho(\theta^{\prime})\right]$ .

The target distribution in this experiment was the posterior for a logistic regression model trained on the MNIST dataset for classifying digits 7 vs 9. The dataset consisted of 12214 datapoints and we reduced the dimensionality from 784 to 50 using PCA. We chose a zero mean spherical Gaussian prior with precision = 10, and set $\sigma_{RW}=0.01$ .

2 Independent Component Analysis

3 Variable selection in Logistic Regression

Now, we apply our MH test to variable selection in a logistic regression model using the reversible jump MCMC algorithm of Green (1995). We use a model that is similar to the Bayesian LASSO model for linear regression described in Chen et al. (2011). Specifically, given $D$ input features, our parameter $\theta=\left\{\beta,\gamma\right\}$ where $\beta$ is a vector of $D$ regression coefficients and $\gamma$ is a $D$ dimensional binary vector that indicates whether a particular feature is included in the model or not. The prior we choose for $\beta$ is $p(\beta_{j}|\gamma,\nu)=\frac{1}{2\nu}\exp\left\{-\frac{|\beta_{j}|}{\nu}\right\}$ if $\gamma_{j}=1$ . If $\gamma_{j}=0$ , $\beta_{j}$ does not appear in the model. Here $\nu$ is a shrinkage parameter that pushes $\beta_{j}$ towards 0, and we choose a prior $p(\nu)\propto 1/\nu$ . We also place a right truncated Poisson prior $p(\gamma|\lambda)\propto\dfrac{\lambda^{k}}{{D\choose k}k!}$ on $\gamma$ to control the size of the model, $k=\sum_{j=1}^{D}\gamma_{j}$ We set $\lambda=10^{-10}$ in this experiment.

Denoting the likelihood of the data by $l_{N}(\beta,\gamma)$ , the posterior distribution after integrating out $\nu$ is $p(\beta,\gamma|X_{N},y_{N},\lambda)\propto l_{N}(\beta,\gamma)\|\beta\|_{1}^{-k}\lambda^{k}B(k,D-k+1)$ where $B(.,.)$ is the beta function. Instead of integrating out $\lambda$ , we use it as a parameter to control the size of the model. We use the same proposal distribution as in Chen et al. (2011) which is a mixture of 3 type of moves that are picked randomly in each iteration: an update move, a birth move and a death move. A detailed description is given in Supplementary Section E.

We applied this to the MiniBooNE dataset from the UCI machine learning repository(Bache and Lichman, 2013). Here the task is to classify electron neutrinos (signal) from muon neutrinos (background). There are 130,065 datapoints (28% in +ve class) with 50 features to which we add a constant feature of 1’s. We randomly split the data into a training (80%) and testing (20%) set. To compute ground truth, we collected T=400K samples using the exact reversible jump algorithm ( $\epsilon=0$ ). Then, we ran the approximate MH algorithm with different values of $\epsilon$ for around 3500 seconds. We plot the risk in predictive mean of test data (estimated from 10 Markov chains) in Fig. 4. Again we see that the lowest risk is obtained with $\epsilon>0$ .

The acceptance rates for the birth/death moves starts off at $\approx$ 20% but dies down to $\approx$ 2% once a good model is found. The acceptance rate for update moves is kept at $\approx$ 50%. The model also suffers from local minima. For the plot in Fig. 4, we started with only one variable and we ended up learning models with around 12 features, giving a classification error $\approx$ 15%. But, if we initialize the sampler with all features included and initialize $\beta$ to the MAP value, we learn models with around 45 features, but with a lower classification error $\approx$ 10%. Both the exact reversible jump algorithm and our approximate version suffer from this problem. We should bear this in mind when interpreting “ground truth”. However, we have observed that when initialized with the same values, we obtain similar results with the approximate algorithm and the exact algorithm (see e.g. Fig. LABEL:fig:rjmcmc_incl in supplementary).

4 Stochastic Gradient Langevin Dynamics

Finally, we apply our method to Stochastic Gradient Langevin Dynamics(Welling and Teh, 2011). In each iteration, we randomly draw a mini-batch $\mathcal{X}_{n}$ of size $n$ , and propose:

The proposed state $\theta^{\prime}$ is always accepted (without conducting any MH test). Since the acceptance probability approaches $1$ as we reduce $\alpha$ , the bias from not conducting the MH test can be kept under control by using $\alpha\approx 0$ . However, we have to use a reasonably large $\alpha$ to keep the mixing rate high. This can be problematic for some distributions, because SGLD relies solely on gradients of the log density and it can be easily thrown off track by large gradients in low density regions, unless $\alpha\approx 0$ .

As an example, consider an L1-regularized linear regression model. Given a dataset $\left\{x_{i},y_{i}\right\}_{i=1}^{N}$ where $x_{i}$ are predictors and $y_{i}$ are targets, we use a Gaussian error model $p(y|x,\theta)\propto\exp\left\{-\frac{\lambda}{2}(y-\theta^{T}x)^{2}\right\}$ and choose a Laplacian prior for the parameters $p(\theta)\propto\exp(-\lambda_{0}\|\theta\|_{1})$ . For pedagogical reasons, we will restrict ourselves to a toy version of the problem where $\theta$ and $x$ are one dimensional. We use a synthetic dataset with $N=10000$ datapoints generated as $y_{i}=0.5x_{i}+\xi$ where $\xi\sim\mathcal{N}(0,1/3)$ . We choose $\lambda=3$ and $\lambda_{0}=4950$ , so that the prior is not washed out by the likelihood. The posterior density and the gradient of the log posterior are shown in figures 5(a) and 5(b) respectively.

An empirical histogram of samples obtained by running SGLD with $\alpha=5\times 10^{-6}$ is shown in Fig. 5(c). The effect of omitting the MH test is quite severe here. When the sampler reaches the mode of the distribution, the Langevin noise occasionally throws it into the valley to the left, where the gradient is very high. This propels the sampler far off to the right, after which it takes a long time to find its way back to the mode. However, if we had used an MH accept-reject test, most of these troublesome jumps into the valley would be rejected because the density in the valley is much lower than that at the mode.

To apply an MH test, note that the SGLD proposal $q(\theta^{\prime}|\theta)$ can be considered a mixture of component kernels $q(\theta^{\prime}|\theta,\mathcal{X}_{n})$ corresponding to different mini-batches. The mixture kernel will satisfy detailed balance with respect to the posterior distribution if the MH test enforces detailed balance between the posterior and each of the component kernels $q(\theta^{\prime}|\theta,\mathcal{X}_{n})$ . Thus, we can use an MH test with $\mu_{0}=\dfrac{1}{N}\log\left[u\dfrac{\rho(\theta_{t})q(\theta^{\prime}|\theta_{t},\mathcal{X}_{n})}{\rho(\theta^{\prime})q(\theta_{t}|\theta^{\prime},\mathcal{X}_{n})}\right]$ .

The result of running SGLD (keeping $\alpha=5\times 10^{-6}$ as before) corrected using our approximate MH test, with $\epsilon=0.5$ , is shown in Fig. 5(d). As expected, the MH test rejects most troublesome jumps into the valley because the density in the valley is much lower than that at the mode. The stationary distribution is almost indistinguishable from the true posterior. Note that when $\epsilon=0.5$ , a decision is always made in the first step (using just $m=500$ datapoints) without querying additional data sequentially.

5 Optimal Design of Sequential Tests

We illustrate the advantages of the optimal test design proposed in Section 5.2 by applying it to the ICA experiment described in Section 6.2. We consider two design methods: the ‘average design’ (Eqn. 7) and the ‘worst-case design’ (Eqn. 8). For the average design, we collected 100 samples of the Markov chain to approximate the expectation of the error over $(\theta,\theta^{\prime})$ . We will call these samples the training set. The worst case design does not need the training set as it does not involve the distribution of $(\theta,\theta^{\prime})$ . We compute the optimal $m$ and $\epsilon$ using grid search, for different values of the target training error, for both designs. We then collect a new set of 100 samples $(\theta,\theta^{\prime})$ and measure the average error and data usage on this test set (Fig. 6).

For the same target error on the training set, the worst-case design gives a conservative parameter setting that achieves a much smaller error on the test set. In contrast, the average design achieves a test error that is almost the same as the target error (Fig. 6(a)). Therefore, it uses much less data than the worst-case design (Fig. 6(b)).

We also analyze the performance in the case where we fix $m=600$ and only change $\epsilon$ . This is a simple heuristic we recommended at the beginning of Section 5.2. Although this usually works well, using the optimal test design ensures the best possible performance. In this experiment, we see that when the error is large, the optimal design uses only half the data (Fig. 6(b)) used by the heuristic and is therefore twice as fast.

Conclusions and Future Work

We have taken a first step towards cutting the computational budget of the Metropolis-Hastings MCMC algorithm, which takes $O(N)$ likelihood evaluations to make the binary decision of accepting or rejecting a proposed sample. In our approach, we compute the probability that a new sample will be accepted based on a subset of the data. We increase the cardinality of the subset until a prescribed confidence level is reached. In the process we create a bias, which is more than compensated for by a reduction in variance due to the fact that we can draw more samples per unit time. Current MCMC procedures do not take these trade-offs into account. In this work we use a fixed decision threshold for accepting or rejecting a sample, but in theory a better algorithm can be obtained by adapting this threshold over time. An adaptive algorithm can tune bias and variance contributions in such a way that at every moment our risk (the sum of squared bias and variance) is as low as possible. We leave these extensions for future work.

Acknowledgments

We thank Alex Ihler, Daniel Gillen, Sungjin Ahn and Babak Shahbaba for their valuable suggestions. This material is based upon work supported by the National Science Foundation under Grant No. 1216045.

Appendix A Distribution of the test statistic

In the sequential test, we first compute the test statistic from a mini-batch of size $m$ . If a decision cannot be made with this statistic, we keep increasing the mini-batch size by $m$ datapoints until we reach a decision. This procedure is guaranteed to terminate as explained in Section 4.

The parameter $\epsilon$ controls the probability of making an error in a single test and not the complete sequential test. As the statistics across multiple tests are correlated with each other, we should first obtain the joint distribution of these statistics in order to estimate the error of the complete sequential test. Let $\bar{l}_{j}$ and $s_{l,j}$ be the sample mean and standard deviation respectively, computed using the first $j$ mini-batches. Notice that when the size of a mini-batch is large enough, e.g. $n>100$ , the central limit theorem applies, and also $s_{l,j}$ is an accurate estimate of the population standard deviation. Additionally, since the degrees of freedom is high, the t-statistic in Eqn. 5 reduces to a $z$ -statistic. Therefore, it is reasonable to make the following assumptions:

The joint distribution of the sequence $(\bar{l}_{1},\bar{l}_{2},\dots)$ follows a multivariate normal distribution.

Fig. 7 shows that when $\mu=\mu_{0}$ the empirical marginal distribution of $t_{j}$ (or $z_{j}$ ) is well fitted by both a standard student-t and a standard normal distribution.

Under these assumptions, we state and prove the following proposition about the joint distribution of the $z$ -statistic ${\bf z}=(z_{1},z_{2},\dots)$ , where $z_{j}\overset{\textrm{def}}{=}(\bar{l}_{j}-\mu_{0})/\sigma_{l}\approx t_{j}$ , from different tests.

Given Assumption 1 and 2, the sequence ${\bf z}$ follows a Gaussian random walk process:

Denote by $x_{j}$ the average of $m$ $l$ ’s in the $j$ -th mini-batch. Taking into account the fact that the $l$ ’s are drawn without replacement, we can compute the mean and covariance of the $x_{j}$ ’s as:

It is trivial to derive the expression for the mean. For the covariance, we first derive the covariance matrix of single data points as

Now, as $x_{j}$ can be written as a linear combination of the elements in $j$ -th mini-batch as $x_{j}=\frac{1}{m}\boldsymbol{1}^{T}\boldsymbol{l}_{j}$ , the expression for covariance in Eqn. 16 follows immediately from:

According to Assumption 1, the joint distribution of $z_{j}$ ’s is Gaussian because $z_{j}$ is a linear combination of $\bar{l}_{j}$ ’s. It is however easier to derive the mean and covariance matrix of $z_{j}$ ’s by considering the vector ${\bf z}$ as a linear function of ${\bf x}$ : ${\bf z}=Q({\bf x}-\mu_{0}\boldsymbol{1})$ with

Fig. 8 shows the mean and $95\%$ confidence interval of the random walk as a function of $\pi$ with a few realizations of the $z$ sequence. Notice that as the proportion of observed data $\pi_{j}$ approaches 1, the mean of $z_{j}$ approaches infinity with a constant variance of 1. This is consistent with the fact that when we observe all the data, we will always make a correct decision.

Applying the individual tests $\delta\gtrless\epsilon\Leftrightarrow|z_{j}|\gtrless\Phi(1-\epsilon)\overset{\textrm{def}}{=}G$ at the $j$ -th mini-batch corresponds to thresholding the absolute value of $z_{j}$ at $\pi_{j}$ with a bound $G$ as shown in Fig. 9. Instead of $m$ and $\epsilon$ , we will use $\pi_{1}=m/N$ and $G$ as the parameters of the sequential test in the supplementary. The probability of incorrectly deciding $\mu<\mu_{0}$ when $\mu\geq\mu_{0}$ over the whole sequential test is computed as:

where $J=\lceil 1/\pi_{1}\rceil$ is the maximum number of tests. Similarly the probability of incorrectly deciding $\mu\geq\mu_{0}$ when $\mu<\mu_{0}$ can be computed similarly by replacing $z_{j}<-G$ with $z_{j}>G$ in Eqn. 21. We can also compute the expected proportion of data that will be used in the sequential test as:

where $j^{\prime}$ denotes the time when the sequential test terminates. Eqn. 21 and 22 can be efficiently approximated together using a dynamic programming algorithm by discretizing the value of $z_{j}$ between $[-G,G]$ . The time complexity of this algorithm is ${\cal O}(L^{2}J)$ where $L$ is the number of discretized values.

Figs. 1 and 10 show respectively that the theoretical value of the error ( ${\cal E}$ ) and the average data usage ( $\bar{\pi}$ ) estimated using our dynamic programming algorithm match the simulated values. Also, note that both error and data usage drop off very fast as $\mu$ moves away from $\mu_{0}$ .

Appendix B Error in One Metropolis-Hastings Step

Appendix C Proof of Theorem 1

We first prove a lemma that will be used for the proof of Theorem 1.

Given two transition kernels, $\mathcal{T}_{0}$ and $\mathcal{T}_{\epsilon}$ , with respective stationary distributions, $\mathcal{S}_{0}$ and $\mathcal{S}_{\epsilon}$ , if $\mathcal{T}_{0}$ satisfies the following contraction condition with a constant $\eta\in[0,1)$ for all probability distributions $P$ :

and the one step error between $\mathcal{T}_{0}$ and $\mathcal{T}_{\epsilon}$ is upper bounded uniformly with a constant $\Delta>0$ as:

then the distance between $\mathcal{S}_{0}$ and $\mathcal{S}_{\epsilon}$ is bounded as:

Consider a Markov chain with transition kernel $\mathcal{T}_{\epsilon}$ initialized from an arbitrary distribution $P$ . Denote the distribution after $t$ steps by $P^{(t)}\overset{\textrm{def}}{=}P\mathcal{T}_{\epsilon}^{t}$ . At every time step, $t\geq 0$ , we apply the transition kernel $\mathcal{T}_{\epsilon}$ on $P^{(t)}$ . According to the one step error bound in Eqn. 26, the distance between $P^{(t+1)}$ and the distribution obtained by applying $\mathcal{T}_{0}$ to $P^{(t)}$ is upper bounded as:

Following the contraction condition of $\mathcal{T}_{0}$ in Eqn. 25, the distance of $P^{(t)}\mathcal{T}_{0}$ from its stationary distribution $\mathcal{S}_{0}$ is less than $P^{(t)}$ as

Now let us use the triangle inequality to combine Eqn. 28 and 29 to obtain an upper bounded for the distance between $P^{(t+1)}$ and $\mathcal{S}_{0}$ :

Let $r<1-\eta$ be any positive constant and consider the ball $\mathcal{B}(\mathcal{S}_{0},\frac{\Delta}{r})\overset{\textrm{def}}{=}\{P:d_{v}(P,\mathcal{S}_{0})<\frac{\Delta}{r}\}$ . When $P^{(t)}$ is outside the ball, we have $\Delta\leq rd_{v}(P^{(t)},S)$ . Plugging this into Eqn. 30, we can obtain a contraction condition for $P^{(t)}$ towards $\mathcal{S}_{0}$ :

So if the initial distribution $P$ is outside the ball, the Markov chain will move monotonically into the ball within a finite number of steps. Let us denote the first time it enters the ball as $t_{r}$ . If the initial distribution is already inside the ball, we simply let $t_{r}=0$ . We then show by induction that $P^{(t)}$ will stay inside the ball for all $t\geq t_{r}$ .

At $t=t_{r}$ , $P^{(t)}\in\mathcal{B}(\mathcal{S}_{0},\frac{\Delta}{r})$ holds by the definition of $t_{r}$ .

Assume $P^{(t)}\in\mathcal{B}(\mathcal{S}_{0},\frac{\Delta}{r})$ for some $t\geq t_{r}$ . Then, following Eqn. 30, we have

Therefore, $P^{(t)}\in\mathcal{B}(\mathcal{S}_{0},\frac{\Delta}{r})$ holds for all $t\geq t_{r}$ . Since $P^{(t)}$ converges to $S_{\epsilon}$ , it follows that:

Taking the limit $r\rightarrow 1-\eta$ , we prove the lemma:

C.2 Proof of Theorem 1

We first derive an upper bound for the one step error of the approximate Metropolis-Hastings algorithm, and then use Lemma 3 to prove Theorem 1. The transition kernel of the exact Metropolis-Hastings algorithm can be written as

where $g_{Q}(\theta^{\prime})\overset{\textrm{def}}{=}\int_{\theta}dP(\theta)q(\theta^{\prime}|\theta)$ is the density that would be obtained by applying one step of Metropolis-Hastings without rejection. So we get an upper bound for the total variation distance as

Apply Lemma 3 with $\Delta=\Delta_{\text{max}}$ and we prove Theorem 1.

Appendix D Optimal Sequential Test Design

It is possible to design optimal tests that minimize the amount of data used while keeping the error below a given tolerance. Ideally, we want to do this based on a tolerance on the error in the stationary distribution ${\cal S}_{\epsilon}$ . Unfortunately, this error depends on the contraction parameter, $\eta$ , of the exact transition kernel, which is difficult to compute. A more practical choice is a bound $\Delta_{\text{max}}$ on the error in the acceptance probability, since the error in ${\cal S}_{\epsilon}$ increases linearly with $\Delta_{\text{max}}$ .

Given $\Delta_{\text{max}}$ , we want to minimize the average data usage $\bar{\pi}$ over the parameters $\epsilon$ (or $G$ ) and/or $m$ (or $\pi_{1}$ ) of the sequential test. Unfortunately, the error is a function of $\mu$ and $\sigma_{l}$ which depend on $\theta$ and $\theta^{\prime}$ , and we cannot afford to change the test design at every iteration.

The expectation w.r.t. $u$ can be computed accurately using one dimensional quadrature. For the expectation w.r.t. $\mu$ and $\sigma_{l}$ , we collect a set of parameter samples $(\theta,\theta^{\prime})$ during burn-in, compute the corresponding $\mu$ and $\sigma_{l}$ for each sample, and use them to empirically estimate the expectation. We can also consider collecting samples periodically and adapting the sequential design over time. Once we obtain a set of samples $\{(\mu,\sigma_{l})\}$ , the optimization is carried out using grid search.

We have been using a constant bound $G$ across all the individual tests. This is known as the Pocock design (Pocock, 1977). A more flexible sequential design can be obtained by allowing $G$ to change as a function of $\pi$ . Wang and Tsiatis (1987) proposed a bound sequence $G_{j}=G_{0}\pi_{j}^{0.5-\alpha}$ where $\alpha\in[0.5,1]$ is a free parameter. When $\alpha=0$ , it reduces to the Pocock design, and when $\alpha=1$ , it reduces to O’Brien-Fleming design (O’Brien and Fleming, 1979). We can adopt this more general form in our optimization problem straightforwardly, and the grid search will now be conducted over three parameters, $\pi_{1}$ , $G_{0}$ , and $\alpha$ .

Appendix E Reversible Jump MCMC

We give a more detailed description of the different transition moves used in experiment 6.3. The update move is the usual MCMC move which involves changing the parameter vector $\beta$ without changing the model $\gamma$ . Specifically, we randomly pick an active component $j:\gamma_{j}=1$ and set $\beta_{j}=\beta_{j}+\eta$ where $\eta\sim\mathcal{N}(0,\sigma_{update})$ . The birth move involves (for $k<D$ ) randomly picking an inactive component $j:\gamma_{j}=0$ and setting $\gamma_{j}=1$ . We also propose a new value for $\beta_{j}\sim\mathcal{N}(0,\sigma_{birth})$ . The birth move is paired with a corresponding death move (for $k>1$ ) which involves randomly picking an active component $j:\gamma_{j}=1$ and setting $\gamma_{j}=0$ . The corresponding $\beta_{j}$ is discarded. The probabilities of picking these moves $p(\gamma\rightarrow\gamma^{\prime})$ is the same as in Chen et al. (2011). The value of $\mu_{0}$ used in the MH test for different moves is given below. 1. Update move:

We used $\sigma_{update}=0.01$ and $\sigma_{birth}=0.1$ in this experiment. As mentioned in the main text, both the exact reversible jump algorithm and our approximate version suffer from local minima. But, when initialized with the same values, we obtain similar results with both algorithms. For example, we plot the marginal posterior probability of including a feature in the model, i.e. $p(\gamma_{j}=1|X_{N},y_{N},\lambda)$ in figure LABEL:fig:rjmcmc_incl.

Appendix F Application to Gibbs Sampling

The same sequential testing method can be applied to the Gibbs sampling algorithm for discrete models. We study a model with binary variables in this paper while the extension to multi-valued variables is also possible. Consider running a Gibbs sampler on a probability distribution over $D$ binary variables $P(X_{1},\dots,X_{D})$ . At every iteration, it updates one variable $X_{i}$ using the following procedure:

where $x_{-i}$ denotes the value of all variables other than the $i^{\text{th}}$ one.

The condition in step 2 is equivalent to checking:

When the joint distribution is expensive to compute but can be represented as a product over multiple terms, $P(X)=\prod_{n=1}^{N}f_{n}(X)$ , we can apply our sequential test to speed up the Gibbs sampling algorithm. In this case the variable $\mu_{0}$ and $\mu$ is given by

Similar to the Metropolis-Hastings algorithm, given an upper bound in the error of the approximate conditional probability

For a Gibbs sampler with a Dobrushin coefficient $\eta\in[0,1)$ (Brémaud, 1999, §7.6.2), the distance between the stationary distribution and that of the approximate Gibbs sampler $S_{\epsilon}$ is upper bounded by

The proof is similar to that of Theorem 1. We first obtain an upper bound for the one step error and then plug it into Lemma 3.

The exact transition kernel of the Gibbs sampler for variable $X_{i}$ can be represented by a matrix $\mathcal{T}_{0,i}$ of size $2^{D}\times 2^{D}$ :

where $1\leq i\leq N,x,y\in\{0,1\}^{D}$ . The approximate transition kernel $\mathcal{T}_{\epsilon,i}$ can be represented similarly as

where $P_{\epsilon}$ is the approximate conditional distribution. Define the approximation error $\Delta\mathcal{T}_{i}(x,y)\overset{\textrm{def}}{=}\mathcal{T}_{\epsilon,i}(x,y)-\mathcal{T}_{0,i}(x,y)$ . We know that $\Delta\mathcal{T}_{i}(x,y)=0$ if $y_{-i}\neq x_{-i}$ and it is upper bounded by $\Delta_{\text{max}}$ from the premise of Theorem 4.

Notice that the total variation distance reduces to a half of the $L_{1}$ distance for discrete distributions. For any distribution $P$ , the one step error is bounded as

For a Gibbs sampling algorithm, we have the contraction condition (Brémaud, 1999, §7.6.2):

Plug $\Delta=\Delta_{\text{max}}$ and $\eta$ into Lemma 3 and we obtain the conclusion. ∎

We illustrate the performance of our approximate Gibbs sampling algorithm on a synthetic Markov Random Field. The model under consideration has $D=100$ binary variables and they are densely connected by potential functions of three variables $\psi_{i,j,k}(X_{i},X_{j},X_{k}),\forall i\neq j\neq k$ . There are $D(D-1)(D-2)/6$ potential functions in total (we assume potential functions with permuted indices in the argument are the same potential function), and every function has $2^{3}=8$ values. The entries in the potential function tables are drawn randomly from a log-normal distribution, $\log\psi_{i,j,k}(X_{i},X_{j},X_{k})\sim\mathcal{N}(0,0.02)$ . To draw a Gibbs sample for one variable $X_{i}$ we have to compute $(D-1)(D-2)/2=4851$ pairs of potential functions as

The approximate methods use a mini-batches of $500$ pairs of potential functions at a time. We compare the exact Gibbs sampling algorithm with approximate versions with $\epsilon\in\{0.01,0.05,0.1,0.15,0.2,0.25\}$ .

To measure the performance in approximating $P(X)$ with samples ${x_{t}}$ , the ideal metric would be a distance between the empirical joint distribution and $P$ . Since it is impossible to store all the $2^{100}$ probabilities, we instead repeatedly draw $M=1600$ subsets of $5$ variables, $\{s_{m}\}_{m=1}^{M},s_{m}\subset\{1,\dots,D\},|s_{m}|=5$ , and compute the average $L_{1}$ distance of the joint distribution on these subsets between the empirical distribution and $P$ :

The true $P$ is estimated by running exact Gibbs chains for a long time. We show the empirical conditional probability obtained by our approximate algorithms (percentage of $X_{i}$ being assigned $1$ ) for different $\epsilon$ in Fig. 14. It tends to underestimate large probabilities and overestimate on the other end. When $\epsilon=0.01$ , the observed maximum error is within $0.01$ .

Fig. 15 shows the error for different $\epsilon$ as a function of the running time. For small $\epsilon$ , we use fewer mini-batches per iteration and thus generate more samples in the same amount of time than the exact Gibbs sampler. So the error decays faster in the beginning. As more samples are collected the variance is reduced. We see that these plots converge towards their bias floor while the exact Gibbs sampler out-performs all the approximate methods at around $1000$ seconds.