Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature

Tom Gunter, Michael A. Osborne, Roman Garnett, Philipp Hennig, Stephen J. Roberts

Introduction

Bayesian approaches to machine learning problems inevitably call for the frequent approximation of computationally intractable integrals of the form

In a big-data setting, where likelihood function evaluations are prohibitively expensive, bq is demonstrably better than Monte Carlo approaches ??). As the cost of the likelihood decreases, however, bq no longer achieves a higher effective sample rate per second, because the computational cost of maintaining the gp model and active sampling becomes relevant, and many Monte Carlo samples may be generated for each new bq sample. Our goal was to develop a cheap and accurate bq model alongside an efficient active sampling scheme, such that even for low cost likelihoods bq would be the scheme of choice. Our contributions extend existing work in two ways:

Square-root gp: Foundational work ????) on bq employed a gp prior directly on the likelihood function, making no attempt to enforce non-negativity a priori. ?) introduced an approximate means of modelling the logarithm of the integrand with a gp. This involved making a first-order approximation to the exponential function, so as to maintain tractability of inference in the integrand model. In this work, we choose another classical transformation to preserve non-negativity—the square-root. By placing a gp prior on the square-root of the integrand, we arrive at a model which both goes some way towards dealing with the high dynamic range of most likelihoods, and enforces non-negativity without the approximations resorted to in ?).

Fast Active Sampling: Whereas most approaches to bq use either a randomised or fixed sampling scheme, ?) targeted the reduction in the expected variance of $Z$ . Here, we sample where the expected posterior variance of the integrand after the quadratic transform is at a maximum. This is a cheap way of balancing exploitation of known probability mass and exploration of the space in order to approximately minimise the entropy of the integral.

We compare our approach, termed warped sequential active Bayesian integration (wsabi), to non-negative integration with standard Monte Carlo techniques on simulated and real examples. Crucially, we make comparisons of error against ground truth given a fixed compute budget.

Bayesian Quadrature

If the gp prior is placed directly on the likelihood in the style of traditional Bayes–Hermite quadrature, the optimal point to add a sample (from an information gain perspective) is dependent only on $x_{d}$ —the locations of the previously sampled points. This means that given a budget of $N$ samples, the most informative set of function evaluations is a design that can be pre-computed, completely uninfluenced by any information gleaned from function values ?). In ?), where the log-likelihood is modelled by a gp, a dependency is introduced between the uncertainty over the function at any point and the function value at that point. This means that the optimal sample placement is now directly influenced by the obtained function values.

Square-Root Bayesian Quadrature

Crucially, likelihoods are non-negative, a fact neglected by traditional Bayes–Hermite quadrature. In ?) the logarithm of the likelihood was modelled, and approximate the posterior of the integral, via a linearisation trick. We choose a different member of the power transform family—the square-root.

The square-root transform halves the dynamic range of the function we model. This helps deal with the large variations in likelihood observed in a typical model, and has the added benefit of extending the autocorrelation range (or the input length-scale) of the gp, yielding improved predictive power when extrapolating away from existing sample points.

2 Moment Matching

3 Quadrature

It is clear that the posterior variance of the likelihood model is now a function of both the expected value of the likelihood at that point, and the distance of that sample location from the rest of $x_{d}$ . This is visualised in Figure 2.

Comparing Figures 1 and 2 we see that conditioned on an identical set of samples, wsabi both achieves a closer fit to the true underlying function, and associates minimal probability mass with negative function values. These are desirable properties when modelling likelihood functions—both arising from the use of the square-root transform.

Active Sampling

2 Uncertainty sampling

in the case of our moment matched approximation, and, under the linearisation approximation,

Figures 5–6 illustrate the result of square-root Bayesian quadrature, conditioned on 15 samples selected sequentially under utility functions (14) and (15) respectively. In both cases the posterior mean has not been scaled by the prior $\pi(x)$ (but the variance has). This is intended to exaggerate the contributions to the mean made by wsabi-m.

A good posterior estimate of the integral has been achieved, and this set of samples is more informative than a grid under the utility function of minimising the integral error. In all active-learning examples a covariance matrix adaptive evolution strategy (cma-es) ?) global optimiser was used to explore the utility function surface before selecting the next sample.

Results

Given this new model and fast active sampling scheme for likelihood surfaces, we now test for speed against standard Monte Carlo techniques on a variety of problems.

We generated 16 likelihoods in four-dimensional space by selecting $K$ normal distributions with $K$ drawn uniformly at random over the integers $5$ – $14$ . The means were drawn uniformly at random over the inner quarter of the domain (by area), and the covariances for each were produced by scaling each axis of an isotropic Gaussian by an integer drawn uniformly at random between $21$ and $29$ . The overall likelihood surface was then given as a mixture of these distributions, with weights given by partitioning the unit interval into $K$ segments drawn uniformly at random—‘stick-breaking’. This procedure was chosen in order to generate ‘lumpy’ surfaces. We budgeted 500 samples for our new method per likelihood, allocating the same amount of time to simple Monte Carlo (smc).

Naturally the computational cost per evaluation of this likelihood is effectively zero, which afforded smc just under 86 000 samples per likelihood on average. wsabi was on average faster to converge to $10^{-3}$ error (Figure 7), and it is visible in Figure 8 that the likelihood of the ground truth is larger under this model than with smc. This concurs with the fact that a tighter bound was achieved.

2 Marginal Likelihood of GP Regression

As an initial exploration into the performance of our approach on real data, we fitted a Gaussian process regression model to the yacht hydrodynamics benchmark dataset ?). This has a six-dimensional input space corresponding to different properties of a boat hull, and a one-dimensional output corresponding to drag coefficient. The dataset has 308 examples, and using a squared exponential ard covariance function a single evaluation of the likelihood takes approximately 0.003 seconds.

Marginalising over the hyperparameters of this model is an eight-dimensional non-analytic integral. Specifically, the hyperparameters were: an output length-scale, six input length-scales, and an output noise variance. We used a zero-mean isotropic Gaussian prior over the hyperparameters in log space with variance of 4. We obtained ground truth through exhaustive smc sampling, and budgeted 1 250 samples for wsabi. The same amount of compute-time was then afforded to smc, ais (which was implemented with a Metropolis–Hastings sampler), and Bayesian Monte Carlo (bmc). smc achieved approximately 375 000 samples in the same amount of time. We ran ais in 10 steps, spaced on a log-scale over the number of iterations, hence the ais plot is more granular than the others (and does not begin at 0). The ‘hottest’ proposal distribution for ais was a Gaussian centered on the prior mean, with variance tuned down from a maximum of the prior variance.

Figure 9 shows the speed with which wsabi converges to a value very near ground truth compared to the rest. ais performs rather disappointingly on this problem, despite our best attempts to tune the proposal distribution to achieve higher acceptance rates.

Although the first datapoint (after 10 000 samples) is the second best performer after wsabi, further compute budget did very little to improve the final ais estimate. bmc is by far the worst performer. This is because it has relatively few samples compared to smc, and those samples were selected completely at random over the domain. It also uses a gp prior directly on the likelihood, which due to the large dynamic range will have a poor predictive performance.

3 Marginal Likelihood of GP Classification

We fitted a Gaussian process classification model to both a one dimensional synthetic dataset, as well as real-world binary classification problem defined on the nodes of a citation network ?). The latter had a four-dimensional input space and 500 examples. We use a probit likelihood model, inferring the function values using a Laplace approximation. Once again we marginalised out the hyperparameters.

4 Synthetic Binary Classification Problem

We generate 500 binary class samples using a 1D input space. The gp classification scheme implemented in Gaussian Processes for Machine Learning Matlab Toolbox (gpml) (?) is employed using the inference and likelihood framework described above. We marginalised over the three-dimensional hyperparameter space of: an output length-scale, an input length-scale and a ‘jitter’ parameter. We again tested against bmc, ais, smc and, additionally, Doubly-Bayesian Quadrature (bbq) (?). Ground truth was found through 100 000 smc samples.

This time the acceptance rate for ais was significantly higher, and it is visibly converging to the ground truth in Figure 10, albeit in a more noisy fashion than the rest. wsabi-l performed particularly well, almost immediately converging to the ground truth, and reaching a tighter bound than smc in the long run. bmc performed well on this particular example, suggesting that the active sampling approach did not buy many gains on this occasion. Despite this, the square-root approaches both converged to a more accurate solution with lower variance than bmc. This suggests that the square-root transform model generates significant added value, even without an active sampling scheme. The computational cost of selecting samples under bbq prevents rapid convergence.

5 Real Binary Classification Problem

For our next experiment, we again used our method to calculate the model evidence of a gp model with a probit likelihood, this time on a real dataset.

The dataset, first described in (?), was a graph from a subset of the CiteSeerx citation network. Papers in the database were grouped based on their venue of publication, and papers from the 48 venues with the most associated publications were retained. The graph was defined by having these papers as its nodes and undirected citation relations as its edges. We designated all papers appearing in nips proceedings as positive observations. To generate Euclidean input vectors, the authors performed “graph principal component analysis” on this network (?); here, we used the first four graph principal components as inputs to a gp classifier. The dataset was subsampled down to a set of 500 examples in order to generate a cheap likelihood, half of which were positive.

Across all our results, it is noticeable that wsabi-m typically performs worse relative to wsabi-l as the dimensionality of the problem increases. This is due to an increased propensity for exploration as compared to wsabi-l. wsabi-l is the fastest method to converge on all test cases, apart from the synthetic mixture model surfaces where wsabi-m performed slightly better (although this was not shown in Figure 7). These results suggest that an active-sampling policy which aggressively exploits areas of probability mass before exploring further afield may be the most appropriate approach to Bayesian quadrature for real likelihoods.

Conclusions

We introduced the first fast Bayesian quadrature scheme, using a novel warped likelihood model and a novel active sampling scheme. Our method, wsabi, demonstrates faster convergence (in wall-clock time) for regression and classification benchmarks than the Monte Carlo state-of-the-art.