Particle learning of Gaussian process models for sequential design and optimization

Robert B. Gramacy, Nicholas G. Polson

Introduction

The Gaussian process (GP) is by now well established as the backbone of many highly flexible and effective nonlinear regression and classification models (e.g., Neal,, 1998; Rasmussen and Williams,, 2006). One important application for GPs is in the sequential design of computer experiments (Santner et al.,, 2003) where designs are built up iteratively: choose a new design point $x$ according to some criterion derived from a GP surrogate model fit; update the fit conditional on the new pair $(x,y(x))$ ; and repeat. The goal is to keep designs small in order to save on expensive simulations of $y(x)$ . By “fit” we colloquially mean: samples obtained from the GP posterior via MCMC. While it is possible to choose each new design point via full utility-based design criterion (e.g., Müller et al.,, 2004), this can be computationally daunting even for modestly sized designs. More thrifty active learning (AL) criterion such as ALM (MacKay,, 1992) and ALC (Cohn,, 1996) can be an effective alternative. These were first used with GPs by Seo et al., (2000), and have since been paired with a non-stationary GP to design a rocket booster (Gramacy and Lee,, 2009).

Similar AL criteria are available for other sequential design tasks. Optimization by expected improvement (EI, Jones et al.,, 1998) is one example. Taddy et al., (2009) used an embellished EI with a non-stationary GP model and MCMC inference to determine the optimal robust configuration of a circuit device. In the classification setting, characteristics like the predictive entropy (Joshi et al.,, 2009) can be used to explore the boundaries between regions of differing class label in order to maximize the information obtained from each new $x$ . The thrifty nature of AL and the flexibility of the GP is a favorable marriage, indeed. However, a drawback of batch MCMC-based inference is that it is not tailored to the online nature of sequential design. Except to guide the initialization of a new Markov chain, it is not clear how fits from earlier iterations may re-used in search of the next $x$ . So after the design is augmented with $(x,y(x))$ the MCMC must be restarted and iterated to convergence.

In this paper we propose to use a sequential Monte Carlo (SMC) technique called particle learning (PL) to exploit the analytically tractable (and Rao–Blackwellizable) GP posterior predictive distribution in order to obtain a quick update of the GP fit after each sequential design iteration. We then show how some key AL heuristics may be efficiently calculated from the particle approximation. Taken separately, SMC/PL, GPs, and AL, are by now well established techniques in their own right. Our contribution lies in illustrating how together they can be a potent mixture for sequential design and optimization under uncertainty.

The remainder of the paper is outlined as follows. Section 1.1 describes the basic elements of GP modeling. Section 1.2 reviews SMC and PL, highlighting the strengths of PL in our setting. Section 2 develops a PL implementation for GP regression and classification, with illustrations and comparisons to MCMC. We show how fast updates of particle approximations may be used for AL in optimization and classification in Section 3, and we conclude with a discussion in Section 4. Software implementing our methods, and the specific code for our illustrative examples, is available in the plgp package (Gramacy,, 2010) for R on CRAN.

It is possible to use a vague scale-invariant prior ( $a,b=0$ ) for $\sigma^{2}$ . In this case, the marginal posterior (1) is proper as long as $N>p+1$ . Mixing is generally good for Metropolis–Hastings (MH) sampling as long as $K(\cdot,\cdot)$ is parsimoniously parameterized, $N$ is large, and there is a high signal–to–noise ratio between $X$ and $Y$ . Otherwise, the posterior can be multimodal (e.g., Warnes and Ripley,, 1987) and hard to sample.

Crucially for our SMC inference via PL [Section 2], and for our AL heuristics [Section 3], the fully marginalized predictive equations for GP regression are available in closed form. Specifically, the distribution of the response $Y(x)$ conditioned on data $D$ and covariance $K(\cdot,\cdot)$ , i.e., $p(y(x)|D,K)$ , is Student- $t$ with degrees of freedom $\hat{v}=N-p-1$ ,

where $k^{\top}(x)$ is the $N$ -vector whose $i^{\mbox{\tiny th}}$ component is $K(x,x_{i})$ .

In the classification problem, with data $D=(X,C)$ , where $C$ is a $N\times 1$ vector of class labels $c_{i}\in\{1,\dots,M\}$ , the GP is used $M$ -fold as a prior over a collection of $M\times N$ latent variables $\mathcal{Y}=\{Y_{(m)}\}_{m=1}^{M}$ , one set for each class. For a particular class $m$ , the generative model (or prior) over the latent variables is MVN with mean $\mu_{(m)}(X)$ and variance $\Sigma_{(m)}(X)$ , as in the regression setup. The class labels then determine the likelihood through the latent variables under an independence assumption so that $p(C_{N}|\mathcal{Y})=\prod_{i=1}^{N}p_{i}$ , where $p_{i}=p(C(x_{i})=c_{i}|\mathcal{Y}_{i})$ . Neal, (1998) recommends a softmax specification:

2 Sequential Monte Carlo

Sequential Monte Carlo (SMC) is an alternative to MCMC that is designed for online inference in dynamic models. In SMC, particles $\{S_{t}^{(i)}\}_{i=1}^{N}$ containing the sufficient information about all uncertainties given data $z^{t}=(z_{1},\dots,z_{t})$ up to time $t$ are used to approximate the posterior distribution: $\{S_{t}^{(i)}\}_{i=1}^{N}\sim p(S_{t}|z^{t})$ . In Section 2 we describe the sufficient information $S_{t}$ for our GP regression and classification models. The key task in SMC inference is to update the particle approximation from time $t$ to time $t+1$ .

Our preferred SMC updating method is particle learning (PL, e.g., Carvalho et al.,, 2008) due to the convenient form of the posterior predictive distribution of GP models. The PL update is derived from the following decomposition.

This suggests a two-step update of the particle approximation:

resample the indices $\{i\}_{i=1}^{N}$ with replacement from a multinomial distribution where each index has weight $w_{i}\propto p(z_{t+1}|S_{t}^{(i)})=\int p(z_{t+1}|S_{t+1})p(S_{t+1}|S_{t})\,dS_{t+1}$ , thus obtaining new indices $\{\zeta(i)\}_{i=1}^{N}$

propagate with a draw from $S_{t+1}^{(i)}\sim p(S_{t+1}|S_{t}^{\zeta(i)},z_{t+1})$ to obtain a new collection of particles $\{S_{t+1}^{(i)}\}_{i=1}^{N}\sim p(S_{t+1}|z^{t+1})$

The core components of PL are not new to the SMC arsenal. Early examples of related propagation methods include those of Kong et al., (1994), with resampling and the propagation of sufficient statistics by Liu and Chen, (1995, 1998), and look-ahead by Pitt and Shephard, (1999). Like many SMC algorithms, PL is susceptible to an accumulation of Monte Carlo error with large data sets. However, two aspects of our setup mitigate these concerns to a large extent. Firstly, the over-arching goal of sequential design is to keep data sets as small as possible. GPs scale poorly to large data sets anyways, regardless of the method of inference (SMC, MCMC, etc.), so drastically different approaches are recommended for large-scale sequential design. Secondly, we only use vague priors for parameters which can be analytically integrated out in the posterior predictive—the main workhorse of PL—so that there is no need to sample them. In this way we extend the class of models for which SMC algorithms apply. However, we note that in order to use vague priors we must initialize the particles at some time $t_{0}>0$ . Further explanation and development is provided in Section 2.

Particle Learning for Gaussian processes

To implement PL for GPs we need to: identify the sufficient information $S_{t}$ ; initialize the particles; derive $p(z_{t+1}|S_{t})$ for the resample step; and determine $p(S_{t+1}|S_{t},z_{t+1})$ for the propagate step. We first develop these quantities for GP regression and then extend them to classification. Although GPs are not dynamic models, we will continue to index the data size, which was $N$ in $D_{N}$ in Section 1.1, with $t$ in the SMC framework so that $z^{t}\equiv D_{N}$ . We use $N$ for the number of particles. As GPs are nonparametric priors, their sufficient information has size in $\Omega(t)$ , i.e., they depend upon the full $z^{t}$ . For example, the covariance $\Sigma(X_{t})$ typically requires maintaining $O(t^{2})$ quantities to store the distances between the pairs of rows in $X_{t}$ . Therefore $z^{t}$ is tacitly part of the sufficient information $S_{t}$ .

Propagate: The propagate step updates each resampled sufficient information $S_{t}^{\zeta(i)}$ to account for $z_{t+1}=(x_{t+1},y_{t+1})$ . Since the parameters to $K(\cdot,\cdot)$ are static, i.e., they do not change in $t$ , they may by propagated deterministically by copying them from $S_{t}^{\zeta(i)}$ to $S_{t+1}^{(i)}$ . We note that, as a matter of efficient bookkeeping, it is the correlation matrix $K_{t+1}$ and its inverse $K_{t+1}^{-1}$ that are required for our PL update, not the values of the parameters directly. The new $K_{t+1}^{(i)}$ is built from $K_{t}^{(i)}$ and $K^{(i)}(x_{t+1},x_{j})$ , for $j=1,\dots,t+1$ as

Deterministically copying $K(\cdot,\cdot)$ in the propagate step is fast, but it may lead to particle depletion in future resample steps. An alternative is to augment the propagate with a sample from the posterior distribution via MCMC to rejuvenate the particles (e.g., MacEachern et al.,, 1999; Gilks and Berzuini,, 2001). In our regression GP context, just a single MH step for the parameters to $K(\cdot,\cdot)$ using Eq. (1), for each particle, suffices. The particles represent “chains” in equilibrium so it is sensible to tune the MH proposals for likely acceptance by making their variance small, initially, relative to the posterior at the starting time $t=t_{0}$ , and then further decreasing it multiplicatively as $t$ increments. Such MH rejuvenations position the propagate step as a local maneuver in the Monte Carlo method, whereas resampling via the predictive is a more global step. Together they can emulate an ensemble method.

Consider the 1-d synthetic sinusoidal data first used by Higdon, (2002),

where $x\in[0,9.6]$ , capturing two periods of low fidelity oscillation (the sine term). We observe the response with noise $Y(x)\sim N(y(x),\sigma=0.1)$ . At this noise level it is difficult to distinguish the high fidelity oscillations (the cosine term) from the noise without many samples. We used a $T=50$ Latin hypercube design (LHD, e.g., Santner et al.,, 2003, Section 5.2.2)—just large enough to begin to detect the high fidelity structure.

The left panel of Figure 1 shows the point-wise predictive distribution for each of the 1,000 particles in terms of the mean(s) and central 90% credible interval(s) of the Student- $t$ distributions (3–4) with parameters $\hat{y}_{t}^{(i)}$ , $\hat{\sigma}^{2(i)}_{t}$ and $\hat{\nu}_{t}^{(i)}$ obtained from $S_{t}^{(i)}$ . Their average, the posterior mean predictive surface, is shown on the right. Observe that some particles lead to higher fidelity surfaces (finding the cosine) than others (only finding the sine). Figure 2 shows the samples of the range ( $d$ ) and nugget ( $g$ ) obtained from the particles. Only 200 of the 1,000 are shown to reduce clutter. The clustering pattern of the black diamonds indicates a multimodal posterior.

For contrast we also took 10,000 MCMC samples from the full data posterior, thinning every 10 and saving 1,000. This took about one minute on our workstation, which is faster than the full PL run, but much slower than the individual updates $t\rightarrow t+1$ . The marginal chains for $d$ and $g$ seemed to mix well (not shown) but, as Figure 2 shows [plotting last 200 sample pairs as red squares], the chain nevertheless became stuck in a mode of the posterior, and only explored a portion of the high density region.

For a more numerical comparison we calculated the RMSE of predictive means (obtained via PL and MCMC, as above) to the truth on a random LHD of size 1000. This was repeated 100 times, each with new LHD training (size 50, as above) and test sets. The mean (sd) RMSE was $0.00079\;(0.00069)$ for PL, and $0.00098\;(0.00075)$ for MCMC. As paired data, the average number of times PL had a lower RMSE than MCMC was 0.64, which is statistically significant ( $p=5.837\times 10^{-5}$ ) using a standard one-sided $t$ -test. In short, this means that the SMC/PL method is performing at least as well as the MCMC with quicker sequential updates. The MCMC could be re-tuned, restarted, and/or run for longer to narrow the RMSE gap, but all of these would come at greater computational expense.

2 Classification

Resample: It may be helpful to think of the latent $\mathcal{Y}^{t}$ as playing the role of (hidden) states in a dynamic model. Indeed, their treatment in the PL update is similar. However, note that they do not satisfy any Markov property. The predictive density $p(z_{t+1}|S_{t})$ , which is needed for the resample step, is the probability of the label $c_{t+1}(x_{t+1})$ under the sufficient information $S_{t}$ : $p(c_{t+1}(x_{t+1})|S_{t})$ . This depends upon the $M$ latents $\mathcal{Y}(x_{t+1})$ , which are not part of $S_{t}$ . For an arbitrary $x$ , the law of total probability gives

The second equality comes since, conditional on $\mathcal{Y}(x)$ , the label does not depend on any other quantity (5). The $M$ GP priors are independent, so $p(\mathcal{Y}(x)|S_{t})$ decomposes as

where each component in the product is a Student- $t$ density (3–4).

The $M$ -dimensional integral in Eq. (7) is not analytically tractable, but it is trivial to approximate by Monte Carlo as follows. Simulate many independent collections of samples from each of the $M$ Student- $t$ distributions (8):

Sampling the latents may proceed via ARS, following Neal, (1998). However, as in the regression setup, we prefer a more local move in the PL propagate context to compliment the globally-scoped resample step. So instead we follow Broderick and Gramacy, (2010) in using 10-fold randomly blocked MH-within-Gibbs sampling. This approach exploits a factorization of the posterior as the product of the class likelihood (5) given the underlying latents and their GP prior (7): (dropping the $\zeta(i)$ )

Here, $I$ is an element of a 10-fold (random) partition $\mathcal{I}_{10}$ of the indices $1,\dots,t+1$ , where $|I|\leq 10$ and $-I=\mathcal{I}_{10}\backslash I$ is its compliment. Extending the predictive equations from Section 2.1, the latter term in Eq. (11) is an $|I|$ -dimensional Student- $t$ with $\hat{\nu}_{I}=|\!-\!I|-p-1$ ,

using the condensed notation $Y_{I}\equiv Y_{(m)}(X_{I})$ , and $|I|\times|I^{\prime}|$ matrix $K_{I,I^{\prime}}\equiv K_{(m)}(X_{I},X_{I^{\prime}})$ , etc. A thus proposed $Y_{(m)}^{\prime}(X_{I})$ may be accepted according to the likelihood ratio since the prior and proposal densities cancel in the MH acceptance ratio. Let $\mathcal{Y}^{\prime}_{I}$ denote the collection of $M\times(t+1)$ latents comprised of $Y_{(m)}^{\prime}(X_{I})$ , $Y_{(-m)}^{t+1}(X_{I})$ , and $\mathcal{Y}^{t+1}(X_{-I})$ . Then the MH acceptance probability is $\min\{1,A\}$ where

Upon acceptance we replace $Y_{(m)}^{t+1}(X_{I})$ with $Y_{(m)}^{\prime}(X_{I})$ , and otherwise do nothing. In this way we loop over $m=1,\dots,M$ and $I\in\mathcal{I}_{10}$ to obtain a set of fully propagated latents.

An Illustration: Consider data generated by converting a real-valued output $y(x)=x_{1}\exp(-x_{1}^{2}-x_{2}^{2})$ into classification labels (Broderick and Gramacy,, 2010) by taking the sign of the sum of the eigenvalues of the Hessian of $y(x)$ . This gives a two-class process where the class is determined by the direction of concavity at $x$ . For our illustration we take $x\in^{2}$ , and create a third class from the first class (negative sign) where $x_{1}>0$ . We use $M-1=2$ GPs, and take our data set to be $T=125$ input–class pairs from a maximum entropy design (MED, Santner et al.,, 2003, Section 6.2.1). Our $N=1000$ particles are initialized using 10,000 MCMC rounds at time $t_{0}=17$ , thinning every 10. This takes less than 2 minutes in R on our workstation. Then we proceed with 108 PL updates, which takes about four hours. The first few updates take less than a minute, whereas the last few take 7–8 minutes.

For a further comparison of timings on a larger classification problem we duplicated the 10-fold cross validation (CV) experiment of Broderick and Gramacy, (2010) on the two-class credit approval data which has $p=47$ covariates for 690 $(x,c)$ pairs. The time required for the final PL update ( $t\approx 621$ ) with $N=1000$ particles, averaged over the 10 CV folds, was 38 minutes. The resulting predictor(s) gave exactly the same misclassification error(s) averaging $14.6\%$ ( $4\%$ sd) on the hold out sets as a similar estimator based on MCMC. However, the authors reported that the MCMC took about $5.5$ hours on average. So even with a modestly large design ( $\approx 621$ ), the Monte Carlo error that might accumulate with the use of vague priors in SMC does not seem to (yet) be an issue in our PL implementation. The savings in time is huge due the decomposition of far fewer $621\times 621$ covariance matrices in the SMC framework.

Sequential design

Here, we illustrate how the online nature of PL is ideally suited to sequential design by AL. Probably the most straightforward AL algorithms in the regression context are ALM and ALC [see Section 1.1]. But these are well known to approximate space filling MEDs for stationary GP models. So instead we consider the sequential design problem of optimizing a noisy black box function. In the classification context we consider the sequential exploration of classification boundaries.

The situation is more complicated when optimizing a noisy function, or with Bayesian inference via Monte Carlo. A re-definition of $f_{\min,t}$ accounts for the noisy ( $g>0$ ) responses: either as the first order statistic of $Y(X_{t})$ or as the minimum of the predictive mean surface, $\min_{x}\hat{y}_{t}(x)$ . Now, each sample (e.g., each particle) from the posterior emits an EI. Using our Student- $t$ predictive equations (3–4) for $S_{t}^{(i)}$ , letting $\delta_{t}^{(i)}(x)=f_{\min,t}-\hat{y}^{(i)}_{t}(x)$ , we have (following Williams et al.,, 2000):

A remedy, proposed to ensure convergence in the optimization, involves pairing EI with a deterministic numerical optimizer. Taddy et al., (2009) proposed using a GP/EI based approach (with MCMC) as an oracle in a pattern search optimizer called APPS. This high powered combination offers convergence guarantees, but unfortunately requires a highly customized implementation that precludes its use in our illustrations. Gramacy and Taddy, (2009, Section 3) propose a simpler, more widely applicable, variant via the opposite embedding. There are (as yet) no convergence guarantees for this heuristic, but it has been shown to perform well in many examples.

2 Online learning of classification boundaries

In Section 2.2 [Figure 3] we saw how the predictive entropy could be useful as an AL heuristic for boundary exploration. Joshi et al., (2009) observed that when $M>2$ , the probability of the irrelevant class(es) near the boundary between two classes can influence the entropy, and thus the sequential design based upon it, in undesirable ways. They showed that restricting the entropy calculation to the two highest probabilities (best–versus–second–best [BVSB] entropy) is a better heuristic.

Figure 5 shows the sequential design obtained via PL with $N=1000$ particles and the BVSP entropy AL heuristic using a pre-defined set of 300 MED candidate locations. The design was initialized with a $t_{0}=25$ sub-MED (from the 300), and AL was performed at each of rounds $t=t_{0},\dots,T=125$ on the $300-t$ remaining candidates. This time there are 40 misclassified points, compared to the 76 obtained with a static design [Section 2.2; the same 1,000 MED test set was used]. The running time here is comparable to the static implementation. MCMC gives similar results but takes 4–5 times longer.

Working off-grid, e.g., with a fresh set of LHD candidates in each AL round, is slightly more challenging because the predictive entropy is very greedy. Paradoxically, the highest (BVSB) entropy regions tend to be near the boundaries which have been most thoroughly explored—straddling it with a high concentration of points—even though the entropy rapidly decreases nearby. One possible remedy involves smoothing the entropy by a distance-based kernel (e.g., $K(\cdot,\cdot)$ from the GP) over the candidate locations. Applying this heuristic leads to very similar results as those reported in Figure 5, and so they are not shown here.

Discussion

We have shown how GP models, for regression and for classification, may be fit via the sequential Monte Carlo (SMC) method of particle learning (PL). We developed the relevant expressions, and provided illustrations on data from both contexts. Although SMC methods are typically applied to time series data, we argued that they are also well suited to scenarios where the data arrive online even when there is no time or dynamic component in the model. Examples include sequential design and optimization, where a significant aspect of the problem is to choose the next input and subsequently update the model fit. In these contexts, MCMC inference has reigned supreme. But MCMC is clearly ill-suited to online data acquisition, as it must be restarted when the new data arrive. We showed that the PL update of a particle approximation is thrifty by contrast, and that adding rejuvenation to the propagate steps mimicks the behavior of an ensemble without explicitly maintaining one.

Another advantage of SMC methods is that they are “embarrassingly parallelizable”, since many of the relevant calculations on the particles may proceed independently of one another, up to having a unique computing node for each particle. In contrast, the Markov property of MCMC requires that the inferential steps, to a large extent, proceed in serial. Getting the most mileage out of our SMC/PL approach will require a careful asynchronous implementation. Observe that the posterior predictive distribution, and the propagate step, may be calculated for each particle in parallel. Resampling requires that the particles be synchronized, but this is fast once the particle predictive densities have been evaluated. Our implementation in the plgp package does not exploit this parallelism. However, it does make heavy use of R’s lapply method, which automatically loops over the particles to calculate the predictive, and to propagate. A parallelized lapply, e.g., using snowfall and sfCluster, as described by Knaus et al., (2009), may be a promising way forward.