Portfolio Allocation for Bayesian Optimization

Eric Brochu, Matthew W. Hoffman, Nando de Freitas

Introduction

Bayesian optimization is a powerful strategy for finding the extrema of objective functions that are expensive to evaluate. It is applicable in situations where one does not have a closed-form expression for the objective function, but where one can obtain noisy evaluations of this function at sampled values. It is particularly useful when these evaluations are costly, when one does not have access to derivatives, or when the problem at hand is non-convex. Bayesian optimization has two key ingredients. First, it uses the entire sample history to compute a posterior distribution over the unknown objective function. Second, it uses an acquisition function to automatically trade off between exploration and exploitation when selecting the points at which to sample next. As such, Bayesian optimization techniques are some of the most efficient approaches in terms of the number of function evaluations required . In recent years, the machine learning community has increasingly used Bayesian optimization to optimize expensive objective functions. Examples can be found in robot gait design , online path planning , intelligent user interfaces for animation , algorithm configuration , efficient MCMC , sensor placement , and reinforcement learning .

However, the choice of acquisition function is not trivial. Several different methods have been proposed in the literature, none of which work well for all classes of functions. Building on recent developments in the field of online learning and multi-armed bandits , this paper proposes a solution to this problem. The solution is based on a hierarchical hedging approach for managing an adaptive portfolio of acquisition functions.

We review Bayesian optimization and popular acquisition functions in Section 2. In Section 3, we propose the use of various hedging strategies for Bayesian optimization . In Section 4, we present experimental results using standard test functions from the literature of global optimization. The experiments show that the proposed hedging approaches outperform any of the individual acquisition functions. We also provide detailed comparisons among the hedging strategies. Finally, in Section 5 we present a bound on the cumulative regret which helps provide some intuition as to algorithm’s performance.

Bayesian optimization

We define xt\mathbf{x}_{t} as the ttth sample and yt=f(xt)+ϵt,y_{t}=f(\mathbf{x}_{t})+\epsilon_{t}, with ϵtiidN(0,σ2)\epsilon_{t}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,\sigma^{2}), as a noisy observation of the objective function at xt\mathbf{x}_{t}. Other observation models are possible , but we will focus on real, Gaussian observations for ease of presentation.

The objective function is distributed according to a GP prior:

For convenience, and without loss of generality, we assume that the prior mean is the zero function (but see for examples of nonzero means). This leaves us the more interesting question of defining the covariance function. A very popular choice is the squared exponential kernel with a vector of automatic relevance determination (ARD) hyperparameters θ\boldsymbol{\theta} :

where diag(θ)\operatorname{diag}(\boldsymbol{\theta}) is a diagonal matrix with entries θ\boldsymbol{\theta} along the diagonal and zeros elsewhere. The choice of hyperparameters will be discussed in the experimental section, but we note that it is not trivial in this domain because of the paucity of data. For an in depth analysis of this issue we refer the reader to e.g. .

We can sample the GP at tt points by choosing the indices {x1:t}\{\mathbf{x}_{1:t}\} and sampling the values of the function at these indices to produce the data D1:t{\mathcal{D}}_{1:t}. The function values are distributed according to a multivariate Gaussian distribution N(0,K)\mathcal{N}(0,\mathbf{K}), with covariance entries k(xi,xj)k(\mathbf{x}_{i},\mathbf{x}_{j}). Assume that we already have the observations, say from previous iterations, and that we want to use Bayesian optimization to decide what point xt+1\mathbf{x}_{t+1} should be considered next. Let us denote the value of the function at this arbitrary point as ft+1f_{t+1}. Then, by the properties of GPs, f1:tf_{1:t} and ft+1f_{t+1} are jointly Gaussian:

where k=[k(xt+1,x1),k(xt+1,x2),,k(xt+1,xt)]\mathbf{k}=[k(\mathbf{x}_{t+1},\mathbf{x}_{1}),k(\mathbf{x}_{t+1},\mathbf{x}_{2}),\dots,k(\mathbf{x}_{t+1},\mathbf{x}_{t})]. Using the Sherman-Morrison-Woodbury formula, see for a comprehensive treatment, one can easily arrive at an expression for the predictive distribution:

In this sequential decision making setting, the number of query points is relatively small and, consequently, the GP predictions are easy to compute.

2 Acquisition functions

The role of the acquisition function is to guide the search for the optimum. Typically, acquisition functions are defined such that high values correspond to potentially high values of the objective function, whether because the prediction is high, the uncertainty is great, or both. The acquisition function is maximized to select the next point at which to evaluate the objective function. That is, we wish to sample the objective function at argmaxxu(xD)\operatorname*{argmax}_{\mathbf{x}}u(\mathbf{x}|{\mathcal{D}}). This auxiliary maximization problem, where the objective is known and easy to evaluate, can be easily carried out with standard numerical techniques such as multistart or DIRECT . The acquisition function is sometimes called the infill or simply the “utility” function. In the following sections, we will look at the three most popular choices. Figure 1 shows how these give rise to distinct sampling behaviour.

Probability of improvement (PI): The early work of Kushner suggested maximizing the probability of improvement over the incumbent μ+=maxtμ(xt)\mu^{+}=\max_{t}\mu(\mathbf{x}_{t}). The drawback, intuitively, is that this formulation is biased toward exploitation only. To remedy this, practitioners often add a trade-off parameter ξ0\xi\geq 0, so that

where Φ()\Phi(\cdot) is the standard Normal cumulative distribution function (CDF). The exact choice of ξ\xi is left to the user. Kushner recommends using a (unspecified) schedule for ξ\xi, which should start high in order to drive exploration and decrease towards zero as the algorithm progresses. Lizotte, however, found that using such a schedule did not offer improvement over a constant value of ξ\xi on a suite of test functions .

Expected improvement (EI): More recent work has tended to take into account not only the probability of improvement, but the magnitude of the improvement a point can potentially yield. Močkus et al. proposed maximizing the expected improvement with respect to the best function value yet seen, given by the incumbent x+=argmaxxtf(xt)\mathbf{x}^{+}=\operatorname*{argmax}_{\mathbf{x}_{t}}f(\mathbf{x}_{t}). For our Gaussian process posterior, one can easily evaluate this expectation, see , yielding:

where d=μ(x)μ+ξd=\mu(\mathbf{x})-\mu^{+}-\xi and where ϕ()\phi(\cdot) and Φ()\Phi(\cdot) denote the PDF and CDF of the standard Normal distribution respectively. Here ξ\xi is an optional trade-off parameter analogous to the one defined above.

Upper confidence bound (UCB & GP-UCB): Cox and John introduce an algorithm they call “Sequential Design for Optimization”, or SDO. Given a random function model, SDO selects points for evaluation based on a confidence bound consisting of the mean and weighted variance: μ(x)+κσ(x)\mu(\mathbf{x})+\kappa\sigma(\mathbf{x}). As with the other acquisition models, however, the parameter κ\kappa is left to the user. A principled approach to selecting this parameter is proposed by Srinivas et al. . In this work, the authors define the instantaneous regret of the selection algorithm as r(x)=f(x)f(x)r(\mathbf{x})=f(\mathbf{x}^{\star})-f(\mathbf{x}) and attempt to select a sequence of weights κt\kappa_{t} so as to minimize the cumulative regret RT=r(x1)++r(xT)R_{T}=r(\mathbf{x}_{1})+\cdots+r(\mathbf{x}_{T}). Using the upper confidence bound selection criterion with κt=νβt\kappa_{t}=\sqrt{\nu\beta_{t}} and the hyperparameter ν>0\nu>0 Srinivas et al. define

It can be shown that this method has cumulative regret bounded by O(TβTγT)\mathcal{O}(\sqrt{T\beta_{T}\gamma_{T}}) with high probability. Here βT\beta_{T} is a carefully selected learning rate and γT\gamma_{T} is a bound on the information gained by the algorithm at selected points after TT steps. Both of these terms depend upon the particular form of kernel-function used, but for most kernels their product can be shown to be sublinear in TT. We refer the interested reader to the original paper for exact bounds.

The sublinear bound on cumulative regret implies that the method is no-regret, i.e. that limTRT/T=0\lim_{T\to\infty}R_{T}/T=0. This in turn provides a bound on the convergence rate for the optimization process, since the regret at the maximum f(x)maxtf(xt)f(\mathbf{x}^{*})-\max_{t}f(\mathbf{x}_{t}) is upper bounded by the average regret RT/T=f(x)1Tt=1Tf(xt).R_{T}/T=f(\mathbf{x}^{*})-\tfrac{1}{T}\textstyle{\sum_{t=1}^{T}}f(\mathbf{x}_{t}). As we will note later, however, this bound can be quite loose in practice.

Portfolio strategies

There is no choice of acquisition function that can be guaranteed to perform best on an arbitrary, unknown objective. In fact, it may be the case that no single acquisition function will perform the best over an entire optimization — a mixed strategy in which the acquisition function samples from a pool (or portfolio) at each iteration might work better than any single acquisition. This can be treated as a hierarchical multi-armed bandit problem, in which each of the NN arms is an acquisition function, each of which is itself an infinite-armed bandit problem. In this section we propose solving the selection problem using three strategies from the literature, the application of which we believe to be novel.

Hedge is an algorithm which at each time step tt selects an action ii with probability pt(i)p_{t}(i) based on the cumulative rewards (gain) for that action (see Auer et al. ). After selecting an action the algorithm receives reward rtir_{t}^{i} for each action and updates the gain vector. In the Bayesian optimization setting, we can define NN bandits each corresponding to a single acquisition function. Choosing action ii corresponds to sampling from the point nominated by function uiu_{i}, i.e. xti=argmaxxui(xD1:t1)\mathbf{x}_{t}^{i}=\operatorname*{argmax}_{\mathbf{x}}u_{i}(\mathbf{x}|{\mathcal{D}}_{1:t-1}) for i=1,,Ni=1,\ldots,N. Finally, while in the conventional Bayesian optimization setting the objective function is sampled only once per iteration, Hedge is a full information strategy and requires a reward for every action at every time step. We can achieve this by defining the reward at xti\mathbf{x}_{t}^{i} as the expected value of the GP model at xti\mathbf{x}_{t}^{i}. That is, rti=μt(xti)r_{t}^{i}=\mu_{t}(\mathbf{x}^{i}_{t}). We refer to this method as GP-Hedge. Provided that the objective function is smooth, this reward definition is reasonable.

Auer et al. also propose the Exp3 algorithm, a variant of Hedge that applies to the partial information setting. In this setting it is no longer assumed that rewards are observed for all actions. Instead at each iteration a reward is only associated with the particular action selected. The algorithm uses Hedge as a subroutine where rewards observed by Hedge at each iteration are rti/p^t(i)r_{t}^{i}/\hat{p}_{t}(i) for the action selected and zero for all actions. Here p^t(i)\hat{p}_{t}(i) is the probability that Hedge would have selected action ii. The Exp3 algorithm, meanwhile, selects actions from a distribution that is a mixture between p^t(i)\hat{p}_{t}(i) and the uniform distribution. Intuitively this ensures that the algorithm does not miss good actions because the initial rewards were low (i.e. it continues exploring).

Finally, another possible strategy is the NormalHedge algorithm . This method, however, is built to take advantage of situations where the number of bandit arms (acquisition functions) is large, and may not be a good match to problems where NN is relatively small.

The GP-Hedge procedure is shown in Algorithm 2. In practice any of these hedging strategies could be used, however in our experiments we find that Hedge tends to outperform the others. Note that it is necessary to optimize NN acquisition functions at each time step rather than 1. While this might seem expensive, this is unlikely to be a major problem in practice for small NN, as (i) Bayesian optimization is typically employed when sampling the objective is so expensive as to dominate other costs; (ii) it has been shown that fast approximate optimization of uu is usually sufficient ; and (iii) it is straightforward to run the optimizations in parallel on a modern multicore processor.

We will also note that the setting of our problem is somewhere “in between” the full and partial information settings. Consider, for example, the situation that all points sampled by our algorithm are “too distant” in the sense that the kernels evaluated at these points exert negligible influence on each other. In this case, we can see that only information obtained by the sampled point is available, and as a result GP-Hedge will be over-confident in its predictions when using the full-information strategy. However, this behaviour is not observed in practical situations because of smoothness properties, as well as our particular selection of acquisition functions. In the case of adversarial acquisition functions one might instead choose to use the Exp3 variant.

Experiments

To validate the use of GP-Hedge, we tested the optimization performance on a set of test functions with known maxima f(x)f(\mathbf{x}^{\star})Code for the optimization methods and experiments will be made available online.. To see how effective each method is at finding the global maximum, we use the “gap” metric , defined as

where again x+\mathbf{x}^{+} is the incumbent or best function sample found up to time tt. The gap GtG_{t} will therefore be a number between 0 (indicating no improvement over the initial sample) and 1 (if the incumbent is the maximum). Note, while this performance metric is evaluated on the true function values, this information is not available to the optimization methods.

We first tested performance using functions common to the literature on Bayesian optimization: the Branin, Hartman 3, and Hartman 6 functions. All of these are continuous, bounded, and multimodal, with 2, 3, and 6 dimensions respectively. We omit the formulae of the functions for space reasons, but they can be found in .

For each experiment, we optimized 25 times and computed the mean and variance of the gap metric over time. In these experiments we used hyperparameters θ\boldsymbol{\theta} chosen offline so as to maximize the log marginal likelihood of a (sufficiently large) set of sample points; see . We compared the standard acquisition functions using parameters suggested by previous authors, i.e. ξ=0.01\xi=0.01 for EI and PI, δ=0.1\delta=0.1 and ν=0.2\nu=0.2 for GP-UCB . For the GP-Hedge trials, we tested performance under using both 3 acquisition functions and 9 acquisition functions. For the 3-function variant we use the standard acquisition functions with default hyperparameters. The 9-function variant uses these same three as well as 6 additional acquisition functions consisting of: both PI and EI with ξ=0.1\xi=0.1 and ξ=1.0\xi=1.0, GP-UCB with ν=0.1\nu=0.1 and ν=1.0\nu=1.0. While we omit trials of these additional acquisition functions for space reasons, these values are not expected to perform as well as the defaults and our experiments confirmed this hypothesis. However, we are curious to see if adding known suboptimal acquisition functions will help or hinder GP-Hedge in practice.

Results for the gap measure GtG_{t} are shown in Figure 2. While the improvement GP-Hedge offers over the best single acquisition function varies, there is almost no combination of function and time step in which the 9-function GP-Hedge variant is not the best-performing method. The results suggest that the extra acquisition functions assist GP-Hedge in exploring the space in the early stages of the optimization process. Figure 2 also displays, for a single example run, how the the arm probabilities pt(i)p_{t}(i) used by GP-Hedge evolve over time. We have observed that the distribution becomes more stable when the acquisition functions come to a general consensus about the best region to sample. As the optimization progresses, exploitation becomes more rewarding than exploration, resulting in more probability being assigned to methods that tend to exploit. However, note that if the initial portfolio had consisted only of these more exploitative acquisition functions, the likelihood of becoming trapped at suboptimal points would have been higher.

In Figure 3 we compare against the other Hedging strategies introduced in Section 3 under both the gap measure and mean average regret. We also introduce a baseline strategy which utilizes a portfolio uniformly distributed over the same acquisition functions. The results show that mixing across multiple acquisition functions provides significant performance benefits under the gap measure, and as the problems’ difficulty/dimensionality increases we see that GP-Hedge outperforms other mixed strategies. The uniform strategy performs well on the easier test functions, as the individual acquisition functions are reasonable. However, for the hardest problem (Hartman 6) we see that the performance of the naive uniform strategy degrades. NormalHedge performs particularly poorly on this problem. We observed that this algorithm very quickly collapses to an exclusively exploitative portfolio which becomes very conservative in its departures from the incumbent. We again note that this strategy is intended for large values of NN, which may explain this behaviour.

In the case of the regret measure we see that the hedging strategies perform comparable to GP-UCB, a method designed to optimize this measure. We further note that although the average regret can be seen as a lower-bound on the convergence of Bayesian optimization methods, this bound can be loose in practice. Further, in the setting of Bayesian optimization we are typically concerned not with the cumulative regret during optimization, but instead with the regret incurred by the incumbent after optimization is complete. Similar notions of “simple regret” have been studied in .

Based on the performance in these experiments, we use Hedge as the underlying algorithm for GP-Hedge in the remainder of the experiments.

2 Sampled test functions

As there is no generally-agreed-upon set of test functions for Bayesian optimization in higher dimensions, we seek to sample synthetic functions from a known GP prior similar to . For further details on how these functions are sampled see Appendix B. As can be seen in Figure 4, GP-Hedge with N=9N=9 is again the best-performing method, which becomes even more clear as the dimensionality increases. Interestingly, the worst-performing function changes as dimensionality increases. In the 40D experiments, GP-UCB, which generally performed well in other experiments, does quite poorly. Examining the behaviour, it appears that by trying to minimize regret instead of maximizing improvement, GP-UCB favours regions of high variance. However, since a 40D space remains extremely sparsely populated even with hundreds of samples, the vast majority of the space still has high variance, and thus high acquisition value.

3 Control of a particle simulation

We also applied these methods to optimize the behavior of a simulated physical system in which the trajectories of falling particles are controlled via a set of repelling forces. This is a difficult, nonlinear control task whose resulting objective function exhibits fairly isolated regions of high value surrounded by severe plateaus. Briefly, the four-dimensional state-space in this problem consists of a particle’s 2D position and velocity (p,p˙)(p,\dot{p}) with two-dimensional actions consisting of forces which act on the particle. Particles are also affected by gravity and a frictional force resisting movement. The goal is to direct the path of the particle through regions of the state space with high reward r(p)r(p) so as to maximize the total reward accumulated over many time-steps. In our experiments we use a finite, but large, time-horizon HH. In order to control this system we employ a set of “repellers” each of which is located at some position ci=(ai,bi)c_{i}=(a_{i},b_{i}) and has strength wiw_{i} (see the left-most plot of Figure 5). The force on a particle at position pp is a weighted sum of the individual forces from all repellers, each of which is inversely proportional to the distance pcip-c_{i}. For further details we refer the reader to .

Results for this optimization task are shown in Figure 5. As with the previous synthetic examples GP-Hedge outperforms each of its constituent methods. We further note the particularly poor performance of PI on this example, which in part we hypothesize is a result of plateaus in the resulting objective function. In particular PI has trouble exploring after it has “locked on” to a particular mode, a fact that seems exacerbated when there are large regions with very little change in objective.

Convergence behaviour

Properly assessing the convergence behaviour of hedging algorithms of this type is very problematic. The main difficulty lies with the fact that decisions made at iteration tt affect the state of the problem and the resulting rewards at all future iterations. As a result we cannot relate the regret of our algorithm directly to the regret of the best underlying acquisition strategy: had we actually used the best underlying strategy we would have selected completely different points [8, section 7.11].

Regret bounds for the underlying GP-UCB algorithm have been shown . Starting with Auer et al. we also have regret bounds for the hedging strategies used to select between acquisition functions (improved bounds can also be found in ). However, because of the points stated in the previous paragraph, and expounded in more detail in the appendix, we cannot simply combine both regret bounds.

With these caveats in mind we will consider a slightly different algorithmic framework. In particular we will consider rewards at iteration tt given by the mean μt1(xt)\mu_{t-1}(\mathbf{x}_{t}), where this assumption is made merely to simplify the following proof. We will also assume that GP-UCB is included as one of the possible acquisition functions due to its associated convergence results (see Section 2.2). In this scenario we can obtain the following bound on our cumulative regret.

Assume GP-Hedge is used with a collection of acquisition strategies, one of which is GP-UCB with parameters βt\beta_{t}. If we also have a bound γT\gamma_{T} on the information gained at points selected by the algorithm after TT iterations, then with probability at least 1δ1-\delta the cumulative regret is bounded by

We give a full proof of this theorem in the appendix. We will note that this theorem on its own does not guarantee the convergence of the algorithm, i.e. that limTRT/T=0\lim_{T\to\infty}R_{T}/T=0. We can see, however, that our regret is bounded by two sub-linear terms and an additional term which depends on the information gained at points proposed, but not necessarily selected. In some sense this additional term depends on the proximity of points proposed by GP-UCB to points previously selected, the expected distance of which should decrease as the number of iterations increases.

Conclusions and future work

Hedging strategies are a powerful tool in the design of acquisition functions for Bayesian optimization. In this paper we have shown that strategies that adaptively modify a portfolio of acquisition functions often perform substantially better — and almost never worse — than the best-performing individual acquisition function. Our experiments have also shown that full-information strategies are able to outperform partial-information strategies in many situations. However, partial-information strategies can be beneficial in instances of high NN or in situations where the acquisition functions provide very conflicting advice. Evaluating these tradeoffs is an interesting area of future research.

Finally, while the EI and PI acquisition functions can perform well in practice, there currently exist no regret bounds for these approaches. In this work we give a regret bound for our hedging strategy by relating its performance to existing bounds for GP-UCB. Although our bound does not guarantee convergence it does provide some intuition as to the success of hedging methods in practice. Another interesting line of future research involves finding similar bounds for the gap measure.

Acknowledgements

We would like to anonymously thank a number of researchers who provided very helpful comments and criticism on the theoretical and practical aspects of this work.

References

Appendix A Proof of Theorem 1

We will consider a portfolio-based strategy using rewards rt=μt1(xt)r_{t}=\mu_{t-1}(\mathbf{x}_{t}) and selecting between acquisition functions using the Hedge algorithm. In order to discuss this we will need to write the gain over TT steps, in hindsight, that would have been obtained had we used strategy ii,

With probability at least 1δ11-\delta_{1} and for a suitable choice of Hedge parameters, η=8lnk/T\eta=\sqrt{8\ln k/T}, the regret is bounded by

This result is given without proof as it follows directly from [8, Section 4.2] for rewards in the range $$. At the cost of slightly worsening the bound in terms of its multiplicative/additive constants, the following generalizations can also be noted:

For rewards instead in the arbitrary range To obtain rewards bounded within some range [a,b][a,b] we can assume that the additive noise ϵt\epsilon_{t} is truncated above some large absolute value, which guarantees bounded means. [a,b][a,b] the same bound can be shown by referring to [8, Section 2.6].

The choice of η\eta in the above Lemma requires knowledge of the time horizon TT. By referring to [8, Section 2.3] we can remove this restriction using a time-varying term ηt=8lnk/t\eta_{t}=\sqrt{8\ln k/t}.

By referring to [8, Section 6.8] we can also extend this bound to the partial-information strategy Exp3.

For the next two lemmas we will refer the reader to [31, Lemma 5.1 and 5.3] for proof. We point out, however, that these two lemmas only depend on the underlying Gaussian process and as a result can be used separately from the GP-UCB framework.

Assume δ2(0,1)\delta_{2}\in(0,1), a finite sample space A<|A|<\infty, and βt=2log(Aπt/δ2)\beta_{t}=2\log(|A|\pi_{t}/\delta_{2}) where tπt1=1\sum_{t}\pi_{t}^{-1}=1 and πt>0\pi_{t}>0. Then with probability at least 1δ21-\delta_{2} the absolute deviation of the mean is bounded by

In order to simplify this discussion we have assumed that the sample space AA is finite, however this can also be extended to compact spaces [31, Lemma 5.7].

The information gain for points selected by the algorithm can be written as

The following lemma follows the proof of [31, Lemma 5.4], however it can be applied outside the GP-UCB framework. Due to the slightly different conditions we recreate this proof here.

Given points xtx_{t} selected by the algorithm the following bound holds for the sum of variances:

Because βt\beta_{t} is nondecreasing we can write the following inequality

The second inequality holds because the posterior variance is bounded by the prior variance, σt12(x)k(x,x)1\sigma_{t-1}^{2}(\mathbf{x})\leq k(\mathbf{x},\mathbf{x})\leq 1, which allows us to write

By summing over both sides of the original bound and applying the result of Lemma 3 we can see that

The result follows by bounding the information gain by I(y1:T;f1:T)γTI(y_{1:T};f_{1:T})\leq\gamma_{T}, which can be done for many common kernels, including the squared exponential [31, Theorem 5]. ∎

Finally, the next lemma follows directly from [31, Lemma 5.2]. We will note that this lemma depends only on the definition of the GP-UCB acquisition function, and as a result does not require that points at any previous iteration were actually selected via GP-UCB.

We can now combine these results to construct the proof of Theorem 1.

With probability at least 1δ11-\delta_{1} the result of Lemma 1 holds. If we assume that GP-UCB is included as one of the acquisition functions we can write

and by adding t=1Tf(x)\sum_{t=1}^{T}f(\mathbf{x}^{*}) to both sides this inequality can be rewritten as

With probability at least 1δ21-\delta_{2} the bound from Lemma 2 can be applied to the left-hand-side and the result of Lemma 5 can be applied to the right. This allows us to rewrite this inequality as

which means that the regret is bounded by

Finally, this result depends upon Lemmas 1 and 5 holding. By a simple union bound argument we can see that these both hold with probability at least 1δ1δ21-\delta_{1}-\delta_{2}, and by setting δ1=δ2=δ/2\delta_{1}=\delta_{2}=\delta/2 we recover our result. ∎

Appendix B Synthetic test functions

As there is no generally-agreed-upon set of test functions for Bayesian optimization in higher dimensions, we seek to sample synthetic functions from a known GP prior, similar to the strategy of Lizotte . A GP prior is infinite-dimensional, so on a practical level for performing experiments we simulate this by sampling points and using the posterior mean as the synthetic objective test function.

For each trial, we use an ARD kernel with θ\boldsymbol{\theta} drawn uniformly from d^{d}. We then sample 100d100d dd-dimensional points, compute K\mathbf{K} and then draw yN(0,K)\mathbf{y}\sim\mathcal{N}(0,\mathbf{K}). The posterior mean of the resulting predictive posterior distribution μ(x)\mu(\mathbf{x}) (Section 2.1) is used as the test function. However it is possible that for particular values of θ\boldsymbol{\theta} and K\mathbf{K}, large parts of the space will be so far from the samples that they will form plateaus along the prior mean. To reduce this, we evaluate the test function at 500 random locations. If more than 25 of these are 0, we recompute K\mathbf{K} using 200d200d points. This process is repeated, adding 100d100d points each time until the test function passes the plateau test (this is rarely necessary in practice).

Using the response surface μ(x)\mu(\mathbf{x}) as the objective function, we can then approximate the maximum using conventional global optimization techniques to get f(x)f(\mathbf{x}^{\star}), which permits us to use the gap metric for performance.

Note that these sample points are only used to construct the objective, and are not known to the optimization methods.