Input Warping for Bayesian Optimization of Non-stationary Functions
Jasper Snoek, Kevin Swersky, Richard S. Zemel, Ryan P. Adams
Introduction
Bayesian optimization is a strategy for the global optimization of noisy, black-box functions. The goal is to find the minimum of an expensive function of interest as quickly as possible. Bayesian optimization fits a surrogate model that estimates the expensive function, and a proxy optimization is performed on this in order to select promising locations to query. Naturally, the ability of the surrogate to accurately model the underlying function is crucial to the success of the optimization routine. Recent work in machine learning has revisited the idea of Bayesian optimization (e.g., Osborne et al., 2009; Brochu et al., 2010; Srinivas et al., 2010; Hutter et al., 2011; Bergstra et al., 2011; Bull, 2011; Snoek et al., 2012; Hennig and Schuler, 2012) in large part due to advances in the ability to efficiently and accurately model statistical distributions over large classes of real-world functions. Gaussian processes (GPs) (see, e.g., Rasmussen and Williams, 2006) provide a powerful framework to express flexible prior distributions over smooth functions, yielding accurate estimates of the expected value of the function at any given input, but crucially also uncertainty estimates over that value. These are the two main components that enable the exploration and exploitation tradeoff that makes Bayesian optimization so effective.
A major limitation of the most commonly used form of Gaussian process regression is the assumption of stationarity — that the covariance between two outputs is invariant to translations in input space. This assumption simplifies the regression task, but hurts the ability of the Gaussian process to model more realistic non-stationary functions. This presents a challenge for Bayesian optimization, as many problems of interest are inherently non-stationary. For example, when optimizing the hyperparameters of a machine learning algorithm, we might expect the objective function to have a short length scale near the optimum, but have a long length scale far away from the optimum. That is, we would expect bad hyperparameters to yield similar bad performance everywhere (e.g., classifying at random) but expect the generalization performance to be sensitive to small tweaks in good hyperparameter regimes.
We introduce a simple solution that allows Gaussian processes to model a large variety of non-stationary functions that is particularly well suited to Bayesian optimization. We automatically learn a bijective warping of the inputs that removes major non-stationary effects. This is achieved by projecting each dimension of the input through the cumulative distribution function of the Beta distribution, while marginalizing over the shape of the warping. Our approach is computationally efficient, captures a variety of desirable transformations, such as logarithmic, exponential, sigmoidal, etc., and is easily interpretable. In the context of Bayesian optimization, understanding the parameter space is often just as important as achieving the best possible result and our approach lends itself to a straightforward analysis of the non-stationarities in a given problem domain.
We extend this idea to multi-task Bayesian optimization (Swersky et al., 2013) so that multiple tasks can be warped into a jointly stationary space. Thus, tasks can be warped onto one another in order to better take advantage of their shared structure.
In the empirical study that forms the experimental part of this paper, we show that modeling non-stationarity is extremely important and yields significant empirical improvements in the performance of Bayesian optimization. For example, we show that on a recently introduced Bayesian optimization benchmark (Eggensperger et al., 2013), our method outperforms all of the previous state-of-the-art algorithms on the problems with continuous-valued parameters. We further observe that on four different challenging machine learning optimization tasks our method outperforms that of Snoek et al. (2012), consistently converging to a better result in fewer function evaluations. As our methodology involves a transformation of the inputs, this strategy generalizes to a wide variety of models and algorithms. Empirically, modeling non-stationarity is a fundamentally important component of effective Bayesian optimization.
Background and Related Work
or the ARD Matérn kernel advocated for hyperparameter tuning with Bayesian optimization by Snoek et al. (2012):
Such covariance functions are invariant to translations along the input space and thus are stationary.
2 Non-stationary Gaussian Process Regression
Compared to these approaches, our approach is relatively simple, yet as we will demonstrate, flexible enough to capture a wide variety of nonstationary behaviours. Our principal aim is to show that addressing nonstationarity is a critical component of effective Bayesian optimization, and that any advantages gained from using our approach would likely generalize to more elaborate techniques.
3 Multi-Task Gaussian Processes
Many problems involve making predictions over multiple datasets (we will henceforth refer to these prediction problems as tasks). When the datasets share an input domain, and the mappings from inputs to outputs are correlated, then these correlations can be used to share information between different tasks and improve predictive performance. There have been many extensions of Gaussian processes to the multi-task setting, e.g., Goovaerts (1997); Alvarez and Lawrence (2011). However, a basic and surprisingly effective approach is to assume that each task is derived from a single latent function which is transformed to produce each output (Teh et al., 2005; Bonilla et al., 2008).
We infer the elements of directly using the spherical parametrization of a covariance matrix (Osborne, 2010; Pinheiro and Bates, 1996).
4 Bayesian Optimization
Bayesian optimization is a general framework for the global optimization of noisy, expensive, black-box functions (Mockus et al., 1978), see Brochu et al. (2010) or Lizotte (2008) for an in-depth explanation and review. The strategy relies on the use of a relatively cheap probabilistic model that can be queried liberally as a surrogate in order to more effectively evaluate an expensive function of interest. Bayes’ rule is used to derive the posterior estimate of the true function, given observations, and the surrogate is then used to determine, via a proxy optimization over an acquisition function, the next most promising point to query. Using the posterior mean and variance of the probabilistic model, the acquisition function generally expresses a tradeoff between exploitation and exploration. Numerous acquisition functions and combinations thereof have been proposed (e.g., Kushner, 1964; Srinivas et al., 2010; Hoffman et al., 2011).
Here is the cumulative distribution function of a standard normal, and is the density of a standard normal. Note that the method proposed in this paper is independent of the choice of acquisition function and do not affect its analytic properties.
5 Multi-Task Bayesian Optimization
When utilizing machine learning in practice, a single model will often need to be trained on multiple datasets. This can happen when e.g., new data is collected and a model must be retrained. In these scenarios we can think of each dataset as a different task and use multi-task Gaussian processes to predict where to query next. In Krause and Ong (2011), this idea was applied to find peptide sequences that bind to molecules for vaccine design, while in Swersky et al. (2013) it was applied to hyperparameter optimization. In these cases it was shown that sharing information between tasks can be extremely beneficial for Bayesian optimization. Other approaches include Bardenet et al. (2013), which finds a joint latent function over tasks explicitly using a ranking model, and Hutter et al. (2011) which uses a set of auxiliary task features to improve prediction.
Input Warping
where BetaCDF refers to the Beta cumulative distribution function and is the normalization constant. That is, is a vector-valued function in which the th output dimension is a function of the th input dimension, and is specified by the cumulative distribution function of the Beta distribution. Each of these bijective transformations from has a unique shape, determined by parameters and . The Beta CDF has no closed form solution for non-integer values of and , however accurate approximations are implemented in many statistical software packages.
Alternatively, one can think of input warping as applying a particular kind of non-stationary kernel to the original data. Examples of non-stationary functions and their corresponding ideal warping that transforms them into stationary functions are shown in Figure 1.
Our choice of the Beta distribution is motivated by the fact that it is capable of expressing a variety of monotonic warpings, while still being concisely parameterized. In general, there are many other suitable choices.
Rather than assume a single, explicit transformation function, we define a hierarchical Bayesian model by placing a prior over the shape parameters, and , of the bijections and integrating them out. We treat the collection as hyperparameters of the covariance function and use Markov chain Monte Carlo via slice sampling, following the treatment of covariance hyperparameters from Snoek et al. (2012). We use a log-normal distribution, i.e.
to express a prior for a wide family of desirable functions. Figure 2 demonstrates example warping functions arising from sampling transformation parameters from various instantiations of this prior. Note that the geometric mean or median of the zero-mean log-normal distribution for the and corresponds to the identity transform. With this prior the model centers itself on the identity transformation of the input space. In the following empirical analysis we use this formulation with a variance of . A nice property of this approach is that a user can easily specify a prior when they expect a specific form of warping, as we show in Figure 2.
2 Multi-Task Input Warping
When training the same model on different datasets, certain properties, such as the size of the dataset, can have a dramatic effect on the optimal hyperparameter settings. For example, a model trained on a small dataset will likely require more regularization than the same model trained on a larger dataset. In other words, it is possible that one part of the input space on one task can be correlated with a different part of the input space on another task. To account for this, we allow each task to have its own set of warping parameters. Inferring these parameters will effectively try to warp both tasks into a jointly stationary space that is more suitably modeled by a standard multi-task kernel. In this way, large values on one task can map to small values on another, and vice versa.
Empirical Analyses
Our empirical analysis is comprised of three distinct experiments. In the first experiment, we compare to the method of Snoek et al. (2012) in order to demonstrate the effectiveness of input warping. In the second experiment, we compare to other hyperparameter optimization methods using a subset of the benchmark suite found in Eggensperger et al. (2013). Finally, we show how our multi-task extension can further benefit this important setting.
We evaluate the standard Gaussian process expected improvement algorithm (GP EI MCMC) as implemented by Snoek et al. (2012), with and without warping. Following their treatment, we use the Matérn kernel and we marginalize over kernel parameters using slice sampling (Murray and Adams, 2010). We repeat three of the experimentsSee Snoek et al. (2012) for details of these experiments. from Snoek et al. (2012), and perform an experiment involving the tuning of a deep convolutional neural networkWe use the Deepnet package from https://github.com/nitishsrivastava/deepnet on a subset of the popular CIFAR-10 data set (Krizhevsky, 2009). The deep network consists of three convolutional layers and two fully connected layers and we optimize over two learning rates, one for each layer type, six dropout regularization rates, six weight norm constraints, the number of hidden units per layer, a convolutional kernel size and a pooling size for a total of 21 hyperparameters. On the logistic regression problem we also compare to warping the input space a priori using the log-transform (optimizing in log-space).
Results
Figure 3 shows that in all cases, dealing with non-stationary effects via input warpings greatly improves the convergence of the optimization. Of particular note, on the higher-dimensional convolutional network problem (Figure 3d) input warped Bayesian optimization consistently converges to a better solution than Bayesian optimization with a stationary GP.
In Figure 4 we plot examples of some of the inferred warpings. For logistic regression, Figure 4a shows that our method learns different logarithmic-like warpings for three dimensions and no warping for the fourth. Figure 4b shows how the posterior distribution over the learning rate warping evolves, becoming more extreme and more certain, as observations are gathered. Figure 4c shows that on both convolutional and dense layers, the intuition that one should log-transform the learning rates holds. For transformations on weight norm constraints, shown in Figure 4d, the weights connected to the inputs and outputs use a sigmoidal transformation, the convolutional-layer weights use an exponential transformation, and the dense-layer weights use a logarithmic transformation. Effectively, this means that the most variation in the error occurs in the medium, high and low scales respectively for these types of weights. Especially interesting are the wide variety of transformations that are learned for dropout on different layers, shown in Figure 4e. These show that different layers benefit from different dropout rates, which was also confirmed on test set error, and challenges the notion that they should just be set to (Hinton et al., 2012).
It is clear that the learned warpings are non-trivial. In some cases, like with learning rates, they agree with intuition, while for others like dropout they yield surprising results. Given the number of hyperparameters and the variety of transformations, it is highly unlikely that even experts would be able to determine the whole set of appropriate warpings. This highlights the utility of learning them automatically.
2 HPOLib Continuous Benchmarks
In our next set of experiments, we tested our approach on the subset of benchmarks over continuous inputs from the HPOLib benchmark suite (Eggensperger et al., 2013). These benchmarks are designed to assess the strengths and weaknesses of several popular hyperparameter optimization schemes. All of the tested methods perform Bayesian optimization, however the underlying surrogate models differ significantly. The SMAC package (Hutter et al., 2011) uses a random forest, the Hyperopt package (Bergstra et al., 2011) uses the tree Parzen estimator, and the Spearmint package (Snoek et al., 2012) uses a Gaussian process. For our experiments, we augmented the Spearmint package with input warping.
Results
Table 1 shows the results, where all but the warped results are taken from Eggensperger et al. (2013). Overall, input warpings improve the performance of the Gaussian process approach such that it does at least as well as every other method, and in many cases better. Furthermore, the standard deviation also decreases significantly in many instances, meaning that the results are far more reliable. Finally, it is worth noting that the number of function evaluations required to solve the problems is also drastically reduced in many cases.
Interestingly, the random forest approach in SMAC also naturally deals with nonstationarity, albeit in a fundamentally different way, by partitioning the space in a non-uniform manner. There are several possibilities to explain the performance discrepancy. Unlike random forests, Gaussian processes produce a smooth function of the inputs, meaning that EI can be locally optimized via gradient methods, so it is possible that better query points are selected in this way. Alternatively, the random forest is not a well-defined prior on functions and there may be overfitting in the absence of parameter marginalization. Further investigation is merited to tease apart this discrepancy.
3 Multi-Task Warping
In this experiment, we apply multi-task warping to logistic regression and online LDA (Hoffman et al., 2010) in a similar manner to Swersky et al. (2013). In the logistic regression problem, a search over hyperparameters has already been completed on the USPS dataset, which consists of training examples of handwritten digits of size . It was demonstrated that it was possible to use this previous search to speed up the hyperparameter search for logistic regression on the MNIST dataset, which consists of training examples of size .
In the online LDA problem, we assume that a model has been trained on documents and that we would now like to train one on documents. Again, it was shown that it is possible to transfer information over to this task, resulting in more efficient optimization.
Results
In Figure 5 we see that warped multi-task Bayesian optimization (warped MTBO) outperforms multi-task Bayesian optimization (MTBO) without warping, and performs far better than single-task Bayesian optimization (STBO) that does not have the benefit of a prior search. On logistic regression it appears that ordinary MTBO gets stuck in a local minimum, while warped MTBO is able to consistently escape this by the function evaluation.
Conclusion
In this paper we develop a novel formulation to elegantly model non-stationary functions using Gaussian processes that is especially well suited to Bayesian optimization. Our approach uses the cumulative distribution function of the Beta distribution to warp the input space in order to remove the effects of mild input-dependent length scale variations. This approach allows us to automatically infer a variety of warpings in a computationally efficient way. In our empirical analysis we see that an inability to model non-stationary functions is a major weakness when using stationary kernels in the GP Bayesian optimization framework. Our simple approach to learn the form of the non-stationarity significantly outperforms the standard Bayesian optimization routine of Snoek et al. (2012) both in the number of evaluations it takes to converge and the value reached. As an additional bonus, the method finds good solutions more reliably. Our experiments on the continuous subset of the HPOLib benchmark (Eggensperger et al., 2013) shows that input warping performs substantially better than state-of-the-art baselines on these problems.
A key advantage of our approach is that the learned transformations can be analyzed post hoc, and our analysis of a convolutional neural network architecture leads to surprising insights that challenge established doctrine. Post-training analysis is becoming a critical component of neural network development. For example, the winning Imagenet 2013 (Deng et al., 2009) submission (Zeiler and Fergus, 2013) used post hoc analysis to correct for model defects. The development of interpretable Bayesian optimization strategies can provide a unique opportunity to facilitate this kind of interaction. An interesting follow-up would be to determine whether consistent patterns emerge across architectures, datasets and domains.
In Bayesian optimization, properly characterizing uncertainty is just as important as making predictions. GPs are ideally suited to this problem because they offer a good balance between modeling power and computational tractability. In many real world problems, however, the assumptions made by the Gaussian processes are often violated, nullifying many of their benefits. In light of this, many opt to use frequentist models instead, which offer minimax-type guarantees. Our emphasis in this work is to demonstrate that it is possible to stay within the Bayesian framework and thus enjoy its characterization of uncertainty, while still overcoming some of the limitations associated with the conventional GP approach. In future work we intend to experiment with more elaborate models of non-stationarity to see if these yield further improvements.
Acknowledgements
The authors would like to thank Nitish Srivastava for providing help with the Deepnet package. Jasper Snoek is a fellow in the Harvard Center for Research on Computation and Society. During his time at the University of Toronto, Jasper Snoek was supported by a grant from Google. This work was funded by DARPA Young Faculty Award N66001-12-1-4219, an Amazon AWS in Research grant, the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institute for Advanced Research (CIFAR).