Scalable Bayesian Optimization Using Deep Neural Networks

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, Ryan P. Adams

stat.ML

Introduction

Recently, the field of machine learning has seen unprecedented growth due to a new wealth of data, increases in computational power, new algorithms, and a plethora of exciting new applications. As researchers tackle more ambitious problems, the models they use are also becoming more sophisticated. However, the growing complexity of machine learning models inevitably comes with the introduction of additional hyperparameters. These range from design decisions such as the shape of a neural network architecture, to optimization parameters such as learning rates, to regularization hyperparameters such as weight decay. Proper setting of these hyperparameters is critical for performance on difficult problems.

There are many methods for optimizing over hyperparameter settings, ranging from simplistic procedures like grid or random search (Bergstra & Bengio, 2012), to more sophisticated model-based approaches using random forests (Hutter et al., 2011) or Gaussian processes (Snoek et al., 2012). Bayesian optimization is a natural framework for model-based global optimization of noisy, expensive black-box functions. It offers a principled approach to modeling uncertainty, which allows exploration and exploitation to be naturally balanced during the search. Perhaps the most commonly used model for Bayesian optimization is the Gaussian process (GP) due to its simplicity and flexibility in terms of conditioning and inference.

However, a major drawback of GP-based Bayesian optimization is that inference time grows cubically in the number of observations, as it necessitates the inversion of a dense covariance matrix. For problems with a very small number of hyperparameters, this has not been an issue, as the minimum is often discovered before the cubic scaling renders further evaluations prohibitive. As the complexity of machine learning models grows, however, the size of the search space grows as well, along with the number of hyperparameter configurations that need to be evaluated before a solution of sufficient quality is found. Fortunately, as models have grown in complexity, computation has become significantly more accessible and it is now possible to train many models in parallel. A natural solution to the hyperparameter search problem is to therefore combine large-scale parallelism with a scalable Bayesian optimization method. The cubic scaling of the GP, however, has made it infeasible to pursue this approach.

The goal of this work is to develop a method for scaling Bayesian optimization, while still maintaining its desirable flexibility and characterization of uncertainty. To that end, we propose the use of neural networks to learn an adaptive set of basis functions for Bayesian linear regression. We refer to this approach as Deep Networks for Global Optimization (DNGO). Unlike a standard Gaussian process, DNGO scales linearly with the number of function evaluations—which, in the case of hyperparameter optimization, corresponds to the number of models trained—and is amenable to stochastic gradient training. Although it may seem that we are merely moving the problem of setting the hyperparameters of the model being tuned to setting them for the tuner itself, we show that for a suitable set of design choices it is possible to create a robust, scalable, and effective Bayesian optimization system that generalizes across many global optimization problems.

We demonstrate the effectiveness of DNGO on a number of difficult problems, including benchmark problems for Bayesian optimization, convolutional neural networks for object recognition, and multi-modal neural language models for image caption generation. We find hyperparameter settings that achieve competitive with state-of-the-art results of $6.37\%$ and $27.4\%$ on CIFAR-10 and CIFAR-100 respectively, and BLEU scores of $25.1$ and $26.7$ on the Microsoft COCO 2014 dataset using a single model and a 3-model ensemble.

Background and Related Work

Recent innovation has resulted in significant progress in Bayesian optimization, including elegant theoretical results (Srinivas et al., 2010; Bull, 2011; de Freitas et al., 2012), multitask and transfer optimization (Krause & Ong, 2011; Swersky et al., 2013; Bardenet et al., 2013) and the application to diverse tasks such as sensor set selection (Garnett et al., 2010), the tuning of adaptive Monte Carlo (Mahendran et al., 2012) and robotic gait control (Calandra et al., 2014b).

Typically, GPs have been used to construct the distribution over functions used in Bayesian optimization, due to their flexibility, well-calibrated uncertainty, and analytic properties (Jones, 2001; Osborne et al., 2009). Recent work has sought to improve the performance of the GP approach through accommodating higher dimensional problems (Wang et al., 2013; Djolonga et al., 2013), input non-stationarities (Snoek et al., 2014) and initialization through meta-learning (Feurer et al., 2015). Random forests, which scale linearly with the data, have also been used successfully for algorithm configuration by Hutter et al. (2011) with empirical estimates of model uncertainty.

More specifically, Bayesian optimization seeks to solve the minimization problem

Here $\Phi(\cdot)$ is the cumulative distribution function of a standard normal, and $\mathcal{N}(\cdot;0,1)$ is the density of a standard normal. Note that numerous alternate acquisition functions and combinations thereof have been proposed (Kushner, 1964; Srinivas et al., 2010; Hoffman et al., 2011), which could be used without affecting the analytic properties of our approach.

2 Bayesian Neural Networks

The application of Bayesian methods to neural networks has a rich history in machine learning (MacKay, 1992; Hinton & van Camp, 1993; Buntine & Weigend, 1991; Neal, 1995; De Freitas, 2003). The goal of Bayesian neural networks is to uncover the full posterior distribution over the network weights in order to capture uncertainty, to act as a regularizer, and to provide a framework for model comparison. The full posterior is, however, intractable for most forms of neural networks, necessitating expensive approximate inference or Markov chain Monte Carlo simulation. More recently, full or approximate Bayesian inference has been considered for small pieces of the overall architecture. For example, in similar spirit to this work, Lázaro-Gredilla & Figueiras-Vidal (2010); Hinton & Salakhutdinov (2008) and Calandra et al. (2014a) considered inference over just the last layer of a neural network. Alternatively, variational approaches are developed in Kingma & Welling (2014); Rezende et al. (2014) and Mnih & Gregor (2014), where a neural network is used in a variational approximation to the posterior distribution over the latent variables of a directed generative neural network.

Adaptive Basis Regression with Deep Neural Networks

A key limitation of GP-based Bayesian optimization is that the computational cost of the technique scales cubically in the number of observations, limiting the applicability of the approach to objectives that require a relatively small number of observations to optimize. In this work, we aim to replace the GP traditionally used in Bayesian optimization with a model that scales in a less dramatic fashion, but retains most of the GP’s desirable properties such as flexibility and well-calibrated uncertainty. Bayesian neural networks are a natural consideration, not least because of the theoretical relationship between Gaussian processes and infinite Bayesian neural networks (Neal, 1995; Williams, 1996). However, deploying these at a large scale is very computationally expensive.

As such, we take a pragmatic approach and add a Bayesian linear regressor to the last hidden layer of a deep neural network, marginalizing only the output weights of the net while using a point estimate for the remaining parameters. This results in adaptive basis regression, a well-established statistical technique which scales linearly in the number of observations, and cubically in the basis function dimensionality. This allows us to explicitly trade off evaluation time and model capacity. As such, we form the basis using the very flexible and powerful non-linear functions defined by the neural network.

These basis functions are parameterized via the weights and biases of the deep neural network, and these parameters are trained via backpropagation and stochastic gradient descent with momentum. In this training phase, a linear output layer is also fit. This procedure can be viewed as a maximum a posteriori (MAP) estimate of all parameters in the network. Once this “basis function neural network” has been trained, we replace the MAP-parameterized output layer with a Bayesian linear regressor that captures uncertainty in the weights. See Section 3.1.2 for a more elaborate explanation of this choice.

A natural concern with the use of deep networks is that they often require significant effort to tune and tailor to specific problems. One can consider adjusting the architecture and tuning the hyperparameters of the neural network as itself a difficult hyperparameter optimization problem. An additional challenge is that we aim to create an approach that generalizes across optimization problems. We found that design decisions such as the type of activation function used significantly altered the performance of the Bayesian optimization routine. For example, in Figure 2 we see that the commonly used rectified linear (ReLU) function can lead to very poor estimates of uncertainty, which causes the Bayesian optimization routine to explore excessively. Since the bounded tanh function results in smooth functions with realistic variance, we use this nonlinearity in this work; however, if the smoothness assumption needs to be relaxed, a combination of rectified linear functions with a tanh function only on the last layer can also be used in order to bound the basis.

1.2 Marginal likelihood vs MAP estimate

The standard empirical Bayesian approach to adaptive basis regression is to maximize the marginal likelihood with respect to the parameters of the basis (see Equation 8), thus taking the model uncertainty into account. However, in the context of our method, this requires evaluating the gradient of the marginal likelihood, which requires inverting a ${D\times D}$ matrix on each update of stochastic gradient descent. As this makes the optimization of the net significantly slower, we take a pragmatic approach and optimize the basis using a point estimate and apply the Bayesian linear regression layer post-hoc. We found that both approaches gave qualitatively and empirically similar results, and as such we in practice employ the more efficient one.

1.3 Quadratic Prior

The horseshoe is a so-called one-group prior for inducing sparsity and is a somewhat unusual choice for the weights of a regression model. Here we choose it because it 1) has support only on the positive reals, leading to convex functions, and 2) it has a large spike at zero with a heavy tail, resulting in strong shrinkage for small values while preserving large ones. This last effect is important for handling model misspecification as it allows the quadratic effect to disappear and become a simple offset if necessary.

2 Incorporating input space constraints

Many problems of interest have complex, possibly unknown bounds, or exhibit undefined behavior in some regions of the input space. These regions can be characterized as constraints on the search space. Recent work (Gelbart et al., 2014; Snoek, 2013; Gramacy & Lee, 2010) has developed approaches for modeling unknown constraints in GP-based Bayesian optimization by learning a constraint classifier and then discounting expected improvement by the probability of constraint violation.

In this work, to model the constraint surface, we similarly replace the Gaussian process with the adaptive basis model, integrating out the output layer weights:

3 Parallel Bayesian Optimization

Obtaining a closed form expression for the joint acquisition function across multiple inputs is intractable in general (Ginsbourger & Riche, 2010). However, a successful Monte Carlo strategy for parallelizing Bayesian optimization was developed in Snoek et al. (2012). The idea is to marginalize over the possible outcomes of currently running experiments when making a decision about a new experiment to run. Following this strategy, we use the posterior predictive distribution given by Equations 4 and 5 to generate a set of fantasy outcomes for each running experiment which we then use to augment the existing dataset. By averaging over sets of fantasies, we can perform approximate marginalization when computing EI for a candidate point. We note that this same idea works with the constraint network, where instead of computing marginalized EI, we would compute the marginalized probability of violating a constraint.

When this strategy is applied to a GP, the cost of computing EI for a candidate point becomes cubic in the size of the augmented dataset. This restricts both the number of running experiments that can be tolerated, as well as the number of fantasy sets used for marginalization. With DNGO it is possible to scale both of these up to accommodate a much higher degree of parallelism.

Experiments

In the literature, there exist several other methods for model-based optimization. Among these, the most popular variants in machine learning are the random forest-based SMAC procedure (Hutter et al., 2011) and the tree Parzen estimator (TPE) (Bergstra et al., 2011). These are faster to fit than a Gaussian process and scale more gracefully with large datasets, but this comes at the cost of a more heuristic treatment of uncertainty. By contrast, DNGO provides a balance between scalability and the Bayesian marginalization of model parameters and hyperparameters.

To demonstrate the effectiveness of our approach, we compare DNGO to these scalable model-based optimization variants, as well as the input-warped Gaussian process method of Snoek et al. (2014) on the benchmark set of continuous problems from the HPOLib package (Eggensperger et al., 2013). As Table 1 shows, DNGO significantly outperforms SMAC and TPE, and is competitive with the Gaussian process approach. This shows that, despite vast improvements in scalability, DNGO retains the statistical efficiency of the Gaussian process method in terms of the number of evaluations required to find the minimum.

2 Image Caption Generation

In this experiment, we explore the effectiveness of DNGO on a practical and expensive problem where highly parallel evaluation is necessary to make progress in a reasonable amount of time. We consider the task of image caption generation using multi-modal neural language models. Specifically, we optimize the hyperparameters of the log-bilinear model (LBL) from Kiros et al. (2014) to maximize the BLEU score of a validation set from the recently released COCO dataset (Lin et al., 2014). In our experiments, each evaluation of this model took an average of 26.6 hours.

We optimize learning parameters such as learning rate, momentum and batch size; regularization parameters like dropout and weight decay for word and image representations; and architectural parameters such as the context size, whether to use the additive or multiplicative version, the size of the word embeddings and the multi-modal representation size Details are provided in the supplementary material.. The final parameter is the number of factors, which is only relevant for the multiplicative model. This adds an interesting challenge, since it is only relevant for half of the hyperparameter space. This gives a total of 11 hyperparameters. Even though this number seems small, this problem offers a number of challenges which render its optimization quite difficult. For example, in order to not lose any generality, we choose broad box constraints for the hyperparameters; this, however, renders most of the volume of the model space infeasible. In addition, quite a few of the hyperparameters are categorical, which introduces severe non-stationarities in the objective surface.

Nevertheless, one of the advantages of a scalable method is the ability to highly parallelize hyperparameter optimization. In this way, high quality settings can be found after only a few sequential steps. To test DNGO in this scenario, we optimize the log-bilinear model with up to 800 parallel evaluations.

Running between 300 and 800 experiments in parallel (determined by cluster availability), we proposed and evaluated approximately 2500 experiments—the equivalent of over 2700 CPU days—in less than one week. Using the BLEU-4 metric, we optimized the validation set performance and the best LBL model found by DNGO outperforms recently proposed models using LSTM recurrent neural networks (Zaremba et al., 2015; Xu et al., 2015) on the test set. This is remarkable, as the LBL is a relatively simple approach. Ensembling this top model with the second and third best (under the validation metric) LBL models resulted in a test-set BLEU score We have verified that our BLEU score evaluation is consistent across reported results. We used a beam search decoding for our test predictions with the LBL model. of $26.7$ , significantly outperforming the LSTM-based approaches. We noticed that there were distinct multiple local optima in the hyperparameter space, which may explain the dramatic improvement from ensembling a small number of models. We show qualitative examples of generated captions on test images in Figure 3. Further figures showing the BLEU score as a function of the iteration of Bayesian optimization are provided in the supplementary.

3 Deep Convolutional Neural Networks

Finally, we use DNGO on a pair of highly competitive deep learning visual object recognition benchmark problems. We tune the hyperparameters of a deep convolutional neural network on the CIFAR-10 and CIFAR-100 datasets. Our approach is to establish a single, generic architecture, and specialize it to various tasks via individualized hyperparameter tuning. As such, for both datasets, we employed the same generic architecture inspired by the configuration proposed in Springenberg et al. (2014), which was shown to attain strong classification results. This architecture is detailed in Table 5.

We performed the optimization on a cluster of Intel® Xeon Phi™ coprocessors, with 40 jobs running in parallel using a kernel library that has been highly optimized for efficient computation on the Intel® Xeon Phi™ coprocessorAvailable at https://github.com/orippel/micmat. For the optimal hyperparameter configuration found, we ran a final experiment for 350 epochs on the entire training set, and report its result.

Our optimal models for CIFAR-10 and CIFAR-100 achieved test errors of $6.37\%$ and $27.4\%$ respectively. A comparison to published state-of-the-art results (Goodfellow et al., 2013; Wan et al., 2013; Lin et al., 2013; Lee et al., 2014; Springenberg et al., 2014) is given in Table 4. We see that the parallelized automated hyperparameter tuning procedure obtains models that are highly competitive with the state-of-the-art in just a few sequential steps.

A comprehensive overview of the setup, the architecture, the tuning and the optimum configuration can be found in the supplementary material.

Conclusion

In this paper, we introduced deep networks for global optimization, or DNGO, which enables efficient optimization of noisy, expensive black-box functions. While this model maintains desirable properties of the GP such as tractability and principled management of uncertainty, it greatly improves its scalability from cubic to linear as a function of the number of observations. We demonstrate that while this model allows efficient computation, its performance is nevertheless competitive with existing state-of-the-art approaches for Bayesian optimization. We demonstrate empirically that it is especially well suited to massively parallel hyperparameter optimization.

While adaptive basis regression with neural networks provides one approach to the enhancement of scalability, other models may also present promise. One promising line of work, for example by Nickson et al. (2014), is to introduce a similar methodology by instead employing the sparse Gaussian process as the underlying probabilistic model (Snelson & Ghahramani, 2005; Titsias, 2009; Hensman et al., 2013).

Acknowledgements

This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This work used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. We would like to acknowledge the NERSC systems staff, in particular Helen He and Harvey Wasserman, for providing us with generous access to the Babbage Xeon Phi testbed.

The image caption generation computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University. We would like to acknowledge the FASRC staff and in particular James Cuff for providing generous access to Odyssey.

Jasper Snoek is a fellow in the Harvard Center for Research on Computation and Society. Kevin Swersky is the recipient of an Ontario Graduate Scholarship (OGS). This work was partially funded by NSF IIS-1421780, the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Institute for Advanced Research (CIFAR).

References

Supplementary Material

Appendix A Convolutional neural network experiment specifications

In this section we elaborate on the details of the network architecture, training and the meta-optimization. In the following subsections we elaborate on the hyperparametrization scheme. The priors on the hyperparameters as well as their optimal configurations for the two datasets can be found in Table 6.

The model architecture is specified in Table 5.

A.2 Data augmentation

We corrupt each input in a number of ways. Below we describe our parametrization of these corruptions.

We shift the hue, saturation and value fields of each input by global constants ${b_{H}\sim U(-B_{H},B_{H})}$ , ${b_{S}\sim U(-B_{S},B_{S})}$ , ${b_{V}\sim U(-B_{V},B_{V})}$ . Similarly, we globally stretch the saturation and value fields by global constants ${a_{S}\sim U(\frac{1}{1+A_{S}},1+A_{S})}$ , ${a_{V}\sim U(\frac{1}{1+A_{V}},1+A_{V})}$ .

Each input is scaled by some factor ${s\sim U(\frac{1}{1+S},1+S)}$ .

We crop each input to size $27\times 27$ , where the window is chosen randomly and uniformly.

Each input is reflected horizontally with a probability of 0.5.

Each input element is dropped independently and identically with some random probability $D_{0}$ .

A.3 Initialization and training procedure

We initialize the weights of each convolution layer $m$ with i.i.d zero-mean Gaussians with standard deviation $\frac{\sigma}{\sqrt{F_{m}}}$ where $F_{m}$ is the number of parameters per filter for that layer. We chose this parametrization to produce activations whose variances are invariant to filter dimensionality. We use the same standard deviation for all layers but the input, for which we dedicate its own hyperparameter $\sigma_{I}$ as it oftentimes varies in scale from deeper layers in the network.

We train the model using the standard stochastic gradient descent and momentum optimizer. We use minibatch size of 128, and tune the momentum and learning rate, which we parametrize as $1-0.1^{M}$ and $0.1^{L}$ respectively. We anneal the learning rate by a factor of $0.1$ at epochs $130$ and $190$ . We terminate the training after $200$ epochs.

We regularize the weights of all layers with weight decay coefficient $W$ . We apply dropout on the outputs of the max pooling layers, and tune these rates $D_{1},D_{2}$ separately.

A.4 Testing procedure

We evaluate the performance of the learned model by averaging its log-probability predictions on 100 samples drawn from the input corruption distribution, with masks drawn from the unit dropout distribution.

Appendix B Additional figures for image caption generation

In Figures 5(a) and 5(b) we provide some additional visualization of the results from the image caption generation experiments from Section 4.2 to highlight the behavior of the Bayesian optimization routine. Both figures show the validation BLEU-4 Score on MS COCO corresponding to different hyperparameter configurations as evaluated over iterations of the optimization procedure. In Figure 5(a), these are represented as a planar histogram, where the shade of each bin indicates the total count within it. The current best validation score discovered is traced in black. Figure 5(b) shows a scatter plot of the validation score of all the experiments in the order in which they finished. Validation scores of 0 correspond to constraint violations. These figures demonstrate the exploration-versus-exploitation paradigm of Bayesian Optimization, in which the algorithm trades off visiting unexplored parts of the space, and focusing on parts which show promise.

Appendix C Multimodal neural language model hyperparameters

We optimize a total of 11 hyperparameters of the log-bilinear model (LBL). Below we explain what these hyperparameters refer to.

The LBL model has two variants, an additive model where the image features are incorporated via an additive bias term, and a multiplicative that uses a factored weight tensor to control the interaction between modalities.

The goal of the LBL is to predict the next word given a sequence of words. The context size dictates the number of words in this sequence.

These are optimization parameters used during stochastic gradient learning of the LBL model parameters. The optimization over learning rate is carried out in log-space, but the proposed learning rate is exponentiated before being passed to the training procedure.

This controls the size of the joint hidden representation for words and images.

Words are represented by feature embeddings rather than one-hot vectors. This is the dimensionality of the embedding.

A regularization parameter that determines the amount of dropout to be added to the hidden layer.

$\mathcal{L}_{2}$ regularization on the input and output weights respectively. Like the learning rate, these are optimized in log-space as they vary over several orders of magnitude.

The rank of the weight tensor. Only relevant for the multiplicative model.