Mondrian Forests for Large-Scale Regression when Uncertainty Matters

Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh

Introduction

Gaussian process (GP) regression is popular due to its ability to deliver both accurate non-parametric predictions and estimates of uncertainty for those predictions. The dominance of GP regression in applications such as Bayesian optimization, where uncertainty estimates are key to balance exploration and exploitation, is a testament to the quality of GP uncertainty estimates.

Unfortunately, the computational cost of GPs is cubic in the number of data points, making them computationally very expensive for large scale non-parametric regression tasks. (Specifically, we focus on the scenario where the number of data points NN is large, but the number of dimensions DD is modest.) Steady progress has been made over the past decade on scaling GP inference to big data, including some impressive recent work such as .

Ensembles of randomized decision trees, also known as decision forests, are popular for (non-probabilistic) non-parametric regression tasks, often achieving state-of-the-art predictive performance . The most popular decision forest variants are random forests (Breiman-RF) introduced by Breiman and extremely randomized trees (ERT) introduced by Geurts et al. . The computational cost of learning decision forests is typically O(NlogN)\mathcal{O}(N\log N) and the computation across the trees in the forest can be parallelized trivially, making them attractive for large scale regression tasks. While decision forests usually yield good predictions (as measured by, e.g., mean squared error or classification accuracy), the uncertainty estimates of decision forests are not as good as those produced by GPs. For instance, Jitkrittum et al. compare the uncertainty estimates of decision forests and GPs on a simple regression problem where the test distributions are different from the training distribution. As we move away from the training distribution, GP predictions smoothly return to the prior and exhibit higher uncertainty. However, the uncertainty estimates of decision forests are less smooth and do not exhibit this desirable property.

Our goal is to combine the desirable properties of GPs (good uncertainty estimates, probabilistic setup) with those of decision forests (computational speed). To this end, we extend Mondrian forests (MFs), introduced by Lakshminarayanan et al. for classification tasks, to non-parametric regression tasks. Unlike usual decision forests, we use a probabilistic model within each tree to model the labels. Specifically, we use a hierarchical Gaussian prior over the leaf node parameters and compute the posterior parameters efficiently using Gaussian belief propagation . Due to special properties of Mondrian processes, their use as the underlying randomization mechanism results in a desirable uncertainty property: the prediction at a test point shrinks to the prior as the test point moves further away from the observed training data points. We demonstrate that, as a result, MFs yield better uncertainty estimates.

The paper is organized as follows: in Section 2, we briefly review Mondrian forests. We present MFs for regression in Section 3 and discuss inference and prediction in detail. We present experiments in Section 5 that demonstrate that (i) MFs produce better uncertainty estimates than Breiman-RF and ERT when test distribution is different from training distribution, (ii) MFs outperform or achieve comparable performance to large scale approximate GPs in terms of mean squared error (MSE) or negative log predictive density (NLPD), thus making them well suited for large scale regression tasks where uncertainty estimates are important, and (iii) MFs outperform (or perform as well as) decision forests on Bayesian optimization tasks, where predictive uncertainty is important (since it guides the exploration-exploitation tradeoff). Finally, we discuss avenues for future work in Section 6.

Mondrian forests

2 Mondrian trees and Mondrian forests

The expected depth of a Mondrian tree is parametrized by a non-negative lifetime parameter λ>0\lambda>0. Since it is difficult to specify λ\lambda, Lakshminarayanan et al. set λ=\lambda=\infty and stopped splitting a node if all the class labels of the data points within the node were identical. We follow a similar approach for regression: we do not split a node which has less than min_samples_split\mathsf{min\_samples\_split} number of data points.Specifying min_samples_split\mathsf{min\_samples\_split} instead of max-depth is common in decision forests, cf. . Given a bound min_samples_split\mathsf{min\_samples\_split} and training data D1:n\mathcal{D}_{1:n}, the generative process for sampling Mondrian trees is described in Algorithms 1 and 2.

The process is similar to top-down induction of decision trees except for the following key differences: (i) splits in a Mondrian tree are committed only within the range of training data (see Figure 1), and (ii) the split dimensions and locations are chosen independent of the labels and uniformly within BjxB_{j}^{x} (see lines 5, 6 of Algorithm 2). A Mondrian forest consists of MM i.i.d. Mondrian trees Tm=(Tm,δm,ξm,τm)T_{m}=(\mathsf{T}_{m},\bm{\delta}_{m},\bm{\xi}_{m},\bm{\tau}_{m}) for m=1,,Mm=1,\dotsc,M. See for further details.

Mondrian trees can be updated online in an efficient manner and remarkably, the distribution of trees sampled from the online algorithm is identical to the corresponding batch counterpart . We use the batch version of Mondrian forests (Algorithms 1 and 2) in all of our experiments except the Bayesian optimization experiment in section 5.3. Since we do not evaluate the computational advantages of online Mondrian forest, using a batch Mondrian forest in the Bayesian optimization experiment would not affect the reported results. For completeness, we describe the online updates in Algorithms 3 and 4 in the supplementary material.

Model, hierarchical prior, and predictive posterior for labels

In this section, we describe a probabilistic model that will determine the predictive label distribution, pT(yx,D1:N)p_{T}(y|\bm{x},\mathcal{D}_{1:N}), for a tree T=(T,δ,ξ,τ)T=(\mathsf{T},\bm{\delta},\bm{\xi},\bm{\tau}), dataset D1:N\mathcal{D}_{1:N}, and test point x\bm{x}. Let leaf(x)\mathsf{leaf}(\bm{x}) denote the unique leaf node jleaves(T)j\in\mathsf{leaves}(\mathsf{T}) such that xBj\bm{x}\in B_{j}. Like with Mondrian forests for classification, we want the predictive label distribution at x\bm{x} to be a smoothed version of the empirical distribution of labels for points in Bleaf(x)B_{\mathsf{leaf}(\bm{x})} and in BjB_{j^{\prime}} for nearby nodes jj^{\prime}. We will also achieve this smoothing via a hierarchical Bayesian approach: every node is associated with a label distribution, and a prior is chosen under which the label distribution of a node is similar to that of its parent’s. The predictive pT(yx,D1:N)p_{T}(y|\bm{x},\mathcal{D}_{1:N}) is then obtained via marginalization.

As is common in the decision tree literature, we assume the labels within each block are independent of X\bm{X} given the tree structure. Lakshminarayanan et al. used a hierarchy of normalized stable processes (HNSP) prior for classification problems. In this paper, we focus on the case of real-valued labels. Let N(μ,v)\mathcal{N}(\mu,v) denote a Gaussian distribution with mean μ\mu and variance vv. For every jTj\in\mathsf{T}, let μj\mu_{j} be a mean parameter (of a Gaussian distribution over the labels) at node jj, and let μ={μj:jT}\bm{\mu}=\{\mu_{j}:j\in\mathsf{T}\}. We assume the labels within a leaf node are Gaussian distributed:

where σy2\sigma_{y}^{2} is a parameter specifying the variance of the (measurement) noise.

We use the following hierarchical Gaussian prior for μ\bm{\mu}: For hyperparameters μH\mu_{H}, γ1,γ2\gamma_{1},\gamma_{2}, let

where ϕj=γ1σ(γ2τj)γ1σ(γ2τparent(j))\phi_{j}=\gamma_{1}\sigma(\gamma_{2}\tau_{j})-\gamma_{1}\sigma(\gamma_{2}\tau_{\mathsf{parent}(j)}) with the convention that τparent(ϵ)=0\tau_{\mathsf{parent}(\epsilon)}=0, and σ(t)=(1+et)1\sigma(t)=(1+e^{-t})^{-1} denotes the sigmoid function.

Before discussing the details of posterior inference, we provide some justification for the details of this model: Recall that τj\tau_{j} increases as we go down the tree, and so the use of the sigmoid σ()\sigma(\cdot) encodes the prior assumption that children are expected to be more similar to their parents as depth increases. The Gaussian hierarchy is closed under marginalization, i.e.,

where ϕϵ+ϕ0=γ1σ(γ2τϵ)γ1σ(γ20)+γ1σ(γ2τ0)γ1σ(γ2τϵ)=γ1σ(γ2τ0)γ1σ(γ20)\phi_{\epsilon}+\phi_{0}=\gamma_{1}\sigma(\gamma_{2}\tau_{\epsilon})-\gamma_{1}\sigma(\gamma_{2}0)+\gamma_{1}\sigma(\gamma_{2}\tau_{0})-\gamma_{1}\sigma(\gamma_{2}\tau_{\epsilon})=\gamma_{1}\sigma(\gamma_{2}\tau_{0})-\gamma_{1}\sigma(\gamma_{2}0). Therefore, we can introduce intermediate nodes without changing the predictive distribution. In Section 3.3, we show that a test data point can branch off into its own node: the hierarchical prior is critical for smoothing predictions.

Given training data D1:N\mathcal{D}_{1:N}, our goal is to compute the posterior density over μ\bm{\mu}:

The posterior over μ\bm{\mu} will be used during the prediction step described in Section 3.3. Note that the posterior over μ\bm{\mu} is computed independently for each tree, and so can be parallelized trivially.

We perform posterior inference using belief propagation . Since the prior and likelihood are Gaussian, all the messages can be computed analytically and the posterior over μ\bm{\mu} is also Gaussian. Since the hierarchy has a tree structure, the posterior can be computed in time that scales linearly with the number of nodes in the tree, which is typically O(N)\mathcal{O}(N), hence posterior inference is efficient compared to non-tree-structured Gaussian processes whose computational cost is typically O(N3)\mathcal{O}(N^{3}). Message passing in trees is a well-studied problem, and so we refer the reader to [19, Chapter 20] for details.

2 Hyperparameter heuristic

In this section, we briefly describe how we choose the hyperparameters θ={μH,γ1,γ2,σy2}\bm{\theta}=\{\mu_{H},\gamma_{1},\gamma_{2},\sigma_{y}^{2}\}. More details can be found in Appendix B in the supplementary material. For simplicity, we use the same values of these hyperparameters for all the trees; it is possible to optimize these parameters for each tree independently, and would be interesting to evaluate this extra flexibility empirically. Ideally, one might choose hyperparameters by optimizing the marginal likelihood (computed as a byproduct of belief propagation) by, e.g., gradient descent. We use a simpler approach here: we maximize the product of the individual label marginals, assuming a individual label noise, which yields closed-form solutions for μH\mu_{H} and γ1\gamma_{1}. This approach does not yield an estimate for γ2\gamma_{2}, and so, using the fact that τ\tau increases with NN, we pre-process the training data to lie in D^{D} and set γ2=D/(20log2N)\gamma_{2}=D/(20\log_{2}N) based on the reasoning that (i) τ\tau is inversely proportional to DD and (ii) τ\tau increases with tree depth and the tree depth is O(log2N)\mathcal{O}(\log_{2}N) assuming balanced trees.In Lakshminarayanan et al. , it was observed that the average tree depths were 2-3 times log2(N)\log_{2}(N) in practice.

Lakshminarayanan et al. proposed to stop splitting a Mondrian block whenever all the class labels were identical.Technically, the Mondrian tree is paused in the online setting and splitting resumes once a block contains more than one distinct label. However, since we only deal with the batch setting, we stop splitting homogeneous blocks. We adopt a similar strategy here and stop splitting a Mondrian block if the number of samples is fewer than a parameter min_samples_split\mathsf{min\_samples\_split}. It is common in decision forests to require a minimum number of samples in each leaf, for instance Breiman and Geurts et al. recommend setting min_samples_leaf=5\mathsf{min\_samples\_leaf}=5 for regression problems. We set min_samples_split=10\mathsf{min\_samples\_split}=10.

3 Predictive variance computation

The prediction step in a Mondrian regression tree is similar to that in a Mondrian classification tree [16, Appendix B] except that at each node of the tree, we predict a Gaussian posterior over yy rather than predict a posterior over class probabilities. Recall that a prediction from a vanilla decision tree is just the average of the training labels in leaf(x)\mathsf{leaf}(\bm{x}). Unlike decision trees, in a Mondrian tree, a test point x\bm{x} can potentially ‘branch off’ the existing Mondrian tree at any point along the path from root to leaf(x)\mathsf{leaf}(\bm{x}). Hence, the predictive posterior over yy from a given tree TT is a mixture of Gaussians of the form

where wjw_{j} denotes the weight of each component, given by the probability of branching off just before reaching node jj, and mj,vjm_{j},v_{j} respectively denote the predictive mean and variance. The probability of branching off increases as the test point moves further away from the training data at that particular node; hence, the predictions of MFs exhibit higher uncertainty as we move farther from the training data. For completeness, we provide pseudocode for computing (3) in Algorithm 5 of the supplementary material.

If a test data point branches off to create a new node, the predictive mean at that node is the posterior of the parent of the new node; if we did not have a hierarchy and instead assumed the predictions at leaves were i.i.d, then branching would result in a prediction from the prior, which would lead to biased predictions in most applications. The predictive mean and variance for the mixture of Gaussians are

and the prediction of the ensemble is then

The prediction of the ensemble can be thought of as being drawn from a mixture model over MM trees where the trees are weighted equally. The predictive mean and variance of the ensemble can be computed using the formula for mixture of Gaussians similar to (4). Similar strategy has been used in as well.

Related work

The work on large scale Gaussian processes can be broadly split into approaches that optimize inducing variables using variational approximations and approaches that distribute computation by using experts that operate on subsets of the data. We refer to for a recent summary of large scale Gaussian processes. Hensman et al. and Gal et al. use stochastic variational inference to speed up GPs, building on the variational bound developed by Titsias . Deisenroth and Ng present the robust Bayesian committee machine (rBCM), which combines predictions from experts that operate on subsets of data.

Hutter investigated the use of Breiman-RF for Bayesian optimization and used the empirical variance between trees in the forest as a measure of uncertainty. (Hutter et al. proposed a further modification, see Appendix C.) Eslami et al. used a non-standard decision forest implementation where a quadratic regressor is fit at each leaf node, rather than a constant regressor as in popular decision forest implementations. Their uncertainty measure—a sum of the Kullback-Leibler (KL) divergence—is highly specific to their application of accelerating expectation propagation, and so it seems their method is unlikely to be immediately applicable to general non-parametric regression tasks. Indeed, Jitkrittum et al. demonstrate that the uncertainty estimates proposed by are not as good as kernel methods in their application domain when the test distribution is different from the training distribution. As originally defined, Mondrian forests produce uncertainty estimates for categorical labels, but Lakshminarayanan et al. evaluated their performance on (online) prediction (classification accuracy) without any assessment of the uncertainty estimates.

Experiments

In this experiment, we compare uncertainty estimates of MFs to those of popular decision forests. The prediction of MFs is given by (5), from which we can compute the predictive mean and predictive variance.Code available from the authors’ websites. For decision forests, we compute the predictive mean as the average of the predictions from the individual trees and, following Hutter [13, §11.1.3], compute the predictive variance as the variance of the predictions from the individual trees. We use 25 trees and set min_samples_leaf=5\mathsf{min\_samples\_leaf}=5 for decision forests to make them comparable to MFs with min_samples_split=10\mathsf{min\_samples\_split}=10. We used the ERT and Breiman-RF implementation in scikit-learn and set the remaining hyperparameters to their default values.

We use a simplified version of the dataset described in , where the goal is to predict the outgoing message in expectation propagation (EP) from a set of incoming messages. When the predictions are uncertain, the outgoing message will be re-computed (either numerically or using a sampler), hence predictive uncertainty is crucial in this application. Our dataset consists of two-dimensional features (which are derived from the incoming message) and a single target (corresponding to mean of the outgoing message). The scatter plot of the training data features is shown in Fig. 2(a). We evaluate predictive uncertainty on two test distributions, shown in red and blue in Fig. 2(a), which contain data points in unobserved regions of the training data.

The mean squared error of all the methods are comparable, so we focus just on the predictive uncertainty. Figures 2(b), 2(c), and 2(d) display the predictive uncertainty of MF, ERT and Breiman-RF as a function of x1x_{1}. We notice that Breiman-RF’s predictions are over-confident compared to MF and ERT. The predictive variance is quite low even in regions where training data has not been observed. The predictive variance of MF is low in regions where training data has been observed (5<x1<5-5<x_{1}<5) and goes up smoothly as we move farther away from the training data; the red test dataset is more similar to the training data than the blue test data and the predictive uncertainty of MF on the blue dataset is higher than that of the red dataset, as one would expect. ERT is less overconfident than Breiman-RF, however its predictive uncertainty is less smooth compared to that of MF.

2 Comparison to GPs and decision forests on flight delay dataset

In this experiment, we compare decision forest variants to large scale Gaussian processes. Deisenroth and Ng evaluated a variety of large scale Gaussian processes on the flight delay dataset, processed by Hensman et al. , and demonstrate that their method achieves state-of-the-art predictive performance; we evaluate decision forests on the same dataset so that our predictive performance can be directly compared to large scale GPs. The goal is to predict the flight delay from eight attributes, namely, the age of the aircraft (number of years since deployment), distance that needs to be covered, airtime, departure time, arrival time, day of the week, day of the month and month.

Deisenroth and Ng employed the following strategy: train using the first NN data points and use the following 100,000 as test data points. Deisenroth and Ng created three datasets, setting NN to 700K700K, 2M2M (million) and 5M5M respectively. We use the same data splits and train MF, Breiman-RF, ERT on these datasets so that our results are directly comparable. Gal et al. report the performance of Breiman-RF on these datasets, but they restricted the maximum depth of the trees to 2, which hurts the performance of Breiman-RF significantly. They also use a random train/test split, hence our results are not directly comparable to theirs due to the non-stationarity in the dataset. We used 10 trees for each forest to reduce the computational burden.

We evaluate performance by measuring the root mean squared error (RMSE) and negative log predictive density (NLPD). NLPD, defined as the negative logarithm of (5), is a popular measure for measuring predictive uncertainty (cf. [23, section 4.2]). NLPD penalizes over-confident as well as under-confident predictions since it not only accounts for predictive mean but also the predictive variance. RF and ERT do not offer a principled way to compute NLPD. But, as a simple approximation, we computed NLPD for RF and ERT assuming a Gaussian distribution with mean equal to the average of trees’ predictions, variance equal to the variance of trees’ predictions.

Table 1 presents the results. The RMSE and NLPD results for SVI-GP, Dist-VGP and rBCM results were taken from , who report a standard error lower than 0.3 for all of their results. Table 1 in shows that rBCM achieves significantly better performance than other large scale GP approximations; hence we only report the performance of rBCM here. It is important to note that the dataset exhibits non-stationarity: as a result, the performance of decision forests as well as GPs is worse on the larger datasets. (This phenomenon was observed by Gal et al. and Deisenroth and Ng as well.) On the 700K700K and 2M2M dataset, we observe that decision forests achieve significantly lower RMSE than rBCM. MF achieves significantly lower NLPD compared to rBCM, which suggests that its uncertainty estimates are useful for large scale regression tasks. However, all the decision forests, including MFs, achieve poorer RMSE performance than rBCM on the larger 5M5M dataset. We believe that this is due to the non-stationary nature of the data. To test this hypothesis, we shuffled the 5,100,000 data points to create three new training (test) data sets with 5M5M (100K100K) points; all the decision forests achieved a RMSE in the range 31-34 on these shuffled datasets.

MF outperforms rBCM in terms of NLPD on all three datasets. On the 5M dataset, the NLPD of Breiman-RF is similar to that of MF, however Breiman-RF’s uncertainty is not computed in a principled fashion. As an additional measure of uncertainty, we report probability calibration measures (akin to those for binary classification cf. http://scikit-learn.org/stable/modules/calibration.html), also known as reliability diagrams , for MF, Breiman-RF and ERT. First, we compute the z%z\% (e.g. 90%90\%) prediction interval for each test data point based on Gaussian quantiles using predictive mean and variance. Next, we measure what fraction of test observations fall within this prediction interval. For a well-calibrated regressor, the observed fraction should be close to z%z\%. We compute observed fraction for z=10%z=10\% to z=90%z=90\% in increments of 10. We report observed fraction minus ideal fraction since it is easier to interpret—a value of zero implies perfect calibration, a negative value implies over-confidence (a lot more observations lie outside the prediction interval than expected) and a positive value implies under-confidence. The results are shown in Table 2. MF is clearly better calibrated than Breiman-RF and ERT, which seem to be consistently over-confident. Since 5M dataset exhibits non-stationarity, MF appears to be over-confident but still outperforms RF and ERT. Deisenroth and Ng do not report calibration measures and their code is not available publicly, hence we do not report calibration measures for GPs.

3 Scalable Bayesian optimization

Next, we showcase the usefulness of MFs in a Bayesian optimization (BayesOpt) task. We briefly review the Bayesian optimization setup for completeness and refer the interested reader to for further details. Bayesian optimization deals with the problem of identifying the global maximizer (or minimizer) of an unknown (a.k.a. black-box) objective function which is computationally expensive to evaluate.For a concrete example, consider the task of optimizing the hyperparameters of a deep neural network to maximize validation accuracy. Our goal is to identify the maximizer in as few evaluations as possible. Bayesian optimization is a model-based sequential search approach to solve this problem. Specifically, given nn noisy observations, we fit a surrogate model such as a Gaussian process or a decision forest and choose the next location based on an acquisition function such as upper confidence bound (UCB) or expected improvement (EI) . The acquisition function trades off exploration versus exploitation by choosing input locations where the predictive mean from the surrogate model is high (exploitation) or the predictive uncertainty of the surrogate model is high (exploration). Hence, a surrogate model with well-calibrated predictive uncertainty is highly desirable. Moreover, the surrogate model has to be re-fit after every new observation is added; while this is not a significant limitation for a few (e.g. 50) observations and scenarios where the evaluation of the objective function is significantly more expensive than re-fitting the surrogate model, the re-fitting can be computationally expensive if one is interested in scalable Bayesian optimization .

Hutter proposed sequential model-based algorithm configuration (SMAC), which uses Breiman-RF as the surrogate model and the uncertainty between the trees as a heuristic measure of uncertainty.Hutter et al. [14, §4.3.2] proposed a further modification to the variance estimation procedure, where each tree outputs a predictive mean and variance, in the spirit of quantile regression forests . See Appendix C for a discussion on how this relates to MFs. Nickson et al. discuss a scenario where this heuristic produces misleading uncertainty estimates that hinders exploration. It is worth noting that SMAC uses EI as the acquisition function only 50%50\% of the time and uses random search the remaining 50%50\% of the time (which is likely due to the fact that the heuristic predictive uncertainty can collapse to 0). Moreover, SMAC re-fits the surrogate model by running a batch algorithm; the computational complexity of running the batch version NN times is n=1NO(nlogn)=O(N2logN)\sum_{n=1}^{N}\mathcal{O}(n\log n)=\mathcal{O}(N^{2}\log N) .

MFs are desirable for such an application since they can produce principled uncertainty estimates and can be efficiently updated online with computational complexity n=1NO(logn)=O(NlogN)\sum_{n=1}^{N}\mathcal{O}(\log n)=\mathcal{O}(N\log N). Note that the cost of updating the Mondrian tree structure is O(logn)\mathcal{O}(\log n), however exact message passing costs O(n)\mathcal{O}(n). To maintain the O(logn)\mathcal{O}(\log n) cost, we approximate the Gaussian posterior at each node by a Gaussian distribution whose mean and variance are given by the empirical mean and variance of the data points at that node. Adding a new data point involves just updating mean and variance for all the nodes along the path from root to a leaf, hence the overall cost is O(logn)\mathcal{O}(\log n). (See Appendix C.)

We report results on four Bayesian optimization benchmarks used in , consisting of two synthetic functions namely the Branin and Hartmann functions, and two real-world problems, namely optimizing the hyperparameters of online latent Dirichlet allocation (LDA) and structured support vector machine (SVM). LDA and SVM datasets consist of 288 and 1400 grid points respectively; we sampled Branin and Hartmann functions at 250,000 grid points (to avoid implementing a separate optimizer for optimizing over the acquisition function). For SVM and LDA, some dimensions of the grid vary on a non-linear scale (e.g. 100,101,10210^{0},10^{-1},10^{-2}); we log-transformed such dimensions and scaled all dimensions to $sothatallfeaturesareonthesamescale.Weused10trees,setso that all features are on the same scale. We used 10 trees, set\mathsf{min\_samples\_split}=2$ and use UCB as the acquisition functionSpecifically, we set acquisition function = predictive mean + predictive standard deviation. for MFs. We repeat our results 15 times (5 times each with 3 different random grids for Branin and Hartmann) and report mean and standard deviation.

Following Eggensperger et al. , we evaluate a fixed number of evaluations for each benchmark and measure the maximum value observed. The results are shown in Table 3. The SMAC results (using Breiman-RF) were taken from Table 2 of . Both MF and SMAC identify the optimum for LDA-grid. SMAC does not identify the optimum for Branin and Hartmann functions. We observe that MF finds maxima very close to the true maximum on the grid, thereby suggesting that better uncertainty estimates are useful for better exploration-exploitation tradeoff. The computational advantage of MFs might not be significant with few evaluations, but we expect MFs to be computationally advantageous in applications with thousands of observations, e.g., scalable Bayesian optimization and reinforcement learning .

4 Failure modes of our approach

No method is a panacea: here we discuss two failure modes of our approach that would be important to address in future work. First, we expect GPs to perform better than decision forests on extrapolation tasks; a GP with an appropriate kernel (and well-estimated hyperparameter values) can extrapolate beyond the observed range of training data; however, the predictions of decision forests with constant predictors at leaf nodes are confined to the range of minimum and maximum observed yy. If extrapolation is desired, we need complex regressors (that are capable of extrapolation) at leaf nodes of the decision forest. However, this will increase the cost of posterior inference. Second, MFs choose splits independent of the labels; hence irrelevant features can hurt predictive performance ; in the batch setting, one can apply feature selection to filter or down weight the irrelevant features.

Discussion

We developed a novel and scalable methodology for regression based on Mondrian forests that provides both good predictive accuracy as well as sensible estimates of uncertainty. These uncertainty estimates are important in a range of application areas including probabilistic numerics, Bayesian optimization and planning. Using a large-scale regression application on flight delay data, we demonstrate that our proposed regression framework can provide both state-of-the-art RMSE and estimates of uncertainty as compared to recent scalable GP approximations. We demonstrate that Mondrian forests deliver better-calibrated uncertainty estimates than existing decision forests, especially in regions far away from the training data. Since Mondrian forests deliver good uncertainty estimates and can be trained online efficiently, they seem promising for applications such as Bayesian optimization and reinforcement learning.

Acknowledgments

We thank Wittawat Jitkrittum for sharing the dataset used in and helpful discussions. We thank Katharina Eggensperger, Frank Hutter and Ziyu Wang for helpful discussions on Bayesian optimization. BL gratefully acknowledges generous funding from the Gatsby Charitable Foundation. This research was carried out in part while DMR held a Research Fellowship at Emmanuel College, Cambridge, with funding also from a Newton International Fellowship through the Royal Society. YWT’s research leading to these results has received funding from EPSRC (grant EP/K009362/1) and the ERC under the EU’s FP7 Programme (grant agreement no. 617411).

References

Appendix A Pseudocode for online learning and prediction

The online updates are shown in Algorithms 3 and 4. The prediction step is detailed in Algorithm 5.

Appendix B Choosing the hyperparameters

In this appendix, we give more details on how we choose the hyper parameters θ={μH,γ1,γ2,σy2}\bm{\theta}=\{\mu_{H},\gamma_{1},\gamma_{2},\sigma_{y}^{2}\}. For simplicity, we used the same values of these hyperparameters for all the trees; it is possible to optimize these parameters for each tree independently.

We optimize the product of label marginals, integrating out μ\bm{\mu} for each label individually, i.e.,

Since τj=\tau_{j}=\infty at the leaf nodes, we have

If the noise variance is known, σy2\sigma_{y}^{2} can be set to the appropriate value. In our case, the noise variance is unknown; hence, we parametrize σy2\sigma_{y}^{2} as γ1/K\gamma_{1}/K and set K=min(2000,2N)K=\mathsf{min}(2000,2N) to ensure that the noise variance σy2\sigma_{y}^{2} is a non-zero fraction of the total variance γ1/2+γ1/K\gamma_{1}/2+\gamma_{1}/K.

We maximize q(Yθ,T)q(Y|\bm{\theta},T) over μH\mu_{H}, γ1\gamma_{1}, and KK, leading to

Note that we could have instead performed gradient descent on the actual marginal likelihood produced as a byproduct of belief propagation. It would be interesting to investigate this.

The likelihood q(Yθ,T)q(Y|\bm{\theta},T) does not depend on γ2\gamma_{2}, and so we cannot choose γ2\gamma_{2} by optimizing it. We know, however, that τ\tau increases with NN. Moreover, Lakshminarayanan et al. observed that the average tree depths were 2-3 times log2(N)\log_{2}(N) in practice. We therefore pre-process the training data to lie in D^{D} and set γ2=D20log2N\gamma_{2}=\frac{D}{20\log_{2}N} since (i) τ\tau increases with tree depth and the tree depth is O(log2N)\mathcal{O}(\log_{2}N) assuming balanced trees and (ii) τ\tau is inversely proportional to DD. In Appendix C, we describe a fast approximation which does not involve estimation of γ1,γ2\gamma_{1},\gamma_{2}.

Appendix C Fast approximation to message passing and hyperparameter estimation

In Section 5.3, we suggested a fast O(logn)\mathcal{O}(\log n) approximation to exact message passing which costs O(n)\mathcal{O}(n). Under this approximation, the Gaussian posterior at each node is approximated by a Gaussian distribution whose mean and variance are given by the empirical mean and variance of the data points at that node. This approximation is better suited for online applications since adding a new data point involves just updating mean and variance for all the nodes along the path from root to a leaf. Another advantage of this approximation is that we only need to set the noise variance σy2\sigma_{y}^{2} and do not need to set the hyper-parameters {μH,γ1,γ2}\{\mu_{H},\gamma_{1},\gamma_{2}\}.

Since our initial publication, we have learnt that this Gaussian posterior approximation is similar to a random forest modification independently proposed in Hutter et al. [14, §4.3.2]. In , each tree outputs a predictive mean and variance equal to the empirical mean and variance of the labels at the leaf node of the decision tree. However, there is an additional level of smoothing in MFs that is not present in . Specifically, the prediction from a Mondrian tree, described in (3), is a weighted mixture of predictions from nodes along the path from the root to the leaf. Moreover, the weights account for the distance between the test point from the training data, thereby ensuring that the predictions shrink to the prior as we move farther away from the training data.