Dynamic Trees for Learning and Design
Matthew A. Taddy, Robert B. Gramacy, Nicholas G. Polson
Introduction
This article develops sequential regression trees, with implementation details and examples for applications in experiment design, optimization, classification, and on-line learning. Our most basic insight is the characterization of partition trees as a dynamic model state. This allows the model to grow with data, such that each new observation leads to only small changes in the posterior, and has the practical effect of reducing tree space dimension without affecting model flexibility. A Bayesian inferential framework is built around particle learning algorithms for the efficient on-line filtering of tree-states, and examples demonstrate that the proposed approach is able to provide better results compared to commonly used methods at a fraction of the cost.
We consider two generic regression formulations. The first has real-valued response (univariate here, but this is not a general restriction) with unspecified mean function and Gaussian noise, and the second has a categorical response with covariate dependent class probabilities:
The regression tree framework provides a simple yet powerful solution to modeling these relationships in situations where very little is known a priori. For such models, the covariate space is partitioned into a set of hyper-rectangles, and a simple model (e.g., linear mean and constant error variance) is fit within each rectangle. This adheres to a general strategy of having a state-space model (i.e., the partition structure) induce flexible prediction despite a restrictive model formulation conditional on this state. For trees, our strategy allows posterior inference to be marginalized over the parameters of simple models in each partition. Given a particular tree, predictive uncertainty is available in closed form and does not depend upon approximate posterior sampling of model parameters. Uncertainty updating is then a pure filtering problem, requiring posterior sampling for the partition tree alone. Thereby this article establishes a natural sequential framework for the specification and learning of regression trees.
Although partitioning is a rough modeling tool, trees will often be better suited to real-world applications than other “more sophisticated” alternatives. Compared to standard basis function models for nonparametric regression, the dynamic trees proposed herein provide some appealing properties (e.g., flexible response surface, nonstationarity, heteroskedasiticity) at the expense of others (e.g., smoothness, explicit correlation structure) within a framework that is trivial to specify and allows for conditional inference, given the global partition tree-state, to be marginalized over all model parameters. Prediction is very fast, requiring only a search for the rectangle containing a new , and individual realizations yield an easily interpretable decision tree for regression and classification problems. Finally, and perhaps most importantly, the models are straightforward to fit through sequential particle algorithms, and thus offer a major advantage in on-line inference problems.
Consider the setting of sequential experiment design, which serves as one of the motivators for this article. In engineering applications, and especially for computer experiments, the typical model of (1.a) is built from a stationary Gaussian process (GP) prior for with constant additive error variance (see, e.g., Santner et al., , 2003). This GP specification is very flexible, but inference for sample size is due to the need to decompose matrices. Furthermore, full Bayesian inference is usually built around MCMC sampling, which demands computation for a posterior sample of size . Sequential design requires this batch sampling to be restarted whenever new pairs are added into the data, making an already computationally intensive procedure even more expensive by failing to take advantage of the sequential nature of the problem. With lower order inference (i.e., only a search of partitions followed by simple leaf regression model calculations) and an explicit dynamic structure, our dynamic regression trees are less expensive and can be fit in serial.
Furthermore, it is essential that the fitted model is appropriate for the problem of sequential design. In this, the standard choice of a stationary GP will again often fall short. Whether designing for optimization or for general learning, the goal is to search for distinctive areas of the feature space that warrant further exploration (e.g., due to high variance or low expected response). The single most appropriate global specification may be very different from the ideal local model in areas of interest (e.g., along breaks in the response surface or near optima), such that it is crippling to assume stationarity or constant error variance. However, nonstationary GP modeling schemes (e.g. Higdon et al., , 1999) require even more computational effort than their stationary counterparts. In contrast, nonstationary function means and nonconstant variance are fundamental characteristics of tree-based regression.
These considerations are meant to illustrate a standard point: It may be that some flexible functions or are ideal for modeling (1.a-b) in any particular setting, but that a simple and fast approximation – with predictions averaged over posterior uncertainty – is better suited to the design scenario or classification problem at hand. We thus put forth dynamic regression trees as a robust class of models with some key properties: prior specification is scale-free and automatic; inference is marginalized over model parameters given a global partition state; it is possible to sequentially filter uncertainty about the partition state; and the resulting prediction surfaces are appropriate for modeling complicated regression functions.
The remainder of the paper is organized as follows. Section 1.1 provides a survey of partition tree models and 1.2 contains a review of Bayesian inference for trees. Our core methodology is outlined in Section 2, which develops a sequential characterization of uncertainty updating for regression trees. The partition evolution is defined in 2.1, leaf model and prediction details are in 2.2, a particle learning algorithm for posterior simulation is provided in 2.3, and marginal likelihood estimation is discussed in 2.4. Section 3 then describes a set of application examples. First, 3.1 illustrates linear and constant mean regression models on some simple datasets, and compares results to alternatives from the literature. We consider the sequential design of experiments, with optimization search in 3.2 and active learning in 3.3. Finally, 3.4 provides two classification examples. Section 4 contains short closing discussion.
The use of partition trees to represent input-output relationships is a classic nonparametric modeling technique. A decision tree is imposed with switching on input variables and terminal node predictions for the relevant output. Although other schemes are available (e.g., based on Voronoi tessellations), the standard approach relies on a binary recursive partitioning of input variables, as in the classification and regression tree (CART) algorithms of Breiman et al., (1984). This forces axis-aligned partitions, although pre-processing data transformations can be used to alter the partition space. In general, the computational and conceptual simplicity of rectangular partitions will favor them over alternative schemes. As binary recursive partition trees are fundamental to our regression model, we shall outline some notation and details here.
Figure 1 shows two diagrams of local tree structure (partition trees tend to grow upside-down). Graph () illustrates the potential neighborhood of each node , and graph () shows a hypothetical local tree formed through recursive binary partitioning. Left and right children, and , are disjoint subsets such that ; and the parent node, , contains both and its sibling node , where and . Node ancestors are parents which contain a given node, and a node’s depth is equivalent to its number of ancestors. Any of the neighbors in () may be nonexistent for a specific node; for example, root has no parent or sibling. If a node has children it is considered internal, otherwise it is referred to as a leaf node. The set of internal nodes for is denoted , and the set of leaves is .
fnum@subsection1.2 Inference for Tree Models
That is, the tree prior is the probability that internal nodes have split and leaves have not.
In their seminal paper, CGM develop a Metropolis-Hastings MCMC approach for sampling from the posterior distribution of partition trees. This algorithm is able to explore posterior tree space by stochastically proposing incremental modifications to (grow, prune, change, and swap moves) which are accepted according to the appropriate Metropolis-Hastings ratio. Although this approach provides a search of posterior trees, the resulting Markov chain will generally have poor mixing properties. In particular, the proposed tree moves are very local. An improbable sequence of (first) prunes and (then) grows are required to move between parts of tree space that have similar posterior probability, but yet represent drastically different hierarchical split rules. Furthermore, MCMC is a batch algorithm and must be re-run for each new observation, making it inefficient for on-line applications.
The next section reformulates regression trees as a dynamic model. Our novel characterization leads to an entirely new class of models, and we develop a sequential particle inference framework that is able to alleviate many of the difficulties with MCMC for trees.
Dynamic Regression Trees
stay: The tree hierarchy remains unchanged, and .
prune: The tree may be pruned by removing and all of the nodes below and including its sibling . Hence, parent becomes a leaf node in . If is null (contains only a root), then the prune move is invalid since is parentless.
In this way, tree changes are restricted to the neighborhood of illustrated in Figure 1; we will see in Section 2.3 that this feature is key in limiting the cost of posterior simulation.
We place a conditional uniform prior distribution over the above set of candidate split locations. Hence, is uniformly distributed over covariate dimensions with non-empty sets of possible grow points and, given , the split point is assigned a uniform prior on . This distribution is uniform over eligible covariate dimensions, regardless of variable scaling within each dimension. As an aside, since the likelihood is unchanged for all grow moves which result in equivalent new leaf nodes – that is, the model is only identified up to location sets which separate elements of – an alternative prior would restrict split locations to members of (such that every possible grow changes the likelihood). We prefer the former option since it leads to smoother posterior predictive surfaces in our particle approximations.
The formulation in this section has many things in common with the original CGM partitioning approach. Indeed, both involve similar grow and prune moves. However, in CGM these are merely MCMC proposals, whereas in dynamic trees they are embedded directly into the definition of the sequential process. Also, analogues of change and swap are not present in our sequential formulation. Although these could have been incorporated, we felt that limiting computational cost was more desirable. In particular, there are two aspects of our framework – a global particle approach to inference and a filtered posterior that changes only incrementally with each new observation – which allow for the convenience of a smaller set of tree rules.
fnum@subsection2.2 Prediction and Leaf Regression Models
Our dynamic tree model is such that posterior inference is driven by two main quantities: the marginal likelihood for a given tree and the posterior predictive distribution for new data. This section establishes these functions for dynamic trees, and quickly details exact forms for our three simple leaf regression models.
We concentrate on three basic options for leaf regression: the canonical constant and linear mean leaf models, and multinomial leaves for categorical data. From (3) and (4), we see that conditioning on a given tree reduces our necessary posterior functionals to the product of independent leaf likelihoods and a single leaf regression function, respectively. Ease of implementation and general efficiency of our models depends upon an ability to evaluate analytically the integrals in (3) and (4), such that prediction and likelihood are always marginalized over unknown regression model parameters (given a default scale-free prior specification). Fortunately, such quantities are easily available for each leaf, and the following three subsections quickly outline modeling specifics for our three types of regression tree.
Before moving to leaf details, we note that the complexity of leaves is limited only by computational budget and data dimensions. For example, the treed Gaussian process (TGP) approach of Gramacy and Lee, (2008) fits a GP regression model at each leaf node. The tree imposes an independence structure on data covariance, providing an inexpensive nonstationary GP model. However, unlike the models proposed herein, GP leaves do not allow inference to be integrated over regression model parameters. This significantly complicates posterior inference, leading to either reversible-jump methods for MCMC or much higher-dimensional particles for sequential inference (which can kill algorithm efficiency). Thus, although dynamic trees can be paired with more complex models, the robust nature of less expensive trees will lead to them being the preferred choice in many industrial applications. Indeed, our results in Section 3 indicate that extra modeling does not necessarily lead to better performance.
Consider a tree partitioning data at time , and let be the number of observations in leaf node . The constant mean model assumes that are distributed as
Under this model, leaf sufficient statistics are and , where our prior forces the minimum data condition . Note that these statistics are easy to update when the leaf sets change.
Under the motivation of an automatic regression framework, we assume independent scale-invariant priors for each leaf model, such that . The leaf likelihood is then
For covariate vector allocated to leaf node , the posterior predictive distribution is
2.2 Linear Mean Leaves
Consider extending the above model to a linear leaf regression function. Leaf responses are accompanied by design matrix , where , and
for intercept parameter , -dimensional slope parameter , and error variance . Sufficient statistics for leaf node are then and , as in Section 2.2.1, plus the covariate mean vector , shifted design matrix , Gram matrix and slope vector for the shifted design matrix, and regression sum of squares . We now restrict . As before, these statistics are easily updated; in particular, matrices can be adapted through partitioned inverse equations.
We again assume the scale invariant prior and, as above, it is straightforward to calculate the essential predictive probability and marginal likelihood functions. The marginal likelihood for leaf node is then
For allocated to leaf node and with , the posterior predictive distribution is
2.3 Multinomial Leaves
For categorical data, as in (1.b), each leaf response is equal to one of different factors. The set of responses, , can be summarized through a count vector , such that . Summary counts for each leaf node are then modeled as
Covariates allocated to leaf node lead to predictive response probabilities
fnum@subsection2.3 Particle Learning for Posterior Simulation
In particular, our approach is a version of the particle learning (PL) sequential Monte Carlo algorithm introduced by Carvalho et al., (2010a) in the context of mixtures of dynamic linear models. The methods in this section are specific to dynamic regression trees and we defer further detail to the original PL paper (in addition, Carvalho et al., , 2010b, focuses on PL for general mixtures and thus describes a partition model with parallels to trees). In the context of our dynamic trees, PL recursive update equations are
Upon observing new covariates and response , update as follows.
Resample: Draw particle indices with predictive probability weight
as in (7), (10), or (13). There are various low-variance options for resampling (e.g., Cappé et al., , 2005), and our implementation makes use of residual-resampling.
Set for each to form a new particle set.
Finish with deterministic sufficient statistic updates .
In an appealing division of labor, resampling incorporates global changes to the tree posterior, while propagation provides local modifications. As with all particle simulation methods, some Monte Carlo error will accumulate and, in practice, one must be careful to assess its effect. However, as mentioned above, our strategy makes major gains by integrating over model parameters to obtain particles which consist of only split rules and sufficient statistics. Given this efficient low-dimensional particle definition, our resampling-first procedure will sequentially discard all models except those which are predicting well, and this tempers posterior search to a small set of plausible trees with good predictive properties. We will see in Section 3 that PL for trees has some significant advantages over the traditional MCMC approach.
fnum@subsection2.4 Marginal Likelihood Estimation
This is just the normalizing constant for PL resampling probabilities (refer to Section 2.3).
Application and Illustration
We now present a series of examples designed to illustrate our approach to on-line regression. Section 3.1 considers two simple 1-d regression problems, illustrating both constant and linear mean leaf models over multiple PL runs, before comparing dynamic tree predictive performance against competing estimators in a 5-d application. The next two sections focus on the sequential design of experiments, with optimization problems in Section 3.2 and active learning in Section 3.3. Finally, Section 3.4 describes the application of dynamic multinomial leaf trees to a 15-d classification problem. Throughout, our dynamic trees were fit using the dynaTree package (Gramacy and Taddy, , 2010b) for R under the default parametrization. In particular, the tree prior of Section 1.2 is specified with and ; inference is generally robust to reasonable changes to this parametrization (e.g., the four settings described in Figure 3 of CGM (2002) lead to no qualitative differences).
The top two rows of Figure 3 show 30 repeated filtered posterior fits for random re-orderings of the data, obtained through the PL algorithm of Section 2.3 with 1000 particles. The first row corresponds to constant leaf models while the second row corresponds to linear leaves, and each plot shows posterior mean and 90% predictive intervals. Although there is clear variation from one PL run to the next, this is not considered excessive given the random data orderings and small number of particles. Linear leaf models appear to be far better than constant models at adapting to the parabola data. With the parabola data, a constant leaf model leads to higher average tree height (9.4, vs 5.4 for linear leaves) while the opposite is true for the motorcycle data (average constant leaf height is 10.5, vs 12.8 for linear leaves).
A more rigorous assessment of the relative evidence in favor of each leaf model is possible through estimated Bayes factors, as described in Section 2.4. Figure 3 presents filtered log BFs comparing linear to constant leaves, calculated for the 30 random data orderings used in Figure 3 and conditional on the first observations. As discussed in 2.4, due to both random ordering and different training samples, the BF estimates are not directly comparable across runs. However, for the parabola data, in every case the linear model is clearly preferable. For the motorcycle data, although there is no consistent evidence in favor of either model, the majority of runs produce log BF values below zero and the mean across runs (which eliminates order dependence as the number of runs increases) shows fairly strong evidence against the linear model. This agrees with the visually similar regression fits drawn in Figure 3, which were obtained after fewer average partitions under constant leaves than with the linear leaves.
Finally, the bottom row of Figure 3 compares the means of the 30 dynamic tree fits to two similar modern nonparametric regression techniques: treed Gaussian processes (TGP) and Bayesian additive regression trees (BART; Chipman et al., , 2010). Each new model is designed as an extension to constant or linear regression trees, and makes steps to partially alleviate the mixing problems of CGM’s original MCMC inference. As mentioned previously, TGP takes advantage of a more flexible leaf regression model to allow for broader covariate partitions. In a different approach, BART proposes a mixture of relatively short trees, and the authors show that combinations of simple individual partitioning schemes can lead to a complicated predictive response surface. The TGP model is given 10 restarts of a 10,000 iteration MCMC run, while BART results are based on a single 1,100 iteration chain. For the parabola data, all of the models find practically identical fits, except for the poorly performing constant leaf dynamic tree. However, for the motorcycle data, we see that each model leads to mean functions that are very similar, but that posterior predictive 90% intervals for BART and TGP appear to variously over or under estimate data uncertainty around the regression mean. In particular, BART’s global variance term is misspecified for this heteroskedastic data.
For this experiment, 100 random training and prediction sets, of respective size 200 and 1000, were drawn with inputs uniform on . Following Chipman et al., (2002), we measure performance through predictive root mean-square error (RMSE) to the true mean, and results are shown in Figure 4. Dynamic tree models (DTC/DTL) were fit in a single (random) pass with particles, and the number of MCMC iterations and restarts, etc., for other Bayesian methods was such that Monte Carlo error in RMSE for repeated runs on the same data is negligible against the results in Figure 4. Non-Bayesian comparators were fit under package defaults. The dynamic versions of the tree models clearly outperform their static counterparts, and the dynamic treed linear model (DTL) performs nearly as well as or better than the GP and BART, both of which offer flexible mean functions under a (true) constant variance term.
fnum@subsection3.2 Search Optimization
In Taddy et al., (2009), TGP regression was used to augment a local pattern search scheme with ranked lists of locations that maximize a measure of expected improvement. The GP leaf model was successful in this setting – deterministic optimization with 8 inputs – but pairwise covariance calculations and repeated MCMC runs became unwieldy in higher dimensions. At the same time, such hybrid optimization schemes are most useful in complicated high-dimensional settings: although true global optimization is impossible without great expense, a hybrid method uses regression to locate promising input areas and move the local search appropriately, thus iterating towards robust optimality.
We propose that our dynamic regression trees, due to a flexible on-line inference framework, provide a very attractive regression tool for hybrid local-global search optimization. Although they are not interpolators, and thus of limited usefulness in deterministic optimization, both constant and linear regression trees provide an efficient model for the prediction and optimization of stochastic functions. Such stochastic optimization problems are common in operations research and, despite being deterministic, many engineering codes are more properly treated as stochastic due to numerical instability. This section will outline use of dynamic regression trees for the global component of a stochastic optimization search.
In a result that will parallel findings in Section 3.3, constant, rather than linear, leaves lead to a better exploration of the 2-d exponential function; this relative weakness exposes a difficulty for linear models in fitting the rapid oscillations around . Interestingly, constant trees also lead to better results than for the TGP optimization routine. This is despite the added flexibility of GP leaves, an augmented candidate sample, and use of a wide range of values for (which, although harder to compute when marginalizing over model parameters, is the literature standard for this type of search). In addition, due to repeated MCMC runs, the TGP algorithm requires near to 10 times the computation of sequential tree optimization.
fnum@subsection3.3 Active Learning
Since active learning is an inherently on-line algorithm, it provides a natural application for dynamic regression trees. The remainder of this section will illustrate ALC/ALM schemes built around particle learning for trees, and show that our methods compare favorably to existing MCMC-based alternatives. As in 3.2, both constant and linear leaf models lead to closed-form calculations of heuristic functionals conditional on a given tree. Fast prediction allows us to evaluate the necessary statistics across candidates, for each particle, and hence find the optimal under our heuristic. Trees are then updated for , and the process is repeated.
Expressions for may be obtained as a special case of the results given by Gramacy and Lee, (2009). Clearly, if and are allocated to different leaves of . Otherwise, for and allocated to in a given tree, we have that is (for constant and linear leaves respectively)
We use the parabola and motorcycle data examples of Section 3.1 to compare ALC and ALM for both constant and linear leaf dynamic trees. Figure 6 illustrates the statistics associated with each combination of the two heuristics and two regression models, evaluated given the complete datasets. For the parabola data (left column), the ALM and ALC plots look roughly similar and, in each case, the linear model statistics are much more flat than for the constant model. In contrast, the ALM and ALC plots are very different from each other for the motorcycle data, which exhibit heteroskedastic additive error.
For a further comparison we return to the sin/Cauchy data from Section 3.2. Figure 7 shows sequential design progress under our dynamic tree model with linear leaves and the ALC heuristic in three snapshots: after an initial Latin hypercube (LHS) sample, after 15 samples taken via ALC () and after 45 ALC samples (). With the exception of , when there is not enough data to support a split in the tree(s), the ALC statistic is high at points where is changing direction—highest where it is changing most rapidly.
Table 2 (right) shows the results of a similar experiment for the 2-d exponential data from Section 3.2. The setup is as described above, except that the initial LHS is now of size 20 and is followed by 55 active learning rounds. Here, the dynamic trees share the top 6 spots with TGP only, and clearly dominate their static analogues. Indeed, DTC is the second best performer, echoing our findings in Section 3.2 that the constant leaf model is a good fit for this function. In contrast to the results from our simple 1-d example, the offline (LHS and ME) and GP methods are poor here because the region of interest is confined to the bottom left quadrant of our 2-d input space. Finally, in all of these examples, dynamic regression trees (DTC and DTL) are the only methods that run on-line and do not require batch MCMC processing.
fnum@subsection3.4 Classification
Classification is one of the original applications for tree models, and is very often associated with sequential inference settings. Categorical data can be fit with multinomial leaf trees (as in 2.2.3) and, without an application specific loss function, the predictive classification rule assigns to new inputs the class with highest mean conditional posterior probability. That is,
for a set of particles, where each is the -class probability from (13) for the leaf corresponding to . Evaluating over the input space provides a predictive classification surface.
Figure 8 illustrates the outcome of this example. The dynamic tree results (top row) are based on a set of 1000 particles, while the GP soft-max results (bottom row) use every 100th observation of a 10,000 iteration MCMC chain. The left column of this figure shows the syndrome classification surface based on maximum mean posterior probability, as in (19). Although it is not possible to assess whether any classifier is outperforming the other, these surfaces clearly illustrate the effect of axis-symmetric (tree) classification as opposed to the surface from a radial-basis (GP) model. We also note that the GP soft-max classification rule is very similar to results from Ripley, (1996, chap. 5.2) for a neural network fit with weight decay.
The right column of Figure 8 shows the entropy, , of the posterior mean probability surface for each classification model. Such entropy plots, which are commonly used in classification settings to assess response variability, illustrate change in expected value of information over the input space. Indeed, Gramacy and Polson, (2010) propose entropy as an active learning criteria for classification problems, analogous to ALC in Section 3.3. In our application, dynamic tree entropy is peaked around a 3-class border region in the top-left. This intuitively offers better guidance about the value of additional information than the GP entropy surface, which is high everywhere except right at clusters of same-class observations.
To better assess the performance of our dynamic tree classifier, we turn to a larger “credit approval” data set from the UC Irvine Machine Learning Database. This data includes 690 credit card applicants grouped by approval status (either ‘+’ or ‘-’) and 15 input variables. Eleven of these inputs are categorical, and for each of these we have encoded the categories through a series of binary variables. This convenient reparametrization, which expands the input space to 47 dimensions, allows direct application of the model in Section 1.1 with nodes splitting on or for each binary variable (see Gramacy and Taddy, , 2010a, for discussion of this approach in general partition trees). As an aside, we note that the binary encoding can also be used to incorporate categorical data into constant and linear leaf trees; the only adaptation is, for linear leaves, to exclude binary variables from each regression design matrix.
We applied the multinomial dynamic tree to 100 independent repetitions of training on 90% of the credit approval data and predictive classification on the left-out 10% sample. This experiment is identical to that in Broderick and Gramacy, (2009), who tested both a TGP soft-max classifier (includes binary variables for partitioning of latent TGP processes and fits independent GP to continuous inputs within each partition) and a naïve GP soft-max classifier (fit to the reparametrized 47 inputs, treating binary variables as continuous in the correlation function). Their results are repeated here, beside those for our dynamic tree, in Table 3. Once again, despite using a less sophisticated leaf model, the dynamic tree is a clear performance winner and the only classifier able to beat 14% average missclassification. The dynamic tree was also orders-of-magnitude faster than the GP and TGP classifiers (we use 1000 particles, and they have 15,000 iterations), and only our approach will be feasible in on-line applications.
Discussion
We have reformulated regression trees as a dynamic model. This sequential characterization leads to an entirely new class of models which can be fit on-line; have a scale-free automatic prior specification; avoid the need for parameter sampling; and lead to robust and flexible prediction. We have shown empirically that our dynamic regression trees can provide superior performance and are less expensive than common alternatives.
In a key point, we have been able to define particles for filtering which contain only split rules and sufficient statistics, leading our sequential filtering to efficiently discard all partition models except those which are predicting well. Due to the size and complexity of potential tree space, this narrowing of posterior search is an essential aspect of our models’ success. In addition, restrictions on partition size allow us to make use of improper priors (for constant and linear leaves), which is not usually possible in particle inference.
The modeling and examples contained herein provide encouragement for further work under a strategy that looks to take advantage of sequential model characterizations. In particular, we hope to find that similar ideas on update mechanisms for covariate dependent model features will lead to insight about dynamic versions of other graphical structures.