Binary Space Partitioning Forests

Xuhui Fan, Bin Li, Scott Anthony Sisson

Introduction

Several machine learning methods, such as decision trees for regression and relational modelling for identifying interaction patterns, are concerned with space partitioning strategies to identify meaningful “blocks” in a product space. Models may be fitted to the data in each block, within which the data will exhibit certain types of homogeneity. These techniques have found application in relational modeling , community detection , collaborative filtering , and random forests – This space-partitioning strategy has shown its promising prospect in real-world applications. As a result, a number of structured space-partitioning priors have been developed, including the Mondrian process , the Rectangular Tiling process , the Ostomachion process , and the Binary Space Partitioning (BSP)-Tree process . Other strategies are also developed to complete the task, e.g., the Rectangular Bounding Process is recently proposed to use a bounding strategy to partition the space and it claims to obtain a parsimonious result.

Among the aforementioned approaches, the BSP-Tree process is an efficient way to partition the two-dimensional space. Instead of axis-aligned cuts adopted in most conventional approaches , the BSP-Tree process uses angled cuts to better describe the potential dependency between each dimension. The BSP-tree process is attractive because it is self-consistent, which ensures distributional invariance while restricting the process from a larger domain to a smaller one. However, as the BSP-Tree process is originally proposed for relational modelling, which only requires partitions on a two-dimensional space, it can hardly be extended to a dd-dimensional (d>2d>2) space. This restriction prohibits its applications to decision tree-style models, which usually consist of more than two dimensions of features. Unfortunately, we cannot straightforwardly extend the BSP-Tree process to a higher dimensional space because it would violate the self-consistency. It is possible to consider a proper extension that is able to retain the self-consistency property, however, for a dd-dimensional space, the hyperplane usually lies in the (d1)(d-1)-dimensional space and the measure of all potential cutting hyperplanes involves complicated integrals which would incur unacceptable calculations for a machine learning task.

In this work, we make the first endeavor to extend the domain of the BSP-Tree process to dd-dimensional (d>2d>2) space while still keeping its self-consistency. To simplify the process, we propose to generate a cutting hyperplane, which is assumed to be parallel to d2d-2 dimensions, to cut each node in the dd-dimensional BSP-tree. That is to say, each cutting hyperplane is allowed to have two degrees (dimensions) of freedom during the generative process. The current node, which is a convex polyhedron, is first projected onto a pair of dimensions – The pair is sampled in proportion to the perimeter of its projection onto the sampled pair of dimensions among all 1/2d(d1)1/2\cdot d(d-1) possible configurations. The subsequent cut on this node is then generated on the projected convex polygon through the same way of the BSP-Tree process and the rest dimensions of the cutting hyperplane are parallel to the other dimensions. This geometrically simple construction permits a flexible multi-dimensional extension to the BSP-tree process that provably retains the self-consistency property.

Our second contribution is to construct an ensemble of BSP-trees – the BSP-forest – to enable complex and flexible regression modelling. In contrast to Bayesian additive regression trees (BART) and the Mondrian Forest , which only implement node cuts in a single dimension to generate the tree structure, the BSP-forest uses two dimensions jointly to form a hyperplane cut in the feature space. As a result, the BSP-forest can achieve higher performance with a lower-depth hierarchical tree structure. In addition, because it consists of multiple BST-trees who select different pairs of dimensions to describe the data in a local region, the BSP-Forest is able to capture all-round pairwise dimensional dependence.

In the inference stage, an efficient Particle Gibbs sampler is developed to infer the BSP-Forest. Particularly, instead of cutting the entire node, we can further simplify the partitioning process by only conducting the hyperplane cut in the convex hull of the training data within the node to circumvent the complicated polyhedron related evaluation. Thanks to the self-consistency of the proposed extended BSP-Tree process, these two strategies lead to the same equilibrium distribution; we can thus largely simplify the sampling procedure and improve the efficiency.

The effectiveness of the proposed BSP-Forest is validated through an extensive study on the famous Friedman’s function and then five real-world data sets. Compared to its counterpart, the Mondrian Forest, the BSP-Forest can achieve similar performance with fewer cuts due to its flexibility. The BSP-Forest also outperforms other (Bayesian) regression forests on a number of real-world data sets.

Preliminary: The BSP-Tree Process

The BSP-tree process generates partitions, \boxplus, on an arbitrary two-dimensional complex polygon \square. The BSP-tree process is an almost surely right-continuous Markov jump process on (0,τ](0,\tau], where τ>0\tau>0 is a pre-fixed budget. Let t\boxplus_{t} denote the BSP-tree partition at time tt. t\boxplus_{t} is a hierarchical partition of \square, which can be represented as a triplet (,θ,u)(\boxplus,\boldsymbol{\theta},\boldsymbol{u}). \boxplus is a finite binary space partitioning tree, and θ=(θj)\boldsymbol{\theta}=(\theta_{j}) and u=(uj)\boldsymbol{u}=(\boldsymbol{u}_{j}) respectively specify, for each intermediate node jj of the tree, the orthogonal slope (θj\theta_{j}) for the cutting line and the cut position (uj\boldsymbol{u}_{j}) on the projected segment of the node.

The probability density function of θj\theta_{j} is proportional to the length of the projected segment l(θj)\boldsymbol{l}(\theta_{j}), and the cut position uj\boldsymbol{u}_{j} is uniformly distributed on l(θj)\boldsymbol{l}(\theta_{j}). t\boxplus_{t} represents a hierarchical partition of \square into the root node (ϵ=\square_{\epsilon}=\square; the base of the domain), intermediate nodes (j\square_{j}) and terminal nodes. Intermediate nodes j\square_{j} are generated from their parental node through a node split and are themselves split into two child nodes through the cutting line

Terminal nodes are the final generated blocks, which do not contain cutting lines.

Defining the time for the ll-th cut as t=τlt=\tau_{l}, the incremental cut time τlτl1\tau_{l}-\tau_{l-1} depends only on the partitions at time τl1\tau_{l-1}. Given an existing partition τl1\boxplus_{\tau_{l-1}}, the time to the next cut follows an Exponential distribution:

where PE(τl1(k))PE(\square_{\tau_{l-1}}^{(k)}) denotes the perimeter of the kk-th block in partition τl1\boxplus_{\tau_{l-1}}. Each cut divides one block (chosen with probability proportional to its perimeter) into two new blocks and forms a new partition. If the time index τl\tau_{l} of the new cut exceeds the budget τ\tau, the BSP-tree process terminates and returns the partition τl1\boxplus_{\tau_{l-1}} as the final realisation.

Self-consistency of the BSP-Tree process refers to that, when restricting the BSP-tree process on a convex polygon \square to any sub-region \triangle\subseteq\square, the resulting partitioning on the sub-region is distributed as if it is directly generated on \triangle through the BSP-tree’s generative process. This property can be usefully exploited in settings such as online learning and domain expansion. See for more detailed description of the BSP-tree process.

Extending the BSP-tree process to d𝑑d-dimensional space

In order to extend the domain of the BSP-Tree process to dd-dimensional (d>2d>2) space while still keeping its self-consistency, we consider a reduced generative process where each cutting hyperplane is only allowed to have two degrees (dimensions) of freedom. In this way, each potential cutting hyperplanes on k\square_{k} is assumed to be parallel to the rest d2d-2 dimensions, except the selected two.

Given the current partition τl1={τl1(k)}k=1l\boxplus_{\tau_{l-1}}=\{\square^{(k)}_{\tau_{l-1}}\}_{k=1}^{l} and τ\tau, the next cutting hyperplane is generated in the following steps (see the illustrations for Steps 1–4 in Figure 2):

Sample a candidate polyhedron (leaf node) ()\square^{(*)} from all the existing leaf nodes {τl1(k)}k=1l\{\square^{(k)}_{\tau_{l-1}}\}_{k=1}^{l} in proportion to {d1,d2PE(Πd1,d2(τl1(k)))}k=1l\left\{\sum_{d_{1},d_{2}}PE(\Pi_{d_{1},d_{2}}(\square^{(k)}_{\tau_{l-1}}))\right\}_{k=1}^{l}, where (d1,d2)(d_{1},d_{2}) denotes an arbitrary pair of dimensions from the dd dimensions, Πd1,d2()\Pi_{d_{1},d_{2}}(\square) denotes the projection of \square onto the dimensions of (d1,d2)(d_{1},d_{2}), and PE(Πd1,d2())PE(\Pi_{d_{1},d_{2}}(\square)) denotes the perimeter of the projection (i.e., a 2-dimensional polygon);

Sample a pair of free dimensions (d1(),d2())(d_{1}^{(*)},d_{2}^{(*)}) from all 1/2d(d1)1/2\cdot d(d-1) possible pairs in proportion to {PE(Πd1,d2(()))}(d1,d2)\left\{PE(\Pi_{d_{1},d_{2}}(\square^{(*)}))\right\}_{(d_{1},d_{2})};

On the projection Πd1(),d2()(())\Pi_{d_{1}^{(*)},d_{2}^{(*)}}(\square^{(*)}), sample a direction θ\theta from (0,π](0,\pi], where the probability density function is in proportion to the length of the line segment l(θ)\boldsymbol{l}(\theta), onto which Πd1(),d2()(())\Pi_{d_{1}^{(*)},d_{2}^{(*)}}(\square^{(*)}) is projected in the direction of θ\theta; and sample the cutting position u\boldsymbol{u} uniformly on the line segment l(θ)\boldsymbol{l}(\theta). The proposed cutting hyperplane is formed as the straight line passing through u\boldsymbol{u} and crossing through the projection Πd1(),d2()(())\Pi_{d_{1}^{(*)},d_{2}^{(*)}}(\square^{(*)}), orthogonal to l(θ)\boldsymbol{l}(\theta) in the dimensions of (d1(),d2())(d_{1}^{(*)},d_{2}^{(*)}) and parallel to the rest d2d-2 dimensionsGenerating a cutting line parameterized by θ\theta and u\boldsymbol{u} on the projection Πd1(),d2()(())\Pi_{d_{1}^{(*)},d_{2}^{(*)}}(\square^{(*)}) (two-dimensional polygon) follows the same sampling method used in the BSP-Tree process .;

Sample the incremental time for the new cut as (τlτl1)Exp(k=1l(i,j)DPE(Πdi,dj(τl1(k))))(\tau_{l}-\tau_{l-1})\sim\textrm{Exp}\left(\sum_{k=1}^{l}\sum_{(i,j)\in\mathcal{D}}PE(\Pi_{d_{i},d_{j}}(\square_{\tau_{l-1}}^{(k)}))\right). If τl>τ\tau_{l}>\tau, reject the proposed cutting hyperplane and return {τl1(k)}k=1l\{\square^{(k)}_{\tau_{l-1}}\}_{k=1}^{l} as the final partition structure; otherwise accept the proposed cutting hyperplane, increase ll to l+1l+1 and go back to Step (1).

Sampling a two-dimensional pair (Step (2)) is the novel key step that helps extend the BSP-tree process to dd-dimensional spaces; all other steps are the natural and logical extensions of the generative process of the two-dimensional BSP-tree process.

Through the above generative process, the cutting hyperplane can be parameterised as H(k,(d1,d2),θ,u)={x(k)([xd1,xd2]u)(1;tanθ)=0}H(k,(d_{1},d_{2}),\theta,\boldsymbol{u})=\{\boldsymbol{x}\in\square^{(k^{*})}|([x_{d_{1}},x_{d_{2}}]-\boldsymbol{u})(1;\tan\theta)^{\top}=0\}, where kk^{*} denotes the index of the selected polygon (leaf node), xd1x_{d_{1}} denotes the d1d_{1}-th element of vector x\boldsymbol{x}, and u\boldsymbol{u} is a two-dimensional vector denoting the position on the dimensions of (d1,d2)(d_{1},d_{2}). The cutting hyperplane is parallel to all dimensions except d1d_{1} and d2d_{2} such that it is fully characterised on (d1,d2)(d_{1},d_{2}).

Moreover, the cutting hyperplane is only meaningful (i.e. it intersects with (k)\square^{(k^{*})}) if and only if it intersects with the projection Πd1,d2((k))\Pi_{d_{1},d_{2}}(\square^{(k^{*})}) on dimensions (d1,d2)(d_{1},d_{2}). As Πd1,d2((k))\Pi_{d_{1},d_{2}}(\square^{(k^{*})}) is a convex polygon, from Proposition 2 of , the measure of the hyperplane cuts on dimensions (d1,d2)(d_{1},d_{2}) is uniform over the perimeter of Πd1,d2((k))\Pi_{d_{1},d_{2}}(\square^{(k^{*})}). Thus, the measure of the hyperplane cuts for (k)\square^{(k^{*})} is uniform over the sum D|\mathcal{D}| of the perimeters of Πd1,d2((k))\Pi_{d_{1},d_{2}}(\square^{(k^{*})}) on all unique pairs of dimensions i.e. (i,j)DPE(Πdi,dj((k)))\sum_{(i,j)\in\mathcal{D}}PE(\Pi_{d_{i},d_{j}}(\square^{(k^{*})})).

When d=2d=2, the above extended BSP-tree process reduces to the original two-dimensional version of . According to Propositions 1 and 2 in , the likelihood of the cutting hyperplane on the two dimensions (d1,d2)(d_{1}^{*},d_{2}^{*}) for (k)\square^{(k^{*})} is p(θ,ud1,d2,(k))=1PE(Πd1,d2((k)))p(\theta,\boldsymbol{u}|d_{1}^{*},d_{2}^{*},\square^{(k^{*})})=\frac{1}{PE(\Pi_{d^{*}_{1},d^{*}_{2}}(\square^{(k^{*})}))}. Extending this hyperplane to dd-dimensions, the likelihood is p(cut)=(i,j)DPE(Πdi,dj((k)))k=1l(i,j)DPE(Πdi,dj((k)))PE(Πd1,d2((k)))(i,j)DPE(Πdi,dj((k)))1PE(Πd1,d2((k)))=1k=1l(i,j)DPE(Πdi,dj((k)))p(cut)=\frac{\sum_{(i,j)\in\mathcal{D}}PE(\Pi_{d_{i},d_{j}}(\square^{(k^{*})}))}{\sum_{k=1}^{l}\sum_{(i,j)\in\mathcal{D}}PE(\Pi_{d_{i},d_{j}}(\square^{(k)}))}\cdot\frac{PE(\Pi_{d^{*}_{1},d^{*}_{2}}(\square^{(k^{*})}))}{\sum_{(i,j)\in\mathcal{D}}PE(\Pi_{d_{i},d_{j}}(\square^{(k^{*})}))}\cdot\frac{1}{PE(\Pi_{d^{*}_{1},d^{*}_{2}}(\square^{(k^{*})}))}=\frac{1}{\sum_{k=1}^{l}\sum_{(i,j)\in\mathcal{D}}PE(\Pi_{d_{i},d_{j}}(\square^{(k)}))}. That is, all potential partitions are equally favoured and the hyperplane cut is uniformly distributed, without involving additional (prior) knowledge about the cutting hyperplane.

As a result of the particular technique of generating dd-dimensional cutting hyperplanes, the BSP-tree process retains the self-consistency property in dd-dimensions.

The extended BSP-tree process is self-consistent in dd-dimensional (d2d\geq 2) space, and maintains distributional invariance when restricting its domain from a convex polyhedron to a sub-domain.

That is, if the extended BSP-Tree process on a finite convex polyhedron \square is then restricted to a sub-region ,\triangle,\triangle\subseteq\square, then the resulting partitions restricted to \triangle are distributed as if they were directly generated on \triangle through the BSP-tree generative process. Subsequent application of the Kolmogorov Extension Theorem states that the self-consistency property of the BSP-tree process then enables it to be defined on infinite multidimensional space.

This useful property can strongly simplify the sampling procedure and improve algorithmic efficiency in dd-dimensional (d>2d>2) implementations, which would otherwise be extremely complicated to implement (see inference sections, below).

Binary space partitioning (BSP)-forests

We apply the extended BSP-tree process in a random forest style model. For the regression we attach mean intensity parameters, μ\mu, to the leaf nodes of each tree structure, and use the sum of the intensities from mm BSP-trees to predicting the unknown label. That is,

where μ(j)={μ1(j),,μk(j)(j)}\boldsymbol{\mu}^{(j)}=\{\mu^{(j)}_{1},\cdots,\mu^{(j)}_{k^{(j)}}\} denotes the set of mean parameters associated with the leaf nodes of the jj-th BSP-tree, k(j)k^{(j)} is the number of leaf nodes in the jj-th BSP-tree, g(x;(j),μ(j))=μk0(j)g\left(\boldsymbol{x};\boxplus^{(j)},\boldsymbol{\mu}^{(j)}\right)=\mu^{(j)}_{k_{0}} if x\boldsymbol{x} belongs to the k0k_{0}-th leaf node of the jj-th BSP-tree, and σ2\sigma^{2} denotes the observation variance.

Determination of node assignments for x\boldsymbol{x} follows the typical procedure used for a decision tree. To begin, x\boldsymbol{x} belongs to the root node of the BSP-tree (j)\boxplus^{(j)}. On any non-leaf node, the cutting hyperplane H(k,(d1,d2),θ,u)H(k,(d_{1},d_{2}),\theta,\boldsymbol{u}) would then be used to determine the child node that x\boldsymbol{x} belongs to: {(xd1,xd2)u)(1;tanθ)<0}\{(x_{d_{1}},x_{d_{2}})-\boldsymbol{u})(1;\tan\theta)^{\top}<0\} or {(xd1,xd2)u)(1;tanθ)>0}\{(x_{d_{1}},x_{d_{2}})-\boldsymbol{u})(1;\tan\theta)^{\top}>0\}. Through a sequence of such binary node assignments from the top to the bottom of the tree, x\boldsymbol{x} would finally fall into a single leaf node k0k_{0} of (j)\boxplus^{(j)}, with gg then allocating the intensity μk0(j)\mu^{(j)}_{k_{0}} to the regression.

In common with other regression forests, each individual tree only contributes proportionally 1m\frac{1}{m} of the model for label yy, which prevents any one of the trees dominating the prediction. Further, because the ensemble comprises multiple BST-trees, each potentially selecting different dimensional pairs to describe the data in local regions, the BSP-forest is able to capture all pair-wise dependence in dd-dimensions. Similarly, the potentially different depths of each tree allows the BSP-forest to reflect different granularities of interactions between dimensions.

Each regression tree in the BSP-forest presents a hierarchical partition structure on dd-dimensional space. From this perspective, the sum-of-trees forest mechanism provides a compositional “smoothing” of the partition structure for local regions of the space, thereby greatly increasing the flexibility of the regression model. Given mm, a BSP-forest is determined by {((j),μ(j))}j=1m\{(\boxplus^{(j)},\boldsymbol{\mu}^{(j)})\}_{j=1}^{m}, which includes the BSP-tree structures and their associated leaf node mean parameters, and the observational variance parameter σ2\sigma^{2}.

We specify the prior distributions for σ2\sigma^{2}, the number of trees in the BSP-forest mm, and the mean parameters {μ(j)}j=1m\{\boldsymbol{\mu}^{(j)}\}_{j=1}^{m} for all mm trees {(j)}j=1m\{\boxplus^{(j)}\}_{j=1}^{m} following . Namely: (1) σ2IGamma(1.5,λ)\sigma^{2}\sim\textrm{IGamma}(1.5,\lambda) where, given an estimate of the sample variance σ^2\hat{\sigma}^{2}, λ\lambda is set to satisfy the condition F(σ^2;1.5,λ)=0.9F(\hat{\sigma}^{2};1.5,\lambda)=0.9, where F()F(\cdot) is the Inverse-Gamma c.d.f.; (2) As the labels yy are typically standardised, the prior for each mean parameter is specified as μN(0,12m)\mu\sim N\left(0,\frac{1}{2\sqrt{m}}\right); (3) While a prior for mm can be specified, and inference performed by a reversible-jump type algorithm , for simplicity we fix m=50m=50 which we find performs well in practice.

Comparison with BART and MF:

The BSP-forest is directly related to two popular Bayesian regression-tree models: Bayesian additive regression trees (BART) and the Mondrian forest (MF). The key differences and advantage here is that the BSP-forest uses both angled cuts, and hyperplane cuts constructed from two dimensions, rather than the one-dimensional, axis-aligned cuts in both BART and MF. This clearly provides a more flexible way to model the observed data, and as a result, lower tree depth is required to obtain the same or even improved modelling capability.

2 Inference

Posterior inference for the BSP-forest is implemented using Markov chain Monte Carlo (MCMC). The joint distribution of all parameters, including the partition structures of mm extended BSP-trees {(j)}j=1m\{\boxplus^{(j)}\}_{j=1}^{m}, the leaf-node mean parameters of each tree {μ(j)}j=1m\{\boldsymbol{\mu}^{(j)}\}_{j=1}^{m}, and the observational variance σ2\sigma^{2} can be written as:

where Y={y1,,yN}Y=\{y_{1},\ldots,y_{N}\} and X={x1,,xN}\boldsymbol{X}=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N}\}. Algorithm 1 outlines the MCMC algorithm, which iteratively samples the variance σ2\sigma^{2} (Line 3) and each BSP-tree structure (j)\boxplus^{(j)} with mean parameters μ(j)\boldsymbol{\mu}^{(j)}) (Line 5).

Through conjugacy, the full posterior conditional distribution of σ2\sigma^{2} is

where Ess=i=1N(yij=1mg(xi;(j),μ(j)))2Ess=\sum_{i=1}^{N}\left(y_{i}-\sum_{j=1}^{m}g(\boldsymbol{x}_{i};\boxplus^{(j)},\boldsymbol{\mu}^{(j)})\right)^{2}.

We use the Conditional-Sequential Monte Carlo (C-SMC) sampler to infer the partition structures of all the mm extended BSP-trees {(j)}j=1m\{\boxplus^{(j)}\}_{j=1}^{m} and the mean parameters {μ(j)}j=1m\{\boldsymbol{\mu}^{(j)}\}_{j=1}^{m}. There is a key difference between our sampler and that used in : These previous works take node splitting events on the time line as the dimensions of the sampled variable. The particles at the same step are under different budget constraints and the target distribution of the C-SMC samplers may not be well mixed. In our implementation, we adopt fixed intervals {(τs,τs+1)}s=0S\{(\tau_{s},\tau_{s+1})\}_{s=0}^{S} (τ0=0,τS+1=τ\tau_{0}=0,\tau_{S+1}=\tau) on the time line as the dimensions of the variable sampled by the C-SMC sampler. \forall step ss, all the particles are under the same budget constraints (0,τs](0,\tau_{s}] and each dimension might involve one, more than one, or even no splitting events. This helps to fix the number of “dimensions” in C-SMC and improve the efficiency of the sampler. Algorithm 2 shows the detail strategy to infer the tree structure in the (t+1)(t+1)-th iteration for the jj-th BSP-tree.

Implementing cutting hyperplane on convex hull:

Cutting the full dd-dimensional polyhedron generated from a series of cutting hyperplanes on the original domain, is both a challenging and unnecessary task. This is because: (1) complete indexing of this polyhedron requires extensive geometric calculations, including calculating the intersection of multiple hyperplanes, deciding if two hyperplanes intersect in the existence of other hyperplanes, and listing all the vertices of the polyhedron; (2) in some cases, the cutting hyperplane occurs outside the convex hull (i.e. the minimum convex polyhedron that contains all the points in this block), it does not partition the available data and thereby influence the regression. In practice, while data usually lies in a small portion of the feature space, implementing cutting hyperplanes directly on this polyhedron is particularly inefficient.

To address these issues, we implement the hyperplane cut on the convex hull only. The projection of the convex hull on any two dimensions can be obtained by simply slicing the x\boldsymbol{x} coordinates on these two dimensions and then using a conventional convex hull detection algorithm (such as the Graham Scan algorithm . Algorithm 3 describes our approach to generate the hyperplane cut in the convex hull and Fig. 3 visualises the convex hull projection in 33-dimensional space.

Due to the self-consistency property of the (extended) BSP-tree process, we conveniently have that:

The hyperplane restricted on the convex hull is distributed the same as if we first partition on the “full” polyhedron and then restrict attention to the convex hull. Both of these two methods lead to the idential equilibrium distribution in the MCMC algorithms.

Computational cost

The computational cost of the BSP-Forest is almost the same as the batched Mondrian Forest, except that the number of candidate features (used to generate a hyperplane) increases from O(d)\mathcal{O}(d) to O(d2)\mathcal{O}(d^{2}).

Related Work

Random forests have been proposed for decades and the volume of literature may be too huge to be reviewed here. The interested readers can refer to for a recent review. For the structure of a tree in the forest, the node splitting in a binary tree usually consists of two steps: (1) choose the most suitable dimension to cut; (2) find the best cutting position on the chosen dimension. A widely adopted strategy is to optimize some information gain related criterion in a greedy way. In terms of the decision tree, a random forest uses the sum-of-trees strategy to further improve the model capability. Some additional strategies to decorrelate the trees include bagging and subsampling the data set.

Here we use the Breimain-Random Forest and Gradient Boosting regressor as two representative works. The Breimain-RF adopts the bagging strategy and chooses a subset of the dd dimensions for each tree. The Gradient Boosting regressor uses the boosting strategy, which is to propose regression trees to cater for the negative gradient of the loss function at the current step.

For the Bayesian decision trees, Bayesian Additive Regression Trees (BART) and the Mondrian Forest are two representative counterparts of the proposed BSP-Forest. BART assigns probability distributions to the structure variables of all the trees, while MF uses the Mondrian process as the prior of the tree structure for the forest. In contrast to these two methods which merely select one dimension to conduct node cut to generate the tree structure, the BSP-Forest adopts two dimensions together to form a hyperplane cut in the feature space.

Multivariate decision trees (single-tree model) may be the closest one to the proposed BSP-tree in term of the inter-dimensional dependence. Its less popularity might be due to its high computational cost in finding an optimized cutting hyperplane. Compared to our BSP-tree, the lack of prior information in the multivariate decision tree requires more regularization techniques.

Experiments

We first evaluate the performance of the BSP-Forest on the Friedman’s function . In this test setting, each data point x\boldsymbol{x} is generated from the uniform distribution, while its label yy takes the following form:

where ϵN(0,σ2),σ2=1\epsilon\sim\mathcal{N}(0,\sigma^{2}),\sigma^{2}=1 denotes the white noise effect and xdx_{d} denotes the dd-th dimension of x\boldsymbol{x}. The Friedman’s function consists of two nonlinear terms, two linear terms and an interaction term between the dimensions.

We begin with a simple application on posterior mean estimates of the BSP-Forest. The number of dimensions is set to 1010, such that f(x)f(x) considers only the first 55 dimensions of the data points and the rest five dimensions are irrelevant; the number of data points NN is set to 300300. The results in Figure 4 show the estimated f^(x)\hat{f}(x) against the groundtruth f(x)f(x) on the training and test label sets. There is a vertical line on each point to indicate the 90%90\% posterior confidence interval. We can see that, on the training data set, the predicted values f^(x)\hat{f}(x) correlate very well with the true values f(x)f(x), and the confidence intervals tend to cover the true values as well. On the test data set, one can observe a slight degradation of the correlation and wider intervals, which indicate a larger variance estimation of f(x)f(x) at new xx values.

Partial Dependence

Partial dependence is commonly used in Statistics to show the effect of adding features to a model. Figure 5 shows the plots of points and interval estimates of the partial dependence functions for x1,,x10x_{1},\cdots,x_{10} from the MCMC samplers. The nonzero marginal effects of x1,,x5x_{1},\cdots,x_{5} and the zero marginal effects of x6,,x10x_{6},\cdots,x_{10} seem to be completely consistent with the form of f(x)f(x).

Dimension Usage

The BSP-Forest can also be used to identify the dependent dimensions that are most correlated to the function f(x)f(x). This can be completed by recording the pair of dimensions used for each regression tree and count the times of involvements of these dimensions. Figure 6 plots the proportions of x1,,x10x_{1},\cdots,x_{10} for the number of trees m=10,20,30,50,100m=10,20,30,50,100. As mm gets smaller, the fitted sum-of-trees models increasingly incorporate only those dimensions that are critical to explain the label yy. Without making use of any assumptions or information about the actual function, the BSP-Forest has exactly identified the underlying dimensions that determine f(x)f(x).

Performance Evaluation w.r.t. Budget

For the Mondrian Forest and the BSP-Forest, the parameter determines the expected depth of the BSP-tree, the budget λ\lambda, has more impact on their performance. Figures 7 and 8 show the Root Mean Absolute Error (RMAE) and the average number of cuts for the BSP-Forest and the Mondrian Forest, while budget is taking different values from 0.40.4 to 1.41.4. We can see that the BSP-Forest can obtain better RMAE performance on all budget values. Also, the number of cuts in the BSP-Forest is always smaller than that in the Mondrian Forest. The efficiency in the number of cuts is consistent with our expectation on the BSP-Forest. Although the curve seems quite flat across values of λ\lambda considered, the reason may be that the average number of cuts does not change much: it only changes from 4 to 5 in Figure 8, as the budget varies from 0.4 to 1.4.

2 Real-world Datasets

The performance of the BSP-Forest is compared with other tree-based models on a number of real-world data sets. The comparison methods we include Bremain’s Random Forest , Gradient Boosting regressor , Bayesian Additive Regression Trees (BART) , batch version of the Mondrian Forest (MF) . The implementations of Bremain’s Random Forest and Gradient Boosting regressor are imported from the Python module of Scikit-Learn . BART is implemented through the R package of BART . We implement the Mondrian Forest through the same strategy as the BSP-Forest here.

Conclusion

This paper makes the first endeavor to extend the BSP-Tree process to a dd-dimensional (d>2d>2) space. By designing a subtle strategy to only sample two free dimensions each time from the space, the extended BSP-Tree process can retain the essential self-consistency property. We further take the extended BSP-trees to construct the BSP-Forest for regression. Compared to other (Bayesian) regression forests, the BSP-Forest can perform best due to its flexibility.

Acknowledgements

Xuhui Fan and Scott A. Sisson are supported by the Australian Research Council through the Australian Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS, CE140100049), and Scott A. Sisson through the Discovery Project Scheme (DP160102544). Bin Li is supported by Fudan University Startup Research Grant and Shanghai Municipal Science & Technology Commission (16JC1420401).

References