Bayesian Regression Tree Ensembles that Adapt to Smoothness and Sparsity

Antonio Ricardo Linero, Yun Yang

Introduction

Consider a nonparametric regression model $Y=f_{0}(X)+\epsilon$ with response $Y$ , $X\in^{p}$ a $p$ -dimensional predictor, $f_{0}$ an unknown regression function of interest, and Gaussian noise $\epsilon\sim\operatorname{Normal}(0,\sigma^{2})$ . Suppose we observe $\mathcal{D}=((X_{1},Y_{1}),\ldots,(X_{n},Y_{n}))$ consisting of independent and identically distributed copies of $(X,Y)$ . A popular approach to estimating $f_{0}(x)$ is to form an ensemble of decision trees; common techniques include boosted decision trees (Freund et al.,, 1999) and random forests (Breiman,, 2001). Bayesian tree-based models, such as the Bayesian additive regression trees (BART) model (Chipman et al.,, 2010), have recently attracted interest from practitioners due to their excellent empirical performance and natural uncertainty quantification; BART has been applied in a wide variety of contexts such as nonparametric function estimation with variable selection (Bleich et al.,, 2014; Linero,, 2016), analysis of loglinear models (Murray,, 2017), and survival analysis (Sparapani et al.,, 2016). Additionally, BART is consistently among the best performing methodologies in the Atlantic Causal Inference Conference Data Analysis Challenge (Hill,, 2011, 2016; Hahn et al.,, 2017; Dorie et al.,, 2017).

Despite the recent popularity of Bayesian tree-based models, they suffer from several drawbacks. First, in the regression setting, estimators based on decision trees are not capable of adapting to higher smoothness levels exhibited in $f_{0}$ due to their piecewise-constant nature. Second, as illustrated by Linero, (2016), they suffer from the curse of dimensionality — their prediction performance deteriorates as the dimensionality $p$ increases. Last but not least, very little theoretical work has been done for understanding large sample properties of Bayesian tree-based approaches from a frequentist perspective.

In this article, we propose a new method, called soft Bayesian additive regression trees (SBART) which improves both practically and theoretically upon existing Bayesian sum-of-trees models. To address the first aforementioned drawback, we employ a ensemble of carefully designed “soft” decision trees as building blocks in the BART model, and show in both empirical studies and theoretical investigation that the resulting Bayesian approach can adapt to the unknown smoothness level of the true regression function $f_{0}$ — the corresponding posterior distribution achieves the minimax rate $n^{-\alpha/(2\alpha+p)}$ of contraction up to logarithmic terms (Ghosal et al.,, 2000) when $f_{0}\in\mathcal{C}^{\alpha,R}(^{d})$ where $C^{\alpha,R}(^{d})$ denotes a Hölder space with smoothness index $\alpha$ and radius $R$ .

To overcome the curse of dimensionality, we specify sparsity inducing priors (Linero,, 2016) for the splitting rule probabilities in the soft decision trees. We show that SBART takes advantage of structural sparsity in the true regression function $f_{0}$ — when $f_{0}$ only depends on $d\ll p$ predictors and is $\alpha$ -Hölder smooth, the resulting posterior distribution contracts towards the truth at a rate of $n^{-\alpha/(2\alpha+d)}+\sqrt{n^{-1}d\log p}$ up to logarithmic terms, which is near minimax-optimal even in the high-dimensional setting where $p$ grows nearly exponentially fast in $n$ (Yang and Tokdar,, 2015). Furthermore, due to the additive nature of sum-of-trees based models, we show that SBART can also adapt to low-order non-linear interactions: if $f_{0}$ can be decomposed into many low dimensional pieces $f_{0}=\sum_{v=1}^{V}f_{0v}$ , where each additive component $f_{0v}$ is $d_{v}$ -sparse and $\alpha_{v}$ -smooth, then SBART also achieves a near-minimax rate of posterior contraction. Compared to the rate for the general sparse case, which allows at most $o(\log n)$ many active predictors for consistency, the rate for additive structures potentially allows $o(n^{\beta})$ many predictors for some $\beta\in(0,1)$ ; this partly explains the empirical success of Bayesian sum-of-tree approaches, as many real-world phenomena can be explained in terms of a small number of low-order interactions.

Our proofs involve a key lemma that links sum-of-tree type estimators with kernel type estimators. Unlike frequentist kernel type estimators that require prior knowledge on the smoothness level of $f_{0}$ for choosing a smoothness matching kernel, Bayesian sum-of-tree based methods are adaptive, requiring no prior knowledge of the smoothness levels $\{\alpha_{v}\}$ , number of additive components $V$ , or degree of lower-order interactions $d_{v}$ , while still attaining near-minimax rates even under the high-dimensional setting. Practically, SBART can be implemented by making minimal modifications to existing strategies for fitting Bayesian tree-based models: the sparsity-inducing prior uses conditionally-conjugate Dirichlet priors which can be easily accommodated during Gibbs sampling, while replacing the usual decision trees with soft decision trees requires minor changes to the backfitting algorithm typically used with BART.

There has been a recent surge of interest in the theoretical properties of BART-type models. While our work was under review we learned that, in essentially simultaneous work, Rockova and van der Pas, (2017) established similar posterior contraction rates for a particularly designed BART prior, using a so-called “spike-and-tree” prior to allow for the ensemble to adapt to sparsity. In particular, they show that a single deep decision tree can approximate any function with smoothness level $\alpha\leq 1$ , which is then divided among trees with smaller depth. Our theory instead relies on linking sum-of-tree type estimators with kernel type estimators, which only need shallow trees and motivate the usage of soft-decision trees. Practically, the most relevant difference is that our SBART prior allows for adaptation to the smoothness level even when $\alpha>1$ , whereas the use of piecewise-constant basis functions in traditional BART models only allows for adaptation to functions which are at-most Lipschitz-smooth ( $\alpha\leq 1$ ). An additional difference is that we focus on establishing concentration results for the fractional posterior, which allows for less restrictive assumptions about our choice of prior; in our supplementary material, we also provide concentration results for the usual posterior, under more stringent conditions. In even more recent work, Alaa and van der Schaar, (2017) establish consistency results for BART-type priors for estimating individual treatment effects in causal inference settings, and also noted the limitation of BART in adapting to a smoothness order higher than $\alpha=1$ .

The soft decision trees we use are similar in spirit to those used by Irsoy et al., (2012), who considered a soft variant of the CART algorithm. Our work differs in that (i) our trees are not learned in a greedy fashion, but instead by extending the Bayesian backfitting approach of Chipman et al., (2010), (ii) we consider an ensemble of soft trees rather than a single tree, (iii) we use a different parameterization of the gating function which does not consider oblique decision boundaries, and (iv) we establish theoretical guarantees for our approach.

The rest of the paper is organized as follows. In Section 2, we develop our SBART prior. In Section 3 we state our theoretical results. In Section 4, we illustrate the methodology on both simulated and real datasets. We finish in Section 5 with a discussion. Proofs are deferred to the appendix. In supplementary material, we provide additional computational details, timing results, and additional theoretical results extending our fractional posterior results to the usual posterior.

Soft Bayesian sum of trees models

We begin by describing the usual “hard” decision tree prior used in BART. We model $f_{0}(x)$ as the realization of a random function

Following Chipman et al., (2010), we endow $\mathcal{T}_{t}$ with a branching process prior. The branching process begins with a root node of depth $k=0$ . For $k=0,1,2,\ldots$ , each node at depth $k$ is non-terminal with probability $q(k)=\gamma(1+k)^{-\beta}$ where $\gamma>0$ and $\beta>0$ are hyperparameters controlling the shape of the trees. It is easy to check using elementary branching process theory that this process terminates almost surely provided that $\beta>0$ (Athreya and Ney,, 2004).

2 Smoothness adaptation

A well-known feature of decision trees is their lack of smoothness. Single-tree algorithms such as the CART algorithm (Hastie et al.,, 2009, Chapter 9.2) result in step-function estimates, suggesting that they should not be capable of efficiently estimating smooth functions (Györfi et al.,, 2006). Methods based on ensembles of decision trees average over many distinct partitions of the predictor space, resulting in some degree of smoothing. Even with this averaging, the estimated regression functions are not smooth. Heuristically, we note that under our BART specification the function $f$ is not differentiable in quadratic mean. Indeed, with trees of depth $1$ , $p=1$ , and cutpoints $C_{b}\sim G$ , simple calculations give $E\{(f(x+\delta)-f(x))^{2}\}\propto\delta G^{\prime}(x)+o(\delta)$ . Consequently, BART ensembles with a large number of trees resemble nowhere-differentiable continuous functions, and in the limit as $T\to\infty$ the BART prior converges to a nowhere-differentiable Gaussian process. This heuristic argument suggests that BART can only adapt to functions with Hölder smoothness level no greater than $\alpha=1$ (Lipschitz functions).

Figure 2 compares the fit of BART to SBART with $\tau_{b}\equiv 0.1$ . We see that when $T=1$ trees are used we require a large number of leaf nodes to model relatively simple functions. At a large scale, we see that the BART fit resembles a nowhere-differentiable continuous function. While an improvement, the estimate from BART is still not sufficiently smooth and exhibits large fluctuations.

The fit of the soft decision tree in Figure 2 by comparison is infinitely differentiable and requires only a small number of parameters. Consequently, we obtain a fit with lower variance and negligible bias. An attractive feature of soft decision trees exhibited in Figure 2 is their ability to approximate linear relationships. In this case, even when $T=1$ , we recover the smooth functions almost exactly.

3 Prior specification and implementation

Following Chipman et al., (2010), in this section we develop a “default” SBART prior. The goal is to develop a prior which can be used routinely, without requiring the user to specify any hyperparameters; while the choices below may appear ad-hoc, they have been found to work remarkably well across a wide range of datasets. After adopting the following default prior, users may wish to further tune the number of trees $T$ , the parameter $r$ in the prior for $\tau_{b}$ , or use additional information regarding the targeted sparsity level. We stress, however, that a reasonable baseline level of performance is obtained without the need to do any further tuning.

Following Chipman et al., (2010), we recommend scaling $Y$ so that most/all of the responses fall in the interval $[-0.5,0.5]$ . We also preprocess $X_{j}$ so that $X_{j}\sim\operatorname{Uniform}(0,1)$ approximately by applying a quantile normalization in which each $X_{ij}$ is mapped to its rank, with $\min X_{ij}=1$ and $\max X_{ij}=n$ . We then apply a linear transformation so that the values of $X_{ij}$ are in $ $. The goal of this preprocessing of$ X $is to make the prior invariant under monotone transformations of$ X$, which is a highly desirable property of the original default BART model.

We now describe our default prior for the bandwidths $\tau_{b}$ and the splitting proportions $s=(s_{1},\ldots,s_{p})$ . We use a sparsity-inducing Dirichlet prior,

Our theoretical results require $\xi>1$ , however in practice we find that setting $\xi=1$ works adequately. This Dirichlet prior for $s$ was introduced by Linero, (2016); throughout, we refer to the BART model with (3) as Dirichlet additive regression trees (DART) to contrast with BART when no such sparsity-inducing prior is used. The parameter $a$ controls the expected amount of sparsity in $f$ . Conditional on there being $B$ branches in the ensemble, the number of predictors included in the ensemble is converges in distribution to $1+Z$ where $Z\sim\operatorname{Poisson}(\theta)$ and $\theta=a\sum_{i=1}^{B-1}(a+i)^{-1}$ (Linero,, 2016) when $\xi=1$ . When prior information is available on the sparsity of $f_{0}$ , we can choose $a$ to match the targeted amount of sparsity. By default we use a compound Gamma prior, $a/(a+\lambda_{a})\sim\operatorname{Be}(a_{a},b_{a}),$ with $a_{a}=0.5,b_{a}=1,\lambda_{a}=p$ . This prior attempts to strike a balance between the sparse and non-sparse settings by having an infinite density at , median $\alpha=p/4$ , and an infinite mean.

There are several possibilities for choosing the bandwidth $\tau_{b}$ . In preliminary work, using tree-specific $\tau_{t}$ ’s shared across branches in a fixed tree worked well, with $\tau_{t}\sim\operatorname{Exponential}(r)$ where $E(\tau_{t})=r$ . Our illustrations use $r=0.1$ , which, as shown in Figure 3, gives a wide range of possible gating functions. An interesting feature of the sampled gating functions is that both approximate step functions and approximately linear functions are supported.

We give $\sigma={\operatorname{Var}(\epsilon)}^{\nicefrac{{1}}{{2}}}$ a half-Cauchy prior, $\sigma\sim\operatorname{Cauchy}_{+}(0,\widehat{\sigma})$ . Again following Chipman et al., (2010), $\widehat{\sigma}$ is an estimate of $\sigma$ based on the data. We use an estimate $\widehat{\sigma}_{\text{lasso}}$ of $\sigma$ obtained by fitting the lasso using the glmnet package in R.

The model has hyperparameters $(\sigma_{\mu}^{2},\gamma,\beta,T)$ . In preliminary work, we did not have success placing priors on $\gamma$ and $\beta$ , and instead fix $\gamma=0.95$ and $\beta=2$ (Chipman et al.,, 2010). We give $\sigma_{\mu}$ a half-Cauchy prior, $\sigma_{\mu}\sim\operatorname{Cauchy}_{+}(0,0.25)$ , where $0.25$ is chosen so that $\sigma_{\mu}$ has median equal to the default value recommended by Chipman et al., (2010).

An important remaining specification is the number of trees $T$ to include in the ensemble. The theoretical results we establish in Section 3 make use of a prior distribution on $T$ ; however, our attempts to incorporate a prior on $T$ using reversible jump methods (Green,, 1995) resulted in poor mixing of the associated MCMC algorithms. Generally, we have found that fixing $T$ at a default value of $T=50$ or $T=200$ is sufficient to attain good performance on most datasets. Tuning $T$ further often provides a modest increase in performance, but may be worth the effort on some datasets (see Section 4.3).

There are a number of possible options for tuning $T$ , such as approximate leave-one-out cross validation using Pareto-smoothed importance sampling (PSIS-LOO) (Vehtari et al.,, 2015), maximizing an approximate marginal likelihood obtained using (say) WBIC (Watanabe,, 2013), or $K$ -fold cross validation as recommended by Chipman et al., (2010). The advantage of WBIC and PSIS-LOO is that they require fitting the model only once for each value of $T$ . In practice, we have found that approximations such as WBIC and PSIS-LOO are unreliable, with PSIS-LOO prone to overfitting and WBIC requiring potentially very long chains to estimate. Figure 4 displays the values of PSIS-LOO, a WBIC approximation of the negative marginal likelihood of $T$ (Watanabe,, 2013), and $5$ -fold cross validation, when used to select $T$ for a replicate of the illustration in Section 4.1 with $p=100$ predictors. Both WBIC and cross validation select $T=10$ , which also minimizes the root mean squared error $\int(f_{0}(x)-\widehat{f}(x))^{2}\ dx$ . Resource permitting, we have found $K$ -fold cross-validation to be the most reliable method for selecting $T$ .

As a default we use the following priors throughout the manuscript.

4 Variable grouping prior

The sparsity-inducing prior (3) can be extended to allow penalization of groups of predictors simultaneously, in a manner similar to the group lasso (Yuan and Lin,, 2006). Suppose that the predictors can be divided into $M$ groups of size $P_{m}$ . We set

We primarily use the grouping prior to allow for the inclusion of categorical predictors through the inclusion of dummy variables. This is an extension of the approach used by the bartMachine package in R. An alternative approach to the inclusion of categorical predictors, used in the BayesTree package, is to construct decision rules based on a dummy variable $Z_{j}=I(X_{j}\in A_{b})$ where $A_{b}$ is a random subset of the possible values of predictor $j$ . In our illustrations, we let $\omega\to\infty$ so that $v_{mk}=P_{m}^{-1}$ and set $a/(a+M)\sim\operatorname{Be}(0.5,1)$ .

5 Posterior computation

We use the Bayesian backfitting approach described by Chipman et al., (2010) to construct a Markov chain Monte Carlo (MCMC) algorithm to sample approximately from the posterior.

Within Algorithm 1, $\mathcal{T}_{t}$ is updated using a Metropolis-Hastings proposal. Proposals consist of one of three possible moves: Birth, which turns a leaf node into a branch node; Death, which turns a branch node into a leaf node; and Change, which changes the decision rule of a branch $b$ . A detailed description of these moves, and their associated transition probabilities, is given in the supplementary materials.

Constructing efficient updates for $\mathcal{T}_{t}$ and $\tau_{t}$ requires marginalizing over $\mathcal{M}_{t}$ . Because the errors are assumed Gaussian, this marginalization can be carried out in closed form. The main computational drawback of SBART relative to BART lies in this marginalization, as SBART requires computing a likelihood contribution for each leaf-observation pair, whereas BART only requires a single likelihood contribution for each tree. Hence, if the trees are deep, BART will be substantially faster. By the construction of the prior, trees generally are not deep enough for this difference to be prohibitive.

The Dirichlet prior $s\sim\mathcal{D}(a/p^{\xi},\ldots,a/p^{\xi})$ allows for a straight-forward Gibbs sampling update, with the full conditional given by $s\sim\mathcal{D}(a/p^{\xi}+c_{1},\ldots,a/p^{\xi}+c_{p}),$ where $c_{j}=\#\{b:\text{branch$ b $splits on predictor$ j $}\}$ . When the grouping prior (5) is used we also obtain simple Gibbs sampling updates, with $u\sim\mathcal{D}(a/M+z_{1},\ldots,a/M+z_{M})$ and $v_{m}\sim\mathcal{D}(\omega/P_{m}+c_{m1},\ldots,\omega/P_{m}+c_{mP_{m}})$ , where $z_{m}=\#\{b:\text{branch$ b $splits on a predictor in group$ m $}\}$ and $c_{mk}=\#\{b:\text{branch$ b $splits on predictor$ mk $}\}$ .

Theoretical results

We study the theoretical properties of the SBART procedure from a frequentist perspective by assuming that $(Y_{1},Y_{2},\ldots,Y_{n})$ are generated from the model $Y_{i}=f_{0}(X_{i})+\epsilon_{i}$ with some true unknown regression function $f_{0}$ . We assume that $f_{0}$ is a function over $^{p}$ . We prove posterior consistency results when $f_{0}$ is a member of certain Hölder spaces. Let $\mathcal{C}^{\alpha}(^{p})$ denote the Hölder space with smoothness index $\alpha$ , i.e., the space of functions on $^{p}$ with bounded partial derivatives up-to order $\beta$ , where $\beta$ is the largest integer strictly less than $\alpha$ and such that the partial derivatives of order $\beta$ are Hölder-continuous of order $\alpha-\beta$ . Let $\mathcal{C}^{\alpha,R}(^{p})=\{f\in\mathcal{C}^{\alpha}(^{p}):\|f\|_{\alpha}\leq R\}$ denote the Hölder-ball of radius $R$ with respect to the Hölder norm $\|f\|_{\alpha}$ (see Ghosal and van der Vaart,, 2017, Appendix C).

We consider the posterior convergence of the Bayesian fractional posterior obtained by raising the likelihood function by a factor $\eta\in(0,1]$ in the Bayes formula

where $\Pi$ denotes the prior probability measure over $\mathcal{L}^{2}(^{p})$ , the $\mathcal{L}^{2}$ space over $^{p}$ . Fractional posteriors have gained renewed attention in Bayesian statistics due to their robustness to model misspecification (Grünwald,, 2012; Miller and Dunson,, 2018). According to Walker and Hjort, (2001), the fractional posterior can be viewed as combining the original likelihood function with a data-dependent prior that is divided by a portion of the likelihood. This data dependent reweighting in the prior helps to prevent from possible inconsistencies by reducing the weights of those parameter values that “track the data too closely”. Additionally, the fractional posterior with $\eta<1$ permits much simpler theoretical analyses. Note that $\eta=1$ corresponds to the usual posterior distribution. Abusing notation slightly, we will also use $\Pi$ to denote the prior probability measure over the parameters $(\mathcal{T}_{t},\mathcal{M}_{t})$ and any hyperparameters in the model. Our goal is to find a sequence $\{\varepsilon_{n}:\,n\geq 1\}$ such that, for a sufficiently large constant $M$ and fixed $\eta$ ,

In this section, we focus on establishing (7) for $\eta<1$ . The benefit of considering $\eta<1$ is that this allows us to bypass verifying technical conditions regarding the effective support of the prior and the existence of a certain sieve (Ghosal et al.,, 2000, 2007), which allows for (7) to be established under very weak conditions. In the supplementary material we establish posterior consistency for $\eta=1$ under more stringent conditions on the prior.

The main condition governing the posterior contraction rate is that the prior $\Pi$ is sufficiently “thick” at $f_{0}$ , in the sense that there exists a $C>0$ such that

where $B_{\varepsilon}(f_{0})$ denotes an $\varepsilon$ -Kullback-Leibler (KL) neighborhood of the truth

$\displaystyle\int_{-\infty}^{+\infty}K(x)\,dx>0$ and for any positive integer $m$ , $\displaystyle\int_{-\infty}^{+\infty}|x|^{m}\,|K(x)|\,dx<\infty$ .

Suppose Assumption G holds for the gating function $\psi$ . For any function $f_{0}\in\mathcal{C}^{\alpha,R}(^{d})$ , any $\epsilon>0$ , and $\tau>0$ , there exists a sum of soft decision trees with a single bandwidth $\tau_{b}\equiv\tau$ for all branches,

where $C_{1}$ and $D_{1}$ are constants independent of $(\varepsilon,\tau)$ .

With the help of this lemma, we establish (8) as a direct consequence of the following result, where we make the following assumptions on the prior distribution.

Assumption P (prior conditions):

The prior density $\pi_{\tau}$ of tree specific bandwidth parameters $\tau_{t}$ satisfies $\pi_{\tau}(\tau)\geq a_{1}\tau^{a_{2}}$ for some constants $a_{1},a_{2}>0$ for all sufficiently small $\tau$ .

The prior on the splitting proportion vector $s$ is $\mathcal{D}(a/p^{\xi},\ldots,a/p^{\xi})$ for some $\xi>1$ and $a>0$ .

$\Pi(D_{t}=k)>0$ for $k=0,1,\ldots,2d$ , where $D_{t}$ denotes the depth of tree $t$ and $d$ is as in Theorem 2.

In the supplementary material we show that under extra technical conditions on the prior, the usual posterior (fractional posterior with $\eta=1$ ) can attain the same rate of convergence as in Theorem 3 below. These extra conditions are needed to control the size of the effective support of the prior and show the existence of a certain sieve (Ghosal et al.,, 2000). In particular, Assumption P only needs certain lower bounds on the prior density (mass) functions, while Assumption SP in the supplementary material requires some upper bound on the tail prior probability of various parameters in the model.

(Prior concentration for sparse function) Suppose that Assumptions G and P are satisfied. Let $f_{0}\in\mathcal{C}^{\alpha,R}(^{p})$ be a bounded regression function that depends on at most $d$ covariates. Then there exists constants $A$ and $C$ independent of $(n,p)$ such that for all sufficiently large $n$ , the prior $\Pi$ over regression function $f$ satisfies

where $\varepsilon_{n}=n^{-\alpha/(2\alpha+d)}(\log n)^{t}+\sqrt{n^{-1}\,d\,\log p}$ for any $t\geq\alpha(d+1)/(2\alpha+d)$ .

The following posterior concentration rate for sparse functions follows immediately from Theorem 2 and Theorem 3.2 in Bhattacharya et al., (2016) (see also Section 4.1 therein).

Suppose that Assumptions G and P are satisfied. Let $f_{0}\in\mathcal{C}^{\alpha,R}(^{p})$ be a bounded regression function that only depends on at most $d$ covariates. If $n\,\varepsilon_{n}^{2}\to\infty$ and $\varepsilon_{n}\to 0$ as $n,\,p\to\infty$ , then for all sufficiently large constant $M>0$ , we have

where $\varepsilon_{n}=n^{-\alpha/(2\alpha+d)}(\log n)^{t}+\sqrt{n^{-1}\,d\,\log p}$ for any $t\geq\alpha(d+1)/(2\alpha+d)$ .

This result shows a salient feature of our sum of soft decision trees model — by introducing the soft thresholding, the resulting posterior contraction rate adapts to the unknown smoothness level $\alpha$ of the truth $f_{0}$ , attaining a near-minimax rate (Yang and Tokdar,, 2015) without the need of knowing $\alpha$ in advance. Our next result shows that if the truth admits a sparse additive structure $f_{0}=\sum_{v=1}^{V}f_{0,v}(x)$ , where each additive component $f_{0,v}(x)$ is sparse and only depends on $d_{v}$ covariates for $v=1,\ldots,V$ , then the posterior contraction rate also adaptively (with respect to both the additive structure and unknown smoothness of each additive component) attains a near-minimax rate (Yang and Tokdar,, 2015) up to $\log n$ terms, which leads to a second salient feature of the sum of soft decision tree model — it also adaptively learns any unknown lower order nonlinear interactions among the covariates.

Suppose that Assumptions G and P are satisfied. Let $f_{0}=\sum_{v=1}^{V}f_{0,v}$ , where the $v$ th additive component $f_{0,v}$ belongs to $\mathcal{C}^{\alpha_{v},R}(^{p})$ , and is bounded and only depends on at most $d_{v}$ covariates for $v=1,\ldots,V$ . If $n\,\varepsilon_{n}^{2}\to\infty$ and $\varepsilon_{n}\to 0$ as $n,\,p\to\infty$ , then for all sufficiently large constant $M>0$ , we have

where $\varepsilon_{n}=\sum_{v=1}^{V}n^{-\alpha_{v}/(2\alpha_{v}+d_{v})}(\log n)^{t}+\sum_{v=1}^{V}\sqrt{n^{-1}\,d_{v}\,\log p}$ for any $t\geq\max_{v}\alpha_{v}(d_{v}+1)/(2\alpha_{v}+d_{v})$ .

Illustrations

A standard test case, initially proposed by Friedman, (1991) (see also Chipman et al.,, 2010), sets

This $f_{0}(x)$ features two nonlinear terms, two linear terms, with a nonlinear interaction.

In this experiment, we consider $n=250$ observations, $\sigma^{2}\in\{1,10\}$ , and $p$ from $5$ to $1000$ along an evenly-spaced grid on the scale of $\log p$ . We compare SBART to BART, DART, gradient boosted decision trees (xgboost), the lasso (glmnet), and random forests (randomForest). A similar experiment was conducted by Linero, (2016), who showed that the sparsity inducing prior used by DART resulted in substantial performance gains over BART. The purpose of this experiment is to demonstrate the further gains which are possible when the smoothness of (9) is also leveraged.

Methods are compared by root mean-squared error, $\operatorname{RMSE}=\{\int\{f(x)-\widehat{f}(x)\}^{2}\ dx\}^{1/2},$ which is approximated by Monte-Carlo integration. For the Bayesian procedures, we take $\widehat{f}$ to be the pointwise posterior mean of $f$ . DART, and SBART use their respective default priors and were fit using 2500 warmup iterations and 2500 sampling iterations, while cross-validation is used to tune the hyperparameters for BART. The non-Bayesian methods were tuned using cross validation for each replication of the experiment.

Results are given in Figure 5. Among the methods considered, SBART performs the best, obtaining a sizeable improvement over DART in both the low noise and high noise settings. Due to the use of a sparsity-inducing prior, both DART and SBART are largely invariant to the number of nuisance predictors, while random forests, BART-CV, and boosting have errors increasing in $\log p$ . The lasso also has stable, albeit poor, performance as $p$ increases.

We now compare SBART to DART for the task of variable selection (see Linero,, 2016 for a detailed comparison of DART, BART, random forests, and the lasso which found DART to perform best among these methods). Our goal is to assess whether leveraging smoothness can improve on the good variable selection properties of DART. We modify Friedman’s function, taking instead

where $\lambda$ is a tuning parameter for the simulation. A variable is included if its posterior inclusion probability exceeds 50%. We consider $\lambda\in[0.1,1]$ . As measures of accuracy, we consider precision $=TP/(TP+FP)$ , recall $=TP/(TP+FN)$ , and $F_{1}$ score (harmonic mean of precision and recall), where $TP,FP$ and $FN$ denote the number of true positives, false positives, and false negatives respectively.

Results for $20$ replications and $\sigma^{2}=1$ are given in Figure 6, along with the average RMSE. First, we see that both DART and SBART have a precision which is roughly constant in $\lambda$ , with SBART performing uniformly better. This makes intuitive sense, as varying $\lambda$ should have little influence on whether irrelevant predictors are selected. The precision of both methods is heavily dependent on $\lambda$ , and we see that SBART is generally capable of detecting smaller signal levels; at its largest, the difference in recall is about 10%. Once the signal level is high enough, both methods detect all relevant predictors consistently. The $F_{1}$ score reflects a mixture of these two behaviors. Perhaps most interesting is the influence of $\lambda$ on the RMSE. As $\lambda$ increases the performance of DART degrades while SBART remains roughly constant. Intuitively this is because, as $\lambda$ increases, DART must use an increasing number of branches to capture the additional signal in the data, while SBART is capable of representing the effects corresponding to $(x_{4},x_{5})$ with fewer parameters.

2 Approximation of non-smooth and locally smooth functions

A potential concern with the use of soft decision trees is that they may not be able to capture fine-scale variability in the underlying regression function. An extreme example of this is when $f$ is a step function. We consider the regression function $f(x)=2-4I(x_{1}<0.5).$ In this case, one might expect soft decision trees to perform suboptimally relative to hard decision trees because a soft decision tree must model the jump at in a continuous fashion.

Surprisingly, ensembles of soft decision trees can outperform ensembles of hard decision trees even in this case. Figure 7 shows fits of BART and SBART to $n=250$ data points and a high signal of $\sigma=0.1$ . We see that both methods can capture the large jump discontinuity at $x_{1}=0.5$ . SBART performs better away from the discontinuity, however, because the level of smoothness is allowed to vary at different points in the covariate space. The trees responsible for the jump discontinuity have small $\tau_{t}$ ’s to effectively replicate a step function, while elsewhere the trees have large $\tau_{t}$ ’s to allow the function to essentially be constant.

The ability to select different $\tau_{t}$ ’s allows SBART to obtain a locally-adaptive behavior. To illustrate this, Figure 8 gives the fit of BART and SBART when $f(x)$ is a highly localised Daubechies wavelet of smoothness order $10$ . We see that SBART is capable of adapting both to the constant regions outside of the support of the wavelet, and the fast oscillatory behavior within the support of the wavelet. The fit of BART, by contrast, possesses many artifacts outside the support of the wavelet, and possesses generally wider credible bands.

3 Benchmark datasets

We compare the SBART to various tree-based and non-tree-based methods on several benchmark datasets. We consider BART, DART, the LASSO (glmnet), random forests (randomForest), and gradient boosted decision trees (xgboost). The parameters for the non-Bayesian procedures were chosen, separately for each fit, using the caret package. Default priors (with $T=50$ ) for SBART and DART were used; additionally, we consider selecting the hyperparameters of SBART and BART by cross validation.

Ten datasets are considered. Aside from bbb and wipp, the datasets are a subset of those considered by Kim et al., (2007). While we consider only a subset of these datasets, no datasets considered for this experiment were omitted. Attributes of these datasets are presented in Table 1. The response in each dataset was transformed to be approximately Gaussian. The bbb, triazines, and wipp datasets were also considered by Linero, (2016) to illustrate features of the sparsity-inducing priors for decision tree methods.

Results of the experiment are given in Table 1. Methods are compared by an estimate of their root mean predictive error obtained using $5$ -fold cross-validation, with the results averaged over $20$ replications of the cross-validation. For each experiment, the root mean predictive error for each method is normalized by the root mean predictive error for SBART-CV, so that scores higher than 1.00 correspond to worse performance than SBART and scores lower than 1.00 correspond to better performance than SBART.

SBART/SBART-CV is seen to perform very well in practice, attaining the best performance on 8 out of the 10 datasets. The results here are consistent with the general observation of Chipman et al., (2010) that BART outperforms gradient boosting and random forests in aggregate over many datasets. Two datasets stand out as particularly interesting. First, for the tecator dataset, SBART outperforms all other methods by a very wide margin, indicating that leveraging smoothness for this dataset is essential to attaining good performance. Second, the only dataset for which SBART-CV substantially outperforms SBART is the hatco dataset, where tuning the number of trees is required to attain optimal performance. This indicates that, for most datasets, the default SBART procedure works very well, but that if one wants to be absolutely sure of optimal performance they should tune $T$ .

Discussion

We have introduced a novel Bayesian sum-of-trees framework and demonstrated that it is capable of attaining a meaningful improvement over existing methods both in simulated experiments and in practice. This was accomplished by incorporating soft decision trees and sparsity-inducing priors. We also provided theoretical support in the form of near-optimal results for posterior concentration, adaptively over smoothness classes, when $f_{0}(x)$ is a sparse, or additive, function.

While this paper has focused only on the case of nonparametric regression, the proposed methodology extends in a straight-forward manner to other settings. For example, the case of binary classification can be addressed in the usual way via a probit link and data augmentation.

Our theoretical results concern the rate of convergence of the posterior. Another relevant question is whether the model can consistently estimate the model support. That is, one can ask under what conditions $\Pi(S=S_{0}\mid\mathcal{D})\to 1$ as $n\to\infty$ , where $S=\{p:\text{predictor$ p $appears in the ensemble}\}$ and $S_{0}=\{p:\text{$ f_{0} $depends on$ p $}\}$ . This is an interesting area for future research.

Software which implements SBART is available online at https://github.com/theodds/SoftBART, and is undergoing active development. Our code is based on the implementation of BART in the BayesTree package. Given enough optimization, we hope that our implementation could reach speeds within a modest factor of existing highly-optimized implementations of BART (Kapelner and Bleich,, 2016).

Appendix A Proof of Lemma 1

Let $K^{(d)}_{\tau}(x_{1},x_{2},\ldots,x_{d})=\tau^{-d}\,\prod_{j=1}^{d}K(x_{j}/\tau)$ denote a $d$ -dimensional tensor product of the rescaled one dimensional kernel $K$ in Assumption G, where recall that $\tau$ is the bandwidth parameter in the gating function. Let $\displaystyle C_{K}:\,=\int K(x)\,dx$ denote the normalization constant of $K$ , so that we can write $K=C_{K}\widetilde{K}$ , and the rescaled kernel function $\widetilde{K}$ is has an unit normalization constant. Also write $\widetilde{K}^{(d)}_{\tau}=K^{(d)}_{\tau}/C_{K}^{d}$ . It is easy to verify that $\widetilde{K}$ also satisfies the two conditions in Assumption G, though it may not be associated to any $\widetilde{\psi}$ .

Our proof is composed of three steps. First, we provide error bound estimates of approximating any $\alpha$ -smooth function by a convolution $K^{(d)}_{\tau}*g$ with some carefully constructed function $g$ for any $\tau>0$ . Second, we show that any continuous convolution $K^{(d)}_{\tau}*g$ can be approximated by a discrete sum $\sum_{t=1}^{T}\mu_{t}K^{(d)}_{\tau}(\cdot-x_{t})$ with at most $O(\tau^{-d})$ atoms. Lastly, we provide an error bound estimate on approximating this sum of kernels with a sum of soft decision trees by identifying each kernel component $K^{(d)}_{\tau}(\cdot-x_{t})$ as one particular leaf in the $t$ th soft decision tree $g(x;\,\mathcal{T}_{t},\mathcal{M}_{t})$ whose depth is at most $2d$ via splitting at most $2d$ times, for $t=1,\ldots,T$ .

Step 1: This step is follows as a direct result of the following lemma, which is adapted from Lemma 3.4 of De Jonge et al., (2010).

Under Assumption G, for any $f_{0}\in\mathcal{C}^{\alpha,R}(^{d})$ , there exist some constants $(M_{1},M_{2})$ independent of $\tau$ , and a function $T_{b,\tau}f_{0}$ satisfying $\|T_{b,\tau}f_{0}\|_{\infty}\leq M_{1}R$ , such that

where $g=C_{K}^{-d}\,T_{b,\tau}f_{0}$ satisfies $\|g\|_{\infty}\leq M_{1}^{\prime}R$ , with $M_{1}^{\prime}=C_{K}^{-d}M_{1}$ independent of $\tau$ .

Step 2: This step generalizes the theory of approximating a continuous one-dimensional density function from by a mixture of Gaussians developed in Ghosal and Van Der Vaart, (2007) to by a location mixture of any kernel $K$ satisfying Assumption G. We also extend their result from density estimation to general function estimation as demanded in our regression setting, where the target function $f$ may not integrate to one and can take negative values. First, we state an extension of Lemma 3.1 of Ghosal and Van Der Vaart, (2001) from dimension one to dimension $d$ , and from the Gaussian kernel to any kernel $K$ satisfying Assumption G.

Under Assumption G, for any probability density function $p_{0}$ on $^{d}$ , any $\epsilon>0$ , and $\tau\in(0,1)$ , there is a discrete measure $P_{\tau}=\sum_{t=1}^{T}r_{t}\,\delta_{x_{t}}$ with $T\leq C_{1}\tau^{-d}\log^{d}(1/\tau)$ support points such that $\sum_{t=1}^{T}r_{t}=1$ and

where $(C_{1},D_{1})$ are independent of $\tau$ and $K$ .

We only sketch the key difference in the proof from Lemma 3.1 of Ghosal and Van Der Vaart, (2001) in the one-dimensional case, and a proof for extending the result from one-dimensional case to the multi-dimensional case follows similar lines as in the proof of Theorem 7 in Shen et al., (2013) (by replacing the Gaussian kernel with the kernel $K$ ).

The only key property of the Gaussian kernel used in the proof of Lemma 3.1 of Ghosal and Van Der Vaart, (2001) is in bounding the remainder term in the $k$ -th order Taylor expansion in their equation (3.11), where they used the fact that for any $k\geq 1$ , the $k$ th order derivative of the standard Gaussian density function $\phi(x)=(2\pi)^{-1/2}\,e^{-x^{2}/2}$ at the origin $x=0$ satisfies the bound

for some sufficiently large constant $C_{2}>0$ (since we only focus on the approximation error over the unit interval $ $, we do not need to include the additional$ \log(1/\varepsilon) $term in equation (3.11) therein). Therefore, it suffices to verify a similar exponentially decay bound for the$ k $th order derivative of function$ K_{\kappa}:\,=\tau^{-1}\,K(\cdot/\kappa) $for some sufficiently large number$ \kappa>0 $depending on$ C $. In fact, under Assumption G,$ K(\cdot) $can be analytically extend to the strip$ \mathcal{S}(\rho) $in the complex plane (for simplicity, we use the same notation$ K$ to denote this extension), which implies by applying Cauchy’s integral formula that

where the closed path $\Gamma_{\kappa}$ is chosen as a counter-clockwise circle centering at the origin with radius $\kappa\,\rho$ . Since $K$ is uniformly bounded on the path $\Gamma_{\kappa}$ by Assumption G, we can further deduce that

holds as long as $\kappa\geq\rho^{-1}\,\exp\{C_{2}\}$ , where $D$ is some constant only depending on $K$ , which completes the proof. ∎

With this lemma on the density function approximation as our preparation, we now return to the problem of approximating any general bounded function $g$ over $^{d}$ . Notice that we always have the decomposition $g=g_{+}-g_{-}$ where $g_{+}=\max\{0,g(x)\}$ and $g_{-}(x)=\max\{0,-g(x)\}$ are the positive parts and negative parts of $g$ , respectively, and both of them are nonnegative and bounded over $^{d}$ . Let $A_{+}=\int_{^{d}}g_{+}(x)\,dx\leq\|g\|_{\infty}$ and $A_{-}=\int_{^{d}}g_{-}(x)\,dx\leq\|g\|_{\infty}$ . It is obvious that $g_{+}/A_{+}$ and $g_{-}/A_{-}$ are two legitimate pdfs over $^{d}$ . By applying Lemma 6, we can find two discrete measures $P_{+}=\sum_{t=1}^{T_{+}}r^{+}_{t}\,\delta_{x^{+}_{t}}$ and $P_{-}=\sum_{t=1}^{T_{-}}r^{-}_{t}\,\delta_{x^{-}_{t}}$ such that

for any $x\in^{d}$ and $\max\{T_{+},\,T_{-}\}\leq C\,\tau^{-d}\log^{d}(1/\varepsilon)$ . Now we combine these two discrete measures into a new discrete signed measure $P_{0}=\sum_{t=1}^{T_{+}}A_{+}\,r^{+}_{t}\,K^{(d)}_{\tau}(x-x^{+}_{t})+\sum_{t=1}^{T_{-}}(-A_{-}\,r^{-}_{t})\,K^{(d)}_{\tau}(x-x^{-}_{t})$ , which will be denoted as $\sum_{t=1}^{T}\mu_{t}\,K^{(d)}_{\tau}\big{(}\cdot-x_{t})$ . Then $T\leq T_{-}+T_{+}\leq 2C\,\tau^{-d}\log^{d}(1/\varepsilon)$ and

for all $x\in^{d}$ . Moreover, we have $\sum_{t=1}^{T}|\mu_{t}|\leq A_{+}\sum_{t=1}^{T_{+}}r_{t}^{+}+A_{-}\sum_{t=1}^{T_{-}}r_{t}^{-}\leq 2\|g\|_{\infty}$ .

Finally, a combination of steps 1-3 together yields a proof of the lemma.

Appendix B Proof of Theorem 2

For convenience, we use the same notation $C$ to denote some constant independent of $(n,p)$ , whose value may change from line to line. Without loss of generality, we may assume that $f_{0}$ depends only on its first $d$ coordinates. Applying Lemma 1, we obtain that for some parameters $\tau$ and $\varepsilon$ to be determined later, there exists some $\widetilde{f}=\sum_{t=1}^{\widetilde{T}}g(x;\,\widetilde{\mathcal{T}}_{t},\widetilde{\mathcal{M}}_{t})$ such that $\widetilde{T}\leq C\,\tau^{-d}\log^{d}(\varepsilon^{-1})$ , $\|\widetilde{f}-f_{0}\|_{\infty}\leq C\,(\tau^{\alpha}+\varepsilon\,\tau^{-d})$ , and the total number of splits (all are along the first $d$ coordinates) across all trees are at most $2d\,\widetilde{T}$ ( $2d\,\widetilde{T}$ many leaves in total).

Recall that our prior over the sum of soft decision tree function $f$ is specified in a hierarchical manner: first, we specify the number $T$ of trees and the tree topology $\mathcal{T}=\{\mathcal{T}_{1},\ldots,\mathcal{T}_{T}\}$ ; second, conditional on these we decide the coordinates in all splits across all the decision trees; third, we sample the independent splitting locations along all the selected coordinates; last, we sample bandwidth parameters $\tau_{t}$ associated with each tree and parameters $\mu$ ’s associated with all leaves across the trees. We denote by $\widetilde{T}$ and $\widetilde{\mathcal{T}}$ the corresponding number of trees and the tree topology of $\widetilde{f}$ .

then we have the following perturbation error bound by applying the triangle inequality,

Now we apply Theorem 2.1 in Yang and Dunson, (2014) on the prior concentration probability for high-dimensional Dirichlet distribution and Assumption P3 to obtain that the splitting proportion vector $s=(s_{1},\ldots,s_{p})$ satisfies

This combined with the fact that each tree has depth at most $2d$ and Assumption P5 implies that the prior probability of $\mathcal{T}=\widetilde{\mathcal{T}}$ given $T=\widetilde{T}$ can be lower bounded by

where we have used the fact that $N\leq C\,\tau^{-d}\,\log^{d}(\varepsilon^{-1})$ in the last step. The perturbation error bound in (10) implies

where in the last step we applied Assumptions P2 and P4, and used the fact that $\sum_{u}|\widetilde{\mu}_{u}|\leq C\,\tau^{-d}$ for some constant $C$ only dependent of $f_{0}$ (due to Lemma 1). Putting all pieces together and using Assumption P1 and the properties of $\widetilde{f}$ , we obtain

Therefore, by choosing $\tau=(\log^{d+1}n/n)^{-1/(2\alpha+d)}$ , $\delta=\tau^{\alpha}$ , $\varepsilon=\tau^{d+\alpha}$ , we can obtain the claimed prior concentration probability lower bound as $\Pi\big{[}\|f-f_{0}\|_{\infty}\leq C\,\varepsilon_{n}\big{]}\geq\exp\{-C\,n\,\varepsilon_{n}^{2}\}$ .

Appendix C Proof of Theorem 4

Using Theorem 3.2 in Bhattacharya et al., (2016) (see also Section 4.1 therein), it suffices to show that $\Pi\big{[}\|f-f_{0}\|_{\infty}\leq C\,\varepsilon_{n}\big{]}\geq\exp\{-C\,n\,\varepsilon_{n}^{2}\}$ . The proof of this is almost the same as that of Theorem 2, the only difference is that now we apply Lemma 1 to find $V$ functions $\{\widetilde{f}_{v}:\,v=1,\ldots,V\}$ , where $\widetilde{f}_{v}$ contains $\widetilde{T}_{v}$ trees and approximates the $v$ th additive component $f_{0,v}$ in $f_{0}$ for $v=1,\ldots,V$ , and set $\widetilde{f}=\sum_{v=1}^{V}\widetilde{f}_{v}$ . Due to the additive structure in our sum of soft decision tree model, we can always write $f=\sum_{v=1}^{V}f_{v}$ where $f_{v}$ collects $\widetilde{T}_{v}$ trees and has the same sum of soft decision tree prior structure when conditioning on the total number of trees $T=\sum_{v=1}^{V}\widetilde{T}_{v}$ , and the conditional priors of $(f_{1},\ldots,f_{v})$ given $T=\sum_{v=1}^{V}\widetilde{T}_{v}$ and the splitting proportion vector $s$ are independent. Let $\mathcal{S}=\big{\{}s_{j}\geq(2d)^{-1}\mbox{ for }j=1,\ldots,d,\mbox{ and }\sum_{j=d+1}^{p}s_{j}\leq d^{-1}\big{\}}$ denote the event in inequality (11) with $d:\,=\sum_{v=1}^{V}d_{v}$ . Therefore, we obtain by applying Assumption P, the prior concentration bound (11) for $s$ , and Theorem 2 for a single $f_{v}$ (choose parameters $\tau_{v},\delta_{v},\varepsilon_{v}$ for each $f_{v}$ as in the proof of Theorem 2) that

where constants $C,C^{\prime}>0$ , $\varepsilon_{n,v}=n^{-\alpha_{v}/(2\alpha_{v}+d_{v})}(\log n)^{t_{v}}+\sqrt{n^{-1}\,d_{v}\,\log p}$ and $t_{v}\geq\alpha_{v}(d_{v}+1)/(2\alpha_{v}+d_{v})$ .

Acknowledgments

This work was partially supported by NSF grant DMS-1712870 and DOD grant SOT-FSU-FATs-16-06.