Bayesian Regression Tree Ensembles that Adapt to Smoothness and Sparsity

Antonio Ricardo Linero, Yun Yang

Introduction

Consider a nonparametric regression model Y=f0(X)+ϵY=f_{0}(X)+\epsilon with response YY, XpX\in^{p} a pp-dimensional predictor, f0f_{0} an unknown regression function of interest, and Gaussian noise ϵNormal(0,σ2)\epsilon\sim\operatorname{Normal}(0,\sigma^{2}). Suppose we observe D=((X1,Y1),,(Xn,Yn))\mathcal{D}=((X_{1},Y_{1}),\ldots,(X_{n},Y_{n})) consisting of independent and identically distributed copies of (X,Y)(X,Y). A popular approach to estimating f0(x)f_{0}(x) is to form an ensemble of decision trees; common techniques include boosted decision trees (Freund et al.,, 1999) and random forests (Breiman,, 2001). Bayesian tree-based models, such as the Bayesian additive regression trees (BART) model (Chipman et al.,, 2010), have recently attracted interest from practitioners due to their excellent empirical performance and natural uncertainty quantification; BART has been applied in a wide variety of contexts such as nonparametric function estimation with variable selection (Bleich et al.,, 2014; Linero,, 2016), analysis of loglinear models (Murray,, 2017), and survival analysis (Sparapani et al.,, 2016). Additionally, BART is consistently among the best performing methodologies in the Atlantic Causal Inference Conference Data Analysis Challenge (Hill,, 2011, 2016; Hahn et al.,, 2017; Dorie et al.,, 2017).

Despite the recent popularity of Bayesian tree-based models, they suffer from several drawbacks. First, in the regression setting, estimators based on decision trees are not capable of adapting to higher smoothness levels exhibited in f0f_{0} due to their piecewise-constant nature. Second, as illustrated by Linero, (2016), they suffer from the curse of dimensionality — their prediction performance deteriorates as the dimensionality pp increases. Last but not least, very little theoretical work has been done for understanding large sample properties of Bayesian tree-based approaches from a frequentist perspective.

In this article, we propose a new method, called soft Bayesian additive regression trees (SBART) which improves both practically and theoretically upon existing Bayesian sum-of-trees models. To address the first aforementioned drawback, we employ a ensemble of carefully designed “soft” decision trees as building blocks in the BART model, and show in both empirical studies and theoretical investigation that the resulting Bayesian approach can adapt to the unknown smoothness level of the true regression function f0f_{0} — the corresponding posterior distribution achieves the minimax rate nα/(2α+p)n^{-\alpha/(2\alpha+p)} of contraction up to logarithmic terms (Ghosal et al.,, 2000) when f0Cα,R(d)f_{0}\in\mathcal{C}^{\alpha,R}(^{d}) where Cα,R(d)C^{\alpha,R}(^{d}) denotes a Hölder space with smoothness index α\alpha and radius RR.

To overcome the curse of dimensionality, we specify sparsity inducing priors (Linero,, 2016) for the splitting rule probabilities in the soft decision trees. We show that SBART takes advantage of structural sparsity in the true regression function f0f_{0} — when f0f_{0} only depends on dpd\ll p predictors and is α\alpha-Hölder smooth, the resulting posterior distribution contracts towards the truth at a rate of nα/(2α+d)+n1dlogpn^{-\alpha/(2\alpha+d)}+\sqrt{n^{-1}d\log p} up to logarithmic terms, which is near minimax-optimal even in the high-dimensional setting where pp grows nearly exponentially fast in nn (Yang and Tokdar,, 2015). Furthermore, due to the additive nature of sum-of-trees based models, we show that SBART can also adapt to low-order non-linear interactions: if f0f_{0} can be decomposed into many low dimensional pieces f0=v=1Vf0vf_{0}=\sum_{v=1}^{V}f_{0v}, where each additive component f0vf_{0v} is dvd_{v}-sparse and αv\alpha_{v}-smooth, then SBART also achieves a near-minimax rate of posterior contraction. Compared to the rate for the general sparse case, which allows at most o(logn)o(\log n) many active predictors for consistency, the rate for additive structures potentially allows o(nβ)o(n^{\beta}) many predictors for some β(0,1)\beta\in(0,1); this partly explains the empirical success of Bayesian sum-of-tree approaches, as many real-world phenomena can be explained in terms of a small number of low-order interactions.

Our proofs involve a key lemma that links sum-of-tree type estimators with kernel type estimators. Unlike frequentist kernel type estimators that require prior knowledge on the smoothness level of f0f_{0} for choosing a smoothness matching kernel, Bayesian sum-of-tree based methods are adaptive, requiring no prior knowledge of the smoothness levels {αv}\{\alpha_{v}\}, number of additive components VV, or degree of lower-order interactions dvd_{v}, while still attaining near-minimax rates even under the high-dimensional setting. Practically, SBART can be implemented by making minimal modifications to existing strategies for fitting Bayesian tree-based models: the sparsity-inducing prior uses conditionally-conjugate Dirichlet priors which can be easily accommodated during Gibbs sampling, while replacing the usual decision trees with soft decision trees requires minor changes to the backfitting algorithm typically used with BART.

There has been a recent surge of interest in the theoretical properties of BART-type models. While our work was under review we learned that, in essentially simultaneous work, Rockova and van der Pas, (2017) established similar posterior contraction rates for a particularly designed BART prior, using a so-called “spike-and-tree” prior to allow for the ensemble to adapt to sparsity. In particular, they show that a single deep decision tree can approximate any function with smoothness level α1\alpha\leq 1, which is then divided among trees with smaller depth. Our theory instead relies on linking sum-of-tree type estimators with kernel type estimators, which only need shallow trees and motivate the usage of soft-decision trees. Practically, the most relevant difference is that our SBART prior allows for adaptation to the smoothness level even when α>1\alpha>1, whereas the use of piecewise-constant basis functions in traditional BART models only allows for adaptation to functions which are at-most Lipschitz-smooth (α1\alpha\leq 1). An additional difference is that we focus on establishing concentration results for the fractional posterior, which allows for less restrictive assumptions about our choice of prior; in our supplementary material, we also provide concentration results for the usual posterior, under more stringent conditions. In even more recent work, Alaa and van der Schaar, (2017) establish consistency results for BART-type priors for estimating individual treatment effects in causal inference settings, and also noted the limitation of BART in adapting to a smoothness order higher than α=1\alpha=1.

The soft decision trees we use are similar in spirit to those used by Irsoy et al., (2012), who considered a soft variant of the CART algorithm. Our work differs in that (i) our trees are not learned in a greedy fashion, but instead by extending the Bayesian backfitting approach of Chipman et al., (2010), (ii) we consider an ensemble of soft trees rather than a single tree, (iii) we use a different parameterization of the gating function which does not consider oblique decision boundaries, and (iv) we establish theoretical guarantees for our approach.

The rest of the paper is organized as follows. In Section 2, we develop our SBART prior. In Section 3 we state our theoretical results. In Section 4, we illustrate the methodology on both simulated and real datasets. We finish in Section 5 with a discussion. Proofs are deferred to the appendix. In supplementary material, we provide additional computational details, timing results, and additional theoretical results extending our fractional posterior results to the usual posterior.

Soft Bayesian sum of trees models

We begin by describing the usual “hard” decision tree prior used in BART. We model f0(x)f_{0}(x) as the realization of a random function

Following Chipman et al., (2010), we endow Tt\mathcal{T}_{t} with a branching process prior. The branching process begins with a root node of depth k=0k=0. For k=0,1,2,k=0,1,2,\ldots, each node at depth kk is non-terminal with probability q(k)=γ(1+k)βq(k)=\gamma(1+k)^{-\beta} where γ>0\gamma>0 and β>0\beta>0 are hyperparameters controlling the shape of the trees. It is easy to check using elementary branching process theory that this process terminates almost surely provided that β>0\beta>0 (Athreya and Ney,, 2004).

2 Smoothness adaptation

A well-known feature of decision trees is their lack of smoothness. Single-tree algorithms such as the CART algorithm (Hastie et al.,, 2009, Chapter 9.2) result in step-function estimates, suggesting that they should not be capable of efficiently estimating smooth functions (Györfi et al.,, 2006). Methods based on ensembles of decision trees average over many distinct partitions of the predictor space, resulting in some degree of smoothing. Even with this averaging, the estimated regression functions are not smooth. Heuristically, we note that under our BART specification the function ff is not differentiable in quadratic mean. Indeed, with trees of depth 11, p=1p=1, and cutpoints CbGC_{b}\sim G, simple calculations give E{(f(x+δ)f(x))2}δG(x)+o(δ)E\{(f(x+\delta)-f(x))^{2}\}\propto\delta G^{\prime}(x)+o(\delta). Consequently, BART ensembles with a large number of trees resemble nowhere-differentiable continuous functions, and in the limit as TT\to\infty the BART prior converges to a nowhere-differentiable Gaussian process. This heuristic argument suggests that BART can only adapt to functions with Hölder smoothness level no greater than α=1\alpha=1 (Lipschitz functions).

Figure 2 compares the fit of BART to SBART with τb0.1\tau_{b}\equiv 0.1. We see that when T=1T=1 trees are used we require a large number of leaf nodes to model relatively simple functions. At a large scale, we see that the BART fit resembles a nowhere-differentiable continuous function. While an improvement, the estimate from BART is still not sufficiently smooth and exhibits large fluctuations.

The fit of the soft decision tree in Figure 2 by comparison is infinitely differentiable and requires only a small number of parameters. Consequently, we obtain a fit with lower variance and negligible bias. An attractive feature of soft decision trees exhibited in Figure 2 is their ability to approximate linear relationships. In this case, even when T=1T=1, we recover the smooth functions almost exactly.

3 Prior specification and implementation

Following Chipman et al., (2010), in this section we develop a “default” SBART prior. The goal is to develop a prior which can be used routinely, without requiring the user to specify any hyperparameters; while the choices below may appear ad-hoc, they have been found to work remarkably well across a wide range of datasets. After adopting the following default prior, users may wish to further tune the number of trees TT, the parameter rr in the prior for τb\tau_{b}, or use additional information regarding the targeted sparsity level. We stress, however, that a reasonable baseline level of performance is obtained without the need to do any further tuning.

Following Chipman et al., (2010), we recommend scaling YY so that most/all of the responses fall in the interval [0.5,0.5][-0.5,0.5]. We also preprocess XjX_{j} so that XjUniform(0,1)X_{j}\sim\operatorname{Uniform}(0,1) approximately by applying a quantile normalization in which each XijX_{ij} is mapped to its rank, with minXij=1\min X_{ij}=1 and maxXij=n\max X_{ij}=n. We then apply a linear transformation so that the values of XijX_{ij} are in $.Thegoalofthispreprocessingof. The goal of this preprocessing ofXistomakethepriorinvariantundermonotonetransformationsofis to make the prior invariant under monotone transformations ofX$, which is a highly desirable property of the original default BART model.

We now describe our default prior for the bandwidths τb\tau_{b} and the splitting proportions s=(s1,,sp)s=(s_{1},\ldots,s_{p}). We use a sparsity-inducing Dirichlet prior,

Our theoretical results require ξ>1\xi>1, however in practice we find that setting ξ=1\xi=1 works adequately. This Dirichlet prior for ss was introduced by Linero, (2016); throughout, we refer to the BART model with (3) as Dirichlet additive regression trees (DART) to contrast with BART when no such sparsity-inducing prior is used. The parameter aa controls the expected amount of sparsity in ff. Conditional on there being BB branches in the ensemble, the number of predictors included in the ensemble is converges in distribution to 1+Z1+Z where ZPoisson(θ)Z\sim\operatorname{Poisson}(\theta) and θ=ai=1B1(a+i)1\theta=a\sum_{i=1}^{B-1}(a+i)^{-1} (Linero,, 2016) when ξ=1\xi=1. When prior information is available on the sparsity of f0f_{0}, we can choose aa to match the targeted amount of sparsity. By default we use a compound Gamma prior, a/(a+λa)Be(aa,ba),a/(a+\lambda_{a})\sim\operatorname{Be}(a_{a},b_{a}), with aa=0.5,ba=1,λa=pa_{a}=0.5,b_{a}=1,\lambda_{a}=p. This prior attempts to strike a balance between the sparse and non-sparse settings by having an infinite density at , median α=p/4\alpha=p/4, and an infinite mean.

There are several possibilities for choosing the bandwidth τb\tau_{b}. In preliminary work, using tree-specific τt\tau_{t}’s shared across branches in a fixed tree worked well, with τtExponential(r)\tau_{t}\sim\operatorname{Exponential}(r) where E(τt)=rE(\tau_{t})=r. Our illustrations use r=0.1r=0.1, which, as shown in Figure 3, gives a wide range of possible gating functions. An interesting feature of the sampled gating functions is that both approximate step functions and approximately linear functions are supported.

We give σ=Var(ϵ)\nicefrac12\sigma={\operatorname{Var}(\epsilon)}^{\nicefrac{{1}}{{2}}} a half-Cauchy prior, σCauchy+(0,σ^)\sigma\sim\operatorname{Cauchy}_{+}(0,\widehat{\sigma}). Again following Chipman et al., (2010), σ^\widehat{\sigma} is an estimate of σ\sigma based on the data. We use an estimate σ^lasso\widehat{\sigma}_{\text{lasso}} of σ\sigma obtained by fitting the lasso using the glmnet package in R.

The model has hyperparameters (σμ2,γ,β,T)(\sigma_{\mu}^{2},\gamma,\beta,T). In preliminary work, we did not have success placing priors on γ\gamma and β\beta, and instead fix γ=0.95\gamma=0.95 and β=2\beta=2 (Chipman et al.,, 2010). We give σμ\sigma_{\mu} a half-Cauchy prior, σμCauchy+(0,0.25)\sigma_{\mu}\sim\operatorname{Cauchy}_{+}(0,0.25), where 0.250.25 is chosen so that σμ\sigma_{\mu} has median equal to the default value recommended by Chipman et al., (2010).

An important remaining specification is the number of trees TT to include in the ensemble. The theoretical results we establish in Section 3 make use of a prior distribution on TT; however, our attempts to incorporate a prior on TT using reversible jump methods (Green,, 1995) resulted in poor mixing of the associated MCMC algorithms. Generally, we have found that fixing TT at a default value of T=50T=50 or T=200T=200 is sufficient to attain good performance on most datasets. Tuning TT further often provides a modest increase in performance, but may be worth the effort on some datasets (see Section 4.3).

There are a number of possible options for tuning TT, such as approximate leave-one-out cross validation using Pareto-smoothed importance sampling (PSIS-LOO) (Vehtari et al.,, 2015), maximizing an approximate marginal likelihood obtained using (say) WBIC (Watanabe,, 2013), or KK-fold cross validation as recommended by Chipman et al., (2010). The advantage of WBIC and PSIS-LOO is that they require fitting the model only once for each value of TT. In practice, we have found that approximations such as WBIC and PSIS-LOO are unreliable, with PSIS-LOO prone to overfitting and WBIC requiring potentially very long chains to estimate. Figure 4 displays the values of PSIS-LOO, a WBIC approximation of the negative marginal likelihood of TT (Watanabe,, 2013), and 55-fold cross validation, when used to select TT for a replicate of the illustration in Section 4.1 with p=100p=100 predictors. Both WBIC and cross validation select T=10T=10, which also minimizes the root mean squared error (f0(x)f^(x))2 dx\int(f_{0}(x)-\widehat{f}(x))^{2}\ dx. Resource permitting, we have found KK-fold cross-validation to be the most reliable method for selecting TT.

As a default we use the following priors throughout the manuscript.

4 Variable grouping prior

The sparsity-inducing prior (3) can be extended to allow penalization of groups of predictors simultaneously, in a manner similar to the group lasso (Yuan and Lin,, 2006). Suppose that the predictors can be divided into MM groups of size PmP_{m}. We set

We primarily use the grouping prior to allow for the inclusion of categorical predictors through the inclusion of dummy variables. This is an extension of the approach used by the bartMachine package in R. An alternative approach to the inclusion of categorical predictors, used in the BayesTree package, is to construct decision rules based on a dummy variable Zj=I(XjAb)Z_{j}=I(X_{j}\in A_{b}) where AbA_{b} is a random subset of the possible values of predictor jj. In our illustrations, we let ω\omega\to\infty so that vmk=Pm1v_{mk}=P_{m}^{-1} and set a/(a+M)Be(0.5,1)a/(a+M)\sim\operatorname{Be}(0.5,1).

5 Posterior computation

We use the Bayesian backfitting approach described by Chipman et al., (2010) to construct a Markov chain Monte Carlo (MCMC) algorithm to sample approximately from the posterior.

Within Algorithm 1, Tt\mathcal{T}_{t} is updated using a Metropolis-Hastings proposal. Proposals consist of one of three possible moves: Birth, which turns a leaf node into a branch node; Death, which turns a branch node into a leaf node; and Change, which changes the decision rule of a branch bb. A detailed description of these moves, and their associated transition probabilities, is given in the supplementary materials.

Constructing efficient updates for Tt\mathcal{T}_{t} and τt\tau_{t} requires marginalizing over Mt\mathcal{M}_{t}. Because the errors are assumed Gaussian, this marginalization can be carried out in closed form. The main computational drawback of SBART relative to BART lies in this marginalization, as SBART requires computing a likelihood contribution for each leaf-observation pair, whereas BART only requires a single likelihood contribution for each tree. Hence, if the trees are deep, BART will be substantially faster. By the construction of the prior, trees generally are not deep enough for this difference to be prohibitive.

The Dirichlet prior sD(a/pξ,,a/pξ)s\sim\mathcal{D}(a/p^{\xi},\ldots,a/p^{\xi}) allows for a straight-forward Gibbs sampling update, with the full conditional given by sD(a/pξ+c1,,a/pξ+cp),s\sim\mathcal{D}(a/p^{\xi}+c_{1},\ldots,a/p^{\xi}+c_{p}), where c_{j}=\#\{b:\text{branchbsplitsonpredictorsplits on predictorj}\}. When the grouping prior (5) is used we also obtain simple Gibbs sampling updates, with uD(a/M+z1,,a/M+zM)u\sim\mathcal{D}(a/M+z_{1},\ldots,a/M+z_{M}) and vmD(ω/Pm+cm1,,ω/Pm+cmPm)v_{m}\sim\mathcal{D}(\omega/P_{m}+c_{m1},\ldots,\omega/P_{m}+c_{mP_{m}}), where z_{m}=\#\{b:\text{branchbsplitsonapredictoringroupsplits on a predictor in groupm}\} and c_{mk}=\#\{b:\text{branchbsplitsonpredictorsplits on predictormk}\}.

Theoretical results

We study the theoretical properties of the SBART procedure from a frequentist perspective by assuming that (Y1,Y2,,Yn)(Y_{1},Y_{2},\ldots,Y_{n}) are generated from the model Yi=f0(Xi)+ϵiY_{i}=f_{0}(X_{i})+\epsilon_{i} with some true unknown regression function f0f_{0}. We assume that f0f_{0} is a function over p^{p}. We prove posterior consistency results when f0f_{0} is a member of certain Hölder spaces. Let Cα(p)\mathcal{C}^{\alpha}(^{p}) denote the Hölder space with smoothness index α\alpha, i.e., the space of functions on p^{p} with bounded partial derivatives up-to order β\beta, where β\beta is the largest integer strictly less than α\alpha and such that the partial derivatives of order β\beta are Hölder-continuous of order αβ\alpha-\beta. Let Cα,R(p)={fCα(p):fαR}\mathcal{C}^{\alpha,R}(^{p})=\{f\in\mathcal{C}^{\alpha}(^{p}):\|f\|_{\alpha}\leq R\} denote the Hölder-ball of radius RR with respect to the Hölder norm fα\|f\|_{\alpha} (see Ghosal and van der Vaart,, 2017, Appendix C).

We consider the posterior convergence of the Bayesian fractional posterior obtained by raising the likelihood function by a factor η(0,1]\eta\in(0,1] in the Bayes formula

where Π\Pi denotes the prior probability measure over L2(p)\mathcal{L}^{2}(^{p}), the L2\mathcal{L}^{2} space over p^{p}. Fractional posteriors have gained renewed attention in Bayesian statistics due to their robustness to model misspecification (Grünwald,, 2012; Miller and Dunson,, 2018). According to Walker and Hjort, (2001), the fractional posterior can be viewed as combining the original likelihood function with a data-dependent prior that is divided by a portion of the likelihood. This data dependent reweighting in the prior helps to prevent from possible inconsistencies by reducing the weights of those parameter values that “track the data too closely”. Additionally, the fractional posterior with η<1\eta<1 permits much simpler theoretical analyses. Note that η=1\eta=1 corresponds to the usual posterior distribution. Abusing notation slightly, we will also use Π\Pi to denote the prior probability measure over the parameters (Tt,Mt)(\mathcal{T}_{t},\mathcal{M}_{t}) and any hyperparameters in the model. Our goal is to find a sequence {εn:n1}\{\varepsilon_{n}:\,n\geq 1\} such that, for a sufficiently large constant MM and fixed η\eta,

In this section, we focus on establishing (7) for η<1\eta<1. The benefit of considering η<1\eta<1 is that this allows us to bypass verifying technical conditions regarding the effective support of the prior and the existence of a certain sieve (Ghosal et al.,, 2000, 2007), which allows for (7) to be established under very weak conditions. In the supplementary material we establish posterior consistency for η=1\eta=1 under more stringent conditions on the prior.

The main condition governing the posterior contraction rate is that the prior Π\Pi is sufficiently “thick” at f0f_{0}, in the sense that there exists a C>0C>0 such that

where Bε(f0)B_{\varepsilon}(f_{0}) denotes an ε\varepsilon-Kullback-Leibler (KL) neighborhood of the truth

+K(x)dx>0\displaystyle\int_{-\infty}^{+\infty}K(x)\,dx>0 and for any positive integer mm, +xmK(x)dx<\displaystyle\int_{-\infty}^{+\infty}|x|^{m}\,|K(x)|\,dx<\infty.

Suppose Assumption G holds for the gating function ψ\psi. For any function f0Cα,R(d)f_{0}\in\mathcal{C}^{\alpha,R}(^{d}), any ϵ>0\epsilon>0, and τ>0\tau>0, there exists a sum of soft decision trees with a single bandwidth τbτ\tau_{b}\equiv\tau for all branches,

where C1C_{1} and D1D_{1} are constants independent of (ε,τ)(\varepsilon,\tau).

With the help of this lemma, we establish (8) as a direct consequence of the following result, where we make the following assumptions on the prior distribution.

Assumption P (prior conditions):

The prior density πτ\pi_{\tau} of tree specific bandwidth parameters τt\tau_{t} satisfies πτ(τ)a1τa2\pi_{\tau}(\tau)\geq a_{1}\tau^{a_{2}} for some constants a1,a2>0a_{1},a_{2}>0 for all sufficiently small τ\tau.

The prior on the splitting proportion vector ss is D(a/pξ,,a/pξ)\mathcal{D}(a/p^{\xi},\ldots,a/p^{\xi}) for some ξ>1\xi>1 and a>0a>0.

Π(Dt=k)>0\Pi(D_{t}=k)>0 for k=0,1,,2dk=0,1,\ldots,2d, where DtD_{t} denotes the depth of tree tt and dd is as in Theorem 2.

In the supplementary material we show that under extra technical conditions on the prior, the usual posterior (fractional posterior with η=1\eta=1) can attain the same rate of convergence as in Theorem 3 below. These extra conditions are needed to control the size of the effective support of the prior and show the existence of a certain sieve (Ghosal et al.,, 2000). In particular, Assumption P only needs certain lower bounds on the prior density (mass) functions, while Assumption SP in the supplementary material requires some upper bound on the tail prior probability of various parameters in the model.

(Prior concentration for sparse function) Suppose that Assumptions G and P are satisfied. Let f0Cα,R(p)f_{0}\in\mathcal{C}^{\alpha,R}(^{p}) be a bounded regression function that depends on at most dd covariates. Then there exists constants AA and CC independent of (n,p)(n,p) such that for all sufficiently large nn, the prior Π\Pi over regression function ff satisfies

where εn=nα/(2α+d)(logn)t+n1dlogp\varepsilon_{n}=n^{-\alpha/(2\alpha+d)}(\log n)^{t}+\sqrt{n^{-1}\,d\,\log p} for any tα(d+1)/(2α+d)t\geq\alpha(d+1)/(2\alpha+d).

The following posterior concentration rate for sparse functions follows immediately from Theorem 2 and Theorem 3.2 in Bhattacharya et al., (2016) (see also Section 4.1 therein).

Suppose that Assumptions G and P are satisfied. Let f0Cα,R(p)f_{0}\in\mathcal{C}^{\alpha,R}(^{p}) be a bounded regression function that only depends on at most dd covariates. If nεn2n\,\varepsilon_{n}^{2}\to\infty and εn0\varepsilon_{n}\to 0 as n,pn,\,p\to\infty, then for all sufficiently large constant M>0M>0, we have

where εn=nα/(2α+d)(logn)t+n1dlogp\varepsilon_{n}=n^{-\alpha/(2\alpha+d)}(\log n)^{t}+\sqrt{n^{-1}\,d\,\log p} for any tα(d+1)/(2α+d)t\geq\alpha(d+1)/(2\alpha+d).

This result shows a salient feature of our sum of soft decision trees model — by introducing the soft thresholding, the resulting posterior contraction rate adapts to the unknown smoothness level α\alpha of the truth f0f_{0}, attaining a near-minimax rate (Yang and Tokdar,, 2015) without the need of knowing α\alpha in advance. Our next result shows that if the truth admits a sparse additive structure f0=v=1Vf0,v(x)f_{0}=\sum_{v=1}^{V}f_{0,v}(x), where each additive component f0,v(x)f_{0,v}(x) is sparse and only depends on dvd_{v} covariates for v=1,,Vv=1,\ldots,V, then the posterior contraction rate also adaptively (with respect to both the additive structure and unknown smoothness of each additive component) attains a near-minimax rate (Yang and Tokdar,, 2015) up to logn\log n terms, which leads to a second salient feature of the sum of soft decision tree model — it also adaptively learns any unknown lower order nonlinear interactions among the covariates.

Suppose that Assumptions G and P are satisfied. Let f0=v=1Vf0,vf_{0}=\sum_{v=1}^{V}f_{0,v}, where the vvth additive component f0,vf_{0,v} belongs to Cαv,R(p)\mathcal{C}^{\alpha_{v},R}(^{p}), and is bounded and only depends on at most dvd_{v} covariates for v=1,,Vv=1,\ldots,V. If nεn2n\,\varepsilon_{n}^{2}\to\infty and εn0\varepsilon_{n}\to 0 as n,pn,\,p\to\infty, then for all sufficiently large constant M>0M>0, we have

where εn=v=1Vnαv/(2αv+dv)(logn)t+v=1Vn1dvlogp\varepsilon_{n}=\sum_{v=1}^{V}n^{-\alpha_{v}/(2\alpha_{v}+d_{v})}(\log n)^{t}+\sum_{v=1}^{V}\sqrt{n^{-1}\,d_{v}\,\log p} for any tmaxvαv(dv+1)/(2αv+dv)t\geq\max_{v}\alpha_{v}(d_{v}+1)/(2\alpha_{v}+d_{v}).

Illustrations

A standard test case, initially proposed by Friedman, (1991) (see also Chipman et al.,, 2010), sets

This f0(x)f_{0}(x) features two nonlinear terms, two linear terms, with a nonlinear interaction.

In this experiment, we consider n=250n=250 observations, σ2{1,10}\sigma^{2}\in\{1,10\}, and pp from 55 to 10001000 along an evenly-spaced grid on the scale of logp\log p. We compare SBART to BART, DART, gradient boosted decision trees (xgboost), the lasso (glmnet), and random forests (randomForest). A similar experiment was conducted by Linero, (2016), who showed that the sparsity inducing prior used by DART resulted in substantial performance gains over BART. The purpose of this experiment is to demonstrate the further gains which are possible when the smoothness of (9) is also leveraged.

Methods are compared by root mean-squared error, RMSE={{f(x)f^(x)}2 dx}1/2,\operatorname{RMSE}=\{\int\{f(x)-\widehat{f}(x)\}^{2}\ dx\}^{1/2}, which is approximated by Monte-Carlo integration. For the Bayesian procedures, we take f^\widehat{f} to be the pointwise posterior mean of ff. DART, and SBART use their respective default priors and were fit using 2500 warmup iterations and 2500 sampling iterations, while cross-validation is used to tune the hyperparameters for BART. The non-Bayesian methods were tuned using cross validation for each replication of the experiment.

Results are given in Figure 5. Among the methods considered, SBART performs the best, obtaining a sizeable improvement over DART in both the low noise and high noise settings. Due to the use of a sparsity-inducing prior, both DART and SBART are largely invariant to the number of nuisance predictors, while random forests, BART-CV, and boosting have errors increasing in logp\log p. The lasso also has stable, albeit poor, performance as pp increases.

We now compare SBART to DART for the task of variable selection (see Linero,, 2016 for a detailed comparison of DART, BART, random forests, and the lasso which found DART to perform best among these methods). Our goal is to assess whether leveraging smoothness can improve on the good variable selection properties of DART. We modify Friedman’s function, taking instead

where λ\lambda is a tuning parameter for the simulation. A variable is included if its posterior inclusion probability exceeds 50%. We consider λ[0.1,1]\lambda\in[0.1,1]. As measures of accuracy, we consider precision =TP/(TP+FP)=TP/(TP+FP), recall =TP/(TP+FN)=TP/(TP+FN), and F1F_{1} score (harmonic mean of precision and recall), where TP,FPTP,FP and FNFN denote the number of true positives, false positives, and false negatives respectively.

Results for 2020 replications and σ2=1\sigma^{2}=1 are given in Figure 6, along with the average RMSE. First, we see that both DART and SBART have a precision which is roughly constant in λ\lambda, with SBART performing uniformly better. This makes intuitive sense, as varying λ\lambda should have little influence on whether irrelevant predictors are selected. The precision of both methods is heavily dependent on λ\lambda, and we see that SBART is generally capable of detecting smaller signal levels; at its largest, the difference in recall is about 10%. Once the signal level is high enough, both methods detect all relevant predictors consistently. The F1F_{1} score reflects a mixture of these two behaviors. Perhaps most interesting is the influence of λ\lambda on the RMSE. As λ\lambda increases the performance of DART degrades while SBART remains roughly constant. Intuitively this is because, as λ\lambda increases, DART must use an increasing number of branches to capture the additional signal in the data, while SBART is capable of representing the effects corresponding to (x4,x5)(x_{4},x_{5}) with fewer parameters.

2 Approximation of non-smooth and locally smooth functions

A potential concern with the use of soft decision trees is that they may not be able to capture fine-scale variability in the underlying regression function. An extreme example of this is when ff is a step function. We consider the regression function f(x)=24I(x1<0.5).f(x)=2-4I(x_{1}<0.5). In this case, one might expect soft decision trees to perform suboptimally relative to hard decision trees because a soft decision tree must model the jump at in a continuous fashion.

Surprisingly, ensembles of soft decision trees can outperform ensembles of hard decision trees even in this case. Figure 7 shows fits of BART and SBART to n=250n=250 data points and a high signal of σ=0.1\sigma=0.1. We see that both methods can capture the large jump discontinuity at x1=0.5x_{1}=0.5. SBART performs better away from the discontinuity, however, because the level of smoothness is allowed to vary at different points in the covariate space. The trees responsible for the jump discontinuity have small τt\tau_{t}’s to effectively replicate a step function, while elsewhere the trees have large τt\tau_{t}’s to allow the function to essentially be constant.

The ability to select different τt\tau_{t}’s allows SBART to obtain a locally-adaptive behavior. To illustrate this, Figure 8 gives the fit of BART and SBART when f(x)f(x) is a highly localised Daubechies wavelet of smoothness order 1010. We see that SBART is capable of adapting both to the constant regions outside of the support of the wavelet, and the fast oscillatory behavior within the support of the wavelet. The fit of BART, by contrast, possesses many artifacts outside the support of the wavelet, and possesses generally wider credible bands.

3 Benchmark datasets

We compare the SBART to various tree-based and non-tree-based methods on several benchmark datasets. We consider BART, DART, the LASSO (glmnet), random forests (randomForest), and gradient boosted decision trees (xgboost). The parameters for the non-Bayesian procedures were chosen, separately for each fit, using the caret package. Default priors (with T=50T=50) for SBART and DART were used; additionally, we consider selecting the hyperparameters of SBART and BART by cross validation.

Ten datasets are considered. Aside from bbb and wipp, the datasets are a subset of those considered by Kim et al., (2007). While we consider only a subset of these datasets, no datasets considered for this experiment were omitted. Attributes of these datasets are presented in Table 1. The response in each dataset was transformed to be approximately Gaussian. The bbb, triazines, and wipp datasets were also considered by Linero, (2016) to illustrate features of the sparsity-inducing priors for decision tree methods.

Results of the experiment are given in Table 1. Methods are compared by an estimate of their root mean predictive error obtained using 55-fold cross-validation, with the results averaged over 2020 replications of the cross-validation. For each experiment, the root mean predictive error for each method is normalized by the root mean predictive error for SBART-CV, so that scores higher than 1.00 correspond to worse performance than SBART and scores lower than 1.00 correspond to better performance than SBART.

SBART/SBART-CV is seen to perform very well in practice, attaining the best performance on 8 out of the 10 datasets. The results here are consistent with the general observation of Chipman et al., (2010) that BART outperforms gradient boosting and random forests in aggregate over many datasets. Two datasets stand out as particularly interesting. First, for the tecator dataset, SBART outperforms all other methods by a very wide margin, indicating that leveraging smoothness for this dataset is essential to attaining good performance. Second, the only dataset for which SBART-CV substantially outperforms SBART is the hatco dataset, where tuning the number of trees is required to attain optimal performance. This indicates that, for most datasets, the default SBART procedure works very well, but that if one wants to be absolutely sure of optimal performance they should tune TT.

Discussion

We have introduced a novel Bayesian sum-of-trees framework and demonstrated that it is capable of attaining a meaningful improvement over existing methods both in simulated experiments and in practice. This was accomplished by incorporating soft decision trees and sparsity-inducing priors. We also provided theoretical support in the form of near-optimal results for posterior concentration, adaptively over smoothness classes, when f0(x)f_{0}(x) is a sparse, or additive, function.

While this paper has focused only on the case of nonparametric regression, the proposed methodology extends in a straight-forward manner to other settings. For example, the case of binary classification can be addressed in the usual way via a probit link and data augmentation.

Our theoretical results concern the rate of convergence of the posterior. Another relevant question is whether the model can consistently estimate the model support. That is, one can ask under what conditions Π(S=S0D)1\Pi(S=S_{0}\mid\mathcal{D})\to 1 as nn\to\infty, where S=\{p:\text{predictorpappears in the ensemble}\} and S_{0}=\{p:\text{f_{0}dependsondepends onp}\}. This is an interesting area for future research.

Software which implements SBART is available online at https://github.com/theodds/SoftBART, and is undergoing active development. Our code is based on the implementation of BART in the BayesTree package. Given enough optimization, we hope that our implementation could reach speeds within a modest factor of existing highly-optimized implementations of BART (Kapelner and Bleich,, 2016).

Appendix A Proof of Lemma 1

Let Kτ(d)(x1,x2,,xd)=τdj=1dK(xj/τ)K^{(d)}_{\tau}(x_{1},x_{2},\ldots,x_{d})=\tau^{-d}\,\prod_{j=1}^{d}K(x_{j}/\tau) denote a dd-dimensional tensor product of the rescaled one dimensional kernel KK in Assumption G, where recall that τ\tau is the bandwidth parameter in the gating function. Let CK:=K(x)dx\displaystyle C_{K}:\,=\int K(x)\,dx denote the normalization constant of KK, so that we can write K=CKK~K=C_{K}\widetilde{K}, and the rescaled kernel function K~\widetilde{K} is has an unit normalization constant. Also write K~τ(d)=Kτ(d)/CKd\widetilde{K}^{(d)}_{\tau}=K^{(d)}_{\tau}/C_{K}^{d}. It is easy to verify that K~\widetilde{K} also satisfies the two conditions in Assumption G, though it may not be associated to any ψ~\widetilde{\psi}.

Our proof is composed of three steps. First, we provide error bound estimates of approximating any α\alpha-smooth function by a convolution Kτ(d)gK^{(d)}_{\tau}*g with some carefully constructed function gg for any τ>0\tau>0. Second, we show that any continuous convolution Kτ(d)gK^{(d)}_{\tau}*g can be approximated by a discrete sum t=1TμtKτ(d)(xt)\sum_{t=1}^{T}\mu_{t}K^{(d)}_{\tau}(\cdot-x_{t}) with at most O(τd)O(\tau^{-d}) atoms. Lastly, we provide an error bound estimate on approximating this sum of kernels with a sum of soft decision trees by identifying each kernel component Kτ(d)(xt)K^{(d)}_{\tau}(\cdot-x_{t}) as one particular leaf in the ttth soft decision tree g(x;Tt,Mt)g(x;\,\mathcal{T}_{t},\mathcal{M}_{t}) whose depth is at most 2d2d via splitting at most 2d2d times, for t=1,,Tt=1,\ldots,T.

Step 1: This step is follows as a direct result of the following lemma, which is adapted from Lemma 3.4 of De Jonge et al., (2010).

Under Assumption G, for any f0Cα,R(d)f_{0}\in\mathcal{C}^{\alpha,R}(^{d}), there exist some constants (M1,M2)(M_{1},M_{2}) independent of τ\tau, and a function Tb,τf0T_{b,\tau}f_{0} satisfying Tb,τf0M1R\|T_{b,\tau}f_{0}\|_{\infty}\leq M_{1}R, such that

where g=CKdTb,τf0g=C_{K}^{-d}\,T_{b,\tau}f_{0} satisfies gM1R\|g\|_{\infty}\leq M_{1}^{\prime}R, with M1=CKdM1M_{1}^{\prime}=C_{K}^{-d}M_{1} independent of τ\tau.

Step 2: This step generalizes the theory of approximating a continuous one-dimensional density function from by a mixture of Gaussians developed in Ghosal and Van Der Vaart, (2007) to by a location mixture of any kernel KK satisfying Assumption G. We also extend their result from density estimation to general function estimation as demanded in our regression setting, where the target function ff may not integrate to one and can take negative values. First, we state an extension of Lemma 3.1 of Ghosal and Van Der Vaart, (2001) from dimension one to dimension dd, and from the Gaussian kernel to any kernel KK satisfying Assumption G.

Under Assumption G, for any probability density function p0p_{0} on d^{d}, any ϵ>0\epsilon>0, and τ(0,1)\tau\in(0,1), there is a discrete measure Pτ=t=1TrtδxtP_{\tau}=\sum_{t=1}^{T}r_{t}\,\delta_{x_{t}} with TC1τdlogd(1/τ)T\leq C_{1}\tau^{-d}\log^{d}(1/\tau) support points such that t=1Trt=1\sum_{t=1}^{T}r_{t}=1 and

where (C1,D1)(C_{1},D_{1}) are independent of τ\tau and KK.

We only sketch the key difference in the proof from Lemma 3.1 of Ghosal and Van Der Vaart, (2001) in the one-dimensional case, and a proof for extending the result from one-dimensional case to the multi-dimensional case follows similar lines as in the proof of Theorem 7 in Shen et al., (2013) (by replacing the Gaussian kernel with the kernel KK).

The only key property of the Gaussian kernel used in the proof of Lemma 3.1 of Ghosal and Van Der Vaart, (2001) is in bounding the remainder term in the kk-th order Taylor expansion in their equation (3.11), where they used the fact that for any k1k\geq 1, the kkth order derivative of the standard Gaussian density function ϕ(x)=(2π)1/2ex2/2\phi(x)=(2\pi)^{-1/2}\,e^{-x^{2}/2} at the origin x=0x=0 satisfies the bound

for some sufficiently large constant C2>0C_{2}>0 (since we only focus on the approximation error over the unit interval $,wedonotneedtoincludetheadditional, we do not need to include the additional\log(1/\varepsilon)terminequation(3.11)therein).Therefore,itsufficestoverifyasimilarexponentiallydecayboundfortheterm in equation (3.11) therein). Therefore, it suffices to verify a similar exponentially decay bound for thekthorderderivativeoffunctionth order derivative of functionK_{\kappa}:\,=\tau^{-1}\,K(\cdot/\kappa)forsomesufficientlylargenumberfor some sufficiently large number\kappa>0dependingondepending onC.Infact,underAssumptionG,. In fact, under Assumption G,K(\cdot)canbeanalyticallyextendtothestripcan be analytically extend to the strip\mathcal{S}(\rho)inthecomplexplane(forsimplicity,weusethesamenotationin the complex plane (for simplicity, we use the same notationK$ to denote this extension), which implies by applying Cauchy’s integral formula that

where the closed path Γκ\Gamma_{\kappa} is chosen as a counter-clockwise circle centering at the origin with radius κρ\kappa\,\rho. Since KK is uniformly bounded on the path Γκ\Gamma_{\kappa} by Assumption G, we can further deduce that

holds as long as κρ1exp{C2}\kappa\geq\rho^{-1}\,\exp\{C_{2}\}, where DD is some constant only depending on KK, which completes the proof. ∎

With this lemma on the density function approximation as our preparation, we now return to the problem of approximating any general bounded function gg over d^{d}. Notice that we always have the decomposition g=g+gg=g_{+}-g_{-} where g+=max{0,g(x)}g_{+}=\max\{0,g(x)\} and g(x)=max{0,g(x)}g_{-}(x)=\max\{0,-g(x)\} are the positive parts and negative parts of gg, respectively, and both of them are nonnegative and bounded over d^{d}. Let A+=dg+(x)dxgA_{+}=\int_{^{d}}g_{+}(x)\,dx\leq\|g\|_{\infty} andA=dg(x)dxgA_{-}=\int_{^{d}}g_{-}(x)\,dx\leq\|g\|_{\infty}. It is obvious that g+/A+g_{+}/A_{+} and g/Ag_{-}/A_{-} are two legitimate pdfs over d^{d}. By applying Lemma 6, we can find two discrete measures P+=t=1T+rt+δxt+P_{+}=\sum_{t=1}^{T_{+}}r^{+}_{t}\,\delta_{x^{+}_{t}} and P=t=1TrtδxtP_{-}=\sum_{t=1}^{T_{-}}r^{-}_{t}\,\delta_{x^{-}_{t}} such that

for any xdx\in^{d} and max{T+,T}Cτdlogd(1/ε)\max\{T_{+},\,T_{-}\}\leq C\,\tau^{-d}\log^{d}(1/\varepsilon). Now we combine these two discrete measures into a new discrete signed measure P0=t=1T+A+rt+Kτ(d)(xxt+)+t=1T(Art)Kτ(d)(xxt)P_{0}=\sum_{t=1}^{T_{+}}A_{+}\,r^{+}_{t}\,K^{(d)}_{\tau}(x-x^{+}_{t})+\sum_{t=1}^{T_{-}}(-A_{-}\,r^{-}_{t})\,K^{(d)}_{\tau}(x-x^{-}_{t}), which will be denoted as \sum_{t=1}^{T}\mu_{t}\,K^{(d)}_{\tau}\big{(}\cdot-x_{t}). Then TT+T+2Cτdlogd(1/ε)T\leq T_{-}+T_{+}\leq 2C\,\tau^{-d}\log^{d}(1/\varepsilon) and

for all xdx\in^{d}. Moreover, we have t=1TμtA+t=1T+rt++At=1Trt2g\sum_{t=1}^{T}|\mu_{t}|\leq A_{+}\sum_{t=1}^{T_{+}}r_{t}^{+}+A_{-}\sum_{t=1}^{T_{-}}r_{t}^{-}\leq 2\|g\|_{\infty}.

Finally, a combination of steps 1-3 together yields a proof of the lemma.

Appendix B Proof of Theorem 2

For convenience, we use the same notation CC to denote some constant independent of (n,p)(n,p), whose value may change from line to line. Without loss of generality, we may assume that f0f_{0} depends only on its first dd coordinates. Applying Lemma 1, we obtain that for some parameters τ\tau and ε\varepsilon to be determined later, there exists some f~=t=1T~g(x;T~t,M~t)\widetilde{f}=\sum_{t=1}^{\widetilde{T}}g(x;\,\widetilde{\mathcal{T}}_{t},\widetilde{\mathcal{M}}_{t}) such that T~Cτdlogd(ε1)\widetilde{T}\leq C\,\tau^{-d}\log^{d}(\varepsilon^{-1}), f~f0C(τα+ετd)\|\widetilde{f}-f_{0}\|_{\infty}\leq C\,(\tau^{\alpha}+\varepsilon\,\tau^{-d}), and the total number of splits (all are along the first dd coordinates) across all trees are at most 2dT~2d\,\widetilde{T} (2dT~2d\,\widetilde{T} many leaves in total).

Recall that our prior over the sum of soft decision tree function ff is specified in a hierarchical manner: first, we specify the number TT of trees and the tree topology T={T1,,TT}\mathcal{T}=\{\mathcal{T}_{1},\ldots,\mathcal{T}_{T}\}; second, conditional on these we decide the coordinates in all splits across all the decision trees; third, we sample the independent splitting locations along all the selected coordinates; last, we sample bandwidth parameters τt\tau_{t} associated with each tree and parameters μ\mu’s associated with all leaves across the trees. We denote by T~\widetilde{T} and T~\widetilde{\mathcal{T}} the corresponding number of trees and the tree topology of f~\widetilde{f}.

then we have the following perturbation error bound by applying the triangle inequality,

Now we apply Theorem 2.1 in Yang and Dunson, (2014) on the prior concentration probability for high-dimensional Dirichlet distribution and Assumption P3 to obtain that the splitting proportion vector s=(s1,,sp)s=(s_{1},\ldots,s_{p}) satisfies

This combined with the fact that each tree has depth at most 2d2d and Assumption P5 implies that the prior probability of T=T~\mathcal{T}=\widetilde{\mathcal{T}} given T=T~T=\widetilde{T} can be lower bounded by

where we have used the fact that NCτdlogd(ε1)N\leq C\,\tau^{-d}\,\log^{d}(\varepsilon^{-1}) in the last step. The perturbation error bound in (10) implies

where in the last step we applied Assumptions P2 and P4, and used the fact that uμ~uCτd\sum_{u}|\widetilde{\mu}_{u}|\leq C\,\tau^{-d} for some constant CC only dependent of f0f_{0} (due to Lemma 1). Putting all pieces together and using Assumption P1 and the properties of f~\widetilde{f}, we obtain

Therefore, by choosing τ=(logd+1n/n)1/(2α+d)\tau=(\log^{d+1}n/n)^{-1/(2\alpha+d)}, δ=τα\delta=\tau^{\alpha}, ε=τd+α\varepsilon=\tau^{d+\alpha}, we can obtain the claimed prior concentration probability lower bound as \Pi\big{[}\|f-f_{0}\|_{\infty}\leq C\,\varepsilon_{n}\big{]}\geq\exp\{-C\,n\,\varepsilon_{n}^{2}\}.

Appendix C Proof of Theorem 4

Using Theorem 3.2 in Bhattacharya et al., (2016) (see also Section 4.1 therein), it suffices to show that \Pi\big{[}\|f-f_{0}\|_{\infty}\leq C\,\varepsilon_{n}\big{]}\geq\exp\{-C\,n\,\varepsilon_{n}^{2}\}. The proof of this is almost the same as that of Theorem 2, the only difference is that now we apply Lemma 1 to find VV functions {f~v:v=1,,V}\{\widetilde{f}_{v}:\,v=1,\ldots,V\}, where f~v\widetilde{f}_{v} contains T~v\widetilde{T}_{v} trees and approximates the vvth additive component f0,vf_{0,v} in f0f_{0} for v=1,,Vv=1,\ldots,V, and set f~=v=1Vf~v\widetilde{f}=\sum_{v=1}^{V}\widetilde{f}_{v}. Due to the additive structure in our sum of soft decision tree model, we can always write f=v=1Vfvf=\sum_{v=1}^{V}f_{v} where fvf_{v} collects T~v\widetilde{T}_{v} trees and has the same sum of soft decision tree prior structure when conditioning on the total number of trees T=v=1VT~vT=\sum_{v=1}^{V}\widetilde{T}_{v}, and the conditional priors of (f1,,fv)(f_{1},\ldots,f_{v}) given T=v=1VT~vT=\sum_{v=1}^{V}\widetilde{T}_{v} and the splitting proportion vector ss are independent. Let \mathcal{S}=\big{\{}s_{j}\geq(2d)^{-1}\mbox{ for }j=1,\ldots,d,\mbox{ and }\sum_{j=d+1}^{p}s_{j}\leq d^{-1}\big{\}} denote the event in inequality (11) with d:=v=1Vdvd:\,=\sum_{v=1}^{V}d_{v}. Therefore, we obtain by applying Assumption P, the prior concentration bound (11) for ss, and Theorem 2 for a single fvf_{v} (choose parameters τv,δv,εv\tau_{v},\delta_{v},\varepsilon_{v} for each fvf_{v} as in the proof of Theorem 2) that

where constants C,C>0C,C^{\prime}>0, εn,v=nαv/(2αv+dv)(logn)tv+n1dvlogp\varepsilon_{n,v}=n^{-\alpha_{v}/(2\alpha_{v}+d_{v})}(\log n)^{t_{v}}+\sqrt{n^{-1}\,d_{v}\,\log p} and tvαv(dv+1)/(2αv+dv)t_{v}\geq\alpha_{v}(d_{v}+1)/(2\alpha_{v}+d_{v}).

Acknowledgments

This work was partially supported by NSF grant DMS-1712870 and DOD grant SOT-FSU-FATs-16-06.

References