Particle Gibbs for Bayesian Additive Regression Trees

Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh

Introduction

Ensembles of regression trees are at the heart of many state-of-the-art approaches for nonparametric regression (Caruana and Niculescu-Mizil, 2006), and can be broadly classified into two families: randomized independent regression trees, wherein the trees are grown independently and predictions are averaged to reduce variance, and additive regression trees, wherein each tree fits the residual not explained by the remainder of the trees. In the former category are bagged decision trees (Breiman, 1996), random forests (Breiman, 2001), extremely randomized trees (Geurts et al., 2006), and many others, while additive regression trees can be further categorized into those that are fit in a serial fashion, like gradient boosted regression trees (Friedman, 2001), and those fit in an iterative fashion, like Bayesian additive regression trees (BART) (Chipman et al., 2010) and additive groves (Sorokina et al., 2007).

Among additive approaches, BART is extremely popular and has been successfully applied to a wide variety of problems including protein-DNA binding, credit risk modeling, automatic phishing/spam detection, and drug discovery (Chipman et al., 2010). Additive regression trees must be regularized to avoid overfitting (Friedman, 2002): in BART, over-fitting is controlled by a prior distribution preferring simpler tree structures and non-extreme predictions at leaves. At the same time, the posterior distribution underlying BART delivers a variety of inferential quantities beyond predictions, including credible intervals for those predictions as well as a measures of variable importance. At the same time, BART has been shown to achieve predictive performance comparable to random forests, boosted regression trees, support vector machines, and neural networks (Chipman et al., 2010).

The standard inference algorithm for BART is an iterative Bayesian backfitting Markov Chain Monte Carlo (MCMC) algorithm (Hastie et al., 2000). In particular, the MCMC algorithm introduced by Chipman et al. (2010) proposes local changes to individual trees. This sampler can be computationally expensive for large datasets, and so recent work on scaling BART to large datasets (Pratola et al., 2013) considers using only a subset of the moves proposed by Chipman et al. (2010). However, this smaller collection of moves has been observed to lead to poor mixing (Pratola, 2013) which in turn produces an inaccurate approximation to the posterior distribution. While a poorly mixing Markov chain might produce a reasonable prediction in terms of mean squared error, BART is often used in scenarios where its users rely on posterior quantities, and so there is a need for computationally efficient samplers that mix well across a range of hyper-parameter settings.

In this work, we describe a novel sampler for BART based on (1) the Particle Gibbs (PG) framework proposed by Andrieu et al. (2010) and (2) the top-down sequential Monte Carlo algorithm for Bayesian decision trees proposed by Lakshminarayanan et al. (2013). Loosely speaking, PG is the particle version of the Gibbs sampler where proposals from the exact conditional distributions are replaced by conditional versions of a sequential Monte Carlo (SMC) algorithm. The complete sampler follows the Bayesian backfitting MCMC framework for BART proposed by Chipman et al. (2010); the key difference is that trees are sampled using PG instead of the local proposals used by Chipman et al. (2010). Our sampler, which we refer to as PG-BART, approximately samples complete trees from the conditional distribution over a tree fitting the residual. As the experiments bear out, the PG-BART sampler explores the posterior distribution more efficiently than samplers based on local moves. Of course, one could easily consider non-local moves in a Metropolis–Hastings (MH) scheme by proposing complete trees from the tree prior, however these moves would be rejected, leading to slow mixing, in high-dimensional and large data settings. The PG-BART sampler succeeds not only because non-local moves are considered, but because those non-local moves have high posterior probability. Another advantage of the PG sampler is that it only requires one to be able to sample from the prior and does not require evaluation of tree prior in the acceptance ratio unlike (local) MH—hence PG can be computationally efficient in situations where the tree prior is expensive (or impossible) to compute, but relatively easier to sample from.

The paper is organized as follows: in section 2, we review the BART model; in section 3, we review the MCMC framework proposed by Chipman et al. (2010) and describe the PG sampler in detail. In section 4, we present experiments that compare the PG sampler to existing samplers for BART.

Model and notation

In this section, we briefly review decision trees and the BART model. We refer the reader to the paper of Chipman et al. (2010) for further details about the model. Our notation closely follows their’s, but also borrows from Lakshminarayanan et al. (2013).

2 Decision tree

3 Likelihood specification for BART

BART is a sum-of-trees model, i.e., BART assumes that the label yy for an input x\bm{x} is generated by an additive combination of mm regression trees. More precisely,

where eN(0,σ2)e\sim\mathcal{N}(0,\sigma^{2}) is an independent Gaussian noise term with zero mean and variance σ2\sigma^{2}. Hence, the likelihood for a training instance is

and the likelihood for the entire training dataset is

4 Prior specification for BART

The parameters of the BART model are the noise variance σ2\sigma^{2} and the regression trees (Tj,μj)(\mathcal{T}_{j},\bm{\mu}_{j}) for j=1,,mj=1,\dotsc,m. The conditional independencies in the prior are captured by the factorization

The prior over decision trees p(Tj={Tj,κj,τj}X)p(\mathcal{T}_{j}=\{\mathsf{T}_{j},\kappa_{j},\tau_{j}\}|\bm{X}) can be described by the following generative process (Chipman et al., 2010; Lakshminarayanan et al., 2013): Starting with a tree comprised only of a root node ϵ\epsilon, the tree is grown by deciding once for every node η\eta whether to 1) stop and make η\eta a leaf, or 2) split, making η\eta an internal node, and add η0\eta 0 and η1\eta 1 as children. The same stop/split decision is made for the children, and their children, and so on. Let ρη\rho_{\eta} be a binary indicator variable for the event that η\eta is split. Then every node η\eta is split independently with probability

where the indicator \mathds1[...]\mathds{1}[...] forces the probability to be zero when every possible split of η\eta is invalid, i.e., one of the children nodes contains no training data. Informally, the hyperparameters αs(0,1)\alpha_{s}\in(0,1) and βs[0,)\beta_{s}\in[0,\infty) control the depth and number of nodes in the tree. Higher values of αs\alpha_{s} lead to deeper trees while higher values of βs\beta_{s} lead to shallower trees.

In the event that a node η\eta is split, the dimension κη\kappa_{\eta} and location τη\tau_{\eta} of the split are assumed to be drawn independently from a uniform distribution over the set of all valid splits of η\eta. The decision tree prior is thus

where U()\mathcal{U}(\cdot) denotes the probability mass function of the uniform distribution over dimensions that contain at least one valid split, and U(κη)\mathcal{U}(\cdot|\kappa_{\eta}) denotes the probability density function of the uniform distribution over valid split locations along dimension κη\kappa_{\eta} in block BηB_{\eta}.

Given a decision tree T\mathcal{T}, the parameters associated with its leaves are independent and identically distributed normal random variables, and so

The mean mμm_{\mu} and variance σμ2\sigma_{\mu}^{2} hyperparameters are set indirectly: Chipman et al. (2010) shift and rescale the labels YY such that ymin=0.5y_{\min}=-0.5 and ymax=0.5y_{\max}=0.5, and set mμ=0m_{\mu}=0 and σμ=0.5/km\sigma_{\mu}=0.5/k\sqrt{m}, where k>0k>0 is an hyperparameter. This adjustment has the effect of keeping individual node parameters μη\mu_{\eta} small; the higher the values of kk and mm, the greater the shrinkage towards the mean mμm_{\mu}.

The prior p(σ2)p(\sigma^{2}) over the noise variance is an inverse gamma distribution. The hyperparameters ν\nu and qq indirectly control the shape and rate of the inverse gamma prior over σ2\sigma^{2}. Chipman et al. (2010) compute an overestimate of the noise variance σ^2\widehat{\sigma}^{2}, e.g., using the least-squares variance or the unconditional variance of YY, and, for a given shape parameter ν\nu, set the rate such that Pr(σσ^)=q\Pr(\sigma\leq\widehat{\sigma})=q, i.e., the qqth quantile of the prior over σ\sigma is located at σ^\widehat{\sigma}.

Chipman et al. (2010) recommend the default values: ν=3,q=0.9,k=2,m=200\nu=3,q=0.9,k=2,m=200 and αs=0.95,βs=2.0\alpha_{s}=0.95,\beta_{s}=2.0. Unless otherwise specified, we use this default hyperparameter setting in our experiments.

5 Sequential generative process for trees

Lakshminarayanan et al. (2013) present a sequential generative process for the tree prior p(TX)p(\mathcal{T}|\bm{X}), where a tree T\mathcal{T} is generated by starting from an empty tree T(0)\mathcal{T}_{(0)} and sampling a sequence T(1),T(2),\mathcal{T}_{(1)},\mathcal{T}_{(2)},\dotsc of partial trees. This sequential representation is used as the scaffolding for their SMC algorithm. Due to space limitations, we can only briefly review the sequential process. The interested reader should refer to the paper by Lakshminarayanan et al. (2013): Let T(t)=T(t),κ(t),τ(t),E(t)\mathcal{T}_{(t)}=\mathsf{T}_{(t)},\bm{\kappa}_{(t)},\bm{\tau}_{(t)},E_{(t)} denote the partial tree at stage tt, where E(t)E_{(t)} denotes the ordered set containing the list of nodes eligible for expansion at stage tt. At stage tt, the generative process samples T(t)\mathcal{T}_{(t)} from Prt(T(t1))\Pr_{t}(\cdot\,|\,\mathcal{T}_{(t-1)}) as follows: the first element of EE, say η\eta, is popped and is stochastically assigned to be an internal node or a leaf node with probability given by (2.4). If η\eta is chosen to be an internal node, we sample the split dimension κη\kappa_{\eta} and split location τη\tau_{\eta} uniformly among the valid splits, and append η0\eta 0 and η1\eta 1 to EE. Thus, the tree is expanded in a breadth-wise fashion and each node is visited just once. The process terminates when EE is empty. Figure 1 presents a cartoon of the sequential generative process.

Posterior inference for BART

In this section, we briefly review the MCMC framework proposed in (Chipman et al., 2010), discuss limitations of existing samplers and then present our PG sampler.

Given the likelihood and the prior, our goal is to compute the posterior distribution

Chipman et al. (2010) proposed a Bayesian backfitting MCMC to sample from the BART posterior. At a high level, the Bayesian backfitting MCMC is a Gibbs sampler that loops through the trees, sampling each tree Tj\mathcal{T}_{j} and associated parameters μj\bm{\mu}_{j} conditioned on σ2\sigma^{2} and the remaining trees and their associated parameters {Tj,μj}jj\{\mathcal{T}_{j^{\prime}},\bm{\mu}_{j^{\prime}}\}_{j^{\prime}\neq j}, and samples σ2\sigma^{2} conditioned on all the trees and parameters {Tj,μj}j=1m\{\mathcal{T}_{j},\bm{\mu}_{j}\}_{j=1}^{m}. Let Tj(i),μj(i),\mathcal{T}_{j}^{(i)},\mu_{j}^{(i)}, and σ2(i)\sigma^{2(i)} respectively denote the values of Tj,μj\mathcal{T}_{j},\mu_{j} and σ2\sigma^{2} at the ithi^{th} MCMC iteration. Sampling σ2\sigma^{2} conditioned on {Tj,μj}j=1m\{\mathcal{T}_{j},\bm{\mu}_{j}\}_{j=1}^{m} is straightforward due to conjugacy. To sample Tj,μj\mathcal{T}_{j},\bm{\mu}_{j} conditioned on the other trees {Tj,μj}jj\{\mathcal{T}_{j^{\prime}},\bm{\mu}_{j^{\prime}}\}_{j^{\prime}\neq j}, we first sample Tj{Tj,μj}jj,σ2\mathcal{T}_{j}|\{\mathcal{T}_{j^{\prime}},\bm{\mu}_{j^{\prime}}\}_{j^{\prime}\neq j},\sigma^{2} and then sample μjTj,{Tj,μj}jj,σ2\bm{\mu}_{j}|\mathcal{T}_{j},\{\mathcal{T}_{j^{\prime}},\bm{\mu}_{j^{\prime}}\}_{j^{\prime}\neq j},\sigma^{2}. (Note that μj\bm{\mu}_{j} is integrated out while sampling Tj\mathcal{T}_{j}.) More precisely, we compute the residual

Using the residual Rj(i)R_{j}^{(i)} as the target, Chipman et al. (2010) sample Tj(i)\mathcal{T}_{j}^{(i)} by proposing local changes to Tj(i1)\mathcal{T}_{j}^{(i-1)}. Finally, μj\bm{\mu}_{j} is sampled from a Gaussian distribution conditioned on Tj,{Tj,μj}jj,σ2\mathcal{T}_{j},\{\mathcal{T}_{j^{\prime}},\bm{\mu}_{j^{\prime}}\}_{j^{\prime}\neq j},\sigma^{2}. The procedure is summarized in Algorithm 1.

2 Existing samplers for BART

To sample Tj\mathcal{T}_{j}, Chipman et al. (2010) use the MCMC algorithm proposed by Chipman et al. (1998). This algorithm, which we refer to as CGM. CGM is a Metropolis-within-Gibbs sampler that randomly chooses one of the following four moves: grow (which randomly chooses a leaf node and splits it further into left and right children), prune (which randomly chooses an internal node where both the children are leaf nodes and prunes the two leaf nodes, thereby making the internal node a leaf node), change (which changes the decision rule at a randomly chosen internal node), swap (which swaps the decision rules at a parent-child pair where both the parent and child are internal nodes). There are two issues with the CGM sampler: (1) the CGM sampler makes local changes to the tree, which is known to affect mixing when computing the posterior over a single decision tree (Wu et al., 2007). Chipman et al. (2010) claim that the default hyper-parameter values encourage shallower trees and hence mixing is not affected significantly. However, if one wishes to use BART on large datasets where individual trees are likely to be deeper, the CGM sampler might suffer from mixing issues. (2) The change and swap moves in CGM sampler are computationally expensive for large datasets that involve deep trees (since they involve re-computation of all likelihoods in the subtree below the top-most node affected by the proposal). For computational efficiency, Pratola et al. (2013) propose using only the grow and prune moves; we will call this the GrowPrune sampler. However, as we illustrate in section 4, the GrowPrune sampler can inefficiently explore the posterior in scenarios where there are multiple possible trees that explain the observations equally well. In the next section, we present a novel sampler that addresses both of these concerns.

3 PG sampler for BART

Recall that Chipman et al. (2010) sample Tj(i)\mathcal{T}_{j}^{(i)} using Rj(i)R_{j}^{(i)} as the target by proposing local changes to Tj(i1)\mathcal{T}_{j}^{(i-1)}. It is natural to ask if it is possible to sample a complete tree Tj(i)\mathcal{T}_{j}^{(i)} rather than just local changes. Indeed, this is possible by marrying the sequential representation of the tree proposed by Lakshminarayanan et al. (2013) (see section 2.5) with the Particle Markov Chain Monte Carlo (PMCMC) framework (Andrieu et al., 2010) where an SMC algorithm (particle filter) is used as a high-dimensional proposal for MCMC. The PG sampler is implemented using the so-called conditional SMC algorithm (instead of the Metropolis-Hastings samplers described in Section 3.2) in line 7 of Algorithm 1. At a high level, the conditional SMC algorithm is similar to the SMC algorithm proposed by Lakshminarayanan et al. (2013), except that one of the particles is clamped to the current tree Tj(i1)\mathcal{T}_{j}^{(i-1)}.

Before describing the PG sampler, we derive the conditional posterior Tj{Tj,μj}jj,σ2,Y,X\mathcal{T}_{j}|\{\mathcal{T}_{j^{\prime}},\bm{\mu}_{j^{\prime}}\}_{j^{\prime}\neq j},\sigma^{2},Y,\bm{X}. Let N(η)N(\eta) denote the set of data point indices n{1,,N}n\in\{1,\dots,N\} such that xnBη\bm{x}_{n}\in B_{\eta}. Slightly abusing the notation, let RN(η)R_{N({\eta})} denote the vector containing residuals of data points in node η\eta. Given R:=Yjjg(X;Tj,μj)R:=Y-\sum_{j^{\prime}\neq j}g(\bm{X};\mathcal{T}_{j^{\prime}},\mu_{j^{\prime}}), it is easy to see that the conditional posterior over Tj,μj\mathcal{T}_{j},\bm{\mu}_{j} is given by

Let π(Tj)\pi(\mathcal{T}_{j}) denote the conditional posterior over Tj\mathcal{T}_{j}. Integrating out μ\bm{\mu} and using (5) for p(TjX)p(\mathcal{T}_{j}|\bm{X}), the conditional posterior π(Tj)\pi(\mathcal{T}_{j}) is

where p(RN(η)σ2,mμ,σμ2)p(R_{N(\eta)}|\sigma^{2},m_{\mu},\sigma_{\mu}^{2}) denotes the marginal likelihood at a node η\eta, given by

The goal is to sample from the (conditional) posterior distribution π(Tj)\pi(\mathcal{T}_{j}). Lakshminarayanan et al. (2013) presented a top-down particle filtering algorithm that approximates the posterior over decision trees. Since this SMC algorithm can sample complete trees, it is tempting to substitute an exact sample from π(Tj)\pi(\mathcal{T}_{j}) with an approximate sample from the particle filter. However, Andrieu et al. (2010) observed that this naive approximation does not leave the joint posterior distribution (7) invariant, and so they proposed instead to generate a sample using a modified version of the SMC algorithm, which they called the conditional-SMC algorithm, and demonstrated that this leaves the joint distribution (7) invariant. (We refer the reader to the paper by Andrieu et al. (2010) for further details about the PMCMC framework.) By building off the top-down particle filter for decision trees, we can define a conditional-SMC algorithm for sampling from π(Tj)\pi(\mathcal{T}_{j}).

The conditional-SMC algorithm is an MH kernel with π(Tj)\pi(\mathcal{T}_{j}) as its stationary distribution. To reduce clutter, let T\mathcal{T}^{*} denote the old tree and T\mathcal{T} denote the tree we wish we to sample. The conditional-SMC algorithm samples T\mathcal{T} from a CC-particle approximation of π(T)\pi(\mathcal{T}), which can be written as c=1Cw(c)δT(c)\sum_{c=1}^{C}\overline{w}{(c)}\delta_{\mathcal{T}{(c)}} where T(c)\mathcal{T}{(c)} denotes the cthc^{th} tree (particle) and the weights sum to 1, that is, cw(c)=1\sum_{c}\overline{w}{(c)}=1.

SMC proposal: Each particle T(c)\mathcal{T}{(c)} is the end product of a sequence of partial trees T(0)(c),T(1)(c),T(2)(c),\mathcal{T}_{(0)}{(c)},\mathcal{T}_{(1)}{(c)},\mathcal{T}_{(2)}{(c)},\dotsc, and the weight w(c)\overline{w}{(c)} reflects how well the cthc^{th} tree explains the residual RR. One of the particles, say the first particle, without loss of generality, is clamped to the old tree T\mathcal{T}^{*} at all stages of the particle filter, i.e., T(t)(1)=T(t)\mathcal{T}_{(t)}(1)=\mathcal{T}_{(t)}^{*}. At stage tt, the remaining C1C-1 particles are sampled from the sequential generative process Prt(T(t1)(c))\Pr_{t}(\cdot\,|\,\mathcal{T}_{(t-1)}(c)) described in section 2.5. Unlike state space models where the length of the latent state sequence is fixed, the sampled decision tree sequences may be of different length and could potentially be deeper than the old tree T\mathcal{T}^{*}. Hence, whenever E(t)=E_{(t)}=\emptyset, we set Pr(t)(T(t)T(t1))=δT(t1)\Pr_{(t)}(\mathcal{T}_{(t)}|\mathcal{T}_{(t-1)})=\delta_{\mathcal{T}_{(t-1)}}, i.e., T(t)=T(t1)\mathcal{T}_{(t)}=\mathcal{T}_{(t-1)}.

SMC weight update: Since the prior is used as the proposal, the particle weight w(t)(c)w_{(t)}(c) is multiplicatively updated with the ratio of the marginal likelihood of T(t)(c)\mathcal{T}_{(t)}(c) to the marginal likelihood of T(t1)(c)\mathcal{T}_{(t-1)}(c). The marginal likelihood associated with a (partial) tree T\mathcal{T} is a product of the marginal likelihoods associated with the leaf nodes of T\mathcal{T} defined in (10). Like Lakshminarayanan et al. (2013), we treat the eligible nodes E(t)E_{(t)} as leaf nodes while computing the marginal likelihood for a partial tree T(t)\mathcal{T}_{(t)}. Plugging in (10), the SMC weight update is given by (11) in Algorithm 2.

Resampling: The resampling step in the conditional-SMC algorithm is slightly different from the typical SMC resampling step. Recall that the first particle is always clamped to the old tree. The remaining C1C-1 particles are resampled such that the probability of choosing particle cc is proportional to its weight w(t)(c)w_{(t)}(c). We used multinomial resampling in our experiments, although other resampling strategies are possible.

When none of the trees contain eligible nodes, the conditional-SMC algorithm stops and returns a sample from the particle approximation. Without loss of generality, we assume that the CthC^{th} particle is returned. The PG sampler is summarized in Algorithm 2.

The computational complexity of the conditional-SMC algorithm in Algorithm 2 is similar to that of the top-down algorithm (Lakshminarayanan et al., 2013, §3.2). Even though the PG sampler has a higher per-iteration complexity in general compared to GrowPrune and CGM samplers, it can mix faster since it can propose a completely different tree that explains the data. The GrowPrune sampler requires many iterations to explore multiple modes (since a prune operation is likely to be rejected around a mode). The CGM sampler can change the decisions at internal nodes; however, it is inefficient since a change in an internal node that leaves any of the nodes in the subtree below empty will be rejected. We demonstrate the competitive performance of PG in the experimental section.

Experimental evaluation

In this section, we present experimental comparisons between the PG sampler and existing samplers for BART. Since the main contribution of this work is a different inference algorithm for an existing model, we just compare the efficiency of the inference algorithms and do not compare to other models. BART has been shown to demonstrate excellent prediction performance compared to other popular black-box non-linear regression approaches; we refer the interested reader to Chipman et al. (2010).

We implemented all the samplers in Python and ran experiments on the same desktop machine so that the timing results are comparable. The scripts can be downloaded from the authors’ webpages. We set the number of particles C=10C=10 for computational efficiency and max-stages=5000\textsf{max-stages}=5000, following Lakshminarayanan et al. (2013), although the algorithm always terminated much earlier.

We investigate the performance of the samplers on a dataset where there are multiple trees that explain the residual (conditioned on other trees). This problem is equivalent to posterior inference over a decision tree where the labels are equal to the residual. Hence, we generate a synthetic dataset where multiple trees are consistent with the observed labels. Intuitively, a local sampler can be expected to mix reasonably well when the true posterior consists of shallow trees; however, a local sampler will lead to an inefficient exploration when the posterior consists of deep trees. Since the depth of trees in the true posterior is at the heart of the mixing issue, we create synthetic datasets where the depth of trees in the true posterior can be controlled.

We generate the hypercube-DD dataset as follows: for each of the 2D2^{D} vertices of D^{D}, we sample 10 data points. The x\bm{x} location of a data point is generated as x=v+ϵ\bm{x}=\bm{v}+\epsilon where v\bm{v} is the vertex location and ϵ\epsilon is a random offset generated as ϵN(0,0.12ID)\epsilon\sim\mathcal{N}(\bm{0},0.1^{2}I_{D}). Each vertex is associated with a different function value and the function values are generated from N(0,32)\mathcal{N}(0,3^{2}). Finally the observed label is generated as y=f+ey=f+e where ff denotes the true function value at the vertex and eN(0,0.012)e\sim\mathcal{N}(0,0.01^{2}). Figure 2 shows a sample hypercube-22 dataset. As DD increases, the number of trees that explains the observations increases.

We fix m=1,αs=0.95m=1,\alpha_{s}=0.95 and set remaining BART hyperparameters to the default values. Since the true tree has 2D2^{D} leaves, we set βs\beta_{s} such that the expected number of leaves is roughly 2D2^{D}. We run 2000 iterations of MCMC. Figure 3 illustrates the posterior trace plots for D=4D=4. (See supplemental material for additional posterior trace plots.) We observe that PG converges much faster to the posterior in terms of number of leaves as well as the test MSE. We observe that GrowPrune sampler tends to overestimate the number of leaves; the low value of train MSE indicates that the GrowPrune sampler is stuck close to a mode and is unable to explore the true posterior. Pratola (2013) has reported similar behavior of GrowPrune sampler on a different dataset as well.

We compare the algorithms by computing effective sample size (ESS). ESS is a measure of how well the chain mixes and is frequently used to assess performance of MCMC algorithms; we compute ESS using R-CODA (Plummer et al., 2006). We discard the first 1000 iterations as burn-in and use the remaining 1000 iterations to compute ESS. Since the per iteration cost of generating a sample differs across samplers, we additionally report ESS per unit time. The ESS (computed using log-likelihood values) and ESS per second (ESS/s) values are shown in Tables 1 and 2 respectively. When the true tree is shallow (D=2D=2 and D=3D=3), we observe that CGM sampler mixes well and is computationally efficient. However, as the depth of the true tree increases (D=4,5,7D=4,5,7), PG achieves much higher ESS and ESS/s compared to CGM and GrowPrune samplers.

2 Real world datasets

In this experiment, we study the effect of the data dimensionality on mixing. Even when the trees are shallow, the number of trees consistent with the labels increases as the data dimensionality increases. Using the default BART prior (which promotes shallower trees), we compare the performance of the samplers on real world datasets of varying dimensionality.

We consider the CaliforniaHouses, YearPredictionMSD and CTslices datasets used by Johnson and Zhang (2013). For each dataset, there are three training sets, each of which contains 2000 data points, and a single test set. The dataset characteristics are summarized in Table 3.

We run each sampler using the three training datasets and report average ESS and ESS/s. All three samplers achieve very similar MSE to those reported by Johnson and Zhang (2013). The average number of leaves in the posterior trees was found to be small and very similar for all the samplers. Tables 4 and 5 respectively present results comparing ESS and ESS/s of the different samplers. As the data dimensionality increases, we observe that PG outperforms existing samplers.

Discussion

We have presented a novel PG sampler for BART. Unlike existing samplers which make local moves, PG can propose complete trees. Experimental results confirm that PG dramatically increases mixing when the true posterior consists of deep trees or when the data dimensionality is high. While we have presented PG only for the BART model, it is applicable to extensions of BART that use a different likelihood model as well. PG can also be used along with other priors for decision trees, e.g., those of Denison et al. (1998), Wu et al. (2007) and Lakshminarayanan et al. (2014). Backward simulation (Lindsten and Schön, 2013) and ancestral sampling (Lindsten et al., 2012) have been shown to significantly improve mixing of PG for state-space models. Extending these ideas to PG-BART is a challenging and interesting future direction.

Acknowledgments

BL gratefully acknowledges generous funding from the Gatsby Charitable Foundation. This research was carried out in part while DMR held a Research Fellowship at Emmanuel College, Cambridge, with funding also from a Newton International Fellowship through the Royal Society. YWT’s research leading to these results has received funding from EPSRC (grant EP/K009362/1) and the ERC under the EU’s FP7 Programme (grant agreement no. 617411).

References

Appendix A Results on hypercube−D𝐷-D dataset