Distributed Gaussian Processes

Marc Peter Deisenroth, Jun Wei Ng

Introduction

Gaussian processes (GPs) (Rasmussen & Williams, 2006) are the method of choice for probabilistic nonlinear regression: Their non-parametric nature allows for flexible modelling without specifying low-level assumptions (e.g., the degree of a polynomial) in advance. Inference can be performed in a principled way simply by applying Bayes’ theorem. GPs have had substantial impact in geostatistics (Cressie, 1993), optimisation (Jones et al., 1998; Brochu et al., 2009), data visualisation (Lawrence, 2005), robotics and reinforcement learning (Deisenroth et al., 2015), spatio-temporal modelling (Luttinen & Ilin, 2012), and active learning (Krause et al., 2008). A strength of GPs is that they are a fairly reliable black-box function approximator, i.e., they produce reasonable predictions without manual parameter tuning. However, their inherent weakness is that they scale poorly with the size NN of the data set: Training and predicting scale in O(N3)\mathcal{O}(N^{3}) and O(N2)\mathcal{O}(N^{2}), respectively, which practically limits GPs to data sets of size O(104)\mathcal{O}(10^{4}).

For large data sets (e.g., N>104N>10^{4}) sparse approximations are often used (Williams & Seeger, 2001; Quiñonero-Candela & Rasmussen, 2005; Hensman et al., 2013; Titsias, 2009; Lázaro-Gredilla et al., 2010; Shen et al., 2006; Gal et al., 2014). Typically, they lower the computational burden by implicitly (or explicitly) using a subset of the data, which scales (sparse) GPs to training set sizes NO(106)N\in\mathcal{O}(10^{6}). Recently, Gal et al. (2014) proposed an approach that scales variational sparse GPs (Titsias, 2009) further by exploiting distributed computations. In particular, they derive an exact re-parameterisation of the variational sparse GP by Titsias (2009) to update the variational parameters independently on different computing nodes. However, even with sparse approximations it is inconceivable to apply GPs to training set sizes of NO(107)N\geq\mathcal{O}(10^{7}).

An alternative to sparse approximations is to distribute the computations by using independent local “expert” models, which operate on subsets of the data. These local models typically require stationary kernels for a notion of “distance” and “locality”. Shen et al. (2006) used KD-trees to recursively partition the data space into a multi-resolution tree data structure, which scale GPs to O(104)\mathcal{O}(10^{4}) training points. However, no solutions for variance predictions are provided, and the approach is limited to stationary kernels. Along the lines of exploiting locality, mixture-of-experts (MoE) models (Jacobs et al., 1991) have been applied to GP regression (Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006; Yuan & Neubauer, 2009). However, these models have not primarily been used to speed up GP regression, but rather to increase the expressiveness of the model, i.e., allowing for heteroscedasticity and non-stationarity. Each local model possesses its own set of hyper-parameters to be optimised. Predictions are made by collecting the predictions of all local expert models, and weighting them using the responsibilities assigned by the gating network. Closed-form inference in these models is intractable, and approximations typically require MCMC. Nguyen & Bonilla (2014) sidestep MCMC inference and speed the GP-MoE model up by (i) fixing the number of GP experts, (ii) combining it with the pseudo-input sparse approximation by Snelson & Ghahramani (2006). This approach assigns data points to expert probabilistically using proximity information provided by stationary kernels and scales GPs to O(105)\mathcal{O}(10^{5}) data points.

Product-of-GP-experts models (PoEs) sidestep the weight assignment problem of mixture models: Since PoEs multiply predictions made by independent GP experts, the overall prediction naturally weights the contribution of each expert. However, the model tends to be overconfident (Ng & Deisenroth, 2014). Cao and Fleet (Cao & Fleet, 2014) recently proposed a generalised PoE-GP model in which the contribution of an expert in the overall prediction can weighted individually. This model is often too conservative, i.e., it over-estimates variances. Tresp’s Bayesian Committee Machine (BCM) (Tresp, 2000) can be considered a PoE-GP model, which provides a consistent framework for combining independent estimators within the Bayesian framework, but it suffers from weak experts.

In this paper, we exploit the fact that the computations of PoE models can be distributed amongst individual computing units and propose the robust BCM (rBMC), a new family of hierarchical PoE-GP models that (i) includes the BCM (Tresp, 2000) and to some degree the generalised PoE-GP (Cao & Fleet, 2014) as special cases, (ii) provides consistent approximations of a full GP, (iii) scales to arbitrarily large data sets by parallelisation. Unlike sparse GPs our rBCM operates on the full data set but distributes the computational and memory load amongst a large set of independent computational units. The rBCM recursively recombines these independent computations to form an efficient distributed GP inference/training framework.

A key advantage of the rBCM is that all computations can be performed analytically, i.e., no sampling is required. With sufficient computing power our model can handle arbitrarily large data sets. We demonstrate that the rBCM can be applied to data sets of size O(107)\mathcal{O}(10^{7}), which exceeds the typical data set sizes sparse GPs deal with by orders of magnitude. However, even with limited resources, our model is practical: A GP with a million data points can be trained in less than half an hour on a laptop.

Problem Set-up and Objective

We consider a regression problem y=f(x)+ϵ\mathdsRy=f(\boldsymbol{x})+\epsilon\in\mathds{R}, where x\mathdsRD\boldsymbol{x}\in\mathds{R}^{D}. The Gaussian likelihood p(yf(x))=N(f(x),σϵ2)p(y|f(\boldsymbol{x}))=\mathcal{N}(f(\boldsymbol{x}),\sigma_{\epsilon}^{2}) accounts for the i.i.d. measurement noise ϵN(0,σϵ2)\epsilon\sim\mathcal{N}(0,\sigma_{\epsilon}^{2}). The objective is to infer the latent function ff from a training data set X={xi}i=1N,y={yi}i=1N\boldsymbol{X}=\{\boldsymbol{x}_{i}\}_{i=1}^{N},\boldsymbol{y}=\{y_{i}\}_{i=1}^{N}. For small data set sizes NN, a Gaussian process (GP) is a method of choice for probabilistic non-parametric regression. A GP is defined as a collection of random variables, any finite number of which is Gaussian distributed. A GP is fully specified by a mean function mm and a covariance function kk (kernel) with hyper-parameters ψ\boldsymbol{\psi}. Without loss of generality, we assume that the prior mean function is 0.

A GP is typically trained by finding hyper-parameters θ={ψ,σϵ}\boldsymbol{\theta}=\{\boldsymbol{\psi},\sigma_{\epsilon}\} that maximise the log-marginal likelihood

where K=k(X,X)\mathdsRN×N\boldsymbol{K}=k(\boldsymbol{X},\boldsymbol{X})\in\mathds{R}^{N\times N} is the kernel matrix.

For a given set of hyper-parameters θ\boldsymbol{\theta}, a training set X,y\boldsymbol{X},\boldsymbol{y} and a test input x\mathdsRD\boldsymbol{x}_{*}\in\mathds{R}^{D}, the GP posterior predictive distribution of the corresponding function value f=f(x)f_{*}=f(\boldsymbol{x}_{*}) is Gaussian with mean and variance given by

respectively, where k=k(X,x)\boldsymbol{k}_{*}=k(\boldsymbol{X},\boldsymbol{x}_{*}) and k=k(x,x)k_{**}=k(\boldsymbol{x}_{*},\boldsymbol{x}_{*}).

Training requires the inversion and the determinant of K+σϵ2I\boldsymbol{K}+\sigma_{\epsilon}^{2}\boldsymbol{I} in (1), both of which scale in O(N3)\mathcal{O}(N^{3}) with a standard implementation. For predictions, we cache (K+σϵ2I)1(\boldsymbol{K}+\sigma_{\epsilon}^{2}\boldsymbol{I})^{-1}, such that the mean and variance in (2) and (3) require O(N)\mathcal{O}(N) and O(N2)\mathcal{O}(N^{2}) computations, respectively. For N>10,000N>10,000 training and predicting become time-consuming procedures, which additionally require O(N2+ND)\mathcal{O}(N^{2}+ND) memory.

Throughout this paper, we assume that a standard GP is a good model for the latent function ff. However, due to the data set size NN the full GP is not applicable.

To scale GPs to large data sets with NO(104)N\gg\mathcal{O}(10^{4}), we address both the computational and the memory issues of full GPs by distributing the computational and memory loads to many independent computational units that only operate on subsets of the data. For this purpose, we devise a robust and scalable hierarchical product-of-GP-experts model.

Distributed Product-of-GP-Experts Models

Product-of-experts models (PoEs) are generally promising for parallelisation and distributed computing. In a PoE model, an overall computation is the product of many independent (smaller) computations, performed by “experts”. In our case, every expert is a GP that accesses only a subset of the training data. In this paper, we consider a GP with a training data set D={X,y}\mathcal{D}=\{\boldsymbol{X},\boldsymbol{y}\}. We partition the training data into MM sets D(k)={X(k),y(k)}\mathcal{D}^{(k)}=\{\boldsymbol{X}^{(k)},\boldsymbol{y}^{(k)}\}, k=1,,Mk=1,\dotsc,M, and use a GP on each of them as a (local) expertThe notion of “locality” is misleading as our model does not require similarity measures induced by stationary kernels.. Each GP expert performs computations (e.g., mean/variance predictions) conditioned on their respective training data D(k)\mathcal{D}^{(k)}. These (local) predictions are recombined by a parent node (see Fig. 1(a)), which subsequently may play the role of an expert at the next level of the model architecture. Recursive application of these recombinations results in a multi-layered tree-structured computational graph, see Fig. 1(c).

The assumption of independent GP experts leads to a block-diagonal approximation of the kernel matrix, which (i) allows for efficient training and predicting (ii) can be computed efficiently (time and memory) by parallelisation.

Due to the independence assumption, the marginal likelihood p(yX,θ)p(\boldsymbol{y}|\boldsymbol{X},\boldsymbol{\theta}) in a PoE model factorises into the product of MM individual terms, such that

where each factor pkp_{k} is determined by the kkth GP expert. For training the PoE model, we seek GP hyper-parameters θ\boldsymbol{\theta} that maximise the corresponding log-marginal likelihood

where MM is the number of GP experts. The terms in (5) are independently computed and given by

where Kψ(k)=k(X(k),X(k))\boldsymbol{K}_{\boldsymbol{\psi}}^{(k)}=k(\boldsymbol{X}^{(k)},\boldsymbol{X}^{(k)}) is an nk×nkn_{k}\times n_{k} matrix, and nkNn_{k}\ll N is the size of the data set associated with the kkth GP expert. Since we assume that a standard GP is sufficient to model the latent function, all GP experts at the leaves of the tree-structured model are trained jointly and share a single set of hyper-parameters θ={ψ,σϵ}\boldsymbol{\theta}=\{\boldsymbol{\psi},\sigma_{\epsilon}\}. Computing the log-marginal likelihood terms in (6) requires the inversion and determinant of Kψ(k)+σϵ2I\boldsymbol{K}_{\boldsymbol{\psi}}^{(k)}+\sigma_{\epsilon}^{2}\boldsymbol{I}. These computations require O(nk3)\mathcal{O}(n_{k}^{3}) time with a standard implementation. The memory consumption is O(nk2+nkD)\mathcal{O}(n_{k}^{2}+n_{k}D) for each GP expert.

In (5), the number of parameters θ\boldsymbol{\theta} to be optimised is relatively small since we do not consider additional variational parameters or inducing inputs that we optimise. Training (i) allows for straightforward parallelisation, (ii) provides a significant speed-up compared to training a full GP, (iii) requires solving low-dimensional O(D)\mathcal{O}(D) optimisation problem unlike sparse GPs, which additionally optimise inducing inputs or variational parameters.

2 Predictions

In the following, we assume that a set of MM GP experts has been trained according to Section 3.1 and detail how the PoE (Ng & Deisenroth, 2014), the generalised PoE (Cao & Fleet, 2014) and the Bayesian Committee Machine (Tresp, 2000) combine predictions of the GP experts to form an overall prediction. Furthermore, we highlight strengths and weaknesses of these models, which motivates our robust Bayesian Committee Machine (rBCM). The rBCM unifies many other models while providing additional flexibility, which can address the shortcomings of the PoE, gPoE and the BCM. For illustration purposes, we focus on the model in Fig. 1(a), but many models generalise to an arbitrarily deep computational graph (see Section 4).

The product-of-GP-experts model predicts a function value ff_{*} at a corresponding test input x\boldsymbol{x}_{*} according to

where MM GP experts operate on different training data subsets D(k)\mathcal{D}^{(k)}. The MM GP experts predict means μk(x)\mu_{k}(\boldsymbol{x}_{*}) and variances σk2(x)\sigma^{2}_{k}(\boldsymbol{x}_{*}), k=1,,Mk=1,\dotsc,M, independently. The joint prediction p(fx,D)p(f_{*}|\boldsymbol{x}_{*},\mathcal{D}) is obtained by the product of all experts’ predictions. The product of these Gaussian predictions is proportional to a Gaussian with mean and precision

respectively. For k=1k=1, this model is identical to the full GP we wish to approximate.

Strengths of the PoE model are (a) the overall prediction p(fx,D)p(f_{*}|\boldsymbol{x}_{*},\mathcal{D}) is straightforward to compute, (b) there are no free weight parameters to be assigned to each prediction (unlike in MoE models). A shortcoming of this model is that with an increasing number of GP experts the predictive variances vanish (the precisions add up, see (9)), which leads to overconfident predictions, especially in regions without data. Thus, the PoE model is inconsistent in the sense that it does not fall back to the prior, see Fig. 2(a).

2.2 Generalised Product of GP Experts

The generalised product-of-experts model (gPoE) by Cao & Fleet (2014) adds the flexibility of increasing/reducing the importance of experts. The predictive distribution is

where the βk\beta_{k} weight the contributions of the experts. The predictive mean and precision are, therefore,

respectively. A strength of the gPoE is that with kβk=1\sum_{k}\beta_{k}=1 the model falls back to the prior outside the range of the data (and corresponds to a log-opinion-pool model (Heskes, 1998)). A weakness of the gPoE is that in the range of the data, it over-estimates the variances, i.e., the predictions are generally too conservative, especially with an increasing number of GP experts. Fig. 2(b) illustrates these two properties.

Cao & Fleet (2014) suggest to set βk\beta_{k} to the difference in the differential entropy between the prior and the posterior to determine the importance. In this paper, we do not consider this setting for the gPoE for two reasons: (i) kβk1\sum_{k}\beta_{k}\neq 1 leads to unreasonable error bars; (ii) Even with a normalisation, this setting does not allow for deep computational graphs. In fact, it only allows for a single-layer computational graph, such as Fig. 1(a). To allow for a general computational graph, we require an a-priori setting of the βk\beta_{k}. Thus, we set βk=1/M\beta_{k}=1/M, where MM is the number of GP experts. In this case, the predicted means in (8) and (11) are identical, but the precisions differ, see also Fig. 2(a) and 2(b) for a comparison.

2.3 Bayesian Committee Machine

A third model that falls in the category of PoE models is the Bayesian Committee Machine (BCM) proposed by Tresp (2000). Unlike the (g)PoE, the BCM explicitly incorporates the GP prior p(f)p(f) when combining predictions (and not only at the leaves).

For two experts j,kj,k and corresponding training data sets D(j),D(k)\mathcal{D}^{(j)},\mathcal{D}^{(k)}, the predictive distribution is generally given by

where p(f)p(f_{*}) is the GP prior over functions. The BCM makes the conditional independence assumption that D(j) ⁣ ⁣ ⁣D(k)f\mathcal{D}^{(j)}\perp\!\!\!\perp\mathcal{D}^{(k)}|f_{*}. With (13) this yields

which is the PoE model in (7) divided by the GP prior. For MM training data sets D(k)\mathcal{D}^{(k)}, k=1,,Mk=1,\dotsc,M, the BCM applies the above approximation repeatedly, leading to the BCM’s posterior predictive distribution

The (M1M-1)-fold division by the prior is the decisive difference between the BCM and the PoE model in (7) and leads to the BCM’s predictive mean and precision

respectively, where σ2\sigma_{**}^{-2} is the prior precision of p(f)p(f_{*}).

The repeated application of Bayes’s theorem leads to an (M1)(M-1)-fold division by the prior in (17), which plays the role of a “correction” term in (19) that ensures a consistent model that falls back to the prior.The BCM predictions fall into the category of normalised group odds (Bordley, 1982). The error bars of the BCM within the range of the data are usually good, but it is possible to “break” the BCM when only few data points are assigned to each GP expert. In Fig. 2(c), we see that the posterior mean suffers from weak experts when leaving the data (around x=0x=0).Here, each expert was assigned only two data points.

2.4 Robust Bayesian Committee Machine

In this section, we propose the robust Bayesian Committee Machine (rBCM), a unified model that (a) includes the gPoE and BCM as special cases, (b) yields consistent predictions, (c) can be implemented on a distributed computing architecture.

Inspired by the gPoE in (10), the rBCM is a BCM with the added flexibility of increasing/decreasing an expert’s importance. The rBCM’s predictive distribution is

where the predictive mean and precision are given as

respectively. The derivation of the rBCM in (20) is analogous to the BCM’s derivation in (14)–(16).

The rBCM combines the flexibility of the generalised PoE with the appropriate Bayesian treatment of the BCM, which leads to the correction term (1kβk)σ2(1-\sum\nolimits_{k}\beta_{k})\sigma_{**}^{-2} in the the precision in (22). This correction term ensures that the predictive variance falls back to the prior when leaving the data. Note that we no longer require kβk=1\sum_{k}\beta_{k}=1 to ensure this (see also (Bordley, 1982)), which will facilitate computational graphs with multiple layers. The gPoE from Section 3.2.2 and the BCM from Section 3.2.3 are recovered for βk=1/M\beta_{k}=1/M and βk=1\beta_{k}=1, respectively. For βk=0\beta_{k}=0, the rBCM is identical to the GP prior, and for βk=1\beta_{k}=1 but without the correction, the rBCM recovers the PoE from Section 3.2.1

The parameters βk\beta_{k} control not only the importance of the individual experts, but they also control how strong the influence of the prior is. Assuming each GP expert is a good predictive model, we would set βk=1\beta_{k}=1 for all kk, such that we retain the BCM. If the quality of the GP experts is weak, e.g., data is noisy and the experts’ data sets D(k)\mathcal{D}^{(k)} are small, βk\beta_{k} allows us to weaken the experts’ votes and to robustify the predictive model by putting (relatively) more weight on the prior. Therefore, we follow the suggestion by Cao & Fleet (2014) and choose βk\beta_{k} according to the predictive power of each expert at x\boldsymbol{x}_{*}. Specifically, we use the difference in differential entropy between the prior p(fx)p(f_{*}|\boldsymbol{x}_{*}) and the posterior pk(fx,D(k))p_{k}(f_{*}|\boldsymbol{x}_{*},\mathcal{D}^{(k)}). This quantity can be computed efficiently and is given as βk=12(logσ2logσk2(x))\beta_{k}=\tfrac{1}{2}(\log\sigma^{2}_{**}-\log\sigma_{k}^{2}(\boldsymbol{x}_{*})), where σ2\sigma_{**}^{2} is the prior variance and σk2(x)\sigma_{k}^{2}(\boldsymbol{x}_{*}) is the predictive variance of the kkth expert.

Fig. 2(d) shows that for this choice of βk\beta_{k} the rBCM expresses more uncertainty about the learned model than the BCM: Due to the adaptive influence of the prior in (21)–(22), the variances within the range of the data (black circles) are slightly larger, but the predictive mean no longer suffers from the dominant “kink” at around x=0x=0 compared to the BCM in Fig. 2(c). Overall, the rBCM provides more reasonable predictions than any other model in Fig. 2.

Distributed Computations

In the following, we show that for a given number MM of GP experts, the rBCM can be implemented in different computational graphs while providing identical predictions. For instance, with 32 experts, we show that a single-layer computational graph with 32 experts and one central node, see Fig. 1(a), is equivalent to a two-layer computational graph with 32 experts, 8 parent nodes (each of which is responsible for 4 GP experts), and one central node. This property can be exploited in distributed systems or to balance the communication between computing units.

In a single-layer model as shown in Fig. 1(a) the rBCM predictions in (20) can be constructed by a gPoE (numerator) combining predictions of GP experts, followed by a correction via the prior (denominator). Let us consider a two-layer model with MM GP experts, LL nodes at level 1 and one central node at level 2, see Fig. 1(b). Each grey node at level 1 is responsible for LkL_{k} GP experts (black nodes). All MM GP experts kik_{i}, compute weighted predictive distributions pkiβki(fD(ki))p_{k_{i}}^{\beta_{k_{i}}}(f_{*}|\mathcal{D}^{(k_{i})}). These predictions are then multiplied by the corresponding LL parent nodes (grey) to pkβk(fD(k))p_{k}^{\beta_{k}}(f_{*}|\mathcal{D}^{(k)}), where βk=iβki\beta_{k}=\sum_{i}\beta_{k_{i}} is the overall weight of the subtree following the kkth node LkL_{k} and D(k)=iD(ki)\mathcal{D}^{(k)}=\bigcup_{i}\mathcal{D}^{(k_{i})}. The overall prediction at the top node (blue in Fig. 1) is

where we accounted for the (kβk1)(\sum_{k}\beta_{k}-1)-fold correction by the prior. This computation can be obtained by a gPoE (black), followed by a PoE (red) and a prior correction (blue) in (23). Hence, for a given number of GP experts, the rBCM predictions can be equivalently realised in a single and two-layer computational graph, see Fig. 1.

This can be generalised further to an arbitrarily deep computational graph, whose general implementation structure is shown in Fig. 3. The GP experts at the leaves compute their individual means, variances and confidence values βi\beta_{i}. The next layer consists of gPoE models, which compute the weighted means and variances according to (8) and (12), respectively (plus their overall weights βk=ichildrenβki\beta_{k}=\sum_{i\in\text{children}}\beta_{k_{i}}, which are passed on to the next-higher level). The gPoE is followed by an arbitrary number of PoE models, which compute means and precisions according to (8) and (9), respectively (plus their overall weights βk=jchildrenβkj\beta_{k}=\sum_{j\in\text{children}}\beta_{k_{j}}, which are passed on to the next-higher level). The top layer accounts for the prior (blue term in (23)), which uses all the βk\beta_{k} from the subtrees starting at its children to compute the overall mean and precision according to (21) and (22), respectively.

Hence, for a given number of GP experts, there are many equivalent computational graphs for the rBCM (and the other distributed GPs discussed in Section 3). For instance, all computational graphs in Fig. 1 return identical predictive distributions. This allows us to choose the computational graph that works best with the computing infrastructure available: Shallow graphs minimise overall traffic. However, they are more vulnerable to communication bottlenecks at the central node since it has a large number of connections. Deeper computational graphs cause more overall communication, but the tree has a smaller branching factor. In practice, for a set of computational graphs we took the time it requires to compute the gradient of the marginal likelihood and chose the “fastest” architecture.

Experiments

We empirically assess three aspects of all distribute GP models: (1) The required training time, (2) the approximation quality, (3) a comparison with state-of-the-art sparse GP methods. In all experiments, we chose the standard squared exponential kernel with automatic relevance determination and a Gaussian likelihood. Moreover, we assigned training data to experts randomly for two reasons: First, we demonstrate that our models do not need locality information; second, random assignment is very fast compared to clustering methods, e.g., KD-trees.

To evaluate the training time for distributed GPs (DGP)This comprises the (g)PoE and (r)BCM for which the training time is identical., we measured the time required to compute the log-marginal likelihood and its gradient with respect to the kernel hyper-parameters. Since the model is trained using LBFGS, the overall training time is proportional to the time it takes to compute the log-marginal likelihood and its gradient. For this evaluation, we chose a computer architecture of 64 nodes with four cores each. Furthermore, we chose a three-layer computational graph with varying branching factors. For data sets of 220\leq 2^{20} data points each GP expert possessed 512 data points, for data set sizes of >220>2^{20}, we chose the number of data points per node to be 128.

Fig. 4 shows the time required for computing the,log-marginal likelihood and its gradient with respect to the hyper-parameters. The horizontal axis shows the size of the training set (logarithmic scale), the left vertical axis shows the computation time in seconds (logarithmic scale) for the DGP (blue-dashed), a full GP (red-dashed) and a sparse GP (FITC) with inducing inputs (Snelson & Ghahramani, 2006) (green-dashed). For the sparse GP model, we chose the number MM of inducing inputs to be 10% of the size of the training set, i.e., the computation time is of the order of O(NM2)=O(N3/100)\mathcal{O}(NM^{2})=\mathcal{O}(N^{3}/100), which offsets the curve of the full GP. The right vertical axis shows the number of GP experts (black-solid) amongst which we distribute the computation. While the training time of the full GP becomes impractical at data set sizes of about 20,000, the sparse GP model can be reasonably trained up to 50,000 data points.We did not include any computational overhead for learning the inducing inputs, which is often time consuming. The computational time required for the DGP to compute the marginal likelihood and gradients is significantly lower than that of the full GP, and we scaled it up to 2241.7×1072^{24}\approx 1.7\times 10^{7} training data points, which required about the same amount of time (230\approx 230 s) for training a full GP with 2141.6×1042^{14}\approx 1.6\times 10^{4} and a sparse GP with 2153.2×1042^{15}\approx 3.2\times 10^{4} data points. The figure shows that for any problem size we can find a computational graph that allows us to train the model within a reasonable amount of time.

Even if a big computing infrastructure is not available our model is useful in practice: We performed a full training cycle of the DGP with 10610^{6} data points on a standard laptop in about 20 minutes. This is also a clear indicator that the memory consumption of the DGP is relatively small.

2 Empirical Approximation Errors

We evaluate the predictive quality of the DGP models introduced in Section 3 on the Kin40K data set. The data set represents the forward dynamics of an 8-link all-revolute robot arm. The goal is to predict the distance of the end-effector from a target, given the joint angles of the eight links as features. The Kin40K data set consists of 10,000 training points and 30,000 test points. We use the same split into training and test data as Seeger et al. (2003), Lázaro-Gredilla et al. (2010), and Nguyen & Bonilla (2014).

We considered two baselines: a full (ground truth) GP and the subset-of-data (SOD) approximation, which uses a random subset of the full training data set to train a sparse GP. Taking training time into account, Chalupka et al. (2013) identified the SOD method as a good and robust sparse GP approximation. All approximate models (PoE, gPoE, BCM, rBCM, SOD) used the hyper-parameters from the full GP to eliminate issues with local optima. For every model, we took the time for computing the gradient of the marginal likelihood (training is proportional to this amount of time). We selected the number of data points for SOD, such that the gradient computation time approximately matches the one of the DGPs.

Experiments were repeated 10 times to average out the effect of the random assignment of data points to experts and the selection of the subset of training data for the SOD approximation. For this experiment, we used a Virtual Machine with 16 3 GHz cores and 8 GB RAM.

Fig. 5 shows the average performance (RMSE and negative log predictive density (NLPD) per data point) of all models as a function of (a) the time to compute the gradient of the marginal likelihood and (b) the number of training data points per GP expert. The full GP (black, dashed) shows the ground-truth performance, but requires 1162 seconds for the gradient computation. The performances of the (r)BCM and the (g)POE have been evaluated using 256, 64, 16, 4, and 1 expert with 39, 156, 625, 2500 and 10000 data points per expert, respectively. In Fig. 5(a), the rBCM consistently outperforms all other methods, where SOD is substantially worse than all DGPs. The NLPD in Fig. 5(b) allows us to make some more conclusive statements: While the rBCM again outperforms all other methods, the BCM and the PoE’s performances suffer when only a small number of data points is assigned to the GP experts. The PoE suffers from variance underestimation (see Fig. 2(a)) whereas the BCM suffers from “weak” experts (see Fig. 2(c)). SOD does not work well, even with 2500 data points. Overall, the rBCM provides an enormous training speed-up compared to a full GP, with a significantly better predictive performance than SOD.

3 Airline Delays (US Flight Data)

We assess the performance of the distributed GPs (PoE, gPoE, BCM and rBCM) on a large-scale non-stationary data set reporting flight arrival and departure times for every commercial fight in the US from January to April 2008http://stat-computing.org/dataexpo/2009/. This data set contains information about almost 6 million flights. We followed the procedure described by Hensman et al. (2013)Thanks to J Hensman for the pre-processing script. to predict the flight delay (in minutes) at arrival: We selected the first PP data points to train the model and the following 100,000 to test it. We chose the same eight input variables x\boldsymbol{x} as Hensman et al. (2013): age of the aircraft, distance that needs to be covered, airtime, departure and arrival times, day of the week and month, month. This data set has been evaluated by Hensman et al. (2013) and Gal et al. (2014), who use either Stochastic Variational Inference GP (SVIGP) or Distributed Variational GPs (Dist-VGP) to deal with this training set size.

Using a workstation with 12 3.5 GHz cores and 32 GB RAM, we conducted 10 experiments with P=7×105P=7\times 10^{5}, P=2×106P=2\times 10^{6} and P=5×106P=5\times 10^{6} where we chose 4096, 8192 and 32768 experts corresponding to 170, 244 and 152 training data points per expert, respectively. The computation of the marginal likelihoods required 13, 39, and 90 seconds, respectively.All experiments can be conducted on a MacBook Air (2012) with 8 GB RAM. The computational graphs were (16-16-16), (8192), and (32-32-32), respectively, where each number denotes the branching factor at the corresponding level in the tree. After normalising the data every single experiment consisted of an independent full training/test cycle, with random assignments of data points to GP experts. Training the DGPs normally required 30–100 line searches to converge.

Table 1 reports the performance (RMSE and NLPD) of various large-scale GP methods for the flight data set. The results for SVIGP and Dist-VGP are taken from Hensman et al. (2013) and Gal et al. (2014), respectively. Since SVIGP and Dist-VGP are difficult to optimise due to the large number of variational parameters, Gal et al. (2014) report only their best results, whereas we report an average of all experiments conducted.

The standard errors (not shown in Table 1) of the rBCM and gPoE are consistently below 0.3, whereas the BCM and the PoE suffered from a few outliers, which is also indicated by the relatively large NLPD values. Compared to the Dist-VGP and SVIGP on the 700K data set, the rBCM, gPoE and PoE perform significantly better in RMSE. The table highlights the weaknesses of the PoE (under-estimation of the variance) and the BCM (problems with weak experts) very clearly. The property of the gPoE (too conservative) is a bit hidden: Although the RMSE of the gPoE is consistently worse than that of the rBCM, its NLPD tends to be a bit lower. The NLPD values of the rBCM and the gPoE are fairly consistent across all three experiments.

The data set exhibits the property that the 700K/100K data set is more stationary than the 2M/100K and 5M/100K data sets. Therefore, we observe a decreasing performance although we include more training data. This effect has already been reported by Gal et al. (2014).

Discussion and Conclusion

The distributed GP models discussed in this paper exhibit the appealing property that they are conceptually simple: They split the data set into small pieces, train GP experts jointly, and subsequently combine individual predictions to an overall computation. Compared to sparse GP models, which are based on inducing or variational parameters, the distributed GPs possess only the normal GP hyper-parameters, i.e., it is less likely to end up in local optima.

In this paper, all experts share the same hyper-parameters, which leads to automatic regularisation: The overall gradient is an average of the experts’ marginal likelihood gradients, i.e., overfitting of individual experts is not favoured.

We believe that distributed computations are promising to scale GPs to truly large data sets. Sparse methods using inducing inputs are naturally limited by the number of inducing variables. In practice, we see O(102)\mathcal{O}(10^{2}) many inducing variables (optimisation is hard), although their theoretical limit might be at O(104)\mathcal{O}(10^{4}), if we assume that this is the limit up to which we can train a full GP. If we assume that a single inducing variable summarises 100 data points, the practical limit of sparse methods are data sets of size O(106)\mathcal{O}(10^{6}). This can be extended by either a higher compression rate or parallelisation (Hensman et al., 2013; Gal et al., 2014).

We introduced the robust Bayesian Committee Machine (rBCM), a conceptually straightforward, but effective, product-of-GP-experts model that scales GPs to (in principle) arbitrarily large data sets. The rBCM distributes computations amongst independent computing units, allowing for straightforward parallelisation. Recursive and closed-form recombinations of these independent computations yield a practical model that is both computationally and memory efficient. The rBCM addresses shortcomings of other distributed models by appropriately incorporating the GP prior when combining predictions. The rBCM is independent of the computational graph and can be equivalently used on heterogeneous computing infrastructures, ranging from laptops to large clusters. Training an rBCM with a million data points takes less than 30 minutes on a laptop. With more computing power training the rBCM with 10710^{7} data points can be done in a few hours. Compared to state-of-the-art sparse GPs, our model is conceptually simple, performs well, learns fast, requires little memory and does not require high-dimensional optimisation.

MPD has been supported by an Imperial College Junior Research Fellowship.

References