Thoughts on Massively Scalable Gaussian Processes

Andrew Gordon Wilson, Christoph Dann, Hannes Nickisch

Introduction

Every minute of the day, users share hundreds of thousands of pictures, videos, tweets, reviews, and blog posts. More than ever before, we have access to massive datasets in almost every area of science and engineering, including genomics, robotics, and climate science. This wealth of information provides an unprecedented opportunity to automatically learn rich representations of data, which allows us to greatly improve performance in predictive tasks, but also provides a mechanism for scientific discovery.

Expressive non-parametric methods, such as Gaussian processes (GPs) (Rasmussen and Williams,, 2006), have great potential for large-scale structure discovery; indeed, these methods can be highly flexible, and have an information capacity that grows with the amount of available data. However, large data problems are mostly uncharted territory for GPs, which can only be applied to at most a few thousand training points nn, due to the O(n3)\mathcal{O}(n^{3}) computations and O(n2)\mathcal{O}(n^{2}) storage required for inference and learning.

Even more scalable approximate GP approaches, such as inducing point methods (Quiñonero-Candela and Rasmussen, 2005a, ), typically require O(m2n+m3)\mathcal{O}(m^{2}n+m^{3}) computations and O(m2+mn)\mathcal{O}(m^{2}+mn) storage, for mm inducing points, and are hard to apply massive datasets, containing n>105n>10^{5} examples. Moreover, for computational tractability, these approaches require mnm\ll n, which can severely affect predictive performance, limit representational power, and the ability for kernel learning, which is most needed on large datasets (Wilson,, 2014). New directions for scalable Gaussian processes have involved mini-batches of data through stochastic variational inference (Hensman et al.,, 2013) and distributed learning (Deisenroth and Ng,, 2015). While these approaches are promising, inference can undergo severe approximations, and a small number of inducing points are still required. Indeed, stochastic variational approaches scale as O(m3)\mathcal{O}(m^{3}).

In this paper, we introduce a new framework for massively scalable Gaussian processes (MSGP), which provides near-exact O(n)\mathcal{O}(n) inference and learning and O(1)\mathcal{O}(1) test time predictions, and does not require distributed learning or severe assumptions. Our approach builds on the recently introduced KISS-GP framework (Wilson and Nickisch,, 2015), with several significant advances which enable its use on massive datasets. In particular, we provide:

Near-exact O(1)\mathcal{O}(1) mean and variance predictions. By contrast, standard GPs and KISS-GP cost O(n)\mathcal{O}(n) for the predictive mean and O(n2)\mathcal{O}(n^{2}) for the predictive variance per test point. Moreover, inducing point and finite basis expansions (e.g., Quiñonero-Candela and Rasmussen, 2005a, ; Lázaro-Gredilla et al.,, 2010; Yang et al.,, 2015) cost O(m)\mathcal{O}(m) and O(m2)\mathcal{O}(m^{2}) per test point.

Circulant approximations which (i) integrate Kronecker and Toeplitz structure, (ii) enable extremely and accurate fast log determinant evaluations for kernel learning, and (iii) increase the speed of Toeplitz methods on problems with 1D predictors.

The ability to exploit more general block-Toeplitz-Toeplitz-block (BTTB) structure, which enables fast and exact inference and learning in cases where multidimensional Kronecker structure is not present.

Projections which help alleviate the limitation of Kronecker methods to low-dimensional input spaces.

Code will be available as part of the GPML package (Rasmussen and Nickisch,, 2010), with demonstrations at http://www.cs.cmu.edu/~andrewgw/pattern.

We begin by briefly reviewing Gaussian processes, structure exploiting inference, and KISS-GP, in sections 2 - 4. We then introduce our MSGP approach in section 5. We demonstrate the scalability and accuracy of MSGP in the experiments of section 6. We conclude in section 7.

Gaussian Processes

We briefly review Gaussian processes (GPs), and the computational requirements for predictions and kernel learning. Rasmussen and Williams, (2006) contains a full treatment of GPs.

We assume a dataset D\mathcal{D} of nn input (predictor) vectors X=[x1,,xn]X=[{\bm{\mathbf{x}}}_{1},\dots,{\bm{\mathbf{x}}}_{n}], each of dimension DD, corresponding to a n×1n\times 1 vector of targets y=[y(x1),,y(xn)]{\bm{\mathbf{y}}}=[y({\bm{\mathbf{x}}}_{1}),\dots,y({\bm{\mathbf{x}}}_{n})]^{\top}. If f(x)GP(μ,kθ)f({\bm{\mathbf{x}}})\sim\mathcal{GP}(\mu,k_{{\bm{\mathbf{\theta}}}}), then any collection of function values f{\bm{\mathbf{f}}} has a joint Gaussian distribution,

with mean vector and covariance matrix defined by the mean vector and covariance function of the Gaussian process: (μX)i=μ(xi)({\bm{\mathbf{\mu}}}_{X})_{i}=\mu(x_{i}), and (KX,X)ij=kθ(xi,xj)(K_{X,X})_{ij}=k_{\bm{\mathbf{\theta}}}({\bm{\mathbf{x}}}_{i},{\bm{\mathbf{x}}}_{j}). The covariance function kθk_{\bm{\mathbf{\theta}}} is parametrized by θ{\bm{\mathbf{\theta}}}. Assuming additive Gaussian noise, y(x)f(x)N(y(x);f(x),σ2)y({\bm{\mathbf{x}}})|f({\bm{\mathbf{x}}})\sim\mathcal{N}(y({\bm{\mathbf{x}}});f({\bm{\mathbf{x}}}),\sigma^{2}), then the predictive distribution of the GP evaluated at the nn_{*} test points indexed by XX_{*}, is given by

KX,XK_{X_{*},X} represents the n×nn_{*}\times n matrix of covariances between the GP evaluated at XX_{*} and XX, and all other covariance matrices follow the same notational conventions. μX{\bm{\mathbf{\mu}}}_{X_{*}} is the n×1n_{*}\times 1 mean vector, and KX,XK_{X,X} is the n×nn\times n covariance matrix evaluated at training inputs XX. All covariance matrices implicitly depend on the kernel hyperparameters θ{\bm{\mathbf{\theta}}}.

The marginal likelihood of the targets y{\bm{\mathbf{y}}} is given by

where we have used KθK_{{\bm{\mathbf{\theta}}}} as shorthand for KX,XK_{X,X} given θ{\bm{\mathbf{\theta}}}. Kernel learning is performed by optimizing Eq. (3) with respect to θ{\bm{\mathbf{\theta}}}.

The computational bottleneck for inference is solving the linear system (KX,X+σ2I)1y(K_{X,X}+\sigma^{2}I)^{-1}{\bm{\mathbf{y}}}, and for kernel learning is computing the log determinant logKX,X+σ2I\log|K_{X,X}+\sigma^{2}I|. Standard procedure is to compute the Cholesky decomposition of the n×nn\times n matrix KX,XK_{X,X}, which requires O(n3)\mathcal{O}(n^{3}) operations and O(n2)\mathcal{O}(n^{2}) storage. Afterwards, the predictive mean and variance of the GP cost respectively O(n)\mathcal{O}(n) and O(n2)\mathcal{O}(n^{2}) per test point x{\bm{\mathbf{x}}}_{*}.

Structure Exploiting Inference

Structure exploiting approaches make use of existing structure in KX,XK_{X,X} to accelerate inference and learning. These approaches benefit from often exact predictive accuracy, and impressive scalability, but are inapplicable to most problems due to severe grid restrictions on the data inputs XX. We briefly review Kronecker and Toeplitz structure.

Kronecker (tensor product) structure arises when we have multidimensional inputs i.e. P>1P>1 on a rectilinear grid, xX1××XP{\bm{\mathbf{x}}}\in\mathcal{X}_{1}\times\dots\times\mathcal{X}_{P}, and a product kernel across dimensions k(xi,xj)=p=1Pk(xi(p),xj(p))k({\bm{\mathbf{x}}}_{i},{\bm{\mathbf{x}}}_{j})=\prod_{p=1}^{P}k({\bm{\mathbf{x}}}_{i}^{(p)},{\bm{\mathbf{x}}}_{j}^{(p)}). In this case, K=K1KPK=K_{1}\otimes\dots\otimes K_{P}. One can then compute the eigendecomposition of K=QVQK=QVQ^{\top} by separately taking the eigendecompositions of the much smaller K1,,KPK_{1},\dots,K_{P}. Inference and learning then proceed via (K+σ2I)1y=(QVQ+σ2I)1y=Q(V+σ2I)1Qy(K+\sigma^{2}I)^{-1}{\bm{\mathbf{y}}}=(QVQ^{\top}+\sigma^{2}I)^{-1}{\bm{\mathbf{y}}}=Q(V+\sigma^{2}I)^{-1}Q^{\top}{\bm{\mathbf{y}}}, and logK+σ2I=ilog(Vii+σ2)\log|K+\sigma^{2}I|=\sum_{i}\log(V_{ii}+\sigma^{2}), where QQ is an orthogonal matrix of eigenvectors, which also decomposes a Kronecker product (allowing for fast MVMs), and VV is a diagonal matrix of eigenvalues and thus simple to invert. Overall, for mm grid data points, and PP grid dimensions, inference and learning cost O(Pm1+1P)\mathcal{O}(Pm^{1+\frac{1}{P}}) operations (for P>1P>1) and O(Pm2P)\mathcal{O}(Pm^{\frac{2}{P}}) storage (Saatchi,, 2011; Wilson et al.,, 2014). Unfortunately, there is no efficiency gain for 1D inputs (e.g., time series).

2 Toeplitz Structure

A covariance matrix constructed from a stationary kernel k(x,z)=k(xz)k(x,z)=k(x-z) on a 1D regularly spaced grid has Toeplitz structure. Toeplitz matrices TT have constant diagonals, Ti,j=Ti+1,j+1T_{i,j}=T_{i+1,j+1}. Toeplitz structure has been exploited for GP inference (e.g., Zhang et al.,, 2005; Cunningham et al.,, 2008) in O(nlogn)\mathcal{O}(n\log n) computations. Computing logT\log|T|, and the predictive variance for a single test point, requires O(n2)\mathcal{O}(n^{2}) operations (although finite support can be exploited (Storkey,, 1999)) thus limiting Toeplitz methods to about n<10,000n<10,000 points when kernel learning is required. Since Toeplitz methods are limited to problems with 1D inputs (e.g., time series), they complement Kronecker methods, which exploit multidimensional grid structure.

KISS-GP

Recently, Wilson and Nickisch, (2015) introduced a fast Gaussian process method called KISS-GP, which performs local kernel interpolation, in combination with inducing point approximations (Quiñonero-Candela and Rasmussen, 2005b, ) and structure exploiting algebra (e.g., Saatchi,, 2011; Wilson,, 2014).

for any single inputs x{\bm{\mathbf{x}}} and z{\bm{\mathbf{z}}}. The n×nn\times n training covariance matrix KX,XK_{X,X} thus has the approximation

Wilson and Nickisch, (2015) propose to perform local kernel interpolation, in a method called KISS-GP, for extremely sparse interpolation matrices. For example, if we are performing local cubic (Keys,, 1981) interpolation for dd-dimensional input data, WXW_{X} and WZW_{Z} contain only 4d4^{d} non-zero entries per row.

Furthermore, Wilson and Nickisch, (2015) show that classical inducing point methods can be re-derived within their SKI framework as global interpolation with a noise free GP and non-sparse interpolation weights. For example, the subset of regression (SoR) inducing point method effectively uses the kernel kSoR(x,z)=Kx,UKU,U1KU,z{k}_{\text{SoR}}({\bm{\mathbf{x}}},{\bm{\mathbf{z}}})=K_{{\bm{\mathbf{x}}},U}K_{U,U}^{-1}K_{U,{\bm{\mathbf{z}}}} (Quiñonero-Candela and Rasmussen, 2005b, ), and thus has interpolation weights wSoR(x)=Kx,UKU,U1{\bm{\mathbf{w}}}_{\text{SoR}({\bm{\mathbf{x}}})}=K_{{\bm{\mathbf{x}}},U}K_{U,U}^{-1} within the SKI framework.

GP inference and learning can be performed in O(n)\mathcal{O}(n) using KISS-GP, a significant advance over the more standard O(m2n)\mathcal{O}(m^{2}n) scaling of fast GP methods (Quiñonero-Candela and Rasmussen, 2005a, ; Lázaro-Gredilla et al.,, 2010). Moreover, Wilson and Nickisch, (2015) show how – when performing local kernel interpolation – one can achieve close to linear scaling with the number of inducing points mm by placing these points UU on a rectilinear grid, and then exploiting Toeplitz or Kronecker structure in KU,UK_{U,U} (see, e.g., Wilson,, 2014), without requiring that the data inputs XX are on a grid. Such scaling with mm compares favourably to the O(m3)\mathcal{O}(m^{3}) operations for stochastic variational approaches (Hensman et al.,, 2013). Allowing for large mm enables near-exact performance, and large scale kernel learning.

In particular, for inference we can solve (KSKI+σ2I)1y(K_{\text{SKI}}+\sigma^{2}I)^{-1}{\bm{\mathbf{y}}}, by performing linear conjugate gradients, an iterative procedure which depends only on matrix vector multiplications (MVMs) with (KSKI+σ2I)(K_{\text{SKI}}+\sigma^{2}I). Only jnj\ll n iterations are required for convergence up to machine precision, and the value of jj in practice depends on the conditioning of KSKIK_{\text{SKI}} rather than nn. MVMs with sparse WW (corresponding to local interpolation) cost O(n)\mathcal{O}(n), and MVMs exploiting structure in KU,UK_{U,U} are roughly linear in mm. Moreover, we can efficiently approximate the eigenvalues of KSKIK_{\text{SKI}} to evaluate logKSKI+σ2I\log|K_{\text{SKI}}+\sigma^{2}I|, for kernel learning, by using fast structure exploiting eigendecompositions of KU,UK_{U,U}. Further details are in Wilson and Nickisch, (2015).

Massively Scalable Gaussian Processes

We introduce massively scalable Gaussian processes (MSGP), which significantly extend KISS-GP, inducing point, and structure exploiting approaches, for: (1) O(1)\mathcal{O}(1) test predictions (section 5.1); (2) circulant log determinant approximations which (i) unify Toeplitz and Kronecker structure; (ii) enable extremely fast marginal likelihood evaluations (section 5.2); and (iii) extend KISS-GP and Toeplitz methods for scalable kernel learning in D=1 input dimensions, where one cannot exploit multidimensional Kronecker structure for scalability. (3) more general BTTB structure, which enables fast exact multidimensional inference without requiring Kronecker (tensor) decompositions (section 5.3); and, (4) projections which enable KISS-GP to be used with structure exploiting approaches for D5D\gg 5 input dimensions, and increase the expressive power of covariance functions (section 5.4).

We note that the methodology here for fast test predictions does not rely on having performed inference and learning in any particular way: it can be applied to any trained Gaussian process model, including a full GP, inducing points methods such as FITC (Snelson and Ghahramani,, 2006), or the Big Data GP (Hensman et al.,, 2013), or finite basis models (e.g., Yang et al.,, 2015; Lázaro-Gredilla et al.,, 2010; Rahimi and Recht,, 2007; Le et al.,, 2013; Williams and Seeger,, 2001).

1.2 Predictive Variance

Practical applications typically do not require the full predictive covariance matrix cov(f)\text{cov}({\bm{\mathbf{f}}}_{*}) of Eq. (2) for a set of nn_{*} test inputs XX_{*}, but rather focus on the predictive variance,

where ν=diag(KX,X[KX,X+σ2I]1KX,X){\bm{\mathbf{\nu}}}_{*}=\text{diag}(K_{X_{*},X}[K_{X,X}+\sigma^{2}I]^{-1}K_{X,X_{*}}), the explained variance, is approximated by local interpolation from the explained variance on the grid UU

The overall (unbiased) estimate (Papandreou and Yuille,, 2011) is obtained by clipping

2 Circulant Approximation

Kronecker and Toeplitz methods (section 3) are greatly restricted by requiring that the data inputs XX are located on a grid. We lift this restriction by creating structure in KU,UK_{U,U}, with the unobserved inducing variables UU, as part of the structured kernel interpolation framework described in section 4.

Toeplitz methods apply only to 1D problems, and Kronecker methods require multidimensional structure for efficiency gains. Here we present a circulant approximation to unify the complementary benefits of Kronecker and Toeplitz methods, and to greatly scale marginal likelihood evaluations of Toeplitz based methods, while not requiring any grid structure in the data inputs XX.

If UU is a regularly spaced multidimensional grid, and we use a stationary product kernel (e.g., the RBF kernel), then KU,UK_{U,U} decomposes as a Kronecker product of Toeplitz (section 3.2) matrices:

Because the algorithms which leverage Kronecker structure in Gaussian processes require eigendecompositions of the constituent matrices, computations involving the structure in Eq. (11) are no faster than if the Kronecker product were over arbitrary positive definite matrices: this nested Toeplitz structure, which is often present in Kronecker decompositions, is wasted. Indeed, while it is possible to efficiently solve linear systems with Toeplitz matrices, there is no particularly efficient way to obtain a full eigendecomposition.

Fast operations with an m×mm\times m Toeplitz matrix TT can be obtained through its relationship with an a×aa\times a circulant matrix. Symmetric circulant matrices CC are Toeplitz matrices where the first column c{\bm{\mathbf{c}}} is given by a circulant vector: c=[c1,c2,c3,..,c3,c2]{\bm{\mathbf{c}}}=[c_{1},c_{2},c_{3},..,c_{3},c_{2}]^{\top}. Each subsequent column is shifted one position from the next. In other words, Ci,j=cji mod aC_{i,j}={\bm{\mathbf{c}}}_{|j-i|\text{ mod }a}. Circulant matrices are computationally attractive because their eigendecomposition is given by

where FF is the discrete Fourier transform (DFT): Fjk=exp(2jkπi/a)F_{jk}=\exp(-2jk\pi i/a). The eigenvalues of CC are thus given by the DFT of its first column, and the eigenvectors are proportional to the DFT itself (the aa roots of unity). The log determinant of CC – the sum of log eigenvalues – can therefore be computed from a single fast Fourier transform (FFT) which costs O(aloga)\mathcal{O}(a\log a) operations and O(a)\mathcal{O}(a) memory. Fast matrix vector products can be computed at the same asymptotic cost through Eq. (12), requiring two FFTs (one FFT if we pre-compute FcF{\bm{\mathbf{c}}}), one inverse FFT, and one inner product.

In a Gaussian process context, fast MVMs with Toeplitz matrices are typically achieved through embedding a m×mm\times m Toeplitz matrix KK into a (2m1)×(2m1)(2m-1)\times(2m-1) circulant matrix CC (e.g., Zhang et al.,, 2005; Cunningham et al.,, 2008), with first column c=[k1,k2,..,km1,km,km1,..,k2]{\bm{\mathbf{c}}}=[k_{1},k_{2},..,k_{m-1},k_{m},k_{m-1},..,k_{2}]. Therefore K=Ci=1m,j=1mK=C_{i=1\dots m,j=1\dots m}, and using zero padding and truncation, Ky=[Cy,0]i=1mK{\bm{\mathbf{y}}}=[C{\bm{\mathbf{y}}},{\bm{\mathbf{0}}}]^{\top}_{i=1\dots m}, where the circulant MVM CyC{\bm{\mathbf{y}}} can be computed efficiently through FFTs. GP inference can then be achieved through LCG, solving K1yK^{-1}{\bm{\mathbf{y}}} in an iterative procedure which only involves MVMs, and has an asymptotic cost of O(mlogm)\mathcal{O}(m\log m) computations. The log determinant and a single predictive variance, however, require O(m2)\mathcal{O}(m^{2}) computations.

To speed up LCG for solving linear Toeplitz systems, one can use circulant pre-conditioners which act as approximate inverses. One wishes to minimise the distance between the preconditioner CC and the Toeplitz matrix KK, arg minCd(C,K)\text{arg min}_{C}d(C,K). Three classical pre-conditioners include CStrang=argminCCK1C_{\text{Strang}}=\arg\min_{C}\left\|C-K\right\|_{1} (Strang,, 1986), CT. Chan=argminCCKFC_{\text{T. Chan}}=\arg\min_{C}\left\|C-K\right\|_{F} (Chan,, 1988), and CTyrtyshnikov=argminCIC1KFC_{\text{Tyrtyshnikov}}=\arg\min_{C}\left\|I-C^{-1}K\right\|_{F} (Tyrtyshnikov,, 1992).

A distinct line of research was explored more than 60 years ago in the context of statistical inference over spatial processes (Whittle,, 1954). The circulant Whittle approximation circWhittle(k)\text{circ}_{\text{Whittle}}({\bm{\mathbf{k}}}) is given by truncating the sum

i.e. we retain j=wwki+jm\sum_{j=-w}^{w}k_{i+jm} only.

Positive definiteness of C=toep(c)C=\text{toep}({\bm{\mathbf{c}}}) for the complete sum is guaranteed by construction (Guinness and Fuentes,, 2014, Section 2). For large lattices, the approach is often used due to its accuracy and favourable asymptotic properties such as consistency, efficiency and asymptotic normality (Lieberman et al.,, 2009). In fact, the circulant approximation cic_{i} is asymptotically equivalent to the initial covariance kik_{i}, see Gray, (2005, Lemma 4.5), hence the logdet approximation inevitably converges to the exact value.

where we threshold c=FHmax(Fcirc(k),0){\bm{\mathbf{c}}}=F^{\textsf{H}}\max(F\text{circ}({\bm{\mathbf{k}}}),{\bm{\mathbf{0}}}). 1{\bm{\mathbf{1}}} denotes a vector 1=[1,1,,1]{\bm{\mathbf{1}}}=[1,1,\dots,1]^{\top}, FHF^{H} is the conjugate transpose of the Fourier transform matrix. circ(k)\text{circ}({\bm{\mathbf{k}}}) denotes one of the T. Chan, Tyrtyshnikov, Strang, Helgason or Whittle approximations.

3 BCCB Approximation for BTTB

We note that blocks in the BTTB matrix need not be symmetric. Moreover, symmetric BCCB matrices – in contrast to symmetric BTTB matrices – are fully characterised by their first column.

4 Projections

Even if we do not exploit structure in KU,UK_{U,U}, our framework in section 5 provides efficiency gains over conventional inducing point approaches, particularly for test time predictions. However, if we are to place UU onto a multidimensional (Cartesian product) grid so that KU,UK_{U,U} has Kronecker structure, then the total number of inducing points mm (the cardinality of UU) grows exponentially with the number of grid dimensions, limiting one to about five or fewer grid dimensions for practical applications. However, we need not limit the applicability of our approach to data inputs XX with D5D\leq 5 input dimensions, even if we wish to exploit Kronecker structure in KU,UK_{U,U}. Indeed, many inducing point approaches suffer from the curse of dimensionality, and input projections have provided an effective countermeasure (e.g., Snelson,, 2007).

The resulting kernel generalises the RBF and ARD kernels, which respectively have spherical and diagonal covariance matrices, with a full covariance matrix Σ=PP\Sigma=PP^{\top}, which allows for richer models (Vivarelli and Williams,, 1998). But in our context, this added flexibility has special meaning. Kronecker methods typically require a kernel which separates as a product across input dimensions (section 3.1). Here, we can capture sophisticated correlations between the different dimensions of the data inputs x{\bm{\mathbf{x}}} through the projection matrix PP, while preserving Kronecker structure in KU,UK_{U,U}. Moreover, learning PP in a supervised manner, e.g., through the Gaussian porcess marginal likelihood, has immediate advantages over unsupervised dimensionality reduction of x{\bm{\mathbf{x}}}; for example, if only a subset of the data inputs were used in producing the target values, this structure would not be detected by an unsupervised method such as PCA, but can be learned through PP. Thus in addition to the critical benefit of allowing for applications with D>5D>5 dimensional data inputs, PP can also enrich the expressive power of our MSGP model.

The entries of PP become hyperparameters of the Gaussian process marginal likelihood of Eq. (3), and can be treated in exactly the same way as standard kernel hyperparameters such as length-scale. One can learn these parameters through marginal likelihood optimisation.

Computing the derivatives of the log marginal likelihood with respect to the projection matrix requires some care under the structured kernel interpolation approximation to the covariance matrix KX,XK_{X,X}. We provide the mathematical details in the appendix A.

For practical reasons, one may wish to restrict PP to be orthonormal or have unit scaling. We discuss this further in section 6.2.

Experiments

In these preliminary experiments, we stress test MSGP in terms of training and prediction runtime, as well as accuracy, empirically verifying its scalability and predictive performance. We also demonstrate the consistency of the model in being able to learn supervised projections, for higher dimensional input spaces.

We compare to exact Gaussian processes, FITC (Snelson and Ghahramani,, 2006), Sparse Spectrum Gaussian Processes (SSGP) (Lázaro-Gredilla et al.,, 2010), the Big Data GP (BDGP) (Hensman et al.,, 2013), MSGP with Toeplitz (rather than circulant) methods, and MSGP not using the new scalable approach for test predictions (scalable test predictions are described in section 5).

The experiments were executed on a workstation with Intel i7-4770 CPU and 32 GB RAM. We used a step length of 0.010.01 for BDGP based on the values reported by Hensman et al., (2013) and a batchsize of 300300. We also evaluated BDGP with a larger batchsize of 50005000 but found that the results are qualititively similar. We stopped the stochastic optimization in BDGP when the log-likelihood did not improve at least by 0.10.1 within the last 5050 steps or after 50005000 iterations.

One cannot exploit Kronecker structure in one dimensional inputs for scalability, and Toeplitz methods, which apply to 1D inputs, are traditionally limited to about 10,00010,000 points if one requires many marginal likelihood evaluations for kernel learning. Thus to stress test the value of the circulant approximation most transparently, and to give the greatest advantage to alternative approaches in terms of scalability with number of inducing points mm, we initially stress test in a 1D input space, before moving onto higher dimensional problems.

In particular, we sample 1D inputs xx uniform randomly in $,sothatthedatainputshavenogridstructure.Wethengenerateoutofclassgroundtruthdata, so that the data inputs have no grid structure. We then generate out of class ground truth dataf(x)=\sin(x)\exp\left(\frac{-x^{2}}{2\times 5^{2}}\right)withadditiveGaussiannoisetoformwith additive Gaussian noise to form{\bm{\mathbf{y}}}.Wedistributeinducingpointsonaregularlyspacedgridin. We distribute inducing points on a regularly spaced grid in$.

In Figure 2 we show the runtime for a marginal likelihood evaluation as a function of training points nn, and number inducing points mm. MSGP quickly overtakes the alternatives as nn increases past this point. Moreover, the runtimes for MSGP with different numbers of inducing points converge quickly with increases in mm. By n=107n=10^{7}, MSGP requires the same training time for m=103m=10^{3} inducing points as it does for m=106m=10^{6} inducing points! Indeed, MSGP is able to accommodate an unprecedented number of inducing points, overtaking the alternatives, which are using m=103m=10^{3} inducing points, when using m=106m=10^{6} inducing points. Such an exceptionally large number of inducing points allows for near-exact accuracy, and the ability to retain model structure necessary for large scale kernel learning. Note that m=103m=10^{3} inducing points (resp. basis functions) is a practical upper limit in alternative approaches (Hensman et al., (2013), for example, gives mm\in as a practical upper bound for conventional inducing approaches for large nn).

We emphasize that although the stochastic optimization of the BDGP can take some time to converge, the ability to use mini-batches for jointly optimizing the variational parameters and the GP hyper parameters, as part of a principled framework, makes BDGP highly scalable. Indeed, the methodology in MSGP and BDGP are complementary and could be combined for a particularly scalable Gaussian process framework.

In Figure 3, we show that the prediction runtime for MSGP is practically independent of both mm and nn, and for any fixed m,nm,n is much faster than the alternatives, which depend at least quadratically on mm and nn. We also see that the local interpolation strategy for test predictions, introduced in section 5.1, greatly improves upon MSGP using the exact predictive distributions while exploiting Kronecker and Toeplitz algebra.

As we increase the number of inducing points mm the quality of predictions are improved, as we expect. The fast predictive variances, based on local kernel interpolation, are not as sensitive to the number of inducing points as the alternatives, but nonetheless have reasonable performance. We also see that the the fast predictive variances are improved by having an increasing number of samples nsn_{s} in the stochastic estimator of section 5.1.2, which is most noticeable for larger values numbers of inducing points mm. Notably, the fast predictive mean, based on local interpolation, is essentially indistinguishable from the predictive mean (‘slow’) using the standard GP predictive equations without interpolation. Overall, the error of the fast predictions is much less than the average variability in the data.

2 Projections

Here we test the consistency of our approach in section 5.4 for recovering ground truth projections, and providing accurate predictions, on D5D\gg 5 dimensional input spaces.

To generate data, we begin by sampling the entries of a d×Dd\times D projection matrix PP from a standard Gaussian distribution, and then project n=3000n=3000 inputs x{\bm{\mathbf{x}}} (of dimensionality D×1D\times 1), with locations randomly sampled from a Gaussian distribution so that there is no input grid structure, into a d=2d=2 dimensional space: x=Px{\bm{\mathbf{x}}}^{\prime}=P{\bm{\mathbf{x}}}. We then sample data y{\bm{\mathbf{y}}} from a Gaussian process with an RBF kernel operating on the low dimensional inputs x{\bm{\mathbf{x}}}^{\prime}. We repeat this process 3030 times for each of D=1,2,3,4,5,6,7,8,9,10,15,20,25,,100D=1,2,3,4,5,6,7,8,9,10,15,20,25,\dots,100.

We now index the data y{\bm{\mathbf{y}}} by the high dimensional inputs x{\bm{\mathbf{x}}} and attempt to reconstruct the true low dimensional subspace described by x=Px{\bm{\mathbf{x}}}^{\prime}=P{\bm{\mathbf{x}}}. We learn the entries of PP jointly with covariance parameters through marginal likelihood optimisation (Eq. (3)). Using a (d=2)(d=2) 50×5050\times 50 Cartesian product grid for 25002500 total inducing points UU, we reconstruct the projection matrix PP, with the subspace error,

shown in Figure 5. Here GiG_{i} is the orthogonal projection onto the dd-dimensional subspace spanned by the rows of PiP_{i}. This metric is motivated by the one-to-one correspondence between subspaces and orthogonal projections. It is bounded in $,wherethemaximumdistance, where the maximum distance1$ indicates that the subspaces are orthogonal to each other and the minimum is achieved if the subspaces are identical. More information on this metric including how to compute dist of Eq. (13) is available in chapter 2.5.3 of Golub and Van Loan, (2012).

We implement variants of our approach where PP is constrained to be 1) orthonormal, 2) to have unit scaling, e.g., Punit=diag((PP)1)PP_{\text{unit}}=\text{diag}(\sqrt{(PP^{\top})}^{-1})P. Such constraints prevent degeneracies between PP and kernel hyperparameters from causing practical issues, such as length-scales growing to large values to shrink the marginal likelihood log determinant complexity penalty, and then re-scaling PP to leave the marginal likelihood model fit term unaffected. In practice we found that unit scaling was sufficient to avoid such issues, and thus preferable to orthonormal PP, which are more constrained.

Discussion

We introduce massively scalable Gaussian processes (MSGP), which significantly extend KISS-GP, inducing point, and structure exploiting approaches, for: (1) O(1)\mathcal{O}(1) test predictions (section 5.1); (2) circulant log determinant approximations which (i) unify Toeplitz and Kronecker structure; (ii) enable extremely fast marginal likelihood evaluations (section 5.2); and (iii) extend KISS-GP and Toeplitz methods for scalable kernel learning in D=1 input dimensions, where one cannot exploit multidimensional Kronecker structure for scalability. (3) more general BTTB structure, which enables fast exact multidimensional inference without requiring Kronecker (tensor) decompositions (section 5.3); and, (4) projections which enable KISS-GP to be used with structure exploiting approaches for D5D\gg 5 input dimensions, and increase the expressive power of covariance functions (section 5.4).

We demonstrate these advantages, comparing to state of the art alternatives. In particular, we show MSGP is exceptionally scalable in terms of training points nn, inducing points mm, and number of testing points. The ability to handle large mm will prove important for retaining accuracy in scalable Gaussian process methods, and in enabling large scale kernel learning.

This document serves to report substantial developments regarding the SKI and KISS-GP frameworks introduced in Wilson and Nickisch, (2015).

References

Appendix A APPENDIX

apply chain rule to obtain from ψQ\frac{\partial\psi}{\partial Q} from ψP\frac{\partial\psi}{\partial P}

A.2 Derivatives For Orthonormal Projections

apply chain rule to obtain from ψQ\frac{\partial\psi}{\partial Q} from ψP\frac{\partial\psi}{\partial P}

use eigenvalue decomposition PP=VFVPP^{\top}=VFV^{\top}, define S=Vdiag(s)VS=V\text{diag}(\mathbf{s})V^{\top}, s=diag(F)\mathbf{s}=\sqrt{\text{diag}(F)}

A.3 Circulant Determinant Approximation Benchmark