On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions

Francis Bach

Introduction

The numerical computation of high-dimensional integrals is one of the core computational tasks in many areas of machine learning, signal processing and more generally applied mathematics, in particular in the context of Bayesian inference (Gelman, 2004), or the study of complex systems (Robert and Casella, 2005). In this paper, we focus on quadrature rules, that aim at approximating the integral of a certain function from only the (potentially noisy) knowledge of the function values at as few as possible well-chosen points. Key situations that remain active areas of research are problems where the measurable space where the function is defined on is either high-dimensional or structured (e.g., presence of discrete structures, or graphs). For these problems, techniques based on positive definite kernels have emerged as having the potential to efficiently deal with these situations, and to improve over plain Monte-Carlo integration (O’Hagan, 1991; Rasmussen and Ghahramani, 2003; Huszár and Duvenaud, 2012; Oates and Girolami, 2015). In particular, the quadrature problem may be cast as the one of approximating a fixed element, the mean element (Smola et al., 2007), of a Hilbert space as a linear combination of well chosen elements, the goal being to minimize the number of these factors as it corresponds to the required number of function evaluations.

A seemingly unrelated problem on positive definite kernels have recently emerged, namely the representation of the corresponding infinite-dimensional feature space from random sets of features. If a certain positive definite kernel between two points may be represented as the expectation of the product of two random one-dimensional (typically non-linear) features computed on these two points, the full kernel (and hence its feature space) may be approximated by sufficiently many random samples, replacing the expectation by a sample average (Neal, 1995; Rahimi and Recht, 2007; Huang et al., 2006). When using these random features, the complexity of a regular kernel method such as the support vector machine or ridge regression goes from quadratic in the number of observations to linear in the number of observations, with a constant proportional to the number of random features, which thus drives the running time complexity of these methods.

In this paper, we make the following contributions:

After describing the functional analysis framework our analysis is based on and presenting many examples in Section 2, we show in Section 3 that these two problems are strongly related; more precisely, optimizing weights in kernel-based quadrature rules can be seen as decomposing a certain function in a special class of random features for a particular decomposition that always exists for all positive definite kernels on a measurable space.

We provide in Section 4 a theoretical analysis of the number of required samples for a given approximation error, leading to both upper and lower bounds that are based solely on the eigenvalues of the associated integral operator and match up to logarithmic terms. In particular, we show that the upper bound may be obtained as independent and identically distributed samples from a specific non-uniform distribution, while the lower bound if valid for any set of points.

Applying our results to kernel quadrature, while our results are fairly general, we recover known upper and lower bounds for the special cases of Sobolev spaces (Section 4.4). Moreover, our results extend to the more general problem of full function approximations (beyond simply computing an integral), with results in L2L_{2}- and LL_{\infty}-norm that match known results for special cases (Section 5).

Applying our results to random feature expansions, we show in Section 4.5 an improvement of the number of random features needed for preserving the generalization guarantees for learning with Lipshitz-continuous losses.

Random Feature Expansions of Positive Definite Kernels

Throughout this paper, we consider a topological space X{\mathcal{X}} equipped with a Borel probability measure dρd\rho, which we assume to have full support. This naturally defines the space of square-integrable functionsFor simplicity and following most of the literature on kernel methods, we identify functions and their equivalence classes for the equivalence relationship of being equal except for a zero-measure (for dρd\rho) subset of X{\mathcal{X}}..

Membership of kernel evaluations: for any xXx\in{\mathcal{X}}, the function k(,x):yk(y,x)k(\cdot,x):y\mapsto k(y,x) is an element of F{\mathcal{F}}.

Reproducing property: for all fFf\in{\mathcal{F}} and xXx\in{\mathcal{X}}, f(x)=f,k(,x)Ff(x)=\langle f,k(\cdot,x)\rangle_{\mathcal{F}}. In other words, function evaluations are equal to dot-products with a specific element of F{\mathcal{F}}.

Moreover, throughout the paper, we assume that the function xk(x,x)x\mapsto k(x,x) is integrable with respect to dρd\rho (which is weaker than supxXk(x,x)<\sup_{x\in\mathcal{X}}k(x,x)<\infty). This implies that F{\mathcal{F}} is a subset of L2(dρ)L_{2}(d\rho); that is, functions in the RKHS F{\mathcal{F}} are all square-integrable for dρd\rho. In general, F{\mathcal{F}} is strictly included in L2(dρ)L_{2}(d\rho), but, in this paper, we will always assume that it is dense in L2(dρ)L_{2}(d\rho), that is, any function in L2(dρ)L_{2}(d\rho) may be approximated arbitrarily closely by a function in F{\mathcal{F}}. Finally, for simplicity of our notation (to make sure that the sequence of eigenvalues of integral operators is infinite) we will always assume that L2(dρ)L_{2}(d\rho) is infinite-dimensional, which excludes finite sets for X{\mathcal{X}}. Note that the last two assumptions (denseness and infinite dimensionality) can easily be relaxed.

Reproducing kernel Hilbert spaces are often studied through a specific integral operator which leads to an isometry with L2(dρ)L_{2}(d\rho) (Smale and Cucker, 2001). Let Σ:L2(dρ)L2(dρ)\Sigma:L_{2}(d\rho)\to L_{2}(d\rho) be defined as

Since Xk(x,x)dρ(x)\int_{\mathcal{X}}k(x,x)d\rho(x) is finite, Σ\Sigma is self-adjoint, positive semi-definite and trace-class (Simon, 1979). Given that Σf\Sigma f is a linear combination of kernel functions k(,y)k(\cdot,y), it belongs to F{\mathcal{F}}. More precisely, since we have assumed that F{\mathcal{F}} is dense in L2(dρ)L_{2}(d\rho), Σ1/2\Sigma^{1/2}, which is the unique positive self-adjoint square root of Σ\Sigma, is a bijection from L2(dρ)L_{2}(d\rho) to our RKHS F{\mathcal{F}}; that is, for any fFf\in{\mathcal{F}}, there exists a unique gL2(dρ)g\in L_{2}(d\rho) such that f=Σ1/2gf=\Sigma^{1/2}g and fH=gL2(dρ)\|f\|_{\mathcal{H}}=\|g\|_{L_{2}(d\rho)} (Smale and Cucker, 2001). This justifies the notation Σ1/2f\Sigma^{-1/2}f for fFf\in{\mathcal{F}} and means that Σ1/2\Sigma^{1/2} is an isometry from L2(dρ)L_{2}(d\rho) to F{\mathcal{F}}; in other words, for any functions ff and gg in F{\mathcal{F}}, we have:

This justifies the view of F{\mathcal{F}} as the subspace of functions fL2(dρ)f\in L_{2}(d\rho) such that Σ1/2fL2(dρ)2\|\Sigma^{-1/2}f\|_{L_{2}(d\rho)}^{2}. This relationship is even more transparent when considering a spectral decomposition of Σ\Sigma.

Mercer decomposition.

From extensions of Mercer’s theorem (König, 1986), there exists an orthonormal basis (em)m1(e_{m})_{m\geqslant 1} of L2(dρ)L_{2}(d\rho) and a summable non-increasing sequence of strictly positive eigenvalues (μm)m1(\mu_{m})_{m\geqslant 1} such that Σem=μmem\Sigma e_{m}=\mu_{m}e_{m}. Note that since we have assumed that F{\mathcal{F}} is dense in L2(dρ)L_{2}(d\rho), there are no zero eigenvalues.

Since Σ1/2\Sigma^{1/2} is an isometry from L2(dρ)L_{2}(d\rho) to F{\mathcal{F}}, (μm1/2em)m1(\mu_{m}^{1/2}e_{m})_{m\geqslant 1} is an orthonormal basis of F{\mathcal{F}}. Moreover, we can use the eigendecomposition to characterize elements of F{\mathcal{F}} as the functions in L2(dρ)L_{2}(d\rho) such that

is finite. In other words, once projected in the orthonormal basis (em)m1(e_{m})_{m\geqslant 1}, elements ff of F{\mathcal{F}} correspond to a certain decay of its decomposition coefficients (f,emL2(dρ))m1(\langle f,e_{m}\rangle_{L_{2}(d\rho)})_{m\geqslant 1}.

Finally, by decomposing the function k(,y):xk(x,y)k(\cdot,y):x\mapsto k(x,y), we obtain the Mercer decomposition:

Properties of the spectrum.

The sequence of eigenvalues (μm)m1(\mu_{m})_{m\geqslant 1} is an important quantity that appears in the analysis of kernel methods (Hastie and Tibshirani, 1990; Caponnetto and De Vito, 2007; Harchaoui et al., 2008; Bach, 2013; El Alaoui and Mahoney, 2014). It depends both on the kernel kk and the chosen distribution dρd\rho.

Some modifications of the kernel kk or the distribution dρd\rho lead to simple behaviors for the spectrum. For example, if we have a second distribution so that dρdρ\frac{d\rho^{\prime}}{d\rho} is upper-bounded by a constant cc, then, as a consequence of the Courant-Fischer minimax theorem (Horn and Johnson, 2012), the eigenvalues for dρd\rho^{\prime} are less than than cc times that the ones for dρd\rho. Similarly, if the kernel kk^{\prime} is such that ckkck-k^{\prime} is a positive definite kernel, then we have a similar bound between eigenvalues.

In this paper, for any strictly positive λ\lambda, we will also consider the quantity m(λ)m^{\ast}(\lambda) equal to the number of eigenvalues μm\mu_{m} that are greater than or equal to λ\lambda. Since we have assume that the sequence mm is non-increasing, we have m(λ)=max{m1, μmλ}m^{\ast}(\lambda)=\max\{m\geqslant 1,\ \mu_{m}\geqslant\lambda\}. This is a left-continuous non-increasing function, that tends to ++\infty when λ\lambda tends to zero (since we have assumed that there are infinitely many strictly positive eigenvalues), and characterizes the sequence (μm)m1(\mu_{m})_{m\geqslant 1}, as we can recover μm\mu_{m} as μm=sup{λ0, m(λ)m}\mu_{m}=\sup\{\lambda\geqslant 0,\ m^{\ast}(\lambda)\geqslant m\}.

Potential confusion with covariance operator.

Note that the operator Σ\Sigma is a self-adjoint operator on L2(dρ)L_{2}(d\rho). It should not be confused with the (non-centered) covariance operator CC (Baker, 1973), which is a self-adjoint operator on a different space, namely the RKHS F{\mathcal{F}}, defined by g,CfF=Xf(x)g(x)dρ(x)\langle g,Cf\rangle_{\mathcal{F}}=\int_{{\mathcal{X}}}f(x)g(x)d\rho(x). Given that Σ1/2\Sigma^{1/2} is an isometry from L2(dρ)L_{2}(d\rho) to F{\mathcal{F}}, the operator CC may also be used to define an operator on L2(dρ)L_{2}(d\rho), which happens to be exactly Σ\Sigma. Thus, the two operators have the same eigenvalues. Moreover, we have, for any yXy\in{\mathcal{X}}:

that is, CC is equal to the restriction of Σ\Sigma on F{\mathcal{F}}.

2 Kernels as expectations

Such additional structure allows to give an explicit characterization of the RKHS F{\mathcal{F}} in terms of the features φ\varphi. In terms of operators, the function φ\varphi leads to a specific square-root of the integral operator Σ\Sigma defined in Section 2.1 (which is not the positive self-adjoint square-root Σ1/2\Sigma^{1/2}).

We consider the bounded linear operator T:L2(dτ)L2(dρ)T:L_{2}(d\tau)\to L_{2}(d\rho) defined as

Given T:L2(dτ)L2(dρ)T:L_{2}(d\tau)\to L_{2}(d\rho), the adjoint operator T:L2(dρ)L2(dτ)T^{\ast}:L_{2}(d\rho)\to L_{2}(d\tau) is the unique operator such that g,TfL2(dτ)=Tg,fL2(dρ)\langle g,T^{\ast}f\rangle_{L_{2}(d\tau)}=\langle Tg,f\rangle_{L_{2}(d\rho)} for all f,gf,g. Given the definition of TT in Eq. (2), we simply inverse the role of V{\mathcal{V}} and X{\mathcal{X}} and have:

that is we have an expression of the integral operator Σ\Sigma as Σ=TT\Sigma=TT^{\ast}. Thus, the decomposition of the kernel kk as an expectation corresponds to a particular square root TT of the integral operator—there are many possible choices for such square roots, and thus many possible expansions like Eq. (1). It turns out that the positive self-adjoint square root Σ1/2\Sigma^{1/2} will correspond to the equivalence with quadrature rules (see Section 3.2).

Decomposition of functions in ℱℱ{\mathcal{F}}.

Since Σ=TT\Sigma=TT^{\ast} and Σ1/2\Sigma^{1/2} is an isometry between L2(dρ)L_{2}(d\rho) and F{\mathcal{F}}, we can naturally expressed any elements of F{\mathcal{F}} through the operator TT and thus the features φ\varphi.

As a linear operator, TT defines a bijection from the orthogonal of its null space (Ker T)L2(dτ)({\rm Ker}\ T)^{\perp}\subset L_{2}(d\tau) to its image Im(T)L2(dρ){\rm Im}(T)\subset L_{2}(d\rho), and this allows to define uniquely T1f(Ker T)T^{-1}f\in({\rm Ker}\ T)^{\perp} for any fIm(T)f\in{\rm Im}(T), and a dot-product on Im(T){\rm Im}(T) as

As shown by Bach (2014, App. A), Im(T){\rm Im}(T) turns out to be equal to our RKHSThe proof goes as follows: (a) for any yXy\in{\mathcal{X}}, k(,y)k(\cdot,y) can be expressed as Vφ(v,y)φ(v,)dτ(v)=Tφ(,y)\int_{{\mathcal{V}}}\varphi(v,y)\varphi(v,\cdot)d\tau(v)=T\varphi(\cdot,y) and thus belongs to Im(T){\rm Im}(T); (b) for any fIm(T)f\in{\rm Im}(T), and yXy\in{\mathcal{X}}, we have f,k(,y)Im(T)=T1f,φ(,y)L2(dτ)=(TT1f)(y)=f(y)\langle f,k(\cdot,y)\rangle_{{\rm Im}(T)}=\langle T^{-1}f,\varphi(\cdot,y)\rangle_{L_{2}(d\tau)}=(TT^{-1}f)(y)=f(y), that is, the reproducing property is satisfied. These two properties are characteristic of F{\mathcal{F}}.. Thus, the norm fF2\|f\|_{\mathcal{F}}^{2} for fFf\in{\mathcal{F}} is equal to the squared L2L_{2}-norm of T1f(Ker T)T^{-1}f\in({\rm Ker}\ T)^{\perp}, which is itself equal to the minimum of gL2(dτ)2\|g\|^{2}_{L_{2}(d\tau)} over all gg such that Tg=fTg=f. The resulting gg may also be defined through pseudo-inverses.

In other words, a function fL2(dρ)f\in L_{2}(d\rho) is in F{\mathcal{F}} if and only if it may be written as

Singular value decomposition.

The operator TT is an Hilbert-Schmidt operator, to which the singular value decopomposition can be applied (Kato, 1995). That is, there exists an orthonormal basis (fm)m1(f_{m})_{m\geqslant 1} of (Ker T)L2(dτ)({\rm Ker}\ T)^{\perp}\subset L_{2}(d\tau), together with the orthonormal basis (em)m1(e_{m})_{m\geqslant 1} of L2(dρ)L_{2}(d\rho) which we have from the eigenvalue decomposition of Σ=TT\Sigma=TT^{\ast}, such that Tfm=μm1/2emTf_{m}=\mu_{m}^{1/2}e_{m}. Moreover, we have:

with a convergence in L2(dτdρ)L_{2}(d\tau\otimes d\rho). This extends the Mercer decomposition of the kernel k(x,y)k(x,y).

Integral operator as an expectation.

Given the expansion of the kernel kk in Eq. (1), we may express the integral operator Σ\Sigma as follows, explicitly as an expectation:

where aL2(dρ)ba\otimes_{L_{2}(d\rho)}b is the operator L2(dρ)L2(dρ)L_{2}(d\rho)\to L_{2}(d\rho) so that (aL2(dρ)b)f=b,fL2(dρ)a(a\otimes_{L_{2}(d\rho)}b)f=\langle b,f\rangle_{L_{2}(d\rho)}a. This will be useful to define empirical versions, where the integral over dτd\tau will be replaced by a finite average.

3 Examples

In this section, we provide examples of kernels and usual decompositions. We first start by decompositions that always exist, then focus on specific kernels based on Fourier components.

The Mercer decomposition provides an expansion for all kernels, as follows:

Periodic kernels on [0,1]01[0,1].

We consider X={\mathcal{X}}= and translation-invariant kernels k(x,y)k(x,y) of the form k(x,y)=t(xy)k(x,y)=t(x-y), where tt is a square-integrable 1-periodic function. These kernels are positive definite if and only if the Fourier series of tt is non-negative (Wahba, 1990). An orthonormal basis of L2()L_{2}() is composed of the constant function c0:x1c_{0}:x\mapsto 1 and the functions cm:x2cos2πmxc_{m}:x\mapsto\sqrt{2}\cos 2\pi mx and sm:x2sin2πmxs_{m}:x\mapsto\sqrt{2}\sin 2\pi mx. A kernel may thus be expressed as

A particularly interesting example is obtained through derivatives of ff. If ff is differentiable and has a derivative fL2()f^{\prime}\in L_{2}(), then, on the Fourier series coefficients of ff, taking the derivative corresponds to multiplying the two mm-th coefficients by 2πm2\pi m and swapping them. Sobolev spaces for periodic functions on $(i.e.,suchthat(i.e., such thatf(0)\!=\!f(1))aredefinedthroughintegrabilityofderivatives(AdamsandFournier,2003).IntheHilbertspacesetup,afunction) are defined through integrability of derivatives (Adams and Fournier, 2003). In the Hilbert space set-up, a functionfbelongstotheSobolevspaceoforderbelongs to the Sobolev space of ordersifonecandefineaif one can define asthordersquareintegrablederivativein-th order square-integrable derivative inL_{2}(fortheLebesguemeasure,whichhappenstobeequalto(for the Lebesgue measure, which happens to be equal tod\rho),thatis,), that is,f^{(s)}\in L_{2}().TheSobolevsquarednormisthendefinedasanypositivelinearcombinationofthequadraticforms. The Sobolev squared norm is then defined as any positive linear combination of the quadratic forms\int_{0}^{1}f^{(t)}(x)^{2}dx,,t\in\{0,\dots,s\},withnonzerocoefficientsfor, with non-zero coefficients fort=0andandt=s(allofthesenormsarethenequivalent).Ifonlyusing(all of these norms are then equivalent). If only usingt=0andandt=swithnonzerocoefficients,weneedwith non-zero coefficients, we need\nu_{0}^{-1}=1andand\nu_{m}^{-1}=1+m^{2s}.Anequivalent(i.e,withupperandlowerboundedratios)sequenceisobtainedbyreplacing. An equivalent (i.e, with upper and lower bounded ratios) sequence is obtained by replacing\nu_{m}=(1+m^{2s})^{-1}byby\nu_{m}=m^{-2s}$, leading to a closed-form formula:

where {xy}\{x-y\} denotes the fractional part of xyx-y, and B2sB_{2s} is the 2s2s-th Bernoulli polynomial (Wahba, 1990). The RKHS F{\mathcal{F}} is then the Sobolev space of order ss on $$, with a norm equivalent to any of the family of Sobolev norms; it will be used as a running example throughout this paper.

In order to extend to d>1d>1, we may consider several extensions as described by Oates and Girolami (2015), and compute the resulting eigenvalues of the integral operators. For simplicity, we consider the Sobolev space on $,with, with\nu_{0}=1andand\nu_{m}^{-1}=m^{2s}forform>0.Thefirstpossibilitytoextendto. The first possibility to extend to^{d}istotakeakernelwhichissimplythepointwiseproductofindividualkernelsonis to take a kernel which is simply the pointwise product of individual kernels on.Thatis,if. That is, ifk(x,y)isthekernelonis the kernel on,define, defineK(X,Y)=\prod_{j=1}^{d}k(x_{j},y_{j})betweenbetweenXandandYinin^{d}.AsshowninAppendixA,thisleadstoeigenvaluedecaysboundedby. As shown in Appendix A, this leads to eigenvalue decays bounded by(\log m)^{2s(d-1)}m^{-2s},andthusuptologarithmictermsatthesamespeed, and thus up to logarithmic terms at the same speedm^{-2s}asasd=1.Whilethissoundsattractiveintermsofgeneralizationperformance,itcorrespondstoaspaceafunctionwhichisnotaSobolevspacein. While this sounds attractive in terms of generalization performance, it corresponds to a space a function which is not a Sobolev space inddimensions.Thatistheassociatedsquarednormondimensions. That is the associated squared norm onfwouldbeequivalenttoalinearcombinationofsquaredwould be equivalent to a linear combination of squaredL_{2}$-norm of partial derivatives

for all t1,,tdt_{1},\dots,t_{d} in {0,,s}\{0,\dots,s\}. This corresponds to functions which have square-integrable partial derivatives with all individual orders less than ss. All values of s1s\geqslant 1 are allowed and lead to an RKHS.

This is thus to be contrasted with the usual multi-dimensional Sobolev space which is composed of functions which have square-integrable partial derivatives with all orders (t1,,td)(t_{1},\dots,t_{d}) with sum t1++tdt_{1}+\cdots+t_{d} less than ss. Only s>d/2s>d/2 is then allowed to get an RKHS. The Sobolev norm is then of the form

In the expansion on the dd-th order tensor product of the Fourier basis, the norm above is equivalent to putting a weight on the element (m1,,md)(m_{1},\dots,m_{d}) asymptotically equivalent to \big{(}\sum_{j=1}^{d}m_{j}\big{)}^{2s}, which thus represent the inverse of the eigenvalues of the corresponding kernel for the uniform distribution dρd\rho (this is simply an explicit Mercer decomposition). Thus, the number of eigenvalues which are greater than λ\lambda grows as the number of (m1,,md)(m_{1},\dots,m_{d}) such that their sum is less than λ1/(2s)\lambda^{-1/(2s)}, which itself is less than a constant times λd/(2s)\lambda^{-d/(2s)} (see a proof in Appendix A). This leads to an eigenvalue decay of m2s/dm^{-2s/d}, which is much worse because of the term in 1/d1/d in the exponent.

For these kernels, the decay of eigenvalues has been well-studied by Widom (1963), who relates the decay of eigenvalues to the tails of the distribution dρd\rho and the decay of the Fourier transform of tt. For example, for the Gaussian kernel where k(x,y)=exp(αxy22)k(x,y)=\exp(-\alpha\|x-y\|_{2}^{2}), on sub-Gaussian distributions, the decay of eigenvalues is geometric, and for kernels leading to Sobolev spaces of order ss, such as the Matern kernel (Furrer and Nychka, 2007), the decay is of the form m2s/dm^{-2s/d}. See also examples by Birman and Solomyak (1977); Harchaoui et al. (2008).

Finally, note that in terms of computation, there are extensions to avoid linear complexity in dd (Le et al., 2013).

Kernels on hyperspheres.

As shown by Smola et al. (2001); Bach (2014), the equivalent of Fourier series (which corresponds to d=1d=1) is then the basis of spherical harmonics, which is organized by integer frequencies k1k\geqslant 1; instead of having 2 basis vectors (sine and cosine) per frequency, there are O(kd1)O(k^{d-1}) of them. As shown by Bach (2014, page 44), we have an explicit expansion of k(x,y)k(x,y) in terms of spherical harmonics, leading to a sequence of eigenvalues equal to kd2s1k^{-d-2s-1} on the entire subspace associated with frequency kk. Thus, by taking multiplicity into account, after j=1kjd1kd\sum_{j=1}^{k}j^{d-1}\approx k^{d} (up to constants) eigenvalues, we have an eigenvalue of kd2s1k^{-d-2s-1}; this leads to an eigenvalue decay (where all eigenvalues are ordered in decreasing order and we consider the mm-th one) as (m1/d)d2s1=m11/d2s/d(m^{1/d})^{-d-2s-1}=m^{-1-1/d-2s/d}.

4 Approximation from randomly sampled features

Given the formulation of kk as an expectation in Eq. (1), it is natural to consider sampling nn elements v1,,vnVv_{1},\dots,v_{n}\in{\mathcal{V}} from the distribution dτd\tau and define the kernel approximation

which defines a finite-dimensional RKHS F^\hat{{\mathcal{F}}}.

In this paper, we rather consider approximations of functions in F{\mathcal{F}} by functions in F^\hat{{\mathcal{F}}}, the RKHS associated with k^\hat{k}. A key difficulty is that in general F^\hat{{\mathcal{F}}} is not even included in F\mathcal{F}, and therefore, we cannot use the norm in F{\mathcal{F}} to characterize approximations. In this paper, we choose the L2L_{2}-norm associated with the probability measure dρd\rho on X{\mathcal{X}} to characterize the approximation. Given fFf\in{\mathcal{F}} with norm fF\|f\|_{\mathcal{F}} less than one, we look for a function f^F^\hat{f}\in\hat{{\mathcal{F}}} of the smallest possible norm and so that ff^L2(dρ)\|f-\hat{f}\|_{L_{2}(d\rho)} is as small as possible.

Note that the measure dτd\tau is associated to the kernel kk and the random features φ\varphi, while the measure dρd\rho is associated to the way we want to measure errors (and leads to a specific defintion of the integral operator Σ\Sigma).

Goals.

We thus aim at sampling nn points v1,,vnVv_{1},\dots,v_{n}\in{\mathcal{V}} from a distribution with density qq with respect to dτd\tau. Then the kernel approximation using importance weights is equal to

(so that the law of large numbers leads to an approximation converging to kk), and we thus aim at minimizing \Big{\|}\sum_{i=1}^{n}\frac{\beta_{i}}{q(v_{i})^{1/2}}\varphi(v_{i},\cdot)-\int_{\mathcal{V}}g(v)\varphi(v,\cdot)d\tau(v)\Big{\|}_{L_{2}(d\rho)}, with nβ22n\|\beta\|_{2}^{2} (which represents the norm of the approximation in F^\hat{{\mathcal{F}}} because of our importance weights are taken into account) as small as possible.

Quadrature in RKHSs

Following Smola et al. (2007), the error may be expressed using the reproducing property as:

and by Cauchy-Schwarz inequality its supremum over hF1\|h\|_{\mathcal{F}}\leqslant 1 is equal to

In this paper, we explore sampling points xix_{i} from a probability distribution on X{\mathcal{X}} with density qq with respect to dρd\rho. Note that when gg is a constant function, it is sometimes required that the coefficients α\alpha are non-negative and sum to a fixed constant (so that constant functions are exactly integrated). We will not pursue this here as our theoretical results do not accommodate such constraints (see, e.g., Chen et al., 2010; Bach et al., 2012, and references therein).

In practice, independent (but not necessarily identically distributed) noise εi\varepsilon_{i} may be present with variance σ2(xi)\sigma^{2}(x_{i}). Then, the worst (with respect to hF1\|h\|_{\mathcal{F}}\leqslant 1) expected (with respect to the noise) squared error is

2 Reformulation as random features

For any xXx\in{\mathcal{X}}, the function k(,x)k(\cdot,x) is in F{\mathcal{F}}, and since we have assumed that Σ1/2\Sigma^{1/2} is an isometry from L2(dρ)L_{2}(d\rho) to F{\mathcal{F}}, there exists a unique element, which we denote ψ(,x)\psi(\cdot,x), of L2(dρ)L_{2}(d\rho) such that Σ1/2ψ(,x)=k(,x)\Sigma^{1/2}\psi(\cdot,x)=k(\cdot,x). Given the Mercer decomposition k(,x)=m1μmem(x)emk(\cdot,x)=\sum_{m\geqslant 1}\mu_{m}e_{m}(x)e_{m}, we have the expansion ψ(,x)=m1μm1/2em(x)em\psi(\cdot,x)=\sum_{m\geqslant 1}\mu_{m}^{1/2}e_{m}(x)e_{m} (with convergence in the L2L_{2}-norm for the measure dρdρd\rho\otimes d\rho; note that we do not assume that μm1/2\mu_{m}^{1/2} is summable), and thus we may consider ψ\psi as a symmetric function. Note that ψ\psi may not be easy to compute in many practical cases (except for some periodic kernels on $$).

We thus have for (x,y)X×X(x,y)\in{\mathcal{X}}\times{\mathcal{X}}:

That is, the kernel kk may always be written as an expectation. Moreover, we have the quadrature error in Eq. (7) equal to (again using the isometry Σ1/2\Sigma^{1/2} from L2(dρ)L_{2}(d\rho) to F{\mathcal{F}}):

which is exactly an instance of the approximation result in Eq. (6) with V=X{\mathcal{V}}={\mathcal{X}} and φ=ψ\varphi=\psi, that is the random feature is indexed by the same set X{\mathcal{X}} as the kernel. Thus, the quadrature problem, that is finding points xix_{i} and weights (αi)(\alpha_{i}) to get the best possible error over all functions of the unit ball of F{\mathcal{F}}, is a subcase of the random feature problem for a specific expansion. Note that this random decomposition in terms of ψ\psi is always possible (although not in closed form in general).

As shown in Section 2.2, random feature expansions correspond to square-roots of the integral operator Σ:L2(dρ)L2(dρ)\Sigma:L_{2}(d\rho)\to L_{2}(d\rho) as Σ=TT\Sigma=TT^{\ast}. Among the many possible square roots, the quadrature case corresponds exactly to the positive self-adjoint square root T=Σ1/2T=\Sigma^{1/2}. In this situation, the basis (fm)m1(f_{m})_{m\geqslant 1} of the singular value decomposition of T=Σ1/2T=\Sigma^{1/2} is equal to (em)m1(e_{m})_{m\geqslant 1}, recovering the expansion ψ(x,y)=m1μm1/2em(x)em(y)\psi(x,y)=\sum_{m\geqslant 1}\mu_{m}^{1/2}e_{m}(x)e_{m}(y) which we have seen above.

In this important situation, we have two different expansions: the one based on Fourier features, where the random variable indexing the one-dimensional feature is a frequency, while for the one based on the square root ψ\psi, the random variable is a spatial variable in X{\mathcal{X}}. As we show in Section 4, our results are independent of the chosen expansions and thus apply to both. However, (a) when the goal is to do quadrature, we need to use ψ\psi, and (b) in general, the decomposition based on Fourier features can be easily computed once samples are obtained, while for most kernels, ψ(x,y)\psi(x,y) does not have any closed-form simple expression. In Section 6, we provide a simple example with X={\mathcal{X}}= where the two decompositions are considered.

Goals.

In order to be able to make the parallel with random feature approximations, we consider importance-weighted coefficients βi=αiq(xi)1/2\beta_{i}=\alpha_{i}q(x_{i})^{1/2}, and we thus aim at minimizing the approximation error

3 Relationship with column sampling

The problem of quadrature is related to the problem of column sampling. Given nn observations x1,,xnXx_{1},\dots,x_{n}\in{\mathcal{X}}, the goal of column-sampling methods is to approximate the n×nn\times n matrix of pairwise kernel evalulations, the so-called kernel matrix, from a subset of its columns. It has appeared under many names: Nyström method (Williams and Seeger, 2001), sparse greedy approximations (Smola and Schölkopf, 2000), incomplete Cholesky decomposition (Fine and Scheinberg, 2001), Gram-Schmidt orthonormalization (Shawe-Taylor and Cristianini, 2004) or CUR matrix decompositions (Mahoney and Drineas, 2009).

While column sampling has typically been analyzed for a fixed kernel matrix, it has a natural extension which is related to quadrature problems: selecting nn points x1,,xnx_{1},\dots,x_{n} from X{\mathcal{X}} such that the projection of any element of the RKHS F{\mathcal{F}} onto the subspace spanned by k(,xi)k(\cdot,x_{i}), i=1,,ni=1,\dots,n is as small as possible. Natural functions from F{\mathcal{F}} are k(,x)k(\cdot,x), xXx\in{\mathcal{X}}, and thus the goal is to minimize, for such xXx\in{\mathcal{X}},

In the usual sampling approach, several points are considered for testing the projection error, and it is thus natural to consider the criterion averaged through the measure dρd\rho, that is:

In fact, when dρd\rho is supported on a finite set, this formulation is equivalent to minimizing the nuclear norm between the kernel matrix and its low-rank approximation. There are thus several differences and similarities between recent work on column sampling (Bach, 2013; El Alaoui and Mahoney, 2014) and the present paper on quadrature rules and random features:

Different error measures: The column sampling approach corresponds to a function gg in Eq. (7) which is a Dirac function at the point xx, and is thus not in L2(dρ)L_{2}(d\rho). Thus the two frameworks are not equivalent.

Approximation vs. prediction: The works by Bach (2013); El Alaoui and Mahoney (2014) aim at understanding when column sampling leads to no loss in predictive performance within a supervised learning framework, while the present paper looks at approximation properties, mostly regardless of any supervised learning problem, except in Section 4.5 for random features (but not for quadrature).

Lower bounds: In Section 4.3, we provide explicit lower bounds of approximations, which are not available for column sampling.

Similar sampling issues: In the two frameworks, points x1,,xnXx_{1},\dots,x_{n}\in{\mathcal{X}} are sampled i.i.d. with a certain distribution qq, and the best choice depends on the appropriate notion of leverage scores (Mahoney, 2011), while the standard uniform distribution leads to an inferior approximation result. Moreover, the proof techniques are similar and based on concentration inequalities for operators, here in Hilbert spaces rather in finite dimensions.

4 Related work on quadrature

Many methods have been designed for the computation of integrals of a function given evaluations at certain well-chosen points, in most cases when gg is constant equal to one. We review some of these below.

When the underlying set X{\mathcal{X}} is a compact interval of the real line, several methods exists, such as the trapezoidal or Simpson’s rules, which are based on interpolation between the sample points, and for which the error decays as O(1/n2)O(1/n^{2}) and O(1/n4)O(1/n^{4}) for functions with uniformly bounded second or fourth derivatives (Cruz-Uribe and Neugebauer, 2002).

Gaussian quadrature is another class of methods for one-dimensional integrals: it is based on a basis of orthogonal polynomials for L2(dρ)L_{2}(d\rho) where dρd\rho is a probability measure supported in an interval, and their zeros (Hildebrand, 1987, Chap. 8). This leads to quadrature rules which are exact for polynomials of degree 2n12n-1 but error bounds for non-polynomials rely on high-order derivatives, although the empirical performance on functions of a Sobolev space in our experiments is as good as optimal quadrature schemes (see Section 6); depending on the orthogonal polynomials, we get various quadrature rules, such as Gauss-Legendre quadrature for the Lebesgue measure on $$.

Quasi Monte-carlo methods employ a sequence of points with low discrepancy with uniform weights (Morokoff and Caflisch, 1994), leading to approximation errors of O(1/n)O(1/n) for univariate functions with bounded variation, but typically with no adaptation to smoother functions.

Higher-dimensional integrals.

All of the methods above may be generalized for products of intervals d^{d}, typically with dd small. For larger problems, Bayes-Hermite quadrature (O’Hagan, 1991) is essentially equivalent to the quadrature rules we study in this paper.

Some of the quadrature rules are constrained to have positive weights with unit sum (so that the positivity properties of integrals are preserved and constants are exactly inegrated). The quadrature rules we present do not satisfy these constraints. If these constraints are required, kernel herding (Chen et al., 2010; Bach et al., 2012) provides a novel way to select a sequence of points based on the conditional gradient algorithm, but with currently no convergence guarantees improving over O(1/n)O(1/\sqrt{n}) for infinite-dimensional spaces.

Theoretical results.

The best possible error for a quadrature rule with nn points has been well-studied in several settings; see Novak (1988) for a comprehensive review. For example, for X={\mathcal{X}}= and the space of Sobolev functions, which are RKHSs with eigenvalues of their integral operator decreasing as m2sm^{-2s}, Novak (1988, Prop. 2 and 3, page 38) shows that the best possible quadrature rule for the uniform distribution and g=1g=1 leads to an error rate of nsn^{-s}, as well as for any squared-integrable function gg. The proof of these results (both upper and lower bounds) relies on detailed properties of Sobolev spaces. In this paper, we recover these results using only the decay of eigenvalues of the associated integral operator Σ\Sigma, thus allowing straightforward extensions to many situations, like Sobolev spaces on manifolds such as hyperspheres (Hesse, 2006), where we also recover existing results (up to logarithmic terms).

Moreover, Novak (1988, page 17) shows that adaptive quadrature rules where points are selected sequentially with the knowledge of the function values at previous points cannot improve the worst-case guarantees. Our results do not recover this lower bound result for adaptivity.

Finally, Langberg and Schulman (2010) consider multiplicative errors in computing integrals and mainly focuses on different function spaces, such as ones used in clustering functionals. Although sampling quadrature points from a well-chosen density is common in the two approaches, the analysis tools are different. It would be interesting to see if some of these tools can be transferred to our RKHS setting.

From quadrature to function approximation and optimization.

The problem of quadrature, uniformly over all functions gL2(dρ)g\in L_{2}(d\rho) that define the integral, is in fact equivalent to the full approximation of a function hh given values at nn points, where the approximation error is characterized in L2L_{2}-norm. Indeed, given the observations h(xi)h(x_{i}), i=1,,ni=1,\dots,n, we build i=1nαih(xi)\sum_{i=1}^{n}\alpha_{i}h(x_{i}) as an approximation of Xg(x)h(x)dρ(x)\int_{{\mathcal{X}}}g(x)h(x)d\rho(x). It turns out that the coefficients αi\alpha_{i} are linear in gg, that is, there exists aiL2(dρ)a_{i}\in L_{2}(d\rho) such that αi=ai,gL2(dρ)\alpha_{i}=\langle a_{i},g\rangle_{L_{2}(d\rho)}. This implies that i=1nh(xi)ai,gL2(dρ)\sum_{i=1}^{n}h(x_{i})\langle a_{i},g\rangle_{L_{2}(d\rho)} is an approximation of h,gL2(dρ)\langle h,g\rangle_{L_{2}(d\rho)}. Thus, the worst case error with respect to gg in the unit ball of L2(dρ)L_{2}(d\rho) is \big{\|}\sum_{i=1}^{n}h(x_{i})a_{i}-h\big{\|}_{L_{2}(d\rho)}, that is, we have an approximation result of hh through observations of its values at certain points.

Novak (1988) considers the approximation problem in LL_{\infty}-norm and shows that for Sobolev spaces, going from L2L_{2}- to LL_{\infty}-norms incurs a loss of performance of n\sqrt{n}. We recover partially these results in Section 5 from a more general perspective. When optimizing the points at which the function is evaluated (adaptively or not), the approximation problem is often referred to as experimental design (Cochran and Cox, 1957; Chaloner and Verdinelli, 1995).

Finally, a third problem is of interest (and outside of the scope of this paper), namely the problem of finding the minimum of a function given (potentially noisy) function evaluations. For noiseless problems, Novak (1988, page 26) shows that the approximation and optimization problems have the same worst-case guarantees (with no influence of adaptivity); this optimization problem has also been studied in the bandit setting (Srinivas et al., 2012) and in the framework of “Bayesian optimization” (see, e.g. Bull, 2011).

Theoretical Analysis

In this section, we provide approximation bounds for the random feature problem outlined in Section 2.4 (and thus the quadrature problem in Section 3). In Section 4.1, we provide generic upper bounds, which depend on the eigenvalues of the integral operator Σ\Sigma and present matching lower bounds (up to logarithmic terms) in Section 4.3. The upper-bound depends on specific distributions of samples that we discuss in Section 4.2. We then consider consequences of these results on quadrature (Section 4.4) and random feature expansions (Section 4.5).

The following proposition (see proof in Appendix B.1) determines the minimal number of samples required for a given approximation accuracy:

For λ>0\lambda>0 and a distribution with positive density qq with respect to dτd\tau, we consider

Let v1,,vnv_{1},\dots,v_{n} be sampled i.i.d. from the density qq, then for any δ(0,1)\delta\in(0,1), if

with probability greater than 1δ1-\delta, we have 1ni=1nq(vi)1φ(vi,)L2(dρ)22trΣδ\frac{1}{n}\sum_{i=1}^{n}q(v_{i})^{-1}\|\varphi(v_{i},\cdot)\|_{L_{2}(d\rho)}^{2}\leqslant\frac{2\mathop{\rm tr}\Sigma}{\delta} and

We can interpret the proposition above as follows: given any squared error 4λ>04\lambda>0 and a distribution with density qq, the number nn of samples from qq needed so that the unit ball of F{\mathcal{F}} is approximated by the ball of radius 22 of F^\hat{{\mathcal{F}}} is, up to logarithmic terms, at most a constant times dmax(q,λ)d_{\max}(q,\lambda), defined in Eq. (9). The result above is a statement for a fixed qq and λ\lambda and this number of samples nn depends on these.

We could also invert the relationship between λ\lambda and nn, that is, answer the following question: given a fixed number nn of samples, what is the approximation error λ\lambda? This requires inverting the function λdmax(q,λ)\lambda\mapsto d_{\max}(q,\lambda). This will be done in Section 4.2 for a specific distribution qq where the expression simplifies, together with specific examples from Section 2.3.

Finally, note that we also have a bound on 1ni=1nq(vi)1φ(vi,)L2(dρ)2\frac{1}{n}\sum_{i=1}^{n}q(v_{i})^{-1}\|\varphi(v_{i},\cdot)\|_{L_{2}(d\rho)}^{2}, which shows that our random functions are not too large on average (this constraint will be needed in the lower bound as well in Section 4.3).

It turns out that the final bound on the squared error is exactly proportional to the regularization parameter λ\lambda. As shown in Appendix B.1, this leads to an approximation f^\hat{f} which is a linear function of ff, as f^=(Σ^+λI)1Σ^f\hat{f}=(\hat{\Sigma}+\lambda I)^{-1}\hat{\Sigma}f, where Σ^\hat{\Sigma} is a properly defined empirical integral operator and λ>0\lambda>0 is the regularization parameter. Then, Bernstein concentration inequalities for operators (Minsker, 2011) can be used in a way similar to the work of Bach (2013); El Alaoui and Mahoney (2014) on column sampling, to provide a bound on all desired quantities.

Result in expectation.

In Section 4.5, we will need a result in expectation. As shown at the end of Appendix B.1, as soons as, λ(trΣ)/4\lambda\leqslant(\mathop{\rm tr}\Sigma)/4 and n5dmax(λ)log2(trΣ)dmax(λ)λ,\displaystyle n\geqslant 5d_{\max}(\lambda)\log\frac{2(\mathop{\rm tr}\Sigma)d_{\max}(\lambda)}{\lambda}, then

2 Optimized distribution

We may now consider a specific distribution that depends on the kernel and on λ\lambda, namely

for which dmax(qλ,λ)=d(λ)=trΣ(Σ+λI)1d_{\max}(q^{\ast}_{\lambda},\lambda)=d(\lambda)=\mathop{\rm tr}\Sigma(\Sigma+\lambda I)^{-1}. With this distribution, we thus need to have n5d(λ)log16d(λ)δn\geqslant 5d(\lambda)\log\frac{16d(\lambda)}{\delta} with d(λ)=trΣ(Σ+λI)1d(\lambda)=\mathop{\rm tr}\Sigma(\Sigma+\lambda I)^{-1} is the degrees of freedom, a traditional quantity in the analysis of least-squares regression (Hastie and Tibshirani, 1990; Caponnetto and De Vito, 2007), which is always smaller than dmax(1,λ)d_{\max}(1,\lambda) and can be upper-bounded explicitly for many examples, as we now explain. The computation of dmax(1,λ)d_{\max}(1,\lambda) in the operator setting (for which we may use q=1q=1), a quantity often referred to as the maximal leverage score (Mahoney, 2011), remains an open problem.

The quantity d(λ)d(\lambda) only depends on the integral operator Σ\Sigma, that is, for all possible choices of square roots, i.e., all possible choices of feature expansions, the number of samples that our results guarantee is the same. This being said, some expansions may be more computationally practical than others, and when using the distribution with q(v)=1q(v)=1, the bounds will be different.

Given the singular value decomposition of φ\varphi in Eq. (3), we have, for any vVv\in{\mathcal{V}}, φ(v,)=m1μm1/2fm(v)em\varphi(v,\cdot)=\sum_{m\geqslant 1}\mu_{m}^{1/2}f_{m}(v)e_{m} and thus

which provides an explicit expression for the density qλq_{\lambda}^{\ast}.

For a given squared error value λ\lambda, the optimized distribution qλq^{\ast}_{\lambda}, while leading to the degrees of freedom that will happen to be optimal in terms of approximation, has two main drawbacks:

Dependence on λ\lambda: this implies that if we want a reduced error (i.e., a smaller λ\lambda), then the samples obtained from a higher λ\lambda, may not be reused to provably obtain the desired bound; in other words, the sampling is not anytime. For specific examples, e.g., quadrature with periodic kernels on $withtheuniformdistribution,thenwith the uniform distribution, thenq=1happenstobeoptimalforallhappens to be optimal for all\lambda$, and thus, we may reuse samples for different values of the error.

Hard to compute in practice: the optimal distribution depends on a leverage score φ(v,),(Σ+λI)1φ(v,)L2(dρ)\langle\varphi(v,\cdot),(\Sigma+\lambda I)^{-1}\varphi(v,\cdot)\rangle_{L_{2}(d\rho)}, which may be hard to use for several reasons; first, it requires access to the infinite-dimensional operator Σ\Sigma, which may be difficult; moreover, even if it possible to invert Σ+λI\Sigma+\lambda I, the set V{\mathcal{V}} might be particularly large and impractical to sample from. At the end of Section 4.1, we propose a simple algorithm based on sampling.

Eigenvalues and degrees of freedom.

In order to relate more directly to the eigenvalues of Σ\Sigma, we notice that we may lower bound the degrees of freedom by a constant times the number m(λ)m^{\ast}(\lambda) of eigenvalues greater than λ\lambda:

We now make the assumption that there exists a γ>0\gamma>0 independent of jj such that

This assumption essentially states that the eigenvalues decay sufficiently homogeneously and is satisfied by μmm2α\mu_{m}\propto m^{-2\alpha} with γ=(2α1)1\gamma=(2\alpha-1)^{-1}, μmrm\mu_{m}\propto r^{m} with γ=(1r)1\gamma=(1-r)^{-1} and similar bounds also hold for all examples in Section 2.3. It allows us to relate the degrees of freedom directly to eigenvalue decays.

Indeed, this implies that 1λμm<λμmγmax({m, μmλ})=m(λ)\frac{1}{\lambda}\sum_{\mu_{m}<\lambda}\mu_{m}\leqslant\gamma{\rm max}(\{m,\ \mu_{m}\geqslant\lambda\})=m^{\ast}(\lambda) for all λμ1\lambda\leqslant\mu_{1} (the largest eigenvalue) and thus

We can now restate the approximation result of Prop. 1 from Section 4.1 with the optimized distribution (see proof in Appendix B.2):

For λ>0\lambda>0 and the distribution with density qλq^{\ast}_{\lambda} defined in Eq. (10) with respect to dτd\tau, with degrees of freedom d(λ)d(\lambda). Let v1,,vnv_{1},\dots,v_{n} be sampled i.i.d. from the density qq, defining the kernel (and its associated RKHS F^\hat{{\mathcal{F}}}) k^(x,y)=1ni=1n1q(vi)φ(vi,x)φ(vi,y)\hat{k}(x,y)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{q(v_{i})}\varphi({v_{i}},x)\varphi({v_{i}},y). Then, for any δ(0,1)\delta\in(0,1), with probability 1δ1-\delta, we have:

if \displaystyle n\geqslant 5\ d(\lambda)\log\big{[}16d(\lambda)/\delta\big{]},

if Eq. (11) is satisfied, and, by choosing mn5(1+γ)log16n5δm\leqslant\frac{n}{5(1+\gamma)\log\frac{16n}{5\delta}}, and λ=μm\lambda=\mu_{m}.

The statement (a) above, is a simple corollary of Prop. 1, and goes from level of error λ\lambda to minimum number nn of samples. The statement (b) goes in the other direction, that is, from the number of samples nn to the achieved approximation error. It depends on the eigenvalues μm\mu_{m} of the integral operator taken at m=O(n/log(n))m=O(n/\log(n)). For example, for polynomial decays of eigenvalues of the form μm=O(m2s)\mu_{m}=O(m^{-2s}), we get (non squared) errors proportional to (logn)sns(\log n)^{s}n^{-s} for nn samples, while for geometric decays, we get geometric errors as a function of the number nn of samples.

Note however that for the statement (b) to hold, we need to sample the points v1,,vnv_{1},\dots,v_{n} from the distribution qμmq_{\mu_{m}}^{\ast}, that is, for different numbers of samples nn, the distribution is unfortunately different (except in special cases). It would be interesting to study the properties of independent but not identically distributed samples v1,,vnv_{1},\dots,v_{n} and the possibility of achieving the same rate adaptively.

Corollary for Sobolev spaces.

The same approximation results holds for translation-invariant kernels on d^{d}; but when dρd\rho is the uniform distribution, as shown in Section 4.4, the optimized distribution for the quadrature case is still the uniform distribution, for all values of λ\lambda, and can thus be computed.

Algorithm to estimate the optimized distribution.

This implies that the density of the optimized distribution with respect to the uniform measure on {v1,,vN}\{v_{1},\dots,v_{N}\} is proportional to \big{(}T^{\ast}T(T^{\ast}T+\lambda I)^{-1}\big{)}_{ii}. We can then sample any number nn of points from resampling from {v1,,vN}\{v_{1},\dots,v_{N}\} from the density above. The computational complexity is O(N3)O(N^{3}). A detailed analysis of the approximation properties of this algorithm is outside the scope of this paper.

We have (TT)ij=ηi1/2ηj1/2Xφ(vi,x)φ(vj,x)dρ(x)(T^{\ast}T)_{ij}={\eta_{i}^{1/2}\eta_{j}^{1/2}}\int_{\mathcal{X}}\varphi(v_{i},x)\varphi(v_{j},x)d\rho(x). In some cases, it can be computed in closed form—such as for quadrature where this is equal to ηi1/2ηj1/2k(vi,vj){\eta_{i}^{1/2}\eta_{j}^{1/2}}k(v_{i},v_{j}). In some others, it requires i.i.d. samples x1,,xMx_{1},\dots,x_{M} from dρd\rho, and the estimate: ηi1/2ηj1/2M1k=1Mφ(vi,xk)φ(vj,xk){{\eta_{i}^{1/2}\eta_{j}^{1/2}}}{M^{-1}}\sum_{k=1}^{M}\varphi(v_{i},x_{k})\varphi(v_{j},x_{k}).

3 Lower bound

In this section, we aim at providing lower-bounds on the number of samples required for a given accuracy. We have the following result (see proof in Appendix B.3):

For δ(0,1)\delta\in(0,1), if we have a family ψ1,,ψnL2(dρ)\psi_{1},\dots,\psi_{n}\in L_{2}(d\rho) such that

then nmax{m, μm144λ}4log10trΣλδ\displaystyle n\geqslant\frac{{\rm max}\{m,\ \mu_{m}\geqslant 144\lambda\}}{4\log\frac{10\mathop{\rm tr}\Sigma}{\lambda\delta}}.

The proof technique not surprisingly borrows tools from minimax estimation over ellipsoids, namely the Varshamov-Gilbert’s lemma.

We obtain matching upper and lower bounds up to logarithmic terms, using only the decay of eigenvalues (μm)m1(\mu_{m})_{m\geqslant 1} of the integral operator Σ\Sigma (of course, if sampling from the optimized distribution qλq^{\ast}_{\lambda} is possible). Indeed in that case, as shown in Prop. 2, we have shown that we need at most 10\ d(\lambda)\log\big{[}2d(\lambda)\big{]}, where d(λ)d(\lambda) is the degrees of freedom, which is upper and lower bounded by a constant times m(λ)=max{m, μmλ}m^{\ast}(\lambda)={\rm max}\{m,\ \mu_{m}\geqslant\lambda\}.

In order to obtain such a bound, we need to constrain both β2\|\beta\|_{2} and the norms of the vectors ψi\psi_{i}, which correspond to bounded features for the random feature interpretation and tolerance to noise for the quadrature interpretation. We choose our scaling to match the constraints we have in Prop. 1, for which the parameter δ\delta ends up entering the lower bound logarithmically.

4 Quadrature

We may specialize the results above to the quadrature case, namely give a formulation where the features φ\varphi do not appear (or equivalently using ψ\psi defined in Section 3.2). This is a special case where V=X{\mathcal{V}}={\mathcal{X}} and φ=ψ\varphi=\psi. In terms of operators TT in Section 2.2, this corresponds to T=Σ1/2T=\Sigma^{1/2}.

Following Section 4.1, we have an expression for the optimized distribution, both in terms of operators, as follows,

and in terms of eigenvalues and eigenvectors of kk, that is,

While this is uniform in some special cases (uniform distribution on $$ and Sobolev kernels, as shown below), this is typically hard to compute and sample from. An algorithm for approximating it was presented at the end of Section 4.1.

A weakness of our result is that in general our optimized distribution qλ(x)q^{\ast}_{\lambda}(x) depends on λ\lambda and thus on the number of samples. In some cases with symmetries (i.e., uniform distribution on $orthehypersphere),or the hypersphere),q^{\ast}_{\lambda}happenstobeconstantforallhappens to be constant for all\lambda.Notealsothatwehaveobservedempiricallythatinsomecases,. Note also that we have observed empirically that in some cases,q^{\ast}_{\lambda}convergestoacertaindistributionwhenconverges to a certain distribution when\lambda$ tends to zero (see an example in Section 6).

Sobolev spaces.

For the special case of Sobolev spaces on d^{d} with dρd\rho the uniform distribution, the optimized distribution in Eq. (12) is also the uniform distribution. Indeed, the eigenfunctions of the integral operator Σ\Sigma are dd-th order tensor products of the uni-dimensional Fourier basis (the constant and all pairs of sine/cosine at a given frequency), with the same eigenvalue for the 2d2^{d} possibilities of sines/cosines for a given multi-dimensional frequency (m1,,md)(m_{1},\dots,m_{d}). Therefore, when summing all squared values of the eigenfunctions corresponding to (m1,,md)(m_{1},\dots,m_{d}), we end up with the sum a{0,1}di=1dcos2ai(2πmixi)sin2(1ai)(2πmixi)\sum_{a\in\{0,1\}^{d}}\prod_{i=1}^{d}\cos^{2a_{i}}(2\pi m_{i}x_{i})\sin^{2(1-a_{i})}(2\pi m_{i}x_{i}), which ends up being constant equal to one (and thus independent of xx) because cos2ai(2πmixi)+sin2ai(2πmixi)=1\cos^{2a_{i}}(2\pi m_{i}x_{i})+\sin^{2a_{i}}(2\pi m_{i}x_{i})=1.

Quadrature rule.

We assume that points x1,,xnx_{1},\dots,x_{n} are sampled from the distribution with density qq with respect to dρd\rho. The quadrature rule for a function hFh\in{\mathcal{F}} is i=1nβih(xi)q(xi)1/2\sum_{i=1}^{n}\frac{\beta_{i}h(x_{i})}{q(x_{i})^{1/2}}. To compute β\beta, we need to minimize with respect to β\beta the error:

which is the regularized worst case squared error in the estimation of the integral of hh over hFh\in{\mathcal{F}}. The best error is obtained for λ=0\lambda=0, but our guarantees are valid for λ>0\lambda>0, with an explicit control over the norm β22\|\beta\|_{2}^{2}, which is important for robustness to noise.

Given the values of Xk(xi,x)g(x)dρ(x)=zi\int_{\mathcal{X}}k(x_{i},x)g(x)d\rho(x)=z_{i}, for i=1,,ni=1,\dots,n, which can be computed in closed form for several triplet (k,g,dρ)(k,g,d\rho) (see, e.g., Smola et al., 2007; Oates and Girolami, 2015), then the problem above is equivalent to minimizing with respect to β\beta:

which leads to a n×nn\times n linear system with running time complexity O(n3)O(n^{3}). Note that when adding points sequentially (in particular for kernels for which the distribution qλq_{\lambda}^{\ast} is independent of λ\lambda, such as Sobolev spaces on $),onemayupdatethesolutionsothatafter), one may update the solution so that afternsteps,theoverallcomplexityissteps, the overall complexity isO(n^{3})$.

Approximation of functions in ℱℱ{\mathcal{F}}.

With the quadrature weights β\beta estimated above and the quadrature rule i=1nβih(xi)q(xi)1/2\sum_{i=1}^{n}\frac{\beta_{i}h(x_{i})}{q(x_{i})^{1/2}} for the estimation of Xg(x)f(x)dρ(x)\int_{{\mathcal{X}}}g(x)f(x)d\rho(x), we may derive an expression which is explicitly linear in gg. Following the proof of Prop. 1 in Appendix B.1, we have, when specialized to the quadrature case:

Moreover, we have βi=1nq(xi)1/2k(,xi),Σ1/2(Σ^+λI)1Σ1/2gL2(dρ)\beta_{i}=\frac{1}{nq(x_{i})^{1/2}}\langle k(\cdot,x_{i}),\Sigma^{-1/2}(\hat{\Sigma}+\lambda I)^{-1}\Sigma^{1/2}g\rangle_{L_{2}(d\rho)} from Eq. (15) in Appendix B.1, and the quadrature rule becomes:

which can be put in the form h^,gL2(dρ)\langle\hat{h},g\rangle_{L_{2}(d\rho)} with the approximation h^=Σ1/2Σ^(Σ^+λI)1Σ1/2h\hat{h}=\Sigma^{1/2}\hat{\Sigma}(\hat{\Sigma}+\lambda I)^{-1}\Sigma^{-1/2}h of the function hFh\in{\mathcal{F}}. Having a bound for all functions gg such that gL2(dρ)1\|g\|_{L_{2}(d\rho)}\leqslant 1 is equivalent to having a bound on hh^L2(dρ)\|h-\hat{h}\|_{L_{2}(d\rho)}. In Section 5, we consider extensions, where we consider other norms than the L2L_{2}-norm for characterizing the approximation error h^h\hat{h}-h. Moreover, we consider cases where hh belongs to a strict subspace of F{\mathcal{F}} (with improved results).

5 Learning with random features

Because of the GG-Lipschitz-continuity of the loss, we have L(g^γ)L(f)Gg^γ)fL2(dρ){L}(\hat{g}_{\gamma})-L(f^{\prime})\leqslant{G}\|\hat{g}_{\gamma})-f^{\prime}\|_{L_{2}(d\rho)}, and thus the second term is less than 8λGF3GFλ\sqrt{8\lambda}GF\leqslant 3GF\sqrt{\lambda}. Overall, we obtain

If we consider λ=R2/m\lambda=R^{2}/m in order to lose only a constant factor compared to Eq. (13), we have the constraint n\geqslant 5d(R^{2}/m)\log\big{[}{2md(R^{2}/m)}\big{]}.

We may now look at several situations. In the worst case, where the decay of eigenvalue is not fast, i.e., very close to 1/i1/i, then we may only use the bound d(λ)=trΣ(Σ+λI)1λ1trΣ=R2/λd(\lambda)=\mathop{\rm tr}\Sigma(\Sigma+\lambda I)^{-1}\leqslant\lambda^{-1}\mathop{\rm tr}\Sigma=R^{2}/\lambda, and thus a sufficient condition n10mlog2mn\geqslant 10m\log 2m, and we obtain the same result as Rahimi and Recht (2009).

However, when we have eigenvalue decays as R2i2sR^{2}i^{-2s}, we get (up to constants), following the same computation as Section 4.2, d(λ)(R2/λ)1/(2s)d(\lambda)\leqslant(R^{2}/\lambda)^{1/(2s)}, and thus nm1/(2s)logmn\geqslant m^{1/(2s)}\log m, which is a significant improvement (regardless of the value of FF). Moreover, if the decay is geometric as rir^{i}, then we get d(λ)log(R2/λ)d(\lambda)\leqslant\log(R^{2}/\lambda), and thus n(logm)2n\geqslant(\log m)^{2}, which is even more significant.

Quadrature-related Extensions

In Section 4.4, we have built an approximation h^=Σ1/2Σ^(Σ^+λI)1Σ1/2h\hat{h}=\Sigma^{1/2}\hat{\Sigma}(\hat{\Sigma}+\lambda I)^{-1}\Sigma^{-1/2}h of a function hFh\in{\mathcal{F}}, which is based on nn function evaluations h(x1),,h(xn)h(x_{1}),\dots,h(x_{n}). We have presented in Section 4.4 a convergence rate for the L2L_{2}-norm h^hL2(dρ)\|\hat{h}-h\|_{L_{2}(d\rho)} for functions hh with less than unit F{\mathcal{F}}-norm hF1\|h\|_{\mathcal{F}}\leqslant 1. Up to logarithmic terms, if using the optimal distribution for sampling x1,,xnx_{1},\dots,x_{n}, then we get a squared error of μn\mu_{n} where μn\mu_{n} is the nn-th largest eigenvalue of the integral operator Σ\Sigma.

We have seen that if the noise in the function evaluations h(xi)h(x_{i}) has a variance less than q(xi)τ2q(x_{i})\tau^{2}, then the error hh^L2(dρ)2\|h-\hat{h}\|_{L_{2}(d\rho)}^{2} has an additional term τ2β224τ2n\tau^{2}\|\beta\|_{2}^{2}\leqslant\frac{4\tau^{2}}{n}. Hence, the amount of noise has to be less than nμnn\mu_{n} in order to incur no loss in performance (a bound which decreases with nn).

Adaptivity to smoother functions.

We assume that the function hh happens to be smoother than what is sufficient to be an element of the RKHS F{\mathcal{F}}, that is, if ΣshL2(dρ)1\|\Sigma^{-s}h\|_{L_{2}(d\rho)}\leqslant 1, where s1/2s\geqslant 1/2. The case s=1/2s=1/2 corresponds to being in the RKHS. In the proof of Prop. 1 in Appendix B.1, we have seen that with high-probability we have:

We now see that we can bound the error h^hL2(dρ)\|\hat{h}-h\|_{L_{2}(d\rho)} as follows:

We may now bound each term. The first one \big{\|}\Sigma^{1/2}(\hat{\Sigma}+\lambda I)^{-1/2}\big{\|}_{\rm op} is less than 2, because of Eq. (14). The second one \big{\|}(\hat{\Sigma}+\lambda I)^{-1/2}\Sigma^{-1/2+s}\big{\|}_{\rm op} is equal to \big{\|}(\hat{\Sigma}+\lambda I)^{s-1}(\hat{\Sigma}+\lambda I)^{1/2-s}\Sigma^{-1/2+s}\big{\|}_{\rm op}, and thus less than \big{\|}(\hat{\Sigma}+\lambda I)^{s-1}\|_{\rm op}\cdot\big{\|}(\hat{\Sigma}+\lambda I)^{1/2-s}\Sigma^{-1/2+s}\big{\|}_{\rm op}\leqslant 2\lambda^{s-1}. Overall we obtain

The norm hΣshL2(dρ)h\mapsto\|\Sigma^{-s}h\|_{L_{2}(d\rho)} is an RKHS norm with kernel m0μm2sem(x)em(y)\sum_{m\geqslant 0}\mu_{m}^{2s}e_{m}(x)e_{m}(y), with corresponding eigenvalues equal to (μm)2s(\mu_{m})^{2s}. From Prop. 2 and 3, the optimal number of quadrature points to reach a squared error less than ε\varepsilon is proportional to the number max({m, μm2sε}){\rm max}(\{m,\ \mu_{m}^{2s}\geqslant\varepsilon\}), while using the quadrature points from s=1/2s=1/2, leads to a number max({m, μmε1/(2s)}){\rm max}(\{m,\ \mu_{m}\geqslant\varepsilon^{1/(2s)}\}), which is equal. Thus if the RKHS used to compute the quadrature weights is a bit too large (but not too large, see experiments in Section 6), then we still get the optimal rate. Note that this robustness is only shown for the regularized estimation of the quadrature coefficients (in our simulations, the non-regularized ones also exhibit the same behavior).

Approximation with stronger norms.

We may consider characterizing the difference h^h\hat{h}-h with different norms than L2(dρ)\|\cdot\|_{L_{2}(d\rho)}, in particular norms Σr(h^h)L2(dρ)\|\Sigma^{-r}(\hat{h}-h)\|_{L_{2}(d\rho)}, with r[0,1/2]r\in[0,1/2]. For r=0r=0, this is our results in L2L_{2}-norm, while for r=1/2r=1/2, this is the RKHS norms. We have, using the same manipulations than above:

When r=1/2r=1/2, we get a result in the RKHS norm, but with no decay to zero; the RKHS norm F\|\cdot\|_{\mathcal{F}} would allow a control in LL_{\infty}-norm, but as noticed by Steinwart et al. (2009); Mendelson and Neeman (2010), such a control may be obtained in practice with rr much smaller. For example, when the eigenfunctions eme_{m} are uniformly bounded in LL_{\infty}-norm by a constant CC (as is the case for periodic kernels in $withtheuniformdistribution),then,foranywith the uniform distribution), then, for anyx\in{\mathcal{X}},wehavefor, we have fort>1$,

If for simplicity, we assume that μm=(m+1)2s\mu_{m}=(m+1)^{-2s} (like for Sobolev spaces), we have ΣrfL2(dρ)2=m=1μm2rf,emL2(dρ)2=m=1(m+1)tf,emL2(dρ)2\|\Sigma^{-r}f\|_{L_{2}(d\rho)}^{2}=\sum_{m=1}^{\infty}\mu_{m}^{-2r}\langle f,e_{m}\rangle^{2}_{L_{2}(d\rho)}=\sum_{m=1}^{\infty}(m+1)^{t}\langle f,e_{m}\rangle^{2}_{L_{2}(d\rho)} with r=t/4sr=t/4s. If λO(n2s)\lambda\leqslant O(n^{-2s}) (as suggested by Prop. 1), then we obtain a squared LL_{\infty}-error less than \frac{1}{t-1}\lambda^{1-2r}=O\big{(}\frac{1}{t-1}n^{-2s(1-t/2s)}\big{)}=O\big{(}\frac{n^{t}}{t-1}n^{-2s}\big{)}. With t=1+1lognt=1+\frac{1}{\log n}, we get O\big{(}\frac{n\log n}{n^{-2s}}\big{)}, and thus a degradation compared to the squared L2L_{2}-loss of nn (plus additional logarithmic terms), which corresponds to the (non-improvable) result of Novak (1988, page 36).

Simulations

In this section, we consider simple illustrative quadrature experimentsMatlab code for all 5 figures may be downloaded from http://www.di.ens.fr/~fbach/quadrature.html. with X={\mathcal{X}}= and kernels k(x,y)=1+m=11m2scos2πm(xy)k(x,y)=1+\sum_{m=1}^{\infty}\frac{1}{m^{2s}}\cos 2\pi m(x-y), with various values of ss and distributions dρd\rho which are Beta random variable with the two parameters equal to a=ba=b, hence symmetric around 1/21/2.

For b=1b=1, we have the uniform distribution on $forwhichthecosine/sinebasisisorthonormal,andtheoptimizeddistributionfor which the cosine/sine basis is orthonormal, and the optimized distributionq_{\lambda}^{\ast}isalsouniform.Moreover,wehaveis also uniform. Moreover, we have\int_{0}^{1}k(x,y)d\rho(x)=1.WereportresultscomparingdifferentSobolevspacesfortestingfunctionstointegrate(parameterizedby. We report results comparing different Sobolev spaces for testing functions to integrate (parameterized bys)andlearningquadratureweights(parameterizedby) and learning quadrature weights (parameterized byt)inFigure1,wherewecomputeerrorsaveragedover1000draws.Wedidnotuseregularizationtocomputequadratureweights) in Figure 1, where we compute errors averaged over 1000 draws. We did not use regularization to compute quadrature weights\alpha$. We can make the following observations:

The exponents in the convergence rates for s=ts=t (matching RKHSs for learning quadrature weights and testing functions) are close to 2s2s as expected.

When the functions to integrate are less smooth than the ones used for learning quadrature weights (that is t>st>s), then the quadrature performance does not necessarily decay with the number of samples.

On the contrary, when s>ts>t, then we have convergence and the rate is potentially worse than the optimal one (attained for s=ts=t), and equal when ts/2t\geqslant s/2, as shown in Section 5.

In Figure 2, we compare several quadrature rules on $,namelySimpsonsrulewithuniformlyspreadpoints,GaussLegendrequadratureandtheSobolsequencewithuniformweights.For, namely Simpson’s rule with uniformly spread points, Gauss-Legendre quadrature and the Sobol sequence with uniform weights. Fors=1,asexpected,allsquarederrorsdecayas, as expected, all squared errors decay asn^{-2}withaworseconstantforourkernelbasedrule,whileforwith a worse constant for our kernel-based rule, while fors=2(smoothertestfunctions),theSobolsequenceisnotadaptive,whileallothersareadaptiveandgetconvergenceratesaround(smoother test functions), the Sobol sequence is not adaptive, while all others are adaptive and get convergence rates aroundn^{-4}$.

Non-uniform distribution.

We consider the case a=b=1/2a=b=1/2, which is the distribution dρd\rho with density π1x1/2(1x)1/2\pi^{-1}x^{-1/2}(1-x)^{-1/2} with respect to the Lebesgue measure, and with cumulative distribution function F(x)=π1arccos(12x)F(x)=\pi^{-1}\arccos(1-2x). We may use an approximation of dτd\tau with NN unweighted points F^{-1}(k/N)=\big{(}1-\cos\frac{k\pi}{N}\big{)}/2, for k{1,,N}k\in\{1,\dots,N\} and the algorithms from the end of Section 4.2. We consider the Sobolev kernel with s=1s=1.

In Figure 3, we plot all densities qλq_{\lambda}^{\ast} as a function of λ\lambda. When λ\lambda is large, we unsuprisingly obtain the uniform density, while, more surprisingly, when λ\lambda tends to zero, the density tends to a density, which happens here to be proportional to x1/4(1x)1/4x^{1/4}(1-x)^{1/4} (leading to a Beta distribution with parameters a=b=.25a=b=.25).

Conclusion

In this paper, we have shown that kernel-based quadrature rules are a special case of random feature expansions for positive definite kernels, and derived upper and lower bounds on approximations, that match up to logarithmic terms. For quadrature, this leads to widely applicable results while for random features this allows a significantly improved guarantee within a supervised learning framework.

The present work could be extended in a variety of ways, for example towards bandit optimization rather than quadrature (Srinivas et al., 2012), the use of quasi-random sampling within our framework in the spirit of Yang et al. (2014); Oates and Girolami (2015), a similar analysis for kernel herding (Chen et al., 2010; Bach et al., 2012), an extension to fast rates for non-parametric least-squares regression (Hsu et al., 2014) but with an improved computational complexity, and a study of the consequences of our improved approximation result for online learning and stochastic approximation, in the spirit of Dai et al. (2014); Dieuleveut and Bach (2014).

This work was partially supported by the MSR-Inria Joint Centre and a grant by the European Research Council (SIERRA project 239993). Comments of the reviewers were greatly appreciated and helped improve the presentation significantly. The author would like to thank the STVI for the opportunity of writing a single-handed paper.

References

A Kernels on product spaces

In this appendix, we consider sets X{\mathcal{X}} which are products of several simple sets X1,,Xd{\mathcal{X}}_{1},\dots,{\mathcal{X}}_{d}, with known kernels k1,,kdk_{1},\dots,k_{d}, each with RKHS F1,,Fd{\mathcal{F}}_{1},\dots,{\mathcal{F}}_{d}. We also assume that we have dd measures dρ1,,dρdd\rho_{1},\dots,d\rho_{d}, leading to sequences of eigenvalues (μjmj)mj1(\mu_{jm_{j}})_{m_{j}\geqslant 1} and eigenfunctions (ejmj)mj1(e_{jm_{j}})_{m_{j}\geqslant 1}.

Our aim is to define a kernel kk on X=X1××Xd{\mathcal{X}}={\mathcal{X}}_{1}\times\cdots\times{\mathcal{X}}_{d} with the product measure dρ=dρ1dρdd\rho=d\rho_{1}\cdots d\rho_{d}. For illustration purposes, we consider decays of the form μmm2s\mu_{m}\propto m^{-2s} for the dd kernels, that will be useful for Sobolev spaces. We also consider the case where μmexp(ρm)\mu_{m}\propto\exp(-\rho m). For some combinations, eigenvalue decay is the most natural, in others, the number of eigenvalues m(λ)m^{\ast}(\lambda) greater than a given λ>0\lambda>0 is more natural.

In this situation, the RKHS for kk is isomorphic to F1××Fd{\mathcal{F}}_{1}\times\cdots\times{\mathcal{F}}_{d}, composed of functions gg such that there exists f1,,fdf_{1},\dots,f_{d} in F1,,Fd{\mathcal{F}}_{1},\dots,{\mathcal{F}}_{d} such that g(x)=j=1dfj(xj)g(x)=\sum_{j=1}^{d}f_{j}(x_{j}), that is we obtain separable functions, which are sometimes used in the context of generalized additive models (Hastie and Tibshirani, 1990). The corresponding integral operator is then block-diagonal with jj-th block equal to the integral operator for kjk_{j} and dρjd\rho_{j}. This implies that that its eigenvalues are the concatenation of all sequences (μjmj)mj0(\mu_{jm_{j}})_{m_{j}\geqslant 0}. Thus the function m(λ)m^{\ast}(\lambda) is the sum of functions m1(λ)++md(λ)m^{\ast}_{1}(\lambda)+\cdots+m^{\ast}_{d}(\lambda).

In terms of norms of functions, we have a norm equal to gF2=j=1dfjFj2.\|g\|_{\mathcal{F}}^{2}=\sum_{j=1}^{d}\|f_{j}\|_{{\mathcal{F}}_{j}}^{2}.

In the particular case where μjmjmj2s\mu_{jm_{j}}\propto m_{j}^{-2s} for all jj, or equivalently, a number of eigenvalues of kjk_{j} greater than λ\lambda proportional to λ1/(2s)\lambda^{-1/(2s)}, we have a number of eigenvalues of kk greater than λ\lambda equivalent to dλ1/(2s)d\lambda^{-1/(2s)}, that is a decay for the eigenvalues proportional to (m/d)2s(m/d)^{-2s}. Similarly, when the decay is exponential as exp(ρm)\exp(-\rho m), we get a decay of exp(ρm/d)\exp(-\rho m/d).

In this situation, the RKHS for KK is exactly the tensor product of F1,,Fd{\mathcal{F}}_{1},\dots,{\mathcal{F}}_{d}, i.e., the span of all functions j=1dfj(xj)\prod_{j=1}^{d}f_{j}(x_{j}), for f1,,fdf_{1},\dots,f_{d} in F1,,Fd{\mathcal{F}}_{1},\dots,{\mathcal{F}}_{d} (Berlinet and Thomas-Agnan, 2004). Moreover, the integral operator for kk is a tensor product of the dd integral operator for k1,,kdk_{1},\dots,k_{d}. This implies that its eigenvalues are μ1m1××μdmd\mu_{1m_{1}}\times\cdots\times\mu_{dm_{d}}, m1,,md0m_{1},\dots,m_{d}\geqslant 0. In terms of norms of functions defined on X1××Xd{\mathcal{X}}_{1}\times\cdots\times{\mathcal{X}}_{d}, this thus corresponds to

In the particular case where μjmjmj2s\mu_{jm_{j}}\propto m_{j}^{-2s} for all jj, we have a number of eigenvalues of kk greater than λ\lambda equivalent to the number of multi-indices such that m1××mdm_{1}\times\cdots\times m_{d} is less than λ1/(2s)\lambda^{-1/(2s)}. By counting first the index m1m_{1}, this can be upper-bounded by the sum of λ1/(2s)m2md\frac{\lambda^{-1/(2s)}}{m_{2}\cdots m_{d}} over all indices m2,,mdm_{2},\dots,m_{d} less than λ1/(2s)\lambda^{-1/(2s)}, which is less than \lambda^{-1/(2s)}\big{(}\sum_{m=1}^{\lambda^{-1/(2s)}}\frac{1}{m}\big{)}^{d-1}=O\Big{(}\lambda^{-1/(2s)}\big{(}s\log\frac{1}{\lambda}\big{)}^{d-1}\Big{)}. This results in a decay of eigenvalues bounded by (logm)2s(d1)m2s(\log m)^{2s(d-1)}m^{-2s} (this can be obtained by inverting approximately the function of λ\lambda).

When the decay is exponential as exp(ρλ)\exp(-\rho\lambda), then we get that m(λ)m^{\ast}(\lambda) is the number of multi-indices (m1,,md)(m_{1},\dots,m_{d}) such that their sum is less than c=log1λρc=\frac{\log\frac{1}{\lambda}}{\rho}; when cc is large, this is equivalent to cdc^{d} times the volume of the dd-dimensional simplex, and thus less than \frac{c^{d}}{d!}=\big{(}\frac{\log\frac{1}{\lambda}}{\rho}\big{)}^{d}\frac{1}{d!}. This leads to a decay of eigenvalues as exp(ρd!1/dm1/d)\exp(-\rho d!^{1/d}m^{1/d}) or, by using Stirling formula, less than exp(ρdm1/d)\exp(-\rho dm^{1/d}).

B Proofs

As shown in Section 2.2, any fFf\in{\mathcal{F}} with F{\mathcal{F}}-norm less than one may be represented as f=Vg(v)φ(v,)dτ(v)f=\int_{\mathcal{V}}g(v)\varphi(v,\cdot)d\tau(v), for a certain gL2(dτ)g\in L_{2}(d\tau) with L2(dτ)L_{2}(d\tau)-norm less than one. We do not solve the problem in β\beta exactly, but use a properly chosen Lagrange multiplier λ\lambda and consider the following minimization problem:

We then need to minimize the familiar least-squares problem:

with solution from the usual normal equations and the matrix inversion lemma for operators (Ogawa, 1988):

We consider the empirical integral operator Σ^:L2(dρ)L2(dρ)\hat{\Sigma}:L_{2}(d\rho)\to L_{2}(d\rho), defined as

The value of fΦβL2(dρ)2\|f-\Phi\beta\|_{L_{2}(d\rho)}^{2} is equal to

because (Σ^+λI)2λ1(Σ^+λI)1(\hat{\Sigma}+\lambda I)^{-2}\preccurlyeq\lambda^{-1}(\hat{\Sigma}+\lambda I)^{-1} (with the classical partial order between self-adjoint operators).

Finally, we have, with β=1nΦ(Σ^+λI)1f\beta=\frac{1}{n}\Phi^{\ast}(\hat{\Sigma}+\lambda I)^{-1}f:

using (Σ^+λI)2Σ^(Σ^+λI)1(\hat{\Sigma}+\lambda I)^{-2}\hat{\Sigma}\preccurlyeq(\hat{\Sigma}+\lambda I)^{-1}.

Thus fL2(dρ)fΣf\otimes_{L_{2}(d\rho)}f\preccurlyeq\Sigma, and we may thus define f,Σ1fL2(dρ)\langle f,\Sigma^{-1}f\rangle_{L_{2}(d\rho)}, which is less than one.

Overall we aim to study f,(Σ^+λI)1fL2(dρ)\langle f,(\hat{\Sigma}+\lambda I)^{-1}f\rangle_{L_{2}(d\rho)}, for f,Σ1fL2(dρ)1\langle f,\Sigma^{-1}f\rangle_{L_{2}(d\rho)}\leqslant 1, to control both the norm β22\|\beta\|_{2}^{2} in Eq. (17) and the approximation error fΦβL2(dρ)2\|f-\Phi\beta\|_{L_{2}(d\rho)}^{2} in Eq. (16). We have, following a similar argument than the one of Bach (2013); El Alaoui and Mahoney (2014) for column sampling, i.e., by a formulation using ΣΣ^\Sigma-\hat{\Sigma} in terms of operators in an appropriate way:

Thus, if (Σ+λI)1/2(Σ^Σ)(Σ+λI)1/2tI(\Sigma+\lambda I)^{-1/2}(\hat{\Sigma}-\Sigma)(\Sigma+\lambda I)^{-1/2}\succcurlyeq-tI, with t(0,1)t\in(0,1), we have

Moreover, we have shown (Σ^+λI)111t(Σ+λI)1(\hat{\Sigma}+\lambda I)^{-1}\preccurlyeq\frac{1}{1-t}({\Sigma}+\lambda I)^{-1}.

Thus, the performance depends on having (Σ+λI)1/2(ΣΣ^)(Σ+λI)1/2tI(\Sigma+\lambda I)^{-1/2}({\Sigma}-\hat{\Sigma})(\Sigma+\lambda I)^{-1/2}\preccurlyeq tI.

We consider the self-adjoint operators XiX_{i}, for i=1,,ni=1,\dots,n, which are independent and identically distributed:

so that our goal is to provide an upperbound on the probability that i=1nXiop>t\|\sum_{i=1}^{n}X_{i}\|_{\rm op}>t, where op\|\cdot\|_{\rm op} is the operator norm (largest singular values). We use the notation

with a maximal eigenvalue less than dmaxn\displaystyle\frac{d_{\max}}{n} and a trace less than dmaxntrΣ(Σ+λI)1=ddmaxn\displaystyle\frac{d_{\max}}{n}\mathop{\rm tr}\Sigma(\Sigma+\lambda I)^{-1}=\frac{d\,d_{\max}}{n}.

Following Hsu et al. (2014), we use a matrix Bernstein inequality which is independent of the underlying dimension (which is here infinite). We consider the bound of Minsker (2011, Theorem 2.1), which improves on the earlier result of Hsu et al. (2012, Theorem 4), that is:

We now consider t=34t=\frac{3}{4}, δ(0,1)\delta\in(0,1), and nBdmaxlogCdmaxδ\displaystyle n\geqslant Bd_{\max}\log\frac{Cd_{\max}}{\delta}, with appropriate constants B,C>0B,C>0. This implies that

and, if dmaxDd_{\max}\geqslant D, using nBdmaxlogCDn\geqslant Bd_{\max}\log CD,

while if dmaxDd_{\max}\leqslant D and n1n\geqslant 1,

In order to get a bound, we need (3/4)2B/25/41\frac{(3/4)^{2}B/2}{5/4}\geqslant 1, and we can take B=5B=5. If we take C=8C=8, then in order to have 1+6t2log2(1+nt/dmax)41+\frac{6}{t^{2}\log^{2}(1+nt/d_{\rm max})}\leqslant 4, we can take D=3/8D=3/8. Thus the probability is less than δ\delta.

By taking δ/2\delta/2 instead of δ\delta in the control of i=1nXiop>t\|\sum_{i=1}^{n}X_{i}\|_{\rm op}>t and in the Markov inequality above, we have a control over β22\|\beta\|_{2}^{2}, trΣ^\mathop{\rm tr}\hat{\Sigma} and the approximation error, which leads to the desired result in Prop 1. This will be useful for the lower bound of Prop. 3.

We can make the following extra observations regarding the proof:

It may be possible to derive a similar result with a thresholding of eigenvalues in the spirit of Zwald et al. (2004), but this would require Bernstein-type concentration inequalities for the projections on principal subspaces.

We have seen that with high-probability, we have (Σ^+λI)14(Σ+λI)1(\hat{\Sigma}+\lambda I)^{-1}\preccurlyeq 4({\Sigma}+\lambda I)^{-1}. Note that ABA\preccurlyeq B does not imply in A2B2A^{2}\preccurlyeq B^{2} (Bhatia, 2009, page 9) and that in general we do not have (Σ^+λI)2C(Σ+λI)2(\hat{\Sigma}+\lambda I)^{-2}\preccurlyeq C(\Sigma+\lambda I)^{-2} for any constant CC (which would allow an improvement in the error by replacing λ\lambda by λ2\lambda^{2}, and violate the lower bound of Prop. 3).

We may also obtain a result in expectation, by using δ=4λ/trΣ\delta=4\lambda/\mathop{\rm tr}\Sigma (which is assumed to be less than 1), leading to a squared error with expectation less than 8λ8\lambda as soon as n5dmax(λ)log2(trΣ)dmax(λ)λ.n\geqslant 5d_{\max}(\lambda)\log\frac{2(\mathop{\rm tr}\Sigma)d_{\max}(\lambda)}{\lambda}. Indeed, we can use the bound 4λ4\lambda with probability 1δ1-\delta and fL2(dρ)2trΣ\|f\|_{L_{2}(d\rho)}^{2}\leqslant\mathop{\rm tr}\Sigma with probability δ\delta, leading to a bound of 4λ(1δ)+δtrΣ8λ4\lambda(1-\delta)+\delta\mathop{\rm tr}\Sigma\leqslant 8\lambda. We use this result in Section 4.5.

B.2 Proof of Prop. 2

We start from the bound above, with the constraint n5d(λ)log16d(λ)δ.n\geqslant 5d(\lambda)\log\frac{16d(\lambda)}{\delta}. Statement (a) is a simple reformulation of Prop. 1. For statement (b), if we assume mn5(1+γ)log16n5δm\leqslant\frac{n}{5(1+\gamma)\log\frac{16n}{5\delta}}, and λ=μm\lambda=\mu_{m}, then we have d(λ)(1+γ)md(\lambda)\leqslant(1+\gamma)m, which implies n5d(λ)log16d(λ)δn\geqslant 5d(\lambda)\log\frac{16d(\lambda)}{\delta}, and (b) is a consequence of (a).

B.3 Proof of Prop. 3

We first use the Varshamov-Gilbert’s lemma (see, e.g., Massart, 2003, Lemma 4.7). That is, for any integer ss, there exists a family (θj)jJ(\theta_{j})_{j\in J} of at least Jes/8|J|\geqslant e^{s/8} distinct elements of {0,1}s\{0,1\}^{s}, such that for jjJj\neq j^{\prime}\in J, θjθj22s4\|\theta_{j}-\theta_{j^{\prime}}\|_{2}^{2}\geqslant\frac{s}{4}.

For each θ{0,1}s\theta\in\{0,1\}^{s}, we define an element of F{\mathcal{F}} with norm less than one, as f(θ)=μssi=1sθieiFf(\theta)=\frac{\sqrt{\mu_{s}}}{\sqrt{s}}\sum_{i=1}^{s}\theta_{i}e_{i}\in{\mathcal{F}}, where (ei,μi)(e_{i},\mu_{i}), i=1,,si=1,\dots,s are the eigenvector/eigenvalue pairs associated with the ss largest eigenvalues of Σ\Sigma. We have, since μiμs\mu_{i}\geqslant\mu_{s} for i{1,,s}i\in\{1,\dots,s\} and θ221\|\theta\|_{2}^{2}\leqslant 1:

Moreover, for any jjJj\neq j^{\prime}\in J, we have f(θj)f(θj)L2(dρ)2=μssθjθj22μs4\|f(\theta_{j})-f(\theta_{j^{\prime}})\|_{L_{2}(d\rho)}^{2}=\frac{\mu_{s}}{s}\|\theta_{j}-\theta_{j^{\prime}}\|_{2}^{2}\geqslant\frac{\mu_{s}}{4}.

This leads to, for any jjJj\neq j^{\prime}\in J,

Combining the last two inequalities, we get βjβj2δμs72ntrΣ=Δ\|\beta_{j}-\beta_{j^{\prime}}\|_{2}\geqslant\sqrt{\frac{\delta\mu_{s}}{72n\mathop{\rm tr}\Sigma}}=\Delta. Thus, es/8e^{s/8} is less than the Δ\Delta-packing number of the ball of radius r=2/nr=2/\sqrt{n}, which is itself less than (r/Δ)n(2+Δ/r)n(r/\Delta)^{n}(2+\Delta/r)^{n} (see, e.g., Massart, 2003, Lemma 4.14). Since Δ/r=δμs472trΣ1122\Delta/r=\sqrt{\frac{\delta\mu_{s}}{4\cdot 72\mathop{\rm tr}\Sigma}}\leqslant\frac{1}{12\sqrt{2}}, we have

This implies ns4logtrΣδμs+29.n\geqslant\frac{s}{4\log\frac{\mathop{\rm tr}\Sigma}{\delta\mu_{s}}+29}. Given that we have to choose μs144λ\mu_{s}\geqslant 144\lambda for the result to hold, this implies the desired result, since 4log(1440)294\log(1440)\geqslant 29.