On the Error of Random Fourier Features

Danica J. Sutherland, Jeff Schneider

INTRODUCTION

give two such embeddings, based on the Fourier transform P(ω)P(\omega) of the kernel kk: one of the form

The primary previous analyses of these embeddings, outside the one in the original paper, have been by , who bound the increase in error of empirical risk estimates when learning models in the induced rkhs, and by , who compare the ability of the Nyström and Fourier embeddings to exploit eigengaps in the learning problem. We instead study the approximation directly, providing a complementary view of the quality of these embeddings.

APPROXIMATION ERROR

We will give various analyses of the error due to each approximation.

The Gaussian kernel k(Δ)=exp(Δ22σ2)k(\Delta)=\exp\left(-\frac{\left\lVert\Delta\right\rVert^{2}}{2\sigma^{2}}\right) has

2 UNIFORM ERROR BOUND

Thus, we can achieve an embedding with pointwise error no more than ε\varepsilon with probability at least 1δ1-\delta as long as

For the Gaussian kernel, αε12+13ε\alpha_{\varepsilon}\leq\tfrac{1}{2}+\frac{1}{3}\varepsilon and σp2=d/σ2\sigma_{p}^{2}=d/\sigma^{2}; the Bernstein bound is tighter when ε<32\varepsilon<\frac{3}{2}.

For z˘\breve{z}, since the embedding s˘\breve{s} is not shift-invariant, we must instead place the ε\varepsilon-net on X2\mathcal{X}^{2}. The additional noise in s˘\breve{s} also increases the expected Lipschitz constant and gives looser bounds on each term in the sum, though there are twice as many such terms. The corresponding bound is as follows:

Thus, we can achieve an embedding with pointwise error no more than ε\varepsilon with probability at least 1δ1-\delta as long as

β48=98\beta^{\prime}_{48}=98, and limdβd=96\lim_{d\to\infty}\beta^{\prime}_{d}=96, also shown in Figure 2. The full proof is given in Section A.2.

For the Gaussian kernel, αε14+16ε\alpha^{\prime}_{\varepsilon}\leq\tfrac{1}{4}+\tfrac{1}{6}\varepsilon, so that the Berstein bound is essentially always superior.

2.2 Expected Max Error

We can, however, use a slight generalization of Dudley’s entropy integral [*]dudley to obtain the following bound:

where 0.803<γX<1.5420.803<\gamma^{\prime}_{\mathcal{X}}<1.542. See Section A.4 for details on γX\gamma^{\prime}_{\mathcal{X}} and the “not extremely small” assumption.

The proof is given in Section A.4. It is similar to that for Proposition 3, but the lack of shift invariance increases some constants and otherwise slightly complicates matters. Note also that the RR of Proposition 4 is somewhat larger than that of Proposition 3.

2.3 Concentration About Mean

Bousquet’s inequality [*]bousquet2002 can be used to show exponential concentration of supf\sup f about its mean.

using Equation 20. Clearly 1σw221\leq\sigma_{w}^{2}\leq 2; for the Gaussian kernel, it is 1.

A bound on the lower tail, unfortunately, is not available in the same form.

For f˘\breve{f}, note f˘3\lvert\breve{f}\rvert\leq 3, so we use f˘/3\breve{f}/3. Letting f˘ω:=1D(cos(ωTΔ)k(Δ))\breve{f}_{\omega}:=\tfrac{1}{D}(\cos(\omega^{\mathsf{T}}\Delta)-k(\Delta)), we have σf˘/32=118D(σw2+1)\sigma_{\breve{f}/3}^{2}=\frac{1}{18D}(\sigma_{w}^{2}+1). Thus the same argument gives us:

LL_{\infty} bounds provide useful guarantees, but are very strict. It can also be useful to consider a less stringent error measure. Let μ\mu be a σ\sigma-finite measure on X×X\mathcal{X}\times\mathcal{X}; define

where Equation 46 is justified by Tonelli’s theorem.

The second version of the bound is simpler, but somewhat looser for D1D\gg 1; asymptotically, the coefficient of the denominator becomes 128.

Similarly, the variation of f˘μ\lVert\breve{f}\rVert_{\mu} is bounded by at most 32D+1D2μ(X2)32\frac{D+1}{D^{2}}\mu(\mathcal{X}^{2}) (shown in Section B.2). Thus:

Let kk, μ\mu, μ\lVert\cdot\rVert_{\mu}, and M\mathcal{M} be as in Proposition 7. Define z˘\breve{z} as in Equation 4 and let f˘(x,y)=z˘(x)Tz˘(y)k(x,y)\breve{f}(x,y)=\breve{z}(x)^{\mathsf{T}}\breve{z}(y)-k(x,y). Then

The cost of a simpler dependence on DD is higher here; the asymptotic coefficient of the denominator is 512512.

DOWNSTREAM ERROR

Rahimi and Recht [*]rahimi-nips,rahimi-allerton give a bound on the L2L_{2} distance between any given function in the reproducing kernel Hilbert space (rkhs) induced by kk and the closest function in the rkhs of ss: results invaluable for the study of learning rates. In some situations, however, it is useful to consider not the learning-theoretic convergence of hypotheses to the assumed “true” function, but rather directly consider the difference in predictions due to using the zz embedding instead of the exact kernel kk.

Proposition 1 of bounds the change in krr predictions from approximating the kernel matrix KK by K^\hat{K}, in terms of K^K2\lVert\hat{K}-K\rVert_{2}. They assume, however, that the kernel evaluations at test time kxk_{x} are unapproximated, which is certainly not the case when using Fourier features. We therefore extend their result to Proposition 9 before using it to analyze the performance of Fourier features.

Let α=(K+λI)1y\alpha=(K+\lambda I)^{-1}y, α^=(K^+λI)1y\hat{\alpha}=(\hat{K}+\lambda I)^{-1}y. Thus, using M^1M1=M^1(M^M)M1\hat{M}^{-1}-M^{-1}=-\hat{M}^{-1}(\hat{M}-M)M^{-1}, we have

since the smallest eigenvalues of K+λIK+\lambda I and K^+λI\hat{K}+\lambda I are at least λ\lambda. Since kxnκ\lVert k_{x}\rVert\leq\sqrt{n}\kappa and α^y/λ\lVert\hat{\alpha}\rVert\leq\lVert y\rVert/\lambda:

The claim follows from λ=nλ0\lambda=n\lambda_{0}, y=nσy\lVert y\rVert=\sqrt{n}\sigma_{y}. ∎

Suppose that, per the uniform error bounds of Section 2.2, supk(x,y)s(x,y)ε\sup\left\lvert k(x,y)-s(x,y)\right\rvert\leq\varepsilon. Then k^xkxnε\lVert\hat{k}_{x}-k_{x}\rVert\leq\sqrt{n}\varepsilon and K^K2K^KFnε\lVert\hat{K}-K\rVert_{2}\leq\lVert\hat{K}-K\rVert_{F}\leq n\varepsilon, and Proposition 9 gives

which we can bound with Proposition 1 or 2. We can therefore guarantee h(x)h(x)ε\lvert h(x)-h^{\prime}(x)\rvert\leq\varepsilon with probability at least δ\delta if

Note that this rate does not depend on nn. If we want h(x)h(x)h^{\prime}(x)\to h(x) at least as fast as h(x)h(x)’s convergence rate of O(1/n)O(1/\sqrt{n}) , ignoring the logarithmic terms, we thus need DD to be linear in nn, matching the conclusion of .

2 SUPPORT VECTOR MACHINES

Consider a Support Vector Machine (svm) classifier with no offset, such that h(x)=wTΦ(x)h(x)=w^{\mathsf{T}}\Phi(x) for a kernel embedding Φ(x):XH\Phi(x):\mathcal{X}\to\mathcal{H} and ww is found by

Use the setup of Section 2.2 of . In particular, we will use wκC0\lVert w\rVert\leq\sqrt{\kappa}C_{0} and their (16-17):

where Kx=[KkxkxTk(x,x)]K_{x}=\begin{bmatrix}K&k_{x}\\ k_{x}^{\mathsf{T}}&k(x,x)\end{bmatrix} and eie_{i} the iith standard basis.

Further, Lemma 1 of says that K^x1/2Kx1/22K^xKx21/2\lVert\hat{K}_{x}^{1/2}-K_{x}^{1/2}\rVert_{2}\leq\lVert\hat{K}_{x}-K_{x}\rVert_{2}^{1/2}. Let fx:=k^(x,x)k(x,x)f_{x}:=\hat{k}(x,x)-k(x,x); Then, by Weyl’s inequality for singular values,

Then h^(x)h(x)u\lvert\hat{h}(x)-h(x)\rvert\geq u only if

This bound has the unfortunate property of requiring the approximation to be more accurate as the training set size increases, and thus can prove only a very loose upper bound on the number of features needed to achieve a given approximation accuracy, due to the looseness of Proposition 10. Analyses of generalization error in the induced rkhs, such as , are more useful in this case.

3 MAXIMUM MEAN DISCREPANCY

The distance φ(P)φ(Q)\lVert\varphi(P)-\varphi(Q)\rVert is known as the maximum mean discrepancy (mmd), which can be estimated with:

\textscmmk(X,X)\textsc{{mmk}}(X,X) is a biased estimator, because of the k(Xi,Xi)k(X_{i},X_{i}) and k(Yi,Yi)k(Y_{i},Y_{i}) terms; removing them gives an unbiased estimator . The mmk can be used in standard kernel methods to perform learning on probability distributions, such as when images are treated as sets of local patch descriptors or documents as sets of word descriptors . The mmd has strong applications to two-sample testing, where it serves as the statistic for testing the hypothesis that XX and YY are sampled from the same distribution ; this has applications in, for example, comparing microarray data from different experimental situations or in matching attributes when merging databases.

The mmk estimate can clearly be approximated with an explicit embedding: if k(x,y)z(x)Tz(y)k(x,y)\approx z(x)^{\mathsf{T}}z(y),

Thus the biased estimator of \textscmmk(X,X)\textsc{{mmk}}(X,X) is just zˉ(X)2\left\lVert\bar{z}(X)\right\rVert^{2}; the unbiased estimator is

This has been noticed a few times in the literature, e.g. by . gives different linear-time test statistics based on subsampling the sum over pairs; this version avoids reducing the amount of data used in favor of approximating the kernel. Additionally, when using the mmk in a kernel method this approximation allows the use of linear solvers, whereas the other linear approximations must still perform some pairwise computation. compare the empirical performance of an approximation equivalent to z˘\breve{z} against other linear-time approximations for two-sample testing. They find it is slower than the mmd-linear approximation but far more accurate, while being more accurate and comparable in speed to a block-based BB-test .

also state a simple uniform error bound on the quality of this approximation. Specifically, since we can write \textscmmkz(X,Y)\textscmmk(X,Y)\left\lvert\textsc{{mmk}}_{z}(X,Y)-\textsc{{mmk}}(X,Y)\right\rvert as the mean of f(Xi,Yj)\left\lvert f(X_{i},Y_{j})\right\rvert, uniform error bounds on ff apply directly to \textscmmkz\textsc{{mmk}}_{z}, including to the unbiased version of \textscmmkz(X,X)\textsc{{mmk}}_{z}(X,X). Moreover, since \textscmmd2(X,Y)=\textscmmk(X,X)+\textscmmk(Y,Y)2\textscmmk(X,Y)\textsc{{mmd}}^{2}(X,Y)=\textsc{{mmk}}(X,X)+\textsc{{mmk}}(Y,Y)-2\textsc{{mmk}}(X,Y), its error is at most 44 times f\lVert f\rVert_{\infty}. The advantage of this bound is that it applies uniformly to all sample sets on the input space X\mathcal{X}, which is useful when we use mmk for a kernel method.

which is at most 4/D4/D. The bound for z˘\breve{z} is in fact the same here. Thus McDiarmid’s inequality tells us that for fixed sets XX and YY and either zz,

and expected absolute error of at most 82π/D8\sqrt{2\pi/D}.

Now, if we consider the distributions PP and QQ to be fixed but the sample sets random, Theorems 7 and 10 of give exponential convergence bounds for the biased and unbiased population estimators of mmd, which can easily be combined with the above bounds. Note that this approach allows the domain X\mathcal{X} to be unbounded, unlike the other bound. One could extend this to a bound uniform over some smoothness class of distributions using the techniques of Section 2.2, though we do not do so here.

NUMERICAL EVALUATION

We first conduct a detailed study of the approximations on the interval X=[b,b]\mathcal{X}=[-b,b]. Specifically, we evenly spaced 10001\,000 points on $andapproximatedthekernelmatrixusingbothembeddingsatand approximated the kernel matrix using both embeddings atD\in\{50,100,200,\dots,900,1\,000,2\,000,\dots,9\,000,10\,000\},repeatingeachtrial, repeating each trial1\,000times,estimatingtimes, estimating\lVert f\rVert_{\infty}andand\lVert f\rVert_{\mu}atthosepoints.Wedonotconsiderat those points. We do not considerd>1here,becauseobtainingareliableestimateofhere, because obtaining a reliable estimate of\sup\lvert f\rvertbecomesverycomputationallyexpensiveevenforbecomes very computationally expensive even ford=2$.

Figure 4 fixes b=3b=3 and shows the expected maximal error as a function of DD. It also plots the expected error obtained by numerically integrating the bounds of Propositions 1 and 2 (using the minimum of 1 and the bound). We can see that all of the bounds are fairly loose, but that the first version of the bound in the propositions (with βd\beta_{d}, the exponent depending on dd, and αε\alpha_{\varepsilon}) is substantially tighter than the second version when d=1d=1.

Figure 5 shows the empirical survival function of the max error for D=500D=500, along with the bounds of Propositions 1 and 2 and those of Propositions 5 and 6 using the empirical mean. The latter bounds are tighter than the former for low ε\varepsilon, especially for low DD, but have a lower slope.

2 MAXIMUM MEAN DISCREPANCY

DISCUSSION

This work was funded in part by DARPA grant FA87501220324. DJS is also supported by a Sandia Campus Executive Program fellowship.

References

References

Appendix A PROOFS FOR UNIFORM ERROR BOUND (SECTION 2.2)

The proof strategy closely follows that of ; we fill in some (important) details, tightening some parts of the proof as we go.

f(x,ω)f(x,\omega) is a Lebesgue-integrable function of ω\omega for each xXx\in X.

For almost all ωΩ\omega\in\Omega, the derivative f(x,ω)x\frac{\partial f(x,\omega)}{\partial x} exists for all xXx\in X.

which is finite since the first moment of ω\omega is assumed to exist.

A.1.2 Lipschitz Constant

A.1.3 Anchor Points

Since we know the variance of each term from Equation 13, we could alternatively use Bernstein’s inequality:

This is a better bound when Var[cos(ωTΔ)]+13ε<1\operatorname{Var}[\cos(\omega^{\mathsf{T}}\Delta)]+\frac{1}{3}\varepsilon<1; for the RBF kernel, this is true whenever ε<32\varepsilon<\tfrac{3}{2}, and the improvement is bigger when k(Δ)k(\Delta) is large or ε\varepsilon is small.

To unify the two, let αε:=min(1,maxΔMΔ12+12k(2Δ)k(Δ2)+13ε)\alpha_{\varepsilon}:=\min\left(1,\max_{\Delta\in\mathcal{M}_{\Delta}}\tfrac{1}{2}+\tfrac{1}{2}k(2\Delta)-k(\Delta^{2})+\tfrac{1}{3}\varepsilon\right). Then

A.1.4 Optimizing Over r𝑟r

Combining these two bounds, we have a bound in terms of rr:

If we choose r=(κ1/κ2)1/(d+2)r=(\kappa_{1}/\kappa_{2})^{1/(d+2)}, as did , the bound again becomes 12κ12d+2κ2dd+21-2\kappa_{1}^{\frac{2}{d+2}}\kappa_{2}^{\frac{d}{d+2}}. But we could instead maximize the bound by choosing rr such that dκ1rd12κ2r=0d\kappa_{1}r^{-d-1}-2\kappa_{2}r=0, i.e. r=(dκ12κ2)1d+2r=\left(\frac{d\kappa_{1}}{2\kappa_{2}}\right)^{\frac{1}{d+2}}. Then the bound becomes 1((d2)dd+2+(d2)2d+2)κ12d+2κ2dd+21-\left(\left(\frac{d}{2}\right)^{\frac{-d}{d+2}}+\left(\frac{d}{2}\right)^{\frac{2}{d+2}}\right)\kappa_{1}^{\frac{2}{d+2}}\kappa_{2}^{\frac{d}{d+2}}:

To prove the final statement of Proposition 1, simply set Equation 116 to be at most δ\delta and solve for DD.

A.2 PROOF OF PROPOSITION 2

We will follow the proof strategy of Proposition 1 as closely as possible.

Our approximation is now s˘(x,y)=z˘(x)Tz˘(y)\breve{s}(x,y)=\breve{z}(x)^{\mathsf{T}}\breve{z}(y), and the error is f˘(x,y)=s˘(x,y)k(y,x)\breve{f}(x,y)=\breve{s}(x,y)-k(y,x). Note that s˘\breve{s} and f˘\breve{f} are not shift-invariant: for example, with D=1D=1, s˘(x,y)=cos(ωTΔ)+cos(ωT(x+y)+2b)\breve{s}(x,y)=\cos(\omega^{\mathsf{T}}\Delta)+\cos(\omega^{\mathsf{T}}(x+y)+2b) but s˘(Δ,0)=cos(ωTΔ)+cos(ωTΔ+2b)\breve{s}(\Delta,0)=\cos(\omega^{\mathsf{T}}\Delta)+\cos(\omega^{\mathsf{T}}\Delta+2b).

A.2.2 Lipschitz Constant

The argument follows that of Section A.1.2 up to Equation 105, using qq^{*} in place of Δ\Delta^{*}. Then:

Following through with Markov’s inequality:

A.2.3 Anchor Points

For any fixed x,yx,y, s˘\breve{s} takes a mean of DD terms with expectation k(x,y)k(x,y) bounded by ±2\pm 2. Using Hoeffding’s inequality:

Since the variance of each term is given by Equation 19, we can instead use Bernstein’s inequality:

Thus Bernstein’s gives us a tighter bound if

To unify the bounds, define αε=min(1,maxΔ18+14Var[cos(ωTΔ)]+16ε)\alpha^{\prime}_{\varepsilon}=\min\left(1,\max_{\Delta}\tfrac{1}{8}+\tfrac{1}{4}\operatorname{Var}[\cos(\omega^{\mathsf{T}}\Delta)]+\frac{1}{6}\varepsilon\right); then

A.2.4 Optimizing Over r𝑟r

This is maximized by rr when 2dκ1r2d12κ2r=02d\kappa_{1}r^{-2d-1}-2\kappa_{2}r=0, i.e. r=(dκ1κ2)12d+2r=\left(\frac{d\kappa_{1}}{\kappa_{2}}\right)^{\frac{1}{2d+2}}. Substituting that value of rr into the bound yields 1(ddd+1+d1d+1)κ11d+1κ2dd+11-\left(d^{\frac{-d}{d+1}}+d^{\frac{1}{d+1}}\right)\kappa_{1}^{\frac{1}{d+1}}\kappa_{2}^{\frac{d}{d+1}}, and thus:

To prove the final statement of Proposition 2, set Equation 133 to be at most δ\delta and solve for DD.

A.3 PROOF OF PROPOSITION 3

Our primary tool will be the following slight generalization of Dudley’s entropy integral, which is a special case of Lemma 13.1 of . (The only difference from their Corollary 13.2 is that we maintain the variance factor vv.)

Let T\mathcal{T} be a finite pseudometric space and let (Xt)tT(X_{t})_{t\in\mathcal{T}} be a collection of random variables such that for some constant v>0v>0,

for all t,tTt,t^{\prime}\in\mathcal{T} and all λ>0\lambda>0. Let δ=suptTd(t,t0)\delta=\sup_{t\in\mathcal{T}}d(t,t_{0}). Then, for any t0Tt_{0}\in\mathcal{T},

where γ:=4πerfc(2log2)+log20.964\gamma:=4\sqrt{\pi}\operatorname{erfc}(2\sqrt{\log 2})+\sqrt{\log 2}\approx 0.964.

Now, 2D(cos(ωiTΔ)k(Δ)cos(ωiTΔ)+k(Δ))\frac{2}{D}\left(\cos(\omega_{i}^{\mathsf{T}}\Delta)-k(\Delta)-\cos(\omega_{i}^{\mathsf{T}}\Delta^{\prime})+k(\Delta^{\prime})\right) has mean zero, and absolute value bounded by

Thus, via Hoeffding’s lemma [3, Lemma 2.2], each such term has log moment generating function at most 2D2(ωi+L)2λ2ΔΔ2\frac{2}{D^{2}}(\lVert\omega_{i}\rVert+L)^{2}\lambda^{2}\lVert\Delta-\Delta^{\prime}\rVert^{2}.

A.4 PROOF OF PROPOSITION 4

For the z˘\breve{z} features, the error process again must be defined over X2\mathcal{X}^{2} due to the non-shift invariant noise. We still assume that kk is LL-Lipschitz over XΔ\mathcal{X}_{\Delta}, however.

We will again use the notation of q=(x,y)X2q=(x,y)\in\mathcal{X}^{2}, Δ=xy\Delta=x-y, t=x+yt=x+y. Each term in the sum of f˘(q)f˘(q)\breve{f}(q)-\breve{f}(q^{\prime}) has mean zero and absolute value at most

Now, in order to cast this in terms of distance on X2\mathcal{X}^{2}, let δx=xx\delta_{x}=x-x^{\prime}, δy=yy\delta_{y}=y-y^{\prime}. Then

Pr(f˘ crosses 0)\Pr\left(\breve{f}\text{ crosses }0\right) is extremely close to 1 in “usual” situations.

We can do essentially the same thing for f˘\breve{f}: