On the Error of Random Fourier Features

Danica J. Sutherland, Jeff Schneider

INTRODUCTION

give two such embeddings, based on the Fourier transform $P(\omega)$ of the kernel $k$ : one of the form

The primary previous analyses of these embeddings, outside the one in the original paper, have been by , who bound the increase in error of empirical risk estimates when learning models in the induced rkhs, and by , who compare the ability of the Nyström and Fourier embeddings to exploit eigengaps in the learning problem. We instead study the approximation directly, providing a complementary view of the quality of these embeddings.

APPROXIMATION ERROR

We will give various analyses of the error due to each approximation.

The Gaussian kernel $k(\Delta)=\exp\left(-\frac{\left\lVert\Delta\right\rVert^{2}}{2\sigma^{2}}\right)$ has

2 UNIFORM ERROR BOUND

Thus, we can achieve an embedding with pointwise error no more than $\varepsilon$ with probability at least $1-\delta$ as long as

For the Gaussian kernel, $\alpha_{\varepsilon}\leq\tfrac{1}{2}+\frac{1}{3}\varepsilon$ and $\sigma_{p}^{2}=d/\sigma^{2}$ ; the Bernstein bound is tighter when $\varepsilon<\frac{3}{2}$ .

For $\breve{z}$ , since the embedding $\breve{s}$ is not shift-invariant, we must instead place the $\varepsilon$ -net on $\mathcal{X}^{2}$ . The additional noise in $\breve{s}$ also increases the expected Lipschitz constant and gives looser bounds on each term in the sum, though there are twice as many such terms. The corresponding bound is as follows:

Thus, we can achieve an embedding with pointwise error no more than $\varepsilon$ with probability at least $1-\delta$ as long as

$\beta^{\prime}_{48}=98$ , and $\lim_{d\to\infty}\beta^{\prime}_{d}=96$ , also shown in Figure 2. The full proof is given in Section A.2.

For the Gaussian kernel, $\alpha^{\prime}_{\varepsilon}\leq\tfrac{1}{4}+\tfrac{1}{6}\varepsilon$ , so that the Berstein bound is essentially always superior.

2.2 Expected Max Error

We can, however, use a slight generalization of Dudley’s entropy integral [*]dudley to obtain the following bound:

where $0.803<\gamma^{\prime}_{\mathcal{X}}<1.542$ . See Section A.4 for details on $\gamma^{\prime}_{\mathcal{X}}$ and the “not extremely small” assumption.

The proof is given in Section A.4. It is similar to that for Proposition 3, but the lack of shift invariance increases some constants and otherwise slightly complicates matters. Note also that the $R$ of Proposition 4 is somewhat larger than that of Proposition 3.

2.3 Concentration About Mean

Bousquet’s inequality [*]bousquet2002 can be used to show exponential concentration of $\sup f$ about its mean.

using Equation 20. Clearly $1\leq\sigma_{w}^{2}\leq 2$ ; for the Gaussian kernel, it is 1.

A bound on the lower tail, unfortunately, is not available in the same form.

For $\breve{f}$ , note $\lvert\breve{f}\rvert\leq 3$ , so we use $\breve{f}/3$ . Letting $\breve{f}_{\omega}:=\tfrac{1}{D}(\cos(\omega^{\mathsf{T}}\Delta)-k(\Delta))$ , we have $\sigma_{\breve{f}/3}^{2}=\frac{1}{18D}(\sigma_{w}^{2}+1)$ . Thus the same argument gives us:

$L_{\infty}$ bounds provide useful guarantees, but are very strict. It can also be useful to consider a less stringent error measure. Let $\mu$ be a $\sigma$ -finite measure on $\mathcal{X}\times\mathcal{X}$ ; define

where Equation 46 is justified by Tonelli’s theorem.

The second version of the bound is simpler, but somewhat looser for $D\gg 1$ ; asymptotically, the coefficient of the denominator becomes 128.

Similarly, the variation of $\lVert\breve{f}\rVert_{\mu}$ is bounded by at most $32\frac{D+1}{D^{2}}\mu(\mathcal{X}^{2})$ (shown in Section B.2). Thus:

Let $k$ , $\mu$ , $\lVert\cdot\rVert_{\mu}$ , and $\mathcal{M}$ be as in Proposition 7. Define $\breve{z}$ as in Equation 4 and let $\breve{f}(x,y)=\breve{z}(x)^{\mathsf{T}}\breve{z}(y)-k(x,y)$ . Then

The cost of a simpler dependence on $D$ is higher here; the asymptotic coefficient of the denominator is $512$ .

DOWNSTREAM ERROR

Rahimi and Recht [*]rahimi-nips,rahimi-allerton give a bound on the $L_{2}$ distance between any given function in the reproducing kernel Hilbert space (rkhs) induced by $k$ and the closest function in the rkhs of $s$ : results invaluable for the study of learning rates. In some situations, however, it is useful to consider not the learning-theoretic convergence of hypotheses to the assumed “true” function, but rather directly consider the difference in predictions due to using the $z$ embedding instead of the exact kernel $k$ .

Proposition 1 of bounds the change in krr predictions from approximating the kernel matrix $K$ by $\hat{K}$ , in terms of $\lVert\hat{K}-K\rVert_{2}$ . They assume, however, that the kernel evaluations at test time $k_{x}$ are unapproximated, which is certainly not the case when using Fourier features. We therefore extend their result to Proposition 9 before using it to analyze the performance of Fourier features.

Let $\alpha=(K+\lambda I)^{-1}y$ , $\hat{\alpha}=(\hat{K}+\lambda I)^{-1}y$ . Thus, using $\hat{M}^{-1}-M^{-1}=-\hat{M}^{-1}(\hat{M}-M)M^{-1}$ , we have

since the smallest eigenvalues of $K+\lambda I$ and $\hat{K}+\lambda I$ are at least $\lambda$ . Since $\lVert k_{x}\rVert\leq\sqrt{n}\kappa$ and $\lVert\hat{\alpha}\rVert\leq\lVert y\rVert/\lambda$ :

The claim follows from $\lambda=n\lambda_{0}$ , $\lVert y\rVert=\sqrt{n}\sigma_{y}$ . ∎

Suppose that, per the uniform error bounds of Section 2.2, $\sup\left\lvert k(x,y)-s(x,y)\right\rvert\leq\varepsilon$ . Then $\lVert\hat{k}_{x}-k_{x}\rVert\leq\sqrt{n}\varepsilon$ and $\lVert\hat{K}-K\rVert_{2}\leq\lVert\hat{K}-K\rVert_{F}\leq n\varepsilon$ , and Proposition 9 gives

which we can bound with Proposition 1 or 2. We can therefore guarantee $\lvert h(x)-h^{\prime}(x)\rvert\leq\varepsilon$ with probability at least $\delta$ if

Note that this rate does not depend on $n$ . If we want $h^{\prime}(x)\to h(x)$ at least as fast as $h(x)$ ’s convergence rate of $O(1/\sqrt{n})$ , ignoring the logarithmic terms, we thus need $D$ to be linear in $n$ , matching the conclusion of .

2 SUPPORT VECTOR MACHINES

Consider a Support Vector Machine (svm) classifier with no offset, such that $h(x)=w^{\mathsf{T}}\Phi(x)$ for a kernel embedding $\Phi(x):\mathcal{X}\to\mathcal{H}$ and $w$ is found by

Use the setup of Section 2.2 of . In particular, we will use $\lVert w\rVert\leq\sqrt{\kappa}C_{0}$ and their (16-17):

where $K_{x}=\begin{bmatrix}K&k_{x}\\ k_{x}^{\mathsf{T}}&k(x,x)\end{bmatrix}$ and $e_{i}$ the $i$ th standard basis.

Further, Lemma 1 of says that $\lVert\hat{K}_{x}^{1/2}-K_{x}^{1/2}\rVert_{2}\leq\lVert\hat{K}_{x}-K_{x}\rVert_{2}^{1/2}$ . Let $f_{x}:=\hat{k}(x,x)-k(x,x)$ ; Then, by Weyl’s inequality for singular values,

Then $\lvert\hat{h}(x)-h(x)\rvert\geq u$ only if

This bound has the unfortunate property of requiring the approximation to be more accurate as the training set size increases, and thus can prove only a very loose upper bound on the number of features needed to achieve a given approximation accuracy, due to the looseness of Proposition 10. Analyses of generalization error in the induced rkhs, such as , are more useful in this case.

3 MAXIMUM MEAN DISCREPANCY

The distance $\lVert\varphi(P)-\varphi(Q)\rVert$ is known as the maximum mean discrepancy (mmd), which can be estimated with:

$\textsc{{mmk}}(X,X)$ is a biased estimator, because of the $k(X_{i},X_{i})$ and $k(Y_{i},Y_{i})$ terms; removing them gives an unbiased estimator . The mmk can be used in standard kernel methods to perform learning on probability distributions, such as when images are treated as sets of local patch descriptors or documents as sets of word descriptors . The mmd has strong applications to two-sample testing, where it serves as the statistic for testing the hypothesis that $X$ and $Y$ are sampled from the same distribution ; this has applications in, for example, comparing microarray data from different experimental situations or in matching attributes when merging databases.

The mmk estimate can clearly be approximated with an explicit embedding: if $k(x,y)\approx z(x)^{\mathsf{T}}z(y)$ ,

Thus the biased estimator of $\textsc{{mmk}}(X,X)$ is just $\left\lVert\bar{z}(X)\right\rVert^{2}$ ; the unbiased estimator is

This has been noticed a few times in the literature, e.g. by . gives different linear-time test statistics based on subsampling the sum over pairs; this version avoids reducing the amount of data used in favor of approximating the kernel. Additionally, when using the mmk in a kernel method this approximation allows the use of linear solvers, whereas the other linear approximations must still perform some pairwise computation. compare the empirical performance of an approximation equivalent to $\breve{z}$ against other linear-time approximations for two-sample testing. They find it is slower than the mmd-linear approximation but far more accurate, while being more accurate and comparable in speed to a block-based $B$ -test .

also state a simple uniform error bound on the quality of this approximation. Specifically, since we can write $\left\lvert\textsc{{mmk}}_{z}(X,Y)-\textsc{{mmk}}(X,Y)\right\rvert$ as the mean of $\left\lvert f(X_{i},Y_{j})\right\rvert$ , uniform error bounds on $f$ apply directly to $\textsc{{mmk}}_{z}$ , including to the unbiased version of $\textsc{{mmk}}_{z}(X,X)$ . Moreover, since $\textsc{{mmd}}^{2}(X,Y)=\textsc{{mmk}}(X,X)+\textsc{{mmk}}(Y,Y)-2\textsc{{mmk}}(X,Y)$ , its error is at most $4$ times $\lVert f\rVert_{\infty}$ . The advantage of this bound is that it applies uniformly to all sample sets on the input space $\mathcal{X}$ , which is useful when we use mmk for a kernel method.

which is at most $4/D$ . The bound for $\breve{z}$ is in fact the same here. Thus McDiarmid’s inequality tells us that for fixed sets $X$ and $Y$ and either $z$ ,

and expected absolute error of at most $8\sqrt{2\pi/D}$ .

Now, if we consider the distributions $P$ and $Q$ to be fixed but the sample sets random, Theorems 7 and 10 of give exponential convergence bounds for the biased and unbiased population estimators of mmd, which can easily be combined with the above bounds. Note that this approach allows the domain $\mathcal{X}$ to be unbounded, unlike the other bound. One could extend this to a bound uniform over some smoothness class of distributions using the techniques of Section 2.2, though we do not do so here.

NUMERICAL EVALUATION

We first conduct a detailed study of the approximations on the interval $\mathcal{X}=[-b,b]$ . Specifically, we evenly spaced $1\,000$ points on $ $and approximated the kernel matrix using both embeddings at$ D\in\{50,100,200,\dots,900,1\,000,2\,000,\dots,9\,000,10\,000\} $, repeating each trial$ 1\,000 $times, estimating$ \lVert f\rVert_{\infty} $and$ \lVert f\rVert_{\mu} $at those points. We do not consider$ d>1 $here, because obtaining a reliable estimate of$ \sup\lvert f\rvert $becomes very computationally expensive even for$ d=2$.

Figure 4 fixes $b=3$ and shows the expected maximal error as a function of $D$ . It also plots the expected error obtained by numerically integrating the bounds of Propositions 1 and 2 (using the minimum of 1 and the bound). We can see that all of the bounds are fairly loose, but that the first version of the bound in the propositions (with $\beta_{d}$ , the exponent depending on $d$ , and $\alpha_{\varepsilon}$ ) is substantially tighter than the second version when $d=1$ .

Figure 5 shows the empirical survival function of the max error for $D=500$ , along with the bounds of Propositions 1 and 2 and those of Propositions 5 and 6 using the empirical mean. The latter bounds are tighter than the former for low $\varepsilon$ , especially for low $D$ , but have a lower slope.

2 MAXIMUM MEAN DISCREPANCY

DISCUSSION

This work was funded in part by DARPA grant FA87501220324. DJS is also supported by a Sandia Campus Executive Program fellowship.

References

Appendix A PROOFS FOR UNIFORM ERROR BOUND (SECTION 2.2)

The proof strategy closely follows that of ; we fill in some (important) details, tightening some parts of the proof as we go.

$f(x,\omega)$ is a Lebesgue-integrable function of $\omega$ for each $x\in X$ .

For almost all $\omega\in\Omega$ , the derivative $\frac{\partial f(x,\omega)}{\partial x}$ exists for all $x\in X$ .

which is finite since the first moment of $\omega$ is assumed to exist.

A.1.2 Lipschitz Constant

A.1.3 Anchor Points

Since we know the variance of each term from Equation 13, we could alternatively use Bernstein’s inequality:

This is a better bound when $\operatorname{Var}[\cos(\omega^{\mathsf{T}}\Delta)]+\frac{1}{3}\varepsilon<1$ ; for the RBF kernel, this is true whenever $\varepsilon<\tfrac{3}{2}$ , and the improvement is bigger when $k(\Delta)$ is large or $\varepsilon$ is small.

To unify the two, let $\alpha_{\varepsilon}:=\min\left(1,\max_{\Delta\in\mathcal{M}_{\Delta}}\tfrac{1}{2}+\tfrac{1}{2}k(2\Delta)-k(\Delta^{2})+\tfrac{1}{3}\varepsilon\right)$ . Then

A.1.4 Optimizing Over r𝑟r

Combining these two bounds, we have a bound in terms of $r$ :

If we choose $r=(\kappa_{1}/\kappa_{2})^{1/(d+2)}$ , as did , the bound again becomes $1-2\kappa_{1}^{\frac{2}{d+2}}\kappa_{2}^{\frac{d}{d+2}}$ . But we could instead maximize the bound by choosing $r$ such that $d\kappa_{1}r^{-d-1}-2\kappa_{2}r=0$ , i.e. $r=\left(\frac{d\kappa_{1}}{2\kappa_{2}}\right)^{\frac{1}{d+2}}$ . Then the bound becomes $1-\left(\left(\frac{d}{2}\right)^{\frac{-d}{d+2}}+\left(\frac{d}{2}\right)^{\frac{2}{d+2}}\right)\kappa_{1}^{\frac{2}{d+2}}\kappa_{2}^{\frac{d}{d+2}}$ :

To prove the final statement of Proposition 1, simply set Equation 116 to be at most $\delta$ and solve for $D$ .

A.2 PROOF OF PROPOSITION 2

We will follow the proof strategy of Proposition 1 as closely as possible.

Our approximation is now $\breve{s}(x,y)=\breve{z}(x)^{\mathsf{T}}\breve{z}(y)$ , and the error is $\breve{f}(x,y)=\breve{s}(x,y)-k(y,x)$ . Note that $\breve{s}$ and $\breve{f}$ are not shift-invariant: for example, with $D=1$ , $\breve{s}(x,y)=\cos(\omega^{\mathsf{T}}\Delta)+\cos(\omega^{\mathsf{T}}(x+y)+2b)$ but $\breve{s}(\Delta,0)=\cos(\omega^{\mathsf{T}}\Delta)+\cos(\omega^{\mathsf{T}}\Delta+2b)$ .

A.2.2 Lipschitz Constant

The argument follows that of Section A.1.2 up to Equation 105, using $q^{*}$ in place of $\Delta^{*}$ . Then:

Following through with Markov’s inequality:

A.2.3 Anchor Points

For any fixed $x,y$ , $\breve{s}$ takes a mean of $D$ terms with expectation $k(x,y)$ bounded by $\pm 2$ . Using Hoeffding’s inequality:

Since the variance of each term is given by Equation 19, we can instead use Bernstein’s inequality:

Thus Bernstein’s gives us a tighter bound if

To unify the bounds, define $\alpha^{\prime}_{\varepsilon}=\min\left(1,\max_{\Delta}\tfrac{1}{8}+\tfrac{1}{4}\operatorname{Var}[\cos(\omega^{\mathsf{T}}\Delta)]+\frac{1}{6}\varepsilon\right)$ ; then

A.2.4 Optimizing Over r𝑟r

This is maximized by $r$ when $2d\kappa_{1}r^{-2d-1}-2\kappa_{2}r=0$ , i.e. $r=\left(\frac{d\kappa_{1}}{\kappa_{2}}\right)^{\frac{1}{2d+2}}$ . Substituting that value of $r$ into the bound yields $1-\left(d^{\frac{-d}{d+1}}+d^{\frac{1}{d+1}}\right)\kappa_{1}^{\frac{1}{d+1}}\kappa_{2}^{\frac{d}{d+1}}$ , and thus:

To prove the final statement of Proposition 2, set Equation 133 to be at most $\delta$ and solve for $D$ .

A.3 PROOF OF PROPOSITION 3

Our primary tool will be the following slight generalization of Dudley’s entropy integral, which is a special case of Lemma 13.1 of . (The only difference from their Corollary 13.2 is that we maintain the variance factor $v$ .)

Let $\mathcal{T}$ be a finite pseudometric space and let $(X_{t})_{t\in\mathcal{T}}$ be a collection of random variables such that for some constant $v>0$ ,

for all $t,t^{\prime}\in\mathcal{T}$ and all $\lambda>0$ . Let $\delta=\sup_{t\in\mathcal{T}}d(t,t_{0})$ . Then, for any $t_{0}\in\mathcal{T}$ ,

where $\gamma:=4\sqrt{\pi}\operatorname{erfc}(2\sqrt{\log 2})+\sqrt{\log 2}\approx 0.964$ .

Now, $\frac{2}{D}\left(\cos(\omega_{i}^{\mathsf{T}}\Delta)-k(\Delta)-\cos(\omega_{i}^{\mathsf{T}}\Delta^{\prime})+k(\Delta^{\prime})\right)$ has mean zero, and absolute value bounded by

Thus, via Hoeffding’s lemma [3, Lemma 2.2], each such term has log moment generating function at most $\frac{2}{D^{2}}(\lVert\omega_{i}\rVert+L)^{2}\lambda^{2}\lVert\Delta-\Delta^{\prime}\rVert^{2}$ .

A.4 PROOF OF PROPOSITION 4

For the $\breve{z}$ features, the error process again must be defined over $\mathcal{X}^{2}$ due to the non-shift invariant noise. We still assume that $k$ is $L$ -Lipschitz over $\mathcal{X}_{\Delta}$ , however.

We will again use the notation of $q=(x,y)\in\mathcal{X}^{2}$ , $\Delta=x-y$ , $t=x+y$ . Each term in the sum of $\breve{f}(q)-\breve{f}(q^{\prime})$ has mean zero and absolute value at most

Now, in order to cast this in terms of distance on $\mathcal{X}^{2}$ , let $\delta_{x}=x-x^{\prime}$ , $\delta_{y}=y-y^{\prime}$ . Then

$\Pr\left(\breve{f}\text{ crosses }0\right)$ is extremely close to 1 in “usual” situations.

We can do essentially the same thing for $\breve{f}$ :