On the Error of Random Fourier Features
Danica J. Sutherland, Jeff Schneider
INTRODUCTION
give two such embeddings, based on the Fourier transform of the kernel : one of the form
The primary previous analyses of these embeddings, outside the one in the original paper, have been by , who bound the increase in error of empirical risk estimates when learning models in the induced rkhs, and by , who compare the ability of the Nyström and Fourier embeddings to exploit eigengaps in the learning problem. We instead study the approximation directly, providing a complementary view of the quality of these embeddings.
APPROXIMATION ERROR
We will give various analyses of the error due to each approximation.
The Gaussian kernel has
2 UNIFORM ERROR BOUND
Thus, we can achieve an embedding with pointwise error no more than with probability at least as long as
For the Gaussian kernel, and ; the Bernstein bound is tighter when .
For , since the embedding is not shift-invariant, we must instead place the -net on . The additional noise in also increases the expected Lipschitz constant and gives looser bounds on each term in the sum, though there are twice as many such terms. The corresponding bound is as follows:
Thus, we can achieve an embedding with pointwise error no more than with probability at least as long as
, and , also shown in Figure 2. The full proof is given in Section A.2.
For the Gaussian kernel, , so that the Berstein bound is essentially always superior.
2.2 Expected Max Error
We can, however, use a slight generalization of Dudley’s entropy integral [*]dudley to obtain the following bound:
where . See Section A.4 for details on and the “not extremely small” assumption.
The proof is given in Section A.4. It is similar to that for Proposition 3, but the lack of shift invariance increases some constants and otherwise slightly complicates matters. Note also that the of Proposition 4 is somewhat larger than that of Proposition 3.
2.3 Concentration About Mean
Bousquet’s inequality [*]bousquet2002 can be used to show exponential concentration of about its mean.
using Equation 20. Clearly ; for the Gaussian kernel, it is 1.
A bound on the lower tail, unfortunately, is not available in the same form.
For , note , so we use . Letting , we have . Thus the same argument gives us:
bounds provide useful guarantees, but are very strict. It can also be useful to consider a less stringent error measure. Let be a -finite measure on ; define
where Equation 46 is justified by Tonelli’s theorem.
The second version of the bound is simpler, but somewhat looser for ; asymptotically, the coefficient of the denominator becomes 128.
Similarly, the variation of is bounded by at most (shown in Section B.2). Thus:
Let , , , and be as in Proposition 7. Define as in Equation 4 and let . Then
The cost of a simpler dependence on is higher here; the asymptotic coefficient of the denominator is .
DOWNSTREAM ERROR
Rahimi and Recht [*]rahimi-nips,rahimi-allerton give a bound on the distance between any given function in the reproducing kernel Hilbert space (rkhs) induced by and the closest function in the rkhs of : results invaluable for the study of learning rates. In some situations, however, it is useful to consider not the learning-theoretic convergence of hypotheses to the assumed “true” function, but rather directly consider the difference in predictions due to using the embedding instead of the exact kernel .
Proposition 1 of bounds the change in krr predictions from approximating the kernel matrix by , in terms of . They assume, however, that the kernel evaluations at test time are unapproximated, which is certainly not the case when using Fourier features. We therefore extend their result to Proposition 9 before using it to analyze the performance of Fourier features.
Let , . Thus, using , we have
since the smallest eigenvalues of and are at least . Since and :
The claim follows from , . ∎
Suppose that, per the uniform error bounds of Section 2.2, . Then and , and Proposition 9 gives
which we can bound with Proposition 1 or 2. We can therefore guarantee with probability at least if
Note that this rate does not depend on . If we want at least as fast as ’s convergence rate of , ignoring the logarithmic terms, we thus need to be linear in , matching the conclusion of .
2 SUPPORT VECTOR MACHINES
Consider a Support Vector Machine (svm) classifier with no offset, such that for a kernel embedding and is found by
Use the setup of Section 2.2 of . In particular, we will use and their (16-17):
where and the th standard basis.
Further, Lemma 1 of says that . Let ; Then, by Weyl’s inequality for singular values,
Then only if
This bound has the unfortunate property of requiring the approximation to be more accurate as the training set size increases, and thus can prove only a very loose upper bound on the number of features needed to achieve a given approximation accuracy, due to the looseness of Proposition 10. Analyses of generalization error in the induced rkhs, such as , are more useful in this case.
3 MAXIMUM MEAN DISCREPANCY
The distance is known as the maximum mean discrepancy (mmd), which can be estimated with:
is a biased estimator, because of the and terms; removing them gives an unbiased estimator . The mmk can be used in standard kernel methods to perform learning on probability distributions, such as when images are treated as sets of local patch descriptors or documents as sets of word descriptors . The mmd has strong applications to two-sample testing, where it serves as the statistic for testing the hypothesis that and are sampled from the same distribution ; this has applications in, for example, comparing microarray data from different experimental situations or in matching attributes when merging databases.
The mmk estimate can clearly be approximated with an explicit embedding: if ,
Thus the biased estimator of is just ; the unbiased estimator is
This has been noticed a few times in the literature, e.g. by . gives different linear-time test statistics based on subsampling the sum over pairs; this version avoids reducing the amount of data used in favor of approximating the kernel. Additionally, when using the mmk in a kernel method this approximation allows the use of linear solvers, whereas the other linear approximations must still perform some pairwise computation. compare the empirical performance of an approximation equivalent to against other linear-time approximations for two-sample testing. They find it is slower than the mmd-linear approximation but far more accurate, while being more accurate and comparable in speed to a block-based -test .
also state a simple uniform error bound on the quality of this approximation. Specifically, since we can write as the mean of , uniform error bounds on apply directly to , including to the unbiased version of . Moreover, since , its error is at most times . The advantage of this bound is that it applies uniformly to all sample sets on the input space , which is useful when we use mmk for a kernel method.
which is at most . The bound for is in fact the same here. Thus McDiarmid’s inequality tells us that for fixed sets and and either ,
and expected absolute error of at most .
Now, if we consider the distributions and to be fixed but the sample sets random, Theorems 7 and 10 of give exponential convergence bounds for the biased and unbiased population estimators of mmd, which can easily be combined with the above bounds. Note that this approach allows the domain to be unbounded, unlike the other bound. One could extend this to a bound uniform over some smoothness class of distributions using the techniques of Section 2.2, though we do not do so here.
NUMERICAL EVALUATION
We first conduct a detailed study of the approximations on the interval . Specifically, we evenly spaced points on $D\in\{50,100,200,\dots,900,1\,000,2\,000,\dots,9\,000,10\,000\}1\,000\lVert f\rVert_{\infty}\lVert f\rVert_{\mu}d>1\sup\lvert f\rvertd=2$.
Figure 4 fixes and shows the expected maximal error as a function of . It also plots the expected error obtained by numerically integrating the bounds of Propositions 1 and 2 (using the minimum of 1 and the bound). We can see that all of the bounds are fairly loose, but that the first version of the bound in the propositions (with , the exponent depending on , and ) is substantially tighter than the second version when .
Figure 5 shows the empirical survival function of the max error for , along with the bounds of Propositions 1 and 2 and those of Propositions 5 and 6 using the empirical mean. The latter bounds are tighter than the former for low , especially for low , but have a lower slope.
2 MAXIMUM MEAN DISCREPANCY
DISCUSSION
This work was funded in part by DARPA grant FA87501220324. DJS is also supported by a Sandia Campus Executive Program fellowship.
References
References
Appendix A PROOFS FOR UNIFORM ERROR BOUND (SECTION 2.2)
The proof strategy closely follows that of ; we fill in some (important) details, tightening some parts of the proof as we go.
is a Lebesgue-integrable function of for each .
For almost all , the derivative exists for all .
which is finite since the first moment of is assumed to exist.
A.1.2 Lipschitz Constant
A.1.3 Anchor Points
Since we know the variance of each term from Equation 13, we could alternatively use Bernstein’s inequality:
This is a better bound when ; for the RBF kernel, this is true whenever , and the improvement is bigger when is large or is small.
To unify the two, let . Then
A.1.4 Optimizing Over r𝑟r
Combining these two bounds, we have a bound in terms of :
If we choose , as did , the bound again becomes . But we could instead maximize the bound by choosing such that , i.e. . Then the bound becomes :
To prove the final statement of Proposition 1, simply set Equation 116 to be at most and solve for .
A.2 PROOF OF PROPOSITION 2
We will follow the proof strategy of Proposition 1 as closely as possible.
Our approximation is now , and the error is . Note that and are not shift-invariant: for example, with , but .
A.2.2 Lipschitz Constant
The argument follows that of Section A.1.2 up to Equation 105, using in place of . Then:
Following through with Markov’s inequality:
A.2.3 Anchor Points
For any fixed , takes a mean of terms with expectation bounded by . Using Hoeffding’s inequality:
Since the variance of each term is given by Equation 19, we can instead use Bernstein’s inequality:
Thus Bernstein’s gives us a tighter bound if
To unify the bounds, define ; then
A.2.4 Optimizing Over r𝑟r
This is maximized by when , i.e. . Substituting that value of into the bound yields , and thus:
To prove the final statement of Proposition 2, set Equation 133 to be at most and solve for .
A.3 PROOF OF PROPOSITION 3
Our primary tool will be the following slight generalization of Dudley’s entropy integral, which is a special case of Lemma 13.1 of . (The only difference from their Corollary 13.2 is that we maintain the variance factor .)
Let be a finite pseudometric space and let be a collection of random variables such that for some constant ,
for all and all . Let . Then, for any ,
where .
Now, has mean zero, and absolute value bounded by
Thus, via Hoeffding’s lemma [3, Lemma 2.2], each such term has log moment generating function at most .
A.4 PROOF OF PROPOSITION 4
For the features, the error process again must be defined over due to the non-shift invariant noise. We still assume that is -Lipschitz over , however.
We will again use the notation of , , . Each term in the sum of has mean zero and absolute value at most
Now, in order to cast this in terms of distance on , let , . Then
is extremely close to 1 in “usual” situations.
We can do essentially the same thing for :