Optimal Rates for Random Fourier Features
Bharath K. Sriperumbudur, Zoltan Szabo
Introduction
Kernel methods [scholkopf02learning] have enjoyed tremendous success in solving several fundamental problems of machine learning ranging from classification, regression, feature extraction, dependency estimation, causal discovery, Bayesian inference and hypothesis testing. Such a success owes to their capability to represent and model complex relations by mapping points into high (possibly infinite) dimensional feature spaces. At the heart of all these techniques is the kernel trick, which allows to implicitly compute inner products between these high dimensional feature maps, via a kernel function : . However, this flexibility and richness of kernels has a price: by resorting to implicit computations these methods operate on the Gram matrix of the data, which raises serious computational challenges while dealing with large-scale data. In order to resolve this bottleneck, numerous solutions have been proposed, such as low-rank matrix approximations [Williams-01, drineas05nystrom, alaoui15fast], explicit feature maps designed for additive kernels [vedaldi12efficient, maji13efficient], hashing [shi09hash, kulis12kernelized], and random Fourier features (RFF) [rahimi07random] constructed for shift-invariant kernels, the focus of the current paper.
RFFs implement an extremely simple, yet efficient idea: instead of relying on the implicit feature map associated with the kernel, by appealing to Bochner’s theorem [wendland05scattered]—any bounded, continuous, shift-invariant kernel is the Fourier transform of a probability measure—-[rahimi07random] proposed an explicit low-dimensional random Fourier feature map obtained by empirically approximating the Fourier integral so that . The advantage of this explicit low-dimensional feature representation is that the kernel machine can be efficiently solved in the primal form through fast linear solvers, thereby enabling to handle large-scale data. Through numerical experiments, it has also been demonstrated that kernel algorithms constructed using the approximate kernel do not suffer from significant performance degradation [rahimi07random]. Another advantage with the RFF approach is that unlike low rank matrix approximation approach [Williams-01, drineas05nystrom] which also speeds up kernel machines, it approximates the entire kernel function and not just the kernel matrix. This property is particularly useful while dealing with out-of-sample data and also in online learning applications. The RFF technique has found wide applicability in several areas such as fast function-to-function regression [oliva15fast], differential privacy preserving [chaudhuri11differentially] and causal discovery [lopezpaz15towards].
Traditionally, kernel based algorithms involve computing the value of the kernel. Recently, kernel algorithms involving the derivatives of the kernel (i.e., the Gram matrix consists of derivatives of the kernel computed at training samples) have been used to address numerous machine learning tasks, e.g., semi-supervised or Hermite learning with gradient information [zhou08derivative, shi10hermite], nonlinear variable selection [rosasco10regularization, rosasco13nonparametric], (multi-task) gradient learning [ying12learning] and fitting of distributions in an infinite-dimensional exponential family [sriperumbudur14density]. Given the importance of these derivative based kernel algorithms, similar to [rahimi07random], in Section 4, we propose a finite dimensional random feature map approximation to kernel derivatives, which can be used to speed up the above mentioned derivative based kernel algorithms. We present a finite-sample bound that quantifies the quality of approximation in uniform and -norms and show the rate of convergence to be in both these cases.
A summary of our contributions are as follows. We
provide the first detailed finite-sample performance analysis of RFFs for approximating kernels and their derivatives.
prove uniform and convergence on fixed compacts sets with optimal rate in terms of the RFF dimension ();
give sufficient conditions for the growth rate of compact sets while preserving a.s. convergence uniformly and in ; specializing our result we match the best attainable asymptotic growth rate.
Various notations and definitions that are used throughout the paper are provided in Section 2 along with a brief review of RFF approximation proposed by [rahimi07random]. The missing proofs of the results in Sections 3 and 4 are provided in the supplementary material.
Notations & preliminaries
where and when . The condition implies that (and therefore ) is twice differentiable. From (3) it is clear that the probability has polynomial tails if (i.e., small ) and Gaussian tails if (i.e., large ) and can be equivalently written as
where . For sufficiently large (i.e., ), it follows from (4) that
While (5) shows that is a consistent estimator of in the topology of compact convergence (i.e., convergences to uniformly over compact sets), the rate of convergence of is not optimal. In addition, the order of dependence on is not optimal. While a faster rate (in fact, an optimal rate) of convergence is desired—better rates in (5) can lead to better convergence rates for the excess error of the kernel machine constructed using —, the order of dependence on is also important as it determines the the number of RFF features (i.e., ) that are needed to achieve a given approximation accuracy. In fact, the order of dependence on controls the rate at which can be grown as a function of when (see Remark 1(ii) for a detailed discussion about the significance of growing ). In the following section, we present an analogue of (4)—see Theorem 1—that provides optimal rates and has correct dependence on .
Main results: approximation of k𝑘k
where .
Note that , where , which means the object of interest is the suprema of an empirical process indexed by . Instead of bounding by using Hoeffding’s inequality on a cover of and then applying union bound as carried out in [rahimi07random, sutherland15error], we use the refined technique of applying concentration via McDiarmid’s inequality, followed by symmetrization and bound the Rademacher average by Dudley entropy bound. The result is obtained by carefully bounding the -covering number of . The details are provided in Section B.1 of the supplementary material. ∎
Corollary 2 shows that and therefore if as , then consistency of in -norm is achieved as long as as . This means, in comparison to the uniform norm in Theorem 1 where can grow exponential in (), cannot grow faster than () to achieve consistency in -norm.
Instead of using Theorem 1 to obtain a bound on (this bound may be weak as for any ), a better bound (for ) can be obtained by directly bounding , as shown in the following result.
where is the Khintchine constant given by for and for .
Approximation of kernel derivatives
In the previous section we focused on the approximation of the kernel function where we presented uniform and convergence guarantees on compact sets for the random Fourier feature approximation, and discussed how fast the diameter of these sets can grow to preserve uniform and convergence almost surely. In this section, we propose an approximation to derivatives of the kernel and analyze the uniform and convergence behavior of the proposed approximation. As motivated in Section 1, the question of approximating the derivatives of the kernel through finite dimensional random feature map is also important as it enables to speed up several interesting machine learning tasks that involve the derivatives of the kernel [zhou08derivative, shi10hermite, rosasco10regularization, rosasco13nonparametric, ying12learning, sriperumbudur14density], see for example the recent infinite dimensional exponential family fitting technique [strathmann15gradient], which implements this idea.
so that can be approximated by replacing with , resulting in
where and . Now the goal is to understand the behavior of and for , i.e., obtain analogues of Theorems 1 and 3.
.
(i) Note that Theorem 4 reduces to Theorem 1 if , in which case . If or , then the boundedness of implies that and . (ii) Growth of : By the same reasoning as in Remark 1(ii) and Corollary 2, it follows that if and if (for ) as . An exact analogue of Theorem 3 can be obtained (but with different constants) under the assumption that is bounded and it can be shown that for , if .
The following result relaxes the boundedness of by imposing certain moment conditions on but at the expense of a worse rate. The proof relies on applying Bernstein inequality at the elements of a net (which exists by the compactness of ) combined with a union bound, and extending the approximation error from the anchors by a probabilistic Lipschitz argument.
where . Define . is monotonically decreasing in , . Then
Discussion
Z. Szabó wishes to thank the Gatsby Charitable Foundation for its generous support.
References
Appendix A Definitions & notation
We provide proofs of the results presented in Sections 3 and 4. Lemmas used in the proofs are enlisted in Section C.
Below we prove Theorem 4, thereby Theorem 1 (). The idea of the proof is as follows: (i) We note that
is a separable Carathéodory family: is a separable Carathéodory family w.r.t. since
is continuous for all .
Concentration of by its bounded difference property: By defining , we have that for ,
Applying McDiarmid’s inequality (Lemma C.1) to , for any , with probability at least over the choice of ,
Bounding : Using Dudley’s entropy integral [bousquet03new, Equation 4.4], we have
The upper limit of the integral can be bounded as
Bounding by the compactness of : For any , ,
By the mean value theorem, there exists such that
(B.6) shows that the existence of an -net on implies an -net on . In other words,
by noting that . Using (B.5) and (B.7) in (B.4), we have
where in the last inequality we used the fact that . By bounding , (B.8) reduces to
where the last equality is obtained by changing the variable of integration and defining . By applying Lemma C.2 to bound the integral in (B.9), we obtain
Bounding the expectation of the Rademacher average: From (B.10), we have
Final bound: Combining (B.2), (B.3) and (B.11) yields the result.∎
B.2 Proof of Theorem 3
and therefore by McDiarmid’s inequality (Lemma C.1), for any , with probability at least over the choice of , we have
One can rewrite the argument of this supremum by Eqs. (1)-(2) as
since is of type [lindenstrauss79classical, page 73] and there exists a universal constant independent of (the so-called Khintchine constant) [ledoux02probability, page 247] such that holds; in addition we used
and .
Combining (B.12)–(B.16) and using the bound on given in the proof of Corollary 2 yields the result.∎
B.3 Proof of Theorem 5
Below we give the detailed proof of Theorem 5. At high-level the proof goes as follows: (i) By the compactness of (implied by that of ) one can take an -net covering (for any ). (ii) Small approximation error can be guaranteed at the centers of the -net by Bernstein’s inequality combined with a union bound. (iii) Propagation of the error from the centers to arbitrary points is achieved by Lipschitzness. (iv) The Lipschitz constant is, however, a random quantity and we show with high probability that it is ‘not too large’. (v) Union bounding the two events (small errors at the centers and small Lipschitz constant) leads to a uniform bound for arbitrary , which holds with high probability. (vi) Optimizing over gives the stated result.
Formally, the proof is as follows. Let us define
where . Let us notice that since is compact (by the compactness of , implied by that of ) and is continuous, the supremum inside the expectation in is finite for any .
Covering of : By the compactness of there exist an -net with at most
balls covering [geer09empirical, Lemma 2.5, page 20], where we used that . Let us denote the centers of this -net by .
Bounding , where ; is fixed: Let
is continuously differentiable since is so. Thus by the mean value theorem such that
Hence by the Cauchy-Bunyakovsky-Schwarz inequality, we get
where we used the compactness of (implied by that of ) and the continuity of the mapping to guarantee that exists, and it is finite for any .
Bound on : Note that
By the homogenity of norms (), the chain rule, and (, )
Combining Eq. (B.20) and (B.21) results in the bound
Error propagation from the net centers: We will use the following note to propagate the error from the net centers (, ) to an arbitrary point. Note: If () and , then
where we used (B.18) and our assumptions in the note, thereby yielding (B.23).
Guaranteeing the conditions of (B.23) with high probability:
Setting , (B.24) is written as
By union bounding (), we get
Condition : Applying Markov’s inequality to (note that is non-negative), for any , we obtain
by invoking (B.19) and (B.22). Choosing , we have
Final bound for any : By (B.25) and (B.26), and substituting the explicit form of in (B.17), we get
Jensen’s inequality in (), , and .
Matching the two terms to choose : Maximizing w.r.t. in (B.27)
we note that maximizes it. Using this in (B.27), we have
where .
B.4 Proof of bounded supp(Λ)suppΛ\text{supp}(\Lambda) ⇒⇒\Rightarrow (7)
We prove that the boundedness of implies that of [see (B.28)], specifically (7).
Applying the triangle inequality and () we have
is finite since is bounded, thus is bounded.
Appendix C Supplementary results
In this section, we present some technical results that are used in the proofs.
By change of variables, we have . Applying partial integration, we have
Note: the -algebra of Lebesgue measurable sets is typically not countably generated [bogachev07measure, page 106 (vol. I)].