Optimal Rates for Random Fourier Features

Bharath K. Sriperumbudur, Zoltan Szabo

Introduction

Kernel methods [scholkopf02learning] have enjoyed tremendous success in solving several fundamental problems of machine learning ranging from classification, regression, feature extraction, dependency estimation, causal discovery, Bayesian inference and hypothesis testing. Such a success owes to their capability to represent and model complex relations by mapping points into high (possibly infinite) dimensional feature spaces. At the heart of all these techniques is the kernel trick, which allows to implicitly compute inner products between these high dimensional feature maps, λ\lambda via a kernel function kk: k(x,y)=λ(x),λ(y)k(\mathbf{x},\mathbf{y})=\langle\lambda(\mathbf{x}),\lambda(\mathbf{y})\rangle. However, this flexibility and richness of kernels has a price: by resorting to implicit computations these methods operate on the Gram matrix of the data, which raises serious computational challenges while dealing with large-scale data. In order to resolve this bottleneck, numerous solutions have been proposed, such as low-rank matrix approximations [Williams-01, drineas05nystrom, alaoui15fast], explicit feature maps designed for additive kernels [vedaldi12efficient, maji13efficient], hashing [shi09hash, kulis12kernelized], and random Fourier features (RFF) [rahimi07random] constructed for shift-invariant kernels, the focus of the current paper.

RFFs implement an extremely simple, yet efficient idea: instead of relying on the implicit feature map λ\lambda associated with the kernel, by appealing to Bochner’s theorem [wendland05scattered]—any bounded, continuous, shift-invariant kernel is the Fourier transform of a probability measure—-[rahimi07random] proposed an explicit low-dimensional random Fourier feature map ϕ\phi obtained by empirically approximating the Fourier integral so that k(x,y)ϕ(x),ϕ(y)k(\mathbf{x},\mathbf{y})\approx\langle\phi(\mathbf{x}),\phi(\mathbf{y})\rangle. The advantage of this explicit low-dimensional feature representation is that the kernel machine can be efficiently solved in the primal form through fast linear solvers, thereby enabling to handle large-scale data. Through numerical experiments, it has also been demonstrated that kernel algorithms constructed using the approximate kernel do not suffer from significant performance degradation [rahimi07random]. Another advantage with the RFF approach is that unlike low rank matrix approximation approach [Williams-01, drineas05nystrom] which also speeds up kernel machines, it approximates the entire kernel function and not just the kernel matrix. This property is particularly useful while dealing with out-of-sample data and also in online learning applications. The RFF technique has found wide applicability in several areas such as fast function-to-function regression [oliva15fast], differential privacy preserving [chaudhuri11differentially] and causal discovery [lopezpaz15towards].

Traditionally, kernel based algorithms involve computing the value of the kernel. Recently, kernel algorithms involving the derivatives of the kernel (i.e., the Gram matrix consists of derivatives of the kernel computed at training samples) have been used to address numerous machine learning tasks, e.g., semi-supervised or Hermite learning with gradient information [zhou08derivative, shi10hermite], nonlinear variable selection [rosasco10regularization, rosasco13nonparametric], (multi-task) gradient learning [ying12learning] and fitting of distributions in an infinite-dimensional exponential family [sriperumbudur14density]. Given the importance of these derivative based kernel algorithms, similar to [rahimi07random], in Section 4, we propose a finite dimensional random feature map approximation to kernel derivatives, which can be used to speed up the above mentioned derivative based kernel algorithms. We present a finite-sample bound that quantifies the quality of approximation in uniform and LrL^{r}-norms and show the rate of convergence to be m1/2m^{-1/2} in both these cases.

A summary of our contributions are as follows. We

provide the first detailed finite-sample performance analysis of RFFs for approximating kernels and their derivatives.

prove uniform and LrL^{r} convergence on fixed compacts sets with optimal rate in terms of the RFF dimension (mm);

give sufficient conditions for the growth rate of compact sets while preserving a.s. convergence uniformly and in LrL^{r}; specializing our result we match the best attainable asymptotic growth rate.

Various notations and definitions that are used throughout the paper are provided in Section 2 along with a brief review of RFF approximation proposed by [rahimi07random]. The missing proofs of the results in Sections 3 and 4 are provided in the supplementary material.

Notations & preliminaries

where σ2:=ω2dΛ(ω)\sigma^{2}:=\int\|\bm{\omega}\|^{2}\,d\Lambda(\bm{\omega}) and Cd:=26d+2d+2((2d)dd+2+(d2)2d+2)27d2d+2C_{d}:=2^{\frac{6d+2}{d+2}}\left(\left(\frac{2}{d}\right)^{\frac{d}{d+2}}+\left(\frac{d}{2}\right)^{\frac{2}{d+2}}\right)\leq 2^{7}d^{\frac{2}{d+2}} when d2d\geq 2. The condition σ2<\sigma^{2}<\infty implies that ψ\psi (and therefore kk) is twice differentiable. From (3) it is clear that the probability has polynomial tails if ϵ<Sσ\epsilon<|\mathscr{S}|\sigma (i.e., small ϵ\epsilon) and Gaussian tails if ϵSσ\epsilon\geq|\mathscr{S}|\sigma (i.e., large ϵ\epsilon) and can be equivalently written as

where α:=4dCdd+2dS2σ2\alpha:=4d-C^{\frac{d+2}{d}}_{d}|\mathscr{S}|^{2}\sigma^{2}. For S|\mathscr{S}| sufficiently large (i.e., α<0\alpha<0), it follows from (4) that

While (5) shows that k^\hat{k} is a consistent estimator of kk in the topology of compact convergence (i.e., k^\hat{k} convergences to kk uniformly over compact sets), the rate of convergence of (logm)/m\sqrt{(\log m)/m} is not optimal. In addition, the order of dependence on S|\mathscr{S}| is not optimal. While a faster rate (in fact, an optimal rate) of convergence is desired—better rates in (5) can lead to better convergence rates for the excess error of the kernel machine constructed using k^\hat{k}—, the order of dependence on S|\mathscr{S}| is also important as it determines the the number of RFF features (i.e., mm) that are needed to achieve a given approximation accuracy. In fact, the order of dependence on S|\mathscr{S}| controls the rate at which S|\mathscr{S}| can be grown as a function of mm when mm\rightarrow\infty (see Remark 1(ii) for a detailed discussion about the significance of growing S|\mathscr{S}|). In the following section, we present an analogue of (4)—see Theorem 1—that provides optimal rates and has correct dependence on S|\mathscr{S}|.

Main results: approximation of k𝑘k

where h(d,S,σ):=322dlog(2S+1)+322dlog(σ+1)+162d[log(2S+1)]1h(d,|\mathscr{S}|,\sigma):=32\sqrt{2d\log(2|\mathscr{S}|+1)}+32\sqrt{2d\log(\sigma+1)}+16\sqrt{2d[\log(2|\mathscr{S}|+1)]^{-1}}.

Note that k^kS×S=supx,ySk^(x,y)k(x,y)=supgGΛmgΛg\|\hat{k}-k\|_{\mathscr{S}\times\mathscr{S}}=\sup_{\mathbf{x},\mathbf{y}\in\mathscr{S}}|\hat{k}(\mathbf{x},\mathbf{y})-k(\mathbf{x},\mathbf{y})|=\sup_{g\in\mathcal{G}}|\Lambda_{m}g-\Lambda g|, where G:={gx,y(ω)=cos(ωT(xy)):x,yS}\mathcal{G}:=\{g_{\mathbf{x},\mathbf{y}}(\bm{\omega})=\cos(\bm{\omega}^{T}(\mathbf{x}-\mathbf{y}))\,:\,\mathbf{x},\mathbf{y}\in\mathscr{S}\}, which means the object of interest is the suprema of an empirical process indexed by G\mathcal{G}. Instead of bounding supgGΛmgΛg\sup_{g\in\mathcal{G}}|\Lambda_{m}g-\Lambda g| by using Hoeffding’s inequality on a cover of G\mathcal{G} and then applying union bound as carried out in [rahimi07random, sutherland15error], we use the refined technique of applying concentration via McDiarmid’s inequality, followed by symmetrization and bound the Rademacher average by Dudley entropy bound. The result is obtained by carefully bounding the L2(Λm)L^{2}(\Lambda_{m})-covering number of G\mathcal{G}. The details are provided in Section B.1 of the supplementary material. ∎

Corollary 2 shows that k^kLr(S)=Oa.s.(m1/2S2d/rlogS)\|\hat{k}-k\|_{L^{r}(\mathscr{S})}=O_{a.s.}(m^{-1/2}|\mathscr{S}|^{2d/r}\sqrt{\log|\mathscr{S}|}) and therefore if Sm|\mathscr{S}_{m}|\rightarrow\infty as mm\rightarrow\infty, then consistency of k^\hat{k} in Lr(Sm)L^{r}(\mathscr{S}_{m})-norm is achieved as long as m1/2Sm2d/rlogSm0m^{-1/2}|\mathscr{S}_{m}|^{2d/r}\sqrt{\log|\mathscr{S}_{m}|}\rightarrow 0 as mm\rightarrow\infty. This means, in comparison to the uniform norm in Theorem 1 where Sm|\mathscr{S}_{m}| can grow exponential in mδm^{\delta} (δ<1\delta<1), Sm|\mathscr{S}_{m}| cannot grow faster than mr4d(logm)r4dθm^{\frac{r}{4d}}(\log m)^{-\frac{r}{4d}-\theta} (θ>0\theta>0) to achieve consistency in LrL^{r}-norm.

Instead of using Theorem 1 to obtain a bound on k^kLr(S)\|\hat{k}-k\|_{L^{r}(\mathscr{S})} (this bound may be weak as k^kLr(S)k^kS×Svol2/r(S)\|\hat{k}-k\|_{L^{r}(\mathscr{S})}\leq\|\hat{k}-k\|_{\mathscr{S}\times\mathscr{S}}\text{vol}^{2/r}(\mathscr{S}) for any 1r<1\leq r<\infty), a better bound (for 2r<2\leq r<\infty) can be obtained by directly bounding k^kLr(S)\|\hat{k}-k\|_{L^{r}(\mathscr{S})}, as shown in the following result.

where CrC^{\prime}_{r} is the Khintchine constant given by Cr=1C^{\prime}_{r}=1 for r(1,2]r\in(1,2] and Cr=2[Γ(r+12)/π]1rC^{\prime}_{r}=\sqrt{2}\left[\Gamma\left(\frac{r+1}{2}\right)/\sqrt{\pi}\right]^{\frac{1}{r}} for r[2,)r\in[2,\infty).

Approximation of kernel derivatives

In the previous section we focused on the approximation of the kernel function where we presented uniform and LrL^{r} convergence guarantees on compact sets for the random Fourier feature approximation, and discussed how fast the diameter of these sets can grow to preserve uniform and LrL^{r} convergence almost surely. In this section, we propose an approximation to derivatives of the kernel and analyze the uniform and LrL^{r} convergence behavior of the proposed approximation. As motivated in Section 1, the question of approximating the derivatives of the kernel through finite dimensional random feature map is also important as it enables to speed up several interesting machine learning tasks that involve the derivatives of the kernel [zhou08derivative, shi10hermite, rosasco10regularization, rosasco13nonparametric, ying12learning, sriperumbudur14density], see for example the recent infinite dimensional exponential family fitting technique [strathmann15gradient], which implements this idea.

so that p,qk(x,y)\partial^{\mathbf{p},\mathbf{q}}k(\mathbf{x},\mathbf{y}) can be approximated by replacing Λ\Lambda with Λm\Lambda_{m}, resulting in

where ϕp(u):=1m(ω1php(ω1Tu),,ωmphp(ωmTu),ω1ph3+p(ω1Tu),,ωmph3+p(ωmTu))\phi^{\mathbf{p}}(\mathbf{u}):=\frac{1}{\sqrt{m}}\left(\bm{\omega}_{1}^{\mathbf{p}}h_{|\mathbf{p}|}(\bm{\omega}_{1}^{T}\mathbf{u}),\cdots,\bm{\omega}_{m}^{\mathbf{p}}h_{|\mathbf{p}|}(\bm{\omega}_{m}^{T}\mathbf{u}),\bm{\omega}_{1}^{\mathbf{p}}h_{3+|\mathbf{p}|}(\bm{\omega}_{1}^{T}\mathbf{u}),\cdots,\bm{\omega}_{m}^{\mathbf{p}}h_{3+|\mathbf{p}|}(\bm{\omega}_{m}^{T}\mathbf{u})\right) and (ωj)j=1mi.i.d.Λ(\bm{\omega}_{j})^{m}_{j=1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\Lambda. Now the goal is to understand the behavior of sp,qp,qkS×S\|s^{\mathbf{p},\mathbf{q}}-\partial^{\mathbf{p},\mathbf{q}}k\|_{\mathscr{S}\times\mathscr{S}} and sp,qp,qkLr(S)\|s^{\mathbf{p},\mathbf{q}}-\partial^{\mathbf{p},\mathbf{q}}k\|_{L^{r}(\mathscr{S})} for r[1,)r\in[1,\infty), i.e., obtain analogues of Theorems 1 and 3.

U(p,q,S)=log(2ST2p,2q1/2+1)U(\mathbf{p},\mathbf{q},|\mathscr{S}|)=\log\left(2|\mathscr{S}|T^{-1/2}_{2\mathbf{p},2\mathbf{q}}+1\right).

(i) Note that Theorem 4 reduces to Theorem 1 if p=q=0\mathbf{p}=\mathbf{q}=0, in which case Tp,q=T2p,2q=1T_{\mathbf{p},\mathbf{q}}=T_{2\mathbf{p},2\mathbf{q}}=1. If p0\mathbf{p}\neq\mathbf{0} or q0\mathbf{q}\neq\mathbf{0}, then the boundedness of supp(Λ)\text{supp}(\Lambda) implies that Tp,q<T_{\mathbf{p},\mathbf{q}}<\infty and T2p,2q<T_{2\mathbf{p},2\mathbf{q}}<\infty. (ii) Growth of Sm|\mathscr{S}_{m}|: By the same reasoning as in Remark 1(ii) and Corollary 2, it follows that p,qksp,qSm×Sma.s.0\|\partial^{\mathbf{p},\mathbf{q}}k-s^{\mathbf{p},\mathbf{q}}\|_{\mathscr{S}_{m}\times\mathscr{S}_{m}}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0 if Sm=eo(m)|\mathscr{S}_{m}|=e^{o(m)} and p,qksp,qLr(Sm)a.s.0\|\partial^{\mathbf{p},\mathbf{q}}k-s^{\mathbf{p},\mathbf{q}}\|_{L^{r}(\mathscr{S}_{m})}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0 if m1/2Sm2d/rlogSm0m^{-1/2}|\mathscr{S}_{m}|^{2d/r}\sqrt{\log|\mathscr{S}_{m}|}\rightarrow 0 (for 1r<1\leq r<\infty) as mm\rightarrow\infty. An exact analogue of Theorem 3 can be obtained (but with different constants) under the assumption that supp(Λ)\text{supp}(\Lambda) is bounded and it can be shown that for r[2,)r\in[2,\infty), p,qksp,qLr(Sm)a.s.0\|\partial^{\mathbf{p},\mathbf{q}}k-s^{\mathbf{p},\mathbf{q}}\|_{L^{r}(\mathscr{S}_{m})}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0 if Sm=o(mr4d)|\mathscr{S}_{m}|=o(m^{\frac{r}{4d}}). \blacksquare

The following result relaxes the boundedness of supp(Λ)\text{supp}(\Lambda) by imposing certain moment conditions on Λ\Lambda but at the expense of a worse rate. The proof relies on applying Bernstein inequality at the elements of a net (which exists by the compactness of S\mathscr{S}) combined with a union bound, and extending the approximation error from the anchors by a probabilistic Lipschitz argument.

where f(z;ω)=p,qk(z)ωp(ω)qhp+q(ωTz)f(\mathbf{z};\bm{\omega})=\partial^{\mathbf{p},\mathbf{q}}k(\mathbf{z})-\bm{\omega}^{\mathbf{p}}(-\bm{\omega})^{\mathbf{q}}h_{|\mathbf{p}+\mathbf{q}|}\left(\bm{\omega}^{T}\mathbf{z}\right). Define Fd:=ddd+1+d1d+1F_{d}:=d^{-\frac{d}{d+1}}+d^{\frac{1}{d+1}}.FdF_{d} is monotonically decreasing in dd, F1=2F_{1}=2. Then

Discussion

Z. Szabó wishes to thank the Gatsby Charitable Foundation for its generous support.

References

Appendix A Definitions & notation

We provide proofs of the results presented in Sections 3 and 4. Lemmas used in the proofs are enlisted in Section C.

Below we prove Theorem 4, thereby Theorem 1 (p=q=0\mathbf{p}=\mathbf{q}=\mathbf{0}). The idea of the proof is as follows: (i) We note that

G\mathcal{G} is a separable Carathéodory family: G\mathcal{G} is a separable Carathéodory family w.r.t. SΔ\mathscr{S}_{\Delta} since

zωp(ω)qhp+q(ωTz)\mathbf{z}\mapsto\bm{\omega}^{\mathbf{p}}(-\bm{\omega})^{\mathbf{q}}h_{|\mathbf{p}+\mathbf{q}|}\left(\bm{\omega}^{T}\mathbf{z}\right) is continuous for all ωsupp(Λ)\bm{\omega}\in\text{supp}(\Lambda).

Concentration of ΛΛmG\left\|\Lambda-\Lambda_{m}\right\|_{\mathcal{G}} by its bounded difference property: By defining f(ω1,,ωm):=ΛΛmGf(\bm{\omega}_{1},\ldots,\bm{\omega}_{m}):=\left\|\Lambda-\Lambda_{m}\right\|_{\mathcal{G}}, we have that for i{1,,m}\forall i\in\{1,\ldots,m\},

Applying McDiarmid’s inequality (Lemma C.1) to ff, for any τ>0\tau>0, with probability at least 1eτ1-e^{-\tau} over the choice of (ωi)i=1mi.i.d.Λ(\bm{\omega}_{i})^{m}_{i=1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\Lambda,

Bounding R(G,ω1:m)\mathscr{R}\left(\mathcal{G},\bm{\omega}_{1:m}\right): Using Dudley’s entropy integral [bousquet03new, Equation 4.4], we have

The upper limit of the integral can be bounded as

Bounding N(G,L2(Λm),r)\mathcal{N}(\mathcal{G},L^{2}(\Lambda_{m}),r) by the compactness of SΔ\mathscr{S}_{\Delta}: For any gz1g_{\mathbf{z}_{1}}, gz2Gg_{\mathbf{z}_{2}}\in\mathcal{G},

By the mean value theorem, there exists c(0,1)c\in(0,1) such that

(B.6) shows that the existence of an ϵ\epsilon-net on (SΔ,2)\left(\mathscr{S}_{\Delta},\left\|\cdot\right\|_{2}\right) implies an r=ϵ1mj=1mωj2(p+q)ωj22r=\epsilon\sqrt{\frac{1}{m}\sum_{j=1}^{m}\left|\bm{\omega}_{j}^{2(\mathbf{p}+\mathbf{q})}\right|\left\|\bm{\omega}_{j}\right\|_{2}^{2}}-net on (G,L2(Λm))(\mathcal{G},L^{2}(\Lambda_{m})). In other words,

by noting that SΔ2S|\mathscr{S}_{\Delta}|\leq 2|\mathscr{S}|. Using (B.5) and (B.7) in (B.4), we have

where in the last inequality we used the fact that r2T2p,2qr\leq 2\sqrt{T_{2\mathbf{p},2\mathbf{q}}}. By bounding 2SAp,q+T2p,2q(2S+T2p,2q)(Ap,q+1)2|\mathscr{S}|A_{\mathbf{p},\mathbf{q}}+\sqrt{T_{2\mathbf{p},2\mathbf{q}}}\leq(2|\mathscr{S}|+\sqrt{T_{2\mathbf{p},2\mathbf{q}}})(A_{\mathbf{p},\mathbf{q}}+1), (B.8) reduces to

where the last equality is obtained by changing the variable of integration and defining Bp,q:=2ST2p,2qB_{\mathbf{p},\mathbf{q}}:=\frac{2|\mathscr{S}|}{\sqrt{T_{2\mathbf{p},2\mathbf{q}}}}. By applying Lemma C.2 to bound the integral in (B.9), we obtain

Bounding the expectation of the Rademacher average: From (B.10), we have

Final bound: Combining (B.2), (B.3) and (B.11) yields the result.∎

B.2 Proof of Theorem 3

and therefore by McDiarmid’s inequality (Lemma C.1), for any τ>0\tau>0, with probability at least 1eτ1-e^{-\tau} over the choice of (ωi)i=1mΛ(\bm{\omega}_{i})^{m}_{i=1}\sim\Lambda, we have

One can rewrite the argument of this supremum by Eqs. (1)-(2) as

since Lr(S)L^{r}(\mathscr{S}) is of type min(2,r)\min(2,r) [lindenstrauss79classical, page 73] and there exists a universal constant CrC^{\prime}_{r} independent of S\mathscr{S} (the so-called Khintchine constant) [ledoux02probability, page 247] such that ()(*) holds; in addition we used

and 1min{2,r}=max{12,1r}\frac{1}{\min\{2,r\}}=\max\left\{\frac{1}{2},\frac{1}{r}\right\}.

Combining (B.12)–(B.16) and using the bound on vol(S)\text{vol}(\mathscr{S}) given in the proof of Corollary 2 yields the result.∎

B.3 Proof of Theorem 5

Below we give the detailed proof of Theorem 5. At high-level the proof goes as follows: (i) By the compactness of SΔ\mathscr{S}_{\Delta} (implied by that of S\mathscr{S}) one can take an rr-net covering SΔ\mathscr{S}_{\Delta} (for any r>0r>0). (ii) Small approximation error can be guaranteed at the centers of the rr-net by Bernstein’s inequality combined with a union bound. (iii) Propagation of the error from the centers to arbitrary points is achieved by Lipschitzness. (iv) The Lipschitz constant is, however, a random quantity and we show with high probability that it is ‘not too large’. (v) Union bounding the two events (small errors at the centers and small Lipschitz constant) leads to a uniform bound for arbitrary rr, which holds with high probability. (vi) Optimizing over rr gives the stated result.

Formally, the proof is as follows. Let us define

where f(z;ω)=p,qk(z)ωp(ω)qhp+q(ωTz)f(\mathbf{z};\bm{\omega})=\partial^{\mathbf{p},\mathbf{q}}k(\mathbf{z})-\bm{\omega}^{\mathbf{p}}(-\bm{\omega})^{\mathbf{q}}h_{|\mathbf{p}+\mathbf{q}|}\left(\bm{\omega}^{T}\mathbf{z}\right). Let us notice that since conv(SΔ)conv(\mathscr{S}_{\Delta}) is compact (by the compactness of SΔ\mathscr{S}_{\Delta}, implied by that of S\mathscr{S}) and zzf(z;ω)2\mathbf{z}\mapsto\left\|\nabla_{\mathbf{z}}f(\mathbf{z};\bm{\omega})\right\|_{2} is continuous, the supremum inside the expectation in Bp,q,SB_{\mathbf{p},\mathbf{q},\mathscr{S}} is finite for any ω\bm{\omega}.

Covering of SΔ\mathscr{S}_{\Delta}: By the compactness of SΔ\mathscr{S}_{\Delta} there exist an rr-net with at most

balls covering SΔ\mathscr{S}_{\Delta} [geer09empirical, Lemma 2.5, page 20], where we used that SΔ2S|\mathscr{S}_{\Delta}|\leq 2|\mathscr{S}|. Let us denote the centers of this rr-net by c1,,cN\mathbf{c}_{1},\ldots,\mathbf{c}_{N}.

Bounding fˉ(b;ω1:m)fˉ(a;ω1:m)\bar{f}(\mathbf{b};\bm{\omega}_{1:m})-\bar{f}(\mathbf{a};\bm{\omega}_{1:m}), where a,bSΔ\mathbf{a},\mathbf{b}\in\mathscr{S}_{\Delta}; ω1:m=(ω1,,ωm)\bm{\omega}_{1:m}=(\bm{\omega}_{1},\ldots,\bm{\omega}_{m}) is fixed: Let

zfˉ(z;ω1:m)\mathbf{z}\mapsto\bar{f}(\mathbf{z};\bm{\omega}_{1:m}) is continuously differentiable since ψ\psi is so. Thus by the mean value theorem t(0,1)\exists\,t\in(0,1) such that

Hence by the Cauchy-Bunyakovsky-Schwarz inequality, we get

where we used the compactness of conv(SΔ)conv(\mathscr{S}_{\Delta}) (implied by that of SΔ\mathscr{S}_{\Delta}) and the continuity of the zzfˉ(z;ω1:m)2\mathbf{z}\mapsto\left\|\nabla_{\mathbf{z}}\bar{f}(\mathbf{z};\bm{\omega}_{1:m})\right\|_{2} mapping to guarantee that L(ω1:m)L(\bm{\omega}_{1:m}) exists, and it is finite for any ω1:m\bm{\omega}_{1:m}.

Bound on Bp,q,SB_{\mathbf{p},\mathbf{q},\mathscr{S}}: Note that

By the homogenity of norms (av=av\left\|a\mathbf{v}\right\|=|a|\left\|\mathbf{v}\right\|), the chain rule, and ha(v)1|h_{a}(v)|\leq 1 (a\forall a, v\forall v)

Combining Eq. (B.20) and (B.21) results in the bound

Error propagation from the net centers: We will use the following note to propagate the error from the net centers (cj\mathbf{c}_{j}, j=1,,Nj=1,\ldots,N) to an arbitrary zSΔ\mathbf{z}\in\mathscr{S}_{\Delta} point. Note: If fˉ(cj;ω1:m)<ϵ2|\bar{f}(\mathbf{c}_{j};\bm{\omega}_{1:m})|<\frac{\epsilon}{2} (j\forall j) and L(ω1:m)<ϵ2rL(\bm{\omega}_{1:m})<\frac{\epsilon}{2r}, then

where we used (B.18) and our assumptions in the note, thereby yielding (B.23).

Guaranteeing the conditions of (B.23) with high probability:

Setting ϵ=2ησm\epsilon=\frac{2\eta\sigma}{\sqrt{m}}, (B.24) is written as

By union bounding (j=1,,Nj=1,\ldots,N), we get

Condition L(ω1:m)<ϵ2rL(\bm{\omega}_{1:m})<\frac{\epsilon}{2r}: Applying Markov’s inequality to L(ω1:m)L(\bm{\omega}_{1:m}) (note that L(ω1:m)L(\bm{\omega}_{1:m}) is non-negative), for any t>0t>0, we obtain

by invoking (B.19) and (B.22). Choosing t=ϵ2rt=\frac{\epsilon}{2r}, we have

Final bound for any r>0r>0: By (B.25) and (B.26), and substituting the explicit form of NN in (B.17), we get

Jensen’s inequality in (\dagger), c:=2d1emϵ28σ2(1+ϵL2σ2)c_{*}:=2^{d-1}e^{-\frac{m\epsilon^{2}}{8\sigma^{2}\left(1+\frac{\epsilon L}{2\sigma^{2}}\right)}}, κ1:=4dSdc\kappa_{1}:=4^{d}|\mathscr{S}|^{d}c_{*} and κ2=2ϵ(Dp,q,S+Ep,q)\kappa_{2}=\frac{2}{\epsilon}(D_{\mathbf{p},\mathbf{q},\mathscr{S}}+E_{\mathbf{p},\mathbf{q}}).

Matching the two terms to choose rr: Maximizing w.r.t. rr in (B.27)

we note that r=(dκ1κ2)1d+1r=\left(\frac{d\kappa_{1}}{\kappa_{2}}\right)^{\frac{1}{d+1}} maximizes it. Using this in (B.27), we have

where Fd:=ddd+1+d1d+1F_{d}:=d^{-\frac{d}{d+1}}+d^{\frac{1}{d+1}}.

B.4 Proof of bounded supp​(Λ)suppΛ\text{supp}(\Lambda) ⇒⇒\Rightarrow (7)

We prove that the boundedness of supp(Λ)\text{supp}(\Lambda) implies that of ff [see (B.28)], specifically (7).

Applying the triangle inequality and ha(v)1|h_{a}(v)|\leq 1 (a,v\forall a,\forall v) we have

K:=supssupp(Λ)sp+qK:=\sup_{\mathbf{s}\in\text{supp}(\Lambda)}|\mathbf{s}^{\mathbf{p}+\mathbf{q}}| is finite since supp(Λ)\text{supp}(\Lambda) is bounded, thus f(z;ω)|f(\mathbf{z};\bm{\omega})| is bounded.

Appendix C Supplementary results

In this section, we present some technical results that are used in the proofs.

By change of variables, we have 01logaϵdϵ=alogatetdt\int^{1}_{0}\sqrt{\log\frac{a}{\epsilon}}\,d\epsilon=a\int^{\infty}_{\log a}\sqrt{t}e^{-t}\,dt. Applying partial integration, we have

Note: the σ\sigma-algebra of Lebesgue measurable sets is typically not countably generated [bogachev07measure, page 106 (vol. I)].