Optimal Rates for Random Fourier Features

Bharath K. Sriperumbudur, Zoltan Szabo

Introduction

Kernel methods [scholkopf02learning] have enjoyed tremendous success in solving several fundamental problems of machine learning ranging from classification, regression, feature extraction, dependency estimation, causal discovery, Bayesian inference and hypothesis testing. Such a success owes to their capability to represent and model complex relations by mapping points into high (possibly infinite) dimensional feature spaces. At the heart of all these techniques is the kernel trick, which allows to implicitly compute inner products between these high dimensional feature maps, $\lambda$ via a kernel function $k$ : $k(\mathbf{x},\mathbf{y})=\langle\lambda(\mathbf{x}),\lambda(\mathbf{y})\rangle$ . However, this flexibility and richness of kernels has a price: by resorting to implicit computations these methods operate on the Gram matrix of the data, which raises serious computational challenges while dealing with large-scale data. In order to resolve this bottleneck, numerous solutions have been proposed, such as low-rank matrix approximations [Williams-01, drineas05nystrom, alaoui15fast], explicit feature maps designed for additive kernels [vedaldi12efficient, maji13efficient], hashing [shi09hash, kulis12kernelized], and random Fourier features (RFF) [rahimi07random] constructed for shift-invariant kernels, the focus of the current paper.

RFFs implement an extremely simple, yet efficient idea: instead of relying on the implicit feature map $\lambda$ associated with the kernel, by appealing to Bochner’s theorem [wendland05scattered]—any bounded, continuous, shift-invariant kernel is the Fourier transform of a probability measure—-[rahimi07random] proposed an explicit low-dimensional random Fourier feature map $\phi$ obtained by empirically approximating the Fourier integral so that $k(\mathbf{x},\mathbf{y})\approx\langle\phi(\mathbf{x}),\phi(\mathbf{y})\rangle$ . The advantage of this explicit low-dimensional feature representation is that the kernel machine can be efficiently solved in the primal form through fast linear solvers, thereby enabling to handle large-scale data. Through numerical experiments, it has also been demonstrated that kernel algorithms constructed using the approximate kernel do not suffer from significant performance degradation [rahimi07random]. Another advantage with the RFF approach is that unlike low rank matrix approximation approach [Williams-01, drineas05nystrom] which also speeds up kernel machines, it approximates the entire kernel function and not just the kernel matrix. This property is particularly useful while dealing with out-of-sample data and also in online learning applications. The RFF technique has found wide applicability in several areas such as fast function-to-function regression [oliva15fast], differential privacy preserving [chaudhuri11differentially] and causal discovery [lopezpaz15towards].

Traditionally, kernel based algorithms involve computing the value of the kernel. Recently, kernel algorithms involving the derivatives of the kernel (i.e., the Gram matrix consists of derivatives of the kernel computed at training samples) have been used to address numerous machine learning tasks, e.g., semi-supervised or Hermite learning with gradient information [zhou08derivative, shi10hermite], nonlinear variable selection [rosasco10regularization, rosasco13nonparametric], (multi-task) gradient learning [ying12learning] and fitting of distributions in an infinite-dimensional exponential family [sriperumbudur14density]. Given the importance of these derivative based kernel algorithms, similar to [rahimi07random], in Section 4, we propose a finite dimensional random feature map approximation to kernel derivatives, which can be used to speed up the above mentioned derivative based kernel algorithms. We present a finite-sample bound that quantifies the quality of approximation in uniform and $L^{r}$ -norms and show the rate of convergence to be $m^{-1/2}$ in both these cases.

A summary of our contributions are as follows. We

provide the first detailed finite-sample performance analysis of RFFs for approximating kernels and their derivatives.

prove uniform and $L^{r}$ convergence on fixed compacts sets with optimal rate in terms of the RFF dimension ( $m$ );

give sufficient conditions for the growth rate of compact sets while preserving a.s. convergence uniformly and in $L^{r}$ ; specializing our result we match the best attainable asymptotic growth rate.

Various notations and definitions that are used throughout the paper are provided in Section 2 along with a brief review of RFF approximation proposed by [rahimi07random]. The missing proofs of the results in Sections 3 and 4 are provided in the supplementary material.

Notations & preliminaries

where $\sigma^{2}:=\int\|\bm{\omega}\|^{2}\,d\Lambda(\bm{\omega})$ and $C_{d}:=2^{\frac{6d+2}{d+2}}\left(\left(\frac{2}{d}\right)^{\frac{d}{d+2}}+\left(\frac{d}{2}\right)^{\frac{2}{d+2}}\right)\leq 2^{7}d^{\frac{2}{d+2}}$ when $d\geq 2$ . The condition $\sigma^{2}<\infty$ implies that $\psi$ (and therefore $k$ ) is twice differentiable. From (3) it is clear that the probability has polynomial tails if $\epsilon<|\mathscr{S}|\sigma$ (i.e., small $\epsilon$ ) and Gaussian tails if $\epsilon\geq|\mathscr{S}|\sigma$ (i.e., large $\epsilon$ ) and can be equivalently written as

where $\alpha:=4d-C^{\frac{d+2}{d}}_{d}|\mathscr{S}|^{2}\sigma^{2}$ . For $|\mathscr{S}|$ sufficiently large (i.e., $\alpha<0$ ), it follows from (4) that

While (5) shows that $\hat{k}$ is a consistent estimator of $k$ in the topology of compact convergence (i.e., $\hat{k}$ convergences to $k$ uniformly over compact sets), the rate of convergence of $\sqrt{(\log m)/m}$ is not optimal. In addition, the order of dependence on $|\mathscr{S}|$ is not optimal. While a faster rate (in fact, an optimal rate) of convergence is desired—better rates in (5) can lead to better convergence rates for the excess error of the kernel machine constructed using $\hat{k}$ —, the order of dependence on $|\mathscr{S}|$ is also important as it determines the the number of RFF features (i.e., $m$ ) that are needed to achieve a given approximation accuracy. In fact, the order of dependence on $|\mathscr{S}|$ controls the rate at which $|\mathscr{S}|$ can be grown as a function of $m$ when $m\rightarrow\infty$ (see Remark 1(ii) for a detailed discussion about the significance of growing $|\mathscr{S}|$ ). In the following section, we present an analogue of (4)—see Theorem 1—that provides optimal rates and has correct dependence on $|\mathscr{S}|$ .

Main results: approximation of k𝑘k

where $h(d,|\mathscr{S}|,\sigma):=32\sqrt{2d\log(2|\mathscr{S}|+1)}+32\sqrt{2d\log(\sigma+1)}+16\sqrt{2d[\log(2|\mathscr{S}|+1)]^{-1}}$ .

Note that $\|\hat{k}-k\|_{\mathscr{S}\times\mathscr{S}}=\sup_{\mathbf{x},\mathbf{y}\in\mathscr{S}}|\hat{k}(\mathbf{x},\mathbf{y})-k(\mathbf{x},\mathbf{y})|=\sup_{g\in\mathcal{G}}|\Lambda_{m}g-\Lambda g|$ , where $\mathcal{G}:=\{g_{\mathbf{x},\mathbf{y}}(\bm{\omega})=\cos(\bm{\omega}^{T}(\mathbf{x}-\mathbf{y}))\,:\,\mathbf{x},\mathbf{y}\in\mathscr{S}\}$ , which means the object of interest is the suprema of an empirical process indexed by $\mathcal{G}$ . Instead of bounding $\sup_{g\in\mathcal{G}}|\Lambda_{m}g-\Lambda g|$ by using Hoeffding’s inequality on a cover of $\mathcal{G}$ and then applying union bound as carried out in [rahimi07random, sutherland15error], we use the refined technique of applying concentration via McDiarmid’s inequality, followed by symmetrization and bound the Rademacher average by Dudley entropy bound. The result is obtained by carefully bounding the $L^{2}(\Lambda_{m})$ -covering number of $\mathcal{G}$ . The details are provided in Section B.1 of the supplementary material. ∎

Corollary 2 shows that $\|\hat{k}-k\|_{L^{r}(\mathscr{S})}=O_{a.s.}(m^{-1/2}|\mathscr{S}|^{2d/r}\sqrt{\log|\mathscr{S}|})$ and therefore if $|\mathscr{S}_{m}|\rightarrow\infty$ as $m\rightarrow\infty$ , then consistency of $\hat{k}$ in $L^{r}(\mathscr{S}_{m})$ -norm is achieved as long as $m^{-1/2}|\mathscr{S}_{m}|^{2d/r}\sqrt{\log|\mathscr{S}_{m}|}\rightarrow 0$ as $m\rightarrow\infty$ . This means, in comparison to the uniform norm in Theorem 1 where $|\mathscr{S}_{m}|$ can grow exponential in $m^{\delta}$ ( $\delta<1$ ), $|\mathscr{S}_{m}|$ cannot grow faster than $m^{\frac{r}{4d}}(\log m)^{-\frac{r}{4d}-\theta}$ ( $\theta>0$ ) to achieve consistency in $L^{r}$ -norm.

Instead of using Theorem 1 to obtain a bound on $\|\hat{k}-k\|_{L^{r}(\mathscr{S})}$ (this bound may be weak as $\|\hat{k}-k\|_{L^{r}(\mathscr{S})}\leq\|\hat{k}-k\|_{\mathscr{S}\times\mathscr{S}}\text{vol}^{2/r}(\mathscr{S})$ for any $1\leq r<\infty$ ), a better bound (for $2\leq r<\infty$ ) can be obtained by directly bounding $\|\hat{k}-k\|_{L^{r}(\mathscr{S})}$ , as shown in the following result.

where $C^{\prime}_{r}$ is the Khintchine constant given by $C^{\prime}_{r}=1$ for $r\in(1,2]$ and $C^{\prime}_{r}=\sqrt{2}\left[\Gamma\left(\frac{r+1}{2}\right)/\sqrt{\pi}\right]^{\frac{1}{r}}$ for $r\in[2,\infty)$ .

Approximation of kernel derivatives

In the previous section we focused on the approximation of the kernel function where we presented uniform and $L^{r}$ convergence guarantees on compact sets for the random Fourier feature approximation, and discussed how fast the diameter of these sets can grow to preserve uniform and $L^{r}$ convergence almost surely. In this section, we propose an approximation to derivatives of the kernel and analyze the uniform and $L^{r}$ convergence behavior of the proposed approximation. As motivated in Section 1, the question of approximating the derivatives of the kernel through finite dimensional random feature map is also important as it enables to speed up several interesting machine learning tasks that involve the derivatives of the kernel [zhou08derivative, shi10hermite, rosasco10regularization, rosasco13nonparametric, ying12learning, sriperumbudur14density], see for example the recent infinite dimensional exponential family fitting technique [strathmann15gradient], which implements this idea.

so that $\partial^{\mathbf{p},\mathbf{q}}k(\mathbf{x},\mathbf{y})$ can be approximated by replacing $\Lambda$ with $\Lambda_{m}$ , resulting in

where $\phi^{\mathbf{p}}(\mathbf{u}):=\frac{1}{\sqrt{m}}\left(\bm{\omega}_{1}^{\mathbf{p}}h_{|\mathbf{p}|}(\bm{\omega}_{1}^{T}\mathbf{u}),\cdots,\bm{\omega}_{m}^{\mathbf{p}}h_{|\mathbf{p}|}(\bm{\omega}_{m}^{T}\mathbf{u}),\bm{\omega}_{1}^{\mathbf{p}}h_{3+|\mathbf{p}|}(\bm{\omega}_{1}^{T}\mathbf{u}),\cdots,\bm{\omega}_{m}^{\mathbf{p}}h_{3+|\mathbf{p}|}(\bm{\omega}_{m}^{T}\mathbf{u})\right)$ and $(\bm{\omega}_{j})^{m}_{j=1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\Lambda$ . Now the goal is to understand the behavior of $\|s^{\mathbf{p},\mathbf{q}}-\partial^{\mathbf{p},\mathbf{q}}k\|_{\mathscr{S}\times\mathscr{S}}$ and $\|s^{\mathbf{p},\mathbf{q}}-\partial^{\mathbf{p},\mathbf{q}}k\|_{L^{r}(\mathscr{S})}$ for $r\in[1,\infty)$ , i.e., obtain analogues of Theorems 1 and 3.

$U(\mathbf{p},\mathbf{q},|\mathscr{S}|)=\log\left(2|\mathscr{S}|T^{-1/2}_{2\mathbf{p},2\mathbf{q}}+1\right)$ .

(i) Note that Theorem 4 reduces to Theorem 1 if $\mathbf{p}=\mathbf{q}=0$ , in which case $T_{\mathbf{p},\mathbf{q}}=T_{2\mathbf{p},2\mathbf{q}}=1$ . If $\mathbf{p}\neq\mathbf{0}$ or $\mathbf{q}\neq\mathbf{0}$ , then the boundedness of $\text{supp}(\Lambda)$ implies that $T_{\mathbf{p},\mathbf{q}}<\infty$ and $T_{2\mathbf{p},2\mathbf{q}}<\infty$ . (ii) Growth of $|\mathscr{S}_{m}|$ : By the same reasoning as in Remark 1(ii) and Corollary 2, it follows that $\|\partial^{\mathbf{p},\mathbf{q}}k-s^{\mathbf{p},\mathbf{q}}\|_{\mathscr{S}_{m}\times\mathscr{S}_{m}}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0$ if $|\mathscr{S}_{m}|=e^{o(m)}$ and $\|\partial^{\mathbf{p},\mathbf{q}}k-s^{\mathbf{p},\mathbf{q}}\|_{L^{r}(\mathscr{S}_{m})}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0$ if $m^{-1/2}|\mathscr{S}_{m}|^{2d/r}\sqrt{\log|\mathscr{S}_{m}|}\rightarrow 0$ (for $1\leq r<\infty$ ) as $m\rightarrow\infty$ . An exact analogue of Theorem 3 can be obtained (but with different constants) under the assumption that $\text{supp}(\Lambda)$ is bounded and it can be shown that for $r\in[2,\infty)$ , $\|\partial^{\mathbf{p},\mathbf{q}}k-s^{\mathbf{p},\mathbf{q}}\|_{L^{r}(\mathscr{S}_{m})}\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}0$ if $|\mathscr{S}_{m}|=o(m^{\frac{r}{4d}})$ . $\blacksquare$

The following result relaxes the boundedness of $\text{supp}(\Lambda)$ by imposing certain moment conditions on $\Lambda$ but at the expense of a worse rate. The proof relies on applying Bernstein inequality at the elements of a net (which exists by the compactness of $\mathscr{S}$ ) combined with a union bound, and extending the approximation error from the anchors by a probabilistic Lipschitz argument.

where $f(\mathbf{z};\bm{\omega})=\partial^{\mathbf{p},\mathbf{q}}k(\mathbf{z})-\bm{\omega}^{\mathbf{p}}(-\bm{\omega})^{\mathbf{q}}h_{|\mathbf{p}+\mathbf{q}|}\left(\bm{\omega}^{T}\mathbf{z}\right)$ . Define $F_{d}:=d^{-\frac{d}{d+1}}+d^{\frac{1}{d+1}}$ . $F_{d}$ is monotonically decreasing in $d$ , $F_{1}=2$ . Then

Discussion

Z. Szabó wishes to thank the Gatsby Charitable Foundation for its generous support.

References

Appendix A Definitions & notation

We provide proofs of the results presented in Sections 3 and 4. Lemmas used in the proofs are enlisted in Section C.

Below we prove Theorem 4, thereby Theorem 1 ( $\mathbf{p}=\mathbf{q}=\mathbf{0}$ ). The idea of the proof is as follows: (i) We note that

$\mathcal{G}$ is a separable Carathéodory family: $\mathcal{G}$ is a separable Carathéodory family w.r.t. $\mathscr{S}_{\Delta}$ since

$\mathbf{z}\mapsto\bm{\omega}^{\mathbf{p}}(-\bm{\omega})^{\mathbf{q}}h_{|\mathbf{p}+\mathbf{q}|}\left(\bm{\omega}^{T}\mathbf{z}\right)$ is continuous for all $\bm{\omega}\in\text{supp}(\Lambda)$ .

Concentration of $\left\|\Lambda-\Lambda_{m}\right\|_{\mathcal{G}}$ by its bounded difference property: By defining $f(\bm{\omega}_{1},\ldots,\bm{\omega}_{m}):=\left\|\Lambda-\Lambda_{m}\right\|_{\mathcal{G}}$ , we have that for $\forall i\in\{1,\ldots,m\}$ ,

Applying McDiarmid’s inequality (Lemma C.1) to $f$ , for any $\tau>0$ , with probability at least $1-e^{-\tau}$ over the choice of $(\bm{\omega}_{i})^{m}_{i=1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\Lambda$ ,

Bounding $\mathscr{R}\left(\mathcal{G},\bm{\omega}_{1:m}\right)$ : Using Dudley’s entropy integral [bousquet03new, Equation 4.4], we have

The upper limit of the integral can be bounded as

Bounding $\mathcal{N}(\mathcal{G},L^{2}(\Lambda_{m}),r)$ by the compactness of $\mathscr{S}_{\Delta}$ : For any $g_{\mathbf{z}_{1}}$ , $g_{\mathbf{z}_{2}}\in\mathcal{G}$ ,

By the mean value theorem, there exists $c\in(0,1)$ such that

(B.6) shows that the existence of an $\epsilon$ -net on $\left(\mathscr{S}_{\Delta},\left\|\cdot\right\|_{2}\right)$ implies an $r=\epsilon\sqrt{\frac{1}{m}\sum_{j=1}^{m}\left|\bm{\omega}_{j}^{2(\mathbf{p}+\mathbf{q})}\right|\left\|\bm{\omega}_{j}\right\|_{2}^{2}}$ -net on $(\mathcal{G},L^{2}(\Lambda_{m}))$ . In other words,

by noting that $|\mathscr{S}_{\Delta}|\leq 2|\mathscr{S}|$ . Using (B.5) and (B.7) in (B.4), we have

where in the last inequality we used the fact that $r\leq 2\sqrt{T_{2\mathbf{p},2\mathbf{q}}}$ . By bounding $2|\mathscr{S}|A_{\mathbf{p},\mathbf{q}}+\sqrt{T_{2\mathbf{p},2\mathbf{q}}}\leq(2|\mathscr{S}|+\sqrt{T_{2\mathbf{p},2\mathbf{q}}})(A_{\mathbf{p},\mathbf{q}}+1)$ , (B.8) reduces to

where the last equality is obtained by changing the variable of integration and defining $B_{\mathbf{p},\mathbf{q}}:=\frac{2|\mathscr{S}|}{\sqrt{T_{2\mathbf{p},2\mathbf{q}}}}$ . By applying Lemma C.2 to bound the integral in (B.9), we obtain

Bounding the expectation of the Rademacher average: From (B.10), we have

Final bound: Combining (B.2), (B.3) and (B.11) yields the result.∎

B.2 Proof of Theorem 3

and therefore by McDiarmid’s inequality (Lemma C.1), for any $\tau>0$ , with probability at least $1-e^{-\tau}$ over the choice of $(\bm{\omega}_{i})^{m}_{i=1}\sim\Lambda$ , we have

One can rewrite the argument of this supremum by Eqs. (1)-(2) as

since $L^{r}(\mathscr{S})$ is of type $\min(2,r)$ [lindenstrauss79classical, page 73] and there exists a universal constant $C^{\prime}_{r}$ independent of $\mathscr{S}$ (the so-called Khintchine constant) [ledoux02probability, page 247] such that $(*)$ holds; in addition we used

and $\frac{1}{\min\{2,r\}}=\max\left\{\frac{1}{2},\frac{1}{r}\right\}$ .

Combining (B.12)–(B.16) and using the bound on $\text{vol}(\mathscr{S})$ given in the proof of Corollary 2 yields the result.∎

B.3 Proof of Theorem 5

Below we give the detailed proof of Theorem 5. At high-level the proof goes as follows: (i) By the compactness of $\mathscr{S}_{\Delta}$ (implied by that of $\mathscr{S}$ ) one can take an $r$ -net covering $\mathscr{S}_{\Delta}$ (for any $r>0$ ). (ii) Small approximation error can be guaranteed at the centers of the $r$ -net by Bernstein’s inequality combined with a union bound. (iii) Propagation of the error from the centers to arbitrary points is achieved by Lipschitzness. (iv) The Lipschitz constant is, however, a random quantity and we show with high probability that it is ‘not too large’. (v) Union bounding the two events (small errors at the centers and small Lipschitz constant) leads to a uniform bound for arbitrary $r$ , which holds with high probability. (vi) Optimizing over $r$ gives the stated result.

Formally, the proof is as follows. Let us define

where $f(\mathbf{z};\bm{\omega})=\partial^{\mathbf{p},\mathbf{q}}k(\mathbf{z})-\bm{\omega}^{\mathbf{p}}(-\bm{\omega})^{\mathbf{q}}h_{|\mathbf{p}+\mathbf{q}|}\left(\bm{\omega}^{T}\mathbf{z}\right)$ . Let us notice that since $conv(\mathscr{S}_{\Delta})$ is compact (by the compactness of $\mathscr{S}_{\Delta}$ , implied by that of $\mathscr{S}$ ) and $\mathbf{z}\mapsto\left\|\nabla_{\mathbf{z}}f(\mathbf{z};\bm{\omega})\right\|_{2}$ is continuous, the supremum inside the expectation in $B_{\mathbf{p},\mathbf{q},\mathscr{S}}$ is finite for any $\bm{\omega}$ .

Covering of $\mathscr{S}_{\Delta}$ : By the compactness of $\mathscr{S}_{\Delta}$ there exist an $r$ -net with at most

balls covering $\mathscr{S}_{\Delta}$ [geer09empirical, Lemma 2.5, page 20], where we used that $|\mathscr{S}_{\Delta}|\leq 2|\mathscr{S}|$ . Let us denote the centers of this $r$ -net by $\mathbf{c}_{1},\ldots,\mathbf{c}_{N}$ .

Bounding $\bar{f}(\mathbf{b};\bm{\omega}_{1:m})-\bar{f}(\mathbf{a};\bm{\omega}_{1:m})$ , where $\mathbf{a},\mathbf{b}\in\mathscr{S}_{\Delta}$ ; $\bm{\omega}_{1:m}=(\bm{\omega}_{1},\ldots,\bm{\omega}_{m})$ is fixed: Let

$\mathbf{z}\mapsto\bar{f}(\mathbf{z};\bm{\omega}_{1:m})$ is continuously differentiable since $\psi$ is so. Thus by the mean value theorem $\exists\,t\in(0,1)$ such that

Hence by the Cauchy-Bunyakovsky-Schwarz inequality, we get

where we used the compactness of $conv(\mathscr{S}_{\Delta})$ (implied by that of $\mathscr{S}_{\Delta}$ ) and the continuity of the $\mathbf{z}\mapsto\left\|\nabla_{\mathbf{z}}\bar{f}(\mathbf{z};\bm{\omega}_{1:m})\right\|_{2}$ mapping to guarantee that $L(\bm{\omega}_{1:m})$ exists, and it is finite for any $\bm{\omega}_{1:m}$ .

Bound on $B_{\mathbf{p},\mathbf{q},\mathscr{S}}$ : Note that

By the homogenity of norms ( $\left\|a\mathbf{v}\right\|=|a|\left\|\mathbf{v}\right\|$ ), the chain rule, and $|h_{a}(v)|\leq 1$ ( $\forall a$ , $\forall v$ )

Combining Eq. (B.20) and (B.21) results in the bound

Error propagation from the net centers: We will use the following note to propagate the error from the net centers ( $\mathbf{c}_{j}$ , $j=1,\ldots,N$ ) to an arbitrary $\mathbf{z}\in\mathscr{S}_{\Delta}$ point. Note: If $|\bar{f}(\mathbf{c}_{j};\bm{\omega}_{1:m})|<\frac{\epsilon}{2}$ ( $\forall j$ ) and $L(\bm{\omega}_{1:m})<\frac{\epsilon}{2r}$ , then

where we used (B.18) and our assumptions in the note, thereby yielding (B.23).

Guaranteeing the conditions of (B.23) with high probability:

Setting $\epsilon=\frac{2\eta\sigma}{\sqrt{m}}$ , (B.24) is written as

By union bounding ( $j=1,\ldots,N$ ), we get

Condition $L(\bm{\omega}_{1:m})<\frac{\epsilon}{2r}$ : Applying Markov’s inequality to $L(\bm{\omega}_{1:m})$ (note that $L(\bm{\omega}_{1:m})$ is non-negative), for any $t>0$ , we obtain

by invoking (B.19) and (B.22). Choosing $t=\frac{\epsilon}{2r}$ , we have

Final bound for any $r>0$ : By (B.25) and (B.26), and substituting the explicit form of $N$ in (B.17), we get

Jensen’s inequality in ( $\dagger$ ), $c_{*}:=2^{d-1}e^{-\frac{m\epsilon^{2}}{8\sigma^{2}\left(1+\frac{\epsilon L}{2\sigma^{2}}\right)}}$ , $\kappa_{1}:=4^{d}|\mathscr{S}|^{d}c_{*}$ and $\kappa_{2}=\frac{2}{\epsilon}(D_{\mathbf{p},\mathbf{q},\mathscr{S}}+E_{\mathbf{p},\mathbf{q}})$ .

Matching the two terms to choose $r$ : Maximizing w.r.t. $r$ in (B.27)

we note that $r=\left(\frac{d\kappa_{1}}{\kappa_{2}}\right)^{\frac{1}{d+1}}$ maximizes it. Using this in (B.27), we have

where $F_{d}:=d^{-\frac{d}{d+1}}+d^{\frac{1}{d+1}}$ .

B.4 Proof of bounded supp(Λ)suppΛ\text{supp}(\Lambda) ⇒⇒\Rightarrow (7)

We prove that the boundedness of $\text{supp}(\Lambda)$ implies that of $f$ [see (B.28)], specifically (7).

Applying the triangle inequality and $|h_{a}(v)|\leq 1$ ( $\forall a,\forall v$ ) we have

$K:=\sup_{\mathbf{s}\in\text{supp}(\Lambda)}|\mathbf{s}^{\mathbf{p}+\mathbf{q}}|$ is finite since $\text{supp}(\Lambda)$ is bounded, thus $|f(\mathbf{z};\bm{\omega})|$ is bounded.

Appendix C Supplementary results

In this section, we present some technical results that are used in the proofs.

By change of variables, we have $\int^{1}_{0}\sqrt{\log\frac{a}{\epsilon}}\,d\epsilon=a\int^{\infty}_{\log a}\sqrt{t}e^{-t}\,dt$ . Applying partial integration, we have

Note: the $\sigma$ -algebra of Lebesgue measurable sets is typically not countably generated [bogachev07measure, page 106 (vol. I)].