Generalization Properties of Learning with Random Features

Alessandro Rudi, Lorenzo Rosasco

Introduction

Supervised learning is a basic machine learning problem where the goal is estimating a function from random noisy samples . The function to be learned is fixed, but unknown, and flexible non-parametric models are needed for good results. A general class of models is based on functions of the form,

Most popular approaches to tackle these limitations are randomized and include sampling the centers at random, either in a data-dependent or in a data-independent way. Notable examples include Nyström and random features approaches. Given random centers, computations still reduce to convex optimization with potential big memory gains, provided that the centers are fewer than the data-points. In practice, the choice of the number of centers is based on heuristics or memory constraints, and the question arises of characterizing theoretically which choices provide optimal learning bounds. Answering this question allows to understand the statistical and computational trade-offs in using these randomized approximations. For Nyström methods, partial results in this direction were derived for example in and improved in , but only for a simplified setting where the input points are fixed. Results in the statistical learning setting were given in for ridge regression, showing in particular that O(nlogn)O(\sqrt{n}\log n) random centers uniformly sampled from nn training points suffices to yield O(1/n)O(1/\sqrt{n}) learning bounds, the same as full kernel ridge regression.

A question motivating our study is whether similar results hold for random features approaches. While several papers consider the properties of random features for approximating the kernel function, see and references therein, fewer results consider their generalization properties.

Several papers considered the properties of random features for approximating the kernel function, see and references therein, an interesting line of research with connections to sketching and non-linear (one-bit) compressed sensing . However, only a few results consider the generalization properties of learning with random features.

An exception is one of the original random features papers, which provides learning bounds for a general class of loss functions . These results show that O(n)O({n}) random features are needed for O(1/n)O(1/\sqrt{n}) learning bounds and choosing less random features leads to worse bounds. In other words, these results suggest that that computational gains come at the expense of learning accuracy. Later results, see e.g. , essentially confirm these considerations, albeit the analysis in suggests that fewer random features could suffice if sampled in a problem dependent way.

In this paper, we focus on the least squares loss, considering random features within a ridge regression approach. Our main result shows, under standard assumptions, that the estimator obtained with a number of random features proportional to O(nlogn)O(\sqrt{n}\log n) achieves O(1/n)O(1/\sqrt{n}) learning error, that is the same prediction accuracy of the exact kernel ridge regression estimator. In other words, there are problems for which random features can allow to drastically reduce computational costs without any loss of prediction accuracy. To the best of our knowledge this is the first result showing that such an effect is possible. Our study improves on previous results by taking advantage of analytic and probabilistic results developed to provide sharp analyses of kernel ridge regression. We further present a second set of more refined results deriving fast convergence rates. We show that indeed fast rates are possible, but, depending on the problem at hand, a larger number of features might be needed. We then discuss how the requirement on the number of random features can be weakened at the expense of typically more complex sampling schemes. Indeed, in this latter case either some knowledge of the data-generating distribution or some potentially data-driven sampling scheme is needed. For this latter case, we borrow and extend ideas from and inspired from the theory of statical leverage scores . Theoretical findings are complemented by numerical simulation validating the bounds.

The rest of the paper is organized as follows. In Section 2, we review relevant results on learning with kernels, least squares and learning with random features. In Section 3, we present and discuss our main results, while proofs are deferred to the appendix. Finally, numerical experiments are presented in Section 4.

Learning with random features and ridge regression

We begin recalling basics ideas in kernel methods and their approximation via random features.

where X^\widehat{X} is the nn by DD data matrix. In this case, the complexity becomes O(nD)O(nD) in space, and O(nD2+D3)O(nD^{2}+D^{3}) in time. Beyond the linear case, the above reasoning extends to inner product kernels

The basic idea of random features is to relax Eq. (4) assuming it holds only approximately,

Clearly, if one such approximation exists the approach described in the previous section can still be used. A first question is then for which kernels an approximation of the form (5) can be derived. A simple manipulation of the Gaussian kernel provides one basic example.

If we write the Gaussian kernel as K(x,x)=G(xx)K(x,x^{\prime})=G(x-x^{\prime}), with G(z)=e12σ2z2G(z)=e^{-\frac{1}{2\sigma^{2}}\lVert{z}\rVert^{2}}, for a σ>0\sigma>0, then since the inverse Fourier transform of GG is a Gaussian, and using a basic symmetry argument, it is easy to show that

where ZZ is a normalizing factor. Then, the Gaussian kernel has an approximation of the form (5) with ϕM(x)=M1/2 (2cos(w1x+b1),,2cos(wMx+bM)),\phi_{M}(x)=M^{-1/2}~{}(\sqrt{2}\cos(w_{1}^{\top}x+b_{1}),\dots,\sqrt{2}\cos(w_{M}^{\top}x+b_{M})), and w1,,wMw_{1},\dots,w_{M} and b1,,bMb_{1},\dots,b_{M} sampled independently from 1Zeσ2w2/2\frac{1}{Z}e^{-\sigma^{2}\lVert{w}\rVert^{2}/2} and uniformly in [0,2π][0,2\pi], respectively.

The above example can be abstracted to a general strategy. Assume the kernel KK to have an integral representation,

We note that specific examples of random features can be seen as form of sketching . This latter term typically refers to reducing data dimensionality by random projection, e.g. considering

where ωN(0,I)\omega\sim N(0,I) (or suitable bounded measures). From a random feature perspective, we are defining an approximation of the linear kernel since

More general non-linear sketching can also be considered. For example in one-bit compressed sensing the following random features are relevant,

with wN(0,I)w\sim N(0,I) and sign(a)=1\textrm{sign}(a)=1 if a>0a>0 and 1-1 otherwise. Deriving the corresponding kernel is more involved and we refer to (see Section E in the appendixes).

Back to supervised learning, combining random features with ridge regression leads to,

for λ>0\lambda>0, S^M:=n1/2 (ϕM(x1),,ϕM(xn))\widehat{S}_{M}^{\top}:=n^{-1/2}~{}(\phi_{M}(x_{1}),\dots,\phi_{M}(x_{n})) and y^:=n1/2 (y1,,yn)\widehat{y}:=n^{-1/2}~{}(y_{1},\dots,y_{n}).

Then, random features can be used to reduce the computational costs of full kernel ridge regression as soon as MnM\ll n (see Sec. 2). However, since random features rely on an approximation (5), the question is whether there is a loss of prediction accuracy. This is the question we analyze in the rest of the paper.

Main Results

In this section, we present our main results characterizing the generalization properties of random features with ridge regression. We begin considering a basic setting and then discuss fast learning rates and the possible benefits of problem dependent sampling schemes.

since this implies that ff will generalize/predict well new data. Since we consider estimators of the form (2), (7) we are potentially restricting the space of possible solutions. Indeed, estimators of this form can be naturally related to the so called reproducing kernel Hilbert space (RKHS) corresponding to the PD kernel KK. Recall that, the latter is the function space H\mathcal{H} defined as as the completion of the linear span of {K(x,) : xX}\{K(x,\cdot)~{}:~{}x\in X\} with respect to the inner product K(x,),K(x,):=K(x,x)\left\langle{K(x,\cdot)},{K(x^{\prime},\cdot)}\right\rangle:=K(x,x^{\prime}) . In this view, the best possible solution is fHf_{\mathcal{H}} solving

We will assume throughout that fHf_{\mathcal{H}} exists. We add one technical remark useful in the following.

Existence of fHf_{\mathcal{H}} is not ensured, since we consider a potentially infinite dimensional RKHS H\mathcal{H}, possibly universal . The situation is different if H\mathcal{H} is replaced by HR={fH : fR},\mathcal{H}_{R}=\{f\in\mathcal{H}~{}:~{}\lVert{f}\rVert\leq R\}, with RR fixed a priori. In this case a minimizer of risk E\cal E always exists, but RR needs to be fixed a priori and HR\mathcal{H}_{R} can’t be universal. Clearly, assuming fHf_{\mathcal{H}} to exist, implies it belongs to a ball of radius Rρ,HR_{\rho,\mathcal{H}}. However, our results do not require prior knowledge of Rρ,HR_{\rho,\mathcal{H}} and hold uniformly over all finite radii.

The following is our first result on the learning properties of random features with ridge regression.

Assume that KK is a kernel with an integral representation (6). Assume ψ\psi continuous, such that ψ(x,ω)κ|\psi(x,\omega)|\leq\kappa almost surely, with κ[1,)\kappa\in[1,\infty) and yb|y|\leq b almost surely, with b>0b>0. Let δ(0,1]\delta\in(0,1]. If nn0n\geq n_{0} and λn=n1/2\lambda_{n}=n^{-1/2}, then a number of random features MnM_{n} equal to

is enough to guarantee, with probability at least 1δ1-\delta, that

In particular the constants c0,c1c_{0},c_{1} do not depend on n,λ,δn,\lambda,\delta, and n0n_{0} does not depends on n,λ,fH,ρn,\lambda,f_{\mathcal{H}},\rho.

The above result is presented with some simplifications (e.g. the assumption of bounded output) for sake of presentation, while it is proved and presented in full generality in the Appendix. In particular, the values of all the constants are given explicitly. Here, we make a few comments. The learning bound is the same achieved by the exact kernel ridge regression estimator (2) choosing λ=n1/2\lambda=n^{-1/2}, see e.g. . The theorem derives a bound in a worst case situation, where no assumption is made besides existence of fHf_{\mathcal{H}}, and is optimal in a minmax sense . This means that, in this setting, as soon as the number of features is order nlogn\sqrt{n}\log n, the corresponding ridge regression estimator has optimal generalization properties. This is remarkable considering the corresponding gain from a computational perspective: from roughly O(n3)O(n^{3}) and O(n2)O(n^{2}) in time and space for kernel ridge regression to O(n2)O(n^{2}) and O(nn)O(n\sqrt{n}) for ridge regression with random features (see Section 2). Consider that taking δ1/n2\delta\propto 1/n^{2} changes only the constants and allows to derive bounds in expectation and almost sure convergence (see Cor. 1 in the appendix, for the result in expectation). The above result shows that there is a whole set of problems where computational gains are achieved without having to trade-off statistical accuracy. In the next sections we consider what happens under more benign assumptions, which are standard, but also somewhat more technical. We first compare with previous works since the above setting is the one more closely related.

This is one of the original random features paper and considers the question of generalization properties. In particular they study the estimator

rather than a RKHS, where RR is fixed a priori. The best possible solution is fGRf^{*}_{\mathcal{G}_{R}} solving minfGRE(f),\min_{f\in\mathcal{G}_{R}}{\cal E}(f), and the main result in provides the bound

This is the first and still one the main results providing a statistical analysis for an estimator based on random features for a wide class of loss functions. There are a few elements of comparison with the result in this paper, but the main one is that to get O(1/n)O(1/\sqrt{n}) learning bounds, the above result requires O(n)O(n) random features, while a smaller number leads to worse bounds. This shows the main novelty of our analysis. Indeed we prove that, considering the square loss, fewer random features are sufficient, hence allowing computational gains without loss of accuracy. We add a few more tehcnical comments explaining : 1) how the setting we consider covers a wider range of problems, and 2) why the bounds we obtain are sharper. First, note that the functional setting in our paper is more general in the following sense. It is easy to see that considering the RKHS H\mathcal{H} is equivalent to consider H2={ψ(,ω)β(ω)dπ(ω) | β(ω)2dπ(ω)<}\mathcal{H}_{2}=\left\{\int\psi(\cdot,\omega)\beta(\omega)d\pi(\omega)~{}\middle|~{}\int|\beta(\omega)|^{2}d\pi(\omega)<\infty\right\} and the following inclusions hold GRGH2\mathcal{G}_{R}\subset\mathcal{G}_{\infty}\subset\mathcal{H}_{2}. Clearly, assuming a minimizer of the expected risk to exists in H2\mathcal{H}_{2} does not imply it belongs to G\mathcal{G}_{\infty} or GR\mathcal{G}_{R}, while the converse is true. In this view, our results cover a wider range of problems. Second, note that, this gap is not easy to bridge. Indeed, even if we were to consider G\mathcal{G}_{\infty} in place of GR\mathcal{G}_{R}, the results in could be used to derive the bound

where A(R):=E(fGR)E(fG)A(R):={\cal E}(f^{*}_{\mathcal{G}_{R}})-{\cal E}(f^{*}_{\mathcal{G}_{\infty}}) and fGf^{*}_{\mathcal{G}_{\infty}} is a minimizer of the expected risk on G\mathcal{G}_{\infty}. In this case we would have to balance the various terms in (11), which would lead to a worse bound. For example, we could consider R:=lognR:=\log n, obtaining a bound n1/2lognn^{-1/2}\log n with an extra logarithmic term, but the result would hold only for nn larger than a number of examples n0n_{0} at least exponential with respect to the norm of ff_{\infty}. Moreover, to derive results uniform with respect to ff_{\infty}, we would have to keep into account the decay rate of A(R)A(R) and this would get bounds slower than n1/2n^{-1/2}.

Several other papers study the generalization properties of random features, see and references therein. For example, generalization bounds are derived in from very general arguments. However, the corresponding generalization bound requires a number of random features much larger than the number of training examples to give O(1/n)O(1/\sqrt{n}) bounds. The basic results in are analogous to those in with the set GR\mathcal{G}_{R} replaced by HR\mathcal{H}_{R}. These results are closer, albeit more restrictive then ours (see Remark 8) and especially like the bounds in suggest O(n)O(n) random features are needed for O(1/n)O(1/\sqrt{n}) learning bounds. A novelty in is the introduction of more complex problem dependent sampling that can reduce the number of random features. In Section 3.3, we show that using possibly-data dependent random features can lead to rates much faster than n1/2n^{-1/2}, and using much less than n\sqrt{n} features.

2 Refined Results: Fast Learning Rates

Faster rates can be achieved under favorable conditions. Such conditions for kernel ridge regression are standard, but somewhat technical. Roughly speaking they characterize the “size” of the considered RKHS and the regularity of fHf_{\mathcal{H}}. The key quantity needed to make this precise is the integral operator defined by the kernel KK and the marginal distribution ρX{\rho_{{X}}} of ρ\rho on X{X}, that is

For λ>0\lambda>0, let the effective dimension be defined as N(λ):=Tr((L+λI)1L),{\cal N}(\lambda):=\operatorname{Tr}\left((L+\lambda I)^{-1}L\right), and assume, there exists Q>0Q>0 and γ\gamma\in such that,

Moreover, assume there exists r1/2r\geq 1/2 and gL2(X,ρX)g\in{L^{2}({X},{\rho_{{X}}})} such that

Let δ(0,1]\delta\in(0,1]. Under Asm. 1 and the same assumptions of Thm. 1, if nn0n\geq n_{0}, and λn=n12r+γ\lambda_{n}=n^{-\frac{1}{2r+\gamma}}, then a number of random features MM equal to

is enough to guarantee, with probability at least 1δ1-\delta, that

for r1r\leq 1, and where c0,c1c_{0},c_{1} do not depend on n,τn,\tau, while n0n_{0} does not depends on n,fH,ρn,f_{\mathcal{H}},\rho.

The above bound is the same as the one obtained by the full kernel ridge regression estimator and is optimal in a minimax sense . For large rr and small γ\gamma it approaches a O(1/n)O(1/n) bound. When γ=1\gamma=1 and r=1/2r=1/2 the worst case bound of the previous section is recovered. Interestingly, the number of random features in different regimes is typically smaller than nn but can be larger than O(n)O(\sqrt{n}). Figure. 1 provides a pictorial representation of the number of random features needed for optimal rates in different regimes. In particular MnM\ll n random features are enough when γ>0\gamma>0 and r>1/2r>1/2. For example for r=1,γ=0r=1,\gamma=0 (higher regularity/sparsity and a small RKHS) O(n)O(\sqrt{n}) are sufficient to get a rate O(1/n)O(1/n). But, for example, if r=1/2,γ=0r=1/2,\gamma=0 (not too much regularity/sparsity but a small RKHS) O(n)O(n) are needed for O(1/n)O(1/n) error. The proof suggests that this effect can be a byproduct of sampling features in a data-independent way. Indeed, in the next section we show how much fewer features can be used considering problem dependent sampling schemes.

3 Refined Results: Beyond uniform sampling

We show next that fast learning rates can be achieved with fewer random features if they are somewhat compatible with the data distribution. This is made precise by the following condition.

Define the maximum random features dimension as

Assume there exists α\alpha\in, and F>0F>0 such that F(λ)Fλα,λ>0.{\cal F}_{\infty}(\lambda)\leq F\lambda^{-\alpha},\quad\forall\lambda>0.

The above assumption is abstract and we comment on it before showing how it affects the results. The maximum random features dimension (14) relates the random features to the data-generating distribution through the operator LL. It is always satisfied for α=1\alpha=1 ands F=κ2F=\kappa^{2}. e.g. considering any random feature satisfying (6). The favorable situation corresponds to random features such that case α=γ\alpha=\gamma. The following theoretical construction borrowed from gives an example.

Assume KK is a kernel with an integral representation (6). For s(ω)=(L+λI)1/2ψ(,ω)ρX2s(\omega)=\lVert{(L+\lambda I)^{-1/2}\psi(\cdot,\omega)}\rVert_{\rho_{{X}}}^{-2} and Cs:=1s(ω)dπ(ω)C_{s}:=\int\frac{1}{s(\omega)}d\pi(\omega), consider the random features ψs(x,ω)=ψ(x,ω)Css(ω),\psi_{s}(x,\omega)=\psi(x,\omega)\sqrt{C_{s}s(\omega)}, with distribution πs(ω):=π(ω)Css(ω)\pi_{s}(\omega):=\frac{\pi(\omega)}{C_{s}s(\omega)}. We show in the Appendix that these random features provide an integral representation of KK and satisfy Asm. 2 with α=γ\alpha=\gamma.

We next show how random features satisfying Asm. 2 can lead to better resuts.

Let δ(0,1]\delta\in(0,1]. Under Asm. 2 and the same assumptions of Thm. 1, 2, if nn0n\geq n_{0}, and λn=n12r+γ\lambda_{n}=n^{-\frac{1}{2r+\gamma}}, then a number of random features MnM_{n} equal to

is enough to guarantee, with probability at least 1δ1-\delta, that

where c0,c1c_{0},c_{1} do not depend on n,τn,\tau, while n0n_{0} does not depends on n,fH,ρn,f_{\mathcal{H}},\rho.

The above learning bound is the same as Thm. 2, but the number of random features is given by a more complex expression depending on α\alpha. In particular, in the slow O(1/n)O(1/\sqrt{n}) rates scenario, that is r=1/2r=1/2, γ=1\gamma=1, we see that O(nα/2)O(n^{\alpha/2}) are needed, recovering O(n)O(\sqrt{n}), since γα1\gamma\leq\alpha\leq 1. On the contrary, for a small RKHS, that is γ=0\gamma=0 and random features with α=γ\alpha=\gamma, a constant (!) number of feature is sufficient. A similar trend is seen considering fast rates. For γ>0\gamma>0 and r>1/2r>1/2, if α<1\alpha<1 then the number of random features is always smaller, and potentially much smaller, then the number of random features sampled in a problem independent way, that is α=1\alpha=1. For γ=0\gamma=0 and r=1/2r=1/2, the number of number of features is O(nα)O(n^{\alpha}) and can be again just constant if α=γ\alpha=\gamma. Figure 1 depicts the number of random features required if α=γ\alpha=\gamma. The above result shows the potentially dramatic effect of problem dependent random features. However the construction in Ex. 2 is theoretical. We comment on this in the next remark.

This question was recently considered in and our results offer new insights. In particular, recalling the results in , we see that in the slow rate setting there is essentially no difference between random features and Nyström approaches, neither from a statistical nor from a computational point of view. In the case of fast rates, Nyström methods with uniform sampling requires O(n12r+γ)O(n^{-\frac{1}{2r+\gamma}}) random centers, which compared to Thm. 2, suggests Nyström methods can be advantageous in this regime. While problem dependent random features provide a further improvement, it should be compared with the number of centers needed for Nyström with leverage scores, which is O(nγ2r+γ)O(n^{-\frac{\gamma}{2r+\gamma}}) and hence again better, see Thm. 3. In summary, both random features and Nyström methods achieve optimal statistical guarantees while reducing computations. They are essentially the same in the worst case, while Nyström can be better for benign problems. Finally we add a few words about the main steps in the proof.

The proofs are quite technical and long and are collected in the appendices. They use a battery of tools developed to analyze KRR and related methods. The key challenges in the analysis include analyzing the bias of the estimator, the effect of noise in the outputs, the effect of random sampling in the data, the approximation due to random features and a notion of orthogonality between the function space corresponding to random features and the full RKHS. The last two points are the main elements on novelty in the proof. In particular, compared to other studies, we identify and study the quantity needed to assess the effect of the random feature approximation if the goal is prediction rather than the kernel approximation itself.

Numerical results

While the learning bounds we present are optimal, there are no lower bounds on the number of random features, hence we present numerical experiments validating our bounds. Consider a spline kernel of order qq (see Eq. 2.1.7 when qq integer), defined as

Conclusion

In this paper, we provide a thorough analyses of the generalization properties of random features with ridge regression. We consider a statistical learning theory setting where data are noisy and sampled at random. Our main results show that there are large classes of learning problems where random features allow to reduce computations while preserving optimal statistical accuracy of exact kernel ridge regression. This in contrast with previous state of the art results suggesting computational gains needs to be traded-off with statistical accuracy. Our results open several venues for both theoretical and empirical work. As mentioned in the paper, it would be interesting to analyze random features with empirical leverage scores. This is immediate if input points are fixed, but our approach should allow to also consider the statistical learning setting. Beyond KRR, it would be interesting to analyze random features together with other approaches, in particular accelerated and stochastic gradient methods, or distributed techniques. It should be possible to extend the results in the paper to consider these cases. A more substantial generalization would be to consider loss functions other than quadratic loss, since this require different techniques from empirical process theory.

The authors gratefully acknowledge the contribution of Raffaello Camoriano who was involved in the initial phase of this project. These preliminary result appeared in the 2016 NIPS workshop “Adaptive and Scalable Nonparametric Methods in ML”. This work is funded by the Air Force project FA9550-17-1-0390 (European Office of Aerospace Research and Development) and by the FIRB project RBFR12M3AC (Italian Ministry of Education, University and Research).

References

Appendix A Proofs

In Sect. A.1, the notation is introduced and some standard identities are recalled. In Sect. A.2, the excess risk is decomposed in five terms (Eq. (17)-(21)) that are further simplified in Lemma 2, 3, 4, 5. The complete decomposition is presented in Thm. 4. In Sect. A.3, the terms in decomposition are bounded in probability, in particular Lemma 7 bounds the variance term, Lemma 8 the computational error term, while Lemma 10 controls the constants. Finally the proofs of the main results are presented in Section A.4 together with the more general results of Thm. 5.

First we recall the assumptions needed to derive the results. They are already presented or implied in the main text, here we collect and number them.

Assumption 2 (Compatibility condition) There exists α\alpha\in and F>0F>0 such that

The kernel KK has an integral representation as in Eq. 6, with ψ\psi continuous in both variables and bounded, that is, there exists κ1\kappa\geq 1 such that ψ(x,ω)κ|\psi(x,\omega)|\leq\kappa for any x,Xx,\in{X} and ωΩ\omega\in\Omega. The associated RKHS H\mathcal{H} is separable.

Moreover there exists fHHf_{\mathcal{H}}\in\mathcal{H} such that E(fH)=inffHE(f){\cal E}(f_{\mathcal{H}})=\inf_{f\in\mathcal{H}}{\cal E}(f).

Note that the above assumption on yy is satisfied when yy is bounded, sub-gaussian or sub-exponential. In particular, if y[b2,b2]|y|\in[-\frac{b}{2},\frac{b}{2}] almost surely, with b(0,)b\in(0,\infty) then the assumption above is satisfied with σ=B=b\sigma=B=b.

Let λ>0\lambda>0. There exists Q>0Q>0 and γ\gamma\in such that, for any λ>0\lambda>0

It is the first part of Asm. 1, for the sake of clarity we need to split it in two, since many results depend either on the first or on the second part.

There exists 1/2r11/2\leq r\leq 1 and gL2(X,ρX)g\in{L^{2}({X},{\rho_{{X}}})} such that

We denote with RR the quantity 1gρX1\vee\lVert{g}\rVert_{\rho_{{X}}}.

We now recall a useful characterization of the excess risk, in term of the quantities defined above.

When y2dρ\int y^{2}d\rho is finite, then fρL2(X,ρX)f_{\rho}\in{L^{2}({X},{\rho_{{X}}})} and fρf_{\rho} is the minimizer of E{\cal E} over all the measurable functions. When K(x,x)dρX\int K(x,x)d{\rho_{{X}}} is finite, the range of PP and of LL is the closure of H\mathcal{H} in L2(X,ρX){L^{2}({X},{\rho_{{X}}})}. When both conditions hold, for any fL2(X,ρX)f\in{L^{2}({X},{\rho_{{X}}})} the following hold

The latter term is zero if fHf\in\mathcal{H}, but we will also see that it is zero for all the functions defined by MM random features. Moreover if there exists fHHf_{\mathcal{H}}\in\mathcal{H} minimizing E{\cal E}, then Asm. 6 is equivalent to requiring the existence of r1/2r\geq 1/2, gL2(X,ρX)g\in{L^{2}({X},{\rho_{{X}}})} such that

with R:=gL2(X,ρX)R:=\lVert{g}\rVert_{L^{2}({X},{\rho_{{X}}})}.

In the following we define analogous operators for the approximated kernel KM:=ϕM(x)ϕM(x)K_{M}:=\phi_{M}(x)^{\top}\phi_{M}(x^{\prime}), with

Under Asm. 3 and the fact that ρ\rho is a finite measure, ψωL2(X,ρX)\psi_{\omega}\in{L^{2}({X},{\rho_{{X}}})} almost surely.

Now we are ready for defining the following operators, depending on ϕM\phi_{M} or KMK_{M}.

LM:L2(X,ρX)L2(X,ρX),(LMg)()=XKM(,z)g(z)dρX(z)L_{M}:{L^{2}({X},{\rho_{{X}}})}\to{L^{2}({X},{\rho_{{X}}})},\quad(L_{M}g)(\cdot)=\int_{X}K_{M}(\cdot,z)g(z)d{\rho_{{X}}}(z).

Note that the operators above satisfy the properties in the following remark.

Under Asm. 3 the linear operators LL is trace class and the linear operators LM,CM,SM,C^M,S^ML_{M},C_{M},S_{M},\widehat{C}_{M},\widehat{S}_{M} are finite dimensional. Moreover we have that L=SSL=SS^{*}, LM=SMSML_{M}=S_{M}S_{M}^{*}, CM=SMSMC_{M}=S_{M}^{*}S_{M} and C^M=S^MS^M\widehat{C}_{M}=\widehat{S}_{M}^{*}\widehat{S}_{M}. Finally L,LM,CM,C^ML,L_{M},C_{M},\widehat{C}_{M} are self-adjoint and positive operators, with spectrum is [0,κ2][0,\kappa^{2}].

In the next remark we rewrite f^λ,M\widehat{f}_{\lambda,M} in terms of the operators introduced above.

Let f^λ,M\widehat{f}_{\lambda,M} defined as in Eq. 7. Under Assumption 3, f^λ,ML2(X,ρX)\widehat{f}_{\lambda,M}\in{L^{2}({X},{\rho_{{X}}})} almost surely, since ψω\psi_{\omega} is in L2(X,ρX){L^{2}({X},{\rho_{{X}}})} almost surely (Rem. 7) and f^λ,M\widehat{f}_{\lambda,M} is a linear combination of ψω1,,ψωM\psi_{\omega_{1}},\dots,\psi_{\omega_{M}}. In particular,

A.2 Analytic Result

In this subsection we decompose analytically the excess risks in different terms, that will be bounded, via concentration inequalities, in the next section. Under Asm. 3, since f^λ,ML2(X,ρX)\widehat{f}_{\lambda,M}\in{L^{2}({X},{\rho_{{X}}})} almost surely, we have

(for more details see Rem. 6, 9). In our analysis we decompose the first term considering,

We comment on the role of the various terms in the decomposition. The first controls the variance of the outputs yy, the second the interaction between the space of models spanned by ψω1,,ψωM\psi_{\omega_{1}},\dots,\psi_{\omega_{M}} and H\mathcal{H}, the third the approximation of the inverse covariance operator C^M,λ1\widehat{C}_{M,\lambda}^{-1}, the fourth controls how close is the integral operator LML_{M} to LL, while the last controls the approximation error of the models in H\mathcal{H}. The L2(X,ρX){L^{2}({X},{\rho_{{X}}})} norm of f^λ,MPfρ\widehat{f}_{\lambda,M}-Pf_{\rho} is bounded by the sum of the L2(X,ρX){L^{2}({X},{\rho_{{X}}})} norms of the terms, that are further bounded in Lemma. 2, 3, 4, 5. The final analytical decomposition is given in Thm. 4. First we need a preliminary result.

Under Asm. 3, the operator LL is characterized by

By Asm. 3, we have that ψωL2(X,ρX)\psi_{\omega}\in{L^{2}({X},{\rho_{{X}}})} almost surely and uniformly bounded. By using the kernel expansion of Eq. (6), the linearity of the Bochner integral and of the dot product, we have that for any f,gL2(X,ρX)f,g\in{L^{2}({X},{\rho_{{X}}})} the following holds

Now we are ready to prove that the second term of the expansion in Eq. 17 is zero. We obtain this result by proving that (IP)ψω=0\lVert{(I-P)\psi_{\omega}}\rVert=0 almost everywhere.

where the latter result holds more generally for any frange(SM).f\in\text{range}(S_{M}).

Note that, since PP is the projection operator on the range of LL and LL is trace class, then (IP)L=0(I-P)L=0, this implies that Tr((IP)L(IP))=0\operatorname{Tr}((I-P)L(I-P))=0. By the characterization of LL given in Lemma 1, the linearity of the bounded operator IPI-P and of the trace, we have that

where the last step is due to the fact that 0,v=0\left\langle{0},{v}\right\rangle=0 for any vv and (IP)ψωi=0(I-P)\psi_{\omega_{i}}=0 almost surely, since ωi\omega_{i} are distributed according to π\pi and (IP)ψω=0(I-P)\psi_{\omega}=0 almost surely on the support of π\pi. Now

First of all we recall that Zf(ZZ)=f(ZZ)ZZ^{*}f(ZZ^{*})=f(Z^{*}Z)Z^{*} for any continuous spectral function and any compact operator ZZ. By the characterization of LML_{M} in Rem. 8 under Asm. 3, we have

since (+λI)1(\cdot+\lambda I)^{-1} is a continuos spectral function on [0,)[0,\infty), which contains the spectrum of LL that is in [0,κ2][0,\kappa^{2}]. Equivalently, the equation above could be proven algebraically via the Woodbury identity. Now we have

where the last step is due to the identity A1B1=A1(BA)B1A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1} valid for any bounded invertible linear operator A,BA,B. In particular by multiplying and dividing by CM,λ1/2C_{M,\lambda}^{1/2} we have the following decomposition

The result is given by bounding the norm of the lhs of the identity above, by the product of the norms of the parentheses on the rhs (see Rem. 5). Note that by applying Eq. (15), we have that there exists gL2(X,ρX)g\in{L^{2}({X},{\rho_{{X}}})}, such that Pfρ=LrgPf_{\rho}=L^{r}g and by dividing and multiplying for LM,λ1/2L_{M,\lambda}^{1/2}, we have

Now note that, by Asm. 6, we have r1/2r\geq 1/2, gρXR\lVert{g}\rVert_{\rho_{{X}}}\leq R and Lr1/2κ2r1\lVert{L^{r-1/2}}\rVert\leq\kappa^{2r-1} since LL is compact with the spectrum in [0,κ2][0,\kappa^{2}] and 2r102r-1\geq 0. By the fact that (+λI)2(\cdot+\lambda I)^{-2} is a continuous spectral function on [0,)[0,\infty) containing the spectrum of CMC_{M}, we have that SMCM,λ2SM=LM,λ2LMS_{M}C_{M,\lambda}^{-2}S_{M}^{*}=L_{M,\lambda}^{-2}L_{M} and so for any λ>0\lambda>0

By the algebraic identities A(A+λI)=Iλ(A+λI)1A(A+\lambda I)=I-\lambda(A+\lambda I)^{-1} valid for any bounded positive operator and A1B1=A1(BA)B1A^{-1}-B^{-1}=A^{-1}(B-A)B^{-1} valid for any invertible bounded operators, we have

By applying Eq. (15), we have that there exists gL2(X,ρX)g\in{L^{2}({X},{\rho_{{X}}})}, such that Pfρ=LrgPf_{\rho}=L^{r}g, so by multiplying and dividing by LM1/2L_{M}^{1/2}, we preform the following decomposition

The result is given by bounding the norm of the lhs of the identity above, by the product of the norms of the parentheses on the rhs (see Rem. 5). Note that λLM,λ1/21\lVert{\sqrt{\lambda}L_{M,\lambda}^{-1/2}}\rVert\leq 1 and LλrLr1\lVert{L_{\lambda}^{-r}{L}^{r}}\rVert\leq 1 for any λ>0\lambda>0 and gρXR\lVert{g}\rVert_{\rho_{{X}}}\leq R. Now we apply Proposition 9 on Lλ1/2(LLM)Lλ(1r)\lVert{L_{\lambda}^{-1/2}(L-L_{M})L_{\lambda}^{-(1-r)}}\rVert, indeed note that 01r1/20\leq 1-r\leq 1/2, so by setting σ=22r\sigma=2-2r, X=Lλ1/2(LLM)X=L_{\lambda}^{-1/2}(L-L_{M}), A=Lλ1/2A=L_{\lambda}^{-1/2} and applying the proposition, we have

Under Asm. 3, and Eq. (15) the following holds for any λ>0\lambda>0,

By the identity A(A+λI)1=Iλ(A+λ)1A(A+\lambda I)^{-1}=I-\lambda(A+\lambda)^{-1} valid for λ>0\lambda>0 and any bounded self-adjoint positive operator and by Eq. (15) for which there exists gL2(X,ρX)g\in{L^{2}({X},{\rho_{{X}}})} such that Pfρ=LrgPf_{\rho}={L}^{r}g, we have

The result is given by bounding the norm of the lhs of Eq. 22, by the product of the norms of the parentheses on the rhs (see Rem. 5). Note that λ1rLλ(1r)1\lVert{\lambda^{1-r}L_{\lambda}^{-(1-r)}}\rVert\leq 1 and LλrLr1\lVert{L_{\lambda}^{-r}{L}^{r}}\rVert\leq 1, while R:=gρXR:=\lVert{g}\rVert_{\rho_{{X}}} according to Eq. (15). ∎

S(λ,M,n):=CM,λ1/2(S^My^SMfρ)ρX+Rκ2r1CM,λ1/2(CMC^M){\cal S}(\lambda,M,n):=\lVert{C_{M,\lambda}^{-1/2}(\widehat{S}_{M}^{*}\widehat{y}-S_{M}^{*}f_{\rho})}\rVert_{\rho_{{X}}}+R\kappa^{2r-1}\lVert{C_{M,\lambda}^{-1/2}(C_{M}-\widehat{C}_{M})}\rVert,

C(λ,M):=RλLλ1/2(LLM)2v1Lλ1/2(LLM)Lλ1/222v{\cal C}(\lambda,M):=R\sqrt{\lambda}\lVert{L_{\lambda}^{-1/2}(L-L_{M})}\rVert^{2v-1}\lVert{L_{\lambda}^{-1/2}(L-L_{M})L_{\lambda}^{-1/2}}\rVert^{2-2v},

β:=max(1,(1β1)1)max(1,(1β2)1/2)\beta:=\max(1,(1-\beta_{1})^{-1})\max(1,(1-\beta_{2})^{-1/2}), with β1:=λmax(CM,λ1/2(CMC^M)CM,λ1/2)\beta_{1}:=\lambda_{\max}(C_{M,\lambda}^{-1/2}(C_{M}-\widehat{C}_{M})C_{M,\lambda}^{-1/2}) and β2:=λmax(Lλ1/2(LLM)Lλ1/2)\beta_{2}:=\lambda_{\max}(L_{\lambda}^{-1/2}(L-L_{M})L_{\lambda}^{-1/2}).

Under Asm. 3, the excess risk is characterized by Eq. 16 as we recalled at the beginning of this subsection. We decomposed the quantity f^λ,MPfρ\widehat{f}_{\lambda,M}-Pf_{\rho} according to the terms in Eq. 17-21. The L2(X,ρX){L^{2}({X},{\rho_{{X}}})} norm of f^λ,MPfρ\widehat{f}_{\lambda,M}-Pf_{\rho} is bounded by the sum of the L2(X,ρX){L^{2}({X},{\rho_{{X}}})} norms of the terms, that are further bounded in Lemma. 2, 3, 4, 5. In particular, for the first term, by writing f^λ,M\widehat{f}_{\lambda,M} in terms of the linear operators in Def. 2 (see Rem. 9) and by multipling and dividing by CM,λ1/2C_{M,\lambda}^{1/2} we have

then we bound the norm of the term with the norm of the parenthesis in the decomposition above (see Rem. 5). By collecting the result above with the bounds in Lemma. 2, 3, 4, 5, we have

where b1:=SMC^M,λ1CM,λ1/2b_{1}:=\lVert{S_{M}\widehat{C}_{M,\lambda}^{-1}C_{M,\lambda}^{1/2}}\rVert, b2:=LM,λ1/2L1/2b_{2}:=\lVert{L_{M,\lambda}^{-1/2}L^{1/2}}\rVert, b3:=LM,λ1/2Lλ1/2b_{3}:=\lVert{L_{M,\lambda}^{-1/2}L_{\lambda}^{1/2}}\rVert, A:=CM,λ1/2(S^My^SMfρ)ρXA:=\lVert{C_{M,\lambda}^{-1/2}(\widehat{S}_{M}^{*}\widehat{y}-S_{M}^{*}f_{\rho})}\rVert_{\rho_{{X}}}, B:=Rκ2r1CM,λ1/2(CMC^M)B:=R\kappa^{2r-1}\lVert{C_{M,\lambda}^{-1/2}(C_{M}-\widehat{C}_{M})}\rVert and D:=RλrD:=R\lambda^{r}. Now note that b2b3b_{2}\leq b_{3} for any λ>0\lambda>0 since, for any X,TX,T, bounded linear operators, with TT positive, by multiplying and dividing for TλT_{\lambda} the following holds

and Tλ1T1\lVert{T_{\lambda}^{-1}T}\rVert\leq 1, for any λ>0\lambda>0. Since A,B,C(λ,M),DA,B,{\cal C}(\lambda,M),D will contribute to the rates of the bound, while b1,b3b_{1},b_{3} are responsible for the numerical constants, we are going to bound the excess risk in order to collect b1,b3b_{1},b_{3} in a multiplicative term as follows

Finally, we further simplify b1,b3b_{1},b_{3}, in particular we apply Prop. 8 in the appendix, obtaining b3(1β2)1/2b_{3}\leq(1-\beta_{2})^{-1/2}. For b1b_{1} note that,

since, SMC^M,λ1/2CM,λ1/2C^M,λ1/2\lVert{S_{M}\widehat{C}_{M,\lambda}^{-1/2}}\rVert\leq\lVert{C_{M,\lambda}^{1/2}\widehat{C}_{M,\lambda}^{-1/2}}\rVert for the same reasoning in Eq. (24). Then we apply Prop. 8 in the appendix, obtaining b1(1β1)1b_{1}\leq(1-\beta_{1})^{-1}. ∎

In the next subsection, we are going to find probabilistic estimates for terms in the analytic decomposition of the excess risk in Thm. 4.

A.3 Probabilistic Estimates

The next lemma bounds the first term of S(λ,M,n){\cal S}(\lambda,M,n) and use a similar technique to the one in , while Lemma 7 bounds the whole S(λ,M,n){\cal S}(\lambda,M,n). First we need to introduce NM(λ){\cal N}_{M}(\lambda) that is the effective dimension induced by the kernel KMK_{M}. For any λ>0\lambda>0 define NM(λ){\cal N}_{M}(\lambda) as follows,

In Prop. 10 in the appendix, we bound NM(λ){\cal N}_{M}(\lambda) in terms of the effective dimension N(λ){\cal N}(\lambda) that is the one associated to the kernel KK. Prop. 10 refines the result of Prop. 1 of , with simpler proof and slightly improved constants.

In this proof we bound the quantity under study, by using the Bernstein inequality for sum of zero-mean random vectors (see Prop. 2 in the appendix). Since S^My^=n1i=1nϕM(xi)yi\widehat{S}_{M}^{*}\widehat{y}=n^{-1}\sum_{i=1}^{n}\phi_{M}(x_{i})y_{i} (see Def. 2) we have

So, by linearity of the expectation the ζi\zeta_{i},

which implies that ζi=ziμ\zeta_{i}=z_{i}-\mu is a zero-mean random variable for 1in1\leq i\leq n. Let zz be another random variable independent and identically distributed as the ziz_{i}’s. To apply the Bernstein inequality for random vectors, we need to bound their moments. First of all note that for any p1p\geq 1

In particular, by applying Asm. 3 and Asm. 4, we have

where J(λ)=XCM,λ1/2ϕM(x)2dρX(x)J(\lambda)=\int_{X}\lVert{C_{M,\lambda}^{-1/2}\phi_{M}(x)}\rVert^{2}d{\rho_{{X}}}(x), while CM,λ1/2ϕM(x)κ/λ\lVert{C_{M,\lambda}^{-1/2}\phi_{M}(x)}\rVert\leq\kappa/\sqrt{\lambda} a.s. is given by

where the last step is due to Asm. 3. Finally, to concentrate the sum of random vectors, we apply Prop. 2. To conclude the proof we need to prove that J(λ)=NM(λ)J(\lambda)={\cal N}_{M}(\lambda). Note that, by Rem. 8, we have that LM=SMSML_{M}=S_{M}S_{M}^{*} and CM=SMSMC_{M}=S_{M}^{*}S_{M}, so

since LM=SMSML_{M}=S_{M}S_{M}^{*} and SMLM,λ1SM=CMCM,λ1S_{M}^{*}L_{M,\lambda}^{-1}S_{M}=C_{M}C_{M,\lambda}^{-1}. By the the ciclicity of the trace and the definition of CMC_{M} in Def. 2, we have

when 0<λ<L0<\lambda<\lVert{L}\rVert and M(4+18F(λ))log12κ2λδM\geq(4+18{\cal F}_{\infty}(\lambda))\log\frac{12\kappa^{2}}{\lambda\delta}.

First of all we need to define four other events. Define the event Eω1ZE^{1}_{{\boldsymbol{\omega}}}\subseteq Z, as the event satisfying

A.3.2 Estimates for 𝒞​(λ,M)𝒞𝜆𝑀{\cal C}(\lambda,M)

Let C(λ,M){\cal C}(\lambda,M) as in Thm. 4, point 2. Let δ(0,1/2]\delta\in(0,1/2] and λ>0\lambda>0. Under Asm. 3, following holds with probability at least 12δ1-2\delta

when M(4+18F(λ))log8κ2λδM\geq(4+18{\cal F}_{\infty}(\lambda))\log\frac{8\kappa^{2}}{\lambda\delta} and t:=log11κ2λt:=\log\frac{11\kappa^{2}}{\lambda}.

We now study C(λ,M){\cal C}(\lambda,M) that is

where the last steps are due to the identity Tr(A(vv))=v,Av=A1/2v2\operatorname{Tr}(A(v\otimes v))=\left\langle{v},{Av}\right\rangle=\lVert{A^{1/2}v}\rVert^{2} valid for any vector vv and any bounded self-adjoint positive operator AA on a Hilbert space.

Define AΩMA\subseteq\Omega^{M} the event satisfying

Define BΩMB\subseteq\Omega^{M} the event satisfying

Now set E=ABE=A\cap B. When EE holds and under the assumption that M(4+18F(λ))log8κ2λδM\geq(4+18{\cal F}_{\infty}(\lambda))\log\frac{8\kappa^{2}}{\lambda\delta}, we have that the right hand side of Eq. (32) is smaller than 4ηF(λ)M,\sqrt{\frac{4\eta{\cal F}_{\infty}(\lambda)}{M}}, and η=log2δ+log4κ2λ\eta=\log\frac{2}{\delta}+\log\frac{4\kappa^{2}}{\lambda}, so

A.3.3 Estimates for β𝛽\beta

Let δ(0,1]\delta\in(0,1]. Under Asm. 3, the following holds with probability at least 1δ1-\delta,

when M32(κ2L+κ2)log2δM\geq 32\left(\frac{\kappa^{2}}{\lVert{L}\rVert}+\kappa^{2}\right)\log\frac{2}{\delta}.

Define the event AΩMA\in\Omega^{M} as the one satisfying

Note that when ωA{\boldsymbol{\omega}}\in A and M32(κ2L+κ2)log2δM\geq 32\left(\frac{\kappa^{2}}{\lVert{L}\rVert}+\kappa^{2}\right)\log\frac{2}{\delta}, we have LLMHS14L\lVert{L-L_{M}}\rVert_{HS}\leq\frac{1}{4}\lVert{L}\rVert and, by using the characterization of CM,LMC_{M},L_{M} in Rem. 8 and the fact that HS\lVert{\cdot}\rVert_{HS}\geq\lVert{\cdot}\rVert, we have

Let δ(0,1/3]\delta\in(0,1/3], and β\beta be as in Thm. 4, point 3. Under Asm. 3, the following holds with probability at least 13δ1-3\delta,

when 0<λ34L0<\lambda\leq\frac{3}{4}\lVert{L}\rVert, and

Let β,β1,β2\beta,\beta_{1},\beta_{2} be defined as in Thm. 4, point 3. To bound β\beta in probability, we first bound β1\beta_{1} and β2\beta_{2} in probability, and then, under the intersection of the events, we control β\beta. First of all, denote with (a)(a) the condition on λ\lambda, with (b)(b) the condition on nn and with (c.1)(c.1) the condition M32(κ2L+κ2)log2δM\geq 32\left(\frac{\kappa^{2}}{\lVert{L}\rVert}+\kappa^{2}\right)\log\frac{2}{\delta}, while with (c.2)(c.2) the condition M18(2+F(λ))log4κ2λδM\geq 18\left(2+{\cal F}_{\infty}(\lambda)\right)\log\frac{4\kappa^{2}}{\lambda\delta}. Define the event EWE\subseteq W as the one satisfying β113\beta_{1}\leq\frac{1}{3}. To bound the probability of EE we need an auxiliary event AΩMA\in\Omega^{M} that is the one satisfying 34LCM\frac{3}{4}\lVert{L}\rVert\leq\lVert{C_{M}}\rVert. The specific choice of AA will be made clear later. We have

A.4 Proof of the Main Result

Here we prove Thm. 1, 2, 3 that are the main results of the paper. In particular the following Thm. 5 is a general version of the three theorem above, without the need of Assumptions 5, 2 and valid for a wide range of λ,M\lambda,M. In Thm. 6, we specialize the result of Thm. 5, selecting MM in terms of λ\lambda such that the upper bound of the excess risk depends only on λ\lambda and is proportional to the same upper bound for kernel ridge regression that leads to optimal generalization bounds. Note that Thm. 6 is again independent of Asm. 5, 2. Finally, Thm. 7 is obtained by Thm. 6, by adding Asm. 5, 2 and has all the constant explicit. Then Thm. 1 is a specification of Thm. 7, for the simple scenario where it is only required the existence of fHf_{\mathcal{H}} (that is Asm. 5 satisfied with γ=1\gamma=1, Asm. 6 satisfied with r=1/2r=1/2 and Asm. 2 with α=1\alpha=1, see discussion after the introduction of the assumptions). Thm. 2 is a specification of Thm. 7 for the fast rates (Asm. 2 is satisfied with α=1\alpha=1), while Thm. 3 is a simplified version of Thm. 7 where the constants have been hidden.

Let δ(0,1]\delta\in(0,1]. Let f^λ,M\widehat{f}_{\lambda,M} be as in Eq. (7). Under Asm. 3, 6, 4 when 0<λ34L0<\lambda\leq\frac{3}{4}\lVert{L}\rVert and

with q0=2(2+κL+κ2)q_{0}=2(2+\frac{\kappa}{\lVert{L}\rVert}+\kappa^{2}), then the following holds with probability at least 1δ1-\delta,

where Bˉ=B+2Rκ\bar{B}=B+2R\kappa, σˉ=σ+2Rκ\bar{\sigma}=\sigma+2\sqrt{R}\kappa,

and t:=log11κ2λt:=\log\frac{11\kappa^{2}}{\lambda}.

Under Asm. 3, the existence of fHf_{\mathcal{H}} in Asm. 4 and Asm. 6, plus Rem. 6, we have the following analytical decomposition of the excess risk, by Thm, 4

where the quantities β\beta, C(λ,M){\cal C}(\lambda,M) and S(λ,M,n){\cal S}(\lambda,M,n) are defined in the statement of Thm. 4. Under the same assumptions, Lemma 7, 8 and 10 are devoted to bound in probability the three quantities, with the help of the concentration inequalities recalled in Section B, plus some auxiliary results in Section D, of the appendixes.

Define the event EWE\subseteq W as the one satisfying

Define the event GWG\subseteq W as the one satisfying

Finally Eq. (35) is obtained from Eq. (37), by bounding β\beta with 22, S(λ,M,n){\cal S}(\lambda,M,n) with Eq. (38) and C(λ,M){\cal C}(\lambda,M) by C(λ,M){\mathfrak{C}}(\lambda,M). So by definition, Eq. (35) holds under the event DEGD\cap E\cap G and the conditions (d1),(e1)(d_{1}),(e_{1}) on λ\lambda, (d2)(d_{2}) on nn and (d3),(e2),(g1)(d_{3}),(e_{2}),(g_{1}) on MM. The event DEGD\cap E\cap G has probability

Finally note that the conditions on λ,n,M\lambda,n,M in the statement of this theorem imply, respectively, conditions (d1),(e1)(d_{1}),(e_{1}) on λ\lambda, (d2)(d_{2}) on nn, and (d3),(e2),(g1)(d_{3}),(e_{2}),(g_{1}) on MM. ∎

Let δ(0,1]\delta\in(0,1]. Let f^λ,M\widehat{f}_{\lambda,M} be as in Eq. (7). Under Asm. 3, 6, 4, when 0<λ34L0<\lambda\leq\frac{3}{4}\lVert{L}\rVert and

with q0=2(2+κL+κ2)q_{0}=2(2+\frac{\kappa}{\lVert{L}\rVert}+\kappa^{2}), then the following holds with probability at least 1δ1-\delta,

Here Bˉ=B+2Rκ\bar{B}=B+2R\kappa, σˉ=σ+2Rκ\bar{\sigma}=\sigma+2\sqrt{R}\kappa.

First we apply Thm. 5, then we add a condition on MM with respect to λ\lambda such that we can bound C(λ,M)\mathfrak{C}(\lambda,M) with RλrR\lambda^{r}. The condition we will consider is the following

Indeed lower bounding with (g2)(g_{2}) the occurrences of MM in C(λ,M)\mathfrak{C}(\lambda,M), we have

where the second step is due to F(λ)κ2/λ{\cal F}_{\infty}(\lambda)\leq\kappa^{2}/\lambda and 1+4r24r01+4r^{2}-4r\geq 0 for r1/2r\geq 1/2, while the last step is due to the following three facts. First, that 42rN(λ)4r22r44^{2r}{\cal N}(\lambda)^{4r^{2}-2r}\geq 4, since 4r22r04r^{2}-2r\geq 0 on r[1/2,1]r\in[1/2,1] and, by denoting with (λi(L))i1(\lambda_{i}(L))_{i\geq 1} the eigenvalues of LL, with L:=λ1(L)λ2(L)0\lVert{L}\rVert:=\lambda_{1}(L)\geq\lambda_{2}(L)\geq\dots\geq 0, and recalling that 0λ34L0\leq\lambda\leq\frac{3}{4}\lVert{L}\rVert, we have

Second, that t6r4r221t^{6r-4r^{2}-2}\geq 1, since 6r4r2206r-4r^{2}-2\geq 0 on r[1/2,1]r\in[1/2,1] and t1t\geq 1, since 0λ34L34κ20\leq\lambda\leq\frac{3}{4}\lVert{L}\rVert\leq\frac{3}{4}\kappa^{2}. Third, that κ12r8r241\kappa^{12r-8r^{2}-4}\geq 1 and κ44r1\kappa^{4-4r}\geq 1, since 12r8r24012r-8r^{2}-4\geq 0 and 44r04-4r\geq 0 on r[1/2,1],κ1r\in[1/2,1],\kappa\geq 1. ∎

The following theorem is a specialization of the previous one, under 5, 2 and an explicit relation of λ\lambda with respect to nn.

Let δ(0,1]\delta\in(0,1]. Under Asm. 3 and 5, 6, 2, 4, let p:=(2r+γ1)1p:=(2r+\gamma-1)^{-1}, and

with c0=9(3+4κ2+4κ2L+κ24Q2r1F22r)c_{0}=9(3+4\kappa^{2}+\frac{4\kappa^{2}}{\lVert{L}\rVert}+\frac{\kappa^{2}}{4}Q^{2r-1}F^{2-2r}), then the following holds with probability at least 1δ1-\delta,

and c1=64(Bˉκ+σˉQ+R)2c_{1}=64(\bar{B}\kappa+\bar{\sigma}Q+R)^{2}.

Let λ=n12r+γ\lambda=n^{-\frac{1}{2r+\gamma}} in Thm. 6 and substitute F(λ){\cal F}_{\infty}(\lambda) and N(λ){\cal N}(\lambda) by their bounds given in Asm. 5, 2. Note that to guarantee that nn satisfies the associated constraint with respect to λ\lambda, in Thm. 6, and that λ\lambda is in (0,34L](0,\frac{3}{4}\lVert{L}\rVert] we need that n(43L)p+1pn\geq(\frac{4}{3\lVert{L}\rVert})^{\frac{p+1}{p}} and

Note that the theorems in Sect 3 are corollaries of the theorem above. For the sake of readability, in contrast to Thm. 7, the results in Thm. 1, 2, 3 are expressed with respect to τ:=log1δ\tau:=\log\frac{1}{\delta}. Moreover in the statement of Thm. 1, 2, 3 the constants and the logarithmic terms are omitted. They can be recovered by plugging the coefficients detailed in the following proofs in the statement of Thm. 7.

This is an application of Thm. 7, with minimum number of assumptions. Indeed the existence of fHf_{\mathcal{H}} and the fact that yb|y|\leq b a. s. satisfies Asm. 4 with σ=B=2b\sigma=B=2b. The fact that XX is a Polish space and that ψ\psi is bounded continuous satisfy Asm. 3, and so the kernel is bounded by κ2\kappa^{2}. Since the kernel is bounded, we have that Asm. 5 is always satisfied with γ=1,Q=κ\gamma=1,Q=\kappa; Asm. 6 is always satisfied with r=1/2,R=1fHHr=1/2,R=1\vee\lVert{f_{\mathcal{H}}}\rVert_{\mathcal{H}}; Asm. 2 is always satisfied with α=1,F=κ2\alpha=1,F=\kappa^{2}. In particular we have the following constants n0:=4L2  (264κ2 log556κ3δ)2n_{0}:=4\lVert{L}\rVert^{-2}~{}\vee~{}\left(264\kappa^{2}~{}\log\frac{556\kappa^{3}}{\delta}\right)^{2},

with Bˉ:=2b+2κ(1fHH)\bar{B}:=2b+2\kappa(1\vee\lVert{f_{\mathcal{H}}}\rVert_{\mathcal{H}}) and σˉ:=2b+2κ1fHH\bar{\sigma}:=2b+2\kappa\sqrt{1\vee\lVert{f_{\mathcal{H}}}\rVert_{\mathcal{H}}}. ∎

Under the same assumptions of Thm. 1, if nL2(1056log1056278κ5bc1)2n\geq\lVert{L}\rVert^{-2}\vee\left(1056\log\frac{1056\sqrt{278\kappa^{5}b}}{\sqrt{c_{1}}}\right)^{2} and λn=n1/2\lambda_{n}=n^{-1/2}, then a number of random features MnM_{n} equal to

In particular the constants c0,c1c_{0},c_{1} are as in Thm. 1 and c2=8κ2bc1c_{2}=\frac{8\kappa^{2}\sqrt{b}}{\sqrt{c_{1}}}.

In the rest we will denote E(f^λn,Mn)E(fH){\cal E}(\widehat{f}_{\lambda_{n},M_{n}})-{\cal E}(f_{\mathcal{H}}), with R(f^λn,Mn){\cal R}(\widehat{f}_{\lambda_{n},M_{n}}) and will use the notation of Sect. A.3. Fix δ0=2c1κ2bn1\delta_{0}=\frac{2c_{1}}{\kappa^{2}b}n^{-1}. Denote with EE, the event satisfying R(f^λn,Mn)>t0{\cal R}(\widehat{f}_{\lambda_{n},M_{n}})>t_{0}, with t0:=c1log218δ0 n1/2t_{0}:=c_{1}\log^{2}\frac{18}{\delta_{0}}~{}n^{-1/2}.

First, note that Mn2c0nlog(8κ2bc1n)M_{n}\geq 2c_{0}\sqrt{n}\log\left(\frac{8\kappa^{2}\sqrt{b}}{\sqrt{c_{1}}}n\right), satisfies Mnc0 nlog108κ2nδ0M_{n}\geq c_{0}~{}\sqrt{n}\log\frac{108\kappa^{2}\sqrt{n}}{\delta_{0}} and any nL2(1056log1056278κ5bc1)2n\geq\lVert{L}\rVert^{-2}\vee\left(1056\log\frac{1056\sqrt{278\kappa^{5}b}}{\sqrt{c_{1}}}\right)^{2} satisfies nn0(δ0)n\geq n_{0}(\delta_{0}) with n0(δ)n_{0}(\delta) as in Thm. 1. So, we can apply Thm. 1, from which we know that EE holds with probability smaller than δ0\delta_{0}.

Second, by Rem. 6 and Rem. 9, we have that

Now, by denoting with 1E{\bf 1}_{E} the indicator function for EE, we have

This is an application of Thm. 7, where assumption Asm. 2 is satisfied with F=κ2F=\kappa^{2} and α=1\alpha=1. Indeed the existence of fHf_{\mathcal{H}} and the fact that yb|y|\leq b a. s. satisfies Asm. 4 with σ=B=2b\sigma=B=2b. The fact that XX is a Polish space and that ψ\psi is bounded continuous satisfy, Asm. 3, and so the kernel is bounded by κ2\kappa^{2}. Since the kernel is bounded, we have that Asm. 2 is always satisfied with α=1,F=κ2\alpha=1,F=\kappa^{2}. Asm. 4 and Asm. 6 are directly satisfied by Asm. 1. In particular we obtain n0:=(2/L)p+1p  (264κ2p log(556κ2δ1pκ2))1+pn_{0}:=\left(2/\lVert{L}\rVert\right)^{\frac{p+1}{p}}~{}\vee~{}\left(264\kappa^{2}p~{}\log(556\kappa^{2}\delta^{-1}\sqrt{p\kappa^{2}})\right)^{1+p},

with Bˉ:=2b+2κR\bar{B}:=2b+2\kappa R and σˉ:=2b+2κR\bar{\sigma}:=2b+2\kappa\sqrt{R}. ∎

By definition of ψs,πs\psi_{s},\pi_{s} we have

Now we show that ψs,πs\psi_{s},\pi_{s} achieves F(λ)=N(λ){\cal F}_{\infty}(\lambda)={\cal N}(\lambda). By recalling that s(ω)=(L+λI)1/2ψ(,ω)ρX2s(\omega)=\lVert{(L+\lambda I)^{-1/2}\psi(\cdot,\omega)}\rVert_{\rho_{{X}}}^{-2}, we have

We recall that Cs=1s(ω)dπC_{s}=\int\frac{1}{s(\omega)}d\pi. Denoting with ψω\psi_{\omega} the function ψ(,ω)\psi(\cdot,\omega) and considering that AxρX=Tr(A2(xx))\lVert{Ax}\rVert_{\rho_{{X}}}=\operatorname{Tr}(A^{2}(x\otimes x)) for any bounded symmetrix linear operator AA and vector vv, and that the trace is linear,

where the fact that L=ψωψωdπ(ω)L=\int\psi_{\omega}\otimes\psi_{\omega}d\pi(\omega) is due to Lemma 1. ∎

This is a version of Thm. 7, with simplified set of assumptions. Indeed the existence of fHf_{\mathcal{H}} and the fact that yb|y|\leq b a. s. satisfies Asm. 4 with σ=B=2b\sigma=B=2b. The fact that XX is a Polish space and that ψ\psi is bounded continuous, satisfy Asm. 3, and so the kernel is bounded by κ2\kappa^{2}. Asm. 4 and Asm. 6 are directly satisfied by Asm. 1.In particular we obtain n0:=(2/L)p+1p  (264κ2p log(556κ2δ1pκ2))1+pn_{0}:=\left(2/\lVert{L}\rVert\right)^{\frac{p+1}{p}}~{}\vee~{}\left(264\kappa^{2}p~{}\log(556\kappa^{2}\delta^{-1}\sqrt{p\kappa^{2}})\right)^{1+p},

with Bˉ:=2b+2κR\bar{B}:=2b+2\kappa R and σˉ:=2b+2κR\bar{\sigma}:=2b+2\kappa\sqrt{R}. ∎

Appendix B Concentration Inequalities

Here we recall some standard concentration inequalities that will be used in Sect. A.3. The following inequality is from Thm.3 of and will be used in Lemma 10, together with other inequalities, to concentrate the empirical effective dimension to the true effective dimension.

If there exists TmaxixiT^{\prime}\geq\max_{i}|x_{i}| almost everywhere, then the same bound, with TT^{\prime} instead of TT, holds for the for the absolute value of the left hand side, with probability at least 12δ1-2\delta.

The following inequality is and adaptation of Thm. 3.3.4 in and is a generalization of the previous one to random vectors. It is used primarily in Lemma 6, to control the sample error. Moreover it is used in Prop. 10, Lemma 10, to control the empirical effective dimension and to bound the term β\beta of Thm. 4, in the main theorem. Finally it is used to prove the inequality in Prop. 5.

for any i{1,,n}i\in\{1,\dots,n\}. Then for any δ(0,1]\delta\in(0,1]:

The following inequality is essentially Thm. 7.3.1 in (generalized to separable Hilbert spaces by the technique in Section 4 of ). It is a generalization of the Bernstein inequality to random operators. It is mainly used to prove the inequality in Prop. 6.

with probability at least 1δ1-\delta. Here β=log2TrSSδ\beta=\log\frac{2\operatorname{Tr}S}{\lVert{S}\rVert{}\delta}.

If there exists LL^{\prime} such that LmaxiXiL^{\prime}\geq\max_{i}\lVert{X_{i}}\rVert{} almost everywhere, then the same bound holds with LL^{\prime} instead of LL for the operator norm, with probability at least 12δ1-2\delta.

The theorem is a restatement of Theorem 7.3.1 of generalized to the separable Hilbert space case by means of the technique in Section 4 of . ∎

Appendix C Operator Inequalities

Let H,K\mathcal{H},\mathcal{K} be separable Hilbert spaces and A,B:HHA,B:\mathcal{H}\to\mathcal{H} bounded linear operators.

The following inequality is needed to prove the interpolation inequality in Prop. 9, that is needed to perform a fine split of the computational error.

If A,BA,B are self-adjoint and positive, then

Appendix D Auxiliary Results

The next proposition is used in Lemma 7 to control the sample error. It is based on the Bernstein inequality for random vectors, Prop. 2.

with probability at least 1τ1-\tau, where ess sup\operatorname*{ess\,sup} denotes the essential supremum and

for any 1in1\leq i\leq n. Therefore for any p2p\geq 2 we have

The following inequality, together with Prop. 8, is used in Prop. 10, Lemmas 10, 8. A similar technique can be found in .

where β=log4TrQλδ\beta=\log\frac{4\operatorname{Tr}Q}{\lambda\delta}. Moreover, with the same probability

Let Qλ=Q+λIQ_{\lambda}=Q+\lambda I. Here we apply Prop. 3 on the random variables Zi=MQλ1/2viQλ1/2viZ_{i}=M-Q_{\lambda}^{-1/2}v_{i}\otimes Q_{\lambda}^{-1/2}v_{i} with M=Qλ1/2QQλ1/2M=Q_{\lambda}^{-1/2}QQ_{\lambda}^{-1/2} for 1in1\leq i\leq n. Note that the expectation of ZiZ_{i} is . The random vectors are bounded by

almost everywhere, for any 1in1\leq i\leq n. The second order moment is

for 1in1\leq i\leq n. Now we can apply Prop. 3. Now some considerations on β\beta. It is β=log2TrSSδ=2TrQλ1QQλ1Qδ\beta=\log\frac{2\operatorname{Tr}S}{\lVert{S}\rVert{}\delta}=\frac{2\operatorname{Tr}Q_{\lambda}^{-1}Q}{\lVert{Q_{\lambda}^{-1}Q}\rVert{}\delta}, now TrQλ1Q1λTrQ\operatorname{Tr}Q_{\lambda}^{-1}Q\leq\frac{1}{\lambda}\operatorname{Tr}Q. We need a lower bound for Qλ1Q=σ1σ1+λ\lVert{Q_{\lambda}^{-1}Q}\rVert{}=\frac{\sigma_{1}}{\sigma_{1}+\lambda} where σ1=Q\sigma_{1}=\lVert{Q}\rVert is the biggest eigenvalue of QQ, now λσ1\lambda\leq\sigma_{1} thus σ1σ1+λ1/2\frac{\sigma_{1}}{\sigma_{1}+\lambda}\geq 1/2 and so βlog2TrQλ1QQλ1Qδlog4TrQλδ\beta\leq\log\frac{2\operatorname{Tr}Q_{\lambda}^{-1}Q}{\lVert{Q_{\lambda}^{-1}Q}\rVert{}\delta}\leq\log\frac{4\operatorname{Tr}Q}{\lambda\delta}.

For the second bound of this proposition we use the second bound of Prop. 3, the analysis remains the same except for uniform bound on Z1Z_{1}, that now is

In the following remark, we start from the result of the previous proposition, expressing the conditions on nn and λ\lambda with respect to a given value for the bound.

With the same notation of Prop. 6, assume that vκ\lVert{v}\rVert\leq\kappa almost everywhereOr equivalently define κ2\kappa^{2} with respect to F(λ){\cal F}_{\infty}(\lambda) as κ2=infλ>0F(λ)(Q+λ)\kappa^{2}=\inf_{\lambda>0}{\cal F}_{\infty}(\lambda)(\lVert{Q}\rVert+\lambda)., then we have that

for any t(0,1]t\in(0,1], when n2t2(2t3+F(λ))log4κ2λδn\geq\frac{2}{t^{2}}\left(\frac{2t}{3}+{\cal F}_{\infty}(\lambda)\right)\log\frac{4\kappa^{2}}{\lambda\delta} and λQ\lambda\leq\lVert{Q}\rVert, we have

The equation above holds with the same probability with t=1/2t=1/2, when 9κ2nlogn2δλQ\frac{9\kappa^{2}}{n}\log\frac{n}{2\delta}\leq\lambda\leq\lVert{Q}\rVert{} and n405κ267κ2logκ22δn\geq 405\kappa^{2}\vee 67\kappa^{2}\log\frac{\kappa^{2}}{2\delta}.

The equation above holds with the same probability with t=1/3t=1/3, when 19κ2nlogn4δλQ\frac{19\kappa^{2}}{n}\log\frac{n}{4\delta}\leq\lambda\leq\lVert{Q}\rVert{} and n405κ267κ2logκ22δn\geq 405\kappa^{2}\vee 67\kappa^{2}\log\frac{\kappa^{2}}{2\delta}.

The next proposition, together with Prop. 10 are a restatement of Prop. 1 of . In particular the next proposition performs the analytic decomposition of the difference between the empirical and the true effective dimension, while Prop. 10, bounds the decomposition in probability.

Let L\cal{L} be an Hilbert space. Let L,LM:LLL,L_{M}:{\cal L}\to{\cal L} be two bounded positive linear operators, that are trace class. Given λ>0\lambda>0, let

Considering that for any bounded symmetric linear operator XX the following identity holds

The next result is essentially Prop. 7 of , while a similar technique can be found in . It is used, mainly together with Prop. 6, to give multiplicative bounds to empirical operators. It is used in the analytic decomposition of the excess risk, in Prop. 7, Thm. 4.

Let H\mathcal{H} be a separable Hilbert space, let A,BA,B two bounded self-adjoint positive linear operators on H\mathcal{H} and λ>0\lambda>0. Then

Note that βλmax(B)λmax(B)+λ<1\beta\leq\frac{\lambda_{\max}(B)}{\lambda_{\max}(B)+\lambda}<1 by definition.

Let Bλ=B+λIB_{\lambda}=B+\lambda I. First of all note that β<1\beta<1 for any λ>0\lambda>0. Indeed, by exploiting the variational formulation of the biggest eigenvalue, we have

since AA is a positive operator and thus f,Bλ1/2ABλ1/2f0\left\langle{f},{B_{\lambda}^{-1/2}AB_{\lambda}^{-1/2}f}\right\rangle\geq 0 for any fHf\in\mathcal{H}. Now note that

Now let X=(IBλ1/2(BA)Bλ1/2)1X=(I-B_{\lambda}^{-1/2}(B-A)B_{\lambda}^{-1/2})^{-1}. We have that,

because Z=ZZ1/2\lVert{Z}\rVert{}=\lVert{Z^{*}Z}\rVert{}^{1/2} for any bounded operator ZZ. Note that

Finally let Y=Bλ1/2(BA)Bλ1/2Y=B_{\lambda}^{-1/2}(B-A)B_{\lambda}^{-1/2}, we have seen that β=λmax(Y)<1\beta=\lambda_{\max}(Y)<1, then

since X=w(Y)X=w(Y) with w(σ)=(1σ)1w(\sigma)=(1-\sigma)^{-1} for σ<1-\infty\leq\sigma<1, and ww is positive and monotonically increasing on the domain. ∎

The following proposition is used to give a fine analytical decomposition of the excess risk in Prop. 7, Thm. 4. A similar interpolation inequality for finite dimensional matrices, can be found in . Here we prove it for bounded linear operators on separable Hilbert spaces.

Let H,K\mathcal{H},\mathcal{K} be two separable Hilbert spaces and X,AX,A be bounded linear operators, with X:HKX:\mathcal{H}\to\mathcal{K} and A:HHA:\mathcal{H}\to\mathcal{H} be positive semidefinite.

where the last inequality is due to Cordes (see Proposition 4). Then we have that (XX)1σX2(1σ)σXX(X^{*}X)^{\frac{1}{\sigma}}\preccurlyeq\lVert{X}\rVert^{\frac{2(1-\sigma)}{\sigma}}X^{*}X (where \preccurlyeq is the Löwner partial ordering on positive operators) and so

In the next proposition, we bound NM(λ){\cal N}_{M}(\lambda) in terms of the effective dimension N(λ){\cal N}(\lambda) that is the one associated to the kernel KK. The proof of Prop. 10 analogous to the one of Prop. 1 of , with simpler proof and slightly improved constants.

with c(λ,m,δ)=qN(λ)+3q2N(λ)+32(3qN(λ)+3q)2c(\lambda,m,\delta)=\frac{q}{{\cal N}(\lambda)}+\sqrt{\frac{3q}{2{\cal N}(\lambda)}}+\frac{3}{2}\left(\frac{3q}{\sqrt{{\cal N}(\lambda)}}+\sqrt{3q}\right)^{2} and q=4F(λ)log6δ3mq=\frac{4{\cal F}_{\infty}(\lambda)\log\frac{6}{\delta}}{3m}.

By noting that MM is upper bounded by M=Tr(λLλ2L)=Tr((ILλ1L)Lλ1L)N(λ)M=\operatorname{Tr}(\lambda L_{\lambda}^{-2}L)=\operatorname{Tr}((I-L_{\lambda}^{-1}L)L_{\lambda}^{-1}L)\leq{\cal N}(\lambda), we can apply the Bernstein inequality (Prop. 1) where TT and SS are

with probability at least 1τ1-\tau. Then, by taking a union bound for the three events we have

with probability at least 1δ1-\delta, where q=4F(λ)log6δ3mq=\frac{4{\cal F}_{\infty}(\lambda)\log\frac{6}{\delta}}{3m}. Finally, if m(4+18F(λ))log12κ2λδm\geq(4+18{\cal F}_{\infty}(\lambda))\log\frac{12\kappa^{2}}{\lambda\delta}, then we have q2/27q\leq 2/27. Noting that N(λ)LLλ1=LL+λ1/2{\cal N}(\lambda)\geq\lVert{LL_{\lambda}^{-1}}\rVert{}=\frac{\lVert{L}\rVert}{\lVert{L}\rVert+\lambda}\geq 1/2, we have that

Appendix E Examples of Random feature maps

A lot of works have been devoted to develop random feature maps in the the setting introduced above, or slight variations (see for example and references therein). In the rest of the section, we give several examples.

with π\pi proportional to v^\hat{v}, ψ(ω,x)=cos(wx+b)\psi(\omega,x)=\cos(w^{\top}x+b) and ω:=(w,b)Ω, xX\omega:=(w,b)\in\Omega,~{}x\in{X}. Note that X,Ω,π,ψX,\Omega,\pi,\psi satisfy Assumption 3. In , they further randomize the construction above, by using results from locally sensitive hashing, to obtain a feature map which is a binary code. It can be shown that their methods satisfy Assumption 3 for an appropriate choice of Ω\Omega and the probability distribution π\pi. consider the setting of , but selects ω1,,ωm\omega_{1},\dots,\omega_{m} by means of the fast Welsh-Hadamard transform in order to improve the computational complexity for the algorithm computing KMK_{M}, while selects them by using low-discrepancy sequences for quasi-random sampling to improve the statistical accuracy of KMK_{M} with respect to KK.

where CC is the normalization factor C=eσ2dC=e^{\sigma^{2}d}. Indeed by Taylor expansion of the exponential function and of the power of a multinomial, we have

Note that it is possible to sample from π\pi in the following way. By the steps above it is clear that a sample from π\pi can be obtained by first sampling tt from a Poisson distribution with parameter σ2d\sigma^{2}d and then sampling ω1,,ωd\omega_{1},\dots,\omega_{d} from a multinomial distribution with probability p1==pd=1/dp_{1}=\dots=p_{d}=1/d and number of trials tt.

and (xz)p(x^{\top}z)^{p} can be approximated by g(Wp,x)g(Wp,z)g(W_{p},x)^{\top}g(W_{p},z) with g(Wp,x)=i=1pxwig(W_{p},x)=\prod_{i=1}^{p}x^{\top}w_{i} and Wp=(w1,,wp){1,1}p×dW_{p}=(w_{1},\dots,w_{p})\in\{-1,1\}^{p\times d} a matrix of independent random variables with probability at least 1/21/2. Indeed it holds,

where wiw_{i} is the ii-th row of WpW_{p} with 1ip1\leq i\leq p. Now note that Eq. (6) is satisfied, indeed for any x,zXx,z\in{X}

Now note that Assumption 3 is satisfied when v(τR2d)<v(\tau R^{2}d)<\infty. Indeed, considering that (xw)2x2w2R2d(x^{\top}w)^{2}\leq\lVert{x}\rVert^{2}\lVert{w}\rVert^{2}\leq R^{2}d for any xXx\in{X} and w{1,1}dw\in\{-1,1\}^{d}, we have

approximate the construction above by using randomized tensor sketching and Johnson-Lindenstrauss random projections. It can be shown that even their methods satisfy Assumption 3 for an appropriate choice of Ω\Omega and the probability distribution π\pi.

The considered input space is X=[0,)dX=[0,\infty)^{d} and the considered kernels are of the form

In this work they focus on X=d{X}=^{d} and on additive homogeneous kernels, that are of the form

As pointed out in , this family of kernels is particularly useful on histograms. Exploiting the homogeneous property, the kernel kk is rewritten as follows

In this work a generalization of the ReLU activation function for one layer neural networks is considered, that is

where JpJ_{p} is defined in Eq. 4 of and θ(x,z)\theta(x,z) is the angle between the two vectors xx and zz. Examples of JpJ_{p} are the following