Linearized two-layers neural networks in high dimension

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

Introduction and main results

For a number of important applications, state-of-the-art performances are obtained by representing the function ff as a multi-layers neural network. The simplest model in this class is given by two-layers networks (NN):

Two-layers neural networks have been extensively studied in the nineties, with a focus on two goals: (i)(i) Establishing approximation guarantees over classical function spaces; (ii)(ii) Controlling the generalization error via Rademacher complexity arguments. We refer to [Pin99, AB09] for surveys of these results.

Computational aspects were notably under-represented within these early theoretical contributions. On the contrary, it is nowadays increasingly clear that computational and statistical aspects cannot be separated in the analysis of neural networks (see, e.g. [SHN+18, MMN18, CB18]). Indeed, the optimization algorithm does not simply compute the unique minimizer of a regularized empirical risk: it instead selects one among many possible near-minimizers, whose generalization properties can vary significantly. Therefore, the specific optimization algorithm is an integral part of the definition of the regularization method.

A concrete scenario in which this interplay can be understood precisely is the so-called ‘neural tangent kernel’ regime. First explicitly described in [JGH18], this regime has attracted considerable amount of work. The basic idea is that, for highly overparametrized networks, the network weights barely change from their random initialization. We can therefore replace the nonlinear function class FNN{\mathcal{F}}_{{\sf NN}} by its first order Taylor expansion around this initialization.

Denoting by (a0,i,w0,i)iN(a_{0,i},{\bm{w}}_{0,i})_{i\leq N} the weights at initialization, a first order Taylor expansion yields

where fNN,0f_{{\sf NN},0} is the neural network at initialization. In other words, fNNfNN,0f_{{\sf NN}}-f_{{\sf NN},0} is a function in the direct sum FNT(W)FRF(W){\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}}), where we defined

We will refer to FRF(W){\mathcal{F}}_{{\sf RF}}({\bm{W}}) as the ‘random features’ (RF) model: it amounts to fixing the first layer, and only optimizing the coefficients in the second layer. Equivalently, FRF(W){\mathcal{F}}_{{\sf RF}}({\bm{W}}) corresponds to the first order Taylor expansion of fNNf_{{\sf NN}} with respect to the second layer weights (ai)iN(a_{i})_{i\leq N}. This model can be traced back to the work of Neal [Nea96], and was successfully developed by Rahimi and Recht [RR08] as a randomized approximation to kernel methods.

The second function class FNT(W){\mathcal{F}}_{{\sf NT}}({\bm{W}}) corresponds to the first order Taylor expansion of fNNf_{{\sf NN}} with respect to the first layer weights (wi)iN({\bm{w}}_{i})_{i\leq N} [JGH18]. We will refer to FNT(W){\mathcal{F}}_{{\sf NT}}({\bm{W}}) as the neural tangent classOften the term ‘neural tangent’ is reserved for the direct sum FNT(W)FRF(W){\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}}). We find it more convenient to give distinct names to each of the two terms, especially since FRF(W){\mathcal{F}}_{{\sf RF}}({\bm{W}}) has much smaller dimension than FNT(W){\mathcal{F}}_{{\sf NT}}({\bm{W}}) for large dd..

A sequence of recent papers proves that, in a certain overparametrized regime, gradient descent (GD) applied to the nonlinear neural network class FNN{\mathcal{F}}_{{\sf NN}} effectively converges to a model in FNT(W)FRF(W){\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}}). Namely, if the number of neurons NN is larger than a threshold N0(n,d)N_{0}(n,d), and training is initialized with f0(x)=N1/2i=1Na0,iσ(w0,i,x)f_{0}({\bm{x}})=N^{-1/2}\sum_{i=1}^{N}a_{0,i}\,\sigma(\langle{\bm{w}}_{0,i},{\bm{x}}\rangle) where {(a0,i,w0,i)}iNiidN(0,1)N(0,Id/d)\{(a_{0,i},{\bm{w}}_{0,i})\}_{i\leq N}\sim_{iid}{\sf N}(0,1)\otimes{\sf N}(0,{\mathbf{I}}_{d}/d), then gradient descent converges exponentially fast to weights {(ai,wi)}iN\{(a_{i},{\bm{w}}_{i})\}_{i\leq N} such that ff0f-f_{0} is well approximated by a function in FNT(W)FRF(W){\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}}). The specific value of the threshold N0(n,d)N_{0}(n,d) for the onset of this NT regime has been steadily pushed down over the last year [DZPS18, DLL+18, AZLS18, ZCZG18, ADH+19].

Does the NT regime explain the power of multi-layers neural networks, when trained by gradient descent methods? From an empirical point of view, the evidence is not univocal [LXS+19, GSJW19, COB19]. From a theoretical point of view, while the expressivity of neural networks is superior to the one of NT models, this hypothesis is not easy to dismiss for at least two reasons. First, neural networks learned by gradient descent algorithms form a significantly smaller class than general networks. Second, the answer depends on the data distribution, the target function ff_{*} and the sample size.

In order to clarify this question, we explore the behavior of RF and NT models in the high-dimensional setting. More precisely, we consider two specific asymptotic regimes:

In both cases we obtain sharp results, up to errors vanishing as dd\to\infty. Crucially, our results hold pointwise, i.e. they provide a characterization of approximation and generalization error which hold for a given function ff_{*}. This allows us to derive precise separation results between NN and NT models.

2 A parenthesis

The approximation properties of neural networks have been studied for over three decades [DHM89, Cyb89, Hor91, Bar93, MM94, GJP95, Mha96, Pet98, Mai99, Pin99]. It is useful to discuss the relation between the questions outlined above and existing literature.

A number of results are available on the approximation of functions in certain smoothness classes by two-layers neural networks. In particular [Bar93] controls smoothness by the average frequency content in the Fourier transform (the ‘Barron norm’), while [Mha96, Pet98, Mai99] use classical Sobolev norms. For instance [Mai99] proves that NN-neurons NN approximate functions in the Sobolev ball W2rW^{r}_{2} with worst case error

for some unspecified functions C1,C2C_{1},C_{2}. (Similar results are found in [Pet98].) These results cannot be used for our purposes.

First of all, we are interested in the NT class which is potentially much less powerful than NN.

Second, bounds of the type (1) make it hard to prove separation results between NN and NT. In order to prove such a separation, we would have to prove that neural networks trained by gradient descent have good approximation properties, uniformly over Sobolev balls. This objective is currently out of reach. Our pointwise approximation results make it much easier to prove separation statements.

Third, earlier work neglects polynomial dependencies in dd. Bounds of the type (1) have weak implications when both dd and NN are large, say d=100d=100, N=106N=10^{6}. We will instead prove sharp asymptotic results that are valid in this regime. As illustrated in the next section, our analysis captures the actual behavior in a quantitative manner, already when d100d\geq 100.

Quantitative results in the high-dimensional regime have been proved only recently. In particular, Bach [Bac17b] established quantitative upper and lower bounds for the approximation error in the RF model. However, these results do not have direct implications on the NT model which is our main interest here. Further, lower bounds in [Bac17b] are, as before, worst case over a certain RKHS. (See also [Bac13, AM15, RR17] for related work.)

Similar considerations apply to the generalization error of kernel methods. While this is a classical topic [CST+00, CDV07, RR17, LR18], earlier work proves minimax upper and lower bounds. Establishing pointwise lower bounds is instead important in order to understand precisely the separation between neural networks and their linearized counterparts. We refer to Section 4 for further discussion of related work.

3 A numerical experiment

Figures 1, 2, 3 report the results of such a simulation using RF –for Figure 1– and NT –for Figures 2 and 3. We use shifted ReLU activations σ(u)=max(uu0,0)\sigma(u)=\max(u-u_{0},0), u0=0.5u_{0}=0.5. The choice of u0=0.5u_{0}=0.5 is not essential: (Lebesgue-)almost every u00u_{0}\neq 0 has similar behavior. In contrast, the case u0=0u_{0}=0 is degenerate because max(u,0)\max(u,0) is equal to a linear function plus an even function.

The target functions ff_{\star} in these examples are quite simple. Figures 1 and 2 use a quadratic function f,2(x)=id/2xi2i>d/2xi2f_{\star,2}({\bm{x}})=\sum_{i\leq\lfloor d/2\rfloor}x_{i}^{2}-\sum_{i>\lfloor d/2\rfloor}x_{i}^{2}. In Figure 3, the target function is a third-order polynomial f,3(x)=i=1d(xi33xi)f_{\star,3}({\bm{x}})=\sum_{i=1}^{d}(x_{i}^{3}-3x_{i}).

The results are somewhat disappointing: in two cases (first and third figures) RF and NT models do not beat the trivial predictor. In one case (the second one), the NT model surpasses the trivial baseline, and it appears to decrease to as the number of samples nn increase. We also note that the risk shows a cusp when npn\approx p, with pp the number of parameters (p=Np=N for RF, and p=Ndp=Nd for NT). This phenomenon is related to overparametrization, and will not be discussed further in this paper (see [BHMM18, BHX19, HMRT19, MM19] for relevant work). We will instead focus on the population behavior nn\to\infty.

In other words, the RF model does not appear to be able to learn a simple quadratic function, and the NT model does not appear to be able to learn a third order polynomial. Our main theorems (presented in the next sections) capture in a precise manner this behavior. In particular,

We will prove that for N=Od(d2δ)N=O_{d}(d^{2-\delta}) , RF does not outperform the trivial predictor on any function that has vanishing projection on linear functions. Similarly, NT does not outperform the trivial predictor on any function that has vanishing projection on linear and quadratic functions.

In contrast, there exists neural networks in FNN{\mathcal{F}}_{{\sf NN}} with N=Od(d)N=O_{d}(d) neurons, and a small approximation error both for f,2f_{\star,2} and f,3f_{\star,3} (see, e.g., [Bac17b], or [MMN18, Proposition 1]).

These two points illustrate the gap in approximation power between NT (or RF) and NN.

4 Summary of main results

The equivalence between RF regression and polynomial regression holds pointwise for target function ff_{\star}.

Again, this result holds pointwise over the choice of ff_{\star}.

The second aspect can be summarized as follows.

Approximation error of linearized neural networks

In this section, we state formally our results about the approximation error of RF{\sf RF} and NT{\sf NT} models. We define the minimum population error for any of the models M{RF,NT}{\sf M}\in\{{\sf RF},{\sf NT}\} by

See Section 6 for the proof of lower bound, and Section 7 for the proof of upper bound.

τd11\tau^{1}_{d-1} is supported on [d,d][-\sqrt{d},\sqrt{d}]. It is therefore sufficient that supudσd(u)=C1(d)<\sup_{|u|\leq\sqrt{d}}|\sigma_{d}(u)|=C_{1}(d)<\infty.

By an explicit calculation, the density of τd11(du)=C2(d)(1u2/d)(d3)/2du\tau^{1}_{d-1}({\rm d}u)=C_{2}(d)(1-u^{2}/d)^{(d-3)/2}{\rm d}u. Since this density is bounded, it is sufficient that σd\sigma_{d} is square integrable with respect to the Lebesgue measure on [d,d][-\sqrt{d},\sqrt{d}].

2 Approximation error of neural tangent models

The function σ\sigma is weakly differentiable, with weak derivative σ\sigma^{\prime} such that σ(u)2c0exp(c1u2/2)\sigma^{\prime}(u)^{2}\leq c_{0}\exp(c_{1}u^{2}/2) for some constants c0,c1c_{0},c_{1}, with c1<1c_{1}<1.

See Section 8 for the proof of lower bound, and Section 9 for the proof of upper bound.

For instance the ReLU activation σ(u)=max(u,0)\sigma(u)=\max(u,0) and its weak derivative σ(x)=1x0\sigma^{\prime}(x)={\bm{1}}_{x\geq 0} have subexponential growth. Further its Hermite coefficients are μ0(σ)=1/2\mu_{0}(\sigma^{\prime})=1/2 and

Assumption 2.(c) does not hold for ReLU activation σ(u)=max(u,0)\sigma(u)=\max(u,0), since μk(σ)=0\mu_{k}(\sigma)=0 for kk even. However it holds for shifted ReLU σ(u)=max(uu0,0)\sigma(u)=\max(u-u_{0},0), for a generic value of the shift u0u_{0}.

Theorems 1 and 2 can be illustrated by a cartoon, which we show as Figure 5. In words, the approximation error plotted as a function of log(#parameters)/logd\log(\#\text{parameters})/\log d follows a staircase: it drops close to integer values of this ratio, with each drop corresponding to the projection onto homogeneous polynomials of that degree. We can extract three useful statistical insights from these findings:

There is no difference between plain RF and the more recent NT approach in terms of approximation error, once we compare them at fixed number of parameters pp. All that changes is the relation between number of parameters and number of neurons: p=Np=N for RF, and p=Ndp=Nd for NT. The recent work [GMMM19] actually shows some advantage for the RF model, although in a special case. It is worth mentioning that the same equivalence holds when we consider the dependence on the sample size nn, at N=N=\infty, see Section 3.

We notice however an important computational advantage for NT, at constant parameters number. Indeed, the complexity at prediction time is O(Nd)=O(p)O(Nd)=O(p) for NT, while it is O(Nd)=O(pd)O(Nd)=O(pd) for RF.

3 Separation between NN and RF, NT

Crucially, as proven in [MBM16], running gradient descent over the space of neural networks consisting of a single neuron allows to learn the target function f(x)=σ(w,x)f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle) efficiently. In other words, we do not have simply a separation between the function classes FNN{\mathcal{F}}_{{\sf NN}} and FRF{\mathcal{F}}_{{\sf RF}} or FNT{\mathcal{F}}_{{\sf NT}}, but a separation between linearized neural networks, and neural networks trained by gradient descent.

Essentially the same example was independently considered by Yehudai and Shamir in concurrent work [YS19]. These authors prove that there exist finite constants c0,c1>0c_{0},c_{1}>0 such that, if Nexp{c1d}N\leq\exp\{c_{1}d\} and the coefficients ai,aia_{i},{\bm{a}}_{i} have magnitude at most exp{c1d}\exp\{c_{1}d\}, then there exists a vector w{\bm{w}}_{\star} such that, setting f(x)=σ(w,x)f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle), then RRF(f;W),RNT(f;W)c0R_{{\sf RF}}(f_{*};{\bm{W}}),R_{{\sf NT}}(f_{*};{\bm{W}})\geq c_{0}. An important difference with respect to our separation result is in the fact that Eq. 10 holds –once again– pointwise, i.e. for any fixed w{\bm{w}}_{\star}, while in [YS19] w{\bm{w}}_{\star} is chosen by an adversary who has knowledge of the vectors (wi)iN({\bm{w}}_{i})_{i\leq N}. Let us emphasize there are other important differences between our setting and the one of [YS19], and neither of the two analysis implies the other.

Generalization error of kernel methods

Analogously, ridge regression in FNT(W){\mathcal{F}}_{{\sf NT}}({\bm{W}}) can be shown to converge to KRR with respect to the kernel

Section 3.1 presents a lower bound on the prediction error of general kernel methods, and Section 3.2 derives an upper bound for kernel ridge regression.

Consider any regression method of the form

where fH\|f\|_{H} is the reproducing kernel Hilbert space (RKHS) norm with respect to the kernel HH of the form (13). By the representer theorem [BTA11] there exist coefficients a^1,,a^n\hat{a}_{1},\dots,\hat{a}_{n} such that

We are therefore led to define the following data-dependent prediction risk function for kernel methods

The next theorem provides a decomposition of this generalization error that is analogous to the one given in Theorem 1.(a)(a). Notice however that the controlling factor is not the number of neurons NN, but instead the sample size nn.

This follows immediately from Theorem 1.(a)(a). Indeed, setting σd(u)=hd(u/d)\sigma_{d}(u)=h_{d}(u/\sqrt{d}) and wi=xi/d{\bm{w}}_{i}={\bm{x}}_{i}/\sqrt{d}, we obtain RH(fd,X)=RRF(fd,W)R_{H}(f_{d},{\bm{X}})=R_{{\sf RF}}(f_{d},{\bm{W}}), whence the claim follows by applying Eq. (3). ∎

2 Upper bound for kernel ridge regression

where the kernel matrix H=(Hij)ij[n]{\bm{H}}=(H_{ij})_{ij\in[n]} is given by

and y=(y1,,yn)T{\bm{y}}=(y_{1},\ldots,y_{n})^{\mathsf{T}}. The prediction function at location x{\bm{x}} is given by

The test error of empirical kernel ridge regression is defined as

We assume that {hd}d1\{h_{d}\}_{d\geq 1} are positive-definite kernels, and we consider the associated eigenvalues:

where we recall that Qk(d)Q_{k}^{(d)} is the kk-th Gegenbauer polynomial.

If hdh_{d} has zero mean (i.e. hd(de1,x)τd(dx)=0\int h_{d}(\sqrt{d}\langle{\bm{e}}_{1},{\bm{x}}\rangle)\tau_{d}({\rm d}{\bm{x}})=0) further assume that fdf_{d} is centered (i.e. fd(x)τd(dx)=0\int f_{d}({\bm{x}})\tau_{d}({\rm d}{\bm{x}})=0).

See Section 10 for the proof of this theorem.

Assume hdhh_{d}\to h as dd\to\infty, uniformly over [δ,δ][-\delta,\delta], together with its derivatives, and further assume hd(x)c0exp(c1x2/2)|h_{d}(x)|\leq c_{0}\exp(c_{1}x^{2}/2) for some c0>0c_{0}>0, c1<1c_{1}<1. We expect this to be the case for many kernels of interest, and in particular it can be shown to be the case for hdRFh_{d}^{{\sf RF}} and hdNTh_{d}^{{\sf NT}} under mild conditions on the activation σ\sigma. Using Rodrigues’ formula described in Section 5.2, by an application of integration by part followed by dominated convergence, we get

3 Separation between kernel methods and neural networks

Repeating the same argument of Section 2.3, we see that Theorems 3 and 4 imply a separation between kernel methods, with rotationally invariant kernels, and gradient-descent trained neural networks.

Namely, consider again the target function f(x)=σ(w,x)f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle), for w2=1\|{\bm{w}}_{\star}\|_{2}=1. As proven in [MBM16], ff_{\star} can be learnt efficiently by minimizing the following empirical risk via gradient descent:

Namely, if nCdlogdn\geq C\,d\log d samples are used (and under some technical conditions on σ\sigma), gradient descent reaches prediction error of order (dlogd)/n(d\log d)/n

This test error is achieved by kernel ridge regression.

4 Near-optimality of interpolators

Optimal test error is achieved for any λ<λ\lambda<\lambda_{*}. In particular, by taking λ0\lambda\to 0, we obtain an interpolator, i.e. a predictor that interpolates the data (yi,xi)(y_{i},{\bm{x}}_{i}). This remark is made quantitative in the following bound on the empirical risk

Recall that the empirical risk of KRR is given by Eq. (24), where f^λ=(f^λ(x1),,f^λ(xn))\hat{\bm{f}}_{\lambda}=(\hat{f}_{\lambda}({\bm{x}}_{1}),\ldots,\hat{f}_{\lambda}({\bm{x}}_{n})) can be rewritten as

where we simply used the law of large numbers y22/nfdL22+τ2\|{\bm{y}}\|_{2}^{2}/n\to\|f_{d}\|_{L^{2}}^{2}+\tau^{2}. ∎

5 A conjecture for generalization error of random features model

Under the same data model of the previous sections, we are interested in the test prediction error

Theorem 1 characterized the test error RRF(fd,X,W,λ)R_{\sf RF}(f_{d},{\bm{X}},{\bm{W}},\lambda) in the population limit n=n=\infty, whereas Theorems 3 and 4 characterize the same quantity in the case when N=N=\infty.

What happens when both nn and NN are finite? In the proportional regime NdN\propto d and ndn\propto d, the precise asymptotics of RRF(fd,X,W,λ)R_{\sf RF}(f_{d},{\bm{X}},{\bm{W}},\lambda) was calculated in [MM19].

Further related work

Donoho and Johnstone [DJ89] study an approximation problem analogous to the one we considered in Section 2, although in d=2d=2 dimensions. Their problem essentially reduces to determining rates of approximation on the unit circle, with the technical difference that the wi{\bm{w}}_{i}’s are equi-spaced along the circle instead of being random. As for other references mentioned in Section 1.2, the lower bounds of [DJ89] are worst case over differentiable functions.

After the present paper appeared as a preprint, several authors presented important contributions to the same line of work. In particular, Liang, Rakhlin, and Zhai [LRZ19] studies kernel ridge regression in dd dimension using n=Od(dγ)n=O_{d}(d^{\gamma}) samples. Assuming the target function has bounded RKHS norm, they derive upper and lower bounds on the rate of convergence of the generalization error. This result is related to our Theorem 3. The most important difference is that we do not assume that the target function has bounded RKHS norm. Instead we obtain the precise asymptotics of the generalization error in a regime in which it is non-vanishing. As illustrated in Section 1.3, this asymptotic analysis captures indeed the actual behavior in practically reasonable settings.

From a technical viewpoint, several of our calculations make use of harmonic analysis over the dd-dimensional sphere, as it is natural given that xi{\bm{x}}_{i}’s are uniform over the sphere. Spherical harmonics expansion appear in related contexts, e.g. in [DJ89, Bac17a, VW18].

Let us finally mention that an alternative approach to the analysis of two-layers neural networks in the wide limit, was developed in [MMN18, RVE18, SS18, CB18, MMM19] using mean field theory. Unlike in the neural tangent approach, the evolution of network weights is described beyond the linear regime in this theory.

Technical background

The dimension of each subspace is given by

2 Gegenbauer polynomials

We will use the following properties of Gegenbauer polynomials

Note in particular that property 2 implies that –up to a constant– Qk(d)(x,y)Q_{k}^{(d)}(\langle{\bm{x}},{\bm{y}}\rangle) is a representation of the projector onto the subspace of degree -kk spherical harmonics

then we have the following equation holds in L2([d,d],τd11)L^{2}([-\sqrt{d},\sqrt{d}],\tau^{1}_{d-1}) sense

By rotational invariance, the space VkV_{k} of homogeneous polynomials of degree kk is an eigenspace of \mathscrsfsHd\mathscrsfs{H}_{d}, and we will denote the corresponding eigenvalue by ξd,k(hd)\xi_{d,k}(h_{d}). In other words \mathscrsfsHdf(x):=k=0λd,k(hd)Pkf\mathscrsfs{H}_{d}f({\bm{x}}):=\sum_{k=0}^{\infty}\lambda_{d,k}(h_{d}){\mathsf{P}}_{k}f. The eigenvalues can be computed via

3 Hermite polynomials

Here and below, for PP a polynomial, Coeff{P(x)}{\rm Coeff}\{P(x)\} is the vector of the coefficients of PP. As a consequence, for any fixed integer kk, we have

where μk(σ)\mu_{k}(\sigma) and λd,k(σ)\lambda_{d,k}(\sigma) are given in Eq. (42) and (38).

4 Notations

Proof of Theorem 1.(a): RF model lower bound

Define the random matrix U=(Uij)i,j[N]{\bm{U}}=(U_{ij})_{i,j\in[N]}, with

In what follows, we write RRF(fd)=RRF(fd,W)=RRF(fd,Θ/d)R_{{\sf RF}}(f_{d})=R_{{\sf RF}}(f_{d},{\bm{W}})=R_{{\sf RF}}(f_{d},{\bm{\Theta}}/\sqrt{d}) for the random features risk, omitting the dependence on the weights W=Θ/d{\bm{W}}={\bm{\Theta}}/\sqrt{d}. By the definition and a simple calculation, we have

where the last inequality used the fact that

This is achieved by the Proposition 1 and 2 stated below.

The proof of Proposition 2 relies on the following tight bound on the operator norm of the Gegenbauer polynomials of the Gram matrix:

The proofs of these three propositions are provided in the next sections. Proposition 1 implies

From Proposition 2, we have with high probability

Then by Markov inequality, we have with high probability

where K(d1,j)=(d2+jj)K(d-1,j)={d-2+j\choose j} is non-negative for d2d\geq 2. This immediately shows that B(d,k)B(d,k) is non-decreasing in kk. ∎

2 Proof of Proposition 1

and the Gegenbauer expansion of σd\sigma_{d} gives

3 Proof of Proposition 2

Recall the expansion of σd\sigma_{d} in terms of Gegenbauer polynomials, see Eqs. (51) and (52). From the properties of Gegenbauer polynomials, we have

where Wk=(Wk,ij)i,j[N]{\bm{W}}_{k}=(W_{k,ij})_{i,j\in[N]} with Wk,ij=Qk(θi,θj)W_{k,ij}=Q_{k}(\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle).

As a result, we have U^0\hat{\bm{U}}\succeq 0, and hence

In the following, we give a lower bound for Uˉ\bar{\bm{U}}. Note we have

Hence, there exists constant CC^{\prime}, such that for large dd, we have

Plug Eq. (56) into Eq. (53), we get with high probability

4 Proof of Proposition 3

Step 1. Bounding operator norm by moments.

We define Δ=WId{\bm{\Delta}}={\bm{W}}-{\mathbf{I}}_{d}. Then we have

For any sequence of integers p=p(d)p=p(d), we have

To prove the proposition, it suffices to show that for any sequence AdA_{d}\to\infty, we have

To calculate this quantity, we will apply repeatedly the following identity, which is an immediate consequence of Eq. (33). For any i1,i2,i3i_{1},i_{2},i_{3} distinct, we have

Throughout the proof, we will denote by C,C,CC,C^{\prime},C^{\prime\prime} constants that may depend on kk but not on p,d,Np,d,N. The value of these constants is allowed to change from line to line.

Step 2. The induced graph and equivalence of index sequences.

For any index sequence i=(i1,i2,,i2p)[N]2p{\bm{i}}=(i_{1},i_{2},\ldots,i_{2p})\in[N]^{2p}, we defined an undirected multigraph Gi=(Vi,Ei)G_{\bm{i}}=(V_{\bm{i}},E_{\bm{i}}) associated to index sequence i{\bm{i}}. The vertex set ViV_{\bm{i}} is the set of distinct elements in i1,,i2pi_{1},\ldots,i_{2p}. The edge set EiE_{{\bm{i}}} is formed as follows: for any j[2p]j\in[2p] we add an edge between iji_{j} and ij+1i_{j+1} (with convention 2p+112p+1\equiv 1). Notice that this could be a self-edge, or a repeated edge: Gi=(Vi,Ei)G_{\bm{i}}=(V_{\bm{i}},E_{\bm{i}}) will be –in general– a multigraph. We denote v(i)=Viv({\bm{i}})=|V_{\bm{i}}| to be the number of vertices of GiG_{\bm{i}}, and e(i)=Eie({\bm{i}})=|E_{\bm{i}}| to be the number of edges (counting multiplicities). In particular, e(i)=ke({\bm{i}})=k for i[N]k{\bm{i}}\in[N]^{k}. We define

For any two index sequences i1,i2{\bm{i}}_{1},{\bm{i}}_{2}, we say they are equivalent i1i2{\bm{i}}_{1}\asymp{\bm{i}}_{2}, if the two graphs Gi1G_{{\bm{i}}_{1}} and Gi2G_{{\bm{i}}_{2}} are isomorphic, i.e. there exists an edge-preserving bijection of their vertices (ignoring vertex labels). We denote the equivalent class of i{\bm{i}} to be

We define the quotient set Q(p){\mathcal{Q}}(p) by

For any integer k2k\geq 2 and i=(i1,,ik)[N]k{\bm{i}}=(i_{1},\ldots,i_{k})\in[N]^{k}, we define

The following properties holds for all sufficiently large NN and dd:

For any equivalent index sequences i=(i1,,i2p)j=(j1,,j2p){\bm{i}}=(i_{1},\ldots,i_{2p})\asymp{\bm{j}}=(j_{1},\ldots,j_{2p}), we have Mi=MjM_{{\bm{i}}}=M_{{\bm{j}}}.

For any index sequence i[N]2pT(p){\bm{i}}\in[N]^{2p}\setminus{\mathcal{T}}_{\star}(p), we have Mi=0M_{{\bm{i}}}=0.

For any index sequence iT(p){\bm{i}}\in{\mathcal{T}}_{\star}(p), the degree of any vertex in GiG_{\bm{i}} must be even.

The number of equivalent classes Q(p)(2p)2p|{\mathcal{Q}}(p)|\leq(2p)^{2p}.

Recall that v(i)=Viv({\bm{i}})=|V_{\bm{i}}| denotes the number of distinct elements in i{\bm{i}}. Then, for any i[N]2p{\bm{i}}\in[N]^{2p}, the number of elements in the corresponding equivalence class satisfies C(i)v(i)v(i)Nv(i)ppNv(i)|{\mathcal{C}}({\bm{i}})|\leq v({\bm{i}})^{v({\bm{i}})}\cdot N^{v({\bm{i}})}\leq p^{p}N^{v({\bm{i}})}.

Properties (a)(a), (b)(b) and (c)(c) are straightforward. Note that v(i)2pv({\bm{i}})\leq 2p for any i[N]2p{\bm{i}}\in[N]^{2p}. For property (d)(d), notice that to each distinct equivalence class we can associate, in an injective manner, a string of length 2p2p over an alphabet of size 2p2p (simply follow the elements in i{\bm{i}} in order, and replace the labels by some canonical ones, e.g. {1,2,3,}\{1,2,3,\dots\} in order of appearance). Therefore the number of classes is bounded as

For property (e)(e), we need to bound the number of elements in C(i){\mathcal{C}}({\bm{i}}) for representative i{\bm{i}} with degree v(i)v({\bm{i}}). Define a mapping ψ:C(i)[N]v(i)\psi:{\mathcal{C}}({\bm{i}})\to[N]^{v({\bm{i}})} as follows. For i[N]2p{\bm{i}}\in[N]^{2p}, ψ(i)\psi({\bm{i}}) is a vector of the distinct elements in i{\bm{i}}, listed in increasing order. For any k[N]v(i){\bm{k}}\in[N]^{v({\bm{i}})}, the pre-image ψ1(k)\psi^{-1}({\bm{k}}) contains at most v(i)!v(i)v(i)v({\bm{i}})!\leq v({\bm{i}})^{v({\bm{i}})} elements. As a result, we have

In view of property (a)(a) in the last lemma, given an equivalence class C=C(i){\mathcal{C}}={\mathcal{C}}({\bm{i}}), we will write MC=MiM_{{\mathcal{C}}}=M_{{\bm{i}}} for the corresponding value common to the equivalence class C{\mathcal{C}}.

It is easy to see that the outcome of this process is independent of the order in which we select vertices.

For illustration, we give two examples of skeletonization processes:

Let i=(1,2,1,3,4,3){\bm{i}}=(1,2,1,3,4,3), and set i0=i{\bm{i}}_{0}={\bm{i}}. First notice that {2,4}\{2,4\} are redundant vertices and we can remove them in arbitrary order to get i2=(1,3){\bm{i}}_{2}=(1,3). Then notice that 33 is redundant whence we get i3={1}{\bm{i}}_{3}=\{1\}. Hence we have r(i)=3r({\bm{i}})=3, and sk(i)=(1){\rm sk}({\bm{i}})=(1).

Consider the skeletonization process of j=(1,2,3,2,4,3){\bm{j}}=(1,2,3,2,4,3). Take j0=j{\bm{j}}_{0}={\bm{j}}. First notice that {1,4}\{1,4\} are redundant vertices and can be removed in arbitrary order to get j2=(2,3,2,3){\bm{j}}_{2}=(2,3,2,3). We see that there is no further redundant vertex in Gj1G_{{\bm{j}}_{1}}, so that r(j)=2r({\bm{j}})=2, and sk(j)=j1=(2,3,2,3){\rm sk}({\bm{j}})={\bm{j}}_{1}=(2,3,2,3).

For the above skeletonization process, the following properties hold:

If ij[N]p{\bm{i}}\asymp{\bm{j}}\in[N]^{p}, then sk(i)sk(j){\rm sk}({\bm{i}})\asymp{\rm sk}({\bm{j}}). That is, the skeletons of equivalent index sequences are equivalent.

For any i=(i1,,ik)[N]k{\bm{i}}=(i_{1},\ldots,i_{k})\in[N]^{k}, define

For any iT(p)[N]2p{\bm{i}}\in{\mathcal{T}}_{\star}(p)\subset[N]^{2p}, its skeleton is either formed by a single element, or an index sequence whose graph has the property that every vertex has degree greater or equal to 44.

Property (a)(a) holds by the definition of equivalence which is graph isomorphism. Property (b)(b) used the fact that, if ij1i\neq j_{1} and ij2i\neq j_{2}, we have

so that deleting a redundant vertex will contribute a 1/B(d,k)1/B(d,k) factor.

To show property (c)(c), note that any intermediate index sequence is{\bm{i}}_{s} in the skeletonization process is such that GisG_{{\bm{i}}_{s}} only has even degree vertices, is connected, and has no self-edges (by induction). Hence, Gsk(i)G_{{\rm sk}({\bm{i}})} only has even degree vertices, is connected, and has no self-edges. Note that Gsk(i)G_{{\rm sk}({\bm{i}})} cannot have degree-2 vertices, and has at least one vertex (because the last vertex is not removed). Therefore, as long as sk(i){\rm sk}({\bm{i}}) contains at least two vertices, Gsk(i)G_{{\rm sk}({\bm{i}})} can only contain vertices with degree greater or equal to 44. ∎

Given an index sequence iT(p)[N]2p{\bm{i}}\in{\mathcal{T}}_{\star}(p)\subset[N]^{2p}, we say i{\bm{i}} is of type 1, if sk(i){\rm sk}({\bm{i}}) contains only one index. We say i{\bm{i}} is of type 2 if sk(i){\rm sk}({\bm{i}}) has more than one index (so that by Lemma 3, Gsk(i)G_{{\rm sk}({\bm{i}})} can only contain vertices with degree greater or equal to 44). Denote the class of type 1 index sequence (respectively type 2 index sequence) by T1(p){\mathcal{T}}_{1}(p) (respectively T2(p){\mathcal{T}}_{2}(p)). We also denote by T~a(p)\widetilde{\mathcal{T}}_{a}(p), a{1,2}a\in\{1,2\} the set of equivalence classes of sequences in Ta(p){\mathcal{T}}_{a}(p). This definition makes sense since the equivalence class of the skeleton of a sequence only depends on the equivalence class of the sequence itself.

Recall that v(i)v({\bm{i}}) is the number of vertices in GiG_{\bm{i}}, and e(i)e({\bm{i}}) is the number of edges in GiG_{\bm{i}} (which coincides with the length of i{\bm{i}}). We consider iT1(p){\bm{i}}\in{\mathcal{T}}_{1}(p). Since for iT1(p){\bm{i}}\in{\mathcal{T}}_{1}(p), every edge of GiG_{\bm{i}} must be at most a double edge. Indeed, if (u1,u2)(u_{1},u_{2}) had multiplicity larger than 22 in GiG_{{\bm{i}}}, neither u1u_{1} nor u2u_{2} could be deleted during the skeletonization process, contradicting the assumption that sk(i){\rm sk}({\bm{i}}) contains a single vertex. Therefore, we must have miniT1v(i)=p+1\min_{{\bm{i}}\in{\mathcal{T}}_{1}}v({\bm{i}})=p+1. According the Lemma 3.(b)(b), for every iT1(p){\bm{i}}\in{\mathcal{T}}_{1}(p), we have

Note by Lemma 2.(e)(e), the number of elements in the equivalence class of i{\bm{i}} is C(i)ppNv(i)|{\mathcal{C}}({\bm{i}})|\leq p^{p}\cdot N^{v({\bm{i}})}. Hence we get

where in the last step we used Lemma 2 and the fact that B(d,k)C0dkB(d,k)\geq C_{0}d^{k} for some C0>0C_{0}>0.

We have the following simple lemma bounding MiM_{\bm{i}}. This bound is useful when i{\bm{i}} is a skeleton.

There exists constants CC and d0d_{0} depending uniquely on kk such that, for any dd0(k)d\geq d_{0}(k), and any index sequence i[N]m{\bm{i}}\in[N]^{m} with 2md/(4k)2\leq m\leq d/(4k), we have

The lemma following by the claim that (for dd0(k)d\geq d_{0}(k))

Therefore there exists a constant C0C_{0} such that for all dd large enough

As a consequence, for any integer mm, we have

Combining the above two upper bounds (63) and (64), we have

By noting that B(d,k)C0dkB(d,k)\geq C_{0}d^{k} for some C0>0C_{0}>0, this proves the claim. ∎

Suppose iT2(p){\bm{i}}\in{\mathcal{T}}_{2}(p), and denote v(i)v({\bm{i}}) to be the number of vertices in GiG_{\bm{i}}. We have, for a sequence p=od(d)p=o_{d}(d)

Here (1)(1) holds by Lemma 3.(b)(b); (2)(2) by Lemma 4, and the fact that sk(i)[N]e(sk(i)){\rm sk}({\bm{i}})\in[N]^{e({\rm sk}({\bm{i}}))}, together by B(d,k)C0dkB(d,k)\geq C_{0}d^{k}; (3)(3) because e(sk(i))2pe({\rm sk}({\bm{i}}))\leq 2p; (4)(4) by Lemma 3.(c)(c), implying that for iT2(p){\bm{i}}\in{\mathcal{T}}_{2}(p), each vertex of Gsk(i)G_{{\rm sk}({\bm{i}})} has degree greater or equal to 44, so that v(sk(i))e(sk(i))/2v({\rm sk}({\bm{i}}))\leq e({\rm sk}({\bm{i}}))/2 (notice that for dd0(k)d\geq d_{0}(k) we can assume Cp/d<1Cp/d<1). Finally, (5)(5) follows since r(i),v(sk(i))v(i)r({\bm{i}}),v({\rm sk}({\bm{i}}))\leq v({\bm{i}}), and (6)(6) the definition of r(i)r({\bm{i}}) implying r(i)=v(i)v(sk(i))r({\bm{i}})=v({\bm{i}})-v({\rm sk}({\bm{i}})).

Note by Lemma 2.(e)(e), the number of elements in equivalent class C(i)pv(i)Nv(i)|{\mathcal{C}}({\bm{i}})|\leq p^{v({\bm{i}})}\cdot N^{v({\bm{i}})}. Since v(i)v({\bm{i}}) depends only on the equivalence class of i{\bm{i}}, we will write, with a slight abuse of notation v(i)=v(C(i))v({\bm{i}})=v({\mathcal{C}}({\bm{i}})). Notice that the number of equivalence classes with v(C)=vv({\mathcal{C}})=v is upper bounded by the number multi-graphs with vv vertices and 2p2p edges, which is at most v4pv^{4p}. Hence we get

Define ε=CNpk+1/dk\varepsilon=CNp^{k+1}/d^{k}. We will assume hereafter that pp is selected such that

By calculus and condition (68), the function F(v)=v4pεvF(v)=v^{4p}\varepsilon^{v} is maximized over v[2,2p]v\in[2,2p] at v=2v=2, whence

Using Eqs. (61) and (69), we have, for any p=od(d)p=o_{d}(d) satisfying Eq. (68), we have

Finally setting N=dke2AlogdN=d^{k}e^{-2A\sqrt{\log d}} and p=(k/A)logdp=(k/A)\sqrt{\log d}, this yields

Proof of Theorem 1.(b): RF model upper bound

Consider f^RF(x;Θ,a)=i=1Naiσd(θi,x/d)\hat{f}_{{\sf RF}}({\bm{x}};{\bm{\Theta}},{\bm{a}})=\sum_{i=1}^{N}a_{i}\sigma_{d}(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle/\sqrt{d}). We can expand the risk achieved at parameter a{\bm{a}} as

Proof of Theorem 2.(a): NT model lower bound

We begin with some notations and simple remarks.

Assume σ\sigma is an activation function with σ(u)2c0exp(c1u2/2)\sigma(u)^{2}\leq c_{0}\,\exp(c_{1}\,u^{2}/2) for some constants c0>0c_{0}>0 and c1<1c_{1}<1. Then

A simple calculation shows that Cd(2π)1/2C_{d}\to(2\pi)^{-1/2} as dd\to\infty, and hence supdCdC<\sup_{d}C_{d}\leq\overline{C}<\infty. Therefore

where the last inequality holds provided dd0=10/(1c1)d\geq d_{0}=10/(1-c_{1}).

Finally, for point 3, without loss of generality we will take w=e1{\bm{w}}={\bm{e}}_{1}, so that w,x=x1\langle{\bm{w}},{\bm{x}}\rangle=x_{1}. By the same argument given above (and since both GG and x1x_{1} have densities bounded uniformly in dd), for any M>0M>0 we can choose σM\sigma_{M} bounded continuous so that for any dd,

It is therefore sufficient to prove the claim for σM\sigma_{M}. Letting ξN(0,Id1){\bm{\xi}}\sim{\sf N}(0,{\mathbf{I}}_{d-1}), independent of GG, we construct the coupling via

where we set x=(x1,x){\bm{x}}=(x_{1},{\bm{x}}^{\prime}). We thus have x1Gx_{1}\to G almost surely, and the claim follows by weak convergence. ∎

We denote the Hermite decomposition of σ\sigma by

We state separately the assumptions of Theorem 2.(a) for future reference.

It is also useful to notice that the Hermite coefficients of x2σ(x)x^{2}\sigma^{\prime}(x) can be computed from the ones of σ(x)\sigma^{\prime}(x) using the relation μk(x2σ)=μk+2(σ)+[1+2k]μk(σ)+k(k1)μk2(σ)\mu_{k}(x^{2}\sigma^{\prime})=\mu_{k+2}(\sigma^{\prime})+[1+2k]\mu_{k}(\sigma^{\prime})+k(k-1)\mu_{k-2}(\sigma^{\prime}).

2 Proof of Theorem 2.(a): Outline

Proceeding as for the RF model, we obtain

This is achieved in the following two propositions.

Let σ\sigma be an activation function satisfying Assumption 4. Define

These two propositions will be proven in the next sections. Proposition 4 shows that

3 Proof of Proposition 4

We denote the Gegenbauer decomposition of σ(e,)\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle) by

By Lemma 5, applied to function σ\sigma^{\prime} (instead of σ\sigma), under Assumption 4, we have σ(e,)L22C\|\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)\|_{L^{2}}^{2}\leq C (for CC a constant independent of dd). We therefore have (recalling the normalization of the Gegenbauer polynomials in Eq. (32))

where in the last step we used Eq. (33). By the recurrence relationship for Gegenbauer polynomials (35), we have

We use the convention that td,1=0t_{d,-1}=0. This gives

The last inequality follows by Eqs. (89) and (91).

Using the fact that the kernel HH preserve the decomposition (29), we have

where we used the fact that B(d,k)B(d,k) is non-decreasing in kk given by Lemma 1. This concludes the proof.

4 Proof of Proposition 5

In the proof of this proposition, we will need the following lemmas.

We recall the following two formulas for k1k\geq 1 (see Section 5.2):

Furthermore, we have Q0(d)(x)=1Q^{(d)}_{0}(x)=1, Q1(d)(x)=x/dQ^{(d)}_{1}(x)=x/d and therefore therefore xQ0(d)(x)=dQ1(d)(x)xQ^{(d)}_{0}(x)=dQ^{(d)}_{1}(x). We insert these expressions in the expansion of the function ψ\psi

Matching the coefficients of the expansion yields

Similarly, we can write the decomposition of x2ψ(x)x^{2}\psi(x) to be

where the coefficients are given by the same relation as in the above lemma

Case 1: θ1θ2{\bm{\theta}}_{1}\neq{\bm{\theta}}_{2}.

Case 2: θ1=θ2{\bm{\theta}}_{1}={\bm{\theta}}_{2}.

Similarly, for some fixed α\alpha and β\beta, we define

We can therefore fix u1(1)=αu_{1}(1)=\alpha and u2(1)+u3(1)=β/2u_{2}(1)+u_{3}(1)=\beta/2. ∎

Let σ\sigma be an activation function such that σ(u)c0exp(c1u2)\sigma(u)\leq c_{0}\exp(c_{1}u^{2}) for some constants c0,c1c_{0},c_{1}, with c1<1c_{1}<1. Let the Hermite and Gegenbauer decompositions of σ\sigma be

Recall the correspondence (43) between Gegenbauer and Hermite polynomials. Note for any monomial mk(x)=xkm_{k}(x)=x^{k}, by Lemma 5.(c)(c), we have

For any fixed kk, let Qk(d)(x)Q_{k}^{(d)}(x) be the kk-th Gegenbauer polynomial. We expand

Using the correspondence (43) between Gegenbauer and Hermite polynomials we have

where the last inequality holds for all dd large enough, since Cd(2π)1/2C_{d}\to(2\pi)^{-1/2} as dd\to\infty. Hence, we have:

Taking t=O(log(d)1/2d1/2)t=O(\log(d)^{1/2}d^{-1/2}), we get

4.2 Proof of Proposition 5

Step 1. Construction of the activation function σ^\hat{\sigma}.

for some δ1,δ2\delta_{1},\delta_{2} that we will fix later (with δt1|\delta_{t}|\leq 1).

Step 2. The functions u,u^{\bm{u}},\hat{\bm{u}} and uˉ\bar{\bm{u}}.

Let u{\bm{u}} and u^\hat{\bm{u}} be the matrix-valued functions associated respectively to σ\sigma^{\prime} and σ^\hat{\sigma}^{\prime}

From Lemma 7, there exists functions u1,u2,u3u_{1},u_{2},u_{3} and u^1,u^2,u^3\hat{u}_{1},\hat{u}_{2},\hat{u}_{3}, such that

We define uˉ=uu^\bar{\bm{u}}={\bm{u}}-\hat{\bm{u}}. Then we can write

where uˉk=uku^k\bar{u}_{k}=u_{k}-\hat{u}_{k} for k=1,2,3k=1,2,3.

Step 3. Construction of the kernel matrices.

Note that we have U=U^+Uˉ{\bm{U}}=\hat{\bm{U}}+\bar{\bm{U}}. By Eq. (101) and (98), it is easy to see that U^0\hat{\bm{U}}\succeq 0. Then we have UUˉ{\bm{U}}\succeq\bar{\bm{U}}. In the following, we would like to lower bound matrix Uˉ\bar{\bm{U}}.

Denoting γij=θi,θj/d<1\gamma_{ij}=\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle/d<1, we get, from Eq. (92),

We get similar expressions for U^ij\hat{\bm{U}}_{ij} with λk,d(σ)\lambda_{k,d}(\sigma^{\prime}) replaced by λk,d(σ^)\lambda_{k,d}(\hat{\sigma}^{\prime}). Because we defined σ\sigma^{\prime} and σ^\hat{\sigma}^{\prime} by only modifying the k1k_{1}-th and k2k_{2}-th coefficients, we get

Recalling that λk,d(1)\lambda_{k,d}^{(1)} only depend on λk1,d\lambda_{k-1,d} and λk+1,d\lambda_{k+1,d} (Lemma 6), we get

By Assumption 4 and the convergence in Lemma 8, for any fixed kk,

From Lemma 9, we recall that the coefficients of the kk-th Gegenbauer polynomial Qk(d)(x)=s=0kpk,s(d)xsQ_{k}^{(d)}(x)=\sum_{s=0}^{k}p^{(d)}_{k,s}x^{s} satisfy

Plugging the estimates (109), (110) and (112) into Eqs. (107) and (108), we obtain that

We deduce from (113) (105) and (114) that

As a result, combining Eq. (115) with Eq. (102) and (99), we get

By the expression of Δ{\bm{\Delta}} given by (104), we conclude that

Step 5. Proving that DεINd{\bm{D}}\succeq\varepsilon{\mathbf{I}}_{Nd}.

By Lemma 7, we can express Uˉii\bar{\bm{U}}_{ii} by

with α\alpha, β\beta independent of ii, and given by Eq. (93), namely

(Notice that Tr(Uˉii){\rm Tr}(\bar{\bm{U}}_{ii}) and θi,Uˉiiθi\langle{\bm{\theta}}_{i},\bar{\bm{U}}_{ii}{\bm{\theta}}_{i}\rangle are independent of ii by construction, cf. Eqs. (97), (98) and (100), (101).) By the definition of D{\bm{D}} given in Eq. (103), We deduce that:

We claim that, under the assumptions of Proposition 5, and denoting δ=(δ1,δ2){\bm{\delta}}=(\delta_{1},\delta_{2}) (where δ1,δ2\delta_{1},\delta_{2} first appears in the definition of σ^\hat{\sigma} in Eq. (96), and till now δ1,δ2\delta_{1},\delta_{2} are still not determined)

where F1(0)=F2(0)=0F_{1}({\bm{0}})=F_{2}({\bm{0}})=0 and F1(0),F2(0)0\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}})\neq{\bm{0}}, det(F1(0),F2(0))0\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0. Before proving this claim, let us show that it allows to finish the proof of Proposition 5. Since det(F1(0),F2(0))0\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0, there exists a unit-norm vector v{\bm{v}}, such that v,F1(0)>0\langle{\bm{v}},\nabla F_{1}({\bm{0}})\rangle>0, and v,F2(0)>0\langle{\bm{v}},\nabla F_{2}({\bm{0}})\rangle>0. Now we choose δ1,δ2\delta_{1},\delta_{2} (first appears in the definition of σ^\hat{\sigma} in Eq. (96)): we set δ=(δ1,δ2)=δ0v{\bm{\delta}}=(\delta_{1},\delta_{2})=\delta_{0}{\bm{v}} with some δ0>0\delta_{0}>0 small enough. This yields F1(δ)>0F_{1}({\bm{\delta}})>0, F2(δ)>0F_{2}({\bm{\delta}})>0. Define ε=min(F1(δ),F2(δ))/2\varepsilon=\min(F_{1}({\bm{\delta}}),F_{2}({\bm{\delta}}))/2, we have

We are left with the task of proving that the limits in Eqs. (118), (119) exist, with the desired properties. Using Eqs. (107) and (108), we get:

Using Eq. (110), we get that the limits (118), (119) exist. Further, letting μkμk(σ)\mu_{k}\equiv\mu_{k}(\sigma^{\prime}), we have

It is easy to check F1(0)=F2(0)=0F_{1}({\bm{0}})=F_{2}({\bm{0}})=0, and to compute the gradients, using the identity μk(x2σ)=μk+2(σ)+(2k+1)μk(σ)+k(k1)μk2(σ)\mu_{k}(x^{2}\sigma^{\prime})=\mu_{k+2}(\sigma^{\prime})+(2k+1)\mu_{k}(\sigma^{\prime})+k(k-1)\mu_{k-2}(\sigma^{\prime}), we get

Under Assumption 5, we have F1(0),F2(0)0\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}})\neq{\bm{0}} and det(F1(0),F2(0))0\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0 completing the proof.

Proof of Theorem 2.(b): NT model upper bound

and Γd,m\Gamma_{d,m} can be computed using the Gegenbauer recursion formula Eq. (35),

Consider f^NT(x;Θ,a)=i=1Nai,xσ(θi,x)\hat{f}_{\sf NT}({\bm{x}};{\bm{\Theta}},{\bm{a}})=\sum_{i=1}^{N}\langle{\bm{a}}_{i},{\bm{x}}\rangle\sigma^{\prime}(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle). We can expand the risk at parameter a{\bm{a}} as

From Assumption 2.(a) and Lemma 5.(b) applied to σ(e,)\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle) and e,σ(e,)\langle{\bm{e}},\cdot\rangle\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle), we get αd=Od(1)\alpha_{d}=O_{d}(1) and βd=Od(d1)\beta_{d}=O_{d}(d^{-1}). We deduce that the operator norm verifies K(θ,θ)op=αd+βdθ22=Od(1)\|{\bm{K}}({\bm{\theta}},{\bm{\theta}})\|_{{\rm op}}=\alpha_{d}+\beta_{d}\|{\bm{\theta}}\|_{2}^{2}=O_{d}(1).

Hence, there exists a constant C>0C>0 such that

Proof of Theorem 4: risk for KR

Step 1. Rewrite the y{\bm{y}}, E{\bm{E}}, H{\bm{H}}, M{\bm{M}} matrices.

The test error of empirical kernel ridge regression gives

where E=(E1,,En)T{\bm{E}}=(E_{1},\ldots,E_{n})^{\mathsf{T}}, M=(Mij)ij[n]{\bm{M}}=(M_{ij})_{ij\in[n]} and H=(Hij)ij[n]{\bm{H}}=(H_{ij})_{ij\in[n]} with

Let the spherical harmonics decomposition of fdf_{d} be

and the Gegenbauer decomposition of hdh_{d} be

We decompose the vectors and matrices f{\bm{f}}, E{\bm{E}}, H{\bm{H}}, and M{\bm{M}} in terms of spherical harmonics

By Proposition 3 and Eq. (56), the kernel H{\bm{H}} and M{\bm{M}} can be rewritten as

Recalling y=f+ε{\bm{y}}={\bm{f}}+{\bm{\varepsilon}}, we decompose the risk as follows

Using Cauchy Schwarz inequality for T22T_{22}, we get

As a result, combining Eqs. (130), (132) and (131), we have

Notice that by Lemma 11, Lemma 13 and the definition of M{\bm{M}}, for any integer LL:

Combining Eqs. (137), (133), (138), (139) and (140), we have

2 Auxiliary results

Then as long as n/(BlogB)n/(B\log B)\to\infty as dd\to\infty, we have

Integrating the tail bound proves the lemma. ∎

Now we look at B11S0B11{\bm{B}}_{11}{\bm{S}}_{0}{\bm{B}}_{11}. We have

By Assumption 3, we have λmin(D1)=ωd(1)\lambda_{\min}({\bm{D}}_{1})=\omega_{d}(1). This proves the lemma. ∎

Acknowledgements

This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.

References

Appendix A Numerical results with ridge regression

The results are reported in Figures 6, 7, 8, and are consistent with the ones of Section 1.3. Regularization does not help: it only reduces the peak at npn\approx p, as expected from [HMRT19], but not the large nn behavior.

(Note that for RF we do not report results for d=100d=100, in Fig. 6. As in Fig. 1, the resulting risk is slightly below the baseline R0R_{0}: this effect vanishes for d100d\gtrsim 100.)