Linearized two-layers neural networks in high dimension

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

Introduction and main results

For a number of important applications, state-of-the-art performances are obtained by representing the function $f$ as a multi-layers neural network. The simplest model in this class is given by two-layers networks (NN):

Two-layers neural networks have been extensively studied in the nineties, with a focus on two goals: $(i)$ Establishing approximation guarantees over classical function spaces; $(ii)$ Controlling the generalization error via Rademacher complexity arguments. We refer to [Pin99, AB09] for surveys of these results.

Computational aspects were notably under-represented within these early theoretical contributions. On the contrary, it is nowadays increasingly clear that computational and statistical aspects cannot be separated in the analysis of neural networks (see, e.g. [SHN+18, MMN18, CB18]). Indeed, the optimization algorithm does not simply compute the unique minimizer of a regularized empirical risk: it instead selects one among many possible near-minimizers, whose generalization properties can vary significantly. Therefore, the specific optimization algorithm is an integral part of the definition of the regularization method.

A concrete scenario in which this interplay can be understood precisely is the so-called ‘neural tangent kernel’ regime. First explicitly described in [JGH18], this regime has attracted considerable amount of work. The basic idea is that, for highly overparametrized networks, the network weights barely change from their random initialization. We can therefore replace the nonlinear function class ${\mathcal{F}}_{{\sf NN}}$ by its first order Taylor expansion around this initialization.

Denoting by $(a_{0,i},{\bm{w}}_{0,i})_{i\leq N}$ the weights at initialization, a first order Taylor expansion yields

where $f_{{\sf NN},0}$ is the neural network at initialization. In other words, $f_{{\sf NN}}-f_{{\sf NN},0}$ is a function in the direct sum ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ , where we defined

We will refer to ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ as the ‘random features’ (RF) model: it amounts to fixing the first layer, and only optimizing the coefficients in the second layer. Equivalently, ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ corresponds to the first order Taylor expansion of $f_{{\sf NN}}$ with respect to the second layer weights $(a_{i})_{i\leq N}$ . This model can be traced back to the work of Neal [Nea96], and was successfully developed by Rahimi and Recht [RR08] as a randomized approximation to kernel methods.

The second function class ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ corresponds to the first order Taylor expansion of $f_{{\sf NN}}$ with respect to the first layer weights $({\bm{w}}_{i})_{i\leq N}$ [JGH18]. We will refer to ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ as the neural tangent classOften the term ‘neural tangent’ is reserved for the direct sum ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ . We find it more convenient to give distinct names to each of the two terms, especially since ${\mathcal{F}}_{{\sf RF}}({\bm{W}})$ has much smaller dimension than ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ for large $d$ ..

A sequence of recent papers proves that, in a certain overparametrized regime, gradient descent (GD) applied to the nonlinear neural network class ${\mathcal{F}}_{{\sf NN}}$ effectively converges to a model in ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ . Namely, if the number of neurons $N$ is larger than a threshold $N_{0}(n,d)$ , and training is initialized with $f_{0}({\bm{x}})=N^{-1/2}\sum_{i=1}^{N}a_{0,i}\,\sigma(\langle{\bm{w}}_{0,i},{\bm{x}}\rangle)$ where $\{(a_{0,i},{\bm{w}}_{0,i})\}_{i\leq N}\sim_{iid}{\sf N}(0,1)\otimes{\sf N}(0,{\mathbf{I}}_{d}/d)$ , then gradient descent converges exponentially fast to weights $\{(a_{i},{\bm{w}}_{i})\}_{i\leq N}$ such that $f-f_{0}$ is well approximated by a function in ${\mathcal{F}}_{{\sf NT}}({\bm{W}})\oplus{\mathcal{F}}_{{\sf RF}}({\bm{W}})$ . The specific value of the threshold $N_{0}(n,d)$ for the onset of this NT regime has been steadily pushed down over the last year [DZPS18, DLL+18, AZLS18, ZCZG18, ADH+19].

Does the NT regime explain the power of multi-layers neural networks, when trained by gradient descent methods? From an empirical point of view, the evidence is not univocal [LXS+19, GSJW19, COB19]. From a theoretical point of view, while the expressivity of neural networks is superior to the one of NT models, this hypothesis is not easy to dismiss for at least two reasons. First, neural networks learned by gradient descent algorithms form a significantly smaller class than general networks. Second, the answer depends on the data distribution, the target function $f_{*}$ and the sample size.

In order to clarify this question, we explore the behavior of RF and NT models in the high-dimensional setting. More precisely, we consider two specific asymptotic regimes:

In both cases we obtain sharp results, up to errors vanishing as $d\to\infty$ . Crucially, our results hold pointwise, i.e. they provide a characterization of approximation and generalization error which hold for a given function $f_{*}$ . This allows us to derive precise separation results between NN and NT models.

2 A parenthesis

The approximation properties of neural networks have been studied for over three decades [DHM89, Cyb89, Hor91, Bar93, MM94, GJP95, Mha96, Pet98, Mai99, Pin99]. It is useful to discuss the relation between the questions outlined above and existing literature.

A number of results are available on the approximation of functions in certain smoothness classes by two-layers neural networks. In particular [Bar93] controls smoothness by the average frequency content in the Fourier transform (the ‘Barron norm’), while [Mha96, Pet98, Mai99] use classical Sobolev norms. For instance [Mai99] proves that $N$ -neurons NN approximate functions in the Sobolev ball $W^{r}_{2}$ with worst case error

for some unspecified functions $C_{1},C_{2}$ . (Similar results are found in [Pet98].) These results cannot be used for our purposes.

First of all, we are interested in the NT class which is potentially much less powerful than NN.

Second, bounds of the type (1) make it hard to prove separation results between NN and NT. In order to prove such a separation, we would have to prove that neural networks trained by gradient descent have good approximation properties, uniformly over Sobolev balls. This objective is currently out of reach. Our pointwise approximation results make it much easier to prove separation statements.

Third, earlier work neglects polynomial dependencies in $d$ . Bounds of the type (1) have weak implications when both $d$ and $N$ are large, say $d=100$ , $N=10^{6}$ . We will instead prove sharp asymptotic results that are valid in this regime. As illustrated in the next section, our analysis captures the actual behavior in a quantitative manner, already when $d\geq 100$ .

Quantitative results in the high-dimensional regime have been proved only recently. In particular, Bach [Bac17b] established quantitative upper and lower bounds for the approximation error in the RF model. However, these results do not have direct implications on the NT model which is our main interest here. Further, lower bounds in [Bac17b] are, as before, worst case over a certain RKHS. (See also [Bac13, AM15, RR17] for related work.)

Similar considerations apply to the generalization error of kernel methods. While this is a classical topic [CST+00, CDV07, RR17, LR18], earlier work proves minimax upper and lower bounds. Establishing pointwise lower bounds is instead important in order to understand precisely the separation between neural networks and their linearized counterparts. We refer to Section 4 for further discussion of related work.

3 A numerical experiment

Figures 1, 2, 3 report the results of such a simulation using RF –for Figure 1– and NT –for Figures 2 and 3. We use shifted ReLU activations $\sigma(u)=\max(u-u_{0},0)$ , $u_{0}=0.5$ . The choice of $u_{0}=0.5$ is not essential: (Lebesgue-)almost every $u_{0}\neq 0$ has similar behavior. In contrast, the case $u_{0}=0$ is degenerate because $\max(u,0)$ is equal to a linear function plus an even function.

The target functions $f_{\star}$ in these examples are quite simple. Figures 1 and 2 use a quadratic function $f_{\star,2}({\bm{x}})=\sum_{i\leq\lfloor d/2\rfloor}x_{i}^{2}-\sum_{i>\lfloor d/2\rfloor}x_{i}^{2}$ . In Figure 3, the target function is a third-order polynomial $f_{\star,3}({\bm{x}})=\sum_{i=1}^{d}(x_{i}^{3}-3x_{i})$ .

The results are somewhat disappointing: in two cases (first and third figures) RF and NT models do not beat the trivial predictor. In one case (the second one), the NT model surpasses the trivial baseline, and it appears to decrease to as the number of samples $n$ increase. We also note that the risk shows a cusp when $n\approx p$ , with $p$ the number of parameters ( $p=N$ for RF, and $p=Nd$ for NT). This phenomenon is related to overparametrization, and will not be discussed further in this paper (see [BHMM18, BHX19, HMRT19, MM19] for relevant work). We will instead focus on the population behavior $n\to\infty$ .

In other words, the RF model does not appear to be able to learn a simple quadratic function, and the NT model does not appear to be able to learn a third order polynomial. Our main theorems (presented in the next sections) capture in a precise manner this behavior. In particular,

We will prove that for $N=O_{d}(d^{2-\delta})$ , RF does not outperform the trivial predictor on any function that has vanishing projection on linear functions. Similarly, NT does not outperform the trivial predictor on any function that has vanishing projection on linear and quadratic functions.

In contrast, there exists neural networks in ${\mathcal{F}}_{{\sf NN}}$ with $N=O_{d}(d)$ neurons, and a small approximation error both for $f_{\star,2}$ and $f_{\star,3}$ (see, e.g., [Bac17b], or [MMN18, Proposition 1]).

These two points illustrate the gap in approximation power between NT (or RF) and NN.

4 Summary of main results

The equivalence between RF regression and polynomial regression holds pointwise for target function $f_{\star}$ .

Again, this result holds pointwise over the choice of $f_{\star}$ .

The second aspect can be summarized as follows.

Approximation error of linearized neural networks

In this section, we state formally our results about the approximation error of ${\sf RF}$ and ${\sf NT}$ models. We define the minimum population error for any of the models ${\sf M}\in\{{\sf RF},{\sf NT}\}$ by

See Section 6 for the proof of lower bound, and Section 7 for the proof of upper bound.

$\tau^{1}_{d-1}$ is supported on $[-\sqrt{d},\sqrt{d}]$ . It is therefore sufficient that $\sup_{|u|\leq\sqrt{d}}|\sigma_{d}(u)|=C_{1}(d)<\infty$ .

By an explicit calculation, the density of $\tau^{1}_{d-1}({\rm d}u)=C_{2}(d)(1-u^{2}/d)^{(d-3)/2}{\rm d}u$ . Since this density is bounded, it is sufficient that $\sigma_{d}$ is square integrable with respect to the Lebesgue measure on $[-\sqrt{d},\sqrt{d}]$ .

2 Approximation error of neural tangent models

The function $\sigma$ is weakly differentiable, with weak derivative $\sigma^{\prime}$ such that $\sigma^{\prime}(u)^{2}\leq c_{0}\exp(c_{1}u^{2}/2)$ for some constants $c_{0},c_{1}$ , with $c_{1}<1$ .

See Section 8 for the proof of lower bound, and Section 9 for the proof of upper bound.

For instance the ReLU activation $\sigma(u)=\max(u,0)$ and its weak derivative $\sigma^{\prime}(x)={\bm{1}}_{x\geq 0}$ have subexponential growth. Further its Hermite coefficients are $\mu_{0}(\sigma^{\prime})=1/2$ and

Assumption 2.(c) does not hold for ReLU activation $\sigma(u)=\max(u,0)$ , since $\mu_{k}(\sigma)=0$ for $k$ even. However it holds for shifted ReLU $\sigma(u)=\max(u-u_{0},0)$ , for a generic value of the shift $u_{0}$ .

Theorems 1 and 2 can be illustrated by a cartoon, which we show as Figure 5. In words, the approximation error plotted as a function of $\log(\#\text{parameters})/\log d$ follows a staircase: it drops close to integer values of this ratio, with each drop corresponding to the projection onto homogeneous polynomials of that degree. We can extract three useful statistical insights from these findings:

There is no difference between plain RF and the more recent NT approach in terms of approximation error, once we compare them at fixed number of parameters $p$ . All that changes is the relation between number of parameters and number of neurons: $p=N$ for RF, and $p=Nd$ for NT. The recent work [GMMM19] actually shows some advantage for the RF model, although in a special case. It is worth mentioning that the same equivalence holds when we consider the dependence on the sample size $n$ , at $N=\infty$ , see Section 3.

We notice however an important computational advantage for NT, at constant parameters number. Indeed, the complexity at prediction time is $O(Nd)=O(p)$ for NT, while it is $O(Nd)=O(pd)$ for RF.

3 Separation between NN and RF, NT

Crucially, as proven in [MBM16], running gradient descent over the space of neural networks consisting of a single neuron allows to learn the target function $f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle)$ efficiently. In other words, we do not have simply a separation between the function classes ${\mathcal{F}}_{{\sf NN}}$ and ${\mathcal{F}}_{{\sf RF}}$ or ${\mathcal{F}}_{{\sf NT}}$ , but a separation between linearized neural networks, and neural networks trained by gradient descent.

Essentially the same example was independently considered by Yehudai and Shamir in concurrent work [YS19]. These authors prove that there exist finite constants $c_{0},c_{1}>0$ such that, if $N\leq\exp\{c_{1}d\}$ and the coefficients $a_{i},{\bm{a}}_{i}$ have magnitude at most $\exp\{c_{1}d\}$ , then there exists a vector ${\bm{w}}_{\star}$ such that, setting $f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle)$ , then $R_{{\sf RF}}(f_{*};{\bm{W}}),R_{{\sf NT}}(f_{*};{\bm{W}})\geq c_{0}$ . An important difference with respect to our separation result is in the fact that Eq. 10 holds –once again– pointwise, i.e. for any fixed ${\bm{w}}_{\star}$ , while in [YS19] ${\bm{w}}_{\star}$ is chosen by an adversary who has knowledge of the vectors $({\bm{w}}_{i})_{i\leq N}$ . Let us emphasize there are other important differences between our setting and the one of [YS19], and neither of the two analysis implies the other.

Generalization error of kernel methods

Analogously, ridge regression in ${\mathcal{F}}_{{\sf NT}}({\bm{W}})$ can be shown to converge to KRR with respect to the kernel

Section 3.1 presents a lower bound on the prediction error of general kernel methods, and Section 3.2 derives an upper bound for kernel ridge regression.

Consider any regression method of the form

where $\|f\|_{H}$ is the reproducing kernel Hilbert space (RKHS) norm with respect to the kernel $H$ of the form (13). By the representer theorem [BTA11] there exist coefficients $\hat{a}_{1},\dots,\hat{a}_{n}$ such that

We are therefore led to define the following data-dependent prediction risk function for kernel methods

The next theorem provides a decomposition of this generalization error that is analogous to the one given in Theorem 1. $(a)$ . Notice however that the controlling factor is not the number of neurons $N$ , but instead the sample size $n$ .

This follows immediately from Theorem 1. $(a)$ . Indeed, setting $\sigma_{d}(u)=h_{d}(u/\sqrt{d})$ and ${\bm{w}}_{i}={\bm{x}}_{i}/\sqrt{d}$ , we obtain $R_{H}(f_{d},{\bm{X}})=R_{{\sf RF}}(f_{d},{\bm{W}})$ , whence the claim follows by applying Eq. (3). ∎

2 Upper bound for kernel ridge regression

where the kernel matrix ${\bm{H}}=(H_{ij})_{ij\in[n]}$ is given by

and ${\bm{y}}=(y_{1},\ldots,y_{n})^{\mathsf{T}}$ . The prediction function at location ${\bm{x}}$ is given by

The test error of empirical kernel ridge regression is defined as

We assume that $\{h_{d}\}_{d\geq 1}$ are positive-definite kernels, and we consider the associated eigenvalues:

where we recall that $Q_{k}^{(d)}$ is the $k$ -th Gegenbauer polynomial.

If $h_{d}$ has zero mean (i.e. $\int h_{d}(\sqrt{d}\langle{\bm{e}}_{1},{\bm{x}}\rangle)\tau_{d}({\rm d}{\bm{x}})=0$ ) further assume that $f_{d}$ is centered (i.e. $\int f_{d}({\bm{x}})\tau_{d}({\rm d}{\bm{x}})=0$ ).

See Section 10 for the proof of this theorem.

Assume $h_{d}\to h$ as $d\to\infty$ , uniformly over $[-\delta,\delta]$ , together with its derivatives, and further assume $|h_{d}(x)|\leq c_{0}\exp(c_{1}x^{2}/2)$ for some $c_{0}>0$ , $c_{1}<1$ . We expect this to be the case for many kernels of interest, and in particular it can be shown to be the case for $h_{d}^{{\sf RF}}$ and $h_{d}^{{\sf NT}}$ under mild conditions on the activation $\sigma$ . Using Rodrigues’ formula described in Section 5.2, by an application of integration by part followed by dominated convergence, we get

3 Separation between kernel methods and neural networks

Repeating the same argument of Section 2.3, we see that Theorems 3 and 4 imply a separation between kernel methods, with rotationally invariant kernels, and gradient-descent trained neural networks.

Namely, consider again the target function $f_{\star}({\bm{x}})=\sigma(\langle{\bm{w}}_{\star},{\bm{x}}\rangle)$ , for $\|{\bm{w}}_{\star}\|_{2}=1$ . As proven in [MBM16], $f_{\star}$ can be learnt efficiently by minimizing the following empirical risk via gradient descent:

Namely, if $n\geq C\,d\log d$ samples are used (and under some technical conditions on $\sigma$ ), gradient descent reaches prediction error of order $(d\log d)/n$

This test error is achieved by kernel ridge regression.

4 Near-optimality of interpolators

Optimal test error is achieved for any $\lambda<\lambda_{*}$ . In particular, by taking $\lambda\to 0$ , we obtain an interpolator, i.e. a predictor that interpolates the data $(y_{i},{\bm{x}}_{i})$ . This remark is made quantitative in the following bound on the empirical risk

Recall that the empirical risk of KRR is given by Eq. (24), where $\hat{\bm{f}}_{\lambda}=(\hat{f}_{\lambda}({\bm{x}}_{1}),\ldots,\hat{f}_{\lambda}({\bm{x}}_{n}))$ can be rewritten as

where we simply used the law of large numbers $\|{\bm{y}}\|_{2}^{2}/n\to\|f_{d}\|_{L^{2}}^{2}+\tau^{2}$ . ∎

5 A conjecture for generalization error of random features model

Under the same data model of the previous sections, we are interested in the test prediction error

Theorem 1 characterized the test error $R_{\sf RF}(f_{d},{\bm{X}},{\bm{W}},\lambda)$ in the population limit $n=\infty$ , whereas Theorems 3 and 4 characterize the same quantity in the case when $N=\infty$ .

What happens when both $n$ and $N$ are finite? In the proportional regime $N\propto d$ and $n\propto d$ , the precise asymptotics of $R_{\sf RF}(f_{d},{\bm{X}},{\bm{W}},\lambda)$ was calculated in [MM19].

Further related work

Donoho and Johnstone [DJ89] study an approximation problem analogous to the one we considered in Section 2, although in $d=2$ dimensions. Their problem essentially reduces to determining rates of approximation on the unit circle, with the technical difference that the ${\bm{w}}_{i}$ ’s are equi-spaced along the circle instead of being random. As for other references mentioned in Section 1.2, the lower bounds of [DJ89] are worst case over differentiable functions.

After the present paper appeared as a preprint, several authors presented important contributions to the same line of work. In particular, Liang, Rakhlin, and Zhai [LRZ19] studies kernel ridge regression in $d$ dimension using $n=O_{d}(d^{\gamma})$ samples. Assuming the target function has bounded RKHS norm, they derive upper and lower bounds on the rate of convergence of the generalization error. This result is related to our Theorem 3. The most important difference is that we do not assume that the target function has bounded RKHS norm. Instead we obtain the precise asymptotics of the generalization error in a regime in which it is non-vanishing. As illustrated in Section 1.3, this asymptotic analysis captures indeed the actual behavior in practically reasonable settings.

From a technical viewpoint, several of our calculations make use of harmonic analysis over the $d$ -dimensional sphere, as it is natural given that ${\bm{x}}_{i}$ ’s are uniform over the sphere. Spherical harmonics expansion appear in related contexts, e.g. in [DJ89, Bac17a, VW18].

Let us finally mention that an alternative approach to the analysis of two-layers neural networks in the wide limit, was developed in [MMN18, RVE18, SS18, CB18, MMM19] using mean field theory. Unlike in the neural tangent approach, the evolution of network weights is described beyond the linear regime in this theory.

Technical background

The dimension of each subspace is given by

2 Gegenbauer polynomials

We will use the following properties of Gegenbauer polynomials

Note in particular that property 2 implies that –up to a constant– $Q_{k}^{(d)}(\langle{\bm{x}},{\bm{y}}\rangle)$ is a representation of the projector onto the subspace of degree - $k$ spherical harmonics

then we have the following equation holds in $L^{2}([-\sqrt{d},\sqrt{d}],\tau^{1}_{d-1})$ sense

By rotational invariance, the space $V_{k}$ of homogeneous polynomials of degree $k$ is an eigenspace of $\mathscrsfs{H}_{d}$ , and we will denote the corresponding eigenvalue by $\xi_{d,k}(h_{d})$ . In other words $\mathscrsfs{H}_{d}f({\bm{x}}):=\sum_{k=0}^{\infty}\lambda_{d,k}(h_{d}){\mathsf{P}}_{k}f$ . The eigenvalues can be computed via

3 Hermite polynomials

Here and below, for $P$ a polynomial, ${\rm Coeff}\{P(x)\}$ is the vector of the coefficients of $P$ . As a consequence, for any fixed integer $k$ , we have

where $\mu_{k}(\sigma)$ and $\lambda_{d,k}(\sigma)$ are given in Eq. (42) and (38).

4 Notations

Proof of Theorem 1.(a): RF model lower bound

Define the random matrix ${\bm{U}}=(U_{ij})_{i,j\in[N]}$ , with

In what follows, we write $R_{{\sf RF}}(f_{d})=R_{{\sf RF}}(f_{d},{\bm{W}})=R_{{\sf RF}}(f_{d},{\bm{\Theta}}/\sqrt{d})$ for the random features risk, omitting the dependence on the weights ${\bm{W}}={\bm{\Theta}}/\sqrt{d}$ . By the definition and a simple calculation, we have

where the last inequality used the fact that

This is achieved by the Proposition 1 and 2 stated below.

The proof of Proposition 2 relies on the following tight bound on the operator norm of the Gegenbauer polynomials of the Gram matrix:

The proofs of these three propositions are provided in the next sections. Proposition 1 implies

From Proposition 2, we have with high probability

Then by Markov inequality, we have with high probability

where $K(d-1,j)={d-2+j\choose j}$ is non-negative for $d\geq 2$ . This immediately shows that $B(d,k)$ is non-decreasing in $k$ . ∎

2 Proof of Proposition 1

and the Gegenbauer expansion of $\sigma_{d}$ gives

3 Proof of Proposition 2

Recall the expansion of $\sigma_{d}$ in terms of Gegenbauer polynomials, see Eqs. (51) and (52). From the properties of Gegenbauer polynomials, we have

where ${\bm{W}}_{k}=(W_{k,ij})_{i,j\in[N]}$ with $W_{k,ij}=Q_{k}(\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle)$ .

As a result, we have $\hat{\bm{U}}\succeq 0$ , and hence

In the following, we give a lower bound for $\bar{\bm{U}}$ . Note we have

Hence, there exists constant $C^{\prime}$ , such that for large $d$ , we have

Plug Eq. (56) into Eq. (53), we get with high probability

4 Proof of Proposition 3

Step 1. Bounding operator norm by moments.

We define ${\bm{\Delta}}={\bm{W}}-{\mathbf{I}}_{d}$ . Then we have

For any sequence of integers $p=p(d)$ , we have

To prove the proposition, it suffices to show that for any sequence $A_{d}\to\infty$ , we have

To calculate this quantity, we will apply repeatedly the following identity, which is an immediate consequence of Eq. (33). For any $i_{1},i_{2},i_{3}$ distinct, we have

Throughout the proof, we will denote by $C,C^{\prime},C^{\prime\prime}$ constants that may depend on $k$ but not on $p,d,N$ . The value of these constants is allowed to change from line to line.

Step 2. The induced graph and equivalence of index sequences.

For any index sequence ${\bm{i}}=(i_{1},i_{2},\ldots,i_{2p})\in[N]^{2p}$ , we defined an undirected multigraph $G_{\bm{i}}=(V_{\bm{i}},E_{\bm{i}})$ associated to index sequence ${\bm{i}}$ . The vertex set $V_{\bm{i}}$ is the set of distinct elements in $i_{1},\ldots,i_{2p}$ . The edge set $E_{{\bm{i}}}$ is formed as follows: for any $j\in[2p]$ we add an edge between $i_{j}$ and $i_{j+1}$ (with convention $2p+1\equiv 1$ ). Notice that this could be a self-edge, or a repeated edge: $G_{\bm{i}}=(V_{\bm{i}},E_{\bm{i}})$ will be –in general– a multigraph. We denote $v({\bm{i}})=|V_{\bm{i}}|$ to be the number of vertices of $G_{\bm{i}}$ , and $e({\bm{i}})=|E_{\bm{i}}|$ to be the number of edges (counting multiplicities). In particular, $e({\bm{i}})=k$ for ${\bm{i}}\in[N]^{k}$ . We define

For any two index sequences ${\bm{i}}_{1},{\bm{i}}_{2}$ , we say they are equivalent ${\bm{i}}_{1}\asymp{\bm{i}}_{2}$ , if the two graphs $G_{{\bm{i}}_{1}}$ and $G_{{\bm{i}}_{2}}$ are isomorphic, i.e. there exists an edge-preserving bijection of their vertices (ignoring vertex labels). We denote the equivalent class of ${\bm{i}}$ to be

We define the quotient set ${\mathcal{Q}}(p)$ by

For any integer $k\geq 2$ and ${\bm{i}}=(i_{1},\ldots,i_{k})\in[N]^{k}$ , we define

The following properties holds for all sufficiently large $N$ and $d$ :

For any equivalent index sequences ${\bm{i}}=(i_{1},\ldots,i_{2p})\asymp{\bm{j}}=(j_{1},\ldots,j_{2p})$ , we have $M_{{\bm{i}}}=M_{{\bm{j}}}$ .

For any index sequence ${\bm{i}}\in[N]^{2p}\setminus{\mathcal{T}}_{\star}(p)$ , we have $M_{{\bm{i}}}=0$ .

For any index sequence ${\bm{i}}\in{\mathcal{T}}_{\star}(p)$ , the degree of any vertex in $G_{\bm{i}}$ must be even.

The number of equivalent classes $|{\mathcal{Q}}(p)|\leq(2p)^{2p}$ .

Recall that $v({\bm{i}})=|V_{\bm{i}}|$ denotes the number of distinct elements in ${\bm{i}}$ . Then, for any ${\bm{i}}\in[N]^{2p}$ , the number of elements in the corresponding equivalence class satisfies $|{\mathcal{C}}({\bm{i}})|\leq v({\bm{i}})^{v({\bm{i}})}\cdot N^{v({\bm{i}})}\leq p^{p}N^{v({\bm{i}})}$ .

Properties $(a)$ , $(b)$ and $(c)$ are straightforward. Note that $v({\bm{i}})\leq 2p$ for any ${\bm{i}}\in[N]^{2p}$ . For property $(d)$ , notice that to each distinct equivalence class we can associate, in an injective manner, a string of length $2p$ over an alphabet of size $2p$ (simply follow the elements in ${\bm{i}}$ in order, and replace the labels by some canonical ones, e.g. $\{1,2,3,\dots\}$ in order of appearance). Therefore the number of classes is bounded as

For property $(e)$ , we need to bound the number of elements in ${\mathcal{C}}({\bm{i}})$ for representative ${\bm{i}}$ with degree $v({\bm{i}})$ . Define a mapping $\psi:{\mathcal{C}}({\bm{i}})\to[N]^{v({\bm{i}})}$ as follows. For ${\bm{i}}\in[N]^{2p}$ , $\psi({\bm{i}})$ is a vector of the distinct elements in ${\bm{i}}$ , listed in increasing order. For any ${\bm{k}}\in[N]^{v({\bm{i}})}$ , the pre-image $\psi^{-1}({\bm{k}})$ contains at most $v({\bm{i}})!\leq v({\bm{i}})^{v({\bm{i}})}$ elements. As a result, we have

In view of property $(a)$ in the last lemma, given an equivalence class ${\mathcal{C}}={\mathcal{C}}({\bm{i}})$ , we will write $M_{{\mathcal{C}}}=M_{{\bm{i}}}$ for the corresponding value common to the equivalence class ${\mathcal{C}}$ .

It is easy to see that the outcome of this process is independent of the order in which we select vertices.

For illustration, we give two examples of skeletonization processes:

Let ${\bm{i}}=(1,2,1,3,4,3)$ , and set ${\bm{i}}_{0}={\bm{i}}$ . First notice that $\{2,4\}$ are redundant vertices and we can remove them in arbitrary order to get ${\bm{i}}_{2}=(1,3)$ . Then notice that $3$ is redundant whence we get ${\bm{i}}_{3}=\{1\}$ . Hence we have $r({\bm{i}})=3$ , and ${\rm sk}({\bm{i}})=(1)$ .

Consider the skeletonization process of ${\bm{j}}=(1,2,3,2,4,3)$ . Take ${\bm{j}}_{0}={\bm{j}}$ . First notice that $\{1,4\}$ are redundant vertices and can be removed in arbitrary order to get ${\bm{j}}_{2}=(2,3,2,3)$ . We see that there is no further redundant vertex in $G_{{\bm{j}}_{1}}$ , so that $r({\bm{j}})=2$ , and ${\rm sk}({\bm{j}})={\bm{j}}_{1}=(2,3,2,3)$ .

For the above skeletonization process, the following properties hold:

If ${\bm{i}}\asymp{\bm{j}}\in[N]^{p}$ , then ${\rm sk}({\bm{i}})\asymp{\rm sk}({\bm{j}})$ . That is, the skeletons of equivalent index sequences are equivalent.

For any ${\bm{i}}=(i_{1},\ldots,i_{k})\in[N]^{k}$ , define

For any ${\bm{i}}\in{\mathcal{T}}_{\star}(p)\subset[N]^{2p}$ , its skeleton is either formed by a single element, or an index sequence whose graph has the property that every vertex has degree greater or equal to $4$ .

Property $(a)$ holds by the definition of equivalence which is graph isomorphism. Property $(b)$ used the fact that, if $i\neq j_{1}$ and $i\neq j_{2}$ , we have

so that deleting a redundant vertex will contribute a $1/B(d,k)$ factor.

To show property $(c)$ , note that any intermediate index sequence ${\bm{i}}_{s}$ in the skeletonization process is such that $G_{{\bm{i}}_{s}}$ only has even degree vertices, is connected, and has no self-edges (by induction). Hence, $G_{{\rm sk}({\bm{i}})}$ only has even degree vertices, is connected, and has no self-edges. Note that $G_{{\rm sk}({\bm{i}})}$ cannot have degree-2 vertices, and has at least one vertex (because the last vertex is not removed). Therefore, as long as ${\rm sk}({\bm{i}})$ contains at least two vertices, $G_{{\rm sk}({\bm{i}})}$ can only contain vertices with degree greater or equal to $4$ . ∎

Given an index sequence ${\bm{i}}\in{\mathcal{T}}_{\star}(p)\subset[N]^{2p}$ , we say ${\bm{i}}$ is of type 1, if ${\rm sk}({\bm{i}})$ contains only one index. We say ${\bm{i}}$ is of type 2 if ${\rm sk}({\bm{i}})$ has more than one index (so that by Lemma 3, $G_{{\rm sk}({\bm{i}})}$ can only contain vertices with degree greater or equal to $4$ ). Denote the class of type 1 index sequence (respectively type 2 index sequence) by ${\mathcal{T}}_{1}(p)$ (respectively ${\mathcal{T}}_{2}(p)$ ). We also denote by $\widetilde{\mathcal{T}}_{a}(p)$ , $a\in\{1,2\}$ the set of equivalence classes of sequences in ${\mathcal{T}}_{a}(p)$ . This definition makes sense since the equivalence class of the skeleton of a sequence only depends on the equivalence class of the sequence itself.

Recall that $v({\bm{i}})$ is the number of vertices in $G_{\bm{i}}$ , and $e({\bm{i}})$ is the number of edges in $G_{\bm{i}}$ (which coincides with the length of ${\bm{i}}$ ). We consider ${\bm{i}}\in{\mathcal{T}}_{1}(p)$ . Since for ${\bm{i}}\in{\mathcal{T}}_{1}(p)$ , every edge of $G_{\bm{i}}$ must be at most a double edge. Indeed, if $(u_{1},u_{2})$ had multiplicity larger than $2$ in $G_{{\bm{i}}}$ , neither $u_{1}$ nor $u_{2}$ could be deleted during the skeletonization process, contradicting the assumption that ${\rm sk}({\bm{i}})$ contains a single vertex. Therefore, we must have $\min_{{\bm{i}}\in{\mathcal{T}}_{1}}v({\bm{i}})=p+1$ . According the Lemma 3. $(b)$ , for every ${\bm{i}}\in{\mathcal{T}}_{1}(p)$ , we have

Note by Lemma 2. $(e)$ , the number of elements in the equivalence class of ${\bm{i}}$ is $|{\mathcal{C}}({\bm{i}})|\leq p^{p}\cdot N^{v({\bm{i}})}$ . Hence we get

where in the last step we used Lemma 2 and the fact that $B(d,k)\geq C_{0}d^{k}$ for some $C_{0}>0$ .

We have the following simple lemma bounding $M_{\bm{i}}$ . This bound is useful when ${\bm{i}}$ is a skeleton.

There exists constants $C$ and $d_{0}$ depending uniquely on $k$ such that, for any $d\geq d_{0}(k)$ , and any index sequence ${\bm{i}}\in[N]^{m}$ with $2\leq m\leq d/(4k)$ , we have

The lemma following by the claim that (for $d\geq d_{0}(k)$ )

Therefore there exists a constant $C_{0}$ such that for all $d$ large enough

As a consequence, for any integer $m$ , we have

Combining the above two upper bounds (63) and (64), we have

By noting that $B(d,k)\geq C_{0}d^{k}$ for some $C_{0}>0$ , this proves the claim. ∎

Suppose ${\bm{i}}\in{\mathcal{T}}_{2}(p)$ , and denote $v({\bm{i}})$ to be the number of vertices in $G_{\bm{i}}$ . We have, for a sequence $p=o_{d}(d)$

Here $(1)$ holds by Lemma 3. $(b)$ ; $(2)$ by Lemma 4, and the fact that ${\rm sk}({\bm{i}})\in[N]^{e({\rm sk}({\bm{i}}))}$ , together by $B(d,k)\geq C_{0}d^{k}$ ; $(3)$ because $e({\rm sk}({\bm{i}}))\leq 2p$ ; $(4)$ by Lemma 3. $(c)$ , implying that for ${\bm{i}}\in{\mathcal{T}}_{2}(p)$ , each vertex of $G_{{\rm sk}({\bm{i}})}$ has degree greater or equal to $4$ , so that $v({\rm sk}({\bm{i}}))\leq e({\rm sk}({\bm{i}}))/2$ (notice that for $d\geq d_{0}(k)$ we can assume $Cp/d<1$ ). Finally, $(5)$ follows since $r({\bm{i}}),v({\rm sk}({\bm{i}}))\leq v({\bm{i}})$ , and $(6)$ the definition of $r({\bm{i}})$ implying $r({\bm{i}})=v({\bm{i}})-v({\rm sk}({\bm{i}}))$ .

Note by Lemma 2. $(e)$ , the number of elements in equivalent class $|{\mathcal{C}}({\bm{i}})|\leq p^{v({\bm{i}})}\cdot N^{v({\bm{i}})}$ . Since $v({\bm{i}})$ depends only on the equivalence class of ${\bm{i}}$ , we will write, with a slight abuse of notation $v({\bm{i}})=v({\mathcal{C}}({\bm{i}}))$ . Notice that the number of equivalence classes with $v({\mathcal{C}})=v$ is upper bounded by the number multi-graphs with $v$ vertices and $2p$ edges, which is at most $v^{4p}$ . Hence we get

Define $\varepsilon=CNp^{k+1}/d^{k}$ . We will assume hereafter that $p$ is selected such that

By calculus and condition (68), the function $F(v)=v^{4p}\varepsilon^{v}$ is maximized over $v\in[2,2p]$ at $v=2$ , whence

Using Eqs. (61) and (69), we have, for any $p=o_{d}(d)$ satisfying Eq. (68), we have

Finally setting $N=d^{k}e^{-2A\sqrt{\log d}}$ and $p=(k/A)\sqrt{\log d}$ , this yields

Proof of Theorem 1.(b): RF model upper bound

Consider $\hat{f}_{{\sf RF}}({\bm{x}};{\bm{\Theta}},{\bm{a}})=\sum_{i=1}^{N}a_{i}\sigma_{d}(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle/\sqrt{d})$ . We can expand the risk achieved at parameter ${\bm{a}}$ as

Proof of Theorem 2.(a): NT model lower bound

We begin with some notations and simple remarks.

Assume $\sigma$ is an activation function with $\sigma(u)^{2}\leq c_{0}\,\exp(c_{1}\,u^{2}/2)$ for some constants $c_{0}>0$ and $c_{1}<1$ . Then

A simple calculation shows that $C_{d}\to(2\pi)^{-1/2}$ as $d\to\infty$ , and hence $\sup_{d}C_{d}\leq\overline{C}<\infty$ . Therefore

where the last inequality holds provided $d\geq d_{0}=10/(1-c_{1})$ .

Finally, for point 3, without loss of generality we will take ${\bm{w}}={\bm{e}}_{1}$ , so that $\langle{\bm{w}},{\bm{x}}\rangle=x_{1}$ . By the same argument given above (and since both $G$ and $x_{1}$ have densities bounded uniformly in $d$ ), for any $M>0$ we can choose $\sigma_{M}$ bounded continuous so that for any $d$ ,

It is therefore sufficient to prove the claim for $\sigma_{M}$ . Letting ${\bm{\xi}}\sim{\sf N}(0,{\mathbf{I}}_{d-1})$ , independent of $G$ , we construct the coupling via

where we set ${\bm{x}}=(x_{1},{\bm{x}}^{\prime})$ . We thus have $x_{1}\to G$ almost surely, and the claim follows by weak convergence. ∎

We denote the Hermite decomposition of $\sigma$ by

We state separately the assumptions of Theorem 2.(a) for future reference.

It is also useful to notice that the Hermite coefficients of $x^{2}\sigma^{\prime}(x)$ can be computed from the ones of $\sigma^{\prime}(x)$ using the relation $\mu_{k}(x^{2}\sigma^{\prime})=\mu_{k+2}(\sigma^{\prime})+[1+2k]\mu_{k}(\sigma^{\prime})+k(k-1)\mu_{k-2}(\sigma^{\prime})$ .

2 Proof of Theorem 2.(a): Outline

Proceeding as for the RF model, we obtain

This is achieved in the following two propositions.

Let $\sigma$ be an activation function satisfying Assumption 4. Define

These two propositions will be proven in the next sections. Proposition 4 shows that

3 Proof of Proposition 4

We denote the Gegenbauer decomposition of $\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)$ by

By Lemma 5, applied to function $\sigma^{\prime}$ (instead of $\sigma$ ), under Assumption 4, we have $\|\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)\|_{L^{2}}^{2}\leq C$ (for $C$ a constant independent of $d$ ). We therefore have (recalling the normalization of the Gegenbauer polynomials in Eq. (32))

where in the last step we used Eq. (33). By the recurrence relationship for Gegenbauer polynomials (35), we have

We use the convention that $t_{d,-1}=0$ . This gives

The last inequality follows by Eqs. (89) and (91).

Using the fact that the kernel $H$ preserve the decomposition (29), we have

where we used the fact that $B(d,k)$ is non-decreasing in $k$ given by Lemma 1. This concludes the proof.

4 Proof of Proposition 5

In the proof of this proposition, we will need the following lemmas.

We recall the following two formulas for $k\geq 1$ (see Section 5.2):

Furthermore, we have $Q^{(d)}_{0}(x)=1$ , $Q^{(d)}_{1}(x)=x/d$ and therefore therefore $xQ^{(d)}_{0}(x)=dQ^{(d)}_{1}(x)$ . We insert these expressions in the expansion of the function $\psi$

Matching the coefficients of the expansion yields

Similarly, we can write the decomposition of $x^{2}\psi(x)$ to be

where the coefficients are given by the same relation as in the above lemma

Case 1: ${\bm{\theta}}_{1}\neq{\bm{\theta}}_{2}$ .

Case 2: ${\bm{\theta}}_{1}={\bm{\theta}}_{2}$ .

Similarly, for some fixed $\alpha$ and $\beta$ , we define

We can therefore fix $u_{1}(1)=\alpha$ and $u_{2}(1)+u_{3}(1)=\beta/2$ . ∎

Let $\sigma$ be an activation function such that $\sigma(u)\leq c_{0}\exp(c_{1}u^{2})$ for some constants $c_{0},c_{1}$ , with $c_{1}<1$ . Let the Hermite and Gegenbauer decompositions of $\sigma$ be

Recall the correspondence (43) between Gegenbauer and Hermite polynomials. Note for any monomial $m_{k}(x)=x^{k}$ , by Lemma 5. $(c)$ , we have

For any fixed $k$ , let $Q_{k}^{(d)}(x)$ be the $k$ -th Gegenbauer polynomial. We expand

Using the correspondence (43) between Gegenbauer and Hermite polynomials we have

where the last inequality holds for all $d$ large enough, since $C_{d}\to(2\pi)^{-1/2}$ as $d\to\infty$ . Hence, we have:

Taking $t=O(\log(d)^{1/2}d^{-1/2})$ , we get

4.2 Proof of Proposition 5

Step 1. Construction of the activation function $\hat{\sigma}$ .

for some $\delta_{1},\delta_{2}$ that we will fix later (with $|\delta_{t}|\leq 1$ ).

Step 2. The functions ${\bm{u}},\hat{\bm{u}}$ and $\bar{\bm{u}}$ .

Let ${\bm{u}}$ and $\hat{\bm{u}}$ be the matrix-valued functions associated respectively to $\sigma^{\prime}$ and $\hat{\sigma}^{\prime}$

From Lemma 7, there exists functions $u_{1},u_{2},u_{3}$ and $\hat{u}_{1},\hat{u}_{2},\hat{u}_{3}$ , such that

We define $\bar{\bm{u}}={\bm{u}}-\hat{\bm{u}}$ . Then we can write

where $\bar{u}_{k}=u_{k}-\hat{u}_{k}$ for $k=1,2,3$ .

Step 3. Construction of the kernel matrices.

Note that we have ${\bm{U}}=\hat{\bm{U}}+\bar{\bm{U}}$ . By Eq. (101) and (98), it is easy to see that $\hat{\bm{U}}\succeq 0$ . Then we have ${\bm{U}}\succeq\bar{\bm{U}}$ . In the following, we would like to lower bound matrix $\bar{\bm{U}}$ .

Denoting $\gamma_{ij}=\langle{\bm{\theta}}_{i},{\bm{\theta}}_{j}\rangle/d<1$ , we get, from Eq. (92),

We get similar expressions for $\hat{\bm{U}}_{ij}$ with $\lambda_{k,d}(\sigma^{\prime})$ replaced by $\lambda_{k,d}(\hat{\sigma}^{\prime})$ . Because we defined $\sigma^{\prime}$ and $\hat{\sigma}^{\prime}$ by only modifying the $k_{1}$ -th and $k_{2}$ -th coefficients, we get

Recalling that $\lambda_{k,d}^{(1)}$ only depend on $\lambda_{k-1,d}$ and $\lambda_{k+1,d}$ (Lemma 6), we get

By Assumption 4 and the convergence in Lemma 8, for any fixed $k$ ,

From Lemma 9, we recall that the coefficients of the $k$ -th Gegenbauer polynomial $Q_{k}^{(d)}(x)=\sum_{s=0}^{k}p^{(d)}_{k,s}x^{s}$ satisfy

Plugging the estimates (109), (110) and (112) into Eqs. (107) and (108), we obtain that

We deduce from (113) (105) and (114) that

As a result, combining Eq. (115) with Eq. (102) and (99), we get

By the expression of ${\bm{\Delta}}$ given by (104), we conclude that

Step 5. Proving that ${\bm{D}}\succeq\varepsilon{\mathbf{I}}_{Nd}$ .

By Lemma 7, we can express $\bar{\bm{U}}_{ii}$ by

with $\alpha$ , $\beta$ independent of $i$ , and given by Eq. (93), namely

(Notice that ${\rm Tr}(\bar{\bm{U}}_{ii})$ and $\langle{\bm{\theta}}_{i},\bar{\bm{U}}_{ii}{\bm{\theta}}_{i}\rangle$ are independent of $i$ by construction, cf. Eqs. (97), (98) and (100), (101).) By the definition of ${\bm{D}}$ given in Eq. (103), We deduce that:

We claim that, under the assumptions of Proposition 5, and denoting ${\bm{\delta}}=(\delta_{1},\delta_{2})$ (where $\delta_{1},\delta_{2}$ first appears in the definition of $\hat{\sigma}$ in Eq. (96), and till now $\delta_{1},\delta_{2}$ are still not determined)

where $F_{1}({\bm{0}})=F_{2}({\bm{0}})=0$ and $\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}})\neq{\bm{0}}$ , $\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0$ . Before proving this claim, let us show that it allows to finish the proof of Proposition 5. Since $\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0$ , there exists a unit-norm vector ${\bm{v}}$ , such that $\langle{\bm{v}},\nabla F_{1}({\bm{0}})\rangle>0$ , and $\langle{\bm{v}},\nabla F_{2}({\bm{0}})\rangle>0$ . Now we choose $\delta_{1},\delta_{2}$ (first appears in the definition of $\hat{\sigma}$ in Eq. (96)): we set ${\bm{\delta}}=(\delta_{1},\delta_{2})=\delta_{0}{\bm{v}}$ with some $\delta_{0}>0$ small enough. This yields $F_{1}({\bm{\delta}})>0$ , $F_{2}({\bm{\delta}})>0$ . Define $\varepsilon=\min(F_{1}({\bm{\delta}}),F_{2}({\bm{\delta}}))/2$ , we have

We are left with the task of proving that the limits in Eqs. (118), (119) exist, with the desired properties. Using Eqs. (107) and (108), we get:

Using Eq. (110), we get that the limits (118), (119) exist. Further, letting $\mu_{k}\equiv\mu_{k}(\sigma^{\prime})$ , we have

It is easy to check $F_{1}({\bm{0}})=F_{2}({\bm{0}})=0$ , and to compute the gradients, using the identity $\mu_{k}(x^{2}\sigma^{\prime})=\mu_{k+2}(\sigma^{\prime})+(2k+1)\mu_{k}(\sigma^{\prime})+k(k-1)\mu_{k-2}(\sigma^{\prime})$ , we get

Under Assumption 5, we have $\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}})\neq{\bm{0}}$ and $\det(\nabla F_{1}({\bm{0}}),\nabla F_{2}({\bm{0}}))\neq 0$ completing the proof.

Proof of Theorem 2.(b): NT model upper bound

and $\Gamma_{d,m}$ can be computed using the Gegenbauer recursion formula Eq. (35),

Consider $\hat{f}_{\sf NT}({\bm{x}};{\bm{\Theta}},{\bm{a}})=\sum_{i=1}^{N}\langle{\bm{a}}_{i},{\bm{x}}\rangle\sigma^{\prime}(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle)$ . We can expand the risk at parameter ${\bm{a}}$ as

From Assumption 2.(a) and Lemma 5.(b) applied to $\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)$ and $\langle{\bm{e}},\cdot\rangle\sigma^{\prime}(\langle{\bm{e}},\cdot\rangle)$ , we get $\alpha_{d}=O_{d}(1)$ and $\beta_{d}=O_{d}(d^{-1})$ . We deduce that the operator norm verifies $\|{\bm{K}}({\bm{\theta}},{\bm{\theta}})\|_{{\rm op}}=\alpha_{d}+\beta_{d}\|{\bm{\theta}}\|_{2}^{2}=O_{d}(1)$ .

Hence, there exists a constant $C>0$ such that

Proof of Theorem 4: risk for KR

Step 1. Rewrite the ${\bm{y}}$ , ${\bm{E}}$ , ${\bm{H}}$ , ${\bm{M}}$ matrices.

The test error of empirical kernel ridge regression gives

where ${\bm{E}}=(E_{1},\ldots,E_{n})^{\mathsf{T}}$ , ${\bm{M}}=(M_{ij})_{ij\in[n]}$ and ${\bm{H}}=(H_{ij})_{ij\in[n]}$ with

Let the spherical harmonics decomposition of $f_{d}$ be

and the Gegenbauer decomposition of $h_{d}$ be

We decompose the vectors and matrices ${\bm{f}}$ , ${\bm{E}}$ , ${\bm{H}}$ , and ${\bm{M}}$ in terms of spherical harmonics

By Proposition 3 and Eq. (56), the kernel ${\bm{H}}$ and ${\bm{M}}$ can be rewritten as

Recalling ${\bm{y}}={\bm{f}}+{\bm{\varepsilon}}$ , we decompose the risk as follows

Using Cauchy Schwarz inequality for $T_{22}$ , we get

As a result, combining Eqs. (130), (132) and (131), we have

Notice that by Lemma 11, Lemma 13 and the definition of ${\bm{M}}$ , for any integer $L$ :

Combining Eqs. (137), (133), (138), (139) and (140), we have

2 Auxiliary results

Then as long as $n/(B\log B)\to\infty$ as $d\to\infty$ , we have

Integrating the tail bound proves the lemma. ∎

Now we look at ${\bm{B}}_{11}{\bm{S}}_{0}{\bm{B}}_{11}$ . We have

By Assumption 3, we have $\lambda_{\min}({\bm{D}}_{1})=\omega_{d}(1)$ . This proves the lemma. ∎

Acknowledgements

This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.

References

Appendix A Numerical results with ridge regression

The results are reported in Figures 6, 7, 8, and are consistent with the ones of Section 1.3. Regularization does not help: it only reduces the peak at $n\approx p$ , as expected from [HMRT19], but not the large $n$ behavior.

(Note that for RF we do not report results for $d=100$ , in Fig. 6. As in Fig. 1, the resulting risk is slightly below the baseline $R_{0}$ : this effect vanishes for $d\gtrsim 100$ .)