Linearized two-layers neural networks in high dimension
Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari
Introduction and main results
For a number of important applications, state-of-the-art performances are obtained by representing the function as a multi-layers neural network. The simplest model in this class is given by two-layers networks (NN):
Two-layers neural networks have been extensively studied in the nineties, with a focus on two goals: Establishing approximation guarantees over classical function spaces; Controlling the generalization error via Rademacher complexity arguments. We refer to [Pin99, AB09] for surveys of these results.
Computational aspects were notably under-represented within these early theoretical contributions. On the contrary, it is nowadays increasingly clear that computational and statistical aspects cannot be separated in the analysis of neural networks (see, e.g. [SHN+18, MMN18, CB18]). Indeed, the optimization algorithm does not simply compute the unique minimizer of a regularized empirical risk: it instead selects one among many possible near-minimizers, whose generalization properties can vary significantly. Therefore, the specific optimization algorithm is an integral part of the definition of the regularization method.
A concrete scenario in which this interplay can be understood precisely is the so-called ‘neural tangent kernel’ regime. First explicitly described in [JGH18], this regime has attracted considerable amount of work. The basic idea is that, for highly overparametrized networks, the network weights barely change from their random initialization. We can therefore replace the nonlinear function class by its first order Taylor expansion around this initialization.
Denoting by the weights at initialization, a first order Taylor expansion yields
where is the neural network at initialization. In other words, is a function in the direct sum , where we defined
We will refer to as the ‘random features’ (RF) model: it amounts to fixing the first layer, and only optimizing the coefficients in the second layer. Equivalently, corresponds to the first order Taylor expansion of with respect to the second layer weights . This model can be traced back to the work of Neal [Nea96], and was successfully developed by Rahimi and Recht [RR08] as a randomized approximation to kernel methods.
The second function class corresponds to the first order Taylor expansion of with respect to the first layer weights [JGH18]. We will refer to as the neural tangent classOften the term ‘neural tangent’ is reserved for the direct sum . We find it more convenient to give distinct names to each of the two terms, especially since has much smaller dimension than for large ..
A sequence of recent papers proves that, in a certain overparametrized regime, gradient descent (GD) applied to the nonlinear neural network class effectively converges to a model in . Namely, if the number of neurons is larger than a threshold , and training is initialized with where , then gradient descent converges exponentially fast to weights such that is well approximated by a function in . The specific value of the threshold for the onset of this NT regime has been steadily pushed down over the last year [DZPS18, DLL+18, AZLS18, ZCZG18, ADH+19].
Does the NT regime explain the power of multi-layers neural networks, when trained by gradient descent methods? From an empirical point of view, the evidence is not univocal [LXS+19, GSJW19, COB19]. From a theoretical point of view, while the expressivity of neural networks is superior to the one of NT models, this hypothesis is not easy to dismiss for at least two reasons. First, neural networks learned by gradient descent algorithms form a significantly smaller class than general networks. Second, the answer depends on the data distribution, the target function and the sample size.
In order to clarify this question, we explore the behavior of RF and NT models in the high-dimensional setting. More precisely, we consider two specific asymptotic regimes:
In both cases we obtain sharp results, up to errors vanishing as . Crucially, our results hold pointwise, i.e. they provide a characterization of approximation and generalization error which hold for a given function . This allows us to derive precise separation results between NN and NT models.
2 A parenthesis
The approximation properties of neural networks have been studied for over three decades [DHM89, Cyb89, Hor91, Bar93, MM94, GJP95, Mha96, Pet98, Mai99, Pin99]. It is useful to discuss the relation between the questions outlined above and existing literature.
A number of results are available on the approximation of functions in certain smoothness classes by two-layers neural networks. In particular [Bar93] controls smoothness by the average frequency content in the Fourier transform (the ‘Barron norm’), while [Mha96, Pet98, Mai99] use classical Sobolev norms. For instance [Mai99] proves that -neurons NN approximate functions in the Sobolev ball with worst case error
for some unspecified functions . (Similar results are found in [Pet98].) These results cannot be used for our purposes.
First of all, we are interested in the NT class which is potentially much less powerful than NN.
Second, bounds of the type (1) make it hard to prove separation results between NN and NT. In order to prove such a separation, we would have to prove that neural networks trained by gradient descent have good approximation properties, uniformly over Sobolev balls. This objective is currently out of reach. Our pointwise approximation results make it much easier to prove separation statements.
Third, earlier work neglects polynomial dependencies in . Bounds of the type (1) have weak implications when both and are large, say , . We will instead prove sharp asymptotic results that are valid in this regime. As illustrated in the next section, our analysis captures the actual behavior in a quantitative manner, already when .
Quantitative results in the high-dimensional regime have been proved only recently. In particular, Bach [Bac17b] established quantitative upper and lower bounds for the approximation error in the RF model. However, these results do not have direct implications on the NT model which is our main interest here. Further, lower bounds in [Bac17b] are, as before, worst case over a certain RKHS. (See also [Bac13, AM15, RR17] for related work.)
Similar considerations apply to the generalization error of kernel methods. While this is a classical topic [CST+00, CDV07, RR17, LR18], earlier work proves minimax upper and lower bounds. Establishing pointwise lower bounds is instead important in order to understand precisely the separation between neural networks and their linearized counterparts. We refer to Section 4 for further discussion of related work.
3 A numerical experiment
Figures 1, 2, 3 report the results of such a simulation using RF –for Figure 1– and NT –for Figures 2 and 3. We use shifted ReLU activations , . The choice of is not essential: (Lebesgue-)almost every has similar behavior. In contrast, the case is degenerate because is equal to a linear function plus an even function.
The target functions in these examples are quite simple. Figures 1 and 2 use a quadratic function . In Figure 3, the target function is a third-order polynomial .
The results are somewhat disappointing: in two cases (first and third figures) RF and NT models do not beat the trivial predictor. In one case (the second one), the NT model surpasses the trivial baseline, and it appears to decrease to as the number of samples increase. We also note that the risk shows a cusp when , with the number of parameters ( for RF, and for NT). This phenomenon is related to overparametrization, and will not be discussed further in this paper (see [BHMM18, BHX19, HMRT19, MM19] for relevant work). We will instead focus on the population behavior .
In other words, the RF model does not appear to be able to learn a simple quadratic function, and the NT model does not appear to be able to learn a third order polynomial. Our main theorems (presented in the next sections) capture in a precise manner this behavior. In particular,
We will prove that for , RF does not outperform the trivial predictor on any function that has vanishing projection on linear functions. Similarly, NT does not outperform the trivial predictor on any function that has vanishing projection on linear and quadratic functions.
In contrast, there exists neural networks in with neurons, and a small approximation error both for and (see, e.g., [Bac17b], or [MMN18, Proposition 1]).
These two points illustrate the gap in approximation power between NT (or RF) and NN.
4 Summary of main results
The equivalence between RF regression and polynomial regression holds pointwise for target function .
Again, this result holds pointwise over the choice of .
The second aspect can be summarized as follows.
Approximation error of linearized neural networks
In this section, we state formally our results about the approximation error of and models. We define the minimum population error for any of the models by
See Section 6 for the proof of lower bound, and Section 7 for the proof of upper bound.
is supported on . It is therefore sufficient that .
By an explicit calculation, the density of . Since this density is bounded, it is sufficient that is square integrable with respect to the Lebesgue measure on .
2 Approximation error of neural tangent models
The function is weakly differentiable, with weak derivative such that for some constants , with .
See Section 8 for the proof of lower bound, and Section 9 for the proof of upper bound.
For instance the ReLU activation and its weak derivative have subexponential growth. Further its Hermite coefficients are and
Assumption 2.(c) does not hold for ReLU activation , since for even. However it holds for shifted ReLU , for a generic value of the shift .
Theorems 1 and 2 can be illustrated by a cartoon, which we show as Figure 5. In words, the approximation error plotted as a function of follows a staircase: it drops close to integer values of this ratio, with each drop corresponding to the projection onto homogeneous polynomials of that degree. We can extract three useful statistical insights from these findings:
There is no difference between plain RF and the more recent NT approach in terms of approximation error, once we compare them at fixed number of parameters . All that changes is the relation between number of parameters and number of neurons: for RF, and for NT. The recent work [GMMM19] actually shows some advantage for the RF model, although in a special case. It is worth mentioning that the same equivalence holds when we consider the dependence on the sample size , at , see Section 3.
We notice however an important computational advantage for NT, at constant parameters number. Indeed, the complexity at prediction time is for NT, while it is for RF.
3 Separation between NN and RF, NT
Crucially, as proven in [MBM16], running gradient descent over the space of neural networks consisting of a single neuron allows to learn the target function efficiently. In other words, we do not have simply a separation between the function classes and or , but a separation between linearized neural networks, and neural networks trained by gradient descent.
Essentially the same example was independently considered by Yehudai and Shamir in concurrent work [YS19]. These authors prove that there exist finite constants such that, if and the coefficients have magnitude at most , then there exists a vector such that, setting , then . An important difference with respect to our separation result is in the fact that Eq. 10 holds –once again– pointwise, i.e. for any fixed , while in [YS19] is chosen by an adversary who has knowledge of the vectors . Let us emphasize there are other important differences between our setting and the one of [YS19], and neither of the two analysis implies the other.
Generalization error of kernel methods
Analogously, ridge regression in can be shown to converge to KRR with respect to the kernel
Section 3.1 presents a lower bound on the prediction error of general kernel methods, and Section 3.2 derives an upper bound for kernel ridge regression.
Consider any regression method of the form
where is the reproducing kernel Hilbert space (RKHS) norm with respect to the kernel of the form (13). By the representer theorem [BTA11] there exist coefficients such that
We are therefore led to define the following data-dependent prediction risk function for kernel methods
The next theorem provides a decomposition of this generalization error that is analogous to the one given in Theorem 1.. Notice however that the controlling factor is not the number of neurons , but instead the sample size .
This follows immediately from Theorem 1.. Indeed, setting and , we obtain , whence the claim follows by applying Eq. (3). ∎
2 Upper bound for kernel ridge regression
where the kernel matrix is given by
and . The prediction function at location is given by
The test error of empirical kernel ridge regression is defined as
We assume that are positive-definite kernels, and we consider the associated eigenvalues:
where we recall that is the -th Gegenbauer polynomial.
If has zero mean (i.e. ) further assume that is centered (i.e. ).
See Section 10 for the proof of this theorem.
Assume as , uniformly over , together with its derivatives, and further assume for some , . We expect this to be the case for many kernels of interest, and in particular it can be shown to be the case for and under mild conditions on the activation . Using Rodrigues’ formula described in Section 5.2, by an application of integration by part followed by dominated convergence, we get
3 Separation between kernel methods and neural networks
Repeating the same argument of Section 2.3, we see that Theorems 3 and 4 imply a separation between kernel methods, with rotationally invariant kernels, and gradient-descent trained neural networks.
Namely, consider again the target function , for . As proven in [MBM16], can be learnt efficiently by minimizing the following empirical risk via gradient descent:
Namely, if samples are used (and under some technical conditions on ), gradient descent reaches prediction error of order
This test error is achieved by kernel ridge regression.
4 Near-optimality of interpolators
Optimal test error is achieved for any . In particular, by taking , we obtain an interpolator, i.e. a predictor that interpolates the data . This remark is made quantitative in the following bound on the empirical risk
Recall that the empirical risk of KRR is given by Eq. (24), where can be rewritten as
where we simply used the law of large numbers . ∎
5 A conjecture for generalization error of random features model
Under the same data model of the previous sections, we are interested in the test prediction error
Theorem 1 characterized the test error in the population limit , whereas Theorems 3 and 4 characterize the same quantity in the case when .
What happens when both and are finite? In the proportional regime and , the precise asymptotics of was calculated in [MM19].
Further related work
Donoho and Johnstone [DJ89] study an approximation problem analogous to the one we considered in Section 2, although in dimensions. Their problem essentially reduces to determining rates of approximation on the unit circle, with the technical difference that the ’s are equi-spaced along the circle instead of being random. As for other references mentioned in Section 1.2, the lower bounds of [DJ89] are worst case over differentiable functions.
After the present paper appeared as a preprint, several authors presented important contributions to the same line of work. In particular, Liang, Rakhlin, and Zhai [LRZ19] studies kernel ridge regression in dimension using samples. Assuming the target function has bounded RKHS norm, they derive upper and lower bounds on the rate of convergence of the generalization error. This result is related to our Theorem 3. The most important difference is that we do not assume that the target function has bounded RKHS norm. Instead we obtain the precise asymptotics of the generalization error in a regime in which it is non-vanishing. As illustrated in Section 1.3, this asymptotic analysis captures indeed the actual behavior in practically reasonable settings.
From a technical viewpoint, several of our calculations make use of harmonic analysis over the -dimensional sphere, as it is natural given that ’s are uniform over the sphere. Spherical harmonics expansion appear in related contexts, e.g. in [DJ89, Bac17a, VW18].
Let us finally mention that an alternative approach to the analysis of two-layers neural networks in the wide limit, was developed in [MMN18, RVE18, SS18, CB18, MMM19] using mean field theory. Unlike in the neural tangent approach, the evolution of network weights is described beyond the linear regime in this theory.
Technical background
The dimension of each subspace is given by
2 Gegenbauer polynomials
We will use the following properties of Gegenbauer polynomials
Note in particular that property 2 implies that –up to a constant– is a representation of the projector onto the subspace of degree - spherical harmonics
then we have the following equation holds in sense
By rotational invariance, the space of homogeneous polynomials of degree is an eigenspace of , and we will denote the corresponding eigenvalue by . In other words . The eigenvalues can be computed via
3 Hermite polynomials
Here and below, for a polynomial, is the vector of the coefficients of . As a consequence, for any fixed integer , we have
where and are given in Eq. (42) and (38).
4 Notations
Proof of Theorem 1.(a): RF model lower bound
Define the random matrix , with
In what follows, we write for the random features risk, omitting the dependence on the weights . By the definition and a simple calculation, we have
where the last inequality used the fact that
This is achieved by the Proposition 1 and 2 stated below.
The proof of Proposition 2 relies on the following tight bound on the operator norm of the Gegenbauer polynomials of the Gram matrix:
The proofs of these three propositions are provided in the next sections. Proposition 1 implies
From Proposition 2, we have with high probability
Then by Markov inequality, we have with high probability
where is non-negative for . This immediately shows that is non-decreasing in . ∎
2 Proof of Proposition 1
and the Gegenbauer expansion of gives
3 Proof of Proposition 2
Recall the expansion of in terms of Gegenbauer polynomials, see Eqs. (51) and (52). From the properties of Gegenbauer polynomials, we have
where with .
As a result, we have , and hence
In the following, we give a lower bound for . Note we have
Hence, there exists constant , such that for large , we have
Plug Eq. (56) into Eq. (53), we get with high probability
4 Proof of Proposition 3
Step 1. Bounding operator norm by moments.
We define . Then we have
For any sequence of integers , we have
To prove the proposition, it suffices to show that for any sequence , we have
To calculate this quantity, we will apply repeatedly the following identity, which is an immediate consequence of Eq. (33). For any distinct, we have
Throughout the proof, we will denote by constants that may depend on but not on . The value of these constants is allowed to change from line to line.
Step 2. The induced graph and equivalence of index sequences.
For any index sequence , we defined an undirected multigraph associated to index sequence . The vertex set is the set of distinct elements in . The edge set is formed as follows: for any we add an edge between and (with convention ). Notice that this could be a self-edge, or a repeated edge: will be –in general– a multigraph. We denote to be the number of vertices of , and to be the number of edges (counting multiplicities). In particular, for . We define
For any two index sequences , we say they are equivalent , if the two graphs and are isomorphic, i.e. there exists an edge-preserving bijection of their vertices (ignoring vertex labels). We denote the equivalent class of to be
We define the quotient set by
For any integer and , we define
The following properties holds for all sufficiently large and :
For any equivalent index sequences , we have .
For any index sequence , we have .
For any index sequence , the degree of any vertex in must be even.
The number of equivalent classes .
Recall that denotes the number of distinct elements in . Then, for any , the number of elements in the corresponding equivalence class satisfies .
Properties , and are straightforward. Note that for any . For property , notice that to each distinct equivalence class we can associate, in an injective manner, a string of length over an alphabet of size (simply follow the elements in in order, and replace the labels by some canonical ones, e.g. in order of appearance). Therefore the number of classes is bounded as
For property , we need to bound the number of elements in for representative with degree . Define a mapping as follows. For , is a vector of the distinct elements in , listed in increasing order. For any , the pre-image contains at most elements. As a result, we have
In view of property in the last lemma, given an equivalence class , we will write for the corresponding value common to the equivalence class .
It is easy to see that the outcome of this process is independent of the order in which we select vertices.
For illustration, we give two examples of skeletonization processes:
Let , and set . First notice that are redundant vertices and we can remove them in arbitrary order to get . Then notice that is redundant whence we get . Hence we have , and .
Consider the skeletonization process of . Take . First notice that are redundant vertices and can be removed in arbitrary order to get . We see that there is no further redundant vertex in , so that , and .
For the above skeletonization process, the following properties hold:
If , then . That is, the skeletons of equivalent index sequences are equivalent.
For any , define
For any , its skeleton is either formed by a single element, or an index sequence whose graph has the property that every vertex has degree greater or equal to .
Property holds by the definition of equivalence which is graph isomorphism. Property used the fact that, if and , we have
so that deleting a redundant vertex will contribute a factor.
To show property , note that any intermediate index sequence in the skeletonization process is such that only has even degree vertices, is connected, and has no self-edges (by induction). Hence, only has even degree vertices, is connected, and has no self-edges. Note that cannot have degree-2 vertices, and has at least one vertex (because the last vertex is not removed). Therefore, as long as contains at least two vertices, can only contain vertices with degree greater or equal to . ∎
Given an index sequence , we say is of type 1, if contains only one index. We say is of type 2 if has more than one index (so that by Lemma 3, can only contain vertices with degree greater or equal to ). Denote the class of type 1 index sequence (respectively type 2 index sequence) by (respectively ). We also denote by , the set of equivalence classes of sequences in . This definition makes sense since the equivalence class of the skeleton of a sequence only depends on the equivalence class of the sequence itself.
Recall that is the number of vertices in , and is the number of edges in (which coincides with the length of ). We consider . Since for , every edge of must be at most a double edge. Indeed, if had multiplicity larger than in , neither nor could be deleted during the skeletonization process, contradicting the assumption that contains a single vertex. Therefore, we must have . According the Lemma 3., for every , we have
Note by Lemma 2., the number of elements in the equivalence class of is . Hence we get
where in the last step we used Lemma 2 and the fact that for some .
We have the following simple lemma bounding . This bound is useful when is a skeleton.
There exists constants and depending uniquely on such that, for any , and any index sequence with , we have
The lemma following by the claim that (for )
Therefore there exists a constant such that for all large enough
As a consequence, for any integer , we have
Combining the above two upper bounds (63) and (64), we have
By noting that for some , this proves the claim. ∎
Suppose , and denote to be the number of vertices in . We have, for a sequence
Here holds by Lemma 3.; by Lemma 4, and the fact that , together by ; because ; by Lemma 3., implying that for , each vertex of has degree greater or equal to , so that (notice that for we can assume ). Finally, follows since , and the definition of implying .
Note by Lemma 2., the number of elements in equivalent class . Since depends only on the equivalence class of , we will write, with a slight abuse of notation . Notice that the number of equivalence classes with is upper bounded by the number multi-graphs with vertices and edges, which is at most . Hence we get
Define . We will assume hereafter that is selected such that
By calculus and condition (68), the function is maximized over at , whence
Using Eqs. (61) and (69), we have, for any satisfying Eq. (68), we have
Finally setting and , this yields
Proof of Theorem 1.(b): RF model upper bound
Consider . We can expand the risk achieved at parameter as
Proof of Theorem 2.(a): NT model lower bound
We begin with some notations and simple remarks.
Assume is an activation function with for some constants and . Then
A simple calculation shows that as , and hence . Therefore
where the last inequality holds provided .
Finally, for point 3, without loss of generality we will take , so that . By the same argument given above (and since both and have densities bounded uniformly in ), for any we can choose bounded continuous so that for any ,
It is therefore sufficient to prove the claim for . Letting , independent of , we construct the coupling via
where we set . We thus have almost surely, and the claim follows by weak convergence. ∎
We denote the Hermite decomposition of by
We state separately the assumptions of Theorem 2.(a) for future reference.
It is also useful to notice that the Hermite coefficients of can be computed from the ones of using the relation .
2 Proof of Theorem 2.(a): Outline
Proceeding as for the RF model, we obtain
This is achieved in the following two propositions.
Let be an activation function satisfying Assumption 4. Define
These two propositions will be proven in the next sections. Proposition 4 shows that
3 Proof of Proposition 4
We denote the Gegenbauer decomposition of by
By Lemma 5, applied to function (instead of ), under Assumption 4, we have (for a constant independent of ). We therefore have (recalling the normalization of the Gegenbauer polynomials in Eq. (32))
where in the last step we used Eq. (33). By the recurrence relationship for Gegenbauer polynomials (35), we have
We use the convention that . This gives
The last inequality follows by Eqs. (89) and (91).
Using the fact that the kernel preserve the decomposition (29), we have
where we used the fact that is non-decreasing in given by Lemma 1. This concludes the proof.
4 Proof of Proposition 5
In the proof of this proposition, we will need the following lemmas.
We recall the following two formulas for (see Section 5.2):
Furthermore, we have , and therefore therefore . We insert these expressions in the expansion of the function
Matching the coefficients of the expansion yields
Similarly, we can write the decomposition of to be
where the coefficients are given by the same relation as in the above lemma
Case 1: .
Case 2: .
Similarly, for some fixed and , we define
We can therefore fix and . ∎
Let be an activation function such that for some constants , with . Let the Hermite and Gegenbauer decompositions of be
Recall the correspondence (43) between Gegenbauer and Hermite polynomials. Note for any monomial , by Lemma 5., we have
For any fixed , let be the -th Gegenbauer polynomial. We expand
Using the correspondence (43) between Gegenbauer and Hermite polynomials we have
where the last inequality holds for all large enough, since as . Hence, we have:
Taking , we get
4.2 Proof of Proposition 5
Step 1. Construction of the activation function .
for some that we will fix later (with ).
Step 2. The functions and .
Let and be the matrix-valued functions associated respectively to and
From Lemma 7, there exists functions and , such that
We define . Then we can write
where for .
Step 3. Construction of the kernel matrices.
Note that we have . By Eq. (101) and (98), it is easy to see that . Then we have . In the following, we would like to lower bound matrix .
Denoting , we get, from Eq. (92),
We get similar expressions for with replaced by . Because we defined and by only modifying the -th and -th coefficients, we get
Recalling that only depend on and (Lemma 6), we get
By Assumption 4 and the convergence in Lemma 8, for any fixed ,
From Lemma 9, we recall that the coefficients of the -th Gegenbauer polynomial satisfy
Plugging the estimates (109), (110) and (112) into Eqs. (107) and (108), we obtain that
We deduce from (113) (105) and (114) that
As a result, combining Eq. (115) with Eq. (102) and (99), we get
By the expression of given by (104), we conclude that
Step 5. Proving that .
By Lemma 7, we can express by
with , independent of , and given by Eq. (93), namely
(Notice that and are independent of by construction, cf. Eqs. (97), (98) and (100), (101).) By the definition of given in Eq. (103), We deduce that:
We claim that, under the assumptions of Proposition 5, and denoting (where first appears in the definition of in Eq. (96), and till now are still not determined)
where and , . Before proving this claim, let us show that it allows to finish the proof of Proposition 5. Since , there exists a unit-norm vector , such that , and . Now we choose (first appears in the definition of in Eq. (96)): we set with some small enough. This yields , . Define , we have
We are left with the task of proving that the limits in Eqs. (118), (119) exist, with the desired properties. Using Eqs. (107) and (108), we get:
Using Eq. (110), we get that the limits (118), (119) exist. Further, letting , we have
It is easy to check , and to compute the gradients, using the identity , we get
Under Assumption 5, we have and completing the proof.
Proof of Theorem 2.(b): NT model upper bound
and can be computed using the Gegenbauer recursion formula Eq. (35),
Consider . We can expand the risk at parameter as
From Assumption 2.(a) and Lemma 5.(b) applied to and , we get and . We deduce that the operator norm verifies .
Hence, there exists a constant such that
Proof of Theorem 4: risk for KR
Step 1. Rewrite the , , , matrices.
The test error of empirical kernel ridge regression gives
where , and with
Let the spherical harmonics decomposition of be
and the Gegenbauer decomposition of be
We decompose the vectors and matrices , , , and in terms of spherical harmonics
By Proposition 3 and Eq. (56), the kernel and can be rewritten as
Recalling , we decompose the risk as follows
Using Cauchy Schwarz inequality for , we get
As a result, combining Eqs. (130), (132) and (131), we have
Notice that by Lemma 11, Lemma 13 and the definition of , for any integer :
Combining Eqs. (137), (133), (138), (139) and (140), we have
2 Auxiliary results
Then as long as as , we have
Integrating the tail bound proves the lemma. ∎
Now we look at . We have
By Assumption 3, we have . This proves the lemma. ∎
Acknowledgements
This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.
References
Appendix A Numerical results with ridge regression
The results are reported in Figures 6, 7, 8, and are consistent with the ones of Section 1.3. Regularization does not help: it only reduces the peak at , as expected from [HMRT19], but not the large behavior.
(Note that for RF we do not report results for , in Fig. 6. As in Fig. 1, the resulting risk is slightly below the baseline : this effect vanishes for .)