Deep learning: a statistical viewpoint

Peter L. Bartlett, Andrea Montanari, Alexander Rakhlin

Introduction

The past decade has witnessed dramatic advances in machine learning that have led to major breakthroughs in computer vision, speech recognition, and robotics. These achievements are based on a powerful and diverse toolbox of techniques and algorithms that now bears the name ‘deep learning’; see, for example, [GBC16]. Deep learning has evolved from the decades-old methodology of neural networks: circuits of parametrized nonlinear functions, trained by gradient-based methods. Practitioners have made major architectural and algorithmic innovations, and have exploited technological advances, such as increased computing power, distributed computing architectures, and the availability of large amounts of digitized data. The 2018 Turing Award celebrated these advances, a reflection of their enormous impact [LBH15].

While deep learning has been hugely successful in the hands of practitioners, there are significant gaps in our understanding of what makes these methods successful. Indeed, deep learning reveals some major surprises from a theoretical perspective: deep learning methods can find near-optimal solutions to highly non-convex empirical risk minimization problems, solutions that give a near-perfect fit to noisy training data, but despite making no explicit effort to control model complexity, these methods lead to excellent prediction performance in practice.

The deep learning revolution built on two surprising empirical discoveries that are suggestive of radically different ways of managing these trade-offs. First, deep learning exploits rich and expressive models, with many parameters, and the problem of optimizing the fit to the training data appears to simplify dramatically when the function class is rich enough, that is, when it is sufficiently overparametrized. In this regime, simple, local optimization approaches, variants of stochastic gradient methods, are extraordinarily successful at finding near-optimal fits to training data, even though the nonlinear parametrization—see equation (1)—implies that the optimization problems that these simple methods solve are notoriously non-convex. A posteriori, the idea that overparametrization could lead to tractability might seem natural, but it would have seemed completely foolish from the point of view of classical learning theory: the resulting models are outside the realm of uniform convergence, and therefore should not be expected to generalize well.

The second surprising empirical discovery was that these models are indeed outside the realm of uniform convergence. They are enormously complex, with many parameters, they are trained with no explicit regularization to control their statistical complexity, and they typically exhibit a near-perfect fit to noisy training data, that is, empirical risk close to zero. Nonetheless this overfitting is benign, in that they produce excellent prediction performance in a number of settings. Benign overfitting appears to contradict accepted statistical wisdom, which insists on a trade-off between the complexity of a model and its fit to the data. Indeed, the rule of thumb that models fitting noisy data too well will not generalize is found in most classical texts on statistics and machine learning [FHT01, Was13]. This viewpoint has become so prevalent that the word ‘overfitting’ is often taken to mean both fitting data better than should be expected and also giving poor predictive accuracy as a consequence. In this paper, we use the literal meaning of the word ‘overfitting’; deep learning practice has demonstrated that poor predictive accuracy is not an inevitable consequence.

This paper reviews some initial steps towards understanding these two surprising aspects of the success of deep learning. We have two working hypotheses:

Classically, tractable statistical learning is achieved by restricting to linearly parametrized classes of functions and convex objectives. A fundamentally new principle appears to be at work in deep learning. Although the objective is highly non-convex, we conjecture that the hardness of the optimization problem depends on the relationship between the dimension of the parameter space (the number of optimization variables) and the sample size (which, when we aim for a near-perfect fit to training data, we can think of as the number of constraints), that is, tractability is achieved if and only if we choose a model that is sufficiently under-constrained or, equivalently, overparametrized.

Even if overparametrized models simplify the optimization task, classically we would have believed that good generalization properties would be restricted to either an underparametrized regime or a suitably regularized regime. Statistical wisdom suggests that a method that takes advantage of too many degrees of freedom by perfectly interpolating noisy training data will be poor at predicting new outcomes. In deep learning, training algorithms appear to induce a bias that breaks the equivalence among all the models that interpolate the observed data. Because these models interpolate noisy data, the classical statistical perspective would suggest that this bias cannot provide sufficient regularization to give good generalization, but in practice it does. We conjecture that deep learning models can be decomposed into a low-complexity component for which classical uniform convergence occurs and a high-complexity component that enables a perfect fit to training data, and if the model is suitably overparameterized, this perfect fit does not have a significant impact on prediction accuracy.

As we shall see, both of these hypotheses are supported by results in specific scenarios, but there are many intriguing open questions in extending these results to realistic deep learning settings.

It is worth noting that none of the results that we review here make a case for any optimization or generalization benefits of increasing depth in deep learning. Although it is not the focus here, another important aspect of deep learning concerns how deep neural networks can effectively and parsimoniously express natural functions that are well matched to the data that arise in practice. It seems likely that depth is crucial for these issues of expressivity.

Section 2 starts by reviewing some results from classical statistical learning theory that are relevant to the problem of prediction with deep neural networks. It describes an explicit probabilistic formulation of prediction problems. Consistent with the data-driven perspective of deep learning, this formulation assumes little more than that the $({\boldsymbol{x}},y)$ pairs are sampled independently from a fixed probability distribution. We explain the role played by uniform bounds on deviations between risk and empirical risk,

in the analysis of the generalization question for functions chosen from a class ${\mathcal{F}}$ . We show how a partition of a rich function class ${\mathcal{F}}$ into a complexity hierarchy allows regularization methods that balance the statistical complexity and the empirical risk to enjoy the best bounds on generalization implied by the uniform convergence results. We consider consequences of these results for general pattern classification problems, for easier “large margin” classification problems and for regression problems, and we give some specific examples of risk bounds for feed-forward networks. Finally, we consider the implications of these results for benign overfitting: If an algorithm chooses an interpolating function to minimize some notion of complexity, what do the uniform convergence results imply about its performance? We see that there are very specific barriers to analysis of this kind in the overfitting regime; an analysis of benign overfitting must make stronger assumptions about the process that generates the data.

In Section 3, we review results on the implicit regularization that is imposed by the algorithmic approach ubiquitous in deep learning: gradient methods. We see examples of function classes and loss functions where gradient methods, suitably initialized, return the empirical risk minimizers that minimize certain parameter norms. While all of these examples involve parameterizations of linear functions with convex losses, we shall see in Section 5 that this linear/convex viewpoint can be important for nonconvex optimization problems that arise in neural network settings.

Section 4 reviews analyses of benign overfitting. We consider extreme cases of overfitting, where the prediction rule gives a perfect interpolating fit to noisy data. In all the cases that we review where this gives good predictive accuracy, we can view the prediction rule as a linear combination of two components: $\widehat{f}=\widehat{f}_{0}+\Delta$ . The first, $\widehat{f}_{0}$ , is a simple component that is useful for prediction, and the second, $\Delta$ , is a spiky component that is useful for overfitting. Classical statistical theory explains the good predictive accuracy of the simple component. The other component is not useful for prediction, but equally it is not harmful for prediction. The first example we consider is the classical Nadaraya-Watson kernel smoothing method with somewhat strange, singular kernels, which lead to an interpolating solution that, for a suitable choice of the kernel bandwidth, enjoys minimax estimation rates. In this case, we can view $\widehat{f}_{0}$ as the prediction of a standard kernel smoothing method and $\Delta$ as a spiky component that is harmless for prediction but allows interpolation. The other examples we consider are for high-dimensional linear regression. Here, ‘linear’ means linearly parameterized, which of course allows for the richness of highly nonlinear features, for instance the infinite dimensional feature vectors that arise in reproducing kernel Hilbert spaces (RKHSs). Motivated by the results of Section 3, we study the behavior of the minimum norm interpolating linear function. We see that it can be decomposed into a prediction component and an overfitting component, with the split determined by the eigenvalues of the data covariance matrix. The prediction component corresponds to a high-variance subspace and the overfitting component to the orthogonal, low-variance subspace. For sub-Gaussian features, benign overfitting occurs if and only if the high-variance subspace is low-dimensional (that is, the prediction component is simple enough for the corresponding subspace of functions to exhibit uniform convergence) and the low-variance subspace has high effective dimension and suitably low energy. In that case, we see a self-induced regularization: the projection of the data on the low-variance subspace is well-conditioned, just as it would be if a certain level of statistical regularization were imposed, so that even though this subspace allows interpolation, it does not significantly deteriorate the predictive accuracy. (Notice that this self-induced regularization is a consequence of the decay of eigenvalues of the covariance matrix, and should not be confused with the implicit regularization, which is a consequence of the gradient optimization method and leads to the minimum norm interpolant.) Using direct arguments that avoid the sub-Gaussian assumption, we see similar behavior of the minimum norm interpolant in certain infinite-dimensional RKHSs, including an example of an RKHS with fixed input dimension where benign overfitting cannot occur and examples of RKHSs where it does occur for suitably increasing input dimension, again corresponding to decompositions into a simple subspace—in this case, a subspace of polynomials, with dimension low enough for uniform convergence—and a complex high-dimensional orthogonal subspace that allows benign overfitting.

In Section 5, we consider a specific regime where overparametrization allows a non-convex empirical risk minimization problem to be solved efficiently by gradient methods: a linear regime, in which a parameterized function can be accurately approximated by its linearization about an initial parameter vector. For a suitable parameterization and initialization, we see that a gradient method remains in the linear regime, enjoys linear convergence of the empirical risk, and leads to a solution whose predictions are well approximated by the linearization at the initialization. In the case of two-layer networks, suitably large overparametrization and initialization suffice. On the other hand, the mean-field limit for wide two-layer networks, a limit that corresponds to a smaller—and perhaps more realistic—initialization, exhibits an essentially different behavior, highlighting the need to extend our understanding beyond linear models.

Section 6 returns to benign overfitting, focusing on the linear regime for two specific families of two-layer networks: a random features model, with randomly initialized first-layer parameters that remain constant throughout training, and a neural tangent model, corresponding to the linearization about a random initialization. Again, we see decompositions into a simple subspace (of low-degree polynomials) that is useful for prediction and a complex orthogonal subspace that allows interpolation without significantly harming prediction accuracy.

Section 7 outlines future directions. Specifically, for the two working hypotheses of tractability via overparametrization and generalization via implicit regularization, this section summarizes the insights from the examples that we have reviewed—mechanisms for implicit regularization, the role of dimension, decompositions into prediction and overfitting components, data-adaptive choices of these decompositions, and the tractability benefits of overparameterization. It also speculates on how these might extend to realistic deep learning settings.

Generalization and uniform convergence

This section reviews uniform convergence results from statistical learning theory and their implications for prediction with rich families of functions, such as those computed by neural networks. In classical statistical analyses, it is common to posit a specific probabilistic model for the process generating the data and to estimate the parameters of that model; see, for example, [BD07]. In contrast, the approach in this section is motivated by viewing neural networks as defining rich, flexible families of functions that are useful for prediction in a broad range of settings. We make only weak assumptions about the process generating the data, for example, that it is sampled independently from an unknown distribution, and we aim for the best prediction accuracy.

is close to zero, where the infimum is over all measurable functions. Notice that we assume only that $({\boldsymbol{x}},y)$ pairs are independent and identically distributed; in particular, we do not assume any functional relationship between ${\boldsymbol{x}}$ and $y$ .

Suppose that we choose $\widehat{f}$ from a set of functions ${\mathcal{F}}\subseteq{\mathcal{Y}}^{\mathcal{X}}$ . For instance, ${\mathcal{F}}$ might be the set of functions computed by a deep network with a particular architecture and with particular constraints on the parameters in the network. A natural approach to using the sample to choose $\widehat{f}$ is to minimize the empirical risk over the class ${\mathcal{F}}$ . Define

is the expectation of the loss under the empirical distribution defined by the sample. Often, we consider classes of functions ${\boldsymbol{x}}\mapsto f({\boldsymbol{x}};{\boldsymbol{\theta}})$ parameterized by ${\boldsymbol{\theta}}$ , and we use $L({\boldsymbol{\theta}})$ and $\widehat{L}({\boldsymbol{\theta}})$ to denote $L(f(\cdot;{\boldsymbol{\theta}}))$ and $\widehat{L}(f(\cdot;{\boldsymbol{\theta}}))$ , respectively.

We can split the excess risk of the empirical risk minimizer $\widehat{f}_{\scalebox{0.4}{erm}}$ into two components,

the second reflecting how well functions in the class ${\mathcal{F}}$ can approximate an optimal prediction rule and the first reflecting the statistical cost of estimating such a prediction rule from the finite sample. For a more complex function class ${\mathcal{F}}$ , we should expect the approximation error to decrease and the estimation error to increase. We focus on the estimation error, and on controlling it using uniform laws of large numbers.

2 Uniform laws of large numbers

Without any essential loss of generality, suppose that a minimizer $f_{{\mathcal{F}}}^{*}\in\arg\min_{f\in{\mathcal{F}}}L(f)$ exists. Then we can split the estimation error of an empirical risk minimizer $\widehat{f}_{\scalebox{0.4}{erm}}$ defined in (2) into three components:

where $\epsilon_{1},\ldots,\epsilon_{n}\in\{\pm 1\}$ are independent and uniformly distributed.

See [KP00, Kol01, BBL02, BM02] and [Kol06]. This theorem shows that for bounded losses, a uniform bound

3 Faster rates

Notice that, for any probability distribution on ${\mathcal{X}}$ , $R_{n}({\mathcal{F}})\leq\bar{R}_{n}({\mathcal{F}})$ .

Typically, faster rates like these arise when the variance of the excess loss is bounded in terms of its expectation, for instance

4 Complexity regularization

The results we have seen give bounds on the excess risk of $\widehat{f}_{\scalebox{0.4}{erm}}$ in terms of a sum of approximation error and a bound on the estimation error that depends on the complexity of the function class ${\mathcal{F}}$ . Rather than choosing the complexity of the function class ${\mathcal{F}}$ in advance, we could instead split a rich class ${\mathcal{F}}$ into a complexity hierarchy and choose the appropriate complexity based on the data, with the aim of managing this approximation-estimation tradeoff. We might define subsets ${\mathcal{F}}_{r}$ of a rich class ${\mathcal{F}}$ , indexed by a complexity parameter $r$ . We call each ${\mathcal{F}}_{r}$ a complexity class, and we say that it has complexity $r$ .

The following theorem gives an illustration of the effectiveness of this kind of complexity regularization. In the first part of the theorem, the complexity penalty for a complexity class is a uniform bound on deviations between expectations and sample averages for that class. We have seen that uniform deviation bounds of this kind imply upper bounds on the excess risk of the empirical risk minimizer in the class. In the second part of the theorem, the complexity penalty appears in the upper bounds on excess risk that arise in settings where faster rates are possible. In both cases, the theorem shows that when the bounds hold, choosing the best penalized empirical risk minimizer in the complexity hierarchy leads to the best of these upper bounds.

For each ${\mathcal{F}}_{r}\subseteq{\mathcal{F}}$ , define an empirical risk minimizer

Among these, select the one with complexity $\widehat{r}$ that gives an optimal balance between the empirical risk and a complexity penalty $p_{r}$ :

In the event that the complexity penalties are uniform deviation bounds:

Suppose that the complexity classes and penalties are ordered, that is,

and fix $f_{r}^{*}\in\arg\min_{f\in{\mathcal{F}}_{r}}L(f)$ . In the event that the complexity penalties satisfy the uniform relative deviation bounds

These are called oracle inequalities because (8) (respectively (10)) gives the error bound that follows from the best of the uniform bounds (7) (respectively (9)), as if we have access to an oracle who knows the complexity that gives the best bound. The proof of the first part is a straightforward application of the same decomposition as (4); see, for example, [BBL02]. The proof of the second part, which allows significantly smaller penalties $p_{r}$ when faster rates are possible, is also elementary; see [Bar08]. In both cases, the broad approach to managing the trade-off between approximation error and estimation error is qualitatively the same: having identified a complexity hierarchy $\{{\mathcal{F}}_{r}\}$ with corresponding excess risk bounds $p_{r}$ , these results show the effectiveness of choosing from the hierarchy a function $f$ that balances the complexity penalty $p_{r}$ with the fit to the training data $\widehat{L}(\widehat{f}_{\scalebox{0.4}{erm}}^{r})$ .

Later in this section, we will see examples of upper bounds on estimation error for neural network classes ${\mathcal{F}}_{r}$ indexed by a complexity parameter $r$ that depends on properties of the network, such as the size of the parameters. Thus, a prediction method that trades off the fit to the training data with these measures of complexity would satisfy an oracle inequality.

5 Computational complexity of empirical risk minimization

To this point, we have considered the statistical performance of the empirical risk minimizer $\widehat{f}_{\scalebox{0.4}{erm}}$ without considering the computational cost of solving this optimization problem. The classical cases where it can be solved efficiently involve linearly parameterized function classes, convex losses, and convex complexity penalties, so that penalized empirical risk minimization is a convex optimization problem. For instance, SVMs exploit a linear function class (an RKHS, ${\mathcal{H}}$ ), a convex loss,

Again, minimizing complexity-penalized empirical risk corresponds to solving a quadratic program.

On the other hand, the optimization problems that arise in a classification setting, where functions map to a discrete set, have a combinatorial flavor, and are often computationally hard in the worst case. For instance, empirical risk minimization over the set of linear classifiers

is NP-hard [JP78, GJ79]. In contrast, if there is a function in this class that classifies all of the training data correctly, finding an empirical risk minimizer is equivalent to solving a linear program, which can be solved efficiently. Another approach to simplifying the algorithmic challenge of empirical risk minimization is to replace the discrete loss for this family of thresholded linear functions with a surrogate convex loss for the family of linear functions. This is the approach used in SVMs: replacing a nonconvex loss with a convex loss allows for computational efficiency, even when there is no thresholded linear function that classifies all of the training data correctly.

However, the corresponding optimization problems for neural networks appear to be more difficult. Even when $\widehat{L}(\widehat{f}_{\scalebox{0.4}{erm}})=0$ , various natural empirical risk minimization problems over families of neural networks are NP-hard [Jud90, BR92, DSS95], and this is still true even for convex losses [Vu98, BBD02].

6 Classification

For ${\mathcal{F}}\subseteq^{{\mathcal{X}}}$ and for any distribution on ${\mathcal{X}}$ ,

If ${\mathcal{F}}\subseteq\{\pm 1\}^{{\mathcal{X}}}$ and $n\geq d=d_{VC}({\mathcal{F}})$ , then

where $d_{VC}({\mathcal{F}}):=\max\left\{d:\Pi_{{\mathcal{F}}}(d)=2^{d}\right\}$ . In that case, for any distribution on ${\mathcal{X}}$ ,

and conversely, for some probability distribution, $R_{n}({\mathcal{F}})=\Omega\left(\sqrt{d/n}\right)$ .

Part 1 is from [BH89]. The upper bound in part 2 is from [BHLM19]. The lower bound in part 2 and the bound in part 3 are from [BMM98]. There are also upper bounds for the smooth sigmoid $\sigma(\alpha)=1/(1+\exp(-\alpha))$ that are quadratic in $p$ ; see [KM97]. See Chapter 8 of [AB99] for a review.

The theorem shows that the VC-dimension of these neural networks grows at least linearly with the number of parameters in the network, and hence to achieve small excess risk or uniform convergence of sample averages to probabilities for discrete losses, the sample size must be large compared to the number of parameters in these networks.

There is an important caveat to this analysis: it captures arbitrarily fine-grained properties of real-valued functions, because the operation of thresholding these functions is very sensitive to perturbations, as the following example shows.

7 Large margin classification

where the infima are over measurable functions $f$ .

For $|{\mathcal{X}}|\geq 2$ , this inequality cannot hold if $\psi_{\phi}$ is replaced by any larger function:

$\psi_{\phi}(\theta_{i})\to 0$ implies $\theta_{i}\to 0$ if and only if both $\phi$ is differentiable at zero and $\phi^{\prime}(0)<0$ .

For example, for the hinge loss $\phi(\alpha)=(1-\alpha)\vee 0$ , the relationship between excess risk and excess $\phi$ -risk is given by $\psi_{\phi}(\theta)=|\theta|$ , for the quadratic loss $\phi(\alpha)=(1-\alpha)^{2}$ , $\psi_{\phi}(\theta)=\theta^{2}$ , and for the exponential loss $\phi(\alpha)=\exp(-\alpha)$ , $\psi_{\phi}(\theta)=1-\sqrt{1-\theta^{2}}$ . Theorem 2.8 is from [BJM06]; see also [Lin04, LV04] and [Zha04].

Notice that in addition to the Rademacher complexity of the real-valued class ${\mathcal{F}}$ , this bound includes an approximation error term defined in terms of the surrogate loss; the binary-valued prediction problem has been reduced to a real-valued problem.

8 Real prediction

For a real-valued function class ${\mathcal{F}}$ , there is an analog of Theorem 2.4 with the VC-dimension of ${\mathcal{F}}$ replaced by the pseudodimension of ${\mathcal{F}}$ , which is the VC-dimension of $\left\{({\boldsymbol{x}},y)\mapsto\boldsymbol{1}\left[f({\boldsymbol{x}})\geq y\right]:f\in{\mathcal{F}}\right\}$ ; see [Pol90]. Theorem 2.5 is true with the output nonlinearity $\sigma_{L}$ of ${\mathcal{F}}_{L,\sigma}$ replaced by any Lipschitz nonlinearity and with $d_{VC}$ replaced by the pseudodimension. However, using this result to obtain bounds on the excess risk of an empirical risk minimizer would again require the sample size to be large compared to the number of parameters.

The following result gives a bound on Rademacher complexity for neural networks that use a bounded, Lipschitz nonlinearity, such as the sigmoid function

For two-layer neural networks defined on ${\mathcal{X}}=^{d}$ ,

Theorem 2.9 is from [BM02]. The proof uses the contraction inequality (Theorem 2.7) and elementary properties of Rademacher complexity.

The following theorem gives similar error bounds for networks with Lipschitz nonlinearities that, like the ReLU nonlinearity, do not necessarily have a bounded range. The definition of the function class includes deviations of the parameter matrices ${\boldsymbol{W}}_{i}$ from fixed ‘centers’ ${\boldsymbol{M}}_{i}$ .

Theorem 2.10 is from [BFT17]. The proof uses different techniques (covering numbers rather than the Rademacher complexity) to address the key technical difficulty, which is controlling the scale of vectors that appear throughout the network.

where $\|{\boldsymbol{W}}_{i}\|_{F}$ denotes the Frobenius norm of ${\boldsymbol{W}}_{i}$ . Then we have

This result is from [GRS18], which also shows that it is possible to remove the $\sqrt{L}$ factor at the cost of a worse dependence on $n$ . See also [NTS15].

9 The mismatch between benign overfitting and uniform convergence

Theorems 2.9, 2.10, and 2.11 imply upper bounds on risk in terms of various notions of scale of network parameters. For these bounds to be meaningful for a given probability distribution, there must be an interpolating $\widehat{f}$ for which the complexity $r(\widehat{f})$ grows suitably slowly with the sample size $n$ so that the excess risk bounds converge to zero.

An easy example is when there is an $f^{*}\in{\mathcal{F}}_{r}$ with $L(f^{*})=0$ , where $r$ is a fixed complexity. Notice that this implies not just that the conditional expectation is in ${\mathcal{F}}_{r}$ , but that there is no noise, that is, almost surely $y=f^{*}({\boldsymbol{x}})$ . In that case, if we choose $\widehat{f}$ as the interpolant $\widehat{L}(\widehat{f})=0$ with minimum complexity, then its complexity will certainly satisfy $r(\widehat{f})\leq r(f^{*})=r$ . And then as the sample size $n$ increases, $L(\widehat{f})$ will approach zero. In fact, since $\widehat{L}(\widehat{f})=0$ , Theorem 2.2 implies a faster rate in this case: $L(\widehat{f})=O((\log n)^{4}\bar{R}_{n}^{2}({\mathcal{F}}_{r}))$ .

Theorem 2.3 shows that if we were to balance the complexity with the fit to the training data, then we can hope to enjoy excess risk as good as the best bound for any ${\mathcal{F}}_{r}$ in the complexity hierarchy. If we always choose a perfect fit to the data, there is no trade-off between complexity and empirical risk, but when there is a prediction rule $f^{*}$ with finite complexity and zero risk, then once the sample size is sufficiently large, the best trade-off does correspond to a perfect fit to the data. To summarize: when there is no noise, that is, when $y=f^{*}(x)$ , and $f^{*}\in{\mathcal{F}}$ , classical theory shows that a minimum-complexity interpolant $\widehat{f}\in{\mathcal{F}}$ will have risk $L(\widehat{f})$ converging to zero as the sample size increases.

But what if there is noise, that is, there is no deterministic relationship between ${\boldsymbol{x}}$ and $y$ ? Then it turns out that the bounds on the excess risk $L(\widehat{f})-L(f_{{\mathcal{F}}}^{*})$ presented in this section must become vacuous: they can never decrease below a constant, no matter how large the sample size. This is because these bounds do not rely on any properties of the distribution on ${\mathcal{X}}$ , and hence are also true in a fixed design setting, where the excess risk is at least the noise level.

To make this precise, fix ${\boldsymbol{x}}_{1},\ldots,{\boldsymbol{x}}_{n}\in{\mathcal{X}}$ and define the fixed design risk

Then the decomposition (4) extends to this risk: for any $\widehat{f}$ and $f^{*}$ ,

Finally, although Theorems 2.9 and Theorem 2.11 are stated as bounds on the Rademacher complexity of ${\mathcal{F}}_{r}$ , they are in fact bounds on $\bar{R}_{n}({\mathcal{F}}_{r})$ , the worst-case empirical Rademacher complexity of ${\mathcal{F}}$ .

Consider the complexity hierarchy defined in Theorem 2.9 or Theorem 2.11. For the minimum-complexity interpolant $\widehat{f}$ , these theorems give bounds that depend on the complexity $r(\widehat{f})$ , that is, bounds of the form $L(\widehat{f})-L(f^{*})\leq B(r(\widehat{f}))$ (ignoring the fact that that the minimum complexity $r(\widehat{f})$ is random; making the bounds uniform over $r$ would give a worse bound). Then these observations imply that

Thus, unless there is no noise, the upper bound on excess risk must be at least as big as a constant.

[BL20b] use a similar comparison between prediction problems in random design and fixed design settings to demonstrate situations where benign overfitting occurs but a general family of excess risk bounds—those that depend only on properties of $\widehat{f}$ and do not increase too quickly with sample size—must sometimes be very loose. [NK19] present a scenario where, with high probability, a classification method gives good predictive accuracy but uniform convergence bounds must fail for any function class that contains the algorithm’s output. Algorithmic stability approaches—see [DW79] and [BE02]—also aim to identify sufficient conditions for closeness of risk and empirical risk, and appear to be inapplicable in the interpolation regime. These examples illustrate that to understand benign overfitting, new analysis approaches are necessary that exploit additional information. We shall review results of this kind in Section 4, for minimum-complexity interpolants in regression settings. The notion of complexity that is minimized is obviously of crucial importance here; this is the topic of the next section.

Implicit regularization

When the model ${\mathcal{F}}$ is complex enough to ensure zero empirical error, such as in the case of overparametrized neural networks, the set of empirical minimizers may be large. Therefore, it may very well be the case that some empirical minimizers generalize well while others do not. Optimization algorithms introduce a bias in this choice: an iterative method may converge to a solution with certain properties. Since this bias is a by-product rather than an explicitly enforced property, we follow the recent literature and call it implicit regularization. In subsequent sections, we shall investigate statistical consequences of such implicit regularization.

Perhaps the simplest example of implicit regularization is gradient descent on the square-loss objective with linear functions:

This minimum-norm interpolant can be written in closed form as

where ${\boldsymbol{X}}^{\dagger}$ denotes the pseudoinverse. It can also be seen as a limit of ridge regression

as $\lambda\to 0^{+}$ . The connection between minimum-norm interpolation (14) and the “ridgeless” limit of ridge regression will be fruitful in the following sections when statistical properties of these methods are analyzed and compared.

Boosting is another notable example of implicit regularization arising from the choice of the optimization algorithm, this time for the problem of classification. Consider the linear classification objective

where $y_{1},\ldots,y_{n}\in\{\pm 1\}$ . In the classical formulation of the boosting problem, the coordinates of vectors ${\boldsymbol{x}}_{i}$ correspond to features computed by functions in some class of base classifiers. Boosting was initially proposed as a method for minimizing empirical classification loss (17) by iteratively updating ${\boldsymbol{\theta}}$ . In particular, AdaBoost [FS97] corresponds to coordinate descent on the exponential loss function

was shown in [ZY05] and [Tel13] assuming small enough step size, where separability means positivity of the margin

These results have been extended to multi-layer fully connected neural networks and convolutional neural networks (without nonlinearities) in [GLSS18b]. On the other hand, [GLSS18a] considered the implicit bias arising from other optimization procedures, including mirror descent, steepest descent, and AdaGrad, both in the case when the global minimum is attained (as for the square loss) and when the global minimizers are at infinity (as in the classification case with exponential-like tails of the loss function). We refer to [JT19] and [NLG+19] and references therein for further studies on faster rates of convergence to the direction of the max margin solution (with more aggressive time-varying step sizes) and on milder assumptions on the loss function.

In addition to the particular optimization algorithm being employed, implicit regularization arises from the choice of model parametrization. Consider re-parametrizing the least-squares objective in (13) as

where ${\boldsymbol{\theta}}({\boldsymbol{u}})_{i}={\boldsymbol{u}}_{i}^{2}$ is the coordinate-wise square. [GWB+17] show that if ${\boldsymbol{\theta}}_{\infty}(\alpha)$ is the limit point of gradient flow on (22) with initialization $\alpha\boldsymbol{1}$ and the limit $\widehat{{\boldsymbol{\theta}}}=\lim_{\alpha\to 0}{\boldsymbol{\theta}}_{\infty}(\alpha)$ exists and satisfies ${\boldsymbol{X}}\widehat{{\boldsymbol{\theta}}}={\boldsymbol{y}}$ , then it must be that

which can be viewed, in turn, as an empirical risk minimization objective for a two-layer neural network with linear activation functions.

In summary, in overparametrized problems that admit multiple minimizers of the empirical objective, the choice of the optimization method and the choice of parametrization both play crucial roles in selecting a minimizer with certain properties. As we show in the next section, these properties of the solution can ensure good generalization properties through novel mechanisms that go beyond the realm of uniform convergence.

Benign overfitting

We now turn our attention to generalization properties of specific solutions that interpolate training data. As emphasized in Section 2, mechanisms of uniform convergence alone cannot explain good statistical performance of such methods, at least in the presence of noise.

We assume that for any ${\boldsymbol{x}}$ , conditional variance of the noise $\xi=y-f^{*}({\boldsymbol{x}})$ is at most $\sigma_{\xi}^{2}$ , and we write $\xi_{i}=y_{i}-f^{*}({\boldsymbol{x}}_{i})$ .

As in the previous section, we say that a solution $\widehat{f}$ is interpolating if

In this section we consider linear (in ${\boldsymbol{y}}$ ) estimators of the form $\widehat{f}({\boldsymbol{x}})=\sum_{i=1}^{n}y_{i}\omega_{i}({\boldsymbol{x}}).$ For such estimators we have

with equality if conditional noise variances are equal to $\sigma_{\xi}^{2}$ at each ${\boldsymbol{x}}$ .

In classical statistics, the balance between bias and variance is achieved by tuning an explicit parameter. Before diving into the more unexpected interpolation results, where the behavior of bias and variance are driven by novel self-regularization phenomena, we discuss the bias-variance tradeoff in the context of one of the oldest statistical methods.

The Nadaraya-Watson (NW) smoothing estimator [Nad64, Wat64] is defined as

with a smaller power $0<a<d/2$ was shown in [BRT19] to lead to minimax optimal rates of estimation under the corresponding smoothness assumptions. Notably, the NW estimator with the kernel in (30) is necessarily interpolating the training data for any choice of $h$ .

The following result appears in [BRT19]; see also [BHM18]:

Let $f^{*}\in H(\beta,L)$ for $\beta\in(0,1]$ and $L>0$ . Suppose the marginal density $p$ of ${\boldsymbol{x}}$ satisfies $0<p_{\text{min}}\leq p({\boldsymbol{x}})\leq p_{\text{max}}$ for all ${\boldsymbol{x}}$ in its support. Then the estimator (29) with kernel (30) satisfiesIn the remainder of this paper, the symbol $\lesssim$ denotes inequality up to a multiplicative constant.

The result can be extended to smoothness parameters $\beta>1$ [BRT19]. The choice of $h=n^{-1/(2\beta+d)}$ balances the two terms and leads to minimax optimal rates for Hölder classes [Tsy08].

In retrospect, Theorem 4.1 should not be surprising, and we mention it here for pedagogical purposes. It should be clear from the definition (29) that the behavior of the kernel at , and in particular the presence of a singularity, determines whether the estimator fits the training data exactly. This is, however, decoupled from the level of smoothing, as given by the bandwidth parameter $h$ . In particular, it is the choice of $h$ alone that determines the bias-variance tradeoff, and the value of the empirical loss cannot inform us whether the estimator is over-smoothing or under-smoothing the data.

2 Linear regression in the interpolating regime

It is easy to see that the excess square loss can be written as

The solution has a closed form and yields the estimator

which can also be written as $\widehat{f}({\boldsymbol{x}})=\sum_{i=1}^{n}y_{i}\omega_{i}({\boldsymbol{x}})$ , with

Thus, from (27), the bias term can be written as

where $P^{\perp}=\mathbf{I}_{d}-{\boldsymbol{X}}^{\scriptscriptstyle\mathsf{\,T}}({\boldsymbol{X}}{\boldsymbol{X}}^{\scriptscriptstyle\mathsf{\,T}})^{-1}{\boldsymbol{X}}$ , and from (28), the variance term is

Suppose ${\boldsymbol{z}}={\boldsymbol{\Sigma}}^{-1/2}{\boldsymbol{x}}$ is $1$ -sub-Gaussian. Without loss of generality, assume ${\boldsymbol{\Sigma}}=\text{diag}(\lambda_{1},\ldots,\lambda_{d})$ with $\lambda_{1}\geq\cdots\geq\lambda_{d}$ .

The central question now is: Are there mechanisms that can ensure small bias and variance of the minimum-norm interpolant? Surprisingly, we shall see that the answer is yes. To this end, choose an index $k\in\{1,\ldots,d\}$ and consider the subspace spanned by the top $k$ eigenvectors corresponding to $\lambda_{1},\ldots,\lambda_{k}$ . Write ${\boldsymbol{x}}^{\scriptscriptstyle\mathsf{\,T}}=[{\boldsymbol{x}}_{\leq k}^{\scriptscriptstyle\mathsf{\,T}},{\boldsymbol{x}}_{>k}^{\scriptscriptstyle\mathsf{\,T}}]$ . For an appropriate choice of $k$ , it turns out the decomposition of the minimum-norm interpolant as $\langle\widehat{\boldsymbol{\theta}},{\boldsymbol{x}}\rangle=\langle\widehat{\boldsymbol{\theta}}_{\leq k},{\boldsymbol{x}}_{\leq k}\rangle+\langle\widehat{\boldsymbol{\theta}}_{>k},{\boldsymbol{x}}_{>k}\rangle$ corresponds to a decomposition into a prediction component and an interpolation component. Write the data matrix as ${\boldsymbol{X}}=[{\boldsymbol{X}}_{\leq k},{\boldsymbol{X}}_{>k}]$ and

Observe that if the eigenvalues of the second part were to be contained in an interval $[\gamma/c,c\gamma]$ for some $\gamma$ and a constant $c$ , we could write

where $c^{-1}\mathbf{I}_{n}\preceq{\boldsymbol{M}}\preceq c\mathbf{I}_{n}$ . If we replace ${\boldsymbol{M}}$ with the approximation $\mathbf{I}_{n}$ and substitute this expression into (33), we see that $\gamma$ would have an effect similar to explicit regularization through a ridge penalty: if that approximation were precise, the first $k$ components of $\widehat{\boldsymbol{\theta}}$ would correspond to

since this has the closed-form solution ${\boldsymbol{X}}_{\leq k}^{\scriptscriptstyle\mathsf{\,T}}({\boldsymbol{X}}_{\leq k}{\boldsymbol{X}}_{\leq k}^{\scriptscriptstyle\mathsf{\,T}}+\gamma\mathbf{I}_{n})^{-1}{\boldsymbol{y}}$ . Thus, if $\gamma$ is not too large, we might expect this approximation to have a minimal impact on the bias and variance of the prediction component.

It is, therefore, natural to ask when to expect such a near-isotropic behavior arising from the “tail” features. The following lemma provides an answer to this question [BLLT20]:

Suppose coordinates of ${\boldsymbol{\Sigma}}^{-1/2}{\boldsymbol{x}}$ are independent. Then there exists a constant $c>0$ such that, with probability at least $1-2\exp\{-n/c\}$ ,

The condition of independence of coordinates in Lemma 4.3 is satisfied for Gaussian ${\boldsymbol{x}}$ . It can be relaxed to the following small-ball assumption:

Under this assumption, the conclusion of Lemma 4.3 still holds with probability at least $1-2\exp\{-n/c\}-n\delta$ [TB20].

An appealing consequence of Lemma 4.3 is the small condition number of ${\boldsymbol{X}}_{>k}{\boldsymbol{X}}_{>k}^{\scriptscriptstyle\mathsf{\,T}}$ for any $k$ such that $\sum_{i>k}\lambda_{i}\gtrsim\lambda_{k+1}n$ . Define the effective rank for a given index $k$ by

We see that $r_{k}({\boldsymbol{\Sigma}})\geq bn$ for some constant $b$ implies that the set of eigenvalues of ${\boldsymbol{X}}_{>k}{\boldsymbol{X}}_{>k}^{\scriptscriptstyle\mathsf{\,T}}$ lies in the interval $[\gamma/c,c\gamma]$ for

and thus the scale of the self-induced regularization in (38) is the sum of the tail eigenvalues of the covariance operator. Interestingly, the reverse implication also holds: if for some $k$ the condition number of ${\boldsymbol{X}}_{>k}{\boldsymbol{X}}_{>k}^{\scriptscriptstyle\mathsf{\,T}}$ is at most $\kappa$ with probability at least $1-\delta$ , then effective rank $r_{k}({\boldsymbol{\Sigma}})$ is at least $c_{\kappa}n$ with probability at least $1-\delta-c\exp\{-n/c\}$ for some constants $c,c_{\kappa}$ . Therefore, the condition $r_{k}({\boldsymbol{\Sigma}})\gtrsim n$ characterizes the indices $k$ such that ${\boldsymbol{X}}_{>k}{\boldsymbol{X}}_{>k}^{\scriptscriptstyle\mathsf{\,T}}$ behaves as a scaling of $\mathbf{I}_{d}$ , and the scaling is proportional to $\sum_{i>k}\lambda_{i}$ . We may call the smallest such index $k$ the effective dimension, for reasons that will be clear in a bit.

How do the estimates on tail eigenvalues help in controlling the variance of the minimum-norm interpolant? Define

Then, omitting $\sigma_{\xi}^{2}$ for the moment, the variance upper bound in (36) can be estimated by

The first term is further upper-bounded by

and its expectation corresponds to the variance of $k$ -dimensional regression, which is of the order of $k/n$ . On the other hand, by Bernstein’s inequality, with probability at least $1-2\exp^{-cn}$ ,

so we have that the second term in (41) is, with high probability, of order at most

Putting these results together, we have the following theorem [TB20]:

Fix $\delta<1/2$ . Under Assumption 4.2, suppose for some $k$ the condition number of ${\boldsymbol{X}}_{>k}{\boldsymbol{X}}_{>k}^{\scriptscriptstyle\mathsf{\,T}}$ is at most $\kappa$ with probability at least $1-\delta$ . Then

We now turn to the analysis of the bias term. Since the projection operator in (35) annihilates any vector in the span of the rows of ${\boldsymbol{X}}$ , we can write

where $\widehat{{\boldsymbol{\Sigma}}}=n^{-1}{\boldsymbol{X}}^{\scriptscriptstyle\mathsf{\,T}}{\boldsymbol{X}}$ is the sample covariance operator. Since projection contracts distances, we obtain an upper bound

The rate of approximation of the covariance operator by its sample-based counterpart has been studied in [KL17], and we conclude

The upper bound in (47) can be sharpened significantly by analyzing the bias in the two subspaces, as proved in [TB20]:

Under the assumptions of Theorem 4.4, for $n\gtrsim\log(1/\delta)$ , with probability at least $1-2\delta$ ,

The following result shows that without further assumptions, the bounds on variance and bias given in Theorems 4.4 and 4.5 cannot be improved by more than constant factors; see [BLLT20] and [TB20].

There are absolute constants $b$ and $c$ such that for Gaussian ${\boldsymbol{x}}\sim{\sf N}(0,{\boldsymbol{\Sigma}})$ , where ${\boldsymbol{\Sigma}}$ has eigenvalues $\lambda_{1}\geq\lambda_{2}\geq\cdots$ , with probability at least $1-\exp(-n/c)$ ,

for the constant $b$ in the definition of the effective dimension $k$ . That definition implies that $\lambda_{k}\geq\lambda\geq\lambda_{k+1}$ , so we can write the bias and variance terms, within constant factors, as

These are reminiscent of the bias and variance terms that arise in ridge regression (16). Indeed, a ridge regression estimate in a fixed design setting with ${\boldsymbol{X}}^{\scriptscriptstyle\mathsf{\,T}}{\boldsymbol{X}}=\text{diag}(s_{1},\ldots,s_{d})$ has precisely these bias and variance terms with $\lambda_{i}$ replaced by $s_{i}$ ; see, for example, [DFKU13, Lemma 1]. In Section 4.3.3, we shall see the same bias-variance decomposition arise in a related setting, but with the dimension growing with sample size.

3 Linear regression in Reproducing Kernel Hilbert Spaces

Kernel methods are among the core algorithms in machine learning and statistics. These methods were introduced to machine learning in the pioneering work of [ABR64] as a generalization of the Perceptron algorithm to nonlinear functions by lifting the ${\boldsymbol{x}}$ -variable to a high- or infinite-dimensional feature space. Our interest in studying kernel methods here is two-fold: on the one hand, as discussed in detail in Sections 5 and 6, sufficiently wide neural networks with random initialization stay close to a certain kernel-based solution during optimization and are essentially equivalent to a minimum-norm interpolant; on the other hand, it has been noted that kernel methods exhibit similar surprising behavior of benign interpolation to neural networks [BMM18].

A kernel method in the regression setting amounts to choosing a feature map ${\boldsymbol{x}}\mapsto\phi({\boldsymbol{x}})$ and computing a (regularized) linear regression solution in the feature space. While Section 4.2 already addressed the question of overparametrized linear regression, the non-linear feature map $\phi({\boldsymbol{x}})$ might not satisfy Assumption 4.2. In this section, we study interpolating RKHS regression estimates using a more detailed analysis of certain random kernel matrices.

which has been extensively analyzed through the lens of bias-variance tradeoff with an appropriately tuned parameter $\lambda>0$ [CDV07]. As $\lambda\to 0^{+}$ , we obtain a minimum-norm interpolant

Alternatively, we can write the solution as

which makes it clear that $\omega_{i}({\boldsymbol{x}}_{j})=\boldsymbol{1}\left[i=j\right]$ . We first describe a setting where this approach does not lead to benign overfitting.

The RKHS norm corresponding to this kernel can be related to a Sobolev norm, and its RKHS has been shown [Bac17, GYK+20, CX21] to be closely related to the RKHS corresponding to the Neural Tangent Kernel (NTK), which we study in Section 6.

The intuition carries over to the more general case, as long as $d$ is a constant. The following theorem appears in [RZ19]:

We only state the results for the inner-product kernel

and remark that more general rotationally invariant kernels (including NTK: see Section 6) exhibit the same behavior under the independent-coordinate assumption [LRZ20].

For brevity, define $\boldsymbol{K}=n^{-1}K({\boldsymbol{X}},{\boldsymbol{X}})$ . Let ${\boldsymbol{r}}=(r_{1},\cdots,r_{d})\geq 0$ be a multi-index, and write $\left\|{\boldsymbol{r}}\right\|=\sum_{i=1}^{d}r_{i}$ . With this notation, each entry of the kernel matrix can be expanded as

and the monomials are $p_{{\boldsymbol{r}}}({\boldsymbol{x}}_{i})=({\boldsymbol{x}}_{i})^{r_{1}}\cdots({\boldsymbol{x}}_{i}[d])^{r_{d}}$ . If $h$ has infinitely many positive coefficients $\alpha$ , each ${\boldsymbol{x}}$ is lifted to an infinite-dimensional space. However, the resulting feature map $\phi({\boldsymbol{x}})$ is not (in general) sub-Gaussian. Therefore, results from Section 4.2 are not immediately applicable and a more detailed analysis that takes advantage of the structure of the feature map is needed.

As before, we separate the high-dimensional feature map into two parts, one corresponding to the prediction component, and the other corresponding to the overfitting part of the minimum-norm interpolant. More precisely, the truncated function $h^{\leq\iota}(t)=\sum_{i=0}^{\iota}\alpha_{i}t^{i}$ leads to the degree-bounded component of the empirical kernel:

The following theorem reveals the staircase structure of the eigenvalues of the kernel, with $\Theta(d^{\iota})$ eigenvalues of order $\Omega(d^{-\iota})$ , as long as $n$ is large enough to sketch these directions; see [LRZ20] and [GMMM20a].

Suppose $\alpha_{0},\ldots,\alpha_{\iota_{0}}>0$ and $d^{\iota_{0}}\log d=o(n)$ . Under Assumption 4.8, with probability at least $1-\exp^{-\Omega(n/d^{\iota_{0}})}$ , for any $\iota\leq\iota_{0}$ , $\boldsymbol{K}^{[\leq\iota]}$ has ${\iota+d}\choose\iota$ nonzero eigenvalues, all of them larger than $Cd^{-\iota}$ and the range of $\boldsymbol{K}^{[\leq\iota]}$ is the span of

The component $\boldsymbol{K}^{[\leq\iota]}$ of the kernel matrix sketches the low-frequency component of the signal in much the same way as the corresponding ${\boldsymbol{X}}_{\leq k}{\boldsymbol{X}}_{\leq k}^{\scriptscriptstyle\mathsf{\,T}}$ in linear regression sketches the top $k$ directions of the population distribution (see Section 4.2).

where $C(\iota)$ denotes constants that depend on $\iota$ . Since under our general assumptions on the distribution this orthogonality does not necessarily hold, we employ the Gram-Schmidt process on the basis $\{1,t,t^{2},\ldots\}$ with respect to $L^{2}(p)$ to produce an orthogonal polynomial basis $q_{0},q_{1},\ldots$ . This yields new features

As shown in [LRZ20], these features are weakly dependent and the orthogonalization process does not distort the eigenvalues of the covariance matrix by more than a multiplicative constant. A small-ball method [KM15] can then be used to prove the lower bound for the eigenvalues of $\Psi\Psi^{\scriptscriptstyle\mathsf{\,T}}$ and thus establish Theorem 4.9.

We now turn to variance and bias calculations. The analogue of (36) becomes

and, similarly to (37), we split the kernel matrix into two parts, according to the degree $\iota$ .

The following theorem establishes an upper bound on (53) [LRZ20]:

Under Assumption 4.8 and the additional assumption of sub-Gaussianity of the distribution $p$ for the coordinates of ${\boldsymbol{x}}$ , if $\alpha_{1},\ldots,\alpha_{\iota}>0$ , there exists $\iota^{\prime}\geq 2\iota+3$ with $\alpha_{\iota^{\prime}}>0$ , and $d^{\iota}\log d\lesssim n\lesssim d^{\iota+1}$ , then with probability at least $1-\exp^{-\Omega(n/d^{\iota})}$ ,

Notice that the behavior of the upper bound changes as $n$ increases from $d^{\iota}$ to $d^{\iota+1}$ . At $d\asymp n^{\iota}$ , variance is large since there is not enough data to reliably estimate all the $d^{\iota}$ directions in the feature space. As $n$ increases, variance in the first $d^{\iota}$ directions decreases; new directions in the data appear (those corresponding to monomials of degree $\iota+1$ , with smaller population eigenvalues) but cannot be reliably estimated. This second part of (54) grows linearly with $n$ , similarly to the second term in (44). The split between these two terms occurs at the effective dimension defined in Section 4.2.

Two aspects of the multiple-descent behavior of the upper bound (54) should be noted. First, variance is small when $d^{\iota}\ll n\ll d^{\iota+1}$ , between the peaks; second, the valleys become deeper as $d$ becomes larger, with variance at most $d^{-1/2}$ at $n=d^{\iota+1/2}$ .

We complete the discussion of this section by exhibiting one possible upper bound on the bias term [LRZ20]:

Assume the regression function can be written as

Let Assumption 4.8 hold, and suppose $\sup_{\boldsymbol{x}}k({\boldsymbol{x}},{\boldsymbol{x}})\lesssim 1$ . Then

with probability at least $1-\delta$ . The above expectation is precisely ${\widehat{\textsc{var}}}/\sigma^{2}_{\xi}$ and can be bounded as in Theorem 4.10.

We now turn our attention to the regime $d\asymp n$ and investigate the behavior of minimum norm interpolants in the RKHS in this high-dimensional setting. Random kernel matrices in the $d\asymp n$ regime have been extensively studied in the last ten years. As shown in [EK10], under assumptions specified below, the kernel matrix can be approximated in operator norm by

that is, a linear kernel plus a scaling of the identity. While this equivalence can be viewed as a negative result about the utility of kernels in the $d\asymp n$ regime, the term $c_{2}\mathbf{I}_{n}$ provides implicit regularization for the minimum-norm interpolant in the RKHS [LR20].

$\|{\boldsymbol{\Sigma}}\|\leq M$ , $d^{-1}\sum_{i=1}^{d}\lambda_{i}^{-1}\leq M$ , where $\lambda_{1},\dots,\lambda_{d}$ are the eigenvalues of ${\boldsymbol{\Sigma}}$ .

Note that, for $i\neq j$ , the rescaled scalar products $\left\langle{\boldsymbol{x}}_{i},{\boldsymbol{x}}_{j}\right\rangle/d$ are typically of order $1/\sqrt{d}$ . We can therefore approximate the kernel function by its Taylor expansion around . To this end, define

Under Assumption 4.12, a variant of a result of [EK10] implies that for some $c_{0}\in(0,1/2)$ , the following holds with high probability

To make the self-induced regularization due to the ridge apparent, we develop an upper bound on the variance of the minimum-norm interpolant in (53). Up to an additive diminishing factor, this expression can be replaced by

where we assumed without loss of generality that $\alpha=0$ . Comparing to (41), we observe that here implicit regularization arises due to the ‘curvature’ of the kernel, in addition to any favorable tail behavior in the spectrum of ${\boldsymbol{X}}{\boldsymbol{X}}^{\scriptscriptstyle\mathsf{\,T}}$ . Furthermore, this regularization arises under rather weak assumptions on the random variables even if Assumption 4.2 is not satisfied. A variant of the development in [LR20] yields a more interpretable upper bound of

for any $k\geq 1$ [Lia20]; the proof is in the Supplementary Material. Furthermore, a high probability bound on the bias

can be established with basic tools from empirical process theory under boundedness assumptions on $\sup_{{\boldsymbol{x}}}k({\boldsymbol{x}},{\boldsymbol{x}})$ [LR20].

With more recent developments on the bias and variance of linear interpolants in [HMRT20], a significantly more precise statement can be derived for the $d\asymp n$ regime. The proof of the following theorem is in the Supplementary Material.

Define $\mathscrsfs{B}({\boldsymbol{\Sigma}},{\boldsymbol{\beta}}_{0})$ and $\mathscrsfs{V}({\boldsymbol{\Sigma}})$ by

A few remarks are in order. First, note that the left hand side of (61) is strictly increasing in $\lambda_{*}$ , while the right hand side is strictly decreasing. By considering the limits as $\lambda_{*}\to 0$ and $\lambda_{*}\to\infty$ , it is easy to see that this equation indeed admits a unique solution. Second, the bias estimate in (60) requires $f^{*}\in{\mathcal{H}}$ , while the bias calculation in (64) does not make this assumption, but instead incurs an approximation error for non-linear components of $f^{*}$ .

We now remark that the minimum-norm interpolant with kernel $K^{\text{lin}}$ is simply ridge regression with respect to the plain covariates ${\boldsymbol{X}}$ and ridge penalty proportional to $\gamma$ :

The characterization in (61), (62), and (63) can be shown to imply upper bounds that are related to the analysis in Section 4.2.

Further, under the same assumptions, the effective regularization $\lambda_{*}$ (that is, the unique solution of (61)), satisfies

Note that apart from the $n^{-c_{0}}$ term, (67) recovers the result of Theorem 4.5, while (68) recovers Theorem 4.4 (setting $\gamma=0$ ), both with improved constants but limited to the proportional regime. We remark that analogues of Theorems 4.4, 4.5, and 4.6 for ridge regression with $\gamma\neq 0$ can be found in [TB20].

The formulas (61), (62), and (63) might seem somewhat mysterious. However, they have an appealing interpretation in terms of a simpler model that we will refer to as a ‘sequence model’ (this terminology comes from classical statistical estimation theory [Joh19]). As stated precisely in the remark below, the sequence model is a linear regression model in which the design matrix is deterministic (and diagonal), and the noise and regularization levels are determined via a fixed point equation.

where $\tau$ is a parameter given below. We then perform ridge regression with regularization $\lambda_{*}$ :

To conclude this section, we summarize the insights gained from the analyses of several models in the interpolation regime. First, in all cases, the interpolating solution $\widehat{f}$ can be decomposed into a prediction (or simple) component and an overfitting (or spiky) component. The latter ensures interpolation without hurting prediction accuracy. In the next section, we show, under appropriate conditions on the parameterization and the initialization, that gradient methods can be accurately approximated by their linearization, and hence can be viewed as converging to a minimum-norm linear interpolating solution despite their non-convexity. In Section 6, we return to the question of generalization, focusing specifically on two-layer neural networks in linear regimes.

Efficient optimization

The empirical risk minimization (ERM) problem is, in general, intractable even in simple cases. Section 2.5 gives examples of such hardness results. The classical approach to address this conundrum is to construct convex surrogates of the non-convex ERM problem. The problem of learning a linear classifier provides an easy-to-state—and yet subtle—example. Considering the - $1$ loss, ERM reads

In the case of linear classifiers, tractability arises because of the specific structure of the function class (which is linear in the parameters ${\boldsymbol{\theta}}$ ), but one might wonder whether it is instead a more general phenomenon. The problem of finding an interpolator can be phrased as a constraint optimization problem. Write the empirical risk as

Then we are seeking ${\boldsymbol{\theta}}\in\Theta$ such that

Random constraint satisfaction problems have been studied in depth over the last twenty years, although under different distributions from those arising from neural network theory. Nevertheless, a recurring observation is that, when the number of free parameters is sufficiently large compared to the number of constraints, these problems (which are NP-hard in the worst case) become tractable; see, for example, [FS96, AM97] and [CO10].

These remarks motivate a fascinating working hypothesis: modern neural networks are tractable because they are overparametrized.

Unfortunately, a satisfactory theory of this phenomenon is still lacking, with an important exception: the linear regime. This is a training regime in which the network can be approximated by a linear model, with a random featurization map associated with the training initialization. We discuss these results in Section 5.1.

While the linear theory can explain a number of phenomena observed in practical neural networks, it also misses some important properties. We will discuss these points, and results beyond the linear regime, in Section 5.2.

As first argued in [JGH18], in a highly overparametrized regime it can happen that ${\boldsymbol{\theta}}$ changes only slightly with respect to the initialization ${\boldsymbol{\theta}}_{0}$ . This suggests comparing the original gradient flow with the one obtained by linearizing the right-hand side of (77) around the initialization ${\boldsymbol{\theta}}_{0}$ :

More precisely, this is the gradient flow for the risk function

The linear (or ‘lazy’ ) regime is a training regime in which ${\boldsymbol{\theta}}_{t}$ is well approximated by $\overline{\boldsymbol{\theta}}_{t}$ at all times. Of course if $f_{n}({\boldsymbol{\theta}})$ is an affine function of ${\boldsymbol{\theta}}$ , that is, if ${\boldsymbol{D}}f_{n}({\boldsymbol{\theta}})$ is constant, then we have ${\boldsymbol{\theta}}_{t}=\overline{\boldsymbol{\theta}}_{t}$ for all times $t$ . It is therefore natural to quantify deviations from linearity by defining the Lipschitz constant

The next theorem establishes sufficient conditions for ${\boldsymbol{\theta}}_{t}$ to remain in the linear regime in terms of the singular values and Lipschitz constant of the Jacobian. Statements of this type were proved in several papers, starting with [DZPS19]; see, for example, [AZLS19, DLL+19, ZCZG20, OS20] and [LZB20]. We follow the abstract point of view in [OS19] and [COB19].

The empirical risk decreases exponentially fast to , with rate $\lambda_{0}=\sigma^{2}_{\min}/(2n)$ :

The parameters stay close to the initialization and are closely tracked by those of the linearized flow. Specifically, letting $L_{n}:={\rm Lip}({\boldsymbol{D}}f_{n})$ ,

The models constructed by gradient flow and by the linearized flow are similar on test data. Specifically, writing $f^{\mbox{\rm\tiny lin}}({\boldsymbol{\theta}})=f({\boldsymbol{\theta}}_{0})+{\boldsymbol{D}}f({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})$ , we have

The bounds in (85) and (86) follow from the main result of [OS19]. The coupling bounds in (87) and (88) are proved in the Supplementary Material.

Equivalently, the residuals ${\boldsymbol{r}}_{t}:={\boldsymbol{y}}-f^{\mbox{\rm\tiny lin}}_{n}(\overline{\boldsymbol{\theta}}_{t})$ are driven to zero according to $({\rm d}/{\rm d}t){\boldsymbol{r}}_{t}=-{\boldsymbol{K}}_{m,0}{\boldsymbol{r}}_{t}/n$ .

Applying Theorem 5.1 requires the evaluation of the minimum and maximum singular values of the Jacobian, as well as its Lipschitz constant. As an example, we consider the case of two-layer neural networks:

To simplify our task, we assume the second layer weights ${\boldsymbol{b}}=(b_{1},\dots,b_{m})\in\{+1,-1\}^{m}$ to be fixed with an equal number of $+1$ s and $-1$ s. Without loss of generality we can assume that $b_{1}=\cdots=b_{m/2}=+1$ and $b_{m/2+1}=\cdots=b_{m}=-1$ . We train the weights ${\boldsymbol{w}}_{1},\dots,{\boldsymbol{w}}_{m}$ via gradient flow. The number of parameters is $p=md$ . The scaling factor $\alpha$ allows tuning between different regimes. We consider two initializations, denoted by ${\boldsymbol{\theta}}_{0}^{(1)}$ and ${\boldsymbol{\theta}}_{0}^{(2)}$ :

Equations (94), (95) are straightforward [OS19]. The remaining inequalities are proved in the Supplementary Material. Using these estimates in Theorem 5.1, we get the following approximation theorem for two-layer neural nets.

then, with probability at least $1-2\exp\{-n/C_{0}\}$ , the following hold for all $t\geq 0$ .

Gradient flow converges exponentially fast to a global minimizer. Specifically, letting $\lambda_{*}=C_{1}\alpha^{2}d/n$ , we have

The model constructed by gradient flow and linearized flow are similar on test data, namely

It is instructive to consider Theorem 5.4 for two different choices of $\alpha$ (a third one will be considered in Section 5.2).

For $\alpha=\Theta(1)$ , we have $\overline{\alpha}=\Theta(1)$ and therefore the two initializations $\{{\boldsymbol{\theta}}^{(1)},{\boldsymbol{\theta}}_{0}^{(2)}\}$ behave similarly. In particular, condition (101) requires $md\gg n^{2}$ : the number of network parameters must be quadratic in the sample size. This is significantly stronger than the simple condition that the network is overparametrized, namely $md\gg n$ . Under the condition $md\gg n^{2}$ we have exponential convergence to vanishing training error, and the difference between the neural network and its linearization is bounded as in (103). This bound vanishes for $m\gg n^{5}/d^{4}$ . While we do not expect this condition to be tight, it implies that, under the choice $\alpha=\Theta(1)$ , sufficiently wide networks behave as linearly parametrized models.

Recall that, as $t\to\infty$ , $\overline{\boldsymbol{\theta}}_{t}$ converges to the min-norm interpolant $\overline{\boldsymbol{\theta}}_{\infty}$ ; see (80). Therefore, as long as condition (101) holds and the right-hand side of (103) is negligible, the generalization properties of the neural network are well approximated by those of min-norm interpolation in a linear model with featurization map ${\boldsymbol{x}}\mapsto{\boldsymbol{D}}f({\boldsymbol{x}};{\boldsymbol{\theta}}_{0})$ . We will study the latter in Section 6.

In the next subsection we will see that the linear theory outlined here fails to capture different training schemes in which the network weights genuinely change.

2 Beyond the linear regime?

For a given dimension $d$ and sample size $n$ , we can distinguish two ways to violate the conditions for the linear regime, as stated for instance in Theorem 5.4. First, we can reduce the network size $m$ . While Theorem 5.4 does not specify the minimum $m$ under which the conclusions of the theorem cease to hold, it is clear that $md\geq n$ is necessary in order for the training error to vanish as in (102).

However, even if the model is overparametrized, the same condition is violated if $\alpha$ is sufficiently small. In particular, the limit $m\to\infty$ with $\alpha=\alpha_{0}/\sqrt{m}$ has attracted considerable attention and is known as the mean field limit. In order to motivate the mean field analysis, we can suggestively rewrite (90) as

where $\widehat{\rho}:=m^{-1}\sum_{j=1}^{m}\delta_{{\boldsymbol{w}}_{j},b_{j}}$ is the empirical distribution of neuron weights. If the weights are drawn i.i.d. from a common distribution $({\boldsymbol{w}}_{j},b_{j})\sim\rho$ , we can asymptotically replace $\widehat{\rho}$ with $\rho$ in the above expression, by the law of large numbers.

The gradient flow (77) defines an evolution over the space of neuron weights, and hence an evolution in the space of empirical distributions $\widehat{\rho}$ . It is natural to ask whether this evolution admits a simple characterization. This question was first addressed by [NS17, MMN18, RVE18, SS20] and [CB18].

Here the gradient $\nabla$ is with respect to $({\boldsymbol{w}},b)$ (gradient in $d+1$ dimensions) if both first- and second-layer weights are trained, and only with respect to ${\boldsymbol{w}}$ (gradient in $d$ dimensions) if only first-layer weights are trained.

This statement can be obtained by checking the conditions of [CB18, Theorem 2.6]. A quantitative version can be obtained for bounded $\sigma$ using Theorem 1 of [MMM19].

A few remarks are in order. First, the limit in (105) requires time to be accelerated by a factor $m$ . This is to compensate for the fact that the function value is scaled by a factor $1/m$ . Second, while we stated this theorem as an asymptotic result, for large $m$ , the evolution described by the PDE (106) holds at any finite $m$ for the empirical measure $\widehat{\rho}_{t}$ . In that case, the gradient of $\rho_{t}$ is not well defined, and it is important to interpret this equation in the weak sense [AGS08, San15]. The advantage of working with the average measure $\rho_{t}$ instead of the empirical one $\widehat{\rho}_{t}$ is that the former is deterministic and has a positive density (this has important connections to global convergence). Third, quantitative versions of this theorem were proved in [MMN18, MMM19], and generalizations to multi-layer networks in [NP20].

Mean-field theory can be used to prove global convergence results. Before discussing these results, let us emphasize that —in this regime— the weights move in a non-trivial way during training, despite the fact that the network is infinitely wide. For the sake of simplicity, we will focus on the case already treated in the previous section in which the weights $b_{j}\in\{+1,-1\}$ are initialized with signs in equal proportions, and are not changed during training. Let us first consider the evolution of the predicted values $F_{n}(\rho_{t}):=(F({\boldsymbol{x}}_{1};\rho_{t}),\dots,F({\boldsymbol{x}}_{n};\rho_{t}))$ . Manipulating (106), we get

In the short-time limit we recover the linearized evolution of (89) [MMM19], but the kernel $K_{t}$ is now changing with training (with a factor $m$ acceleration in time).

Theorem 5.1 (see (86)) and Lemma 5.3 (see (94)–(96)) imply that, with high probability,

where $\overline{\alpha}=\alpha/(1+\alpha)$ for initialization ${\boldsymbol{\theta}}^{(1)}_{0}$ and $\overline{\alpha}=\alpha$ for initialization ${\boldsymbol{\theta}}^{(2)}_{0}$ . In the mean field regime $\overline{\alpha}\asymp\alpha\asymp 1/\sqrt{m}$ and the right hand side above is of order $\sqrt{n/d}$ , and hence it does not vanish. This is not due to a weakness of the analysis. By (110), we can choose $\varepsilon$ a small enough constant so that

This is bounded away from as long as $v_{2}(\rho_{0})$ is non-vanishing. In order to see this, note that $\lambda_{\min}({\boldsymbol{K}}_{0})\geq c_{0}\,d$ with high probability for $c_{0}$ a constant (note that ${\boldsymbol{K}}_{0}$ is a kernel inner product random matrix, and hence this claim follows from the general results of [MMM21]). Noting that $F_{n}(\rho_{0})=0$ (because $\int b\rho_{0}({\rm d}b,{\rm d}{\boldsymbol{w}})=0$ ), this implies, with high probability,

We expect this lower bound to be tight, as can be seen by considering the pure noise case ${\boldsymbol{y}}\sim{\sf N}(0,\tau^{2}{\mathbf{I}}_{n})$ , which leads to $v(\rho_{0})=\tau^{2}{\mathsf{tr}}({\boldsymbol{K}}_{0})/n^{2}(1+o_{n}(1))\asymp d/n$ .

hence the limit on the left-hand side of (113) is indeed non-vanishing as $m\to\infty$ at $n,d$ fixed. In other words, the fact that the upper bound in (112) is non-vanishing is not an artifact of the bounding technique, but a consequence of the change of training regime. We also note a gap between the upper and lower bounds in (115) when $n\gg d$ : a better understanding of this quantity is an interesting open problem. In conclusion, both a linear and a nonlinear regime can be obtained in the infinite-width limit of two-layer neural networks, for different scalings of the normalization factor $\alpha$ .

As mentioned above, the mean field limit can be used to prove global convergence results, both for two-layer [MMN18, CB18] and for multilayer networks [NP20]. Rather than stating these (rather technical) results formally, it is instructive to discuss the nature of fixed points of the evolution (106): this will also indicate the key role played by the support of the distribution $\rho_{t}$ .

$\rho_{*}$ is a fixed point of the evolution (106) if and only if, for all $(b,{\boldsymbol{w}})\in{\rm supp}(\rho_{*})$ , we have $\psi({\boldsymbol{w}};\rho_{*})=0$ and $b\nabla_{{\boldsymbol{w}}}\psi({\boldsymbol{w}};\rho_{*})=0$ .

This statement clarifies that fixed points of the gradient flow are only a ‘small’ superset of global minimizers, as $m\to\infty$ . Consider for instance the case of an analytic activation function $t\mapsto\sigma(t)$ . Let $\rho_{*}$ be a stationary point and assume that its support contains a sequence of distinct points $\{(b_{i},{\boldsymbol{w}}_{i})\}_{i\geq 1}$ such that $\{{\boldsymbol{w}}_{i}\}_{i\geq 1}$ has an accumulation point. Then, by condition $(b)$ , $\psi({\boldsymbol{w}};\rho_{*})=0$ identically and therefore $\rho_{*}$ is a global minimum. In other words, the only local minima correspond to $\rho_{*}$ supported on a set of isolated points. Global convergence proofs aim at ruling out this case.

3 Other approaches

The mean-field limit is only one of several analytical approaches that have been developed to understand training beyond the linear regime. A full survey of these directions goes beyond the scope of this review. Here we limit ourselves to highlighting a few of them that have a direct connection to the analysis in the previous section.

A natural idea is to view the linearized evolution as the first order in a Taylor expansion, and to construct higher order approximations. This can be achieved by writing an ordinary differential equation for the evolution of the kernel ${\boldsymbol{K}}_{t}$ (see (109) for the infinite-width limit). This takes the form [HY20]

Other approaches towards constructing a Taylor expansion around the linearized evolutions were proposed, among others, by [DGA20] and [HN20].

Note that the linearized approximation relies on the assumption that the Jacobian ${\boldsymbol{D}}f_{n}({\boldsymbol{\theta}}_{0})$ is non-vanishing and well conditioned. [BL20a] propose specific neural network parametrizations in which the Jacobian at initialization vanishes, and the first non-trivial term in the Taylor expansion is quadratic. Under such initializations the gradient flow dynamics is ‘purely nonlinear’.

Generalization in the linear regime

As discussed in Sections 2 and 4, approaches that control the test error via uniform convergence fail for overparametrized interpolating models. So far, the most complete generalization results for such models have been obtained in the linear regime, namely under the assumption that we can approximate $f({\boldsymbol{\theta}})$ by its first order Taylor approximation $f_{\mbox{\rm\tiny lin}}({\boldsymbol{\theta}})=f({\boldsymbol{\theta}}_{0})+{\boldsymbol{D}}f({\boldsymbol{\theta}})({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{0})$ . While Theorem 5.1 provides a set of sufficient conditions for this approximation to be accurate, in this section we leave aside the question of whether or when this is indeed the case, and review what we know about the generalization properties of these linearized models. We begin in Section 6.1 by discussing the inductive bias induced by gradient descent on wide two-layer networks. Section 6.2 describes a general setup. Section 6.3 reviews random features models: two-layer neural networks in which the first layer is not trained and entirely random. While these are simpler than neural networks in the linear regime, their generalization behavior is in many ways similar. Finally, in Section 6.4 we review progress on the generalization error of linearized two-layer networks.

For simplicity, we will set $f({\boldsymbol{x}};{\boldsymbol{\theta}}_{0})=0$ . This can be achieved either by properly constructing the initialization ${\boldsymbol{\theta}}_{0}$ (as in the initialization ${\boldsymbol{\theta}}_{0}^{(2)}$ in Section 5.1) or by redefining the response vector ${\boldsymbol{y}}^{\prime}={\boldsymbol{y}}-f_{n}({\boldsymbol{\theta}}_{0})$ . If $f({\boldsymbol{x}};{\boldsymbol{\theta}}_{0})=0$ , the interpolation constraint $y_{i}=f_{\mbox{\rm\tiny lin}}({\boldsymbol{x}}_{i};{\boldsymbol{a}})$ for all $i\leq n$ can be written as ${\boldsymbol{D}}f_{n}({\boldsymbol{\theta}}_{0}){\boldsymbol{a}}={\boldsymbol{y}}$ .

Consider the case of two-layer neural networks in which only first-layer weights are trained. Recalling the form of the Jacobian (93), we can rewrite (117) as

and we are assuming that the weights ${\boldsymbol{w}}_{j}$ in (119) are initialized as

This is also an RKHS norm whose kernel $K_{{\sf NT}}({\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2})$ will be described below; see (129).

Let us emphasize that moving out of the linear regime leads to different—and possibly more interesting—inductive biases than those described in (119) or (120). As an example, [CB20] analyze the mean field limit of two-layer networks, trained with logistic loss, for activation functions that have Lipschitz gradient and are positively $2$ -homogeneous. For instance, the square ReLU $\sigma(x)=(x_{+})^{2}$ with fixed second-layer coefficients fits this framework. The usual ReLU with trained second-layer coefficients $b_{j}\sigma(\langle{\boldsymbol{w}}_{j},{\boldsymbol{x}}\rangle)=b_{j}(\langle{\boldsymbol{w}}_{j},{\boldsymbol{x}}\rangle)_{+}$ is $2$ -homogeneous but not differentiable. In this setting, and under a convergence assumption, they show that gradient flow minimizes the following norm among interpolators:

This norm differs in two ways from the RKHS norm of (120). Each is defined in terms of a different integral operator,

2 Ridge regression in the linear regime

We generalize the min-norm procedure of (117) to consider the ridge regression estimator:

The min-norm estimator can be recovered by taking the limit of vanishing regularization $\lim_{\lambda\to 0}{\widehat{\boldsymbol{a}}}(\lambda)={\widehat{\boldsymbol{a}}}(0^{+})$ (with a slight abuse of notation, we will identify $\lambda=0$ with this limit). Apart from being intrinsically interesting, the behavior of ${\widehat{\boldsymbol{a}}}(\lambda)$ for $\lambda>0$ is a good approximation of the behavior of the estimator produced by gradient flow with early stopping [AKT19]. More precisely, letting $({\widehat{\boldsymbol{a}}}_{\mbox{\rm\tiny GF}}(t))_{t\geq 0}$ denote the path of gradient flow initialized at ${\widehat{\boldsymbol{a}}}_{\mbox{\rm\tiny GF}}(0)=0$ , there exists a parametrization $t\mapsto\lambda(t)$ , such that the test error at ${\widehat{\boldsymbol{a}}}_{\mbox{\rm\tiny GF}}(t)$ is well approximated by the test error at ${\widehat{\boldsymbol{a}}}(\lambda(t))$ .

Ridge regression (122) within either model ${\mathcal{F}}_{{\sf RF}}$ or ${\mathcal{F}}_{{\sf NT}}$ can be viewed as kernel ridge regression (KRR) with respect to the kernels

The convergence $K_{{\sf RF},m}\to K_{{\sf RF}}$ , $K_{{\sf NT},m}\to K_{{\sf NT},m}$ takes place under suitable assumptions, pointwise [RR07]. However, we would like to understand the qualitative behavior of the generalization error in the above linearized models.

Does the procedure (122) share qualitative behavior with KRR, as discussed in Section 4? In particular, can min-norm interpolation be (nearly) optimal in the RF or NT models as well?

How large should $m$ be for the generalization properties of RF or NT ridge regression to match those of the associated kernel?

What discrepancies between KRR and RF or NT regression can we observe when $m$ is not sufficiently large?

Is there any advantage of one of the three methods (KRR, RF, NT) over the others?

Throughout this section we assume an isotropic model for the distribution of the covariates ${\boldsymbol{x}}_{i}$ , namely we assume $\{({\boldsymbol{x}}_{i},y_{i})\}_{i\leq n}$ to be i.i.d., with

We evaluate the quality of method (122) using the square loss

3 Random features model

We begin by considering the random features model ${\mathcal{F}}_{{\sf RF}}$ . A number of authors have established upper bounds on its minimax generalization error for suitably chosen positive values of the regularization [RR17, RR09]. Besides the connection to neural networks, ${\mathcal{F}}_{{\sf RF}}$ can be viewed as a randomized approximation for the RKHS associated with $K_{{\sf RF}}$ . A closely related approach in this context is provided by randomized subset selection, also known as Nyström’s method [WS01, Bac13, EAM15, RCR15].

The classical random features model ${\mathcal{F}}_{{\sf RF}}$ is mathematically easier to analyze than the neural tangent model ${\mathcal{F}}_{{\sf NT}}$ , and a precise picture can be established that covers the interpolation limit. Several elements of this picture have been proved to generalize to the ${\sf NT}$ model as well, as discussed in the next subsection.

We focus on the high-dimensional regime, $m,n,d\to\infty$ ; as discussed in Section 4, interpolation methods have appealing properties in high dimension. Complementary asymptotic descriptions are obtained depending on how $m,n,d$ diverge. In Section 6.3.1 we discuss the behavior at a coarser scale, namely when $m$ and $n$ scale polynomially in $d$ : this type of analysis provides a simple quantitative answer to the question of how large $m$ should be to approach the $m=\infty$ limit. Next, in Section 6.3.2, we consider the proportional regime $m\asymp n\asymp d$ . This allows us to explore more precisely what happens in the transition from underparametrized to overparametrized.

In words, as long as the number of parameters $m$ and the number of samples $n$ are well separated, the test error is determined by the minimum of $m$ and $n$ :

Both of the above are achieved for any sufficiently small value of the regularization parameter $\lambda$ . In particular, they apply to min-norm interpolation (corresponding to the case $\lambda=0^{+}$ ).

From a practical perspective, if the sample size $n$ is given, we might be interested in choosing the number of neurons $m$ . The above result indicates that the test error roughly decreases until the overparametrization threshold $m\approx n$ , and that there is limited improvement from increasing the network size beyond $m\geq nd^{\delta}$ . At this point, RF ridge regression achieves the same error as the corresponding kernel method. Indeed the statement of Theorem 6.1 holds for the case of KRR as well, by identifying it with the limit $m=\infty$ [GMMM20a].

Note that the infinite width (kernel) limit $m=\infty$ corresponds to the setting already investigated in Theorem 4.10. Indeed, the staircase phenomenon in the $m=\infty$ case of Theorem 6.1 corresponds to the multiple descent behavior seen in Theorem 4.10. The two results do not imply each other because Theorem 4.10 assumes $f^{*}$ to have bounded RKHS norm; Theorem 6.1 does not make this assumption, but is not as sharp for functions with bounded RKHS norm.

3.2 Proportional scaling

Theorem 6.1 requires that $m$ and $n$ are well separated. When $m,n$ are close to each other, the feature matrix (135) is nearly square and we might expect its condition number to be large. When this is the case, the variance component of the risk can also be large.

What happens when $m$ is comparable to $n$ , and both are comparable to an integer power of $d$ ? Figure 1 reports simulations within the data model introduced above. We performed ridge regression as per (122), with a small value of the regularization parameter, $\lambda=10^{-3}(m/d)$ . We report test error and train error for several network widths $m$ , plotting them as a function of the overparametrization ratio $m/n$ .

We observe that the train error decreases with the overparametrization ratio, and becomes very small for $m/n\geq 1$ : it is not exactly because we are using $\lambda>0$ , but for $m/n>1$ it vanishes as $\lambda\to 0$ . On the other hand, the test error displays a peak at the interpolation threshold $m/n=1$ . For $\lambda=0^{+}$ the error actually diverges at this threshold. It then decreases and converges rapidly to an asymptotic value as $m/n\gg 1$ . If both $n/d\gg 1$ , and $m/n\gg 1$ , the asymptotic value of the test error is given by $\|{\mathsf{P}}_{>1}f^{*}\|_{L^{2}}$ : the model is fitting the degree-one polynomial component of the target function perfectly and behaves trivially on higher degree components. This matches the picture obtained under polynomial scalings, in Theorem 6.1, and actually indicates that a far smaller separation between $m$ and $n$ is required than assumed in that theorem. Namely, $m/n\gg 1$ instead of $m/n\geq d^{\delta}$ appears to be sufficient for the risk to be dominated by the statistical error.

The peculiar behavior illustrated in Figure 1 was first observed empirically in neural networks and then shown to be ubiquitous for numerous over-parametrized models [GSd+19, SGd+19, BHMM19]. It is commonly referred to as the ‘double descent phenomenon’, after [BHMM19].

Substituting these approximations for ${\boldsymbol{U}}$ and ${\boldsymbol{V}}$ in (139) yields an expression of the risk in terms of the three (correlated) random matrices ${\boldsymbol{X}}$ , ${\boldsymbol{\Theta}}$ , ${\boldsymbol{\Phi}}$ . Standard random matrix theory does not apply directly to compute the asymptotics of this expression. The main difficulty is that the matrix ${\boldsymbol{\Phi}}$ does not have independent or nearly independent entries. It is instead obtained by applying a nonlinear function to a product of matrices with (nearly) independent entries; see (135). The name ‘nonlinear random matrix theory’ has been coined to refer to this setting [PW17]. Techniques from random matrix theory have been adapted to this new class of random matrices. In particular, the leave-one-out method can be used to derive a recursion for the resolvent, as first shown for this type of matrices in [CS13], and the moments method was first used in [FM19] (both of these papers consider symmetric random matrices, but these techniques extend to the asymmetric case). Further results on kernel random matrices can be found in [DV13, LLC18] and [PW18].

where $\zeta:=|\mu_{1}|/\mu_{*}$ . The functions $\mathscrsfs{B}$ , $\mathscrsfs{V}\geq 0$ are explicit and correspond to an effective bias term and an effective variance term. Note the additive term $\|{\mathsf{P}}_{>1}f^{*}\|_{L^{2}}^{2}$ : in agreement with Theorem 6.1, the nonlinear component of $f^{*}$ cannot be learnt at all (recall that $m,n=O(d)$ here). Further $\|{\mathsf{P}}_{>1}f^{*}\|_{L^{2}}^{2}$ is added to the noise strength in the ‘variance’ term: high degree components of $f^{*}$ are equivalent to white noise at small sample/network size.

The expressions for $\mathscrsfs{B}$ , $\mathscrsfs{V}$ can be used to plot curves such as those in Figure 1: we refer to [MMM21] for explicit formulas. As an interesting conceptual consequence, these results establish a universality phenomenon: the risk under the random features model is asymptotically the same as the risk of a mathematically simpler model. This simpler model can be analyzed by a direct application of standard random matrix theory [HMRT20].

We refer to the simpler equivalent model as the ‘noisy features model.’ In order to motivate it, recall the decomposition $\sigma(x)=\mu_{0}+\mu_{1}x+\sigma_{\perp}(x)$ (with the three components being orthogonal in $L^{2}(\gamma)$ ). Accordingly, we decompose the feature matrix as

The next statement establishes asymptotic equivalence of the noisy and random features model.

Under the data distribution introduced above, let ${L}_{{\sf RF}}(\lambda)$ denote the risk of ridge regression in the random features model with regularization $\lambda$ , and let ${L}_{{\rm NF}}(\lambda)$ be the risk in the noisy features model. Then we have, in $n,m,d\to\infty$ with $m/d\to\psi_{\mbox{\rm\tiny w}}$ , $n/d\to\psi_{\mbox{\rm\tiny s}}$ ,

Knowing the exact asymptotics of the risk allows us to identify phenomena that otherwise would be out of reach. A particularly interesting one is the optimality of interpolation at high signal-to-noise ratio.

Define the signal-to-noise ratio of the random features model as ${\sf SNR}_{d}:=\|{\mathsf{P}}_{1}f^{*}\|_{L^{2}}^{2}/(\|{\mathsf{P}}_{>1}f^{*}\|_{L^{2}}^{2}+\tau^{2})$ , and let ${L}_{{\sf RF}}(\lambda)$ be the risk of ridge regression with regularization $\lambda$ . Then there exists a critical value ${\sf SNR}_{*}>0$ such that the following hold.

If $\lim_{d\to\infty}{\sf SNR}_{d}={\sf SNR}_{\infty}>{\sf SNR}_{*}$ , then the optimal regularization parameter is $\lambda=0^{+}$ , in the sense that ${L}_{{\sf RF},\infty}(\lambda):=\lim_{d\to\infty}{L}_{{\sf RF}}(\lambda)$ is monotone increasing for $\lambda\in(0,\infty)$ .

If $\lim_{d\to\infty}{\sf SNR}_{d}={\sf SNR}_{\infty}<{\sf SNR}_{*}$ , then the optimal regularization parameter is $\lambda>0$ , in the sense that ${L}_{{\sf RF},\infty}(\lambda):=\lim_{d\to\infty}{L}_{{\sf RF}}(\lambda)$ is monotone decreasing for $\lambda\in(0,\lambda_{0})$ with $\lambda_{0}>0$ .

In other words, above a certain threshold in ${\sf SNR}$ , (near) interpolation is required in order to achieve optimal risk, not just optimal rates.

The universality phenomenon of Theorem 6.2 first emerged in random matrix theory studies of (symmetric) kernel inner product random matrices. In that case, the spectrum of such a random matrix was shown in [CS13] to behave asymptotically as the one of the sum of independent Wishart and Wigner matrices, which correspond respectively to the linear and nonlinear parts of the kernel (see also [FM19] where this remark is made more explicit). In the context of random features ridge regression, this type of universality was first pointed out in [HMRT20], which proved a special case of Theorem 6.2. In [GMKZ19] and [GRM+20], a universality conjecture was put forward on the basis of statistical physics arguments and proved to hold in online learning schemes (that is, if each sample is visited only once).

Universality is conjectured to hold in significantly broader settings than ridge-regularized least-squares. This is interesting because analysing the noisy feature models is often significantly easier than the original random features model. For instance [MRSY19] studied max margin classification under the universality hypothesis, and derived an asymptotic characterization of the test error using Gaussian comparison inequalities. Related results were obtained by [TPT20] and [KT20], among others.

Finally, a direct proof of universality for general strongly convex smooth losses was recently proposed in [HL20] using the Lindeberg interpolation method.

4 Neural tangent model

Nevertheless, several results are available and point to a common conclusion: the generalization properties of NT are very similar to those of RF, provided we keep the number of parameters constant, which amounts to reducing the number of neurons according to $m_{{\sf NT}}d=p_{{\sf NT}}=p_{{\sf RF}}=m_{{\sf RF}}$ .

In this statement we abused notation in letting $m=\infty$ denote the case of KRR, and letting $n=\infty$ refer to the approximation error:

In order to study the test error, it is not sufficient to lower-bound the minimum singular value of ${\boldsymbol{\Phi}}$ , but we need to understand the structure of this matrix: results in this direction were obtained in [MZ20], for $m\leq C_{0}d$ , for some constant $C_{0}$ . Following the same strategy of previous sections, we decompose

For the diagonal entries we have (assuming for simplicity ${\boldsymbol{x}}_{i}\sim{\sf N}(0,{\mathbf{I}}_{d})$ ),

Equations (153) and (154) suggest that for $m=O(d)$ , ridge regression in the NT model can be approximated by ridge regression in the raw covariates, as long as the regularization parameter is suitably modified. The next theorem confirms this intuition [MZ20]. We define ridge regression with respect to the raw covariates as per

Assume $d^{1/C_{0}}\leq m\leq C_{0}d$ , $n\geq d/C_{0}$ and $md\gg n$ . Then with high probability there exists an interpolator. Further assume ${\boldsymbol{x}}_{i}\sim{\sf N}(0,{\mathbf{I}}_{d})$ and $f^{*}({\boldsymbol{x}})=\langle{\boldsymbol{\beta}}_{*},{\boldsymbol{x}}\rangle$ . Let

denote the risk of ridge regression with respect to the raw features.

Set $\lambda=\lambda_{0}(md/n)$ for some $\lambda_{0}\geq 0$ . Then there exists a constant $C>0$ such that, with high probability,

Notice that the shift in regularization parameter matches the heuristics given above (the scaling in $\lambda=\lambda_{0}(md/n)$ is introduced to match the typical scale of ${\boldsymbol{\Phi}}$ ).

Conclusions and future directions

Classical statistical learning theory establishes guarantees on the performance of a statistical estimator ${\widehat{f}}$ , by bounding the generalization error ${L}({\widehat{f}})-\widehat{L}({\widehat{f}})$ . This is often thought of as a small quantity compared to the training error ${L}({\widehat{f}})-\widehat{L}({\widehat{f}})\ll\widehat{L}({\widehat{f}})$ . Regularization methods are designed precisely with the aim of keeping the generalization error ${L}({\widehat{f}})-\widehat{L}({\widehat{f}})$ small.

The effort to understand deep learning has recently led to the discovery of a different learning scenario, in which the test error ${L}({\widehat{f}})$ is optimal or nearly optimal, despite being much larger than the training error. Indeed in deep learning the training error often vanishes or is extremely small. The model is so rich that it overfits the data, that is, $\widehat{L}({\widehat{f}})\ll\inf_{f}{L}(f)$ . When pushed, gradient-based training leads to interpolation or near-interpolation $\widehat{L}({\widehat{f}})\approx 0$ [ZBH+17]. We regard this as a particularly illuminating limit case.

As pointed out in Section 2, interpolation poses less of a conceptual problem if data are noiseless. Indeed, unlike the noisy case, we can exhibit at least one interpolating solution that has vanishing test error, for any sample size: the true function $f^{*}$ . Stronger results can also be established in the noiseless case: [Fel20] proved that interpolation is necessary to achieve optimal error rates when the data distribution is heavy-tailed in a suitable sense.

In this review we have focused on understanding when and why interpolation can be optimal or nearly optimal even with noisy data. Rigorous work has largely focused on models that are linear in a certain feature space, with the featurization map being independent of the data. Examples are RKHSs, the features produced by random network layers, or the neural tangent features defined by the Jacobian of the network at initialization. Mathematical work has established that interpolation can indeed be optimal and has described the underlying mechanism in a number of settings. While the scope of this analysis might appear to be limited (neural networks are notoriously nonlinear in their parameters), it is relevant to deep learning in two ways. First, in a direct way: as explained in Section 5, there are training regimes in which an overparametrized neural network is well approximated by a linear model that corresponds to the first-order Taylor expansion of the network around its initialization (the ‘neural tangent’ model). Second, in an indirect way: insights and hypotheses arising from the analysis of linear models can provide useful guidance for studying more complex settings.

Based on the work presented in this review, we can distill a few insights worthy of exploration in broader contexts.

Simple-plus-spiky decomposition. The function learnt in the overfitting (interpolating) regime takes the form

Here ${\widehat{f}}_{0}$ is simple in a suitable sense (for instance, it is smooth) and hence is far from interpolating the data, while $\Delta$ is spiky: it has large complexity and allows interpolation of the data, but it is small, in the sense that it has negligible effect on the test error, i.e. ${L}({\widehat{f}}_{0}+\Delta)\approx{L}({\widehat{f}}_{0})$ .

In the case of linear models, the decomposition (157) corresponds to a decomposition of ${\widehat{f}}$ into two orthogonal subspaces that do not depend on the data. Namely, ${\widehat{f}}_{0}$ is the projection of ${\widehat{f}}$ onto the top eigenvectors of the associated kernel and $\Delta$ is its orthogonal complement. In nonlinear models, the two components need not be orthogonal and the associated subspaces are likely to be data-dependent.

Understanding whether such a decomposition is possible, and what is its nature is a wide-open problem, which could be investigated both empirically and mathematically. A related question is whether the decomposition (157) is related to the widely observed ‘compressibility’ of neural network models. This is the observation that the test error of deep learning models does not change significantly if —after training— the model is simplified by a suitable compression operation [HMD15].

The mechanism by which the training algorithm selects a specific empirical risk minimizer is understood in only a handful of cases: we refer to Section 3 for pointers to this literature. It would be important to understand how the model nonlinearity interacts with gradient flow dynamics. This in turn impacts the decomposition (157), namely which part of the function ${\widehat{f}}$ is to be considered ‘simple’ and which one is ‘spiky’. Finally, the examples of kernel machines, random features and neural tangent models show that—in certain regimes—the simple component ${\widehat{f}}_{0}$ is also regularized in a non-trivial way, a phenomenon that we called self-induced regularization. Understanding these mechanisms in a more general setting is an outstanding challenge.

Role of dimension. As pointed out in Section 4, interpolation is sub-optimal in a fixed dimension in the presence of noise, for certain kernel methods [RZ19]. The underlying mechanism is as described above: for an interpolating model, ${\widehat{f}}({\boldsymbol{x}}_{i})-f^{*}({\boldsymbol{x}}_{i})$ is of the order of the noise level. If ${\widehat{f}}$ and $f^{*}$ are sufficiently regular (for instance, uniformly continuous, both in ${\boldsymbol{x}}$ and in $n$ ) ${\widehat{f}}({\boldsymbol{x}}_{\mbox{\tiny\rm test}})-f^{*}({\boldsymbol{x}}_{\mbox{\tiny\rm test}})$ is expected to be of the same order when ${\boldsymbol{x}}_{\mbox{\tiny\rm test}}$ is close to the training set. This happens with constant probability in fixed dimension. However, this probability decays rapidly with the dimension.

Typical data in deep learning applications are high-dimensional (images, text, and so on). On the other hand, it is reasonable to believe that deep learning methods are not affected by the ambient dimension (the number of pixels in an image), but rather by an effective or intrinsic dimension. This is the case for random feature models [GMMM20b]. This raises the question of how deep learning methods escape the intrinsic limitations of interpolators in low dimension. Is it because they construct a (near) interpolant ${\widehat{f}}$ that is highly irregular (not uniformly continuous)? Or perhaps because the effective dimension is at least moderately large? (After all the lower bounds mentioned above decrease rapidly with dimension.) What is the proper mathematical definition of effective dimension?

Adaptive model complexity. As mentioned above, in the case of linear models, the terms ${\widehat{f}}_{0}$ and $\Delta$ in the decomposition (157) correspond to the projections of ${\widehat{f}}$ onto ${\mathcal{V}}_{k}$ and ${\mathcal{V}}_{k}^{\perp}$ . Here ${\mathcal{V}}_{k}$ is the space spanned by the top $k$ eigenfunctions of the kernel associated with the linear regression problem. Note that this is the case also for the random features and neural tangent models of Section 6. In this case the relevant kernel is the expectation of the finite-network kernel ${\boldsymbol{D}}f({\boldsymbol{\theta}}_{0})^{{\mathsf{T}}}{\boldsymbol{D}}f({\boldsymbol{\theta}}_{0})$ with respect to the choice of random weights at initialization.

A crucial element of this behavior is the dependence of $k$ (the dimension of the eigenspace ${\mathcal{V}}_{k}$ ) on various features of the problem at hand: indeed $k$ governs the complexity of the ‘simple’ part of the model ${\widehat{f}}_{0}$ , which is the one actually relevant for prediction. As discussed in Section 4, in kernel methods $k$ increases with the sample size $n$ : as more data are used, the model ${\widehat{f}}_{0}$ becomes more complex. In random features and neural tangent models (see Section 6), $k$ depends on the minimum of $n$ and the number of network parameters (which is proportional to the width for two-layer networks). The model complexity increases with sample size, but saturates when it reaches the number of network parameters.

This suggests a general hypothesis that would be interesting to investigate beyond linear models. Namely, if a decomposition of the type (157) is possible, then the complexity of the simple part ${\widehat{f}}_{0}$ increases with the sample size and the network size.

Computational role of overparametrization. We largely focused on the surprising discovery that overparametrization and interpolation do not necessarily hurt generalization, even in the presence of noise. However, we should emphasize once more that the real motivation for working with overparametrized models is not statistical but computational. The empirical risk minimization problem for neural networks is computationally hard, and in general we cannot hope to be able to find a global minimizer using gradient-based algorithms. However, empirical evidence indicates that global optimization becomes tractable when the model is sufficiently overparametrized.

The linearized and mean field theories of Section 5 provide general arguments to confirm this empirical finding. However, we are far from understanding precisely what amount of overparametrization is necessary, even in simple neural network models.

Acknowledgements

PB, AM and AR acknowledge support from the NSF through award DMS-2031883 and from the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. For insightful discussions on these topics, the authors also thank the other members of that Collaboration and many other collaborators and colleagues, including Emmanuel Abbe, Misha Belkin, Niladri Chatterji, Amit Daniely, Tengyuan Liang, Philip Long, Gábor Lugosi, Song Mei, Theodor Misiakiewicz, Hossein Mobahi, Elchanan Mossel, Phan-Minh Nguyen, Nati Srebro, Nike Sun, Alexander Tsigler, Roman Vershynin, and Bin Yu. We thank Tengyuan Liang and Song Mei for insightful comments on the draft. PB acknowledges support from the NSF through grant DMS-2023505. AM acknowledges support from the ONR through grant N00014-18-1-2729. AR acknowledges support from the NSF through grant DMS-1953181, and support from the MIT-IBM Watson AI Lab and the NSF AI Institute for Artificial Intelligence and Fundamental Interactions.

References

where $\lambda_{1}\geq\ldots\geq\lambda_{d}$ are the eigenvalues of ${\boldsymbol{\Sigma}}$ .

This deterministic argument is due to T. Liang [Lia20]. We write ${\boldsymbol{\Sigma}}={\boldsymbol{\Sigma}}_{\leq k}+{\boldsymbol{\Sigma}}_{>k}$ , with ${\boldsymbol{\Sigma}}_{\leq k}=\sum_{i\leq k}\lambda_{i}{\boldsymbol{u}}_{i}{\boldsymbol{u}}_{i}^{\scriptscriptstyle\mathsf{\,T}}$ . Then by the argument in [LR20, Remark 5.1],

where $\widehat{\lambda}_{i}$ are the eigenvalues of ${\boldsymbol{X}}{\boldsymbol{X}}^{\scriptscriptstyle\mathsf{\,T}}$ . Here we use the fact that $\frac{t}{(r+t)^{2}}\leq\frac{1}{4r}$ for all $t,r>0$ . On the other hand,

Now, using the argument similar to that in [BLLT20], we define $A_{-i}=d\gamma\boldsymbol{I}_{n}+{\boldsymbol{X}}(\boldsymbol{I}_{n}-{\boldsymbol{u}}_{i}{\boldsymbol{u}}_{i}^{\scriptscriptstyle\mathsf{\,T}}){\boldsymbol{X}}^{\scriptscriptstyle\mathsf{\,T}}$ , ${\boldsymbol{v}}={\boldsymbol{X}}{\boldsymbol{u}}_{i}$ and write

by the Sherman-Morrison formula. The last quantity is upper bounded by

Substituting in (160), we obtain an upper bound of

A.2 Exact characterization in the proportional asymptotics

We will denote by ${\boldsymbol{K}}=(h(\langle{\boldsymbol{x}}_{i},{\boldsymbol{x}}_{j}\rangle/d))_{i,j\leq n}$ the kernel matrix. We will also denote by ${\boldsymbol{K}}_{1}$ the linearized kernel

$\|{\boldsymbol{\Sigma}}\|\leq M$ , $d^{-1}\sum_{i=1}^{d}\lambda_{i}^{-1}\leq M$ , where $\lambda_{1},\dots,\lambda_{d}$ are the eigenvalues of ${\boldsymbol{\Sigma}}$ .

Define $\mathscrsfs{B}({\boldsymbol{\Sigma}},{\boldsymbol{\beta}}_{0})$ and $\mathscrsfs{V}({\boldsymbol{\Sigma}})$ by

The result for the variance will be proved under weaker assumptions and in a stronger form than stated. In particular, it does not require any assumption on the target function $f_{*}$ , and it holds with smaller error terms than stated.

Notice that by positive definiteness of the kernel, we have $h^{\prime}(0),h^{\prime\prime}(0)\geq 0$ . Hence the conditions that these are strictly positive is essentially a non-degeneracy requirement.

We note for future reference that the target function $f^{*}$ is decomposed as

Throughout the proof, we will use $C$ for constants that depend uniquely on the constants in Assumption 4.12 and Theorem 4.13. We also write that an inequality holds with very high probability if, for any $A>0$ , we can choose the constants $C$ in the inequality such that this holds with probability at least $1-n^{-A}$ for all $A$ large enough.

We will repeatedly use the following bound, see e.g. [EK10].

Under the assumptions of Theorem 4.13, we have, with very high probability

In particular, as long as $h$ is non-linear, we have ${\boldsymbol{K}}\succeq c_{*}{\mathbf{I}}_{n}$ , $c_{*}=\beta\gamma>0$ with probability at least $1-Cn^{-D}$ .

Our first lemma provides useful approximations of these quantities.

Define (here expectations are over $G\sim{\sf N}(0,1)$ ):

Then the following hold with very high probability (in other words, for any $A>0$ there exists $C$ such that the following hold with probability at least $1-n^{-A}$ for all $n$ large enough)

In particular, this implies $\|{\boldsymbol{v}}-{\boldsymbol{v}}_{0}\|_{2}\leq Cd^{-1}\sqrt{\log d}$ , $\|{\boldsymbol{M}}-{\boldsymbol{M}}_{0}\|_{F}\leq Cd^{-3/2}\log d$ .

Throughout the proof we will work on the intersection ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ of following events, which hold with very high probability by standard concentration arguments. These events are defined by

We denote by $h_{i,k}$ the $k$ -th coefficient of $h((Q_{ii}/d)^{1/2}x)$ in the basis of Hermite polynomials. Namely:

Here $h^{(k)}$ denotes the $k$ -th derivative of $h$ (recall that by the argument above we can assume, without loss of generality, that $h$ is $k$ -times differentiable).

We write $h_{i,>k}$ for the remainder after the first $k$ terms of the Hermite expansion have been removed:

Finally, we denote by $\hat{h}_{i,>k}(x)$ the remainder after the first $k$ terms in the Taylor expansion have been subtracted:

Of course $h-\hat{h}_{>k}$ is a polynomial of degree $k$ , and therefore its projection orthogonal to the first $k$ Hermite polynomials vanishes, whence

We also have that $|\hat{h}_{>k}(t)|\leq C\min(1,|t|^{k+1})$ . Define ${\boldsymbol{v}}_{i}={\boldsymbol{\Sigma}}^{1/2}{\boldsymbol{x}}_{i}/\sqrt{d}$ , $\|{\boldsymbol{v}}_{i}\|_{2}^{2}=Q_{ii}$ . For any fixed $m\geq 2$ , by Eq. (191) and the triangle inequality,

where the inequality $(a)$ follows since $\langle{\boldsymbol{v}}_{i},{\boldsymbol{z}}\rangle$ is $C$ -sub-Gaussian. Note that Eqs. (189), (193) can also be rewritten as

We next prove Eq. (180). Using Eq. (194) with $k=2$ and recalling ${\rm He}_{0}(x)=1$ , ${\rm He}_{1}(x)=x$ , ${\rm He}_{2}(x)=x^{2}-1$ , we get

By the above and the Hanson-Wright inequality

and similarly for the lower tail. By taking a union bound over $i\leq n$ , we obtain $\max_{i\leq n}|\langle{\boldsymbol{x}}_{i},{\boldsymbol{F}}_{2}{\boldsymbol{x}}_{i}\rangle|\leq C\sqrt{d\log d}$ as claimed, thus completing the proof of Eq. (180).

We next prove Eq. (181). We claim that this bound holds for any realization in ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ . Therefore we can fix without loss of generality $i=1$ , $j=2$ . We use Eq. (194) with $k=4$ . Using Cauchy-Schwarz and Eqs. (194), (195), we get

where in the inequality we used the identities $h_{1,0}h_{2,0}M_{1,2}(0,0)=h_{1,0}h_{2,0}=a_{1,0}a_{2,0}$ , and

We next bound each of the terms above separately.

As for $k=4$ , we have (recalling ${\rm He}_{4}(x)=x^{4}-3x^{2}$ ):

where the last inequality follows since $\|{\boldsymbol{w}}\|_{2}=1$ by construction and $\|{\boldsymbol{w}}\|_{\infty}\leq C\sqrt{(\log d)/d}$ on ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ .

Therefore, on ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ ,

Therefore, on ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$

Next consider the term $|h_{1,0}h_{2,3}M_{1,2}(0,3)/6-a_{1,0}a_{2,1}|$ in Eq. (204). Using the fact that $h_{1,0}=a_{1,0}$ is bounded, we get

Recalling ${\rm He}_{3}(x)=x^{3}-3x$ , and letting ${\boldsymbol{w}}={\boldsymbol{\Sigma}}^{1/2}{\boldsymbol{x}}_{2}/\|{\boldsymbol{\Sigma}}^{1/2}{\boldsymbol{x}}_{2}\|_{2}$ :

In particular, on the event ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ , $|M_{1,2}(0,3)|\leq C\sqrt{(\log d)/d}$ . Comparing the definitions of $a_{2,1}$ and $h_{2,3}$ , we get

Finally, consider term $|a_{1,1}a_{2,1}|$ in Eq. (204). By the above estimates, we get $|a_{2,1}|\leq Cd^{-2}(\log d)^{1/2}$ , and hence this term is negligible as well. This completes the proof of Eq. (181).

Equation (182) follows by a similar argument, which we omit. ∎

A.2.2 An estimate on the entries of the resolvent

Then, for any $\lambda>0$ there exists a finite constant $C$ such that, for any $i\neq j$ ,

Then, by a simple linear algebra calculation, we have

Note that since ${\boldsymbol{R}}_{0}\succeq 0$ , we have $\langle{\boldsymbol{z}}_{1},{\boldsymbol{R}}_{0}{\boldsymbol{z}}_{2}\rangle^{2}\leq\langle{\boldsymbol{z}}_{1},{\boldsymbol{R}}_{0}{\boldsymbol{z}}_{1}\rangle\langle{\boldsymbol{z}}_{2},{\boldsymbol{R}}_{0}{\boldsymbol{z}}_{2}\rangle$ , and therefore

Here $(a)$ follows from the orthogonality of $g({\boldsymbol{z}})$ to linear functions.

Here $(a)$ follows from Hölder’s inequality and $(b)$ from the Hanson-Wright inequality using the fact that $\|{\boldsymbol{R}}_{0}\|$ is bounded. The proof is completed by taking expectation over $({\boldsymbol{z}}_{i})_{2<i\leq n}$ . ∎

Under the definitions and assumptions of Lemma A.4, let $Y_{ij}:=({\boldsymbol{Z}}{\boldsymbol{S}}{\boldsymbol{Z}}^{{\scriptscriptstyle\mathsf{\,T}}}/d+\lambda{\mathbf{I}}_{n})^{-1}_{i,j}$ . Then, for any tuple of four distinct indices $i,j,k,l$ , we have

We then have that ${\boldsymbol{Y}}=(Y_{ij})_{i,j\leq 4}$ is given by

Here $(a)$ holds because $g_{i}$ is orthogonal to ${\boldsymbol{z}}_{i}$ for $i\in\{2,3,4\}$ and hence the terms $\overline{q}^{-1}$ have vanishing contribution; $(b)$ by Hölder for $p=12$ , and using the fact that $q_{i}^{-1}$ is bounded; $(c)$ by the above bounds on the moments of $A_{ij}$ , $Q_{i}$ , plus $|q^{-1}_{i}-\overline{q}^{-1}|\leq C|Q_{i}|$ .

Notice that in the first term ${\boldsymbol{z}}_{3}$ only appears in $q_{3}$ , $A_{31}$ , and $g_{3}$ , and similarly ${\boldsymbol{z}}_{4}$ only appears in $q_{4}$ , $A_{24}$ , and $g_{4}$ . Hence

where the last inequality follows again by Hölder. Analogously, for the second term we have

This completes the proof of this lemma. ∎

This proof is very similar to the one of Lemma A.5, and we will follow the same notation introduced there.

where the last bound follows from Hölder inequality.

Then, for any $\lambda>0$ , with probability at least $1-Cd^{-1/4}$ , we have

Denote by $X$ the sum on the left-hand side of Eq. (248), and define $Y_{ij}:=({\boldsymbol{Z}}{\boldsymbol{S}}{\boldsymbol{Z}}^{{\scriptscriptstyle\mathsf{\,T}}}/d+\lambda{\mathbf{I}}_{n})^{-1}_{i,j}$ , $g_{i}=g({\boldsymbol{z}}_{i})$ . Further, let ${\mathcal{I}}_{m}:=\{(i,j,k,l):\;i<j\leq n,k<l\leq n,|\{i,j\}\cap\{k,j\}|=m\}$ , $m\in\{0,1\}$ . Then we have

The proof is completed by Chebyshev inequality. ∎

A.2.3 Proof of Theorem 4.13: Variance term

Throughout this section we will refer to the events ${\mathcal{E}}_{1}$ , ${\mathcal{E}}_{2}$ defined in Eqs. (183), (185). The variance is given by

The following lemma allows us to take the expectation with respect to ${\boldsymbol{x}}$ .

First notice that, defining ${\boldsymbol{M}}$ as in Eq. (173), we have

We then have, with very high probability,

where $(a)$ follows from Lemma A.3 and $(b)$ from Lemma A.2. ∎

The next lemma shows that ${\boldsymbol{B}}_{0}$ is a good approximation for ${\boldsymbol{B}}$ , defined in Eq. (177).

Let ${\boldsymbol{B}}$ be defined as per Eq. (177). With very high probability, we have $\|{\boldsymbol{B}}-{\boldsymbol{B}}_{0}\|\leq Cd^{-3/2}$ and $\|{\boldsymbol{B}}-{\boldsymbol{B}}_{0}\|_{*}\leq Cd^{-1/2}$ .

Notice that ${\boldsymbol{B}}={\boldsymbol{D}}{\boldsymbol{X}}{\boldsymbol{\Sigma}}{\boldsymbol{X}}^{{\scriptscriptstyle\mathsf{\,T}}}{\boldsymbol{D}}/d^{2}$ and, on ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ ,

This immediately implies $\|{\boldsymbol{B}}-{\boldsymbol{B}}_{0}\|_{*}\leq n\|{\boldsymbol{B}}-{\boldsymbol{B}}_{0}\|\leq C/\sqrt{d}$ . ∎

Under the assumptions of Theorem 4.13, let ${\boldsymbol{B}}$ be defined as per Eq. (177)and ${\boldsymbol{B}}_{0}$ as per Eq. (256). Also, recall the definition of ${\boldsymbol{K}}_{1}$ in Eq. (163). Then, with very high probability, we have

Throughout this proof, we work under events ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ defined in the proof of Lemma A.3. Recall that $\max_{i\leq n}|D_{i}|$ is bounded (see, e.g., Eq. (257)), whence

whence $\|{\boldsymbol{B}}\|_{F}\leq C\,\sqrt{(\log d)/d}$ . Using Lemma A.2, we have

Using again Lemma A.2 together with Lemma A.9, we obtain that the following holds with very high probability:

The desired claim follows from this display alongside Eq. (261). ∎

Under the assumptions of Theorem 4.13, let ${\boldsymbol{a}}$ be defined as in Lemma A.3. Then, with very high probability we have

Notice that the lower bound is trivial since ${\boldsymbol{K}}$ is positive semidefinite. We will write

On the other hand, recalling the definition of ${\boldsymbol{a}}_{1}$ in Eq. (178), we have, always on ${\mathcal{E}}_{1}\cap{\mathcal{E}}_{2}$ ,

We therefore obtain, again using Lemma A.2,

where we used the above remark $C^{-1}{\mathbf{I}}\preceq{\boldsymbol{K}}_{*}\preceq C{\mathbf{I}}$ .

Using the last two displays in Eq. (268) yields the desired claim. ∎

By virtue of Lemmas A.8, A.10, A.11, we have

Here and below we denote by ${\rm Err}(n)$ an error term bounded as $|{\rm Err}(n)|\leq Cn^{-c_{0}}$ with very high probability, and we defined

By an application of the Sherman-Morrison formula, and recalling that $\beta\gamma>0$ is bounded away from zero, we get

We are therefore left with the task of evaluating the asymptotics of

However, this is just the variance of ridge regression with respect to the simple features ${\boldsymbol{X}}$ , with ridge regularization proportional to $\gamma$ . We apply the results of [HMRT20] to obtain the claim. ∎

A.2.4 Proof of Theorem 4.13: Bias term

We begin with an elementary lemma on the norm of ${\boldsymbol{f}}^{*}$ .

With probability at least $1-Cn^{-1/4}$ , we have $|\|{\boldsymbol{f}}^{*}\|_{2}^{2}/n-\|f^{*}\|_{L^{2}}^{2}|\leq n^{-3/8}$ .

With probability at least $1-Cn^{-1/4}$ , we have $|\|{\boldsymbol{f}}_{\mbox{\rm\tiny NL}}^{*}\|_{2}^{2}/n-\|f_{\mbox{\rm\tiny NL}}^{*}\|_{L^{2}}^{2}|\leq n^{-3/8}$ .

Finally, $(c)$ follows by the same argument as for claim $(b)$ , once we bound $\|f^{*}_{\mbox{\rm\tiny NL}}\|_{L^{4}}$ . In order to show this, notice that, by triangle inequality, $\|f^{*}_{\mbox{\rm\tiny NL}}\|_{L^{4}}\leq\|f^{*}\|_{L^{4}}+\|f_{0}\|_{L^{4}}+\|f_{1}\|_{L^{4}}$ , where $f_{0}({\boldsymbol{x}})=b_{0}$ , $f_{1}({\boldsymbol{x}})=\langle{\boldsymbol{\beta}}_{0},{\boldsymbol{x}}\rangle$ . Since ${\boldsymbol{x}}={\boldsymbol{\Sigma}}{\boldsymbol{z}}$ , with ${\boldsymbol{z}}$ $C$ -sub-Gaussian, $\|f^{*}_{\mbox{\rm\tiny NL}}\|_{L^{4}}\leq\|f^{*}\|_{L^{4}}+b_{0}+C\|{\boldsymbol{\Sigma}}^{1/2}{\boldsymbol{\beta}}_{0}\|_{2}\leq C$ . ∎

Under the assumptions of Theorem 4.13, let ${\boldsymbol{M}}_{0}$ , ${\boldsymbol{v}}_{0}$ be defined as in the statement of Lemma A.3. Then, with probability at least $1-Cn^{-1/4}$ , we have

Here, in the last line, we used Lemmas A.2, A.3 and the fact that $\|{\boldsymbol{f}}^{*}\|^{2}\leq Cn$ by Lemma A.12. ∎

In view of the last lemma, it is sufficient to work with ${\widehat{\textsc{bias}}^{2}}_{0}$ . We decompose it as

We next show that the contribution of the constant term in $f^{*}_{\mbox{\rm\tiny L}}({\boldsymbol{x}})$ and ${\boldsymbol{M}}_{0}$ is negligible.

Under the assumptions of Theorem 4.13, let ${\boldsymbol{M}}_{0}$ , ${\boldsymbol{B}}$ , ${\boldsymbol{v}}_{0}$ be defined as in the statement of Lemma A.3. Further define

The proof of this lemma is very similar to the one of Lemma A.11, and we omit it. ∎

Under the assumptions of Theorem 4.13, let $\mathscrsfs{B}({\boldsymbol{\Sigma}},{\boldsymbol{\beta}}_{0})$ be defined as in Eq. (168), and $R_{\mbox{\rm\tiny L}}$ be defined as in the statement of Lemma A.14. Let $a\in(0,1/2)$ . Then we have, with very high probability

Letting ${\boldsymbol{u}}={\boldsymbol{X}}{\boldsymbol{\beta}}_{0}={\boldsymbol{Z}}{\boldsymbol{\Sigma}}^{1/2}{\boldsymbol{\beta}}_{0}$ , note that $\|{\boldsymbol{u}}\|_{2}\leq\|{\boldsymbol{Z}}\|\|{\boldsymbol{\Sigma}}^{1/2}{\boldsymbol{\beta}}_{0}\|_{2}\leq C\sqrt{n}$ with very high probability (using Lemma A.12). We then have

We bound each of the three terms with very high probability:

Here in Eq. (303) we used Lemma A.2 and Lemma A.9; in Eq. (304) Lemma A.2 and the fact that $\|{\boldsymbol{B}}_{0}\|\leq C/d$ ; in Eq. (305), Lemma A.2 and $\|{\boldsymbol{X}}\|\leq C\sqrt{d}$ . Hence we conclude that

where the last inequality holds with very high probability by [KY17, Theorem 3.16] (cf. also Lemma 4.4 in the same paper). We therefore have

We are left with the task of estimating $\accentset{\approx}{R}_{\mbox{\rm\tiny L}}$ which we rewrite explicitly as

We recognize in this the bias of ridge regression with respect to the linear features ${\boldsymbol{x}}_{i}$ , when the responses are also linear $\langle{\boldsymbol{\beta}}_{0},{\boldsymbol{x}}_{i}\rangle$ . Using the results of [HMRT20], we obtain that, for any $a\in(0,1/2)$ , the following holds with very high probability.

The proof is completed by using Eqs. (306), (315), (317). ∎

We next consider the nonlinear term $R_{\mbox{\rm\tiny NL}}$ , cf. Eq. (296).

Under the assumptions of Theorem 4.13, let $\mathscrsfs{V}({\boldsymbol{\Sigma}})$ be defined as in Eq. (167), and $R_{\mbox{\rm\tiny NL}}$ be defined as in the statement of Lemma A.14. Then there exists $c_{0}>0$ such that, with probability at least $1-Cn^{-1/4}$ ,

By the same argument as in the proof of Lemma A.15, we have, with very high probability,

We next use the following identity, which holds for any two symmetric matrices ${\boldsymbol{A}}$ , ${\boldsymbol{M}}$ , and any $t\neq 0$ ,

Therefore, for any matrix ${\boldsymbol{U}}$ and any $t>0$ , we have

We apply this inequality to ${\boldsymbol{A}}={\boldsymbol{Z}}{\boldsymbol{\Sigma}}{\boldsymbol{Z}}^{{\scriptscriptstyle\mathsf{\,T}}}/d+\gamma{\mathbf{I}}_{n}$ , ${\boldsymbol{M}}={\boldsymbol{Z}}{\boldsymbol{\Sigma}}^{2}{\boldsymbol{Z}}^{{\scriptscriptstyle\mathsf{\,T}}}/d$ and $U_{ij}=f^{*}_{\mbox{\rm\tiny NL}}({\boldsymbol{x}}_{i})f^{*}_{\mbox{\rm\tiny NL}}({\boldsymbol{x}}_{i}){\boldsymbol{1}}_{i\neq j}$ . Note that $\|{\boldsymbol{A}}^{-1}\|,\|{\boldsymbol{M}}\|,\|({\boldsymbol{A}}+t{\boldsymbol{M}})^{-1}\|\leq C$ . Further $\|{\boldsymbol{U}}\|_{*}\leq 2\|{\boldsymbol{f}}^{*}_{\mbox{\rm\tiny NL}}\|_{2}^{2}\leq Cn$ with probability at least $1-Cn^{-1/4}$ by Lemma A.12. Finally for any $t\in(0,1)$ , by Theorem A.7, the following hold with probability at least $1-Cd^{-1/4}$ :

where in the last step we selected $t=d^{-1/16}$ . Recalling the definitions of ${\boldsymbol{A}},{\boldsymbol{M}},{\boldsymbol{U}}$ , we have proved:

We are therefore left with the task of controlling the diagonal terms. Using the results of [KY17], we get

Further $|\|{\boldsymbol{f}}^{*}_{\mbox{\rm\tiny NL}}\|_{2}^{2}/n-\|f^{*}_{\mbox{\rm\tiny NL}}\|_{L^{2}}^{2}|\leq Cn^{-1/2}$ with probability at least $1-Cn^{-1/4}$ by Lemma A.12. Therefore, with probability at least $1-Cd^{-1/4}$ ,

We finally recognize that the term $V_{\mbox{\tiny\rm RR}}$ is just the variance of ridge regression with respect to the linear features ${\boldsymbol{x}}_{i}$ , and using [HMRT20], we obtain

The proof of the lemma is concluded by using the last equation together with Eq. (322). ∎

Under the assumptions of Theorem 4.13, $R_{\mbox{\rm\tiny mix}}$ be defined as in the statement of Lemma A.14. Then we have, with probability at least $1-Cd^{-1/4}$ ,

The proof of this lemma is analogous to the one of Lemma A.16 and we omit it. ∎

We are now in a position to prove Theorem 4.13.

Using Lemma A.13, Eq. (291) and Lemma A.14, we obtain that, with very high probability,

Hence the proof is completed by using Lemmas A.15, A.16, A.17. ∎

A.2.5 Consequences: Proof of Corollary 4.14

We denote by $\lambda_{1}\geq\dots\geq\lambda_{d}$ the eigenvalues of ${\boldsymbol{\Sigma}}$ in decreasing order.

First note that the left hand side of Eq. (166) is strictly increasing in $\lambda_{*}$ , while the right hand side is strictly decreasing. By considering the limits as $\lambda_{*}\to 0$ and $\lambda_{*}\to\infty$ , it is easy to see that this equation admits indeed a unique solution.

Next denoting by $F(x):={\mathsf{tr}}\Big{(}{\boldsymbol{\Sigma}}({\boldsymbol{\Sigma}}+x{\mathbf{I}})^{-1}\Big{)}$ the function appearing on the right hand side of Eq. (166), we have, for $x\geq c_{*}\lambda_{k+1}$ ,

Let $\underline{\lambda}_{*}$ be the unique non-negative solution of $n(1-(\gamma/\underline{\lambda}_{*}))=\underline{F}(\underline{\lambda}_{*})$ . Then, the above inequality implies that whenever $\underline{\lambda}_{*}\geq c_{*}\lambda_{k+1}$ we have $\lambda_{*}\geq\underline{\lambda}_{*}$ . Solving explicitly for $\underline{\lambda}_{*}$ , we get

If we assume that the right-hand side is less than $1/2$ , using Theorem 4.13, we obtain that, with high probability,

Next, considering again Eq. (166) and upper bounding the right-hand side, we get

Hence, using the assumption that the right hand side of Eq. (339) is upper bounded by $1/2$ , which implies $k\leq n/2$ , we get

Next consider the formula for the bias term, Eq. (168). Denoting by $(\beta_{0,i})_{i\leq p}$ the coordinates of ${\boldsymbol{\beta}}_{0}$ in the basis of the eigenvectors of ${\boldsymbol{\Sigma}}$ , we get

Together with Theorem 4.13, this implies the desired bound on the bias.

Appendix B Optimization in the linear regime

The empirical risk decreases exponentially fast to , with rate $\lambda_{0}=\sigma^{2}_{\min}/(2n)$ :

The parameters stay close to the initialization and are closely tracked by those of the linearized flow. Specifically, letting $L_{n}:={\rm Lip}({\boldsymbol{D}}f_{n})$ ,

Throughout the proof we let $L_{n}:={\rm Lip}({\boldsymbol{D}}f_{n})$ , and we use $\dot{{\boldsymbol{a}}}_{t}$ to denote the derivative of quantity ${\boldsymbol{a}}_{t}$ with respect to time.

Let ${\boldsymbol{y}}_{t}=f_{n}({\boldsymbol{\theta}}_{t})$ . By the gradient flow equation,

Defining the empirical kernel at time $t$ , ${\boldsymbol{K}}_{t}:={\boldsymbol{D}}f_{n}({\boldsymbol{\theta}}_{t}){\boldsymbol{D}}f_{n}({\boldsymbol{\theta}}_{t})^{{\mathsf{T}}}$ , we thus have

Letting $r_{*}:=\sigma_{\min}/(2L_{n})$ and $t_{*}:=\inf\{t:\;\|{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{0}\|_{2}>r_{*}\}$ , we have $\lambda_{\min}({\boldsymbol{K}}_{t})\geq(\sigma_{\min}/2)^{2}$ for all $t\leq t_{*}$ , whence

with $\lambda_{0}=\sigma_{\min}^{2}/(2n)$ .

Note that, for any $t\leq t_{*}$ , $\sigma_{\min}({\boldsymbol{D}}f_{n}({\boldsymbol{\theta}}_{t}))\geq\sigma_{\min}/2$ . Therefore, by the gradient flow equations, for any $t\leq t_{*}$ ,

Assume by contradiction $t_{*}<\infty$ . The last equation together with the assumption (346) implies $\|{\boldsymbol{\theta}}_{t_{*}}-{\boldsymbol{\theta}}_{0}\|_{2}<r_{*}$ , which contradicts the definition of $t_{*}$ . We conclude that $t_{*}=\infty$ , and Eq. (347) follows from Eq. (354).

In order to prove Eq. (349), let $\overline{\boldsymbol{y}}_{t}:=f_{n}({\boldsymbol{\theta}}_{0})+{\boldsymbol{D}}f_{n}({\boldsymbol{\theta}}_{0})(\overline{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{0})$ . Note that this satisfies an equation similar to (352), namely

Define the difference ${\boldsymbol{r}}_{t}:={\boldsymbol{y}}_{t}-\overline{\boldsymbol{y}}_{t}$ . We then have $\dot{{\boldsymbol{r}}}_{t}=-({\boldsymbol{K}}_{t}/n){\boldsymbol{r}}_{t}-(({\boldsymbol{K}}_{t}-{\boldsymbol{K}}_{0})/n)(\overline{\boldsymbol{y}}_{t}-{\boldsymbol{y}})$ , whence

Using $2\lambda_{\min}({\boldsymbol{K}}_{t})/n\geq\lambda_{0}$ and $\|\overline{\boldsymbol{y}}_{t}-{\boldsymbol{y}}_{t}\|_{2}\leq\|{\boldsymbol{y}}_{0}-{\boldsymbol{y}}\|_{2}e^{-\lambda_{0}t/2}$ , we get

(In the last inequality, we used the fact that $L_{n}\|{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{0}\|\leq\sigma_{\min}/2$ by definition of $r_{*}$ .) Applying Grönwall’s inequality, and using ${\boldsymbol{r}}_{0}=0$ , we obtain

Here in $(a)$ we used Eq. (367) and in $(b)$ Eq. (359). Further using $\|{\boldsymbol{r}}_{t}\|_{2}\leq\|{\boldsymbol{y}}_{t}-{\boldsymbol{y}}\|_{2}+\|\overline{\boldsymbol{y}}_{t}-{\boldsymbol{y}}\|_{2}\leq 2\|{\boldsymbol{y}}_{0}-{\boldsymbol{y}}\|\exp(-\lambda_{0}t/2)$ , we get

Recall the gradient flow equations for ${\boldsymbol{\theta}}_{t}$ and $\overline{\boldsymbol{\theta}}_{t}$ :

Taking the difference of these equations, we get

where in $(a)$ we used Eqs. (348), (354) and (374). Integrating the last expression (thanks to $\overline{\boldsymbol{\theta}}_{0}={\boldsymbol{\theta}}_{0}$ ), we get

By writing $f({\boldsymbol{\theta}}_{t})-f_{\mbox{\rm\tiny lin}}({\boldsymbol{\theta}}_{t})=\int_{0}^{t}\frac{{\rm d}\phantom{s}}{{\rm d}s}[f({\boldsymbol{\theta}}_{s})-f_{\mbox{\rm\tiny lin}}({\boldsymbol{\theta}}_{s})]{\rm d}s$ , we get

In the last step we used Eq. (348) and noted that the same argument to prove the latter indeed also bounds the integral $\int_{0}^{t}\|\dot{{\boldsymbol{\theta}}}_{s}\|_{2}{\rm d}s$ (see Eq. (358)).

Finally, to bound term $E_{2}$ , note that $f_{\mbox{\rm\tiny lin}}({\boldsymbol{\theta}}_{t})-f_{\mbox{\rm\tiny lin}}(\overline{\boldsymbol{\theta}}_{t})={\boldsymbol{D}}f({\boldsymbol{\theta}}_{0})({\boldsymbol{\theta}}_{t}-\overline{\boldsymbol{\theta}}_{t})$ , and using Eq. (349), we get

Equation (350) follows by putting together the above bounds for $E_{1}$ and $E_{2}$ . ∎

We next pass to the case of two-layers networks:

Since the $y_{i}$ are $B^{2}$ sub-Gaussian, we have $\|{\boldsymbol{y}}\|_{2}\leq C_{1}B\sqrt{n}$ with the stated probability. Equation (388) follows since by construction $f_{n}({\boldsymbol{\theta}}^{(2)}_{0})=0$ .

Next, proceeding as in the proof of [OS20, Lemma 7] (letting ${\boldsymbol{b}}=(b_{j})_{j\leq m}$ )

We have $\|{\boldsymbol{X}}\|\leq 2(\sqrt{n}+\sqrt{d})$ with the probability at least $1-2\exp\{-(n\vee d)/C)$ [Ver18]. On this event, $F({\boldsymbol{X}},\,\cdot\,)$ is $2(\sqrt{n}+\sqrt{d})$ -Lipschitz with respect to ${\boldsymbol{W}}$ . Recall that the uniform measure on the sphere of radius $\sqrt{d}$ satisfies a log-Sobolev inequality with $\Theta(1)$ constant, [Led01, Chapter 5], that the log-Sobolev constant for a product measure is the same as the worst constant of each of the terms. We then have

Taking $t=C_{1}\sqrt{n}$ for a sufficiently large constant $C_{1}$ implies that the right-hand side is at most $2\exp(-(n\vee d)/C)$ , which proves the claim.

Notice that all the following inequalities are homogeneous in $\alpha>0$ . Hence, we will assume—without loss of generality—that $\alpha=1$ . Equation (389) follows from [OS20, Lemma 4]. Indeed this lemma implies

where ${\boldsymbol{K}}$ is the empirical NT kernel

Since ${\boldsymbol{M}}\succeq 0$ , we have

Hence $\sigma_{\max}({\boldsymbol{D}}f_{n}({\boldsymbol{\theta}}_{0}))\leq B\|{\boldsymbol{X}}\|$ and the claim follows from standard estimates of operator norms of random matrices with independent entries.

Equation (391) follows from [OS20, Lemma 5], which yields (after adapting to the different normalization of the ${\boldsymbol{x}}_{i}$ , and using the fact that $\max_{i\leq n}\|{\boldsymbol{x}}_{i}\|_{2}\leq C\sqrt{d}$ with probability at least $1-2\exp(-d/C)$ ):

This implies $\|{\boldsymbol{D}}f({\boldsymbol{\theta}}_{0})\|\leq B$ .

then, with probability at least $1-2\exp\{-n/C_{0}\}$ , the following hold for all $t\geq 0$ .

Gradient flow converges exponentially fast to a global minimizer. Specifically, letting $\lambda_{*}=C_{1}\alpha^{2}d/n$ , we have

The model constructed by gradient flow and linearized flow are similar on test data, namely

Throughout the proof, we use $C$ to denote constants depending only on $\sigma$ , that might change from line to line. Using Lemma 5.3, the condition (346) reads

which is equivalent to Eq. (415). We can therefore apply Theorem 5.1.

Equation (416) follows from Theorem 5.1, point 1, using the lower bound on $\sigma_{\min}$ given in Eq. (389).

Equation (417) follows from Theorem 5.1, point 3, using the estimates in Lemma 5.3. ∎