The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Ben Adlam, Jeffrey Pennington

Introduction

Machine learning models based on deep neural networks have achieved widespread success across a variety of domains, often playing integral roles in products and services people depend on. As users rely on these systems in increasingly important scenarios, it becomes paramount to establish a rigorous understanding for when the models might work, and, crucially, when they might not. Unfortunately, the current theoretical understanding of deep learning is modest at best, as large gaps persist between theory and observation and many basic questions remain unanswered.

One of the most conspicuous such gaps is the unexpectedly good generalization performance of large, heavily-overparameterized models. These models can be so expressive that they can perfectly fit the training data (even when the labels are replace by pure noise), but still manage to generalize well on real data (Zhang et al., 2016). An emerging paradigm for describing this behavior is in terms of a double descent curve (Belkin et al., 2019a), in which increasing a model’s capacity causes its test error to first decrease, then increase to a maximum near the interpolation threshold (where the number of parameters equals the number of samples), and then decrease again in the overparameterized regime.

There are of course more elaborate measures of a model’s capacity than a naive parameter count. Recent empirical and theoretical work studying the correlation of these capacity measures with generalization has found mixed results, with many measures having the opposite relationship with generalization that theory would predict (Neyshabur et al., 2017). Other work has questioned whether it is possible in principle for uniform convergence results to explain the generalization performance of neural networks (Nagarajan & Kolter, 2019).

Our approach is quite different. We consider the algorithm’s asymptotic performance on a specific data distribution, leveraging the large system size to get precise theoretical results. In particular, we examine the high-dimensional asymptotics of kernel ridge regression with respect to the Neural Tangent Kernel (NTK) (Jacot et al., 2018) and conclude that double descent does not always provide an accurate or complete picture of generalization performance. Instead, we identify complex non-monotonic behavior in the test error as the number of parameters varies across multiple scales and find that it can exhibit additional peaks and descents when the number of parameters scales quadratically with the dataset size.

Our theoretical analysis focuses on the NTK of a single-layer fully-connected model when the samples are drawn independently from a Gaussian distribution and the targets are generated by a wide teacher neural network. We provide an exact analytical characterization of the generalization error in the high-dimensional limit in which the number of samples $m$ , the number of features $n_{0}$ , and the number of hidden units $n_{1}$ tend to infinity with fixed ratios $\phi\mathrel{\mathop{:}}=n_{0}/m$ and $\psi\mathrel{\mathop{:}}=n_{0}/n_{1}$ . By adjusting these ratios, we reveal the intricate ways in which the generalization error depends on the dataset size and the effective model capacity.

We investigate various limits of our results, including the behavior when the NTK degenerates into the kernel with respect to only the first-layer or only the second-layer weights. The latter corresponds to the standard setting of random feature ridge regression, which was recently analyzed in (Mei & Montanari, 2019). In this case, the total number of parameters $p$ is equal to the width $n_{1}$ , i.e. $p=n_{1}=(\phi/\psi)m$ , so that $p$ is linear in the dataset size. In contrast, for the full kernel, the number of parameters is $p=(n_{0}+1)n_{1}=(\phi^{2}/\psi)m^{2}+(\phi/\psi)m$ , i.e. it is quadratic in the dataset size. By studying these two kernels, we derive insight into the generalization performance in the vicinities of linear and quadratic overparameterization, and by piecing these two perspectives together, we infer the existence of multi-scale phenomena, which sometimes can include triple descent. See Fig. 1 for an illustration and Fig. 4 for empirical confirmation of this behavior.

We derive exact high-dimensional asymptotic expressions for the test error of NTK ridge regression.

We prove that the test error can exhibit non-monotonic behavior deep in the overparameterized regime.

We investigate the origins of this non-monotonicity and attribute them to the kernel with respect to the second-layer weights.

We provide empirical evidence that triple descent can indeed occur for finite-sized networks trained with gradient descent.

We find exceptionally fast learning curves in the noiseless case, with $E_{\text{test}}\sim m^{-2}$ .

2 Related Work

A recent line of work studying the behavior of interpolating models was initiated by the intriguing experimental results of (Zhang et al., 2016; Belkin et al., 2018b), which showed that deep neural networks and kernel methods can generalize well even in the interpolation regime. A number of theoretical results have since established this behavior in certain settings, such as interpolating nearest neighbor schemes (Belkin et al., 2018a) and kernel regression (Belkin et al., 2019c; Liang et al., 2020b).

These observations, coupled with classical notions of the bias-variance tradeoff, have given rise to the double descent paradigm for understanding how test error depends on model complexity. These ideas were first discussed in (Belkin et al., 2019a), and empirical evidence was obtained in (Advani & Saxe, 2017; Geiger et al., 2020) and recently in (Nakkiran et al., 2019). Precise theoretical predictions soon confirmed this picture for linear regression in various scenarios (Belkin et al., 2019b; Hastie et al., 2019; Mitra, 2019).

Linear models struggle to capture all of the phenomena relevant to double descent because the parameter count is tied to the number of features. Recent work found multiple descents in the test loss for minimum-norm interpolants in Reproducing Kernel Hilbert Spaces (Liang et al., 2020a), but it similarly requires changing the data distribution to vary model capacity. A precise analysis of a nonlinear system for a fixed data generating process is the most direct way to draw insight into double descent. A recent preprint (Mei & Montanari, 2019) shares this view and adopts a similar analysis to ours, but focuses entirely on the standard case of unstructured random features. Such a setup can indeed model double descent, and certainly bears relevance to certain wide neural networks in which only the top-layer weights are optimized (Neal, 1996; Rahimi & Recht, 2008; Lee et al., 2018; de G. Matthews et al., 2018; Lee et al., 2019), but its connection to neural networks trained with gradient descent remains less clear.

Gradient-based training of wide neural networks initialized in the standard way was recently shown to correspond to kernel gradient descent with respect to the Neural Tangent Kernel (Jacot et al., 2018). This result has spawned renewed interest in kernel methods and their connection to deep learning; a woefully incomplete list of papers in this direction includes Lee et al. (2019); Chizat et al. (2019); Du et al. (2019, 2018); Arora et al. (2019); Xiao et al. (2019).

To connect these research directions, our analysis requires tools and recent results from random matrix theory and free probability. A central challenge stems from the fact that many of the matrices in question have nonlinear dependencies between the elements, which arises from the nonlinear feature matrix $F=\sigma(WX)$ . This challenge was overcome in (Pennington & Worah, 2017), which computed the spectrum of $F$ , and in (Pennington & Worah, 2018), which examined the spectrum of the Fisher information matrix; see also (Louart et al., 2018). We also utilize the results of (Adlam et al., 2019; Péché et al., 2019), which established a linear signal plus noise model for $F$ that shares the same bulk statistics. This linearized model allows us to write the test error as the trace of a rational function of the underlying random matrices. The methods we use to compute such quantities rely on so-called linear pencils that represent the rational function in terms of the inverse of a larger block matrix (Helton et al., 2018), and on operator-valued free probability for computing the trace of the latter (Far et al., 2006).

Preliminaries

In this section, we introduce our theoretical setting and some of the tools required to state our results.

Let $\hat{y}(\mathbf{x})$ denote the model’s predictive function. We consider squared error, so the test loss is,

where the expectation is over an iid test point $(\mathbf{x},y)$ conditional on the training set, the teacher parameters, and any randomness in the learning algorithm producing $\hat{y}$ , such as the random parameters defining the random features. Note that the test loss is a random variable; however, in the high-dimensional asymptotics we consider here, it concentrates about its mean.

2 Neural Tangent Kernel Regression

We consider predictive functions $\hat{y}$ defined by approximate (i.e. random feature) kernel ridge regression using the Neural Tangent Kernel (NTK) of a single-hidden-layer neural network. The NTK can be considered a kernel $K$ that is approximated by random features corresponding to the Jacobian $J$ of the network’s output with respect to its parameters, i.e. $K(\mathbf{x}_{1},\mathbf{x}_{2})=J(\mathbf{x}_{1})J(\mathbf{x}_{2})^{\top}$ . As the width of the network becomes very large (compared to all other relevant scales in the system), the approximate NTK converges to a constant kernel determined by the network’s initial parameters and describes the trajectory of the network’s output under gradient descent. In particular,

where $N_{t}(\mathbf{x})$ is the output of the network at time $t$ , $K\mathrel{\mathop{:}}=K(\gamma)=K(X,X)+\gamma I_{m}$ , $K_{\mathbf{x}}\mathrel{\mathop{:}}=K(X,\mathbf{x})$ , $\eta$ is the learning rate, and $\gamma$ is a ridge regularization constantThese overloaded definitions of $K$ can be distinguished by the number of arguments and should be clear from context.. For this work, we are interested in the $t\to\infty$ limit of (3), which defines the predictive function,

We remark that if the width is not asymptotically larger than the dataset size, the validity of (3) can break down and (4) may not accurately describe the late-time predictions of the neural network. While this potential discrepancy is an interesting topic, we defer an in-depth analysis to future work (but see Fig. 4) for an empirical analysis of gradient descent). Instead, we regard (4) as the definition of our predictive function and focus on kernel regression with the NTK. We believe this setup is interesting its own right; for example, recent work has demonstrated its effectiveness as a kernel method on complex image datasets (Li et al., 2019) and found it to be competitive with neural networks in small data regimes.

In this work, we restrict our study to the NTK of single-hidden-layer fully-connected networks. In particular, consider a network of with width $n_{1}$ and pointwise activation function $\sigma$ , defined by,

We collect our assumptions on the activation functions below, in Assumption 1. Their main purpose is to ensure that certain moments and derivatives exist almost surely, but for simplicity we state somewhat stronger conditions than are actually required for our analysis. To simplify the already cumbersome algebraic manipulations, we assume that $\sigma$ has zero Gaussian mean. We emphasize that this condition is not essential and our techniques easily generalize to all commonly used activation functions.

The Jacobian of (5) with respect to the parameters naturally decomposes into the Jacobian with respect to $W_{1}$ and $W_{2}$ , i.e. $J(\mathbf{x})=[\partial N_{0}(\mathbf{x})/\partial W_{1},\partial N_{0}(\mathbf{x})/\partial W_{2}]=[J_{1}(\mathbf{x}),J_{2}(\mathbf{x})]$ . Therefore the kernel $K$ also decomposes this way, and we can write.

A simple calculation yields the per-layer constituent kernels,

where we have introduced the abbreviations $F=\sigma(W_{1}X/\sqrt{n_{0}})$ and $F^{\prime}=\sigma^{\prime}(W_{1}X/\sqrt{n_{0}})$ . Notice that when $\sigma_{W_{2}}^{2}\to 0$ , $K=K_{2}$ , i.e. the NTK degenerates into the standard random features kernel. However, the predictive function (4) contains an offset $N_{0}(\mathbf{x})$ which would typically be set to zero in standard random feature kernel regression because it simply increases the variance of test predictions. Removing this variance component has an analogous operation in neural network training: either the function value at initialization can be subtracted throughout training, or a symmetrization trick can be used in which two copies of the NN are initialized identically, and their normalized difference $N\equiv\left({N^{(a)}-N^{(b)}}\right)/{\sqrt{2}}$ is trained with gradient descent. Either method preserves the kernel $K$ while enforcing $N_{0}\equiv 0$ . We call this setup centering, and present results with and without it.

Finally, we note that ridge regularization in the kernel perspective corresponds to using L2 regularization of the neural network’s weights toward their initial values.

Three Regimes of Parameterization

In this section, we outline an argument based on the structure of the NTK as to why one should expect the test error to exhibit non-trivial phenomena at two different scales of overparameterization. From the expressions for the test error (2) and the predictive function (4), it is evident that the behavior of the test error is determined by the spectral properties of the NTK. Although the fine details of the relationship can only be revealed by the explicit calculation, we can nevertheless make some basic high-level observations based on the coarser structure of the kernel.

The number of trainable parameters $p$ relative to the dataset size $m$ controls the amount of parameterization or complexity of a model. In our setting of a single-hidden-layer fully-connected neural network, $p=n_{1}(n_{0}+1)$ , and for a fixed dataset, we can adjust the ratio $p/m$ by varying the hidden-layer width $n_{1}$ .

These two scales partition the degree of parameterization into three regimes. We consider the classical regime to be when $p\lesssim m$ because classical generalization theory tends to hold and the U-shaped test error curve is observed. The transition around $p=\Theta(m)$ manifests as a sharp rise in the test loss near the interpolation threshold, followed by a quick descent as $p$ increases further, as can be seen in Fig. 2(a). We call this the linear scaling transition. After this, we enter a regime we call abundant parameterization when $m\lesssim p\lesssim m^{2}$ . In this regime, the test error tends to decrease until $p$ nears the vicinity of $m^{2}$ , where it can sometimes increase again, producing a second U-shaped curve. When $p=\Theta(m^{2})$ , another transition is observed, which we call the quadratic scaling transition, which can be seen in Fig. 2(b). On the other side of this transition, $p\gtrsim m^{2}$ , a regime we call superabundant parameterization. See Fig 1 for an illustration of this general picture.

While the classical regime has been long studied, and the superabundant regime has generated considerable recent interest due to the NTK, our main aim in delineating the above regimes is to highlight the existence of the intermediate scale containing complex phenomenology. For this reason, we focus our theoretical analysis on the novel scaling regime in which $p=\Theta(m^{2})$ . In particular, as mentioned in Sec. 1, we consider the high-dimensional asymptotics in which $n_{0},n_{1},m\to\infty$ with $\phi\mathrel{\mathop{:}}=n_{0}/m$ and $\psi\mathrel{\mathop{:}}=n_{0}/n_{1}$ held constant.

Overview of Techniques

In this section, we provide a high-level overview of the analytical tools and mathematical results we use to compute the generalization error. To begin with, let us first describe the main technical challenges in computing explicit asymptotic limits of (2).

The first challenge, which is evident upon inspecting (8), is that the kernel contains a Hadamard product of random matrices, for which concrete results in the random matrix literature are few and far between. We address this problem in Sec. 4.1.

The second challenge, which is apparent by inspecting (9), is that the kernel depends on random matrices with nonlinear dependencies between the entries. We describe how to circumvent this difficulty in Sec. 4.2.

Finally, by expanding the square in (2) and substituting (4), we find terms that are constant, linear, and quadratic in $K^{-1}$ . Some of the random matrices that appear inside the matrix inverses (e.g. $X$ , and $W_{1}$ ) also appear outside of them as multiplicative factors, a situation that prevents the straightforward application of many standard proof techniques in random matrix theory. We describe how to overcome this challenge in Sec. 4.3.

A straightforward central limiting argument shows that in the asymptotic limit the entries of $W_{1}X/\sqrt{n_{0}}$ are marginally Gaussian with mean zero and unit variance. As such, the first and second moments of the entries in the matrix $F^{\prime}=\sigma^{\prime}(W_{1}X/\sqrt{n_{0}})$ are equal to

It follows that we can split $K_{1}$ into two terms,

where $\bar{F}^{\prime}$ is a centered version of $F^{\prime}$ . Focusing on the first term, because $n_{0}n_{1}=\phi^{2}/\psi m^{2}$ , the random fluctuations in the off-diagonal elements are $\mathcal{O}(1/m)$ , which are too small to contribute to the spectrum or moments of an $m\times m$ matrix whose diagonal entries are order one. In fact, the diagonal entries are simply proportional to the variance of the entries of $F^{\prime}$ , namely $(\eta^{\prime}-\zeta)$ . Putting this together, we can eliminate the Hadamard product entirely and write,

where the $\cong$ notation means the two matrices share the same bulk statistics asymptotically. We make this argument precise in Sec. S1.

2 Linearization 1: Gaussian Equivalents

The test error (2) involves large random matrices with nonlinear dependencies, which are not immediately amenable to standard methods of analysis in random matrix theory. The main culprit is the random feature matrix $F=\sigma(W_{1}X/\sqrt{n_{0}})$ , but $f\mathrel{\mathop{:}}=\sigma(W_{1}\mathbf{x}/\sqrt{n_{0}})$ , $Y=\omega\sigma_{\textsc{t}}(\Omega X/\sqrt{n_{0}})/\sqrt{n_{\textsc{t}}}+\mathcal{E}$ , and $y\mathrel{\mathop{:}}=\omega\sigma_{\textsc{t}}(\Omega\mathbf{x}/\sqrt{n_{0}})/\sqrt{n_{\textsc{t}}}$ all suffer from the same issue.

The solution is to replace each of these matrices with an equivalent matrix without nonlinear dependencies, but chosen to maintain the same first- and second-order moments for all of the terms that appear in the test error (2). This approach was described for $F$ in (Adlam et al., 2019) (see also (Péché et al., 2019)). The upshot is that the test error is asymptotically invariant to the following substitutions,

The new objects $\Theta_{F}$ , $\Theta_{Y}$ , $\theta_{f}$ , and $\theta_{y}$ are matrices of the appropriate shapes with iid standard Gaussian entries. The constants $\eta,\zeta,\eta_{\textsc{t}}$ , and $\zeta_{\textsc{t}}$ are chosen so that the mixed moments up to second order are the same for the original and linearized versions. In particular,

The statement that the test error only depends on $Y^{\text{lin}}$ is consistent with the observations made in (Ghorbani et al., 2019; Mei & Montanari, 2019) that in the high-dimensional regime where $n_{0}=\Theta(m)$ , only linear functions of the data can be learned. Indeed, $Y^{\text{lin}}$ is equivalent to a linear teacher plus noise with signal-to-noise ratio given by,

We often make this equivalence to a linear teacher explicit by setting $\sigma_{\textsc{t}}(x)=x$ , which implies $\eta_{\textsc{t}}=\zeta_{\textsc{t}}=1$ . Doing so also removes the noise from the test label, but since this noise merely contributes an additive shift to the test loss, removing it does not change any of our conclusions.

3 Linearization 2: Linear Pencil

Next we turn our attention to the actual computation of the asymptotic test loss. Expanding the test error (2) we haveFor simplicity, we discuss the centered setting with $N_{0}=0$ , which captures all of the technical complexities.,

which, when applied to (21) together with the substitutions (13)-(16), expresses the test error directly in terms of the iid Gaussian random matrices $W_{1},X,\Theta_{F},\Omega,\Theta_{Y},\mathcal{E},\theta_{f},\theta_{y}$ and $\mathbf{x}$ . The expectations over $\mathbf{x}$ and $\mathcal{E}$ are trivial because these variables do not appear inside the matrix inverse $K^{-1}$ . Moreover, asymptotically the traces concentrate around their means with respect to $\Omega,\Theta_{Y},\theta_{f}$ and $\theta_{y}$ , which we can also compute easily for the same reason. Therefore, the test error can be written as,

Eqn. (24) is a rational function of the noncommutative random variables $W_{1},X,$ and $\Theta_{F}$ . A useful result from noncommutative algebra guarantees that such a rational function can be linearized in the sense that it can be expressed in terms of the inverse of a matrix whose entries are linear in the noncommutative variables. This representation is often called a linear pencil, and is not unique; see e.g. (Helton et al., 2018) for details.

To illustrate this concept, consider the simple case of $K^{-1}$ . After applying the substitutions (13)-(16) to (22), a linear pencil is given by

which can be checked by an explicit computation of the block matrix inverse. After obtaining a linear pencil for each of the terms in (24), the only task that remains is computing the trace. Since each linear pencil is a block matrix whose blocks are iid Gaussian random matrices, its trace can be evaluated using the techniques described in (Far et al., 2006) or through the general formalism of operator-valued free probability. We refer the reader to the book (Mingo & Speicher, 2017) for more details on these topics.

Asymptotic Training and Test Error

The calculations described in the previous section are presented in the Supplementary Materials. Here we present the main results.

The subtraction of $\sigma_{\varepsilon}^{2}$ in eqn. (27) is because we have assumed that there is no label noise on the test points. Had we included the same label noise on both the training and test distributions, that term would be absent.

When $\nu=0$ , the quantity $(\gamma\tau_{1})^{-2}E_{\text{train}}$ on the right hand side of eqn. (27) is precisely the generalized cross-validation (GCV) metric of (Golub et al., 1979). Theorem 1 shows that the GCV gives the exact asymptotic test error for the problem studied here.

Test Error in Limiting Cases

While the explicit formulas in preceding section provide an exact characterization of the asymptotic training and test loss, they do not readily admit clear interpretations. On the other hand, eqn. (LABEL:eq:prop1) and therefore the expressions for $E_{\text{test}}$ simplify considerably under several natural limits, which we examine in this section.

Here we examine the test error in the superabundant regime in which the width $n_{1}$ is larger than any constant times the dataset size $m$ , which can be obtained by letting $\psi\to 0$ and $\psi/\phi\to 0$ . In this setting we find,

where $\nu=0$ with centering and $\nu=1$ without it and $\rho\mathrel{\mathop{:}}=\zeta(1+\sigma_{W_{2}}^{2})$ , $\,\xi\mathrel{\mathop{:}}=\gamma+\eta+\sigma_{W_{2}}^{2}\eta^{\prime}$ , and

The learning curve is remarkably steep with centering. To see this, we expand the result as $m\to\infty$ , i.e. as $\phi\to 0$ ,

Interestingly, we see that when the network is super abundantly parameterized, we obtain very fast learning curves: for finite SNR, $E_{\text{test}}\sim m^{-1}$ , and in the noiseless case $E_{\text{test}}\sim m^{-2}$ . See Fig 3(b).

2 Small Width Limit

Here we consider the limit in which the width $n_{1}$ is smaller than any constant times the dataset size $m$ or the number of features $n_{0}$ , which can be obtained by letting $\psi\to\infty$ with $\phi$ held constant. In this setting we find,

where $\,\xi_{1}\mathrel{\mathop{:}}=\eta^{\prime}+\gamma/\sigma_{W_{2}}^{2}$ , and

The small width limit characterizes one boundary of the abundant parameterization regime and as such provides an upper bound on the test loss in that regime. Therefore, a sufficient condition for the global minimum to occur at intermediate widths is $E_{\text{test}}|_{\psi\to\infty}<E_{\text{test}}|_{\psi=0}$ . By comparing eqn. (28) to eqn. (31), precise though unenlightening constraints on the parameters can be derived for satisfying this condition. One such configuration is illustrated in Fig. 4(b).

3 Large Dataset Limit

Here we consider the limit in which the dataset $m$ is larger than any constant times the width $n_{1}$ , which can be obtained by letting $\phi\to 0$ with $\phi/\psi\to 0$ . In this setting we find,

where $\nu=0$ with centering and $\nu=1$ with without it and,

Here again we observe very steep learning curves, similar to the large width limit above.

4 Ridgeless Limit: First-Layer Kernel

Here we examine the ridgeless limit $\gamma\to 0$ of the first-layer kernel $K_{1}$ . We find that the result can be obtained through a degeneration of (28),

where, $\bar{\chi}\mathrel{\mathop{:}}=\sqrt{(\zeta+\eta^{\prime}\phi)^{2}-4\phi\zeta^{2}}$ and we have specialized to the centered case $\nu=0$ . The expansion as $m\to\infty$ also looks similar to (30) and can be obtained from that equation by substituting $\xi/\rho\to\eta^{\prime}/\zeta$ .

5 Ridgeless Limit: Second-Layer Kernel

Here we examine the ridgeless limit $\gamma\to 0$ when the kernel is due to the second-layer weights only, i.e. $K_{2}$ . This limit can be obtained by letting $\sigma_{W_{2}}\to 0$ . In this setting, the result can be expressed as,

where $\omega\mathrel{\mathop{:}}=\max\{{\phi,\psi}\}$ , $\beta\mathrel{\mathop{:}}=\zeta+\omega\eta-\chi$ , and

and we have again specialized to the centered case $\nu=0$ . This expression is in agreement with the result presented in (Mei & Montanari, 2019).

When the system is far in the regime of abundant parameterization, namely $p=n_{1}\gg m$ (or $\psi/\phi\to 0$ ), we can examine the large dataset behavior by first sending $\psi\to 0$ and then expanding as $\phi\to 0$ . The result is described by (30) by substituting $\xi/\rho\to\eta/\zeta$ .

Quadratic Overparameterization

In this section, we investigate the implications of our theoretical results about the generalization performance of NTK regression in the quadratic scaling limit $n_{0},n_{1},m\to\infty$ with $\phi=n_{0}/m$ and $\psi=n_{0}/n_{1}$ held constant. Our high-level observation is that there is complex non-monotonic behavior in this regime as these ratios are varied, and that this behavior can depend on the signal-to-noise ratio and the initial parameter variance $\sigma_{W_{2}}^{2}$ in intricate ways. We highlight a few examples in Fig. 3.

In Fig. 3(a), we plot the test error as a function of $\phi$ and $\phi/\psi$ , which reveals the behavior of jointly varying the number of features $n_{0}$ and the number of hidden units $n_{1}$ . As expected from Fig. 2(b), for fixed $\phi$ the test error has a hump near $n_{1}=m$ . Perhaps unexpectedly, for large $n_{1}$ , the test loss exhibits non-monotonic dependence on $n_{0}$ , with a spike near $n_{0}=m$ . Notice that for small $n_{1}$ , this non-monotonicity disappears. It is clear that the test error depends in a complex way on both variables, underscoring the richness of the quadratically-overparameterized regime.

Fig. 3(b) shows learning curves for fixed $\psi$ and various values of the SNR. For small enough SNR, there are visible bumps in the vicinity of $m=n_{0}$ and $m=n_{1}$ that reveal the existence of regimes in which more training data actually hurts test performance. Note that $n_{0}=\Theta(n_{1})$ so these two humps are separated by a constant factor, so the presence of two humps in this figure is not evidence of multi-scale behavior, though it surely reflects the complex behavior at the quadratic scale.

It is natural to wonder about the origins of this complex behavior. Can it be attributed to a particular component of the kernel $K$ ? We investigate this question in Fig. 3(c), which shows how the test error changes as the relative contributions of the per-layer kernels $K_{1}$ and $K_{2}$ are varied. By decreasing $\sigma_{W_{2}}$ , the contribution of $K_{1}$ decreases and the kernel becomes more like $K_{2}$ , and the small hump at the quadratic transition increases in size until it resembles the large spike at the linear transition (c.f. Fig. 2), suggesting that $K_{2}$ is in fact responsible for the non-monotonicity in the quadratically-overparameterized regime.

Empirical Validation

Our theoretical results establish the existence of nontrivial behavior of the test loss at $p=m$ for the second-layer kernel $K_{2}$ and at $p=m^{2}$ for the full kernel $K$ . While these results are strongly suggestive of multi-scale behavior, they do not prove this behavior exists for a single kernel, nor do they guarantee it will be revealed for finite-size systems, let alone for models trained with gradient descent. Here we provide positive empirical evidence on all counts.

Fig. 4 demonstrates multi-scale phenomena, triple descent, and the linear and quadratic scaling transitions for random feature NTK regression and gradient descent for finite-dimensional systems. The simulations all show a peak near the linear parameterization transition, as well as a bump near the quadratic transition. The asymptotic theoretical predictions agree well with kernel regression in their regime of validity, which is when $n_{1}$ is near $m$ . While we found that the global minimum of the test error is often at $p=\infty$ , there are some configurations for which the optimal $p$ lies between $m$ and $m^{2}$ , as illustrated in Fig. 4(b).

Fig. 4(a) clearly shows triple descent for NTK regression and a marked difference in loss with and without centering, suggesting that this source of variance may often dominate the error for large $n_{1}$ .

Fig. 4(c) confirms the existence of triple descent for a single-layer neural network trained with gradient descent. The noticeable difference between kernel regression and the actual neural network is to be expected because the NTK can change during the course of training when the width is not significantly larger than the dataset size. Indeed, the deviation diminishes for large $n_{1}$ . In any case, the qualitative behavior is similar across all scales, providing support for the validity of our framework beyond pure kernel methods.

Conclusion

In this work, we provided a precise description of the high-dimensional asymptotic generalization performance of kernel regression with the Neural Tangent Kernel of a single-hidden-layer neural network. Our results revealed that the test error has complex non-monotonic behavior deep in the overparameterized regime, indicating that double descent does not always provide an accurate or complete picture of generalization performance. Instead, we argued that the test error may exhibit additional peaks and descents as the number of parameters varies across multiple scales, and we provided empirical evidence of this behavior for kernel ridge regression and for neural networks trained with gradient descent. We conjecture that similar multi-scale phenomena may exist for broader classes of architectures and datasets, but we leave that investigation for future work.

References

S1 Simplification of the first-layer kernel

In this section, we get explicit control in spectral norm of the difference between the empirical (i.e. finite-size) NTK and the version in eqn. (22) that arises through the simplification of the first-layer kernel $K_{1}$ in eqn. (12). We will use the notation $A_{i:}=(A_{i1},\ldots,A_{in})$ , where $A_{:i}$ is defined similarly. Recall from eqns. (8) and (9) that the empirical NTK is given by

and from eqn. (22) the simplified kernel is given by

where $M\mathrel{\mathop{:}}=\zeta\mathbf{1}\mathbf{1}^{\top}+(\eta^{\prime}-\zeta)I$ . Elementary arguments given in Sec. S1.2 show that, in operator norm, the two rightmost terms in eqn. (S6) are bounded by $\mathcal{O}(n_{0}^{3{\varepsilon}-1/2})$ . In Sec. S1.1, we bound $\Delta$ by using the fact that, conditional on $X$ , $\Delta$ is a sum of independent random matrices to apply the matrix Bernstein inequality (Tropp, 2015).

We start with a supremum bound on $\lVert\Delta_{k}\rVert$ . For any vector $\mathbf{v}=\sum_{k}v_{k}\mathbf{e}_{k}$ , we have

by the Cauchy-Schwarz inequality. Note that by assumption on $X$ , eqn. (S7) is $\mathcal{O}\left({n_{0}^{2{\varepsilon}}m^{1/2}/n_{1}}\right)=\mathcal{O}(n_{0}^{3{\varepsilon}-1/2})$ .

which we note is the same for all $k$ . We now calculate these 2- and 4-point expectations to leading order.

Since the entries of $WX$ are multivariate Gaussian conditional on $X$ , we find

Taylor expanding in the covariance term, one can show that, for all $a$ ,

for some ${\varepsilon}_{abl}=\mathcal{O}(n_{0}^{2{\varepsilon}-1/2})$ . We find $\lVert\frac{1}{n_{1}}(X^{\top}X/n_{0})^{2}\odot M_{2}\rVert=\mathcal{O}(n_{0}^{{\varepsilon}}/n_{1})$ and

using the Cauchy-Schwarz inequality and that assumption that all dimensions are on the same order.

Thus finally applying the matrix Bernstein inequality with $t=Cn_{0}^{4{\varepsilon}-1/4}$ for some sufficiently large constant $C$ , we find for any $\delta>0$

for sufficiently large $n_{0}$ . Moreover, eqn. (S25) holds with $X$ random as it is independent of $W_{1}$ and $W_{2}$ , and our assumptions on $X$ hold for any $\delta^{\prime}>0$ for sufficiently large $n_{0}$ .

S1.2 Bounding remaining terms

where $E$ ’s diagonal entries are $\mathcal{O}(n_{0}^{2{\varepsilon}-1})$ and off-diagonal entries are $\mathcal{O}(n_{0}^{3{\varepsilon}-3/2})$ . Taking the terms one by one, we first bound

Eqn. (S28) can be demonstrated by taking the 4th power of the trace as in (El Karoui et al., 2010). This is expected, since the entries are mean zero and have variance order $\mathcal{O}(n_{0}^{-1})$ . Proving the spectral bound is a straightforward calculation using the independence of the entries of $X$ , but we avoid details here. The final term can also be bounded in this way, yielding,

The inclusion of the matrix $R$ is necessary, due to the nonzero mean of the entries. See (El Karoui et al., 2010) for an example of this calculation.

Similarly using the assumptions on $X$ , we can bound the remaining diagonal matrix of eqn. (S6) as follows

Summing our bounds on $\Delta$ and eqns. (S27)-(S30) completes the proof of eqn. (S4).

S2 Gaussian equivalents

In this section we discuss the key arguments for existence of Gaussian equivalents and the linearizations of Sec. 4.2. As all the main elements of this argument have been established elsewhere, here we just provide the main intuitions and refer to prior work for the details.

Many of the statistics of random matrices are universal, that is, their limiting behavior as the matrix gets larger is insensitive to the detailed properties of their entries’ distributions. Considerable work has gone into demonstrating universality for an increasingly large class of random matrices and a growing number of detailed statistics. In our case, the test loss is a global measurement of several random matrices. This perspective gives some intuition for why we are able to replace many of the intractable terms in the expressions we analyze with tractable terms, which only need to match quite superficial properties of the distributions to ensure the limiting test loss is the same.

In Secs. S3 and S4, we use this replacement strategy in two distinct situations. The first is for terms of the form

for deterministic $A$ and random $B$ . Under assumptions on $A$ and $B$ , standard concentration inequalities can be used to describe the limiting behavior of sums like eqn. (S31). In our setting, one finds that this behavior only depends on the the low-order moments of $B$ . By matching these low-order moments with Gaussian random variables, we can replace $B$ with a Gaussian random matrix with the same limiting behavior. Note, often $A$ is not actually deterministic, we are simply conditioning on it and only considering the randomness in $B$ . The approach is suitable for determining the average behavior of eqn. (S31) when we have control over the (weak) correlations in the entries of $A$ and $B$ . Linearizing the matrices $A$ and $B$ in this setting is just a convenient bookkeeping device for performing these computations.

When one of the matrices in eqn. (S31) is inverted, the situation is more complex, and indeed this is the case for the kernel matrix $K$ in expressions for the training and test loss. To apply the linear pencil algorithm, we have to replace the NTK in all expressions with a linearized version (see eqn. (22)), which is a rational expression of the i.i.d. Gaussian matrices, $X$ , $W_{1}$ , etc. In Sec. S1, we bounded the difference between the first-layer kernel and its linearization, thus removing the Hadamard product structure. It remains to linearize the second-layer kernel, i.e. linearize $F$ . This has been discussed in previous works, see (Mei & Montanari, 2019; Adlam et al., 2019; Péché et al., 2019; Benigni & Péché, 2019).

It should be expected that a linearized version of $F$ will lead to the same asymptotic statistics due to some very general results on the limiting behavior of expressions of the form,

Finding Gaussian equivalents for $A$ and $B$ in expressions like eqns. (S31) and (S32) is relatively simple in our case. We encounter terms for which the matrix $B$ depends on some other random matrix $C$ through a coordinate-wise nonlinear function $f(C)$ . For such cases, Taylor expanding the function $f$ is the key tool to finding these equivalents (see e.g. (Adlam et al., 2019) for more details on this type of approach).

S3 Exact asymptotics for the training loss

The model’s predictions on the training set, $\hat{y}(X)$ , take a simple form,

The expected training loss can be written as,

where $\nu=0$ with centering and $\nu=1$ without it and,

Note we can suppress the terms linear in $N_{0}$ since they vanish in expectation owing to the linear dependence on the mean-zero random variable $\omega$ . Here $K=K(X,X)+\gamma I_{m}$ is the linearized NTK and is given by,

This substitution can be justified using the result of Sec. S1:

Note that taking the expectation over $W_{2}$ in eqn. (S39) and eqn. (S40) yields

which can be used to calculate the expectation over $\omega$ and $\Omega$ to leading order (i.e. with remainder terms $o(1)$ ) using the approach of eqn. (S31). Concretely,

Putting these pieces together, we can write for $\tau_{1}=\tau_{1}(\gamma)$ and $\tau_{2}=\tau_{2}(\gamma)$ ,

Self-consistent equations for $\tau_{1}$ and $\tau_{2}$ can be computed using the resolvent method, as was done in (Adlam et al., 2019) for the case of $\sigma_{W_{2}}=0$ . In order to pave the way for the analysis of the test error, we instead demonstrate how to compute these traces using operator-valued free probability.

In the remainder of this section, and in Sec. S4, we assume at times that $\sigma$ is non-linear (so that $\eta^{\prime}>\zeta$ and $\eta>\zeta$ ) and/or $\gamma>0$ in order that certain denominator factors are non-zero. The linear and/or ridgeless cases can be obtained by limits of our general results, or through special cases of the pertinent intermediate formulas.

S3.2 Linear pencils

To begin, we construct linear pencils for $\tau_{1}$ and $\tau_{2}$ . Using the linearization eqn. (13), a straightforward block-matrix inversion confirms that

The matrix $Q_{T}$ is not self-adjoint, but a self-adjoint representation can be obtained from it by doubling the dimensionality. In particular, letting

Observe that $\bar{Q}_{T}$ is a self-adjoint matrix whose blocks are either constants or proportional to one of $\{X,X^{\top},W_{1},W_{1}^{\top},\Theta_{F},\Theta_{F}^{\top}\}$ ; let us denote the constant terms as $Z$ . As such, we can directly utilize the results of (Far et al., 2006; Mingo & Speicher, 2017) to compute the necessary traces.

S3.3 Operator-valued Stieltjes transform

where $\alpha_{k}$ is dimensionality of the $k$ th block and $\sigma(i,k;l,k)$ denotes the covariance between the entries of the blocks $ij$ block of $\bar{Q}$ and entries of the $kl$ block of $\bar{Q}$ . Eqn. (S53) may admit many solutions, but there is a unique solution such that $\text{Im}G\succ 0$ for $\text{Im}Z\succ 0$ .

The constants $Z$ , the entries of $\sigma$ , and therefore the equations (S54) are manifest by inspection of the block matrix representation for $\bar{Q}_{T}$ . Although the matrix representation of the equations is too large to reproduce here, we can nevertheless extract the equations satisfied by each entry of $G$ .

The equations satisfied by the operator-valued Stieltjes transform $G$ of $\bar{Q}_{T}$ induce the following structure on $G$ ,

and the independent entry-wise component functions $g_{i}$ , $\tau_{1}$ and $\tau_{2}$ satisfy the following system of polynomial equations,

It is straightforward algebra to eliminate $g_{3},g_{4},g_{5}$ and $g_{6}$ from the above equations. A simple set of equations for $\tau_{1}$ and $\tau_{2}$ follows,

It will prove useful to obtain expressions for $\tau_{1}^{\prime}(\gamma)$ and $\tau_{2}^{\prime}(\gamma)$ . By differentiating eqns. (S67) and (S68) with respect to $\gamma$ , we find

where we have introduced some auxiliary variables to ease the presentation,

S4 Exact asymptotics for the test loss

As described in Sec. 4.3, the test loss can be written as,

As in Sec. S3, we suppress the terms linear in $\omega$ as they vanish in expectation. The Neural Tangent Kernels ${K=K(X,X)+\gamma I}$ and $K_{\mathbf{x}}=K(X,\mathbf{x})$ are given by,

where the substitution for the linearized NTK is justified as in Sec. S3 using the spectral norm bound of Sec. S1.

Using the cyclicity and linearity of the trace, the expectation over $\mathbf{x}$ requires the computation of

As described in Sec. 4.2, without loss of generality we can consider the case of a linear teacher, so that $\eta_{\textsc{t}}=\zeta_{\textsc{t}}=1$ and (16) and (15) become

Using these substitutions, the expectations over $\mathbf{x}$ are now trivial and we readily find,

One may interpret the substitutions in eqn. (S78) as a tool to calculate the expectations above to leading order as it leads to terms like eqn. (S31). Next we recall the substitution (S44),

As above, we consider the leading order behavior with respect to the random variables $\omega$ , $\Omega$ , and $W_{2}$ using eqn. (S31) to find

where $\nu=0$ with centering and $\nu=1$ without it,

S4.2 Linear pencils

Repeated application of the Schur complement formula for block matrix inversion establishes the following representations for $E_{21},E_{22},E_{31},E_{32},E_{33}.$

A linear pencil for $E_{21}$ follows from the representation,

A linear pencil for $E_{22}$ follows from the representation,

A linear pencil for $E_{31}$ follows from the representation,

and, for $\beta=\left(n_{0}(\zeta-\eta)-\zeta n_{1}\sigma_{W_{2}}^{2}\right)$ ,

A linear pencil for $E_{32}$ follows from the representation,

and, for $\beta=\left(n_{0}(\zeta-\eta)-\zeta n_{1}\sigma_{W_{2}}^{2}\right)$

A linear pencil for $E_{33}$ follows from the representation,

and, for $\beta=\left(n_{0}(\zeta-\eta)-\zeta n_{1}\sigma_{W_{2}}^{2}\right)$ ,

S4.3 Operator-valued Stieltjes transform

Even though the individual error terms $E_{21},E_{22},E_{31},E_{32},E_{33}$ can be written as the trace of self-adjoint matrices, the individual $Q$ matrices are not themselves self-adjoint. However, by enlarging the dimensionality by a factor of two, equivalent self-adjoint representations can easily be constructed. To do so, we simply utilize the identity,

Observe that $\bar{Q}_{21},\bar{Q}_{22},\bar{Q}_{31},\bar{Q}_{32}$ and $\bar{Q}_{33}$ are all self-adjoint block matrices whose blocks are either constants or proportional to one of $\{X,X^{\top},W_{1},W_{1}^{\top},\Theta_{F},\Theta_{F}^{\top}\}$ ; let us denote the constant terms as $Z$ . As such, we can directly utilize the results of (Far et al., 2006; Mingo & Speicher, 2017) to compute the error terms in question.

where $\alpha_{k}$ is dimensionality of the $k$ th block and $\sigma(i,k;l,k)$ denotes the covariance between the entries of the $ij$ block of $\bar{Q}$ and entries of the $kl$ block of $\bar{Q}$ . Eqn. (S127) may admit many solutions, but there is a unique solution such that $\text{Im}G\succ 0$ for $\text{Im}Z\succ 0$ .

The constants $Z$ , the entries of $\sigma$ , and therefore the equations (S128) are manifest by inspection of the block matrix representations for $Q$ . Although the matrix representations are too large to reproduce here, we can nevertheless extract the equations satisfied by each entry of $G$ , which we present in the subsequent sections.

The equations satisfied by the operator-valued Stieltjes transform $G$ of $\bar{Q}_{21}$ induce the following structure on $G$ ,

and the independent entry-wise component functions $g_{i}$ combine to produce the error $E_{21}$ through the relation,