The generalization error of random features regression: Precise asymptotics and double descent curve

Song Mei, Andrea Montanari

Introduction

Statistical lore recommends not to use models that have too many parameters since this will lead to ‘overfitting’ and poor generalization. Indeed, a plot of the test error as a function of the model complexity often reveals a U-shaped curve. The test error first decreases because the model is less and less biased, but then increases because of a variance explosion [HTF09]. In particular, the interpolation threshold, i.e., the threshold in model complexity above which the training error vanishes (the model completely interpolates the data), corresponds to a large test error. It seems wise to keep the model complexity well below this threshold in order to obtain a small generalization error.

These classical prescriptions are in stark contrast with the current practice in deep learning. The number of parameters of modern neural networks can be much larger than the number of training samples, and the resulting models are often so complex that they can perfectly interpolate the data. Even more surprisingly, they can interpolate the data when the actual labels are replaced by pure noise [ZBH+16]. Despite such a large complexity, these models have small test error and can outperform others trained in the classical underparametrized regime.

This behavior has been rationalized in terms of a so-called ‘double-descent’ curve [BMM18, BHMM19]. A plot of the test error as a function of the model complexity follows the traditional U-shaped curve until the interpolation threshold. However, after a peak at the interpolation threshold, the test error decreases, and attains a global minimum in the overparametrized regime. In fact, the minimum error often appears to be ‘at infinite complexity’: the more overparametrized is the model, the smaller is the error. It is conjectured that the good generalization behavior in this highly overparametrized regime is due to the implicit regularization induced by gradient descent learning: among all interpolating models, gradient descent selects the simplest one, in a suitable sense. An example of double descent curve is plotted in Fig. 1. The main contribution of this paper is to describe a natural, analytically tractable model leading to this generalization curve, and to derive precise formulae for the same curve, in a suitable asymptotic regime.

Despite its simplicity, this random covariates model captures several features of the double descent scenario. In particular, the asymptotic generalization curve is U-shaped for $\gamma<1$ , diverging at the interpolation threshold $\gamma=1$ , and descends again exceeding that threshold. The divergence at $\gamma=1$ is explained by an explosion in the variance, which is in turn related to a divergence of the condition number of the random matrix ${\bm{X}}$ . At the same time, this simple model misses some interesting features that are observed in more complex settings: $(i)$ In the Gaussian covariates model, the global minimum of the test error is achieved in the underparametrized regime $\gamma<1$ , unless ad-hoc misspecification structure is assumed; $(ii)$ The number of parameters is tied to the covariates dimension $d$ and hence the effects of overparametrization are not isolated from the effects of the ambient dimensions; $(iii)$ Ridge regression, with some regularization $\lambda>0$ , is always found to outperform the ridgeless limit $\lambda\to 0$ . Moreover, this linear model is not directly connected to actual neural networks, which are highly nonlinear in the covariates ${\bm{x}}_{i}$ .

In this paper, we study the random features model of Rahimi and Recht [RR08]. The random features model can be viewed either as a randomized approximation to kernel ridge regression, or as a two-layers neural networks with random first layer wights. We compute the precise asymptotics of the test error and show that it reproduces all the qualitative features of the double-descent scenario.

We learn the coefficients ${\bm{a}}=(a_{i})_{i\leq N}$ by performing ridge regression

The choice of ridge penalty is motivated by the connection to kernel ridge regression, of which this method can be regarded as a finite-rank approximation. Further, the ridge regularization path is naturally connected to the path of gradient flow with respect to the mean square error $\sum_{i\leq n}(y_{i}-f({\bm{x}}_{i};{\bm{a}},{\bm{\Theta}}))^{2}$ , starting at ${\bm{a}}=0$ . In particular, gradient flow converges to the ridgeless limit ( $\lambda\to 0$ ) of $\hat{\bm{a}}(\lambda)$ , and there is a correspondence between positive $\lambda$ , and early stopping in gradient descent [YRC07].

$N,n,d$ lie in a proportional asymptotics regime. Namely, $N,n,d\to\infty$ with $N/d\to\psi_{1}$ , $n/d\to\psi_{2}$ for some $\psi_{1},\psi_{2}\in(0,\infty)$ .

where expectation is taken with respect to $G\sim{\mathsf{N}}(0,1)$ . Assume $\mu_{0},\mu_{1},\mu_{\star}\neq 0$ .

Then, for linear $f_{d}$ in the setting described above, for any $\lambda>0$ , the prediction risk convergences in probability

where $\mathscrsfs{R}(\rho,\zeta,\psi_{1},\psi_{2},\overline{\lambda})$ is explicitly given in Definition 1.

Section 5.1 also contains an analogous statement for the nonlinear model.

Theorem 1 and its generalizations stated below require $\lambda>0$ fixed as $N,n,d\to\infty$ . We can then consider the ridgeless limit by taking $\lambda\to 0$ . Let us stress that this does not necessarily yield the prediction risk of the min-norm least square estimator that is also given by the limit $\hat{\bm{a}}(0+)\equiv\lim_{\lambda\to 0}\hat{\bm{a}}(\lambda)$ at $N,n,d$ fixed. Denoting by ${\bm{Z}}=\sigma({\bm{X}}{\bm{\Theta}}^{{\mathsf{T}}}/\sqrt{d})/\sqrt{d}$ the design matrix, the latter is given by $\hat{\bm{a}}(0+)=({\bm{Z}}^{{\mathsf{T}}}{\bm{Z}})^{\dagger}{\bm{Z}}^{{\mathsf{T}}}{\bm{y}}/\sqrt{d}$ . While we conjecture that indeed this is the same as taking $\lambda\to 0$ in the asymptotic expression of Theorem 1, establishing this rigorously would require proving that the limits $\lambda\to 0$ and $d\to\infty$ can be exchanged. We leave this to future work.

Our work is the first one to provide a complete treatment of the nonparametric model in the proportional asymptotics and to establish those phenomena. From a mathematical viewpoint, the calculation of the test error can be reduced to studying a block-structured kernel random matrix, with a more intricate structure than the one of [HMRT19]. The reduction itself is novel in the present context, and goes through the log determinant of this random matrix, while the variance computation of [HMRT19] is directly connected to the resolvent.

The proof of Theorem 1 builds on ideas from random matrix theory. A careful look at these arguments unveils an interesting phenomenon. While the random features $\{\sigma(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle/\sqrt{d})\}_{i\leq d}$ are highly non-Gaussian, it is possible to construct a Gaussian covariates model with the same asymptotic prediction error as for the random features model. Apart from being mathematically interesting, this finding provides additional intuition for the behavior of random features models, and opens the way to some interesting future directions. In particular, [MRSY19] uses this Gaussian covariates proxy to analyze maximum margin classification using random features.

The rest of the paper is organized as follows:

In Section 2 we summarize the main insights that can be extracted from the asymptotic theory, and illustrate them through plots.

Section 3 provides a succinct overview of related work.

Section 4 introduces the notations that are used in this paper.

Section 5 contains formal statements of our main results, which is the asymptotics of prediction error as in Theorem 2. This section also presents some special cases of the asymptotic formula.

Section 6 contains the statements of the asymptotics of the training error as in Theorem 6.

Section 7 presents an interesting phenomenon which is that the random features model has the same asymptotic prediction error as a simpler model with Gaussian covariates.

In Section 8 we present the proof of main results. The main results will use several propositions that are proved in the following sections and in the appendices.

Results and insights: An informal overview

Before explaining in detail our technical results —which we will do in Section 5— it is useful to pause and describe some consequences of the exact asymptotic formulae that we prove. Our focus here will be on insights that have a chance to hold more generally, beyond the specific setting studied here.

Bias term also exhibits a singularity at the interpolation threshold. A prominent feature of the double descent curve is the peak in test error at the interpolation threshold which, in the present case, is located at $\psi_{1}=\psi_{2}$ . In the linear regression model of [AS17, HMRT19, BHX19], this phenomenon is entirely explained by a peak in the variance of the estimator (that diverges in the ridgeless limit $\lambda\to 0$ ), while its bias is monotone increasing across to this threshold.

In contrast, in the random features model studied here, both variance and bias have a peak at the interpolation threshold, diverging there when $\lambda\to 0$ . This is apparent from Figure 1 which was obtained for $\tau^{2}=0$ , and therefore in a setting in which the error is entirely due to bias. The fact that the double descent scenario persists in the noiseless limit is particularly important, especially in view of the fact that many machine learning tasks are usually considered nearly noiseless.

Optimal prediction error is achieved in the highly overparametrized regime. Figure 2 (left) reports the predicted test error in the ridgeless limit $\lambda\to 0$ (for a case with non-vanishing noise, $\tau^{2}>0$ ) as a function of $\psi_{1}=N/d$ , for several values of $\psi_{2}=n/d$ . Figure 3 plots the predicted test error as a function of $\psi_{1}/\psi_{2}=N/n$ , for fixed $\psi_{2}$ , several values of $\lambda>0$ , and two values of the SNR. We repeatedly observe that: $(i)$ For a fixed $\lambda$ , the minimum of test error (over $\psi_{1}$ ) is in the highly overparametrized regime $\psi_{1}\to\infty$ ; $(ii)$ The global minimum (over $\lambda$ and $\psi_{1}$ ) of test error is achieved at a value of $\lambda$ that depends on the SNR, but always at $\psi_{1}\to\infty$ ; $(iii)$ In the ridgeless limit $\lambda\to 0$ , the generalization curve is monotonically decreasing in $\psi_{1}$ when $\psi_{1}>\psi_{2}$ .

To the best of our knowledge, this is the first natural and analytically tractable model which satisfies the following requirements: $(1)$ Large overparametrization is necessary to achieve optimal prediction; $(2)$ No special misspecification structure needs to be postulated.

Optimal regularization eliminated the double-descent. Figure 3 reports the asymptotic prediction for the test error as a function of the overparametrization ratio $N/n$ for various values of the regularization parameter $\lambda$ . The peak at the interpolation threshold $N=n$ is apparent, but it becomes less prominent as the regularization increases. In particular, if we consider the optimal regularization (the lower envelope of these curves), the test error becomes monotone decreasing in the number of parameters: regularization compensates overparametrization.

Non-vanishing regularization can hurt (at high SNR). Figure 4 plots the predicted test error as a function of $\lambda$ , for several values of $\psi_{1}$ , with $\psi_{2}$ fixed. The lower envelope of these curves is given by the curve at $\psi_{1}\to\infty$ , confirming that the optimal error is achieved in the highly overparametrized regime. However the dependence of this lower envelope on $\lambda$ changes qualitatively, depending on the SNR. For small SNR, the global minimum is achieved as some $\lambda>0$ : regularization helps. However, for a large SNR the minimum error is achieved as $\lambda\to 0$ . The optimal regularization is vanishingly small.

These two noise regime are separated by a phase transition at a critical SNR which we denote by $\rho_{\star}$ . A characterization of this critical value is given in Section 5.2.2.

Self-induced regularization. What is the mechanism underlying the optimality of the ridgeless limit $\lambda\to 0$ ? An intuitive explanation can be obtained by considering the (random) kernel associated to the ridge regression (2), namely

Related literature

A recent stream of papers studied the generalization behavior of machine learning models in the interpolation regime. An incomplete list of references includes [BMM18, BRT19, LR18, BHMM19, RZ18]. The starting point of this line of work were the experimental results in [ZBH+16, BMM18], which showed that deep neural networks as well as kernel methods can generalize even if the prediction function interpolates all the data. It was proved that several machine learning models including kernel regression [BRT19] and kernel ridgeless regression [LR18] can generalize under certain conditions.

The double descent phenomenon, which is our focus in this paper, was first discussed in general terms in [BHMM19]. The same phenomenon was also observed in [AS17, GJS+19]. The paper [KLS18] observes that the optimal amount of ridge regularization is sometimes vanishing, and provides an explanation in terms of noisy features. Analytical predictions confirming this scenario were obtained, within the linear regression model, in two concurrent papers [HMRT19, BHX19]. In particular, [HMRT19] derives the precise high-dimensional asymptotics of the prediction error, for a general model with correlated covariates. On the other hand, [BHX19] gives exact formula for any finite dimension, for a model with i.i.d. Gaussian covariates. The same papers also compute the double descent curve within other models, including over-specified linear model [HMRT19], and a Fourier series model [BHX19].

As mentioned in the introduction, [HMRT19, Section 8] also calculates the variance term of the prediction error in the random features model in the ridgeless limit $\lambda\to 0$ . Both the simple linear regression models of [HMRT19, BHX19], and the variance calculation of [HMRT19, Section 8] capture the peak of the test error at the interpolation threshold. However, these calculations do not elucidate several crucial statistical phenomena, which are instead the main contribution of our work (see Section 2): optimality of large overparametrization; optimality of interpolators at high SNR ( $\lambda\to 0$ limit); the role of self-induced regularization; disappearance of the double descent at optimal overparametrization.

Rate-optimal bounds on the generalization error of overparametrized linear models were recently derived in [BLLT20] (see also [MVS19] for a different perspective).

2 Random features and kernels

The random features model has been studied in considerable depth since the original work in [RR08]. A classical viewpoint suggests that ${\mathcal{F}}_{{\rm RF}}({\bm{\Theta}})$ should be regarded as random approximation of the reproducing kernel Hilbert space ${\mathcal{F}}_{H}$ defined by the kernel

Indeed ${\mathcal{F}}_{{\rm RF}}({\bm{\Theta}})$ is an RKHS defined by the finite-rank approximation of this kernel defined in Eq. (6). The paper [RR08] showed the pointwise convergence of the empirical kernel $H_{N}$ to $H$ . Subsequent work [Bac17b] showed the convergence of the empirical kernel matrix to the population kernel in terms of operator norm and derived bound on the approximation error (see also [Bac13, AM15, RR17] for related work).

The setting in the present paper is quite different, since we take the limit of a large number neurons $N\to\infty$ , together with large dimension $d\to\infty$ . Our focus on this high-dimensional regime is partially motivated by [RZ18], which emphasizes that optimality of interpolators is somewhat un-natural in low dimension.

It is well-known that approximation using two-layers network suffers from the curse of dimensionality, in particular when first-layer weights are not trained [DHM89, Bac17a, VW18, GMMM19]. The recent paper [GMMM19] studies random features regression in a setting similar to ours, by considering two different regimes: $(1)$ the population limit $n=\infty$ , with $N$ scaling as a polynomial of $d$ ; $(2)$ the wide limit $N=\infty$ , with $n$ scaling as a polynomial of $d$ . In particular, [GMMM19] proves that, if $d^{k+\delta}\leq N\leq d^{k+1-\delta}$ and $n=\infty$ or $d^{k+\delta}\leq n\leq d^{k+1-\delta}$ and $N=\infty$ then a random features model can only fit the projection of the true function $f_{d}$ onto degree- $k$ polynomials.

3 Technical contribution

The asymptotic singular values distribution of ${\bm{Z}}$ is not sufficient to compute the asymptotic prediction error, which also depends on the singular vectors of ${\bm{Z}}$ . The paper [HMRT19] addresses this challenge for what concerns the variance term of the error, and only in the limit $\lambda\to 0$ . Notice that the variance term is given (up to constants) by ${\rm Tr}(({\bm{Z}}^{{\mathsf{T}}}{\bm{Z}})^{\dagger}{\bm{\Sigma}})$ . It is quite straightforward to express this quantity in terms of the Stieltjes transform of a certain block random matrix, and [HMRT19] use the leave-one-out method to characterize the asymptotics of this Stieltjes transform.

Unfortunately, the approach of [HMRT19] cannot be pushed to compute the full test error (i.e. both the bias and variance terms): the latter cannot be expressed in terms of the Stieltjes transform of the same matrix. A key observation of the present paper is that the full prediction error can be expressed in terms of derivatives of the log-determinant of a different block-structured random matrix. In order to compute the asymptotics of this log-determinant, we use leave-one-out arguments (e.g. [BS10, Chapter 3.3]) to derive fixed point equations for the Stieltjes transform of this random matrix, and then integrate this Stieltjes transform.

One further difference from [HMRT19] is that we consider the full nonparametric model $y_{i}=f_{d}({\bm{x}}_{i})+\varepsilon_{i}$ while for the calculation of [HMRT19] does not model the target function. As mentioned above, our setting is similar to the one of [GMMM19]. However, the main technical content of [GMMM19] is to prove that, under polynomial scalings of $n$ and $d$ (at $N=\infty$ ) or $N$ and $d$ (at $n=\infty$ ), the kernel matrix is near isometric. In contrast, here we study a regime in which it is not true that the same matrix is a near isometry, and we characterize its spectral distribution (alongside those properties of the eigenvectors that determine the test error).

Notations

Main results

We begin by stating our assumptions and notations for the activation function $\sigma$ . It is straightforward to check that these are satisfied by all commonly-used activations, including ReLU and sigmoid functions.

where expectation is with respect to $G\sim{\mathsf{N}}(0,1)$ . Assuming $0<\mu_{0}^{2},\mu_{1}^{2},\mu_{\star}^{2}<\infty$ , define ${\zeta}$ by

We will consider sequences of parameters $(N,n,d)$ that diverge proportionally to each other. When necessary, we can think such sequences to be indexed by $d$ , with $N=N(d)$ , $n=n(d)$ functions of $d$ .

Defining $\psi_{1,d}=N/d$ and $\psi_{2,d}=n/d$ , we assume that the following limits exist in $(0,\infty)$ :

The last assumption covers, as a special case, deterministic linear functions $f_{d}({\bm{x}})=\beta_{d,0}+\langle{\bm{\beta}}_{d,1},{\bm{x}}\rangle$ , but also a large class of random non-linear functions. As an example, let ${\bm{G}}=(G_{ij})_{i,j\leq d}$ , where $(G_{ij})_{i,j\leq d}\sim_{iid}{\mathsf{N}}(0,1)$ , and consider the random quadratic function

Higher order polynomials can be constructed analogously (or using the expansion of $f_{d}$ in spherical harmonics).

We also emphasize that that the nonlinear part $f_{d}^{\mbox{\tiny\rm NL}}({\bm{x}}_{2})$ , although being random, is the same for all samples, and hence should not be confused with additive noise $\varepsilon$ .

We finally introduce the formula for the asymptotic prediction error, denoted by $\mathscrsfs{R}(\rho,\zeta,\psi_{1},\psi_{2},\lambda)$ in Theorem 1.

$(iii)$ $(\nu_{1}(\xi),\nu_{2}(\xi))$ is the unique solution of these equations with $|\nu_{1}(\xi)|\leq\psi_{1}/\Im(\xi)$ , $|\nu_{2}(\xi)|\leq\psi_{2}/\Im(\xi)$ for $\Im(\xi)>C$ , with $C$ a sufficiently large constant.

The formula for the asymptotic risk can be easily evaluated numerically. In order to gain further insight, it can be simplified in some interesting special cases, as shown in Section 5.2.

We are now in position to state our main theorem, which generalizes Theorem 1 to the case in which $f_{d}$ has a nonlinear component $f_{d}^{\mbox{\tiny\rm NL}}$ .

Then for any value of the regularization parameter $\lambda>0$ , the asymptotic prediction error of random features ridge regression satisfies

If the regression function $f_{d}({\bm{x}})$ is linear (i.e., $f^{\mbox{\tiny\rm NL}}_{d}({\bm{x}})=0$ ), we recover Theorem 1, where $\mathscrsfs{R}$ is defined as per Eq. (20). Numerical experiments suggest that Eq. (21) holds for any deterministic nonlinear functions $f_{d}$ as well, and that the convergence in Eq. (21) is uniform over $\lambda$ in compacts. We defer the study of these stronger properties to future work.

Note that the formula for a nonlinear truth, cf. Eq. (21), is almost identical to the one for a linear truth in Eq. (5). In fact, the only difference is that the the prediction error increases by a term $F_{\star}^{2}$ , and the noise level $\tau^{2}$ is replaced by $\tau^{2}+F_{\star}^{2}$ .

Figure 5 illustrates the last remark. We report the simulated and predicted test error as a function of $\psi_{1}/\psi_{2}=N/n$ , for three different choices of the function $f_{d}$ and noise level $\tau^{2}$ . In all the settings, the total power of nonlinearity and noise is $F_{\star}^{2}+\tau^{2}=0.5$ , while the power of the linear component is $F_{1}^{2}=1$ . The test errors in these three settings appear to be very close, as predicted by our theory.

The terms $\mathscrsfs{B}$ and $\mathscrsfs{V}$ in Eq. (21) correspond to the the limits of the bias and variance of the estimated function $f({\bm{x}};\hat{\bm{a}}(\lambda),{\bm{\Theta}})$ , when the ground truth function $f_{d}$ is linear. That is, for $f_{d}$ to be a linear function, we have

2 Simplifying the asymptotic risk in special cases

In order to gain further insight into the formula for the asymptotic risk $\mathscrsfs{R}(\rho,{\zeta},\psi_{1},\psi_{2},\overline{\lambda})$ , we consider here three special cases that are particularly interesting:

The highly overparametrized regime $\psi_{1}\to\infty$ (recall that $\psi_{1}=\lim_{d\to\infty}N/d$ ).

The large sample limit $\psi_{2}\to\infty$ (recall that $\psi_{2}=\lim_{d\to\infty}n/d$ ).

Let us emphasize that these limits are taken after the limit $N,n,d\to\infty$ with $N/d\to\infty$ and $n/d\to\infty$ . Hence, the correct interpretation of the highly overparametrized regime is not that the width $N$ is infinite, but rather much larger than $d$ (more precisely, larger than any constant times $d$ ). Analogously, the large sample limit does not coincide with infinite sample size $n$ , but instead sample size that is much larger than $d$ .

The ridgeless limit $\lambda\to 0+$ is important because it captures the asymptotic behavior the min-norm interpolation predictor (see also Remark 1.)

Under the assumptions of Theorem 2, set $\psi\equiv\min\{\psi_{1},\psi_{2}\}$ and define

Then the asymptotic prediction error of random features ridgeless regression is given by

The proof of this result can be found in Section 12.

The next proposition establishes the main qualitative properties of the ridgeless limit.

Recall the bias and variance functions $\mathscrsfs{B}_{\mbox{\tiny\rm rless}}$ and $\mathscrsfs{V}_{\mbox{\tiny\rm rless}}$ defined in Eq. (26) and (27). Then, for any ${\zeta}\in(0,\infty)$ and fixed $\psi_{2}\in(0,\infty)$ , we have

Divergence at the interpolation threshold $\psi_{1}=\psi_{2}$ :

Large width limit $\psi_{1}\to\infty$ (here $\chi$ is defined as per Eq. (24)):

Above the interpolation threshold (i.e. for $\psi_{1}\geq\psi_{2}$ ), the function $\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{1},\psi_{2})$ and $\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{1},\psi_{2})$ are strictly decreasing in the rescaled number of neurons $\psi_{1}$ .

The proof of this proposition is presented in Section 13.1.

As anticipated, point $2$ establishes an important difference with respect to the random covariates linear regression model of [AS17, HMRT19, BHX19]. While in those models the peak in prediction error is entirely due to a variance divergence, in the present setting both variance and bias diverge.

Another important difference is established in point $4$ : both bias and variance are monotonically decreasing above the interpolation threshold. This, again, contrasts with the behavior of simpler models, in which bias increases after the interpolation threshold, or after a somewhat larger point in the number of parameters per dimension (if misspecification is added).

This monotone decrease of the bias is crucial, and is at the origin of the observation that highly overparametrized models outperform underparametrized or moderately overparametrized ones. See Figure 6 for an illustration.

2.2 Highly overparametrized regime

As the number of neurons $N$ diverges (for fixed dimension $d$ ), random features ridge regression is known to approach kernel ridge regression with respect to the kernel (7). It is therefore interesting what happens when $N$ and $d$ diverge together, but $N$ is larger than any constant times $d$ .

Under the assumptions of Theorem 2, define

Then the asymptotic prediction error of random features ridge regression, in the large width limit is given by

The proof of this result can be found in Section 12. Note that, as expected, the risk remains lower bounded by $F_{\star}^{2}$ , even in the limit $\psi_{1}\to\infty$ . Naively, one could have expected to recover kernel ridge regression in this limit, and hence a method that can fit nonlinear functions. However, as shown in [GMMM19], random features methods can only learn linear functions for $N=O_{d}(d^{2-\delta})$ .

As observed in Figures 2 to 4 (which have been obtained by applying Theorem 2), the minimum prediction error is often achieved by highly overparametrized networks $\psi_{1}\to\infty$ . It is natural to ask what is the effect of regularization on such networks. Somewhat surprisingly (and as anticipated in Section 2), we find that regularization does not always help. Namely, there exists a critical value $\rho_{\star}$ of the signal-to-noise ratio, such that vanishing regularization is optimal for $\rho>\rho_{\star}$ , and is not optimal for $\rho<\rho_{\star}$ .

In order to state formally this result, we define the following quantities

Notice in particular that $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\psi_{2},\lambda/\mu_{\star}^{2})$ is the limiting value of the prediction error (right-hand side of (36)) up to an additive constant and an multiplicative constant.

Fix ${\zeta},\psi_{2}\in(0,\infty)$ and $\rho\in(0,\infty)$ . Then the function $\overline{\lambda}\mapsto\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\psi_{2},\overline{\lambda})$ is either strictly increasing in $\overline{\lambda}$ , or strictly decreasing first and then strictly increasing.

The proof of this proposition is presented in Section 13.2, which also provides further information about this phase transition (and, in particular, an explicit expression for $\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho)$ ).

2.3 Large sample limit

Under the assumptions of Theorem 2, define

Then the asymptotic prediction error of random features ridge regression, in the large width limit is given by

The proof of this result can be found in Section 12.

Asymptotics of the training error

Theorem 2 establishes the exact asymptotics of the test error in the random features model. However, the technical results obtained in the proofs allow us to characterize several other quantities of interest. Here we consider the behavior of the training error and of the norm of the parameters. We define the regularized training error by

We also recall that $\hat{\bm{a}}(\lambda)$ denotes the minimizer in the last expression, cf. Eq. (2) The next definition presents the asymptotic formulas for these quantities.

We next state our asymptotic characterization of $L_{\rm RF}(f_{d},{\bm{X}},{\bm{\Theta}},\lambda)$ and $\|\hat{\bm{a}}(\lambda)\|_{2}^{2}$ .

Then for any value of the regularization parameter $\lambda>0$ , the asymptotic regularized training error and norm square of its minimizer satisfy

The proof of Theorem 6 is similar to the proof of Theorem 2. We will give a sketch of proof of Theorem 6 in Section E.

In this section, we illustrate Theorem 6 through numerical simulations. Figure 7 reports the theoretical prediction and numerical results for the regularized training error, the test error, and the norm of the coefficients $\hat{\bm{a}}(\lambda)$ . We use a small non-zero value of the regularization parameter $\lambda=10^{-3}$ , fix the number of samples per dimension $\psi_{2}=n/d$ , and follow these quantities as a function of the overparameterization ratio $\psi_{1}/\psi_{2}=N/n$ .

As expected, the behavior of the training error strikingly different from the one of the test error. The training error is monotone decreasing in the overparameterization ratio $N/n$ , and is close to zero in the overparameterized regime $N/n>1$ (it is not exactly vanishing because we use a small $\lambda>0$ ). In other words, the fitted model is nearly interpolating the data, and the peak in test error matches the interpolation threshold.

On the other hand, the penalty term $\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2}$ is non-monotone: it increases up to the interpolation threshold, then decreases for $N/n>1$ , and converges to a constant as $\psi_{1}\to\infty$ . If we take this as a proxy for the model complexity, the behavior of $\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2}$ provides useful intuition about descent of the generalization error. As the number of parameters increases beyond the interpolation threshold, the model complexity decreases instead of increasing.

We can confirm the intuition that the double descent of the test error is driven by the behavior of the model complexity $\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2}$ , by selecting $\lambda$ in an optimal way. Following [HMRT19], we expect that the optimal regularization should produce a smaller value of $\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2}$ , and hence eliminate or reduce the double descent phenomenon. Indeed, this is illustrated in Figure 8 demonstrates the prediction of the regularized training error and the test error for two choices of $\lambda$ : $\lambda=0$ , and an optimal $\lambda$ such that the test error is minimized. When we choose an optimal $\lambda$ , the test error becomes strictly decreasing as $\psi_{1}=N/d$ increases. We expect this is a generic phenomenon that also holds in other interesting models.

An equivalent Gaussian covariates model

An examination of the proof of our main result (Theorem 2) reveals an interesting phenomenon. The random features model has the same asymptotic prediction error as a simpler model with Gaussian covariates and response that is linear in these covariates, provided we use a special covariance and signal structure.

Draw ${\bm{x}}\sim{\mathsf{N}}(0,{\mathbf{I}}_{d})$ , $\varepsilon\sim{\mathsf{N}}({\bm{0}},\tau^{2})$ , and ${\bm{w}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{N})$ independently, conditional on ${\bm{\Theta}}$ .

Let $y=\langle{\bm{\beta}}_{1},{\bm{x}}\rangle+\varepsilon$ .

Let ${\bm{u}}=(u_{1},\ldots,u_{N})^{\mathsf{T}}$ , $u_{j}=\mu_{0}+\mu_{1}\langle{\bm{\theta}}_{j},{\bm{x}}\rangle/\sqrt{d}+\mu_{\star}w_{j}$ , for some $0<|\mu_{0}|,|\mu_{1}|,|\mu_{\star}|<\infty$ .

Remarkably, in the proportional asymptotics $N,n,d\to\infty$ with $N/d\to\psi_{1},n/d\to\psi_{2}$ , the behavior of this model is the same as the one of the nonlinear random features model studied in the rest of the paper. In particular, the asymptotic prediction error $\mathscrsfs{R}$ is given by the same formula as in Definition 1.

(Gaussian covariates prediction model) Define $\zeta$ and the signal-to-noise ratio $\rho\in[0,\infty]$ as

and assume $\mu_{0},\mu_{1},\mu_{\star}\neq 0$ . Then, in the Gaussian covariates model described above, for any $\lambda>0$ , we have

where $\mathscrsfs{R}(\rho,\zeta,\psi_{1},\psi_{2},\overline{\lambda})$ is explicitly given in Definition 1.

The proof of Theorem 7 is is almost the same as the one of Theorem 2 (with several simplifications, because of the greater amount of independence). To avoid repetitions, we will not present a proof here.

Figure 9 illustrates the content of Theorem 7 via numerical simulations. We report the simulated and predicted test error as a function of $\psi_{1}/\psi_{2}=N/n$ . The theoretical prediction here is exactly the same as the one reported in Figure 5. However, numerical simulations were carried out with the Gaussian covariates model instead of random features. The agreement is excellent, as predicted by Theorem 7.

Proof of Theorem 2

We begin by observing that the minimizer of the training error (2) is given by

Then $\hat{\bm{a}}(\lambda)$ can be written in a simpler form $\hat{\bm{a}}(\lambda)={\bm{\Xi}}{\bm{Z}}^{\mathsf{T}}{\bm{y}}/\sqrt{d}$ . After a simple calculation, we obtain

Our first step is to replace the exact expression (55) by a simpler one involving traces of combinations of ${\bm{\Xi}}$ and the following four random matrices:

The proof of this proposition is deferred to Section 9 and is based on the following main steps:

Finally, we show that $R_{\rm RF}(f_{d},{\bm{X}},{\bm{\Theta}},\lambda)$ concentrates around its expectation with respect to $f_{d}$ (i.e. the coefficients $\{{\bm{\beta}}_{d,k}\}_{k\geq 1}$ ) and ${\bm{\varepsilon}}$ .

Here ${\rm Log}$ is the complex logarithm with branch cut on the negative real axis and $\{\lambda_{i}({\bm{A}})\}_{i\in[M]}$ is the set of eigenvalues of ${\bm{A}}$ in non-increasing order.

The next proposition connects the quantities $\Psi_{j}$ to the transforms $G_{d}$ and $M_{d}$ .

The proof of Proposition 8.2 follows by basic calculus and linear algebra, and we defer its proof to Appendix B. Despite its simplicity, this statement provides the basic scheme of our proof. We will determine the asymptotics of $M_{d}(\xi;{\bm{q}})$ using a leave-one-out argument; then extract the behavior of $G_{d}(\xi;{\bm{q}})$ using Eq. (62); finally we characterize the test error using Eqs. (63) and Proposition 8.1.

The construction of the matrix ${\bm{A}}({\bm{q}})$ is related to the linear pencil method in free probability, see e.g., [HMS18]. A significantly simpler construction was used in [HMRT19, Section 8] to calculate the variance part of the risk $R_{{\rm RF}}$ in the limit $\lambda\to 0$ (in special cases). The approach of [HMRT19] amounts to computing the Stieltjes transform of ${\bm{A}}$ for $p=t_{1}=t_{2}=0$ , in the limit $\xi\to 0$ : unfortunately this quantity is not sufficient to extract the prediction error. We overcome this difficulty by considering a more complex block-structured matrix and expressing the risk in terms of derivatives of the log determinant $G_{d}(\xi;{\bm{q}})$ .

Unfortunately, even conditional on ${\bm{\theta}}_{N}$ , the projections of $({\bm{\theta}}_{a})_{a<N}$ and $({\bm{x}}_{i})_{i\leq n}$ along ${\bm{\theta}}_{N}$ and orthogonal to it are not independent (because of the sphere constraint). To overcome this problem we replace these by Gaussian vectors $(\overline{\bm{\theta}}_{a})_{a<N}$ and $(\overline{\bm{x}}_{i})_{i\leq n}$ and prove that the two distributions yield the same asymptotics of the Stieltjes transform. The decomposition of these Gaussian vectors takes the form

At this point independence can be exploited to obtain concentration results on the right-hand side. Let us emphasize that, while these paragraphs outline the main elements of the leave-one-out argument, several technical subtleties make the actual proof significantly longer, see Section 10 for details.

The proof of Proposition 8.3 is presented in Section 10. The fixed point equations (70) arise as a consequence of Eq. (64) (and the analogous equation for $m_{2,d}$ ). Indeed the proof also shows that the solution $(m_{1},m_{2})$ of these equations gives the limit of $(m_{1,d},m_{2,d})$ as $n,N,d\to\infty$ .

For a complete proof of this proposition we refer to Section 11.

We can now use Eqs. (76), (77), and (63) in Proposition 8.1, to get

The last display provides the desired asymptotics of bias and variance. However, these expressions involve derivatives of $g$ that are very inconvenient to evaluate. We conclude by proving more explicit expressions for these quantities. The key remark here is that the expression $g(\xi;{\bm{q}})$ in Proposition 8.4 has a special property: the fixed point equations (70) imply that $(m_{1}(\xi;{\bm{q}}),m_{2}(\xi;{\bm{q}}))$ is a stationary point of the function $\Xi(\xi,\,\cdot\,,\,\cdot\,;{\bm{q}})$ . This simplifies the calculation of derivatives with respect to ${\bm{q}}$ . In particular, the first derivative are obtained by computing the partial derivatives of $\Xi$ with respect to ${\bm{q}}$ and evaluating it at $m_{1},m_{2}$ .

The proof of this lemma follows by simple calculus and can be found in Appendix D.

By the definition of analytic functions $m_{1}$ and $m_{2}$ (satisfying Eq. (70) and (69) with ${\bm{q}}={\bm{0}}$ as defined in Proposition 8.3), the definition of $\nu_{1}$ and $\nu_{2}$ in Eq. (85) above is equivalent to its definition in Definition 1 (as per Eq. (15)). Moreover, for $\chi$ defined in Eq. (16) with $\overline{\lambda}=\lambda/\mu_{\star}^{2}$ and $m_{0}$ defined in Eq. (82), we have

Plugging in Eq. (83) and (84) into Eq. (80) and (81) and using Eq. (86), we can see that the expressions for $\mathscrsfs{B}$ and $\mathscrsfs{V}$ defined in Eq. (80) and (81) coincide with Eq. (18) and (19) where $\mathscrsfs{E}_{0},\mathscrsfs{E}_{1},\mathscrsfs{E}_{2}$ are provided in Eq. (17). Combining with Eq. (78) and (79) proves the theorem.

Proof of Proposition 8.1

As a useful preliminary remark, we note that the Gaussian process $f_{d}^{\mbox{\tiny\rm NL}}$ defined in Assumption 3 can be explicitly represented as a sum of spherical harmonics with Gaussian coefficients. The following lemma is standard (see, e.g., [MP11, Proposition 6.11]). For the reader’s convenience, we present a simple proof in Appendix C.

Let us write the random variable in the left hand side of Eq. (58) as a function of ${\bm{\beta}}_{d,1}$ and de-emphasize its dependence on other variables, i.e.,

Next, we state three lemmas that are used in the proof of Proposition 8.1.

Let $\lambda_{d,k}(\sigma)$ be the Gegenbauer coefficients of the function $\sigma$ , i.e., we have

Under the assumptions of Proposition 8.1, for any $\lambda>0$ , we have

where $S_{1k},S_{2k},S_{3}$ are given by Eq. (91), and $\Psi_{1},\Psi_{2},\Psi_{3}$ are given by Eq. (59).

Under the assumptions of Proposition 8.1, we have

We defer the proofs of these three lemmas to the following subsections and show here that they imply Proposition 8.1. Indeed, we have

where $(a)$ follows by Lemma 9.3 and triangular inequality, and $(b)$ from Lemma 9.4.

Recall the expression (55) for the risk. Taking expectation with respect to ${\bm{\beta}}$ and ${\bm{\varepsilon}}$ , we get

The proof of the lemma follows by evaluating each of these three terms. It is useful to introduce the matrices ${\bm{Y}}_{k,{\bm{x}}}$ , ${\bm{Y}}_{k,{\bm{\theta}}}$ , which denotes the evaluations of spherical harmonics of degree $k$ at the points $\{{\bm{x}}_{i}\}_{i\leq n}$ and $\{{\bm{\theta}}_{a}\}_{a\leq N}$ (c.f. Appendix A):

Using these expressions, we can evaluate terms $T_{1}$ and $T_{2}$ :

Combining the above formulas for $T_{1}$ , $T_{2}$ , $T_{3}$ proves Lemma 9.3.

2 Proof of Lemma 9.4

The next two lemmas will be used in the proofs of Lemma 9.4 and Lemma 9.5, and hold under the same assumptions. The first of these lemmas will be used to establish Eq. (92) (but notice that its statement does not coincide with that equation), and the second will be used to control several terms in those proofs. The proofs of these lemmas are given in Section C.2.

We will now use these lemmas to prove Lemma 9.4. We begin by recalling a few facts that are used several times in the proof. Since $\lambda>0$ , there exists a constant $C<\infty$ depending on $(\lambda,\psi_{1},\psi_{2})$ such that deterministically

By operator norm bounds on Wishart matrices [AGZ09], we have (the definition of these matrices are given in Eq. (57))

Finally we need some simple operator norm bounds on the matrices $Q_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}})-{\mathbf{I}}_{n}$ , $Q_{k}({\bm{\Theta}}{\bm{\Theta}}^{\mathsf{T}})-{\mathbf{I}}_{N}$ , and $Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}})$ . Notice that $Q_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}})_{ii}=1$ (by the normalization condition of Gegenbauer polynomials) and the out-of-diagonal entries of $Q_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}})$ have zero mean and typical size of order $1/d^{k/2}$ (see Appendix A). This suggests the following estimates, which are formalized in Lemma C.6,

Since further $\lambda_{d,k}(\sigma)^{2}B(d,k)k!\to\mu_{k}(\sigma)^{2}$ as $d\to\infty$ (see Eq. (225)), we have

(This estimate is stated formally in the appendices as Lemma C.7.) It is also useful to introduce the matrix

We have now finished presenting our preliminary estimates and can prove Lemma 9.4.

We begin by considering Eq. (92), where $S_{10}$ and $S_{20}$ are defined in Eq. (91). By the approximate linearization of ${\bm{U}}$ in Eq. (109), we have

Further recall the definition of $A_{1}$ , and $A_{2}$ in Eqs. (103), (104) and the definition of ${\bm{\Xi}}$ and ${\bm{\Pi}}$ as in Eq. (88), and we define

Then we have $S_{10}=A_{1}$ , $S_{20}=A_{2}+B$ and by Lemma 9.6, we have

We next consider Eq. (93), which requires to control $S_{1k}$ , defined in Eq. (91)). By Eq. (106), we have

Further note $\|\sigma\|_{L^{2}(\tau_{d})}^{2}=\sum_{k\geq 0}\lambda_{d,k}(\sigma)^{2}B(d,k)=O_{d}(1)$ , $B(d,k)=\Theta(d^{k})$ , and for fixed $d$ , $B(d,k)$ is non-decreasing in $k$ [GMMM19, Lemma 1]. Therefore

By Lemma 9.7 (with $\overline{\bm{M}}_{k}=Q_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}})-{\mathbf{I}}_{n}$ ) and Eq. (108), we get

We next consider Eq. (95), where we recall the definition of $S_{11}$ in Eq. (91) and the definition of $\Psi_{1}$ in Eq. (59). By observing that $\lim_{d\to\infty}\sqrt{d}\lambda_{1,d}(\sigma)=\mu_{1}$ (see Eq. (225)) and that $\mu_{1}Q_{1}({\bm{X}}{\bm{\Theta}}^{\mathsf{T}})=\mu_{1}{\bm{X}}{\bm{\Theta}}^{\mathsf{T}}/d={\bm{Z}}_{1}$ , we immediately get

In order to prove Eq. (96), recall the definition of $S_{21}$ in Eq. (91) and the definition of $\Psi_{2}$ in Eq. (59). By the decomposition of ${\bm{U}}$ in Eq. (109) and recalling that $Q_{1}({\bm{X}}{\bm{X}}^{\mathsf{T}})={\bm{H}}$ , we have

Finally, Eq. (97) is proved analogously to Eq. (96): this completes the proof of the lemma.

3 Proof of Lemma 9.5

To prove Lemma 9.5, we begin with by rewriting the prediction risk —cf. Eq. (55)— as (note that ${\bm{y}}={\bm{f}}+{\bm{\varepsilon}}$ )

This obviously implies the claims of the lemma. In the rest of this proof, we show the variance bound for $\Gamma_{1}$ , as the other bounds are very similar.

Recall the definition of ${\bm{Y}}_{k,{\bm{x}}}$ and ${\bm{Y}}_{k,{\bm{\theta}}}$ in Eq. (99), the definition of Gegenbauer coefficients $\lambda_{d,l}\equiv\lambda_{d,l}(\sigma)$ in Eq. (89) and the expansion of ${\bm{f}}$ and ${\bm{V}}$ vectors in Eq. (100). We rewrite $\Gamma_{1}$ as

Calculating the variance of $\Gamma_{1}$ with respect to ${\bm{\beta}}_{d,k}\sim{\mathsf{N}}({\bm{0}},(F_{d,k}^{2}/B(d,k)){\mathbf{I}})$ for $k\geq 1$ using Lemma C.8 (which follows from direct calculation), we get

Further note that $\|\sigma\|_{L^{2}(\tau_{d})}^{2}=\sum_{k\geq 0}\lambda_{d,k}^{2}B(d,k)=O_{d}(1)$ , $B(d,k)=\Theta(d^{k})$ , and for fixed $d$ , $B(d,k)$ is non-decreasing in $k$ [GMMM19, Lemma 1]. Therefore

Substituting above we obtain, and using the fact that $\sum_{k\geq 1}F_{d,k}^{2}=O_{d}(1)$ by construction, we have

To bound the remaining two terms in this expression, note that

where the bound is implied by Lemma 9.7 (when applying Lemma 9.7, we change the role of $N$ and $n$ , the role of ${\bm{\Xi}}$ and ${\bm{\Pi}}$ , and the role of ${\bm{\Theta}}$ and ${\bm{X}}$ ; this can be done because the role of ${\bm{\Theta}}$ and ${\bm{X}}$ is symmetric), and by Eq. (107) and $\lambda_{d,0}(\sigma)=\Theta_{d}(1)$ (by Assumption 1 and note that $\mu_{0}(\sigma)=\lim_{d\to\infty}\lambda_{d,0}(\sigma)$ by Eq. (225)). This proves that

The bound on the first term in Eq. (114) is obtained analogously and we omit it for brevity.

Proof of Proposition 8.3

This section is organized as follows. We collect the elements to prove Proposition 8.3 in Sections 10.1, 10.2, 10.3, 10.4, and prove the proposition in Section 10.5.

In this subsection, we state Lemma 10.1, which is the key lemma that is used to prove Proposition 8.3. Lemma 10.1 studies $\overline{m}_{1,d}$ and $\overline{m}_{2,d}$ , the partial Stieltjes transforms of the Gaussian counterparts of the matrix ${\bm{A}}$ as defined in Eq. (60). This lemma shows that these partial Stieltjes transforms $\overline{m}_{1,d}$ and $\overline{m}_{2,d}$ approximately satisfy the fixed point equation which involves functions ${{\sf F}_{1}}$ and ${{\sf F}_{2}}$ as defined in Eq. (69). We will prove Lemma 10.1 in Section 10.3. Later in Section 10.4, we will show that the Gaussian counterpart of the Stieltjes transform shares the same asymptotics with its spherical version.

The matrix $\overline{\bm{A}}$ is in parallel with its spherical version matrix ${\bm{A}}$ defined as in Eq. (60).

In what follows, we will write ${\bm{q}}=(s_{1},s_{2},t_{1},t_{2},p)$ . We would like to calculate the asymptotic behavior of the following partial Stieltjes transforms

The crucial step consists in showing that the expected Stieltjes transforms $\overline{m}_{1,d},\overline{m}_{2,d}$ are approximate solutions of the fixed point equations (70).

The proof of Lemma 10.1 uses a leave-one-out argument in deriving the fixed point equation for Stieltjes transform of random matrices, e.g. [BS10, Chapter 3.3] and [CS13]. We will prove Lemma 10.1 in Section 10.3. In the next subsection, we will collect some lemmas that are used in this proof.

2 Preliminaries of the proof of Lemma 10.1: Stieltjes transforms and the fixed point equation

In the following lemma, we fix a ${\bm{q}}\in{\mathcal{Q}}$ (as defined in Eq. (68)) and fix $0<\psi_{1},\psi_{2},\mu_{1},\mu_{\star}<\infty$ . Since the parameters ${\bm{q}},\psi_{1},\psi_{2},\mu_{1},\mu_{\star}$ are fixed, we will drop them from the argument of F unless necessary. In these notations, Eq. (70) reads

The following lemma shows that there exists a unique fixed point of the equation above in a certain domain provided $\Im\xi$ is large enough.

We rewrite the first equation in Eq. (69) as

In order to prove the Lipschitz continuity of F in this domain, notice that ${{\sf F}_{1}}$ is differentiable and

The functions $\xi\mapsto\overline{m}_{1,d}(\xi)$ , $\xi\mapsto\overline{m}_{2,d}(\xi)$ have the following properties:

The next lemma establishes the concentration of Stieltjes transforms to its mean, whose proof is the same as the proof of Lemma 9 in [HMRT19].

Let $\Im(\xi)\geq\xi_{0}>0$ and consider the partial Stieltjes transforms $\overline{M}_{i,d}(\xi;{\bm{q}})$ as per Eq. (119). Then there exists $c_{0}=c_{0}(\xi_{0})$ such that, for $i\in\{1,2\}$ ,

Assume $\sigma$ is an activation function with $\sigma(u)^{2}\leq c_{0}\,\exp(c_{1}\,u^{2}/2)$ for some constants $c_{0}>0$ and $c_{1}<1$ (this is implied by Assumption 1). Then

3 Proof of Lemma 10.1: Leave-one-out argument

Throughout the proof, we write $F(d)=O_{d}(G(d))$ if there exists a constant $C=C(\xi_{0},{\bm{q}},\psi_{1},\psi_{2},\varphi)$ which is uniformly bounded when $(\xi_{0},{\bm{q}},\psi_{1},\psi_{2})$ is in a compact set, such that $|F(d)|\leq C\cdot|G(d)|$ . We write $F(d)=o_{d}(G(d))$ if for any $\varepsilon>0$ , there exists a constant $C=C(\varepsilon,\xi_{0},{\bm{q}},\psi_{1},\psi_{2},\varphi)$ which is uniformly bounded when $(\xi_{0},{\bm{q}},\psi_{1},\psi_{2})$ is in a compact set, such that $|F(d)|\leq\varepsilon\cdot|G(d)|$ for any $d\geq C$ . We use $C$ to denote generically such a constant, that can change from line to line.

We will assume $p=0$ throughout the proof. For $p\neq 0$ , the lemma holds by viewing $\overline{\bm{J}}+p\overline{\bm{J}}_{1}=\varphi_{\star}({\bm{X}}{\bm{\Theta}}^{\mathsf{T}}/\sqrt{d})/\sqrt{d}$ as a new kernel inner product matrix, with $\varphi_{\star}(x)=\varphi(x)+p\mu_{1}x$ .

Step 1. Calculate the Schur complement and define some notations.

We decompose the vectors $\overline{\bm{\theta}}_{a},\overline{\bm{x}}_{i}$ in the components along $\overline{\bm{\theta}}_{N}$ and the orthogonal component:

We next write $\overline{\bm{B}}$ as the sum of three terms:

and ${\bm{\eta}}=(\eta_{1},\ldots,\eta_{N-1})^{\mathsf{T}}$ , ${\bm{u}}=(u_{1},\ldots,u_{n})^{\mathsf{T}}$ , and

where $\varphi_{\perp}(x)\equiv\varphi(x)-\mu_{1}x$ .

Step 2. Perturbation bound for the Schur complement.

Moreover, by Lemma 10.7, $|\omega_{1}-\omega_{2}|$ is deterministically bounded by $2/\xi_{0}$ . This gives

Using the definitions of $\omega_{1}$ and $\omega_{2}$ as in Eq. (145) and (146), for $\Im\xi\geq\xi_{0}$ , we have

Hence we have $|\omega_{1}|\leq 1/\xi_{0}$ , and, using a similar argument, $|\omega_{2}|\leq 1/\xi_{0}$ . Hence we get the bound $|\omega_{1}-\omega_{2}|\leq 2/\xi_{0}$ .

Under the assumptions of Lemma 10.1, we have

Since ${\bm{E}}_{1}$ is a sub-matrix of ${\bm{E}}$ , we have $\|{\bm{E}}_{1}\|_{{\rm op}}\leq\|{\bm{E}}\|_{{\rm op}}$ . By the intermediate value theorem

Note that $\varphi^{\prime\prime}_{\perp}(x)=\varphi^{\prime\prime}(x)$ is a polynomial with some fixed degree $\bar{k}$ . Therefore we have

Under the assumptions of Lemma 10.1, we have

For ${\bm{a}}_{1}$ , note we have $\|{\bm{a}}_{1}\|_{2}=|s_{2}|\cdot\|{\bm{\eta}}\|_{2}\|\overline{\bm{\theta}}_{N}\|_{2}/d$ where ${\bm{\eta}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{N-1})$ and $\overline{\bm{\theta}}_{N}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{d})$ are independent. Hence we have

For ${\bm{a}}_{2}$ , note $\varphi$ is a polynomial with some fixed degree $\bar{k}$ , hence we have

Step 3. Simplification using Sherman-Morrison-Woodbury formula.

Notice that ${\bm{\Delta}}$ is a matrix with rank at most two. Indeed

Since we assumed ${\bm{q}}\in{\mathcal{Q}}$ so that $|s_{2}t_{2}|\leq\mu_{1}^{2}/2$ , the matrix ${\bm{M}}$ is invertible with $\|{\bm{M}}^{-1}\|_{{\rm op}}\leq C$ .

Recall the definition of $\omega_{2}$ in Eq. (146). By the Sherman-Morrison-Woodbury formula, we get

By auxiliary Lemmas 10.10, 10.11, and 10.12 below, we get

Elementary algebra simplifying Eq. (159) gives $\psi_{1,d}\omega_{3}={{\sf F}_{1}}(\overline{m}_{1,d},\overline{m}_{2,d};\xi;{\bm{q}},\psi_{1,d},\psi_{2,d},\mu_{1},\mu_{\star})$ . This proves Eq. (120) in Lemma 10.1. Eq. (121) follows by the same argument (exchanging $N$ and $n$ ). In the rest of this section, we prove auxiliary Lemmas 10.10, 10.11, and 10.12.

Using the formula of $\omega_{2}$ and $\omega_{3}$ as in Eq. (156) and (159), for $\Im\xi\geq\xi_{0}$ , we have

Combining with $|\omega_{2}-\omega_{3}|\leq|\omega_{2}|+|\omega_{3}|\leq O_{d}(1)$ proves the lemma. ∎

Under the assumptions of Lemma 10.1, we have (following the notations of Eq. (157) and (158))

The first bound is because (see Lemma 10.3 for the boundedness of $\overline{m}_{1,d}$ and $\overline{m}_{2,d}$ )

In the following, we limit ourselves to proving Eq. (161), since Eq. (162) and (163) follow by similar arguments.

Then by the definition of $v_{1}$ in Eq. (157), we have $v_{1}={\bm{a}}^{\mathsf{T}}{\bm{R}}{\bm{a}}$ . Note we have

We next compute $\text{\rm Var}({\bm{h}}^{{\mathsf{T}}}{\bm{R}}{\bm{h}}|{\bm{R}})$ . By a similar calculation of Lemma C.8, we have (for a complex matrix, denote ${\bm{R}}^{\mathsf{T}}$ to be the transpose of ${\bm{R}}$ , and ${\bm{R}}^{*}$ to be the conjugate transpose of ${\bm{R}}$ )

By Lemmas 10.4, partial Stieltjes transforms are stable with respect to deleting one row and one column of the same index. By Lemma 10.13 (which will be stated and proved later), partial Stieltjes transforms are stable with respect to small changes of the dimension $d$ . Moreover, by Lemma 10.5, partial Stieltjes transforms concentrate tightly around their mean. As a consequence of all these lemmas (Lemma 10.4, 10.13, 10.5), we have

Combining with Eq. (167) proves Eq. (161). ∎

The following lemma is the analog of Lemma B.7 and B.8 in [CS13].

Under the assumptions of Lemma 10.1, we have (using the definitions in Eq. (155), (157) and (158) )

Step 1. Bounding $\|({\bm{M}}^{-1}+{\bm{V}}_{3})^{-1}\|_{{\rm op}}$ . By Sherman-Morrison-Woodbury formula, we have

Step 2. Bounding $\|({\bm{M}}^{-1}+\overline{\bm{V}}_{3})^{-1}\|_{{\rm op}}$ . Define ${\bm{G}}={\bm{M}}^{1/2}{\bm{V}}_{3}{\bm{M}}^{1/2}$ and $\overline{\bm{G}}={\bm{M}}^{1/2}\overline{\bm{V}}_{3}{\bm{M}}^{1/2}$ . By Lemma 10.11, we have

Combining with Eq. (170) and (171), we get

The last equality holds because $\|({\mathbf{I}}_{2}+\overline{\bm{G}})^{-1}\|_{{\rm op}}$ is deterministic. Hence we have

The following lemma shows that, the partial resolvents are stable with respect to small changes of the dimension $d$ .

We denote $\overline{\bm{A}}_{ij}$ and $\underline{{\bm{A}}}_{ij}$ for $i,j\in$ to be

Then it’s easy to see that $\|\overline{\bm{\Omega}}\|_{{\rm op}},\|{\bm{\Omega}}_{1}\|_{{\rm op}},\|{\bm{\Omega}}_{2}\|_{{\rm op}},\|{\bm{\Omega}}_{3}\|_{{\rm op}},\|\underline{{\bm{\Omega}}}\|_{{\rm op}}\leq 1/\xi_{0}$ .

where ${\bm{\eta}}=(\overline{\theta}_{1d},\ldots,\overline{\theta}_{Nd})^{\mathsf{T}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{N})$ . This gives

where ${\bm{u}}=(\overline{x}_{1d},\ldots,\overline{x}_{nd})^{\mathsf{T}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{n})$ . This gives

we have $\|{\bm{E}}\|_{{\rm op}}=O_{d}({\rm Poly}(\log d)/\sqrt{d})$ . Therefore, we get

Combining all these bounds establishes Eq. (176). Finally, Eq. (177) can be shown using the same argument. ∎

4 Equivalence between Gaussian and sphere version of Stieltjes transforms

In this subsection, we show that the Stieltjes transform of matrix ${\bm{A}}$ as defined in Eq. (60) and that of matrix $\overline{\bm{A}}$ as defined in Eq. (117) share the same asymptotics. For the reader’s convenience, we restate the definitions of these two matrices here.

and the Stieltjes transforms $\overline{M}_{d}({\bm{\xi}};{\bm{q}})$ and $M_{d}({\bm{\xi}};{\bm{q}})$ , defined by

The readers could keep in mind: a quantity with an overline corresponds to the case when features and data are Gaussian, while a quantity without overline usually corresponds to the case when features and data are on the sphere.

Step 1. Show that the resolvent is stable with respect to nuclear norm perturbation.

Since ${\bm{q}}=(s_{1},s_{2},t_{1},t_{2},p)$ is fixed, we have

The nuclear norm of the term involving ${\bm{Z}}_{0}$ can be easily bounded by

For term $\overline{\bm{H}}-{\bm{H}}$ , denoting ${\bm{D}}_{\bm{x}}={\rm diag}(\sqrt{d}/\|\overline{\bm{x}}_{1}\|_{2},\ldots,\sqrt{d}/\|\overline{\bm{x}}_{n}\|_{2})$ , we have

Step 3. Bound for $\|\overline{\bm{J}}-{\bm{Z}}_{\star}\|_{F}/\sqrt{d}$ .

Define $\overline{\bm{Z}}_{\star}=\varphi({\bm{D}}_{\bm{x}}\overline{\bm{X}}\,\overline{\bm{\Theta}}^{\mathsf{T}}/\sqrt{d})/\sqrt{d}$ . Define $r_{i}=\sqrt{d}/\|\overline{\bm{x}}_{i}\|_{2}$ . We have (for $\zeta_{ia}$ between $r_{i}$ and $1$ )

where ${\bm{\Xi}}=(\zeta_{ia})_{i\in[n],a\in[N]}$ , and $\bar{\varphi}(x)=x\varphi^{\prime}(x)$ (so $\bar{\varphi}$ is a polynomial). It is easy to see that

Therefore, we have (denoting $\deg(\varphi)$ to be the degree of polynomial $\varphi$ , and $C(\varphi)$ to be a constant that only depends on $\varphi$ )

5 Proof of Proposition 8.3

Step 1. Polynomial activation function $\sigma$ .

By the property of Stieltjes transform as in Lemma 10.3 $(c)$ , we have

By the concentration result of Lemma 10.5, for $\overline{M}_{d}(\xi)=d^{-1}{\rm Tr}[(\overline{\bm{A}}-\xi{\mathbf{I}}_{M})^{-1}]$ , we also have

Then we use Lemma 10.14 to transfer this property from $\overline{M}_{d}$ to $M_{d}$ . Recall the definition of resolvent $M_{d}(\xi;{\bm{q}})$ in sphere case in Eq. (61). Combining Lemma 10.14 with Eq. (185), we have

Step 2. General activation function $\sigma$ satisfying Assumption 1.

Further, by continuity of the solution of the fixed point equation with respect to $\mu_{\star},\mu_{1}$ when $\Im\xi\geq\xi_{0}$ for some large $\xi_{0}$ (as stated in Lemma 10.2), we have for $\Im\xi\geq\xi_{0}$ ,

Combining Eq. (188), (189), and (190), we obtain

Taking $\varepsilon\to 0$ proves Eq. (71).

Step 3. Uniform convergence in compact sets (Eq. (72)).

By the concentration of $M_{d}(\xi_{\star};{\bm{q}})$ to its expectation (which is the spherical version of Lemma 10.5), we have

and since $\mathcal{N}(\varepsilon,\Omega)$ is a finite set, we have

This high probability bound will become an expectation bound by the uniform boundedness of $M_{d}(\xi;{\bm{q}})$ for $\xi$ in any compact domain. That is, we have

Combining Eq. (191), (192) and (194), we have

Letting $\varepsilon\to 0$ proves Eq. (72). This concludes the proof of Proposition 8.3.

Proof of Proposition 8.4

In Section 11.1 we state and prove some lemmas that are used in the proof of Proposition 8.4. We prove Proposition 8.4 in Section 11.2.

The first lemma concerns the behavior of the partial Stieltjes transforms $m_{1}$ and $m_{2}$ when $\Im\xi\to\infty$ .

Define $\overline{m}_{1}=-\psi_{1}/\xi$ , $\overline{m}_{2}=-\psi_{2}/\xi$ , $\overline{\bm{m}}=(\overline{m}_{1},\overline{m}_{2})^{\mathsf{T}}$ , and ${\bm{m}}=(m_{1},m_{2})^{\mathsf{T}}$ . Let ${{\sf F}_{1}},{{\sf F}_{2}}$ be defined as in Eq. (69), F be defined as in Eq. (122), and ${{\sf H}_{1}}$ defined as in Eq. (125). By simple calculus we can see that

The next lemma concerns the behavior of the log-determinants when $\Im\xi\to\infty$ .

Follow the notations and settings of Proposition 8.4. For any fixed ${\bm{q}}$ , we have

Combining the bound of the real part and the imaginary part, we have

Recall the definition of $\Xi$ as given in Eq. (73). Define

It is easy to see that for any fixed ${\bm{q}}$ , we have

Combining Eq. (198), (199) and (200), we get

Next, we give some uniform upper bounds on the difference of derivatives of $G_{d}$ and $g$ .

Define ${\bm{q}}=(s_{1},s_{2},t_{1},t_{2},p)\equiv(q_{1},q_{2},q_{3},q_{4},q_{5})$ , and

Finally, we show that the derivatives of a function in a region can be upper bounded by the function value and the second derivatives of the function in the region.

The proof of Lemma 11.4 is elementary and simply follows from Taylor expansion.

2 Proof of Proposition 8.4

By the expression of $\Xi$ in Eq. (73), we have

By fixed point equation (70) with ${{\sf F}_{1}},{{\sf F}_{2}}$ defined in (69), we obtain that

As a result, by the definition of $g$ given in Eq. (74), and by formula for implicit differentiation, we have

Combining Eq. (202) with Eq. (201), we get

Combining Eq. (203), (204) and (205), we get Eq. (75).

By Eq. (75) and Lemma 11.3, by the covering number argument (similar to the Step 3 in Section 10.5), we get that for any compact region ${\mathcal{Q}}_{\star}$ , there is

Taking ${\mathcal{Q}}_{\star}={\mathsf{B}}({\bm{0}},2\varepsilon)$ , by Eq. (206) and by Lemma 11.3 again, there exists some constant $C$ , such that

Sending $\varepsilon\to 0$ gives Eq. (76). By similar argument we get Eq. (77). This finishes the proof of Proposition 8.4.

Proof of Theorem 3, 4, and 5

To prove this theorem, we just need to show that

More specifically, we just need to show that, the formula for $\chi$ defined in Eq. (16) as $\overline{\lambda}\to 0$ coincides with the formula for $\chi$ defined in Eq. (24). By the relationship of $\chi$ and $m_{1}m_{2}$ as per Eq. (86), we just need to show the lemma below.

By Proposition 8.3 this is equivalently saying $m_{1}(\xi;\psi_{1},\psi_{2}),m_{2}(\xi;\psi_{1},\psi_{2})$ is the analytic continuation of solution of Eq. (70) as defined in Proposition 8.3, when ${\bm{q}}={\bm{0}}$ . Defining $\psi=\min(\psi_{1},\psi_{2})$ , we have

and hence $m_{0}=O_{u}(1)$ . Taking the difference between Eq. (209) and (210), we get

This implies one of $m_{1}$ and $m_{2}$ should be of order $1/u$ and the other one should be of order $u$ , as $u\to 0$ .

By definition of $m_{1}$ and $m_{2}$ in Eq. (207), we have

2 Proof of Theorem 4

To prove this theorem, we just need to show that

This follows by simple calculus and a lemma below.

Under the same condition of Lemma 12.1, we have

The proof of this lemma is similar to the proof of Lemma 12.1.

3 Proof of Theorem 5

To prove this theorem, we just need to show that

This follows by simple calculus and a lemma below (this lemma is symmetric to Lemma 12.2).

Under the same condition of Lemma 12.1, we have

Proof of Proposition 5.1 and 5.2

Proof of Point (1). When $\psi_{1}\to 0$ , we have $\chi=O(\psi_{1})$ , so that $\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}}=-\psi_{1}\psi_{2}+O(\psi_{1}^{2})$ , $\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}}=O(\psi_{1}^{2})$ and $\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}=-\psi_{1}\psi_{2}+O(\psi_{1}^{2})$ . This proves Point (1).

Proof of Point (2). When $\psi_{1}=\psi_{2}$ , substituting the expression for $\chi$ into $\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}$ , we can see that $\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})=0$ . We also see that $\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})\neq 0$ and $\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})\neq 0$ . This proves Point (2).

Proof of Point (3). When $\psi_{1}>\psi_{2}$ , we have

Proof of Point (4). For $\psi_{1}>\psi_{2}$ , taking derivative of $\mathscrsfs{B}_{\mbox{\tiny\rm rless}}$ and $\mathscrsfs{V}_{\mbox{\tiny\rm rless}}$ with respect to $\psi_{1}$ , we have

It is easy to check that when $\psi_{1}>\psi_{2}$ , the functions $\partial_{\psi_{1}}\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}-\partial_{\psi_{1}}\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}}$ and $\partial_{\psi_{1}}\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}-\partial_{\psi_{1}}\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}}$ are functions of ${\zeta}$ and $\psi_{2}$ , and are independent of $\psi_{1}$ (note when $\psi_{1}>\psi_{2}$ , $\chi$ is a function of $\psi_{2}$ and doesn’t depend on $\psi_{1}$ ). Therefore, $\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\cdot,\psi_{2})$ and $\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\cdot,\psi_{2})$ as functions of $\psi_{1}$ must be strictly increasing, strictly decreasing, or staying constant on the interval $\psi_{1}\in(\psi_{2},\infty)$ . However, we know $\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})=\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})=\infty$ , and $\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\infty,\psi_{2})$ and $\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\infty,\psi_{2})$ are finite. Therefore, we must have that $\mathscrsfs{B}_{\mbox{\tiny\rm rless}}$ and $\mathscrsfs{V}_{\mbox{\tiny\rm rless}}$ are strictly decreasing on $\psi_{1}\in(\psi_{2},\infty)$ .

2 Proof of Proposition 5.2

In Proposition 13.1 given by the following, we give a more precise description of the behavior of $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}$ , which is stronger than Proposition 5.2.

Moreover, For any $\rho<\rho_{\star}({\zeta},\psi_{2})$ , we have

For any $\rho\geq\rho_{\star}({\zeta},\psi_{2})$ , we have

Minimizing over $\overline{\lambda}$ and ${\zeta}$ , we have

The minimizer is achieved for any ${\zeta}^{2}\geq{\zeta}_{\star}^{2}(\rho,\psi_{2})$ , and $\overline{\lambda}=\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho)$ .

In the following, we prove Proposition 13.1. It is easy to see that

Hence we study the properties of $\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}$ first.

Step 1. Properties of the function $\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}$ . Calculating the derivative of $\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}$ with respect to $u$ , we have

has one negative and one positive solution, and $\omega_{1}$ is the negative solution of the above equation. Therefore, when $u\leq\omega_{1}$ , $\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}$ will be strictly decreasing in $u$ ; when $0\geq u\geq\omega_{1}$ , $\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}$ will be strictly increasing in $u$ . Therefore, we have

Step 2. Properties of the function $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}$ . For fixed $({\zeta},\rho,\psi_{2})$ , we look at the minimizer over $\overline{\lambda}$ of the function $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda},\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega(\overline{\lambda},{\zeta},\psi_{2}),\rho,\psi_{2})$ . The minimum $\min_{\overline{\lambda}\geq 0}\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda},\psi_{2})$ could be different from the minimum $\min_{u\in(-\infty,0]}\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(u,\rho,\psi_{2})$ , since $\operatorname*{arg\,min}_{u\in(-\infty,0]}\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(u,\rho,\psi_{2})=\omega_{1}(\rho,\psi_{2})$ may not be achievable by $\omega(\overline{\lambda},{\zeta},\psi_{2})$ when $\overline{\lambda}\geq 0$ .

One observation is that $\omega(\cdot,\psi_{2},{\zeta})$ as a function of $\overline{\lambda}$ is always negative and increasing.

Then for any $\psi_{2}\in(0,\infty)$ , ${\zeta}\in(0,\infty)$ and $\overline{\lambda}>0$ , we have

Let us for now admit this lemma holds. When $\rho$ is such that $\omega_{1}>\omega_{0}$ (i.e. $\rho<\rho_{\star}({\zeta},\psi_{2})$ ), we can choose $\overline{\lambda}=\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho)>0$ such that $\omega(\overline{\lambda},{\zeta},\psi_{2})=\omega(\overline{\lambda}_{\star},{\zeta},\psi_{2})=\omega_{1}(\rho,\psi_{2})$ , and then $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho),\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega_{1}(\rho,\psi_{2}),\rho,\psi_{2})$ gives the minimum of $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}$ optimizing over $\overline{\lambda}\in[0,\infty)$ . When $\rho$ is such that $\omega_{1}<\omega_{0}$ (i.e. $\rho>\rho_{\star}({\zeta},\psi_{2})$ ), there is not a $\overline{\lambda}$ such that $\omega(\overline{\lambda},{\zeta},\psi_{2})=\omega_{1}(\rho,\psi_{2})$ holds. Therefore, the best we can do is to take $\overline{\lambda}=0$ , and then $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},0,\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega_{0}(\rho,\psi_{2}),\rho,\psi_{2})$ gives the minimum of $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}$ optimizing over $\overline{\lambda}\in[0,\infty)$ .

Finally, when we minimize $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda},\psi_{2})$ jointly over ${\zeta}$ and $\overline{\lambda}$ , note that as long as ${\zeta}^{2}\geq{\zeta}_{\star}^{2}$ , we can choose $\overline{\lambda}=\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho)>0$ such that $\omega(\overline{\lambda},{\zeta},\psi_{2})=\omega(\overline{\lambda}_{\star},{\zeta},\psi_{2})=\omega_{1}(\rho,\psi_{2})$ , and then $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho),\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega_{1}(\rho,\psi_{2}),\rho,\psi_{2})$ gives the minimum of $\mathscrsfs{R}_{\mbox{\tiny\rm wide}}$ optimizing over $\overline{\lambda}\in[0,\infty)$ and ${\zeta}\in(0,\infty)$ . This proves Proposition 13.1.

It is easy to see that $\omega(\overline{\lambda},\psi_{2},{\zeta})<0$ . In the following, we show $\partial_{\overline{\lambda}}\omega(\overline{\lambda},\psi_{2},{\zeta})>0$ .

It is easy to see that, when $\overline{\lambda}>0$ and $\psi_{2}>1$ , both the denominator and numerator is positive, so that $\partial_{\overline{\lambda}}\omega>0$ .

Step 2. When $\psi_{2}<1$ . Note $\omega$ is the negative solution of the quadratic equation

Differentiating the quadratic equation with respect to $\overline{\lambda}$ , we have

We can see that, since $\omega<0$ , when $\psi_{2}<1$ , both the denominator and numerator is negative. This proves $\partial_{\overline{\lambda}}\omega>0$ when $\psi_{2}<1$ . ∎

Acknowledgements

This work was partially supported by grants NSF CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729.

References

Appendix A Technical background

The dimension of each subspace is given by

A.2 Gegenbauer polynomials

We will use the following properties of Gegenbauer polynomials

Note in particular that property 2 implies that –up to a constant– $Q_{k}^{(d)}(\langle{\bm{x}},{\bm{y}}\rangle)$ is a representation of the projector onto the subspace of degree- $k$ spherical harmonics

then we have the following equation holds in $L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d})$ sense

A.3 Hermite polynomials

The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials (up to a $\sqrt{d}$ scaling in domain) are constructed by Gram-Schmidt orthogonalization of the monomials $\{x^{k}\}_{k\geq 0}$ with respect to the measure $\tau_{d}$ , while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to $\mu_{G}$ . Since $\tau_{d}\Rightarrow\mu_{G}$ (here $\Rightarrow$ denotes weak convergence), it is immediate to show that, for any fixed integer $k$ ,

Here and below, for $P$ a polynomial, ${\rm Coeff}\{P(x)\}$ is the vector of the coefficients of $P$ . As a consequence, for any fixed integer $k$ , we have

where $\mu_{k}(\sigma)$ and $\lambda_{d,k}(\sigma)$ are given in Eq. (223) and (220).

Appendix B Proof of Proposition 8.2

Moreover, the set of eigenvalues of ${\bm{A}}({\bm{q}})-\xi{\mathbf{I}}_{M}$ and $\det({\bm{A}}({\bm{q}})-\xi{\mathbf{I}}_{M})$ are continuous with respect to ${\bm{q}}$ . Therefore, for any perturbation $\Delta{\bm{q}}$ with $\|\Delta{\bm{q}}\|_{2}\leq\varepsilon$ and $\varepsilon$ small enough, we have $k({\bm{q}}+\Delta{\bm{q}},\xi)=k({\bm{q}},\xi)$ . As a result, we have

Moreover, ${\bm{A}}({\bm{q}})$ (defined as in Eq. (60)) is a linear matrix function of ${\bm{q}}$ , which gives $\partial_{q_{i},q_{j}}{\bm{A}}({\bm{q}})={\bm{0}}$ . Hence we have

and using the formula for block matrix inversion, we have

With simple algebra, we can show that Eq. (226) holds.

Appendix C Additional Proofs in Section 9

We define the sequence $(F_{d,k}^{2})_{k\geq 2}$ to be the coefficients of Gegenbauer expansion of $\Sigma_{d}$ :

In the expansion, the zeroth and first order coefficients are , because, according to Assumption 3,

To check point (1), we have $\Sigma_{d}(1)=\sum_{k=2}^{\infty}F_{d,k}^{2}Q_{k}^{(d)}(d)=\sum_{k=2}^{\infty}F_{d,k}^{2}$ , and by Assumption 3 we have $\lim_{d\to\infty}\Sigma_{d}(1)=F_{\star}^{2}$ , so that (1) holds.

To check point (2), defining $({\bm{\beta}}_{d,k})_{k\geq 2}$ and $g_{d}^{\mbox{\tiny\rm NL}}({\bm{x}})$ accordingly, we have

With a little abuse of notations, let us define

Moreover, note that ${\bm{X}}$ , ${\bm{\Theta}}$ , ${\bm{\varepsilon}}$ , and $f_{d}^{\mbox{\tiny\rm NL}}$ are mutually independent, ${\bm{X}}\stackrel{{\scriptstyle d}}{{=}}{\bm{X}}{\mathcal{O}}^{\mathsf{T}}$ , ${\bm{\Theta}}\stackrel{{\scriptstyle d}}{{=}}{\bm{\Theta}}{\mathcal{O}}^{\mathsf{T}}$ , and $f_{d}^{\mbox{\tiny\rm NL}}\stackrel{{\scriptstyle d}}{{=}}{\mathcal{T}}_{{\mathcal{O}}}[f_{d}^{\mbox{\tiny\rm NL}}]$ . Then, for any fixed ${\bm{\beta}}_{d,1}$ , we have

C.2 Proof of Lemma 9.6 and Lemma 9.7

To prove Lemma 9.6 and 9.7, first we state a lemma that reformulate $A_{1},A_{2}$ and $B_{\alpha}$ using Sherman-Morrison-Woodbury formula.

Note we have (denoting $\lambda_{d,0}=\lambda_{d,0}(\sigma)$ )

Hence we have (denoting ${\bm{T}}_{2}={\bm{J}}^{\mathsf{T}}\bm{1}_{n}/\sqrt{n}$ )

By the Sherman-Morrison-Woodbury formula, we have

Simplifying this formula using simple algebra proves Eq. (230). ∎

Step 1. Term $A_{1}$ . By Lemma C.1, we get

We can simplify $S_{20}$ in Eq. (233) further, and get

It is easy to see that $0\leq A\leq 1$ . Hence the high probability bound translates to an expectation bound. This proves the lemma. ∎

For notation simplicity, we prove this lemma under the case when ${\mathcal{A}}=\{\alpha\}$ which is a singleton. We denote $B=B_{\alpha}$ . The proof can be directly generalized to the case for arbitrary set ${\mathcal{A}}$ .

By Lemma C.1 (when applying Lemma C.1, we change the role of $N$ and $n$ , and the role of ${\bm{\Theta}}$ and ${\bm{X}}$ ; this can be done because the role of ${\bm{\Theta}}$ and ${\bm{X}}$ is symmetric), we have

Note we have shown in the proof of Lemma 9.6 that

Since all the quantities above are deterministically bounded by a constant, these high probability bounds translate to expectation bounds.

Plugging in the above bounds into Equation (235), we have

C.3 Some auxiliary lemmas

We denote by $\mu_{d}$ the probability law of $\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/\sqrt{d}$ when ${\bm{x}}_{1},{\bm{x}}_{2}\sim_{iid}{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{d})$ . Note that $\mu_{d}$ is symmetric, and $\int x^{2}\mu_{d}({\rm d}x)=1$ . By the central limit theorem, $\mu_{d}$ converges weakly to $\mu_{G}$ as $d\to\infty$ , where $\mu_{G}$ is the standard Gaussian measure. In fact, we have the following stronger convergence result.

For any $\lambda\in[-\sqrt{d}/2,\sqrt{d}/2]$ , we have

In order to prove Eq. (236), we note that the left hand side is given by

where the last inequality holds for $|\lambda|\leq\sqrt{d}/2$ using the fact that $(1-x)^{-1}\leq e^{2x}$ for $x\in[0,1/4]$ .

The next several lemmas establish general bounds on the operator norm of random kernel matrices which is of independent interest.

Then there exists a constant $C$ depending uniquely on $c_{0},c_{1},c_{2}$ , and a sequence of numbers $(\overline{\eta}_{d})_{d\geq 1}$ with $|\overline{\eta}_{d}|\leq C\exp\{C(\log d)^{1/2}\}$ , such that

By Lemma C.2 and Markov inequality, we have, for any $i\neq j$ and all $0\leq t\leq\sqrt{d}$ ,

By [DM16, Lemma 20], there exists a constant $C$ such that

Defining the event $\mathcal{G}\equiv\{|\langle\overline{\bm{z}}_{i},\overline{\bm{z}}_{j}\rangle/\sqrt{d}|\leq 16\sqrt{\log d},\,\forall 1\leq i<j\leq M\}$ , we have

Then there exists a constant $C$ depending uniquely on $c_{0},c_{1},c_{2}$ , such that

In the proof of this lemma, we assume $\sigma$ has continuous derivatives. In the case when $\sigma$ is only weak differentiable, the proof is the same, except that we need to replace the mean value theorem to its integral form.

Define $r_{i}=\sqrt{d}/\|\overline{\bm{z}}_{i}\|_{2}$ , and

By the concentration of $\chi$ -squared distribution, it is easy to see that

Moreover, we have (for $\zeta_{i}$ between $r_{i}$ and $1$ )

Moreover by the assumption that $|\sigma^{\prime}(u)|\leq c_{0}e^{c_{1}|u|}$ , we have

Then there exists a constant $C$ depending uniquely on $c_{0},c_{1},c_{2}$ , such that

By Lemma C.3, there exists a sequence $(\overline{\eta}_{d})_{d\geq 0}$ with $|\overline{\eta}_{d}|\leq C\exp\{C(\log d)^{1/2}\}$ , such that

Combining this equation with Eq. (245), we get

C.4 The decomposition of kernel inner product matrices

The following lemma (Lemma C.6) is a reformulation of Proposition 3 in [GMMM19]. We present it in a stronger form, but it can be easily derived from the proof of Proposition 3 in [GMMM19]. This lemma was first proved in [EK10] in the Gaussian case. (Notice that the second estimate —on $Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}})$ — follows by applying the first one whereby ${\bm{\Theta}}$ is replaced by ${\bm{W}}=[{\bm{\Theta}}^{{\mathsf{T}}}|{\bm{X}}^{{\mathsf{T}}}]^{{\mathsf{T}}}$

Notice that the second estimate —on $Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}})$ — follows by applying the first one —Eq. (246)— whereby ${\bm{\Theta}}$ is replaced by ${\bm{W}}=[{\bm{\Theta}}^{{\mathsf{T}}}|{\bm{X}}^{{\mathsf{T}}}]^{{\mathsf{T}}}$ , and we use $\|Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}})\|_{{\rm op}}\leq\|Q_{k}({\bm{W}}{\bm{W}}^{\mathsf{T}})-{\mathbf{I}}_{N+n}\|_{{\rm op}}$ .

The following lemma (Lemma C.7) can be easily derived from Lemma C.6. Again, this lemma was first proved in [EK10] in the Gaussian case.

Then we can rewrite the matrix ${\bm{U}}$ to be

C.5 A lemma on the variance of the quadratic form

Step 1. Term ${\bm{g}}^{\mathsf{T}}{\bm{A}}{\bm{h}}$ . Calculating the expectation, we have

Step 2. Term ${\bm{g}}^{\mathsf{T}}{\bm{B}}{\bm{g}}$ . Calculating the expectation, we have

Appendix D Proof of Lemma 8.1

where we have, for ${\bm{u}}=(s_{1},s_{2},t_{1},t_{2},z_{1},z_{2})^{\mathsf{T}}$

Appendix E Proof sketch for Theorem 6

In this section, we sketch the calculations of Theorem 6. We assume $\psi_{1,d}\equiv N/d=\psi_{1}$ and $\psi_{2,d}\equiv n/d=\psi_{2}$ are constants independent of $d$ . Recall that the definitions of two useful resolvent matrix ${\bm{\Xi}}$ and ${\bm{\Pi}}$ are

Step 1. The expectation of regularized training error.

By Eq. (45), the regularized training error of random features regression gives

Step 3. The derivatives of the log determinant.

where $g$ is given in Eq. (74). The derivatives of $g$ can be obtained by differentiating Eq. (73) and using Daskin’s theorem. The theorem then follows by simple calculus.