The generalization error of random features regression: Precise asymptotics and double descent curve

Song Mei, Andrea Montanari

Introduction

Statistical lore recommends not to use models that have too many parameters since this will lead to ‘overfitting’ and poor generalization. Indeed, a plot of the test error as a function of the model complexity often reveals a U-shaped curve. The test error first decreases because the model is less and less biased, but then increases because of a variance explosion [HTF09]. In particular, the interpolation threshold, i.e., the threshold in model complexity above which the training error vanishes (the model completely interpolates the data), corresponds to a large test error. It seems wise to keep the model complexity well below this threshold in order to obtain a small generalization error.

These classical prescriptions are in stark contrast with the current practice in deep learning. The number of parameters of modern neural networks can be much larger than the number of training samples, and the resulting models are often so complex that they can perfectly interpolate the data. Even more surprisingly, they can interpolate the data when the actual labels are replaced by pure noise [ZBH+16]. Despite such a large complexity, these models have small test error and can outperform others trained in the classical underparametrized regime.

This behavior has been rationalized in terms of a so-called ‘double-descent’ curve [BMM18, BHMM19]. A plot of the test error as a function of the model complexity follows the traditional U-shaped curve until the interpolation threshold. However, after a peak at the interpolation threshold, the test error decreases, and attains a global minimum in the overparametrized regime. In fact, the minimum error often appears to be ‘at infinite complexity’: the more overparametrized is the model, the smaller is the error. It is conjectured that the good generalization behavior in this highly overparametrized regime is due to the implicit regularization induced by gradient descent learning: among all interpolating models, gradient descent selects the simplest one, in a suitable sense. An example of double descent curve is plotted in Fig. 1. The main contribution of this paper is to describe a natural, analytically tractable model leading to this generalization curve, and to derive precise formulae for the same curve, in a suitable asymptotic regime.

Despite its simplicity, this random covariates model captures several features of the double descent scenario. In particular, the asymptotic generalization curve is U-shaped for γ<1\gamma<1, diverging at the interpolation threshold γ=1\gamma=1, and descends again exceeding that threshold. The divergence at γ=1\gamma=1 is explained by an explosion in the variance, which is in turn related to a divergence of the condition number of the random matrix X{\bm{X}}. At the same time, this simple model misses some interesting features that are observed in more complex settings: (i)(i) In the Gaussian covariates model, the global minimum of the test error is achieved in the underparametrized regime γ<1\gamma<1, unless ad-hoc misspecification structure is assumed; (ii)(ii) The number of parameters is tied to the covariates dimension dd and hence the effects of overparametrization are not isolated from the effects of the ambient dimensions; (iii)(iii) Ridge regression, with some regularization λ>0\lambda>0, is always found to outperform the ridgeless limit λ0\lambda\to 0. Moreover, this linear model is not directly connected to actual neural networks, which are highly nonlinear in the covariates xi{\bm{x}}_{i}.

In this paper, we study the random features model of Rahimi and Recht [RR08]. The random features model can be viewed either as a randomized approximation to kernel ridge regression, or as a two-layers neural networks with random first layer wights. We compute the precise asymptotics of the test error and show that it reproduces all the qualitative features of the double-descent scenario.

We learn the coefficients a=(ai)iN{\bm{a}}=(a_{i})_{i\leq N} by performing ridge regression

The choice of ridge penalty is motivated by the connection to kernel ridge regression, of which this method can be regarded as a finite-rank approximation. Further, the ridge regularization path is naturally connected to the path of gradient flow with respect to the mean square error in(yif(xi;a,Θ))2\sum_{i\leq n}(y_{i}-f({\bm{x}}_{i};{\bm{a}},{\bm{\Theta}}))^{2}, starting at a=0{\bm{a}}=0. In particular, gradient flow converges to the ridgeless limit (λ0\lambda\to 0) of a^(λ)\hat{\bm{a}}(\lambda), and there is a correspondence between positive λ\lambda, and early stopping in gradient descent [YRC07].

N,n,dN,n,d lie in a proportional asymptotics regime. Namely, N,n,dN,n,d\to\infty with N/dψ1N/d\to\psi_{1}, n/dψ2n/d\to\psi_{2} for some ψ1,ψ2(0,)\psi_{1},\psi_{2}\in(0,\infty).

where expectation is taken with respect to GN(0,1)G\sim{\mathsf{N}}(0,1). Assume μ0,μ1,μ0\mu_{0},\mu_{1},\mu_{\star}\neq 0.

Then, for linear fdf_{d} in the setting described above, for any λ>0\lambda>0, the prediction risk convergences in probability

where \mathscrsfsR(ρ,ζ,ψ1,ψ2,λ)\mathscrsfs{R}(\rho,\zeta,\psi_{1},\psi_{2},\overline{\lambda}) is explicitly given in Definition 1.

Section 5.1 also contains an analogous statement for the nonlinear model.

Theorem 1 and its generalizations stated below require λ>0\lambda>0 fixed as N,n,dN,n,d\to\infty. We can then consider the ridgeless limit by taking λ0\lambda\to 0. Let us stress that this does not necessarily yield the prediction risk of the min-norm least square estimator that is also given by the limit a^(0+)limλ0a^(λ)\hat{\bm{a}}(0+)\equiv\lim_{\lambda\to 0}\hat{\bm{a}}(\lambda) at N,n,dN,n,d fixed. Denoting by Z=σ(XΘT/d)/d{\bm{Z}}=\sigma({\bm{X}}{\bm{\Theta}}^{{\mathsf{T}}}/\sqrt{d})/\sqrt{d} the design matrix, the latter is given by a^(0+)=(ZTZ)ZTy/d\hat{\bm{a}}(0+)=({\bm{Z}}^{{\mathsf{T}}}{\bm{Z}})^{\dagger}{\bm{Z}}^{{\mathsf{T}}}{\bm{y}}/\sqrt{d}. While we conjecture that indeed this is the same as taking λ0\lambda\to 0 in the asymptotic expression of Theorem 1, establishing this rigorously would require proving that the limits λ0\lambda\to 0 and dd\to\infty can be exchanged. We leave this to future work.

Our work is the first one to provide a complete treatment of the nonparametric model in the proportional asymptotics and to establish those phenomena. From a mathematical viewpoint, the calculation of the test error can be reduced to studying a block-structured kernel random matrix, with a more intricate structure than the one of [HMRT19]. The reduction itself is novel in the present context, and goes through the log determinant of this random matrix, while the variance computation of [HMRT19] is directly connected to the resolvent.

The proof of Theorem 1 builds on ideas from random matrix theory. A careful look at these arguments unveils an interesting phenomenon. While the random features {σ(θi,x/d)}id\{\sigma(\langle{\bm{\theta}}_{i},{\bm{x}}\rangle/\sqrt{d})\}_{i\leq d} are highly non-Gaussian, it is possible to construct a Gaussian covariates model with the same asymptotic prediction error as for the random features model. Apart from being mathematically interesting, this finding provides additional intuition for the behavior of random features models, and opens the way to some interesting future directions. In particular, [MRSY19] uses this Gaussian covariates proxy to analyze maximum margin classification using random features.

The rest of the paper is organized as follows:

In Section 2 we summarize the main insights that can be extracted from the asymptotic theory, and illustrate them through plots.

Section 3 provides a succinct overview of related work.

Section 4 introduces the notations that are used in this paper.

Section 5 contains formal statements of our main results, which is the asymptotics of prediction error as in Theorem 2. This section also presents some special cases of the asymptotic formula.

Section 6 contains the statements of the asymptotics of the training error as in Theorem 6.

Section 7 presents an interesting phenomenon which is that the random features model has the same asymptotic prediction error as a simpler model with Gaussian covariates.

In Section 8 we present the proof of main results. The main results will use several propositions that are proved in the following sections and in the appendices.

Results and insights: An informal overview

Before explaining in detail our technical results —which we will do in Section 5— it is useful to pause and describe some consequences of the exact asymptotic formulae that we prove. Our focus here will be on insights that have a chance to hold more generally, beyond the specific setting studied here.

Bias term also exhibits a singularity at the interpolation threshold. A prominent feature of the double descent curve is the peak in test error at the interpolation threshold which, in the present case, is located at ψ1=ψ2\psi_{1}=\psi_{2}. In the linear regression model of [AS17, HMRT19, BHX19], this phenomenon is entirely explained by a peak in the variance of the estimator (that diverges in the ridgeless limit λ0\lambda\to 0), while its bias is monotone increasing across to this threshold.

In contrast, in the random features model studied here, both variance and bias have a peak at the interpolation threshold, diverging there when λ0\lambda\to 0. This is apparent from Figure 1 which was obtained for τ2=0\tau^{2}=0, and therefore in a setting in which the error is entirely due to bias. The fact that the double descent scenario persists in the noiseless limit is particularly important, especially in view of the fact that many machine learning tasks are usually considered nearly noiseless.

Optimal prediction error is achieved in the highly overparametrized regime. Figure 2 (left) reports the predicted test error in the ridgeless limit λ0\lambda\to 0 (for a case with non-vanishing noise, τ2>0\tau^{2}>0) as a function of ψ1=N/d\psi_{1}=N/d , for several values of ψ2=n/d\psi_{2}=n/d. Figure 3 plots the predicted test error as a function of ψ1/ψ2=N/n\psi_{1}/\psi_{2}=N/n, for fixed ψ2\psi_{2}, several values of λ>0\lambda>0, and two values of the SNR. We repeatedly observe that: (i)(i) For a fixed λ\lambda, the minimum of test error (over ψ1\psi_{1}) is in the highly overparametrized regime ψ1\psi_{1}\to\infty; (ii)(ii) The global minimum (over λ\lambda and ψ1\psi_{1}) of test error is achieved at a value of λ\lambda that depends on the SNR, but always at ψ1\psi_{1}\to\infty; (iii)(iii) In the ridgeless limit λ0\lambda\to 0, the generalization curve is monotonically decreasing in ψ1\psi_{1} when ψ1>ψ2\psi_{1}>\psi_{2}.

To the best of our knowledge, this is the first natural and analytically tractable model which satisfies the following requirements: (1)(1) Large overparametrization is necessary to achieve optimal prediction; (2)(2) No special misspecification structure needs to be postulated.

Optimal regularization eliminated the double-descent. Figure 3 reports the asymptotic prediction for the test error as a function of the overparametrization ratio N/nN/n for various values of the regularization parameter λ\lambda. The peak at the interpolation threshold N=nN=n is apparent, but it becomes less prominent as the regularization increases. In particular, if we consider the optimal regularization (the lower envelope of these curves), the test error becomes monotone decreasing in the number of parameters: regularization compensates overparametrization.

Non-vanishing regularization can hurt (at high SNR). Figure 4 plots the predicted test error as a function of λ\lambda, for several values of ψ1\psi_{1}, with ψ2\psi_{2} fixed. The lower envelope of these curves is given by the curve at ψ1\psi_{1}\to\infty, confirming that the optimal error is achieved in the highly overparametrized regime. However the dependence of this lower envelope on λ\lambda changes qualitatively, depending on the SNR. For small SNR, the global minimum is achieved as some λ>0\lambda>0: regularization helps. However, for a large SNR the minimum error is achieved as λ0\lambda\to 0. The optimal regularization is vanishingly small.

These two noise regime are separated by a phase transition at a critical SNR which we denote by ρ\rho_{\star}. A characterization of this critical value is given in Section 5.2.2.

Self-induced regularization. What is the mechanism underlying the optimality of the ridgeless limit λ0\lambda\to 0? An intuitive explanation can be obtained by considering the (random) kernel associated to the ridge regression (2), namely

Related literature

A recent stream of papers studied the generalization behavior of machine learning models in the interpolation regime. An incomplete list of references includes [BMM18, BRT19, LR18, BHMM19, RZ18]. The starting point of this line of work were the experimental results in [ZBH+16, BMM18], which showed that deep neural networks as well as kernel methods can generalize even if the prediction function interpolates all the data. It was proved that several machine learning models including kernel regression [BRT19] and kernel ridgeless regression [LR18] can generalize under certain conditions.

The double descent phenomenon, which is our focus in this paper, was first discussed in general terms in [BHMM19]. The same phenomenon was also observed in [AS17, GJS+19]. The paper [KLS18] observes that the optimal amount of ridge regularization is sometimes vanishing, and provides an explanation in terms of noisy features. Analytical predictions confirming this scenario were obtained, within the linear regression model, in two concurrent papers [HMRT19, BHX19]. In particular, [HMRT19] derives the precise high-dimensional asymptotics of the prediction error, for a general model with correlated covariates. On the other hand, [BHX19] gives exact formula for any finite dimension, for a model with i.i.d. Gaussian covariates. The same papers also compute the double descent curve within other models, including over-specified linear model [HMRT19], and a Fourier series model [BHX19].

As mentioned in the introduction, [HMRT19, Section 8] also calculates the variance term of the prediction error in the random features model in the ridgeless limit λ0\lambda\to 0. Both the simple linear regression models of [HMRT19, BHX19], and the variance calculation of [HMRT19, Section 8] capture the peak of the test error at the interpolation threshold. However, these calculations do not elucidate several crucial statistical phenomena, which are instead the main contribution of our work (see Section 2): optimality of large overparametrization; optimality of interpolators at high SNR (λ0\lambda\to 0 limit); the role of self-induced regularization; disappearance of the double descent at optimal overparametrization.

Rate-optimal bounds on the generalization error of overparametrized linear models were recently derived in [BLLT20] (see also [MVS19] for a different perspective).

2 Random features and kernels

The random features model has been studied in considerable depth since the original work in [RR08]. A classical viewpoint suggests that FRF(Θ){\mathcal{F}}_{{\rm RF}}({\bm{\Theta}}) should be regarded as random approximation of the reproducing kernel Hilbert space FH{\mathcal{F}}_{H} defined by the kernel

Indeed FRF(Θ){\mathcal{F}}_{{\rm RF}}({\bm{\Theta}}) is an RKHS defined by the finite-rank approximation of this kernel defined in Eq. (6). The paper [RR08] showed the pointwise convergence of the empirical kernel HNH_{N} to HH. Subsequent work [Bac17b] showed the convergence of the empirical kernel matrix to the population kernel in terms of operator norm and derived bound on the approximation error (see also [Bac13, AM15, RR17] for related work).

The setting in the present paper is quite different, since we take the limit of a large number neurons NN\to\infty, together with large dimension dd\to\infty. Our focus on this high-dimensional regime is partially motivated by [RZ18], which emphasizes that optimality of interpolators is somewhat un-natural in low dimension.

It is well-known that approximation using two-layers network suffers from the curse of dimensionality, in particular when first-layer weights are not trained [DHM89, Bac17a, VW18, GMMM19]. The recent paper [GMMM19] studies random features regression in a setting similar to ours, by considering two different regimes: (1)(1) the population limit n=n=\infty, with NN scaling as a polynomial of dd; (2)(2) the wide limit N=N=\infty, with nn scaling as a polynomial of dd. In particular, [GMMM19] proves that, if dk+δNdk+1δd^{k+\delta}\leq N\leq d^{k+1-\delta} and n=n=\infty or dk+δndk+1δd^{k+\delta}\leq n\leq d^{k+1-\delta} and N=N=\infty then a random features model can only fit the projection of the true function fdf_{d} onto degree-kk polynomials.

3 Technical contribution

The asymptotic singular values distribution of Z{\bm{Z}} is not sufficient to compute the asymptotic prediction error, which also depends on the singular vectors of Z{\bm{Z}}. The paper [HMRT19] addresses this challenge for what concerns the variance term of the error, and only in the limit λ0\lambda\to 0. Notice that the variance term is given (up to constants) by Tr((ZTZ)Σ){\rm Tr}(({\bm{Z}}^{{\mathsf{T}}}{\bm{Z}})^{\dagger}{\bm{\Sigma}}). It is quite straightforward to express this quantity in terms of the Stieltjes transform of a certain block random matrix, and [HMRT19] use the leave-one-out method to characterize the asymptotics of this Stieltjes transform.

Unfortunately, the approach of [HMRT19] cannot be pushed to compute the full test error (i.e. both the bias and variance terms): the latter cannot be expressed in terms of the Stieltjes transform of the same matrix. A key observation of the present paper is that the full prediction error can be expressed in terms of derivatives of the log-determinant of a different block-structured random matrix. In order to compute the asymptotics of this log-determinant, we use leave-one-out arguments (e.g. [BS10, Chapter 3.3]) to derive fixed point equations for the Stieltjes transform of this random matrix, and then integrate this Stieltjes transform.

One further difference from [HMRT19] is that we consider the full nonparametric model yi=fd(xi)+εiy_{i}=f_{d}({\bm{x}}_{i})+\varepsilon_{i} while for the calculation of [HMRT19] does not model the target function. As mentioned above, our setting is similar to the one of [GMMM19]. However, the main technical content of [GMMM19] is to prove that, under polynomial scalings of nn and dd (at N=N=\infty) or NN and dd (at n=n=\infty), the kernel matrix is near isometric. In contrast, here we study a regime in which it is not true that the same matrix is a near isometry, and we characterize its spectral distribution (alongside those properties of the eigenvectors that determine the test error).

Notations

Main results

We begin by stating our assumptions and notations for the activation function σ\sigma. It is straightforward to check that these are satisfied by all commonly-used activations, including ReLU and sigmoid functions.

where expectation is with respect to GN(0,1)G\sim{\mathsf{N}}(0,1). Assuming 0<μ02,μ12,μ2<0<\mu_{0}^{2},\mu_{1}^{2},\mu_{\star}^{2}<\infty, define ζ{\zeta} by

We will consider sequences of parameters (N,n,d)(N,n,d) that diverge proportionally to each other. When necessary, we can think such sequences to be indexed by dd, with N=N(d)N=N(d), n=n(d)n=n(d) functions of dd.

Defining ψ1,d=N/d\psi_{1,d}=N/d and ψ2,d=n/d\psi_{2,d}=n/d, we assume that the following limits exist in (0,)(0,\infty):

The last assumption covers, as a special case, deterministic linear functions fd(x)=βd,0+βd,1,xf_{d}({\bm{x}})=\beta_{d,0}+\langle{\bm{\beta}}_{d,1},{\bm{x}}\rangle, but also a large class of random non-linear functions. As an example, let G=(Gij)i,jd{\bm{G}}=(G_{ij})_{i,j\leq d}, where (Gij)i,jdiidN(0,1)(G_{ij})_{i,j\leq d}\sim_{iid}{\mathsf{N}}(0,1), and consider the random quadratic function

Higher order polynomials can be constructed analogously (or using the expansion of fdf_{d} in spherical harmonics).

We also emphasize that that the nonlinear part fd\mboxNL(x2)f_{d}^{\mbox{\tiny\rm NL}}({\bm{x}}_{2}), although being random, is the same for all samples, and hence should not be confused with additive noise ε\varepsilon.

We finally introduce the formula for the asymptotic prediction error, denoted by \mathscrsfsR(ρ,ζ,ψ1,ψ2,λ)\mathscrsfs{R}(\rho,\zeta,\psi_{1},\psi_{2},\lambda) in Theorem 1.

(iii)(iii) (ν1(ξ),ν2(ξ))(\nu_{1}(\xi),\nu_{2}(\xi)) is the unique solution of these equations with ν1(ξ)ψ1/(ξ)|\nu_{1}(\xi)|\leq\psi_{1}/\Im(\xi), ν2(ξ)ψ2/(ξ)|\nu_{2}(\xi)|\leq\psi_{2}/\Im(\xi) for (ξ)>C\Im(\xi)>C, with CC a sufficiently large constant.

The formula for the asymptotic risk can be easily evaluated numerically. In order to gain further insight, it can be simplified in some interesting special cases, as shown in Section 5.2.

We are now in position to state our main theorem, which generalizes Theorem 1 to the case in which fdf_{d} has a nonlinear component fd\mboxNLf_{d}^{\mbox{\tiny\rm NL}}.

Then for any value of the regularization parameter λ>0\lambda>0, the asymptotic prediction error of random features ridge regression satisfies

If the regression function fd(x)f_{d}({\bm{x}}) is linear (i.e., fd\mboxNL(x)=0f^{\mbox{\tiny\rm NL}}_{d}({\bm{x}})=0), we recover Theorem 1, where \mathscrsfsR\mathscrsfs{R} is defined as per Eq. (20). Numerical experiments suggest that Eq. (21) holds for any deterministic nonlinear functions fdf_{d} as well, and that the convergence in Eq. (21) is uniform over λ\lambda in compacts. We defer the study of these stronger properties to future work.

Note that the formula for a nonlinear truth, cf. Eq. (21), is almost identical to the one for a linear truth in Eq. (5). In fact, the only difference is that the the prediction error increases by a term F2F_{\star}^{2}, and the noise level τ2\tau^{2} is replaced by τ2+F2\tau^{2}+F_{\star}^{2}.

Figure 5 illustrates the last remark. We report the simulated and predicted test error as a function of ψ1/ψ2=N/n\psi_{1}/\psi_{2}=N/n, for three different choices of the function fdf_{d} and noise level τ2\tau^{2}. In all the settings, the total power of nonlinearity and noise is F2+τ2=0.5F_{\star}^{2}+\tau^{2}=0.5, while the power of the linear component is F12=1F_{1}^{2}=1. The test errors in these three settings appear to be very close, as predicted by our theory.

The terms \mathscrsfsB\mathscrsfs{B} and \mathscrsfsV\mathscrsfs{V} in Eq. (21) correspond to the the limits of the bias and variance of the estimated function f(x;a^(λ),Θ)f({\bm{x}};\hat{\bm{a}}(\lambda),{\bm{\Theta}}), when the ground truth function fdf_{d} is linear. That is, for fdf_{d} to be a linear function, we have

2 Simplifying the asymptotic risk in special cases

In order to gain further insight into the formula for the asymptotic risk \mathscrsfsR(ρ,ζ,ψ1,ψ2,λ)\mathscrsfs{R}(\rho,{\zeta},\psi_{1},\psi_{2},\overline{\lambda}), we consider here three special cases that are particularly interesting:

The highly overparametrized regime ψ1\psi_{1}\to\infty (recall that ψ1=limdN/d\psi_{1}=\lim_{d\to\infty}N/d).

The large sample limit ψ2\psi_{2}\to\infty (recall that ψ2=limdn/d\psi_{2}=\lim_{d\to\infty}n/d).

Let us emphasize that these limits are taken after the limit N,n,dN,n,d\to\infty with N/dN/d\to\infty and n/dn/d\to\infty. Hence, the correct interpretation of the highly overparametrized regime is not that the width NN is infinite, but rather much larger than dd (more precisely, larger than any constant times dd). Analogously, the large sample limit does not coincide with infinite sample size nn, but instead sample size that is much larger than dd.

The ridgeless limit λ0+\lambda\to 0+ is important because it captures the asymptotic behavior the min-norm interpolation predictor (see also Remark 1.)

Under the assumptions of Theorem 2, set ψmin{ψ1,ψ2}\psi\equiv\min\{\psi_{1},\psi_{2}\} and define

Then the asymptotic prediction error of random features ridgeless regression is given by

The proof of this result can be found in Section 12.

The next proposition establishes the main qualitative properties of the ridgeless limit.

Recall the bias and variance functions \mathscrsfsB\mboxrless\mathscrsfs{B}_{\mbox{\tiny\rm rless}} and \mathscrsfsV\mboxrless\mathscrsfs{V}_{\mbox{\tiny\rm rless}} defined in Eq. (26) and (27). Then, for any ζ(0,){\zeta}\in(0,\infty) and fixed ψ2(0,)\psi_{2}\in(0,\infty), we have

Divergence at the interpolation threshold ψ1=ψ2\psi_{1}=\psi_{2}:

Large width limit ψ1\psi_{1}\to\infty (here χ\chi is defined as per Eq. (24)):

Above the interpolation threshold (i.e. for ψ1ψ2\psi_{1}\geq\psi_{2}), the function \mathscrsfsB\mboxrless(ζ,ψ1,ψ2)\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{1},\psi_{2}) and \mathscrsfsV\mboxrless(ζ,ψ1,ψ2)\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{1},\psi_{2}) are strictly decreasing in the rescaled number of neurons ψ1\psi_{1}.

The proof of this proposition is presented in Section 13.1.

As anticipated, point 22 establishes an important difference with respect to the random covariates linear regression model of [AS17, HMRT19, BHX19]. While in those models the peak in prediction error is entirely due to a variance divergence, in the present setting both variance and bias diverge.

Another important difference is established in point 44: both bias and variance are monotonically decreasing above the interpolation threshold. This, again, contrasts with the behavior of simpler models, in which bias increases after the interpolation threshold, or after a somewhat larger point in the number of parameters per dimension (if misspecification is added).

This monotone decrease of the bias is crucial, and is at the origin of the observation that highly overparametrized models outperform underparametrized or moderately overparametrized ones. See Figure 6 for an illustration.

2.2 Highly overparametrized regime

As the number of neurons NN diverges (for fixed dimension dd), random features ridge regression is known to approach kernel ridge regression with respect to the kernel (7). It is therefore interesting what happens when NN and dd diverge together, but NN is larger than any constant times dd.

Under the assumptions of Theorem 2, define

Then the asymptotic prediction error of random features ridge regression, in the large width limit is given by

The proof of this result can be found in Section 12. Note that, as expected, the risk remains lower bounded by F2F_{\star}^{2}, even in the limit ψ1\psi_{1}\to\infty. Naively, one could have expected to recover kernel ridge regression in this limit, and hence a method that can fit nonlinear functions. However, as shown in [GMMM19], random features methods can only learn linear functions for N=Od(d2δ)N=O_{d}(d^{2-\delta}).

As observed in Figures 2 to 4 (which have been obtained by applying Theorem 2), the minimum prediction error is often achieved by highly overparametrized networks ψ1\psi_{1}\to\infty. It is natural to ask what is the effect of regularization on such networks. Somewhat surprisingly (and as anticipated in Section 2), we find that regularization does not always help. Namely, there exists a critical value ρ\rho_{\star} of the signal-to-noise ratio, such that vanishing regularization is optimal for ρ>ρ\rho>\rho_{\star}, and is not optimal for ρ<ρ\rho<\rho_{\star}.

In order to state formally this result, we define the following quantities

Notice in particular that \mathscrsfsR\mboxwide(ρ,ζ,ψ2,λ/μ2)\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\psi_{2},\lambda/\mu_{\star}^{2}) is the limiting value of the prediction error (right-hand side of (36)) up to an additive constant and an multiplicative constant.

Fix ζ,ψ2(0,){\zeta},\psi_{2}\in(0,\infty) and ρ(0,)\rho\in(0,\infty). Then the function λ\mathscrsfsR\mboxwide(ρ,ζ,ψ2,λ)\overline{\lambda}\mapsto\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\psi_{2},\overline{\lambda}) is either strictly increasing in λ\overline{\lambda}, or strictly decreasing first and then strictly increasing.

The proof of this proposition is presented in Section 13.2, which also provides further information about this phase transition (and, in particular, an explicit expression for λ(ζ,ψ2,ρ)\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho)).

2.3 Large sample limit

Under the assumptions of Theorem 2, define

Then the asymptotic prediction error of random features ridge regression, in the large width limit is given by

The proof of this result can be found in Section 12.

Asymptotics of the training error

Theorem 2 establishes the exact asymptotics of the test error in the random features model. However, the technical results obtained in the proofs allow us to characterize several other quantities of interest. Here we consider the behavior of the training error and of the norm of the parameters. We define the regularized training error by

We also recall that a^(λ)\hat{\bm{a}}(\lambda) denotes the minimizer in the last expression, cf. Eq. (2) The next definition presents the asymptotic formulas for these quantities.

(iii)(iii) (ν1(ξ),ν2(ξ))(\nu_{1}(\xi),\nu_{2}(\xi)) is the unique solution of these equations with ν1(ξ)ψ1/(ξ)|\nu_{1}(\xi)|\leq\psi_{1}/\Im(\xi), ν2(ξ)ψ2/(ξ)|\nu_{2}(\xi)|\leq\psi_{2}/\Im(\xi) for (ξ)>C\Im(\xi)>C, with CC a sufficiently large constant.

We next state our asymptotic characterization of LRF(fd,X,Θ,λ)L_{\rm RF}(f_{d},{\bm{X}},{\bm{\Theta}},\lambda) and a^(λ)22\|\hat{\bm{a}}(\lambda)\|_{2}^{2}.

Then for any value of the regularization parameter λ>0\lambda>0, the asymptotic regularized training error and norm square of its minimizer satisfy

The proof of Theorem 6 is similar to the proof of Theorem 2. We will give a sketch of proof of Theorem 6 in Section E.

In this section, we illustrate Theorem 6 through numerical simulations. Figure 7 reports the theoretical prediction and numerical results for the regularized training error, the test error, and the norm of the coefficients a^(λ)\hat{\bm{a}}(\lambda). We use a small non-zero value of the regularization parameter λ=103\lambda=10^{-3}, fix the number of samples per dimension ψ2=n/d\psi_{2}=n/d, and follow these quantities as a function of the overparameterization ratio ψ1/ψ2=N/n\psi_{1}/\psi_{2}=N/n.

As expected, the behavior of the training error strikingly different from the one of the test error. The training error is monotone decreasing in the overparameterization ratio N/nN/n, and is close to zero in the overparameterized regime N/n>1N/n>1 (it is not exactly vanishing because we use a small λ>0\lambda>0). In other words, the fitted model is nearly interpolating the data, and the peak in test error matches the interpolation threshold.

On the other hand, the penalty term ψ1a^(λ)22\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2} is non-monotone: it increases up to the interpolation threshold, then decreases for N/n>1N/n>1, and converges to a constant as ψ1\psi_{1}\to\infty. If we take this as a proxy for the model complexity, the behavior of ψ1a^(λ)22\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2} provides useful intuition about descent of the generalization error. As the number of parameters increases beyond the interpolation threshold, the model complexity decreases instead of increasing.

We can confirm the intuition that the double descent of the test error is driven by the behavior of the model complexity ψ1a^(λ)22\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2}, by selecting λ\lambda in an optimal way. Following [HMRT19], we expect that the optimal regularization should produce a smaller value of ψ1a^(λ)22\psi_{1}\|\hat{\bm{a}}(\lambda)\|_{2}^{2}, and hence eliminate or reduce the double descent phenomenon. Indeed, this is illustrated in Figure 8 demonstrates the prediction of the regularized training error and the test error for two choices of λ\lambda: λ=0\lambda=0, and an optimal λ\lambda such that the test error is minimized. When we choose an optimal λ\lambda, the test error becomes strictly decreasing as ψ1=N/d\psi_{1}=N/d increases. We expect this is a generic phenomenon that also holds in other interesting models.

An equivalent Gaussian covariates model

An examination of the proof of our main result (Theorem 2) reveals an interesting phenomenon. The random features model has the same asymptotic prediction error as a simpler model with Gaussian covariates and response that is linear in these covariates, provided we use a special covariance and signal structure.

Draw xN(0,Id){\bm{x}}\sim{\mathsf{N}}(0,{\mathbf{I}}_{d}), εN(0,τ2)\varepsilon\sim{\mathsf{N}}({\bm{0}},\tau^{2}), and wN(0,IN){\bm{w}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{N}) independently, conditional on Θ{\bm{\Theta}}.

Let y=β1,x+εy=\langle{\bm{\beta}}_{1},{\bm{x}}\rangle+\varepsilon.

Let u=(u1,,uN)T{\bm{u}}=(u_{1},\ldots,u_{N})^{\mathsf{T}}, uj=μ0+μ1θj,x/d+μwju_{j}=\mu_{0}+\mu_{1}\langle{\bm{\theta}}_{j},{\bm{x}}\rangle/\sqrt{d}+\mu_{\star}w_{j}, for some 0<μ0,μ1,μ<0<|\mu_{0}|,|\mu_{1}|,|\mu_{\star}|<\infty.

Remarkably, in the proportional asymptotics N,n,dN,n,d\to\infty with N/dψ1,n/dψ2N/d\to\psi_{1},n/d\to\psi_{2}, the behavior of this model is the same as the one of the nonlinear random features model studied in the rest of the paper. In particular, the asymptotic prediction error \mathscrsfsR\mathscrsfs{R} is given by the same formula as in Definition 1.

(Gaussian covariates prediction model) Define ζ\zeta and the signal-to-noise ratio ρ[0,]\rho\in[0,\infty] as

and assume μ0,μ1,μ0\mu_{0},\mu_{1},\mu_{\star}\neq 0. Then, in the Gaussian covariates model described above, for any λ>0\lambda>0, we have

where \mathscrsfsR(ρ,ζ,ψ1,ψ2,λ)\mathscrsfs{R}(\rho,\zeta,\psi_{1},\psi_{2},\overline{\lambda}) is explicitly given in Definition 1.

The proof of Theorem 7 is is almost the same as the one of Theorem 2 (with several simplifications, because of the greater amount of independence). To avoid repetitions, we will not present a proof here.

Figure 9 illustrates the content of Theorem 7 via numerical simulations. We report the simulated and predicted test error as a function of ψ1/ψ2=N/n\psi_{1}/\psi_{2}=N/n. The theoretical prediction here is exactly the same as the one reported in Figure 5. However, numerical simulations were carried out with the Gaussian covariates model instead of random features. The agreement is excellent, as predicted by Theorem 7.

Proof of Theorem 2

We begin by observing that the minimizer of the training error (2) is given by

Then a^(λ)\hat{\bm{a}}(\lambda) can be written in a simpler form a^(λ)=ΞZTy/d\hat{\bm{a}}(\lambda)={\bm{\Xi}}{\bm{Z}}^{\mathsf{T}}{\bm{y}}/\sqrt{d}. After a simple calculation, we obtain

Our first step is to replace the exact expression (55) by a simpler one involving traces of combinations of Ξ{\bm{\Xi}} and the following four random matrices:

The proof of this proposition is deferred to Section 9 and is based on the following main steps:

Finally, we show that RRF(fd,X,Θ,λ)R_{\rm RF}(f_{d},{\bm{X}},{\bm{\Theta}},\lambda) concentrates around its expectation with respect to fdf_{d} (i.e. the coefficients {βd,k}k1\{{\bm{\beta}}_{d,k}\}_{k\geq 1}) and ε{\bm{\varepsilon}}.

Here Log{\rm Log} is the complex logarithm with branch cut on the negative real axis and {λi(A)}i[M]\{\lambda_{i}({\bm{A}})\}_{i\in[M]} is the set of eigenvalues of A{\bm{A}} in non-increasing order.

The next proposition connects the quantities Ψj\Psi_{j} to the transforms GdG_{d} and MdM_{d}.

The proof of Proposition 8.2 follows by basic calculus and linear algebra, and we defer its proof to Appendix B. Despite its simplicity, this statement provides the basic scheme of our proof. We will determine the asymptotics of Md(ξ;q)M_{d}(\xi;{\bm{q}}) using a leave-one-out argument; then extract the behavior of Gd(ξ;q)G_{d}(\xi;{\bm{q}}) using Eq. (62); finally we characterize the test error using Eqs. (63) and Proposition 8.1.

The construction of the matrix A(q){\bm{A}}({\bm{q}}) is related to the linear pencil method in free probability, see e.g., [HMS18]. A significantly simpler construction was used in [HMRT19, Section 8] to calculate the variance part of the risk RRFR_{{\rm RF}} in the limit λ0\lambda\to 0 (in special cases). The approach of [HMRT19] amounts to computing the Stieltjes transform of A{\bm{A}} for p=t1=t2=0p=t_{1}=t_{2}=0, in the limit ξ0\xi\to 0: unfortunately this quantity is not sufficient to extract the prediction error. We overcome this difficulty by considering a more complex block-structured matrix and expressing the risk in terms of derivatives of the log determinant Gd(ξ;q)G_{d}(\xi;{\bm{q}}).

Unfortunately, even conditional on θN{\bm{\theta}}_{N}, the projections of (θa)a<N({\bm{\theta}}_{a})_{a<N} and (xi)in({\bm{x}}_{i})_{i\leq n} along θN{\bm{\theta}}_{N} and orthogonal to it are not independent (because of the sphere constraint). To overcome this problem we replace these by Gaussian vectors (θa)a<N(\overline{\bm{\theta}}_{a})_{a<N} and (xi)in(\overline{\bm{x}}_{i})_{i\leq n} and prove that the two distributions yield the same asymptotics of the Stieltjes transform. The decomposition of these Gaussian vectors takes the form

At this point independence can be exploited to obtain concentration results on the right-hand side. Let us emphasize that, while these paragraphs outline the main elements of the leave-one-out argument, several technical subtleties make the actual proof significantly longer, see Section 10 for details.

The proof of Proposition 8.3 is presented in Section 10. The fixed point equations (70) arise as a consequence of Eq. (64) (and the analogous equation for m2,dm_{2,d}). Indeed the proof also shows that the solution (m1,m2)(m_{1},m_{2}) of these equations gives the limit of (m1,d,m2,d)(m_{1,d},m_{2,d}) as n,N,dn,N,d\to\infty.

For a complete proof of this proposition we refer to Section 11.

We can now use Eqs. (76), (77), and (63) in Proposition 8.1, to get

The last display provides the desired asymptotics of bias and variance. However, these expressions involve derivatives of gg that are very inconvenient to evaluate. We conclude by proving more explicit expressions for these quantities. The key remark here is that the expression g(ξ;q)g(\xi;{\bm{q}}) in Proposition 8.4 has a special property: the fixed point equations (70) imply that (m1(ξ;q),m2(ξ;q))(m_{1}(\xi;{\bm{q}}),m_{2}(\xi;{\bm{q}})) is a stationary point of the function Ξ(ξ,,;q)\Xi(\xi,\,\cdot\,,\,\cdot\,;{\bm{q}}). This simplifies the calculation of derivatives with respect to q{\bm{q}}. In particular, the first derivative are obtained by computing the partial derivatives of Ξ\Xi with respect to q{\bm{q}} and evaluating it at m1,m2m_{1},m_{2}.

The proof of this lemma follows by simple calculus and can be found in Appendix D.

By the definition of analytic functions m1m_{1} and m2m_{2} (satisfying Eq. (70) and (69) with q=0{\bm{q}}={\bm{0}} as defined in Proposition 8.3), the definition of ν1\nu_{1} and ν2\nu_{2} in Eq. (85) above is equivalent to its definition in Definition 1 (as per Eq. (15)). Moreover, for χ\chi defined in Eq. (16) with λ=λ/μ2\overline{\lambda}=\lambda/\mu_{\star}^{2} and m0m_{0} defined in Eq. (82), we have

Plugging in Eq. (83) and (84) into Eq. (80) and (81) and using Eq. (86), we can see that the expressions for \mathscrsfsB\mathscrsfs{B} and \mathscrsfsV\mathscrsfs{V} defined in Eq. (80) and (81) coincide with Eq. (18) and (19) where \mathscrsfsE0,\mathscrsfsE1,\mathscrsfsE2\mathscrsfs{E}_{0},\mathscrsfs{E}_{1},\mathscrsfs{E}_{2} are provided in Eq. (17). Combining with Eq. (78) and (79) proves the theorem.

Proof of Proposition 8.1

As a useful preliminary remark, we note that the Gaussian process fd\mboxNLf_{d}^{\mbox{\tiny\rm NL}} defined in Assumption 3 can be explicitly represented as a sum of spherical harmonics with Gaussian coefficients. The following lemma is standard (see, e.g., [MP11, Proposition 6.11]). For the reader’s convenience, we present a simple proof in Appendix C.

Let us write the random variable in the left hand side of Eq. (58) as a function of βd,1{\bm{\beta}}_{d,1} and de-emphasize its dependence on other variables, i.e.,

Next, we state three lemmas that are used in the proof of Proposition 8.1.

Let λd,k(σ)\lambda_{d,k}(\sigma) be the Gegenbauer coefficients of the function σ\sigma, i.e., we have

Under the assumptions of Proposition 8.1, for any λ>0\lambda>0, we have

where S1k,S2k,S3S_{1k},S_{2k},S_{3} are given by Eq. (91), and Ψ1,Ψ2,Ψ3\Psi_{1},\Psi_{2},\Psi_{3} are given by Eq. (59).

Under the assumptions of Proposition 8.1, we have

We defer the proofs of these three lemmas to the following subsections and show here that they imply Proposition 8.1. Indeed, we have

where (a)(a) follows by Lemma 9.3 and triangular inequality, and (b)(b) from Lemma 9.4.

Recall the expression (55) for the risk. Taking expectation with respect to β{\bm{\beta}} and ε{\bm{\varepsilon}}, we get

The proof of the lemma follows by evaluating each of these three terms. It is useful to introduce the matrices Yk,x{\bm{Y}}_{k,{\bm{x}}}, Yk,θ{\bm{Y}}_{k,{\bm{\theta}}}, which denotes the evaluations of spherical harmonics of degree kk at the points {xi}in\{{\bm{x}}_{i}\}_{i\leq n} and {θa}aN\{{\bm{\theta}}_{a}\}_{a\leq N} (c.f. Appendix A):

Using these expressions, we can evaluate terms T1T_{1} and T2T_{2}:

Combining the above formulas for T1T_{1}, T2T_{2}, T3T_{3} proves Lemma 9.3.

2 Proof of Lemma 9.4

The next two lemmas will be used in the proofs of Lemma 9.4 and Lemma 9.5, and hold under the same assumptions. The first of these lemmas will be used to establish Eq. (92) (but notice that its statement does not coincide with that equation), and the second will be used to control several terms in those proofs. The proofs of these lemmas are given in Section C.2.

We will now use these lemmas to prove Lemma 9.4. We begin by recalling a few facts that are used several times in the proof. Since λ>0\lambda>0, there exists a constant C<C<\infty depending on (λ,ψ1,ψ2)(\lambda,\psi_{1},\psi_{2}) such that deterministically

By operator norm bounds on Wishart matrices [AGZ09], we have (the definition of these matrices are given in Eq. (57))

Finally we need some simple operator norm bounds on the matrices Qk(XXT)InQ_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}})-{\mathbf{I}}_{n}, Qk(ΘΘT)INQ_{k}({\bm{\Theta}}{\bm{\Theta}}^{\mathsf{T}})-{\mathbf{I}}_{N}, and Qk(ΘXT)Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}}). Notice that Qk(XXT)ii=1Q_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}})_{ii}=1 (by the normalization condition of Gegenbauer polynomials) and the out-of-diagonal entries of Qk(XXT)Q_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}}) have zero mean and typical size of order 1/dk/21/d^{k/2} (see Appendix A). This suggests the following estimates, which are formalized in Lemma C.6,

Since further λd,k(σ)2B(d,k)k!μk(σ)2\lambda_{d,k}(\sigma)^{2}B(d,k)k!\to\mu_{k}(\sigma)^{2} as dd\to\infty (see Eq. (225)), we have

(This estimate is stated formally in the appendices as Lemma C.7.) It is also useful to introduce the matrix

We have now finished presenting our preliminary estimates and can prove Lemma 9.4.

We begin by considering Eq. (92), where S10S_{10} and S20S_{20} are defined in Eq. (91). By the approximate linearization of U{\bm{U}} in Eq. (109), we have

Further recall the definition of A1A_{1}, and A2A_{2} in Eqs. (103), (104) and the definition of Ξ{\bm{\Xi}} and Π{\bm{\Pi}} as in Eq. (88), and we define

Then we have S10=A1S_{10}=A_{1}, S20=A2+BS_{20}=A_{2}+B and by Lemma 9.6, we have

We next consider Eq. (93), which requires to control S1kS_{1k}, defined in Eq. (91)). By Eq. (106), we have

Further note σL2(τd)2=k0λd,k(σ)2B(d,k)=Od(1)\|\sigma\|_{L^{2}(\tau_{d})}^{2}=\sum_{k\geq 0}\lambda_{d,k}(\sigma)^{2}B(d,k)=O_{d}(1), B(d,k)=Θ(dk)B(d,k)=\Theta(d^{k}), and for fixed dd, B(d,k)B(d,k) is non-decreasing in kk [GMMM19, Lemma 1]. Therefore

By Lemma 9.7 (with Mk=Qk(XXT)In\overline{\bm{M}}_{k}=Q_{k}({\bm{X}}{\bm{X}}^{\mathsf{T}})-{\mathbf{I}}_{n}) and Eq. (108), we get

We next consider Eq. (95), where we recall the definition of S11S_{11} in Eq. (91) and the definition of Ψ1\Psi_{1} in Eq. (59). By observing that limddλ1,d(σ)=μ1\lim_{d\to\infty}\sqrt{d}\lambda_{1,d}(\sigma)=\mu_{1} (see Eq. (225)) and that μ1Q1(XΘT)=μ1XΘT/d=Z1\mu_{1}Q_{1}({\bm{X}}{\bm{\Theta}}^{\mathsf{T}})=\mu_{1}{\bm{X}}{\bm{\Theta}}^{\mathsf{T}}/d={\bm{Z}}_{1}, we immediately get

In order to prove Eq. (96), recall the definition of S21S_{21} in Eq. (91) and the definition of Ψ2\Psi_{2} in Eq. (59). By the decomposition of U{\bm{U}} in Eq. (109) and recalling that Q1(XXT)=HQ_{1}({\bm{X}}{\bm{X}}^{\mathsf{T}})={\bm{H}}, we have

Finally, Eq. (97) is proved analogously to Eq. (96): this completes the proof of the lemma.

3 Proof of Lemma 9.5

To prove Lemma 9.5, we begin with by rewriting the prediction risk —cf. Eq. (55)— as (note that y=f+ε{\bm{y}}={\bm{f}}+{\bm{\varepsilon}})

This obviously implies the claims of the lemma. In the rest of this proof, we show the variance bound for Γ1\Gamma_{1}, as the other bounds are very similar.

Recall the definition of Yk,x{\bm{Y}}_{k,{\bm{x}}} and Yk,θ{\bm{Y}}_{k,{\bm{\theta}}} in Eq. (99), the definition of Gegenbauer coefficients λd,lλd,l(σ)\lambda_{d,l}\equiv\lambda_{d,l}(\sigma) in Eq. (89) and the expansion of f{\bm{f}} and V{\bm{V}} vectors in Eq. (100). We rewrite Γ1\Gamma_{1} as

Calculating the variance of Γ1\Gamma_{1} with respect to βd,kN(0,(Fd,k2/B(d,k))I){\bm{\beta}}_{d,k}\sim{\mathsf{N}}({\bm{0}},(F_{d,k}^{2}/B(d,k)){\mathbf{I}}) for k1k\geq 1 using Lemma C.8 (which follows from direct calculation), we get

Further note that σL2(τd)2=k0λd,k2B(d,k)=Od(1)\|\sigma\|_{L^{2}(\tau_{d})}^{2}=\sum_{k\geq 0}\lambda_{d,k}^{2}B(d,k)=O_{d}(1), B(d,k)=Θ(dk)B(d,k)=\Theta(d^{k}), and for fixed dd, B(d,k)B(d,k) is non-decreasing in kk [GMMM19, Lemma 1]. Therefore

Substituting above we obtain, and using the fact that k1Fd,k2=Od(1)\sum_{k\geq 1}F_{d,k}^{2}=O_{d}(1) by construction, we have

To bound the remaining two terms in this expression, note that

where the bound is implied by Lemma 9.7 (when applying Lemma 9.7, we change the role of NN and nn, the role of Ξ{\bm{\Xi}} and Π{\bm{\Pi}}, and the role of Θ{\bm{\Theta}} and X{\bm{X}}; this can be done because the role of Θ{\bm{\Theta}} and X{\bm{X}} is symmetric), and by Eq. (107) and λd,0(σ)=Θd(1)\lambda_{d,0}(\sigma)=\Theta_{d}(1) (by Assumption 1 and note that μ0(σ)=limdλd,0(σ)\mu_{0}(\sigma)=\lim_{d\to\infty}\lambda_{d,0}(\sigma) by Eq. (225)). This proves that

The bound on the first term in Eq. (114) is obtained analogously and we omit it for brevity.

Proof of Proposition 8.3

This section is organized as follows. We collect the elements to prove Proposition 8.3 in Sections 10.1, 10.2, 10.3, 10.4, and prove the proposition in Section 10.5.

In this subsection, we state Lemma 10.1, which is the key lemma that is used to prove Proposition 8.3. Lemma 10.1 studies m1,d\overline{m}_{1,d} and m2,d\overline{m}_{2,d}, the partial Stieltjes transforms of the Gaussian counterparts of the matrix A{\bm{A}} as defined in Eq. (60). This lemma shows that these partial Stieltjes transforms m1,d\overline{m}_{1,d} and m2,d\overline{m}_{2,d} approximately satisfy the fixed point equation which involves functions F1{{\sf F}_{1}} and F2{{\sf F}_{2}} as defined in Eq. (69). We will prove Lemma 10.1 in Section 10.3. Later in Section 10.4, we will show that the Gaussian counterpart of the Stieltjes transform shares the same asymptotics with its spherical version.

The matrix A\overline{\bm{A}} is in parallel with its spherical version matrix A{\bm{A}} defined as in Eq. (60).

In what follows, we will write q=(s1,s2,t1,t2,p){\bm{q}}=(s_{1},s_{2},t_{1},t_{2},p). We would like to calculate the asymptotic behavior of the following partial Stieltjes transforms

The crucial step consists in showing that the expected Stieltjes transforms m1,d,m2,d\overline{m}_{1,d},\overline{m}_{2,d} are approximate solutions of the fixed point equations (70).

The proof of Lemma 10.1 uses a leave-one-out argument in deriving the fixed point equation for Stieltjes transform of random matrices, e.g. [BS10, Chapter 3.3] and [CS13]. We will prove Lemma 10.1 in Section 10.3. In the next subsection, we will collect some lemmas that are used in this proof.

2 Preliminaries of the proof of Lemma 10.1: Stieltjes transforms and the fixed point equation

In the following lemma, we fix a qQ{\bm{q}}\in{\mathcal{Q}} (as defined in Eq. (68)) and fix 0<ψ1,ψ2,μ1,μ<0<\psi_{1},\psi_{2},\mu_{1},\mu_{\star}<\infty. Since the parameters q,ψ1,ψ2,μ1,μ{\bm{q}},\psi_{1},\psi_{2},\mu_{1},\mu_{\star} are fixed, we will drop them from the argument of F unless necessary. In these notations, Eq. (70) reads

The following lemma shows that there exists a unique fixed point of the equation above in a certain domain provided ξ\Im\xi is large enough.

We rewrite the first equation in Eq. (69) as

In order to prove the Lipschitz continuity of F in this domain, notice that F1{{\sf F}_{1}} is differentiable and

The functions ξm1,d(ξ)\xi\mapsto\overline{m}_{1,d}(\xi), ξm2,d(ξ)\xi\mapsto\overline{m}_{2,d}(\xi) have the following properties:

The next lemma establishes the concentration of Stieltjes transforms to its mean, whose proof is the same as the proof of Lemma 9 in [HMRT19].

Let (ξ)ξ0>0\Im(\xi)\geq\xi_{0}>0 and consider the partial Stieltjes transforms Mi,d(ξ;q)\overline{M}_{i,d}(\xi;{\bm{q}}) as per Eq. (119). Then there exists c0=c0(ξ0)c_{0}=c_{0}(\xi_{0}) such that, for i{1,2}i\in\{1,2\},

Assume σ\sigma is an activation function with σ(u)2c0exp(c1u2/2)\sigma(u)^{2}\leq c_{0}\,\exp(c_{1}\,u^{2}/2) for some constants c0>0c_{0}>0 and c1<1c_{1}<1 (this is implied by Assumption 1). Then

3 Proof of Lemma 10.1: Leave-one-out argument

Throughout the proof, we write F(d)=Od(G(d))F(d)=O_{d}(G(d)) if there exists a constant C=C(ξ0,q,ψ1,ψ2,φ)C=C(\xi_{0},{\bm{q}},\psi_{1},\psi_{2},\varphi) which is uniformly bounded when (ξ0,q,ψ1,ψ2)(\xi_{0},{\bm{q}},\psi_{1},\psi_{2}) is in a compact set, such that F(d)CG(d)|F(d)|\leq C\cdot|G(d)|. We write F(d)=od(G(d))F(d)=o_{d}(G(d)) if for any ε>0\varepsilon>0, there exists a constant C=C(ε,ξ0,q,ψ1,ψ2,φ)C=C(\varepsilon,\xi_{0},{\bm{q}},\psi_{1},\psi_{2},\varphi) which is uniformly bounded when (ξ0,q,ψ1,ψ2)(\xi_{0},{\bm{q}},\psi_{1},\psi_{2}) is in a compact set, such that F(d)εG(d)|F(d)|\leq\varepsilon\cdot|G(d)| for any dCd\geq C. We use CC to denote generically such a constant, that can change from line to line.

We will assume p=0p=0 throughout the proof. For p0p\neq 0, the lemma holds by viewing J+pJ1=φ(XΘT/d)/d\overline{\bm{J}}+p\overline{\bm{J}}_{1}=\varphi_{\star}({\bm{X}}{\bm{\Theta}}^{\mathsf{T}}/\sqrt{d})/\sqrt{d} as a new kernel inner product matrix, with φ(x)=φ(x)+pμ1x\varphi_{\star}(x)=\varphi(x)+p\mu_{1}x.

Step 1. Calculate the Schur complement and define some notations.

We decompose the vectors θa,xi\overline{\bm{\theta}}_{a},\overline{\bm{x}}_{i} in the components along θN\overline{\bm{\theta}}_{N} and the orthogonal component:

We next write B\overline{\bm{B}} as the sum of three terms:

and η=(η1,,ηN1)T{\bm{\eta}}=(\eta_{1},\ldots,\eta_{N-1})^{\mathsf{T}}, u=(u1,,un)T{\bm{u}}=(u_{1},\ldots,u_{n})^{\mathsf{T}}, and

where φ(x)φ(x)μ1x\varphi_{\perp}(x)\equiv\varphi(x)-\mu_{1}x.

Step 2. Perturbation bound for the Schur complement.

Moreover, by Lemma 10.7, ω1ω2|\omega_{1}-\omega_{2}| is deterministically bounded by 2/ξ02/\xi_{0}. This gives

Using the definitions of ω1\omega_{1} and ω2\omega_{2} as in Eq. (145) and (146), for ξξ0\Im\xi\geq\xi_{0}, we have

Hence we have ω11/ξ0|\omega_{1}|\leq 1/\xi_{0}, and, using a similar argument, ω21/ξ0|\omega_{2}|\leq 1/\xi_{0}. Hence we get the bound ω1ω22/ξ0|\omega_{1}-\omega_{2}|\leq 2/\xi_{0}.

Under the assumptions of Lemma 10.1, we have

Since E1{\bm{E}}_{1} is a sub-matrix of E{\bm{E}}, we have E1opEop\|{\bm{E}}_{1}\|_{{\rm op}}\leq\|{\bm{E}}\|_{{\rm op}}. By the intermediate value theorem

Note that φ(x)=φ(x)\varphi^{\prime\prime}_{\perp}(x)=\varphi^{\prime\prime}(x) is a polynomial with some fixed degree kˉ\bar{k}. Therefore we have

Under the assumptions of Lemma 10.1, we have

For a1{\bm{a}}_{1}, note we have a12=s2η2θN2/d\|{\bm{a}}_{1}\|_{2}=|s_{2}|\cdot\|{\bm{\eta}}\|_{2}\|\overline{\bm{\theta}}_{N}\|_{2}/d where ηN(0,IN1){\bm{\eta}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{N-1}) and θNN(0,Id)\overline{\bm{\theta}}_{N}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{d}) are independent. Hence we have

For a2{\bm{a}}_{2}, note φ\varphi is a polynomial with some fixed degree kˉ\bar{k}, hence we have

Step 3. Simplification using Sherman-Morrison-Woodbury formula.

Notice that Δ{\bm{\Delta}} is a matrix with rank at most two. Indeed

Since we assumed qQ{\bm{q}}\in{\mathcal{Q}} so that s2t2μ12/2|s_{2}t_{2}|\leq\mu_{1}^{2}/2, the matrix M{\bm{M}} is invertible with M1opC\|{\bm{M}}^{-1}\|_{{\rm op}}\leq C.

Recall the definition of ω2\omega_{2} in Eq. (146). By the Sherman-Morrison-Woodbury formula, we get

By auxiliary Lemmas 10.10, 10.11, and 10.12 below, we get

Elementary algebra simplifying Eq. (159) gives ψ1,dω3=F1(m1,d,m2,d;ξ;q,ψ1,d,ψ2,d,μ1,μ)\psi_{1,d}\omega_{3}={{\sf F}_{1}}(\overline{m}_{1,d},\overline{m}_{2,d};\xi;{\bm{q}},\psi_{1,d},\psi_{2,d},\mu_{1},\mu_{\star}). This proves Eq. (120) in Lemma 10.1. Eq. (121) follows by the same argument (exchanging NN and nn). In the rest of this section, we prove auxiliary Lemmas 10.10, 10.11, and 10.12.

Using the formula of ω2\omega_{2} and ω3\omega_{3} as in Eq. (156) and (159), for ξξ0\Im\xi\geq\xi_{0}, we have

Combining with ω2ω3ω2+ω3Od(1)|\omega_{2}-\omega_{3}|\leq|\omega_{2}|+|\omega_{3}|\leq O_{d}(1) proves the lemma. ∎

Under the assumptions of Lemma 10.1, we have (following the notations of Eq. (157) and (158))

The first bound is because (see Lemma 10.3 for the boundedness of m1,d\overline{m}_{1,d} and m2,d\overline{m}_{2,d})

In the following, we limit ourselves to proving Eq. (161), since Eq. (162) and (163) follow by similar arguments.

Then by the definition of v1v_{1} in Eq. (157), we have v1=aTRav_{1}={\bm{a}}^{\mathsf{T}}{\bm{R}}{\bm{a}}. Note we have

We next compute Var(hTRhR)\text{\rm Var}({\bm{h}}^{{\mathsf{T}}}{\bm{R}}{\bm{h}}|{\bm{R}}). By a similar calculation of Lemma C.8, we have (for a complex matrix, denote RT{\bm{R}}^{\mathsf{T}} to be the transpose of R{\bm{R}}, and R{\bm{R}}^{*} to be the conjugate transpose of R{\bm{R}})

By Lemmas 10.4, partial Stieltjes transforms are stable with respect to deleting one row and one column of the same index. By Lemma 10.13 (which will be stated and proved later), partial Stieltjes transforms are stable with respect to small changes of the dimension dd. Moreover, by Lemma 10.5, partial Stieltjes transforms concentrate tightly around their mean. As a consequence of all these lemmas (Lemma 10.4, 10.13, 10.5), we have

Combining with Eq. (167) proves Eq. (161). ∎

The following lemma is the analog of Lemma B.7 and B.8 in [CS13].

Under the assumptions of Lemma 10.1, we have (using the definitions in Eq. (155), (157) and (158) )

Step 1. Bounding (M1+V3)1op\|({\bm{M}}^{-1}+{\bm{V}}_{3})^{-1}\|_{{\rm op}}. By Sherman-Morrison-Woodbury formula, we have

Step 2. Bounding (M1+V3)1op\|({\bm{M}}^{-1}+\overline{\bm{V}}_{3})^{-1}\|_{{\rm op}}. Define G=M1/2V3M1/2{\bm{G}}={\bm{M}}^{1/2}{\bm{V}}_{3}{\bm{M}}^{1/2} and G=M1/2V3M1/2\overline{\bm{G}}={\bm{M}}^{1/2}\overline{\bm{V}}_{3}{\bm{M}}^{1/2}. By Lemma 10.11, we have

Combining with Eq. (170) and (171), we get

The last equality holds because (I2+G)1op\|({\mathbf{I}}_{2}+\overline{\bm{G}})^{-1}\|_{{\rm op}} is deterministic. Hence we have

The following lemma shows that, the partial resolvents are stable with respect to small changes of the dimension dd.

We denote Aij\overline{\bm{A}}_{ij} and Aij\underline{{\bm{A}}}_{ij} for i,ji,j\in to be

Then it’s easy to see that Ωop,Ω1op,Ω2op,Ω3op,Ωop1/ξ0\|\overline{\bm{\Omega}}\|_{{\rm op}},\|{\bm{\Omega}}_{1}\|_{{\rm op}},\|{\bm{\Omega}}_{2}\|_{{\rm op}},\|{\bm{\Omega}}_{3}\|_{{\rm op}},\|\underline{{\bm{\Omega}}}\|_{{\rm op}}\leq 1/\xi_{0}.

where η=(θ1d,,θNd)TN(0,IN){\bm{\eta}}=(\overline{\theta}_{1d},\ldots,\overline{\theta}_{Nd})^{\mathsf{T}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{N}). This gives

where u=(x1d,,xnd)TN(0,In){\bm{u}}=(\overline{x}_{1d},\ldots,\overline{x}_{nd})^{\mathsf{T}}\sim{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{n}). This gives

we have Eop=Od(Poly(logd)/d)\|{\bm{E}}\|_{{\rm op}}=O_{d}({\rm Poly}(\log d)/\sqrt{d}). Therefore, we get

Combining all these bounds establishes Eq. (176). Finally, Eq. (177) can be shown using the same argument. ∎

4 Equivalence between Gaussian and sphere version of Stieltjes transforms

In this subsection, we show that the Stieltjes transform of matrix A{\bm{A}} as defined in Eq. (60) and that of matrix A\overline{\bm{A}} as defined in Eq. (117) share the same asymptotics. For the reader’s convenience, we restate the definitions of these two matrices here.

and the Stieltjes transforms Md(ξ;q)\overline{M}_{d}({\bm{\xi}};{\bm{q}}) and Md(ξ;q)M_{d}({\bm{\xi}};{\bm{q}}), defined by

The readers could keep in mind: a quantity with an overline corresponds to the case when features and data are Gaussian, while a quantity without overline usually corresponds to the case when features and data are on the sphere.

Step 1. Show that the resolvent is stable with respect to nuclear norm perturbation.

Since q=(s1,s2,t1,t2,p){\bm{q}}=(s_{1},s_{2},t_{1},t_{2},p) is fixed, we have

The nuclear norm of the term involving Z0{\bm{Z}}_{0} can be easily bounded by

For term HH\overline{\bm{H}}-{\bm{H}}, denoting Dx=diag(d/x12,,d/xn2){\bm{D}}_{\bm{x}}={\rm diag}(\sqrt{d}/\|\overline{\bm{x}}_{1}\|_{2},\ldots,\sqrt{d}/\|\overline{\bm{x}}_{n}\|_{2}), we have

Step 3. Bound for JZF/d\|\overline{\bm{J}}-{\bm{Z}}_{\star}\|_{F}/\sqrt{d}.

Define Z=φ(DxXΘT/d)/d\overline{\bm{Z}}_{\star}=\varphi({\bm{D}}_{\bm{x}}\overline{\bm{X}}\,\overline{\bm{\Theta}}^{\mathsf{T}}/\sqrt{d})/\sqrt{d}. Define ri=d/xi2r_{i}=\sqrt{d}/\|\overline{\bm{x}}_{i}\|_{2}. We have (for ζia\zeta_{ia} between rir_{i} and 11)

where Ξ=(ζia)i[n],a[N]{\bm{\Xi}}=(\zeta_{ia})_{i\in[n],a\in[N]}, and φˉ(x)=xφ(x)\bar{\varphi}(x)=x\varphi^{\prime}(x) (so φˉ\bar{\varphi} is a polynomial). It is easy to see that

Therefore, we have (denoting deg(φ)\deg(\varphi) to be the degree of polynomial φ\varphi, and C(φ)C(\varphi) to be a constant that only depends on φ\varphi)

5 Proof of Proposition 8.3

Step 1. Polynomial activation function σ\sigma.

By the property of Stieltjes transform as in Lemma 10.3 (c)(c), we have

By the concentration result of Lemma 10.5, for Md(ξ)=d1Tr[(AξIM)1]\overline{M}_{d}(\xi)=d^{-1}{\rm Tr}[(\overline{\bm{A}}-\xi{\mathbf{I}}_{M})^{-1}], we also have

Then we use Lemma 10.14 to transfer this property from Md\overline{M}_{d} to MdM_{d}. Recall the definition of resolvent Md(ξ;q)M_{d}(\xi;{\bm{q}}) in sphere case in Eq. (61). Combining Lemma 10.14 with Eq. (185), we have

Step 2. General activation function σ\sigma satisfying Assumption 1.

Further, by continuity of the solution of the fixed point equation with respect to μ,μ1\mu_{\star},\mu_{1} when ξξ0\Im\xi\geq\xi_{0} for some large ξ0\xi_{0} (as stated in Lemma 10.2), we have for ξξ0\Im\xi\geq\xi_{0},

Combining Eq. (188), (189), and (190), we obtain

Taking ε0\varepsilon\to 0 proves Eq. (71).

Step 3. Uniform convergence in compact sets (Eq. (72)).

By the concentration of Md(ξ;q)M_{d}(\xi_{\star};{\bm{q}}) to its expectation (which is the spherical version of Lemma 10.5), we have

and since N(ε,Ω)\mathcal{N}(\varepsilon,\Omega) is a finite set, we have

This high probability bound will become an expectation bound by the uniform boundedness of Md(ξ;q)M_{d}(\xi;{\bm{q}}) for ξ\xi in any compact domain. That is, we have

Combining Eq. (191), (192) and (194), we have

Letting ε0\varepsilon\to 0 proves Eq. (72). This concludes the proof of Proposition 8.3.

Proof of Proposition 8.4

In Section 11.1 we state and prove some lemmas that are used in the proof of Proposition 8.4. We prove Proposition 8.4 in Section 11.2.

The first lemma concerns the behavior of the partial Stieltjes transforms m1m_{1} and m2m_{2} when ξ\Im\xi\to\infty.

Define m1=ψ1/ξ\overline{m}_{1}=-\psi_{1}/\xi, m2=ψ2/ξ\overline{m}_{2}=-\psi_{2}/\xi, m=(m1,m2)T\overline{\bm{m}}=(\overline{m}_{1},\overline{m}_{2})^{\mathsf{T}}, and m=(m1,m2)T{\bm{m}}=(m_{1},m_{2})^{\mathsf{T}}. Let F1,F2{{\sf F}_{1}},{{\sf F}_{2}} be defined as in Eq. (69), F be defined as in Eq. (122), and H1{{\sf H}_{1}} defined as in Eq. (125). By simple calculus we can see that

The next lemma concerns the behavior of the log-determinants when ξ\Im\xi\to\infty.

Follow the notations and settings of Proposition 8.4. For any fixed q{\bm{q}}, we have

Combining the bound of the real part and the imaginary part, we have

Recall the definition of Ξ\Xi as given in Eq. (73). Define

It is easy to see that for any fixed q{\bm{q}}, we have

Combining Eq. (198), (199) and (200), we get

Next, we give some uniform upper bounds on the difference of derivatives of GdG_{d} and gg.

Define q=(s1,s2,t1,t2,p)(q1,q2,q3,q4,q5){\bm{q}}=(s_{1},s_{2},t_{1},t_{2},p)\equiv(q_{1},q_{2},q_{3},q_{4},q_{5}), and

Finally, we show that the derivatives of a function in a region can be upper bounded by the function value and the second derivatives of the function in the region.

The proof of Lemma 11.4 is elementary and simply follows from Taylor expansion.

2 Proof of Proposition 8.4

By the expression of Ξ\Xi in Eq. (73), we have

By fixed point equation (70) with F1,F2{{\sf F}_{1}},{{\sf F}_{2}} defined in (69), we obtain that

As a result, by the definition of gg given in Eq. (74), and by formula for implicit differentiation, we have

Combining Eq. (202) with Eq. (201), we get

Combining Eq. (203), (204) and (205), we get Eq. (75).

By Eq. (75) and Lemma 11.3, by the covering number argument (similar to the Step 3 in Section 10.5), we get that for any compact region Q{\mathcal{Q}}_{\star}, there is

Taking Q=B(0,2ε){\mathcal{Q}}_{\star}={\mathsf{B}}({\bm{0}},2\varepsilon), by Eq. (206) and by Lemma 11.3 again, there exists some constant CC, such that

Sending ε0\varepsilon\to 0 gives Eq. (76). By similar argument we get Eq. (77). This finishes the proof of Proposition 8.4.

Proof of Theorem 3, 4, and 5

To prove this theorem, we just need to show that

More specifically, we just need to show that, the formula for χ\chi defined in Eq. (16) as λ0\overline{\lambda}\to 0 coincides with the formula for χ\chi defined in Eq. (24). By the relationship of χ\chi and m1m2m_{1}m_{2} as per Eq. (86), we just need to show the lemma below.

By Proposition 8.3 this is equivalently saying m1(ξ;ψ1,ψ2),m2(ξ;ψ1,ψ2)m_{1}(\xi;\psi_{1},\psi_{2}),m_{2}(\xi;\psi_{1},\psi_{2}) is the analytic continuation of solution of Eq. (70) as defined in Proposition 8.3, when q=0{\bm{q}}={\bm{0}}. Defining ψ=min(ψ1,ψ2)\psi=\min(\psi_{1},\psi_{2}), we have

and hence m0=Ou(1)m_{0}=O_{u}(1). Taking the difference between Eq. (209) and (210), we get

This implies one of m1m_{1} and m2m_{2} should be of order 1/u1/u and the other one should be of order uu, as u0u\to 0.

By definition of m1m_{1} and m2m_{2} in Eq. (207), we have

2 Proof of Theorem 4

To prove this theorem, we just need to show that

This follows by simple calculus and a lemma below.

Under the same condition of Lemma 12.1, we have

The proof of this lemma is similar to the proof of Lemma 12.1.

3 Proof of Theorem 5

To prove this theorem, we just need to show that

This follows by simple calculus and a lemma below (this lemma is symmetric to Lemma 12.2).

Under the same condition of Lemma 12.1, we have

Proof of Proposition 5.1 and 5.2

Proof of Point (1). When ψ10\psi_{1}\to 0, we have χ=O(ψ1)\chi=O(\psi_{1}), so that \mathscrsfsE1,\mboxrless=ψ1ψ2+O(ψ12)\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}}=-\psi_{1}\psi_{2}+O(\psi_{1}^{2}), \mathscrsfsE2,\mboxrless=O(ψ12)\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}}=O(\psi_{1}^{2}) and \mathscrsfsE0,\mboxrless=ψ1ψ2+O(ψ12)\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}=-\psi_{1}\psi_{2}+O(\psi_{1}^{2}). This proves Point (1).

Proof of Point (2). When ψ1=ψ2\psi_{1}=\psi_{2}, substituting the expression for χ\chi into \mathscrsfsE0,\mboxrless\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}, we can see that \mathscrsfsE0,\mboxrless(ζ,ψ2,ψ2)=0\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})=0. We also see that \mathscrsfsE1,\mboxrless(ζ,ψ2,ψ2)0\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})\neq 0 and \mathscrsfsE2,\mboxrless(ζ,ψ2,ψ2)0\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})\neq 0. This proves Point (2).

Proof of Point (3). When ψ1>ψ2\psi_{1}>\psi_{2}, we have

Proof of Point (4). For ψ1>ψ2\psi_{1}>\psi_{2}, taking derivative of \mathscrsfsB\mboxrless\mathscrsfs{B}_{\mbox{\tiny\rm rless}} and \mathscrsfsV\mboxrless\mathscrsfs{V}_{\mbox{\tiny\rm rless}} with respect to ψ1\psi_{1}, we have

It is easy to check that when ψ1>ψ2\psi_{1}>\psi_{2}, the functions ψ1\mathscrsfsE1,\mboxrless\mathscrsfsE0,\mboxrlessψ1\mathscrsfsE0,\mboxrless\mathscrsfsE1,\mboxrless\partial_{\psi_{1}}\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}-\partial_{\psi_{1}}\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{1,\mbox{\tiny\rm rless}} and ψ1\mathscrsfsE2,\mboxrless\mathscrsfsE0,\mboxrlessψ1\mathscrsfsE0,\mboxrless\mathscrsfsE2,\mboxrless\partial_{\psi_{1}}\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}-\partial_{\psi_{1}}\mathscrsfs{E}_{0,\mbox{\tiny\rm rless}}\cdot\mathscrsfs{E}_{2,\mbox{\tiny\rm rless}} are functions of ζ{\zeta} and ψ2\psi_{2}, and are independent of ψ1\psi_{1} (note when ψ1>ψ2\psi_{1}>\psi_{2}, χ\chi is a function of ψ2\psi_{2} and doesn’t depend on ψ1\psi_{1}). Therefore, \mathscrsfsB\mboxrless(ζ,,ψ2)\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\cdot,\psi_{2}) and \mathscrsfsV\mboxrless(ζ,,ψ2)\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\cdot,\psi_{2}) as functions of ψ1\psi_{1} must be strictly increasing, strictly decreasing, or staying constant on the interval ψ1(ψ2,)\psi_{1}\in(\psi_{2},\infty). However, we know \mathscrsfsB\mboxrless(ζ,ψ2,ψ2)=\mathscrsfsV\mboxrless(ζ,ψ2,ψ2)=\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})=\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\psi_{2},\psi_{2})=\infty, and \mathscrsfsB\mboxrless(ζ,,ψ2)\mathscrsfs{B}_{\mbox{\tiny\rm rless}}({\zeta},\infty,\psi_{2}) and \mathscrsfsV\mboxrless(ζ,,ψ2)\mathscrsfs{V}_{\mbox{\tiny\rm rless}}({\zeta},\infty,\psi_{2}) are finite. Therefore, we must have that \mathscrsfsB\mboxrless\mathscrsfs{B}_{\mbox{\tiny\rm rless}} and \mathscrsfsV\mboxrless\mathscrsfs{V}_{\mbox{\tiny\rm rless}} are strictly decreasing on ψ1(ψ2,)\psi_{1}\in(\psi_{2},\infty).

2 Proof of Proposition 5.2

In Proposition 13.1 given by the following, we give a more precise description of the behavior of \mathscrsfsR\mboxwide\mathscrsfs{R}_{\mbox{\tiny\rm wide}}, which is stronger than Proposition 5.2.

Fix ζ,ψ2(0,){\zeta},\psi_{2}\in(0,\infty) and ρ(0,)\rho\in(0,\infty). Then the function λ\mathscrsfsR\mboxwide(ρ,ζ,ψ2,λ)\overline{\lambda}\mapsto\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\psi_{2},\overline{\lambda}) is either strictly increasing in λ\overline{\lambda}, or strictly decreasing first and then strictly increasing.

Moreover, For any ρ<ρ(ζ,ψ2)\rho<\rho_{\star}({\zeta},\psi_{2}), we have

For any ρρ(ζ,ψ2)\rho\geq\rho_{\star}({\zeta},\psi_{2}), we have

Minimizing over λ\overline{\lambda} and ζ{\zeta}, we have

The minimizer is achieved for any ζ2ζ2(ρ,ψ2){\zeta}^{2}\geq{\zeta}_{\star}^{2}(\rho,\psi_{2}), and λ=λ(ζ,ψ2,ρ)\overline{\lambda}=\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho).

In the following, we prove Proposition 13.1. It is easy to see that

Hence we study the properties of \mathscrsfsR\mboxwide\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}} first.

Step 1. Properties of the function \mathscrsfsR\mboxwide\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}. Calculating the derivative of \mathscrsfsR\mboxwide\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}} with respect to uu, we have

has one negative and one positive solution, and ω1\omega_{1} is the negative solution of the above equation. Therefore, when uω1u\leq\omega_{1}, \mathscrsfsR\mboxwide\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}} will be strictly decreasing in uu; when 0uω10\geq u\geq\omega_{1}, \mathscrsfsR\mboxwide\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}} will be strictly increasing in uu. Therefore, we have

Step 2. Properties of the function \mathscrsfsR\mboxwide\mathscrsfs{R}_{\mbox{\tiny\rm wide}}. For fixed (ζ,ρ,ψ2)({\zeta},\rho,\psi_{2}), we look at the minimizer over λ\overline{\lambda} of the function \mathscrsfsR\mboxwide(ρ,ζ,λ,ψ2)=\mathscrsfsR\mboxwide(ω(λ,ζ,ψ2),ρ,ψ2)\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda},\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega(\overline{\lambda},{\zeta},\psi_{2}),\rho,\psi_{2}). The minimum minλ0\mathscrsfsR\mboxwide(ρ,ζ,λ,ψ2)\min_{\overline{\lambda}\geq 0}\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda},\psi_{2}) could be different from the minimum minu(,0]\mathscrsfsR\mboxwide(u,ρ,ψ2)\min_{u\in(-\infty,0]}\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(u,\rho,\psi_{2}), since arg minu(,0]\mathscrsfsR\mboxwide(u,ρ,ψ2)=ω1(ρ,ψ2)\operatorname*{arg\,min}_{u\in(-\infty,0]}\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(u,\rho,\psi_{2})=\omega_{1}(\rho,\psi_{2}) may not be achievable by ω(λ,ζ,ψ2)\omega(\overline{\lambda},{\zeta},\psi_{2}) when λ0\overline{\lambda}\geq 0.

One observation is that ω(,ψ2,ζ)\omega(\cdot,\psi_{2},{\zeta}) as a function of λ\overline{\lambda} is always negative and increasing.

Then for any ψ2(0,)\psi_{2}\in(0,\infty), ζ(0,){\zeta}\in(0,\infty) and λ>0\overline{\lambda}>0, we have

Let us for now admit this lemma holds. When ρ\rho is such that ω1>ω0\omega_{1}>\omega_{0} (i.e. ρ<ρ(ζ,ψ2)\rho<\rho_{\star}({\zeta},\psi_{2})), we can choose λ=λ(ζ,ψ2,ρ)>0\overline{\lambda}=\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho)>0 such that ω(λ,ζ,ψ2)=ω(λ,ζ,ψ2)=ω1(ρ,ψ2)\omega(\overline{\lambda},{\zeta},\psi_{2})=\omega(\overline{\lambda}_{\star},{\zeta},\psi_{2})=\omega_{1}(\rho,\psi_{2}), and then \mathscrsfsR\mboxwide(ρ,ζ,λ(ζ,ψ2,ρ),ψ2)=\mathscrsfsR\mboxwide(ω1(ρ,ψ2),ρ,ψ2)\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho),\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega_{1}(\rho,\psi_{2}),\rho,\psi_{2}) gives the minimum of \mathscrsfsR\mboxwide\mathscrsfs{R}_{\mbox{\tiny\rm wide}} optimizing over λ[0,)\overline{\lambda}\in[0,\infty). When ρ\rho is such that ω1<ω0\omega_{1}<\omega_{0} (i.e. ρ>ρ(ζ,ψ2)\rho>\rho_{\star}({\zeta},\psi_{2})), there is not a λ\overline{\lambda} such that ω(λ,ζ,ψ2)=ω1(ρ,ψ2)\omega(\overline{\lambda},{\zeta},\psi_{2})=\omega_{1}(\rho,\psi_{2}) holds. Therefore, the best we can do is to take λ=0\overline{\lambda}=0, and then \mathscrsfsR\mboxwide(ρ,ζ,0,ψ2)=\mathscrsfsR\mboxwide(ω0(ρ,ψ2),ρ,ψ2)\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},0,\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega_{0}(\rho,\psi_{2}),\rho,\psi_{2}) gives the minimum of \mathscrsfsR\mboxwide\mathscrsfs{R}_{\mbox{\tiny\rm wide}} optimizing over λ[0,)\overline{\lambda}\in[0,\infty).

Finally, when we minimize \mathscrsfsR\mboxwide(ρ,ζ,λ,ψ2)\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda},\psi_{2}) jointly over ζ{\zeta} and λ\overline{\lambda}, note that as long as ζ2ζ2{\zeta}^{2}\geq{\zeta}_{\star}^{2}, we can choose λ=λ(ζ,ψ2,ρ)>0\overline{\lambda}=\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho)>0 such that ω(λ,ζ,ψ2)=ω(λ,ζ,ψ2)=ω1(ρ,ψ2)\omega(\overline{\lambda},{\zeta},\psi_{2})=\omega(\overline{\lambda}_{\star},{\zeta},\psi_{2})=\omega_{1}(\rho,\psi_{2}), and then \mathscrsfsR\mboxwide(ρ,ζ,λ(ζ,ψ2,ρ),ψ2)=\mathscrsfsR\mboxwide(ω1(ρ,ψ2),ρ,ψ2)\mathscrsfs{R}_{\mbox{\tiny\rm wide}}(\rho,{\zeta},\overline{\lambda}_{\star}({\zeta},\psi_{2},\rho),\psi_{2})=\overline{\mathscrsfs}{R}_{\mbox{\tiny\rm wide}}(\omega_{1}(\rho,\psi_{2}),\rho,\psi_{2}) gives the minimum of \mathscrsfsR\mboxwide\mathscrsfs{R}_{\mbox{\tiny\rm wide}} optimizing over λ[0,)\overline{\lambda}\in[0,\infty) and ζ(0,){\zeta}\in(0,\infty). This proves Proposition 13.1.

It is easy to see that ω(λ,ψ2,ζ)<0\omega(\overline{\lambda},\psi_{2},{\zeta})<0. In the following, we show λω(λ,ψ2,ζ)>0\partial_{\overline{\lambda}}\omega(\overline{\lambda},\psi_{2},{\zeta})>0.

It is easy to see that, when λ>0\overline{\lambda}>0 and ψ2>1\psi_{2}>1, both the denominator and numerator is positive, so that λω>0\partial_{\overline{\lambda}}\omega>0.

Step 2. When ψ2<1\psi_{2}<1. Note ω\omega is the negative solution of the quadratic equation

Differentiating the quadratic equation with respect to λ\overline{\lambda}, we have

We can see that, since ω<0\omega<0, when ψ2<1\psi_{2}<1, both the denominator and numerator is negative. This proves λω>0\partial_{\overline{\lambda}}\omega>0 when ψ2<1\psi_{2}<1. ∎

Acknowledgements

This work was partially supported by grants NSF CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729.

References

Appendix A Technical background

The dimension of each subspace is given by

A.2 Gegenbauer polynomials

We will use the following properties of Gegenbauer polynomials

Note in particular that property 2 implies that –up to a constant– Qk(d)(x,y)Q_{k}^{(d)}(\langle{\bm{x}},{\bm{y}}\rangle) is a representation of the projector onto the subspace of degree-kk spherical harmonics

then we have the following equation holds in L2([d,d],τd)L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d}) sense

A.3 Hermite polynomials

The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials (up to a d\sqrt{d} scaling in domain) are constructed by Gram-Schmidt orthogonalization of the monomials {xk}k0\{x^{k}\}_{k\geq 0} with respect to the measure τd\tau_{d}, while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to μG\mu_{G}. Since τdμG\tau_{d}\Rightarrow\mu_{G} (here \Rightarrow denotes weak convergence), it is immediate to show that, for any fixed integer kk,

Here and below, for PP a polynomial, Coeff{P(x)}{\rm Coeff}\{P(x)\} is the vector of the coefficients of PP. As a consequence, for any fixed integer kk, we have

where μk(σ)\mu_{k}(\sigma) and λd,k(σ)\lambda_{d,k}(\sigma) are given in Eq. (223) and (220).

Appendix B Proof of Proposition 8.2

Moreover, the set of eigenvalues of A(q)ξIM{\bm{A}}({\bm{q}})-\xi{\mathbf{I}}_{M} and det(A(q)ξIM)\det({\bm{A}}({\bm{q}})-\xi{\mathbf{I}}_{M}) are continuous with respect to q{\bm{q}}. Therefore, for any perturbation Δq\Delta{\bm{q}} with Δq2ε\|\Delta{\bm{q}}\|_{2}\leq\varepsilon and ε\varepsilon small enough, we have k(q+Δq,ξ)=k(q,ξ)k({\bm{q}}+\Delta{\bm{q}},\xi)=k({\bm{q}},\xi). As a result, we have

Moreover, A(q){\bm{A}}({\bm{q}}) (defined as in Eq. (60)) is a linear matrix function of q{\bm{q}}, which gives qi,qjA(q)=0\partial_{q_{i},q_{j}}{\bm{A}}({\bm{q}})={\bm{0}}. Hence we have

and using the formula for block matrix inversion, we have

With simple algebra, we can show that Eq. (226) holds.

Appendix C Additional Proofs in Section 9

We define the sequence (Fd,k2)k2(F_{d,k}^{2})_{k\geq 2} to be the coefficients of Gegenbauer expansion of Σd\Sigma_{d}:

In the expansion, the zeroth and first order coefficients are , because, according to Assumption 3,

To check point (1), we have Σd(1)=k=2Fd,k2Qk(d)(d)=k=2Fd,k2\Sigma_{d}(1)=\sum_{k=2}^{\infty}F_{d,k}^{2}Q_{k}^{(d)}(d)=\sum_{k=2}^{\infty}F_{d,k}^{2}, and by Assumption 3 we have limdΣd(1)=F2\lim_{d\to\infty}\Sigma_{d}(1)=F_{\star}^{2}, so that (1) holds.

To check point (2), defining (βd,k)k2({\bm{\beta}}_{d,k})_{k\geq 2} and gd\mboxNL(x)g_{d}^{\mbox{\tiny\rm NL}}({\bm{x}}) accordingly, we have

With a little abuse of notations, let us define

Moreover, note that X{\bm{X}}, Θ{\bm{\Theta}}, ε{\bm{\varepsilon}}, and fd\mboxNLf_{d}^{\mbox{\tiny\rm NL}} are mutually independent, X=dXOT{\bm{X}}\stackrel{{\scriptstyle d}}{{=}}{\bm{X}}{\mathcal{O}}^{\mathsf{T}}, Θ=dΘOT{\bm{\Theta}}\stackrel{{\scriptstyle d}}{{=}}{\bm{\Theta}}{\mathcal{O}}^{\mathsf{T}}, and fd\mboxNL=dTO[fd\mboxNL]f_{d}^{\mbox{\tiny\rm NL}}\stackrel{{\scriptstyle d}}{{=}}{\mathcal{T}}_{{\mathcal{O}}}[f_{d}^{\mbox{\tiny\rm NL}}]. Then, for any fixed βd,1{\bm{\beta}}_{d,1}, we have

C.2 Proof of Lemma 9.6 and Lemma 9.7

To prove Lemma 9.6 and 9.7, first we state a lemma that reformulate A1,A2A_{1},A_{2} and BαB_{\alpha} using Sherman-Morrison-Woodbury formula.

Note we have (denoting λd,0=λd,0(σ)\lambda_{d,0}=\lambda_{d,0}(\sigma))

Hence we have (denoting T2=JT1n/n{\bm{T}}_{2}={\bm{J}}^{\mathsf{T}}\bm{1}_{n}/\sqrt{n})

By the Sherman-Morrison-Woodbury formula, we have

Simplifying this formula using simple algebra proves Eq. (230). ∎

Step 1. Term A1A_{1}. By Lemma C.1, we get

We can simplify S20S_{20} in Eq. (233) further, and get

It is easy to see that 0A10\leq A\leq 1. Hence the high probability bound translates to an expectation bound. This proves the lemma. ∎

For notation simplicity, we prove this lemma under the case when A={α}{\mathcal{A}}=\{\alpha\} which is a singleton. We denote B=BαB=B_{\alpha}. The proof can be directly generalized to the case for arbitrary set A{\mathcal{A}}.

By Lemma C.1 (when applying Lemma C.1, we change the role of NN and nn, and the role of Θ{\bm{\Theta}} and X{\bm{X}}; this can be done because the role of Θ{\bm{\Theta}} and X{\bm{X}} is symmetric), we have

Note we have shown in the proof of Lemma 9.6 that

Since all the quantities above are deterministically bounded by a constant, these high probability bounds translate to expectation bounds.

Plugging in the above bounds into Equation (235), we have

C.3 Some auxiliary lemmas

We denote by μd\mu_{d} the probability law of x1,x2/d\langle{\bm{x}}_{1},{\bm{x}}_{2}\rangle/\sqrt{d} when x1,x2iidN(0,Id){\bm{x}}_{1},{\bm{x}}_{2}\sim_{iid}{\mathsf{N}}({\bm{0}},{\mathbf{I}}_{d}). Note that μd\mu_{d} is symmetric, and x2μd(dx)=1\int x^{2}\mu_{d}({\rm d}x)=1. By the central limit theorem, μd\mu_{d} converges weakly to μG\mu_{G} as dd\to\infty, where μG\mu_{G} is the standard Gaussian measure. In fact, we have the following stronger convergence result.

For any λ[d/2,d/2]\lambda\in[-\sqrt{d}/2,\sqrt{d}/2], we have

In order to prove Eq. (236), we note that the left hand side is given by

where the last inequality holds for λd/2|\lambda|\leq\sqrt{d}/2 using the fact that (1x)1e2x(1-x)^{-1}\leq e^{2x} for x[0,1/4]x\in[0,1/4].

The next several lemmas establish general bounds on the operator norm of random kernel matrices which is of independent interest.

Then there exists a constant CC depending uniquely on c0,c1,c2c_{0},c_{1},c_{2}, and a sequence of numbers (ηd)d1(\overline{\eta}_{d})_{d\geq 1} with ηdCexp{C(logd)1/2}|\overline{\eta}_{d}|\leq C\exp\{C(\log d)^{1/2}\}, such that

By Lemma C.2 and Markov inequality, we have, for any iji\neq j and all 0td0\leq t\leq\sqrt{d},

By [DM16, Lemma 20], there exists a constant CC such that

Defining the event G{zi,zj/d16logd,1i<jM}\mathcal{G}\equiv\{|\langle\overline{\bm{z}}_{i},\overline{\bm{z}}_{j}\rangle/\sqrt{d}|\leq 16\sqrt{\log d},\,\forall 1\leq i<j\leq M\}, we have

Then there exists a constant CC depending uniquely on c0,c1,c2c_{0},c_{1},c_{2}, such that

In the proof of this lemma, we assume σ\sigma has continuous derivatives. In the case when σ\sigma is only weak differentiable, the proof is the same, except that we need to replace the mean value theorem to its integral form.

Define ri=d/zi2r_{i}=\sqrt{d}/\|\overline{\bm{z}}_{i}\|_{2}, and

By the concentration of χ\chi-squared distribution, it is easy to see that

Moreover, we have (for ζi\zeta_{i} between rir_{i} and 11)

Moreover by the assumption that σ(u)c0ec1u|\sigma^{\prime}(u)|\leq c_{0}e^{c_{1}|u|}, we have

Then there exists a constant CC depending uniquely on c0,c1,c2c_{0},c_{1},c_{2}, such that

By Lemma C.3, there exists a sequence (ηd)d0(\overline{\eta}_{d})_{d\geq 0} with ηdCexp{C(logd)1/2}|\overline{\eta}_{d}|\leq C\exp\{C(\log d)^{1/2}\}, such that

Combining this equation with Eq. (245), we get

C.4 The decomposition of kernel inner product matrices

The following lemma (Lemma C.6) is a reformulation of Proposition 3 in [GMMM19]. We present it in a stronger form, but it can be easily derived from the proof of Proposition 3 in [GMMM19]. This lemma was first proved in [EK10] in the Gaussian case. (Notice that the second estimate —on Qk(ΘXT)Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}})— follows by applying the first one whereby Θ{\bm{\Theta}} is replaced by W=[ΘTXT]T{\bm{W}}=[{\bm{\Theta}}^{{\mathsf{T}}}|{\bm{X}}^{{\mathsf{T}}}]^{{\mathsf{T}}}

Notice that the second estimate —on Qk(ΘXT)Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}})— follows by applying the first one —Eq. (246)— whereby Θ{\bm{\Theta}} is replaced by W=[ΘTXT]T{\bm{W}}=[{\bm{\Theta}}^{{\mathsf{T}}}|{\bm{X}}^{{\mathsf{T}}}]^{{\mathsf{T}}}, and we use Qk(ΘXT)opQk(WWT)IN+nop\|Q_{k}({\bm{\Theta}}{\bm{X}}^{\mathsf{T}})\|_{{\rm op}}\leq\|Q_{k}({\bm{W}}{\bm{W}}^{\mathsf{T}})-{\mathbf{I}}_{N+n}\|_{{\rm op}}.

The following lemma (Lemma C.7) can be easily derived from Lemma C.6. Again, this lemma was first proved in [EK10] in the Gaussian case.

Then we can rewrite the matrix U{\bm{U}} to be

C.5 A lemma on the variance of the quadratic form

Step 1. Term gTAh{\bm{g}}^{\mathsf{T}}{\bm{A}}{\bm{h}}. Calculating the expectation, we have

Step 2. Term gTBg{\bm{g}}^{\mathsf{T}}{\bm{B}}{\bm{g}}. Calculating the expectation, we have

Appendix D Proof of Lemma 8.1

where we have, for u=(s1,s2,t1,t2,z1,z2)T{\bm{u}}=(s_{1},s_{2},t_{1},t_{2},z_{1},z_{2})^{\mathsf{T}}

Appendix E Proof sketch for Theorem 6

In this section, we sketch the calculations of Theorem 6. We assume ψ1,dN/d=ψ1\psi_{1,d}\equiv N/d=\psi_{1} and ψ2,dn/d=ψ2\psi_{2,d}\equiv n/d=\psi_{2} are constants independent of dd. Recall that the definitions of two useful resolvent matrix Ξ{\bm{\Xi}} and Π{\bm{\Pi}} are

Step 1. The expectation of regularized training error.

By Eq. (45), the regularized training error of random features regression gives

Step 3. The derivatives of the log determinant.

where gg is given in Eq. (74). The derivatives of gg can be obtained by differentiating Eq. (73) and using Daskin’s theorem. The theorem then follows by simple calculus.