The generalization error of random features regression: Precise asymptotics and double descent curve
Song Mei, Andrea Montanari
Introduction
Statistical lore recommends not to use models that have too many parameters since this will lead to ‘overfitting’ and poor generalization. Indeed, a plot of the test error as a function of the model complexity often reveals a U-shaped curve. The test error first decreases because the model is less and less biased, but then increases because of a variance explosion [HTF09]. In particular, the interpolation threshold, i.e., the threshold in model complexity above which the training error vanishes (the model completely interpolates the data), corresponds to a large test error. It seems wise to keep the model complexity well below this threshold in order to obtain a small generalization error.
These classical prescriptions are in stark contrast with the current practice in deep learning. The number of parameters of modern neural networks can be much larger than the number of training samples, and the resulting models are often so complex that they can perfectly interpolate the data. Even more surprisingly, they can interpolate the data when the actual labels are replaced by pure noise [ZBH+16]. Despite such a large complexity, these models have small test error and can outperform others trained in the classical underparametrized regime.
This behavior has been rationalized in terms of a so-called ‘double-descent’ curve [BMM18, BHMM19]. A plot of the test error as a function of the model complexity follows the traditional U-shaped curve until the interpolation threshold. However, after a peak at the interpolation threshold, the test error decreases, and attains a global minimum in the overparametrized regime. In fact, the minimum error often appears to be ‘at infinite complexity’: the more overparametrized is the model, the smaller is the error. It is conjectured that the good generalization behavior in this highly overparametrized regime is due to the implicit regularization induced by gradient descent learning: among all interpolating models, gradient descent selects the simplest one, in a suitable sense. An example of double descent curve is plotted in Fig. 1. The main contribution of this paper is to describe a natural, analytically tractable model leading to this generalization curve, and to derive precise formulae for the same curve, in a suitable asymptotic regime.
Despite its simplicity, this random covariates model captures several features of the double descent scenario. In particular, the asymptotic generalization curve is U-shaped for , diverging at the interpolation threshold , and descends again exceeding that threshold. The divergence at is explained by an explosion in the variance, which is in turn related to a divergence of the condition number of the random matrix . At the same time, this simple model misses some interesting features that are observed in more complex settings: In the Gaussian covariates model, the global minimum of the test error is achieved in the underparametrized regime , unless ad-hoc misspecification structure is assumed; The number of parameters is tied to the covariates dimension and hence the effects of overparametrization are not isolated from the effects of the ambient dimensions; Ridge regression, with some regularization , is always found to outperform the ridgeless limit . Moreover, this linear model is not directly connected to actual neural networks, which are highly nonlinear in the covariates .
In this paper, we study the random features model of Rahimi and Recht [RR08]. The random features model can be viewed either as a randomized approximation to kernel ridge regression, or as a two-layers neural networks with random first layer wights. We compute the precise asymptotics of the test error and show that it reproduces all the qualitative features of the double-descent scenario.
We learn the coefficients by performing ridge regression
The choice of ridge penalty is motivated by the connection to kernel ridge regression, of which this method can be regarded as a finite-rank approximation. Further, the ridge regularization path is naturally connected to the path of gradient flow with respect to the mean square error , starting at . In particular, gradient flow converges to the ridgeless limit () of , and there is a correspondence between positive , and early stopping in gradient descent [YRC07].
lie in a proportional asymptotics regime. Namely, with , for some .
where expectation is taken with respect to . Assume .
Then, for linear in the setting described above, for any , the prediction risk convergences in probability
where is explicitly given in Definition 1.
Section 5.1 also contains an analogous statement for the nonlinear model.
Theorem 1 and its generalizations stated below require fixed as . We can then consider the ridgeless limit by taking . Let us stress that this does not necessarily yield the prediction risk of the min-norm least square estimator that is also given by the limit at fixed. Denoting by the design matrix, the latter is given by . While we conjecture that indeed this is the same as taking in the asymptotic expression of Theorem 1, establishing this rigorously would require proving that the limits and can be exchanged. We leave this to future work.
Our work is the first one to provide a complete treatment of the nonparametric model in the proportional asymptotics and to establish those phenomena. From a mathematical viewpoint, the calculation of the test error can be reduced to studying a block-structured kernel random matrix, with a more intricate structure than the one of [HMRT19]. The reduction itself is novel in the present context, and goes through the log determinant of this random matrix, while the variance computation of [HMRT19] is directly connected to the resolvent.
The proof of Theorem 1 builds on ideas from random matrix theory. A careful look at these arguments unveils an interesting phenomenon. While the random features are highly non-Gaussian, it is possible to construct a Gaussian covariates model with the same asymptotic prediction error as for the random features model. Apart from being mathematically interesting, this finding provides additional intuition for the behavior of random features models, and opens the way to some interesting future directions. In particular, [MRSY19] uses this Gaussian covariates proxy to analyze maximum margin classification using random features.
The rest of the paper is organized as follows:
In Section 2 we summarize the main insights that can be extracted from the asymptotic theory, and illustrate them through plots.
Section 3 provides a succinct overview of related work.
Section 4 introduces the notations that are used in this paper.
Section 5 contains formal statements of our main results, which is the asymptotics of prediction error as in Theorem 2. This section also presents some special cases of the asymptotic formula.
Section 6 contains the statements of the asymptotics of the training error as in Theorem 6.
Section 7 presents an interesting phenomenon which is that the random features model has the same asymptotic prediction error as a simpler model with Gaussian covariates.
In Section 8 we present the proof of main results. The main results will use several propositions that are proved in the following sections and in the appendices.
Results and insights: An informal overview
Before explaining in detail our technical results —which we will do in Section 5— it is useful to pause and describe some consequences of the exact asymptotic formulae that we prove. Our focus here will be on insights that have a chance to hold more generally, beyond the specific setting studied here.
Bias term also exhibits a singularity at the interpolation threshold. A prominent feature of the double descent curve is the peak in test error at the interpolation threshold which, in the present case, is located at . In the linear regression model of [AS17, HMRT19, BHX19], this phenomenon is entirely explained by a peak in the variance of the estimator (that diverges in the ridgeless limit ), while its bias is monotone increasing across to this threshold.
In contrast, in the random features model studied here, both variance and bias have a peak at the interpolation threshold, diverging there when . This is apparent from Figure 1 which was obtained for , and therefore in a setting in which the error is entirely due to bias. The fact that the double descent scenario persists in the noiseless limit is particularly important, especially in view of the fact that many machine learning tasks are usually considered nearly noiseless.
Optimal prediction error is achieved in the highly overparametrized regime. Figure 2 (left) reports the predicted test error in the ridgeless limit (for a case with non-vanishing noise, ) as a function of , for several values of . Figure 3 plots the predicted test error as a function of , for fixed , several values of , and two values of the SNR. We repeatedly observe that: For a fixed , the minimum of test error (over ) is in the highly overparametrized regime ; The global minimum (over and ) of test error is achieved at a value of that depends on the SNR, but always at ; In the ridgeless limit , the generalization curve is monotonically decreasing in when .
To the best of our knowledge, this is the first natural and analytically tractable model which satisfies the following requirements: Large overparametrization is necessary to achieve optimal prediction; No special misspecification structure needs to be postulated.
Optimal regularization eliminated the double-descent. Figure 3 reports the asymptotic prediction for the test error as a function of the overparametrization ratio for various values of the regularization parameter . The peak at the interpolation threshold is apparent, but it becomes less prominent as the regularization increases. In particular, if we consider the optimal regularization (the lower envelope of these curves), the test error becomes monotone decreasing in the number of parameters: regularization compensates overparametrization.
Non-vanishing regularization can hurt (at high SNR). Figure 4 plots the predicted test error as a function of , for several values of , with fixed. The lower envelope of these curves is given by the curve at , confirming that the optimal error is achieved in the highly overparametrized regime. However the dependence of this lower envelope on changes qualitatively, depending on the SNR. For small SNR, the global minimum is achieved as some : regularization helps. However, for a large SNR the minimum error is achieved as . The optimal regularization is vanishingly small.
These two noise regime are separated by a phase transition at a critical SNR which we denote by . A characterization of this critical value is given in Section 5.2.2.
Self-induced regularization. What is the mechanism underlying the optimality of the ridgeless limit ? An intuitive explanation can be obtained by considering the (random) kernel associated to the ridge regression (2), namely
Related literature
A recent stream of papers studied the generalization behavior of machine learning models in the interpolation regime. An incomplete list of references includes [BMM18, BRT19, LR18, BHMM19, RZ18]. The starting point of this line of work were the experimental results in [ZBH+16, BMM18], which showed that deep neural networks as well as kernel methods can generalize even if the prediction function interpolates all the data. It was proved that several machine learning models including kernel regression [BRT19] and kernel ridgeless regression [LR18] can generalize under certain conditions.
The double descent phenomenon, which is our focus in this paper, was first discussed in general terms in [BHMM19]. The same phenomenon was also observed in [AS17, GJS+19]. The paper [KLS18] observes that the optimal amount of ridge regularization is sometimes vanishing, and provides an explanation in terms of noisy features. Analytical predictions confirming this scenario were obtained, within the linear regression model, in two concurrent papers [HMRT19, BHX19]. In particular, [HMRT19] derives the precise high-dimensional asymptotics of the prediction error, for a general model with correlated covariates. On the other hand, [BHX19] gives exact formula for any finite dimension, for a model with i.i.d. Gaussian covariates. The same papers also compute the double descent curve within other models, including over-specified linear model [HMRT19], and a Fourier series model [BHX19].
As mentioned in the introduction, [HMRT19, Section 8] also calculates the variance term of the prediction error in the random features model in the ridgeless limit . Both the simple linear regression models of [HMRT19, BHX19], and the variance calculation of [HMRT19, Section 8] capture the peak of the test error at the interpolation threshold. However, these calculations do not elucidate several crucial statistical phenomena, which are instead the main contribution of our work (see Section 2): optimality of large overparametrization; optimality of interpolators at high SNR ( limit); the role of self-induced regularization; disappearance of the double descent at optimal overparametrization.
Rate-optimal bounds on the generalization error of overparametrized linear models were recently derived in [BLLT20] (see also [MVS19] for a different perspective).
2 Random features and kernels
The random features model has been studied in considerable depth since the original work in [RR08]. A classical viewpoint suggests that should be regarded as random approximation of the reproducing kernel Hilbert space defined by the kernel
Indeed is an RKHS defined by the finite-rank approximation of this kernel defined in Eq. (6). The paper [RR08] showed the pointwise convergence of the empirical kernel to . Subsequent work [Bac17b] showed the convergence of the empirical kernel matrix to the population kernel in terms of operator norm and derived bound on the approximation error (see also [Bac13, AM15, RR17] for related work).
The setting in the present paper is quite different, since we take the limit of a large number neurons , together with large dimension . Our focus on this high-dimensional regime is partially motivated by [RZ18], which emphasizes that optimality of interpolators is somewhat un-natural in low dimension.
It is well-known that approximation using two-layers network suffers from the curse of dimensionality, in particular when first-layer weights are not trained [DHM89, Bac17a, VW18, GMMM19]. The recent paper [GMMM19] studies random features regression in a setting similar to ours, by considering two different regimes: the population limit , with scaling as a polynomial of ; the wide limit , with scaling as a polynomial of . In particular, [GMMM19] proves that, if and or and then a random features model can only fit the projection of the true function onto degree- polynomials.
3 Technical contribution
The asymptotic singular values distribution of is not sufficient to compute the asymptotic prediction error, which also depends on the singular vectors of . The paper [HMRT19] addresses this challenge for what concerns the variance term of the error, and only in the limit . Notice that the variance term is given (up to constants) by . It is quite straightforward to express this quantity in terms of the Stieltjes transform of a certain block random matrix, and [HMRT19] use the leave-one-out method to characterize the asymptotics of this Stieltjes transform.
Unfortunately, the approach of [HMRT19] cannot be pushed to compute the full test error (i.e. both the bias and variance terms): the latter cannot be expressed in terms of the Stieltjes transform of the same matrix. A key observation of the present paper is that the full prediction error can be expressed in terms of derivatives of the log-determinant of a different block-structured random matrix. In order to compute the asymptotics of this log-determinant, we use leave-one-out arguments (e.g. [BS10, Chapter 3.3]) to derive fixed point equations for the Stieltjes transform of this random matrix, and then integrate this Stieltjes transform.
One further difference from [HMRT19] is that we consider the full nonparametric model while for the calculation of [HMRT19] does not model the target function. As mentioned above, our setting is similar to the one of [GMMM19]. However, the main technical content of [GMMM19] is to prove that, under polynomial scalings of and (at ) or and (at ), the kernel matrix is near isometric. In contrast, here we study a regime in which it is not true that the same matrix is a near isometry, and we characterize its spectral distribution (alongside those properties of the eigenvectors that determine the test error).
Notations
Main results
We begin by stating our assumptions and notations for the activation function . It is straightforward to check that these are satisfied by all commonly-used activations, including ReLU and sigmoid functions.
where expectation is with respect to . Assuming , define by
We will consider sequences of parameters that diverge proportionally to each other. When necessary, we can think such sequences to be indexed by , with , functions of .
Defining and , we assume that the following limits exist in :
The last assumption covers, as a special case, deterministic linear functions , but also a large class of random non-linear functions. As an example, let , where , and consider the random quadratic function
Higher order polynomials can be constructed analogously (or using the expansion of in spherical harmonics).
We also emphasize that that the nonlinear part , although being random, is the same for all samples, and hence should not be confused with additive noise .
We finally introduce the formula for the asymptotic prediction error, denoted by in Theorem 1.
is the unique solution of these equations with , for , with a sufficiently large constant.
The formula for the asymptotic risk can be easily evaluated numerically. In order to gain further insight, it can be simplified in some interesting special cases, as shown in Section 5.2.
We are now in position to state our main theorem, which generalizes Theorem 1 to the case in which has a nonlinear component .
Then for any value of the regularization parameter , the asymptotic prediction error of random features ridge regression satisfies
If the regression function is linear (i.e., ), we recover Theorem 1, where is defined as per Eq. (20). Numerical experiments suggest that Eq. (21) holds for any deterministic nonlinear functions as well, and that the convergence in Eq. (21) is uniform over in compacts. We defer the study of these stronger properties to future work.
Note that the formula for a nonlinear truth, cf. Eq. (21), is almost identical to the one for a linear truth in Eq. (5). In fact, the only difference is that the the prediction error increases by a term , and the noise level is replaced by .
Figure 5 illustrates the last remark. We report the simulated and predicted test error as a function of , for three different choices of the function and noise level . In all the settings, the total power of nonlinearity and noise is , while the power of the linear component is . The test errors in these three settings appear to be very close, as predicted by our theory.
The terms and in Eq. (21) correspond to the the limits of the bias and variance of the estimated function , when the ground truth function is linear. That is, for to be a linear function, we have
2 Simplifying the asymptotic risk in special cases
In order to gain further insight into the formula for the asymptotic risk , we consider here three special cases that are particularly interesting:
The highly overparametrized regime (recall that ).
The large sample limit (recall that ).
Let us emphasize that these limits are taken after the limit with and . Hence, the correct interpretation of the highly overparametrized regime is not that the width is infinite, but rather much larger than (more precisely, larger than any constant times ). Analogously, the large sample limit does not coincide with infinite sample size , but instead sample size that is much larger than .
The ridgeless limit is important because it captures the asymptotic behavior the min-norm interpolation predictor (see also Remark 1.)
Under the assumptions of Theorem 2, set and define
Then the asymptotic prediction error of random features ridgeless regression is given by
The proof of this result can be found in Section 12.
The next proposition establishes the main qualitative properties of the ridgeless limit.
Recall the bias and variance functions and defined in Eq. (26) and (27). Then, for any and fixed , we have
Divergence at the interpolation threshold :
Large width limit (here is defined as per Eq. (24)):
Above the interpolation threshold (i.e. for ), the function and are strictly decreasing in the rescaled number of neurons .
The proof of this proposition is presented in Section 13.1.
As anticipated, point establishes an important difference with respect to the random covariates linear regression model of [AS17, HMRT19, BHX19]. While in those models the peak in prediction error is entirely due to a variance divergence, in the present setting both variance and bias diverge.
Another important difference is established in point : both bias and variance are monotonically decreasing above the interpolation threshold. This, again, contrasts with the behavior of simpler models, in which bias increases after the interpolation threshold, or after a somewhat larger point in the number of parameters per dimension (if misspecification is added).
This monotone decrease of the bias is crucial, and is at the origin of the observation that highly overparametrized models outperform underparametrized or moderately overparametrized ones. See Figure 6 for an illustration.
2.2 Highly overparametrized regime
As the number of neurons diverges (for fixed dimension ), random features ridge regression is known to approach kernel ridge regression with respect to the kernel (7). It is therefore interesting what happens when and diverge together, but is larger than any constant times .
Under the assumptions of Theorem 2, define
Then the asymptotic prediction error of random features ridge regression, in the large width limit is given by
The proof of this result can be found in Section 12. Note that, as expected, the risk remains lower bounded by , even in the limit . Naively, one could have expected to recover kernel ridge regression in this limit, and hence a method that can fit nonlinear functions. However, as shown in [GMMM19], random features methods can only learn linear functions for .
As observed in Figures 2 to 4 (which have been obtained by applying Theorem 2), the minimum prediction error is often achieved by highly overparametrized networks . It is natural to ask what is the effect of regularization on such networks. Somewhat surprisingly (and as anticipated in Section 2), we find that regularization does not always help. Namely, there exists a critical value of the signal-to-noise ratio, such that vanishing regularization is optimal for , and is not optimal for .
In order to state formally this result, we define the following quantities
Notice in particular that is the limiting value of the prediction error (right-hand side of (36)) up to an additive constant and an multiplicative constant.
Fix and . Then the function is either strictly increasing in , or strictly decreasing first and then strictly increasing.
The proof of this proposition is presented in Section 13.2, which also provides further information about this phase transition (and, in particular, an explicit expression for ).
2.3 Large sample limit
Under the assumptions of Theorem 2, define
Then the asymptotic prediction error of random features ridge regression, in the large width limit is given by
The proof of this result can be found in Section 12.
Asymptotics of the training error
Theorem 2 establishes the exact asymptotics of the test error in the random features model. However, the technical results obtained in the proofs allow us to characterize several other quantities of interest. Here we consider the behavior of the training error and of the norm of the parameters. We define the regularized training error by
We also recall that denotes the minimizer in the last expression, cf. Eq. (2) The next definition presents the asymptotic formulas for these quantities.
is the unique solution of these equations with , for , with a sufficiently large constant.
We next state our asymptotic characterization of and .
Then for any value of the regularization parameter , the asymptotic regularized training error and norm square of its minimizer satisfy
The proof of Theorem 6 is similar to the proof of Theorem 2. We will give a sketch of proof of Theorem 6 in Section E.
In this section, we illustrate Theorem 6 through numerical simulations. Figure 7 reports the theoretical prediction and numerical results for the regularized training error, the test error, and the norm of the coefficients . We use a small non-zero value of the regularization parameter , fix the number of samples per dimension , and follow these quantities as a function of the overparameterization ratio .
As expected, the behavior of the training error strikingly different from the one of the test error. The training error is monotone decreasing in the overparameterization ratio , and is close to zero in the overparameterized regime (it is not exactly vanishing because we use a small ). In other words, the fitted model is nearly interpolating the data, and the peak in test error matches the interpolation threshold.
On the other hand, the penalty term is non-monotone: it increases up to the interpolation threshold, then decreases for , and converges to a constant as . If we take this as a proxy for the model complexity, the behavior of provides useful intuition about descent of the generalization error. As the number of parameters increases beyond the interpolation threshold, the model complexity decreases instead of increasing.
We can confirm the intuition that the double descent of the test error is driven by the behavior of the model complexity , by selecting in an optimal way. Following [HMRT19], we expect that the optimal regularization should produce a smaller value of , and hence eliminate or reduce the double descent phenomenon. Indeed, this is illustrated in Figure 8 demonstrates the prediction of the regularized training error and the test error for two choices of : , and an optimal such that the test error is minimized. When we choose an optimal , the test error becomes strictly decreasing as increases. We expect this is a generic phenomenon that also holds in other interesting models.
An equivalent Gaussian covariates model
An examination of the proof of our main result (Theorem 2) reveals an interesting phenomenon. The random features model has the same asymptotic prediction error as a simpler model with Gaussian covariates and response that is linear in these covariates, provided we use a special covariance and signal structure.
Draw , , and independently, conditional on .
Let .
Let , , for some .
Remarkably, in the proportional asymptotics with , the behavior of this model is the same as the one of the nonlinear random features model studied in the rest of the paper. In particular, the asymptotic prediction error is given by the same formula as in Definition 1.
(Gaussian covariates prediction model) Define and the signal-to-noise ratio as
and assume . Then, in the Gaussian covariates model described above, for any , we have
where is explicitly given in Definition 1.
The proof of Theorem 7 is is almost the same as the one of Theorem 2 (with several simplifications, because of the greater amount of independence). To avoid repetitions, we will not present a proof here.
Figure 9 illustrates the content of Theorem 7 via numerical simulations. We report the simulated and predicted test error as a function of . The theoretical prediction here is exactly the same as the one reported in Figure 5. However, numerical simulations were carried out with the Gaussian covariates model instead of random features. The agreement is excellent, as predicted by Theorem 7.
Proof of Theorem 2
We begin by observing that the minimizer of the training error (2) is given by
Then can be written in a simpler form . After a simple calculation, we obtain
Our first step is to replace the exact expression (55) by a simpler one involving traces of combinations of and the following four random matrices:
The proof of this proposition is deferred to Section 9 and is based on the following main steps:
Finally, we show that concentrates around its expectation with respect to (i.e. the coefficients ) and .
Here is the complex logarithm with branch cut on the negative real axis and is the set of eigenvalues of in non-increasing order.
The next proposition connects the quantities to the transforms and .
The proof of Proposition 8.2 follows by basic calculus and linear algebra, and we defer its proof to Appendix B. Despite its simplicity, this statement provides the basic scheme of our proof. We will determine the asymptotics of using a leave-one-out argument; then extract the behavior of using Eq. (62); finally we characterize the test error using Eqs. (63) and Proposition 8.1.
The construction of the matrix is related to the linear pencil method in free probability, see e.g., [HMS18]. A significantly simpler construction was used in [HMRT19, Section 8] to calculate the variance part of the risk in the limit (in special cases). The approach of [HMRT19] amounts to computing the Stieltjes transform of for , in the limit : unfortunately this quantity is not sufficient to extract the prediction error. We overcome this difficulty by considering a more complex block-structured matrix and expressing the risk in terms of derivatives of the log determinant .
Unfortunately, even conditional on , the projections of and along and orthogonal to it are not independent (because of the sphere constraint). To overcome this problem we replace these by Gaussian vectors and and prove that the two distributions yield the same asymptotics of the Stieltjes transform. The decomposition of these Gaussian vectors takes the form
At this point independence can be exploited to obtain concentration results on the right-hand side. Let us emphasize that, while these paragraphs outline the main elements of the leave-one-out argument, several technical subtleties make the actual proof significantly longer, see Section 10 for details.
The proof of Proposition 8.3 is presented in Section 10. The fixed point equations (70) arise as a consequence of Eq. (64) (and the analogous equation for ). Indeed the proof also shows that the solution of these equations gives the limit of as .
For a complete proof of this proposition we refer to Section 11.
We can now use Eqs. (76), (77), and (63) in Proposition 8.1, to get
The last display provides the desired asymptotics of bias and variance. However, these expressions involve derivatives of that are very inconvenient to evaluate. We conclude by proving more explicit expressions for these quantities. The key remark here is that the expression in Proposition 8.4 has a special property: the fixed point equations (70) imply that is a stationary point of the function . This simplifies the calculation of derivatives with respect to . In particular, the first derivative are obtained by computing the partial derivatives of with respect to and evaluating it at .
The proof of this lemma follows by simple calculus and can be found in Appendix D.
By the definition of analytic functions and (satisfying Eq. (70) and (69) with as defined in Proposition 8.3), the definition of and in Eq. (85) above is equivalent to its definition in Definition 1 (as per Eq. (15)). Moreover, for defined in Eq. (16) with and defined in Eq. (82), we have
Plugging in Eq. (83) and (84) into Eq. (80) and (81) and using Eq. (86), we can see that the expressions for and defined in Eq. (80) and (81) coincide with Eq. (18) and (19) where are provided in Eq. (17). Combining with Eq. (78) and (79) proves the theorem.
Proof of Proposition 8.1
As a useful preliminary remark, we note that the Gaussian process defined in Assumption 3 can be explicitly represented as a sum of spherical harmonics with Gaussian coefficients. The following lemma is standard (see, e.g., [MP11, Proposition 6.11]). For the reader’s convenience, we present a simple proof in Appendix C.
Let us write the random variable in the left hand side of Eq. (58) as a function of and de-emphasize its dependence on other variables, i.e.,
Next, we state three lemmas that are used in the proof of Proposition 8.1.
Let be the Gegenbauer coefficients of the function , i.e., we have
Under the assumptions of Proposition 8.1, for any , we have
where are given by Eq. (91), and are given by Eq. (59).
Under the assumptions of Proposition 8.1, we have
We defer the proofs of these three lemmas to the following subsections and show here that they imply Proposition 8.1. Indeed, we have
where follows by Lemma 9.3 and triangular inequality, and from Lemma 9.4.
Recall the expression (55) for the risk. Taking expectation with respect to and , we get
The proof of the lemma follows by evaluating each of these three terms. It is useful to introduce the matrices , , which denotes the evaluations of spherical harmonics of degree at the points and (c.f. Appendix A):
Using these expressions, we can evaluate terms and :
Combining the above formulas for , , proves Lemma 9.3.
2 Proof of Lemma 9.4
The next two lemmas will be used in the proofs of Lemma 9.4 and Lemma 9.5, and hold under the same assumptions. The first of these lemmas will be used to establish Eq. (92) (but notice that its statement does not coincide with that equation), and the second will be used to control several terms in those proofs. The proofs of these lemmas are given in Section C.2.
We will now use these lemmas to prove Lemma 9.4. We begin by recalling a few facts that are used several times in the proof. Since , there exists a constant depending on such that deterministically
By operator norm bounds on Wishart matrices [AGZ09], we have (the definition of these matrices are given in Eq. (57))
Finally we need some simple operator norm bounds on the matrices , , and . Notice that (by the normalization condition of Gegenbauer polynomials) and the out-of-diagonal entries of have zero mean and typical size of order (see Appendix A). This suggests the following estimates, which are formalized in Lemma C.6,
Since further as (see Eq. (225)), we have
(This estimate is stated formally in the appendices as Lemma C.7.) It is also useful to introduce the matrix
We have now finished presenting our preliminary estimates and can prove Lemma 9.4.
We begin by considering Eq. (92), where and are defined in Eq. (91). By the approximate linearization of in Eq. (109), we have
Further recall the definition of , and in Eqs. (103), (104) and the definition of and as in Eq. (88), and we define
Then we have , and by Lemma 9.6, we have
We next consider Eq. (93), which requires to control , defined in Eq. (91)). By Eq. (106), we have
Further note , , and for fixed , is non-decreasing in [GMMM19, Lemma 1]. Therefore
By Lemma 9.7 (with ) and Eq. (108), we get
We next consider Eq. (95), where we recall the definition of in Eq. (91) and the definition of in Eq. (59). By observing that (see Eq. (225)) and that , we immediately get
In order to prove Eq. (96), recall the definition of in Eq. (91) and the definition of in Eq. (59). By the decomposition of in Eq. (109) and recalling that , we have
Finally, Eq. (97) is proved analogously to Eq. (96): this completes the proof of the lemma.
3 Proof of Lemma 9.5
To prove Lemma 9.5, we begin with by rewriting the prediction risk —cf. Eq. (55)— as (note that )
This obviously implies the claims of the lemma. In the rest of this proof, we show the variance bound for , as the other bounds are very similar.
Recall the definition of and in Eq. (99), the definition of Gegenbauer coefficients in Eq. (89) and the expansion of and vectors in Eq. (100). We rewrite as
Calculating the variance of with respect to for using Lemma C.8 (which follows from direct calculation), we get
Further note that , , and for fixed , is non-decreasing in [GMMM19, Lemma 1]. Therefore
Substituting above we obtain, and using the fact that by construction, we have
To bound the remaining two terms in this expression, note that
where the bound is implied by Lemma 9.7 (when applying Lemma 9.7, we change the role of and , the role of and , and the role of and ; this can be done because the role of and is symmetric), and by Eq. (107) and (by Assumption 1 and note that by Eq. (225)). This proves that
The bound on the first term in Eq. (114) is obtained analogously and we omit it for brevity.
Proof of Proposition 8.3
This section is organized as follows. We collect the elements to prove Proposition 8.3 in Sections 10.1, 10.2, 10.3, 10.4, and prove the proposition in Section 10.5.
In this subsection, we state Lemma 10.1, which is the key lemma that is used to prove Proposition 8.3. Lemma 10.1 studies and , the partial Stieltjes transforms of the Gaussian counterparts of the matrix as defined in Eq. (60). This lemma shows that these partial Stieltjes transforms and approximately satisfy the fixed point equation which involves functions and as defined in Eq. (69). We will prove Lemma 10.1 in Section 10.3. Later in Section 10.4, we will show that the Gaussian counterpart of the Stieltjes transform shares the same asymptotics with its spherical version.
The matrix is in parallel with its spherical version matrix defined as in Eq. (60).
In what follows, we will write . We would like to calculate the asymptotic behavior of the following partial Stieltjes transforms
The crucial step consists in showing that the expected Stieltjes transforms are approximate solutions of the fixed point equations (70).
The proof of Lemma 10.1 uses a leave-one-out argument in deriving the fixed point equation for Stieltjes transform of random matrices, e.g. [BS10, Chapter 3.3] and [CS13]. We will prove Lemma 10.1 in Section 10.3. In the next subsection, we will collect some lemmas that are used in this proof.
2 Preliminaries of the proof of Lemma 10.1: Stieltjes transforms and the fixed point equation
In the following lemma, we fix a (as defined in Eq. (68)) and fix . Since the parameters are fixed, we will drop them from the argument of F unless necessary. In these notations, Eq. (70) reads
The following lemma shows that there exists a unique fixed point of the equation above in a certain domain provided is large enough.
We rewrite the first equation in Eq. (69) as
In order to prove the Lipschitz continuity of F in this domain, notice that is differentiable and
The functions , have the following properties:
The next lemma establishes the concentration of Stieltjes transforms to its mean, whose proof is the same as the proof of Lemma 9 in [HMRT19].
Let and consider the partial Stieltjes transforms as per Eq. (119). Then there exists such that, for ,
Assume is an activation function with for some constants and (this is implied by Assumption 1). Then
3 Proof of Lemma 10.1: Leave-one-out argument
Throughout the proof, we write if there exists a constant which is uniformly bounded when is in a compact set, such that . We write if for any , there exists a constant which is uniformly bounded when is in a compact set, such that for any . We use to denote generically such a constant, that can change from line to line.
We will assume throughout the proof. For , the lemma holds by viewing as a new kernel inner product matrix, with .
Step 1. Calculate the Schur complement and define some notations.
We decompose the vectors in the components along and the orthogonal component:
We next write as the sum of three terms:
and , , and
where .
Step 2. Perturbation bound for the Schur complement.
Moreover, by Lemma 10.7, is deterministically bounded by . This gives
Using the definitions of and as in Eq. (145) and (146), for , we have
Hence we have , and, using a similar argument, . Hence we get the bound .
Under the assumptions of Lemma 10.1, we have
Since is a sub-matrix of , we have . By the intermediate value theorem
Note that is a polynomial with some fixed degree . Therefore we have
Under the assumptions of Lemma 10.1, we have
For , note we have where and are independent. Hence we have
For , note is a polynomial with some fixed degree , hence we have
Step 3. Simplification using Sherman-Morrison-Woodbury formula.
Notice that is a matrix with rank at most two. Indeed
Since we assumed so that , the matrix is invertible with .
Recall the definition of in Eq. (146). By the Sherman-Morrison-Woodbury formula, we get
By auxiliary Lemmas 10.10, 10.11, and 10.12 below, we get
Elementary algebra simplifying Eq. (159) gives . This proves Eq. (120) in Lemma 10.1. Eq. (121) follows by the same argument (exchanging and ). In the rest of this section, we prove auxiliary Lemmas 10.10, 10.11, and 10.12.
Using the formula of and as in Eq. (156) and (159), for , we have
Combining with proves the lemma. ∎
Under the assumptions of Lemma 10.1, we have (following the notations of Eq. (157) and (158))
The first bound is because (see Lemma 10.3 for the boundedness of and )
In the following, we limit ourselves to proving Eq. (161), since Eq. (162) and (163) follow by similar arguments.
Then by the definition of in Eq. (157), we have . Note we have
We next compute . By a similar calculation of Lemma C.8, we have (for a complex matrix, denote to be the transpose of , and to be the conjugate transpose of )
By Lemmas 10.4, partial Stieltjes transforms are stable with respect to deleting one row and one column of the same index. By Lemma 10.13 (which will be stated and proved later), partial Stieltjes transforms are stable with respect to small changes of the dimension . Moreover, by Lemma 10.5, partial Stieltjes transforms concentrate tightly around their mean. As a consequence of all these lemmas (Lemma 10.4, 10.13, 10.5), we have
Combining with Eq. (167) proves Eq. (161). ∎
The following lemma is the analog of Lemma B.7 and B.8 in [CS13].
Under the assumptions of Lemma 10.1, we have (using the definitions in Eq. (155), (157) and (158) )
Step 1. Bounding . By Sherman-Morrison-Woodbury formula, we have
Step 2. Bounding . Define and . By Lemma 10.11, we have
Combining with Eq. (170) and (171), we get
The last equality holds because is deterministic. Hence we have
The following lemma shows that, the partial resolvents are stable with respect to small changes of the dimension .
We denote and for to be
Then it’s easy to see that .
where . This gives
where . This gives
we have . Therefore, we get
Combining all these bounds establishes Eq. (176). Finally, Eq. (177) can be shown using the same argument. ∎
4 Equivalence between Gaussian and sphere version of Stieltjes transforms
In this subsection, we show that the Stieltjes transform of matrix as defined in Eq. (60) and that of matrix as defined in Eq. (117) share the same asymptotics. For the reader’s convenience, we restate the definitions of these two matrices here.
and the Stieltjes transforms and , defined by
The readers could keep in mind: a quantity with an overline corresponds to the case when features and data are Gaussian, while a quantity without overline usually corresponds to the case when features and data are on the sphere.
Step 1. Show that the resolvent is stable with respect to nuclear norm perturbation.
Since is fixed, we have
The nuclear norm of the term involving can be easily bounded by
For term , denoting , we have
Step 3. Bound for .
Define . Define . We have (for between and )
where , and (so is a polynomial). It is easy to see that
Therefore, we have (denoting to be the degree of polynomial , and to be a constant that only depends on )
5 Proof of Proposition 8.3
Step 1. Polynomial activation function .
By the property of Stieltjes transform as in Lemma 10.3 , we have
By the concentration result of Lemma 10.5, for , we also have
Then we use Lemma 10.14 to transfer this property from to . Recall the definition of resolvent in sphere case in Eq. (61). Combining Lemma 10.14 with Eq. (185), we have
Step 2. General activation function satisfying Assumption 1.
Further, by continuity of the solution of the fixed point equation with respect to when for some large (as stated in Lemma 10.2), we have for ,
Combining Eq. (188), (189), and (190), we obtain
Taking proves Eq. (71).
Step 3. Uniform convergence in compact sets (Eq. (72)).
By the concentration of to its expectation (which is the spherical version of Lemma 10.5), we have
and since is a finite set, we have
This high probability bound will become an expectation bound by the uniform boundedness of for in any compact domain. That is, we have
Combining Eq. (191), (192) and (194), we have
Letting proves Eq. (72). This concludes the proof of Proposition 8.3.
Proof of Proposition 8.4
In Section 11.1 we state and prove some lemmas that are used in the proof of Proposition 8.4. We prove Proposition 8.4 in Section 11.2.
The first lemma concerns the behavior of the partial Stieltjes transforms and when .
Define , , , and . Let be defined as in Eq. (69), F be defined as in Eq. (122), and defined as in Eq. (125). By simple calculus we can see that
The next lemma concerns the behavior of the log-determinants when .
Follow the notations and settings of Proposition 8.4. For any fixed , we have
Combining the bound of the real part and the imaginary part, we have
Recall the definition of as given in Eq. (73). Define
It is easy to see that for any fixed , we have
Combining Eq. (198), (199) and (200), we get
Next, we give some uniform upper bounds on the difference of derivatives of and .
Define , and
Finally, we show that the derivatives of a function in a region can be upper bounded by the function value and the second derivatives of the function in the region.
The proof of Lemma 11.4 is elementary and simply follows from Taylor expansion.
2 Proof of Proposition 8.4
By the expression of in Eq. (73), we have
By fixed point equation (70) with defined in (69), we obtain that
As a result, by the definition of given in Eq. (74), and by formula for implicit differentiation, we have
Combining Eq. (202) with Eq. (201), we get
Combining Eq. (203), (204) and (205), we get Eq. (75).
By Eq. (75) and Lemma 11.3, by the covering number argument (similar to the Step 3 in Section 10.5), we get that for any compact region , there is
Taking , by Eq. (206) and by Lemma 11.3 again, there exists some constant , such that
Sending gives Eq. (76). By similar argument we get Eq. (77). This finishes the proof of Proposition 8.4.
Proof of Theorem 3, 4, and 5
To prove this theorem, we just need to show that
More specifically, we just need to show that, the formula for defined in Eq. (16) as coincides with the formula for defined in Eq. (24). By the relationship of and as per Eq. (86), we just need to show the lemma below.
By Proposition 8.3 this is equivalently saying is the analytic continuation of solution of Eq. (70) as defined in Proposition 8.3, when . Defining , we have
and hence . Taking the difference between Eq. (209) and (210), we get
This implies one of and should be of order and the other one should be of order , as .
By definition of and in Eq. (207), we have
2 Proof of Theorem 4
To prove this theorem, we just need to show that
This follows by simple calculus and a lemma below.
Under the same condition of Lemma 12.1, we have
The proof of this lemma is similar to the proof of Lemma 12.1.
3 Proof of Theorem 5
To prove this theorem, we just need to show that
This follows by simple calculus and a lemma below (this lemma is symmetric to Lemma 12.2).
Under the same condition of Lemma 12.1, we have
Proof of Proposition 5.1 and 5.2
Proof of Point (1). When , we have , so that , and . This proves Point (1).
Proof of Point (2). When , substituting the expression for into , we can see that . We also see that and . This proves Point (2).
Proof of Point (3). When , we have
Proof of Point (4). For , taking derivative of and with respect to , we have
It is easy to check that when , the functions and are functions of and , and are independent of (note when , is a function of and doesn’t depend on ). Therefore, and as functions of must be strictly increasing, strictly decreasing, or staying constant on the interval . However, we know , and and are finite. Therefore, we must have that and are strictly decreasing on .
2 Proof of Proposition 5.2
In Proposition 13.1 given by the following, we give a more precise description of the behavior of , which is stronger than Proposition 5.2.
Fix and . Then the function is either strictly increasing in , or strictly decreasing first and then strictly increasing.
Moreover, For any , we have
For any , we have
Minimizing over and , we have
The minimizer is achieved for any , and .
In the following, we prove Proposition 13.1. It is easy to see that
Hence we study the properties of first.
Step 1. Properties of the function . Calculating the derivative of with respect to , we have
has one negative and one positive solution, and is the negative solution of the above equation. Therefore, when , will be strictly decreasing in ; when , will be strictly increasing in . Therefore, we have
Step 2. Properties of the function . For fixed , we look at the minimizer over of the function . The minimum could be different from the minimum , since may not be achievable by when .
One observation is that as a function of is always negative and increasing.
Then for any , and , we have
Let us for now admit this lemma holds. When is such that (i.e. ), we can choose such that , and then gives the minimum of optimizing over . When is such that (i.e. ), there is not a such that holds. Therefore, the best we can do is to take , and then gives the minimum of optimizing over .
Finally, when we minimize jointly over and , note that as long as , we can choose such that , and then gives the minimum of optimizing over and . This proves Proposition 13.1.
It is easy to see that . In the following, we show .
It is easy to see that, when and , both the denominator and numerator is positive, so that .
Step 2. When . Note is the negative solution of the quadratic equation
Differentiating the quadratic equation with respect to , we have
We can see that, since , when , both the denominator and numerator is negative. This proves when . ∎
Acknowledgements
This work was partially supported by grants NSF CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729.
References
Appendix A Technical background
The dimension of each subspace is given by
A.2 Gegenbauer polynomials
We will use the following properties of Gegenbauer polynomials
Note in particular that property 2 implies that –up to a constant– is a representation of the projector onto the subspace of degree- spherical harmonics
then we have the following equation holds in sense
A.3 Hermite polynomials
The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials (up to a scaling in domain) are constructed by Gram-Schmidt orthogonalization of the monomials with respect to the measure , while Hermite polynomial are obtained by Gram-Schmidt orthogonalization with respect to . Since (here denotes weak convergence), it is immediate to show that, for any fixed integer ,
Here and below, for a polynomial, is the vector of the coefficients of . As a consequence, for any fixed integer , we have
where and are given in Eq. (223) and (220).
Appendix B Proof of Proposition 8.2
Moreover, the set of eigenvalues of and are continuous with respect to . Therefore, for any perturbation with and small enough, we have . As a result, we have
Moreover, (defined as in Eq. (60)) is a linear matrix function of , which gives . Hence we have
and using the formula for block matrix inversion, we have
With simple algebra, we can show that Eq. (226) holds.
Appendix C Additional Proofs in Section 9
We define the sequence to be the coefficients of Gegenbauer expansion of :
In the expansion, the zeroth and first order coefficients are , because, according to Assumption 3,
To check point (1), we have , and by Assumption 3 we have , so that (1) holds.
To check point (2), defining and accordingly, we have
With a little abuse of notations, let us define
Moreover, note that , , , and are mutually independent, , , and . Then, for any fixed , we have
C.2 Proof of Lemma 9.6 and Lemma 9.7
To prove Lemma 9.6 and 9.7, first we state a lemma that reformulate and using Sherman-Morrison-Woodbury formula.
Note we have (denoting )
Hence we have (denoting )
By the Sherman-Morrison-Woodbury formula, we have
Simplifying this formula using simple algebra proves Eq. (230). ∎
Step 1. Term . By Lemma C.1, we get
We can simplify in Eq. (233) further, and get
It is easy to see that . Hence the high probability bound translates to an expectation bound. This proves the lemma. ∎
For notation simplicity, we prove this lemma under the case when which is a singleton. We denote . The proof can be directly generalized to the case for arbitrary set .
By Lemma C.1 (when applying Lemma C.1, we change the role of and , and the role of and ; this can be done because the role of and is symmetric), we have
Note we have shown in the proof of Lemma 9.6 that
Since all the quantities above are deterministically bounded by a constant, these high probability bounds translate to expectation bounds.
Plugging in the above bounds into Equation (235), we have
C.3 Some auxiliary lemmas
We denote by the probability law of when . Note that is symmetric, and . By the central limit theorem, converges weakly to as , where is the standard Gaussian measure. In fact, we have the following stronger convergence result.
For any , we have
In order to prove Eq. (236), we note that the left hand side is given by
where the last inequality holds for using the fact that for .
The next several lemmas establish general bounds on the operator norm of random kernel matrices which is of independent interest.
Then there exists a constant depending uniquely on , and a sequence of numbers with , such that
By Lemma C.2 and Markov inequality, we have, for any and all ,
By [DM16, Lemma 20], there exists a constant such that
Defining the event , we have
Then there exists a constant depending uniquely on , such that
In the proof of this lemma, we assume has continuous derivatives. In the case when is only weak differentiable, the proof is the same, except that we need to replace the mean value theorem to its integral form.
Define , and
By the concentration of -squared distribution, it is easy to see that
Moreover, we have (for between and )
Moreover by the assumption that , we have
Then there exists a constant depending uniquely on , such that
By Lemma C.3, there exists a sequence with , such that
Combining this equation with Eq. (245), we get
C.4 The decomposition of kernel inner product matrices
The following lemma (Lemma C.6) is a reformulation of Proposition 3 in [GMMM19]. We present it in a stronger form, but it can be easily derived from the proof of Proposition 3 in [GMMM19]. This lemma was first proved in [EK10] in the Gaussian case. (Notice that the second estimate —on — follows by applying the first one whereby is replaced by
Notice that the second estimate —on — follows by applying the first one —Eq. (246)— whereby is replaced by , and we use .
The following lemma (Lemma C.7) can be easily derived from Lemma C.6. Again, this lemma was first proved in [EK10] in the Gaussian case.
Then we can rewrite the matrix to be
C.5 A lemma on the variance of the quadratic form
Step 1. Term . Calculating the expectation, we have
Step 2. Term . Calculating the expectation, we have
Appendix D Proof of Lemma 8.1
where we have, for
Appendix E Proof sketch for Theorem 6
In this section, we sketch the calculations of Theorem 6. We assume and are constants independent of . Recall that the definitions of two useful resolvent matrix and are
Step 1. The expectation of regularized training error.
By Eq. (45), the regularized training error of random features regression gives
Step 3. The derivatives of the log determinant.
where is given in Eq. (74). The derivatives of can be obtained by differentiating Eq. (73) and using Daskin’s theorem. The theorem then follows by simple calculus.