Surprises in High-Dimensional Ridgeless Least Squares Interpolation
Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani
Introduction
Modern deep learning models involve a huge number of parameters. In many applications, current practice suggests that we should design the network to be sufficiently complex so that the model (as trained, typically, by gradient descent) interpolates the data, i.e., achieves zero training error. Indeed, in a thought-provoking experiment, Zhang et al. (2016) showed that state-of-the-art deep neural network architectures are complex enough that they can be trained to interpolate the data even when the actual labels are replaced by entirely random ones.
Despite their enormous complexity, deep neural networks are frequently observed to generalize well in practice. At first sight, this seems to defy conventional statistical wisdom: interpolation (vanishing training error) is commonly taken to be a proxy for overfitting, poor generalization (large gap between training and test error), and hence large test error. In an insightful series of papers, Belkin et al. (2018b, c, a) pointed out that these concepts are in general distinct, and interpolation does not contradict generalization. For example, recent work has investigated interpolation —via kernel ridge regression— in reproducing kernel Hilbert spaces (Liang et al., 2020; Ghorbani et al., 2019). While in low dimension a positive regularization is needed to achieve good interpolation, in certain high dimensional settings interpolation can be nearly optimal.
We study the model (2) in the proportional regime , with a special focus on the overparametrized case . Our main contribution is to show that, by considering different choices of the features distribution , we can reproduce a number of statistically interesting phenomena that have emerged in the context of deep learning.
From a technical perspective, our main results are: Theorems 2 and 5, which assume the linear model with a vector with independent coordinates; and Theorem 8, which assumes a nonlinear model with . While the linear model has already a attracted significant amount of work (see Section 1.3 for an overview), Theorems 2 and 5 provide a more accurate non-asymptotic approximation of the prediction risk, as compared to available results in the literature.
The prediction risk depends on the geometry of the pair . We consider a few different choices for this geometry, which are broadly motivated by our objective to understand overparametrized models, and specialize our formulas to these special cases:
Isotropic features. This is the simplest case, in which and therefore —as we will see— the asymptotic risk depends on only through its norm . This simple model captures some interesting features of overparametrization, but misses others.
Nonlinear model. In all of the previous cases, the distribution of is of the form where is a vector with independent coordinates. In order to test the generality of our results, we consider a model in which is obtained by passing through a one-layer neural net with random first layer weights.
We will summarize our results for these four examples in the next subsection.
A skeptical reader might ask what linear models have to do with neural networks. We emphasize that linear models provide more than a simple analogy, and a recent line of work outlines a concrete connection between the two settings (Jacot et al., 2018; Du et al., 2018b, a; Allen-Zhu et al., 2018; Chizat and Bach, 2018b). We will discuss this connection in Section 1.2.
We denote by the overparametrization ratio. When , we call the problem underparametrized, and when , we call it overparametrized. Our most general results for the linear model (Theorem 2 and 5) apply to a non-asymptotic setting in which are finite, and provide a deterministic approximation of the risk with error bounds that are uniform in the distribution of the data. We will occasionally consider a simpler asymptotic scenario in which both and diverge with .
Our main results are twofold: We show that by taking , we can easily construct scenarios in which the minimum of the risk is achieved in the overparamertized regime ; We show that these findings are robust to the details of the distribution of .
As a preliminary remark, note that in the underparametrized regime (), the min-norm estimator coincides with the standard least squares estimator. Its risk is purely variance (there is no bias), and does not depend on (see Proposition 2). Interestingly, the asymptotic risk diverges as we approach the interpolation boundary (as ).
In contrast, in the overparametrized regime (), the risk is composed of both bias and varianceNote that in the overparametrized regime the bias is non-vanishing even in the interpolation limit . The reason is that the set of interpolators is an affine space of dimension , and the min-norm criterion selects one specific interpolator which has norm smaller than ., and generally depends on (see Theorem 2).
We next highlight some concrete results for the four models discussed in the previous section (unless explicitly said, we refer to the min-norm estimator).
In either case, the risk approaches the null risk as , and achieves its global minimum in the underparametrized regime (see Section 3.2).
Optimal tuning of the ridge penalty can be achieved by leave-one-out cross-validation (see Theorem 7).
In this case and the risk depends on the geometry of , and in particular on how aligns with the eigenvectors of .
If the coefficients vector is equidistributed along the eigenvectors of , the behavior is qualitatively similar to the isotropic case. This situation arises, for instance, if is itself random with a spherical prior.
If is aligned with the top eigenvectors of , the situation is qualitatively different. As an example we obtain an explicit formula for the asymptotic risk in the latent space model discussed above, see Figure 5 for an illustration. We find that, for natural choices of the model parameters, the risk is monotone decreasing in the overparametrized regime, and reaches its global minimum as . This qualitative behavior matches the one observed for neural networks (see Section 5.4).
For the latent space model, we observe that, at large overparametrization, the minimum error is achieved as , i.e. by min-norm interpolators (see Section 6.2, and Section 1.3 for related work).
From a technical viewpoint, analysis of the isotropic covariates model is straightforward and relies on standard random matrix theory results. However, we believe it provides useful insights.
In contrast, the results for general covariance and coefficients structure is technically novel. We discuss related work in Section 1.3. Our results for the nonlinear model are also novel. In this setting, we derive a new asymptotic result on resolvents of certain block matrices, which may be of independent interest (see Lemma 3).
We next discuss some intuition behind and implications of our results.
The shape of the asymptotic risk curve for min-norm least squares is, of course, controlled by its components: bias and variance. For fully specified models, the bias increases with in the overparametrized regime, which is intuitive. When , the min-norm least squares estimate of is constrained to lie the row space of , the training feature matrix. This is a subspace of dimension lying in a feature space of dimension . Thus as increases, so does the bias, since this row space accounts for less and less of the ambient -dimensional feature space.
Recently, Belkin et al. (2018a) pointed out a fascinating empirical trend where, for popular methods like neural networks and random forests, we can see a second bias-variance tradeoff in the out-of-sample prediction risk beyond the interpolation limit. The risk curve here resembles a traditional U-shape curve before the interpolation limit, and then descends again beyond the interpolation limit, which these authors call “double descent”. A closely related phenomenon was found earlier by Spigler et al. (2018), who studied the “jamming transition” from underparametrized to overparametrized neural networks. Our results formally verify that this double descent phenomenon occurs even in the simple and fundamental case of least squares regression. The appearance of the second descent in the risk, past the interpolation boundary (), is explained by the fact that the variance decreases as grows, as discussed above.
In the misspecified case, the variance still decreases with (for the same reasons), but interestingly, the bias can now also decrease with , provided is not too large (not too far past the interpolation boundary). The intuition here is that in a misspecified model, some part of the true regression function is always unaccounted for, and adding features generally improves our approximation capacity. As a consequence, the double descent phenomenon can be even more pronounced in the misspecified case (depending on the strength of the approximation bias), and that the risk can attain its global minimum past the interpolation limit.
The min-norm least squares estimator can be seen as the limit of ridge regression as the tuning parameter tends to zero. It is also the convergence point of gradient descent run on the least squares loss. We would not in general expect the best-predicting ridge solution to be at the end of its regularization path. Our results, comparing min-norm least squares to optimally-tuned ridge regression, show that (asymptotically) this is never the case, when is incoherent with respect to the eigenvectors of . This is for instance the case when , or is distributed according to a spherically symmetric prior. In contrast, Kobak et al. (2020) recently pointed out that —when is aligned with the leading eigenvectors of — min-norm regression can have optimal risk (i.e. the optimal regularization vanishes). We show that this is indeed the case in the latent space model mentioned above: this provides indeed an extremely simple example of a phenomenon that has been observed in the past for kernel methods Liang et al. (2020).
As mentioned above, early-stopped gradient descent is known to be closely connected to ridge regularization, see, e.g., Ali et al. (2019) which proves a tight coupling between the two (see Section 1.3 for further related work).
In practice, of course, we would not have access to the optimal tuning parameter for ridge (optimal stopping for gradient descent), and we would rely on, e.g., cross-validation (CV). Our theory shows that for ridge regression, CV tuning is asymptotically equivalent to optimal tuning (and we would expect the same results to carry over to gradient descent, but have not pursued this formally).
2 Connection to neural networks
As mentioned above, recent literature has pointed out a direct connection between linear models and more complex models such as neural networks (Jacot et al., 2018; Du et al., 2018b, a; Allen-Zhu et al., 2018; Chizat and Bach, 2018b). Our work contributes to this line of work and in particular points at the important role played by universality, a core concept in random matrix theory Tao (2012).
This model is still nonlinear in the input , but is linear in the parameters . We are therefore led to consider a linear regression problem, with random features , , of high-dimensionality ( much greater than ). Notice that the features are random because of the initialization . Further, since , many vectors give rise to a model that interpolates the data.
A different approach relies on a universality hypothesis. Recall that have joint distribution , and and assume these quantities to have zero meanWe further note that the assumption of zero mean is mainly for convenience, as a non-zero mean of the covariates can be effectively compensated by an intercept.. Universality posits that the asymptotic risk of ridge regression (or min-norm interpolation) does not change if we replace the joint distribution by a Gaussian with the same covariance.
While universality is not expected to hold for any distribution of the data , and for any function , it has been confirmed in a number of examples of interest (see Section 1.3). In fact, our analysis of the nonlinear model in Section 8 provides a concrete example in which universality can be confirmed rigorously.
Under the universality hypothesis, we can assume to be jointly Gaussian. If the true joint distribution of the data is also Gaussian, it is sufficient to study the data distribution
where are chosen to match the second order statistics of the original model. This is a special case of our ‘linear model’ (2). Summarizing, replacing the original nonlinear model , with the linear model rests on two key steps. The first step is the linearization in Eq. (3): this has been proved to be a good approximation under certain ‘lazy training’ schemes Chizat and Bach (2018b). The second step is the universality hypothesis, which is quite common (albeit for simpler models) in random matrix theory Tao (2012).
Finally, although we believe our our results are relevant to understanding overparametrized neural networks, the concrete correspondence outlined above only holds in a certain ‘lazy training’ regime, in which network weights do not change much during training, More generally, in a neural network, the feature representation and the regression function or classifier are learned simultaneously. In both our linear and nonlinear model settings, the features are not learned, but observed. Learning could significantly change some aspects of the behavior of an interpolator. (See for instance Chapter 9 of Goodfellow et al. (2016), and also Chizat and Bach (2018b); Zhang et al. (2019), which emphasize the importance of learning the representation.)
3 Related work
Ridge regression with random features has been studied in the past. Dicker (2016) considers a model in which the covariates are isotropic Gaussian and computes the asymptotic risk of ridge regression in the proportional asymptotics , with . Dobriban and Wager (2018) generalize these results to , where has independent entries with bounded -th moment.
Recently, Advani and Saxe (2017) study the effect of early stopping and ridge regularization in a model with isotropic Gaussian covariates , again focusing on the proportional asymptotics , with . They show that this simple model reproduces several phenomena observed in neural networks training. The same model is reconsidered in concurrent work by Belkin et al. (2019), who obtain exact results for the expected risk of min-norm regression, relying on the jointly Gaussian distribution of . We contribute to this line of work by extending the analysis to general covariance structures and to misspecified models. As we will see, these generalizations allow to produce examples for which the global minimum of the risk is achieved in the overparametrized regime .
The importance of the relation between the coefficient vector and the eigenvectors of was emphasized by Kobak et al. (2020) and Bartlett et al. (2020). These papers point out —under different asymptotic settings— that (i.e., min-norm regression) can be optimal or nearly optimal. After a preprint of this paper appeared, Wu and Xu (2020) and Richards et al. (2020) generalized our earlier results to cover the case in which is potentially aligned with . We review in further detail these important generalizations in Section 4. We contribute to this line of work by obtaining non-asymptotic approximations for the risk, with explicit and nearly optimal error bounds. These hold under weaker assumptions on the geometry of than the results of Wu and Xu (2020); Richards et al. (2020).
For the nonlinear model, the random matrix theory literature is much sparser, and focuses on the related model of kernel random matrices, namely, symmetric matrices of the form . El Karoui (2010) studied the spectrum of such matrices in a regime in which can be approximated by a linear function (for ) and hence the spectrum converges to a rescaled Marchenko-Pastur law. This approximation does not hold for the regime of interest here, which was studied instead by Cheng and Singer (2013) (who determined the limiting spectral distribution) and Fan and Montanari (2015) (who characterized the extreme eigenvalues). The resulting eigenvalue distribution is the free convolution of a semicircle law and a Marchenko-Pastur law. In the current paper, we must consider asymmetric (rectangular) matrices , whose singular value distribution was recently computed by Pennington and Worah (2017), using the moment method. Unfortunately, the prediction variance depends on both the singular values and vectors of this matrix. In order to address this issue, we apply the leave-one out method of Cheng and Singer (2013) to compute the asymptotics of the resolvent of a suitably extended matrix. We then extract the information of interest from this matrix. After appearance of a preprint of this paper, Mei and Montanari (2019) extended the results presented here, to obtain a complete characterization of the risk for the non-linear random features model.
Let us finally mention that the universality (or ‘invariance’) phenomenon is quite common in random matrix theory Tao (2012). In the context of kernel inner product random matrices, it appears (somewhat implicitly) in Cheng and Singer (2013) and (more explicitly) in Fan and Montanari (2015). After a first appearance of this manuscript, universality has been investigated in the context of neural networks in several papers Mei and Montanari (2019); Montanari et al. (2019); Gerace et al. (2020); Goldt et al. (2020); Hu and Lu (2020); Adlam and Pennington (2020).
4 Outline
Section 2 provides important background. Sections 3–7 consider the linear model case, focusing on isotropic features, correlated features, misspecified models, ridge regularization, and cross-validation, respectively. Section 8 covers the nonlinear model case. Nearly all proofs are deferred until the appendix.
Preliminaries
We describe our setup and gather a number of important preliminary results.
Consider a test point , independent of the training data. For an estimator (a function of the training data ), we define its out-of-sample prediction risk (or simply, risk) as
where . Note that our definition of risk is conditional on (as emphasized by our notation ). Note also that we have the bias-variance decomposition
2 Ridgeless least squares
This can be equivalently written as , where is the pseudoinverse of . An alternative name for (4) is the “ridgeless” least squares estimator, motivated by the fact that , where denotes the ridge regression estimator:
or, equivalently, .
When has full column rank the min-norm least squares estimator reduces to , the usual least squares estimator. When has rank , importantly, this estimator interpolates the training data: , for .
Lastly, the following is a well-known fact that connects the min-norm least squares solution to gradient descent (as referenced in the introduction).
Initialize , and consider running gradient descent on the least squares loss, yielding iterates
where we take (and is the largest eigenvalue of ). Then , the min-norm least squares solution in (4).
3 Bias and variance
We recall expressions for the bias and variance of the min-norm least squares estimator, which are standard.
Under the model (1), (2), the min-norm least squares estimator (4) has bias and variance
where is the (uncentered) sample covariance of , and is the projection onto the null space of .
4 Underparametrized asymptotics
We consider an asymptotic setup where , in such a way that . Recall that when , we call the problem underparametrized; when , we call it overparametrized. Here, we recall the risk of the min-norm least squares estimator in the underparametrized case. The rest of this paper focuses on the overparametrized case.
The following is a known result in random matrix theory, and can be found in Chapter 6 of Serdobolskii (2007). It can also be found in the wireless communications literature, see Chapter 4 of Tulino and Verdu (2004).
Assume the model (1), (2), and assume is of the form , where is a random vector with i.i.d. entries that have zero mean, unit variance, and a finite 4th moment, and is a (sequence of) deterministic positive definite matrix, such that , for all and a constant (here is the smallest eigenvalue of ). Then as , such that , the risk of the least squares estimator (4) satisfies, almost surely,
As it can be seen from the last proposition, in the underparametrized case the risk is just variance. In contrast, in the overparametrized case, the bias is nonzero, because is. This will be the focus of the next sections.
Isotropic features
We begin by considering the simpler case in which . In this case the limiting bias is relatively straightforward to compute and depends on only through . In Section 4, we generalize our analysis and study the dependence of the bias on the geometry of and .
Choosing so that , the th standard basis vector, then averaging over , yields
It is possible to show that, concentrates around its expectation ant therefore , almost surely. This is stated formally in the next section.
2 Limiting risk
As the next result shows, the independence of the risk on is still true outside of the Gaussian case, provided the features are isotropic. The next result can be proved as a corollary of the more general Theorem 3 below. We give a simpler self-contained proof using a theorem of Rubio and Mestre (2011) in Appendix C.2.
Assume the model (1), (2), where has i.i.d. entries with zero mean, unit variance, and a finite moment of order , for some . Also assume that for all . Then for the min-norm least squares estimator in (4), as , such that , it holds almost surely that
Hence, summarizing with Proposition 2, we have
We can see that the limiting norm, as a function of , has a somewhat similar profile to the limiting risk in (8): it is monotonically increasing on , diverges at the interpolation boundary, and is monotonically decreasing on .
Correlated features
We broaden the scope of our analysis from the last section, where we examined isotropic features. In this section, we take to be of the form , where is a random vector with independent entries that have zero mean and unit variance, and is arbitrary (but still deterministic and positive definite).
We next state our assumptions about the data distribution: our results will be uniform with respect to the (large) constants , appearing in this assumption.
The covariates vector is of the form , where, defining as per Eq. (9), we have
, .
, .
Condition bounds the tail probabilities on the covariates. Requiring finite moment of all order is useful to get strong bounds on the deviations of the risk from its predicted value. As discussed below, bounds on the first few moments are sufficient if we are satisfied in weaker probability bounds.
Conditions requires the eigenvalues of to be bounded, and not to accumulateThe latter assumption could have been further relaxed, by requiring only of the eigenvalues to be non-vanishing and to satisfy the other conditions. This would require to redefine as . near . For the analysis of min-norm interpolation, we will add the additional assumption that the minimum eigenvalue of is bounded away from zero. However condition is sufficient for the analysis of ridge regression in Section 6.
Finally, as our statements are non-asymptotic, we do not assume to converge to to a value. However condition requires to be bounded and bounded away from the interpolation threshold .
We then define the predicted bias and variance by
Note that evaluating , numerically is relatively straightforward, with the most complex part being the solution of Eq. (10). The next theorem establishes that —under suitable technical conditions— the functions , characterize the test error. Similar theorems were proved in Wu and Xu (2020); Richards et al. (2020), which generalized an earlier version of this manuscript to account for the geometry of .
Assume the data model (1), (2) and that the covariates distribution satisfies Assumption 1. Further assume . Define and let be the min-norm least squares estimator in Eq. (4).
Then for any constants (arbitrarily large) there exist such that, with probability at least the following hold
where and are given in Definition 2.
The proof of this theorem is deferred to Section B.
The order of the error bound in Eqs. (14), (15) is not optimal: a central limit theorem heuristics suggests the deterministic approximation to be accurate up to an error of order . Indeed, we are able to establish the correct order in the case of ridge regression, see Theorem 5.
Note that Theorem 2 establishes deterministic approximations for the bias and variance, that are valid at finite . The overparametrization ratio is a non-asymptotic quantity, and the error bounds are uniform, i.e. depend uniquely on the constant . This is to be contrasted with the asymptotic setting of Richards et al. (2020); Wu and Xu (2020). Both of these papers assume a sequence of regression problems with , , and obtain an asymptotically exact expression for the risk.
Technically, Richards et al. (2020); Wu and Xu (2020) apply asymptotic random matrix theory results, such as Ledoit and Peche (2011), while we have to take a longer detour to exploit non-asymptotic results established inKnowles and Yin (2017). We believe that the non-asymptotic approach provides more concrete and accurate statements.
Theorem 2 also implies asymptotic predictions under minimal assumptions. In particular, if the two probability measures , converge weakly to probability measures , on , then we obtain , almost surely.
As pointed out above the condition , (here denotes weak convergence) is strictly weaker than the condition assumed in Wu and Xu (2020) to establish asymptotic results. Further, we require weaker moment conditions.
In the next sections we illustrate the role of the geometry of by considering two models for which , as . First we consider the case , which we refer to as ‘equidistributed:’ the components of are roughly equally distributed along the eigenvectors of . In this case there is no special relation between and .
As a further application, we consider a latent space model in which is aligned with the top eigenvectors of . This can be regarded as a misspecified model, and is therefore presented in Section 5.4 below.
2 Equidistributed coefficients
In this section we assume , , and . One way to generate satisfying this condition is to draw it uniformly at random on the -dimensional sphere of radius . In this case the conditions of Theorem 2 (or Theorem 3), hold with .
Under the assumptions of Theorem 3, further assume , . Then for , with , almost surely
As a special case, we can revisit the isotropic case , which results in . In this case yielding immediately and .
Under the assumptions of Theorem 3, further assume , and let be defined as there. Then as , such that , the min-norm least squares estimator (4) satisfies, almost surely,
Misspecified model
In this section, we consider a misspecified model, in which the regression function is still linear, but we observe only a subset of the features. Such a setting provides another potential motivation for interpolation: in many problems, we do not know the form of the regression function, and we generate features in order to improve our approximation capacity. Increasing the number of features past the point of interpolation (increasing past 1) can now decrease both bias and variance.
As such, the misspecified model setting also yields further interesting asymptotic comparisons between the and regimes. Recall the isotropic features model of Section 3.2: the risk function in (8) is globally minimized at . This is a consequence of the fact that, in a well-specified linear model at there is no bias and no variance, and hence no risk. In a misspecified model, we will see that the story can be quite different, and the asymptotic risk can actually attain its global minimum on .
Consider, instead of (1), (2), a data model
Given a test point , and an estimator (fit using only, and not ), we define its out-of-sample prediction risk as
Note that this definition is conditional on , and we are integrating over the randomness not only in (the training errors), but in the unobserved features , as well. The next lemma decomposes this notion of risk in a useful way.
Under the misspecified model (19), (20), for any estimator , we have
The first term in the decomposition in Lemma 2 is precisely the risk that we studied previously in the well-specified case, except that the response distribution has changed (due to the presence of the middle term in (20)). We call the second term in Lemma 2 the misspecification bias.
In order to discuss some qualitative features, we focus on the simplest possible model by assuming independent covariates.
2 Isotropic features
Here, we make the additional simplifying assumption that has i.i.d. entries with unit variance, which implies that . (The case of independent features but general covariances is similar, and we omit the details.) Therefore, we may write the response distribution in (20) as
where is independent of , having mean zero and variance , for . Denote the total signal by , and the fraction of the signal captured by the observed features by . Then behaves exactly as we computed previously, for isotropic features in the well-specified setting (Theorem 2 for , and Theorem 1 for ), after we make the substitutions:
Furthermore, we can easily calculate the misspecification bias:
Putting these results together leads to the next conclusion.
Assume the misspecified model (19), (20), and assume has i.i.d. entries with zero mean, unit variance, and a finite moment of order , for some . Also assume that and for all . Then for the min-norm least squares estimator in (4), as , with , it holds almost surely that
We remark that, in the independence setting considered in Theorem 4, the dimension of the unobserved feature space does not play any role: we may equally well take for all (i.e., infinitely many unobserved features).
3 Polynomial approximation bias
Since adding features should generally improve our approximation capacity, it is reasonable to model as an increasing function of . To get an idea of the possible shapes taken by the asymptotic risk curve from Theorem 4, we consider the example of a polynomial decay for the approximation bias,
for some . In this case, the limiting risk in the isotropic setting, from Theorem 4, becomes
4 A latent space model
Here , are noise variables that are mutually independent, and independent of , with , . The features matrix takes the form and therefore, for , min-norm regression amounts to
Notice that the signal-to-noise ratio in the features is equal to . To be definite, we’ll fix , which implies in particular . In order to simplify our calculations, we assume all the non-zero singular values of to be equal, whence . It is not hard to check that this model satisfies the assumptions of Theorem 2, with
Using Theorem 2, we get the following explicit expressions.
Consider the latent space model described above, and assume , . Then, almost surely
where , and is the unique non-negative solution of the following second order equation
Figures 5 and 6 illustrate this corollary by comparing analytical predictions to numerical simulations. We observe that the prediction risk is monotone decreasing in the overparametrization ratio for , and reaches its global minimum asymptotically as (after ). To understand why this happens, notice that each feature vector can be viewed as a noisy measurement of the latent covariates . If the noise was absent, then performing min-norm regression with respect to amounts to minimize would be equivalent to min-norm regression with respect to . To see this, consider again Eq. (25). If we drop the noise , we are minimizing subject to , and the regression function is . Since is orthogonal, this is equivalent to computing subject to .
In presence of noise , the latent features cannot be estimated exactly. However as gets larger, the noise is effectively ‘averaged out’ and we approach the idealized situation in which the ’s are observed.
Ridge regularization
We generalize the formulas of Section 4 to non-vanishing ridge regularization. We work under the same assumptions of that section. In particular, recall that is the empirical distribution of eigenvalue of , and the same empirical distribution, reweighted by the projection of onto the eigenvector of the covariance . (Recall the eigenvalue decomposition .)
Further define via
These definitions are extended analytically to whenever possible. We then define the predicted bias and variance by
We next state our deterministic approximation of the risk.
Let , and Assumption 1 hold. Further assume and . Let be the ridge estimator of Eq. (5).
Then for any constants (arbitrarily large) and (arbitrarily small), there exist such that, with probability at least the following hold
where and are given in Definition 2.
The proof of this theorem is deferred to Appendix A. As for theorem 2, similar results were proved in Wu and Xu (2020); Richards et al. (2020), subsequently to a first version of this manuscript that only focused on random . The same comparison of Remark 1 applies here.
In particular, Theorem 5 establishes non-asymptotic deterministic approximations for the bias and variance . The error terms are uniform over the covariance matrix, and have nearly optimal dependence upon the sample size . Indeed, a central-limit theorem heuristics suggests indeed fluctuations of order .
As for the case of min-norm regression, it is easy to extract asymptotic prediction under minimal assumptions.
As a special case, we can consider the simple isotropic model that was already studied in Section 3. Very similar (though not identical) results can be found in Dicker (2016); Dobriban and Wager (2018).
Assume the conditions of Theorem 1 (well-specified model, isotropic features). Then for ridge regression estimator in (5) as , such that , it holds almost surely that
Here is given by Eq. (36) which in this case has the explicit solution .
Furthermore, the limiting ridge risk is minimized at , in which case we have the simpler expression .
Under the conditions of Theorem 4 (misspecified model, isotropic features), the limiting risk of ridge regression is as in the first two displays, and the optimal limiting risk is as in the third, after we make the substitutions in (21) and add , to each expression.
It is easy to recover the formulas in Theorem 1 as a limiting case of Eq. (43), by using the asymptotics for and for and for .
Figures 8 and 8 compare the risk curves of min-norm least squares to those from optimally-tuned ridge regression, in the well-specified and misspecified settings, respectively. There are two important points to make. The first is that optimally-tuned ridge regression is seen to have strictly better asymptotic risk throughout, regardless of , , . This should not be a surprise, as by definition optimal tuning should yield better risk than min-norm least squares, which is the special case given by .
As we will see, the behavior is rather different in the latent space model.
2 Latent space model
As a special application, we consider the latent space model of Section 5.4. It is immediate to specialize Eqs. (38) and (39) to this case. We omit giving giving explicit formulas for brevity, and instead plot the resulting curves for the prediction risk (test error).
In Figure 10 we plot the risk as a function of the overparametrization ration for several values of the regularization parameter (included the ridgeless limit ). The setting here is analogous to the one of Fig. 5. We observe several interesting phenomena:
Independently of in the probed range, the risk is minimized at large overparametrization .
As expected, the divergence of the risk at the interpolation threshold is smoothed out by regularization, and the risk becomes a monotone decreasing function of when is large enough. Crucially, the optimal amount of regularization (corresponding to the lower envelope of these curves) results in a monotonically decreasing risk.
At large overparametrization, the optimal value of the regularization parameter is .
We confirm the last finding in Fig. 10 which plots the risk as a function of : the optimal regularization is . Notice that this is the case despite the fact that the observations are noisy, namely strictly.
The optimality of has a known intuitive explanation that is worth recalling here. Recall that ridge predictor at point takes the form
where we introduced the kernel matrix , and the vector . In the present case can ve viewed as a noisy version of , namely Consider the case in which the covariates contain noise, as is our case, where . (While we are considering , it is instructive to regard the noise standard deviation as a parameter.) Then, we might expect . If this approximations hold, the noise acts as an extra ridge term, which can be sufficient to regularize the problem.
To the best of our knowledge, this argument was first presented by Webb (1994) and, more explicitly, by Bishop (1995). Recently, Kobak et al. (2020) elucidated its role in linear regression, establishing several of its consequences. In a parallel line of work, a closely related idea has recently emerged in the analysis of kernel methods in high dimension El Karoui (2010); Liang et al. (2020).
Cross-validation
We analyze the effect of using cross-validation to choose the tuning parameter in ridge regression. In short, we find that choosing the ridge tuning parameter to minimize the leave-one-out cross-validation error leads to the same asymptotic risk as the optimally-tuned ridge estimator. The next subsection gives the details; the following subsection presents a new “shortcut formula” for leave-one-out cross-validation in the overparametrized regime, for min-norm least squares, akin to the well-known formula for underparametrized least squares and ridge regression. We refer to Craven and Wahba (1979); Golub et al. (1979a) for background on CV and GCV, and to Arlot and Celisse (2010) for a more recent review.
Recomputing the leave-one-out predictors , can be burdensome, especially for large . Importantly, there is a well-known “shortcut formula” that allows us to express the leave-one-out CV error (45) as a weighted average of the training errors,
where is the ridge smoother matrix. This identity is an immediate consequence of the Sherman-Morrison-Woodbury formula. In the next subsection, we give an extension to the case and , i.e., to min-norm least squares.
The next result shows that, for isotropic features, the CV error of a ridge estimator converges almost surely to its prediction error. The focus on isotropic features is only for simplicity: a more general analysis is possible but is not pursued here. The proof, given in Appendix C.6, relies on the shortcut formula (46). In the proof, we actually first analyze generalized cross-validation (GCV), which turns out to be somewhat of an easier calculation (see the proof for details on the precise form of GCV), and then relate leave-one-out CV to GCV.
with the right-hand side above being the asymptotic risk of optimally-tuned ridge regression. Further, the exact same set of results holds for GCV.
Similar results were obtained for various linear smoothers in Li (1986, 1987), for the lasso in the high-dimensional (proportional) asymptotics in Miolane and Montanari (2018), and for general smooth penalized estimators in Xu et al. (2019). The latter paper covers ridge regression as a special case, and gives more precise results (convergence rates), but assumes more restrictive conditions.
The key implication of Theorem 7, in the context of the current paper and its central focus, is that the CV-tuned or GCV-tuned ridge estimator has the same asymptotic performance as the optimally-tuned ridge estimator. In other words, the ridge curves in Figures 1, 8, and 8 can be alternatively viewed as the asymptotic risk of ridge under CV tuning.
2 Shortcut formula for ridgeless CV
We extend the leave-one-out CV shortcut formula (46) to work when and , i.e., for min-norm least squares. In this case, both the numerator and denominator are zero in each summand of (46). To circumvent this, we can use the so-called “kernel trick” to rewrite the ridge regression solution (5) with as
Using this expression, the shortcut formula for leave-one-out CV in (46) can be rewritten as
Taking yields the shortcut formula for leave-one-out CV in min-norm least squares (assuming without a loss of generality that ),
In fact, the exact same arguments given here still apply when we replace by a positive definite kernel matrix (i.e., for each , where is a positive definite kernel function), in which case (48) gives a shortcut formula for leave-one-out CV in kernel ridgeless regression (the limit in kernel ridge regression as ). We also remark that, when we include an unpenalized intercept in the model, in either the linear or kernelized setting, the shortcut formula (48) still applies with or replaced by their doubly-centered (row- and column-centered) versions, and the matrix inverses replaced by pseudoinverses.
Nonlinear model
Our analysis in the previous sections assumed , with a vector with i.i.d. entries. As discussed in the introduction, we expect our results to have implications on certain neural networks in the lazy training regime, via the universality hypothesis. Roughly speaking, if covariates are obtained by applying the featurization map to data , the asymptotic behavior of ridge regression is expected to be close to the one of an equivalent Gaussian model, whose second-order statistics match the ones of .
We focus on a simple case in which the second order statistics match the ones of the isotropic model.
Notice that, conditionally on , the vectors , are independent. However, they do not have independent coordinates: intuitively if is significantly smaller than , contains much less randomness than a vector of independent entriesFor instance, if , we can reconstruct from the first coordinates of , and therefore the remaining coordinates of are a function of the first ..
Then for , the variance satisfies, almost surely,
which is again as in the case of linear isotropic features, recall Theorem 1.
The proof of Theorem 8 is lengthy and will be sketched shortly.
where are jointly Gaussian with unit variance and covariance . Denoting by the decomposition of into orthonormal Hermite polynomials, we thus obtain
Despite the nonlinear model matches the second-order population statistics of the isotropic model, it is far from obvious that the resulting least norm regression risk is the same. Indeed the coordinates of vector are highly dependent, as it can be seen by noting that they are a function of only independent random variables. Theorem 8 confirms that —despite dependence— the risk is asymptotically the same, thus providing a concrete example of the general universality phenomenon.
2 Proof outline for Theorem 8
We introduce the following resolvents (as usual, these are defined for and by analytic continuation, whenever possible, for ):
Here and henceforth, we write for integers . We also write for a matrix , and for a subset . The equalities in the first and third lines above follow by invariance of the distribution of under permutations of and . Whenever clear from the context, we will omit the arguments from block matrix and resolvents, and write , , and .
The next lemma characterizes the asymptotics of , .
Assume the conditions of Theorem 8. Consider or , , with . Let and be the unique solutions of the following quadratic equations:
subject to the condition of being analytic functions for , and satisfying for (with a sufficiently large constant). Then, as , such that and , we have almost surely (and in ),
The proof of this lemma is given in Appendix D.2. As a corollary of the above, we obtain that the asymptotical empirical spectral distribution of the empirical covariance matches the one for the independent entries model, and is hence given by the Marchenko-Pastur law (a result already obtained in Pennington and Worah (2017)). We state this formally using the Stieltjes transform
Assume the conditions of Theorem 8. Consider . As , with and , we have (almost surely and in ) where is nonrandom and coincides with the Stieltjes transform of the Marchenko-Pastur law, namely
We refer to Appendix D.4 for a proof of this corollary. The next lemma connects the above resolvents computed in Lemma 3 to the variance of min-norm least squares, hence finishing our proof outline.
Assume the conditions of Theorem 8. Let , be the asymptotic resolvents given in Lemma 3. Define
Then for , \partial_{x}m(\xi,x)\big{|}_{x=0} as a simple pole at , and hence admits a Taylor-Laurent expansion around , whose coefficients will be denoted by ,
Here each coefficient is a function of : , . Furthermore, for the ridge regression estimator in (5), as , such that , , the following ridgeless limit holds almost surely:
The proof of this lemma can be found in Appendix D.3. Theorem 8 follows by evaluating the formula in Lemma 4, by using the result of Lemma 3. We refer to the appendices material for details.
Acknowledgements
The authors are grateful to Brad Efron, Rob Tibshirani, and Larry Wasserman for inspiring us to work on this in the first place. RT sincerely thanks Edgar Dobriban for many helpful conversations about the random matrix theory literature, in particular, the literature trail leading up to Proposition 2, and the reference to Rubio and Mestre (2011). We are also grateful to Dmitry Kobak for bringing to our attention the Kobak et al. (2020) and clarifying the implications of this work on our setting.
TH was partially supported by grants NSF DMS-1407548, NSF IIS-1837931, and NIH 5R01-EB001988-21. AM was partially supported by NSF DMS-1613091, NSF CCF-1714305, NSF IIS-1741162, and ONR N00014-18-1-2729. RT was partially supported by NSF DMS-1554123.
Appendix A Proofs for the linear model: Risk of ridge regression
Before passing to the proof of Theorem 5, we state and prove a useful lemma in analysis.
Without loss of generality we can assume . Further we set . We consider a finite difference approximation of order of the first order derivative of at . One such approximation is given by (Trefethen, 1996, Chapter 3)
where is a point in the interval . Using the properties of listed above, we obtain
We are left wih the task of proving claims and about the operator . Substituting, these are equivalent to the claims
We further state and proof a lemma about a modified Stieltjes transform.
Let be such that and for some large constant , and . Further assume and . For let be the only solution in of the equation
is analytic and non-decreasing with for a constant uniquely dependent on .
If further , then for all .
We begin from poinr . Since is fixed here, we write , dropping the dependence on . Note that, for , any solution must be such that because otherwise the left hand side is negative and the right-hand side is positive. Call whence the above equation is equivalent to the following equation on :
where . For , we have , and therefore the sequence satisfies the same assumptions as , with replaced by . We will therefore drop the argument and study the fixed point equation (61) (which is equivalent to the case of Eq. (60)).
Let , and note that this function is mononone decreasing on with , and . Equation (61) can be rewritten as
Since is monotone decreasing with and as , this equation has a unique solution, denoted by . By the implicit function theorem, we have
The second inequality follows for instance by noting that . Notice that . By assumption . We will distinguish the two cases and .
Case . Since is decreasing, (62) implies whence
and therefore, by Eq. (63), we have for all .
Case . Let . Since is convex, . Note that . Therefore for some finite constant .
Since on , we have for all . Further,
Therefore by Eq. (63), also in this case.
and therefore , whence
Claim simply follows since by Eq. (60). Since as just shown, this implies .
Finally, for claim , notice that is a solution of the equation
By the implicit function theorem, it is sufficient to prove that \big{|}\frac{\partial^{k+l}f}{\partial u^{k}\partial\eta^{l}}\big{|}\leq C_{k,l}(M) for all and , , and that \big{|}\frac{\partial f}{\partial u}\big{|}\geq 1/C(M).
In order to bound the first derivative away from , note that, again by the implicit function theorem,
where the last inequality follows from point above. In order to bound higher order derivatives, write where . Recall that , and by Eq. (66). Therefore
where , are coefficients depending uniquly on . This sum is bounded since each term is bounded, by the above estimate on , and further using the fact that since by Eq. (66). ∎
For the proof of Theorem 5, it is convenient to introduce the notations and . We begin by recalling the expressions for bias and variance of ridge regression at regularization parameter :
The bias and variance for min-norm interpolation are recovered by taking . In the following, we will occasionally consider to be complex with .
A.2 Proof of Theorem 5, bias term
Note that the bias is homogeneous (of degree ) in . Hence, without loss of generality we can and will assume . For define
Here, for , , is defined as the unique solution of
where are the eigenvalues of . By Eq. (73) we have . Existence and uniqueness of the solution is given, for instance, in (Knowles and Yin, 2017, Lemma 2.2). This lemma also states that is the Stieltjes transorm of a probability measure with support in . As a consequence, for ,
Therefore, taking the limit in Eq. (75), we obtain, with probability at least
Here the last identity is simply the definition of the predicted bias in Eq. (38).
Notice that is infinitely differentiable in with
Note that , (we assumed this to hold without loss of generality) and , whence
We claim that a similar bound holds for the derivative of , namely
In what follows we use the shorhand . Note that, by Eq. (77), solves Eq. (60), and we can therefore apply Lemma 6. Note that
where follows from Lemma 6. and by the theorem’s assumption.
Further, by Lemma 6. we have for all , whence the desired claim (83) follows.
Using Eqs. (82), (83), and (78) in Lemma 5 we get, for any , and any , with probability at least
The claim follows by taking and , and recalling that, by Eqs. (74) and (81), we have
A.3 Proof of Theorem 5, variance term
Further the right-hand side is an analytic function of , for . Using again (Knowles and Yin, 2017, Theorem 3.16.) , we get that for any , , where exists such that, with probability at least ,
uniformly over , . Here is defined as in the previous section, namely as the solution of Eq. (77) with . As shown there , whence
Since both functions \lambda\mapsto\frac{\lambda}{p}{\sf Tr}\big{(}\Sigma(S_{X}+\lambda I)^{-1} and are analytic on , the last bound implies a similar bound on their derivarives, namely
We are left with the task of computing the derivative of :
where the last equality follows by the definition given above.
A.4 Generalization to heavier tail covariates: Proof of Theorem 6
This result follows from Theorem 2 via a standard truncation argument. We outline the argument for the bias as the same proof holds almost unchanged for the variance term. We write the bias to emphasize its dependende upon the population covariance. Using Eq. (69), we obtain, for some
For , decompose the random vriables as
We correspondingly decompose the matricex as
where . Using Eq. (95), we get
Now we can apply Theorem 5 to , since the variables have bounded moments of all orders. Denoting by , the corresponding empirical measures, we have, with probability at least :
Note that and , hence these measures converge weakly to and . Therefore, by Borel-Cantelli
The proof is completed by putting together Eqs. (100) and (101), taking and noting that , in that limit.
Appendix B Proofs for the linear model: Risk min-norm regression
The proof follows from Theorem 5 by approximating min-norm regression with ridge regression with a small value of . For this reason we will assume throughout. We treat separately the bias and variance terms. Since the variance is independent of and the bias is homogeneous in , we can assume, without loss of generality, .
where denotes the smallest non-vanishing singular value and the last inequality holds since by assumption . Finally, since , we have that with probability at least Bai and Yin (1993). Hence, with the same probability \big{|}B_{X}(\hat{\beta}_{\lambda};\beta)^{1/2}-B_{X}(\hat{\beta};\beta)^{1/2}\big{|}\leq C(M)\lambda. Also notice that, by Eq. (103), . Therefore, we obtain
We next claim that Eqs. (11) and (38) imply the bound
In to prove this estimate, recall that , and that, by assumption, is supported on . Define via . Note that , where is the compaion Stieltjes transform already introduced in Eq. (77). In particular, by (Knowles and Yin, 2017, Lemma 2.2), it is non-negative and monotone decreasing in . Further Eq. (36), satisfies
It follows from that is monotone decreasing on for any with and , for any . Further is monotone decreasing in with as . Therefore is monotone decreasing in , with (with defined as per Eq. (10)). Further where is obtained by replacing by , and is obtained by replacing by . Therefore for some finite constant depending uniquely on . In particular, this implies that is uniformly bounded. Finally
For and , the integrand is bounded and bounded away from zero. Therefore for some (possibly different) constant . By the implicit function theorem, it follows that is bounded for , and therefore , whence
(Here denotes a quantity bounded uniformly as .). Sustituting this in Eq. (37), and repeating similar arguments, we obtain
Finally, substituting in Eq. (38), we obtain the claimed bound (105).
Using Eqs. (104), (105) and Theorem 5 we deduce that, for all , and all , with probability at least ,
The desired bound is obtained by setting and choosing small enough.
Variance term. With the same notations introduced above, Eq. (70) implies
Taking the difference and bounding for , we get
The rest of the proof is analogous to one for the bias term and is omitted.
Appendix C Other proofs for the linear model
where the second inequality holds almost surely for all large enough, following from , and the Bai-Yin theorem (Bai and Yin, 1993), which implies that the smallest eigenvalue of is almost surely larger than for sufficiently large . As the right-hand side in the above display is strictly positive, we have that is almost surely invertible. Therefore by the bias and variance results from Lemma 1, we have almost surely and , and also
where is the spectral measure of and the last identity holds if almost surely for all (with a random number) by Eq. (110). By the Marchenko-Pastur theorem (Marchenko and Pastur, 1967; Silverstein, 1995), which says that converges weakly, almost surely, to the Marchenko-Pastur law (depending only on ), whenc, almost surely :
where the last equality follows since is supported in , provided . Thus the last display implies that as , almost surely,
It remains to compute the right-hand side above. We recognize the right-hand side as the evaluation of the Stieltjes transform of Marchenko-Pastur law at . Fortunately, this has an explicit form (e.g., Lemma 3.11 in Bai and Silverstein 2010), for real :
The proof is thus completed by taking the limit in this expression.
C.2 Proof of Theorem 1
Variance term. Recalling the expression for the bias from Lemma 1 (where now ), we have
where , are the nonzero eigenvalues of . Let , denote the eigenvalues of . Then we may write , , and
where is the spectral measure of . Now as , we are back precisely in the setting of Theorem 2, and by the same arguments, we may conclude that almost surely
Bias term. Recall the expression for the bias from Lemma 1 (where now ), and note the following key characterization of the pseudoinverse of a rectangular matrix ,
We can apply this to , and rewrite the bias as
for a deterministic sequence , (defined for each via a certain fixed-point equation). Taking in the above, note that this reduces to the almost sure convergence of the Stieltjes transform of the specral distribution of , and hence by the (classical) Marchenko-Pastur theorem, we learn that , where denotes the Stieltjes transform of the Marchenko-Pastur law . Further, taking , we see from (115) and that, almost surely,
Define . Notice that , and , so
where and denote the largest and smallest nonzero eigenvalues, respectively, of , and the second inequality holds almost surely for large enough , by the Bai-Yin theorem (Bai and Yin, 1993). As its derivatives are bounded, the sequence , is equicontinuous, and by the Arzela-Ascoli theorem, we deduce that converges uniformly to its limit. By the Moore-Osgood theorem, we can exchange limits (as and ) and conclude from (114), (116) that as , almost surely,
Finally, relying on the fact that the Stieltjes transform of the Marchenko-Pastur law has the explicit form in (112), we can compute the above limit:
C.3 Proof of Corollary 2
The corollary follows immediately from Theorem 3, once we estabilish that, for , . By homogeneity, we can assume without loss of generality that . By using Eq. (11), we get
where that last equality follows from the definition of in Eq. (10). Comparing with the definition of in Eq. (16) yields the desired claim.
C.4 The case of equicorrelated features
As an illustration of Corollary 2, we consider the case of equicorrelated covariates. Namellt, for , we assume that for all , and for all .
Assume the conditions of Theorem 2, and moreover, assume that has -equicorrelation structure for all , and some . Then as , with , we have almost surely
Let denote the weak limit of , when has -equicorrelation structure for all . A short calculation shows that such a matrix has one eigenvalue value equal to , and eigenvalues equal to . Thus the weak limit of its spectral measure is simply , i.e., , a point mass at of probability one.
We remark that the present case, strictly speaking, breaks the conditions that we assume in Theorem 2, because clearly diverges with . However, by decomposing (where denotes the vector of all 1s), and correspondingly decomposing the functions defined in the proof of Theorem 2, to handle the rank one part properly, we can ensure the appropriate boundedness conditions. Thus, the result in the theorem still holds when has -equicorrelation structure.
Now denote by the companion Stieltjes transform of the empirical spectral distribution , to emphasize its dependence on . Recall the Silverstein equation (120), which for the equicorrelation case, as , becomes
where is the companion Stieltjes transform in the case, the object of study in Appendix LABEL:app:risk_iso2. From the results for from (LABEL:eq:v0) and (LABEL:eq:v1), invoking the relationship in the above display, we have
Plugging these into the asymptotic risk expression from Theorem 2 gives
C.5 Autoregressive features
We consider a -autoregressive structure for , for a constant , meaning that for all . In this case, it is not clear that a closed-form exists for or . However, we can compute these numerically. In fact, the strategy we describe below applies to any situation in which we are able to perform numerical integration against , where is the weak limit of the spectral measure of .
The critical relationship that we use is the Silverstein equation (Silverstein, 1995), which relates the companion Stieltjes transform to via
Therefore, we can use a simple univariate root-finding algorithm (like the bisection method) to solve for in (121). With computed, we can compute by first differentiating (120) with respect to (see Dobriban 2015), and then taking , to yield
When is of -autoregressive form, it is known to have eigenvalues (Trench, 1999):
This allows us to efficiently approximate an integral with respect to (e.g., by taking each to be in the midpoint of its interval given above), solve for in (121), in (122), and evaluate the asymptotic risk as per Theorem 2.
Figure 13 shows the results from using such a numerical scheme to evaluate the asymptotic risk, as varies from 0 to 0.75.
C.6 Proof of Theorem 7
We begin by recalling an alternative to leave-one-out cross-validation, for linear smoothers, called generalized cross-validation (GCV) (Craven and Wahba, 1978; Golub et al., 1979b). The GCV error of the ridge regression estimator at a tuning parameter value defined as
Compared to the shortcut formula for leave-one-out CV in (46), we can see that GCV in (123) swaps out the th diagonal element in the denominator of each summand with the average diagonal element . This modification makes GCV rotationally invariant (Golub et al., 1979b).
It turns out that the GCV error is easier to analyze, compared to the CV error. Thus we proceed by first studying GCV, and then relating CV to GCV. We break up the exposition below into these two parts accordingly.
Let us rewrite the GCV criterion in (123) as
We will treat the almost sure convergence of the numerator and denominator separately.
The denominator is an easier calculation. Denoting , , we have
where this convergence holds almost surely as , a direct consequence of the Marchenko-Pastur theorem, and denotes the Marchenko-Pastur law. Meanwhile, we can rewrite this asymptotic limit as
where denotes the Stieltjes transform of the Marchenko-Pastur law , and therefore, almost surely,
The numerator requires only a bit more difficult calculation. Let and . Observe
Note that has independent entries with mean zero and variance , and further, note that and are independent. Therefore we can use the almost sure convergence of quadratic forms, from Lemma 7.6 in Dobriban and Wager (2018), which is adapted from Lemma B.26 in Bai and Silverstein (2010).As written, Lemma 7.6 of Dobriban and Wager (2018) assumes i.i.d. components for the random vector in question, which is not necessarily true of in our case. However, an inspection of their proof shows that they only require independent components with mean zero and common variance, which is precisely as stated in Lemma B.26 of Bai and Silverstein (2010). This result asserts that, almost surely,
A short calculation and application of the Marchenko-Pastur theorem gives that, almost surely,
where for the second sum, we used Vitali’s theorem to show convergence of the derivative of the Stieltjes transform of the spectral distribution of to the derivative of the Stieltjes transform of (note that Vitali’s theorem applies as the function in question is bounded and analytic). By a similar calculation, we have almost surely,
Putting (125), (126) together with (124), we have, almost surely,
To show that this matches to the asymptotic prediction error of ridge regression requires some nontrivial calculations. We start by reparametrizing the above asymptotic limit in terms of the companion Stieltjes transform, abbreviated by . This satisfies
Introducing , the almost sure limit of GCV is
where in the second line we rearranged, in the third line we applied the companion Stieltjes transform facts, and in the fourth line we simplified. We can now recognize the above as plus the asymptotic risk of ridge regression at tuning parameter , either from the proof of Theorem 2, or from Theorem 2.1 in Dobriban and Wager (2018). In terms of the Stieltjes transform itself, this is , which proves the first claimed result.
In the second line, we used , with , and in the third line, we used almost surely for sufficiently large , by the strong law of large numbers, and almost surely for sufficiently large , by the Bai-Yin theorem (Bai and Yin, 1993). Furthermore, writing for the numerator and denominator of , respectively, we have
The above argument just showed that is upper bounded on , and is lower bounded on ; also, clearly ; therefore to show that is almost surely bounded on , it suffices to show that both are. Denoting by , the eigenvectors of (corresponding to eigenvalues , ), a short calculation shows
the last step holding almost surely for large enough , by the law of large numbers. Also,
the last step holding almost surely for large enough , by the Bai-Yin theorem. Thus we have shown that is almost surely bounded on , for large enough , and applying the Arzela-Ascoli theorem, converges uniformly to . With , this means for any for any , almost surely
where we used the optimality of for , and then uniform convergence. In other words, almost surely,
As the almost sure convergence is also uniform for (by similar arguments, where we bound the risk and its derivative in ), we conclude that almost surely , completing the proof for GCV.
C.6.2 Analysis of CV
Let us rewrite the CV criterion, starting in its shortcut form (46), as
where is a diagonal matrix that has diagonal elements , .
First, fixing an arbitrary , we will study the limiting behavior of
Since and are not independent, we cannot immediately apply the almost sure convergence of quadratic forms lemma, as we did in the previous analysis of GCV. But, letting denote the matrix with the th row removed, we can write , and use the Sherman-Morrison-Woodbury formula to separate out the dependent and independent parts, as follows. Letting , , and , we have
Note and are independent (i.e., and are independent), so we can now use the almost sure convergence of quadratic forms, from Lemma 7.6 in Dobriban and Wager (2018), adapted from Lemma B.26 in Bai and Silverstein (2010), to get that, almost surely
Further, as almost surely by the Marchenko-Pastur theorem, we have, almost surely,
The strategy henceforth, based on the result in (129), is to replace the denominators , in the summands of the CV error by their asymptotic limits, and then control the remainder terms. More precisely, we define , and then write, from (127),
We will first show that almost surely . Observe, by the Cauchy-Schwartz inequality,
where in the second step, we used , which holds almost surely for large enough , by the strong law of large numbers. Meanwhile,
By the Marchenko-Pastur theorem, we have almost surely. Using on the maximands in ,
In the second line above, we used , , and also . In the third line, we used the Sherman-Morrison-Woodbury formula. In the fourth line, for each summand , we upper bounded the numerator by , and lower bounded the denominator by , with (which follows from a short calculation using the eigendecomposition of ). In the fifth line, we used , and in the sixth line, we used almost surely for sufficiently large , by the Bai-Yin theorem (Bai and Yin, 1993).
Now we will show that almost surely by showing that the convergence in (128) is rapid enough. As before, using on the maximands in , we have
Here we used that for , we have , and similarly, , with the last inequality holding almost surely for sufficiently large , by the Bai-Yin theorem (Bai and Yin, 1993). To show almost surely, we start with the union bound and Markov’s inequality, where , and is be specified later,
where in the last step we used for large enough . Since the right-hand side above is summable in , we conclude by the Borel-Cantelli lemma that almost surely, and therefore also almost surely. This finishes the proof that .
As before, it helps to reparametrize in terms of the companion Stieltjes transform, giving
Adding to both sides, then dividing both sides by , yields
and dividing through by , then rearranging, gives
C.7 Misspecified model results
These results follow because, as argued in Section 5.2, the misspecified case can be reduced to a well-specified case after we make the substitutions in (21).
C.8 CV and GCV simulations
Figures 16 and 16 investigate the effect of using CV and GCV tuning for ridge regression, under the same simulation setup as Figures 8 and 8, respectively. It is worth noting that for these figures, there is a difference in how we compute finite-sample risks, compared to what is done for all of the other figures in the paper. In all others, we can compute the finite-sample risks exactly, because, recall, our notion of risk is conditional on ,
and this quantity can be computed analytically for linear estimators like min-norm least squares and ridge regression. When we tune ridge by minimizing CV or GCV, however, we can no longer compute its risk analytically, and so in our experiments we approximate the above expectation by an average over 20 repetitions. Therefore, the finite-sample risks in Figures 16 and 16 are (only a little bit) farther from their asymptotic counterparts compared to all other figures in this paper, and we have included the finite-sample risk of the optimally-tuned ridge estimator in these figures, when the risk is again computed in the same way (approximated by an average over 20 repetitions), as a reference. All this being said, we still see excellent agreement between the risks under CV, GCV, and optimal tuning, and these all lie close to their (common) asymptotic risk curves, throughout.
Appendix D Nonlinear model: Proof of Theorem 8
Whenever clear from the context, we will omit the arguments from block matrix and resolvents, and write , , and .
The next theorem is the the central random matrix results, and characterizes the asymptotics of , . It provides a generalization of Lemma 3.
Consider or , , with . Let and be the unique solutions of the following fourth degree equations:
subject to the condition of being analytic functions for , and satisfying for (with a sufficiently large constant). Then, as , such that and , we have almost surely (and in ),
The proof of this theorem is given in Appendix D.2. Now define
where recall . As a corollary of the above, we obtain the asymptotic Stieltjes transform of the eigenvalue distribution of (this result was first derived in Pennington and Worah (2017)).
Assume the conditions of Theorem 9. Consider . As , with and , the Stieltjes transform of spectral distribution of in (137) satisfies almost surely (and in ) where is a nonrandom function that uniquely solves the following equations (abbreviating ):
subject to the condition of being analytic for , and satisfying for (where is a large enough constant). When , the function is the Stieltjes transform of the Marchenko-Pastur distribution.
We refer to Appendix D.4 for a proof of this corollary. The next lemma connects the above resolvents to the variance of min-norm least squares.
Assume the conditions of Theorem 9. Let , be the asymptotic resolvents given in Theorem 9. Define
Then for , the following Taylor-Laurent expansion holds around :
with each . Furthermore, for the ridge regression estimator in (5), as , such that , , the following ridgeless limit holds almost surely:
The proof of this lemma can be found in Appendix D.3.
D.2 Proof of Theorem 9
The proof follows the approach developed by Cheng and Singer (2013) to determine the asymptotic spectral measure of symmetric kernel random matrices. The latter is in turn inspired to the classical resolvent proof of the semicircle law for Wigner matrices, see Anderson et al. (2009). The present calculation is somewhat longer because of the block structure of the matrix , but does not present technical difficulties. We will therefore provide a proof outline, referring to Cheng and Singer (2013) for further detail.
This decomposition satisfies the following properties:
As mentioned, it is an orthogonal decomposition: . Further, by symmetry and the normalization assumption, .
, , as .
Finally, recall the following definitions for the random resolvents:
In the next section, we will summarize some basic facts about the resolvent , and their analyticity, and some concentration properties of . We will then derive Eqs. (133) and (134). Thanks to the analyticity properties, we will assume throughout these derivations for a large enough constant. Also, we will assume to be fixed throughout, and will not explicitly point out the dependence with respect to these arguments.
Consider, to be definite . Denoting by , , the eigenvalues and eigenvectors of , we have
Properties - are then standard consequences of being a Stieltjes transform (see, for instance, Anderson et al. (2009)) ∎
Without loss of generality, we will assume , and write , . Define
and , the same quantities when is replaced by . Then, by Schur complement formula, it is immediate to get the formulae
Let and (i.e. or depending whether or not). Define \Delta_{A}\equiv\big{|}{\sf Tr}_{A}\big{[}({W}-\xi{{I}}_{N})^{-1}\big{]}-{\sf Tr}_{A}\big{[}({W}^{(N)}-\xi{{I}}_{N})^{-1}\big{]}\big{|}. Then, using the bounds given above,
Next consider . Notice that
Letting , we have . Therefore, defining ,
which implies that Eq. (157) also holds in this case.
Using Eqs. (151) and (157) yields the desired claim. ∎
If , or , with . Then there exists such that, for ,
In particular, for , or , then almost surely.
The proof is completely analogous to (Cheng and Singer, 2013, Lemma 2.4), except that we use Azuma-Hoeffding instead of Burkholder’s inequality. Consider (we omit the arguments for the sake of simplicity). We construct the martingale
In particular Eqs. (134), (133) admit a unique solution with for .
whence by taking large enough. Analogously
Again, the claim follows by taking large enough. ∎
The next lemma allow to restrict ourselves to cases in which is polynomial with degree independent of .
The proof is essentially the same as for Lemma 4.4 in Cheng and Singer (2013). ∎
(Here is understood to be the same in each case.)
This follows immediately from Lemma 9. Denote, with a slight abuse of notation, by the matrix in Eq. (132). Consider for instance the difference
where is obtained from by zero-ing the last row and column. Lemma 9 then implied . The other terms are treated analogously. ∎
D.2.2 Derivation of Eqs. (133)
Using Eq. (179), and Woodbury’s formula, we get (writing for simplicity
By a concentration of measure argument (see, e.g., (Tao, 2012, Section 2.4.3)), the same statement also holds with high probability
By a concentration of measure argument, we also have
By using this together with Eqs. (199) and (204), we get
D.2.3 Derivation of Eqs. (134)
The derivation si analogous to the one in the previous section (notice that we will redefine some of the notations used in the last section), and hence we will only outline the main steps.
We decompose , , and , into their components along , and the orthogonal complement:
As in the previous section, it can be shown that (with an absolute constant), and therefore Eq. (207) yields
We then proceed to compute the various terms on the right-hand side. The calculation is very similar to the one in the previous section, and we limit ourselves to reporting the results:
Substituting Eqs. (221) to (224) into Eq. (220), we get
D.2.4 Completing the proof
Consider a large enough constant. By Eqs. (206), (206), we have
D.3 Proof of Lemma D.5
With the above definitions there exists a constant such that
Therefore, on the same event (for a suitable constant )
This implies Eq. (228) because with the stated probability we have for all and for all .
where the convergence takes place almost surely and in .
Denote by , the eigenvalues and eigenvectors of . The following qualitative behavior can be extracted from the asymptotic of the Stieltjes transform as stated in Corollary 6.
For any , , there exists such that the following happens. Let , . Then, the following limits hold almost surely
(Here denotes weak convergence of probability measures.)
Note that as . Further, for or , it it is immediate to show tha is bounded in (in a neighborhood of ). Hence
Since , a weak convergence argument implies almost surely. Further, defining , , Lemma 15 implies , , where is a measure supported on , with . This in turns implies
In particular, is analytic in a neighborhood of . This proves Eq. (141), with , .
Comparing the last two displays, we obtain our claim
D.4 Proof of Corollary 8 and Corollary 6
Throughout this section, set (and drop these arguments for the various functions), and let , , . In this case we have
and therefore, a simple linear algebra calculation yields
Theorem LABEL:thm:var_nonlinear immediately implies (almost surely and in ), where
Equations (133) and (134) simplify for the case (setting ) to yield
Taking a linear combination of these two equations, we get
Comparing this with Eq (260), we get Eq. (138). Substituting the latter in Eqs. (261), (262), we get Eqs. (139), (140).
Finally, Corollary 6 follows by taking .u
D.5 Proof of Theorem 8
First notice that Lemma 3 and 4 follow by taking , in Theorem 9 and Lemma .
The variance result follows simply by taking in Theorem LABEL:thm:var_nonlinear. Solving the quadratic equations (134), we get
The claimed formula for the variance is obtained by using these expressions in Lemma 4.
By Lemma 14, and using the fact that when , we get
with probability larger than . Therefore, by Borel-Cantelli it is sufficient to estabilish the claim for . By Corollary 6, for , the empirical spectral distribution of converges almost surely (in the weak topology) to the Marchenko-Pastur law . Hence (for the eigenvalues of )
Hence the asymptotic bias is the same as in the linear model (for random isotropic features). The claim hence follows by the results of Section 5.2. Alternatively, we may simply recall that and use dominated convergence.