Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

Introduction

Modern deep learning models involve a huge number of parameters. In many applications, current practice suggests that we should design the network to be sufficiently complex so that the model (as trained, typically, by gradient descent) interpolates the data, i.e., achieves zero training error. Indeed, in a thought-provoking experiment, Zhang et al. (2016) showed that state-of-the-art deep neural network architectures are complex enough that they can be trained to interpolate the data even when the actual labels are replaced by entirely random ones.

Despite their enormous complexity, deep neural networks are frequently observed to generalize well in practice. At first sight, this seems to defy conventional statistical wisdom: interpolation (vanishing training error) is commonly taken to be a proxy for overfitting, poor generalization (large gap between training and test error), and hence large test error. In an insightful series of papers, Belkin et al. (2018b, c, a) pointed out that these concepts are in general distinct, and interpolation does not contradict generalization. For example, recent work has investigated interpolation —via kernel ridge regression— in reproducing kernel Hilbert spaces (Liang et al., 2020; Ghorbani et al., 2019). While in low dimension a positive regularization is needed to achieve good interpolation, in certain high dimensional settings interpolation can be nearly optimal.

We study the model (2) in the proportional regime pnp\asymp n, with a special focus on the overparametrized case p>np>n. Our main contribution is to show that, by considering different choices of the features distribution PxP_{x}, we can reproduce a number of statistically interesting phenomena that have emerged in the context of deep learning.

From a technical perspective, our main results are: Theorems 2 and 5, which assume the linear model xi=Σ1/2zix_{i}=\Sigma^{1/2}z_{i} with ziz_{i} a vector with independent coordinates; and Theorem 8, which assumes a nonlinear model xi=φ(Wzi)x_{i}=\varphi(Wz_{i}) with ziN(0,Id)z_{i}\sim N(0,I_{d}). While the linear model has already a attracted significant amount of work (see Section 1.3 for an overview), Theorems 2 and 5 provide a more accurate non-asymptotic approximation of the prediction risk, as compared to available results in the literature.

The prediction risk depends on the geometry of the pair (Σ,β)(\Sigma,\beta). We consider a few different choices for this geometry, which are broadly motivated by our objective to understand overparametrized models, and specialize our formulas to these special cases:

Isotropic features. This is the simplest case, in which Σ=Ip\Sigma=I_{p} and therefore —as we will see— the asymptotic risk depends on β\beta only through its norm β2\|\beta\|_{2}. This simple model captures some interesting features of overparametrization, but misses others.

Nonlinear model. In all of the previous cases, the distribution of xix_{i} is of the form xi=Σ1/2zix_{i}=\Sigma^{1/2}z_{i} where ziz_{i} is a vector with independent coordinates. In order to test the generality of our results, we consider a model in which xix_{i} is obtained by passing ziN(0,Id)z_{i}\sim{N}(0,I_{d}) through a one-layer neural net with random first layer weights.

We will summarize our results for these four examples in the next subsection.

A skeptical reader might ask what linear models have to do with neural networks. We emphasize that linear models provide more than a simple analogy, and a recent line of work outlines a concrete connection between the two settings (Jacot et al., 2018; Du et al., 2018b, a; Allen-Zhu et al., 2018; Chizat and Bach, 2018b). We will discuss this connection in Section 1.2.

We denote by γ:=p/n(0,)\gamma:=p/n\in(0,\infty) the overparametrization ratio. When γ<1\gamma<1, we call the problem underparametrized, and when γ>1\gamma>1, we call it overparametrized. Our most general results for the linear model (Theorem 2 and 5) apply to a non-asymptotic setting in which n,pn,p are finite, and provide a deterministic approximation of the risk with error bounds that are uniform in the distribution of the data. We will occasionally consider a simpler asymptotic scenario in which both pp and nn diverge with p/nγp/n\to\gamma.

Our main results are twofold: (i)(i) We show that by taking ΣIp\Sigma\neq I_{p}, we can easily construct scenarios in which the minimum of the risk is achieved in the overparamertized regime p>np>n; (ii)(ii) We show that these findings are robust to the details of the distribution of (yi,xi)(y_{i},x_{i}).

As a preliminary remark, note that in the underparametrized regime (γ<1\gamma<1), the min-norm estimator coincides with the standard least squares estimator. Its risk is purely variance (there is no bias), and does not depend on β,Σ\beta,\Sigma (see Proposition 2). Interestingly, the asymptotic risk diverges as we approach the interpolation boundary (as γ1\gamma\to 1).

In contrast, in the overparametrized regime (γ>1\gamma>1), the risk is composed of both bias and varianceNote that in the overparametrized regime the bias is non-vanishing even in the interpolation limit λ0\lambda\to 0. The reason is that the set of interpolators is an affine space of dimension pnp-n, and the min-norm criterion selects one specific interpolator which has norm smaller than β\beta., and generally depends on β,Σ\beta,\Sigma (see Theorem 2).

We next highlight some concrete results for the four models discussed in the previous section (unless explicitly said, we refer to the min-norm estimator).

In either case, the risk approaches the null risk as γ\gamma\to\infty, and achieves its global minimum in the underparametrized regime (see Section 3.2).

Optimal tuning of the ridge penalty can be achieved by leave-one-out cross-validation (see Theorem 7).

In this case ΣI\Sigma\neq{{I}} and the risk depends on the geometry of (Σ,β)(\Sigma,\beta), and in particular on how β\beta aligns with the eigenvectors of Σ\Sigma.

If the coefficients vector is equidistributed along the eigenvectors of Σ\Sigma, the behavior is qualitatively similar to the isotropic case. This situation arises, for instance, if β\beta is itself random with a spherical prior.

If β\beta is aligned with the top eigenvectors of Σ\Sigma, the situation is qualitatively different. As an example we obtain an explicit formula for the asymptotic risk in the latent space model discussed above, see Figure 5 for an illustration. We find that, for natural choices of the model parameters, the risk is monotone decreasing in the overparametrized regime, and reaches its global minimum as γ\gamma\to\infty. This qualitative behavior matches the one observed for neural networks (see Section 5.4).

For the latent space model, we observe that, at large overparametrization, the minimum error is achieved as λ0\lambda\to 0, i.e. by min-norm interpolators (see Section 6.2, and Section 1.3 for related work).

From a technical viewpoint, analysis of the isotropic covariates model is straightforward and relies on standard random matrix theory results. However, we believe it provides useful insights.

In contrast, the results for general covariance and coefficients structure (Σ,β)(\Sigma,\beta) is technically novel. We discuss related work in Section 1.3. Our results for the nonlinear model are also novel. In this setting, we derive a new asymptotic result on resolvents of certain block matrices, which may be of independent interest (see Lemma 3).

We next discuss some intuition behind and implications of our results.

The shape of the asymptotic risk curve for min-norm least squares is, of course, controlled by its components: bias and variance. For fully specified models, the bias increases with γ\gamma in the overparametrized regime, which is intuitive. When p>np>n, the min-norm least squares estimate of β\beta is constrained to lie the row space of XX, the training feature matrix. This is a subspace of dimension nn lying in a feature space of dimension pp. Thus as pp increases, so does the bias, since this row space accounts for less and less of the ambient pp-dimensional feature space.

Recently, Belkin et al. (2018a) pointed out a fascinating empirical trend where, for popular methods like neural networks and random forests, we can see a second bias-variance tradeoff in the out-of-sample prediction risk beyond the interpolation limit. The risk curve here resembles a traditional U-shape curve before the interpolation limit, and then descends again beyond the interpolation limit, which these authors call “double descent”. A closely related phenomenon was found earlier by Spigler et al. (2018), who studied the “jamming transition” from underparametrized to overparametrized neural networks. Our results formally verify that this double descent phenomenon occurs even in the simple and fundamental case of least squares regression. The appearance of the second descent in the risk, past the interpolation boundary (γ=1\gamma=1), is explained by the fact that the variance decreases as γ\gamma grows, as discussed above.

In the misspecified case, the variance still decreases with γ\gamma (for the same reasons), but interestingly, the bias can now also decrease with γ\gamma, provided γ\gamma is not too large (not too far past the interpolation boundary). The intuition here is that in a misspecified model, some part of the true regression function is always unaccounted for, and adding features generally improves our approximation capacity. As a consequence, the double descent phenomenon can be even more pronounced in the misspecified case (depending on the strength of the approximation bias), and that the risk can attain its global minimum past the interpolation limit.

The min-norm least squares estimator can be seen as the limit of ridge regression as the tuning parameter tends to zero. It is also the convergence point of gradient descent run on the least squares loss. We would not in general expect the best-predicting ridge solution to be at the end of its regularization path. Our results, comparing min-norm least squares to optimally-tuned ridge regression, show that (asymptotically) this is never the case, when β\beta is incoherent with respect to the eigenvectors of Σ\Sigma. This is for instance the case when Σ=Ip\Sigma=I_{p}, or β\beta is distributed according to a spherically symmetric prior. In contrast, Kobak et al. (2020) recently pointed out that —when β\beta is aligned with the leading eigenvectors of Σ\Sigma— min-norm regression can have optimal risk (i.e. the optimal regularization vanishes). We show that this is indeed the case in the latent space model mentioned above: this provides indeed an extremely simple example of a phenomenon that has been observed in the past for kernel methods Liang et al. (2020).

As mentioned above, early-stopped gradient descent is known to be closely connected to ridge regularization, see, e.g., Ali et al. (2019) which proves a tight coupling between the two (see Section 1.3 for further related work).

In practice, of course, we would not have access to the optimal tuning parameter for ridge (optimal stopping for gradient descent), and we would rely on, e.g., cross-validation (CV). Our theory shows that for ridge regression, CV tuning is asymptotically equivalent to optimal tuning (and we would expect the same results to carry over to gradient descent, but have not pursued this formally).

2 Connection to neural networks

As mentioned above, recent literature has pointed out a direct connection between linear models and more complex models such as neural networks (Jacot et al., 2018; Du et al., 2018b, a; Allen-Zhu et al., 2018; Chizat and Bach, 2018b). Our work contributes to this line of work and in particular points at the important role played by universality, a core concept in random matrix theory Tao (2012).

This model is still nonlinear in the input zz, but is linear in the parameters β\beta. We are therefore led to consider a linear regression problem, with random features xi=θf(zi;θ0)x_{i}=\nabla_{\theta}f(z_{i};\theta_{0}), i=1,,ni=1,\ldots,n, of high-dimensionality (pp much greater than nn). Notice that the features are random because of the initialization θ0\theta_{0}. Further, since p>np>n, many vectors β\beta give rise to a model that interpolates the data.

A different approach relies on a universality hypothesis. Recall that {(yi,zi)}in\{(y_{i},z_{i})\}_{i\leq n} have joint distribution Py,zP_{y,z}, and xi=θf(zi;θ0)x_{i}=\nabla_{\theta}f(z_{i};\theta_{0}) and assume these quantities to have zero meanWe further note that the assumption of zero mean is mainly for convenience, as a non-zero mean of the covariates can be effectively compensated by an intercept.. Universality posits that the asymptotic risk of ridge regression (or min-norm interpolation) does not change if we replace the joint distribution (xi,zi)(x_{i},z_{i}) by a Gaussian with the same covariance.

While universality is not expected to hold for any distribution of the data (yi,zi)(y_{i},z_{i}), and for any function ff, it has been confirmed in a number of examples of interest (see Section 1.3). In fact, our analysis of the nonlinear model in Section 8 provides a concrete example in which universality can be confirmed rigorously.

Under the universality hypothesis, we can assume (zi,xi)(z_{i},x_{i}) to be jointly Gaussian. If the true joint distribution of the data (yi,zi)(y_{i},z_{i}) is also Gaussian, it is sufficient to study the data distribution

where Σ,σ\Sigma,\sigma are chosen to match the second order statistics of the original model. This is a special case of our ‘linear model’ (2). Summarizing, replacing the original nonlinear model yi=f(zi;θ)y_{i}=f(z_{i};\theta), with the linear model rests on two key steps. The first step is the linearization in Eq. (3): this has been proved to be a good approximation under certain ‘lazy training’ schemes Chizat and Bach (2018b). The second step is the universality hypothesis, which is quite common (albeit for simpler models) in random matrix theory Tao (2012).

Finally, although we believe our our results are relevant to understanding overparametrized neural networks, the concrete correspondence outlined above only holds in a certain ‘lazy training’ regime, in which network weights do not change much during training, More generally, in a neural network, the feature representation and the regression function or classifier are learned simultaneously. In both our linear and nonlinear model settings, the features XX are not learned, but observed. Learning XX could significantly change some aspects of the behavior of an interpolator. (See for instance Chapter 9 of Goodfellow et al. (2016), and also Chizat and Bach (2018b); Zhang et al. (2019), which emphasize the importance of learning the representation.)

3 Related work

Ridge regression with random features has been studied in the past. Dicker (2016) considers a model in which the covariates are isotropic Gaussian xiN(0,Ip)x_{i}\sim N(0,I_{p}) and computes the asymptotic risk of ridge regression in the proportional asymptotics p,np,n\to\infty, with p/nγ(0,)p/n\to\gamma\in(0,\infty). Dobriban and Wager (2018) generalize these results to xi=Σ1/2zix_{i}=\Sigma^{1/2}z_{i}, where ziz_{i} has independent entries with bounded 1212-th moment.

Recently, Advani and Saxe (2017) study the effect of early stopping and ridge regularization in a model with isotropic Gaussian covariates xiN(0,Ip)x_{i}\sim N(0,I_{p}), again focusing on the proportional asymptotics p,np,n\to\infty, with p/nγ(0,)p/n\to\gamma\in(0,\infty). They show that this simple model reproduces several phenomena observed in neural networks training. The same model is reconsidered in concurrent work by Belkin et al. (2019), who obtain exact results for the expected risk of min-norm regression, relying on the jointly Gaussian distribution of (yi,xi)(y_{i},x_{i}). We contribute to this line of work by extending the analysis to general covariance structures and to misspecified models. As we will see, these generalizations allow to produce examples for which the global minimum of the risk is achieved in the overparametrized regime γ>1\gamma>1.

The importance of the relation between the coefficient vector β\beta and the eigenvectors of Σ\Sigma was emphasized by Kobak et al. (2020) and Bartlett et al. (2020). These papers point out —under different asymptotic settings— that λ=0+\lambda=0+ (i.e., min-norm regression) can be optimal or nearly optimal. After a preprint of this paper appeared, Wu and Xu (2020) and Richards et al. (2020) generalized our earlier results to cover the case in which β\beta is potentially aligned with Σ\Sigma. We review in further detail these important generalizations in Section 4. We contribute to this line of work by obtaining non-asymptotic approximations for the risk, with explicit and nearly optimal error bounds. These hold under weaker assumptions on the geometry of (Σ,β)(\Sigma,\beta) than the results of Wu and Xu (2020); Richards et al. (2020).

For the nonlinear model, the random matrix theory literature is much sparser, and focuses on the related model of kernel random matrices, namely, symmetric matrices of the form Kij=φ(ziTzj)K_{ij}=\varphi(z_{i}^{T}z_{j}). El Karoui (2010) studied the spectrum of such matrices in a regime in which φ\varphi can be approximated by a linear function (for iji\neq j) and hence the spectrum converges to a rescaled Marchenko-Pastur law. This approximation does not hold for the regime of interest here, which was studied instead by Cheng and Singer (2013) (who determined the limiting spectral distribution) and Fan and Montanari (2015) (who characterized the extreme eigenvalues). The resulting eigenvalue distribution is the free convolution of a semicircle law and a Marchenko-Pastur law. In the current paper, we must consider asymmetric (rectangular) matrices xij=φ(wjTzi)x_{ij}=\varphi(w_{j}^{T}z_{i}), whose singular value distribution was recently computed by Pennington and Worah (2017), using the moment method. Unfortunately, the prediction variance depends on both the singular values and vectors of this matrix. In order to address this issue, we apply the leave-one out method of Cheng and Singer (2013) to compute the asymptotics of the resolvent of a suitably extended matrix. We then extract the information of interest from this matrix. After appearance of a preprint of this paper, Mei and Montanari (2019) extended the results presented here, to obtain a complete characterization of the risk for the non-linear random features model.

Let us finally mention that the universality (or ‘invariance’) phenomenon is quite common in random matrix theory Tao (2012). In the context of kernel inner product random matrices, it appears (somewhat implicitly) in Cheng and Singer (2013) and (more explicitly) in Fan and Montanari (2015). After a first appearance of this manuscript, universality has been investigated in the context of neural networks in several papers Mei and Montanari (2019); Montanari et al. (2019); Gerace et al. (2020); Goldt et al. (2020); Hu and Lu (2020); Adlam and Pennington (2020).

4 Outline

Section 2 provides important background. Sections 3–7 consider the linear model case, focusing on isotropic features, correlated features, misspecified models, ridge regularization, and cross-validation, respectively. Section 8 covers the nonlinear model case. Nearly all proofs are deferred until the appendix.

Preliminaries

We describe our setup and gather a number of important preliminary results.

Consider a test point x0Pxx_{0}\sim P_{x}, independent of the training data. For an estimator β^\hat{\beta} (a function of the training data X,yX,y), we define its out-of-sample prediction risk (or simply, risk) as

where xΣ2=xTΣx\|x\|_{\Sigma}^{2}=x^{T}\Sigma x. Note that our definition of risk is conditional on XX (as emphasized by our notation RXR_{X}). Note also that we have the bias-variance decomposition

2 Ridgeless least squares

This can be equivalently written as β^=(XTX)+XTy\hat{\beta}=(X^{T}X)^{+}X^{T}y, where (XTX)+(X^{T}X)^{+} is the pseudoinverse of XTXX^{T}X. An alternative name for (4) is the “ridgeless” least squares estimator, motivated by the fact that β^=limλ0+β^λ\hat{\beta}=\lim_{\lambda\to 0^{+}}\hat{\beta}_{\lambda}, where β^λ\hat{\beta}_{\lambda} denotes the ridge regression estimator:

or, equivalently, β^λ=(XTX+nλI)1XTy\hat{\beta}_{\lambda}=(X^{T}X+n\lambda I)^{-1}X^{T}y.

When XX has full column rank the min-norm least squares estimator reduces to β^=(XTX)1XTy\hat{\beta}=(X^{T}X)^{-1}X^{T}y, the usual least squares estimator. When XX has rank nn, importantly, this estimator interpolates the training data: yi=xiTβ^y_{i}=x_{i}^{T}\hat{\beta}, for i=1,,ni=1,\ldots,n.

Lastly, the following is a well-known fact that connects the min-norm least squares solution to gradient descent (as referenced in the introduction).

Initialize β(0)=0\beta^{(0)}=0, and consider running gradient descent on the least squares loss, yielding iterates

where we take 0<t1/λmax(XTX)0<t\leq 1/\lambda_{\max}(X^{T}X) (and λmax(XTX)\lambda_{\max}(X^{T}X) is the largest eigenvalue of XTXX^{T}X). Then limkβ(k)=β^\lim_{k\to\infty}\beta^{(k)}=\hat{\beta}, the min-norm least squares solution in (4).

3 Bias and variance

We recall expressions for the bias and variance of the min-norm least squares estimator, which are standard.

Under the model (1), (2), the min-norm least squares estimator (4) has bias and variance

where Σ^=XTX/n\hat{\Sigma}=X^{T}X/n is the (uncentered) sample covariance of XX, and Π=IΣ^+Σ^\Pi=I-\hat{\Sigma}^{+}\hat{\Sigma} is the projection onto the null space of XX.

4 Underparametrized asymptotics

We consider an asymptotic setup where n,pn,p\to\infty, in such a way that p/nγ(0,)p/n\to\gamma\in(0,\infty). Recall that when γ<1\gamma<1, we call the problem underparametrized; when γ>1\gamma>1, we call it overparametrized. Here, we recall the risk of the min-norm least squares estimator in the underparametrized case. The rest of this paper focuses on the overparametrized case.

The following is a known result in random matrix theory, and can be found in Chapter 6 of Serdobolskii (2007). It can also be found in the wireless communications literature, see Chapter 4 of Tulino and Verdu (2004).

Assume the model (1), (2), and assume xPxx\sim P_{x} is of the form x=Σ1/2zx=\Sigma^{1/2}z, where zz is a random vector with i.i.d. entries that have zero mean, unit variance, and a finite 4th moment, and Σ\Sigma is a (sequence of) deterministic positive definite matrix, such that λmin(Σ)c>0\lambda_{\min}(\Sigma)\geq c>0, for all n,pn,p and a constant cc (here λmin(Σ)\lambda_{\min}(\Sigma) is the smallest eigenvalue of Σ\Sigma). Then as n,pn,p\to\infty, such that p/nγ<1p/n\to\gamma<1, the risk of the least squares estimator (4) satisfies, almost surely,

As it can be seen from the last proposition, in the underparametrized case the risk is just variance. In contrast, in the overparametrized case, the bias BX(β^;β)=βTΠΣΠβB_{X}(\hat{\beta};\beta)=\beta^{T}\Pi\Sigma\Pi\beta is nonzero, because Π\Pi is. This will be the focus of the next sections.

Isotropic features

We begin by considering the simpler case in which Σ=I\Sigma=I. In this case the limiting bias is relatively straightforward to compute and depends on β\beta only through β22\|\beta\|_{2}^{2}. In Section 4, we generalize our analysis and study the dependence of the bias on the geometry of Σ\Sigma and β\beta.

Choosing UU so that Uβ=reiU\beta=re_{i}, the iith standard basis vector, then averaging over i=1,,pi=1,\ldots,p, yields

It is possible to show that, BX(β^;β)B_{X}(\hat{\beta};\beta) concentrates around its expectation ant therefore BX(β^;β)r2(11/γ)B_{X}(\hat{\beta};\beta)\to r^{2}(1-1/\gamma), almost surely. This is stated formally in the next section.

2 Limiting risk

As the next result shows, the independence of the risk on β\beta is still true outside of the Gaussian case, provided the features are isotropic. The next result can be proved as a corollary of the more general Theorem 3 below. We give a simpler self-contained proof using a theorem of Rubio and Mestre (2011) in Appendix C.2.

Assume the model (1), (2), where xPxx\sim P_{x} has i.i.d. entries with zero mean, unit variance, and a finite moment of order 4+η4+\eta, for some η>0\eta>0. Also assume that β22=r2\|\beta\|_{2}^{2}=r^{2} for all n,pn,p. Then for the min-norm least squares estimator β^\hat{\beta} in (4), as n,pn,p\to\infty, such that p/nγ(0,)p/n\to\gamma\in(0,\infty), it holds almost surely that

Hence, summarizing with Proposition 2, we have

We can see that the limiting norm, as a function of γ\gamma, has a somewhat similar profile to the limiting risk in (8): it is monotonically increasing on (0,1)(0,1), diverges at the interpolation boundary, and is monotonically decreasing on (1,)(1,\infty).

Correlated features

We broaden the scope of our analysis from the last section, where we examined isotropic features. In this section, we take xPxx\sim P_{x} to be of the form x=Σ1/2zx=\Sigma^{1/2}z, where zz is a random vector with independent entries that have zero mean and unit variance, and Σ\Sigma is arbitrary (but still deterministic and positive definite).

We next state our assumptions about the data distribution: our results will be uniform with respect to the (large) constants MM, {Ck}k2\{C_{k}\}_{k\geq 2} appearing in this assumption.

The covariates vector xPxx\sim P_{x} is of the form x=Σ1/2zx=\Sigma^{1/2}z, where, defining H^n\widehat{H}_{n} as per Eq. (9), we have

s1=ΣopMs_{1}=\|\Sigma\|_{op}\leq M, s1dH^n(s)<M\int s^{-1}d\widehat{H}_{n}(s)<M.

1(p/n)1/M|1-(p/n)|\geq 1/M, 1/Mp/nM1/M\leq p/n\leq M.

Condition (a)(a) bounds the tail probabilities on the covariates. Requiring finite moment of all order is useful to get strong bounds on the deviations of the risk from its predicted value. As discussed below, bounds on the first few moments are sufficient if we are satisfied in weaker probability bounds.

Conditions (b)(b) requires the eigenvalues of Σ\Sigma to be bounded, and not to accumulateThe latter assumption could have been further relaxed, by requiring only p+p_{+} of the pp eigenvalues to be non-vanishing and to satisfy the other conditions. This would require to redefine γ\gamma as p+/np_{+}/n. near . For the analysis of min-norm interpolation, we will add the additional assumption that the minimum eigenvalue of Σ\Sigma is bounded away from zero. However condition (b)(b) is sufficient for the analysis of ridge regression in Section 6.

Finally, as our statements are non-asymptotic, we do not assume p/np/n to converge to to a value. However condition (c)(c) requires p/np/n to be bounded and bounded away from the interpolation threshold p/n=1p/n=1.

We then define the predicted bias and variance by

Note that evaluating \mathscrsfsB(H,G,γ)\mathscrsfs{B}(H,G,\gamma), \mathscrsfsV(H,γ)\mathscrsfs{V}(H,\gamma) numerically is relatively straightforward, with the most complex part being the solution of Eq. (10). The next theorem establishes that —under suitable technical conditions— the functions \mathscrsfsB\mathscrsfs{B}, \mathscrsfsV\mathscrsfs{V} characterize the test error. Similar theorems were proved in Wu and Xu (2020); Richards et al. (2020), which generalized an earlier version of this manuscript to account for the geometry of (Σ,β)(\Sigma,\beta).

Assume the data model (1), (2) and that the covariates distribution satisfies Assumption 1. Further assume sp=λmin(Σ)>1/Ms_{p}=\lambda_{\min}(\Sigma)>1/M. Define γ=p/n\gamma=p/n and let β^\hat{\beta} be the min-norm least squares estimator in Eq. (4).

Then for any constants D>0D>0 (arbitrarily large) there exist C=C(M,D)C=C(M,D) such that, with probability at least 1CnD1-Cn^{-D} the following hold

where \mathscrsfsB\mathscrsfs{B} and \mathscrsfsV\mathscrsfs{V} are given in Definition 2.

The proof of this theorem is deferred to Section B.

The order of the error bound in Eqs. (14), (15) is not optimal: a central limit theorem heuristics suggests the deterministic approximation to be accurate up to an error of order n1/2n^{-1/2}. Indeed, we are able to establish the correct order in the case of ridge regression, see Theorem 5.

Note that Theorem 2 establishes deterministic approximations for the bias and variance, that are valid at finite n,pn,p. The overparametrization ratio γ=p/n\gamma=p/n is a non-asymptotic quantity, and the error bounds are uniform, i.e. depend uniquely on the constant MM. This is to be contrasted with the asymptotic setting of Richards et al. (2020); Wu and Xu (2020). Both of these papers assume a sequence of regression problems with n,pn,p\to\infty, p/nγp/n\to\gamma, and obtain an asymptotically exact expression for the risk.

Technically, Richards et al. (2020); Wu and Xu (2020) apply asymptotic random matrix theory results, such as Ledoit and Peche (2011), while we have to take a longer detour to exploit non-asymptotic results established inKnowles and Yin (2017). We believe that the non-asymptotic approach provides more concrete and accurate statements.

Theorem 2 also implies asymptotic predictions under minimal assumptions. In particular, if the two probability measures H^n\widehat{H}_{n}, G^n\widehat{G}_{n} converge weakly to probability measures HH, GG on [0,)[0,\infty), then we obtain BX(β^;β)/β2\mathscrsfsB(H,G,γ)B_{X}(\hat{\beta};\beta)/\|\beta\|^{2}\to\mathscrsfs{B}(H,G,\gamma), VX(β^;β)\mathscrsfsV(H,γ)V_{X}(\hat{\beta};\beta)\to\mathscrsfs{V}(H,\gamma) almost surely.

As pointed out above the condition H^nH\widehat{H}_{n}\Rightarrow H, G^nG\widehat{G}_{n}\Rightarrow G (here \Rightarrow denotes weak convergence) is strictly weaker than the condition assumed in Wu and Xu (2020) to establish asymptotic results. Further, we require weaker moment conditions.

In the next sections we illustrate the role of the geometry of β,Σ\beta,\Sigma by considering two models for which H^nH\widehat{H}_{n}\Rightarrow H, G^nG\widehat{G}_{n}\Rightarrow G as n,pn,p\to\infty. First we consider the case G=HG=H, which we refer to as ‘equidistributed:’ the components of β\beta are roughly equally distributed along the eigenvectors of Σ\Sigma. In this case there is no special relation between β\beta and Σ\Sigma.

As a further application, we consider a latent space model in which β\beta is aligned with the top eigenvectors of Σ\Sigma. This can be regarded as a misspecified model, and is therefore presented in Section 5.4 below.

2 Equidistributed coefficients

In this section we assume G=HG=H, β2r\|\beta\|_{2}\to r, and p/nγp/n\to\gamma. One way to generate β\beta satisfying this condition is to draw it uniformly at random on the pp-dimensional sphere of radius β2=r\|\beta\|_{2}=r. In this case the conditions of Theorem 2 (or Theorem 3), hold with G^nG=H\widehat{G}_{n}\to G=H.

Under the assumptions of Theorem 3, further assume G=HG=H, β2r2\|\beta\|^{2}\to r^{2}. Then for n,pn,p\to\infty, with p/nγ>1p/n\to\gamma>1, almost surely

As a special case, we can revisit the isotropic case Σ=I\Sigma=I, which results in dH=δ1dH=\delta_{1}. In this case c0(H,γ)=γ(γ1)c_{0}(H,\gamma)=\gamma(\gamma-1) yielding immediately \mathscrsfsB\mboxequi(H,γ)=1γ1\mathscrsfs{B}_{\mbox{\tiny\rm equi}}(H,\gamma)=1-\gamma^{-1} and \mathscrsfsV\mboxequi(H,γ)=1/(γ1)\mathscrsfs{V}_{\mbox{\tiny\rm equi}}(H,\gamma)=1/(\gamma-1).

Under the assumptions of Theorem 3, further assume β2r2\|\beta\|^{2}\to r^{2}, and let c0=c0(H,γ)c_{0}=c_{0}(H,\gamma) be defined as there. Then as n,pn,p\to\infty, such that p/nγp/n\to\gamma, the min-norm least squares estimator (4) satisfies, almost surely,

Misspecified model

In this section, we consider a misspecified model, in which the regression function is still linear, but we observe only a subset of the features. Such a setting provides another potential motivation for interpolation: in many problems, we do not know the form of the regression function, and we generate features in order to improve our approximation capacity. Increasing the number of features past the point of interpolation (increasing γ\gamma past 1) can now decrease both bias and variance.

As such, the misspecified model setting also yields further interesting asymptotic comparisons between the γ<1\gamma<1 and γ>1\gamma>1 regimes. Recall the isotropic features model of Section 3.2: the risk function in (8) is globally minimized at γ=0\gamma=0. This is a consequence of the fact that, in a well-specified linear model at γ=0\gamma=0 there is no bias and no variance, and hence no risk. In a misspecified model, we will see that the story can be quite different, and the asymptotic risk can actually attain its global minimum on (1,)(1,\infty).

Consider, instead of (1), (2), a data model

Given a test point (x0,w0)Px,w(x_{0},w_{0})\sim P_{x,w}, and an estimator β^\hat{\beta} (fit using X,yX,y only, and not WW), we define its out-of-sample prediction risk as

Note that this definition is conditional on XX, and we are integrating over the randomness not only in ϵ\epsilon (the training errors), but in the unobserved features WW, as well. The next lemma decomposes this notion of risk in a useful way.

Under the misspecified model (19), (20), for any estimator β^\hat{\beta}, we have

The first term RX(β^;β,θ)R^{*}_{X}(\hat{\beta};\beta,\theta) in the decomposition in Lemma 2 is precisely the risk that we studied previously in the well-specified case, except that the response distribution has changed (due to the presence of the middle term in (20)). We call the second term M(β,θ)M(\beta,\theta) in Lemma 2 the misspecification bias.

In order to discuss some qualitative features, we focus on the simplest possible model by assuming independent covariates.

2 Isotropic features

Here, we make the additional simplifying assumption that (x,w)Px,w(x,w)\sim P_{x,w} has i.i.d. entries with unit variance, which implies that Σ=I\Sigma=I. (The case of independent features but general covariances Σx,Σw\Sigma_{x},\Sigma_{w} is similar, and we omit the details.) Therefore, we may write the response distribution in (20) as

where δi\delta_{i} is independent of xix_{i}, having mean zero and variance σ2+θ22\sigma^{2}+\|\theta\|_{2}^{2}, for i=1,,ni=1,\ldots,n. Denote the total signal by r2=β22+θ22r^{2}=\|\beta\|_{2}^{2}+\|\theta\|_{2}^{2}, and the fraction of the signal captured by the observed features by κ=β22/r2\kappa=\|\beta\|_{2}^{2}/r^{2}. Then RX(β^;β,θ)R_{X}^{*}(\hat{\beta};\beta,\theta) behaves exactly as we computed previously, for isotropic features in the well-specified setting (Theorem 2 for γ<1\gamma<1, and Theorem 1 for γ>1\gamma>1), after we make the substitutions:

Furthermore, we can easily calculate the misspecification bias:

Putting these results together leads to the next conclusion.

Assume the misspecified model (19), (20), and assume (x,w)Px,w(x,w)\sim P_{x,w} has i.i.d. entries with zero mean, unit variance, and a finite moment of order 8+η8+\eta, for some η>0\eta>0. Also assume that β22+θ22=r2\|\beta\|_{2}^{2}+\|\theta\|_{2}^{2}=r^{2} and β22/r2=κ\|\beta\|_{2}^{2}/r^{2}=\kappa for all n,pn,p. Then for the min-norm least squares estimator β^\hat{\beta} in (4), as n,pn,p\to\infty, with p/nγp/n\to\gamma, it holds almost surely that

We remark that, in the independence setting considered in Theorem 4, the dimension qq of the unobserved feature space does not play any role: we may equally well take q=q=\infty for all n,pn,p (i.e., infinitely many unobserved features).

3 Polynomial approximation bias

Since adding features should generally improve our approximation capacity, it is reasonable to model κ=κ(γ)\kappa=\kappa(\gamma) as an increasing function of γ\gamma. To get an idea of the possible shapes taken by the asymptotic risk curve from Theorem 4, we consider the example of a polynomial decay for the approximation bias,

for some a>0a>0. In this case, the limiting risk in the isotropic setting, from Theorem 4, becomes

4 A latent space model

Here (ξi)in(\xi_{i})_{i\leq n}, (uij)in,jp(u_{ij})_{i\leq n,j\leq p} are noise variables that are mutually independent, and independent of ziz_{i}, with ξiN(0,σξ2)\xi_{i}\sim{N}(0,\sigma_{\xi}^{2}), uijN(0,1)u_{ij}\sim{N}(0,1). The features matrix takes the form X=ZWT+UX=ZW^{T}+U and therefore, for p>np>n, min-norm regression amounts to

Notice that the signal-to-noise ratio in the features xijx_{ij} is equal to wi22\|w_{i}\|_{2}^{2}. To be definite, we’ll fix wi2=1\|w_{i}\|_{2}=1, which implies in particular WF2=p\|W\|_{F}^{2}=p. In order to simplify our calculations, we assume all the non-zero singular values of WW to be equal, whence WTW=(p/d)IdW^{T}W=(p/d)I_{d}. It is not hard to check that this model satisfies the assumptions of Theorem 2, with

Using Theorem 2, we get the following explicit expressions.

Consider the latent space model described above, and assume d/pψ(1,)d/p\to\psi\in(1,\infty), p/nγ(0,)p/n\to\gamma\in(0,\infty). Then, almost surely

where σ2=σξ2+ψrθ2/(1+ψ)\sigma^{2}=\sigma_{\xi}^{2}+\psi r_{\theta}^{2}/(1+\psi), and c0=c0(ψ,γ)0c_{0}=c_{0}(\psi,\gamma)\geq 0 is the unique non-negative solution of the following second order equation

Figures 5 and 6 illustrate this corollary by comparing analytical predictions to numerical simulations. We observe that the prediction risk is monotone decreasing in the overparametrization ratio for γ>1\gamma>1, and reaches its global minimum asymptotically as γ\gamma\to\infty (after p,n,dp,n,d\to\infty). To understand why this happens, notice that each feature vector xix_{i} can be viewed as a noisy measurement of the latent covariates ziz_{i}. If the noise uiju_{ij} was absent, then performing min-norm regression with respect to (xi)in(x_{i})_{i\leq n} amounts to minimize would be equivalent to min-norm regression with respect to (zi)in(z_{i})_{i\leq n}. To see this, consider again Eq. (25). If we drop the noise UU, we are minimizing b2\|b\|_{2} subject to Z(WTb)=0Z(W^{T}b)=0, and the regression function is f^(z)=xTβ^=zT(WTβ^)\hat{f}(z)=x^{T}\hat{\beta}=z^{T}(W^{T}\hat{\beta}). Since WW is orthogonal, this is equivalent to computing θ^=argmin{t2:\hat{\theta}=\arg\min\{\|t\|_{2}: subject to Zt=0}Zt=0\}.

In presence of noise uiju_{ij}, the latent features cannot be estimated exactly. However as pp gets larger, the noise is effectively ‘averaged out’ and we approach the idealized situation in which the ziz_{i}’s are observed.

Ridge regularization

We generalize the formulas of Section 4 to non-vanishing ridge regularization. We work under the same assumptions of that section. In particular, recall that H^n(s)=p1i=1p1{ssi}\widehat{H}_{n}(s)=p^{-1}\sum_{i=1}^{p}1_{\{s\geq s_{i}\}} is the empirical distribution of eigenvalue of Σ\Sigma, and G^n(s)=i=1pβ,vi21{ssi}/β2\widehat{G}_{n}(s)=\sum_{i=1}^{p}\langle\beta,v_{i}\rangle^{2}1_{\{s\geq s_{i}\}}/\|\beta\|^{2} the same empirical distribution, reweighted by the projection of β\beta onto the eigenvector viv_{i} of the covariance Σ\Sigma. (Recall the eigenvalue decomposition Σ=i=1psiviviT\Sigma=\sum_{i=1}^{p}s_{i}v_{i}v_{i}^{T}.)

Further define mn,1(z)=mn,1(z;H^n,γ)m_{n,1}(z)=m_{n,1}(z;\widehat{H}_{n},\gamma) via

These definitions are extended analytically to Im(z)=0{\operatorname{Im}}(z)=0 whenever possible. We then define the predicted bias and variance by

We next state our deterministic approximation of the risk.

Let M1p/nMM^{-1}\leq p/n\leq M, and Assumption 1 hold. Further assume λλmin(Σ)>1/M\lambda\vee\lambda_{\min}(\Sigma)>1/M and n2/3+1/M<λ<Mn^{-2/3+1/M}<\lambda<M. Let β^λ\hat{\beta}_{\lambda} be the ridge estimator of Eq. (5).

Then for any constants D>0D>0 (arbitrarily large) and ε>0{\varepsilon}>0 (arbitrarily small), there exist C=C(M,D)C=C(M,D) such that, with probability at least 1CnD1-Cn^{-D} the following hold

where \mathscrsfsB\mathscrsfs{B} and \mathscrsfsV\mathscrsfs{V} are given in Definition 2.

The proof of this theorem is deferred to Appendix A. As for theorem 2, similar results were proved in Wu and Xu (2020); Richards et al. (2020), subsequently to a first version of this manuscript that only focused on random β\beta. The same comparison of Remark 1 applies here.

In particular, Theorem 5 establishes non-asymptotic deterministic approximations for the bias BX(β^λ;β)B_{X}(\hat{\beta}_{\lambda};\beta) and variance VX(β^;β)V_{X}(\hat{\beta};\beta). The error terms are uniform over the covariance matrix, and have nearly optimal dependence upon the sample size nn. Indeed, a central-limit theorem heuristics suggests indeed fluctuations of order n1/2n^{-1/2}.

As for the case of min-norm regression, it is easy to extract asymptotic prediction under minimal assumptions.

As a special case, we can consider the simple isotropic model that was already studied in Section 3. Very similar (though not identical) results can be found in Dicker (2016); Dobriban and Wager (2018).

Assume the conditions of Theorem 1 (well-specified model, isotropic features). Then for ridge regression estimator in (5) as n,pn,p\to\infty, such that p/nγ(0,)p/n\to\gamma\in(0,\infty), it holds almost surely that

Here m(z)m(z) is given by Eq. (36) which in this case has the explicit solution m(z)=[1γz(1γz)24γz]/(2γz)m(z)=[1-\gamma-z-\sqrt{(1-\gamma-z)^{2}-4\gamma z}]/(2\gamma z).

Furthermore, the limiting ridge risk is minimized at λ=σ2γ/r2\lambda^{*}=\sigma^{2}\gamma/r^{2}, in which case we have the simpler expression RX(β^λ;β)σ2γm(λ)R_{X}(\hat{\beta}_{\lambda};\beta)\to\sigma^{2}\gamma m(-\lambda^{*}).

Under the conditions of Theorem 4 (misspecified model, isotropic features), the limiting risk of ridge regression is as in the first two displays, and the optimal limiting risk is as in the third, after we make the substitutions in (21) and add r2(1κ)r^{2}(1-\kappa), to each expression.

It is easy to recover the formulas in Theorem 1 as a limiting case of Eq. (43), by using the z0z\to 0 asymptotics m(z)=(1γ)1+O(z)m(z)=(1-\gamma)^{-1}+O(z) for γ<1\gamma<1 and m(z)=(1γ)1+O(z)m(z)=(1-\gamma)^{-1}+O(z) for γ<1\gamma<1 and m(z)=(γ1)/(γz)+[(γ1)γ]1+O(z)m(z)=-(\gamma-1)/(\gamma z)+[(\gamma-1)\gamma]^{-1}+O(z) for γ>1\gamma>1.

Figures 8 and 8 compare the risk curves of min-norm least squares to those from optimally-tuned ridge regression, in the well-specified and misspecified settings, respectively. There are two important points to make. The first is that optimally-tuned ridge regression is seen to have strictly better asymptotic risk throughout, regardless of r2r^{2}, γ\gamma, κ\kappa. This should not be a surprise, as by definition optimal tuning should yield better risk than min-norm least squares, which is the special case given by λ0+\lambda\to 0^{+}.

As we will see, the behavior is rather different in the latent space model.

2 Latent space model

As a special application, we consider the latent space model of Section 5.4. It is immediate to specialize Eqs. (38) and (39) to this case. We omit giving giving explicit formulas for brevity, and instead plot the resulting curves for the prediction risk (test error).

In Figure 10 we plot the risk as a function of the overparametrization ration γ=p/n\gamma=p/n for several values of the regularization parameter λ\lambda (included the ridgeless limit λ0\lambda\to 0). The setting here is analogous to the one of Fig. 5. We observe several interesting phenomena:

Independently of λ\lambda in the probed range, the risk is minimized at large overparametrization γ1\gamma\gg 1.

As expected, the divergence of the risk at the interpolation threshold γ=1\gamma=1 is smoothed out by regularization, and the risk becomes a monotone decreasing function of γ\gamma when λ\lambda is large enough. Crucially, the optimal amount of regularization (corresponding to the lower envelope of these curves) results in a monotonically decreasing risk.

At large overparametrization, the optimal value of the regularization parameter is λ0\lambda\to 0.

We confirm the last finding in Fig. 10 which plots the risk as a function of λ\lambda: the optimal regularization is λ0\lambda\to 0. Notice that this is the case despite the fact that the observations are noisy, namely σξ>0\sigma_{\xi}>0 strictly.

The optimality of λ0\lambda\to 0 has a known intuitive explanation that is worth recalling here. Recall that ridge predictor at point x0x_{0} takes the form

where we introduced the kernel matrix K(X,X)=XXT/nK(X,X)=XX^{T}/n, and the vector K(x0,X)=x0TX/nK(x_{0},X)=x_{0}^{T}X/n. In the present case XX can ve viewed as a noisy version of X=ZWT\overline{X}=ZW^{T}, namely Consider the case in which the covariates XX contain noise, as is our case, X=X+ηUX=\overline{X}+\eta U where uijN(0,1)u_{ij}\sim N(0,1). (While we are considering η=1\eta=1, it is instructive to regard the noise standard deviation as a parameter.) Then, we might expect K(X,X)K(X,X)+η2UUTK(X,X)+η2InK(X,X)\approx K(\overline{X},\overline{X})+\eta^{2}UU^{T}\approx K(\overline{X},\overline{X})+\eta^{2}I_{n}. If this approximations hold, the noise acts as an extra ridge term, which can be sufficient to regularize the problem.

To the best of our knowledge, this argument was first presented by Webb (1994) and, more explicitly, by Bishop (1995). Recently, Kobak et al. (2020) elucidated its role in linear regression, establishing several of its consequences. In a parallel line of work, a closely related idea has recently emerged in the analysis of kernel methods in high dimension El Karoui (2010); Liang et al. (2020).

Cross-validation

We analyze the effect of using cross-validation to choose the tuning parameter in ridge regression. In short, we find that choosing the ridge tuning parameter to minimize the leave-one-out cross-validation error leads to the same asymptotic risk as the optimally-tuned ridge estimator. The next subsection gives the details; the following subsection presents a new “shortcut formula” for leave-one-out cross-validation in the overparametrized regime, for min-norm least squares, akin to the well-known formula for underparametrized least squares and ridge regression. We refer to Craven and Wahba (1979); Golub et al. (1979a) for background on CV and GCV, and to Arlot and Celisse (2010) for a more recent review.

Recomputing the leave-one-out predictors f^λi\hat{f}^{-i}_{\lambda}, i=1,,ni=1,\ldots,n can be burdensome, especially for large nn. Importantly, there is a well-known “shortcut formula” that allows us to express the leave-one-out CV error (45) as a weighted average of the training errors,

where Sλ=X(XTX+nλ)1XTS_{\lambda}=X(X^{T}X+n\lambda)^{-1}X^{T} is the ridge smoother matrix. This identity is an immediate consequence of the Sherman-Morrison-Woodbury formula. In the next subsection, we give an extension to the case λ=0\lambda=0 and rank(X)=n{\rm rank}(X)=n, i.e., to min-norm least squares.

The next result shows that, for isotropic features, the CV error of a ridge estimator converges almost surely to its prediction error. The focus on isotropic features is only for simplicity: a more general analysis is possible but is not pursued here. The proof, given in Appendix C.6, relies on the shortcut formula (46). In the proof, we actually first analyze generalized cross-validation (GCV), which turns out to be somewhat of an easier calculation (see the proof for details on the precise form of GCV), and then relate leave-one-out CV to GCV.

with the right-hand side above being the asymptotic risk of optimally-tuned ridge regression. Further, the exact same set of results holds for GCV.

Similar results were obtained for various linear smoothers in Li (1986, 1987), for the lasso in the high-dimensional (proportional) asymptotics in Miolane and Montanari (2018), and for general smooth penalized estimators in Xu et al. (2019). The latter paper covers ridge regression as a special case, and gives more precise results (convergence rates), but assumes more restrictive conditions.

The key implication of Theorem 7, in the context of the current paper and its central focus, is that the CV-tuned or GCV-tuned ridge estimator has the same asymptotic performance as the optimally-tuned ridge estimator. In other words, the ridge curves in Figures 1, 8, and 8 can be alternatively viewed as the asymptotic risk of ridge under CV tuning.

2 Shortcut formula for ridgeless CV

We extend the leave-one-out CV shortcut formula (46) to work when p>np>n and λ=0+\lambda=0+, i.e., for min-norm least squares. In this case, both the numerator and denominator are zero in each summand of (46). To circumvent this, we can use the so-called “kernel trick” to rewrite the ridge regression solution (5) with λ>0\lambda>0 as

Using this expression, the shortcut formula for leave-one-out CV in (46) can be rewritten as

Taking λ0+\lambda\to 0^{+} yields the shortcut formula for leave-one-out CV in min-norm least squares (assuming without a loss of generality that rank(X)=n{\rm rank}(X)=n),

In fact, the exact same arguments given here still apply when we replace XXTXX^{T} by a positive definite kernel matrix KK (i.e., Kij=k(xi,xj)K_{ij}=k(x_{i},x_{j}) for each i,j=1,,ni,j=1,\ldots,n, where kk is a positive definite kernel function), in which case (48) gives a shortcut formula for leave-one-out CV in kernel ridgeless regression (the limit in kernel ridge regression as λ0+\lambda\to 0^{+}). We also remark that, when we include an unpenalized intercept in the model, in either the linear or kernelized setting, the shortcut formula (48) still applies with XXTXX^{T} or KK replaced by their doubly-centered (row- and column-centered) versions, and the matrix inverses replaced by pseudoinverses.

Nonlinear model

Our analysis in the previous sections assumed xi=Σ1/2zix_{i}=\Sigma^{1/2}z_{i}, with ziz_{i} a vector with i.i.d. entries. As discussed in the introduction, we expect our results to have implications on certain neural networks in the lazy training regime, via the universality hypothesis. Roughly speaking, if covariates are obtained by applying the featurization map zixi=f(zi;θ0)z_{i}\mapsto x_{i}=\nabla f(z_{i};\theta_{0}) to data ziz_{i}, the asymptotic behavior of ridge regression is expected to be close to the one of an equivalent Gaussian model, whose second-order statistics match the ones of f(zi;θ0)\nabla f(z_{i};\theta_{0}).

We focus on a simple case in which the second order statistics match the ones of the isotropic model.

Notice that, conditionally on WW, the vectors xi=φ(Wzi)x_{i}=\varphi(Wz_{i}), ini\leq n are independent. However, they do not have independent coordinates: intuitively if dd is significantly smaller than pp, xix_{i} contains much less randomness than a vector of pp independent entriesFor instance, if φ(t)=at2+b\varphi(t)=at^{2}+b, we can reconstruct ziz_{i} from the first 2d2d coordinates of xix_{i}, and therefore the remaining p2dp-2d coordinates of xix_{i} are a function of the first 2d2d..

Then for γ>1\gamma>1, the variance satisfies, almost surely,

which is again as in the case of linear isotropic features, recall Theorem 1.

The proof of Theorem 8 is lengthy and will be sketched shortly.

where G1,G2G_{1},G_{2} are jointly Gaussian with unit variance and covariance wiTwjw_{i}^{T}w_{j}. Denoting by φ(x)=k0λk(φ)hk(x)\varphi(x)=\sum_{k\geq 0}\lambda_{k}(\varphi)\,h_{k}(x) the decomposition of φ\varphi into orthonormal Hermite polynomials, we thus obtain

Despite the nonlinear model xi=φ(Wzi)x_{i}=\varphi(Wz_{i}) matches the second-order population statistics of the isotropic model, it is far from obvious that the resulting least norm regression risk is the same. Indeed the coordinates of vector xix_{i} are highly dependent, as it can be seen by noting that they are a function of only dpd\ll p independent random variables. Theorem 8 confirms that —despite dependence— the risk is asymptotically the same, thus providing a concrete example of the general universality phenomenon.

2 Proof outline for Theorem 8

We introduce the following resolvents (as usual, these are defined for Im(ξ)>0{\operatorname{Im}}(\xi)>0 and by analytic continuation, whenever possible, for Im(ξ)=0{\operatorname{Im}}(\xi)=0):

Here and henceforth, we write [i,j]={i+1,,i+j}[i,j]=\{i+1,\ldots,i+j\} for integers i,ji,j. We also write Mij1=(M1)ijM^{-1}_{ij}=(M^{-1})_{ij} for a matrix MM, and TrS(M)=iSMii{\sf Tr}_{S}(M)=\sum_{i\in S}M_{ii} for a subset SS. The equalities in the first and third lines above follow by invariance of the distribution of A(s){A}(s) under permutations of [1,p][1,p] and [p+1,p+n][p+1,p+n]. Whenever clear from the context, we will omit the arguments from block matrix and resolvents, and write A=A(s){A}={A}(s), m1,n=m1,n(ξ,s)m_{1,n}=m_{1,n}(\xi,s), and m2,n=m2,n(ξ,s)m_{2,n}=m_{2,n}(\xi,s).

The next lemma characterizes the asymptotics of m1,nm_{1,n}, m2,nm_{2,n}.

Assume the conditions of Theorem 8. Consider Im(ξ)>0{\operatorname{Im}}(\xi)>0 or Im(ξ)=0{\operatorname{Im}}(\xi)=0, Re(ξ)<0{\operatorname{Re}}(\xi)<0, with st0s\geq t\geq 0. Let m1m_{1} and m2m_{2} be the unique solutions of the following quadratic equations:

subject to the condition of being analytic functions for Im(z)>0{\operatorname{Im}}(z)>0, and satisfying m1(z,s),m2(z,s)1/Im(z)|m_{1}(z,s)|,|m_{2}(z,s)|\leq 1/{\operatorname{Im}}(z) for Im(z)>C{\operatorname{Im}}(z)>C (with CC a sufficiently large constant). Then, as n,p,dn,p,d\to\infty, such that p/nγp/n\to\gamma and d/pψd/p\to\psi, we have almost surely (and in L1L^{1}),

The proof of this lemma is given in Appendix D.2. As a corollary of the above, we obtain that the asymptotical empirical spectral distribution of the empirical covariance Σ^=XTX/n\hat{\Sigma}=X^{T}X/n matches the one for the independent entries model, and is hence given by the Marchenko-Pastur law (a result already obtained in Pennington and Worah (2017)). We state this formally using the Stieltjes transform

Assume the conditions of Theorem 8. Consider Im(z)>0{\operatorname{Im}}(z)>0. As n,p,dn,p,d\to\infty, with p/nγp/n\to\gamma and d/pψd/p\to\psi, we have (almost surely and in L1L^{1}) Rn(ξ)r(ξ)R_{n}(\xi)\to r(\xi) where rr is nonrandom and coincides with the Stieltjes transform of the Marchenko-Pastur law, namely

We refer to Appendix D.4 for a proof of this corollary. The next lemma connects the above resolvents computed in Lemma 3 to the variance of min-norm least squares, hence finishing our proof outline.

Assume the conditions of Theorem 8. Let m1m_{1}, m2m_{2} be the asymptotic resolvents given in Lemma 3. Define

Then for γ1\gamma\neq 1, \partial_{x}m(\xi,x)\big{|}_{x=0} as a simple pole at ξ=0\xi=0, and hence admits a Taylor-Laurent expansion around ξ=0\xi=0, whose coefficients will be denoted by D1D_{-1}, D0D_{0}

Here each coefficient is a function of γ,ψ\gamma,\psi: D1=D1(γ,ψ)D_{-1}=D_{-1}(\gamma,\psi), D0=D0(γ,ψ)D_{0}=D_{0}(\gamma,\psi). Furthermore, for the ridge regression estimator β^λ\hat{\beta}_{\lambda} in (5), as n,p,dn,p,d\to\infty, such that p/nγ(0,)p/n\to\gamma\in(0,\infty), d/pψ(0,1)d/p\to\psi\in(0,1), the following ridgeless limit holds almost surely:

The proof of this lemma can be found in Appendix D.3. Theorem 8 follows by evaluating the formula in Lemma 4, by using the result of Lemma 3. We refer to the appendices material for details.

Acknowledgements

The authors are grateful to Brad Efron, Rob Tibshirani, and Larry Wasserman for inspiring us to work on this in the first place. RT sincerely thanks Edgar Dobriban for many helpful conversations about the random matrix theory literature, in particular, the literature trail leading up to Proposition 2, and the reference to Rubio and Mestre (2011). We are also grateful to Dmitry Kobak for bringing to our attention the Kobak et al. (2020) and clarifying the implications of this work on our setting.

TH was partially supported by grants NSF DMS-1407548, NSF IIS-1837931, and NIH 5R01-EB001988-21. AM was partially supported by NSF DMS-1613091, NSF CCF-1714305, NSF IIS-1741162, and ONR N00014-18-1-2729. RT was partially supported by NSF DMS-1554123.

Appendix A Proofs for the linear model: Risk of ridge regression

Before passing to the proof of Theorem 5, we state and prove a useful lemma in analysis.

Without loss of generality we can assume x=0x=0. Further we set h=fgh=f-g. We consider a finite difference approximation of order 2k2k of the first order derivative of hh at . One such approximation is given by (Trefethen, 1996, Chapter 3)

where uxu_{x} is a point in the interval [0,x][0,x]. Using the properties of D2kD_{2k} listed above, we obtain

We are left wih the task of proving claims (i)(i) and (ii)(ii) about the operator D2kD_{2k}. Substituting, these are equivalent to the claims

We further state and proof a lemma about a modified Stieltjes transform.

Let s1sp>0s_{1}\geq\dots\geq s_{p}>0 be such that s1Ms_{1}\leq M and #{i:si1/M}p/M\#\{i:\,s_{i}\geq 1/M\}\geq p/M for some large constant M>1M>1, and M1p/nMM^{-1}\leq p/n\leq M. Further assume 1(p/n)1/M|1-(p/n)|\geq 1/M and n1i=1psi11si>0<Mn^{-1}\sum_{i=1}^{p}s_{i}^{-1}1_{s_{i}>0}<M. For (λ,η)(0,M]×(1/(10M2),1/(10M2))(\lambda,\eta)\in(0,M]\times(-1/(10M^{2}),1/(10M^{2})) let u=u(λ,η)u=u(\lambda,\eta) be the only solution in (0,)(0,\infty) of the equation

λu(λ,η)\lambda\mapsto u(\lambda,\eta) is analytic and non-decreasing with 0<λu(λ,η)C(M)0<\partial_{\lambda}u(\lambda,\eta)\leq C(M) for a constant C(M)C(M) uniquely dependent on MM.

If further λsp1/M\lambda\vee s_{p}\geq 1/M, then ηku(λ,η)Ck(M)|\partial^{k}_{\eta}u(\lambda,\eta)|\leq C_{k}(M) for all kk.

We begin from poinr (a)(a). Since η\eta is fixed here, we write u(λ)=u(λ,η)u(\lambda)=u(\lambda,\eta), dropping the dependence on η\eta. Note that, for η>0\eta>0, any solution uu must be such that u(0,1/η)u\in(0,1/\eta) because otherwise the left hand side is negative and the right-hand side is positive. Call x=u/(1ηu)x=u/(1-\eta u) whence the above equation is equivalent to the following equation on x(0,)x\in(0,\infty):

where si(η)=si/(1+ηsi)s_{i}(\eta)=s_{i}/(1+\eta s_{i}). For η(1/(10M2),1/(10M2))\eta\in(-1/(10M^{2}),1/(10M^{2})), we have (1(10M)1)si<si(η)<(1+(10M)1)si(1-(10M)^{-1})s_{i}<s_{i}(\eta)<(1+(10M)^{-1})s_{i}, and therefore the sequence (si(η))ip(s_{i}(\eta))_{i\leq p} satisfies the same assumptions as (si)ip(s_{i})_{i\leq p}, with MM replaced by 2M2M. We will therefore drop the argument η\eta and study the fixed point equation (61) (which is equivalent to the case η=0\eta=0 of Eq. (60)).

Let g(x):=n1i=1psi/(si+x)g(x):=n^{-1}\sum_{i=1}^{p}s_{i}/(s_{i}+x), and note that this function is mononone decreasing on [0,)[0,\infty) with g(0)C0(M)g(0)\leq C_{0}(M), and g(x)M2/xg(x)\leq M^{2}/x. Equation (61) can be rewritten as

Since xg^(x):=(λ/x)+g(x)x\mapsto\hat{g}(x):=(\lambda/x)+g(x) is monotone decreasing with g^(0+)=+\hat{g}(0+)=+\infty and g^(x)0\hat{g}(x)\to 0 as xx\to\infty, this equation has a unique solution, denoted by x(λ)x(\lambda). By the implicit function theorem, we have

The second inequality follows for instance by noting that n1i=1psi2/(si+x(λ))2<g(x(λ))=1λ/x(λ)<1n^{-1}\sum_{i=1}^{p}s_{i}^{2}/(s_{i}+x(\lambda))^{2}<g(x(\lambda))=1-\lambda/x(\lambda)<1. Notice that g(0)=p/ng(0)=p/n. By assumption g(0)11/M|g(0)-1|\geq 1/M. We will distinguish the two cases g(0)<1g(0)<1 and g(0)>1g(0)>1.

Case g(0)<1g(0)<1. Since g(x)g(x) is decreasing, (62) implies 1λx1+g(0)1\leq\lambda x^{-1}+g(0) whence

and therefore, by Eq. (63), we have λx(λ)M\partial_{\lambda}x(\lambda)\leq M for all λ(0,)\lambda\in(0,\infty).

Case g(0)>1g(0)>1. Let x0=g1(1)>0x_{0}=g^{-1}(1)>0. Since xg(x)x\mapsto g(x) is convex, x0(1g(0))/g(0)x_{0}\geq(1-g(0))/|g^{\prime}(0)|. Note that g(0)=n1i=1psi11si>0M|g^{\prime}(0)|=n^{-1}\sum_{i=1}^{p}s_{i}^{-1}1_{s_{i}>0}\leq M. Therefore x01/C(M)x_{0}\geq 1/C(M) for some finite constant C(M)C(M).

Since g^(x)=λ/x+g(x)>g(x)\hat{g}(x)=\lambda/x+g(x)>g(x) on (0,)(0,\infty), we have x(λ)>x(0+)=x01/C(M)x(\lambda)>x(0+)=x_{0}\geq 1/C(M) for all λ(0,)\lambda\in(0,\infty). Further,

Therefore by Eq. (63), 0<λx(λ)C(M)0<\partial_{\lambda}x(\lambda)\leq C(M) also in this case.

and therefore 7/10(1+ηx(λ))13/107/10\leq(1+\eta x(\lambda))\leq 13/10, whence

Claim (b)(b) simply follows since u/(1ηu)λu/(1-\eta u)\geq\lambda by Eq. (60). Since (1ηu)=1/(1+ηx)1/2(1-\eta u)=1/(1+\eta x)\geq 1/2 as just shown, this implies uλ/2u\geq\lambda/2.

Finally, for claim (c)(c), notice that u(λ,η)u(\lambda,\eta) is a solution of the equation

By the implicit function theorem, it is sufficient to prove that \big{|}\frac{\partial^{k+l}f}{\partial u^{k}\partial\eta^{l}}\big{|}\leq C_{k,l}(M) for all l1,k0l\geq 1,k\geq 0 and l=0l=0, k2k\geq 2, and that \big{|}\frac{\partial f}{\partial u}\big{|}\geq 1/C(M).

In order to bound the first derivative away from , note that, again by the implicit function theorem,

where the last inequality follows from point (a)(a) above. In order to bound higher order derivatives, write f(u,η)=f1(ηu)/ηλf2(u)f(u,\eta)=f_{1}(\eta u)/\eta-\lambda-f_{2}(u) where f1(v):=v/(1v)f_{1}(v):=v/(1-v). Recall that x=u/(1ηu)x=u/(1-\eta u), and x3M2x\leq 3M^{2} by Eq. (66). Therefore

where C0C_{0}, c1(m1,m2,m3)c_{1}(m_{1},m_{2},m_{3}) are coefficients depending uniquly on k,lk,l. This sum is bounded since each term is bounded, by the above estimate on kf1\partial^{k}f_{1}, and further using the fact that u=x/(1+ηx)10M2u=x/(1+\eta x)\leq 10M^{2} since x3M2x\leq 3M^{2} by Eq. (66). ∎

For the proof of Theorem 5, it is convenient to introduce the notations SZ:=ZTZ/nS_{Z}:=Z^{T}Z/n and SX:=XTX/n=Σ1/2SZΣ1/2S_{X}:=X^{T}X/n=\Sigma^{1/2}S_{Z}\Sigma^{1/2}. We begin by recalling the expressions for bias and variance of ridge regression at regularization parameter λ\lambda:

The bias and variance for min-norm interpolation are recovered by taking λ0+\lambda\to 0+. In the following, we will occasionally consider λ\lambda to be complex with Re(λ)>0{\operatorname{Re}}(\lambda)>0.

A.2 Proof of Theorem 5, bias term

Note that the bias is homogeneous (of degree 22) in β2\|\beta\|_{2}. Hence, without loss of generality we can and will assume β2=1\|\beta\|_{2}=1. For Re(η)>1/λmax(Σ){\operatorname{Re}}(\eta)>-1/\lambda_{\max}(\Sigma) define

Here, for Im(z)>0{\operatorname{Im}}(z)>0, Re(η)=0{\operatorname{Re}}(\eta)=0, rn=rn(z,η)r_{n}=r_{n}(z,\eta) is defined as the unique solution Im(rn(z,η))>0{\operatorname{Im}}(r_{n}(z,\eta))>0 of

where s1(η)s2(η)sp(η)s_{1}(\eta)\geq s_{2}(\eta)\geq\dots\geq s_{p}(\eta) are the eigenvalues of Ση\Sigma_{\eta}. By Eq. (73) we have si(η)=si/(1+ηsi)s_{i}(\eta)=s_{i}/(1+\eta s_{i}). Existence and uniqueness of the solution rn(z,η)r_{n}(z,\eta) is given, for instance, in (Knowles and Yin, 2017, Lemma 2.2). This lemma also states that rn(z,η)r_{n}(z,\eta) is the Stieltjes transorm of a probability measure ρη\rho_{\eta} with support in [0,C(M)][0,C(M)]. As a consequence, for λ=λ1+iλ2\lambda=\lambda_{1}+i\lambda_{2}, λ1>0\lambda_{1}>0

Therefore, taking the limit Im(λ)0{\operatorname{Im}}(\lambda)\to 0 in Eq. (75), we obtain, with probability at least 1CnD1-Cn^{-D}

Here the last identity is simply the definition of the predicted bias in Eq. (38).

Notice that ηF(λ,η)\eta\mapsto{\overline{F}}(\lambda,\eta) is infinitely differentiable in (1/(2M),)(-1/(2M),\infty) with

Note that ΣopM\|\Sigma\|_{op}\leq M, β2=1\|\beta\|_{2}=1 (we assumed this to hold without loss of generality) and R\mboxop1/λ\|R\|_{\mbox{\tiny\rm op}}\leq 1/\lambda, whence

We claim that a similar bound holds for the derivative of FnF_{n}, namely

In what follows we use the shorhand un=un(η,z)u_{n}=u_{n}(\eta,z). Note that, by Eq. (77), unu_{n} solves Eq. (60), and we can therefore apply Lemma 6. Note that

where (a)(a) follows from Lemma 6.(b)(b) and (b)(b) by the theorem’s assumption.

Further, by Lemma 6.(c)(c) we have ηkunCk(M)|\partial^{k}_{\eta}u_{n}|\leq C^{\prime}_{k}(M) for all kk, whence the desired claim (83) follows.

Using Eqs. (82), (83), and (78) in Lemma 5 we get, for any ε>0{\varepsilon}>0, kk and any δ<k/(4M)\delta<k/(4M), with probability at least 1nC1-n^{-C}

The claim follows by taking δ=nε/2\delta=n^{-{\varepsilon}/2} and k=1/εk=\lceil 1/{\varepsilon}\rceil, and recalling that, by Eqs. (74) and (81), we have

A.3 Proof of Theorem 5, variance term

Further the right-hand side is an analytic function of λ\lambda, for Re(λ)>0{\operatorname{Re}}(\lambda)>0. Using again (Knowles and Yin, 2017, Theorem 3.16.(i)(i)) , we get that for any D>0D>0, ε>0{\varepsilon}>0, where exists C=C(D,ε)C=C(D,{\varepsilon}) such that, with probability at least 1CnD1-Cn^{-D},

uniformly over Re(λ)>n2/3+(1/M){\operatorname{Re}}(\lambda)>n^{-2/3+(1/M)}, Im(λ)>0{\operatorname{Im}}(-\lambda)>0. Here rn(z)r_{n}(z) is defined as in the previous section, namely as the solution of Eq. (77) with η=0\eta=0. As shown there Im(rn(λ))Im(λ)/Re(λ)2{\operatorname{Im}}(r_{n}(-\lambda))|\leq|{\operatorname{Im}}(\lambda)|/{\operatorname{Re}}(\lambda)^{2}, whence

Since both functions \lambda\mapsto\frac{\lambda}{p}{\sf Tr}\big{(}\Sigma(S_{X}+\lambda I)^{-1} and λLn(λ)\lambda\mapsto L_{n}(\lambda) are analytic on Re(λ)>0{\operatorname{Re}}(\lambda)>0, the last bound implies a similar bound on their derivarives, namely

We are left with the task of computing the derivative of LnL_{n}:

where the last equality follows by the definition mn(x)=(1γ+γzrn(z))/(γz)m_{n}(x)=(1-\gamma+\gamma zr_{n}(z))/(\gamma z) given above.

A.4 Generalization to heavier tail covariates: Proof of Theorem 6

This result follows from Theorem 2 via a standard truncation argument. We outline the argument for the bias as the same proof holds almost unchanged for the variance term. We write the bias BX,Σ(β^;β)B_{X,\Sigma}(\hat{\beta};\beta) to emphasize its dependende upon the population covariance. Using Eq. (69), we obtain, for some C=C(λ)<C=C(\lambda)<\infty

For M>0M>0, decompose the random vriables zijz_{ij} as

We correspondingly decompose the matricex X,ZX,Z as

where ΣM=aM2Σ\Sigma_{M}=a_{M}^{2}\Sigma. Using Eq. (95), we get

Now we can apply Theorem 5 to XM,ΣMX_{M},\Sigma_{M}, since the variables zijMz^{M}_{ij} have bounded moments of all orders. Denoting by H^nM\widehat{H}_{n}^{M}, G^nM\widehat{G}_{n}^{M} the corresponding empirical measures, we have, with probability at least 1Cn21-Cn^{-2}:

Note that H^nM(s)=H^n(s/aM2)\widehat{H}^{M}_{n}(s)=\widehat{H}_{n}(s/a^{2}_{M}) and G^nM(s)=G^n(s/aM2)\widehat{G}^{M}_{n}(s)=\widehat{G}_{n}(s/a^{2}_{M}), hence these measures converge weakly to HM(s)=H(s/aM2)H^{M}(s)=H(s/a^{2}_{M}) and GM(s)=G(s/aM2)G^{M}(s)=G(s/a^{2}_{M}). Therefore, by Borel-Cantelli

The proof is completed by putting together Eqs. (100) and (101), taking MM\to\infty and noting that HMHH^{M}\Rightarrow H, GMGG^{M}\Rightarrow G in that limit.

Appendix B Proofs for the linear model: Risk min-norm regression

The proof follows from Theorem 5 by approximating min-norm regression with ridge regression with a small value of λ\lambda. For this reason we will assume λ1\lambda\leq 1 throughout. We treat separately the bias and variance terms. Since the variance is independent of β\beta and the bias is homogeneous in β2\|\beta\|_{2}, we can assume, without loss of generality, β21\|\beta\|_{2}\leq 1.

where σmin\sigma_{\min} denotes the smallest non-vanishing singular value and the last inequality holds since by assumption λmin(Σ)1/M\lambda_{\min}(\Sigma)\geq 1/M. Finally, since p/n11/M|p/n-1|\geq 1/M, we have that σmin(Z)/n1/C(M)\sigma_{\min}(Z)/\sqrt{n}\geq 1/C(M) with probability at least 1CnD1-Cn^{-D} Bai and Yin (1993). Hence, with the same probability \big{|}B_{X}(\hat{\beta}_{\lambda};\beta)^{1/2}-B_{X}(\hat{\beta};\beta)^{1/2}\big{|}\leq C(M)\lambda. Also notice that, by Eq. (103), BX(β^;β)Σ\mboxopMB_{X}(\hat{\beta};\beta)\leq\|\Sigma\|_{\mbox{\tiny\rm op}}\leq M. Therefore, we obtain

We next claim that Eqs. (11) and (38) imply the bound

In to prove this estimate, recall that γ>1+M1\gamma>1+M^{-1}, and that, by assumption, H^n\widehat{H}_{n} is supported on [M1,M][M^{-1},M]. Define c(λ)c_{*}(\lambda) via mn(λ)=(1γλγc(λ))/(γλ)m_{n}(-\lambda)=(1-\gamma-\lambda\gamma c_{*}(\lambda))/(-\gamma\lambda). Note that c(λ)=rn(λ)/γc_{*}(\lambda)=r_{n}(-\lambda)/\gamma, where rn(z)r_{n}(z) is the compaion Stieltjes transform already introduced in Eq. (77). In particular, by (Knowles and Yin, 2017, Lemma 2.2), cc_{*} it is non-negative and monotone decreasing in λ\lambda. Further Eq. (36), cc_{*} satisfies

It follows from supp(H^n)[M1,M]{\rm supp}(\widehat{H}_{n})\in[M^{-1},M] that f(x;λ)f(x;\lambda) is monotone decreasing on (0,)(0,\infty) for any λ0\lambda\geq 0 with f(0;λ)=1f(0;\lambda)=1 and limxf(x;0)=0\lim_{x\to\infty}f(x;0)=0, limxf(x;λ)=\lim_{x\to\infty}f(x;\lambda)=-\infty for any λ>0\lambda>0. Further f(x;λ)f(x;\lambda) is monotone decreasing in λ\lambda with f(x;λ)f(x;0)f(x;\lambda)\to f(x;0) as λ0\lambda\to 0. Therefore c(λ)c_{*}(\lambda) is monotone decreasing in λ\lambda, with limλ0c(λ)=c0>0\lim_{\lambda\to 0}c_{*}(\lambda)=c_{0}>0 (with c0c_{0} defined as per Eq. (10)). Further f(x;λ)f(x;λ)f(x;λ)\underline{f}(x;\lambda)\leq f(x;\lambda)\leq\overline{f}(x;\lambda) where f(x;λ)\overline{f}(x;\lambda) is obtained by replacing H^n\widehat{H}_{n} by δ1/M\delta_{1/M}, and f(x;λ)\underline{f}(x;\lambda) is obtained by replacing H^n\widehat{H}_{n} by δM\delta_{M}. Therefore C(M)1c(λ)c0C(M)C(M)^{-1}\leq c_{*}(\lambda)\leq c_{0}\leq C(M) for some finite constant C(M)C(M) depending uniquely on MM. In particular, this implies that λf(c;λ)\partial_{\lambda}f(c_{*};\lambda) is uniformly bounded. Finally

For s[M1,M]s\in[M^{-1},M] and x[C(M)1,C(M)]x\in[C(M)^{-1},C(M)], the integrand γs/(1+xγs)2\gamma s/(1+x\gamma s)^{2} is bounded and bounded away from zero. Therefore C(M)1xf(x;λ)C(M)C(M)^{-1}\leq-\partial_{x}f(x;\lambda)\leq C(M) for some (possibly different) constant C(M)C(M). By the implicit function theorem, it follows that λc(λ)C(M)|\partial_{\lambda}c_{*}(\lambda)|\leq C(M) is bounded for λ[0,M]\lambda\in[0,M], and therefore c(λ)c0C(M)λ|c_{*}(\lambda)-c_{0}|\leq C(M)\lambda, whence

(Here O(λ)O_{*}(\lambda) denotes a quantity bounded uniformly as O(λ)C(M)λ|O_{*}(\lambda)|\leq C(M)\lambda.). Sustituting this in Eq. (37), and repeating similar arguments, we obtain

Finally, substituting in Eq. (38), we obtain the claimed bound (105).

Using Eqs. (104), (105) and Theorem 5 we deduce that, for all M,D,εM,D,{\varepsilon}, and all n2/3+1/Mλ1n^{-2/3+1/M}\lambda\leq 1, with probability at least 1CnD1-Cn^{-D} ,

The desired bound is obtained by setting λ=n1/7\lambda=n^{-1/7} and choosing ε{\varepsilon} small enough.

Variance term. With the same notations introduced above, Eq. (70) implies

Taking the difference and bounding x(x+λ)2x12λ/x2|x(x+\lambda)^{-2}-x^{-1}|\leq 2\lambda/x^{2} for x,λ>0x,\lambda>0, we get

The rest of the proof is analogous to one for the bias term and is omitted.

Appendix C Other proofs for the linear model

where the second inequality holds almost surely for all nn large enough, following from λmin(Σ)c\lambda_{\min}(\Sigma)\geq c, and the Bai-Yin theorem (Bai and Yin, 1993), which implies that the smallest eigenvalue of ZTZ/nZ^{T}Z/n is almost surely larger than (1γ)2/2(1-\sqrt{\gamma})^{2}/2 for sufficiently large nn. As the right-hand side in the above display is strictly positive, we have that XTXX^{T}X is almost surely invertible. Therefore by the bias and variance results from Lemma 1, we have almost surely Π=0\Pi=0 and BX(β^;β)=0B_{X}(\hat{\beta};\beta)=0, and also

where FZTZ/nF_{Z^{T}Z/n} is the spectral measure of ZTZ/nZ^{T}Z/n and the last identity holds if M1<(c/2)(1γ)2M^{-1}<(c/2)(1-\sqrt{\gamma})^{2} almost surely for all nN0n\geq N_{0} (with N0N_{0} a random number) by Eq. (110). By the Marchenko-Pastur theorem (Marchenko and Pastur, 1967; Silverstein, 1995), which says that FZTZ/nF_{Z^{T}Z/n} converges weakly, almost surely, to the Marchenko-Pastur law FγF_{\gamma} (depending only on γ\gamma), whenc, almost surely :

where the last equality follows since FγF_{\gamma} is supported in [(1γ)2,(1+γ)2][(1-\sqrt{\gamma})^{2},(1+\sqrt{\gamma})^{2}], provided M>(1γ)2M>(1-\sqrt{\gamma})^{-2}. Thus the last display implies that as n,pn,p\to\infty, almost surely,

It remains to compute the right-hand side above. We recognize the right-hand side as the evaluation of the Stieltjes transform m(z)m(z) of Marchenko-Pastur law at z=0z=0. Fortunately, this has an explicit form (e.g., Lemma 3.11 in Bai and Silverstein 2010), for real z>0z>0:

The proof is thus completed by taking the limit z0+z\to 0+ in this expression.

C.2 Proof of Theorem 1

Variance term. Recalling the expression for the bias from Lemma 1 (where now Σ=I\Sigma=I), we have

where si=λi(XTX/n)s_{i}=\lambda_{i}(X^{T}X/n), i=1,,ni=1,\ldots,n are the nonzero eigenvalues of XTX/nX^{T}X/n. Let ti=λi(XXT/p)t_{i}=\lambda_{i}(XX^{T}/p), i=1,,pi=1,\ldots,p denote the eigenvalues of XXT/pXX^{T}/p. Then we may write si=(p/n)tis_{i}=(p/n)t_{i}, i=1,,ni=1,\ldots,n, and

where FXXT/pF_{XX^{T}/p} is the spectral measure of XXT/pXX^{T}/p. Now as n/pτ=1/γ<1n/p\to\tau=1/\gamma<1, we are back precisely in the setting of Theorem 2, and by the same arguments, we may conclude that almost surely

Bias term. Recall the expression for the bias from Lemma 1 (where now Σ=I\Sigma=I), and note the following key characterization of the pseudoinverse of a rectangular matrix AA,

We can apply this to A=X/nA=X/\sqrt{n}, and rewrite the bias as

for a deterministic sequence cn(z)>0c_{n}(z)>0, n=1,2,3,n=1,2,3,\ldots (defined for each nn via a certain fixed-point equation). Taking Θn=I/p\Theta_{n}=I/p in the above, note that this reduces to the almost sure convergence of the Stieltjes transform of the specral distribution of Σ^\hat{\Sigma}, and hence by the (classical) Marchenko-Pastur theorem, we learn that cn(z)m(z)c_{n}(z)\to m(-z), where mm denotes the Stieltjes transform of the Marchenko-Pastur law FγF_{\gamma}. Further, taking Θn=ββT/p\Theta_{n}=\beta\beta^{T}/p, we see from (115) and cn(z)m(z)c_{n}(z)\to m(-z) that, almost surely,

Define fn(z)=zβT(Σ^+zI)1βf_{n}(z)=z\beta^{T}(\hat{\Sigma}+zI)^{-1}\beta. Notice that fn(z)r2|f_{n}(z)|\leq r^{2}, and fn(z)=βT(Σ^+zI)2Σ^βf_{n}^{\prime}(z)=\beta^{T}(\hat{\Sigma}+zI)^{-2}\hat{\Sigma}\beta, so

where λmax(Σ^)\lambda_{\max}(\hat{\Sigma}) and λmin+(Σ^)\lambda_{\min}^{+}(\hat{\Sigma}) denote the largest and smallest nonzero eigenvalues, respectively, of Σ^\hat{\Sigma}, and the second inequality holds almost surely for large enough nn, by the Bai-Yin theorem (Bai and Yin, 1993). As its derivatives are bounded, the sequence fnf_{n}, n=1,2,3,n=1,2,3,\ldots is equicontinuous, and by the Arzela-Ascoli theorem, we deduce that fnf_{n} converges uniformly to its limit. By the Moore-Osgood theorem, we can exchange limits (as n,pn,p\to\infty and z0+z\to 0^{+}) and conclude from (114), (116) that as n,pn,p\to\infty, almost surely,

Finally, relying on the fact that the Stieltjes transform of the Marchenko-Pastur law has the explicit form in (112), we can compute the above limit:

C.3 Proof of Corollary 2

The corollary follows immediately from Theorem 3, once we estabilish that, for β2r2\|\beta\|^{2}\to r^{2}, \mathscrsfsB(H,H,γ)\mathscrsfsB\mboxequi(H,γ)\mathscrsfs{B}(H,H,\gamma)\to\mathscrsfs{B}_{\mbox{\tiny\rm equi}}(H,\gamma). By homogeneity, we can assume without loss of generality that β2=r2=1\|\beta\|^{2}=r^{2}=1. By using Eq. (11), we get

where that last equality follows from the definition of c0c_{0} in Eq. (10). Comparing with the definition of cuB\mboxequi(H,γ)cuB_{\mbox{\tiny\rm equi}}(H,\gamma) in Eq. (16) yields the desired claim.

C.4 The case of equicorrelated features

As an illustration of Corollary 2, we consider the case of equicorrelated covariates. Namellt, for ρ[0,1)\rho\in[0,1), we assume that Σii=1\Sigma_{ii}=1 for all ii, and Σij=ρ\Sigma_{ij}=\rho for all iji\not=j.

Assume the conditions of Theorem 2, and moreover, assume that Σ\Sigma has ρ\rho-equicorrelation structure for all n,pn,p, and some ρ[0,1)\rho\in[0,1). Then as n,pn,p\to\infty, with p/nγ>1p/n\to\gamma>1, we have almost surely

Let HρH_{\rho} denote the weak limit of FΣF_{\Sigma}, when Σ\Sigma has ρ\rho-equicorrelation structure for all n,pn,p. A short calculation shows that such a matrix Σ\Sigma has one eigenvalue value equal to 1+(p1)ρ1+(p-1)\rho, and p1p-1 eigenvalues equal to 1ρ1-\rho. Thus the weak limit of its spectral measure is simply Hρ=1[1ρ,)H_{\rho}=1_{[1-\rho,\infty)}, i.e., dHρ=δ1ρdH_{\rho}=\delta_{1-\rho}, a point mass at 1ρ1-\rho of probability one.

We remark that the present case, strictly speaking, breaks the conditions that we assume in Theorem 2, because λmax(Σ)=1+(p1)ρ\lambda_{\max}(\Sigma)=1+(p-1)\rho clearly diverges with pp. However, by decomposing Σ=(1ρ)I+ρ\mathds1\mathds1T\Sigma=(1-\rho)I+\rho\mathds{1}\mathds{1}^{T} (where \mathds1\mathds{1} denotes the vector of all 1s), and correspondingly decomposing the functions fn,gn,hnf_{n},g_{n},h_{n} defined in the proof of Theorem 2, to handle the rank one part ρ\mathds1\mathds1T\rho\mathds{1}\mathds{1}^{T} properly, we can ensure the appropriate boundedness conditions. Thus, the result in the theorem still holds when Σ\Sigma has ρ\rho-equicorrelation structure.

Now denote by vρv_{\rho} the companion Stieltjes transform of the empirical spectral distribution FHρ,γF_{H_{\rho},\gamma}, to emphasize its dependence on ρ\rho. Recall the Silverstein equation (120), which for the equicorrelation case, as dHρ=δ1ρdH_{\rho}=\delta_{1-\rho}, becomes

where v0v_{0} is the companion Stieltjes transform in the Σ=I\Sigma=I case, the object of study in Appendix LABEL:app:risk_iso2. From the results for v0v_{0} from (LABEL:eq:v0) and (LABEL:eq:v1), invoking the relationship in the above display, we have

Plugging these into the asymptotic risk expression from Theorem 2 gives

C.5 Autoregressive features

We consider a ρ\rho-autoregressive structure for Σ\Sigma, for a constant ρ[0,1)\rho\in[0,1), meaning that Σij=ρij\Sigma_{ij}=\rho^{|i-j|} for all i,ji,j. In this case, it is not clear that a closed-form exists for v(0)v(0) or v(0)v^{\prime}(0). However, we can compute these numerically. In fact, the strategy we describe below applies to any situation in which we are able to perform numerical integration against dHdH, where HH is the weak limit of the spectral measure FΣF_{\Sigma} of Σ\Sigma.

The critical relationship that we use is the Silverstein equation (Silverstein, 1995), which relates the companion Stieltjes transform vv to HH via

Therefore, we can use a simple univariate root-finding algorithm (like the bisection method) to solve for v(0)v(0) in (121). With v(0)v(0) computed, we can compute v(0)v^{\prime}(0) by first differentiating (120) with respect to zz (see Dobriban 2015), and then taking z0+z\to 0^{+}, to yield

When Σ\Sigma is of ρ\rho-autoregressive form, it is known to have eigenvalues (Trench, 1999):

This allows us to efficiently approximate an integral with respect to dHdH (e.g., by taking each θi\theta_{i} to be in the midpoint of its interval given above), solve for v(0)v(0) in (121), v(0)v^{\prime}(0) in (122), and evaluate the asymptotic risk as per Theorem 2.

Figure 13 shows the results from using such a numerical scheme to evaluate the asymptotic risk, as ρ\rho varies from 0 to 0.75.

C.6 Proof of Theorem 7

We begin by recalling an alternative to leave-one-out cross-validation, for linear smoothers, called generalized cross-validation (GCV) (Craven and Wahba, 1978; Golub et al., 1979b). The GCV error of the ridge regression estimator at a tuning parameter value λ\lambda defined as

Compared to the shortcut formula for leave-one-out CV in (46), we can see that GCV in (123) swaps out the iith diagonal element (Sλ)ii(S_{\lambda})_{ii} in the denominator of each summand with the average diagonal element Tr(Sλ)/n{\sf Tr}(S_{\lambda})/n. This modification makes GCV rotationally invariant (Golub et al., 1979b).

It turns out that the GCV error is easier to analyze, compared to the CV error. Thus we proceed by first studying GCV, and then relating CV to GCV. We break up the exposition below into these two parts accordingly.

Let us rewrite the GCV criterion in (123) as

We will treat the almost sure convergence of the numerator and denominator separately.

The denominator is an easier calculation. Denoting si=λi(XTX/n)s_{i}=\lambda_{i}(X^{T}X/n), i=1,,pi=1,\ldots,p, we have

where this convergence holds almost surely as n,pn,p\to\infty, a direct consequence of the Marchenko-Pastur theorem, and FγF_{\gamma} denotes the Marchenko-Pastur law. Meanwhile, we can rewrite this asymptotic limit as

where m=mFγm=m_{F_{\gamma}} denotes the Stieltjes transform of the Marchenko-Pastur law FγF_{\gamma}, and therefore, almost surely,

The numerator requires only a bit more difficult calculation. Let y=Xβ+ϵy=X\beta+\epsilon and cn=p(σ/r)c_{n}=\sqrt{p}(\sigma/r). Observe

Note that δ\delta has independent entries with mean zero and variance r2/pr^{2}/p, and further, note that δ\delta and AA are independent. Therefore we can use the almost sure convergence of quadratic forms, from Lemma 7.6 in Dobriban and Wager (2018), which is adapted from Lemma B.26 in Bai and Silverstein (2010).As written, Lemma 7.6 of Dobriban and Wager (2018) assumes i.i.d. components for the random vector in question, which is not necessarily true of δ\delta in our case. However, an inspection of their proof shows that they only require independent components with mean zero and common variance, which is precisely as stated in Lemma B.26 of Bai and Silverstein (2010). This result asserts that, almost surely,

A short calculation and application of the Marchenko-Pastur theorem gives that, almost surely,

where for the second sum, we used Vitali’s theorem to show convergence of the derivative of the Stieltjes transform of the spectral distribution of XTX/nX^{T}X/n to the derivative of the Stieltjes transform of FγF_{\gamma} (note that Vitali’s theorem applies as the function in question is bounded and analytic). By a similar calculation, we have almost surely,

Putting (125), (126) together with (124), we have, almost surely,

To show that this matches to the asymptotic prediction error of ridge regression requires some nontrivial calculations. We start by reparametrizing the above asymptotic limit in terms of the companion Stieltjes transform, abbreviated by v=vFγv=v_{F_{\gamma}}. This satisfies

Introducing α=r2/(σ2γ)\alpha=r^{2}/(\sigma^{2}\gamma), the almost sure limit of GCV is

where in the second line we rearranged, in the third line we applied the companion Stieltjes transform facts, and in the fourth line we simplified. We can now recognize the above as σ2\sigma^{2} plus the asymptotic risk of ridge regression at tuning parameter λ\lambda, either from the proof of Theorem 2, or from Theorem 2.1 in Dobriban and Wager (2018). In terms of the Stieltjes transform itself, this is σ2+σ2γ(m(λ)λ(1αλ)m(λ))\sigma^{2}+\sigma^{2}\gamma(m(-\lambda)-\lambda(1-\alpha\lambda)m^{\prime}(-\lambda)), which proves the first claimed result.

In the second line, we used Tr(Sλ)/n=(1/n)i=1psi/(si+λ)smax/(smax+λ){\sf Tr}(S_{\lambda})/n=(1/n)\sum_{i=1}^{p}s_{i}/(s_{i}+\lambda)\leq s_{\max}/(s_{\max}+\lambda), with smax=λmax(XTX/n)s_{\max}=\lambda_{\max}(X^{T}X/n), and in the third line, we used y22/n2(r2+σ2)\|y\|_{2}^{2}/n\leq 2(r^{2}+\sigma^{2}) almost surely for sufficiently large nn, by the strong law of large numbers, and smax2s_{\max}\leq 2 almost surely for sufficiently large nn, by the Bai-Yin theorem (Bai and Yin, 1993). Furthermore, writing gn,hng_{n},h_{n} for the numerator and denominator of fnf_{n}, respectively, we have

The above argument just showed that gn(λ)|g_{n}(\lambda)| is upper bounded on [λ1,λ2][\lambda_{1},\lambda_{2}], and hn(λ)|h_{n}(\lambda)| is lower bounded on [λ1,λ2][\lambda_{1},\lambda_{2}]; also, clearly hn(λ)1|h_{n}(\lambda)|\leq 1; therefore to show that fn|f_{n}^{\prime}| is almost surely bounded on [λ1,λ2][\lambda_{1},\lambda_{2}], it suffices to show that both gn,hn|g_{n}^{\prime}|,|h_{n}^{\prime}| are. Denoting by uiu_{i}, i=1,,pi=1,\ldots,p the eigenvectors of XTX/nX^{T}X/n (corresponding to eigenvalues sis_{i}, i=1,,pi=1,\ldots,p), a short calculation shows

the last step holding almost surely for large enough nn, by the law of large numbers. Also,

the last step holding almost surely for large enough nn, by the Bai-Yin theorem. Thus we have shown that fn|f_{n}^{\prime}| is almost surely bounded on [λ1,λ2][\lambda_{1},\lambda_{2}], for large enough nn, and applying the Arzela-Ascoli theorem, fnf_{n} converges uniformly to ff. With λn=argminλ[λ1,λ2]fn(λ)\lambda_{n}=\arg\min_{\lambda\in[\lambda_{1},\lambda_{2}]}f_{n}(\lambda), this means for any for any λ[λ1,λ2]\lambda\in[\lambda_{1},\lambda_{2}], almost surely

where we used the optimality of λn\lambda_{n} for fnf_{n}, and then uniform convergence. In other words, almost surely,

As the almost sure convergence RX(β^λ)+σ2f(λ)R_{X}(\hat{\beta}_{\lambda})+\sigma^{2}\to f(\lambda) is also uniform for λ[λ1,λ2]\lambda\in[\lambda_{1},\lambda_{2}] (by similar arguments, where we bound the risk and its derivative in λ\lambda), we conclude that almost surely RX(β^λ)σ2γm(1/α)R_{X}(\hat{\beta}_{\lambda})\to\sigma^{2}\gamma m(-1/\alpha), completing the proof for GCV.

C.6.2 Analysis of CV

Let us rewrite the CV criterion, starting in its shortcut form (46), as

where DλD_{\lambda} is a diagonal matrix that has diagonal elements (Dλ)ii=1(Sλ)ii(D_{\lambda})_{ii}=1-(S_{\lambda})_{ii}, i=1,,ni=1,\ldots,n.

First, fixing an arbitrary i=1,,ni=1,\ldots,n, we will study the limiting behavior of

Since xix_{i} and XTXX^{T}X are not independent, we cannot immediately apply the almost sure convergence of quadratic forms lemma, as we did in the previous analysis of GCV. But, letting XiX_{-i} denote the matrix XX with the iith row removed, we can write (XTX/n+λI)1=(XiTXi/n+λI+xixiT/n)1(X^{T}X/n+\lambda I)^{-1}=(X_{-i}^{T}X_{-i}/n+\lambda I+x_{i}x_{i}^{T}/n)^{-1}, and use the Sherman-Morrison-Woodbury formula to separate out the dependent and independent parts, as follows. Letting δi=xi/n\delta_{i}=x_{i}/\sqrt{n}, Ai=(XiTXi/n+λI)1A_{i}=(X_{-i}^{T}X_{-i}/n+\lambda I)^{-1}, and A=(XTX/n+λI)A=(X^{T}X/n+\lambda I), we have

Note δi\delta_{i} and AiA_{i} are independent (i.e., xix_{i} and XiTXiX_{-i}^{T}X_{-i} are independent), so we can now use the almost sure convergence of quadratic forms, from Lemma 7.6 in Dobriban and Wager (2018), adapted from Lemma B.26 in Bai and Silverstein (2010), to get that, almost surely

Further, as Tr(Ai)/nγm(λ){\sf Tr}(A_{i})/n\to\gamma m(-\lambda) almost surely by the Marchenko-Pastur theorem, we have, almost surely,

The strategy henceforth, based on the result in (129), is to replace the denominators 1(Sλ)ii1-(S_{\lambda})_{ii}, i=1,,ni=1,\ldots,n in the summands of the CV error by their asymptotic limits, and then control the remainder terms. More precisely, we define Dˉλ=(1+γm(λ))1I\bar{D}_{\lambda}=(1+\gamma m(-\lambda))^{-1}I, and then write, from (127),

We will first show that almost surely b0b\to 0. Observe, by the Cauchy-Schwartz inequality,

where in the second step, we used (ISλ)y22/ny22/n2(r2+σ2)\|(I-S_{\lambda})y\|_{2}^{2}/n\leq\|y\|_{2}^{2}/n\leq 2(r^{2}+\sigma^{2}), which holds almost surely for large enough nn, by the strong law of large numbers. Meanwhile,

By the Marchenko-Pastur theorem, we have d30d_{3}\to 0 almost surely. Using u2v2=(uv)(u+v)u^{2}-v^{2}=(u-v)(u+v) on the maximands in d2d_{2},

In the second line above, we used Tr(Ai)/nλmax(Ai)1/λ{\sf Tr}(A_{i})/n\leq\lambda_{\max}(A_{i})\leq 1/\lambda, i=1,,ni=1,\ldots,n, and also Tr(A)/n1/λ{\sf Tr}(A)/n\leq 1/\lambda. In the third line, we used the Sherman-Morrison-Woodbury formula. In the fourth line, for each summand i=1,,ni=1,\ldots,n, we upper bounded the numerator by λmax(A)2δi22δi22/λ2\lambda_{\max}(A)^{2}\|\delta_{i}\|_{2}^{2}\leq\|\delta_{i}\|_{2}^{2}/\lambda^{2}, and lower bounded the denominator by λ/(smax+λ)\lambda/(s_{\max}+\lambda), with smax=λmax(XTX/n)s_{\max}=\lambda_{\max}(X^{T}X/n) (which follows from a short calculation using the eigendecomposition of XTX/nX^{T}X/n). In the fifth line, we used δi22=eiT(XXT/n)eismax\|\delta_{i}\|_{2}^{2}=e_{i}^{T}(XX^{T}/n)e_{i}\leq s_{\max}, and in the sixth line, we used smax2s_{\max}\leq 2 almost surely for sufficiently large nn, by the Bai-Yin theorem (Bai and Yin, 1993).

Now we will show that d10d_{1}\to 0 almost surely by showing that the convergence in (128) is rapid enough. As before, using u2v2=(uv)(u+v)u^{2}-v^{2}=(u-v)(u+v) on the maximands in d1d_{1}, we have

Here we used that for i=1,,ni=1,\ldots,n, we have Tr(Ai)/n1/λ{\sf Tr}(A_{i})/n\leq 1/\lambda, and similarly, δiTAiδiδi22/λsmax/λ2/λ\delta_{i}^{T}A_{i}\delta_{i}\leq\|\delta_{i}\|_{2}^{2}/\lambda\leq s_{\max}/\lambda\leq 2/\lambda, with the last inequality holding almost surely for sufficiently large nn, by the Bai-Yin theorem (Bai and Yin, 1993). To show maxi=1,,nΔi0\max_{i=1,\ldots,n}\Delta_{i}\to 0 almost surely, we start with the union bound and Markov’s inequality, where q=2+η/2q=2+\eta/2, and tnt_{n} is be specified later,

where in the last step we used n/p2/γn/p\leq 2/\gamma for large enough nn. Since the right-hand side above is summable in pp, we conclude by the Borel-Cantelli lemma that maxi=1,,nΔi0\max_{i=1,\ldots,n}\Delta_{i}\to 0 almost surely, and therefore also d10d_{1}\to 0 almost surely. This finishes the proof that b0b\to 0.

As before, it helps to reparametrize in terms of the companion Stieltjes transform, giving

Adding v(λ)v(-\lambda) to both sides, then dividing both sides by 1+v(λ)1+v(-\lambda), yields

and dividing through by v(λ)v(-\lambda), then rearranging, gives

C.7 Misspecified model results

These results follow because, as argued in Section 5.2, the misspecified case can be reduced to a well-specified case after we make the substitutions in (21).

C.8 CV and GCV simulations

Figures 16 and 16 investigate the effect of using CV and GCV tuning for ridge regression, under the same simulation setup as Figures 8 and 8, respectively. It is worth noting that for these figures, there is a difference in how we compute finite-sample risks, compared to what is done for all of the other figures in the paper. In all others, we can compute the finite-sample risks exactly, because, recall, our notion of risk is conditional on XX,

and this quantity can be computed analytically for linear estimators like min-norm least squares and ridge regression. When we tune ridge by minimizing CV or GCV, however, we can no longer compute its risk analytically, and so in our experiments we approximate the above expectation by an average over 20 repetitions. Therefore, the finite-sample risks in Figures 16 and 16 are (only a little bit) farther from their asymptotic counterparts compared to all other figures in this paper, and we have included the finite-sample risk of the optimally-tuned ridge estimator in these figures, when the risk is again computed in the same way (approximated by an average over 20 repetitions), as a reference. All this being said, we still see excellent agreement between the risks under CV, GCV, and optimal tuning, and these all lie close to their (common) asymptotic risk curves, throughout.

Appendix D Nonlinear model: Proof of Theorem 8

Whenever clear from the context, we will omit the arguments from block matrix and resolvents, and write A=A(s,t){A}={A}(s,t), m1,n=m1,n(ξ,s,t)m_{1,n}=m_{1,n}(\xi,s,t), and m2,n=m2,n(ξ,s,t)m_{2,n}=m_{2,n}(\xi,s,t).

The next theorem is the the central random matrix results, and characterizes the asymptotics of m1,nm_{1,n}, m2,nm_{2,n}. It provides a generalization of Lemma 3.

Consider Im(ξ)>0{\operatorname{Im}}(\xi)>0 or Im(ξ)=0{\operatorname{Im}}(\xi)=0, Re(ξ)<0{\operatorname{Re}}(\xi)<0, with st0s\geq t\geq 0. Let m1m_{1} and m2m_{2} be the unique solutions of the following fourth degree equations:

subject to the condition of being analytic functions for Im(z)>0{\operatorname{Im}}(z)>0, and satisfying m1(z,s,t),m2(z,s,t)1/Im(z)|m_{1}(z,s,t)|,|m_{2}(z,s,t)|\leq 1/{\operatorname{Im}}(z) for Im(z)>C{\operatorname{Im}}(z)>C (with CC a sufficiently large constant). Then, as n,p,dn,p,d\to\infty, such that p/nγp/n\to\gamma and d/pψd/p\to\psi, we have almost surely (and in L1L^{1}),

The proof of this theorem is given in Appendix D.2. Now define

where recall Σ^=XTX/n\hat{\Sigma}=X^{T}X/n. As a corollary of the above, we obtain the asymptotic Stieltjes transform of the eigenvalue distribution of Σ^\hat{\Sigma} (this result was first derived in Pennington and Worah (2017)).

Assume the conditions of Theorem 9. Consider Im(ξ)>0{\operatorname{Im}}(\xi)>0. As n,p,dn,p,d\to\infty, with p/nγp/n\to\gamma and d/pψd/p\to\psi, the Stieltjes transform of spectral distribution of Σ^\hat{\Sigma} in (137) satisfies almost surely (and in L1L^{1}) Sn(ξ)s(ξ)S_{n}(\xi)\to s(\xi) where ss is a nonrandom function that uniquely solves the following equations (abbreviating s=s(ξ)s=s(\xi)):

subject to the condition of being analytic for Im(z)>0{\operatorname{Im}}(z)>0, and satisfying s(z2)<1/Im(z2)|s(z^{2})|<1/{\operatorname{Im}}(z^{2}) for Im(z)>C{\operatorname{Im}}(z)>C (where CC is a large enough constant). When c1=0c_{1}=0, the function ss is the Stieltjes transform of the Marchenko-Pastur distribution.

We refer to Appendix D.4 for a proof of this corollary. The next lemma connects the above resolvents to the variance of min-norm least squares.

Assume the conditions of Theorem 9. Let m1m_{1}, m2m_{2} be the asymptotic resolvents given in Theorem 9. Define

Then for γ1\gamma\neq 1, the following Taylor-Laurent expansion holds around ξ=0\xi=0:

with each Di=Di(γ,ψ,c1)D_{i}=D_{i}(\gamma,\psi,c_{1}). Furthermore, for the ridge regression estimator β^λ\hat{\beta}_{\lambda} in (5), as n,p,dn,p,d\to\infty, such that p/nγ(0,)p/n\to\gamma\in(0,\infty), d/pψ(0,1)d/p\to\psi\in(0,1), the following ridgeless limit holds almost surely:

The proof of this lemma can be found in Appendix D.3.

D.2 Proof of Theorem 9

The proof follows the approach developed by Cheng and Singer (2013) to determine the asymptotic spectral measure of symmetric kernel random matrices. The latter is in turn inspired to the classical resolvent proof of the semicircle law for Wigner matrices, see Anderson et al. (2009). The present calculation is somewhat longer because of the block structure of the matrix A{A}, but does not present technical difficulties. We will therefore provide a proof outline, referring to Cheng and Singer (2013) for further detail.

This decomposition satisfies the following properties:

As mentioned, it is an orthogonal decomposition: xφ(x)μd(dx)=0\int x\varphi_{\perp}(x)\,\mu_{d}({\rm d}x)=0. Further, by symmetry and the normalization assumption, xμd(dx)=φ(x)μd(dx)=0\int x\,\mu_{d}({\rm d}x)=\int\varphi_{\perp}(x)\,\mu_{d}({\rm d}x)=0.

φ(x)2μd(dx)1c1\int\varphi_{\perp}(x)^{2}\,\mu_{d}({\rm d}x)\to 1-c_{1}, φ(x)2μG(dx)1c1\int\varphi_{\perp}(x)^{2}\,\mu_{G}({\rm d}x)\to 1-c_{1}, as dd\to\infty.

Finally, recall the following definitions for the random resolvents:

In the next section, we will summarize some basic facts about the resolvent m1,nm_{1,n}, m2,nm_{2,n} and their analyticity, and some concentration properties of M1,n,M2,nM_{1,n},M_{2,n}. We will then derive Eqs. (133) and (134). Thanks to the analyticity properties, we will assume throughout these derivations Im(ξ)C{\operatorname{Im}}(\xi)\geq C for CC a large enough constant. Also, we will assume γ,ψ,φ\gamma,\psi,\varphi to be fixed throughout, and will not explicitly point out the dependence with respect to these arguments.

Consider, to be definite m1,nm_{1,n}. Denoting by (λi)iN(\lambda_{i})_{i\leq N}, (vi)iN({\boldsymbol{v}}_{i})_{i\leq N}, the eigenvalues and eigenvectors of A(n){A}(n), we have

Properties (a)(a)-(c)(c) are then standard consequences of m1,nm_{1,n} being a Stieltjes transform (see, for instance, Anderson et al. (2009)) ∎

Without loss of generality, we will assume i=Ni=N, and write W=(Wij)i,jN1{W}_{*}=(W_{ij})_{i,j\leq N-1} w=(wN,i)iN1{{w}}=({{w}}_{N,i})_{i\leq N-1}, ω=WNN\omega=W_{NN}. Define

and R(N){\boldsymbol{R}}^{(N)}, ρ(N)\rho^{(N)} the same quantities when W{W} is replaced by W(N){W}^{(N)}. Then, by Schur complement formula, it is immediate to get the formulae

Let S0S{N}S_{0}\equiv S\setminus\{N\} and S1S{N}S_{1}\equiv S\cap\{N\} (i.e. S1={N}S_{1}=\{N\} or S1=S_{1}=\emptyset depending whether NSN\in S or not). Define \Delta_{A}\equiv\big{|}{\sf Tr}_{A}\big{[}({W}-\xi{{I}}_{N})^{-1}\big{]}-{\sf Tr}_{A}\big{[}({W}^{(N)}-\xi{{I}}_{N})^{-1}\big{]}\big{|}. Then, using the bounds given above,

Next consider ΔS0\Delta_{S_{0}}. Notice that

Letting Vij=vi,PS0vjV_{ij}=\langle{\boldsymbol{v}}_{i},{\sf P}_{S_{0}}{\boldsymbol{v}}_{j}\rangle, we have V\mboxopPS0\mboxop1\|{\boldsymbol{V}}\|_{\mbox{\tiny\rm op}}\leq\|{\sf P}_{S_{0}}\|_{\mbox{\tiny\rm op}}\leq 1. Therefore, defining ui=w,vi/(λiξ)u_{i}=\langle{{w}},{\boldsymbol{v}}_{i}\rangle/(\lambda_{i}-\xi),

which implies that Eq. (157) also holds in this case.

Using Eqs. (151) and (157) yields the desired claim. ∎

If Im(ξ)ξ0>0{\operatorname{Im}}(\xi)\geq\xi_{0}>0, or Re(ξ)ξ0<0{\operatorname{Re}}(\xi)\leq-\xi_{0}<0, with st0s\geq t\geq 0. Then there exists c0=c0(ξ0)c_{0}=c_{0}(\xi_{0}) such that, for i{1,2}i\in\{1,2\},

In particular, for Im(ξ)>0{\operatorname{Im}}(\xi)>0, or Re(ξ)<0{\operatorname{Re}}(\xi)<0, then Mi,n(ξ,s,t)mi,n(ξ,s,t)0|M_{i,n}(\xi,s,t)-m_{i,n}(\xi,s,t)|\to 0 almost surely.

The proof is completely analogous to (Cheng and Singer, 2013, Lemma 2.4), except that we use Azuma-Hoeffding instead of Burkholder’s inequality. Consider M1,n(ξ)M_{1,n}(\xi) (we omit the arguments s,ts,t for the sake of simplicity). We construct the martingale

In particular Eqs. (134), (133) admit a unique solution with m1,m2<r0|m_{1}|,|m_{2}|<r_{0} for ξ>ξ0\xi>\xi_{0}.

whence Fi(m)<r|F_{i}({\boldsymbol{m}})|<r by taking ξ0\xi_{0} large enough. Analogously

Again, the claim follows by taking ξ0\xi_{0} large enough. ∎

The next lemma allow to restrict ourselves to cases in which φ\varphi is polynomial with degree independent of nn.

The proof is essentially the same as for Lemma 4.4 in Cheng and Singer (2013). ∎

(Here dd is understood to be the same in each case.)

This follows immediately from Lemma 9. Denote, with a slight abuse of notation, by A(n,p){A}(n,p) the matrix in Eq. (132). Consider for instance the difference

where A(n,p){A}^{*}(n,p) is obtained from A(n,p){A}(n,p) by zero-ing the last row and column. Lemma 9 then implied m1,n,p(ξ)m1,n,p1(ξ)3/ξ0|m_{1,n,p}(\xi)-m_{1,n,p-1}(\xi)|\leq 3/\xi_{0}. The other terms are treated analogously. ∎

D.2.2 Derivation of Eqs. (133)

Using Eq. (179), and Woodbury’s formula, we get (writing for simplicity v=A,n+p{\boldsymbol{v}}={A}_{\cdot,n+p}

By a concentration of measure argument (see, e.g., (Tao, 2012, Section 2.4.3)), the same statement also holds with high probability

By a concentration of measure argument, we also have

By using this together with Eqs. (199) and (204), we get

D.2.3 Derivation of Eqs. (134)

The derivation si analogous to the one in the previous section (notice that we will redefine some of the notations used in the last section), and hence we will only outline the main steps.

We decompose wa{{w}}_{a}, 1ap11\leq a\leq p-1, and zi{{z}}_{i}, 1in1\leq i\leq n into their components along wp{{w}}_{p}, and the orthogonal complement:

As in the previous section, it can be shown that E\mboxopεn(logn)c/n\|{\boldsymbol{E}}\|_{\mbox{\tiny\rm op}}\leq{\varepsilon}_{n}\equiv(\log n)^{c}/\sqrt{n} (with cc an absolute constant), and therefore Eq. (207) yields

We then proceed to compute the various terms on the right-hand side. The calculation is very similar to the one in the previous section, and we limit ourselves to reporting the results:

Substituting Eqs. (221) to (224) into Eq. (220), we get

D.2.4 Completing the proof

Consider Im(ξ)C0{\operatorname{Im}}(\xi)\geq C_{0} a large enough constant. By Eqs. (206), (206), we have

D.3 Proof of Lemma D.5

With the above definitions there exists a constant CC such that

Therefore, on the same event (for a suitable constant CC)

This implies Eq. (228) because with the stated probability we have wi1+εn\|{{w}}_{i}\|\leq 1+{\varepsilon}_{n} for all ii and wi,wjεn|\langle{{w}}_{i},{{w}}_{j}\rangle|\leq{\varepsilon}_{n} for all iji\neq j.

where the convergence takes place almost surely and in L1L^{1}.

Denote by (λi)ip(\lambda_{i})_{i\leq p}, (vi)ip({\boldsymbol{v}}_{i})_{i\leq p} the eigenvalues and eigenvectors of Σ^\hat{\Sigma}. The following qualitative behavior can be extracted from the asymptotic of the Stieltjes transform as stated in Corollary 6.

For any γ1\gamma\neq 1, c1[0,1)c_{1}\in[0,1), there exists ρ0>0\rho_{0}>0 such that the following happens. Let S+{i[p]:  λi>ρ0}S_{+}\equiv\{i\in[p]:\;|\lambda_{i}|>\rho_{0}\}, S{i[p]:  λiρ0}S_{-}\equiv\{i\in[p]:\;|\lambda_{i}|\leq\rho_{0}\}. Then, the following limits hold almost surely

(Here \Rightarrow denotes weak convergence of probability measures.)

Note that mn(ξ,x,c1x)m(ξ,x,c1x)m_{n}(\xi,x,c_{1}x)\to m(\xi,x,c_{1}x) as nn\to\infty. Further, for Im(ξ)>0{\operatorname{Im}}(\xi)>0 or Re(ξ)<0{\operatorname{Re}}(\xi)<0, it it is immediate to show tha x2mn(ξ,x,c1x)\partial^{2}_{x}m_{n}(\xi,x,c_{1}x) is bounded in nn (in a neighborhood of x=0x=0). Hence

Since Qn(ξ)q(ξ)Q_{n}(\xi)\to q(\xi), a weak convergence argument implies μnμ\mu_{n}\Rightarrow\mu_{\infty} almost surely. Further, defining μn+=1(ρ0,)μn\mu^{+}_{n}={\boldsymbol{1}}_{(\rho_{0},\infty)}\mu_{n}, μn=1[0,ρ0]μn\mu^{-}_{n}={\boldsymbol{1}}_{[0,\rho_{0}]}\mu_{n}, Lemma 15 implies μn+μ+\mu^{+}_{n}\Rightarrow\mu^{+}_{\infty}, μnc0δ0\mu^{-}_{n}\Rightarrow c_{0}\delta_{0}, where μ+\mu^{+}_{\infty} is a measure supported on [ρ0,)[\rho_{0},\infty), with μ+([ρ0,))=1c\mu^{+}_{\infty}([\rho_{0},\infty))=1-c. This in turns implies

In particular, q+q_{+} is analytic in a neighborhood of . This proves Eq. (141), with q+(0)=D0q_{+}(0)=D_{0}, γc=D1\gamma c=D_{-1}.

Comparing the last two displays, we obtain our claim

D.4 Proof of Corollary 8 and Corollary 6

Throughout this section, set s=t=0s=t=0 (and drop these arguments for the various functions), and let Mn(ξ)γM1,n(ξ)+M2,n(ξ)M_{n}(\xi)\equiv\gamma M_{1,n}(\xi)+M_{2,n}(\xi), mn(ξ)γm1,n(ξ)+m2,n(ξ)m_{n}(\xi)\equiv\gamma m_{1,n}(\xi)+m_{2,n}(\xi), m(ξ)=γm1(ξ)+m2(ξ)m(\xi)=\gamma m_{1}(\xi)+m_{2}(\xi). In this case we have

and therefore, a simple linear algebra calculation yields

Theorem LABEL:thm:var_nonlinear immediately implies Sn(z2)s(z2)S_{n}(z^{2})\to s(z^{2}) (almost surely and in L1L^{1}), where

Equations (133) and (134) simplify for the case s=t=0s=t=0 (setting mj=mj(z,s=0,t=0)\overline{m}_{j}=m_{j}(z,s=0,t=0)) to yield

Taking a linear combination of these two equations, we get

Comparing this with Eq (260), we get Eq. (138). Substituting the latter in Eqs. (261), (262), we get Eqs. (139), (140).

Finally, Corollary 6 follows by taking c1=0c_{1}=0.u

D.5 Proof of Theorem 8

First notice that Lemma 3 and 4 follow by taking t=0t=0, c1=0c_{1}=0 in Theorem 9 and Lemma .

The variance result follows simply by taking c10c_{1}\to 0 in Theorem LABEL:thm:var_nonlinear. Solving the quadratic equations (134), we get

The claimed formula for the variance is obtained by using these expressions in Lemma 4.

By Lemma 14, and using the fact that Σ0=Ip{{\Sigma}}_{0}={{I}}_{p} when c1=0c_{1}=0, we get

with probability larger than 11/n21-1/n^{2}. Therefore, by Borel-Cantelli it is sufficient to estabilish the claim for B~X(β^λ)\widetilde{B}_{{X}}(\hat{\beta}_{\lambda}). By Corollary 6, for c1=0c_{1}=0, the empirical spectral distribution of Σ^\hat{\Sigma} converges almost surely (in the weak topology) to the Marchenko-Pastur law μMP\mu_{MP}. Hence (for (λi)ip(\lambda_{i})_{i\leq p} the eigenvalues of Σ^\hat{\Sigma})

Hence the asymptotic bias is the same as in the linear model (for random isotropic features). The claim hence follows by the results of Section 5.2. Alternatively, we may simply recall that μMP({0})=(1γ1)+\mu_{MP}(\{0\})=(1-\gamma^{-1})_{+} and use dominated convergence.

References