Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

Introduction

Modern deep learning models involve a huge number of parameters. In many applications, current practice suggests that we should design the network to be sufficiently complex so that the model (as trained, typically, by gradient descent) interpolates the data, i.e., achieves zero training error. Indeed, in a thought-provoking experiment, Zhang et al. (2016) showed that state-of-the-art deep neural network architectures are complex enough that they can be trained to interpolate the data even when the actual labels are replaced by entirely random ones.

Despite their enormous complexity, deep neural networks are frequently observed to generalize well in practice. At first sight, this seems to defy conventional statistical wisdom: interpolation (vanishing training error) is commonly taken to be a proxy for overfitting, poor generalization (large gap between training and test error), and hence large test error. In an insightful series of papers, Belkin et al. (2018b, c, a) pointed out that these concepts are in general distinct, and interpolation does not contradict generalization. For example, recent work has investigated interpolation —via kernel ridge regression— in reproducing kernel Hilbert spaces (Liang et al., 2020; Ghorbani et al., 2019). While in low dimension a positive regularization is needed to achieve good interpolation, in certain high dimensional settings interpolation can be nearly optimal.

We study the model (2) in the proportional regime $p\asymp n$ , with a special focus on the overparametrized case $p>n$ . Our main contribution is to show that, by considering different choices of the features distribution $P_{x}$ , we can reproduce a number of statistically interesting phenomena that have emerged in the context of deep learning.

From a technical perspective, our main results are: Theorems 2 and 5, which assume the linear model $x_{i}=\Sigma^{1/2}z_{i}$ with $z_{i}$ a vector with independent coordinates; and Theorem 8, which assumes a nonlinear model $x_{i}=\varphi(Wz_{i})$ with $z_{i}\sim N(0,I_{d})$ . While the linear model has already a attracted significant amount of work (see Section 1.3 for an overview), Theorems 2 and 5 provide a more accurate non-asymptotic approximation of the prediction risk, as compared to available results in the literature.

The prediction risk depends on the geometry of the pair $(\Sigma,\beta)$ . We consider a few different choices for this geometry, which are broadly motivated by our objective to understand overparametrized models, and specialize our formulas to these special cases:

Isotropic features. This is the simplest case, in which $\Sigma=I_{p}$ and therefore —as we will see— the asymptotic risk depends on $\beta$ only through its norm $\|\beta\|_{2}$ . This simple model captures some interesting features of overparametrization, but misses others.

Nonlinear model. In all of the previous cases, the distribution of $x_{i}$ is of the form $x_{i}=\Sigma^{1/2}z_{i}$ where $z_{i}$ is a vector with independent coordinates. In order to test the generality of our results, we consider a model in which $x_{i}$ is obtained by passing $z_{i}\sim{N}(0,I_{d})$ through a one-layer neural net with random first layer weights.

We will summarize our results for these four examples in the next subsection.

A skeptical reader might ask what linear models have to do with neural networks. We emphasize that linear models provide more than a simple analogy, and a recent line of work outlines a concrete connection between the two settings (Jacot et al., 2018; Du et al., 2018b, a; Allen-Zhu et al., 2018; Chizat and Bach, 2018b). We will discuss this connection in Section 1.2.

We denote by $\gamma:=p/n\in(0,\infty)$ the overparametrization ratio. When $\gamma<1$ , we call the problem underparametrized, and when $\gamma>1$ , we call it overparametrized. Our most general results for the linear model (Theorem 2 and 5) apply to a non-asymptotic setting in which $n,p$ are finite, and provide a deterministic approximation of the risk with error bounds that are uniform in the distribution of the data. We will occasionally consider a simpler asymptotic scenario in which both $p$ and $n$ diverge with $p/n\to\gamma$ .

Our main results are twofold: $(i)$ We show that by taking $\Sigma\neq I_{p}$ , we can easily construct scenarios in which the minimum of the risk is achieved in the overparamertized regime $p>n$ ; $(ii)$ We show that these findings are robust to the details of the distribution of $(y_{i},x_{i})$ .

As a preliminary remark, note that in the underparametrized regime ( $\gamma<1$ ), the min-norm estimator coincides with the standard least squares estimator. Its risk is purely variance (there is no bias), and does not depend on $\beta,\Sigma$ (see Proposition 2). Interestingly, the asymptotic risk diverges as we approach the interpolation boundary (as $\gamma\to 1$ ).

In contrast, in the overparametrized regime ( $\gamma>1$ ), the risk is composed of both bias and varianceNote that in the overparametrized regime the bias is non-vanishing even in the interpolation limit $\lambda\to 0$ . The reason is that the set of interpolators is an affine space of dimension $p-n$ , and the min-norm criterion selects one specific interpolator which has norm smaller than $\beta$ ., and generally depends on $\beta,\Sigma$ (see Theorem 2).

We next highlight some concrete results for the four models discussed in the previous section (unless explicitly said, we refer to the min-norm estimator).

In either case, the risk approaches the null risk as $\gamma\to\infty$ , and achieves its global minimum in the underparametrized regime (see Section 3.2).

Optimal tuning of the ridge penalty can be achieved by leave-one-out cross-validation (see Theorem 7).

In this case $\Sigma\neq{{I}}$ and the risk depends on the geometry of $(\Sigma,\beta)$ , and in particular on how $\beta$ aligns with the eigenvectors of $\Sigma$ .

If the coefficients vector is equidistributed along the eigenvectors of $\Sigma$ , the behavior is qualitatively similar to the isotropic case. This situation arises, for instance, if $\beta$ is itself random with a spherical prior.

If $\beta$ is aligned with the top eigenvectors of $\Sigma$ , the situation is qualitatively different. As an example we obtain an explicit formula for the asymptotic risk in the latent space model discussed above, see Figure 5 for an illustration. We find that, for natural choices of the model parameters, the risk is monotone decreasing in the overparametrized regime, and reaches its global minimum as $\gamma\to\infty$ . This qualitative behavior matches the one observed for neural networks (see Section 5.4).

For the latent space model, we observe that, at large overparametrization, the minimum error is achieved as $\lambda\to 0$ , i.e. by min-norm interpolators (see Section 6.2, and Section 1.3 for related work).

From a technical viewpoint, analysis of the isotropic covariates model is straightforward and relies on standard random matrix theory results. However, we believe it provides useful insights.

In contrast, the results for general covariance and coefficients structure $(\Sigma,\beta)$ is technically novel. We discuss related work in Section 1.3. Our results for the nonlinear model are also novel. In this setting, we derive a new asymptotic result on resolvents of certain block matrices, which may be of independent interest (see Lemma 3).

We next discuss some intuition behind and implications of our results.

The shape of the asymptotic risk curve for min-norm least squares is, of course, controlled by its components: bias and variance. For fully specified models, the bias increases with $\gamma$ in the overparametrized regime, which is intuitive. When $p>n$ , the min-norm least squares estimate of $\beta$ is constrained to lie the row space of $X$ , the training feature matrix. This is a subspace of dimension $n$ lying in a feature space of dimension $p$ . Thus as $p$ increases, so does the bias, since this row space accounts for less and less of the ambient $p$ -dimensional feature space.

Recently, Belkin et al. (2018a) pointed out a fascinating empirical trend where, for popular methods like neural networks and random forests, we can see a second bias-variance tradeoff in the out-of-sample prediction risk beyond the interpolation limit. The risk curve here resembles a traditional U-shape curve before the interpolation limit, and then descends again beyond the interpolation limit, which these authors call “double descent”. A closely related phenomenon was found earlier by Spigler et al. (2018), who studied the “jamming transition” from underparametrized to overparametrized neural networks. Our results formally verify that this double descent phenomenon occurs even in the simple and fundamental case of least squares regression. The appearance of the second descent in the risk, past the interpolation boundary ( $\gamma=1$ ), is explained by the fact that the variance decreases as $\gamma$ grows, as discussed above.

In the misspecified case, the variance still decreases with $\gamma$ (for the same reasons), but interestingly, the bias can now also decrease with $\gamma$ , provided $\gamma$ is not too large (not too far past the interpolation boundary). The intuition here is that in a misspecified model, some part of the true regression function is always unaccounted for, and adding features generally improves our approximation capacity. As a consequence, the double descent phenomenon can be even more pronounced in the misspecified case (depending on the strength of the approximation bias), and that the risk can attain its global minimum past the interpolation limit.

The min-norm least squares estimator can be seen as the limit of ridge regression as the tuning parameter tends to zero. It is also the convergence point of gradient descent run on the least squares loss. We would not in general expect the best-predicting ridge solution to be at the end of its regularization path. Our results, comparing min-norm least squares to optimally-tuned ridge regression, show that (asymptotically) this is never the case, when $\beta$ is incoherent with respect to the eigenvectors of $\Sigma$ . This is for instance the case when $\Sigma=I_{p}$ , or $\beta$ is distributed according to a spherically symmetric prior. In contrast, Kobak et al. (2020) recently pointed out that —when $\beta$ is aligned with the leading eigenvectors of $\Sigma$ — min-norm regression can have optimal risk (i.e. the optimal regularization vanishes). We show that this is indeed the case in the latent space model mentioned above: this provides indeed an extremely simple example of a phenomenon that has been observed in the past for kernel methods Liang et al. (2020).

As mentioned above, early-stopped gradient descent is known to be closely connected to ridge regularization, see, e.g., Ali et al. (2019) which proves a tight coupling between the two (see Section 1.3 for further related work).

In practice, of course, we would not have access to the optimal tuning parameter for ridge (optimal stopping for gradient descent), and we would rely on, e.g., cross-validation (CV). Our theory shows that for ridge regression, CV tuning is asymptotically equivalent to optimal tuning (and we would expect the same results to carry over to gradient descent, but have not pursued this formally).

2 Connection to neural networks

As mentioned above, recent literature has pointed out a direct connection between linear models and more complex models such as neural networks (Jacot et al., 2018; Du et al., 2018b, a; Allen-Zhu et al., 2018; Chizat and Bach, 2018b). Our work contributes to this line of work and in particular points at the important role played by universality, a core concept in random matrix theory Tao (2012).

This model is still nonlinear in the input $z$ , but is linear in the parameters $\beta$ . We are therefore led to consider a linear regression problem, with random features $x_{i}=\nabla_{\theta}f(z_{i};\theta_{0})$ , $i=1,\ldots,n$ , of high-dimensionality ( $p$ much greater than $n$ ). Notice that the features are random because of the initialization $\theta_{0}$ . Further, since $p>n$ , many vectors $\beta$ give rise to a model that interpolates the data.

A different approach relies on a universality hypothesis. Recall that $\{(y_{i},z_{i})\}_{i\leq n}$ have joint distribution $P_{y,z}$ , and $x_{i}=\nabla_{\theta}f(z_{i};\theta_{0})$ and assume these quantities to have zero meanWe further note that the assumption of zero mean is mainly for convenience, as a non-zero mean of the covariates can be effectively compensated by an intercept.. Universality posits that the asymptotic risk of ridge regression (or min-norm interpolation) does not change if we replace the joint distribution $(x_{i},z_{i})$ by a Gaussian with the same covariance.

While universality is not expected to hold for any distribution of the data $(y_{i},z_{i})$ , and for any function $f$ , it has been confirmed in a number of examples of interest (see Section 1.3). In fact, our analysis of the nonlinear model in Section 8 provides a concrete example in which universality can be confirmed rigorously.

Under the universality hypothesis, we can assume $(z_{i},x_{i})$ to be jointly Gaussian. If the true joint distribution of the data $(y_{i},z_{i})$ is also Gaussian, it is sufficient to study the data distribution

where $\Sigma,\sigma$ are chosen to match the second order statistics of the original model. This is a special case of our ‘linear model’ (2). Summarizing, replacing the original nonlinear model $y_{i}=f(z_{i};\theta)$ , with the linear model rests on two key steps. The first step is the linearization in Eq. (3): this has been proved to be a good approximation under certain ‘lazy training’ schemes Chizat and Bach (2018b). The second step is the universality hypothesis, which is quite common (albeit for simpler models) in random matrix theory Tao (2012).

Finally, although we believe our our results are relevant to understanding overparametrized neural networks, the concrete correspondence outlined above only holds in a certain ‘lazy training’ regime, in which network weights do not change much during training, More generally, in a neural network, the feature representation and the regression function or classifier are learned simultaneously. In both our linear and nonlinear model settings, the features $X$ are not learned, but observed. Learning $X$ could significantly change some aspects of the behavior of an interpolator. (See for instance Chapter 9 of Goodfellow et al. (2016), and also Chizat and Bach (2018b); Zhang et al. (2019), which emphasize the importance of learning the representation.)

3 Related work

Ridge regression with random features has been studied in the past. Dicker (2016) considers a model in which the covariates are isotropic Gaussian $x_{i}\sim N(0,I_{p})$ and computes the asymptotic risk of ridge regression in the proportional asymptotics $p,n\to\infty$ , with $p/n\to\gamma\in(0,\infty)$ . Dobriban and Wager (2018) generalize these results to $x_{i}=\Sigma^{1/2}z_{i}$ , where $z_{i}$ has independent entries with bounded $12$ -th moment.

Recently, Advani and Saxe (2017) study the effect of early stopping and ridge regularization in a model with isotropic Gaussian covariates $x_{i}\sim N(0,I_{p})$ , again focusing on the proportional asymptotics $p,n\to\infty$ , with $p/n\to\gamma\in(0,\infty)$ . They show that this simple model reproduces several phenomena observed in neural networks training. The same model is reconsidered in concurrent work by Belkin et al. (2019), who obtain exact results for the expected risk of min-norm regression, relying on the jointly Gaussian distribution of $(y_{i},x_{i})$ . We contribute to this line of work by extending the analysis to general covariance structures and to misspecified models. As we will see, these generalizations allow to produce examples for which the global minimum of the risk is achieved in the overparametrized regime $\gamma>1$ .

The importance of the relation between the coefficient vector $\beta$ and the eigenvectors of $\Sigma$ was emphasized by Kobak et al. (2020) and Bartlett et al. (2020). These papers point out —under different asymptotic settings— that $\lambda=0+$ (i.e., min-norm regression) can be optimal or nearly optimal. After a preprint of this paper appeared, Wu and Xu (2020) and Richards et al. (2020) generalized our earlier results to cover the case in which $\beta$ is potentially aligned with $\Sigma$ . We review in further detail these important generalizations in Section 4. We contribute to this line of work by obtaining non-asymptotic approximations for the risk, with explicit and nearly optimal error bounds. These hold under weaker assumptions on the geometry of $(\Sigma,\beta)$ than the results of Wu and Xu (2020); Richards et al. (2020).

For the nonlinear model, the random matrix theory literature is much sparser, and focuses on the related model of kernel random matrices, namely, symmetric matrices of the form $K_{ij}=\varphi(z_{i}^{T}z_{j})$ . El Karoui (2010) studied the spectrum of such matrices in a regime in which $\varphi$ can be approximated by a linear function (for $i\neq j$ ) and hence the spectrum converges to a rescaled Marchenko-Pastur law. This approximation does not hold for the regime of interest here, which was studied instead by Cheng and Singer (2013) (who determined the limiting spectral distribution) and Fan and Montanari (2015) (who characterized the extreme eigenvalues). The resulting eigenvalue distribution is the free convolution of a semicircle law and a Marchenko-Pastur law. In the current paper, we must consider asymmetric (rectangular) matrices $x_{ij}=\varphi(w_{j}^{T}z_{i})$ , whose singular value distribution was recently computed by Pennington and Worah (2017), using the moment method. Unfortunately, the prediction variance depends on both the singular values and vectors of this matrix. In order to address this issue, we apply the leave-one out method of Cheng and Singer (2013) to compute the asymptotics of the resolvent of a suitably extended matrix. We then extract the information of interest from this matrix. After appearance of a preprint of this paper, Mei and Montanari (2019) extended the results presented here, to obtain a complete characterization of the risk for the non-linear random features model.

Let us finally mention that the universality (or ‘invariance’) phenomenon is quite common in random matrix theory Tao (2012). In the context of kernel inner product random matrices, it appears (somewhat implicitly) in Cheng and Singer (2013) and (more explicitly) in Fan and Montanari (2015). After a first appearance of this manuscript, universality has been investigated in the context of neural networks in several papers Mei and Montanari (2019); Montanari et al. (2019); Gerace et al. (2020); Goldt et al. (2020); Hu and Lu (2020); Adlam and Pennington (2020).

4 Outline

Section 2 provides important background. Sections 3–7 consider the linear model case, focusing on isotropic features, correlated features, misspecified models, ridge regularization, and cross-validation, respectively. Section 8 covers the nonlinear model case. Nearly all proofs are deferred until the appendix.

Preliminaries

We describe our setup and gather a number of important preliminary results.

Consider a test point $x_{0}\sim P_{x}$ , independent of the training data. For an estimator $\hat{\beta}$ (a function of the training data $X,y$ ), we define its out-of-sample prediction risk (or simply, risk) as

where $\|x\|_{\Sigma}^{2}=x^{T}\Sigma x$ . Note that our definition of risk is conditional on $X$ (as emphasized by our notation $R_{X}$ ). Note also that we have the bias-variance decomposition

2 Ridgeless least squares

This can be equivalently written as $\hat{\beta}=(X^{T}X)^{+}X^{T}y$ , where $(X^{T}X)^{+}$ is the pseudoinverse of $X^{T}X$ . An alternative name for (4) is the “ridgeless” least squares estimator, motivated by the fact that $\hat{\beta}=\lim_{\lambda\to 0^{+}}\hat{\beta}_{\lambda}$ , where $\hat{\beta}_{\lambda}$ denotes the ridge regression estimator:

or, equivalently, $\hat{\beta}_{\lambda}=(X^{T}X+n\lambda I)^{-1}X^{T}y$ .

When $X$ has full column rank the min-norm least squares estimator reduces to $\hat{\beta}=(X^{T}X)^{-1}X^{T}y$ , the usual least squares estimator. When $X$ has rank $n$ , importantly, this estimator interpolates the training data: $y_{i}=x_{i}^{T}\hat{\beta}$ , for $i=1,\ldots,n$ .

Lastly, the following is a well-known fact that connects the min-norm least squares solution to gradient descent (as referenced in the introduction).

Initialize $\beta^{(0)}=0$ , and consider running gradient descent on the least squares loss, yielding iterates

where we take $0<t\leq 1/\lambda_{\max}(X^{T}X)$ (and $\lambda_{\max}(X^{T}X)$ is the largest eigenvalue of $X^{T}X$ ). Then $\lim_{k\to\infty}\beta^{(k)}=\hat{\beta}$ , the min-norm least squares solution in (4).

3 Bias and variance

We recall expressions for the bias and variance of the min-norm least squares estimator, which are standard.

Under the model (1), (2), the min-norm least squares estimator (4) has bias and variance

where $\hat{\Sigma}=X^{T}X/n$ is the (uncentered) sample covariance of $X$ , and $\Pi=I-\hat{\Sigma}^{+}\hat{\Sigma}$ is the projection onto the null space of $X$ .

4 Underparametrized asymptotics

We consider an asymptotic setup where $n,p\to\infty$ , in such a way that $p/n\to\gamma\in(0,\infty)$ . Recall that when $\gamma<1$ , we call the problem underparametrized; when $\gamma>1$ , we call it overparametrized. Here, we recall the risk of the min-norm least squares estimator in the underparametrized case. The rest of this paper focuses on the overparametrized case.

The following is a known result in random matrix theory, and can be found in Chapter 6 of Serdobolskii (2007). It can also be found in the wireless communications literature, see Chapter 4 of Tulino and Verdu (2004).

Assume the model (1), (2), and assume $x\sim P_{x}$ is of the form $x=\Sigma^{1/2}z$ , where $z$ is a random vector with i.i.d. entries that have zero mean, unit variance, and a finite 4th moment, and $\Sigma$ is a (sequence of) deterministic positive definite matrix, such that $\lambda_{\min}(\Sigma)\geq c>0$ , for all $n,p$ and a constant $c$ (here $\lambda_{\min}(\Sigma)$ is the smallest eigenvalue of $\Sigma$ ). Then as $n,p\to\infty$ , such that $p/n\to\gamma<1$ , the risk of the least squares estimator (4) satisfies, almost surely,

As it can be seen from the last proposition, in the underparametrized case the risk is just variance. In contrast, in the overparametrized case, the bias $B_{X}(\hat{\beta};\beta)=\beta^{T}\Pi\Sigma\Pi\beta$ is nonzero, because $\Pi$ is. This will be the focus of the next sections.

Isotropic features

We begin by considering the simpler case in which $\Sigma=I$ . In this case the limiting bias is relatively straightforward to compute and depends on $\beta$ only through $\|\beta\|_{2}^{2}$ . In Section 4, we generalize our analysis and study the dependence of the bias on the geometry of $\Sigma$ and $\beta$ .

Choosing $U$ so that $U\beta=re_{i}$ , the $i$ th standard basis vector, then averaging over $i=1,\ldots,p$ , yields

It is possible to show that, $B_{X}(\hat{\beta};\beta)$ concentrates around its expectation ant therefore $B_{X}(\hat{\beta};\beta)\to r^{2}(1-1/\gamma)$ , almost surely. This is stated formally in the next section.

2 Limiting risk

As the next result shows, the independence of the risk on $\beta$ is still true outside of the Gaussian case, provided the features are isotropic. The next result can be proved as a corollary of the more general Theorem 3 below. We give a simpler self-contained proof using a theorem of Rubio and Mestre (2011) in Appendix C.2.

Assume the model (1), (2), where $x\sim P_{x}$ has i.i.d. entries with zero mean, unit variance, and a finite moment of order $4+\eta$ , for some $\eta>0$ . Also assume that $\|\beta\|_{2}^{2}=r^{2}$ for all $n,p$ . Then for the min-norm least squares estimator $\hat{\beta}$ in (4), as $n,p\to\infty$ , such that $p/n\to\gamma\in(0,\infty)$ , it holds almost surely that

Hence, summarizing with Proposition 2, we have

We can see that the limiting norm, as a function of $\gamma$ , has a somewhat similar profile to the limiting risk in (8): it is monotonically increasing on $(0,1)$ , diverges at the interpolation boundary, and is monotonically decreasing on $(1,\infty)$ .

Correlated features

We broaden the scope of our analysis from the last section, where we examined isotropic features. In this section, we take $x\sim P_{x}$ to be of the form $x=\Sigma^{1/2}z$ , where $z$ is a random vector with independent entries that have zero mean and unit variance, and $\Sigma$ is arbitrary (but still deterministic and positive definite).

We next state our assumptions about the data distribution: our results will be uniform with respect to the (large) constants $M$ , $\{C_{k}\}_{k\geq 2}$ appearing in this assumption.

The covariates vector $x\sim P_{x}$ is of the form $x=\Sigma^{1/2}z$ , where, defining $\widehat{H}_{n}$ as per Eq. (9), we have

$s_{1}=\|\Sigma\|_{op}\leq M$ , $\int s^{-1}d\widehat{H}_{n}(s)<M$ .

$|1-(p/n)|\geq 1/M$ , $1/M\leq p/n\leq M$ .

Condition $(a)$ bounds the tail probabilities on the covariates. Requiring finite moment of all order is useful to get strong bounds on the deviations of the risk from its predicted value. As discussed below, bounds on the first few moments are sufficient if we are satisfied in weaker probability bounds.

Conditions $(b)$ requires the eigenvalues of $\Sigma$ to be bounded, and not to accumulateThe latter assumption could have been further relaxed, by requiring only $p_{+}$ of the $p$ eigenvalues to be non-vanishing and to satisfy the other conditions. This would require to redefine $\gamma$ as $p_{+}/n$ . near . For the analysis of min-norm interpolation, we will add the additional assumption that the minimum eigenvalue of $\Sigma$ is bounded away from zero. However condition $(b)$ is sufficient for the analysis of ridge regression in Section 6.

Finally, as our statements are non-asymptotic, we do not assume $p/n$ to converge to to a value. However condition $(c)$ requires $p/n$ to be bounded and bounded away from the interpolation threshold $p/n=1$ .

We then define the predicted bias and variance by

Note that evaluating $\mathscrsfs{B}(H,G,\gamma)$ , $\mathscrsfs{V}(H,\gamma)$ numerically is relatively straightforward, with the most complex part being the solution of Eq. (10). The next theorem establishes that —under suitable technical conditions— the functions $\mathscrsfs{B}$ , $\mathscrsfs{V}$ characterize the test error. Similar theorems were proved in Wu and Xu (2020); Richards et al. (2020), which generalized an earlier version of this manuscript to account for the geometry of $(\Sigma,\beta)$ .

Assume the data model (1), (2) and that the covariates distribution satisfies Assumption 1. Further assume $s_{p}=\lambda_{\min}(\Sigma)>1/M$ . Define $\gamma=p/n$ and let $\hat{\beta}$ be the min-norm least squares estimator in Eq. (4).

Then for any constants $D>0$ (arbitrarily large) there exist $C=C(M,D)$ such that, with probability at least $1-Cn^{-D}$ the following hold

where $\mathscrsfs{B}$ and $\mathscrsfs{V}$ are given in Definition 2.

The proof of this theorem is deferred to Section B.

The order of the error bound in Eqs. (14), (15) is not optimal: a central limit theorem heuristics suggests the deterministic approximation to be accurate up to an error of order $n^{-1/2}$ . Indeed, we are able to establish the correct order in the case of ridge regression, see Theorem 5.

Note that Theorem 2 establishes deterministic approximations for the bias and variance, that are valid at finite $n,p$ . The overparametrization ratio $\gamma=p/n$ is a non-asymptotic quantity, and the error bounds are uniform, i.e. depend uniquely on the constant $M$ . This is to be contrasted with the asymptotic setting of Richards et al. (2020); Wu and Xu (2020). Both of these papers assume a sequence of regression problems with $n,p\to\infty$ , $p/n\to\gamma$ , and obtain an asymptotically exact expression for the risk.

Technically, Richards et al. (2020); Wu and Xu (2020) apply asymptotic random matrix theory results, such as Ledoit and Peche (2011), while we have to take a longer detour to exploit non-asymptotic results established inKnowles and Yin (2017). We believe that the non-asymptotic approach provides more concrete and accurate statements.

Theorem 2 also implies asymptotic predictions under minimal assumptions. In particular, if the two probability measures $\widehat{H}_{n}$ , $\widehat{G}_{n}$ converge weakly to probability measures $H$ , $G$ on $[0,\infty)$ , then we obtain $B_{X}(\hat{\beta};\beta)/\|\beta\|^{2}\to\mathscrsfs{B}(H,G,\gamma)$ , $V_{X}(\hat{\beta};\beta)\to\mathscrsfs{V}(H,\gamma)$ almost surely.

As pointed out above the condition $\widehat{H}_{n}\Rightarrow H$ , $\widehat{G}_{n}\Rightarrow G$ (here $\Rightarrow$ denotes weak convergence) is strictly weaker than the condition assumed in Wu and Xu (2020) to establish asymptotic results. Further, we require weaker moment conditions.

In the next sections we illustrate the role of the geometry of $\beta,\Sigma$ by considering two models for which $\widehat{H}_{n}\Rightarrow H$ , $\widehat{G}_{n}\Rightarrow G$ as $n,p\to\infty$ . First we consider the case $G=H$ , which we refer to as ‘equidistributed:’ the components of $\beta$ are roughly equally distributed along the eigenvectors of $\Sigma$ . In this case there is no special relation between $\beta$ and $\Sigma$ .

As a further application, we consider a latent space model in which $\beta$ is aligned with the top eigenvectors of $\Sigma$ . This can be regarded as a misspecified model, and is therefore presented in Section 5.4 below.

2 Equidistributed coefficients

In this section we assume $G=H$ , $\|\beta\|_{2}\to r$ , and $p/n\to\gamma$ . One way to generate $\beta$ satisfying this condition is to draw it uniformly at random on the $p$ -dimensional sphere of radius $\|\beta\|_{2}=r$ . In this case the conditions of Theorem 2 (or Theorem 3), hold with $\widehat{G}_{n}\to G=H$ .

Under the assumptions of Theorem 3, further assume $G=H$ , $\|\beta\|^{2}\to r^{2}$ . Then for $n,p\to\infty$ , with $p/n\to\gamma>1$ , almost surely

As a special case, we can revisit the isotropic case $\Sigma=I$ , which results in $dH=\delta_{1}$ . In this case $c_{0}(H,\gamma)=\gamma(\gamma-1)$ yielding immediately $\mathscrsfs{B}_{\mbox{\tiny\rm equi}}(H,\gamma)=1-\gamma^{-1}$ and $\mathscrsfs{V}_{\mbox{\tiny\rm equi}}(H,\gamma)=1/(\gamma-1)$ .

Under the assumptions of Theorem 3, further assume $\|\beta\|^{2}\to r^{2}$ , and let $c_{0}=c_{0}(H,\gamma)$ be defined as there. Then as $n,p\to\infty$ , such that $p/n\to\gamma$ , the min-norm least squares estimator (4) satisfies, almost surely,

Misspecified model

In this section, we consider a misspecified model, in which the regression function is still linear, but we observe only a subset of the features. Such a setting provides another potential motivation for interpolation: in many problems, we do not know the form of the regression function, and we generate features in order to improve our approximation capacity. Increasing the number of features past the point of interpolation (increasing $\gamma$ past 1) can now decrease both bias and variance.

As such, the misspecified model setting also yields further interesting asymptotic comparisons between the $\gamma<1$ and $\gamma>1$ regimes. Recall the isotropic features model of Section 3.2: the risk function in (8) is globally minimized at $\gamma=0$ . This is a consequence of the fact that, in a well-specified linear model at $\gamma=0$ there is no bias and no variance, and hence no risk. In a misspecified model, we will see that the story can be quite different, and the asymptotic risk can actually attain its global minimum on $(1,\infty)$ .

Consider, instead of (1), (2), a data model

Given a test point $(x_{0},w_{0})\sim P_{x,w}$ , and an estimator $\hat{\beta}$ (fit using $X,y$ only, and not $W$ ), we define its out-of-sample prediction risk as

Note that this definition is conditional on $X$ , and we are integrating over the randomness not only in $\epsilon$ (the training errors), but in the unobserved features $W$ , as well. The next lemma decomposes this notion of risk in a useful way.

Under the misspecified model (19), (20), for any estimator $\hat{\beta}$ , we have

The first term $R^{*}_{X}(\hat{\beta};\beta,\theta)$ in the decomposition in Lemma 2 is precisely the risk that we studied previously in the well-specified case, except that the response distribution has changed (due to the presence of the middle term in (20)). We call the second term $M(\beta,\theta)$ in Lemma 2 the misspecification bias.

In order to discuss some qualitative features, we focus on the simplest possible model by assuming independent covariates.

2 Isotropic features

Here, we make the additional simplifying assumption that $(x,w)\sim P_{x,w}$ has i.i.d. entries with unit variance, which implies that $\Sigma=I$ . (The case of independent features but general covariances $\Sigma_{x},\Sigma_{w}$ is similar, and we omit the details.) Therefore, we may write the response distribution in (20) as

where $\delta_{i}$ is independent of $x_{i}$ , having mean zero and variance $\sigma^{2}+\|\theta\|_{2}^{2}$ , for $i=1,\ldots,n$ . Denote the total signal by $r^{2}=\|\beta\|_{2}^{2}+\|\theta\|_{2}^{2}$ , and the fraction of the signal captured by the observed features by $\kappa=\|\beta\|_{2}^{2}/r^{2}$ . Then $R_{X}^{*}(\hat{\beta};\beta,\theta)$ behaves exactly as we computed previously, for isotropic features in the well-specified setting (Theorem 2 for $\gamma<1$ , and Theorem 1 for $\gamma>1$ ), after we make the substitutions:

Furthermore, we can easily calculate the misspecification bias:

Putting these results together leads to the next conclusion.

Assume the misspecified model (19), (20), and assume $(x,w)\sim P_{x,w}$ has i.i.d. entries with zero mean, unit variance, and a finite moment of order $8+\eta$ , for some $\eta>0$ . Also assume that $\|\beta\|_{2}^{2}+\|\theta\|_{2}^{2}=r^{2}$ and $\|\beta\|_{2}^{2}/r^{2}=\kappa$ for all $n,p$ . Then for the min-norm least squares estimator $\hat{\beta}$ in (4), as $n,p\to\infty$ , with $p/n\to\gamma$ , it holds almost surely that

We remark that, in the independence setting considered in Theorem 4, the dimension $q$ of the unobserved feature space does not play any role: we may equally well take $q=\infty$ for all $n,p$ (i.e., infinitely many unobserved features).

3 Polynomial approximation bias

Since adding features should generally improve our approximation capacity, it is reasonable to model $\kappa=\kappa(\gamma)$ as an increasing function of $\gamma$ . To get an idea of the possible shapes taken by the asymptotic risk curve from Theorem 4, we consider the example of a polynomial decay for the approximation bias,

for some $a>0$ . In this case, the limiting risk in the isotropic setting, from Theorem 4, becomes

4 A latent space model

Here $(\xi_{i})_{i\leq n}$ , $(u_{ij})_{i\leq n,j\leq p}$ are noise variables that are mutually independent, and independent of $z_{i}$ , with $\xi_{i}\sim{N}(0,\sigma_{\xi}^{2})$ , $u_{ij}\sim{N}(0,1)$ . The features matrix takes the form $X=ZW^{T}+U$ and therefore, for $p>n$ , min-norm regression amounts to

Notice that the signal-to-noise ratio in the features $x_{ij}$ is equal to $\|w_{i}\|_{2}^{2}$ . To be definite, we’ll fix $\|w_{i}\|_{2}=1$ , which implies in particular $\|W\|_{F}^{2}=p$ . In order to simplify our calculations, we assume all the non-zero singular values of $W$ to be equal, whence $W^{T}W=(p/d)I_{d}$ . It is not hard to check that this model satisfies the assumptions of Theorem 2, with

Using Theorem 2, we get the following explicit expressions.

Consider the latent space model described above, and assume $d/p\to\psi\in(1,\infty)$ , $p/n\to\gamma\in(0,\infty)$ . Then, almost surely

where $\sigma^{2}=\sigma_{\xi}^{2}+\psi r_{\theta}^{2}/(1+\psi)$ , and $c_{0}=c_{0}(\psi,\gamma)\geq 0$ is the unique non-negative solution of the following second order equation

Figures 5 and 6 illustrate this corollary by comparing analytical predictions to numerical simulations. We observe that the prediction risk is monotone decreasing in the overparametrization ratio for $\gamma>1$ , and reaches its global minimum asymptotically as $\gamma\to\infty$ (after $p,n,d\to\infty$ ). To understand why this happens, notice that each feature vector $x_{i}$ can be viewed as a noisy measurement of the latent covariates $z_{i}$ . If the noise $u_{ij}$ was absent, then performing min-norm regression with respect to $(x_{i})_{i\leq n}$ amounts to minimize would be equivalent to min-norm regression with respect to $(z_{i})_{i\leq n}$ . To see this, consider again Eq. (25). If we drop the noise $U$ , we are minimizing $\|b\|_{2}$ subject to $Z(W^{T}b)=0$ , and the regression function is $\hat{f}(z)=x^{T}\hat{\beta}=z^{T}(W^{T}\hat{\beta})$ . Since $W$ is orthogonal, this is equivalent to computing $\hat{\theta}=\arg\min\{\|t\|_{2}:$ subject to $Zt=0\}$ .

In presence of noise $u_{ij}$ , the latent features cannot be estimated exactly. However as $p$ gets larger, the noise is effectively ‘averaged out’ and we approach the idealized situation in which the $z_{i}$ ’s are observed.

Ridge regularization

We generalize the formulas of Section 4 to non-vanishing ridge regularization. We work under the same assumptions of that section. In particular, recall that $\widehat{H}_{n}(s)=p^{-1}\sum_{i=1}^{p}1_{\{s\geq s_{i}\}}$ is the empirical distribution of eigenvalue of $\Sigma$ , and $\widehat{G}_{n}(s)=\sum_{i=1}^{p}\langle\beta,v_{i}\rangle^{2}1_{\{s\geq s_{i}\}}/\|\beta\|^{2}$ the same empirical distribution, reweighted by the projection of $\beta$ onto the eigenvector $v_{i}$ of the covariance $\Sigma$ . (Recall the eigenvalue decomposition $\Sigma=\sum_{i=1}^{p}s_{i}v_{i}v_{i}^{T}$ .)

Further define $m_{n,1}(z)=m_{n,1}(z;\widehat{H}_{n},\gamma)$ via

These definitions are extended analytically to ${\operatorname{Im}}(z)=0$ whenever possible. We then define the predicted bias and variance by

We next state our deterministic approximation of the risk.

Let $M^{-1}\leq p/n\leq M$ , and Assumption 1 hold. Further assume $\lambda\vee\lambda_{\min}(\Sigma)>1/M$ and $n^{-2/3+1/M}<\lambda<M$ . Let $\hat{\beta}_{\lambda}$ be the ridge estimator of Eq. (5).

Then for any constants $D>0$ (arbitrarily large) and ${\varepsilon}>0$ (arbitrarily small), there exist $C=C(M,D)$ such that, with probability at least $1-Cn^{-D}$ the following hold

where $\mathscrsfs{B}$ and $\mathscrsfs{V}$ are given in Definition 2.

The proof of this theorem is deferred to Appendix A. As for theorem 2, similar results were proved in Wu and Xu (2020); Richards et al. (2020), subsequently to a first version of this manuscript that only focused on random $\beta$ . The same comparison of Remark 1 applies here.

In particular, Theorem 5 establishes non-asymptotic deterministic approximations for the bias $B_{X}(\hat{\beta}_{\lambda};\beta)$ and variance $V_{X}(\hat{\beta};\beta)$ . The error terms are uniform over the covariance matrix, and have nearly optimal dependence upon the sample size $n$ . Indeed, a central-limit theorem heuristics suggests indeed fluctuations of order $n^{-1/2}$ .

As for the case of min-norm regression, it is easy to extract asymptotic prediction under minimal assumptions.

As a special case, we can consider the simple isotropic model that was already studied in Section 3. Very similar (though not identical) results can be found in Dicker (2016); Dobriban and Wager (2018).

Assume the conditions of Theorem 1 (well-specified model, isotropic features). Then for ridge regression estimator in (5) as $n,p\to\infty$ , such that $p/n\to\gamma\in(0,\infty)$ , it holds almost surely that

Here $m(z)$ is given by Eq. (36) which in this case has the explicit solution $m(z)=[1-\gamma-z-\sqrt{(1-\gamma-z)^{2}-4\gamma z}]/(2\gamma z)$ .

Furthermore, the limiting ridge risk is minimized at $\lambda^{*}=\sigma^{2}\gamma/r^{2}$ , in which case we have the simpler expression $R_{X}(\hat{\beta}_{\lambda};\beta)\to\sigma^{2}\gamma m(-\lambda^{*})$ .

Under the conditions of Theorem 4 (misspecified model, isotropic features), the limiting risk of ridge regression is as in the first two displays, and the optimal limiting risk is as in the third, after we make the substitutions in (21) and add $r^{2}(1-\kappa)$ , to each expression.

It is easy to recover the formulas in Theorem 1 as a limiting case of Eq. (43), by using the $z\to 0$ asymptotics $m(z)=(1-\gamma)^{-1}+O(z)$ for $\gamma<1$ and $m(z)=(1-\gamma)^{-1}+O(z)$ for $\gamma<1$ and $m(z)=-(\gamma-1)/(\gamma z)+[(\gamma-1)\gamma]^{-1}+O(z)$ for $\gamma>1$ .

Figures 8 and 8 compare the risk curves of min-norm least squares to those from optimally-tuned ridge regression, in the well-specified and misspecified settings, respectively. There are two important points to make. The first is that optimally-tuned ridge regression is seen to have strictly better asymptotic risk throughout, regardless of $r^{2}$ , $\gamma$ , $\kappa$ . This should not be a surprise, as by definition optimal tuning should yield better risk than min-norm least squares, which is the special case given by $\lambda\to 0^{+}$ .

As we will see, the behavior is rather different in the latent space model.

2 Latent space model

As a special application, we consider the latent space model of Section 5.4. It is immediate to specialize Eqs. (38) and (39) to this case. We omit giving giving explicit formulas for brevity, and instead plot the resulting curves for the prediction risk (test error).

In Figure 10 we plot the risk as a function of the overparametrization ration $\gamma=p/n$ for several values of the regularization parameter $\lambda$ (included the ridgeless limit $\lambda\to 0$ ). The setting here is analogous to the one of Fig. 5. We observe several interesting phenomena:

Independently of $\lambda$ in the probed range, the risk is minimized at large overparametrization $\gamma\gg 1$ .

As expected, the divergence of the risk at the interpolation threshold $\gamma=1$ is smoothed out by regularization, and the risk becomes a monotone decreasing function of $\gamma$ when $\lambda$ is large enough. Crucially, the optimal amount of regularization (corresponding to the lower envelope of these curves) results in a monotonically decreasing risk.

At large overparametrization, the optimal value of the regularization parameter is $\lambda\to 0$ .

We confirm the last finding in Fig. 10 which plots the risk as a function of $\lambda$ : the optimal regularization is $\lambda\to 0$ . Notice that this is the case despite the fact that the observations are noisy, namely $\sigma_{\xi}>0$ strictly.

The optimality of $\lambda\to 0$ has a known intuitive explanation that is worth recalling here. Recall that ridge predictor at point $x_{0}$ takes the form

where we introduced the kernel matrix $K(X,X)=XX^{T}/n$ , and the vector $K(x_{0},X)=x_{0}^{T}X/n$ . In the present case $X$ can ve viewed as a noisy version of $\overline{X}=ZW^{T}$ , namely Consider the case in which the covariates $X$ contain noise, as is our case, $X=\overline{X}+\eta U$ where $u_{ij}\sim N(0,1)$ . (While we are considering $\eta=1$ , it is instructive to regard the noise standard deviation as a parameter.) Then, we might expect $K(X,X)\approx K(\overline{X},\overline{X})+\eta^{2}UU^{T}\approx K(\overline{X},\overline{X})+\eta^{2}I_{n}$ . If this approximations hold, the noise acts as an extra ridge term, which can be sufficient to regularize the problem.

To the best of our knowledge, this argument was first presented by Webb (1994) and, more explicitly, by Bishop (1995). Recently, Kobak et al. (2020) elucidated its role in linear regression, establishing several of its consequences. In a parallel line of work, a closely related idea has recently emerged in the analysis of kernel methods in high dimension El Karoui (2010); Liang et al. (2020).

Cross-validation

We analyze the effect of using cross-validation to choose the tuning parameter in ridge regression. In short, we find that choosing the ridge tuning parameter to minimize the leave-one-out cross-validation error leads to the same asymptotic risk as the optimally-tuned ridge estimator. The next subsection gives the details; the following subsection presents a new “shortcut formula” for leave-one-out cross-validation in the overparametrized regime, for min-norm least squares, akin to the well-known formula for underparametrized least squares and ridge regression. We refer to Craven and Wahba (1979); Golub et al. (1979a) for background on CV and GCV, and to Arlot and Celisse (2010) for a more recent review.

Recomputing the leave-one-out predictors $\hat{f}^{-i}_{\lambda}$ , $i=1,\ldots,n$ can be burdensome, especially for large $n$ . Importantly, there is a well-known “shortcut formula” that allows us to express the leave-one-out CV error (45) as a weighted average of the training errors,

where $S_{\lambda}=X(X^{T}X+n\lambda)^{-1}X^{T}$ is the ridge smoother matrix. This identity is an immediate consequence of the Sherman-Morrison-Woodbury formula. In the next subsection, we give an extension to the case $\lambda=0$ and ${\rm rank}(X)=n$ , i.e., to min-norm least squares.

The next result shows that, for isotropic features, the CV error of a ridge estimator converges almost surely to its prediction error. The focus on isotropic features is only for simplicity: a more general analysis is possible but is not pursued here. The proof, given in Appendix C.6, relies on the shortcut formula (46). In the proof, we actually first analyze generalized cross-validation (GCV), which turns out to be somewhat of an easier calculation (see the proof for details on the precise form of GCV), and then relate leave-one-out CV to GCV.

with the right-hand side above being the asymptotic risk of optimally-tuned ridge regression. Further, the exact same set of results holds for GCV.

Similar results were obtained for various linear smoothers in Li (1986, 1987), for the lasso in the high-dimensional (proportional) asymptotics in Miolane and Montanari (2018), and for general smooth penalized estimators in Xu et al. (2019). The latter paper covers ridge regression as a special case, and gives more precise results (convergence rates), but assumes more restrictive conditions.

The key implication of Theorem 7, in the context of the current paper and its central focus, is that the CV-tuned or GCV-tuned ridge estimator has the same asymptotic performance as the optimally-tuned ridge estimator. In other words, the ridge curves in Figures 1, 8, and 8 can be alternatively viewed as the asymptotic risk of ridge under CV tuning.

2 Shortcut formula for ridgeless CV

We extend the leave-one-out CV shortcut formula (46) to work when $p>n$ and $\lambda=0+$ , i.e., for min-norm least squares. In this case, both the numerator and denominator are zero in each summand of (46). To circumvent this, we can use the so-called “kernel trick” to rewrite the ridge regression solution (5) with $\lambda>0$ as

Using this expression, the shortcut formula for leave-one-out CV in (46) can be rewritten as

Taking $\lambda\to 0^{+}$ yields the shortcut formula for leave-one-out CV in min-norm least squares (assuming without a loss of generality that ${\rm rank}(X)=n$ ),

In fact, the exact same arguments given here still apply when we replace $XX^{T}$ by a positive definite kernel matrix $K$ (i.e., $K_{ij}=k(x_{i},x_{j})$ for each $i,j=1,\ldots,n$ , where $k$ is a positive definite kernel function), in which case (48) gives a shortcut formula for leave-one-out CV in kernel ridgeless regression (the limit in kernel ridge regression as $\lambda\to 0^{+}$ ). We also remark that, when we include an unpenalized intercept in the model, in either the linear or kernelized setting, the shortcut formula (48) still applies with $XX^{T}$ or $K$ replaced by their doubly-centered (row- and column-centered) versions, and the matrix inverses replaced by pseudoinverses.

Nonlinear model

Our analysis in the previous sections assumed $x_{i}=\Sigma^{1/2}z_{i}$ , with $z_{i}$ a vector with i.i.d. entries. As discussed in the introduction, we expect our results to have implications on certain neural networks in the lazy training regime, via the universality hypothesis. Roughly speaking, if covariates are obtained by applying the featurization map $z_{i}\mapsto x_{i}=\nabla f(z_{i};\theta_{0})$ to data $z_{i}$ , the asymptotic behavior of ridge regression is expected to be close to the one of an equivalent Gaussian model, whose second-order statistics match the ones of $\nabla f(z_{i};\theta_{0})$ .

We focus on a simple case in which the second order statistics match the ones of the isotropic model.

Notice that, conditionally on $W$ , the vectors $x_{i}=\varphi(Wz_{i})$ , $i\leq n$ are independent. However, they do not have independent coordinates: intuitively if $d$ is significantly smaller than $p$ , $x_{i}$ contains much less randomness than a vector of $p$ independent entriesFor instance, if $\varphi(t)=at^{2}+b$ , we can reconstruct $z_{i}$ from the first $2d$ coordinates of $x_{i}$ , and therefore the remaining $p-2d$ coordinates of $x_{i}$ are a function of the first $2d$ ..

Then for $\gamma>1$ , the variance satisfies, almost surely,

which is again as in the case of linear isotropic features, recall Theorem 1.

The proof of Theorem 8 is lengthy and will be sketched shortly.

where $G_{1},G_{2}$ are jointly Gaussian with unit variance and covariance $w_{i}^{T}w_{j}$ . Denoting by $\varphi(x)=\sum_{k\geq 0}\lambda_{k}(\varphi)\,h_{k}(x)$ the decomposition of $\varphi$ into orthonormal Hermite polynomials, we thus obtain

Despite the nonlinear model $x_{i}=\varphi(Wz_{i})$ matches the second-order population statistics of the isotropic model, it is far from obvious that the resulting least norm regression risk is the same. Indeed the coordinates of vector $x_{i}$ are highly dependent, as it can be seen by noting that they are a function of only $d\ll p$ independent random variables. Theorem 8 confirms that —despite dependence— the risk is asymptotically the same, thus providing a concrete example of the general universality phenomenon.

2 Proof outline for Theorem 8

We introduce the following resolvents (as usual, these are defined for ${\operatorname{Im}}(\xi)>0$ and by analytic continuation, whenever possible, for ${\operatorname{Im}}(\xi)=0$ ):

Here and henceforth, we write $[i,j]=\{i+1,\ldots,i+j\}$ for integers $i,j$ . We also write $M^{-1}_{ij}=(M^{-1})_{ij}$ for a matrix $M$ , and ${\sf Tr}_{S}(M)=\sum_{i\in S}M_{ii}$ for a subset $S$ . The equalities in the first and third lines above follow by invariance of the distribution of ${A}(s)$ under permutations of $[1,p]$ and $[p+1,p+n]$ . Whenever clear from the context, we will omit the arguments from block matrix and resolvents, and write ${A}={A}(s)$ , $m_{1,n}=m_{1,n}(\xi,s)$ , and $m_{2,n}=m_{2,n}(\xi,s)$ .

The next lemma characterizes the asymptotics of $m_{1,n}$ , $m_{2,n}$ .

Assume the conditions of Theorem 8. Consider ${\operatorname{Im}}(\xi)>0$ or ${\operatorname{Im}}(\xi)=0$ , ${\operatorname{Re}}(\xi)<0$ , with $s\geq t\geq 0$ . Let $m_{1}$ and $m_{2}$ be the unique solutions of the following quadratic equations:

subject to the condition of being analytic functions for ${\operatorname{Im}}(z)>0$ , and satisfying $|m_{1}(z,s)|,|m_{2}(z,s)|\leq 1/{\operatorname{Im}}(z)$ for ${\operatorname{Im}}(z)>C$ (with $C$ a sufficiently large constant). Then, as $n,p,d\to\infty$ , such that $p/n\to\gamma$ and $d/p\to\psi$ , we have almost surely (and in $L^{1}$ ),

The proof of this lemma is given in Appendix D.2. As a corollary of the above, we obtain that the asymptotical empirical spectral distribution of the empirical covariance $\hat{\Sigma}=X^{T}X/n$ matches the one for the independent entries model, and is hence given by the Marchenko-Pastur law (a result already obtained in Pennington and Worah (2017)). We state this formally using the Stieltjes transform

Assume the conditions of Theorem 8. Consider ${\operatorname{Im}}(z)>0$ . As $n,p,d\to\infty$ , with $p/n\to\gamma$ and $d/p\to\psi$ , we have (almost surely and in $L^{1}$ ) $R_{n}(\xi)\to r(\xi)$ where $r$ is nonrandom and coincides with the Stieltjes transform of the Marchenko-Pastur law, namely

We refer to Appendix D.4 for a proof of this corollary. The next lemma connects the above resolvents computed in Lemma 3 to the variance of min-norm least squares, hence finishing our proof outline.

Assume the conditions of Theorem 8. Let $m_{1}$ , $m_{2}$ be the asymptotic resolvents given in Lemma 3. Define

Then for $\gamma\neq 1$ , $\partial_{x}m(\xi,x)\big{|}_{x=0}$ as a simple pole at $\xi=0$ , and hence admits a Taylor-Laurent expansion around $\xi=0$ , whose coefficients will be denoted by $D_{-1}$ , $D_{0}$

Here each coefficient is a function of $\gamma,\psi$ : $D_{-1}=D_{-1}(\gamma,\psi)$ , $D_{0}=D_{0}(\gamma,\psi)$ . Furthermore, for the ridge regression estimator $\hat{\beta}_{\lambda}$ in (5), as $n,p,d\to\infty$ , such that $p/n\to\gamma\in(0,\infty)$ , $d/p\to\psi\in(0,1)$ , the following ridgeless limit holds almost surely:

The proof of this lemma can be found in Appendix D.3. Theorem 8 follows by evaluating the formula in Lemma 4, by using the result of Lemma 3. We refer to the appendices material for details.

Acknowledgements

The authors are grateful to Brad Efron, Rob Tibshirani, and Larry Wasserman for inspiring us to work on this in the first place. RT sincerely thanks Edgar Dobriban for many helpful conversations about the random matrix theory literature, in particular, the literature trail leading up to Proposition 2, and the reference to Rubio and Mestre (2011). We are also grateful to Dmitry Kobak for bringing to our attention the Kobak et al. (2020) and clarifying the implications of this work on our setting.

TH was partially supported by grants NSF DMS-1407548, NSF IIS-1837931, and NIH 5R01-EB001988-21. AM was partially supported by NSF DMS-1613091, NSF CCF-1714305, NSF IIS-1741162, and ONR N00014-18-1-2729. RT was partially supported by NSF DMS-1554123.

Appendix A Proofs for the linear model: Risk of ridge regression

Before passing to the proof of Theorem 5, we state and prove a useful lemma in analysis.

Without loss of generality we can assume $x=0$ . Further we set $h=f-g$ . We consider a finite difference approximation of order $2k$ of the first order derivative of $h$ at . One such approximation is given by (Trefethen, 1996, Chapter 3)

where $u_{x}$ is a point in the interval $[0,x]$ . Using the properties of $D_{2k}$ listed above, we obtain

We are left wih the task of proving claims $(i)$ and $(ii)$ about the operator $D_{2k}$ . Substituting, these are equivalent to the claims

We further state and proof a lemma about a modified Stieltjes transform.

Let $s_{1}\geq\dots\geq s_{p}>0$ be such that $s_{1}\leq M$ and $\#\{i:\,s_{i}\geq 1/M\}\geq p/M$ for some large constant $M>1$ , and $M^{-1}\leq p/n\leq M$ . Further assume $|1-(p/n)|\geq 1/M$ and $n^{-1}\sum_{i=1}^{p}s_{i}^{-1}1_{s_{i}>0}<M$ . For $(\lambda,\eta)\in(0,M]\times(-1/(10M^{2}),1/(10M^{2}))$ let $u=u(\lambda,\eta)$ be the only solution in $(0,\infty)$ of the equation

$\lambda\mapsto u(\lambda,\eta)$ is analytic and non-decreasing with $0<\partial_{\lambda}u(\lambda,\eta)\leq C(M)$ for a constant $C(M)$ uniquely dependent on $M$ .

If further $\lambda\vee s_{p}\geq 1/M$ , then $|\partial^{k}_{\eta}u(\lambda,\eta)|\leq C_{k}(M)$ for all $k$ .

We begin from poinr $(a)$ . Since $\eta$ is fixed here, we write $u(\lambda)=u(\lambda,\eta)$ , dropping the dependence on $\eta$ . Note that, for $\eta>0$ , any solution $u$ must be such that $u\in(0,1/\eta)$ because otherwise the left hand side is negative and the right-hand side is positive. Call $x=u/(1-\eta u)$ whence the above equation is equivalent to the following equation on $x\in(0,\infty)$ :

where $s_{i}(\eta)=s_{i}/(1+\eta s_{i})$ . For $\eta\in(-1/(10M^{2}),1/(10M^{2}))$ , we have $(1-(10M)^{-1})s_{i}<s_{i}(\eta)<(1+(10M)^{-1})s_{i}$ , and therefore the sequence $(s_{i}(\eta))_{i\leq p}$ satisfies the same assumptions as $(s_{i})_{i\leq p}$ , with $M$ replaced by $2M$ . We will therefore drop the argument $\eta$ and study the fixed point equation (61) (which is equivalent to the case $\eta=0$ of Eq. (60)).

Let $g(x):=n^{-1}\sum_{i=1}^{p}s_{i}/(s_{i}+x)$ , and note that this function is mononone decreasing on $[0,\infty)$ with $g(0)\leq C_{0}(M)$ , and $g(x)\leq M^{2}/x$ . Equation (61) can be rewritten as

Since $x\mapsto\hat{g}(x):=(\lambda/x)+g(x)$ is monotone decreasing with $\hat{g}(0+)=+\infty$ and $\hat{g}(x)\to 0$ as $x\to\infty$ , this equation has a unique solution, denoted by $x(\lambda)$ . By the implicit function theorem, we have

The second inequality follows for instance by noting that $n^{-1}\sum_{i=1}^{p}s_{i}^{2}/(s_{i}+x(\lambda))^{2}<g(x(\lambda))=1-\lambda/x(\lambda)<1$ . Notice that $g(0)=p/n$ . By assumption $|g(0)-1|\geq 1/M$ . We will distinguish the two cases $g(0)<1$ and $g(0)>1$ .

Case $g(0)<1$ . Since $g(x)$ is decreasing, (62) implies $1\leq\lambda x^{-1}+g(0)$ whence

and therefore, by Eq. (63), we have $\partial_{\lambda}x(\lambda)\leq M$ for all $\lambda\in(0,\infty)$ .

Case $g(0)>1$ . Let $x_{0}=g^{-1}(1)>0$ . Since $x\mapsto g(x)$ is convex, $x_{0}\geq(1-g(0))/|g^{\prime}(0)|$ . Note that $|g^{\prime}(0)|=n^{-1}\sum_{i=1}^{p}s_{i}^{-1}1_{s_{i}>0}\leq M$ . Therefore $x_{0}\geq 1/C(M)$ for some finite constant $C(M)$ .

Since $\hat{g}(x)=\lambda/x+g(x)>g(x)$ on $(0,\infty)$ , we have $x(\lambda)>x(0+)=x_{0}\geq 1/C(M)$ for all $\lambda\in(0,\infty)$ . Further,

Therefore by Eq. (63), $0<\partial_{\lambda}x(\lambda)\leq C(M)$ also in this case.

and therefore $7/10\leq(1+\eta x(\lambda))\leq 13/10$ , whence

Claim $(b)$ simply follows since $u/(1-\eta u)\geq\lambda$ by Eq. (60). Since $(1-\eta u)=1/(1+\eta x)\geq 1/2$ as just shown, this implies $u\geq\lambda/2$ .

Finally, for claim $(c)$ , notice that $u(\lambda,\eta)$ is a solution of the equation

By the implicit function theorem, it is sufficient to prove that $\big{|}\frac{\partial^{k+l}f}{\partial u^{k}\partial\eta^{l}}\big{|}\leq C_{k,l}(M)$ for all $l\geq 1,k\geq 0$ and $l=0$ , $k\geq 2$ , and that $\big{|}\frac{\partial f}{\partial u}\big{|}\geq 1/C(M)$ .

In order to bound the first derivative away from , note that, again by the implicit function theorem,

where the last inequality follows from point $(a)$ above. In order to bound higher order derivatives, write $f(u,\eta)=f_{1}(\eta u)/\eta-\lambda-f_{2}(u)$ where $f_{1}(v):=v/(1-v)$ . Recall that $x=u/(1-\eta u)$ , and $x\leq 3M^{2}$ by Eq. (66). Therefore

where $C_{0}$ , $c_{1}(m_{1},m_{2},m_{3})$ are coefficients depending uniquly on $k,l$ . This sum is bounded since each term is bounded, by the above estimate on $\partial^{k}f_{1}$ , and further using the fact that $u=x/(1+\eta x)\leq 10M^{2}$ since $x\leq 3M^{2}$ by Eq. (66). ∎

For the proof of Theorem 5, it is convenient to introduce the notations $S_{Z}:=Z^{T}Z/n$ and $S_{X}:=X^{T}X/n=\Sigma^{1/2}S_{Z}\Sigma^{1/2}$ . We begin by recalling the expressions for bias and variance of ridge regression at regularization parameter $\lambda$ :

The bias and variance for min-norm interpolation are recovered by taking $\lambda\to 0+$ . In the following, we will occasionally consider $\lambda$ to be complex with ${\operatorname{Re}}(\lambda)>0$ .

A.2 Proof of Theorem 5, bias term

Note that the bias is homogeneous (of degree $2$ ) in $\|\beta\|_{2}$ . Hence, without loss of generality we can and will assume $\|\beta\|_{2}=1$ . For ${\operatorname{Re}}(\eta)>-1/\lambda_{\max}(\Sigma)$ define

Here, for ${\operatorname{Im}}(z)>0$ , ${\operatorname{Re}}(\eta)=0$ , $r_{n}=r_{n}(z,\eta)$ is defined as the unique solution ${\operatorname{Im}}(r_{n}(z,\eta))>0$ of

where $s_{1}(\eta)\geq s_{2}(\eta)\geq\dots\geq s_{p}(\eta)$ are the eigenvalues of $\Sigma_{\eta}$ . By Eq. (73) we have $s_{i}(\eta)=s_{i}/(1+\eta s_{i})$ . Existence and uniqueness of the solution $r_{n}(z,\eta)$ is given, for instance, in (Knowles and Yin, 2017, Lemma 2.2). This lemma also states that $r_{n}(z,\eta)$ is the Stieltjes transorm of a probability measure $\rho_{\eta}$ with support in $[0,C(M)]$ . As a consequence, for $\lambda=\lambda_{1}+i\lambda_{2}$ , $\lambda_{1}>0$

Therefore, taking the limit ${\operatorname{Im}}(\lambda)\to 0$ in Eq. (75), we obtain, with probability at least $1-Cn^{-D}$

Here the last identity is simply the definition of the predicted bias in Eq. (38).

Notice that $\eta\mapsto{\overline{F}}(\lambda,\eta)$ is infinitely differentiable in $(-1/(2M),\infty)$ with

Note that $\|\Sigma\|_{op}\leq M$ , $\|\beta\|_{2}=1$ (we assumed this to hold without loss of generality) and $\|R\|_{\mbox{\tiny\rm op}}\leq 1/\lambda$ , whence

We claim that a similar bound holds for the derivative of $F_{n}$ , namely

In what follows we use the shorhand $u_{n}=u_{n}(\eta,z)$ . Note that, by Eq. (77), $u_{n}$ solves Eq. (60), and we can therefore apply Lemma 6. Note that

where $(a)$ follows from Lemma 6. $(b)$ and $(b)$ by the theorem’s assumption.

Further, by Lemma 6. $(c)$ we have $|\partial^{k}_{\eta}u_{n}|\leq C^{\prime}_{k}(M)$ for all $k$ , whence the desired claim (83) follows.

Using Eqs. (82), (83), and (78) in Lemma 5 we get, for any ${\varepsilon}>0$ , $k$ and any $\delta<k/(4M)$ , with probability at least $1-n^{-C}$

The claim follows by taking $\delta=n^{-{\varepsilon}/2}$ and $k=\lceil 1/{\varepsilon}\rceil$ , and recalling that, by Eqs. (74) and (81), we have

A.3 Proof of Theorem 5, variance term

Further the right-hand side is an analytic function of $\lambda$ , for ${\operatorname{Re}}(\lambda)>0$ . Using again (Knowles and Yin, 2017, Theorem 3.16. $(i)$ ) , we get that for any $D>0$ , ${\varepsilon}>0$ , where exists $C=C(D,{\varepsilon})$ such that, with probability at least $1-Cn^{-D}$ ,

uniformly over ${\operatorname{Re}}(\lambda)>n^{-2/3+(1/M)}$ , ${\operatorname{Im}}(-\lambda)>0$ . Here $r_{n}(z)$ is defined as in the previous section, namely as the solution of Eq. (77) with $\eta=0$ . As shown there ${\operatorname{Im}}(r_{n}(-\lambda))|\leq|{\operatorname{Im}}(\lambda)|/{\operatorname{Re}}(\lambda)^{2}$ , whence

Since both functions $\lambda\mapsto\frac{\lambda}{p}{\sf Tr}\big{(}\Sigma(S_{X}+\lambda I)^{-1}$ and $\lambda\mapsto L_{n}(\lambda)$ are analytic on ${\operatorname{Re}}(\lambda)>0$ , the last bound implies a similar bound on their derivarives, namely

We are left with the task of computing the derivative of $L_{n}$ :

where the last equality follows by the definition $m_{n}(x)=(1-\gamma+\gamma zr_{n}(z))/(\gamma z)$ given above.

A.4 Generalization to heavier tail covariates: Proof of Theorem 6

This result follows from Theorem 2 via a standard truncation argument. We outline the argument for the bias as the same proof holds almost unchanged for the variance term. We write the bias $B_{X,\Sigma}(\hat{\beta};\beta)$ to emphasize its dependende upon the population covariance. Using Eq. (69), we obtain, for some $C=C(\lambda)<\infty$

For $M>0$ , decompose the random vriables $z_{ij}$ as

We correspondingly decompose the matricex $X,Z$ as

where $\Sigma_{M}=a_{M}^{2}\Sigma$ . Using Eq. (95), we get

Now we can apply Theorem 5 to $X_{M},\Sigma_{M}$ , since the variables $z^{M}_{ij}$ have bounded moments of all orders. Denoting by $\widehat{H}_{n}^{M}$ , $\widehat{G}_{n}^{M}$ the corresponding empirical measures, we have, with probability at least $1-Cn^{-2}$ :

Note that $\widehat{H}^{M}_{n}(s)=\widehat{H}_{n}(s/a^{2}_{M})$ and $\widehat{G}^{M}_{n}(s)=\widehat{G}_{n}(s/a^{2}_{M})$ , hence these measures converge weakly to $H^{M}(s)=H(s/a^{2}_{M})$ and $G^{M}(s)=G(s/a^{2}_{M})$ . Therefore, by Borel-Cantelli

The proof is completed by putting together Eqs. (100) and (101), taking $M\to\infty$ and noting that $H^{M}\Rightarrow H$ , $G^{M}\Rightarrow G$ in that limit.

Appendix B Proofs for the linear model: Risk min-norm regression

The proof follows from Theorem 5 by approximating min-norm regression with ridge regression with a small value of $\lambda$ . For this reason we will assume $\lambda\leq 1$ throughout. We treat separately the bias and variance terms. Since the variance is independent of $\beta$ and the bias is homogeneous in $\|\beta\|_{2}$ , we can assume, without loss of generality, $\|\beta\|_{2}\leq 1$ .

where $\sigma_{\min}$ denotes the smallest non-vanishing singular value and the last inequality holds since by assumption $\lambda_{\min}(\Sigma)\geq 1/M$ . Finally, since $|p/n-1|\geq 1/M$ , we have that $\sigma_{\min}(Z)/\sqrt{n}\geq 1/C(M)$ with probability at least $1-Cn^{-D}$ Bai and Yin (1993). Hence, with the same probability $\big{|}B_{X}(\hat{\beta}_{\lambda};\beta)^{1/2}-B_{X}(\hat{\beta};\beta)^{1/2}\big{|}\leq C(M)\lambda$ . Also notice that, by Eq. (103), $B_{X}(\hat{\beta};\beta)\leq\|\Sigma\|_{\mbox{\tiny\rm op}}\leq M$ . Therefore, we obtain

We next claim that Eqs. (11) and (38) imply the bound

In to prove this estimate, recall that $\gamma>1+M^{-1}$ , and that, by assumption, $\widehat{H}_{n}$ is supported on $[M^{-1},M]$ . Define $c_{*}(\lambda)$ via $m_{n}(-\lambda)=(1-\gamma-\lambda\gamma c_{*}(\lambda))/(-\gamma\lambda)$ . Note that $c_{*}(\lambda)=r_{n}(-\lambda)/\gamma$ , where $r_{n}(z)$ is the compaion Stieltjes transform already introduced in Eq. (77). In particular, by (Knowles and Yin, 2017, Lemma 2.2), $c_{*}$ it is non-negative and monotone decreasing in $\lambda$ . Further Eq. (36), $c_{*}$ satisfies

It follows from ${\rm supp}(\widehat{H}_{n})\in[M^{-1},M]$ that $f(x;\lambda)$ is monotone decreasing on $(0,\infty)$ for any $\lambda\geq 0$ with $f(0;\lambda)=1$ and $\lim_{x\to\infty}f(x;0)=0$ , $\lim_{x\to\infty}f(x;\lambda)=-\infty$ for any $\lambda>0$ . Further $f(x;\lambda)$ is monotone decreasing in $\lambda$ with $f(x;\lambda)\to f(x;0)$ as $\lambda\to 0$ . Therefore $c_{*}(\lambda)$ is monotone decreasing in $\lambda$ , with $\lim_{\lambda\to 0}c_{*}(\lambda)=c_{0}>0$ (with $c_{0}$ defined as per Eq. (10)). Further $\underline{f}(x;\lambda)\leq f(x;\lambda)\leq\overline{f}(x;\lambda)$ where $\overline{f}(x;\lambda)$ is obtained by replacing $\widehat{H}_{n}$ by $\delta_{1/M}$ , and $\underline{f}(x;\lambda)$ is obtained by replacing $\widehat{H}_{n}$ by $\delta_{M}$ . Therefore $C(M)^{-1}\leq c_{*}(\lambda)\leq c_{0}\leq C(M)$ for some finite constant $C(M)$ depending uniquely on $M$ . In particular, this implies that $\partial_{\lambda}f(c_{*};\lambda)$ is uniformly bounded. Finally

For $s\in[M^{-1},M]$ and $x\in[C(M)^{-1},C(M)]$ , the integrand $\gamma s/(1+x\gamma s)^{2}$ is bounded and bounded away from zero. Therefore $C(M)^{-1}\leq-\partial_{x}f(x;\lambda)\leq C(M)$ for some (possibly different) constant $C(M)$ . By the implicit function theorem, it follows that $|\partial_{\lambda}c_{*}(\lambda)|\leq C(M)$ is bounded for $\lambda\in[0,M]$ , and therefore $|c_{*}(\lambda)-c_{0}|\leq C(M)\lambda$ , whence

(Here $O_{*}(\lambda)$ denotes a quantity bounded uniformly as $|O_{*}(\lambda)|\leq C(M)\lambda$ .). Sustituting this in Eq. (37), and repeating similar arguments, we obtain

Finally, substituting in Eq. (38), we obtain the claimed bound (105).

Using Eqs. (104), (105) and Theorem 5 we deduce that, for all $M,D,{\varepsilon}$ , and all $n^{-2/3+1/M}\lambda\leq 1$ , with probability at least $1-Cn^{-D}$ ,

The desired bound is obtained by setting $\lambda=n^{-1/7}$ and choosing ${\varepsilon}$ small enough.

Variance term. With the same notations introduced above, Eq. (70) implies

Taking the difference and bounding $|x(x+\lambda)^{-2}-x^{-1}|\leq 2\lambda/x^{2}$ for $x,\lambda>0$ , we get

The rest of the proof is analogous to one for the bias term and is omitted.

Appendix C Other proofs for the linear model

where the second inequality holds almost surely for all $n$ large enough, following from $\lambda_{\min}(\Sigma)\geq c$ , and the Bai-Yin theorem (Bai and Yin, 1993), which implies that the smallest eigenvalue of $Z^{T}Z/n$ is almost surely larger than $(1-\sqrt{\gamma})^{2}/2$ for sufficiently large $n$ . As the right-hand side in the above display is strictly positive, we have that $X^{T}X$ is almost surely invertible. Therefore by the bias and variance results from Lemma 1, we have almost surely $\Pi=0$ and $B_{X}(\hat{\beta};\beta)=0$ , and also

where $F_{Z^{T}Z/n}$ is the spectral measure of $Z^{T}Z/n$ and the last identity holds if $M^{-1}<(c/2)(1-\sqrt{\gamma})^{2}$ almost surely for all $n\geq N_{0}$ (with $N_{0}$ a random number) by Eq. (110). By the Marchenko-Pastur theorem (Marchenko and Pastur, 1967; Silverstein, 1995), which says that $F_{Z^{T}Z/n}$ converges weakly, almost surely, to the Marchenko-Pastur law $F_{\gamma}$ (depending only on $\gamma$ ), whenc, almost surely :

where the last equality follows since $F_{\gamma}$ is supported in $[(1-\sqrt{\gamma})^{2},(1+\sqrt{\gamma})^{2}]$ , provided $M>(1-\sqrt{\gamma})^{-2}$ . Thus the last display implies that as $n,p\to\infty$ , almost surely,

It remains to compute the right-hand side above. We recognize the right-hand side as the evaluation of the Stieltjes transform $m(z)$ of Marchenko-Pastur law at $z=0$ . Fortunately, this has an explicit form (e.g., Lemma 3.11 in Bai and Silverstein 2010), for real $z>0$ :

The proof is thus completed by taking the limit $z\to 0+$ in this expression.

C.2 Proof of Theorem 1

Variance term. Recalling the expression for the bias from Lemma 1 (where now $\Sigma=I$ ), we have

where $s_{i}=\lambda_{i}(X^{T}X/n)$ , $i=1,\ldots,n$ are the nonzero eigenvalues of $X^{T}X/n$ . Let $t_{i}=\lambda_{i}(XX^{T}/p)$ , $i=1,\ldots,p$ denote the eigenvalues of $XX^{T}/p$ . Then we may write $s_{i}=(p/n)t_{i}$ , $i=1,\ldots,n$ , and

where $F_{XX^{T}/p}$ is the spectral measure of $XX^{T}/p$ . Now as $n/p\to\tau=1/\gamma<1$ , we are back precisely in the setting of Theorem 2, and by the same arguments, we may conclude that almost surely

Bias term. Recall the expression for the bias from Lemma 1 (where now $\Sigma=I$ ), and note the following key characterization of the pseudoinverse of a rectangular matrix $A$ ,

We can apply this to $A=X/\sqrt{n}$ , and rewrite the bias as

for a deterministic sequence $c_{n}(z)>0$ , $n=1,2,3,\ldots$ (defined for each $n$ via a certain fixed-point equation). Taking $\Theta_{n}=I/p$ in the above, note that this reduces to the almost sure convergence of the Stieltjes transform of the specral distribution of $\hat{\Sigma}$ , and hence by the (classical) Marchenko-Pastur theorem, we learn that $c_{n}(z)\to m(-z)$ , where $m$ denotes the Stieltjes transform of the Marchenko-Pastur law $F_{\gamma}$ . Further, taking $\Theta_{n}=\beta\beta^{T}/p$ , we see from (115) and $c_{n}(z)\to m(-z)$ that, almost surely,

Define $f_{n}(z)=z\beta^{T}(\hat{\Sigma}+zI)^{-1}\beta$ . Notice that $|f_{n}(z)|\leq r^{2}$ , and $f_{n}^{\prime}(z)=\beta^{T}(\hat{\Sigma}+zI)^{-2}\hat{\Sigma}\beta$ , so

where $\lambda_{\max}(\hat{\Sigma})$ and $\lambda_{\min}^{+}(\hat{\Sigma})$ denote the largest and smallest nonzero eigenvalues, respectively, of $\hat{\Sigma}$ , and the second inequality holds almost surely for large enough $n$ , by the Bai-Yin theorem (Bai and Yin, 1993). As its derivatives are bounded, the sequence $f_{n}$ , $n=1,2,3,\ldots$ is equicontinuous, and by the Arzela-Ascoli theorem, we deduce that $f_{n}$ converges uniformly to its limit. By the Moore-Osgood theorem, we can exchange limits (as $n,p\to\infty$ and $z\to 0^{+}$ ) and conclude from (114), (116) that as $n,p\to\infty$ , almost surely,

Finally, relying on the fact that the Stieltjes transform of the Marchenko-Pastur law has the explicit form in (112), we can compute the above limit:

C.3 Proof of Corollary 2

The corollary follows immediately from Theorem 3, once we estabilish that, for $\|\beta\|^{2}\to r^{2}$ , $\mathscrsfs{B}(H,H,\gamma)\to\mathscrsfs{B}_{\mbox{\tiny\rm equi}}(H,\gamma)$ . By homogeneity, we can assume without loss of generality that $\|\beta\|^{2}=r^{2}=1$ . By using Eq. (11), we get

where that last equality follows from the definition of $c_{0}$ in Eq. (10). Comparing with the definition of $cuB_{\mbox{\tiny\rm equi}}(H,\gamma)$ in Eq. (16) yields the desired claim.

C.4 The case of equicorrelated features

As an illustration of Corollary 2, we consider the case of equicorrelated covariates. Namellt, for $\rho\in[0,1)$ , we assume that $\Sigma_{ii}=1$ for all $i$ , and $\Sigma_{ij}=\rho$ for all $i\not=j$ .

Assume the conditions of Theorem 2, and moreover, assume that $\Sigma$ has $\rho$ -equicorrelation structure for all $n,p$ , and some $\rho\in[0,1)$ . Then as $n,p\to\infty$ , with $p/n\to\gamma>1$ , we have almost surely

Let $H_{\rho}$ denote the weak limit of $F_{\Sigma}$ , when $\Sigma$ has $\rho$ -equicorrelation structure for all $n,p$ . A short calculation shows that such a matrix $\Sigma$ has one eigenvalue value equal to $1+(p-1)\rho$ , and $p-1$ eigenvalues equal to $1-\rho$ . Thus the weak limit of its spectral measure is simply $H_{\rho}=1_{[1-\rho,\infty)}$ , i.e., $dH_{\rho}=\delta_{1-\rho}$ , a point mass at $1-\rho$ of probability one.

We remark that the present case, strictly speaking, breaks the conditions that we assume in Theorem 2, because $\lambda_{\max}(\Sigma)=1+(p-1)\rho$ clearly diverges with $p$ . However, by decomposing $\Sigma=(1-\rho)I+\rho\mathds{1}\mathds{1}^{T}$ (where $\mathds{1}$ denotes the vector of all 1s), and correspondingly decomposing the functions $f_{n},g_{n},h_{n}$ defined in the proof of Theorem 2, to handle the rank one part $\rho\mathds{1}\mathds{1}^{T}$ properly, we can ensure the appropriate boundedness conditions. Thus, the result in the theorem still holds when $\Sigma$ has $\rho$ -equicorrelation structure.

Now denote by $v_{\rho}$ the companion Stieltjes transform of the empirical spectral distribution $F_{H_{\rho},\gamma}$ , to emphasize its dependence on $\rho$ . Recall the Silverstein equation (120), which for the equicorrelation case, as $dH_{\rho}=\delta_{1-\rho}$ , becomes

where $v_{0}$ is the companion Stieltjes transform in the $\Sigma=I$ case, the object of study in Appendix LABEL:app:risk_iso2. From the results for $v_{0}$ from (LABEL:eq:v0) and (LABEL:eq:v1), invoking the relationship in the above display, we have

Plugging these into the asymptotic risk expression from Theorem 2 gives

C.5 Autoregressive features

We consider a $\rho$ -autoregressive structure for $\Sigma$ , for a constant $\rho\in[0,1)$ , meaning that $\Sigma_{ij}=\rho^{|i-j|}$ for all $i,j$ . In this case, it is not clear that a closed-form exists for $v(0)$ or $v^{\prime}(0)$ . However, we can compute these numerically. In fact, the strategy we describe below applies to any situation in which we are able to perform numerical integration against $dH$ , where $H$ is the weak limit of the spectral measure $F_{\Sigma}$ of $\Sigma$ .

The critical relationship that we use is the Silverstein equation (Silverstein, 1995), which relates the companion Stieltjes transform $v$ to $H$ via

Therefore, we can use a simple univariate root-finding algorithm (like the bisection method) to solve for $v(0)$ in (121). With $v(0)$ computed, we can compute $v^{\prime}(0)$ by first differentiating (120) with respect to $z$ (see Dobriban 2015), and then taking $z\to 0^{+}$ , to yield

When $\Sigma$ is of $\rho$ -autoregressive form, it is known to have eigenvalues (Trench, 1999):

This allows us to efficiently approximate an integral with respect to $dH$ (e.g., by taking each $\theta_{i}$ to be in the midpoint of its interval given above), solve for $v(0)$ in (121), $v^{\prime}(0)$ in (122), and evaluate the asymptotic risk as per Theorem 2.

Figure 13 shows the results from using such a numerical scheme to evaluate the asymptotic risk, as $\rho$ varies from 0 to 0.75.

C.6 Proof of Theorem 7

We begin by recalling an alternative to leave-one-out cross-validation, for linear smoothers, called generalized cross-validation (GCV) (Craven and Wahba, 1978; Golub et al., 1979b). The GCV error of the ridge regression estimator at a tuning parameter value $\lambda$ defined as

Compared to the shortcut formula for leave-one-out CV in (46), we can see that GCV in (123) swaps out the $i$ th diagonal element $(S_{\lambda})_{ii}$ in the denominator of each summand with the average diagonal element ${\sf Tr}(S_{\lambda})/n$ . This modification makes GCV rotationally invariant (Golub et al., 1979b).

It turns out that the GCV error is easier to analyze, compared to the CV error. Thus we proceed by first studying GCV, and then relating CV to GCV. We break up the exposition below into these two parts accordingly.

Let us rewrite the GCV criterion in (123) as

We will treat the almost sure convergence of the numerator and denominator separately.

The denominator is an easier calculation. Denoting $s_{i}=\lambda_{i}(X^{T}X/n)$ , $i=1,\ldots,p$ , we have

where this convergence holds almost surely as $n,p\to\infty$ , a direct consequence of the Marchenko-Pastur theorem, and $F_{\gamma}$ denotes the Marchenko-Pastur law. Meanwhile, we can rewrite this asymptotic limit as

where $m=m_{F_{\gamma}}$ denotes the Stieltjes transform of the Marchenko-Pastur law $F_{\gamma}$ , and therefore, almost surely,

The numerator requires only a bit more difficult calculation. Let $y=X\beta+\epsilon$ and $c_{n}=\sqrt{p}(\sigma/r)$ . Observe

Note that $\delta$ has independent entries with mean zero and variance $r^{2}/p$ , and further, note that $\delta$ and $A$ are independent. Therefore we can use the almost sure convergence of quadratic forms, from Lemma 7.6 in Dobriban and Wager (2018), which is adapted from Lemma B.26 in Bai and Silverstein (2010).As written, Lemma 7.6 of Dobriban and Wager (2018) assumes i.i.d. components for the random vector in question, which is not necessarily true of $\delta$ in our case. However, an inspection of their proof shows that they only require independent components with mean zero and common variance, which is precisely as stated in Lemma B.26 of Bai and Silverstein (2010). This result asserts that, almost surely,

A short calculation and application of the Marchenko-Pastur theorem gives that, almost surely,

where for the second sum, we used Vitali’s theorem to show convergence of the derivative of the Stieltjes transform of the spectral distribution of $X^{T}X/n$ to the derivative of the Stieltjes transform of $F_{\gamma}$ (note that Vitali’s theorem applies as the function in question is bounded and analytic). By a similar calculation, we have almost surely,

Putting (125), (126) together with (124), we have, almost surely,

To show that this matches to the asymptotic prediction error of ridge regression requires some nontrivial calculations. We start by reparametrizing the above asymptotic limit in terms of the companion Stieltjes transform, abbreviated by $v=v_{F_{\gamma}}$ . This satisfies

Introducing $\alpha=r^{2}/(\sigma^{2}\gamma)$ , the almost sure limit of GCV is

where in the second line we rearranged, in the third line we applied the companion Stieltjes transform facts, and in the fourth line we simplified. We can now recognize the above as $\sigma^{2}$ plus the asymptotic risk of ridge regression at tuning parameter $\lambda$ , either from the proof of Theorem 2, or from Theorem 2.1 in Dobriban and Wager (2018). In terms of the Stieltjes transform itself, this is $\sigma^{2}+\sigma^{2}\gamma(m(-\lambda)-\lambda(1-\alpha\lambda)m^{\prime}(-\lambda))$ , which proves the first claimed result.

In the second line, we used ${\sf Tr}(S_{\lambda})/n=(1/n)\sum_{i=1}^{p}s_{i}/(s_{i}+\lambda)\leq s_{\max}/(s_{\max}+\lambda)$ , with $s_{\max}=\lambda_{\max}(X^{T}X/n)$ , and in the third line, we used $\|y\|_{2}^{2}/n\leq 2(r^{2}+\sigma^{2})$ almost surely for sufficiently large $n$ , by the strong law of large numbers, and $s_{\max}\leq 2$ almost surely for sufficiently large $n$ , by the Bai-Yin theorem (Bai and Yin, 1993). Furthermore, writing $g_{n},h_{n}$ for the numerator and denominator of $f_{n}$ , respectively, we have

The above argument just showed that $|g_{n}(\lambda)|$ is upper bounded on $[\lambda_{1},\lambda_{2}]$ , and $|h_{n}(\lambda)|$ is lower bounded on $[\lambda_{1},\lambda_{2}]$ ; also, clearly $|h_{n}(\lambda)|\leq 1$ ; therefore to show that $|f_{n}^{\prime}|$ is almost surely bounded on $[\lambda_{1},\lambda_{2}]$ , it suffices to show that both $|g_{n}^{\prime}|,|h_{n}^{\prime}|$ are. Denoting by $u_{i}$ , $i=1,\ldots,p$ the eigenvectors of $X^{T}X/n$ (corresponding to eigenvalues $s_{i}$ , $i=1,\ldots,p$ ), a short calculation shows

the last step holding almost surely for large enough $n$ , by the law of large numbers. Also,

the last step holding almost surely for large enough $n$ , by the Bai-Yin theorem. Thus we have shown that $|f_{n}^{\prime}|$ is almost surely bounded on $[\lambda_{1},\lambda_{2}]$ , for large enough $n$ , and applying the Arzela-Ascoli theorem, $f_{n}$ converges uniformly to $f$ . With $\lambda_{n}=\arg\min_{\lambda\in[\lambda_{1},\lambda_{2}]}f_{n}(\lambda)$ , this means for any for any $\lambda\in[\lambda_{1},\lambda_{2}]$ , almost surely

where we used the optimality of $\lambda_{n}$ for $f_{n}$ , and then uniform convergence. In other words, almost surely,

As the almost sure convergence $R_{X}(\hat{\beta}_{\lambda})+\sigma^{2}\to f(\lambda)$ is also uniform for $\lambda\in[\lambda_{1},\lambda_{2}]$ (by similar arguments, where we bound the risk and its derivative in $\lambda$ ), we conclude that almost surely $R_{X}(\hat{\beta}_{\lambda})\to\sigma^{2}\gamma m(-1/\alpha)$ , completing the proof for GCV.

C.6.2 Analysis of CV

Let us rewrite the CV criterion, starting in its shortcut form (46), as

where $D_{\lambda}$ is a diagonal matrix that has diagonal elements $(D_{\lambda})_{ii}=1-(S_{\lambda})_{ii}$ , $i=1,\ldots,n$ .

First, fixing an arbitrary $i=1,\ldots,n$ , we will study the limiting behavior of

Since $x_{i}$ and $X^{T}X$ are not independent, we cannot immediately apply the almost sure convergence of quadratic forms lemma, as we did in the previous analysis of GCV. But, letting $X_{-i}$ denote the matrix $X$ with the $i$ th row removed, we can write $(X^{T}X/n+\lambda I)^{-1}=(X_{-i}^{T}X_{-i}/n+\lambda I+x_{i}x_{i}^{T}/n)^{-1}$ , and use the Sherman-Morrison-Woodbury formula to separate out the dependent and independent parts, as follows. Letting $\delta_{i}=x_{i}/\sqrt{n}$ , $A_{i}=(X_{-i}^{T}X_{-i}/n+\lambda I)^{-1}$ , and $A=(X^{T}X/n+\lambda I)$ , we have

Note $\delta_{i}$ and $A_{i}$ are independent (i.e., $x_{i}$ and $X_{-i}^{T}X_{-i}$ are independent), so we can now use the almost sure convergence of quadratic forms, from Lemma 7.6 in Dobriban and Wager (2018), adapted from Lemma B.26 in Bai and Silverstein (2010), to get that, almost surely

Further, as ${\sf Tr}(A_{i})/n\to\gamma m(-\lambda)$ almost surely by the Marchenko-Pastur theorem, we have, almost surely,

The strategy henceforth, based on the result in (129), is to replace the denominators $1-(S_{\lambda})_{ii}$ , $i=1,\ldots,n$ in the summands of the CV error by their asymptotic limits, and then control the remainder terms. More precisely, we define $\bar{D}_{\lambda}=(1+\gamma m(-\lambda))^{-1}I$ , and then write, from (127),

We will first show that almost surely $b\to 0$ . Observe, by the Cauchy-Schwartz inequality,

where in the second step, we used $\|(I-S_{\lambda})y\|_{2}^{2}/n\leq\|y\|_{2}^{2}/n\leq 2(r^{2}+\sigma^{2})$ , which holds almost surely for large enough $n$ , by the strong law of large numbers. Meanwhile,

By the Marchenko-Pastur theorem, we have $d_{3}\to 0$ almost surely. Using $u^{2}-v^{2}=(u-v)(u+v)$ on the maximands in $d_{2}$ ,

In the second line above, we used ${\sf Tr}(A_{i})/n\leq\lambda_{\max}(A_{i})\leq 1/\lambda$ , $i=1,\ldots,n$ , and also ${\sf Tr}(A)/n\leq 1/\lambda$ . In the third line, we used the Sherman-Morrison-Woodbury formula. In the fourth line, for each summand $i=1,\ldots,n$ , we upper bounded the numerator by $\lambda_{\max}(A)^{2}\|\delta_{i}\|_{2}^{2}\leq\|\delta_{i}\|_{2}^{2}/\lambda^{2}$ , and lower bounded the denominator by $\lambda/(s_{\max}+\lambda)$ , with $s_{\max}=\lambda_{\max}(X^{T}X/n)$ (which follows from a short calculation using the eigendecomposition of $X^{T}X/n$ ). In the fifth line, we used $\|\delta_{i}\|_{2}^{2}=e_{i}^{T}(XX^{T}/n)e_{i}\leq s_{\max}$ , and in the sixth line, we used $s_{\max}\leq 2$ almost surely for sufficiently large $n$ , by the Bai-Yin theorem (Bai and Yin, 1993).

Now we will show that $d_{1}\to 0$ almost surely by showing that the convergence in (128) is rapid enough. As before, using $u^{2}-v^{2}=(u-v)(u+v)$ on the maximands in $d_{1}$ , we have

Here we used that for $i=1,\ldots,n$ , we have ${\sf Tr}(A_{i})/n\leq 1/\lambda$ , and similarly, $\delta_{i}^{T}A_{i}\delta_{i}\leq\|\delta_{i}\|_{2}^{2}/\lambda\leq s_{\max}/\lambda\leq 2/\lambda$ , with the last inequality holding almost surely for sufficiently large $n$ , by the Bai-Yin theorem (Bai and Yin, 1993). To show $\max_{i=1,\ldots,n}\Delta_{i}\to 0$ almost surely, we start with the union bound and Markov’s inequality, where $q=2+\eta/2$ , and $t_{n}$ is be specified later,

where in the last step we used $n/p\leq 2/\gamma$ for large enough $n$ . Since the right-hand side above is summable in $p$ , we conclude by the Borel-Cantelli lemma that $\max_{i=1,\ldots,n}\Delta_{i}\to 0$ almost surely, and therefore also $d_{1}\to 0$ almost surely. This finishes the proof that $b\to 0$ .

As before, it helps to reparametrize in terms of the companion Stieltjes transform, giving

Adding $v(-\lambda)$ to both sides, then dividing both sides by $1+v(-\lambda)$ , yields

and dividing through by $v(-\lambda)$ , then rearranging, gives

C.7 Misspecified model results

These results follow because, as argued in Section 5.2, the misspecified case can be reduced to a well-specified case after we make the substitutions in (21).

C.8 CV and GCV simulations

Figures 16 and 16 investigate the effect of using CV and GCV tuning for ridge regression, under the same simulation setup as Figures 8 and 8, respectively. It is worth noting that for these figures, there is a difference in how we compute finite-sample risks, compared to what is done for all of the other figures in the paper. In all others, we can compute the finite-sample risks exactly, because, recall, our notion of risk is conditional on $X$ ,

and this quantity can be computed analytically for linear estimators like min-norm least squares and ridge regression. When we tune ridge by minimizing CV or GCV, however, we can no longer compute its risk analytically, and so in our experiments we approximate the above expectation by an average over 20 repetitions. Therefore, the finite-sample risks in Figures 16 and 16 are (only a little bit) farther from their asymptotic counterparts compared to all other figures in this paper, and we have included the finite-sample risk of the optimally-tuned ridge estimator in these figures, when the risk is again computed in the same way (approximated by an average over 20 repetitions), as a reference. All this being said, we still see excellent agreement between the risks under CV, GCV, and optimal tuning, and these all lie close to their (common) asymptotic risk curves, throughout.

Appendix D Nonlinear model: Proof of Theorem 8

Whenever clear from the context, we will omit the arguments from block matrix and resolvents, and write ${A}={A}(s,t)$ , $m_{1,n}=m_{1,n}(\xi,s,t)$ , and $m_{2,n}=m_{2,n}(\xi,s,t)$ .

The next theorem is the the central random matrix results, and characterizes the asymptotics of $m_{1,n}$ , $m_{2,n}$ . It provides a generalization of Lemma 3.

Consider ${\operatorname{Im}}(\xi)>0$ or ${\operatorname{Im}}(\xi)=0$ , ${\operatorname{Re}}(\xi)<0$ , with $s\geq t\geq 0$ . Let $m_{1}$ and $m_{2}$ be the unique solutions of the following fourth degree equations:

subject to the condition of being analytic functions for ${\operatorname{Im}}(z)>0$ , and satisfying $|m_{1}(z,s,t)|,|m_{2}(z,s,t)|\leq 1/{\operatorname{Im}}(z)$ for ${\operatorname{Im}}(z)>C$ (with $C$ a sufficiently large constant). Then, as $n,p,d\to\infty$ , such that $p/n\to\gamma$ and $d/p\to\psi$ , we have almost surely (and in $L^{1}$ ),

The proof of this theorem is given in Appendix D.2. Now define

where recall $\hat{\Sigma}=X^{T}X/n$ . As a corollary of the above, we obtain the asymptotic Stieltjes transform of the eigenvalue distribution of $\hat{\Sigma}$ (this result was first derived in Pennington and Worah (2017)).

Assume the conditions of Theorem 9. Consider ${\operatorname{Im}}(\xi)>0$ . As $n,p,d\to\infty$ , with $p/n\to\gamma$ and $d/p\to\psi$ , the Stieltjes transform of spectral distribution of $\hat{\Sigma}$ in (137) satisfies almost surely (and in $L^{1}$ ) $S_{n}(\xi)\to s(\xi)$ where $s$ is a nonrandom function that uniquely solves the following equations (abbreviating $s=s(\xi)$ ):

subject to the condition of being analytic for ${\operatorname{Im}}(z)>0$ , and satisfying $|s(z^{2})|<1/{\operatorname{Im}}(z^{2})$ for ${\operatorname{Im}}(z)>C$ (where $C$ is a large enough constant). When $c_{1}=0$ , the function $s$ is the Stieltjes transform of the Marchenko-Pastur distribution.

We refer to Appendix D.4 for a proof of this corollary. The next lemma connects the above resolvents to the variance of min-norm least squares.

Assume the conditions of Theorem 9. Let $m_{1}$ , $m_{2}$ be the asymptotic resolvents given in Theorem 9. Define

Then for $\gamma\neq 1$ , the following Taylor-Laurent expansion holds around $\xi=0$ :

with each $D_{i}=D_{i}(\gamma,\psi,c_{1})$ . Furthermore, for the ridge regression estimator $\hat{\beta}_{\lambda}$ in (5), as $n,p,d\to\infty$ , such that $p/n\to\gamma\in(0,\infty)$ , $d/p\to\psi\in(0,1)$ , the following ridgeless limit holds almost surely:

The proof of this lemma can be found in Appendix D.3.

D.2 Proof of Theorem 9

The proof follows the approach developed by Cheng and Singer (2013) to determine the asymptotic spectral measure of symmetric kernel random matrices. The latter is in turn inspired to the classical resolvent proof of the semicircle law for Wigner matrices, see Anderson et al. (2009). The present calculation is somewhat longer because of the block structure of the matrix ${A}$ , but does not present technical difficulties. We will therefore provide a proof outline, referring to Cheng and Singer (2013) for further detail.

This decomposition satisfies the following properties:

As mentioned, it is an orthogonal decomposition: $\int x\varphi_{\perp}(x)\,\mu_{d}({\rm d}x)=0$ . Further, by symmetry and the normalization assumption, $\int x\,\mu_{d}({\rm d}x)=\int\varphi_{\perp}(x)\,\mu_{d}({\rm d}x)=0$ .

$\int\varphi_{\perp}(x)^{2}\,\mu_{d}({\rm d}x)\to 1-c_{1}$ , $\int\varphi_{\perp}(x)^{2}\,\mu_{G}({\rm d}x)\to 1-c_{1}$ , as $d\to\infty$ .

Finally, recall the following definitions for the random resolvents:

In the next section, we will summarize some basic facts about the resolvent $m_{1,n}$ , $m_{2,n}$ and their analyticity, and some concentration properties of $M_{1,n},M_{2,n}$ . We will then derive Eqs. (133) and (134). Thanks to the analyticity properties, we will assume throughout these derivations ${\operatorname{Im}}(\xi)\geq C$ for $C$ a large enough constant. Also, we will assume $\gamma,\psi,\varphi$ to be fixed throughout, and will not explicitly point out the dependence with respect to these arguments.

Consider, to be definite $m_{1,n}$ . Denoting by $(\lambda_{i})_{i\leq N}$ , $({\boldsymbol{v}}_{i})_{i\leq N}$ , the eigenvalues and eigenvectors of ${A}(n)$ , we have

Properties $(a)$ - $(c)$ are then standard consequences of $m_{1,n}$ being a Stieltjes transform (see, for instance, Anderson et al. (2009)) ∎

Without loss of generality, we will assume $i=N$ , and write ${W}_{*}=(W_{ij})_{i,j\leq N-1}$ ${{w}}=({{w}}_{N,i})_{i\leq N-1}$ , $\omega=W_{NN}$ . Define

and ${\boldsymbol{R}}^{(N)}$ , $\rho^{(N)}$ the same quantities when ${W}$ is replaced by ${W}^{(N)}$ . Then, by Schur complement formula, it is immediate to get the formulae

Let $S_{0}\equiv S\setminus\{N\}$ and $S_{1}\equiv S\cap\{N\}$ (i.e. $S_{1}=\{N\}$ or $S_{1}=\emptyset$ depending whether $N\in S$ or not). Define $\Delta_{A}\equiv\big{|}{\sf Tr}_{A}\big{[}({W}-\xi{{I}}_{N})^{-1}\big{]}-{\sf Tr}_{A}\big{[}({W}^{(N)}-\xi{{I}}_{N})^{-1}\big{]}\big{|}$ . Then, using the bounds given above,

Next consider $\Delta_{S_{0}}$ . Notice that

Letting $V_{ij}=\langle{\boldsymbol{v}}_{i},{\sf P}_{S_{0}}{\boldsymbol{v}}_{j}\rangle$ , we have $\|{\boldsymbol{V}}\|_{\mbox{\tiny\rm op}}\leq\|{\sf P}_{S_{0}}\|_{\mbox{\tiny\rm op}}\leq 1$ . Therefore, defining $u_{i}=\langle{{w}},{\boldsymbol{v}}_{i}\rangle/(\lambda_{i}-\xi)$ ,

which implies that Eq. (157) also holds in this case.

Using Eqs. (151) and (157) yields the desired claim. ∎

If ${\operatorname{Im}}(\xi)\geq\xi_{0}>0$ , or ${\operatorname{Re}}(\xi)\leq-\xi_{0}<0$ , with $s\geq t\geq 0$ . Then there exists $c_{0}=c_{0}(\xi_{0})$ such that, for $i\in\{1,2\}$ ,

In particular, for ${\operatorname{Im}}(\xi)>0$ , or ${\operatorname{Re}}(\xi)<0$ , then $|M_{i,n}(\xi,s,t)-m_{i,n}(\xi,s,t)|\to 0$ almost surely.

The proof is completely analogous to (Cheng and Singer, 2013, Lemma 2.4), except that we use Azuma-Hoeffding instead of Burkholder’s inequality. Consider $M_{1,n}(\xi)$ (we omit the arguments $s,t$ for the sake of simplicity). We construct the martingale

In particular Eqs. (134), (133) admit a unique solution with $|m_{1}|,|m_{2}|<r_{0}$ for $\xi>\xi_{0}$ .

whence $|F_{i}({\boldsymbol{m}})|<r$ by taking $\xi_{0}$ large enough. Analogously

Again, the claim follows by taking $\xi_{0}$ large enough. ∎

The next lemma allow to restrict ourselves to cases in which $\varphi$ is polynomial with degree independent of $n$ .

The proof is essentially the same as for Lemma 4.4 in Cheng and Singer (2013). ∎

(Here $d$ is understood to be the same in each case.)

This follows immediately from Lemma 9. Denote, with a slight abuse of notation, by ${A}(n,p)$ the matrix in Eq. (132). Consider for instance the difference

where ${A}^{*}(n,p)$ is obtained from ${A}(n,p)$ by zero-ing the last row and column. Lemma 9 then implied $|m_{1,n,p}(\xi)-m_{1,n,p-1}(\xi)|\leq 3/\xi_{0}$ . The other terms are treated analogously. ∎

D.2.2 Derivation of Eqs. (133)

Using Eq. (179), and Woodbury’s formula, we get (writing for simplicity ${\boldsymbol{v}}={A}_{\cdot,n+p}$

By a concentration of measure argument (see, e.g., (Tao, 2012, Section 2.4.3)), the same statement also holds with high probability

By a concentration of measure argument, we also have

By using this together with Eqs. (199) and (204), we get

D.2.3 Derivation of Eqs. (134)

The derivation si analogous to the one in the previous section (notice that we will redefine some of the notations used in the last section), and hence we will only outline the main steps.

We decompose ${{w}}_{a}$ , $1\leq a\leq p-1$ , and ${{z}}_{i}$ , $1\leq i\leq n$ into their components along ${{w}}_{p}$ , and the orthogonal complement:

As in the previous section, it can be shown that $\|{\boldsymbol{E}}\|_{\mbox{\tiny\rm op}}\leq{\varepsilon}_{n}\equiv(\log n)^{c}/\sqrt{n}$ (with $c$ an absolute constant), and therefore Eq. (207) yields

We then proceed to compute the various terms on the right-hand side. The calculation is very similar to the one in the previous section, and we limit ourselves to reporting the results:

Substituting Eqs. (221) to (224) into Eq. (220), we get

D.2.4 Completing the proof

Consider ${\operatorname{Im}}(\xi)\geq C_{0}$ a large enough constant. By Eqs. (206), (206), we have

D.3 Proof of Lemma D.5

With the above definitions there exists a constant $C$ such that

Therefore, on the same event (for a suitable constant $C$ )

This implies Eq. (228) because with the stated probability we have $\|{{w}}_{i}\|\leq 1+{\varepsilon}_{n}$ for all $i$ and $|\langle{{w}}_{i},{{w}}_{j}\rangle|\leq{\varepsilon}_{n}$ for all $i\neq j$ .

where the convergence takes place almost surely and in $L^{1}$ .

Denote by $(\lambda_{i})_{i\leq p}$ , $({\boldsymbol{v}}_{i})_{i\leq p}$ the eigenvalues and eigenvectors of $\hat{\Sigma}$ . The following qualitative behavior can be extracted from the asymptotic of the Stieltjes transform as stated in Corollary 6.

For any $\gamma\neq 1$ , $c_{1}\in[0,1)$ , there exists $\rho_{0}>0$ such that the following happens. Let $S_{+}\equiv\{i\in[p]:\;|\lambda_{i}|>\rho_{0}\}$ , $S_{-}\equiv\{i\in[p]:\;|\lambda_{i}|\leq\rho_{0}\}$ . Then, the following limits hold almost surely

(Here $\Rightarrow$ denotes weak convergence of probability measures.)

Note that $m_{n}(\xi,x,c_{1}x)\to m(\xi,x,c_{1}x)$ as $n\to\infty$ . Further, for ${\operatorname{Im}}(\xi)>0$ or ${\operatorname{Re}}(\xi)<0$ , it it is immediate to show tha $\partial^{2}_{x}m_{n}(\xi,x,c_{1}x)$ is bounded in $n$ (in a neighborhood of $x=0$ ). Hence

Since $Q_{n}(\xi)\to q(\xi)$ , a weak convergence argument implies $\mu_{n}\Rightarrow\mu_{\infty}$ almost surely. Further, defining $\mu^{+}_{n}={\boldsymbol{1}}_{(\rho_{0},\infty)}\mu_{n}$ , $\mu^{-}_{n}={\boldsymbol{1}}_{[0,\rho_{0}]}\mu_{n}$ , Lemma 15 implies $\mu^{+}_{n}\Rightarrow\mu^{+}_{\infty}$ , $\mu^{-}_{n}\Rightarrow c_{0}\delta_{0}$ , where $\mu^{+}_{\infty}$ is a measure supported on $[\rho_{0},\infty)$ , with $\mu^{+}_{\infty}([\rho_{0},\infty))=1-c$ . This in turns implies

In particular, $q_{+}$ is analytic in a neighborhood of . This proves Eq. (141), with $q_{+}(0)=D_{0}$ , $\gamma c=D_{-1}$ .

Comparing the last two displays, we obtain our claim

D.4 Proof of Corollary 8 and Corollary 6

Throughout this section, set $s=t=0$ (and drop these arguments for the various functions), and let $M_{n}(\xi)\equiv\gamma M_{1,n}(\xi)+M_{2,n}(\xi)$ , $m_{n}(\xi)\equiv\gamma m_{1,n}(\xi)+m_{2,n}(\xi)$ , $m(\xi)=\gamma m_{1}(\xi)+m_{2}(\xi)$ . In this case we have

and therefore, a simple linear algebra calculation yields

Theorem LABEL:thm:var_nonlinear immediately implies $S_{n}(z^{2})\to s(z^{2})$ (almost surely and in $L^{1}$ ), where

Equations (133) and (134) simplify for the case $s=t=0$ (setting $\overline{m}_{j}=m_{j}(z,s=0,t=0)$ ) to yield

Taking a linear combination of these two equations, we get

Comparing this with Eq (260), we get Eq. (138). Substituting the latter in Eqs. (261), (262), we get Eqs. (139), (140).

Finally, Corollary 6 follows by taking $c_{1}=0$ .u

D.5 Proof of Theorem 8

First notice that Lemma 3 and 4 follow by taking $t=0$ , $c_{1}=0$ in Theorem 9 and Lemma .

The variance result follows simply by taking $c_{1}\to 0$ in Theorem LABEL:thm:var_nonlinear. Solving the quadratic equations (134), we get

The claimed formula for the variance is obtained by using these expressions in Lemma 4.

By Lemma 14, and using the fact that ${{\Sigma}}_{0}={{I}}_{p}$ when $c_{1}=0$ , we get

with probability larger than $1-1/n^{2}$ . Therefore, by Borel-Cantelli it is sufficient to estabilish the claim for $\widetilde{B}_{{X}}(\hat{\beta}_{\lambda})$ . By Corollary 6, for $c_{1}=0$ , the empirical spectral distribution of $\hat{\Sigma}$ converges almost surely (in the weak topology) to the Marchenko-Pastur law $\mu_{MP}$ . Hence (for $(\lambda_{i})_{i\leq p}$ the eigenvalues of $\hat{\Sigma}$ )

Hence the asymptotic bias is the same as in the linear model (for random isotropic features). The claim hence follows by the results of Section 5.2. Alternatively, we may simply recall that $\mu_{MP}(\{0\})=(1-\gamma^{-1})_{+}$ and use dominated convergence.