Two models of double descent for weak features

Mikhail Belkin, Daniel Hsu, Ji Xu

Introduction

The “double descent” risk curve was proposed by [Bel+19] as a general way to qualitatively describe the out-of-sample prediction performance of variably-parameterized machine learning models. This risk curve reconciles the classical bias-variance trade-off with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications (see Section 1.1 for references). In these studies, a predictive model with $p$ parameters is fit to a training sample of size $n$ , and the test risk (i.e., out-of-sample error) is examined as a function of $p$ . When $p$ is below the sample size $n$ (for regression or binary classification), the test risk is governed by the usual bias-variance decomposition. As $p$ is increased towards $n$ , the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up, sometimes toward infinity. The classical bias-variance analysis identifies a “sweet spot” value of $p\in[0,n]$ at which the bias and variance are balanced to achieve low test risk. However, in the “modern regime”, as $p$ grows beyond $n$ , the training risk remains zero, but the test risk decreases again, even when fitting noisy data, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). In many (but not all) cases from [Bel+19], the limiting risk as $p\to\infty$ is lower than what is achieved at the “sweet spot” value of $p$ .

In this article, we show that key aspects of the “double descent” risk curve can be observed with the least squares/least norm predictor in two simple random features models. The first is a Gaussian model studied by [BF83] in the classical $p\leq n$ regime, while the second is a Fourier series model for functions on the circle. In both cases, we prove that the risk is infinite around $p=n$ , and decreases again as $p$ increases beyond $n$ . When the signal-to-noise ratio is high, the minimum risk is, in fact, achieved in the modern regime, when $p>n$ . Our results provide a precise mathematical analysis in a simple and tractable setting of the mechanism that was qualitatively described by [Bel+19]. In particular, it captures a key aspect of many practical over-parameterized models: that increasing the number of parameters to the maximum can lead to better performance. We also establish some non-asymptotic concentration phenomena in the Gaussian model.

We note that in both of the models, the features are selected randomly, which makes them useful for studying scenarios where features are plentiful but individually too “weak” to be selected in an informed manner. Such scenarios are commonplace in machine learning practice, and they should be contrasted with “scientific” scenarios where features are carefully designed or curated, as is often the case in scientific applications. For comparison, we give an example of “prescient” feature selection, where the $p$ features a priori known to be most useful are included in the model. In this case, the optimal test risk is achieved at some $p\leq n$ , which is consistent with the classical analysis of [BF83].

The “double descent” risk curve was posited by [Bel+19] to connect the classical bias-variance trade-off to behaviors observed in over-parameterized regimes for a variety of machine learning models. The shape and features of the risk curve itself appear throughout in the literature in a number of contexts [[, e.g.,]]vallet1989linear,opper1990ability,le1991eigenvalues,krogh1992generalization,bos1998dynamics,watkin1993statistical,advani2017high; see also [Loo+20] for a “brief prehistory” that focuses on the curious peak in the curve. These prior works analyze the risk of linear classification and regression models and neural networks in high-dimensional asymptotic regimes. Our analysis in the Gaussian model gives an exact expression for the risk for any finite sample size and number of parameters.

More recently, [Nea+18] observe that similar phenomena in neural networks can be explained by a variance reduction effect of increasing network width. The transition from under- to over-parametrized regimes was recently analyzed by [Spi+18] by drawing a connection to the physical phenomenon of “jamming” in a class of glassy systems. Our analysis makes these ideas concrete and explicit in the context of simple regression models. For instance, our analysis captures the transition from under- to over-parameterized regimes at a point where an inverse Wishart random matrix has no finite expectation. It also allows us to compare the risks at any points in the curve and explain how the risk in the over-parameterized regime can be lower than any risk in the under-parameterized regime.

The initial version of this article [BHX19] appeared concurrently with the works of [Has+19], [Mut+20], and [Bar+20], all of which also study the behavior of the least squares/least norm predictor in over-parameterized linear regression. [Mut+20] focus on the well-specified scenario (essentially, $p=D$ ) and provide upper-bounds on the risk that go to zero as $p\to\infty$ . (A related variance analysis was carried out by [Nea+18].) [Has+19] provide a much broader range of analyses in the high-dimensional asymptotic regime, including a “misspecified” setup that is related to ours. Their analyses require weaker distributional assumptions than ours, owing to their reliance on asymptotic analysis. (A special case of the results in the follow-up work by [XH19] further broadens the range of analyses to allow highly non-isotropic designs, but again only in the high-dimensional asymptotic regime.) The analysis of [Has+19] also considers the effect of ridge regularization; in particular, they show that when the optimal level of regularization is used, the risk curve no longer shows the “double descent” shape. Finally, [Bar+20] study non-asymptotic upper and lower bounds on the risk in the over-parameterized regime, and provide a characterization in terms of certain “effective dimensions” based on the tail of the eigenvalue sequence of the covariance operator.

Gaussian model

Given $n$ iid copies $(({\boldsymbol{x}}^{(i)},y^{(i)}))_{i=1}^{n}$ of $({\boldsymbol{x}},y)$ , we fit a linear model to the data only using a subset ${T}\subseteq[D]:=\{1,\dotsc,D\}$ of $p:=|{T}|$ variables.

Let ${\boldsymbol{X}}:=[{\boldsymbol{x}}^{(1)}|\dotsb|{\boldsymbol{x}}^{(n)}]^{*}$ be the $n\times D$ design matrix, and let ${\boldsymbol{y}}:=(y^{(1)},\dotsc,y^{(n)})$ be the vector of responses. For a subset $A\subseteq[D]$ and a $D$ -dimensional vector ${\boldsymbol{v}}$ , we use ${\boldsymbol{v}}_{A}:=(v_{j}:j\in A)$ to denote its $|A|$ -dimensional subvector of entries from $A$ ; we also use ${\boldsymbol{X}}_{A}:=[{\boldsymbol{x}}_{A}^{(1)}|\dotsb|{\boldsymbol{x}}_{A}^{(n)}]^{*}$ to denote the $n\times|A|$ design matrix with variables from $A$ . For $A\subseteq[D]$ , we denote its complement by $A^{c}:=[D]\setminus A$ . Finally, $\|\cdot\|$ denotes the Euclidean norm.

We fit regression coefficients $\hat{\boldsymbol{\beta}}=(\hat{\beta}_{1},\dotsc,\hat{\beta}_{D})$ with

Above, the symbol † denotes the Moore-Penrose pseudoinverse. In other words, we use the solution to the normal equations ${\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}{\boldsymbol{v}}={\boldsymbol{X}}_{T}^{*}{\boldsymbol{y}}$ of least norm for $\hat{\boldsymbol{\beta}}_{T}$ and force $\hat{\boldsymbol{\beta}}_{{T}^{c}}$ to all-zeros.

In this section, our analysis assumes a model in which $({\boldsymbol{x}},\epsilon)$ follows a standard multivariate Gaussian distribution. This Gaussian model was also studied by [BF83], although their analysis is restricted to the case where the number of variables used $p$ is always at most $n$ ; our analysis will also consider the $p\geq n$ regime.

We derive a formula for the (prediction) risk of $\hat{\boldsymbol{\beta}}$ for an arbitrary choice of $p$ features ${T}\subseteq[D]$ , and then examine this risk under particular selection models for ${T}$ .

The proof of Theorem 1 is not hard, we give the details in Section 2.2. We now turn to the risk of $\hat{\boldsymbol{\beta}}$ under a random selection model for ${T}$ .

Let ${T}$ be a uniformly random subset of $[D]$ of cardinality $p$ . In the setting of Theorem 1, the risk of $\hat{\boldsymbol{\beta}}$ (taking expectation with respect to the random choice of ${T}$ in addition to the random design matrix and response vector) satisfies

Since ${T}$ is a uniformly random subset of $[D]$ of cardinality $p$ ,

Plugging into Theorem 1 completes the proof. ∎

Thus, assuming $D>n+1$ , we observe that the risk first increases with $p$ up to the “interpolation threshold” ( $p=n$ ), after which the risk decreases with $p$ . Moreover, when the signal-to-noise ratio $\|{\boldsymbol{\beta}}\|^{2}/\sigma^{2}$ is larger than $D/(D-n-1)$ , the risk is smallest at $p=D$ ; in particular, it is smaller than the risk at any $p\leq n$ . This is the “double descent” risk curve where the first “descent” is degenerate (i.e., the “sweet spot” that balances bias and variance is at $p=0$ ). See Figure 1 for an illustration.

It is worth pointing out that the behavior under the random selection model of ${T}$ can be very different from that under a deterministic model of ${T}$ . Consider including variables in ${T}$ by decreasing order of $\beta_{j}^{2}$ —a kind of “prescient” selection model studied by [BF83]. The behavior of the risk as a function of $p$ , illustrated in Figure 2, reveals a striking difference between the random selection model and the “prescient” selection model.

2 Proof of Theorem 1

Since $\hat{\boldsymbol{\beta}}_{{T}^{c}}={\boldsymbol{0}}$ , it follows that the risk of $\hat{\boldsymbol{\beta}}$ is

The risk of $\hat{\boldsymbol{\beta}}$ was computed by [BF83] in the regime where $p\leq n$ :

We consider the regime where $p\geq n$ . Recall that the pseudoinverse of ${\boldsymbol{X}}_{T}$ can be written as ${\boldsymbol{X}}_{T}^{{\dagger}}={\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}$ . Thus, letting ${\boldsymbol{\eta}}:={\boldsymbol{y}}-{\boldsymbol{X}}_{T}{\boldsymbol{\beta}}_{T}$ ,

On the right hand side, the first term $({\boldsymbol{I}}-{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{X}}_{T}){\boldsymbol{\beta}}_{T}$ is the orthogonal projection of ${\boldsymbol{\beta}}_{T}$ onto the null space of ${\boldsymbol{X}}_{T}$ , while the second term $-{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}$ is a vector in the row space of ${\boldsymbol{X}}_{T}$ . By the Pythagorean theorem, the squared norm of their sum is equal to the sum of their squared norms, so

We analyze the expected values of these two terms by exploiting properties of the standard normal distribution.

Note that ${\boldsymbol{\Pi}}_{T}:={\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{X}}_{T}$ is the orthogonal projection matrix for the row space of ${\boldsymbol{X}}_{T}$ . So, by the Pythagorean theorem, we have

By rotational symmetry of the standard normal distribution, it follows that

where the second equality holds almost surely because ${\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*}$ is almost surely invertible. Since ${\boldsymbol{x}}_{T}^{*}{\boldsymbol{\beta}}_{T}$ and ${\boldsymbol{x}}_{{T}^{c}}^{*}{\boldsymbol{\beta}}_{{T}^{c}}+\sigma\epsilon$ are uncorrelated, it follows that

Combining the first and second terms gives the claimed expression for the risk. ∎

3 Concentration

We briefly consider the measure concentration of $\|{\boldsymbol{\beta}}-\hat{\boldsymbol{\beta}}\|^{2}$ .

Consider the setting from Theorem 1, and fix any $\epsilon\in(0,1)$ . If $\alpha:=p/n<1$ , then

The proof is given in Appendix A. The main idea for the $p>n$ case is as follows. From the proof of Theorem 1, we have the decomposition

The same arguments can be used to give fixed-level confidence bounds; see Proposition 2 in Appendix B.

Finally, it is also possible to compare $\|{\boldsymbol{\beta}}_{T}\|^{2}$ to $(p/D)\|{\boldsymbol{\beta}}\|^{2}$ (and $\|{\boldsymbol{\beta}}_{T^{c}}\|^{2}$ to $(1-p/D)\|{\boldsymbol{\beta}}\|^{2}$ ) under the random selection model of $T$ from Corollary 1 using concentration inequalities for sampling without replacement [BM15, see, e.g.,]. The following is a simple consequence of Proposition 1.4 of [BM15].

For any $t>0$ , with probability at least $1-2e^{-t}$ ,

where $\mu:=\max_{i\in[D]}|\beta_{i}|/\|{\boldsymbol{\beta}}\|$ .

The proof is in Appendix C. The crucial parameter $\mu$ has range $[1/\sqrt{D},1]$ . It is small when there are many relevant “weak” features, each with a relatively small coefficient in ${\boldsymbol{\beta}}$ ; conversely, it is large when ${\boldsymbol{\beta}}$ is concentrated on a sparse subset of features.

Fourier series model

In this section, we consider a noise-free Fourier series model, which can be regarded as a one-dimensional version of the random Fourier features model studied by [RR08] for functions defined on the unit circle.

${S}$ and ${T}$ are independent random subsets of $[D]$ . For any $i\in[D]$ , the membership of $i$ in ${S}$ (respectively, ${T}$ ) is determined by an independent Bernoulli variable with mean $\rho_{n}:=n/D$ (respectively, $\rho_{p}:=p/D$ ).

We observe the $n\times p$ design matrix ${\boldsymbol{F}}_{{S},{T}}$ and $n$ -dimensional vector of responses ${\boldsymbol{\mu}}_{S}$ . Here, ${\boldsymbol{F}}_{{S},{T}}$ is the submatrix of ${\boldsymbol{F}}$ with rows from ${S}$ and columns from ${T}$ , and ${\boldsymbol{\mu}}_{S}$ is the subvector of ${\boldsymbol{\mu}}$ of entries from ${S}$ .

We fit regression coefficients $\hat{\boldsymbol{\beta}}=(\hat{\beta}_{1},\dotsc,\hat{\beta}_{D})$ with

One important property of the discrete Fourier transform matrix that we use is that the matrix ${\boldsymbol{F}}_{A,B}$ has rank $\min\{|A|,|B|\}$ for any $A,B\subseteq[D]$ . This is a consequence of the fact that ${\boldsymbol{F}}$ is Vandermonde. Thus, we have

In the remainder of this section, we analyze the risk of $\hat{\boldsymbol{\beta}}$ under a random model for ${\boldsymbol{\beta}}$ , where

Following the arguments from Section 2.1, we have

Now we take (conditional) expectations with respect to ${\boldsymbol{\beta}}$ , given ${S}$ and ${T}$ :

Since ${\boldsymbol{F}}_{{S},{T}}$ has rank $\min\{|{S}|,|{T}|\}$ , the first trace expression is equal to

For the second trace expression, we use the explicit formula for ${\boldsymbol{F}}_{{S},{T}}^{\dagger}$ and the fact that ${\boldsymbol{F}}_{{S},{T}}{\boldsymbol{F}}_{{S},{T}}^{*}+{\boldsymbol{F}}_{{S},{T}^{c}}{\boldsymbol{F}}_{{S},{T}^{c}}^{*}={\boldsymbol{I}}$ to obtain

where the $\lambda_{i}\in$ are the eigenvalues of ${\boldsymbol{F}}_{{S},{T}^{c}}{\boldsymbol{F}}_{{S},{T}^{c}}^{*}$ . Therefore, from Equation 1, we have

To determine the asymptotic behavior of $(*)$ , we use a recent result of [Far11]:

as $D,n,p\to\infty$ with $\rho_{n}=n/D$ and $\rho_{p}=p/D$ held fixed. Further, under this limit, we have

since $\rho_{p}\geq\rho_{n}$ . Hence we have the following:

Assume the setting as above, with $D,n,p\to\infty$ and $\rho_{n}=n/D$ and $\rho_{p}=p/D$ held fixed. Then

Note that the right-hand side in the equation from Theorem 3 is well-defined in the limit because the ratios $\rho_{n},\rho_{p}$ are fixed. It diverges to $+\infty$ when $\rho_{p}$ is close to $\rho_{n}$ , and decreases as $\rho_{p}$ approaches $1$ . This is the same behavior as in the Gaussian model from Section 2 with random feature selection; we depict a non-asymptotic instantiation of it in Figure 3.

Discussion

Our analysis shows that when features are chosen in an uninformed manner, it may be optimal to choose as many as possible—even more than the number of data—rather than limit the number to that which balances bias and variance as suggested by classical analyses. This choice is simple, both conceptually and algorithmically (although it may incur a computational penalty for processing large numbers of parameters), and avoids the need for precise control of regularization parameters. It is reflective of the practice in modern machine learning applications like image and speech recognition, where signal processing-based features are individually weak but in great abundance, and models that use all of the features, notably neural networks, are highly successful. This stands in contrast to the “scientific” scenarios with informed selection of features; for example, in many science and medical applications, features are purposefully chosen based on the detailed understanding of the underlying phenomena. As illustrated by the “prescient” model that selects the best features, in that case choosing the number of features to balance bias and variance can be better than incurring the costs that come with using all of the features.

Finally we remark, that there appears to be a sharp divide between the classical analyses of statistics and machine learning in $p<n$ regimes and the modern “weak but plentiful features” interpolating settings. While the former are deeply explored, an understanding of the latter is only starting to emerge. It is clear that the best practices for model and feature selection depend crucially on the regime of the application.

We thank the anonymous referees for their remarks and suggestions (which, in particular, led to the inclusion of Section 2.3). This work was carried out in part while MB was at The Ohio State University. This research was supported by NSF CCF-1740833 and IIS-1815697 awards, a Sloan Research Fellowship, a Google Faculty Award, and a Cheung-Kong Graduate School of Business Fellowship.

References

Appendix A Proof of Theorem 2

We first consider $p>n$ (i.e., $\alpha>1$ ). From the proof of Theorem 1, we have the decomposition

The second term $\|{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}\|^{2}$ is a (random) quadratic form in ${\boldsymbol{\eta}}$ . Let ${\boldsymbol{K}}_{T}:={\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*}$ , which is non-singular almost surely. By Lemma 4 from [Das00], we have for any $\epsilon\in(0,1)$ ,

where $\kappa({\boldsymbol{X}}_{T})=\sigma_{\max}({\boldsymbol{X}}_{T})/\sigma_{\min}({\boldsymbol{X}}_{T})$ is the ratio of the largest singular value of ${\boldsymbol{X}}_{T}$ to the smallest singular value of ${\boldsymbol{X}}_{T}$ . For any $t>0$ ,

These inequalities follow from Gaussian comparison inequalities and concentration of measure on the sphere and in Gaussian space [RV09, Ver18, see, e.g.,]. Therefore, for $p>(1+t)^{2}n$ ,

Finally, observe that $1/({\boldsymbol{K}}_{T}^{-1})_{i,i}$ has a $\chi^{2}$ -distribution with $p-n+1$ degrees of freedom. Therefore, again using Lemma 4 from [Das00] and a union bound, we have for any $\epsilon\in(0,1)$ ,

Putting these probability inequalities together (with $t=(1-\epsilon)(\sqrt{\alpha}-1)$ ) completes the proof for $p>n$ .

Now we consider $p<n$ (i.e., $\alpha<1$ ). We have

The matrix ${\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}$ is non-singular almost surely, so $\|\hat{\boldsymbol{\beta}}_{T}-{\boldsymbol{\beta}}\|^{2}={\boldsymbol{\eta}}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}={\boldsymbol{\eta}}^{*}{\boldsymbol{K}}_{T}^{\dagger}{\boldsymbol{\eta}}$ also holds almost surely. Note that ${\boldsymbol{K}}_{T}$ has the same eigenvalues as ${\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}$ , and hence ${\boldsymbol{K}}_{T}^{\dagger}$ has the same eigenvalues as $({\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T})^{-1}$ . Therefore, following essentially the same arguments as above for handling $\|{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}\|^{2}$ (but switching the roles of $p$ and $n$ , and hence replacing $\alpha$ with $\alpha^{-1}$ ) completes the proof for $p<n$ . ∎

Appendix B Confidence bounds

Fixed-level confidence bounds can be immediately derived from the probability inequalities in Appendix A.

Consider the setting from Theorem 1 and fix any $\delta\in(0,1)$ . If $p<n$ , then with probability at least $1-\delta$ ,

If $p>n$ , then with probability at least $1-\delta$ ,

In the expressions above, we assume $n$ and $p$ are large enough (perhaps in relation to each other) so that all denominators are positive.

Appendix C Proof of Proposition 1

Let $X_{1},\dotsc,X_{p}$ denote a random sample of cardinality $p$ from the finite population $(\beta_{1}^{2},\dotsc,\beta_{D}^{2})$ , drawn without replacement, so that $\|{\boldsymbol{\beta}}_{T}\|^{2}=\sum_{j=1}^{p}X_{j}$ . Since $\|{\boldsymbol{\beta}}_{T^{c}}\|^{2}=\|{\boldsymbol{\beta}}\|^{2}-\|{\boldsymbol{\beta}}_{T}\|^{2}$ , we have

Observe that the finite population $(\beta_{1}^{2},\dotsc,\beta_{D}^{2})$ has mean $\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{2}$ , variance $\tfrac{1}{D}\sum_{j=1}^{D}\beta_{j}^{4}-(\tfrac{1}{D}\sum_{j=1}^{D}\beta_{j}^{2})^{2}\leq\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{4}\mu^{2}-(\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{2})^{2}=\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{4}(\mu^{2}-\tfrac{1}{D})$ , and range $\max_{j\in[D]}\beta_{j}^{2}=\|{\boldsymbol{\beta}}\|^{2}\mu^{2}$ . Therefore, Proposition 1.4 of [BM15] and a union bound implies, with probability at least $1-2e^{-t}$ ,

If $p/D$ is more than $1/2$ , then we can replace $p/D$ by $1-p/D$ on the right-hand side by analogously applying the previous argument to the random sample of cardinality $D-p$ that determines ${\boldsymbol{\beta}}_{T^{c}}$ . ∎