Two models of double descent for weak features

Mikhail Belkin, Daniel Hsu, Ji Xu

Introduction

The “double descent” risk curve was proposed by [Bel+19] as a general way to qualitatively describe the out-of-sample prediction performance of variably-parameterized machine learning models. This risk curve reconciles the classical bias-variance trade-off with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications (see Section 1.1 for references). In these studies, a predictive model with pp parameters is fit to a training sample of size nn, and the test risk (i.e., out-of-sample error) is examined as a function of pp. When pp is below the sample size nn (for regression or binary classification), the test risk is governed by the usual bias-variance decomposition. As pp is increased towards nn, the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up, sometimes toward infinity. The classical bias-variance analysis identifies a “sweet spot” value of p[0,n]p\in[0,n] at which the bias and variance are balanced to achieve low test risk. However, in the “modern regime”, as pp grows beyond nn, the training risk remains zero, but the test risk decreases again, even when fitting noisy data, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). In many (but not all) cases from [Bel+19], the limiting risk as pp\to\infty is lower than what is achieved at the “sweet spot” value of pp.

In this article, we show that key aspects of the “double descent” risk curve can be observed with the least squares/least norm predictor in two simple random features models. The first is a Gaussian model studied by [BF83] in the classical pnp\leq n regime, while the second is a Fourier series model for functions on the circle. In both cases, we prove that the risk is infinite around p=np=n, and decreases again as pp increases beyond nn. When the signal-to-noise ratio is high, the minimum risk is, in fact, achieved in the modern regime, when p>np>n. Our results provide a precise mathematical analysis in a simple and tractable setting of the mechanism that was qualitatively described by [Bel+19]. In particular, it captures a key aspect of many practical over-parameterized models: that increasing the number of parameters to the maximum can lead to better performance. We also establish some non-asymptotic concentration phenomena in the Gaussian model.

We note that in both of the models, the features are selected randomly, which makes them useful for studying scenarios where features are plentiful but individually too “weak” to be selected in an informed manner. Such scenarios are commonplace in machine learning practice, and they should be contrasted with “scientific” scenarios where features are carefully designed or curated, as is often the case in scientific applications. For comparison, we give an example of “prescient” feature selection, where the pp features a priori known to be most useful are included in the model. In this case, the optimal test risk is achieved at some pnp\leq n, which is consistent with the classical analysis of [BF83].

The “double descent” risk curve was posited by [Bel+19] to connect the classical bias-variance trade-off to behaviors observed in over-parameterized regimes for a variety of machine learning models. The shape and features of the risk curve itself appear throughout in the literature in a number of contexts [[, e.g.,]]vallet1989linear,opper1990ability,le1991eigenvalues,krogh1992generalization,bos1998dynamics,watkin1993statistical,advani2017high; see also [Loo+20] for a “brief prehistory” that focuses on the curious peak in the curve. These prior works analyze the risk of linear classification and regression models and neural networks in high-dimensional asymptotic regimes. Our analysis in the Gaussian model gives an exact expression for the risk for any finite sample size and number of parameters.

More recently, [Nea+18] observe that similar phenomena in neural networks can be explained by a variance reduction effect of increasing network width. The transition from under- to over-parametrized regimes was recently analyzed by [Spi+18] by drawing a connection to the physical phenomenon of “jamming” in a class of glassy systems. Our analysis makes these ideas concrete and explicit in the context of simple regression models. For instance, our analysis captures the transition from under- to over-parameterized regimes at a point where an inverse Wishart random matrix has no finite expectation. It also allows us to compare the risks at any points in the curve and explain how the risk in the over-parameterized regime can be lower than any risk in the under-parameterized regime.

The initial version of this article [BHX19] appeared concurrently with the works of [Has+19], [Mut+20], and [Bar+20], all of which also study the behavior of the least squares/least norm predictor in over-parameterized linear regression. [Mut+20] focus on the well-specified scenario (essentially, p=Dp=D) and provide upper-bounds on the risk that go to zero as pp\to\infty. (A related variance analysis was carried out by [Nea+18].) [Has+19] provide a much broader range of analyses in the high-dimensional asymptotic regime, including a “misspecified” setup that is related to ours. Their analyses require weaker distributional assumptions than ours, owing to their reliance on asymptotic analysis. (A special case of the results in the follow-up work by [XH19] further broadens the range of analyses to allow highly non-isotropic designs, but again only in the high-dimensional asymptotic regime.) The analysis of [Has+19] also considers the effect of ridge regularization; in particular, they show that when the optimal level of regularization is used, the risk curve no longer shows the “double descent” shape. Finally, [Bar+20] study non-asymptotic upper and lower bounds on the risk in the over-parameterized regime, and provide a characterization in terms of certain “effective dimensions” based on the tail of the eigenvalue sequence of the covariance operator.

Gaussian model

Given nn iid copies ((x(i),y(i)))i=1n(({\boldsymbol{x}}^{(i)},y^{(i)}))_{i=1}^{n} of (x,y)({\boldsymbol{x}},y), we fit a linear model to the data only using a subset T[D]:={1,,D}{T}\subseteq[D]:=\{1,\dotsc,D\} of p:=Tp:=|{T}| variables.

Let X:=[x(1)x(n)]{\boldsymbol{X}}:=[{\boldsymbol{x}}^{(1)}|\dotsb|{\boldsymbol{x}}^{(n)}]^{*} be the n×Dn\times D design matrix, and let y:=(y(1),,y(n)){\boldsymbol{y}}:=(y^{(1)},\dotsc,y^{(n)}) be the vector of responses. For a subset A[D]A\subseteq[D] and a DD-dimensional vector v{\boldsymbol{v}}, we use vA:=(vj:jA){\boldsymbol{v}}_{A}:=(v_{j}:j\in A) to denote its A|A|-dimensional subvector of entries from AA; we also use XA:=[xA(1)xA(n)]{\boldsymbol{X}}_{A}:=[{\boldsymbol{x}}_{A}^{(1)}|\dotsb|{\boldsymbol{x}}_{A}^{(n)}]^{*} to denote the n×An\times|A| design matrix with variables from AA. For A[D]A\subseteq[D], we denote its complement by Ac:=[D]AA^{c}:=[D]\setminus A. Finally, \|\cdot\| denotes the Euclidean norm.

We fit regression coefficients β^=(β^1,,β^D)\hat{\boldsymbol{\beta}}=(\hat{\beta}_{1},\dotsc,\hat{\beta}_{D}) with

Above, the symbol † denotes the Moore-Penrose pseudoinverse. In other words, we use the solution to the normal equations XTXTv=XTy{\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}{\boldsymbol{v}}={\boldsymbol{X}}_{T}^{*}{\boldsymbol{y}} of least norm for β^T\hat{\boldsymbol{\beta}}_{T} and force β^Tc\hat{\boldsymbol{\beta}}_{{T}^{c}} to all-zeros.

In this section, our analysis assumes a model in which (x,ϵ)({\boldsymbol{x}},\epsilon) follows a standard multivariate Gaussian distribution. This Gaussian model was also studied by [BF83], although their analysis is restricted to the case where the number of variables used pp is always at most nn; our analysis will also consider the pnp\geq n regime.

We derive a formula for the (prediction) risk of β^\hat{\boldsymbol{\beta}} for an arbitrary choice of pp features T[D]{T}\subseteq[D], and then examine this risk under particular selection models for T{T}.

The proof of Theorem 1 is not hard, we give the details in Section 2.2. We now turn to the risk of β^\hat{\boldsymbol{\beta}} under a random selection model for T{T}.

Let T{T} be a uniformly random subset of [D][D] of cardinality pp. In the setting of Theorem 1, the risk of β^\hat{\boldsymbol{\beta}} (taking expectation with respect to the random choice of T{T} in addition to the random design matrix and response vector) satisfies

Since T{T} is a uniformly random subset of [D][D] of cardinality pp,

Plugging into Theorem 1 completes the proof. ∎

Thus, assuming D>n+1D>n+1, we observe that the risk first increases with pp up to the “interpolation threshold” (p=np=n), after which the risk decreases with pp. Moreover, when the signal-to-noise ratio β2/σ2\|{\boldsymbol{\beta}}\|^{2}/\sigma^{2} is larger than D/(Dn1)D/(D-n-1), the risk is smallest at p=Dp=D; in particular, it is smaller than the risk at any pnp\leq n. This is the “double descent” risk curve where the first “descent” is degenerate (i.e., the “sweet spot” that balances bias and variance is at p=0p=0). See Figure 1 for an illustration.

It is worth pointing out that the behavior under the random selection model of T{T} can be very different from that under a deterministic model of T{T}. Consider including variables in T{T} by decreasing order of βj2\beta_{j}^{2}—a kind of “prescient” selection model studied by [BF83]. The behavior of the risk as a function of pp, illustrated in Figure 2, reveals a striking difference between the random selection model and the “prescient” selection model.

2 Proof of Theorem 1

Since β^Tc=0\hat{\boldsymbol{\beta}}_{{T}^{c}}={\boldsymbol{0}}, it follows that the risk of β^\hat{\boldsymbol{\beta}} is

The risk of β^\hat{\boldsymbol{\beta}} was computed by [BF83] in the regime where pnp\leq n:

We consider the regime where pnp\geq n. Recall that the pseudoinverse of XT{\boldsymbol{X}}_{T} can be written as XT=XT(XTXT){\boldsymbol{X}}_{T}^{{\dagger}}={\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}. Thus, letting η:=yXTβT{\boldsymbol{\eta}}:={\boldsymbol{y}}-{\boldsymbol{X}}_{T}{\boldsymbol{\beta}}_{T},

On the right hand side, the first term (IXT(XTXT)XT)βT({\boldsymbol{I}}-{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{X}}_{T}){\boldsymbol{\beta}}_{T} is the orthogonal projection of βT{\boldsymbol{\beta}}_{T} onto the null space of XT{\boldsymbol{X}}_{T}, while the second term XT(XTXT)η-{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}} is a vector in the row space of XT{\boldsymbol{X}}_{T}. By the Pythagorean theorem, the squared norm of their sum is equal to the sum of their squared norms, so

We analyze the expected values of these two terms by exploiting properties of the standard normal distribution.

Note that ΠT:=XT(XTXT)XT{\boldsymbol{\Pi}}_{T}:={\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{X}}_{T} is the orthogonal projection matrix for the row space of XT{\boldsymbol{X}}_{T}. So, by the Pythagorean theorem, we have

By rotational symmetry of the standard normal distribution, it follows that

where the second equality holds almost surely because XTXT{\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*} is almost surely invertible. Since xTβT{\boldsymbol{x}}_{T}^{*}{\boldsymbol{\beta}}_{T} and xTcβTc+σϵ{\boldsymbol{x}}_{{T}^{c}}^{*}{\boldsymbol{\beta}}_{{T}^{c}}+\sigma\epsilon are uncorrelated, it follows that

Combining the first and second terms gives the claimed expression for the risk. ∎

3 Concentration

We briefly consider the measure concentration of ββ^2\|{\boldsymbol{\beta}}-\hat{\boldsymbol{\beta}}\|^{2}.

Consider the setting from Theorem 1, and fix any ϵ(0,1)\epsilon\in(0,1). If α:=p/n<1\alpha:=p/n<1, then

The proof is given in Appendix A. The main idea for the p>np>n case is as follows. From the proof of Theorem 1, we have the decomposition

The same arguments can be used to give fixed-level confidence bounds; see Proposition 2 in Appendix B.

Finally, it is also possible to compare βT2\|{\boldsymbol{\beta}}_{T}\|^{2} to (p/D)β2(p/D)\|{\boldsymbol{\beta}}\|^{2} (and βTc2\|{\boldsymbol{\beta}}_{T^{c}}\|^{2} to (1p/D)β2(1-p/D)\|{\boldsymbol{\beta}}\|^{2}) under the random selection model of TT from Corollary 1 using concentration inequalities for sampling without replacement [BM15, see, e.g.,]. The following is a simple consequence of Proposition 1.4 of [BM15].

For any t>0t>0, with probability at least 12et1-2e^{-t},

where μ:=maxi[D]βi/β\mu:=\max_{i\in[D]}|\beta_{i}|/\|{\boldsymbol{\beta}}\|.

The proof is in Appendix C. The crucial parameter μ\mu has range [1/D,1][1/\sqrt{D},1]. It is small when there are many relevant “weak” features, each with a relatively small coefficient in β{\boldsymbol{\beta}}; conversely, it is large when β{\boldsymbol{\beta}} is concentrated on a sparse subset of features.

Fourier series model

In this section, we consider a noise-free Fourier series model, which can be regarded as a one-dimensional version of the random Fourier features model studied by [RR08] for functions defined on the unit circle.

S{S} and T{T} are independent random subsets of [D][D]. For any i[D]i\in[D], the membership of ii in S{S} (respectively, T{T}) is determined by an independent Bernoulli variable with mean ρn:=n/D\rho_{n}:=n/D (respectively, ρp:=p/D\rho_{p}:=p/D).

We observe the n×pn\times p design matrix FS,T{\boldsymbol{F}}_{{S},{T}} and nn-dimensional vector of responses μS{\boldsymbol{\mu}}_{S}. Here, FS,T{\boldsymbol{F}}_{{S},{T}} is the submatrix of F{\boldsymbol{F}} with rows from S{S} and columns from T{T}, and μS{\boldsymbol{\mu}}_{S} is the subvector of μ{\boldsymbol{\mu}} of entries from S{S}.

We fit regression coefficients β^=(β^1,,β^D)\hat{\boldsymbol{\beta}}=(\hat{\beta}_{1},\dotsc,\hat{\beta}_{D}) with

One important property of the discrete Fourier transform matrix that we use is that the matrix FA,B{\boldsymbol{F}}_{A,B} has rank min{A,B}\min\{|A|,|B|\} for any A,B[D]A,B\subseteq[D]. This is a consequence of the fact that F{\boldsymbol{F}} is Vandermonde. Thus, we have

In the remainder of this section, we analyze the risk of β^\hat{\boldsymbol{\beta}} under a random model for β{\boldsymbol{\beta}}, where

Following the arguments from Section 2.1, we have

Now we take (conditional) expectations with respect to β{\boldsymbol{\beta}}, given S{S} and T{T}:

Since FS,T{\boldsymbol{F}}_{{S},{T}} has rank min{S,T}\min\{|{S}|,|{T}|\}, the first trace expression is equal to

For the second trace expression, we use the explicit formula for FS,T{\boldsymbol{F}}_{{S},{T}}^{\dagger} and the fact that FS,TFS,T+FS,TcFS,Tc=I{\boldsymbol{F}}_{{S},{T}}{\boldsymbol{F}}_{{S},{T}}^{*}+{\boldsymbol{F}}_{{S},{T}^{c}}{\boldsymbol{F}}_{{S},{T}^{c}}^{*}={\boldsymbol{I}} to obtain

where the λi\lambda_{i}\in are the eigenvalues of FS,TcFS,Tc{\boldsymbol{F}}_{{S},{T}^{c}}{\boldsymbol{F}}_{{S},{T}^{c}}^{*}. Therefore, from Equation 1, we have

To determine the asymptotic behavior of ()(*), we use a recent result of [Far11]:

as D,n,pD,n,p\to\infty with ρn=n/D\rho_{n}=n/D and ρp=p/D\rho_{p}=p/D held fixed. Further, under this limit, we have

since ρpρn\rho_{p}\geq\rho_{n}. Hence we have the following:

Assume the setting as above, with D,n,pD,n,p\to\infty and ρn=n/D\rho_{n}=n/D and ρp=p/D\rho_{p}=p/D held fixed. Then

Note that the right-hand side in the equation from Theorem 3 is well-defined in the limit because the ratios ρn,ρp\rho_{n},\rho_{p} are fixed. It diverges to ++\infty when ρp\rho_{p} is close to ρn\rho_{n}, and decreases as ρp\rho_{p} approaches 11. This is the same behavior as in the Gaussian model from Section 2 with random feature selection; we depict a non-asymptotic instantiation of it in Figure 3.

Discussion

Our analysis shows that when features are chosen in an uninformed manner, it may be optimal to choose as many as possible—even more than the number of data—rather than limit the number to that which balances bias and variance as suggested by classical analyses. This choice is simple, both conceptually and algorithmically (although it may incur a computational penalty for processing large numbers of parameters), and avoids the need for precise control of regularization parameters. It is reflective of the practice in modern machine learning applications like image and speech recognition, where signal processing-based features are individually weak but in great abundance, and models that use all of the features, notably neural networks, are highly successful. This stands in contrast to the “scientific” scenarios with informed selection of features; for example, in many science and medical applications, features are purposefully chosen based on the detailed understanding of the underlying phenomena. As illustrated by the “prescient” model that selects the best features, in that case choosing the number of features to balance bias and variance can be better than incurring the costs that come with using all of the features.

Finally we remark, that there appears to be a sharp divide between the classical analyses of statistics and machine learning in p<np<n regimes and the modern “weak but plentiful features” interpolating settings. While the former are deeply explored, an understanding of the latter is only starting to emerge. It is clear that the best practices for model and feature selection depend crucially on the regime of the application.

We thank the anonymous referees for their remarks and suggestions (which, in particular, led to the inclusion of Section 2.3). This work was carried out in part while MB was at The Ohio State University. This research was supported by NSF CCF-1740833 and IIS-1815697 awards, a Sloan Research Fellowship, a Google Faculty Award, and a Cheung-Kong Graduate School of Business Fellowship.

References

Appendix A Proof of Theorem 2

We first consider p>np>n (i.e., α>1\alpha>1). From the proof of Theorem 1, we have the decomposition

The second term XT(XTXT)η2\|{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}\|^{2} is a (random) quadratic form in η{\boldsymbol{\eta}}. Let KT:=XTXT{\boldsymbol{K}}_{T}:={\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*}, which is non-singular almost surely. By Lemma 4 from [Das00], we have for any ϵ(0,1)\epsilon\in(0,1),

where κ(XT)=σmax(XT)/σmin(XT)\kappa({\boldsymbol{X}}_{T})=\sigma_{\max}({\boldsymbol{X}}_{T})/\sigma_{\min}({\boldsymbol{X}}_{T}) is the ratio of the largest singular value of XT{\boldsymbol{X}}_{T} to the smallest singular value of XT{\boldsymbol{X}}_{T}. For any t>0t>0,

These inequalities follow from Gaussian comparison inequalities and concentration of measure on the sphere and in Gaussian space [RV09, Ver18, see, e.g.,]. Therefore, for p>(1+t)2np>(1+t)^{2}n,

Finally, observe that 1/(KT1)i,i1/({\boldsymbol{K}}_{T}^{-1})_{i,i} has a χ2\chi^{2}-distribution with pn+1p-n+1 degrees of freedom. Therefore, again using Lemma 4 from [Das00] and a union bound, we have for any ϵ(0,1)\epsilon\in(0,1),

Putting these probability inequalities together (with t=(1ϵ)(α1)t=(1-\epsilon)(\sqrt{\alpha}-1)) completes the proof for p>np>n.

Now we consider p<np<n (i.e., α<1\alpha<1). We have

The matrix XTXT{\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T} is non-singular almost surely, so β^Tβ2=η(XTXT)η=ηKTη\|\hat{\boldsymbol{\beta}}_{T}-{\boldsymbol{\beta}}\|^{2}={\boldsymbol{\eta}}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}={\boldsymbol{\eta}}^{*}{\boldsymbol{K}}_{T}^{\dagger}{\boldsymbol{\eta}} also holds almost surely. Note that KT{\boldsymbol{K}}_{T} has the same eigenvalues as XTXT{\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}, and hence KT{\boldsymbol{K}}_{T}^{\dagger} has the same eigenvalues as (XTXT)1({\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T})^{-1}. Therefore, following essentially the same arguments as above for handling XT(XTXT)η2\|{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}\|^{2} (but switching the roles of pp and nn, and hence replacing α\alpha with α1\alpha^{-1}) completes the proof for p<np<n. ∎

Appendix B Confidence bounds

Fixed-level confidence bounds can be immediately derived from the probability inequalities in Appendix A.

Consider the setting from Theorem 1 and fix any δ(0,1)\delta\in(0,1). If p<np<n, then with probability at least 1δ1-\delta,

If p>np>n, then with probability at least 1δ1-\delta,

In the expressions above, we assume nn and pp are large enough (perhaps in relation to each other) so that all denominators are positive.

Appendix C Proof of Proposition 1

Let X1,,XpX_{1},\dotsc,X_{p} denote a random sample of cardinality pp from the finite population (β12,,βD2)(\beta_{1}^{2},\dotsc,\beta_{D}^{2}), drawn without replacement, so that βT2=j=1pXj\|{\boldsymbol{\beta}}_{T}\|^{2}=\sum_{j=1}^{p}X_{j}. Since βTc2=β2βT2\|{\boldsymbol{\beta}}_{T^{c}}\|^{2}=\|{\boldsymbol{\beta}}\|^{2}-\|{\boldsymbol{\beta}}_{T}\|^{2}, we have

Observe that the finite population (β12,,βD2)(\beta_{1}^{2},\dotsc,\beta_{D}^{2}) has mean 1Dβ2\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{2}, variance 1Dj=1Dβj4(1Dj=1Dβj2)21Dβ4μ2(1Dβ2)2=1Dβ4(μ21D)\tfrac{1}{D}\sum_{j=1}^{D}\beta_{j}^{4}-(\tfrac{1}{D}\sum_{j=1}^{D}\beta_{j}^{2})^{2}\leq\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{4}\mu^{2}-(\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{2})^{2}=\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{4}(\mu^{2}-\tfrac{1}{D}), and range maxj[D]βj2=β2μ2\max_{j\in[D]}\beta_{j}^{2}=\|{\boldsymbol{\beta}}\|^{2}\mu^{2}. Therefore, Proposition 1.4 of [BM15] and a union bound implies, with probability at least 12et1-2e^{-t},

If p/Dp/D is more than 1/21/2, then we can replace p/Dp/D by 1p/D1-p/D on the right-hand side by analogously applying the previous argument to the random sample of cardinality DpD-p that determines βTc{\boldsymbol{\beta}}_{T^{c}}. ∎