Benign Overfitting in Linear Regression

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler

Introduction

Deep learning methodology has revealed a surprising statistical phenomenon: overfitting can perform well. The classical perspective in statistical learning theory is that there should be a tradeoff between the fit to the training data and the complexity of the prediction rule. Whether complexity is measured in terms of the number of parameters, the number of non-zero parameters in a high-dimensional setting, the number of neighbors averaged in a nearest-neighbor estimator, the scale of an estimate in a reproducing kernel Hilbert space, or the bandwidth of a kernel smoother, this tradeoff has been ubiquitous in statistical learning theory. Deep learning seems to operate outside the regime where results of this kind are informative, since deep neural networks can perform well even with a perfect fit to the training data.

As one example of this phenomenon, consider the experiment illustrated in Figure 1(c) in : standard deep network architectures and stochastic gradient algorithms, run until they perfectly fit a standard image classification training set, give respectable prediction performance, even when significant levels of label noise are introduced. The deep networks in the experiments reported in achieved essentially zero cross-entropy loss on the training data. In statistics and machine learning textbooks, an estimate that fits every training example perfectly is often presented as an illustration of overfitting (“… interpolating fits… [are] unlikely to predict future data well at all.” [25, p37]). Thus, to arrive at a scientific understanding of the success of deep learning methods, it is a central challenge to understand the performance of prediction rules that fit the training data perfectly.

In this paper, we consider perhaps the simplest setting where we might hope to witness this phenomenon: linear regression. That is, we consider quadratic loss and linear prediction rules, and we assume that the dimension of the parameter space is large enough that a perfect fit is guaranteed. We consider data in an infinite dimensional space (a separable Hilbert space), but our results apply to a finite-dimensional subspace as a special case. There is an ideal value of the parameters, θ\theta^{*}, corresponding to the linear prediction rule that minimizes the expected quadratic loss. We ask when it is possible to fit the data exactly and still compete with the prediction accuracy of θ\theta^{*}. Since we require more parameters than the sample size in order to fit exactly, the solution might be underdetermined, so there might be many interpolating solutions. We consider the most natural: choose the parameter vector θ^\hat{\theta} with the smallest norm among all vectors that give perfect predictions on the training sample. (This corresponds to using the pseudoinverse to solve the normal equations; see Section 2.) We ask when it is possible to overfit in this way—and embed all of the noise of the labels into the parameter estimate θ^\hat{\theta}—without harming prediction accuracy.

Our main result is a finite sample characterization of when overfitting is benign in this setting. The linear regression problem depends on the optimal parameters θ\theta^{*} and the covariance Σ\Sigma of the covariates xx. The properties of Σ\Sigma turn out to be crucial, since the magnitude of the variance in different directions determines both how the label noise gets distributed across the parameter space and how errors in parameter estimation in different directions in parameter space affect prediction accuracy. There is a classical decomposition of the excess prediction error into two terms. The first is rather standard: provided that the scale of the problem (that is, the sum of the eigenvalues of Σ\Sigma) is small compared to the sample size nn, the contribution to θ^\hat{\theta} that we can view as coming from θ\theta^{*} is not too distorted. The second term is more interesting, since it reflects the impact of the noise in the labels on prediction accuracy. We show that this part is small if and only if the effective rank of Σ\Sigma in the subspace corresponding to low variance directions is large compared to nn. This necessary and sufficient condition of a large effective rank can be viewed as a property of significant overparameterization: fitting the training data exactly but with near-optimal prediction accuracy occurs if and only if there are many low variance (and hence unimportant) directions in parameter space where the label noise can be hidden.

The details are more complicated. The characterization depends in a specific way on two notions of effective rank, rr and RR; the smaller one, rr, determines a split of Σ\Sigma into large and small eigenvalues, and the excess prediction error depends on the effective rank, as measured by the larger notion RR, of the subspace corresponding to the smallest eigenvalues. For the excess prediction error to be small, the smallest eigenvalues of Σ\Sigma must decay slowly.

Studying the patterns of eigenvalues that allow benign overfitting reveals an interesting role for large but finite dimensions: in an infinite-dimensional setting, benign overfitting occurs only for a narrow range of decay rates of the eigenvalues. On the other hand, it occurs with any suitably slowly decaying eigenvalue sequence in a finite dimensional space whose dimension grows faster than the sample size. Thus, for linear regression, data that lies in a large but finite dimensional space exhibits the benign overfitting phenomenon with a much wider range of covariance properties than data that lies in an infinite dimensional space.

The phenomenon of interpolating prediction rules has been an object of study by several authors over the last two years, since it emerged as an intriguing mystery at the Simons Institute program on Foundations of Machine Learning in Spring 2017. Belkin, Ma and Mandal described an experimental study demonstrating that this phenomenon of accurate prediction for functions that interpolate noisy data also occurs for prediction rules chosen from reproducing kernel Hilbert spaces, and explained the mismatch between this phenomenon and classical generalization bounds. Belkin, Hsu and Mitra gave an example of an interpolating decision rule—simplicial interpolation—with an asymptotic consistency property as the input dimension gets large. That work, and subsequent work of Belkin, Rakhlin, and Tsybakov , studied kernel smoothing methods based on singular kernels that both interpolate and, with suitable bandwidth choice, give optimal rates for nonparametric estimation (building on earlier consistency results for these unusual kernels). Liang and Rakhlin considered minimum norm interpolating kernel regression with kernels defined as nonlinear functions of the Euclidean inner product and showed that, with certain properties of the training sample (expressed in terms of the empirical kernel matrix), these methods can have good prediction accuracy. Belkin, Hsu, Ma and Mandal studied experimentally the excess risk as a function of the dimension of a sequence of parameter spaces for linear and non-linear classes.

Subsequent to our work, considered the properties of the interpolating linear prediction rule with minimal expected squared error. After this work was presented at the NAS Colloquium on the Science of Deep Learning , we became aware of the concurrent work of Belkin, Hsu and Xu and of Hastie, Montanari, Rosset and Tibshirani . Belkin et al calculated the excess risk for certain linear models (a regression problem with identity covariance, sparse optimal parameters, both with and without noise, and a problem with random Fourier features with no noise), and Hastie et al considered linear regression in an asymptotic regime, where sample size nn and input dimension pp go to infinity together with asymptotic ratio p/nγp/n\to\gamma. They assumed that, as pp gets large, the empirical spectral distribution of Σ\Sigma (the discrete measure on its set of eigenvalues) converges to a fixed measure, and they applied random matrix theory to explore the range of behaviors of the asymptotics of the excess prediction error as γ\gamma, the noise variance, and the eigenvalue distribution vary. They also studied the asymptotics of a model involving random nonlinear features. In contrast, we give upper and lower bounds on the excess prediction error for arbitrary finite sample size, for arbitrary covariance matrices, and for data of arbitrary dimension.

The next section introduces notation and definitions used throughout the paper, including definitions of the problem of linear regression and of various notions of effective rank of the covariance operator. Section 3 gives the characterization of benign overfitting, illustrates why the effective rank condition corresponds to significant overparameterization, and presents several examples of patterns of eigenvalues that allow benign overfitting, suggesting that slowly decaying covariance eigenvalues in input spaces of growing but finite dimension are the generic example of benign overfitting. Section 4 discusses the connections between these results and the benign overfitting phenomenon in deep neural networks. Section 5 outlines the proofs of the results.

Definitions and Notation

the conditional noise variance is bounded below by some constant σ2\sigma^{2},

almost surely, the projection of the data XX on the space orthogonal to any eigenvector of Σ\Sigma spans a space of dimension nn.

By the projection theorem, parameter vectors that solve the least squares problem minβXβy2\min_{\beta}\left\|X\beta-\boldsymbol{y}\right\|^{2} solve the normal equations, so we can equivalently write θ^\hat{\theta} as the minimum norm solution to the normal equations,

Our main result gives tight bounds on the excess risk of this minimum norm estimator in terms of certain notions of effective rank of the covariance that are defined in terms of its eigenvalues.

For the covariance operator Σ\Sigma, define λi=μi(Σ)\lambda_{i}=\mu_{i}(\Sigma) for i=1,2,i=1,2,\ldots. If i=1λi<\sum_{i=1}^{\infty}\lambda_{i}<\infty and λk+1>0\lambda_{k+1}>0 for k0k\geq 0, define

Main Results

The following theorem establishes nearly matching upper and lower bounds for the risk of the minimum-norm interpolating estimator.

For any σx\sigma_{x} there are b,c,c1>1b,c,c_{1}>1 for which the following holds. Consider a linear regression problem from Definition 1. Define

with probability at least 1δ1-\delta, and

Moreover, there are universal constants a1,a2,n0a_{1},a_{2},n_{0} such that for all nn0n\geq n_{0}, for all Σ\Sigma, for all t0t\geq 0, there is a θ\theta^{*} with θ=t\|\theta^{*}\|=t such that for xN(0,Σ)x\sim{\cal N}(0,\Sigma) and yxN(xθ,θ2Σ)y|x\sim{\cal N}(x^{\top}\theta^{*},\|\theta^{*}\|^{2}\|\Sigma\|), with probability at least 1/41/4,

In order to understand the implications of Theorem 4, we now study relationships between the two notions of effective rank, rkr_{k} and RkR_{k}, and establish sufficient and necessary conditions for the sequence {λi}\{\lambda_{i}\} of eigenvalues to lead to small excess risk.

The following lemma shows that the two notions of effective rank are closely related. See Appendix H for its proof, and for other properties of rkr_{k} and RkR_{k}. (All appendices may be found in the supporting material.)

rk(Σ)1r_{k}(\Sigma)\geq 1, rk2(Σ)=rk(Σ2)Rk(Σ)r^{2}_{k}(\Sigma)=r_{k}(\Sigma^{2})R_{k}(\Sigma), and

Both notions of symmetry ss and SS lie between 1/p1/p (when λ20\lambda_{2}\to 0) and 11 (when the λi\lambda_{i} are all equal).

Theorem 4 shows that, for the minimum norm estimator to have near-optimal prediction accuracy, r0(Σ)r_{0}(\Sigma) should be small compared to the sample size nn (from the first term) and rk(Σ)r_{k^{*}}(\Sigma) and Rk(Σ)R_{k^{*}}(\Sigma) should be large compared to nn. Together, these conditions imply that overparameterization is essential for benign overfitting in this setting: the number of non-zero eigenvalues should be large compared to nn, they should have a small sum compared to nn, and there should be many eigenvalues no larger than λk\lambda_{k^{*}}. If the number of these small eigenvalues is not much larger than nn, then they should be roughly equal, but they can be more assymmetric if there are many more of them.

The following theorem shows that the kind of overparameterization that is essential for benign overfitting requires Σ\Sigma to have a heavy tail. (The proof—and some other examples illustrating the boundary of benign overfitting—are in Appendix I.) In particular, if we fix Σ\Sigma in an infinite-dimensional Hilbert space and ask when does the excess risk of the minimum norm estimator approach zero as nn\to\infty, it imposes tight restrictions on the eigenvalues of Σ\Sigma. But there are many other possibilities for these asymptotics if Σ\Sigma can change with nn. Since rescaling XX affects the accuracy of the least-norm interpolant in an obvious way, we may assume without loss of generality that Σ=1\|\Sigma\|=1. If we restrict our attention to this case, then, informally, Theorem 4 implies that, when the covariance operator for data with nn examples is Σn\Sigma_{n}, the least-norm interpolant converges if r0(Σn)n0\frac{r_{0}(\Sigma_{n})}{n}\rightarrow 0, knn0\frac{k_{n}^{*}}{n}\rightarrow 0, and nRkn(Σn)0\frac{n}{R_{k_{n}^{*}}(\Sigma_{n})}\rightarrow 0, and only if r0(Σn)nlog(1+r0(Σn))0\frac{r_{0}(\Sigma_{n})}{n\log(1+r_{0}(\Sigma_{n}))}\rightarrow 0, knn0\frac{k_{n}^{*}}{n}\rightarrow 0, and nRkn(Σn)0\frac{n}{R_{k_{n}^{*}}(\Sigma_{n})}\rightarrow 0, where kn=min{k0:rk(Σn)bn}k^{*}_{n}=\min\left\{k\geq 0:r_{k}(\Sigma_{n})\geq bn\right\} for the universal constant bb in Theorem 4. For this reason, we say that a sequence of covariance operators Σn\Sigma_{n} is benign if

If μk(Σ)=kαlnβ(k+1)\mu_{k}(\Sigma)=k^{-\alpha}\ln^{-\beta}(k+1), then Σ\Sigma is benign iff α=1\alpha=1 and β>1\beta>1.

and γk=Θ(exp(k/τ))\gamma_{k}=\Theta(\exp(-k/\tau)), then Σn\Sigma_{n} is benign iff pn=ω(n)p_{n}=\omega(n) and neo(n)=ϵnpn=o(n)ne^{-o(n)}=\epsilon_{n}p_{n}=o(n). Furthermore, for pn=Ω(n)p_{n}=\Omega(n) and ϵnpn=neo(n)\epsilon_{n}p_{n}=ne^{-o(n)},

Compare the situations described by Parts 1 and 2 of Theorem 6. Part 1 shows that for infinite-dimensional data with a fixed covariance, benign overfitting occurs iff the eigenvalues of the covariance operator decay just slowly enough for their sum to remain finite. Part 2 shows that the situation is very different if the data has finite dimension and a small amount of isotropic noise is added to the covariates. In that case, even if the eigenvalues of the original covariance operator (before the addition of isotropic noise) decay very rapidly, benign overfitting occurs iff both the dimension is large compared to the sample size, and the isotropic component of the covariance is sufficiently small—but not exponentially small—compared to the sample size.

These examples illustrate the tension between the slow decay of eigenvalues that is needed for k/n+n/Rkk/n+n/R_{k} to be small, and the summability of eigenvalues that is needed for r0(Σ)/nr_{0}(\Sigma)/n to be small. There are two ways to resolve this tension. First, in the infinite dimensional setting, slow decay of the eigenvalues suffices—decay just fast enough to ensure summability—as shown by Part 1 of Theorem 6. (Appendix I gives another example, where the eigenvalue decay is allowed to vary with nn; in that case, Σn\Sigma_{n} is benign iff the decay rate gets close—but not too close—to 1/k1/k as nn increases.) The other way to resolve the tension is to consider a finite dimensional setting (which ensures that the eigenvalues are summable), and in this case arbitrarily slow decay is possible. Part 2 of Theorem 6 gives an example of this: eigenvalues that are all at least as large as a small constant. Appendix I gives another example, with a truncated infinite series that decays sufficiently slowly that their sum does not converge. Theorem 6(1) shows that a very specific decay rate is required in infinite dimensions, which suggests that this is an unusual phenomenon in that case. The more generic scenario where benign overfitting will occur is demonstrated by Theorem 6(2), with eigenvalues that are either constant or slowly decaying in a very high—but finite dimensional—space.

Deep neural networks

How relevant are Theorems 4 and 6 to the phenomenon of benign overfitting in deep neural networks? One connection appears by considering regimes where deep neural networks are well-approximated by linear functions of their parameters. This so-called neural tangent kernel (NTK) viewpoint has been vigorously pursued recently in an attempt to understand the optimization properties of deep learning methods. Very wide neural networks, trained with gradient descent from a suitable random initialization, can be accurately approximated by linear functions in an appropriate Hilbert space, and in this case gradient descent finds an interpolating solution quickly; see . (Note that these papers do not consider prediction accuracy, except when there is no noise; for example, [29, Assumption A1] implies that the network can compute a suitable real-valued response exactly, and the data-dependent bound of [1, Theorem 5.1] becomes vacuous when independent noise is added to the yiy_{i}s.) The eigenvalues of the covariance operator in this case can have a heavy tail under reasonable assumptions on the data distribution (see , where this kernel was introduced, and ), and the dimension is very large but finite, as required for benign overfitting. However, the assumptions of Theorem 4 do not apply in this case. In particular, the assumption that the random elements of the Hilbert space are a linearly transformed vector with independent components is not satisfied. Thus, our results are not directly applicable in this—somewhat unrealistic—setting. Note that the slow decay of the eigenvalues of the NTK is in contrast to the case of the gaussian and other smooth kernels, where the eigenvalues decay nearly exponentially quickly .

The phenomenon of benign overfitting was first observed in deep neural networks. Theorems 4 and 6 are steps towards understanding this phenomenon by characterizing when it occurs in the simple setting of linear regression. Those results suggest that covariance eigenvalues that are constant or slowly decaying in a high (but finite) dimensional space might be important in the deep network setting also. Some authors have suggested viewing neural networks as finite-dimensional approximations to infinite dimensional objects , and there are generalization bounds—although not for the overfitting regime—that are applicable to infinite width deep networks with parameter norm constraints . However, the intuition from the linear setting suggests that truncating to a finite dimensional space might be important for good statistical performance in the overfitting regime. Confirming this conjecture by extending our results to the setting of prediction in deep neural networks is an important open problem.

Proof

Throughout the proofs, we treat σx\sigma_{x} (the subgaussian norm of the covariates) as a constant. Therefore, we use the symbols b,c,c1,c2,b,c,c_{1},c_{2},\ldots to refer to constants that only depend on σx\sigma_{x}. Their values are suitably large (and always at least 11) but do not depend on any parameters of the problems we consider, besides σx\sigma_{x}. For universal constants that do not depend on any parameters of the problem at all we use the symbol aa. Also, whenever we sum over eigenvectors of Σ\Sigma, the sum is restricted to eigenvectors with non-zero eigenvalues.

The first step is a standard decomposition of the excess risk into two pieces, a term that corresponds to the distortion that is introduced by viewing θ\theta^{*} through the lens of the finite sample and a term that corresponds to the distortion introduced by the noise ε=yXθ\boldsymbol{\varepsilon}=\boldsymbol{y}-X\theta. The impact of both sources of error in θ^\hat{\theta} on the excess risk is modulated by the covariance Σ\Sigma, which gives different weight to different directions in parameter space.

The excess risk of the minimum norm estimator satisfies

with probability at least 1δ1-\delta over ϵ\epsilon, and

Unit variance subgaussians

Our assumptions allow the trace of CC to be expressed as a function of many independent subgaussian vectors.

where Ai=jiλjzjzjA_{-i}=\sum_{j\neq i}\lambda_{j}z_{j}z_{j}^{\top}.

By Assumption 2 in Definition 1, the random variables xvi/λix^{\top}v_{i}/\sqrt{\lambda_{i}} are independent σx2\sigma_{x}^{2}-subgaussian. We consider XX in the basis of eigenvectors of Σ\Sigma, Xvi=λiziXv_{i}=\sqrt{\lambda_{i}}z_{i}, to see that

For the second part, we use Lemma 20, which is a consequence of the Sherman-Woodbury-Morrison formula; see Appendix B.

by Lemma 20, for the case k=1k=1 and Z=λiziZ=\sqrt{\lambda_{i}}z_{i}. Note that AiA_{-i} is invertible by Assumption 5 in Definition 1. ∎

The weighted sum of outer products of these subgaussian vectors plays a central role in the rest of the proof. Define

Concentration of A𝐴A

The next step is to show that eigenvalues of AA, AiA_{-i} and AkA_{k} are concentrated. The proof of the following inequality is in Appendix C. Recall that μ1(A)\mu_{1}(A) and μn(A)\mu_{n}(A) denote the largest and the smallest eigenvalues of the n×nn\times n matrix AA.

There is a constant cc such that for any k0k\geq 0 with probability at least 12en/c1-2e^{-n/c},

The following lemma uses this result to give bounds on the eigenvalues of AkA_{k}, which in turn give bounds on some eigenvalues of AiA_{-i} and AA. For these upper and lower bounds to match up to a constant factor, the sum of the eigenvalues of AkA_{k} should dominate the term involving its leading eigenvalue, which is a condition on the effective rank rk(Σ)r_{k}(\Sigma). The lemma shows that once rk(Σ)r_{k}(\Sigma) is sufficiently large, all of the eigenvalues of AkA_{k} are identical up to a constant factor.

There are constants b,c1b,c\geq 1 such that for any k0k\geq 0, with probability at least 12en/c1-2e^{-n/c},

By Lemma 9, we know that with probability at least 12en/c11-2e^{-n/c_{1}},

First, the matrix AAkA-A_{k} has rank at most kk (as a sum of kk matrices of rank 11). Thus, there is a linear space L\mathscr{L} of dimension nkn-k such that for all vLv\in\mathscr{L}, vAv=vAkvμ1(Ak)v2v^{\top}Av=v^{\top}A_{k}v\leq\mu_{1}(A_{k})\|v\|^{2}, and so μk+1(A)μ1(Ak)\mu_{k+1}(A)\leq\mu_{1}(A_{k}).

Second, by the Courant-Fischer-Weyl Theorem, for all ii and jj, μj(Ai)μj(A)\mu_{j}(A_{-i})\leq\mu_{j}(A) (see Lemma 28). On the other hand, for iki\leq k, AkAiA_{k}\preceq A_{-i}, so all the eigenvalues of AiA_{-i} are lower bounded by μn(Ak)\mu_{n}(A_{k}).

Choosing b>c12b>c_{1}^{2} and c>max{c1+1/c1,(1/c1c1/b)1}c>\max\left\{c_{1}+1/c_{1},\left(1/c_{1}-c_{1}/b\right)^{-1}\right\} gives the third claim of the lemma. ∎

Upper bound on the trace term

There are constants b,c1b,c\geq 1 such that if 0kn/c0\leq k\leq n/c, rk(Σ)bnr_{k}(\Sigma)\geq bn, and lkl\leq k then with probability at least 17en/c1-7e^{-n/c},

The proof uses the following lemma and its corollary. Their proofs are in Appendix C.

Suppose {λi}i\{\lambda_{i}\}_{i}^{\infty} is a non-increasing sequence of non-negative numbers such that i=1λi<\sum_{i=1}^{\infty}\lambda_{i}<\infty, and {ξi}i=1\{\xi_{i}\}_{i=1}^{\infty} are independent centered σ\sigma-subexponential random variables. Then for some universal constant aa for any t>0t>0 with probability at least 12et1-2e^{-t}

where ΠL\Pi_{\mathscr{L}} is the orthogonal projection on L\mathscr{L}.

(of Lemma 11) Fix bb to its value in Lemma 10. By Lemma 8,

and the upper bounds on the μk+1(Ai)\mu_{k+1}(A_{-i})’s give

where Li\mathscr{L}_{i} is the span of the nkn-k eigenvectors of AiA_{-i} corresponding to its smallest nkn-k eigenvalues. So for ili\leq l,

Next, we apply Corollary 13 ll times, together with a union bound, to show that with probability at least 13et1-3e^{-t}, for all 1il1\leq i\leq l,

provided that t<n/c0t<n/c_{0} and c>c0c>c_{0} for some sufficiently large c0c_{0} (note that c2c_{2} and c3c_{3} only depend on c0c_{0}, aa and σx\sigma_{x}, and we can still take cc large enough in the end without changing c2c_{2} and c3c_{3}). Combining (3), (4), and (5), with probability at least 15en/c01-5e^{-n/c_{0}},

Second, consider the second sum in (2). Lemma 10 shows that, on the same high probability event that we considered in bounding the first half of the sum, μn(A)λk+1rk(Σ)/c1\mu_{n}(A)\geq\lambda_{k+1}r_{k}(\Sigma)/c_{1}. Hence,

Notice that i>lλi2zi2\sum_{i>l}\lambda_{i}^{2}\|z_{i}\|^{2} is a weighted sum of σx2\sigma_{x}^{2}-subexponential random variables, with the weights given by the λi2\lambda_{i}^{2} in blocks of size nn. Lemma 12 implies that, with probability at least 12et1-2e^{-t},

because t<n/c0t<n/c_{0}. Combining the above gives

Finally, putting both parts together and taking c>max{c0,c4,c6}c>\max\{c_{0},c_{4},c_{6}\} gives the lemma. ∎

Lower bound on the trace term

There is a constant cc such that for any i1i\geq 1 with λi>0\lambda_{i}>0, and any 0kn/c0\leq k\leq n/c, with probability at least 15en/c1-5e^{-n/c},

Suppose nn\leq\infty and {ηi}i=1n\{\eta_{i}\}_{i=1}^{n} is a sequence of non-negative random variables, {ti}i=1n\{t_{i}\}_{i=1}^{n} is a sequence of non-negative real numbers (at least one of which is strictly positive) such that for some δ(0,1)\delta\in(0,1) and any ini\leq n, Pr(ηi>ti)1δ\Pr(\eta_{i}>t_{i})\geq 1-\delta. Then

These two lemmas imply the following lower bound.

There are constants cc such that for any 0kn/c0\leq k\leq n/c and any b>1b>1 with probability at least 110en/c1-10e^{-n/c},

From Lemmas 8, 14 and 15, with probability at least 110en/c11-10e^{-n/c_{1}},

Now, if rk(Σ)<bnr_{k}(\Sigma)<bn, then the second term in the minimum is always bigger than the third term, and in that case,

On the other hand, if rk(λ)bnr_{k}(\lambda)\geq bn,

where the equality follows from the fact that the λi\lambda_{i}s are non-increasing. ∎

A simple choice of l𝑙l

Recall that σx\sigma_{x} is a constant. If no kn/ck\leq n/c has rk(Σ)bnr_{k}(\Sigma)\geq bn, then Lemmas 7 and 16 imply that the expected excess risk is Ω(σ2)\Omega(\sigma^{2}), which proves the first paragraph of Theorem 4 for large kk^{*}. If some kn/ck\leq n/c does have rk(Σ)bnr_{k}(\Sigma)\geq bn, then the upper and lower bounds of Lemmas 11 and 16 are constant multiples of

For any b1b\geq 1 and k:=min{k:rk(Σ)bn}k^{*}:=\min\left\{k:r_{k}(\Sigma)\geq bn\right\}, if k<k^{*}<\infty, we have

Finally, we can finish the proof of Theorem 4. Set bb in Lemma 16 and Theorem 4 to the constant bb from Lemma 11. Take c1c_{1} to be the maximum of the constants cc from Lemmas 16 and 11.

The proof of the second paragraph is in Appendix K.

Conclusions and Further Work

Our results characterize when the phenomenon of benign overfitting occurs in high dimensional linear regression, with gaussian data and more generally. We give finite sample excess risk bounds that reveal the covariance structure that ensures that the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization depends on two notions of the effective rank of the data covariance operator. It shows that overparameterization, that is, the existence of many low-variance and hence unimportant directions in parameter space, is essential for benign overfitting, and that data that lies in a large but finite dimensional space exhibits the benign overfitting phenomenon with a much wider range of covariance properties than data that lies in an infinite dimensional space.

Acknowledgements

We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of Google through a Google Research Award. Part of this work was done as part the Fall 2018 program on Foundations of Data Science at the Simons Institute for the Theory of Computing. Gábor Lugosi was supported by the Spanish Ministry of Economy and Competitiveness, Grant MTM2015-67304-P and FEDER, EU; “High-dimensional problems in structured probabilistic models - Ayudas Fundación BBVA a Equipos de Investigación Cientifica 2017”; and Google Focused Award “Algorithms and Learning for AI”.

References

Appendix A Proof of Lemma 7

We first give the decomposition of the excess risk.

The excess risk of the minimum norm estimator satisfies

Since ε=yxθ\varepsilon=y-x^{\top}\theta^{*} has mean zero conditionally on xx,

Using (1), the definition of Σ\Sigma, and the fact that y=Xθ+ε\boldsymbol{y}=X\theta^{*}+\boldsymbol{\varepsilon},

Also, since ε\boldsymbol{\varepsilon} has zero mean conditionally on XX, and is independent of xx, we have

The following lemma shows that we can obtain a high-probability upper bound on the term εCε\boldsymbol{\varepsilon}^{\top}C\boldsymbol{\varepsilon} in terms of the trace of CC. It is Lemma 36 in .

Combining this with Lemma 18 implies Lemma 7.

Appendix B An Algebraic Property

We use the Sherman–Morrison–Woodbury formula to write

Denote M1:=ZA1ZM_{1}:=Z^{\top}A^{-1}Z and M2:=ZA2ZM_{2}:=Z^{\top}A^{-2}Z. Applying (6), we get

where we used the identity I(I+M1)1M1=(I+M1)1I-(I+M_{1})^{-1}M_{1}=(I+M_{1})^{-1} twice in the second last equality and the identity IM1(I+M1)1=(I+M1)1I-M_{1}(I+M_{1})^{-1}=(I+M_{1})^{-1} in the last equality. ∎

Appendix C Proof of concentration inequalities

We use some standard results about subgaussian and subexponential random variables.

First of all, we need the following direct consequence of Propositions 2.5.2 and 2.7.1 and Lemma 2.7.6 from :

There is a universal constant cc such that for any random variable ξ\xi that is centered, σ2\sigma^{2}-subgaussian, and unit variance, ξ21\xi^{2}-1 is a centered cσ2c\sigma^{2}-subexponential random variable, that is,

Second, we are going to use the following form of Bernstein’s inequality, which is Theorem 2.8.2 in :

There is a universal constant cc such that for any non-increasing sequence {λi}i=1\{\lambda_{i}\}_{i=1}^{\infty} of non-negative numbers such that i=1λi<\sum_{i=1}^{\infty}\lambda_{i}<\infty, and any independent, centered, σ\sigma-subexponential random variables {ξi}i=1\{\xi_{i}\}_{i=1}^{\infty}, and any x>0x>0, with probability at least 12ex1-2e^{-x}

where ΠL\Pi_{\mathscr{L}} is the orthogonal projection on L\mathscr{L}.

First of all, since z2=i=1nzi2\|z\|^{2}=\sum_{i=1}^{n}z_{i}^{2} — a sum of nn σ2\sigma^{2}-subexponential random variables, by Corollary 23, for some absolute constant cc and for any t>0t>0, with probability at least 12et1-2e^{-t},

Thus, with probability at least 13et1-3e^{-t}

where the first inequality holds because the λi\lambda_{i}s are decreasing in magnitude, and the last two inequalities hold since the functions x+x2x+x^{2} and 2x+x22x+x^{2} are both increasing on (12,)(-\frac{1}{2},\infty) and Δv1Δvϵ12\Delta v_{1}\geq-\|\Delta v\|\geq-\epsilon\geq-\frac{1}{2}. ∎

There is a universal constant cc such that with probability at least 12en/c1-2e^{-n/c},

Let N\mathcal{N} be a 14\frac{1}{4}-net on the sphere Sn1\mathcal{S}^{n-1} with respect to the Euclidean distance such that N9n|\mathcal{N}|\leq 9^{n}. Applying the union bound over the elements of N\mathcal{N}, we see that with probability 12et1-2e^{-t}, every vNv\in\mathcal{N} satisfies

Since N\mathcal{N} is a 14\frac{1}{4}-net, by Lemma 25, we need to multiply the quantity above by (11/4)2(1-1/4)^{-2} to get the bound on the norm of the AIniλiA-I_{n}\sum_{i}\lambda_{i}. Denote

Thus, with probability at least 12et1-2e^{-t},

When t<n/c4t<n/c_{4} we can write t+nln9c5nt+n\ln 9\leq c_{5}n, and we have

by the AMGM inequality. (Recall that c1,c2,c_{1},c_{2},\ldots denote universal constants with value at least 11, and σx1/c7\sigma_{x}\geq 1/c_{7} is the subgaussian constant of a random variable with unit variance.) ∎

Appendix D Proof of Lemma 14

Fix i1i\geq 1 with λi>0\lambda_{i}>0 and 0kn/c0\leq k\leq n/c. By Lemma 10, with probability at least 12en/c11-2e^{-n/c_{1}},

By Corollary 13, with probability at least 13et1-3e^{-t},

provided that t<n/c0t<n/c_{0} and c>c0c>c_{0} for some sufficiently large c0c_{0}. Thus, with probability at least 15en/c31-5e^{-n/c_{3}},

Dividing λi2ziAi2zi\lambda_{i}^{2}z_{i}^{\top}A_{-i}^{-2}z_{i} by the square of both sides, we have

Also, from the Cauchy-Schwarz inequality and Corollary 13 again, we have that on the same event,

Choosing cc suitably large gives the lemma.

Appendix E Proof of Lemma 15

and denote its probability as cδc\delta for some c(0,δ1)c\in(0,\delta^{-1}). On the one hand, by the definition of the event, we have

On the other hand, note that for any ii,

Appendix F Proof of Lemma 17

We can write the function of ll being minimized as

where ll^{*} is the largest value of iki\leq k^{*} for which

since the λi2\lambda_{i}^{2} are non-increasing. This condition holds iff

The definition of kk^{*} implies rk1(Σ)<bnr_{k^{*}-1}(\Sigma)<bn. So we can write

and so the minimizing ll is kk^{*}. Also,

Appendix G Eigenvalue monotonicity

Recall (half of) the Courant-Fischer-Weyl theorem.

If symmetric matrices AA and BB satisfy ABA\preceq B, then, for any i[n]i\in[n], we have μi(A)μi(B)\mu_{i}(A)\leq\mu_{i}(B).

Appendix H Rank facts

The quantity r0(Σ)r_{0}(\Sigma) is an important complexity parameter for covariance estimation problems, where it has been called the ‘effective rank’ . Earlier, r0(Σ2)r_{0}(\Sigma^{2}) was called the ‘stable rank’ and the ‘numerical rank’ , although that term has a different meaning in computational linear algebra [23, p261].

rk(Σ)1r_{k}(\Sigma)\geq 1, rk2(Σ)=rk(Σ2)Rk(Σ)r^{2}_{k}(\Sigma)=r_{k}(\Sigma^{2})R_{k}(\Sigma), and rk(Σ2)rk(Σ)Rk(Σ)rk2(Σ)r_{k}(\Sigma^{2})\leq r_{k}(\Sigma)\leq R_{k}(\Sigma)\leq r_{k}^{2}(\Sigma).

The first inequality and the equality are immediate from the definitions. Together they imply Rk(Σ)rk2(Σ)R_{k}(\Sigma)\leq r_{k}^{2}(\Sigma). For the second inequality,

Substituting this in the equality implies rk(Σ)Rk(Σ)r_{k}(\Sigma)\leq R_{k}(\Sigma). ∎

Writing rkr_{k} and RkR_{k} for rk(Σ)r_{k}(\Sigma) and Rk(Σ)R_{k}(\Sigma),

Thus, the function ϕ(k)=k/(b2n)+n/Rk\phi(k)=k/(b^{2}n)+n/R_{k} satisfies the monotonicity property ϕ(k+1)>ϕ(k)\phi(k+1)>\phi(k) whenever rk>bn1r_{k}>bn\geq 1.

Since rk>1r_{k}>1, 0<1(21/rk)/rk<10<1-\left(2-1/r_{k}\right)/r_{k}<1, so

Appendix I Conditions on eigenvalues

In this section, we prove the following expanded version of Theorem 6.

Define λk,n:=μk(Σn)\lambda_{k,n}:=\mu_{k}(\Sigma_{n}) for all k,nk,n.

If λk,n=kαlnβ(k+1)\lambda_{k,n}=k^{-\alpha}\ln^{-\beta}(k+1), then Σn\Sigma_{n} is benign iff α=1\alpha=1 and β>1\beta>1.

If λk,n=k(1+αn)\lambda_{k,n}=k^{-(1+\alpha_{n})}, then Σn\Sigma_{n} is benign iff ω(1/n)=αn=o(1)\omega(1/n)=\alpha_{n}=o(1). Furthermore,

then Σn\Sigma_{n} is benign iff either 0<α<10<\alpha<1, pn=ω(n)p_{n}=\omega(n) and pn=o(n1/(1α))p_{n}=o\left(n^{1/(1-\alpha)}\right) or α=1\alpha=1, pn=eω(n)p_{n}=e^{\omega(\sqrt{n})} and pn=eo(n)p_{n}=e^{o(n)}.

and γk=Θ(exp(k/τ))\gamma_{k}=\Theta(\exp(-k/\tau)), then Σn\Sigma_{n} is benign iff pn=ω(n)p_{n}=\omega(n) and neo(n)=ϵnpn=o(n)ne^{-o(n)}=\epsilon_{n}p_{n}=o(n). Furthermore, for pn=Ω(n)p_{n}=\Omega(n) and ϵnpn=neo(n)\epsilon_{n}p_{n}=ne^{-o(n)},

We build up the proof in stages. First, we characterize those sequences of effective ranks that can arise.

Consider some positive summable sequence {λi}i=1\{\lambda_{i}\}_{i=1}^{\infty}, and for any non-negative integer ii denote

Then ri>1r_{i}>1 and iri1=\sum_{i}r_{i}^{-1}=\infty. Moreover, for any positive sequence {ui}\{u_{i}\} such that i=0ui1=\sum_{i=0}^{\infty}u_{i}^{-1}=\infty and for every ii ui>1u_{i}>1, there exists a positive sequence {λi}\{\lambda_{i}\} (unique up to constant multiplier) such that riuir_{i}\equiv u_{i}. The sequence is (a constant rescaling of)

which goes to zero if and only if iri1=\sum_{i}r_{i}^{-1}=\infty. On the other hand, we may rewrite the first equality in the proof as

So for any sequence {ui}\{u_{i}\} we can uniquely (up to a constant multiplier) recover the sequence {λi}\{\lambda_{i}\} such that ri=uir_{i}=u_{i} — the only candidate is

However, for such {λi}\{\lambda_{i}\} one can compute

so the resulting sequence {λi}\{\lambda_{i}\} sums to 1, and

Suppose bb is some constant, and k(n)=min{k:rkbn}k^{\ast}(n)=\min\{k:r_{k}\geq bn\}. Suppose also that the sequence {rn}\{r_{n}\} is increasing. Then, as nn goes to infinity, k(n)/nk^{\ast}(n)/n goes to zero if and only if rn/nr_{n}/n goes to infinity.

We prove the “if” part separately from the “only if” part.

If k(n)/n0k^{\ast}(n)/n\to 0 then rn/nr_{n}/n\to\infty.

Fix some C>1C>1. Since k(n)/n0k^{\ast}(n)/n\to 0, there exists some NCN_{C} such that for any nNCn\geq N_{C}, k(n)<n/C.k^{\ast}(n)<n/C. Thus, for all n>NCn>N_{C},

Since the constant CC is arbitrary, rn/nr_{n}/n goes to infinity.

If rn/nr_{n}/n\to\infty then k(n)/n0k^{\ast}(n)/n\to 0 .

Fix some constant C>1C>1. Since rn/nr_{n}/n\to\infty there exists some NCN_{C} such that for any nNCn\geq N_{C}, rn>Cnr_{n}>Cn. Thus, for any n>CNC/bn>CN_{C}/b

Since the constant CC is arbitrary, k(n)/nk^{\ast}(n)/n goes to zero.

Suppose the sequence {ri}\{r_{i}\} is increasing and rn/nr_{n}/n\to\infty as nn\to\infty. Then a sufficient condition for nRk(n)0\frac{n}{R_{k^{\ast}(n)}}\to 0 is

For example, this condition holds for rn=nlognr_{n}=n\log n.

Since rk(n)bnr_{k^{\ast}(n)}\geq bn and limnk(n)=\lim_{n\rightarrow\infty}k^{*}(n)=\infty, it is enough to prove that i>kλi2λk+12rk0\frac{\sum_{i>k}\lambda_{i}^{2}}{\lambda_{k+1}^{2}r_{k}}\to 0 as kk goes to infinity. Since

and it is sufficient to prove that the latter quantity goes to zero. We write

Since both numerator and denominator are decreasing in kk and go to zero as kk\to\infty, we can apply the Stolz–Cesáro theorem (an analog of L’Hôpital’s rule for discrete sequences):

where the last line is due to our sufficient condition. ∎

Part 1, if direction, first term: We have

Part 1, if direction, second term: By Theorem 33, it suffices to prove that limnrnn=\lim_{n\rightarrow\infty}\frac{r_{n}}{n}=\infty. This holds because

Part 1, if direction, third term: By Theorem 34, it suffices to prove that rk2=o(rk1rk+11)r_{k}^{-2}=o(r_{k}^{-1}-r_{k+1}^{-1}), that is

As argued above, when α=1\alpha=1 and β>1\beta>1, rk=Θ(klogk)r_{k}=\Theta(k\log k), so it suffices to show that limk(rk+1rk)=\lim_{k\rightarrow\infty}(r_{k+1}-r_{k})=\infty. We have

Since λi\lambda_{i} is non-increasing, we have

If we define ff on the positive reals by f(x)=xlogβ(x+1)f(x)=x\log^{\beta}(x+1), then ff is convex, and, since f(x)=βxlogβ1(x+1)x+1+logβ(x+1)f^{\prime}(x)=\frac{\beta x\log^{\beta-1}(x+1)}{x+1}+\log^{\beta}(x+1), we have

which goes to infinity for large kk, completing the proof of the “if” direction of the third term of Part 1.

Part 1, only if direction, α>1\alpha>1: If α>1\alpha>1, then

which does not grow faster than nn. Thus, by Theorem 33, k(n)/nk^{\ast}(n)/n does not go to zero.

Part 1, only if direction, α<1\alpha<1, or α=1\alpha=1 and β1\beta\leq 1: In this case, since, as above

and i=11iαlogβ(1+i)\sum_{i=1}^{\infty}\frac{1}{i^{\alpha}\log^{\beta}(1+i)} diverges in this case, Σnr0(Σn)n\frac{||\Sigma_{n}||\sqrt{r_{0}(\Sigma_{n})}}{n} does not go to zero.

Before starting on Part 2, let us define rk,n=rk(Σn)r_{k,n}=r_{k}(\Sigma_{n}) and Rk,n=Rk(Σn)R_{k,n}=R_{k}(\Sigma_{n}).

Part 2, if direction, first term: We have

so Σnr0,nn1+1αnn||\Sigma_{n}||\sqrt{\frac{r_{0,n}}{n}}\leq\sqrt{\frac{1+\frac{1}{\alpha_{n}}}{n}} which goes to zero with nn if αn=ω(1/n)\alpha_{n}=\omega(1/n).

Part 2, if direction, second term: First,

Thus, k(n)=O(αnn)k^{*}(n)=O(\alpha_{n}n), so that k(n)n=O(αn)=o(1)\frac{k^{*}(n)}{n}=O(\alpha_{n})=o(1).

Part 2, if direction, third term: We bound Rk,nR_{k,n} from below by separately bounding its numerator and denominator:

So now we want a lower bound on k(n)k^{*}(n). For that, we need an upper bound on rk,nr_{k,n}, and

This implies 2k(n)αneαn/k(n)bn\frac{2k^{*}(n)}{\alpha_{n}}e^{\alpha_{n}/k^{*}(n)}\geq bn. This, together with the fact that, for u>1u>1, ue1/uue^{1/u} is an increasing function of uu, implies that, for large enough nn, k(n)αnbn/3k^{*}(n)\geq\alpha_{n}bn/3. Since αn=ω(1/n)\alpha_{n}=\omega(1/n), this implies that k(n)=ω(1)k^{*}(n)=\omega(1). Combining this with (7), for large enough nn

Thus n/Rk(n),n=O(αn)=o(1)n/R_{k^{*}(n),n}=O(\alpha_{n})=o(1).

Part 2, only if direction, αn=O(1/n)\alpha_{n}=O(1/n): We have

so Σnr0,nn1αnn||\Sigma_{n}||\sqrt{\frac{r_{0,n}}{n}}\geq\sqrt{\frac{1}{\alpha_{n}n}}, which is bounded below by a constant for large nn if αn=O(1/n)\alpha_{n}=O(1/n).

Part 2, only if direction, αn=Ω(1)\alpha_{n}=\Omega(1): Recall that, in the proof of the “if” direction of the third term, we showed that k(n)αnbn/3k^{*}(n)\geq\alpha_{n}bn/3. This implies that k(n)n=Ω(αn)\frac{k^{*}(n)}{n}=\Omega(\alpha_{n}).

Part 3: Suppose that Σn\Sigma_{n} is benign. Then because Rk(Σn)pnkR_{k}(\Sigma_{n})\leq p_{n}-k, we must have pn=ω(n)p_{n}=\omega(n). Thus, we can restrict our attention to the sequences for which pn=ω(n)p_{n}=\omega(n) and find the necessary and sufficient conditions for that class.

Next, for any positive α\alpha and any natural number k[1,pn)k\in[1,p_{n}), we can write

As the sequence can only be benign if k=o(n)k^{\ast}=o(n), we can only consider values of kk that do not exceed some constant fraction of nn, e.g. n/2n/2. Since pn=ω(n)p_{n}=\omega(n), noting that, for x>0x>0, the sign of 11αx1α\frac{1}{1-\alpha}x^{1-\alpha} flips when α\alpha crosses 11, we can write, uniformly for all k[1,n/2]k\in[1,n/2],

Recall that we consider λi,n=iα\lambda_{i,n}=i^{-\alpha} for ipni\leq p_{n}. Using the formula above, we get uniformly for all k[1,n/2]k\in[1,n/2]

Recall that k=min{k:rk(Σn)bn}k^{\ast}=\min\{k:r_{k}(\Sigma_{n})\geq bn\}. We compute

One can see that for α>1\alpha>1, k=Ωα(n)k^{\ast}=\Omega_{\alpha}(n), so the sequence is not benign for α>1\alpha>1. On the other hand, k=o(n)k^{\ast}=o(n) for α1\alpha\leq 1.

Next, analogously to the asymptotics for rk(Σ)r_{k}(\Sigma), we have

Since Rk=rk(Σ)2rk(Σ2)R_{k}=\frac{r_{k}(\Sigma)^{2}}{r_{k}(\Sigma^{2})}, we can write uniformly for all k[1,n/2]k\in[1,n/2]

Now we plug in kk^{\ast} instead of kk. Recall that pn/k=Θα((pn/n)1/α)p_{n}/k^{\ast}=\Theta_{\alpha}\left((p_{n}/n)^{1/\alpha}\right) for α(0,1)\alpha\in(0,1), and pn/k=Θα(pn/nln(pn/n))p_{n}/k^{\ast}=\Theta_{\alpha}\left(p_{n}/n\ln(p_{n}/n)\right) for α=1\alpha=1. We get

Since pn=ω(n)p_{n}=\omega(n), for any α(0,1)\alpha\in(0,1), Rk=ω(n)R_{k^{\ast}}=\omega(n). For α=1\alpha=1 the necessary and sufficient for Rk=ω(n)R_{k^{\ast}}=\omega(n) is ln(pn/n)=ω(n)\ln(p_{n}/n)=\omega(\sqrt{n}).

So far, we obtained the necessary and sufficient conditions for the last terms to go to zero. Now let’s look at the upper bound for the first term: since λ1,n1\lambda_{1,n}\equiv 1, we just need r0/n0r_{0}/n\to 0. We write, for α(0,1]\alpha\in(0,1],

Thus, for α<1\alpha<1, r0(Σn)/nr_{0}(\Sigma_{n})/n goes to zero if and only if pn=o(n1/(1α))p_{n}=o\left(n^{1/(1-\alpha)}\right), and for α=1\alpha=1, r0(Σn)/nr_{0}(\Sigma_{n})/n goes to zero if and only if ln(pn)=o(n)\ln(p_{n})=o(n).

Part 4: Suppose that Σn\Sigma_{n} is benign. Then because Rk(Σn)pnkR_{k}(\Sigma_{n})\leq p_{n}-k, we must have pn=ω(n)p_{n}=\omega(n). Also,

and so pnϵn=o(n)p_{n}\epsilon_{n}=o(n). Since Σn\Sigma_{n} benign implies k=o(n)k^{*}=o(n), and hence k=o(pn)k^{*}=o(p_{n}), we consider k=o(pn)k=o(p_{n}). In this regime,

Substituting k=τln(n/(pnϵn))ak=\tau\ln(n/(p_{n}\epsilon_{n}))-a gives

which shows that kτln(n/(pnϵn))O(1)k^{*}\geq\tau\ln(n/(p_{n}\epsilon_{n}))-O(1). Thus, if Σn\Sigma_{n} is benign, we must have k=o(n)k^{*}=o(n), that is, ϵnpn=neo(n)\epsilon_{n}p_{n}=ne^{-o(n)}.

Conversely, assume pn=Ω(n)p_{n}=\Omega(n) and ϵnpn=neo(n)\epsilon_{n}p_{n}=ne^{-o(n)} (that is, ln(n/(pnϵn))=o(n)\ln(n/(p_{n}\epsilon_{n}))=o(n)). Set k=τln(n/(pnϵn))ak=\tau\ln(n/(p_{n}\epsilon_{n}))-a, for some aa, which we shall see is Θ(1)\Theta(1). Notice that k=o(n)k=o(n), so pnk=Ω(pn)p_{n}-k=\Omega(p_{n}) and epn=o(ek)e^{-p_{n}}=o(e^{-k}). Thus,

which shows that k=τln(n/(pnϵn))+O(1)k^{*}=\tau\ln(n/(p_{n}\epsilon_{n}))+O(1). Also, we have

Now, it is clear that pn=ω(n)p_{n}=\omega(n), ϵnpn=o(n)\epsilon_{n}p_{n}=o(n), and ϵnpn=neo(n)\epsilon_{n}p_{n}=ne^{-o(n)} imply that Σn\Sigma_{n} is benign.

Appendix J Upper bound on the B𝐵B term

We can control the term θBθ{\theta^{\ast}}^{\top}B\theta^{\ast} in Lemma 7 using a standard argument.

There is a constant cc, that depends only on σx\sigma_{x}, such that for any 1<t<n1<t<n, with probability at least 1et1-e^{-t},

Moreover, for any vv in the orthogonal complement to the span of the columns of XX^{\top},

Thus, due to Theorem 9 in , there is an absolute constant cc such that for any t>1t>1 with probability at least 1et1-e^{-t},

Appendix K Another lower bound

In this section, we prove the second paragraph of Theorem 4.

since, otherwise, the lower bound is vacuously satisfied.

so that, informally, a successful learning algorithm achieves ρ(θ^,θ)<τ0\rho(\hat{\theta},\theta)<\sqrt{\tau_{0}}.

Define sets S1,S2,...S_{1},S_{2},... of indices as follows. Let S1={1}S_{1}=\{1\}; let S2={2,...,i2}S_{2}=\{2,...,i_{2}\}, for the least i2i_{2} such that i=2i2λi1\sum_{i=2}^{i_{2}}\lambda_{i}\geq 1. Continue the same way as long as possible; for all j>2j>2, let Sj={ij1,...,ij}S_{j}=\{i_{j-1},...,i_{j}\}, where iji_{j} is the least index such that i=ij1ijλi1\sum_{i=i_{j-1}}^{i_{j}}\lambda_{i}\geq 1.

Definition 36 produces Ω(nlogn)\Omega(n\log n) sets.

yielding the contradiction and completing the proof. ∎

If the number of sets produced by the process of Definition 36 is finite, let dd be this finite number. Otherwise, let d=nlnnd=\lceil n\ln n\rceil.

Now, informally, we, in our role as an adversary, commit to assigning all covariates in SjS_{j} the same weight. The following definition formalizes this idea.

We would like to show that applying ϕ\phi to an L2L_{2} packing yields a ρ\rho-packing, which is done in the following lemma.

Let AA be the least-norm interpolation algorithm. We will bound the accuracy of AA by bounding its performance in terms of an algorithm CC built using AA as a subroutine, as was done in a related context in . The definition of Algorithm CC is illustrated in Figure 1, which is reproduced from .

The definition uses the function QαQ_{\alpha} that rounds its input to the nearest multiple of α\alpha. Algorithm CC applies algorithm AA to training data whose response variables have been modified. For each example (x,y)(x,y), and simulated artificial noise ε\varepsilon distributed as N(0,1)N(0,1), and artificial noise ζ\zeta distributed uniformly on (α/2,α/2)(-\alpha/2,\alpha/2), Algorithm CC gives (x,y+Qα(ε)+ζ)(x,y+Q_{\alpha}(\varepsilon)+\zeta) to AA. The following lemma is similar to Lemma 5 of . One important difference is that we show that Algorithm CC approximates the linear function parameterized by θ\theta^{*}, not its discretization.

If the linear interpolant algorithm AA has error τ\tau from nn examples drawn from N(0,Σ)N(0,\Sigma) with independent N(0,1)N(0,1) noise with probability 1δ1-\delta, and

then, in the absence of noise, Algorithm CC, given nn examples of the form (x,Qα(θx))(x,Q_{\alpha}(\theta^{\top}x)), with probability 12δ1-2\delta, achieves ρ(θ^,θ)2τ\rho(\hat{\theta},\theta^{*})^{2}\leq\tau.

The proof of Lemma 41 will be deferred until we have proved some more lemmas.

Recall the definition of total variation distance, dTV(P,Q)=supEP(E)Q(E)d_{TV}(P,Q)=\sup_{E}|P(E)-Q(E)|. The following lemma is implicit in the proof of Lemma 6 of .

Let η,ν\eta,\nu be random variables that are distributed according to N(0,1)N(0,1) and let ζ\zeta be uniform over [α/2,α/2][-\alpha/2,\alpha/2].

We will use the following, which is implicit in the proof of Lemma 8 of .

If P1,...,Pn,Q1,...,QnP_{1},...,P_{n},Q_{1},...,Q_{n} are probability distributions over a domain UU, and χ\chi is a $valuedrandomvariabledefinedon-valued random variable defined onU^{n}$ then

Now, we are ready to prove Lemma 41. The proof closely follows the proof of Lemma 5 in .

Let ζt\zeta_{t} be a random variable with distribution UαU_{\alpha}, where UαU_{\alpha} is the uniform distribution over (α/2,α/2)(-\alpha/2,\alpha/2). Let BB be the randomized algorithm that adds noise ζt\zeta_{t} to each yty_{t} value it receives, passes the result to Algorithm AA, and returns AA’s output.

where P1xtP_{1|x_{t}} is the distribution of (θ)xt+εt(\theta^{*})^{\top}x_{t}+\varepsilon_{t}.

Define P2xtP_{2|x_{t}} as the distribution of Qα((θ)xt+εt)+ζtQ_{\alpha}((\theta^{*})^{\top}x_{t}+\varepsilon_{t})+\zeta_{t}. From Lemma 42, dTV(P1xt,P2xt)αd_{TV}(P_{1|x_{t}},P_{2|x_{t}})\leq\alpha. Applying Lemma 43 with χ\chi as the indicator function for E1E_{1},

Since αδ2n\alpha\leq\frac{\delta}{2n}, this implies

Let P3xtP_{3|x_{t}} be the distribution of Qα((θ)xt+εt)Q_{\alpha}((\theta^{*})^{\top}x_{t}+\varepsilon_{t}), and let

Let P4xtP_{4|x_{t}} be the distribution of Qα((θ)xt)+Qα(εt)Q_{\alpha}((\theta^{*})^{\top}x_{t})+Q_{\alpha}(\varepsilon_{t}). Applying Lemma 43, we get

From Lemma 42, dTV(P3xt,P4xt)αd_{TV}(P_{3|x_{t}},P_{4|x_{t}})\leq\alpha, so

Averaging over the random choice of XX, the probability, for (X,ζ,ε)(X,\boldsymbol{\zeta},\boldsymbol{\varepsilon}) distributed as N(0,Σ)n×Uαn×N(0,1)nN(0,\Sigma)^{n}\times U_{\alpha}^{n}\times N(0,1)^{n}, that ρ(A(X,Qα(Xθ)+Qα(ε)+ζ)),θ)2>τ\rho(A(X,Q_{\alpha}(X\theta^{*})+Q_{\alpha}(\boldsymbol{\varepsilon})+\boldsymbol{\zeta})),\theta^{*})^{2}>\tau, is at most

But A(X,Qα(Xθ)+Qα(ε)+ζ)A(X,Q_{\alpha}(X\theta^{*})+Q_{\alpha}(\boldsymbol{\varepsilon})+\boldsymbol{\zeta}) is the output of the randomized algorithm CC, so this completes the proof. ∎

So, informally, we have shown that if the least norm interpolant can learn unit length weight vectors with noise and N(0,Σ)N(0,\Sigma) data, then there is an algorithm CC than can learn from quantized data without noise. The next step is to lower bound the error of CC.

We will use the following, which is an immediate consequence of Corollary 23.

For each row xtx_{t} of XX, and each q>1q>1,

The proof of the following lemma borrows heavily from .

If 1/α=O(n)1/\alpha=O(n), there is a constant τ\tau such that, for any regression algorithm CC, for all large enough nn, if CC is given nn examples of the form (X,Qα(Xθ))(X,Q_{\alpha}(X\theta^{*})), if the rows of XX are nn independent draws from N(0,Σ)N(0,\Sigma), with probability at least 1/21/2, its output θ^\hat{\theta} satisfies ρ(θ^,θ)2>τ\rho(\hat{\theta},\theta^{*})^{2}>\tau.

Our assumption about the learning ability of CC implies that

For any g,hGg,h\in G for which Qα(Xg)=Qα(Xh)Q_{\alpha}(Xg)=Q_{\alpha}(Xh), since ρ(g,h)>3τ\rho(g,h)>3\sqrt{\tau}, it cannot be the case that both ϕ(X,g)\phi(X,g) and ϕ(X,h)\phi(X,h) are both 11. Thus, recalling that x1,...,xnx_{1},...,x_{n} are the rows of XX, and that all elements of GG have length at most 1, we have

since d=Θ(s)d=\Theta(s). Since d=Ω(nlogn)d=\Omega(n\log n), for large enough nn and small enough τ\tau, this contradicts (11), completing the proof. ∎

Now we are ready to put everything together to prove the second paragraph of Theorem 4. By Lemma 41, it suffices to prove that, for a small enough constant τ0\tau_{0}, if 1/α=O(n)1/\alpha=O(n), with probability 1/21/2, Algorithm CC, given examples (x,Qα(θx))(x,Q_{\alpha}(\theta^{\top}x)), with probability 1/21/2, fails to achieves ρ(θ^,θ)2τ0\rho(\hat{\theta},\theta^{*})^{2}\leq\tau_{0}. By Lemma 45, this is the case, completing the proof.