Benign Overfitting in Linear Regression

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler

Introduction

Deep learning methodology has revealed a surprising statistical phenomenon: overfitting can perform well. The classical perspective in statistical learning theory is that there should be a tradeoff between the fit to the training data and the complexity of the prediction rule. Whether complexity is measured in terms of the number of parameters, the number of non-zero parameters in a high-dimensional setting, the number of neighbors averaged in a nearest-neighbor estimator, the scale of an estimate in a reproducing kernel Hilbert space, or the bandwidth of a kernel smoother, this tradeoff has been ubiquitous in statistical learning theory. Deep learning seems to operate outside the regime where results of this kind are informative, since deep neural networks can perform well even with a perfect fit to the training data.

As one example of this phenomenon, consider the experiment illustrated in Figure 1(c) in : standard deep network architectures and stochastic gradient algorithms, run until they perfectly fit a standard image classification training set, give respectable prediction performance, even when significant levels of label noise are introduced. The deep networks in the experiments reported in achieved essentially zero cross-entropy loss on the training data. In statistics and machine learning textbooks, an estimate that fits every training example perfectly is often presented as an illustration of overfitting (“… interpolating fits… [are] unlikely to predict future data well at all.” [25, p37]). Thus, to arrive at a scientific understanding of the success of deep learning methods, it is a central challenge to understand the performance of prediction rules that fit the training data perfectly.

In this paper, we consider perhaps the simplest setting where we might hope to witness this phenomenon: linear regression. That is, we consider quadratic loss and linear prediction rules, and we assume that the dimension of the parameter space is large enough that a perfect fit is guaranteed. We consider data in an infinite dimensional space (a separable Hilbert space), but our results apply to a finite-dimensional subspace as a special case. There is an ideal value of the parameters, $\theta^{*}$ , corresponding to the linear prediction rule that minimizes the expected quadratic loss. We ask when it is possible to fit the data exactly and still compete with the prediction accuracy of $\theta^{*}$ . Since we require more parameters than the sample size in order to fit exactly, the solution might be underdetermined, so there might be many interpolating solutions. We consider the most natural: choose the parameter vector $\hat{\theta}$ with the smallest norm among all vectors that give perfect predictions on the training sample. (This corresponds to using the pseudoinverse to solve the normal equations; see Section 2.) We ask when it is possible to overfit in this way—and embed all of the noise of the labels into the parameter estimate $\hat{\theta}$ —without harming prediction accuracy.

Our main result is a finite sample characterization of when overfitting is benign in this setting. The linear regression problem depends on the optimal parameters $\theta^{*}$ and the covariance $\Sigma$ of the covariates $x$ . The properties of $\Sigma$ turn out to be crucial, since the magnitude of the variance in different directions determines both how the label noise gets distributed across the parameter space and how errors in parameter estimation in different directions in parameter space affect prediction accuracy. There is a classical decomposition of the excess prediction error into two terms. The first is rather standard: provided that the scale of the problem (that is, the sum of the eigenvalues of $\Sigma$ ) is small compared to the sample size $n$ , the contribution to $\hat{\theta}$ that we can view as coming from $\theta^{*}$ is not too distorted. The second term is more interesting, since it reflects the impact of the noise in the labels on prediction accuracy. We show that this part is small if and only if the effective rank of $\Sigma$ in the subspace corresponding to low variance directions is large compared to $n$ . This necessary and sufficient condition of a large effective rank can be viewed as a property of significant overparameterization: fitting the training data exactly but with near-optimal prediction accuracy occurs if and only if there are many low variance (and hence unimportant) directions in parameter space where the label noise can be hidden.

The details are more complicated. The characterization depends in a specific way on two notions of effective rank, $r$ and $R$ ; the smaller one, $r$ , determines a split of $\Sigma$ into large and small eigenvalues, and the excess prediction error depends on the effective rank, as measured by the larger notion $R$ , of the subspace corresponding to the smallest eigenvalues. For the excess prediction error to be small, the smallest eigenvalues of $\Sigma$ must decay slowly.

Studying the patterns of eigenvalues that allow benign overfitting reveals an interesting role for large but finite dimensions: in an infinite-dimensional setting, benign overfitting occurs only for a narrow range of decay rates of the eigenvalues. On the other hand, it occurs with any suitably slowly decaying eigenvalue sequence in a finite dimensional space whose dimension grows faster than the sample size. Thus, for linear regression, data that lies in a large but finite dimensional space exhibits the benign overfitting phenomenon with a much wider range of covariance properties than data that lies in an infinite dimensional space.

The phenomenon of interpolating prediction rules has been an object of study by several authors over the last two years, since it emerged as an intriguing mystery at the Simons Institute program on Foundations of Machine Learning in Spring 2017. Belkin, Ma and Mandal described an experimental study demonstrating that this phenomenon of accurate prediction for functions that interpolate noisy data also occurs for prediction rules chosen from reproducing kernel Hilbert spaces, and explained the mismatch between this phenomenon and classical generalization bounds. Belkin, Hsu and Mitra gave an example of an interpolating decision rule—simplicial interpolation—with an asymptotic consistency property as the input dimension gets large. That work, and subsequent work of Belkin, Rakhlin, and Tsybakov , studied kernel smoothing methods based on singular kernels that both interpolate and, with suitable bandwidth choice, give optimal rates for nonparametric estimation (building on earlier consistency results for these unusual kernels). Liang and Rakhlin considered minimum norm interpolating kernel regression with kernels defined as nonlinear functions of the Euclidean inner product and showed that, with certain properties of the training sample (expressed in terms of the empirical kernel matrix), these methods can have good prediction accuracy. Belkin, Hsu, Ma and Mandal studied experimentally the excess risk as a function of the dimension of a sequence of parameter spaces for linear and non-linear classes.

Subsequent to our work, considered the properties of the interpolating linear prediction rule with minimal expected squared error. After this work was presented at the NAS Colloquium on the Science of Deep Learning , we became aware of the concurrent work of Belkin, Hsu and Xu and of Hastie, Montanari, Rosset and Tibshirani . Belkin et al calculated the excess risk for certain linear models (a regression problem with identity covariance, sparse optimal parameters, both with and without noise, and a problem with random Fourier features with no noise), and Hastie et al considered linear regression in an asymptotic regime, where sample size $n$ and input dimension $p$ go to infinity together with asymptotic ratio $p/n\to\gamma$ . They assumed that, as $p$ gets large, the empirical spectral distribution of $\Sigma$ (the discrete measure on its set of eigenvalues) converges to a fixed measure, and they applied random matrix theory to explore the range of behaviors of the asymptotics of the excess prediction error as $\gamma$ , the noise variance, and the eigenvalue distribution vary. They also studied the asymptotics of a model involving random nonlinear features. In contrast, we give upper and lower bounds on the excess prediction error for arbitrary finite sample size, for arbitrary covariance matrices, and for data of arbitrary dimension.

The next section introduces notation and definitions used throughout the paper, including definitions of the problem of linear regression and of various notions of effective rank of the covariance operator. Section 3 gives the characterization of benign overfitting, illustrates why the effective rank condition corresponds to significant overparameterization, and presents several examples of patterns of eigenvalues that allow benign overfitting, suggesting that slowly decaying covariance eigenvalues in input spaces of growing but finite dimension are the generic example of benign overfitting. Section 4 discusses the connections between these results and the benign overfitting phenomenon in deep neural networks. Section 5 outlines the proofs of the results.

Definitions and Notation

the conditional noise variance is bounded below by some constant $\sigma^{2}$ ,

almost surely, the projection of the data $X$ on the space orthogonal to any eigenvector of $\Sigma$ spans a space of dimension $n$ .

By the projection theorem, parameter vectors that solve the least squares problem $\min_{\beta}\left\|X\beta-\boldsymbol{y}\right\|^{2}$ solve the normal equations, so we can equivalently write $\hat{\theta}$ as the minimum norm solution to the normal equations,

Our main result gives tight bounds on the excess risk of this minimum norm estimator in terms of certain notions of effective rank of the covariance that are defined in terms of its eigenvalues.

For the covariance operator $\Sigma$ , define $\lambda_{i}=\mu_{i}(\Sigma)$ for $i=1,2,\ldots$ . If $\sum_{i=1}^{\infty}\lambda_{i}<\infty$ and $\lambda_{k+1}>0$ for $k\geq 0$ , define

Main Results

The following theorem establishes nearly matching upper and lower bounds for the risk of the minimum-norm interpolating estimator.

For any $\sigma_{x}$ there are $b,c,c_{1}>1$ for which the following holds. Consider a linear regression problem from Definition 1. Define

with probability at least $1-\delta$ , and

Moreover, there are universal constants $a_{1},a_{2},n_{0}$ such that for all $n\geq n_{0}$ , for all $\Sigma$ , for all $t\geq 0$ , there is a $\theta^{*}$ with $\|\theta^{*}\|=t$ such that for $x\sim{\cal N}(0,\Sigma)$ and $y|x\sim{\cal N}(x^{\top}\theta^{*},\|\theta^{*}\|^{2}\|\Sigma\|)$ , with probability at least $1/4$ ,

In order to understand the implications of Theorem 4, we now study relationships between the two notions of effective rank, $r_{k}$ and $R_{k}$ , and establish sufficient and necessary conditions for the sequence $\{\lambda_{i}\}$ of eigenvalues to lead to small excess risk.

The following lemma shows that the two notions of effective rank are closely related. See Appendix H for its proof, and for other properties of $r_{k}$ and $R_{k}$ . (All appendices may be found in the supporting material.)

$r_{k}(\Sigma)\geq 1$ , $r^{2}_{k}(\Sigma)=r_{k}(\Sigma^{2})R_{k}(\Sigma)$ , and

Both notions of symmetry $s$ and $S$ lie between $1/p$ (when $\lambda_{2}\to 0$ ) and $1$ (when the $\lambda_{i}$ are all equal).

Theorem 4 shows that, for the minimum norm estimator to have near-optimal prediction accuracy, $r_{0}(\Sigma)$ should be small compared to the sample size $n$ (from the first term) and $r_{k^{*}}(\Sigma)$ and $R_{k^{*}}(\Sigma)$ should be large compared to $n$ . Together, these conditions imply that overparameterization is essential for benign overfitting in this setting: the number of non-zero eigenvalues should be large compared to $n$ , they should have a small sum compared to $n$ , and there should be many eigenvalues no larger than $\lambda_{k^{*}}$ . If the number of these small eigenvalues is not much larger than $n$ , then they should be roughly equal, but they can be more assymmetric if there are many more of them.

The following theorem shows that the kind of overparameterization that is essential for benign overfitting requires $\Sigma$ to have a heavy tail. (The proof—and some other examples illustrating the boundary of benign overfitting—are in Appendix I.) In particular, if we fix $\Sigma$ in an infinite-dimensional Hilbert space and ask when does the excess risk of the minimum norm estimator approach zero as $n\to\infty$ , it imposes tight restrictions on the eigenvalues of $\Sigma$ . But there are many other possibilities for these asymptotics if $\Sigma$ can change with $n$ . Since rescaling $X$ affects the accuracy of the least-norm interpolant in an obvious way, we may assume without loss of generality that $\|\Sigma\|=1$ . If we restrict our attention to this case, then, informally, Theorem 4 implies that, when the covariance operator for data with $n$ examples is $\Sigma_{n}$ , the least-norm interpolant converges if $\frac{r_{0}(\Sigma_{n})}{n}\rightarrow 0$ , $\frac{k_{n}^{*}}{n}\rightarrow 0$ , and $\frac{n}{R_{k_{n}^{*}}(\Sigma_{n})}\rightarrow 0$ , and only if $\frac{r_{0}(\Sigma_{n})}{n\log(1+r_{0}(\Sigma_{n}))}\rightarrow 0$ , $\frac{k_{n}^{*}}{n}\rightarrow 0$ , and $\frac{n}{R_{k_{n}^{*}}(\Sigma_{n})}\rightarrow 0$ , where $k^{*}_{n}=\min\left\{k\geq 0:r_{k}(\Sigma_{n})\geq bn\right\}$ for the universal constant $b$ in Theorem 4. For this reason, we say that a sequence of covariance operators $\Sigma_{n}$ is benign if

If $\mu_{k}(\Sigma)=k^{-\alpha}\ln^{-\beta}(k+1)$ , then $\Sigma$ is benign iff $\alpha=1$ and $\beta>1$ .

and $\gamma_{k}=\Theta(\exp(-k/\tau))$ , then $\Sigma_{n}$ is benign iff $p_{n}=\omega(n)$ and $ne^{-o(n)}=\epsilon_{n}p_{n}=o(n)$ . Furthermore, for $p_{n}=\Omega(n)$ and $\epsilon_{n}p_{n}=ne^{-o(n)}$ ,

Compare the situations described by Parts 1 and 2 of Theorem 6. Part 1 shows that for infinite-dimensional data with a fixed covariance, benign overfitting occurs iff the eigenvalues of the covariance operator decay just slowly enough for their sum to remain finite. Part 2 shows that the situation is very different if the data has finite dimension and a small amount of isotropic noise is added to the covariates. In that case, even if the eigenvalues of the original covariance operator (before the addition of isotropic noise) decay very rapidly, benign overfitting occurs iff both the dimension is large compared to the sample size, and the isotropic component of the covariance is sufficiently small—but not exponentially small—compared to the sample size.

These examples illustrate the tension between the slow decay of eigenvalues that is needed for $k/n+n/R_{k}$ to be small, and the summability of eigenvalues that is needed for $r_{0}(\Sigma)/n$ to be small. There are two ways to resolve this tension. First, in the infinite dimensional setting, slow decay of the eigenvalues suffices—decay just fast enough to ensure summability—as shown by Part 1 of Theorem 6. (Appendix I gives another example, where the eigenvalue decay is allowed to vary with $n$ ; in that case, $\Sigma_{n}$ is benign iff the decay rate gets close—but not too close—to $1/k$ as $n$ increases.) The other way to resolve the tension is to consider a finite dimensional setting (which ensures that the eigenvalues are summable), and in this case arbitrarily slow decay is possible. Part 2 of Theorem 6 gives an example of this: eigenvalues that are all at least as large as a small constant. Appendix I gives another example, with a truncated infinite series that decays sufficiently slowly that their sum does not converge. Theorem 6(1) shows that a very specific decay rate is required in infinite dimensions, which suggests that this is an unusual phenomenon in that case. The more generic scenario where benign overfitting will occur is demonstrated by Theorem 6(2), with eigenvalues that are either constant or slowly decaying in a very high—but finite dimensional—space.

Deep neural networks

How relevant are Theorems 4 and 6 to the phenomenon of benign overfitting in deep neural networks? One connection appears by considering regimes where deep neural networks are well-approximated by linear functions of their parameters. This so-called neural tangent kernel (NTK) viewpoint has been vigorously pursued recently in an attempt to understand the optimization properties of deep learning methods. Very wide neural networks, trained with gradient descent from a suitable random initialization, can be accurately approximated by linear functions in an appropriate Hilbert space, and in this case gradient descent finds an interpolating solution quickly; see . (Note that these papers do not consider prediction accuracy, except when there is no noise; for example, [29, Assumption A1] implies that the network can compute a suitable real-valued response exactly, and the data-dependent bound of [1, Theorem 5.1] becomes vacuous when independent noise is added to the $y_{i}$ s.) The eigenvalues of the covariance operator in this case can have a heavy tail under reasonable assumptions on the data distribution (see , where this kernel was introduced, and ), and the dimension is very large but finite, as required for benign overfitting. However, the assumptions of Theorem 4 do not apply in this case. In particular, the assumption that the random elements of the Hilbert space are a linearly transformed vector with independent components is not satisfied. Thus, our results are not directly applicable in this—somewhat unrealistic—setting. Note that the slow decay of the eigenvalues of the NTK is in contrast to the case of the gaussian and other smooth kernels, where the eigenvalues decay nearly exponentially quickly .

The phenomenon of benign overfitting was first observed in deep neural networks. Theorems 4 and 6 are steps towards understanding this phenomenon by characterizing when it occurs in the simple setting of linear regression. Those results suggest that covariance eigenvalues that are constant or slowly decaying in a high (but finite) dimensional space might be important in the deep network setting also. Some authors have suggested viewing neural networks as finite-dimensional approximations to infinite dimensional objects , and there are generalization bounds—although not for the overfitting regime—that are applicable to infinite width deep networks with parameter norm constraints . However, the intuition from the linear setting suggests that truncating to a finite dimensional space might be important for good statistical performance in the overfitting regime. Confirming this conjecture by extending our results to the setting of prediction in deep neural networks is an important open problem.

Proof

Throughout the proofs, we treat $\sigma_{x}$ (the subgaussian norm of the covariates) as a constant. Therefore, we use the symbols $b,c,c_{1},c_{2},\ldots$ to refer to constants that only depend on $\sigma_{x}$ . Their values are suitably large (and always at least $1$ ) but do not depend on any parameters of the problems we consider, besides $\sigma_{x}$ . For universal constants that do not depend on any parameters of the problem at all we use the symbol $a$ . Also, whenever we sum over eigenvectors of $\Sigma$ , the sum is restricted to eigenvectors with non-zero eigenvalues.

The first step is a standard decomposition of the excess risk into two pieces, a term that corresponds to the distortion that is introduced by viewing $\theta^{*}$ through the lens of the finite sample and a term that corresponds to the distortion introduced by the noise $\boldsymbol{\varepsilon}=\boldsymbol{y}-X\theta$ . The impact of both sources of error in $\hat{\theta}$ on the excess risk is modulated by the covariance $\Sigma$ , which gives different weight to different directions in parameter space.

The excess risk of the minimum norm estimator satisfies

with probability at least $1-\delta$ over $\epsilon$ , and

Unit variance subgaussians

Our assumptions allow the trace of $C$ to be expressed as a function of many independent subgaussian vectors.

where $A_{-i}=\sum_{j\neq i}\lambda_{j}z_{j}z_{j}^{\top}$ .

By Assumption 2 in Definition 1, the random variables $x^{\top}v_{i}/\sqrt{\lambda_{i}}$ are independent $\sigma_{x}^{2}$ -subgaussian. We consider $X$ in the basis of eigenvectors of $\Sigma$ , $Xv_{i}=\sqrt{\lambda_{i}}z_{i}$ , to see that

For the second part, we use Lemma 20, which is a consequence of the Sherman-Woodbury-Morrison formula; see Appendix B.

by Lemma 20, for the case $k=1$ and $Z=\sqrt{\lambda_{i}}z_{i}$ . Note that $A_{-i}$ is invertible by Assumption 5 in Definition 1. ∎

The weighted sum of outer products of these subgaussian vectors plays a central role in the rest of the proof. Define

Concentration of A𝐴A

The next step is to show that eigenvalues of $A$ , $A_{-i}$ and $A_{k}$ are concentrated. The proof of the following inequality is in Appendix C. Recall that $\mu_{1}(A)$ and $\mu_{n}(A)$ denote the largest and the smallest eigenvalues of the $n\times n$ matrix $A$ .

There is a constant $c$ such that for any $k\geq 0$ with probability at least $1-2e^{-n/c}$ ,

The following lemma uses this result to give bounds on the eigenvalues of $A_{k}$ , which in turn give bounds on some eigenvalues of $A_{-i}$ and $A$ . For these upper and lower bounds to match up to a constant factor, the sum of the eigenvalues of $A_{k}$ should dominate the term involving its leading eigenvalue, which is a condition on the effective rank $r_{k}(\Sigma)$ . The lemma shows that once $r_{k}(\Sigma)$ is sufficiently large, all of the eigenvalues of $A_{k}$ are identical up to a constant factor.

There are constants $b,c\geq 1$ such that for any $k\geq 0$ , with probability at least $1-2e^{-n/c}$ ,

By Lemma 9, we know that with probability at least $1-2e^{-n/c_{1}}$ ,

First, the matrix $A-A_{k}$ has rank at most $k$ (as a sum of $k$ matrices of rank $1$ ). Thus, there is a linear space $\mathscr{L}$ of dimension $n-k$ such that for all $v\in\mathscr{L}$ , $v^{\top}Av=v^{\top}A_{k}v\leq\mu_{1}(A_{k})\|v\|^{2}$ , and so $\mu_{k+1}(A)\leq\mu_{1}(A_{k})$ .

Second, by the Courant-Fischer-Weyl Theorem, for all $i$ and $j$ , $\mu_{j}(A_{-i})\leq\mu_{j}(A)$ (see Lemma 28). On the other hand, for $i\leq k$ , $A_{k}\preceq A_{-i}$ , so all the eigenvalues of $A_{-i}$ are lower bounded by $\mu_{n}(A_{k})$ .

Choosing $b>c_{1}^{2}$ and $c>\max\left\{c_{1}+1/c_{1},\left(1/c_{1}-c_{1}/b\right)^{-1}\right\}$ gives the third claim of the lemma. ∎

Upper bound on the trace term

There are constants $b,c\geq 1$ such that if $0\leq k\leq n/c$ , $r_{k}(\Sigma)\geq bn$ , and $l\leq k$ then with probability at least $1-7e^{-n/c}$ ,

The proof uses the following lemma and its corollary. Their proofs are in Appendix C.

Suppose $\{\lambda_{i}\}_{i}^{\infty}$ is a non-increasing sequence of non-negative numbers such that $\sum_{i=1}^{\infty}\lambda_{i}<\infty$ , and $\{\xi_{i}\}_{i=1}^{\infty}$ are independent centered $\sigma$ -subexponential random variables. Then for some universal constant $a$ for any $t>0$ with probability at least $1-2e^{-t}$

where $\Pi_{\mathscr{L}}$ is the orthogonal projection on $\mathscr{L}$ .

(of Lemma 11) Fix $b$ to its value in Lemma 10. By Lemma 8,

and the upper bounds on the $\mu_{k+1}(A_{-i})$ ’s give

where $\mathscr{L}_{i}$ is the span of the $n-k$ eigenvectors of $A_{-i}$ corresponding to its smallest $n-k$ eigenvalues. So for $i\leq l$ ,

Next, we apply Corollary 13 $l$ times, together with a union bound, to show that with probability at least $1-3e^{-t}$ , for all $1\leq i\leq l$ ,

provided that $t<n/c_{0}$ and $c>c_{0}$ for some sufficiently large $c_{0}$ (note that $c_{2}$ and $c_{3}$ only depend on $c_{0}$ , $a$ and $\sigma_{x}$ , and we can still take $c$ large enough in the end without changing $c_{2}$ and $c_{3}$ ). Combining (3), (4), and (5), with probability at least $1-5e^{-n/c_{0}}$ ,

Second, consider the second sum in (2). Lemma 10 shows that, on the same high probability event that we considered in bounding the first half of the sum, $\mu_{n}(A)\geq\lambda_{k+1}r_{k}(\Sigma)/c_{1}$ . Hence,

Notice that $\sum_{i>l}\lambda_{i}^{2}\|z_{i}\|^{2}$ is a weighted sum of $\sigma_{x}^{2}$ -subexponential random variables, with the weights given by the $\lambda_{i}^{2}$ in blocks of size $n$ . Lemma 12 implies that, with probability at least $1-2e^{-t}$ ,

because $t<n/c_{0}$ . Combining the above gives

Finally, putting both parts together and taking $c>\max\{c_{0},c_{4},c_{6}\}$ gives the lemma. ∎

Lower bound on the trace term

There is a constant $c$ such that for any $i\geq 1$ with $\lambda_{i}>0$ , and any $0\leq k\leq n/c$ , with probability at least $1-5e^{-n/c}$ ,

Suppose $n\leq\infty$ and $\{\eta_{i}\}_{i=1}^{n}$ is a sequence of non-negative random variables, $\{t_{i}\}_{i=1}^{n}$ is a sequence of non-negative real numbers (at least one of which is strictly positive) such that for some $\delta\in(0,1)$ and any $i\leq n$ , $\Pr(\eta_{i}>t_{i})\geq 1-\delta$ . Then

These two lemmas imply the following lower bound.

There are constants $c$ such that for any $0\leq k\leq n/c$ and any $b>1$ with probability at least $1-10e^{-n/c}$ ,

From Lemmas 8, 14 and 15, with probability at least $1-10e^{-n/c_{1}}$ ,

Now, if $r_{k}(\Sigma)<bn$ , then the second term in the minimum is always bigger than the third term, and in that case,

On the other hand, if $r_{k}(\lambda)\geq bn$ ,

where the equality follows from the fact that the $\lambda_{i}$ s are non-increasing. ∎

A simple choice of l𝑙l

Recall that $\sigma_{x}$ is a constant. If no $k\leq n/c$ has $r_{k}(\Sigma)\geq bn$ , then Lemmas 7 and 16 imply that the expected excess risk is $\Omega(\sigma^{2})$ , which proves the first paragraph of Theorem 4 for large $k^{*}$ . If some $k\leq n/c$ does have $r_{k}(\Sigma)\geq bn$ , then the upper and lower bounds of Lemmas 11 and 16 are constant multiples of

For any $b\geq 1$ and $k^{*}:=\min\left\{k:r_{k}(\Sigma)\geq bn\right\}$ , if $k^{*}<\infty$ , we have

Finally, we can finish the proof of Theorem 4. Set $b$ in Lemma 16 and Theorem 4 to the constant $b$ from Lemma 11. Take $c_{1}$ to be the maximum of the constants $c$ from Lemmas 16 and 11.

The proof of the second paragraph is in Appendix K.

Conclusions and Further Work

Our results characterize when the phenomenon of benign overfitting occurs in high dimensional linear regression, with gaussian data and more generally. We give finite sample excess risk bounds that reveal the covariance structure that ensures that the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization depends on two notions of the effective rank of the data covariance operator. It shows that overparameterization, that is, the existence of many low-variance and hence unimportant directions in parameter space, is essential for benign overfitting, and that data that lies in a large but finite dimensional space exhibits the benign overfitting phenomenon with a much wider range of covariance properties than data that lies in an infinite dimensional space.

Acknowledgements

We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of Google through a Google Research Award. Part of this work was done as part the Fall 2018 program on Foundations of Data Science at the Simons Institute for the Theory of Computing. Gábor Lugosi was supported by the Spanish Ministry of Economy and Competitiveness, Grant MTM2015-67304-P and FEDER, EU; “High-dimensional problems in structured probabilistic models - Ayudas Fundación BBVA a Equipos de Investigación Cientifica 2017”; and Google Focused Award “Algorithms and Learning for AI”.

References

Appendix A Proof of Lemma 7

We first give the decomposition of the excess risk.

The excess risk of the minimum norm estimator satisfies

Since $\varepsilon=y-x^{\top}\theta^{*}$ has mean zero conditionally on $x$ ,

Using (1), the definition of $\Sigma$ , and the fact that $\boldsymbol{y}=X\theta^{*}+\boldsymbol{\varepsilon}$ ,

Also, since $\boldsymbol{\varepsilon}$ has zero mean conditionally on $X$ , and is independent of $x$ , we have

The following lemma shows that we can obtain a high-probability upper bound on the term $\boldsymbol{\varepsilon}^{\top}C\boldsymbol{\varepsilon}$ in terms of the trace of $C$ . It is Lemma 36 in .

Combining this with Lemma 18 implies Lemma 7.

Appendix B An Algebraic Property

We use the Sherman–Morrison–Woodbury formula to write

Denote $M_{1}:=Z^{\top}A^{-1}Z$ and $M_{2}:=Z^{\top}A^{-2}Z$ . Applying (6), we get

where we used the identity $I-(I+M_{1})^{-1}M_{1}=(I+M_{1})^{-1}$ twice in the second last equality and the identity $I-M_{1}(I+M_{1})^{-1}=(I+M_{1})^{-1}$ in the last equality. ∎

Appendix C Proof of concentration inequalities

We use some standard results about subgaussian and subexponential random variables.

First of all, we need the following direct consequence of Propositions 2.5.2 and 2.7.1 and Lemma 2.7.6 from :

There is a universal constant $c$ such that for any random variable $\xi$ that is centered, $\sigma^{2}$ -subgaussian, and unit variance, $\xi^{2}-1$ is a centered $c\sigma^{2}$ -subexponential random variable, that is,

Second, we are going to use the following form of Bernstein’s inequality, which is Theorem 2.8.2 in :

There is a universal constant $c$ such that for any non-increasing sequence $\{\lambda_{i}\}_{i=1}^{\infty}$ of non-negative numbers such that $\sum_{i=1}^{\infty}\lambda_{i}<\infty$ , and any independent, centered, $\sigma$ -subexponential random variables $\{\xi_{i}\}_{i=1}^{\infty}$ , and any $x>0$ , with probability at least $1-2e^{-x}$

where $\Pi_{\mathscr{L}}$ is the orthogonal projection on $\mathscr{L}$ .

First of all, since $\|z\|^{2}=\sum_{i=1}^{n}z_{i}^{2}$ — a sum of $n$ $\sigma^{2}$ -subexponential random variables, by Corollary 23, for some absolute constant $c$ and for any $t>0$ , with probability at least $1-2e^{-t}$ ,

Thus, with probability at least $1-3e^{-t}$

where the first inequality holds because the $\lambda_{i}$ s are decreasing in magnitude, and the last two inequalities hold since the functions $x+x^{2}$ and $2x+x^{2}$ are both increasing on $(-\frac{1}{2},\infty)$ and $\Delta v_{1}\geq-\|\Delta v\|\geq-\epsilon\geq-\frac{1}{2}$ . ∎

There is a universal constant $c$ such that with probability at least $1-2e^{-n/c}$ ,

Let $\mathcal{N}$ be a $\frac{1}{4}$ -net on the sphere $\mathcal{S}^{n-1}$ with respect to the Euclidean distance such that $|\mathcal{N}|\leq 9^{n}$ . Applying the union bound over the elements of $\mathcal{N}$ , we see that with probability $1-2e^{-t}$ , every $v\in\mathcal{N}$ satisfies

Since $\mathcal{N}$ is a $\frac{1}{4}$ -net, by Lemma 25, we need to multiply the quantity above by $(1-1/4)^{-2}$ to get the bound on the norm of the $A-I_{n}\sum_{i}\lambda_{i}$ . Denote

Thus, with probability at least $1-2e^{-t}$ ,

When $t<n/c_{4}$ we can write $t+n\ln 9\leq c_{5}n$ , and we have

by the AMGM inequality. (Recall that $c_{1},c_{2},\ldots$ denote universal constants with value at least $1$ , and $\sigma_{x}\geq 1/c_{7}$ is the subgaussian constant of a random variable with unit variance.) ∎

Appendix D Proof of Lemma 14

Fix $i\geq 1$ with $\lambda_{i}>0$ and $0\leq k\leq n/c$ . By Lemma 10, with probability at least $1-2e^{-n/c_{1}}$ ,

By Corollary 13, with probability at least $1-3e^{-t}$ ,

provided that $t<n/c_{0}$ and $c>c_{0}$ for some sufficiently large $c_{0}$ . Thus, with probability at least $1-5e^{-n/c_{3}}$ ,

Dividing $\lambda_{i}^{2}z_{i}^{\top}A_{-i}^{-2}z_{i}$ by the square of both sides, we have

Also, from the Cauchy-Schwarz inequality and Corollary 13 again, we have that on the same event,

Choosing $c$ suitably large gives the lemma.

Appendix E Proof of Lemma 15

and denote its probability as $c\delta$ for some $c\in(0,\delta^{-1})$ . On the one hand, by the definition of the event, we have

On the other hand, note that for any $i$ ,

Appendix F Proof of Lemma 17

We can write the function of $l$ being minimized as

where $l^{*}$ is the largest value of $i\leq k^{*}$ for which

since the $\lambda_{i}^{2}$ are non-increasing. This condition holds iff

The definition of $k^{*}$ implies $r_{k^{*}-1}(\Sigma)<bn$ . So we can write

and so the minimizing $l$ is $k^{*}$ . Also,

Appendix G Eigenvalue monotonicity

Recall (half of) the Courant-Fischer-Weyl theorem.

If symmetric matrices $A$ and $B$ satisfy $A\preceq B$ , then, for any $i\in[n]$ , we have $\mu_{i}(A)\leq\mu_{i}(B)$ .

Appendix H Rank facts

The quantity $r_{0}(\Sigma)$ is an important complexity parameter for covariance estimation problems, where it has been called the ‘effective rank’ . Earlier, $r_{0}(\Sigma^{2})$ was called the ‘stable rank’ and the ‘numerical rank’ , although that term has a different meaning in computational linear algebra [23, p261].

$r_{k}(\Sigma)\geq 1$ , $r^{2}_{k}(\Sigma)=r_{k}(\Sigma^{2})R_{k}(\Sigma)$ , and $r_{k}(\Sigma^{2})\leq r_{k}(\Sigma)\leq R_{k}(\Sigma)\leq r_{k}^{2}(\Sigma)$ .

The first inequality and the equality are immediate from the definitions. Together they imply $R_{k}(\Sigma)\leq r_{k}^{2}(\Sigma)$ . For the second inequality,

Substituting this in the equality implies $r_{k}(\Sigma)\leq R_{k}(\Sigma)$ . ∎

Writing $r_{k}$ and $R_{k}$ for $r_{k}(\Sigma)$ and $R_{k}(\Sigma)$ ,

Thus, the function $\phi(k)=k/(b^{2}n)+n/R_{k}$ satisfies the monotonicity property $\phi(k+1)>\phi(k)$ whenever $r_{k}>bn\geq 1$ .

Since $r_{k}>1$ , $0<1-\left(2-1/r_{k}\right)/r_{k}<1$ , so

Appendix I Conditions on eigenvalues

In this section, we prove the following expanded version of Theorem 6.

Define $\lambda_{k,n}:=\mu_{k}(\Sigma_{n})$ for all $k,n$ .

If $\lambda_{k,n}=k^{-\alpha}\ln^{-\beta}(k+1)$ , then $\Sigma_{n}$ is benign iff $\alpha=1$ and $\beta>1$ .

If $\lambda_{k,n}=k^{-(1+\alpha_{n})}$ , then $\Sigma_{n}$ is benign iff $\omega(1/n)=\alpha_{n}=o(1)$ . Furthermore,

then $\Sigma_{n}$ is benign iff either $0<\alpha<1$ , $p_{n}=\omega(n)$ and $p_{n}=o\left(n^{1/(1-\alpha)}\right)$ or $\alpha=1$ , $p_{n}=e^{\omega(\sqrt{n})}$ and $p_{n}=e^{o(n)}$ .

We build up the proof in stages. First, we characterize those sequences of effective ranks that can arise.

Consider some positive summable sequence $\{\lambda_{i}\}_{i=1}^{\infty}$ , and for any non-negative integer $i$ denote

Then $r_{i}>1$ and $\sum_{i}r_{i}^{-1}=\infty$ . Moreover, for any positive sequence $\{u_{i}\}$ such that $\sum_{i=0}^{\infty}u_{i}^{-1}=\infty$ and for every $i$ $u_{i}>1$ , there exists a positive sequence $\{\lambda_{i}\}$ (unique up to constant multiplier) such that $r_{i}\equiv u_{i}$ . The sequence is (a constant rescaling of)

which goes to zero if and only if $\sum_{i}r_{i}^{-1}=\infty$ . On the other hand, we may rewrite the first equality in the proof as

So for any sequence $\{u_{i}\}$ we can uniquely (up to a constant multiplier) recover the sequence $\{\lambda_{i}\}$ such that $r_{i}=u_{i}$ — the only candidate is

However, for such $\{\lambda_{i}\}$ one can compute

so the resulting sequence $\{\lambda_{i}\}$ sums to 1, and

Suppose $b$ is some constant, and $k^{\ast}(n)=\min\{k:r_{k}\geq bn\}$ . Suppose also that the sequence $\{r_{n}\}$ is increasing. Then, as $n$ goes to infinity, $k^{\ast}(n)/n$ goes to zero if and only if $r_{n}/n$ goes to infinity.

We prove the “if” part separately from the “only if” part.

If $k^{\ast}(n)/n\to 0$ then $r_{n}/n\to\infty$ .

Fix some $C>1$ . Since $k^{\ast}(n)/n\to 0$ , there exists some $N_{C}$ such that for any $n\geq N_{C}$ , $k^{\ast}(n)<n/C.$ Thus, for all $n>N_{C}$ ,

Since the constant $C$ is arbitrary, $r_{n}/n$ goes to infinity.

If $r_{n}/n\to\infty$ then $k^{\ast}(n)/n\to 0$ .

Fix some constant $C>1$ . Since $r_{n}/n\to\infty$ there exists some $N_{C}$ such that for any $n\geq N_{C}$ , $r_{n}>Cn$ . Thus, for any $n>CN_{C}/b$

Since the constant $C$ is arbitrary, $k^{\ast}(n)/n$ goes to zero.

Suppose the sequence $\{r_{i}\}$ is increasing and $r_{n}/n\to\infty$ as $n\to\infty$ . Then a sufficient condition for $\frac{n}{R_{k^{\ast}(n)}}\to 0$ is

For example, this condition holds for $r_{n}=n\log n$ .

Since $r_{k^{\ast}(n)}\geq bn$ and $\lim_{n\rightarrow\infty}k^{*}(n)=\infty$ , it is enough to prove that $\frac{\sum_{i>k}\lambda_{i}^{2}}{\lambda_{k+1}^{2}r_{k}}\to 0$ as $k$ goes to infinity. Since

and it is sufficient to prove that the latter quantity goes to zero. We write

Since both numerator and denominator are decreasing in $k$ and go to zero as $k\to\infty$ , we can apply the Stolz–Cesáro theorem (an analog of L’Hôpital’s rule for discrete sequences):

where the last line is due to our sufficient condition. ∎

Part 1, if direction, first term: We have

Part 1, if direction, second term: By Theorem 33, it suffices to prove that $\lim_{n\rightarrow\infty}\frac{r_{n}}{n}=\infty$ . This holds because

Part 1, if direction, third term: By Theorem 34, it suffices to prove that $r_{k}^{-2}=o(r_{k}^{-1}-r_{k+1}^{-1})$ , that is

As argued above, when $\alpha=1$ and $\beta>1$ , $r_{k}=\Theta(k\log k)$ , so it suffices to show that $\lim_{k\rightarrow\infty}(r_{k+1}-r_{k})=\infty$ . We have

Since $\lambda_{i}$ is non-increasing, we have

If we define $f$ on the positive reals by $f(x)=x\log^{\beta}(x+1)$ , then $f$ is convex, and, since $f^{\prime}(x)=\frac{\beta x\log^{\beta-1}(x+1)}{x+1}+\log^{\beta}(x+1)$ , we have

which goes to infinity for large $k$ , completing the proof of the “if” direction of the third term of Part 1.

Part 1, only if direction, $\alpha>1$ : If $\alpha>1$ , then

which does not grow faster than $n$ . Thus, by Theorem 33, $k^{\ast}(n)/n$ does not go to zero.

Part 1, only if direction, $\alpha<1$ , or $\alpha=1$ and $\beta\leq 1$ : In this case, since, as above

and $\sum_{i=1}^{\infty}\frac{1}{i^{\alpha}\log^{\beta}(1+i)}$ diverges in this case, $\frac{||\Sigma_{n}||\sqrt{r_{0}(\Sigma_{n})}}{n}$ does not go to zero.

Before starting on Part 2, let us define $r_{k,n}=r_{k}(\Sigma_{n})$ and $R_{k,n}=R_{k}(\Sigma_{n})$ .

Part 2, if direction, first term: We have

so $||\Sigma_{n}||\sqrt{\frac{r_{0,n}}{n}}\leq\sqrt{\frac{1+\frac{1}{\alpha_{n}}}{n}}$ which goes to zero with $n$ if $\alpha_{n}=\omega(1/n)$ .

Part 2, if direction, second term: First,

Thus, $k^{*}(n)=O(\alpha_{n}n)$ , so that $\frac{k^{*}(n)}{n}=O(\alpha_{n})=o(1)$ .

Part 2, if direction, third term: We bound $R_{k,n}$ from below by separately bounding its numerator and denominator:

So now we want a lower bound on $k^{*}(n)$ . For that, we need an upper bound on $r_{k,n}$ , and

This implies $\frac{2k^{*}(n)}{\alpha_{n}}e^{\alpha_{n}/k^{*}(n)}\geq bn$ . This, together with the fact that, for $u>1$ , $ue^{1/u}$ is an increasing function of $u$ , implies that, for large enough $n$ , $k^{*}(n)\geq\alpha_{n}bn/3$ . Since $\alpha_{n}=\omega(1/n)$ , this implies that $k^{*}(n)=\omega(1)$ . Combining this with (7), for large enough $n$

Thus $n/R_{k^{*}(n),n}=O(\alpha_{n})=o(1)$ .

Part 2, only if direction, $\alpha_{n}=O(1/n)$ : We have

so $||\Sigma_{n}||\sqrt{\frac{r_{0,n}}{n}}\geq\sqrt{\frac{1}{\alpha_{n}n}}$ , which is bounded below by a constant for large $n$ if $\alpha_{n}=O(1/n)$ .

Part 2, only if direction, $\alpha_{n}=\Omega(1)$ : Recall that, in the proof of the “if” direction of the third term, we showed that $k^{*}(n)\geq\alpha_{n}bn/3$ . This implies that $\frac{k^{*}(n)}{n}=\Omega(\alpha_{n})$ .

Part 3: Suppose that $\Sigma_{n}$ is benign. Then because $R_{k}(\Sigma_{n})\leq p_{n}-k$ , we must have $p_{n}=\omega(n)$ . Thus, we can restrict our attention to the sequences for which $p_{n}=\omega(n)$ and find the necessary and sufficient conditions for that class.

Next, for any positive $\alpha$ and any natural number $k\in[1,p_{n})$ , we can write

As the sequence can only be benign if $k^{\ast}=o(n)$ , we can only consider values of $k$ that do not exceed some constant fraction of $n$ , e.g. $n/2$ . Since $p_{n}=\omega(n)$ , noting that, for $x>0$ , the sign of $\frac{1}{1-\alpha}x^{1-\alpha}$ flips when $\alpha$ crosses $1$ , we can write, uniformly for all $k\in[1,n/2]$ ,

Recall that we consider $\lambda_{i,n}=i^{-\alpha}$ for $i\leq p_{n}$ . Using the formula above, we get uniformly for all $k\in[1,n/2]$

Recall that $k^{\ast}=\min\{k:r_{k}(\Sigma_{n})\geq bn\}$ . We compute

One can see that for $\alpha>1$ , $k^{\ast}=\Omega_{\alpha}(n)$ , so the sequence is not benign for $\alpha>1$ . On the other hand, $k^{\ast}=o(n)$ for $\alpha\leq 1$ .

Next, analogously to the asymptotics for $r_{k}(\Sigma)$ , we have

Since $R_{k}=\frac{r_{k}(\Sigma)^{2}}{r_{k}(\Sigma^{2})}$ , we can write uniformly for all $k\in[1,n/2]$

Now we plug in $k^{\ast}$ instead of $k$ . Recall that $p_{n}/k^{\ast}=\Theta_{\alpha}\left((p_{n}/n)^{1/\alpha}\right)$ for $\alpha\in(0,1)$ , and $p_{n}/k^{\ast}=\Theta_{\alpha}\left(p_{n}/n\ln(p_{n}/n)\right)$ for $\alpha=1$ . We get

Since $p_{n}=\omega(n)$ , for any $\alpha\in(0,1)$ , $R_{k^{\ast}}=\omega(n)$ . For $\alpha=1$ the necessary and sufficient for $R_{k^{\ast}}=\omega(n)$ is $\ln(p_{n}/n)=\omega(\sqrt{n})$ .

So far, we obtained the necessary and sufficient conditions for the last terms to go to zero. Now let’s look at the upper bound for the first term: since $\lambda_{1,n}\equiv 1$ , we just need $r_{0}/n\to 0$ . We write, for $\alpha\in(0,1]$ ,

Thus, for $\alpha<1$ , $r_{0}(\Sigma_{n})/n$ goes to zero if and only if $p_{n}=o\left(n^{1/(1-\alpha)}\right)$ , and for $\alpha=1$ , $r_{0}(\Sigma_{n})/n$ goes to zero if and only if $\ln(p_{n})=o(n)$ .

Part 4: Suppose that $\Sigma_{n}$ is benign. Then because $R_{k}(\Sigma_{n})\leq p_{n}-k$ , we must have $p_{n}=\omega(n)$ . Also,

and so $p_{n}\epsilon_{n}=o(n)$ . Since $\Sigma_{n}$ benign implies $k^{*}=o(n)$ , and hence $k^{*}=o(p_{n})$ , we consider $k=o(p_{n})$ . In this regime,

Substituting $k=\tau\ln(n/(p_{n}\epsilon_{n}))-a$ gives

which shows that $k^{*}\geq\tau\ln(n/(p_{n}\epsilon_{n}))-O(1)$ . Thus, if $\Sigma_{n}$ is benign, we must have $k^{*}=o(n)$ , that is, $\epsilon_{n}p_{n}=ne^{-o(n)}$ .

Conversely, assume $p_{n}=\Omega(n)$ and $\epsilon_{n}p_{n}=ne^{-o(n)}$ (that is, $\ln(n/(p_{n}\epsilon_{n}))=o(n)$ ). Set $k=\tau\ln(n/(p_{n}\epsilon_{n}))-a$ , for some $a$ , which we shall see is $\Theta(1)$ . Notice that $k=o(n)$ , so $p_{n}-k=\Omega(p_{n})$ and $e^{-p_{n}}=o(e^{-k})$ . Thus,

which shows that $k^{*}=\tau\ln(n/(p_{n}\epsilon_{n}))+O(1)$ . Also, we have

Now, it is clear that $p_{n}=\omega(n)$ , $\epsilon_{n}p_{n}=o(n)$ , and $\epsilon_{n}p_{n}=ne^{-o(n)}$ imply that $\Sigma_{n}$ is benign.

Appendix J Upper bound on the B𝐵B term

We can control the term ${\theta^{\ast}}^{\top}B\theta^{\ast}$ in Lemma 7 using a standard argument.

There is a constant $c$ , that depends only on $\sigma_{x}$ , such that for any $1<t<n$ , with probability at least $1-e^{-t}$ ,

Moreover, for any $v$ in the orthogonal complement to the span of the columns of $X^{\top}$ ,

Thus, due to Theorem 9 in , there is an absolute constant $c$ such that for any $t>1$ with probability at least $1-e^{-t}$ ,

Appendix K Another lower bound

In this section, we prove the second paragraph of Theorem 4.

since, otherwise, the lower bound is vacuously satisfied.

so that, informally, a successful learning algorithm achieves $\rho(\hat{\theta},\theta)<\sqrt{\tau_{0}}$ .

Define sets $S_{1},S_{2},...$ of indices as follows. Let $S_{1}=\{1\}$ ; let $S_{2}=\{2,...,i_{2}\}$ , for the least $i_{2}$ such that $\sum_{i=2}^{i_{2}}\lambda_{i}\geq 1$ . Continue the same way as long as possible; for all $j>2$ , let $S_{j}=\{i_{j-1},...,i_{j}\}$ , where $i_{j}$ is the least index such that $\sum_{i=i_{j-1}}^{i_{j}}\lambda_{i}\geq 1$ .

Definition 36 produces $\Omega(n\log n)$ sets.

yielding the contradiction and completing the proof. ∎

If the number of sets produced by the process of Definition 36 is finite, let $d$ be this finite number. Otherwise, let $d=\lceil n\ln n\rceil$ .

Now, informally, we, in our role as an adversary, commit to assigning all covariates in $S_{j}$ the same weight. The following definition formalizes this idea.

We would like to show that applying $\phi$ to an $L_{2}$ packing yields a $\rho$ -packing, which is done in the following lemma.

Let $A$ be the least-norm interpolation algorithm. We will bound the accuracy of $A$ by bounding its performance in terms of an algorithm $C$ built using $A$ as a subroutine, as was done in a related context in . The definition of Algorithm $C$ is illustrated in Figure 1, which is reproduced from .

The definition uses the function $Q_{\alpha}$ that rounds its input to the nearest multiple of $\alpha$ . Algorithm $C$ applies algorithm $A$ to training data whose response variables have been modified. For each example $(x,y)$ , and simulated artificial noise $\varepsilon$ distributed as $N(0,1)$ , and artificial noise $\zeta$ distributed uniformly on $(-\alpha/2,\alpha/2)$ , Algorithm $C$ gives $(x,y+Q_{\alpha}(\varepsilon)+\zeta)$ to $A$ . The following lemma is similar to Lemma 5 of . One important difference is that we show that Algorithm $C$ approximates the linear function parameterized by $\theta^{*}$ , not its discretization.

If the linear interpolant algorithm $A$ has error $\tau$ from $n$ examples drawn from $N(0,\Sigma)$ with independent $N(0,1)$ noise with probability $1-\delta$ , and

then, in the absence of noise, Algorithm $C$ , given $n$ examples of the form $(x,Q_{\alpha}(\theta^{\top}x))$ , with probability $1-2\delta$ , achieves $\rho(\hat{\theta},\theta^{*})^{2}\leq\tau$ .

The proof of Lemma 41 will be deferred until we have proved some more lemmas.

Recall the definition of total variation distance, $d_{TV}(P,Q)=\sup_{E}|P(E)-Q(E)|$ . The following lemma is implicit in the proof of Lemma 6 of .

Let $\eta,\nu$ be random variables that are distributed according to $N(0,1)$ and let $\zeta$ be uniform over $[-\alpha/2,\alpha/2]$ .

We will use the following, which is implicit in the proof of Lemma 8 of .

If $P_{1},...,P_{n},Q_{1},...,Q_{n}$ are probability distributions over a domain $U$ , and $\chi$ is a $ $-valued random variable defined on$ U^{n}$ then

Now, we are ready to prove Lemma 41. The proof closely follows the proof of Lemma 5 in .

Let $\zeta_{t}$ be a random variable with distribution $U_{\alpha}$ , where $U_{\alpha}$ is the uniform distribution over $(-\alpha/2,\alpha/2)$ . Let $B$ be the randomized algorithm that adds noise $\zeta_{t}$ to each $y_{t}$ value it receives, passes the result to Algorithm $A$ , and returns $A$ ’s output.

where $P_{1|x_{t}}$ is the distribution of $(\theta^{*})^{\top}x_{t}+\varepsilon_{t}$ .

Define $P_{2|x_{t}}$ as the distribution of $Q_{\alpha}((\theta^{*})^{\top}x_{t}+\varepsilon_{t})+\zeta_{t}$ . From Lemma 42, $d_{TV}(P_{1|x_{t}},P_{2|x_{t}})\leq\alpha$ . Applying Lemma 43 with $\chi$ as the indicator function for $E_{1}$ ,

Since $\alpha\leq\frac{\delta}{2n}$ , this implies

Let $P_{3|x_{t}}$ be the distribution of $Q_{\alpha}((\theta^{*})^{\top}x_{t}+\varepsilon_{t})$ , and let

Let $P_{4|x_{t}}$ be the distribution of $Q_{\alpha}((\theta^{*})^{\top}x_{t})+Q_{\alpha}(\varepsilon_{t})$ . Applying Lemma 43, we get

From Lemma 42, $d_{TV}(P_{3|x_{t}},P_{4|x_{t}})\leq\alpha$ , so

Averaging over the random choice of $X$ , the probability, for $(X,\boldsymbol{\zeta},\boldsymbol{\varepsilon})$ distributed as $N(0,\Sigma)^{n}\times U_{\alpha}^{n}\times N(0,1)^{n}$ , that $\rho(A(X,Q_{\alpha}(X\theta^{*})+Q_{\alpha}(\boldsymbol{\varepsilon})+\boldsymbol{\zeta})),\theta^{*})^{2}>\tau$ , is at most

But $A(X,Q_{\alpha}(X\theta^{*})+Q_{\alpha}(\boldsymbol{\varepsilon})+\boldsymbol{\zeta})$ is the output of the randomized algorithm $C$ , so this completes the proof. ∎

So, informally, we have shown that if the least norm interpolant can learn unit length weight vectors with noise and $N(0,\Sigma)$ data, then there is an algorithm $C$ than can learn from quantized data without noise. The next step is to lower bound the error of $C$ .

We will use the following, which is an immediate consequence of Corollary 23.

For each row $x_{t}$ of $X$ , and each $q>1$ ,

The proof of the following lemma borrows heavily from .

If $1/\alpha=O(n)$ , there is a constant $\tau$ such that, for any regression algorithm $C$ , for all large enough $n$ , if $C$ is given $n$ examples of the form $(X,Q_{\alpha}(X\theta^{*}))$ , if the rows of $X$ are $n$ independent draws from $N(0,\Sigma)$ , with probability at least $1/2$ , its output $\hat{\theta}$ satisfies $\rho(\hat{\theta},\theta^{*})^{2}>\tau$ .

Our assumption about the learning ability of $C$ implies that

For any $g,h\in G$ for which $Q_{\alpha}(Xg)=Q_{\alpha}(Xh)$ , since $\rho(g,h)>3\sqrt{\tau}$ , it cannot be the case that both $\phi(X,g)$ and $\phi(X,h)$ are both $1$ . Thus, recalling that $x_{1},...,x_{n}$ are the rows of $X$ , and that all elements of $G$ have length at most 1, we have

since $d=\Theta(s)$ . Since $d=\Omega(n\log n)$ , for large enough $n$ and small enough $\tau$ , this contradicts (11), completing the proof. ∎

Now we are ready to put everything together to prove the second paragraph of Theorem 4. By Lemma 41, it suffices to prove that, for a small enough constant $\tau_{0}$ , if $1/\alpha=O(n)$ , with probability $1/2$ , Algorithm $C$ , given examples $(x,Q_{\alpha}(\theta^{\top}x))$ , with probability $1/2$ , fails to achieves $\rho(\hat{\theta},\theta^{*})^{2}\leq\tau_{0}$ . By Lemma 45, this is the case, completing the proof.