Benign Overfitting in Linear Regression
Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler
Introduction
Deep learning methodology has revealed a surprising statistical phenomenon: overfitting can perform well. The classical perspective in statistical learning theory is that there should be a tradeoff between the fit to the training data and the complexity of the prediction rule. Whether complexity is measured in terms of the number of parameters, the number of non-zero parameters in a high-dimensional setting, the number of neighbors averaged in a nearest-neighbor estimator, the scale of an estimate in a reproducing kernel Hilbert space, or the bandwidth of a kernel smoother, this tradeoff has been ubiquitous in statistical learning theory. Deep learning seems to operate outside the regime where results of this kind are informative, since deep neural networks can perform well even with a perfect fit to the training data.
As one example of this phenomenon, consider the experiment illustrated in Figure 1(c) in : standard deep network architectures and stochastic gradient algorithms, run until they perfectly fit a standard image classification training set, give respectable prediction performance, even when significant levels of label noise are introduced. The deep networks in the experiments reported in achieved essentially zero cross-entropy loss on the training data. In statistics and machine learning textbooks, an estimate that fits every training example perfectly is often presented as an illustration of overfitting (“… interpolating fits… [are] unlikely to predict future data well at all.” [25, p37]). Thus, to arrive at a scientific understanding of the success of deep learning methods, it is a central challenge to understand the performance of prediction rules that fit the training data perfectly.
In this paper, we consider perhaps the simplest setting where we might hope to witness this phenomenon: linear regression. That is, we consider quadratic loss and linear prediction rules, and we assume that the dimension of the parameter space is large enough that a perfect fit is guaranteed. We consider data in an infinite dimensional space (a separable Hilbert space), but our results apply to a finite-dimensional subspace as a special case. There is an ideal value of the parameters, , corresponding to the linear prediction rule that minimizes the expected quadratic loss. We ask when it is possible to fit the data exactly and still compete with the prediction accuracy of . Since we require more parameters than the sample size in order to fit exactly, the solution might be underdetermined, so there might be many interpolating solutions. We consider the most natural: choose the parameter vector with the smallest norm among all vectors that give perfect predictions on the training sample. (This corresponds to using the pseudoinverse to solve the normal equations; see Section 2.) We ask when it is possible to overfit in this way—and embed all of the noise of the labels into the parameter estimate —without harming prediction accuracy.
Our main result is a finite sample characterization of when overfitting is benign in this setting. The linear regression problem depends on the optimal parameters and the covariance of the covariates . The properties of turn out to be crucial, since the magnitude of the variance in different directions determines both how the label noise gets distributed across the parameter space and how errors in parameter estimation in different directions in parameter space affect prediction accuracy. There is a classical decomposition of the excess prediction error into two terms. The first is rather standard: provided that the scale of the problem (that is, the sum of the eigenvalues of ) is small compared to the sample size , the contribution to that we can view as coming from is not too distorted. The second term is more interesting, since it reflects the impact of the noise in the labels on prediction accuracy. We show that this part is small if and only if the effective rank of in the subspace corresponding to low variance directions is large compared to . This necessary and sufficient condition of a large effective rank can be viewed as a property of significant overparameterization: fitting the training data exactly but with near-optimal prediction accuracy occurs if and only if there are many low variance (and hence unimportant) directions in parameter space where the label noise can be hidden.
The details are more complicated. The characterization depends in a specific way on two notions of effective rank, and ; the smaller one, , determines a split of into large and small eigenvalues, and the excess prediction error depends on the effective rank, as measured by the larger notion , of the subspace corresponding to the smallest eigenvalues. For the excess prediction error to be small, the smallest eigenvalues of must decay slowly.
Studying the patterns of eigenvalues that allow benign overfitting reveals an interesting role for large but finite dimensions: in an infinite-dimensional setting, benign overfitting occurs only for a narrow range of decay rates of the eigenvalues. On the other hand, it occurs with any suitably slowly decaying eigenvalue sequence in a finite dimensional space whose dimension grows faster than the sample size. Thus, for linear regression, data that lies in a large but finite dimensional space exhibits the benign overfitting phenomenon with a much wider range of covariance properties than data that lies in an infinite dimensional space.
The phenomenon of interpolating prediction rules has been an object of study by several authors over the last two years, since it emerged as an intriguing mystery at the Simons Institute program on Foundations of Machine Learning in Spring 2017. Belkin, Ma and Mandal described an experimental study demonstrating that this phenomenon of accurate prediction for functions that interpolate noisy data also occurs for prediction rules chosen from reproducing kernel Hilbert spaces, and explained the mismatch between this phenomenon and classical generalization bounds. Belkin, Hsu and Mitra gave an example of an interpolating decision rule—simplicial interpolation—with an asymptotic consistency property as the input dimension gets large. That work, and subsequent work of Belkin, Rakhlin, and Tsybakov , studied kernel smoothing methods based on singular kernels that both interpolate and, with suitable bandwidth choice, give optimal rates for nonparametric estimation (building on earlier consistency results for these unusual kernels). Liang and Rakhlin considered minimum norm interpolating kernel regression with kernels defined as nonlinear functions of the Euclidean inner product and showed that, with certain properties of the training sample (expressed in terms of the empirical kernel matrix), these methods can have good prediction accuracy. Belkin, Hsu, Ma and Mandal studied experimentally the excess risk as a function of the dimension of a sequence of parameter spaces for linear and non-linear classes.
Subsequent to our work, considered the properties of the interpolating linear prediction rule with minimal expected squared error. After this work was presented at the NAS Colloquium on the Science of Deep Learning , we became aware of the concurrent work of Belkin, Hsu and Xu and of Hastie, Montanari, Rosset and Tibshirani . Belkin et al calculated the excess risk for certain linear models (a regression problem with identity covariance, sparse optimal parameters, both with and without noise, and a problem with random Fourier features with no noise), and Hastie et al considered linear regression in an asymptotic regime, where sample size and input dimension go to infinity together with asymptotic ratio . They assumed that, as gets large, the empirical spectral distribution of (the discrete measure on its set of eigenvalues) converges to a fixed measure, and they applied random matrix theory to explore the range of behaviors of the asymptotics of the excess prediction error as , the noise variance, and the eigenvalue distribution vary. They also studied the asymptotics of a model involving random nonlinear features. In contrast, we give upper and lower bounds on the excess prediction error for arbitrary finite sample size, for arbitrary covariance matrices, and for data of arbitrary dimension.
The next section introduces notation and definitions used throughout the paper, including definitions of the problem of linear regression and of various notions of effective rank of the covariance operator. Section 3 gives the characterization of benign overfitting, illustrates why the effective rank condition corresponds to significant overparameterization, and presents several examples of patterns of eigenvalues that allow benign overfitting, suggesting that slowly decaying covariance eigenvalues in input spaces of growing but finite dimension are the generic example of benign overfitting. Section 4 discusses the connections between these results and the benign overfitting phenomenon in deep neural networks. Section 5 outlines the proofs of the results.
Definitions and Notation
the conditional noise variance is bounded below by some constant ,
almost surely, the projection of the data on the space orthogonal to any eigenvector of spans a space of dimension .
By the projection theorem, parameter vectors that solve the least squares problem solve the normal equations, so we can equivalently write as the minimum norm solution to the normal equations,
Our main result gives tight bounds on the excess risk of this minimum norm estimator in terms of certain notions of effective rank of the covariance that are defined in terms of its eigenvalues.
For the covariance operator , define for . If and for , define
Main Results
The following theorem establishes nearly matching upper and lower bounds for the risk of the minimum-norm interpolating estimator.
For any there are for which the following holds. Consider a linear regression problem from Definition 1. Define
with probability at least , and
Moreover, there are universal constants such that for all , for all , for all , there is a with such that for and , with probability at least ,
In order to understand the implications of Theorem 4, we now study relationships between the two notions of effective rank, and , and establish sufficient and necessary conditions for the sequence of eigenvalues to lead to small excess risk.
The following lemma shows that the two notions of effective rank are closely related. See Appendix H for its proof, and for other properties of and . (All appendices may be found in the supporting material.)
, , and
Both notions of symmetry and lie between (when ) and (when the are all equal).
Theorem 4 shows that, for the minimum norm estimator to have near-optimal prediction accuracy, should be small compared to the sample size (from the first term) and and should be large compared to . Together, these conditions imply that overparameterization is essential for benign overfitting in this setting: the number of non-zero eigenvalues should be large compared to , they should have a small sum compared to , and there should be many eigenvalues no larger than . If the number of these small eigenvalues is not much larger than , then they should be roughly equal, but they can be more assymmetric if there are many more of them.
The following theorem shows that the kind of overparameterization that is essential for benign overfitting requires to have a heavy tail. (The proof—and some other examples illustrating the boundary of benign overfitting—are in Appendix I.) In particular, if we fix in an infinite-dimensional Hilbert space and ask when does the excess risk of the minimum norm estimator approach zero as , it imposes tight restrictions on the eigenvalues of . But there are many other possibilities for these asymptotics if can change with . Since rescaling affects the accuracy of the least-norm interpolant in an obvious way, we may assume without loss of generality that . If we restrict our attention to this case, then, informally, Theorem 4 implies that, when the covariance operator for data with examples is , the least-norm interpolant converges if , , and , and only if , , and , where for the universal constant in Theorem 4. For this reason, we say that a sequence of covariance operators is benign if
If , then is benign iff and .
and , then is benign iff and . Furthermore, for and ,
Compare the situations described by Parts 1 and 2 of Theorem 6. Part 1 shows that for infinite-dimensional data with a fixed covariance, benign overfitting occurs iff the eigenvalues of the covariance operator decay just slowly enough for their sum to remain finite. Part 2 shows that the situation is very different if the data has finite dimension and a small amount of isotropic noise is added to the covariates. In that case, even if the eigenvalues of the original covariance operator (before the addition of isotropic noise) decay very rapidly, benign overfitting occurs iff both the dimension is large compared to the sample size, and the isotropic component of the covariance is sufficiently small—but not exponentially small—compared to the sample size.
These examples illustrate the tension between the slow decay of eigenvalues that is needed for to be small, and the summability of eigenvalues that is needed for to be small. There are two ways to resolve this tension. First, in the infinite dimensional setting, slow decay of the eigenvalues suffices—decay just fast enough to ensure summability—as shown by Part 1 of Theorem 6. (Appendix I gives another example, where the eigenvalue decay is allowed to vary with ; in that case, is benign iff the decay rate gets close—but not too close—to as increases.) The other way to resolve the tension is to consider a finite dimensional setting (which ensures that the eigenvalues are summable), and in this case arbitrarily slow decay is possible. Part 2 of Theorem 6 gives an example of this: eigenvalues that are all at least as large as a small constant. Appendix I gives another example, with a truncated infinite series that decays sufficiently slowly that their sum does not converge. Theorem 6(1) shows that a very specific decay rate is required in infinite dimensions, which suggests that this is an unusual phenomenon in that case. The more generic scenario where benign overfitting will occur is demonstrated by Theorem 6(2), with eigenvalues that are either constant or slowly decaying in a very high—but finite dimensional—space.
Deep neural networks
How relevant are Theorems 4 and 6 to the phenomenon of benign overfitting in deep neural networks? One connection appears by considering regimes where deep neural networks are well-approximated by linear functions of their parameters. This so-called neural tangent kernel (NTK) viewpoint has been vigorously pursued recently in an attempt to understand the optimization properties of deep learning methods. Very wide neural networks, trained with gradient descent from a suitable random initialization, can be accurately approximated by linear functions in an appropriate Hilbert space, and in this case gradient descent finds an interpolating solution quickly; see . (Note that these papers do not consider prediction accuracy, except when there is no noise; for example, [29, Assumption A1] implies that the network can compute a suitable real-valued response exactly, and the data-dependent bound of [1, Theorem 5.1] becomes vacuous when independent noise is added to the s.) The eigenvalues of the covariance operator in this case can have a heavy tail under reasonable assumptions on the data distribution (see , where this kernel was introduced, and ), and the dimension is very large but finite, as required for benign overfitting. However, the assumptions of Theorem 4 do not apply in this case. In particular, the assumption that the random elements of the Hilbert space are a linearly transformed vector with independent components is not satisfied. Thus, our results are not directly applicable in this—somewhat unrealistic—setting. Note that the slow decay of the eigenvalues of the NTK is in contrast to the case of the gaussian and other smooth kernels, where the eigenvalues decay nearly exponentially quickly .
The phenomenon of benign overfitting was first observed in deep neural networks. Theorems 4 and 6 are steps towards understanding this phenomenon by characterizing when it occurs in the simple setting of linear regression. Those results suggest that covariance eigenvalues that are constant or slowly decaying in a high (but finite) dimensional space might be important in the deep network setting also. Some authors have suggested viewing neural networks as finite-dimensional approximations to infinite dimensional objects , and there are generalization bounds—although not for the overfitting regime—that are applicable to infinite width deep networks with parameter norm constraints . However, the intuition from the linear setting suggests that truncating to a finite dimensional space might be important for good statistical performance in the overfitting regime. Confirming this conjecture by extending our results to the setting of prediction in deep neural networks is an important open problem.
Proof
Throughout the proofs, we treat (the subgaussian norm of the covariates) as a constant. Therefore, we use the symbols to refer to constants that only depend on . Their values are suitably large (and always at least ) but do not depend on any parameters of the problems we consider, besides . For universal constants that do not depend on any parameters of the problem at all we use the symbol . Also, whenever we sum over eigenvectors of , the sum is restricted to eigenvectors with non-zero eigenvalues.
The first step is a standard decomposition of the excess risk into two pieces, a term that corresponds to the distortion that is introduced by viewing through the lens of the finite sample and a term that corresponds to the distortion introduced by the noise . The impact of both sources of error in on the excess risk is modulated by the covariance , which gives different weight to different directions in parameter space.
The excess risk of the minimum norm estimator satisfies
with probability at least over , and
Unit variance subgaussians
Our assumptions allow the trace of to be expressed as a function of many independent subgaussian vectors.
where .
By Assumption 2 in Definition 1, the random variables are independent -subgaussian. We consider in the basis of eigenvectors of , , to see that
For the second part, we use Lemma 20, which is a consequence of the Sherman-Woodbury-Morrison formula; see Appendix B.
by Lemma 20, for the case and . Note that is invertible by Assumption 5 in Definition 1. ∎
The weighted sum of outer products of these subgaussian vectors plays a central role in the rest of the proof. Define
Concentration of A𝐴A
The next step is to show that eigenvalues of , and are concentrated. The proof of the following inequality is in Appendix C. Recall that and denote the largest and the smallest eigenvalues of the matrix .
There is a constant such that for any with probability at least ,
The following lemma uses this result to give bounds on the eigenvalues of , which in turn give bounds on some eigenvalues of and . For these upper and lower bounds to match up to a constant factor, the sum of the eigenvalues of should dominate the term involving its leading eigenvalue, which is a condition on the effective rank . The lemma shows that once is sufficiently large, all of the eigenvalues of are identical up to a constant factor.
There are constants such that for any , with probability at least ,
By Lemma 9, we know that with probability at least ,
First, the matrix has rank at most (as a sum of matrices of rank ). Thus, there is a linear space of dimension such that for all , , and so .
Second, by the Courant-Fischer-Weyl Theorem, for all and , (see Lemma 28). On the other hand, for , , so all the eigenvalues of are lower bounded by .
Choosing and gives the third claim of the lemma. ∎
Upper bound on the trace term
There are constants such that if , , and then with probability at least ,
The proof uses the following lemma and its corollary. Their proofs are in Appendix C.
Suppose is a non-increasing sequence of non-negative numbers such that , and are independent centered -subexponential random variables. Then for some universal constant for any with probability at least
where is the orthogonal projection on .
(of Lemma 11) Fix to its value in Lemma 10. By Lemma 8,
and the upper bounds on the ’s give
where is the span of the eigenvectors of corresponding to its smallest eigenvalues. So for ,
Next, we apply Corollary 13 times, together with a union bound, to show that with probability at least , for all ,
provided that and for some sufficiently large (note that and only depend on , and , and we can still take large enough in the end without changing and ). Combining (3), (4), and (5), with probability at least ,
Second, consider the second sum in (2). Lemma 10 shows that, on the same high probability event that we considered in bounding the first half of the sum, . Hence,
Notice that is a weighted sum of -subexponential random variables, with the weights given by the in blocks of size . Lemma 12 implies that, with probability at least ,
because . Combining the above gives
Finally, putting both parts together and taking gives the lemma. ∎
Lower bound on the trace term
There is a constant such that for any with , and any , with probability at least ,
Suppose and is a sequence of non-negative random variables, is a sequence of non-negative real numbers (at least one of which is strictly positive) such that for some and any , . Then
These two lemmas imply the following lower bound.
There are constants such that for any and any with probability at least ,
From Lemmas 8, 14 and 15, with probability at least ,
Now, if , then the second term in the minimum is always bigger than the third term, and in that case,
On the other hand, if ,
where the equality follows from the fact that the s are non-increasing. ∎
A simple choice of l𝑙l
Recall that is a constant. If no has , then Lemmas 7 and 16 imply that the expected excess risk is , which proves the first paragraph of Theorem 4 for large . If some does have , then the upper and lower bounds of Lemmas 11 and 16 are constant multiples of
For any and , if , we have
Finally, we can finish the proof of Theorem 4. Set in Lemma 16 and Theorem 4 to the constant from Lemma 11. Take to be the maximum of the constants from Lemmas 16 and 11.
The proof of the second paragraph is in Appendix K.
Conclusions and Further Work
Our results characterize when the phenomenon of benign overfitting occurs in high dimensional linear regression, with gaussian data and more generally. We give finite sample excess risk bounds that reveal the covariance structure that ensures that the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization depends on two notions of the effective rank of the data covariance operator. It shows that overparameterization, that is, the existence of many low-variance and hence unimportant directions in parameter space, is essential for benign overfitting, and that data that lies in a large but finite dimensional space exhibits the benign overfitting phenomenon with a much wider range of covariance properties than data that lies in an infinite dimensional space.
Acknowledgements
We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of Google through a Google Research Award. Part of this work was done as part the Fall 2018 program on Foundations of Data Science at the Simons Institute for the Theory of Computing. Gábor Lugosi was supported by the Spanish Ministry of Economy and Competitiveness, Grant MTM2015-67304-P and FEDER, EU; “High-dimensional problems in structured probabilistic models - Ayudas Fundación BBVA a Equipos de Investigación Cientifica 2017”; and Google Focused Award “Algorithms and Learning for AI”.
References
Appendix A Proof of Lemma 7
We first give the decomposition of the excess risk.
The excess risk of the minimum norm estimator satisfies
Since has mean zero conditionally on ,
Using (1), the definition of , and the fact that ,
Also, since has zero mean conditionally on , and is independent of , we have
The following lemma shows that we can obtain a high-probability upper bound on the term in terms of the trace of . It is Lemma 36 in .
Combining this with Lemma 18 implies Lemma 7.
Appendix B An Algebraic Property
We use the Sherman–Morrison–Woodbury formula to write
Denote and . Applying (6), we get
where we used the identity twice in the second last equality and the identity in the last equality. ∎
Appendix C Proof of concentration inequalities
We use some standard results about subgaussian and subexponential random variables.
First of all, we need the following direct consequence of Propositions 2.5.2 and 2.7.1 and Lemma 2.7.6 from :
There is a universal constant such that for any random variable that is centered, -subgaussian, and unit variance, is a centered -subexponential random variable, that is,
Second, we are going to use the following form of Bernstein’s inequality, which is Theorem 2.8.2 in :
There is a universal constant such that for any non-increasing sequence of non-negative numbers such that , and any independent, centered, -subexponential random variables , and any , with probability at least
where is the orthogonal projection on .
First of all, since — a sum of -subexponential random variables, by Corollary 23, for some absolute constant and for any , with probability at least ,
Thus, with probability at least
where the first inequality holds because the s are decreasing in magnitude, and the last two inequalities hold since the functions and are both increasing on and . ∎
There is a universal constant such that with probability at least ,
Let be a -net on the sphere with respect to the Euclidean distance such that . Applying the union bound over the elements of , we see that with probability , every satisfies
Since is a -net, by Lemma 25, we need to multiply the quantity above by to get the bound on the norm of the . Denote
Thus, with probability at least ,
When we can write , and we have
by the AMGM inequality. (Recall that denote universal constants with value at least , and is the subgaussian constant of a random variable with unit variance.) ∎
Appendix D Proof of Lemma 14
Fix with and . By Lemma 10, with probability at least ,
By Corollary 13, with probability at least ,
provided that and for some sufficiently large . Thus, with probability at least ,
Dividing by the square of both sides, we have
Also, from the Cauchy-Schwarz inequality and Corollary 13 again, we have that on the same event,
Choosing suitably large gives the lemma.
Appendix E Proof of Lemma 15
and denote its probability as for some . On the one hand, by the definition of the event, we have
On the other hand, note that for any ,
Appendix F Proof of Lemma 17
We can write the function of being minimized as
where is the largest value of for which
since the are non-increasing. This condition holds iff
The definition of implies . So we can write
and so the minimizing is . Also,
Appendix G Eigenvalue monotonicity
Recall (half of) the Courant-Fischer-Weyl theorem.
If symmetric matrices and satisfy , then, for any , we have .
Appendix H Rank facts
The quantity is an important complexity parameter for covariance estimation problems, where it has been called the ‘effective rank’ . Earlier, was called the ‘stable rank’ and the ‘numerical rank’ , although that term has a different meaning in computational linear algebra [23, p261].
, , and .
The first inequality and the equality are immediate from the definitions. Together they imply . For the second inequality,
Substituting this in the equality implies . ∎
Writing and for and ,
Thus, the function satisfies the monotonicity property whenever .
Since , , so
Appendix I Conditions on eigenvalues
In this section, we prove the following expanded version of Theorem 6.
Define for all .
If , then is benign iff and .
If , then is benign iff . Furthermore,
then is benign iff either , and or , and .
and , then is benign iff and . Furthermore, for and ,
We build up the proof in stages. First, we characterize those sequences of effective ranks that can arise.
Consider some positive summable sequence , and for any non-negative integer denote
Then and . Moreover, for any positive sequence such that and for every , there exists a positive sequence (unique up to constant multiplier) such that . The sequence is (a constant rescaling of)
which goes to zero if and only if . On the other hand, we may rewrite the first equality in the proof as
So for any sequence we can uniquely (up to a constant multiplier) recover the sequence such that — the only candidate is
However, for such one can compute
so the resulting sequence sums to 1, and
Suppose is some constant, and . Suppose also that the sequence is increasing. Then, as goes to infinity, goes to zero if and only if goes to infinity.
We prove the “if” part separately from the “only if” part.
If then .
Fix some . Since , there exists some such that for any , Thus, for all ,
Since the constant is arbitrary, goes to infinity.
If then .
Fix some constant . Since there exists some such that for any , . Thus, for any
Since the constant is arbitrary, goes to zero.
Suppose the sequence is increasing and as . Then a sufficient condition for is
For example, this condition holds for .
Since and , it is enough to prove that as goes to infinity. Since
and it is sufficient to prove that the latter quantity goes to zero. We write
Since both numerator and denominator are decreasing in and go to zero as , we can apply the Stolz–Cesáro theorem (an analog of L’Hôpital’s rule for discrete sequences):
where the last line is due to our sufficient condition. ∎
Part 1, if direction, first term: We have
Part 1, if direction, second term: By Theorem 33, it suffices to prove that . This holds because
Part 1, if direction, third term: By Theorem 34, it suffices to prove that , that is
As argued above, when and , , so it suffices to show that . We have
Since is non-increasing, we have
If we define on the positive reals by , then is convex, and, since , we have
which goes to infinity for large , completing the proof of the “if” direction of the third term of Part 1.
Part 1, only if direction, : If , then
which does not grow faster than . Thus, by Theorem 33, does not go to zero.
Part 1, only if direction, , or and : In this case, since, as above
and diverges in this case, does not go to zero.
Before starting on Part 2, let us define and .
Part 2, if direction, first term: We have
so which goes to zero with if .
Part 2, if direction, second term: First,
Thus, , so that .
Part 2, if direction, third term: We bound from below by separately bounding its numerator and denominator:
So now we want a lower bound on . For that, we need an upper bound on , and
This implies . This, together with the fact that, for , is an increasing function of , implies that, for large enough , . Since , this implies that . Combining this with (7), for large enough
Thus .
Part 2, only if direction, : We have
so , which is bounded below by a constant for large if .
Part 2, only if direction, : Recall that, in the proof of the “if” direction of the third term, we showed that . This implies that .
Part 3: Suppose that is benign. Then because , we must have . Thus, we can restrict our attention to the sequences for which and find the necessary and sufficient conditions for that class.
Next, for any positive and any natural number , we can write
As the sequence can only be benign if , we can only consider values of that do not exceed some constant fraction of , e.g. . Since , noting that, for , the sign of flips when crosses , we can write, uniformly for all ,
Recall that we consider for . Using the formula above, we get uniformly for all
Recall that . We compute
One can see that for , , so the sequence is not benign for . On the other hand, for .
Next, analogously to the asymptotics for , we have
Since , we can write uniformly for all
Now we plug in instead of . Recall that for , and for . We get
Since , for any , . For the necessary and sufficient for is .
So far, we obtained the necessary and sufficient conditions for the last terms to go to zero. Now let’s look at the upper bound for the first term: since , we just need . We write, for ,
Thus, for , goes to zero if and only if , and for , goes to zero if and only if .
Part 4: Suppose that is benign. Then because , we must have . Also,
and so . Since benign implies , and hence , we consider . In this regime,
Substituting gives
which shows that . Thus, if is benign, we must have , that is, .
Conversely, assume and (that is, ). Set , for some , which we shall see is . Notice that , so and . Thus,
which shows that . Also, we have
Now, it is clear that , , and imply that is benign.
Appendix J Upper bound on the B𝐵B term
We can control the term in Lemma 7 using a standard argument.
There is a constant , that depends only on , such that for any , with probability at least ,
Moreover, for any in the orthogonal complement to the span of the columns of ,
Thus, due to Theorem 9 in , there is an absolute constant such that for any with probability at least ,
Appendix K Another lower bound
In this section, we prove the second paragraph of Theorem 4.
since, otherwise, the lower bound is vacuously satisfied.
so that, informally, a successful learning algorithm achieves .
Define sets of indices as follows. Let ; let , for the least such that . Continue the same way as long as possible; for all , let , where is the least index such that .
Definition 36 produces sets.
yielding the contradiction and completing the proof. ∎
If the number of sets produced by the process of Definition 36 is finite, let be this finite number. Otherwise, let .
Now, informally, we, in our role as an adversary, commit to assigning all covariates in the same weight. The following definition formalizes this idea.
We would like to show that applying to an packing yields a -packing, which is done in the following lemma.
Let be the least-norm interpolation algorithm. We will bound the accuracy of by bounding its performance in terms of an algorithm built using as a subroutine, as was done in a related context in . The definition of Algorithm is illustrated in Figure 1, which is reproduced from .
The definition uses the function that rounds its input to the nearest multiple of . Algorithm applies algorithm to training data whose response variables have been modified. For each example , and simulated artificial noise distributed as , and artificial noise distributed uniformly on , Algorithm gives to . The following lemma is similar to Lemma 5 of . One important difference is that we show that Algorithm approximates the linear function parameterized by , not its discretization.
If the linear interpolant algorithm has error from examples drawn from with independent noise with probability , and
then, in the absence of noise, Algorithm , given examples of the form , with probability , achieves .
The proof of Lemma 41 will be deferred until we have proved some more lemmas.
Recall the definition of total variation distance, . The following lemma is implicit in the proof of Lemma 6 of .
Let be random variables that are distributed according to and let be uniform over .
We will use the following, which is implicit in the proof of Lemma 8 of .
If are probability distributions over a domain , and is a $U^{n}$ then
Now, we are ready to prove Lemma 41. The proof closely follows the proof of Lemma 5 in .
Let be a random variable with distribution , where is the uniform distribution over . Let be the randomized algorithm that adds noise to each value it receives, passes the result to Algorithm , and returns ’s output.
where is the distribution of .
Define as the distribution of . From Lemma 42, . Applying Lemma 43 with as the indicator function for ,
Since , this implies
Let be the distribution of , and let
Let be the distribution of . Applying Lemma 43, we get
From Lemma 42, , so
Averaging over the random choice of , the probability, for distributed as , that , is at most
But is the output of the randomized algorithm , so this completes the proof. ∎
So, informally, we have shown that if the least norm interpolant can learn unit length weight vectors with noise and data, then there is an algorithm than can learn from quantized data without noise. The next step is to lower bound the error of .
We will use the following, which is an immediate consequence of Corollary 23.
For each row of , and each ,
The proof of the following lemma borrows heavily from .
If , there is a constant such that, for any regression algorithm , for all large enough , if is given examples of the form , if the rows of are independent draws from , with probability at least , its output satisfies .
Our assumption about the learning ability of implies that
For any for which , since , it cannot be the case that both and are both . Thus, recalling that are the rows of , and that all elements of have length at most 1, we have
since . Since , for large enough and small enough , this contradicts (11), completing the proof. ∎
Now we are ready to put everything together to prove the second paragraph of Theorem 4. By Lemma 41, it suffices to prove that, for a small enough constant , if , with probability , Algorithm , given examples , with probability , fails to achieves . By Lemma 45, this is the case, completing the proof.