Learning without Concentration
Shahar Mendelson
Introduction
Our aim is to study the error of Empirical Risk Minimization (ERM), performed in a convex class and relative to the squared loss.
To be more precise, let be a class of real-valued functions on a probability space and let be an unknown target function. One would like to find some function in that is almost the ‘closest’ to in some sense.
Unlike questions in Approximation Theory, the point in learning problems is to approximate using random data – an independent sample selected according to the joint distribution defined by and the target .
One way of using the given data is by selecting a random element in , denoted by , that minimizes the empirical loss
where here, and throughout this article, denotes the empirical mean of the function with respect to the given sample.
With this choice of a learning procedure, is called the empirical minimizer, and the procedure that selects is Empirical Risk Minimization (ERM).
In the context of the estimation problem, it is natural to measure the success of ERM and the effectiveness of the choice of in the following way: that for ‘most’ samples , ERM produces a function that is close in to the best approximation of the target in ; that is, a high probability upper estimate on
for the empirical minimizer selected according to the data .
Our starting point is a well known result that deals with this very question: controlling the distance in between the function produced by ERM and . Theorem 1.1, formulated below, has been established in (see Corollary 5.3 there and also Theorem 5.1 in the survey ).
Let be the ball of radius , centred in . Thus, . For any , let
where are independent, symmetric, -valued random variables (random signs) that are independent of , and the expectation is with respect to both and .
There exist absolute constants and for which the following holds. If is a closed, convex class of functions that are bounded by and the target is also bounded by , then for every , with probability at least ,
The proof of Theorem 1.1 relies heavily on the fact that consists of functions that are bounded by and that the target is also bounded by . Both are restrictive assumptions and exclude many natural problems that one would like to consider.
1. Gaussian noise: arguably the most basic statistical problem is when for some and that is a centred gaussian variable with variance that is independent of . Thus, the given data consists of ‘noisy’ measurements of corrupted by gaussian noise.
Since a gaussian random variable is unbounded, Theorem 1.1 cannot be used to address the estimation problem that involves gaussian noise, regardless of the choice of .
2. Heavy-tailed noise: The vague term ‘heavy-tailed function’ is used to describe a function for which the measure of its tail, , decays to zero relatively slowly – for example, polynomially in . In particular, such a function need not be bounded. Hence, any kind of an estimation problem that involves a heavy-tailed target cannot be treated using Theorem 1.1.
Let be a random sign that is independent of , fix and , and put Y=\bigl{<}t_{0},\cdot\bigr{>}+\sigma\varepsilon_{n+1}.
More accurately, we will show that the outcome of Theorem 1.1 is as follows: there are absolute constants and and
On the other hand, it follows from that the correct rate for this problem is very different: if
then with probability at least ,
for absolute constants and .
Therefore, the outcome of Theorem 1.1 scales incorrectly with and with , and most notably, the estimation error does not converge to zero when becomes smaller and the problem is almost ‘noise-free’.
This example belongs to the persistence framework, which will be explored in greater detail later on. It, and other examples like it show that the poor outcome of Theorem 1.1 is endemic rather than merely an accident.
To get a clearer picture of the reasons why Theorem 1.1 is truly suboptimal, let us take a closer look at its assumptions.
The point-wise boundedness assumption in Theorem 1.1 has been used frequently in Learning Theory, and for a good reason. It allows one to invoke two important tools from empirical processes theory, which are simply not true in a more general setup.
2. Concentration methods: Talagrand’s concentration inequality for empirical processes indexed by a class of uniformly bounded functions has played an essential role in Learning Theory - and Theorem 1.1 is no exception. It is used to show that with high probability, the supremum of an empirical process is almost the same as its expectation, as well as the Rademacher averages of the class (see, e.g., and references therein for more details on the role of Rademacher averages in Learning Theory).
It takes no more than a glance to see that contraction and concentration are crucial to the proof of Theorem 1.1. A more careful examination shows that there is no realistic hope of extending the proof beyond the restricted setup of classes of uniformly bounded functions and a bounded target without introducing a totally new argument.
The noise barrier
An important observation that reflects the suboptimal nature of Theorem 1.1 is that the estimation error is insensitive to the noise level. The term ‘noise level’ is used here to describe the distance between the target and the class, either with respect to the norm, or with respect to a different norm, depending on the situationThe reason for naming the distance between the class and the target ‘noise level’ is the very simple choice of a target for some , and an independent, centred random variable , representing the noise. It is straightforward to verify that for such a target , and the distance between the class and the target is – the variance of ..
One would expect that when the noise level (distance) tends to zero and the problem becomes almost noise-free (or realizable), should decrease, and in some cases should even tend to zero. Among these problems are compressed sensing and phase recovery, in which the exact recovery of the signal is possible (see the discussion in ). Unfortunately, Theorem 1.1 is insensitive to the noise level, and this by-product of its proof cannot be resolved without totally changing the method of analysis.
All these observations are strong indications that if Theorem 1.1 is to be radically improved, the entire concentration-contraction framework on which it is based, must be abandoned. Moreover, any alternative framework should address two core issues:
It must be able to handle ‘heavy-tailed’ problems.
The estimation error must scale correctly with the key parameters of the problem, especially with the noise level.
Towards a heavy-tailed framework
In view of the requirements outlined above, an improved framework has to contend with functions that may have heavy tails, and certainly need not be bounded. We will first explain why standard concentration-based arguments fail miserably when faced with a heavy-tailed scenario. We will then suggest an alternative to standard concentration, introduce the complexity parameters that arise naturally in the suggested framework, and explain their statistical interpretation.
The title of this article may create the wrong impression – that concentration methods are not needed and will not take any part in the analysis of ERM. Actually, the way the title should be understood is different: that learning is possible even when concentration is false.
As a starting point, and seeing that one is interested in the squared loss, one should note the substantial difference between a two-sided concentration inequality, stating, for example, that with high probability,
and just the lower bound on the empirical mean,
If are independent copies of , then with probability at least there exists some for which . On that event,
for a suitable absolute constant – and that a similar estimate is true for any random variable for which is well behaved, with depending only on this ratio.
The difference between an upper estimate and a lower one is even more obvious if one only assumes that for fixed constants and . Using a simple binomial estimate, one may show that with probability at least , for a suitable absolute constant . However, there is no hope of obtaining any upper estimate on based on the given information.
An immediate consequence of these observations is that the standard method of analysis for the estimation problem, which is based on a two-sided concentration argument that holds with exponential probability, can never work in heavy-tailed situations. Thus, one must find a different argument altogether if one wishes to deal with learning problems that include classes of heavy-tailed functions or with a heavy-tailed target.
The difference between a two-sided estimate and a lower bound takes centre-stage when one realizes that the main component in solving the estimation problem is actually a lower bound on
rather than a two-sided bound. The significance of this lower bound will be made clear in Section 3. It will eventually lead to a sharp estimate on even when two-sided concentration is impossible.
2 Two parameters for two regimes
When one considers the performance of a learning procedure, it is reasonable to expect two different ‘performance regimes’, according to the difficulties the learner faces. As will be explained below, there are clear differences between ‘low noise’ problems, in which the target is close to and the rate one should expect is close to the noise-free (realizable) rate, and ‘high noise’ problems, when is far enough from and the interaction between the target and the class determines the estimation error. One may also expect that just as in a realizable problem, the complexity parameter governing the ‘low-noise’ regime is intrinsic to and therefore should not depend on at all, while the parameter controlling the ‘high-noise’ regime depends on in one way or another.
We begin with the definition of the first parameter we will be interested in, and which governs the ‘low-noise’ regime.
Given a class of functions and , set
where the expectation is taken with respect to both and .
Note that is indeed an intrinsic parameter, in the sense that it depends only on the class and not on the exact nature of the ‘noise’ . From a purely technical perspective, measures when the Rademacher averages of the ‘localized’ set scale like rather than like the normalization , which has been used in the definition of and in Theorem 1.1. Thus, will always be much smaller than when dealing with , as we do.
Off-hand, the meaning and significance of is not obvious. It is, perhaps, surprising that it captures properties of the version space of the problem.
The version space is a random subset of that consists of all the functions in the class that agree with on the sample . When the problem is noise-free (), a learning procedure makes significant mistakes only when there are functions in that, despite being ‘far-away’ from , still satisfy that for every . In other words, significant mistakes occur in a noise-free problem only when the version space is large.
As noted earlier, it seems plausible that when the noise level is low rather than zero, that is, when is close enough to , the situation does not change significantly and mistakes are essentially due to a large version space. Thus, when trying to bound the error of ERM, the first order of business is to identify a parameter that captures the ‘size’ of the version space, and gives thats and much more.
We will show that with high probability, if , then
for an appropriate constant , and with (almost) no assumption on (see Theorem 5.3 for the exact formulation).
Immediate outcomes of (2.1) are that the version space cannot include functions for which and that sampling is ‘stable’ for functions that are not too close to . Indeed, (2.1) implies that on a large event, the empirical distances are at least a fixed proportion of the original distances .
Controlling the interaction with the noise
The second regime is encountered once the noise level increases, and mistakes happen for a totally different reason: the ‘interaction’ of the target with class members. It turns out that this interaction is captured by the following parameter.
Let represent the noisecalling ‘the noise’ is a natural name when for some and that is independent of . We will use that name even when does not have that form. and set
The fixed point is of a similar nature to and the two share the same scaling, of the order of , but with one key difference: the supremum of the multiplier process in (2.2) measures the maximal correlation elements of the random set have with the random symmetrized noise vector , while the random process used in the definition of ,
measures the maximal correlation elements of the same random set have with the random vector . The obvious difference is that represents a ‘generic’ noise and has nothing to do with the specific noise that the learner has to deal with.
Note that if functions in and happen to be bounded by , then is bounded by . Applying a standard contraction argument (see, e.g. ) it follows that for every ,
and thus one may show that is dominated by in the bounded case. Moreover, while is insensitive to the noise level, because the contraction argument used in (2.3) destroys any dependence on the ‘noise multipliers’ , is highly affected by the noise – and in a favourable way. When is close to in the right sense, is close to zero, leading to smaller multipliers , and thus to a smaller fixed point .
The main result
Next, let us explain why this heuristic description of the roles of and is reasonable, and why splitting the estimation problem to two components, each captured by one of the two parameters, is the first step in bypassing the concentration-contraction mechanism used in the proof of Theorem 1.1.
For every , let be the excess loss functional associated with , which is defined by
Let be the excess loss class and note that it has two important properties.
Since , the empirical minimizer satisfies .
Therefore, if is a sample for which
then – simply because an empirical minimizer cannot belong to the set .
Setting , observe that by (3),
Hence, one may bound from below by showing that with high probability, and in rather general situations,
1. The version space condition holds: that is, if then
2. The noise interaction condition holds: that is, if , then
Therefore, if is chosen to be smaller than and if is a sample for which both (1) and (2) hold, then when . In particular, on that event, the estimation error satisfies
Let us emphasize that one may obtain a very good lower bound on the quadratic component of (and thus on the version space condition), solely because the required estimate is one-sided. A two-sided bound, obtained by an upper estimate on centred quadratic process
requires to be highly regular in some sense (see for two-sided estimates on the quadratic process). And, as noted earlier, a two-sided bound of this type may be false even for a single function, let alone for a class of functions, unless one imposes rather restrictive assumptions.
The key assumption leading the lower bound on the quadratic term is the following:
Let be a class of functions and set
Given a class of functions , let . We will assume that there is some for which .
In Section 4 we will present several generic examples showing that this weak small-ball condition is indeed minimal, and that in many cases one may choose and to be appropriate absolute constants.
This notion of small-ball is very different from concentration. The empirical mean of a function will concentrate around its true mean only if the function has well-behaved high moments. Such a condition is trivially satisfied when the function is a bounded, but in general it is highly restrictive. In contrast, it is well known that any sort of moment equivalence, even as weak as , leads to the nontrivial small-ball estimate
for constants and that depend only on (see Section 4 for more details).
The significance of (3.2) is that it may be used to derive a very high probability lower bound on . Indeed, let be independent, distributed according to . A straightforward binomial estimate shows that with probability at least , at least of the ’s are larger than . Hence, on that event,
We are, at last, ready to formulate the main result of the article.
Let be a closed, convex class of functions, set to be the unknown target and put .
Fix for which and set . For every , with probability at least one has
The proof of Theorem 3.1 will be presented in Section 5.
Unlike Theorem 1.1, Theorem 3.1 holds with essentially no restrictions on the class or on the target. In particular, it may be applied in all the examples described in the introduction that fall outside the scope of Theorem 1.1, once satisfies Assumption 3.1.
Consider a family of estimation problems in {\cal F}_{R}=\left\{\bigl{<}t,\cdot\bigr{>}:t\in RB_{1}^{n}\right\} and for a set of reasonable targets. In the persistence framework, the dimension , the radius and the noise level are allowed to grow with the sample size , and one has to find conditions on , and that still ensure that tends to zero with high probability. Therefore, one has to identify the correct way in which the estimation error scales with each one of these parameters.
We will study the persistence problem when has bounded iid coordinates and Y=\bigl{<}t_{0},\cdot\bigr{>}+W for and a bounded, symmetric random variable that is independent of . Although the problem fits the assumptions of Theorem 1.1, the outcome of Theorem 3.1 turns out to be superior by far, and, in fact, leads to the optimal estimate in the minimax sense.
2 Is Theorem 3.1 optimal?
The question of whether Theorem 3.1 is optimal is rather delicate and requires highly technical machinery if it is to be answered in full. To keep the scope and length of the article within reason, we will only sketch a few facts indicating that Theorem 3.1 is optimal, at least for a wide variety of classes.
As noted earlier, the intrinsic parameter , which characterizes the low-noise regime, is also an upper estimate on the diameter of the version space associated with in . There are many examples in which this estimate is sharp, including the class of linear functionals used in the persistence framework, endowed by .
As it turns out, the diameter of the version space is a lower bound on the estimation error in a rather general sense and independently of the learning procedure that is used. Indeed, fix a mean-zero random variable that is independent of , consider an arbitrary class and the family of targets , where . Given and ,
where the probability is taken with respect to the data .
Therefore, captures the estimation error in the low-noise regime, with the exception of the (rare) examples in which it is far from the typical diameter of the version space in .
As for , there are strong indications it is the right parameter for describing the interaction between the target and the class. Optimality questions of that flavour have been studied in , featuring arbitrary classes of function and targets for some and mean-zero gaussian variables that are independent of .
The results of show that a ‘subgaussian version’ of is optimal, under mild assumptions on the canonical gaussian process indexed by . Without going into accurate and rather technical definitions, one may show that if and satisfy a subgaussian condition, and if the canonical gaussian process indexed by , , is sufficiently regular, then
is the optimal complexity parameter for the high-noise regime in such estimation problems, for a constant that depends only on the subgaussian properties of .
Hence, for the right choices of and , coincides with , showing its optimality.
Unfortunately, extending this somewhat sketchy argument to the general case studied here, in which the subgaussian machinery is not at one’s disposal, requires methods that are well beyond the scope of this article. It is therefore deferred to future work.
Some Examples
Let us turn to situations one is likely to encounter in a heavy-tailed framework: classes of functions for which one has almost no moment control, and therefore, its members do not exhibit any useful two-sided concentration of empirical means; nevertheless, one may obtain a small-ball estimate as in Assumption 3.1.
Let be a class of functions on a probability space .
1. If for every , then there are constants and that depend only on for which .
2. If there are and for which for every , then there are constants and that depend only on and for which .
Lemma 4.1 is an immediate outcome of the Paley-Zygmund inequality (see, e.g. ). For example, if and then by the Paley-Zygmund inequality, for every ,
Thus, when applied to each and for , it follows that , and the small-ball condition holds uniformly in with reasonable constants.
The proof of Lemma 4.2 is presented in Section 6.1.
If either one of the moment conditions of Lemma 4.2 holds for , then with probability at least , ERM produced for which
for appropriate constants , and that depend only on or on and .
Needless to say that this problem falls outside the scope of Theorem 1.1, as functions in and need not be bounded, nor do they necessarily have rapidly decaying tails.
Let us turn to an example that illustrates the striking difference between Theorem 1.1 and Theorem 3.1, even in a bounded scenario, and in which and are not difficult to compute.
Let be a random vector with independent coordinates distributed according to a mean-zero, variance random variable . To give Theorem 3.1 and Theorem 1.1 a ‘level playing field’, assume that almost surely.
Recall that Problem 4.4 has been mentioned in the introduction for and .
The following two statements summarize the outcomes of Theorem 1.1 and Theorem 3.1 for this choice of and . The proofs of the claims may be found in Section 6.2.
Let us begin with the outcome of Theorem 1.1:
For every there exist constants and that depend only on for which the following holds. Assume that . Set
The estimation error in Theorem 4.5 does not scale correctly with ( is clearly too big) and also does not depend on any norm of the noise , except for the trivial bound , which is likely to be much larger than, say, the variance .
In contrast, the next result follows from Theorem 3.1. To formulate it, set
It turns out that , which is slightly larger than but smaller than for any , captures the noise level of the problem. Since in virtually all examples the and norms are equivalent, we will abuse notation and denote rather than .
For every there exist constants and that depend only on for which the following holds. Assume that and that . Put
The results of show that up to the exact probability estimate, the estimation error given in Theorem 4.6 is optimal in the minimax sense, and no procedure can perform with confidence larger than and yield a estimation error better than .
It turns out that Theorem 4.6 can be extended even further. For example, it holds for a general target rather than just for Y=\bigl{<}t_{0},\cdot\bigr{>}+W, the assumption that has iid coordinates can be relaxed, and ‘heavy-tailed’ measures may be used instead. Unfortunately, the proof of a more general version of Theorem 4.6 comes at a high technical cost and we refer the reader to and for more details.
Proof of Theorem 3.1
We begin this section with a few definitions that will be needed in the proof of Theorem 3.1.
A class is star-shaped around if for every and any , . Thus, a class is star-shaped around if it contains the entire interval whenever .
We will sometimes write instead of .
Let and observe that . Also, it is straightforward to verify that if is star-shaped around and then
The main component in the proof of Theorem 3.1 is the following:
Let be a closed, convex class and assume that there is some for which . Given , set . Then, for every , with probability at least , if , one has
The first step in the proof of Theorem 5.3 is the following uniform empirical small-ball estimate – which is of a similar nature to the results from and .
Let be the unit sphere and let . Assume that there is some for which . If
then with probability at least ,
Let . By the bounded differences inequality applied to (see, for example, ), it follows that for every , with probability at least ,
Note that is a Lipschitz function that vanishes in and with a Lipschitz constant . Therefore, by the Giné-Zinn symmetrization theorem and the contraction inequality for Bernoulli processes (see, e.g. ),
Hence, with probability at least , for every ,
If , set and . Since
it follows that with probability at least ,
Let be star-shaped around and assume that there is some for which . Then, for every , with probability at least , for every that satisfies ,
Proof. Let and since is star-shaped around ,
Clearly, and
By Theorem 5.4 applied to the set and since , it follows that with probability at least , for every ,
Next, fix any for which . is star-shaped around and thus , implying that . The claim follows because (5.2) is positive homogeneous,
Proof of Theorem 5.3. Given , let . Applying Corollary 5.5, it suffices to show that is star-shaped around . To that end, observe that if and , then for . Hence, as is convex, .
Proof of Theorem 3.1. Recall that , that , and that for every ,
Since , the assertion of Theorem 5.3 for holds with replacing the larger . Hence, if , then with probability at least , for every that satisfies , one has
Therefore, on that event, if ,
Fix to be named later and consider . Thus, with probability at least , if then
and by the Giné-Zinn symmetrization theorem , with probability at least ,
Fix a sample for which (5.4) holds and consider that satisfies . Since is star-shaped around and
Combining (5.5) and (5.3), it is evident that with probability at least
if then
provided that . On that event,
Additional proofs
Here, we will present proofs of some of the claims that have been formulated in previous sections.
We will prove a stronger statement: that the equivalence of the -th moment and the second moment of the linear forms \bigl{<}X,t\bigr{>} holds for any isotropic random vector that is also unconditional and whose coordinates belong to for a fixed . That is, when has the same distribution as for every choice of signs (unconditionality), and .
Such a random vector can be heavy-tailed, as its coordinates may only belong to an space for a fixed that is close to . Thus, the empirical mean of, say, \bigl{<}X,e_{j}\bigr{>}^{2}=x_{j}^{2} may exhibit rather poor concentration around the true mean, which is . Despite that, the claim is that \|\bigl{<}X,t\bigr{>}\|_{L_{p}}\leq c\sqrt{p}\kappa\|\bigl{<}X,t\bigr{>}\|_{L_{2}}, and any class of linear functionals satisfies a small-ball condition with constants that depend only on and .
Clearly, a possible choice of is a random vector with independent, symmetric, variance one coordinates, and a standard symmetrization argument may be used to show that the same holds when the coordinates are mean-zero rather than symmetric. Thus, the unconditional case extends Lemma 4.2.
and is a constant that depends only on and .
implying that \|\bigl{<}X,t\bigr{>}\|_{L_{1}}\geq c(\lambda,\kappa)\|\bigl{<}X,t\bigr{>}\|_{L_{2}}.
2 The persistence problem
The proofs of Theorem 4.5 and Theorem 4.6 follow from several observations.
The norm of a random variable is
A random variable is called -subgaussian if .
It is standard to verify that is equivalent to the smallest constant for which for every . Also, is equivalent to (see, e.g. ).
Another property of random variables is that if are independent and mean-zero, then for every ,
where is a monotone non-increasing rearrangement of .
There exists an absolute constant for which the following holds. Assume that are independent copies of a mean-zero, variance random variable , and that for every , . Then for every ,
Proof. For every and ,
Therefore, if and then
The final component needed in the proofs of the two theorems is the following. Note that if {\cal F}_{R}=\{\bigl{<}t,\cdot\bigr{>}:t\in RB_{1}^{n}\} and Y=\bigl{<}t_{0},\cdot\bigr{>}+W for and that is independent of , then f^{*}=\bigl{<}t_{0},\cdot\bigr{>}. Also, by the convexity and symmetry of ,
Therefore, if is a random vector with independent, mean-zero, variance 1 coordinates , and are independent copies of , then
where has independent coordinates. Each is distributed as , and are independent copies of . Hence, by Lemma 6.4,
If the ’s are distributed according to a mean-zero and variance random variable that is bounded in by , then and is -subgaussian. Moreover, , and as a weighted sum of independent random variables, for a suitable absolute constant . Hence, for every , .
provided that . Therefore, in that range,
while if one may verify that
for a suitable absolute constant .
Combining (6.2) with Theorem 1.1, there is a high probability event on which, if ,
Proof of Theorem 4.6. Since the argument is rather standard and follows a similar path to the previous proof, we will omit some of the details.
To prove the result, one has to bound and when and . Using (6.2) as above, it is evident that if ,
for suitable constants and that depend only on and . Moreover, using the notation of Theorem 3.1, and thus, is a constant that depends only on as well.
for constants and that depend only on .
Next, to bound , set , let
and recall that and are independent. By a standard multiplier theorem (see, e.g., Chapter 2.9),
and since is an isotropic, -subgaussian vector, it follows from Talagrand’s Majorizing Measures Theorem that
where are independent, standard gaussian variables and is an absolute constant.
and constants that depend only on .
Indeed, for every and since and are independent,
Hence, for a constant that depends only on , and , one has that with probability at least , . Therefore, , which completes the proof.
Concluding Remarks
Although it seems that the complexity parameters and (and as well) depend on the unknown function , in virtually all applications one may replace the indexing set with , where is the unit ball in – with the obvious modifications to the definitions of the parameters. Moreover, if is centrally symmetric (i.e. if then ) in addition to being convex, then and the indexing set becomes .
The fact that depends on an average while is based on a probability estimate is just an outcome of this presentation. It is possible to obtain a similar result using a ‘high probability’ version of , by using a slightly different argument. On the other hand, while one may replace with an averaged version, the latter makes little sense in heavy-tailed situations, as passing from an average-based parameter to a high probability one reverts to the question of concentration. In heavy-tailed situations the mean need not represents the typical behaviour, and should not be used when trying to obtain very high probability bounds.
Another fact worth mentioning is that the results presented here can be extended well beyond the squared loss, to arbitrary convex loss functions (see ).
for midpoints that belong to the intervals whose ends are and , and thus depend on and on the sample .
Just as in the squared-loss case, cannot be an empirical minimizer if . Therefore, using (7), one may obtain a positive lower bound on by identifying the levels and , for which, if then
for some constant , and if then
On that event the estimation error satisfies
for constants that are independent of .
A version of Theorem 3.1 for a general convex loss can be found in .