Learning without Concentration

Shahar Mendelson

Introduction

Our aim is to study the error of Empirical Risk Minimization (ERM), performed in a convex class and relative to the squared loss.

To be more precise, let F{\cal F} be a class of real-valued functions on a probability space (Ω,μ)(\Omega,\mu) and let YY be an unknown target function. One would like to find some function in F{\cal F} that is almost the ‘closest’ to YY in some sense.

Unlike questions in Approximation Theory, the point in learning problems is to approximate ff^{*} using random data – an independent sample (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N} selected according to the joint distribution defined by μ\mu and the target YY.

One way of using the given data (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N} is by selecting a random element in F{\cal F}, denoted by f^\hat{f}, that minimizes the empirical loss

where here, and throughout this article, PNgP_{N}g denotes the empirical mean of the function gg with respect to the given sample.

With this choice of a learning procedure, f^\hat{f} is called the empirical minimizer, and the procedure that selects f^\hat{f} is Empirical Risk Minimization (ERM).

In the context of the estimation problem, it is natural to measure the success of ERM and the effectiveness of the choice of f^\hat{f} in the following way: that for ‘most’ samples (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}, ERM produces a function that is close in L2(μ)L_{2}(\mu) to the best approximation of the target YY in F{\cal F}; that is, a high probability upper estimate on

for the empirical minimizer f^\hat{f} selected according to the data (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}.

Our starting point is a well known result that deals with this very question: controlling the distance in L2(μ)L_{2}(\mu) between the function produced by ERM and ff^{*}. Theorem 1.1, formulated below, has been established in (see Corollary 5.3 there and also Theorem 5.1 in the survey ).

Let Df{\cal D}_{f^{*}} be the L2(μ)L_{2}(\mu) ball of radius 11, centred in ff^{*}. Thus, {fF:ffL2r}=FrDf\{f\in{\cal F}:\|f-f^{*}\|_{L_{2}}\leq r\}={\cal F}\cap r{\cal D}_{f^{*}}. For any r>0r>0, let

where (εi)i=1N(\varepsilon_{i})_{i=1}^{N} are independent, symmetric, {1,1}\{-1,1\}-valued random variables (random signs) that are independent of (Xi)i=1N(X_{i})_{i=1}^{N}, and the expectation is with respect to both (εi)i=1N(\varepsilon_{i})_{i=1}^{N} and (Xi)i=1N(X_{i})_{i=1}^{N}.

There exist absolute constants c0,c1c_{0},c_{1} and c2c_{2} for which the following holds. If FL2(μ){\cal F}\subset L_{2}(\mu) is a closed, convex class of functions that are bounded by 11 and the target YY is also bounded by 11, then for every t>0t>0, with probability at least 1c0exp(t)1-c_{0}\exp(-t),

The proof of Theorem 1.1 relies heavily on the fact that F{\cal F} consists of functions that are bounded by 11 and that the target is also bounded by 11. Both are restrictive assumptions and exclude many natural problems that one would like to consider.

1. Gaussian noise: arguably the most basic statistical problem is when Y=f0(X)+WY=f_{0}(X)+W for some f0Ff_{0}\in{\cal F} and WW that is a centred gaussian variable with variance σ\sigma that is independent of XX. Thus, the given data consists of ‘noisy’ measurements of f0f_{0} corrupted by gaussian noise.

Since a gaussian random variable is unbounded, Theorem 1.1 cannot be used to address the estimation problem that involves gaussian noise, regardless of the choice of F{\cal F}.

2. Heavy-tailed noise: The vague term ‘heavy-tailed function’ is used to describe a function for which the measure of its tail, Pr(f>t)Pr(|f|>t), decays to zero relatively slowly – for example, polynomially in 1/t1/t. In particular, such a function need not be bounded. Hence, any kind of an estimation problem that involves a heavy-tailed target YY cannot be treated using Theorem 1.1.

Let εn+1\varepsilon_{n+1} be a random sign that is independent of (εi)i=1n(\varepsilon_{i})_{i=1}^{n}, fix 0σR0\leq\sigma\leq R and t0TRt_{0}\in T_{R}, and put Y=\bigl{<}t_{0},\cdot\bigr{>}+\sigma\varepsilon_{n+1}.

More accurately, we will show that the outcome of Theorem 1.1 is as follows: there are absolute constants c1,c2c_{1},c_{2} and c3c_{3} and

On the other hand, it follows from that the correct rate for this problem is very different: if

then with probability at least 12exp(c6Nmin{v2,1})1-2\exp\left(-c_{6}N\min\{v_{2},1\}\right),

for absolute constants c4,c5,c6c_{4},c_{5},c_{6} and c7c_{7}.

Therefore, the outcome of Theorem 1.1 scales incorrectly with RR and with σ\sigma, and most notably, the estimation error does not converge to zero when σ\sigma becomes smaller and the problem is almost ‘noise-free’.

This example belongs to the persistence framework, which will be explored in greater detail later on. It, and other examples like it show that the poor outcome of Theorem 1.1 is endemic rather than merely an accident.

To get a clearer picture of the reasons why Theorem 1.1 is truly suboptimal, let us take a closer look at its assumptions.

The point-wise boundedness assumption in Theorem 1.1 has been used frequently in Learning Theory, and for a good reason. It allows one to invoke two important tools from empirical processes theory, which are simply not true in a more general setup.

2. Concentration methods: Talagrand’s concentration inequality for empirical processes indexed by a class of uniformly bounded functions has played an essential role in Learning Theory - and Theorem 1.1 is no exception. It is used to show that with high probability, the supremum of an empirical process is almost the same as its expectation, as well as the Rademacher averages of the class (see, e.g., and references therein for more details on the role of Rademacher averages in Learning Theory).

It takes no more than a glance to see that contraction and concentration are crucial to the proof of Theorem 1.1. A more careful examination shows that there is no realistic hope of extending the proof beyond the restricted setup of classes of uniformly bounded functions and a bounded target without introducing a totally new argument.

The noise barrier

An important observation that reflects the suboptimal nature of Theorem 1.1 is that the estimation error is insensitive to the noise level. The term ‘noise level’ is used here to describe the distance between the target and the class, either with respect to the L2L_{2} norm, or with respect to a different LpL_{p} norm, depending on the situationThe reason for naming the distance between the class and the target ‘noise level’ is the very simple choice of a target Y=f0(X)+WY=f_{0}(X)+W for some f0Ff_{0}\in{\cal F}, and an independent, centred random variable WW, representing the noise. It is straightforward to verify that for such a target f=f0f^{*}=f_{0}, and the L2L_{2} distance between the class and the target is f(X)YL2=WL2\|f^{*}(X)-Y\|_{L_{2}}=\|W\|_{L_{2}} – the variance of WW..

One would expect that when the noise level (distance) tends to zero and the problem becomes almost noise-free (or realizable), f^fL2\|\hat{f}-f^{*}\|_{L_{2}} should decrease, and in some cases should even tend to zero. Among these problems are compressed sensing and phase recovery, in which the exact recovery of the signal is possible (see the discussion in ). Unfortunately, Theorem 1.1 is insensitive to the noise level, and this by-product of its proof cannot be resolved without totally changing the method of analysis.

All these observations are strong indications that if Theorem 1.1 is to be radically improved, the entire concentration-contraction framework on which it is based, must be abandoned. Moreover, any alternative framework should address two core issues:

\bullet It must be able to handle ‘heavy-tailed’ problems.

\bullet The estimation error must scale correctly with the key parameters of the problem, especially with the noise level.

Towards a heavy-tailed framework

In view of the requirements outlined above, an improved framework has to contend with functions that may have heavy tails, and certainly need not be bounded. We will first explain why standard concentration-based arguments fail miserably when faced with a heavy-tailed scenario. We will then suggest an alternative to standard concentration, introduce the complexity parameters that arise naturally in the suggested framework, and explain their statistical interpretation.

The title of this article may create the wrong impression – that concentration methods are not needed and will not take any part in the analysis of ERM. Actually, the way the title should be understood is different: that learning is possible even when concentration is false.

As a starting point, and seeing that one is interested in the squared loss, one should note the substantial difference between a two-sided concentration inequality, stating, for example, that with high probability,

and just the lower bound on the empirical mean,

If Z1,...,ZNZ_{1},...,Z_{N} are independent copies of ZZ, then with probability at least 1/2N1/2N there exists some 1iN1\leq i\leq N for which Zi=2N|Z_{i}|=2\sqrt{N}. On that event,

for a suitable absolute constant c1c_{1} – and that a similar estimate is true for any random variable ZZ for which ZL4/ZL2\|Z\|_{L_{4}}/\|Z\|_{L_{2}} is well behaved, with c1c_{1} depending only on this ratio.

The difference between an upper estimate and a lower one is even more obvious if one only assumes that Pr(fu)εPr(|f|\geq u)\geq\varepsilon for fixed constants uu and ε\varepsilon. Using a simple binomial estimate, one may show that with probability at least 12exp(c2εN)1-2\exp(-c_{2}\varepsilon N), PNf2εu2/2P_{N}f^{2}\geq\varepsilon u^{2}/2 for a suitable absolute constant c2c_{2}. However, there is no hope of obtaining any upper estimate on PNf2P_{N}f^{2} based on the given information.

An immediate consequence of these observations is that the standard method of analysis for the estimation problem, which is based on a two-sided concentration argument that holds with exponential probability, can never work in heavy-tailed situations. Thus, one must find a different argument altogether if one wishes to deal with learning problems that include classes of heavy-tailed functions or with a heavy-tailed target.

The difference between a two-sided estimate and a lower bound takes centre-stage when one realizes that the main component in solving the estimation problem is actually a lower bound on

rather than a two-sided bound. The significance of this lower bound will be made clear in Section 3. It will eventually lead to a sharp estimate on f^fL2\|\hat{f}-f^{*}\|_{L_{2}} even when two-sided concentration is impossible.

2 Two parameters for two regimes

When one considers the performance of a learning procedure, it is reasonable to expect two different ‘performance regimes’, according to the difficulties the learner faces. As will be explained below, there are clear differences between ‘low noise’ problems, in which the target YY is close to F{\cal F} and the rate one should expect is close to the noise-free (realizable) rate, and ‘high noise’ problems, when YY is far enough from F{\cal F} and the interaction between the target and the class determines the estimation error. One may also expect that just as in a realizable problem, the complexity parameter governing the ‘low-noise’ regime is intrinsic to F{\cal F} and therefore should not depend on YY at all, while the parameter controlling the ‘high-noise’ regime depends on YY in one way or another.

We begin with the definition of the first parameter we will be interested in, and which governs the ‘low-noise’ regime.

Given a class of functions F{\cal F} and γ>0\gamma>0, set

where the expectation is taken with respect to both (εi)i=1N(\varepsilon_{i})_{i=1}^{N} and (Xi)i=1N(X_{i})_{i=1}^{N}.

Note that βN\beta_{N}^{*} is indeed an intrinsic parameter, in the sense that it depends only on the class F{\cal F} and not on the exact nature of the ‘noise’ f(X)Yf^{*}(X)-Y. From a purely technical perspective, βN\beta_{N}^{*} measures when the Rademacher averages of the ‘localized’ set {ff:fFrDf}\{f-f^{*}:f\in{\cal F}\cap r{\cal D}_{f^{*}}\} scale like rr rather than like the normalization r2r^{2}, which has been used in the definition of kNk_{N}^{*} and in Theorem 1.1. Thus, βN\beta_{N}^{*} will always be much smaller than kNk_{N}^{*} when dealing with r1r\ll 1, as we do.

Off-hand, the meaning and significance of βN\beta_{N}^{*} is not obvious. It is, perhaps, surprising that it captures properties of the version space of the problem.

The version space is a random subset of F{\cal F} that consists of all the functions in the class that agree with ff^{*} on the sample (Xi)i=1N(X_{i})_{i=1}^{N}. When the problem is noise-free (Y=fY=f^{*}), a learning procedure makes significant mistakes only when there are functions in F{\cal F} that, despite being ‘far-away’ from ff^{*}, still satisfy that f(Xi)=f(Xi)f(X_{i})=f^{*}(X_{i}) for every 1iN1\leq i\leq N. In other words, significant mistakes occur in a noise-free problem only when the version space is large.

As noted earlier, it seems plausible that when the noise level is low rather than zero, that is, when YY is close enough to F{\cal F}, the situation does not change significantly and mistakes are essentially due to a large version space. Thus, when trying to bound the error of ERM, the first order of business is to identify a parameter that captures the ‘size’ of the version space, and βN\beta_{N}^{*} gives thats and much more.

We will show that with high probability, if ffL2βN\|f-f^{*}\|_{L_{2}}\geq\beta_{N}^{*}, then

for an appropriate constant cc, and with (almost) no assumption on F{\cal F} (see Theorem 5.3 for the exact formulation).

Immediate outcomes of (2.1) are that the version space cannot include functions for which ffL2βN\|f^{*}-f\|_{L_{2}}\geq\beta_{N}^{*} and that sampling is ‘stable’ for functions that are not too close to ff^{*}. Indeed, (2.1) implies that on a large event, the empirical distances (N1i=1N(ff)2(Xi))1/2(N^{-1}\sum_{i=1}^{N}(f-f^{*})^{2}(X_{i}))^{1/2} are at least a fixed proportion of the original L2(μ)L_{2}(\mu) distances ffL2\|f-f^{*}\|_{L_{2}}.

Controlling the interaction with the noise

The second regime is encountered once the noise level increases, and mistakes happen for a totally different reason: the ‘interaction’ of the target YY with class members. It turns out that this interaction is captured by the following parameter.

Let ξ(X,Y)=f(X)Y\xi(X,Y)=f^{*}(X)-Y represent the noisecalling ξ\xi ‘the noise’ is a natural name when Y=f0(X)+WY=f_{0}(X)+W for some f0Ff_{0}\in{\cal F} and WW that is independent of XX. We will use that name even when YY does not have that form. and set

The fixed point αN\alpha_{N}^{*} is of a similar nature to kNk_{N}^{*} and the two share the same scaling, of the order of Ns2\sqrt{N}s^{2}, but with one key difference: the supremum of the multiplier process in (2.2) measures the maximal correlation elements of the random set {(ff)(Xi):fFsDf}\{(f-f^{*})(X_{i}):f\in{\cal F}\cap s{\cal D}_{f^{*}}\} have with the random symmetrized noise vector (εiξi)i=1N(\varepsilon_{i}\xi_{i})_{i=1}^{N}, while the random process used in the definition of kNk_{N}^{*},

measures the maximal correlation elements of the same random set have with the random vector (εi)i=1N(\varepsilon_{i})_{i=1}^{N}. The obvious difference is that (εi)i=1N(\varepsilon_{i})_{i=1}^{N} represents a ‘generic’ noise and has nothing to do with the specific noise that the learner has to deal with.

Note that if functions in F{\cal F} and YY happen to be bounded by 11, then ξ=f(X)Y\xi=f^{*}(X)-Y is bounded by 22. Applying a standard contraction argument (see, e.g. ) it follows that for every u>0u>0,

and thus one may show that αN\alpha_{N}^{*} is dominated by kNk_{N}^{*} in the bounded case. Moreover, while kNk_{N}^{*} is insensitive to the noise level, because the contraction argument used in (2.3) destroys any dependence on the ‘noise multipliers’ (ξi)i=1N(\xi_{i})_{i=1}^{N}, αN\alpha_{N}^{*} is highly affected by the noise – and in a favourable way. When YY is close to F{\cal F} in the right sense, ξ\xi is close to zero, leading to smaller multipliers (ξi)i=1N(\xi_{i})_{i=1}^{N}, and thus to a smaller fixed point αN\alpha_{N}^{*}.

The main result

Next, let us explain why this heuristic description of the roles of αN\alpha_{N}^{*} and βN\beta_{N}^{*} is reasonable, and why splitting the estimation problem to two components, each captured by one of the two parameters, is the first step in bypassing the concentration-contraction mechanism used in the proof of Theorem 1.1.

For every fFf\in{\cal F}, let Lf{\cal L}_{f} be the excess loss functional associated with ff, which is defined by

Let LF={Lf:fF}{\cal L}_{{\cal F}}=\{{\cal L}_{f}:f\in{\cal F}\} be the excess loss class and note that it has two important properties.

\bullet Since 0LF0\in{\cal L}_{{\cal F}}, the empirical minimizer f^\hat{f} satisfies PNLf^0P_{N}{\cal L}_{\hat{f}}\leq 0.

Therefore, if (X1,...,XN)(X_{1},...,X_{N}) is a sample for which

then f^fL2<ρ\|\hat{f}-f^{*}\|_{L_{2}}<\rho – simply because an empirical minimizer cannot belong to the set {fF:ffL2ρ}\{f\in{\cal F}:\|f-f^{*}\|_{L_{2}}\geq\rho\}.

Setting ξi=f(Xi)Yi\xi_{i}=f^{*}(X_{i})-Y_{i}, observe that by (3),

Hence, one may bound PNLfP_{N}{\cal L}_{f} from below by showing that with high probability, and in rather general situations,

1. The version space condition holds: that is, if ffL2βN(γ1)\|f-f^{*}\|_{L_{2}}\geq\beta_{N}^{*}(\gamma_{1}) then

2. The noise interaction condition holds: that is, if ffL2αN(γ2,δ)\|f-f^{*}\|_{L_{2}}\geq\alpha_{N}^{*}(\gamma_{2},\delta), then

Therefore, if γ2\gamma_{2} is chosen to be smaller than c1γ1/2c2c_{1}\gamma_{1}/2c_{2} and if (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N} is a sample for which both (1) and (2) hold, then PNLf>0P_{N}{\cal L}_{f}>0 when ffL2max{αN(γ2,δ),βN(γ1)}\|f-f^{*}\|_{L_{2}}\geq\max\{\alpha_{N}^{*}(\gamma_{2},\delta),\beta_{N}^{*}(\gamma_{1})\}. In particular, on that event, the estimation error satisfies

Let us emphasize that one may obtain a very good lower bound on the quadratic component of PNLfP_{N}{\cal L}_{f} (and thus on the version space condition), solely because the required estimate is one-sided. A two-sided bound, obtained by an upper estimate on centred quadratic process

requires F{\cal F} to be highly regular in some sense (see for two-sided estimates on the quadratic process). And, as noted earlier, a two-sided bound of this type may be false even for a single function, let alone for a class of functions, unless one imposes rather restrictive assumptions.

The key assumption leading the lower bound on the quadratic term is the following:

Let HL2(μ){\cal H}\subset L_{2}(\mu) be a class of functions and set

Given a class of functions F{\cal F}, let FF={fh:f,hF}{\cal F}-{\cal F}=\{f-h:f,h\in{\cal F}\}. We will assume that there is some u>0u>0 for which QFF(u)>0Q_{{\cal F}-{\cal F}}(u)>0.

In Section 4 we will present several generic examples showing that this weak small-ball condition is indeed minimal, and that in many cases one may choose uu and QQ to be appropriate absolute constants.

This notion of small-ball is very different from concentration. The empirical mean of a function will concentrate around its true mean only if the function has well-behaved high moments. Such a condition is trivially satisfied when the function is a bounded, but in general it is highly restrictive. In contrast, it is well known that any sort of moment equivalence, even as weak as hL2LhL1\|h\|_{L_{2}}\leq L\|h\|_{L_{1}}, leads to the nontrivial small-ball estimate

for constants c1c_{1} and c2c_{2} that depend only on LL (see Section 4 for more details).

The significance of (3.2) is that it may be used to derive a very high probability lower bound on PNh2P_{N}h^{2}. Indeed, let X1,...,XNX_{1},...,X_{N} be independent, distributed according to μ\mu. A straightforward binomial estimate shows that with probability at least 12exp(c3(L)N)1-2\exp(-c_{3}(L)N), at least c2(L)N/2c_{2}(L)N/2 of the h(Xi)|h(X_{i})|’s are larger than c1(L)hL2c_{1}(L)\|h\|_{L_{2}}. Hence, on that event,

We are, at last, ready to formulate the main result of the article.

Let FL2(μ){\cal F}\subset L_{2}(\mu) be a closed, convex class of functions, set YY to be the unknown target and put ξ(X,Y)=f(X)Y\xi(X,Y)=f^{*}(X)-Y.

Fix τ>0\tau>0 for which QFF(2τ)>0Q_{{\cal F}-{\cal F}}(2\tau)>0 and set γ<τ2QFF(2τ)/16\gamma<\tau^{2}Q_{{\cal F}-{\cal F}}(2\tau)/16. For every 0<δ<10<\delta<1, with probability at least 1δexp(NQFF2(2τ)/2)1-\delta-\exp(-NQ_{{\cal F}-{\cal F}}^{2}(2\tau)/2) one has

The proof of Theorem 3.1 will be presented in Section 5.

Unlike Theorem 1.1, Theorem 3.1 holds with essentially no restrictions on the class or on the target. In particular, it may be applied in all the examples described in the introduction that fall outside the scope of Theorem 1.1, once FF{\cal F}-{\cal F} satisfies Assumption 3.1.

Consider a family of estimation problems in {\cal F}_{R}=\left\{\bigl{<}t,\cdot\bigr{>}:t\in RB_{1}^{n}\right\} and for a set of reasonable targets. In the persistence framework, the dimension nn, the radius RR and the noise level σ\sigma are allowed to grow with the sample size NN, and one has to find conditions on n(N)n(N), R(N)R(N) and σ(N)\sigma(N) that still ensure that f^fL2\|\hat{f}-f^{*}\|_{L_{2}} tends to zero with high probability. Therefore, one has to identify the correct way in which the estimation error scales with each one of these parameters.

We will study the persistence problem when XX has bounded iid coordinates and Y=\bigl{<}t_{0},\cdot\bigr{>}+W for t0RB1nt_{0}\in RB_{1}^{n} and a bounded, symmetric random variable WW that is independent of XX. Although the problem fits the assumptions of Theorem 1.1, the outcome of Theorem 3.1 turns out to be superior by far, and, in fact, leads to the optimal estimate in the minimax sense.

2 Is Theorem 3.1 optimal?

The question of whether Theorem 3.1 is optimal is rather delicate and requires highly technical machinery if it is to be answered in full. To keep the scope and length of the article within reason, we will only sketch a few facts indicating that Theorem 3.1 is optimal, at least for a wide variety of classes.

As noted earlier, the intrinsic parameter βN\beta_{N}^{*}, which characterizes the low-noise regime, is also an upper estimate on the diameter of the version space associated with ff^{*} in L2(μ)L_{2}(\mu). There are many examples in which this estimate is sharp, including the class of linear functionals used in the persistence framework, endowed by RB1nRB_{1}^{n}.

As it turns out, the L2(μ)L_{2}(\mu) diameter of the version space is a lower bound on the estimation error in a rather general sense and independently of the learning procedure that is used. Indeed, fix a mean-zero random variable WW that is independent of XX, consider an arbitrary class F{\cal F} and the family of targets Yf=f(X)+WY^{f}=f(X)+W, where fFf\in{\cal F}. Given (Xi)i=1N(X_{i})_{i=1}^{N} and fFf\in{\cal F},

where the probability is taken with respect to the data (Xi,Yif)i=1N(X_{i},Y_{i}^{f})_{i=1}^{N}.

Therefore, βN\beta_{N}^{*} captures the estimation error in the low-noise regime, with the exception of the (rare) examples in which it is far from the typical diameter of the version space V(f,(Xi)i=1N){\cal V}(f^{*},(X_{i})_{i=1}^{N}) in L2(μ)L_{2}(\mu) .

As for αN\alpha_{N}^{*}, there are strong indications it is the right parameter for describing the interaction between the target and the class. Optimality questions of that flavour have been studied in , featuring arbitrary classes of function F{\cal F} and targets Yf=f(X)+WY^{f}=f(X)+W for some fFf\in{\cal F} and mean-zero gaussian variables WW that are independent of XX.

The results of show that a ‘subgaussian version’ of αN\alpha_{N}^{*} is optimal, under mild assumptions on the canonical gaussian process indexed by F{\cal F}. Without going into accurate and rather technical definitions, one may show that if F{\cal F} and ξ=f(X)Y\xi=f^{*}(X)-Y satisfy a subgaussian condition, and if the canonical gaussian process indexed by F{\cal F}, {Gf:fF}\{G_{f}:f\in{\cal F}\}, is sufficiently regular, then

is the optimal complexity parameter for the high-noise regime in such estimation problems, for a constant γ\gamma that depends only on the subgaussian properties of F{\cal F}.

Hence, for the right choices of δ\delta and γ\gamma^{\prime}, αN(γ,δ)\alpha_{N}^{*}(\gamma^{\prime},\delta) coincides with sN(γ)s_{N}^{*}(\gamma), showing its optimality.

Unfortunately, extending this somewhat sketchy argument to the general case studied here, in which the subgaussian machinery is not at one’s disposal, requires methods that are well beyond the scope of this article. It is therefore deferred to future work.

Some Examples

Let us turn to situations one is likely to encounter in a heavy-tailed framework: classes of functions for which one has almost no moment control, and therefore, its members do not exhibit any useful two-sided concentration of empirical means; nevertheless, one may obtain a small-ball estimate as in Assumption 3.1.

Let F{\cal F} be a class of functions on a probability space (Ω,μ)(\Omega,\mu).

1. If f1f2L2κ1f1f2L1\|f_{1}-f_{2}\|_{L_{2}}\leq\kappa_{1}\|f_{1}-f_{2}\|_{L_{1}} for every f1,f2Ff_{1},f_{2}\in{\cal F}, then there are constants c1c_{1} and c2c_{2} that depend only on κ1\kappa_{1} for which QFF(c1)c2Q_{{\cal F}-{\cal F}}(c_{1})\geq c_{2}.

2. If there are p>2p>2 and κ2\kappa_{2} for which f1f2Lpκ2f1f2L2\|f_{1}-f_{2}\|_{L_{p}}\leq\kappa_{2}\|f_{1}-f_{2}\|_{L_{2}} for every f1,f2Ff_{1},f_{2}\in{\cal F}, then there are constants c1c_{1} and c2c_{2} that depend only on κ2\kappa_{2} and pp for which QFF(c1)c2Q_{{\cal F}-{\cal F}}(c_{1})\geq c_{2}.

Lemma 4.1 is an immediate outcome of the Paley-Zygmund inequality (see, e.g. ). For example, if p>2p>2 and hLpκ2hL2\|h\|_{L_{p}}\leq\kappa_{2}\|h\|_{L_{2}} then by the Paley-Zygmund inequality, for every 0<u<10<u<1,

Thus, when applied to each h=f1f2h=f_{1}-f_{2} and for u=1/2u=1/2, it follows that QFF(1/2)(3/4κ22)p/(p2)Q_{{\cal F}-{\cal F}}(1/2)\geq(3/4\kappa_{2}^{2})^{p/(p-2)}, and the small-ball condition holds uniformly in FF{\cal F}-{\cal F} with reasonable constants.

The proof of Lemma 4.2 is presented in Section 6.1.

If either one of the moment conditions of Lemma 4.2 holds for ζ\zeta, then with probability at least 1δexp(c1N)1-\delta-\exp(-c_{1}N), ERM produced t^T\hat{t}\in T for which

for appropriate constants c1c_{1}, c2c_{2} and c3c_{3} that depend only on κ1\kappa_{1} or on κ2\kappa_{2} and pp.

Needless to say that this problem falls outside the scope of Theorem 1.1, as functions in F{\cal F} and YY need not be bounded, nor do they necessarily have rapidly decaying tails.

Let us turn to an example that illustrates the striking difference between Theorem 1.1 and Theorem 3.1, even in a bounded scenario, and in which αN\alpha_{N}^{*} and βN\beta_{N}^{*} are not difficult to compute.

Let X=(ζi)i=1nX=(\zeta_{i})_{i=1}^{n} be a random vector with independent coordinates distributed according to a mean-zero, variance 11 random variable ζ\zeta. To give Theorem 3.1 and Theorem 1.1 a ‘level playing field’, assume that ζκ|\zeta|\leq\kappa almost surely.

Recall that Problem 4.4 has been mentioned in the introduction for X=(ε1,...,εN)X=(\varepsilon_{1},...,\varepsilon_{N}) and W=σεN+1W=\sigma\varepsilon_{N+1}.

The following two statements summarize the outcomes of Theorem 1.1 and Theorem 3.1 for this choice of F{\cal F} and YY. The proofs of the claims may be found in Section 6.2.

Let us begin with the outcome of Theorem 1.1:

For every κ>1\kappa>1 there exist constants c1,c2c_{1},c_{2} and c3c_{3} that depend only on κ\kappa for which the following holds. Assume that ζL,WLκ\|\zeta\|_{L_{\infty}},\|W\|_{L_{\infty}}\leq\kappa. Set

The estimation error in Theorem 4.5 does not scale correctly with RR (R2/NR^{2}/\sqrt{N} is clearly too big) and also does not depend on any LpL_{p} norm of the noise WW, except for the trivial bound WL=κ\|W\|_{L_{\infty}}=\kappa, which is likely to be much larger than, say, the variance WL2\|W\|_{L_{2}}.

In contrast, the next result follows from Theorem 3.1. To formulate it, set

It turns out that WL2,1\|W\|_{L_{2,1}}, which is slightly larger than WL2\|W\|_{L_{2}} but smaller than c(q)WLqc(q)\|W\|_{L_{q}} for any q>2q>2, captures the noise level of the problem. Since in virtually all examples the L2L_{2} and L2,1L_{2,1} norms are equivalent, we will abuse notation and denote σ=WL2,1\sigma=\|W\|_{L_{2,1}} rather than σ=WL2\sigma=\|W\|_{L_{2}}.

For every κ>1\kappa>1 there exist constants c1,c2,c3c_{1},c_{2},c_{3} and c4c_{4} that depend only on κ\kappa for which the following holds. Assume that ζL,WLκ\|\zeta\|_{L_{\infty}},\|W\|_{L_{\infty}}\leq\kappa and that W2,1=σ<\|W\|_{2,1}=\sigma<\infty. Put

The results of show that up to the exact probability estimate, the estimation error given in Theorem 4.6 is optimal in the minimax sense, and no procedure can perform with confidence larger than 3/43/4 and yield a estimation error better than max{v1,v2}\sim\max\{v_{1},v_{2}\}.

It turns out that Theorem 4.6 can be extended even further. For example, it holds for a general target YY rather than just for Y=\bigl{<}t_{0},\cdot\bigr{>}+W, the assumption that XX has iid coordinates can be relaxed, and ‘heavy-tailed’ measures may be used instead. Unfortunately, the proof of a more general version of Theorem 4.6 comes at a high technical cost and we refer the reader to and for more details.

Proof of Theorem 3.1

We begin this section with a few definitions that will be needed in the proof of Theorem 3.1.

A class H{\cal H} is star-shaped around if for every hHh\in{\cal H} and any 0<λ10<\lambda\leq 1, λhH\lambda h\in{\cal H}. Thus, a class is star-shaped around if it contains the entire interval [0,h][0,h] whenever hHh\in{\cal H}.

We will sometimes write βN(γ)\beta_{N}(\gamma) instead of βN(H,γ)\beta_{N}({\cal H},\gamma).

Let Ff={ff:fF}{\cal F}-f^{*}=\{f-f^{*}:f\in{\cal F}\} and observe that βN(γ)=βN(Ff,γ)\beta_{N}^{*}(\gamma)=\beta_{N}({\cal F}-f^{*},\gamma). Also, it is straightforward to verify that if H{\cal H} is star-shaped around and r>βN(γ)r>\beta_{N}(\gamma) then

The main component in the proof of Theorem 3.1 is the following:

Let FL2(μ){\cal F}\subset L_{2}(\mu) be a closed, convex class and assume that there is some τ>0\tau>0 for which QFF(2τ)>0Q_{{\cal F}-{\cal F}}(2\tau)>0. Given fFf^{*}\in{\cal F}, set H=Ff{\cal H}={\cal F}-f^{*}. Then, for every r>βN(H,τQH(2τ)/16)r>\beta_{N}({\cal H},\tau Q_{{\cal H}}(2\tau)/16), with probability at least 12exp(NQH2(2τ)/2)1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2), if ffL2r\|f-f^{*}\|_{L_{2}}\geq r, one has

The first step in the proof of Theorem 5.3 is the following uniform empirical small-ball estimate – which is of a similar nature to the results from and .

Let S(L2)S(L_{2}) be the L2(μ)L_{2}(\mu) unit sphere and let HS(L2){\cal H}\subset S(L_{2}). Assume that there is some τ>0\tau>0 for which QH(2τ)>0Q_{{\cal H}}(2\tau)>0. If

then with probability at least 12exp(NQH2(2τ)/2)1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2),

Let Z(X1,...,XN)=suphHPNϕu(h)Pϕu(h)Z(X_{1},...,X_{N})=\sup_{h\in{\cal H}}\left|P_{N}\phi_{u}(|h|)-P\phi_{u}(|h|)\right|. By the bounded differences inequality applied to ZZ (see, for example, ), it follows that for every t>0t>0, with probability at least 12exp(2t2)1-2\exp(-2t^{2}),

Note that ϕu\phi_{u} is a Lipschitz function that vanishes in and with a Lipschitz constant 1/u1/u. Therefore, by the Giné-Zinn symmetrization theorem and the contraction inequality for Bernoulli processes (see, e.g. ),

Hence, with probability at least 12exp(2t2)1-2\exp(-2t^{2}), for every hHh\in{\cal H},

If QH(2τ)>0Q_{{\cal H}}(2\tau)>0, set u=τu=\tau and t=NQH(2τ)/2t=\sqrt{N}Q_{{\cal H}}(2\tau)/2. Since

it follows that with probability at least 12exp(NQH2(2τ)/2)1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2),

Let H{\cal H} be star-shaped around and assume that there is some τ>0\tau>0 for which QH(2τ)>0Q_{{\cal H}}(2\tau)>0. Then, for every r>βN(H,τQH(2τ)/16)r>\beta_{N}({\cal H},\tau Q_{{\cal H}}(2\tau)/16), with probability at least 12exp(NQH2(2τ)/2)1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2), for every hHh\in{\cal H} that satisfies hL2r\|h\|_{L_{2}}\geq r,

Proof. Let r>βN(H,τQH(2τ)/16)r>\beta_{N}({\cal H},\tau Q_{{\cal H}}(2\tau)/16) and since H{\cal H} is star-shaped around ,

Clearly, QV(2τ)QH(2τ)Q_{V}(2\tau)\geq Q_{{\cal H}}(2\tau) and

By Theorem 5.4 applied to the set VV and since QV(2τ)QH(2τ)Q_{V}(2\tau)\geq Q_{{\cal H}}(2\tau), it follows that with probability at least 12exp(NQH2(2τ)/2)1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2), for every vVv\in V,

Next, fix any hHh\in{\cal H} for which hL2r\|h\|_{L_{2}}\geq r. H{\cal H} is star-shaped around and thus (r/hL2)hHrS(L2)(r/\|h\|_{L_{2}})h\in{\cal H}\cap rS(L_{2}), implying that h/hL2Vh/\|h\|_{L_{2}}\in V. The claim follows because (5.2) is positive homogeneous,

Proof of Theorem 5.3. Given fFf^{*}\in{\cal F}, let H=Ff={ff:fF}{\cal H}={\cal F}-f^{*}=\{f-f^{*}:f\in{\cal F}\}. Applying Corollary 5.5, it suffices to show that H{\cal H} is star-shaped around . To that end, observe that if ffHf-f^{*}\in{\cal H} and 0λ10\leq\lambda\leq 1, then λ(ff)=wf\lambda(f-f^{*})=w-f^{*} for w=λf+(1λ)fw=\lambda f+(1-\lambda)f^{*}. Hence, as F{\cal F} is convex, wFw\in{\cal F}.

Proof of Theorem 3.1. Recall that ξ(X,Y)=f(X)Y\xi(X,Y)=f^{*}(X)-Y, that (ξi)i=1N=(f(Xi)Yi)i=1N(\xi_{i})_{i=1}^{N}=(f^{*}(X_{i})-Y_{i})_{i=1}^{N}, and that for every fFf\in{\cal F},

Since HFF{\cal H}\subset{\cal F}-{\cal F}, the assertion of Theorem 5.3 for H=Ff{\cal H}={\cal F}-f^{*} holds with QFF(2τ)Q_{{\cal F}-{\cal F}}(2\tau) replacing the larger QH(2τ)Q_{{\cal H}}(2\tau). Hence, if r>βN(H,τQFF(2τ)/16)r>\beta_{N}({\cal H},\tau Q_{{\cal F}-{\cal F}}(2\tau)/16), then with probability at least 12exp(NQFF2(2τ)/2)1-2\exp(-NQ_{{\cal F}-{\cal F}}^{2}(2\tau)/2), for every fFf\in{\cal F} that satisfies ffL2r\|f-f^{*}\|_{L_{2}}\geq r, one has

Therefore, on that event, if ffL2r\|f-f^{*}\|_{L_{2}}\geq r,

Fix γ\gamma to be named later and consider αN(γ,δ/4)αN\alpha_{N}^{*}(\gamma,\delta/4)\equiv\alpha_{N}. Thus, with probability at least 1δ/41-\delta/4, if ffL2αN\|f-f^{*}\|_{L_{2}}\leq\alpha_{N} then

and by the Giné-Zinn symmetrization theorem , with probability at least 1δ1-\delta,

Fix a sample for which (5.4) holds and consider fFf\in{\cal F} that satisfies ffL2αN\|f-f^{*}\|_{L_{2}}\geq\alpha_{N}. Since Ff{\cal F}-f^{*} is star-shaped around and

Combining (5.5) and (5.3), it is evident that with probability at least

if ffL2max{αN,r}\|f-f^{*}\|_{L_{2}}\geq\max\{\alpha_{N},r\} then

provided that γ<τ2QFF(2τ)/16\gamma<\tau^{2}Q_{{\cal F}-{\cal F}}(2\tau)/16. On that event,

Additional proofs

Here, we will present proofs of some of the claims that have been formulated in previous sections.

We will prove a stronger statement: that the equivalence of the pp-th moment and the second moment of the linear forms \bigl{<}X,t\bigr{>} holds for any isotropic random vector XX that is also unconditional and whose coordinates belong to LpL_{p} for a fixed p>2p>2. That is, when X=(x1,...,xn)X=(x_{1},...,x_{n}) has the same distribution as (ε1x1,...,εnxn)(\varepsilon_{1}x_{1},...,\varepsilon_{n}x_{n}) for every choice of signs (εi)i=1n(\varepsilon_{i})_{i=1}^{n} (unconditionality), xiL2=1\|x_{i}\|_{L_{2}}=1 and xiLpκ\|x_{i}\|_{L_{p}}\leq\kappa.

Such a random vector can be heavy-tailed, as its coordinates may only belong to an LpL_{p} space for a fixed p>2p>2 that is close to 22. Thus, the empirical mean of, say, \bigl{<}X,e_{j}\bigr{>}^{2}=x_{j}^{2} may exhibit rather poor concentration around the true mean, which is 11. Despite that, the claim is that \|\bigl{<}X,t\bigr{>}\|_{L_{p}}\leq c\sqrt{p}\kappa\|\bigl{<}X,t\bigr{>}\|_{L_{2}}, and any class of linear functionals satisfies a small-ball condition with constants that depend only on κ\kappa and pp.

Clearly, a possible choice of XX is a random vector with independent, symmetric, variance one coordinates, and a standard symmetrization argument may be used to show that the same holds when the coordinates are mean-zero rather than symmetric. Thus, the unconditional case extends Lemma 4.2.

and c(λ,κ)c(\lambda,\kappa) is a constant that depends only on λ\lambda and κ\kappa.

implying that \|\bigl{<}X,t\bigr{>}\|_{L_{1}}\geq c(\lambda,\kappa)\|\bigl{<}X,t\bigr{>}\|_{L_{2}}.

2 The persistence problem

The proofs of Theorem 4.5 and Theorem 4.6 follow from several observations.

The ψ2\psi_{2} norm of a random variable XX is

A random variable XX is called LL-subgaussian if Xψ2LXL2\|X\|_{\psi_{2}}\leq L\|X\|_{L_{2}}.

It is standard to verify that Xψ2\|X\|_{\psi_{2}} is equivalent to the smallest constant κ\kappa for which Pr(Xt)2exp(t2/2κ2)Pr(|X|\geq t)\leq 2\exp(-t^{2}/2\kappa^{2}) for every t1t\geq 1. Also, Xψ2\|X\|_{\psi_{2}} is equivalent to supp2XLp/p\sup_{p\geq 2}\|X\|_{L_{p}}/\sqrt{p} (see, e.g. ).

Another property of ψ2\psi_{2} random variables is that if X1,...,XNX_{1},...,X_{N} are independent and mean-zero, then for every a1,...,aNa_{1},...,a_{N},

where (zi)i=1n(z_{i}^{*})_{i=1}^{n} is a monotone non-increasing rearrangement of (zi)i=1n(|z_{i}|)_{i=1}^{n}.

There exists an absolute constant CC for which the following holds. Assume that z1,...,znz_{1},...,z_{n} are independent copies of a mean-zero, variance 11 random variable zz, and that for every plognp\leq\log n, zLpκp\|z\|_{L_{p}}\leq\kappa\sqrt{p}. Then for every 1dn1\leq d\leq n,

Proof. For every 1jn1\leq j\leq n and p2p\geq 2,

Therefore, if t=uκlog(en/j)t=u\kappa\sqrt{\log(en/j)} and p=log(en/j)p=\log(en/j) then

The final component needed in the proofs of the two theorems is the following. Note that if {\cal F}_{R}=\{\bigl{<}t,\cdot\bigr{>}:t\in RB_{1}^{n}\} and Y=\bigl{<}t_{0},\cdot\bigr{>}+W for t0RB1nt_{0}\in RB_{1}^{n} and WW that is independent of XX, then f^{*}=\bigl{<}t_{0},\cdot\bigr{>}. Also, by the convexity and symmetry of B1nB_{1}^{n},

Therefore, if XX is a random vector with independent, mean-zero, variance 1 coordinates ζi\zeta_{i}, and X1,...,XNX_{1},...,X_{N} are independent copies of XX, then

where Z=(zi)i=1nZ=(z_{i})_{i=1}^{n} has independent coordinates. Each ziz_{i} is distributed as N1/2j=1Nζi,jN^{-1/2}\sum_{j=1}^{N}\zeta_{i,j}, and (ζi,j)j=1N(\zeta_{i,j})_{j=1}^{N} are independent copies of ζi\zeta_{i}. Hence, by Lemma 6.4,

If the ζi\zeta_{i}’s are distributed according to a mean-zero and variance 11 random variable ζ\zeta that is bounded in LL_{\infty} by κ\kappa, then ζψ2κζL2\|\zeta\|_{\psi_{2}}\leq\kappa\|\zeta\|_{L_{2}} and ζ\zeta is κ\kappa-subgaussian. Moreover, ziL2=1\|z_{i}\|_{L_{2}}=1, and as a weighted sum of independent ψ2\psi_{2} random variables, ziψ2cκ\|z_{i}\|_{\psi_{2}}\leq c\kappa for a suitable absolute constant cc. Hence, for every p2p\geq 2, ziLpc1κp\|z_{i}\|_{L_{p}}\leq c_{1}\kappa\sqrt{p}.

provided that (R/s)2n/4(R/s)^{2}\leq n/4. Therefore, in that range,

while if (R/s)2n/4(R/s)^{2}\geq n/4 one may verify that

for a suitable absolute constant c3c_{3}.

Combining (6.2) with Theorem 1.1, there is a high probability event on which, if Nn2N\leq n^{2},

Proof of Theorem 4.6. Since the argument is rather standard and follows a similar path to the previous proof, we will omit some of the details.

To prove the result, one has to bound βN\beta_{N}^{*} and αN\alpha_{N}^{*} when X=(ζ1,...,ζn)X=(\zeta_{1},...,\zeta_{n}) and ζLκ\|\zeta\|_{L_{\infty}}\leq\kappa. Using (6.2) as above, it is evident that if Nc1nN\leq c_{1}n,

for suitable constants c1c_{1} and c2c_{2} that depend only on κ\kappa and γ\gamma. Moreover, using the notation of Theorem 3.1, γ=τQFF(2τ)/16\gamma=\tau Q_{{\cal F}-{\cal F}}(2\tau)/16 and thus, γ\gamma is a constant that depends only on κ\kappa as well.

for constants c1c_{1} and c3c_{3} that depend only on κ\kappa.

Next, to bound αN(γ,δ)\alpha_{N}^{*}(\gamma,\delta), set σ=WL2,1\sigma=\|W\|_{L_{2,1}}, let

and recall that WW and XX are independent. By a standard multiplier theorem (see, e.g., Chapter 2.9),

and since 1mi=1mXi\frac{1}{\sqrt{m}}\sum_{i=1}^{m}X_{i} is an isotropic, cκc\kappa-subgaussian vector, it follows from Talagrand’s Majorizing Measures Theorem that

where g1,...,gng_{1},...,g_{n} are independent, standard gaussian variables and c5c_{5} is an absolute constant.

and constants c6,c7c_{6},c_{7} that depend only on κ\kappa.

Indeed, for every t2RB1nsB2nt\in 2RB_{1}^{n}\cap sB_{2}^{n} and since WW and XX are independent,

Hence, for a constant c9c_{9} that depends only on κ\kappa, x=c9Ns2min{1/R,1/σ2}x=c_{9}Ns^{2}\min\{1/R,1/\sigma^{2}\} and δ=exp(x)\delta=\exp(-x), one has that with probability at least 1δ1-\delta, VsγNs2V_{s}\leq\gamma\sqrt{N}s^{2}. Therefore, α(γ,δ)s\alpha(\gamma,\delta)\leq s, which completes the proof.

Concluding Remarks

Although it seems that the complexity parameters αN\alpha_{N}^{*} and βN\beta_{N}^{*} (and kNk_{N}^{*} as well) depend on the unknown function ff^{*}, in virtually all applications one may replace the indexing set FrDfF\cap r{\cal D}_{f^{*}} with (FF)rD({\cal F}-{\cal F})\cap r{\cal D}, where D{\cal D} is the unit ball in L2(μ)L_{2}(\mu) – with the obvious modifications to the definitions of the parameters. Moreover, if F{\cal F} is centrally symmetric (i.e. if fFf\in{\cal F} then fF-f\in{\cal F}) in addition to being convex, then FF2F{\cal F}-{\cal F}\subset 2{\cal F} and the indexing set becomes 2FrD2{\cal F}\cap r{\cal D}.

The fact that βN\beta_{N}^{*} depends on an average while αN\alpha_{N}^{*} is based on a probability estimate is just an outcome of this presentation. It is possible to obtain a similar result using a ‘high probability’ version of βN\beta_{N}^{*}, by using a slightly different argument. On the other hand, while one may replace αN\alpha_{N}^{*} with an averaged version, the latter makes little sense in heavy-tailed situations, as passing from an average-based parameter to a high probability one reverts to the question of concentration. In heavy-tailed situations the mean need not represents the typical behaviour, and should not be used when trying to obtain very high probability bounds.

Another fact worth mentioning is that the results presented here can be extended well beyond the squared loss, to arbitrary convex loss functions (see ).

for midpoints (Zi)i=1N(Z_{i})_{i=1}^{N} that belong to the intervals whose ends are ξi\xi_{i} and ξi+(ff)(Xi)\xi_{i}+(f-f^{*})(X_{i}), and thus depend on ff and on the sample (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}.

Just as in the squared-loss case, ff cannot be an empirical minimizer if PNLf>0P_{N}{\cal L}_{f}>0. Therefore, using (7), one may obtain a positive lower bound on PNLfP_{N}{\cal L}_{f} by identifying the levels αˉN\bar{\alpha}_{N} and βˉN\bar{\beta}_{N}, for which, if ffL2βˉN\|f-f^{*}\|_{L_{2}}\geq\bar{\beta}_{N} then

for some constant cc, and if ffL2αˉN\|f-f^{*}\|_{L_{2}}\geq\bar{\alpha}_{N} then

On that event the estimation error satisfies

for constants that are independent of ff.

A version of Theorem 3.1 for a general convex loss can be found in .

References