Learning without Concentration

Shahar Mendelson

Introduction

Our aim is to study the error of Empirical Risk Minimization (ERM), performed in a convex class and relative to the squared loss.

To be more precise, let ${\cal F}$ be a class of real-valued functions on a probability space $(\Omega,\mu)$ and let $Y$ be an unknown target function. One would like to find some function in ${\cal F}$ that is almost the ‘closest’ to $Y$ in some sense.

Unlike questions in Approximation Theory, the point in learning problems is to approximate $f^{*}$ using random data – an independent sample $(X_{i},Y_{i})_{i=1}^{N}$ selected according to the joint distribution defined by $\mu$ and the target $Y$ .

One way of using the given data $(X_{i},Y_{i})_{i=1}^{N}$ is by selecting a random element in ${\cal F}$ , denoted by $\hat{f}$ , that minimizes the empirical loss

where here, and throughout this article, $P_{N}g$ denotes the empirical mean of the function $g$ with respect to the given sample.

With this choice of a learning procedure, $\hat{f}$ is called the empirical minimizer, and the procedure that selects $\hat{f}$ is Empirical Risk Minimization (ERM).

In the context of the estimation problem, it is natural to measure the success of ERM and the effectiveness of the choice of $\hat{f}$ in the following way: that for ‘most’ samples $(X_{i},Y_{i})_{i=1}^{N}$ , ERM produces a function that is close in $L_{2}(\mu)$ to the best approximation of the target $Y$ in ${\cal F}$ ; that is, a high probability upper estimate on

for the empirical minimizer $\hat{f}$ selected according to the data $(X_{i},Y_{i})_{i=1}^{N}$ .

Our starting point is a well known result that deals with this very question: controlling the distance in $L_{2}(\mu)$ between the function produced by ERM and $f^{*}$ . Theorem 1.1, formulated below, has been established in (see Corollary 5.3 there and also Theorem 5.1 in the survey ).

Let ${\cal D}_{f^{*}}$ be the $L_{2}(\mu)$ ball of radius $1$ , centred in $f^{*}$ . Thus, $\{f\in{\cal F}:\|f-f^{*}\|_{L_{2}}\leq r\}={\cal F}\cap r{\cal D}_{f^{*}}$ . For any $r>0$ , let

where $(\varepsilon_{i})_{i=1}^{N}$ are independent, symmetric, $\{-1,1\}$ -valued random variables (random signs) that are independent of $(X_{i})_{i=1}^{N}$ , and the expectation is with respect to both $(\varepsilon_{i})_{i=1}^{N}$ and $(X_{i})_{i=1}^{N}$ .

There exist absolute constants $c_{0},c_{1}$ and $c_{2}$ for which the following holds. If ${\cal F}\subset L_{2}(\mu)$ is a closed, convex class of functions that are bounded by $1$ and the target $Y$ is also bounded by $1$ , then for every $t>0$ , with probability at least $1-c_{0}\exp(-t)$ ,

The proof of Theorem 1.1 relies heavily on the fact that ${\cal F}$ consists of functions that are bounded by $1$ and that the target is also bounded by $1$ . Both are restrictive assumptions and exclude many natural problems that one would like to consider.

1. Gaussian noise: arguably the most basic statistical problem is when $Y=f_{0}(X)+W$ for some $f_{0}\in{\cal F}$ and $W$ that is a centred gaussian variable with variance $\sigma$ that is independent of $X$ . Thus, the given data consists of ‘noisy’ measurements of $f_{0}$ corrupted by gaussian noise.

Since a gaussian random variable is unbounded, Theorem 1.1 cannot be used to address the estimation problem that involves gaussian noise, regardless of the choice of ${\cal F}$ .

2. Heavy-tailed noise: The vague term ‘heavy-tailed function’ is used to describe a function for which the measure of its tail, $Pr(|f|>t)$ , decays to zero relatively slowly – for example, polynomially in $1/t$ . In particular, such a function need not be bounded. Hence, any kind of an estimation problem that involves a heavy-tailed target $Y$ cannot be treated using Theorem 1.1.

Let $\varepsilon_{n+1}$ be a random sign that is independent of $(\varepsilon_{i})_{i=1}^{n}$ , fix $0\leq\sigma\leq R$ and $t_{0}\in T_{R}$ , and put $Y=\bigl{<}t_{0},\cdot\bigr{>}+\sigma\varepsilon_{n+1}$ .

More accurately, we will show that the outcome of Theorem 1.1 is as follows: there are absolute constants $c_{1},c_{2}$ and $c_{3}$ and

On the other hand, it follows from that the correct rate for this problem is very different: if

then with probability at least $1-2\exp\left(-c_{6}N\min\{v_{2},1\}\right)$ ,

for absolute constants $c_{4},c_{5},c_{6}$ and $c_{7}$ .

Therefore, the outcome of Theorem 1.1 scales incorrectly with $R$ and with $\sigma$ , and most notably, the estimation error does not converge to zero when $\sigma$ becomes smaller and the problem is almost ‘noise-free’.

This example belongs to the persistence framework, which will be explored in greater detail later on. It, and other examples like it show that the poor outcome of Theorem 1.1 is endemic rather than merely an accident.

To get a clearer picture of the reasons why Theorem 1.1 is truly suboptimal, let us take a closer look at its assumptions.

The point-wise boundedness assumption in Theorem 1.1 has been used frequently in Learning Theory, and for a good reason. It allows one to invoke two important tools from empirical processes theory, which are simply not true in a more general setup.

2. Concentration methods: Talagrand’s concentration inequality for empirical processes indexed by a class of uniformly bounded functions has played an essential role in Learning Theory - and Theorem 1.1 is no exception. It is used to show that with high probability, the supremum of an empirical process is almost the same as its expectation, as well as the Rademacher averages of the class (see, e.g., and references therein for more details on the role of Rademacher averages in Learning Theory).

It takes no more than a glance to see that contraction and concentration are crucial to the proof of Theorem 1.1. A more careful examination shows that there is no realistic hope of extending the proof beyond the restricted setup of classes of uniformly bounded functions and a bounded target without introducing a totally new argument.

The noise barrier

An important observation that reflects the suboptimal nature of Theorem 1.1 is that the estimation error is insensitive to the noise level. The term ‘noise level’ is used here to describe the distance between the target and the class, either with respect to the $L_{2}$ norm, or with respect to a different $L_{p}$ norm, depending on the situationThe reason for naming the distance between the class and the target ‘noise level’ is the very simple choice of a target $Y=f_{0}(X)+W$ for some $f_{0}\in{\cal F}$ , and an independent, centred random variable $W$ , representing the noise. It is straightforward to verify that for such a target $f^{*}=f_{0}$ , and the $L_{2}$ distance between the class and the target is $\|f^{*}(X)-Y\|_{L_{2}}=\|W\|_{L_{2}}$ – the variance of $W$ ..

One would expect that when the noise level (distance) tends to zero and the problem becomes almost noise-free (or realizable), $\|\hat{f}-f^{*}\|_{L_{2}}$ should decrease, and in some cases should even tend to zero. Among these problems are compressed sensing and phase recovery, in which the exact recovery of the signal is possible (see the discussion in ). Unfortunately, Theorem 1.1 is insensitive to the noise level, and this by-product of its proof cannot be resolved without totally changing the method of analysis.

All these observations are strong indications that if Theorem 1.1 is to be radically improved, the entire concentration-contraction framework on which it is based, must be abandoned. Moreover, any alternative framework should address two core issues:

$\bullet$ It must be able to handle ‘heavy-tailed’ problems.

$\bullet$ The estimation error must scale correctly with the key parameters of the problem, especially with the noise level.

Towards a heavy-tailed framework

In view of the requirements outlined above, an improved framework has to contend with functions that may have heavy tails, and certainly need not be bounded. We will first explain why standard concentration-based arguments fail miserably when faced with a heavy-tailed scenario. We will then suggest an alternative to standard concentration, introduce the complexity parameters that arise naturally in the suggested framework, and explain their statistical interpretation.

The title of this article may create the wrong impression – that concentration methods are not needed and will not take any part in the analysis of ERM. Actually, the way the title should be understood is different: that learning is possible even when concentration is false.

As a starting point, and seeing that one is interested in the squared loss, one should note the substantial difference between a two-sided concentration inequality, stating, for example, that with high probability,

and just the lower bound on the empirical mean,

If $Z_{1},...,Z_{N}$ are independent copies of $Z$ , then with probability at least $1/2N$ there exists some $1\leq i\leq N$ for which $|Z_{i}|=2\sqrt{N}$ . On that event,

for a suitable absolute constant $c_{1}$ – and that a similar estimate is true for any random variable $Z$ for which $\|Z\|_{L_{4}}/\|Z\|_{L_{2}}$ is well behaved, with $c_{1}$ depending only on this ratio.

The difference between an upper estimate and a lower one is even more obvious if one only assumes that $Pr(|f|\geq u)\geq\varepsilon$ for fixed constants $u$ and $\varepsilon$ . Using a simple binomial estimate, one may show that with probability at least $1-2\exp(-c_{2}\varepsilon N)$ , $P_{N}f^{2}\geq\varepsilon u^{2}/2$ for a suitable absolute constant $c_{2}$ . However, there is no hope of obtaining any upper estimate on $P_{N}f^{2}$ based on the given information.

An immediate consequence of these observations is that the standard method of analysis for the estimation problem, which is based on a two-sided concentration argument that holds with exponential probability, can never work in heavy-tailed situations. Thus, one must find a different argument altogether if one wishes to deal with learning problems that include classes of heavy-tailed functions or with a heavy-tailed target.

The difference between a two-sided estimate and a lower bound takes centre-stage when one realizes that the main component in solving the estimation problem is actually a lower bound on

rather than a two-sided bound. The significance of this lower bound will be made clear in Section 3. It will eventually lead to a sharp estimate on $\|\hat{f}-f^{*}\|_{L_{2}}$ even when two-sided concentration is impossible.

2 Two parameters for two regimes

When one considers the performance of a learning procedure, it is reasonable to expect two different ‘performance regimes’, according to the difficulties the learner faces. As will be explained below, there are clear differences between ‘low noise’ problems, in which the target $Y$ is close to ${\cal F}$ and the rate one should expect is close to the noise-free (realizable) rate, and ‘high noise’ problems, when $Y$ is far enough from ${\cal F}$ and the interaction between the target and the class determines the estimation error. One may also expect that just as in a realizable problem, the complexity parameter governing the ‘low-noise’ regime is intrinsic to ${\cal F}$ and therefore should not depend on $Y$ at all, while the parameter controlling the ‘high-noise’ regime depends on $Y$ in one way or another.

We begin with the definition of the first parameter we will be interested in, and which governs the ‘low-noise’ regime.

Given a class of functions ${\cal F}$ and $\gamma>0$ , set

where the expectation is taken with respect to both $(\varepsilon_{i})_{i=1}^{N}$ and $(X_{i})_{i=1}^{N}$ .

Note that $\beta_{N}^{*}$ is indeed an intrinsic parameter, in the sense that it depends only on the class ${\cal F}$ and not on the exact nature of the ‘noise’ $f^{*}(X)-Y$ . From a purely technical perspective, $\beta_{N}^{*}$ measures when the Rademacher averages of the ‘localized’ set $\{f-f^{*}:f\in{\cal F}\cap r{\cal D}_{f^{*}}\}$ scale like $r$ rather than like the normalization $r^{2}$ , which has been used in the definition of $k_{N}^{*}$ and in Theorem 1.1. Thus, $\beta_{N}^{*}$ will always be much smaller than $k_{N}^{*}$ when dealing with $r\ll 1$ , as we do.

Off-hand, the meaning and significance of $\beta_{N}^{*}$ is not obvious. It is, perhaps, surprising that it captures properties of the version space of the problem.

The version space is a random subset of ${\cal F}$ that consists of all the functions in the class that agree with $f^{*}$ on the sample $(X_{i})_{i=1}^{N}$ . When the problem is noise-free ( $Y=f^{*}$ ), a learning procedure makes significant mistakes only when there are functions in ${\cal F}$ that, despite being ‘far-away’ from $f^{*}$ , still satisfy that $f(X_{i})=f^{*}(X_{i})$ for every $1\leq i\leq N$ . In other words, significant mistakes occur in a noise-free problem only when the version space is large.

As noted earlier, it seems plausible that when the noise level is low rather than zero, that is, when $Y$ is close enough to ${\cal F}$ , the situation does not change significantly and mistakes are essentially due to a large version space. Thus, when trying to bound the error of ERM, the first order of business is to identify a parameter that captures the ‘size’ of the version space, and $\beta_{N}^{*}$ gives thats and much more.

We will show that with high probability, if $\|f-f^{*}\|_{L_{2}}\geq\beta_{N}^{*}$ , then

for an appropriate constant $c$ , and with (almost) no assumption on ${\cal F}$ (see Theorem 5.3 for the exact formulation).

Immediate outcomes of (2.1) are that the version space cannot include functions for which $\|f^{*}-f\|_{L_{2}}\geq\beta_{N}^{*}$ and that sampling is ‘stable’ for functions that are not too close to $f^{*}$ . Indeed, (2.1) implies that on a large event, the empirical distances $(N^{-1}\sum_{i=1}^{N}(f-f^{*})^{2}(X_{i}))^{1/2}$ are at least a fixed proportion of the original $L_{2}(\mu)$ distances $\|f-f^{*}\|_{L_{2}}$ .

Controlling the interaction with the noise

The second regime is encountered once the noise level increases, and mistakes happen for a totally different reason: the ‘interaction’ of the target $Y$ with class members. It turns out that this interaction is captured by the following parameter.

Let $\xi(X,Y)=f^{*}(X)-Y$ represent the noisecalling $\xi$ ‘the noise’ is a natural name when $Y=f_{0}(X)+W$ for some $f_{0}\in{\cal F}$ and $W$ that is independent of $X$ . We will use that name even when $Y$ does not have that form. and set

The fixed point $\alpha_{N}^{*}$ is of a similar nature to $k_{N}^{*}$ and the two share the same scaling, of the order of $\sqrt{N}s^{2}$ , but with one key difference: the supremum of the multiplier process in (2.2) measures the maximal correlation elements of the random set $\{(f-f^{*})(X_{i}):f\in{\cal F}\cap s{\cal D}_{f^{*}}\}$ have with the random symmetrized noise vector $(\varepsilon_{i}\xi_{i})_{i=1}^{N}$ , while the random process used in the definition of $k_{N}^{*}$ ,

measures the maximal correlation elements of the same random set have with the random vector $(\varepsilon_{i})_{i=1}^{N}$ . The obvious difference is that $(\varepsilon_{i})_{i=1}^{N}$ represents a ‘generic’ noise and has nothing to do with the specific noise that the learner has to deal with.

Note that if functions in ${\cal F}$ and $Y$ happen to be bounded by $1$ , then $\xi=f^{*}(X)-Y$ is bounded by $2$ . Applying a standard contraction argument (see, e.g. ) it follows that for every $u>0$ ,

and thus one may show that $\alpha_{N}^{*}$ is dominated by $k_{N}^{*}$ in the bounded case. Moreover, while $k_{N}^{*}$ is insensitive to the noise level, because the contraction argument used in (2.3) destroys any dependence on the ‘noise multipliers’ $(\xi_{i})_{i=1}^{N}$ , $\alpha_{N}^{*}$ is highly affected by the noise – and in a favourable way. When $Y$ is close to ${\cal F}$ in the right sense, $\xi$ is close to zero, leading to smaller multipliers $(\xi_{i})_{i=1}^{N}$ , and thus to a smaller fixed point $\alpha_{N}^{*}$ .

The main result

Next, let us explain why this heuristic description of the roles of $\alpha_{N}^{*}$ and $\beta_{N}^{*}$ is reasonable, and why splitting the estimation problem to two components, each captured by one of the two parameters, is the first step in bypassing the concentration-contraction mechanism used in the proof of Theorem 1.1.

For every $f\in{\cal F}$ , let ${\cal L}_{f}$ be the excess loss functional associated with $f$ , which is defined by

Let ${\cal L}_{{\cal F}}=\{{\cal L}_{f}:f\in{\cal F}\}$ be the excess loss class and note that it has two important properties.

$\bullet$ Since $0\in{\cal L}_{{\cal F}}$ , the empirical minimizer $\hat{f}$ satisfies $P_{N}{\cal L}_{\hat{f}}\leq 0$ .

Therefore, if $(X_{1},...,X_{N})$ is a sample for which

then $\|\hat{f}-f^{*}\|_{L_{2}}<\rho$ – simply because an empirical minimizer cannot belong to the set $\{f\in{\cal F}:\|f-f^{*}\|_{L_{2}}\geq\rho\}$ .

Setting $\xi_{i}=f^{*}(X_{i})-Y_{i}$ , observe that by (3),

Hence, one may bound $P_{N}{\cal L}_{f}$ from below by showing that with high probability, and in rather general situations,

1. The version space condition holds: that is, if $\|f-f^{*}\|_{L_{2}}\geq\beta_{N}^{*}(\gamma_{1})$ then

2. The noise interaction condition holds: that is, if $\|f-f^{*}\|_{L_{2}}\geq\alpha_{N}^{*}(\gamma_{2},\delta)$ , then

Therefore, if $\gamma_{2}$ is chosen to be smaller than $c_{1}\gamma_{1}/2c_{2}$ and if $(X_{i},Y_{i})_{i=1}^{N}$ is a sample for which both (1) and (2) hold, then $P_{N}{\cal L}_{f}>0$ when $\|f-f^{*}\|_{L_{2}}\geq\max\{\alpha_{N}^{*}(\gamma_{2},\delta),\beta_{N}^{*}(\gamma_{1})\}$ . In particular, on that event, the estimation error satisfies

Let us emphasize that one may obtain a very good lower bound on the quadratic component of $P_{N}{\cal L}_{f}$ (and thus on the version space condition), solely because the required estimate is one-sided. A two-sided bound, obtained by an upper estimate on centred quadratic process

requires ${\cal F}$ to be highly regular in some sense (see for two-sided estimates on the quadratic process). And, as noted earlier, a two-sided bound of this type may be false even for a single function, let alone for a class of functions, unless one imposes rather restrictive assumptions.

The key assumption leading the lower bound on the quadratic term is the following:

Let ${\cal H}\subset L_{2}(\mu)$ be a class of functions and set

Given a class of functions ${\cal F}$ , let ${\cal F}-{\cal F}=\{f-h:f,h\in{\cal F}\}$ . We will assume that there is some $u>0$ for which $Q_{{\cal F}-{\cal F}}(u)>0$ .

In Section 4 we will present several generic examples showing that this weak small-ball condition is indeed minimal, and that in many cases one may choose $u$ and $Q$ to be appropriate absolute constants.

This notion of small-ball is very different from concentration. The empirical mean of a function will concentrate around its true mean only if the function has well-behaved high moments. Such a condition is trivially satisfied when the function is a bounded, but in general it is highly restrictive. In contrast, it is well known that any sort of moment equivalence, even as weak as $\|h\|_{L_{2}}\leq L\|h\|_{L_{1}}$ , leads to the nontrivial small-ball estimate

for constants $c_{1}$ and $c_{2}$ that depend only on $L$ (see Section 4 for more details).

The significance of (3.2) is that it may be used to derive a very high probability lower bound on $P_{N}h^{2}$ . Indeed, let $X_{1},...,X_{N}$ be independent, distributed according to $\mu$ . A straightforward binomial estimate shows that with probability at least $1-2\exp(-c_{3}(L)N)$ , at least $c_{2}(L)N/2$ of the $|h(X_{i})|$ ’s are larger than $c_{1}(L)\|h\|_{L_{2}}$ . Hence, on that event,

We are, at last, ready to formulate the main result of the article.

Let ${\cal F}\subset L_{2}(\mu)$ be a closed, convex class of functions, set $Y$ to be the unknown target and put $\xi(X,Y)=f^{*}(X)-Y$ .

Fix $\tau>0$ for which $Q_{{\cal F}-{\cal F}}(2\tau)>0$ and set $\gamma<\tau^{2}Q_{{\cal F}-{\cal F}}(2\tau)/16$ . For every $0<\delta<1$ , with probability at least $1-\delta-\exp(-NQ_{{\cal F}-{\cal F}}^{2}(2\tau)/2)$ one has

The proof of Theorem 3.1 will be presented in Section 5.

Unlike Theorem 1.1, Theorem 3.1 holds with essentially no restrictions on the class or on the target. In particular, it may be applied in all the examples described in the introduction that fall outside the scope of Theorem 1.1, once ${\cal F}-{\cal F}$ satisfies Assumption 3.1.

Consider a family of estimation problems in ${\cal F}_{R}=\left\{\bigl{<}t,\cdot\bigr{>}:t\in RB_{1}^{n}\right\}$ and for a set of reasonable targets. In the persistence framework, the dimension $n$ , the radius $R$ and the noise level $\sigma$ are allowed to grow with the sample size $N$ , and one has to find conditions on $n(N)$ , $R(N)$ and $\sigma(N)$ that still ensure that $\|\hat{f}-f^{*}\|_{L_{2}}$ tends to zero with high probability. Therefore, one has to identify the correct way in which the estimation error scales with each one of these parameters.

We will study the persistence problem when $X$ has bounded iid coordinates and $Y=\bigl{<}t_{0},\cdot\bigr{>}+W$ for $t_{0}\in RB_{1}^{n}$ and a bounded, symmetric random variable $W$ that is independent of $X$ . Although the problem fits the assumptions of Theorem 1.1, the outcome of Theorem 3.1 turns out to be superior by far, and, in fact, leads to the optimal estimate in the minimax sense.

2 Is Theorem 3.1 optimal?

The question of whether Theorem 3.1 is optimal is rather delicate and requires highly technical machinery if it is to be answered in full. To keep the scope and length of the article within reason, we will only sketch a few facts indicating that Theorem 3.1 is optimal, at least for a wide variety of classes.

As noted earlier, the intrinsic parameter $\beta_{N}^{*}$ , which characterizes the low-noise regime, is also an upper estimate on the diameter of the version space associated with $f^{*}$ in $L_{2}(\mu)$ . There are many examples in which this estimate is sharp, including the class of linear functionals used in the persistence framework, endowed by $RB_{1}^{n}$ .

As it turns out, the $L_{2}(\mu)$ diameter of the version space is a lower bound on the estimation error in a rather general sense and independently of the learning procedure that is used. Indeed, fix a mean-zero random variable $W$ that is independent of $X$ , consider an arbitrary class ${\cal F}$ and the family of targets $Y^{f}=f(X)+W$ , where $f\in{\cal F}$ . Given $(X_{i})_{i=1}^{N}$ and $f\in{\cal F}$ ,

where the probability is taken with respect to the data $(X_{i},Y_{i}^{f})_{i=1}^{N}$ .

Therefore, $\beta_{N}^{*}$ captures the estimation error in the low-noise regime, with the exception of the (rare) examples in which it is far from the typical diameter of the version space ${\cal V}(f^{*},(X_{i})_{i=1}^{N})$ in $L_{2}(\mu)$ .

As for $\alpha_{N}^{*}$ , there are strong indications it is the right parameter for describing the interaction between the target and the class. Optimality questions of that flavour have been studied in , featuring arbitrary classes of function ${\cal F}$ and targets $Y^{f}=f(X)+W$ for some $f\in{\cal F}$ and mean-zero gaussian variables $W$ that are independent of $X$ .

The results of show that a ‘subgaussian version’ of $\alpha_{N}^{*}$ is optimal, under mild assumptions on the canonical gaussian process indexed by ${\cal F}$ . Without going into accurate and rather technical definitions, one may show that if ${\cal F}$ and $\xi=f^{*}(X)-Y$ satisfy a subgaussian condition, and if the canonical gaussian process indexed by ${\cal F}$ , $\{G_{f}:f\in{\cal F}\}$ , is sufficiently regular, then

is the optimal complexity parameter for the high-noise regime in such estimation problems, for a constant $\gamma$ that depends only on the subgaussian properties of ${\cal F}$ .

Hence, for the right choices of $\delta$ and $\gamma^{\prime}$ , $\alpha_{N}^{*}(\gamma^{\prime},\delta)$ coincides with $s_{N}^{*}(\gamma)$ , showing its optimality.

Unfortunately, extending this somewhat sketchy argument to the general case studied here, in which the subgaussian machinery is not at one’s disposal, requires methods that are well beyond the scope of this article. It is therefore deferred to future work.

Some Examples

Let us turn to situations one is likely to encounter in a heavy-tailed framework: classes of functions for which one has almost no moment control, and therefore, its members do not exhibit any useful two-sided concentration of empirical means; nevertheless, one may obtain a small-ball estimate as in Assumption 3.1.

Let ${\cal F}$ be a class of functions on a probability space $(\Omega,\mu)$ .

1. If $\|f_{1}-f_{2}\|_{L_{2}}\leq\kappa_{1}\|f_{1}-f_{2}\|_{L_{1}}$ for every $f_{1},f_{2}\in{\cal F}$ , then there are constants $c_{1}$ and $c_{2}$ that depend only on $\kappa_{1}$ for which $Q_{{\cal F}-{\cal F}}(c_{1})\geq c_{2}$ .

2. If there are $p>2$ and $\kappa_{2}$ for which $\|f_{1}-f_{2}\|_{L_{p}}\leq\kappa_{2}\|f_{1}-f_{2}\|_{L_{2}}$ for every $f_{1},f_{2}\in{\cal F}$ , then there are constants $c_{1}$ and $c_{2}$ that depend only on $\kappa_{2}$ and $p$ for which $Q_{{\cal F}-{\cal F}}(c_{1})\geq c_{2}$ .

Lemma 4.1 is an immediate outcome of the Paley-Zygmund inequality (see, e.g. ). For example, if $p>2$ and $\|h\|_{L_{p}}\leq\kappa_{2}\|h\|_{L_{2}}$ then by the Paley-Zygmund inequality, for every $0<u<1$ ,

Thus, when applied to each $h=f_{1}-f_{2}$ and for $u=1/2$ , it follows that $Q_{{\cal F}-{\cal F}}(1/2)\geq(3/4\kappa_{2}^{2})^{p/(p-2)}$ , and the small-ball condition holds uniformly in ${\cal F}-{\cal F}$ with reasonable constants.

The proof of Lemma 4.2 is presented in Section 6.1.

If either one of the moment conditions of Lemma 4.2 holds for $\zeta$ , then with probability at least $1-\delta-\exp(-c_{1}N)$ , ERM produced $\hat{t}\in T$ for which

for appropriate constants $c_{1}$ , $c_{2}$ and $c_{3}$ that depend only on $\kappa_{1}$ or on $\kappa_{2}$ and $p$ .

Needless to say that this problem falls outside the scope of Theorem 1.1, as functions in ${\cal F}$ and $Y$ need not be bounded, nor do they necessarily have rapidly decaying tails.

Let us turn to an example that illustrates the striking difference between Theorem 1.1 and Theorem 3.1, even in a bounded scenario, and in which $\alpha_{N}^{*}$ and $\beta_{N}^{*}$ are not difficult to compute.

Let $X=(\zeta_{i})_{i=1}^{n}$ be a random vector with independent coordinates distributed according to a mean-zero, variance $1$ random variable $\zeta$ . To give Theorem 3.1 and Theorem 1.1 a ‘level playing field’, assume that $|\zeta|\leq\kappa$ almost surely.

Recall that Problem 4.4 has been mentioned in the introduction for $X=(\varepsilon_{1},...,\varepsilon_{N})$ and $W=\sigma\varepsilon_{N+1}$ .

The following two statements summarize the outcomes of Theorem 1.1 and Theorem 3.1 for this choice of ${\cal F}$ and $Y$ . The proofs of the claims may be found in Section 6.2.

Let us begin with the outcome of Theorem 1.1:

For every $\kappa>1$ there exist constants $c_{1},c_{2}$ and $c_{3}$ that depend only on $\kappa$ for which the following holds. Assume that $\|\zeta\|_{L_{\infty}},\|W\|_{L_{\infty}}\leq\kappa$ . Set

The estimation error in Theorem 4.5 does not scale correctly with $R$ ( $R^{2}/\sqrt{N}$ is clearly too big) and also does not depend on any $L_{p}$ norm of the noise $W$ , except for the trivial bound $\|W\|_{L_{\infty}}=\kappa$ , which is likely to be much larger than, say, the variance $\|W\|_{L_{2}}$ .

In contrast, the next result follows from Theorem 3.1. To formulate it, set

It turns out that $\|W\|_{L_{2,1}}$ , which is slightly larger than $\|W\|_{L_{2}}$ but smaller than $c(q)\|W\|_{L_{q}}$ for any $q>2$ , captures the noise level of the problem. Since in virtually all examples the $L_{2}$ and $L_{2,1}$ norms are equivalent, we will abuse notation and denote $\sigma=\|W\|_{L_{2,1}}$ rather than $\sigma=\|W\|_{L_{2}}$ .

For every $\kappa>1$ there exist constants $c_{1},c_{2},c_{3}$ and $c_{4}$ that depend only on $\kappa$ for which the following holds. Assume that $\|\zeta\|_{L_{\infty}},\|W\|_{L_{\infty}}\leq\kappa$ and that $\|W\|_{2,1}=\sigma<\infty$ . Put

The results of show that up to the exact probability estimate, the estimation error given in Theorem 4.6 is optimal in the minimax sense, and no procedure can perform with confidence larger than $3/4$ and yield a estimation error better than $\sim\max\{v_{1},v_{2}\}$ .

It turns out that Theorem 4.6 can be extended even further. For example, it holds for a general target $Y$ rather than just for $Y=\bigl{<}t_{0},\cdot\bigr{>}+W$ , the assumption that $X$ has iid coordinates can be relaxed, and ‘heavy-tailed’ measures may be used instead. Unfortunately, the proof of a more general version of Theorem 4.6 comes at a high technical cost and we refer the reader to and for more details.

Proof of Theorem 3.1

We begin this section with a few definitions that will be needed in the proof of Theorem 3.1.

A class ${\cal H}$ is star-shaped around if for every $h\in{\cal H}$ and any $0<\lambda\leq 1$ , $\lambda h\in{\cal H}$ . Thus, a class is star-shaped around if it contains the entire interval $[0,h]$ whenever $h\in{\cal H}$ .

We will sometimes write $\beta_{N}(\gamma)$ instead of $\beta_{N}({\cal H},\gamma)$ .

Let ${\cal F}-f^{*}=\{f-f^{*}:f\in{\cal F}\}$ and observe that $\beta_{N}^{*}(\gamma)=\beta_{N}({\cal F}-f^{*},\gamma)$ . Also, it is straightforward to verify that if ${\cal H}$ is star-shaped around and $r>\beta_{N}(\gamma)$ then

The main component in the proof of Theorem 3.1 is the following:

Let ${\cal F}\subset L_{2}(\mu)$ be a closed, convex class and assume that there is some $\tau>0$ for which $Q_{{\cal F}-{\cal F}}(2\tau)>0$ . Given $f^{*}\in{\cal F}$ , set ${\cal H}={\cal F}-f^{*}$ . Then, for every $r>\beta_{N}({\cal H},\tau Q_{{\cal H}}(2\tau)/16)$ , with probability at least $1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2)$ , if $\|f-f^{*}\|_{L_{2}}\geq r$ , one has

The first step in the proof of Theorem 5.3 is the following uniform empirical small-ball estimate – which is of a similar nature to the results from and .

Let $S(L_{2})$ be the $L_{2}(\mu)$ unit sphere and let ${\cal H}\subset S(L_{2})$ . Assume that there is some $\tau>0$ for which $Q_{{\cal H}}(2\tau)>0$ . If

then with probability at least $1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2)$ ,

Let $Z(X_{1},...,X_{N})=\sup_{h\in{\cal H}}\left|P_{N}\phi_{u}(|h|)-P\phi_{u}(|h|)\right|$ . By the bounded differences inequality applied to $Z$ (see, for example, ), it follows that for every $t>0$ , with probability at least $1-2\exp(-2t^{2})$ ,

Note that $\phi_{u}$ is a Lipschitz function that vanishes in and with a Lipschitz constant $1/u$ . Therefore, by the Giné-Zinn symmetrization theorem and the contraction inequality for Bernoulli processes (see, e.g. ),

Hence, with probability at least $1-2\exp(-2t^{2})$ , for every $h\in{\cal H}$ ,

If $Q_{{\cal H}}(2\tau)>0$ , set $u=\tau$ and $t=\sqrt{N}Q_{{\cal H}}(2\tau)/2$ . Since

it follows that with probability at least $1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2)$ ,

Let ${\cal H}$ be star-shaped around and assume that there is some $\tau>0$ for which $Q_{{\cal H}}(2\tau)>0$ . Then, for every $r>\beta_{N}({\cal H},\tau Q_{{\cal H}}(2\tau)/16)$ , with probability at least $1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2)$ , for every $h\in{\cal H}$ that satisfies $\|h\|_{L_{2}}\geq r$ ,

Proof. Let $r>\beta_{N}({\cal H},\tau Q_{{\cal H}}(2\tau)/16)$ and since ${\cal H}$ is star-shaped around ,

Clearly, $Q_{V}(2\tau)\geq Q_{{\cal H}}(2\tau)$ and

By Theorem 5.4 applied to the set $V$ and since $Q_{V}(2\tau)\geq Q_{{\cal H}}(2\tau)$ , it follows that with probability at least $1-2\exp(-NQ_{{\cal H}}^{2}(2\tau)/2)$ , for every $v\in V$ ,

Next, fix any $h\in{\cal H}$ for which $\|h\|_{L_{2}}\geq r$ . ${\cal H}$ is star-shaped around and thus $(r/\|h\|_{L_{2}})h\in{\cal H}\cap rS(L_{2})$ , implying that $h/\|h\|_{L_{2}}\in V$ . The claim follows because (5.2) is positive homogeneous,

Proof of Theorem 5.3. Given $f^{*}\in{\cal F}$ , let ${\cal H}={\cal F}-f^{*}=\{f-f^{*}:f\in{\cal F}\}$ . Applying Corollary 5.5, it suffices to show that ${\cal H}$ is star-shaped around . To that end, observe that if $f-f^{*}\in{\cal H}$ and $0\leq\lambda\leq 1$ , then $\lambda(f-f^{*})=w-f^{*}$ for $w=\lambda f+(1-\lambda)f^{*}$ . Hence, as ${\cal F}$ is convex, $w\in{\cal F}$ .

Proof of Theorem 3.1. Recall that $\xi(X,Y)=f^{*}(X)-Y$ , that $(\xi_{i})_{i=1}^{N}=(f^{*}(X_{i})-Y_{i})_{i=1}^{N}$ , and that for every $f\in{\cal F}$ ,

Since ${\cal H}\subset{\cal F}-{\cal F}$ , the assertion of Theorem 5.3 for ${\cal H}={\cal F}-f^{*}$ holds with $Q_{{\cal F}-{\cal F}}(2\tau)$ replacing the larger $Q_{{\cal H}}(2\tau)$ . Hence, if $r>\beta_{N}({\cal H},\tau Q_{{\cal F}-{\cal F}}(2\tau)/16)$ , then with probability at least $1-2\exp(-NQ_{{\cal F}-{\cal F}}^{2}(2\tau)/2)$ , for every $f\in{\cal F}$ that satisfies $\|f-f^{*}\|_{L_{2}}\geq r$ , one has

Therefore, on that event, if $\|f-f^{*}\|_{L_{2}}\geq r$ ,

Fix $\gamma$ to be named later and consider $\alpha_{N}^{*}(\gamma,\delta/4)\equiv\alpha_{N}$ . Thus, with probability at least $1-\delta/4$ , if $\|f-f^{*}\|_{L_{2}}\leq\alpha_{N}$ then

and by the Giné-Zinn symmetrization theorem , with probability at least $1-\delta$ ,

Fix a sample for which (5.4) holds and consider $f\in{\cal F}$ that satisfies $\|f-f^{*}\|_{L_{2}}\geq\alpha_{N}$ . Since ${\cal F}-f^{*}$ is star-shaped around and

Combining (5.5) and (5.3), it is evident that with probability at least

if $\|f-f^{*}\|_{L_{2}}\geq\max\{\alpha_{N},r\}$ then

provided that $\gamma<\tau^{2}Q_{{\cal F}-{\cal F}}(2\tau)/16$ . On that event,

Additional proofs

Here, we will present proofs of some of the claims that have been formulated in previous sections.

We will prove a stronger statement: that the equivalence of the $p$ -th moment and the second moment of the linear forms $\bigl{<}X,t\bigr{>}$ holds for any isotropic random vector $X$ that is also unconditional and whose coordinates belong to $L_{p}$ for a fixed $p>2$ . That is, when $X=(x_{1},...,x_{n})$ has the same distribution as $(\varepsilon_{1}x_{1},...,\varepsilon_{n}x_{n})$ for every choice of signs $(\varepsilon_{i})_{i=1}^{n}$ (unconditionality), $\|x_{i}\|_{L_{2}}=1$ and $\|x_{i}\|_{L_{p}}\leq\kappa$ .

Such a random vector can be heavy-tailed, as its coordinates may only belong to an $L_{p}$ space for a fixed $p>2$ that is close to $2$ . Thus, the empirical mean of, say, $\bigl{<}X,e_{j}\bigr{>}^{2}=x_{j}^{2}$ may exhibit rather poor concentration around the true mean, which is $1$ . Despite that, the claim is that $\|\bigl{<}X,t\bigr{>}\|_{L_{p}}\leq c\sqrt{p}\kappa\|\bigl{<}X,t\bigr{>}\|_{L_{2}}$ , and any class of linear functionals satisfies a small-ball condition with constants that depend only on $\kappa$ and $p$ .

Clearly, a possible choice of $X$ is a random vector with independent, symmetric, variance one coordinates, and a standard symmetrization argument may be used to show that the same holds when the coordinates are mean-zero rather than symmetric. Thus, the unconditional case extends Lemma 4.2.

and $c(\lambda,\kappa)$ is a constant that depends only on $\lambda$ and $\kappa$ .

implying that $\|\bigl{<}X,t\bigr{>}\|_{L_{1}}\geq c(\lambda,\kappa)\|\bigl{<}X,t\bigr{>}\|_{L_{2}}$ .

2 The persistence problem

The proofs of Theorem 4.5 and Theorem 4.6 follow from several observations.

The $\psi_{2}$ norm of a random variable $X$ is

A random variable $X$ is called $L$ -subgaussian if $\|X\|_{\psi_{2}}\leq L\|X\|_{L_{2}}$ .

It is standard to verify that $\|X\|_{\psi_{2}}$ is equivalent to the smallest constant $\kappa$ for which $Pr(|X|\geq t)\leq 2\exp(-t^{2}/2\kappa^{2})$ for every $t\geq 1$ . Also, $\|X\|_{\psi_{2}}$ is equivalent to $\sup_{p\geq 2}\|X\|_{L_{p}}/\sqrt{p}$ (see, e.g. ).

Another property of $\psi_{2}$ random variables is that if $X_{1},...,X_{N}$ are independent and mean-zero, then for every $a_{1},...,a_{N}$ ,

where $(z_{i}^{*})_{i=1}^{n}$ is a monotone non-increasing rearrangement of $(|z_{i}|)_{i=1}^{n}$ .

There exists an absolute constant $C$ for which the following holds. Assume that $z_{1},...,z_{n}$ are independent copies of a mean-zero, variance $1$ random variable $z$ , and that for every $p\leq\log n$ , $\|z\|_{L_{p}}\leq\kappa\sqrt{p}$ . Then for every $1\leq d\leq n$ ,

Proof. For every $1\leq j\leq n$ and $p\geq 2$ ,

Therefore, if $t=u\kappa\sqrt{\log(en/j)}$ and $p=\log(en/j)$ then

The final component needed in the proofs of the two theorems is the following. Note that if ${\cal F}_{R}=\{\bigl{<}t,\cdot\bigr{>}:t\in RB_{1}^{n}\}$ and $Y=\bigl{<}t_{0},\cdot\bigr{>}+W$ for $t_{0}\in RB_{1}^{n}$ and $W$ that is independent of $X$ , then $f^{*}=\bigl{<}t_{0},\cdot\bigr{>}$ . Also, by the convexity and symmetry of $B_{1}^{n}$ ,

Therefore, if $X$ is a random vector with independent, mean-zero, variance 1 coordinates $\zeta_{i}$ , and $X_{1},...,X_{N}$ are independent copies of $X$ , then

where $Z=(z_{i})_{i=1}^{n}$ has independent coordinates. Each $z_{i}$ is distributed as $N^{-1/2}\sum_{j=1}^{N}\zeta_{i,j}$ , and $(\zeta_{i,j})_{j=1}^{N}$ are independent copies of $\zeta_{i}$ . Hence, by Lemma 6.4,

If the $\zeta_{i}$ ’s are distributed according to a mean-zero and variance $1$ random variable $\zeta$ that is bounded in $L_{\infty}$ by $\kappa$ , then $\|\zeta\|_{\psi_{2}}\leq\kappa\|\zeta\|_{L_{2}}$ and $\zeta$ is $\kappa$ -subgaussian. Moreover, $\|z_{i}\|_{L_{2}}=1$ , and as a weighted sum of independent $\psi_{2}$ random variables, $\|z_{i}\|_{\psi_{2}}\leq c\kappa$ for a suitable absolute constant $c$ . Hence, for every $p\geq 2$ , $\|z_{i}\|_{L_{p}}\leq c_{1}\kappa\sqrt{p}$ .

provided that $(R/s)^{2}\leq n/4$ . Therefore, in that range,

while if $(R/s)^{2}\geq n/4$ one may verify that

for a suitable absolute constant $c_{3}$ .

Combining (6.2) with Theorem 1.1, there is a high probability event on which, if $N\leq n^{2}$ ,

Proof of Theorem 4.6. Since the argument is rather standard and follows a similar path to the previous proof, we will omit some of the details.

To prove the result, one has to bound $\beta_{N}^{*}$ and $\alpha_{N}^{*}$ when $X=(\zeta_{1},...,\zeta_{n})$ and $\|\zeta\|_{L_{\infty}}\leq\kappa$ . Using (6.2) as above, it is evident that if $N\leq c_{1}n$ ,

for suitable constants $c_{1}$ and $c_{2}$ that depend only on $\kappa$ and $\gamma$ . Moreover, using the notation of Theorem 3.1, $\gamma=\tau Q_{{\cal F}-{\cal F}}(2\tau)/16$ and thus, $\gamma$ is a constant that depends only on $\kappa$ as well.

for constants $c_{1}$ and $c_{3}$ that depend only on $\kappa$ .

Next, to bound $\alpha_{N}^{*}(\gamma,\delta)$ , set $\sigma=\|W\|_{L_{2,1}}$ , let

and recall that $W$ and $X$ are independent. By a standard multiplier theorem (see, e.g., Chapter 2.9),

and since $\frac{1}{\sqrt{m}}\sum_{i=1}^{m}X_{i}$ is an isotropic, $c\kappa$ -subgaussian vector, it follows from Talagrand’s Majorizing Measures Theorem that

where $g_{1},...,g_{n}$ are independent, standard gaussian variables and $c_{5}$ is an absolute constant.

and constants $c_{6},c_{7}$ that depend only on $\kappa$ .

Indeed, for every $t\in 2RB_{1}^{n}\cap sB_{2}^{n}$ and since $W$ and $X$ are independent,

Hence, for a constant $c_{9}$ that depends only on $\kappa$ , $x=c_{9}Ns^{2}\min\{1/R,1/\sigma^{2}\}$ and $\delta=\exp(-x)$ , one has that with probability at least $1-\delta$ , $V_{s}\leq\gamma\sqrt{N}s^{2}$ . Therefore, $\alpha(\gamma,\delta)\leq s$ , which completes the proof.

Concluding Remarks

Although it seems that the complexity parameters $\alpha_{N}^{*}$ and $\beta_{N}^{*}$ (and $k_{N}^{*}$ as well) depend on the unknown function $f^{*}$ , in virtually all applications one may replace the indexing set $F\cap r{\cal D}_{f^{*}}$ with $({\cal F}-{\cal F})\cap r{\cal D}$ , where ${\cal D}$ is the unit ball in $L_{2}(\mu)$ – with the obvious modifications to the definitions of the parameters. Moreover, if ${\cal F}$ is centrally symmetric (i.e. if $f\in{\cal F}$ then $-f\in{\cal F}$ ) in addition to being convex, then ${\cal F}-{\cal F}\subset 2{\cal F}$ and the indexing set becomes $2{\cal F}\cap r{\cal D}$ .

The fact that $\beta_{N}^{*}$ depends on an average while $\alpha_{N}^{*}$ is based on a probability estimate is just an outcome of this presentation. It is possible to obtain a similar result using a ‘high probability’ version of $\beta_{N}^{*}$ , by using a slightly different argument. On the other hand, while one may replace $\alpha_{N}^{*}$ with an averaged version, the latter makes little sense in heavy-tailed situations, as passing from an average-based parameter to a high probability one reverts to the question of concentration. In heavy-tailed situations the mean need not represents the typical behaviour, and should not be used when trying to obtain very high probability bounds.

Another fact worth mentioning is that the results presented here can be extended well beyond the squared loss, to arbitrary convex loss functions (see ).

for midpoints $(Z_{i})_{i=1}^{N}$ that belong to the intervals whose ends are $\xi_{i}$ and $\xi_{i}+(f-f^{*})(X_{i})$ , and thus depend on $f$ and on the sample $(X_{i},Y_{i})_{i=1}^{N}$ .

Just as in the squared-loss case, $f$ cannot be an empirical minimizer if $P_{N}{\cal L}_{f}>0$ . Therefore, using (7), one may obtain a positive lower bound on $P_{N}{\cal L}_{f}$ by identifying the levels $\bar{\alpha}_{N}$ and $\bar{\beta}_{N}$ , for which, if $\|f-f^{*}\|_{L_{2}}\geq\bar{\beta}_{N}$ then

for some constant $c$ , and if $\|f-f^{*}\|_{L_{2}}\geq\bar{\alpha}_{N}$ then

On that event the estimation error satisfies

for constants that are independent of $f$ .

A version of Theorem 3.1 for a general convex loss can be found in .