Risk minimization by median-of-means tournaments

Gabor Lugosi, Shahar Mendelson

Introduction

We assume in what follows that the minimum is attained and $f^{*}\in{\mathcal{F}}$ exists and is unique, as is the case when ${\mathcal{F}}\subset L_{2}(\mu)$ is a closed, convex set.

One may formulate two natural goals in estimation and prediction problems. One of them is to find a function $f\in{\mathcal{F}}$ whose $L_{2}(\mu)$ distance to $f^{*}$

is as small as possible. The other is to ensure that the excess risk of the selected function

The crucial difference between this type of problems and standard questions in approximation theory is that the available information is limited to a random sample. One observes $\mathcal{D}_{N}=((X_{1},Y_{1}),\ldots,(X_{N},Y_{N}))$ , that is, $N$ independent pairs, where each $(X_{i},Y_{i})$ has the same distribution as $(X,Y)$ and $\mathcal{D}_{N}$ is independent of $(X,Y)$ . The fact that the distribution of the pair $(X,Y)$ is not known makes it impossible to invoke approximation-theoretical methods and identify directly the true minimizer of the risk.

where throughout the article, for $q\geq 1$ , we use the notation

The excess risk, also known as the prediction error, compares the ‘predictive capabilities’ of $\widehat{f}_{N}$ to that of the best in the class, and is defined by the conditional expectation

Note that, up to this point, $Y$ was an arbitrary square-integrable real-valued random variable, and obviously one would like to be able to treat as wide a variety of targets as possible. Clearly, the accuracy and confidence one may establish may depend on some features of the target–for example, some a-priori estimate on its $L_{q}$ norm–, or on its “distance” to ${\mathcal{F}}$ , etc. We consider a broad set of admissible targets ${\cal Y}$ , and the accuracy and confidence of $\Phi$ relates to its performance for any admissible target $Y\in{\cal Y}$ . Thus, a learning problem is the triplet $({\mathcal{F}},Y,X)$ , when $X$ and $Y$ are not known, though the learner does know that $Y\in{\cal Y}$ .

It is clear that there is a tradeoff between the accuracy and confidence in a given learning problem: the smaller the error is, the harder it is to attain it. The question of this accuracy/confidence tradeoff is of utmost importance in statistical learning theory, and has been investigated extensively in numerous manuscripts since the early days of the area in the late 1960’s (see, for example, the books for a sample of the work devoted to this question). To find the right accuracy/confidence tradeoff one must first identify a lower bound on the tradeoff in terms of the sample size, the structure of ${\mathcal{F}}$ and possibly some additional information on $X$ and $Y$ , and then come up with a learning procedure that attains the tradeoff.

Roughly put, one should explore the tradeoff for the set of “achievable accuracies” of each learning problem. An accuracy ${\cal E}$ is achievable if there is a learning procedure in ${\mathcal{F}}$ that achieves the accuracy ${\cal E}$ for the problem $({\mathcal{F}},Y,X)$ with constant probability–say at least $3/4$ –, and because $Y$ is not known, this has to hold for any $Y\in{\cal Y}$ . We define the accuracy edge as the smallest achievable accuracy of a problem.

The primary question is to find the correct accuracy/confidence tradeoff for any (reasonable) learning problem and identify a learning procedure that attains that tradeoff all the way down to the accuracy edge. It should be noted that up to this point in time, and other than in a few isolated examples, no learning procedure even came close to the optimal tradeoff at any nontrivial accuracy level.

In this article we solve this problem by presenting an optimal learning procedure: it yields the best possible accuracy/confidence tradeoff (almost) up to the achievable edge, and under minimal assumptions on the learning problem.

The minor reservation “almost” is due to fact that more often than not, the identity of the accuracy edge of a learning problem is not known. As it is explained in what follows, while one may provide lower estimates on the accuracy edge, there is a very real possibility that such estimates are too optimistic and the real accuracy edge is larger. Regardless, the procedure we introduce “gets as close” to the accuracy edge as any other known procedure–or better, and with a dramatically better confidence. It also exhibits the optimal accuracy/confidence tradeoff for larger errors, something that no other procedure is known to do.

It should be clarified at this point that by “optimal accuracy” we mean an accuracy that is optimal up to a constant factor and optimality in the confidence means that the probability with which the claimed accuracy does not hold is optimal up to a constant factor in the exponent.

Before we present a more accurate formulation of our main results and describe the optimal procedure, let us explain what our procedure is not: it is not empirical risk minimization (ERM), nor any of its “family members”.

Perhaps the most natural way of choosing $\widehat{f}_{N}$ is by empirical risk minimization, that is, by least squares regression,

Again, we assume that the minimum is attained, while if there are several minimizers, $\widehat{f}_{N}$ may be chosen among them in an arbitrary way.

The performance of least squares regression has been thoroughly studied in many different scenarios. A sample of the rich literature includes Györfi, Kohler, Krzyzak, Walk , van de Geer , Bartlett, Bousquet, and Mendelson , Koltchinskii , Massart .

It turns out that the performance of ERM changes dramatically according to the tail behaviour of the functions involved in the given learning problem. One may show (see, e.g., ) that if ${\mathcal{F}}$ is convex and the functions in ${\mathcal{F}}$ (more precisely, the random variables $\{f(X):f\in{\mathcal{F}}\}$ ) and the target $Y$ have well-behaved tails (and by “well-behaved” we mean sub-Gaussian), ERM preformed in ${\mathcal{F}}$ yields good results: for an accuracy that is not far from the accuracy edge, it attains the optimal confidence, though it does not maintain the optimal accuracy/confidence tradeoff for larger errors. Unfortunately, the situation deteriorates considerably when either members of ${\mathcal{F}}$ or one of the admissible targets is heavy-tailed in some sense. In such cases, the performance of ERM is significantly weaker than the known theoretical limitations of the accuracy/confidence tradeoff. Moreover, replacing ERM with a different procedure is of little use: other than in few and rather special learning problems, there have been no known alternatives to ERM whose performance comes close to the known theoretical limitations of the accuracy/confidence tradeoff, and certainly not when the problem is heavy-tailed.

The reason for ERM’s diminished capacity is that it is sensitive to even a small number of atypical points in the sample $(X_{i},Y_{i})_{i=1}^{N}$ : since ERM selects a minimizer of the empirical mean of the squared loss, atypical values may distort the selection and send ERM to the wrong part of ${\mathcal{F}}$ . This sensitivity is clearly reflected in the confidence with which ERM operates in heavy-tailed situations: roughly put, one can guarantee that ERM performs with the right accuracy only on samples that are not contaminated by a significant number of atypical values. However, in heavy-tailed situations, the latter does not occur frequently, and having atypical values is simply a fact of life one has to deal with.

In contrast, the procedure we suggest as an alternative to ERM leads to the optimal accuracy/confidence tradeoff even in heavy-tailed situations. Unlike ERM, it is not sensitive even to a large number of atypical sample points.

One may show that a nontrivial estimate is possible only when $N\geq cn$ for a suitable absolute constant $c$ , and we consider only such values of $N$ . Also, there are known estimates on the theoretical limitations of this problem: a lower bound on the accuracy edge is of the order of $\sigma\sqrt{n/N}$ , and for an accuracy level that is proportional to the accuracy edge, say, $c_{0}\sigma\sqrt{n/N}$ for a suitable absolute constant $c_{0}$ , the conjectured confidence is $1-2\exp(-c_{1}n)$ .

Moreover, because of the minimal assumptions on $W$ and $X$ , the best estimate one can hope for on $(*)$ and that holds with probability $1-\delta$ follows from Chebyshev’s inequality. In particular, it leads to the following, rather unsatisfactory, estimate on the performance of ERM:

Also, if one wishes for an error that is proportional to the (conjectured) accuracy edge, that is, of the order of $\sigma\sqrt{n/N}$ , the best that one can hope for is a constant confidence–a very different estimate from the conjectured confidence of $1-2\exp(-c_{1}n)$ .

Although what we have described is an upper bound, one may show (see ) that this estimate captures the performance of empirical risk minimization, and in particular exhibits ERM’s inability to deal with atypical sample points. The reality is that ERM performs with an accuracy of the order of $\sigma\sqrt{n/N}$ only on the relatively few samples that contain almost no misleading data.

The main result of this article, when applied to this example, shows that under the same assumptions, the procedure we suggest selects $\widehat{t}$ for which

for some numerical constants $c,C>0$ ; that is, it performs with optimal confidence at a level that is proportional to the accuracy edge. In fact, our procedure gives the optimal confidence for any accuracy $r\geq c^{\prime}\sigma\sqrt{n/N}$ .

In the next section we present the required definitions, outline the current state of the art, and formulate our main results. In Section 3, we describe the new procedure in detail. In Section 4 we illustrate the power of the main results on some canonical examples, before turning to the proofs of our results.

Let us point out the well-understood fact that the behaviour of the accuracy and confidence in learning problems in which ${\mathcal{F}}$ is not convex is trivial in some sense, and totally different from the convex case, which is why focus on the latter. As it happens, the dominating factor in non-convex problems is the ‘location’ of the targets $Y$ relative to ${\mathcal{F}}$ rather than the structure of ${\mathcal{F}}$ , and an ‘unfavourable location’ of a target $Y$ completely distorts the accuracy and confidence of the learning problem. However, even if ${\mathcal{F}}$ is not convex, all the targets of the form $Y=f_{0}(X)+W$ for $f_{0}\in{\mathcal{F}}$ and $W$ that is symmetric and independent of $X$ happen to be in a ‘favourable’ location and thus our results apply to such problems as well.

The accuracy edge and the accuracy/confidence tradeoff

We begin by describing the known theoretical limitations on the accuracy edge and on the accuracy/confidence tradeoff for a given learning problem. To this end, let us introduce some notation, following the path of .

Let $D=\{f:\|f\|_{L_{2}}\leq 1\}$ be the unit ball in $L_{2}(\mu)$ and set $S=\{f:\|f\|_{L_{2}}=1\}$ to be the unit sphere. For $h\in L_{2}(\mu)$ and $r>0$ , put $rD_{h}=\{f:\|f-h\|_{L_{2}}\leq r\}$ . Let

Thus, ${\rm star}({\mathcal{F}},h)$ is the star-shaped hull of ${\mathcal{F}}$ around $h$ , that is, the union of all segments for which one end-point is $h$ and the other is in ${\mathcal{F}}$ .

The star-shaped hull ${\rm star}({\mathcal{F}},h)$ adds regularity to the class around the fixed centre $h$ : on the one hand, it does not increase the size of the class by much, while on the other hand, it implies that every function of the form $f-h$ has a ‘scaled-down’ version when one moves towards . In particular, the level sets ${\rm star}({\cal F}-h,0)\cap rS$ become ‘richer’ as $r$ gets smaller: each one of them contains scaled-down copies of all ‘higher’ levels.

Consider the localization of ${\mathcal{F}}$

which is given by the shift that maps the designated point $h$ to . Then the resulting class is made ‘more regular’ by taking its star-shaped hull around , and finally it is localized, by considering its intersection with $rD$ , the $L_{2}(\mu)$ ball of radius $r$ , centred in .

Observe that if ${\mathcal{F}}$ is convex then for any $h\in{\mathcal{F}}$ , ${\rm star}({\mathcal{F}},h)={\mathcal{F}}$ . Also, in that case,

and if, in addition, ${\mathcal{F}}$ is centrally symmetric (that is, if $f\in{\mathcal{F}}$ then $-f\in{\mathcal{F}}$ ), then ${\mathcal{F}}_{h,r}\subset 2{\mathcal{F}}\cap rD$ .

One way of deriving lower estimates on the accuracy edge and on the accuracy/confidence tradeoff is based on the packing numbers of the localizations ${\mathcal{F}}_{h,r}$ .

Given a set $H\subset L_{2}(\mu)$ and $\varepsilon>0$ , denote by ${\cal M}(H,\varepsilon D)$ the cardinality of a maximal $\varepsilon$ -separated subset of $H$ . That is, ${\cal M}(H,\varepsilon D)$ is the maximal cardinality of a subset $\{h_{1},...,h_{m}\}\subset H$ , for which $\|h_{i}-h_{j}\|_{L_{2}}\geq\varepsilon$ for every $i\not=j$ .

Note that if $H^{\prime}$ is a maximal $\varepsilon$ -separated subset of $H$ , then it is also an $\varepsilon$ -cover of $H$ in the sense that for every $h\in H$ there is some $h^{\prime}\in H^{\prime}$ that satisfies $\|h^{\prime}-h\|_{L_{2}}\leq\varepsilon$ .

For $\kappa,\eta>0$ and $h\in{\mathcal{F}}$ , set

There exist absolute constants $\kappa$ and $\eta$ for which the following holds. Let ${\mathcal{F}}\subset L_{2}(\mu)$ be convex and centrally symmetric. For any learning procedure $\Phi$ there exists an $f_{0}\in{\cal F}$ and target $Y=f_{0}(X)$ for which, with probability at least $1/4$ ,

For $\kappa>0$ , $0<\eta<1$ and $h\in{\mathcal{F}}$ , set

(See, e.g., .) There exist absolute constants $\kappa$ and $\eta$ for which the following holds. Let ${\mathcal{F}}\subset L_{2}(\mu)$ and set $W$ to be a centred Gaussian variable with variance $\sigma>0$ that is independent of $X$ . Then, for any learning procedure $\Phi$ there exists $f_{0}\in{\mathcal{F}}$ and a target $Y=f_{0}(X)+W$ for which, with probability at least $1/4$ ,

Combining these two facts, we have a lower bound on the accuracy edge:

for some constants $\kappa_{i},\eta_{i}$ , $i=1,2$ . However, there is no guarantee that this lower estimate is sharp. As we explain in what follows, it is very possible that the true accuracy edge is larger.

Let us turn to the theoretical limitations of the accuracy/confidence tradeoff, assuming that the set of admissible targets ${\cal Y}$ is not too trivial. By that we mean that it at least contains all the targets of the form $Y=f_{0}(X)$ (the noise-free problems) and $Y=f_{0}(X)+W$ , for $f_{0}\in{\mathcal{F}}$ and $W$ that is a centred Gaussian variable with variance $\sigma$ , and is independent of $X$ . We call this set of targets minimal, and of course, ${\cal Y}$ could be much larger.

Applying the results in and one has the following:

There exists an absolute constant $c$ for which the following holds. Let ${\mathcal{F}}\subset L_{2}(\mu)$ be a class that is star-shaped around one of its points (i.e., for some $f_{0}\in{\mathcal{F}}$ and every $f\in{\mathcal{F}}$ , $[f_{0},f]\subset{\mathcal{F}}$ ). Consider ${\cal Y}$ that contains the minimal set of targets. If $\Phi$ is a learning procedure that performs with accuracy $r$ and confidence $1-\delta$ for every such target, then

These facts set our first benchmark (which may be too optimistic, of course): the lower bound on the accuracy edge

and the bound on the accuracy/confidence tradeoff for $r\geq c_{0}\lambda^{*}$ ,

As we noted earlier, $\lambda^{*}$ is an optimistic, and perhaps not very realistic, lower bound on the accuracy edge. A more reasonable conjecture relies on more “global” parameters that take into account the fine structure of ${\mathcal{F}}$ at an arbitrarily small level, defined next.

From here on, let $(\varepsilon_{i})_{i=1}^{N}$ be independent, symmetric $\{-1,1\}$ -valued random variables that are independent of $(X_{i},Y_{i})_{i=1}^{N}$ .

For $\kappa>0$ and $h\in{\mathcal{F}}$ let

and set $r_{E}(\kappa)=\sup_{h\in{\mathcal{F}}}r_{E}(\kappa,h)$ .

The other “global” parameter we require does depend on $Y$ . It is used to calibrate the interaction between ${\mathcal{F}}$ and the target.

Remark. The role of $\sigma$ and of ${\mathcal{F}}_{Y}^{(\sigma)}$ in Definition 2.8 deserves some explanation. The mean oscillation in Definition 2.8 involves the “multipliers” $(h(X_{i})-Y_{i})_{i=1}^{N}$ , and the right choice for the centre $h$ is the unknown $f^{*}$ . While one may take into account the “worst” $h\in{\mathcal{F}}$ , doing so makes little sense, as $f^{*}$ is the minimizer of the $L_{2}$ distance between ${\mathcal{F}}$ and $Y$ , and for the worst $h$ , $\|h-Y\|_{L_{2}}$ could be significantly larger than $\|f^{*}-Y\|_{L_{2}}$ . To overcome this obstacle we assume that an a-priori estimate on the $L_{2}$ distance between $Y$ and ${\mathcal{F}}$ (i.e., a value $\sigma$ such that $\|f^{*}-Y\|_{L_{2}}\leq\sigma$ ) is available. With this information one still needs to consider the worst centre $h$ , but only among all functions $h\in{\mathcal{F}}$ that satisfy $\|h-Y\|_{L_{2}}\leq\sigma$ . As we explain in Section 4, thanks to known estimates for the expectation of the supremum of a multiplier process, one only needs to keep in mind that the multipliers $\xi_{i}=h(X_{i})-Y_{i}$ are independent copies of some random variable $\xi$ that satisfies some moment condition, such as $\|\xi\|_{L_{q}}\leq L\sigma$ for some $q>2$ and a suitable constant $L$ .

In light of the results from , a realistic alternative to $\lambda^{*}$ is

for some constants $c_{1},c_{2}$ , and when for the given (and unknown) target $Y\in{\cal Y}$ one has $\|Y-f^{*}(X)\|_{L_{2}}\leq\sigma$ . Indeed, in it was shown that under some mild conditions on the learning problem, specified below, ERM performs in ${\mathcal{F}}$ with accuracy $cr^{*}$ and constant confidence. Thus, $r^{*}$ is a potential (and to-date, the best) candidate for the accuracy edge.

Therefore, up to the issue of the true identity of the accuracy edge, the (somewhat vaguely formulated) question of the accuracy/confidence tradeoff is as follows:

Is there a learning procedure which, for any reasonable learning problem and any $r\geq c_{1}r^{*}$ , performs with accuracy $r$ and confidence $1-2\exp\bigl{(}-c_{2}N\min\{1,\sigma^{-2}r^{2}\}\bigr{)}$ , thus achieving the optimal accuracy/confidence tradeoff?

Our main result answers Question 2.9 in the affirmative where our notion of a “reasonable learning problem” is formulated rigorously below.

The first theorem we present requires the following conditions:

Let $L$ be a constant and let $X$ be distributed according to the measure $\mu$ on ${\cal X}$ . Given a locally compact, convex class of functions ${\mathcal{F}}\subset L_{2}(\mu)$ and $Y\in L_{2}$ , assume that

$\bullet$ for every $f,h\in{\mathcal{F}}$ , $\|f-h\|_{L_{4}}\leq L\|f-h\|_{L_{2}}$ ;

$\bullet$ for every $f\in{\mathcal{F}}$ , $\|f-Y\|_{L_{4}}\leq L\|f-Y\|_{L_{2}}$ ;

$\bullet$ $\|f^{*}-Y\|_{L_{2}}\leq\sigma$ for some known constant $\sigma>0$ .

Let $L\geq 1$ , $\sigma>0$ , and suppose Assumption 2.1. There exist constants $c,c_{0},c_{1}$ and $c_{2}$ that depend only on $L$ for which the following holds. Let

There exists a procedure that, based on the data $\mathcal{D}_{N}=(X_{i},Y_{i})_{i=1}^{N}$ and the values of $L$ , $\sigma$ and $r$ , selects a function $\widehat{f}\in{\mathcal{F}}$ such that, with probability at least

Of course, the identity of $f^{*}$ is not known, and therefore, it is not reasonable to expect that $r^{*}(f^{*})$ is known beforehand. A “legal” data-independent choice is any $r\geq 2r^{*}$ that is larger than $2r^{*}(f^{*})$ regardless of the identity of $f^{*}$ . In particular, Theorem 2.10 gives the optimal accuracy/confidence tradeoff for any accuracy $r\geq 2r^{*}$ . Alternatively, one may consider the “right” choice of the parameter $r$ as a model selection problem that may be selected using cross validation if independent data are available. We do not discuss the rather straightforward details further here.

Remark. In our main assumption of Theorem 2.10 we use the equivalence between the $L_{4}$ and $L_{2}$ norms for functions of the form $Y-f(X)$ . This allows us to derive bounds in terms of the variance of $Y-f^{*}(X)$ . In fact, if instead of norm equivalence we just have a bound on $\sigma_{4}=\|Y-f^{*}(X)\|_{L_{4}}$ , the arguments work equally well, with $\sigma_{4}$ replacing $\sigma$ . The sets ${\mathcal{F}}_{Y}^{\sigma}$ need to be adjusted as well: it should be replaced by all functions in ${\mathcal{F}}$ whose $L_{4}$ distance to $Y$ is at most $\sigma_{4}$ .

It turns out that, when dealing with independent noise, the assumptions required in Theorem 2.10 may be relaxed even further. In particular, we do not require convexity of the class ${\mathcal{F}}$ and the assumption of norm equivalence may be relaxed:

Let $q>2$ , $L>1$ , and $\sigma>0$ . There exist constants $c,c_{0},c_{1}$ and $c_{2}$ that depend only on $q$ and $L$ for which the following holds. Let ${\mathcal{F}}$ be a locally compact class of functions and assume that for every $f\in{\rm span}({\mathcal{F}})$ , $\|f\|_{L_{q}}\leq L\|f\|_{L_{2}}$ . Assume further that $Y=f_{0}(X)+W$ for some $f_{0}\in{\mathcal{F}}$ and $W$ that is mean-zero, independent of $X$ , and satisfies $\|W\|_{L_{2}}\leq\sigma$ .

Let $r^{*}(f^{*})$ be as above and fix $r\geq 2r^{*}(f^{*})$ . There exists a procedure that, based on the data $\mathcal{D}_{N}=(X_{i},Y_{i})_{i=1}^{N}$ and the values of $L,q$ , $\sigma$ and $r$ , selects a function $\widehat{f}\in{\mathcal{F}}$ such that, with probability at least

Note that in the case of independent additive noise, the assumptions of Theorem 2.11 are almost the minimal needed for the learning problem to be well defined: norm equivalence for $q$ that may be arbitrarily close to $2$ and $W\in L_{2}$ that perhaps does not have any higher moments. Even under these minimal assumptions, we still obtain the optimal accuracy/confidence tradeoff.

To put Theorem 2.10 in perspective, we describe the current sharpest estimates on the accuracy and confidence of a learning problem, focusing on possibly heavy-tailed distributions.

The best known estimate on prediction and estimation in a general convex class are based on a quite weak condition, rather than the norm equivalence we use. Recall that a class ${\mathcal{F}}$ satisfies a small-ball condition with constants $\kappa_{0}$ and $\rho_{0}$ if for every $f,h\in{\mathcal{F}}\cup\{0\}$ ,

for constants $c_{1},c_{2}$ and $c_{3}$ that depend only on $\kappa_{0}$ and $\rho_{0}$ .

In other words, when $N\geq cn$ , the procedure exhibits the optimal accuracy/confidence tradeoff for any $r\geq c_{3}\sigma\sqrt{n/N}$ .

The median-of-means tournament

The key to obtaining sharp estimates for both the accuracy and the confidence is identifying a procedure that is not sensitive to atypical values that may occur on a small part of the given sample. Thus, a natural starting point is the conceptually simple and attractive mean estimator, the so-called median-of-means estimator. It was proposed, independently, by Nemirovsky and Yudin , Jerrum, Valiant, and Vazirani , Alon, Matias, and Szegedy , and is defined as follows.

and define $\widehat{\mu}_{N}^{(\delta)}$ as the median of $W_{1},\ldots,W_{n}$ . (If the median is not uniquely defined, here, and in the rest of the paper, we choose the smallest one. Any other choice would work equally well.) It is straightforward to verify that for any $N\geq 4$ ,

For properties, applications, and extensions of the median-of-means estimator, we refer to Bubeck, Cesa-Bianchi, and Lugosi . Devroye, Lerasle, Lugosi, and Oliveira Hsu and Sabato , Lerasle and Oliveira , Minsker , Audibert and Catoni .

It is tempting to try to replace the empirical means $(1/N)\sum_{i=1}^{N}(f(X_{i})-Y_{i})^{2}$ by some median-of-means estimate of the risk for each $f\in{\mathcal{F}}$ , and select a function in ${\mathcal{F}}$ that minimizes the estimate. However, due to the nonlinear nature of the median-of-means, it is difficult to control the process of the estimated losses. Instead, the alternative we propose is to estimate the difference of the risk for all pairs $f,h$ and organize a two-stage “tournament”.

We mention here that Brownlees, Joly, and Lugosi propose empirical minimization based on a different robust mean estimator, a carefully designed M-estimator proposed by Catoni . Under general loss functions they derive analogs of Dudley’s chaining bound for the excess risk. However, the derived bounds are far from giving the optimal rate of convergence under the squared loss.

Without loss of generality and for convenience in the notation, we use $3N$ instead of $N$ for the sample size and assume that $\mathcal{D}_{3N}=(X_{i},Y_{i})_{i=1}^{3N}$ is the given sample. The sample is split into three equal parts $(X_{i},Y_{i})_{i=1}^{N}$ , $(X_{i},Y_{i})_{i=N+1}^{2N}$ and $(X_{i},Y_{i})_{i=2N+1}^{3N}$ . Given the wanted degree of accuracy $r\geq r^{*}$ , the first part of the sample–in fact, just $(X_{i})_{i=1}^{N}$ –is used to estimate pairwise distances within ${\mathcal{F}}$ , in a sense that will be clarified below. The second part is used in the preliminary round of the tournament. We show that the outcome of the preliminary round is a set $H\subset{\mathcal{F}}$ that contains $f^{*}$ and possibly some other functions whose $L_{2}(\mu)$ distance to $f^{*}$ is at most $cr$ . The final part of the tournament is a ‘champions league’ round. Participants in that final round are the elements in $H$ (the ‘qualifiers’ of the preliminary round), and the goal of that round is to identify a function $\widehat{f}\in H$ whose predictive capabilities are almost optimal, in the sense that

Let us now describe the three stages of the median-of-means tournament is detail.

The functional $\Phi$ allows one to identify distances in ${\mathcal{F}}$ in a crude (isomorphic) way, as the next theorem shows:

$\bullet$ If $\Phi_{\mathcal{C}_{N}}(f,f^{*})\geq\beta r$ then $\beta^{-1}\Phi_{\mathcal{C}_{N}}(f,f^{*})\leq\|f-f^{*}\|_{L_{2}}\leq\alpha^{-1}\Phi_{\mathcal{C}_{N}}(f,f^{*})$ .

$\bullet$ If $\Phi_{\mathcal{C}_{N}}(f,f^{*})<\beta r$ then $\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r$ .

Proposition 3.2 is an immediate modification of Theorem 3.3 from . For the sake of completeness we outline the main components of its proof in the appendix.

Next we introduce the “distance oracle”, denoted by ${\cal DO}$ . Recall the definition of $r^{*}$ from (2.7) and note that for the right choice of constants, $r^{*}\geq d^{*}$ . The distance oracle is adapted to the wanted degree of accuracy, that is, to any fixed $r\geq 2r^{*}$ .

Fix $r\geq 2r^{*}$ . Using the notation of Proposition 3.2, if $\Phi_{\mathcal{C}_{N}}(f,h)\geq\beta r$ set ${\cal DO}(f,h)=1$ , otherwise set ${\cal DO}(f,h)=0$ .

The distance oracle determines if a match between $f$ and $h$ takes place: it does if ${\cal DO}(f,h)=1$ and it is abandoned if ${\cal DO}(f,h)=0$ . Note that Proposition 3.2 only shows that ${\cal DO}$ is a realistic indication of the distance between pairs when one of the functions is the designated function $f^{*}$ . This serves our purposes since the designated function we are interested in is the minimizer of the true risk in ${\mathcal{F}}$ , and the success of the procedure only requires having accurate information on matches that involve $f^{*}$ , even if we do not know which matches those are.

It follows from Proposition 3.2 that with probability at least $1-2\exp(-cN)$ relative to $(X_{i})_{i=1}^{N}$ , if a match between $f^{*}$ and $f$ is allowed to proceed then $\|f-f^{*}\|_{L_{2}}\geq r$ , while if it is abandoned then $\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r$ .

2 The preliminary round

The goal of the preliminary round is to produce a subset $H\subset{\mathcal{F}}$ that, with overwhelming probability over the samples $(X_{i},Y_{i})_{i=N+1}^{2N}$ , contains $f^{*}$ and $\|h-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r$ for any $h\in H$ .

The round consists of ‘matches’ between every pair $f,h\in{\mathcal{F}}$ , and a match can have three possible outcomes: a win by either side, or a draw (the latter includes abandoned matches because of the ruling of the distance oracle).

Each match is ‘played’ using the second part of the sample, $(X_{i},Y_{i})_{i=N+1}^{2N}$ . The sub-sample is partitioned to $n$ blocks $(I_{j})_{j=1}^{n}$ of cardinality $m=N/n$ each, for a choice of $n$ specified later. Let us note that $n$ depends on the desired degree of accuracy $r$ .

$\bullet$ A match between $f$ and $h$ takes place if the distance oracle, using the first part of the sample $(X_{i})_{i=1}^{N}$ , declares that ${\cal DO}(f,h)=1$ ; otherwise, the match is abandoned and results in a draw.

$\bullet$ Each match is decided according to the $n$ blocks generated by the partition of $(X_{i},Y_{i})_{i=N+1}^{2N}$ , with the $j$ -th block played on the coordinate block $I_{j}$ . Put

The function $h$ defeats $f$ on the $j$ -th block if $B_{f,h}(j)<0$ , and $f$ defeats $h$ if $B_{f,h}(j)>0$ .

$\bullet$ A winner of more than $n/2$ blocks is the winner of the match. If neither function wins more than half of the blocks, the match is drawn.

A function $f\in{\mathcal{F}}$ qualifies from the preliminary round if it has not lost a single match; that is, it has won or drawn all its matches. The set of “champions” $H$ consists of all functions qualified from the preliminary round.

The key fact regarding the outcome of the preliminary round is as follows:

Under the assumptions of Theorem 2.10 or of Theorem 2.11, and using their notation, with probability at least

with respect to $(X_{i},Y_{i})_{i=1}^{2N}$ , for all $h\in{\mathcal{F}}$ , if ${\cal DO}(f^{*},h)=1$ then $f^{*}$ defeats $h$ . In particular, $f^{*}\in H$ and for any $h\in H$ , ${\cal DO}(f^{*},h)=0$ , and therefore $\|h-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r$ .

The proof of Proposition 3.5 is presented in Section 5.1. Note that Propositions 3.2 and 3.5 imply Theorem 2.11. In order to prove the general result of Theorem 2.10, another round of matches is necessary to choose a function from $H$ with small excess risk.

3 Champions league

The goal of the second round of the tournament is to choose, among the “champions” selected in the preliminary round, a function with a small excess risk. This round consists of different kind of matches, played between functions in $H$ . Since this round consists of matches between functions in $H$ , conditioned on the ‘good event’ from the preliminary round, every qualifier satisfies that $\|h-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r$ .

The modified matches are decided using the third part of the sample $(X_{i},Y_{i})_{i=2N+1}^{3N}$ . The aim is to produce a function $\widehat{f}\in H$ that has a good excess risk, namely,

for some $r_{1}$ that is bounded by a constant multiple of $r$ .

Proof. Observe that for every $f\in{\mathcal{F}}$

The matches in the champions league consist of “home-and-away” legs:

Given a sample $(X_{i},Y_{i})_{i=2N+1}^{3N}$ , let $(I_{j})_{j=1}^{n}$ be the partition of $\{2N+1,3N\}$ to $n$ blocks, for the same value of $n$ as in the preliminary round. Let $\beta$ and $\alpha$ be as in Proposition 3.5 and set $r_{1}=2(\beta/\alpha)r$ . The function $f$ wins its home match against $h$ if

on more than $n/2$ of the blocks $I_{j}$ .

We select as $\widehat{f}$ any “champion” in $H$ that wins all of its home matches.

The main result regarding the champions league is as follows:

Let $H\subset{\mathcal{F}}$ as above. Under the assumptions of Theorem 2.10 and using its notation, with probability at least

with respect to $(X_{i},Y_{i})_{i=2N+1}^{3N}$ , one has:

$\bullet$ $f^{*}$ wins all of its home matches, and

The proof of Proposition 3.8 is presented in Section 5.2.

The combination of Propositions 3.2, 3.5, and 3.8 yields the proof of Theorem 2.10.

Examples

Before turning to the proofs of our main results, let us present some explicit examples of applications of Theorem 2.10.

The second example is similar to the first one and is motivated by questions in sparse recovery: linear regression in the set $T=\rho B_{1}^{n}=\{t:\|t\|_{1}\leq\rho\}$ . The difference between this example and the first one lies in the assumption on $X$ . In the second example $X$ is not assumed to be $L$ -sub-Gaussian, but rather satisfies a much weaker moment condition, the same condition that is needed to ensure that the basis pursuit algorithm has a unique solution with the optimal number of measurements (see ).

Both examples lead to explicit estimates on the accuracy and confidence of the median-of-means tournament. The estimates are better than the known bounds and hold with optimal confidence for accuracy larger than $cr^{*}$ (though in general, the true identity of the accuracy edge is an open question). Moreover, in the second example, of $\rho B_{1}^{n}$ , one may show that $r^{*}$ is proportional to the accuracy edge, and the optimal tradeoff holds all the way down to that value.

Thanks to the isotropicity assumption, if $T$ is a convex and centrally-symmetric set and $F=\{\left\langle\cdot,t\right\rangle:t\in T\}$ , then for every $h\in F$ ,

the mean-width of $T$ with respect to the Gaussian measure.

The following notion makes the task of obtaining such bounds more manageable, though still nontrivial.

The simplest examples of isotropic, $L$ -sub-Gaussian random vectors are vectors with independent, mean-zero, variance $1$ components that are sub-Gaussian. For instance, the standard Gaussian vector $(g_{i})_{i=1}^{n}$ , and $(\varepsilon_{i})_{i=1}^{n}$ , whose components are independent, symmetric random signs are both $L$ -sub-Gaussian for a constant that is independent of the dimension $n$ . Another family of examples consists of the random vectors whose density is uniform on sets of the form $\{t:\|t\|_{p}\leq cn^{1/p}\}$ for some $p\geq 2$ , normalized to have volume $1$ . Again, $L$ is an absolute constant, independent of $n$ and $p$ (see, e.g., ).

where $(\varepsilon_{i})_{i=1}^{N}$ are independent, symmetric signs that are independent of $(X_{i},\xi_{i})_{i=1}^{N}$ .

Note that if $\xi$ is heavy-tailed there is no hope of obtaining a high-probability version of Proposition 4.3. In fact, if all one knows is that $\xi$ belongs to $L_{q}$ , one cannot hope that

with a probability higher than $1-c_{1}/N^{(q/2)-1}$ .

Applying Proposition 4.3, it is evident that

Thus, assuming that $\|Y-\left\langle X,t\right\rangle\|_{L_{q}}\leq L\|Y-\left\langle X,t\right\rangle\|_{L_{2}}$ , as we do in Assumption 2.1, it follows that

For constants $c_{1}$ , $c_{2}$ and $\sigma$ , let

for constants $c_{1},c_{2},c_{3},c_{4}$ that depend only on $q$ and $L$ .

We obtain the following consequence of Theorem 2.10:

and the constants $c_{1},...,c_{5}$ depend only on $L$ and $q$ .

with probability at least $1-2\exp(-c_{1}N\min\{1,\sigma^{-2}s^{2}\})$ , exhibiting the optimal accuracy-confidence tradeoff.

Theorem 4.5 improves Theorem A from , where it was shown that ERM produces $\widehat{t}$ with the same accuracy and confidence as in the first part of Theorem 4.5, but only when $\|Y-\left\langle X,t\right\rangle\|_{L_{q}}\leq L\sqrt{q}\|Y-\left\langle X,t\right\rangle\|_{L_{2}}$ for every $q>2$ and every $t\in T$ . In other words, Theorem A from is based on the assumption that each $Y-\left\langle X,t\right\rangle$ is an $L$ -sub-Gaussian random variable, and holds only for the accuracy level $s^{*}$ . In contrast, Theorem 4.5 shows that the median-of-means tournament performs in an optimal way in heavy-tailed situations that are totally out of reach for ERM and for the entire range $s\geq cs^{*}$ .

Observe that the only range of accuracies in which Theorem 4.5 is (perhaps) suboptimal, is when

for well chosen values of $\kappa_{i},\eta_{i}$ , $i=1,2$ ; that is, for values that are larger than the known lower estimate on the accuracy edge for such problems. As noted in , there are many examples in which $\lambda^{*}$ and $s^{*}$ are equivalent (roughly speaking, this happens when Sudakov’s inequality is sharp). In such cases the median-of-means tournament is optimal in the entire range of accessible accuracies.

One important class of sets in which this equivalence is true is $\rho B_{1}^{n}=\{t:\|t\|_{1}\leq\rho\}$ (see for the proof). In light of Theorem 4.5, the median-of-means tournament performs in an optimal way in $\rho B_{1}^{n}$ . Moreover, it turns out that one may relax the sub-Gaussian assumption on $X$ and still obtain the optimal behaviour $\rho B_{1}^{n}$ , as we show next.

Being a penalized version of ERM, the analysis of LASSO is equivalent to the study of ERM in the sets $\rho B_{1}^{n}$ for an arbitrary choice of $\rho$ .

While we defer the question of a “LASSO-tournament” procedure to future work, it is clear that the first step in that direction is to explore the median-of-means tournament in $\rho B_{1}^{n}$ . Instead of the sub-Gaussian assumption used in Theorem 4.5, the assumption we use follows the path of :

Note that the coordinates of $X$ need not be independent and $X$ may be far from being an $L$ -sub-Gaussian random vector. Indeed, it is required that linear functionals $\left\langle\cdot,t\right\rangle$ satisfy a sub-Gaussian moment growth only up to the logarithm of the dimension, and it is possible that some do not have any higher moment beyond $p=\kappa\log n$ .

It turns out (see ) that if $X$ satisfies Assumption 4.1, then $N^{-1/2}\sum_{i=1}^{N}\left\langle X_{i},\cdot\right\rangle e_{i}$ , the random matrix whose rows are independent copies of $X$ , exhibits the best possible sparse recovery features. For example, one requires $N\sim s\log(en/s)$ random measurements $\left\langle X_{i},t\right\rangle$ to recover any $s$ -sparse vector $t$ using basis pursuit, and a similar type of estimate holds in the “noisy” setup for the LASSO. Moreover, one cannot relax the moment condition in Assumption 4.1 and still get the same recovery properties.

We show that as far as regression in $\rho B_{1}^{n}$ goes, the median-of-means tournament yields the optimal, “Gaussian” behaviour even when $X$ only satisfies Assumption 4.1 and $Y-\left\langle t^{*},X\right\rangle$ is heavy-tailed. To this end, we need sharp bounds on the parameters that are used to define $r^{*}$ in Theorem 2.10.

Let $q>2$ and let $X$ satisfy Assumption 4.1 for $\kappa=c_{1}(q)$ . Let $\xi\in L_{q}$ be a random variable (not necessarily independent of $X$ ) and let $(X_{i},\xi_{i})_{i=1}^{N}$ be independent copies of $(X,\xi)$ . Then

To put them in a more familiar form, set $\sqrt{s}=\rho/r$ and observe that

Recall that if $1\leq s\leq n$ and $V_{s}$ is the set of $s$ -sparse vectors in the Euclidean unit sphere (i.e., those with at most $s$ nonzero components), then ${\rm conv}(V_{s})\subset\sqrt{s}B_{1}^{n}\cap B_{2}^{n}\subset C\cdot{\rm conv}(V_{s})$ for a suitable absolute constant $C$ . Using what are by now standard estimates (see, e.g., ),

Using these observations, and with the same (tedious) computation as in , one obtains the following: let $c_{1}$ and $c_{2}$ be well-chosen absolute constants and set

for constants $c_{4},c_{5},c_{6}$ that depend only on $q$ and $L$ .

Proofs

Let us begin by considering the structure of the various indexing sets involved in the proofs, paying particular attention to the way their structure affects the regularity of the parameters we defined earlier.

For any class ${\mathcal{F}}$ and every $h\in{\mathcal{F}}$ , ${\mathcal{F}}-h\subset{\mathcal{F}}-{\mathcal{F}}$ , and in particular,

Thus, for all the complexity parameters defined above, one may avoid the need to take the supremum over all possible choices of centres by considering a slightly larger indexing set, namely, ${\rm star}({\mathcal{F}}-{\mathcal{F}},0)\cap rD$ . Moreover, if ${\mathcal{F}}$ is convex, then ${\mathcal{F}}-{\mathcal{F}}$ is both convex and centrally symmetric, and if ${\mathcal{F}}$ happens to be convex and centrally symmetric then ${\mathcal{F}}-{\mathcal{F}}=2{\mathcal{F}}$ and ${\rm star}({\mathcal{F}}-{\mathcal{F}},0)\cap rD=2{\mathcal{F}}\cap rD$ .

This argument shows that when $r>r_{E}(\kappa,h)$ ,

For the sake of transparency, we do not specify all the constants involved in the definitions of the fixed points right from the start. Instead, we collect conditions on these constants and use the fact that one may “combine” and “intersect” them. It turns out that all the constants depend on only two parameters: $q>2$ , for which the $L_{q}$ norm is equivalent to the $L_{2}$ norm, and the constant $L$ . In what follows, we denote by $c(q,L)$ a constant that depends only on $q$ and $L$ , for the values of $q$ and $L$ in Theorem 2.10 (where only $q=4$ is considered) or in Theorem 2.11, when $q$ can be arbitrarily close to $2$ –at the price of worse constants, of course.

With this in mind, let $\kappa_{1},\kappa_{2},\kappa_{3}$ and $\eta$ be constants that will be specified later and that depend only on $q$ and $L$ . Fix $f^{*}\in{\mathcal{F}}$ and consider $r^{*}$ for which, for every $r>r^{*}$ ,

The value of $r^{*}$ from (2.7) will be given by selecting right constants $\kappa_{1},\kappa_{2},\kappa_{3}$ and $\eta$ . We consider an accuracy $r\geq 2r^{*}$ , and for the rest of this section we fix its value.

Hence, $n$ depends on the wanted accuracy $r$ . Also, $n\leq\tau N$ and without loss of generality we may assume that both $n$ and $m=N/n$ are integers. Also, note that $m\geq 1/\tau$ .

In this section we prove Proposition 3.5.

Recall that by Proposition 3.2 and the resulting condition on $r$ , one has that, with probability at least $1-2\exp(-c_{0}N)$ with respect to $(X_{i})_{i=1}^{N}$ , if a match involving $f^{*}$ is allowed to take place (i.e., if ${\cal DO}(f^{*},f)=1$ ), then $\|f^{*}-f\|_{L_{2}}\geq r$ ; and if the match is abandoned (i.e., ${\cal DO}(f^{*},h)=0$ ), then $\|f^{*}-f\|_{L_{2}}\leq(\beta/\alpha)r$ . The constants $\alpha<1<\beta$ and $c_{0}$ depend only on $L$ and $q$ .

Therefore, to establish Proposition 3.5 it is enough to show that with probability $1-2\exp(-cn)$ , if $\|f-f^{*}\|_{L_{2}}\geq r$ , then $f^{*}$ wins its match against $f$ . In other words, that $B_{f,f^{*}}(j)>0$ for the majority of the blocks in any such match.

Let us begin by exploring the situation in a match between $f^{*}$ and $f$ , knowing that $\|f^{*}-f\|_{L_{2}}\geq r\geq 2r^{*}$ , as above.

Clearly, for every $f,h\in{\mathcal{F}}$ ,

This is summarized in the following lemma, established in the next two sections, and for the right choice of the constants $\kappa_{1},\kappa_{2},\kappa_{3}$ , $\eta,\tau$ , and $\theta$ that depend only on $L$ and $q$ .

There exists an absolute constant $c$ and a constant $C_{1}=C_{1}(L,q)$ for which the following holds. With probability at least $1-2\exp(-c\tau^{2}n)$ , for every $f\in{\mathcal{F}}$ with $\|f-f^{*}\|_{L_{2}}\geq r$ ,

for constants $\kappa_{0}$ and $\rho_{0}$ that depend only on $L$ and $q$ .

Recall that $(I_{j})_{j=1}^{n}$ are the $n$ blocks, each one of cardinality $m=N/n$ , that ${\mathcal{F}}_{1}=\{f\in{\rm star}({\mathcal{F}},f^{*}):\|f-f^{*}\|_{L_{2}}=r\}$ , and that $m\geq 1/\tau$ . For $t>0$ and a function $u$ , set

Thanks to (5.6), it is evident that if $f\in{\mathcal{F}}_{1}$ and $1\leq j\leq n$ then, with probability at least $1-2\exp(-c_{0}\rho_{0}m)$ ,

Recalling that $m\geq 1/\tau$ , if $\tau\leq c_{1}(\rho_{0})$ then

In other words, at least $(1-\tau/6)n$ of the blocks $I_{j}$ have at least $m\rho_{0}/2$ coordinates of size at least $\kappa_{0}r$ . Moreover, in light of the high probability estimate with which (5.7) holds, by the union bound, the same property is satisfied uniformly by any subset of ${\mathcal{F}}_{1}$ of cardinality at most $\exp(c_{2}\tau^{2}n/2)$ . The set we consider is a maximal $\eta r$ -separated subset of ${\mathcal{F}}_{1}$ with respect to the $L_{2}(\mu)$ norm and for a well-chosen $\eta$ specified later.

Let $H_{1}$ be such a maximal separated set. Since

it follows from the translation invariance of packing numbers that

Thus, to obtain a uniform control over points in $H_{1}$ , it suffices to show that

Recalling (5.4) and the choice of $n$ , it suffices to verify that

Observe that since $H_{1}$ is maximal, it is also an $\eta r$ -net in ${\mathcal{F}}_{1}$ , that is, every $f\in{\mathcal{F}}_{1}$ has some $\pi f\in H_{1}$ for which $\|f-\pi f\|_{L_{2}}\leq\eta r$ . Consider the following event:

$(A)$ (5.7) holds for every $f\in H_{1}$ , and

On this event, for every $f\in{\mathcal{F}}_{1}$ there are at least $(1-\tau/4)n$ blocks $I_{j}$ with the following properties:

$\bullet$ $|\pi f(X_{i})-f^{*}(X_{i})|\geq\kappa_{0}r$ on at least $m\rho_{0}/2$ coordinates in $I_{j}$ , and

$\bullet$ $|f(X_{i})-\pi f(X_{i})|\geq\kappa_{0}r/2$ on at most $m\rho_{0}/4$ coordinates in $I_{j}$ .

Hence, in each one of the $(1-\tau/4)n$ well-behaved blocks there are at least $m\rho_{0}/4$ coordinates $X_{i}$ that satisfy

In particular, since $\|f-f^{*}\|_{L_{2}}=r$ , one has

Moreover, (5.8) is positive homogeneous in $f-f^{*}$ and ${\mathcal{F}}_{1}$ is star-shaped around $f^{*}$ , implying that (5.8) holds for every $f\in{\mathcal{F}}_{1}$ as long as $\|f-f^{*}\|_{L_{2}}\geq r$ .

All that remains is to establish that $(B)$ holds with a sufficiently high probability. To that end, let

First note that, by the bounded differences inequality (see, e.g., ), with probability at least $1-\exp(-c_{3}u^{2})$ ,

Thus, the final step in the proof is to show that

In the last step we used a standard symmetrization argument, see, for example, . To conclude the proof, observe that $({8}/{\rho_{0}\kappa_{0}r})\cdot\eta r\leq\tau/48$ when

and for a suitable absolute constant $c_{5}$ . The fact that

follows because $r\geq 2r^{*}$ and invoking (5.2), as long as

The multiplier component

Next we complete the proof of Lemma 5.1 by showing that, with high probability, if $\|f-f^{*}\|_{L_{2}}\geq r$ , then

on at most $\tau n$ blocks. In the first step towards this goal, we consider a single function:

In particular, the same assertion holds uniformly for any fixed subset $H_{2}\subset{\mathcal{F}}_{1}$ of cardinality at most $\exp(C_{2}\tau^{2}n/2)$ .

Before proving Lemma 5.2, consider the assumption that

Another observation is that (5.11) passes to ${\rm star}({\mathcal{F}},f^{*})$ , simply because any function in that set is of the form $h=\lambda f+(1-\lambda)f^{*}$ for some $f\in{\mathcal{F}}$ and $0\leq\lambda\leq 1$ and therefore, $(f^{*}(X)-Y)(h(X)-f^{*}(X))=\lambda(f^{*}(X)-Y)(f(X)-f^{*}(X))$ .

for $(U_{i})_{i=1}^{M}$ that are independent copies $U$ . It is straightforward to verify that, with $(\varepsilon_{i})_{i=1}^{N}$ defined as independent symmetric random signs,

By the norm equivalence assumption of Theorem 2.10,

and by the independence assumption of Theorem 2.11,

Hence, setting $t=(C_{1}/2)r^{2}$ and noting that

Therefore, with probability at least $1-\tau/16$ ,

Just as before, one may select any fixed subset $H_{2}\subset{\mathcal{F}}_{1}$ of cardinality $\exp(C_{2}\tau^{2}n/2)$ , and the assertion of Lemma 5.2 holds with high probability and uniformly for every $h\in H_{2}$ . The choice of $H_{2}$ requires some care. It cannot be just an arbitrary maximal separated set.

There exists a subset $H_{2}\subset{\mathcal{F}}_{1}$ of cardinality at most $\exp(C_{2}n\tau^{2}/2)$ such that for every $f\in{\mathcal{F}}_{1}$ there is some $h\in H_{2}$ that satisfies

Proof. Let $H^{\prime}$ be an maximal $\eta r$ -separated subset of ${\mathcal{F}}_{1}$ . Recall that by the choice of $r$ and (5.4), $\log|H^{\prime}|\leq\kappa_{3}^{2}N\min\{1,\sigma^{-2}r^{2}\}$ , which is smaller than $(C_{2}/2)\tau^{2}n$ when

The class ${\mathcal{F}}$ is a locally compact subset of $L_{2}(\mu)$ and therefore, so is ${\mathcal{F}}_{1}$ . Thus, for every $h^{\prime}\in H^{\prime}$ , the intersections of ${\mathcal{F}}_{1}$ with the $L_{2}$ balls $B(h^{\prime},\eta r)$ are compact and the continuous linear functional on $L_{2}$

Hence, to complete the proof it suffices to show that for every $f\in{\mathcal{F}}_{1}$ , there are at most $(7/8)n\tau$ blocks $I_{j}$ with

There exists an absolute constant $c$ for which, with probability at least $1-2\exp(-c\tau^{2}n)$ ,

Consider the supremum of Binomial random variables

The next step is to centre the process. We show that for every $f\in{\mathcal{F}}_{1}$ , the centring term satisfies

Indeed, symmetrizing, applying either the assumption of norm equivalence of Theorem 2.10 or the independence assumption of Theorem 2.11, and recalling that $\|f-\pi f\|_{L_{2}}\leq 2\eta r$ , it is evident that

again using the fact that $\sqrt{n/N}\leq\sqrt{\theta}(r/\sigma)$ . Clearly,

for a suitable absolute constant $C_{3}$ .

Finally, we need to bound the centred empirical process. This is done by standard techniques of symmetrization, the contraction theorem for Bernoulli processes (for the Lipschitz function $\phi(t)=|t|$ ) and then de-symmetrization (see, e.g., ):

2 Champions league–proof

Finally, it remains to prove Proposition 3.8. Recall that, with probability at least $1-2\exp\bigl{(}-c_{0}N\min\{1,\sigma^{-2}r^{2}\}\bigr{)}$ with respect to the sample $(X_{i},Y_{i})_{i=1}^{2N}$ , we have been able to identify a set of “qualifiers” that have not lost a single match in the preliminary round subset; that is, $H\subset{\mathcal{F}}$ , consisting of $f^{*}$ and possibly other functions that satisfy $\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r$ . In the rest of this section we work conditionally on this “good” event.

While producing a function that is close to $f^{*}$ solves the estimation problem, the question of prediction requires an additional step: we would like to choose one of the qualifiers that has an almost optimal statistical performance: a function $\widehat{f}$ that satisfies

for an appropriate $C$ . We show that this is possible with the required high probability for $C=16(\beta/\alpha)^{2}$ . To this end, set $r_{1}=(\beta/\alpha)r$ .

Recall that for $f,h\in{\rm star}({\mathcal{F}},f^{*})$ , $\Psi_{h,f}=(h(X)-f(X))(f(X)-Y)$ and note that

The “champions league” round is designed to have “home-and-away” legs. For the partition $(I_{j})_{j=1}^{n}$ of $(X_{i},Y_{i})_{i=2N+1}^{3N}$ , $f$ wins its home match against $h$ if

The main ingredient in the proof is the following lemma.

Let ${\mathcal{F}}_{2}={\rm star}({\mathcal{F}},f^{*})\cap(f^{*}+r_{1}D)$ . Under the conditions of Theorem 2.10, there is an absolute constant $c$ , such that, with probability at least $1-2\exp(-cn)$ ,

Proof. Using the assumption of norm equivalence of Theorem 2.10, one may verify that $\|\Psi_{f,f^{*}}\|_{L_{2}}\leq L^{2}r_{1}\sigma$ . Therefore,

Applying the same argument used in the previous section–namely, symmetrization, followed by contraction for a Bernoulli process and the Lipschitz function $\phi(t)=|t|$ and de-symmetrization–one has that for absolute constants $c_{1},c_{2}$ , and $c_{3}$ ,

provided that $n\leq c_{4}L^{4}\cdot(r_{1}^{2}/\sigma^{2})N$ for an absolute constant $c_{4}$ . Since $r_{1}=2(\beta/\alpha)r$ , it suffices that

Recall that for every $f\in H$ , $\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r$ and consider the “good” event from Lemma 5.5. For any $f\in{\mathcal{F}}_{2}$ , and, in particular, for any qualifier $f\in H$ ,

Finally, since $\Psi_{f^{*},f}=-(f^{*}(X)-f(X))^{2}-\Psi_{f,f^{*}}$ , on that event and the same coordinate blocks,

Thus, $f$ is defeated by $f^{*}$ in the majority of the blocks, and in particular, loses its home match against $f^{*}$ . It follows that, with probability at least $1-2\exp(-cn)$ , any function $\widehat{f}$ selected in the champions league round must satisfy that $\Psi_{f^{*},\widehat{f}}\geq-2r_{1}^{2}$ , and by Lemma 3.6,

Additional remarks

This fact has recently been established in in a very special situation: when ${\mathcal{F}}$ is a convex, $L$ -sub-Gaussian class of functions; all the admissible targets are of the form $Y=f_{0}(X)+W$ for $f_{0}\in{\mathcal{F}}$ and $W$ that is sub-Gaussian, zero-mean, independent of $X$ . The procedure used is a modification of ERM: one replaces ${\mathcal{F}}$ with an appropriate net, thus ‘erasing’ all the fine structure of ${\mathcal{F}}$ at the right level, and then runs ERM on the net. The idea behind this procedure is straightforward: if one is interested in accuracy $r$ , one is insensitive to perturbations of that order. From that perspective, a net with a mesh that is proportional to $r$ is just as good as the entire class. It turns out that $r^{*}$ of the net is proportional to $\lambda^{*}$ the original class ${\mathcal{F}}$ . Unfortunately, all the highly restrictive assumptions are essential to the proof and cannot be relaxed at all.

It should be noted that the median-of-means tournament may be modified in exactly the same way as ERM is modified in , leading to an accuracy that is proportional to $\lambda^{*}$ when ${\mathcal{F}}$ is a convex, $L$ -sub-Gaussian class and for an independent noise $W$ that may be heavy-tailed. We decided not pursue this point further because it is a very special case, and shifts the emphasis of the article from the question of the optimal accuracy/confidence tradeoff to the nature of the accuracy edge, a problem of a different nature.

As far as the latter is concerned, it is not clear whether the gap between $\lambda^{*}$ and $r^{*}$ can be closed in other cases. It is possible that both independent noise and a sub-Gaussian class are essential in attaining an accuracy proportional to $\lambda^{*}$ , and under weaker assumptions, the true accuracy edge lies somewhere between $\lambda^{*}$ and $r^{*}$ . We leave that question for future study.

Finally, it should be noted that in this work we completely ignore the algorithmic aspects of the new procedure. While computing the empirical risk minimizer often leads to thoroughly studied and well understood convex optimization problems, finding the winner of the median-of-means tournament in a computationally efficient manner is a highly nontrivial–and perhaps not hopeless–problem that goes beyong the scope of this paper. Techniques of “derivative-free optimization” with “function comparison oracle” may be useful, see, for example, .

References

Appendix A The distance oracle – outline of the proof

Here, we sketch the proof of Proposition 3.2. The proof is almost identical to that of Theorem 3.3 from , and follows the same path as the study of the quadratic component in the proof of Theorem 2.10.

Moreover, combining (A.1) with a straightforward application of Chebyshev’s inequality, one has that with probability at least $0.8$ ,

for constants $\kappa_{0}$ and $\kappa_{1}$ that depend only on $q$ and $L$ .

Let us apply this observation to our setup: fix $f^{*}\in{\mathcal{F}}$ and consider the set

for any fixed $r>d^{*}$ . By the choice of $d^{*}$ , ${\rm star}({\mathcal{F}}-f^{*},0)\cap rS$ contains an $\eta r$ -net ${\cal V}_{r}$ of cardinality at most $\exp(c_{1}k/2)$ . Therefore, with probability at least $1-2\exp(-c_{1}k/2)$ , for every $v\in{\cal V}_{r}$ there is $J_{v}\subset\{1,...,k\}$ of cardinality at least $0.7k$ , such that for every $j\in J_{v}$ ,

Next, for each $v\in{\rm star}({\mathcal{F}}-f^{*},0)\cap rS$ let $\pi v\in{\cal V}_{r}$ satisfy $\|v-\pi v\|_{L_{2}}\leq\eta r$ . Suppose that one can show that on a high probability event, for every such $v$ there are at most $k/10$ blocks $I_{j}^{\prime}$ on which

Then, on that event, and on at least $0.6k$ blocks,

This is precisely the isomorphic estimate we require, so far, for elements of ${\rm star}({\mathcal{F}}-f^{*},0)$ whose $L_{2}(\mu)$ norm is $r$ . Obtaining the isomorphic estimate for elements with a larger $L_{2}(\mu)$ norm follows because ${\rm star}({\mathcal{F}}-f^{*},0)$ is star-shaped around and the required isomorphic estimate is positive homogeneous.

Therefore, to complete the proof of the first part of Proposition 3.2 it suffices to show that

where the supremum is taken in ${\rm star}({\mathcal{F}}-f^{*},0)\cap rS$ . The proof of (A.3) is based on an identical argument to the one we used earlier, in the study of the quadratic component, and the fact that $r>d^{*}$ .

For each $v\in{\cal V}_{r}^{\prime}$ , with probability at least $0.8$ ,

and thus, with probability at least $1-2\exp(-c_{1}k)$ , at least $0.7k$ of the random variables ${\cal M}_{v,j}$ are smaller than $\kappa_{1}r$ . Next, we consider an $\eta r$ -net ${\cal V}^{\prime}_{r}\subset{\mathcal{F}}_{f^{*},r}$ , which, by the choice of $r$ is of cardinality at most $\exp(c_{1}k/2)$ . It follows that, with probability at least $1-2\exp(-c_{1}k/2)$ , (A.4) holds for every $v\in{\cal V}_{r}^{\prime}$ . The oscillation term is then controlled as we outlined above.

Finally, because $k=c(q,L)N$ , the estimates hold with the claimed probability of $1-2\exp(-cN)$ for a constant $c$ that depends only on $q$ and $L$ .