Risk minimization by median-of-means tournaments

Gabor Lugosi, Shahar Mendelson

Introduction

We assume in what follows that the minimum is attained and fFf^{*}\in{\mathcal{F}} exists and is unique, as is the case when FL2(μ){\mathcal{F}}\subset L_{2}(\mu) is a closed, convex set.

One may formulate two natural goals in estimation and prediction problems. One of them is to find a function fFf\in{\mathcal{F}} whose L2(μ)L_{2}(\mu) distance to ff^{*}

is as small as possible. The other is to ensure that the excess risk of the selected function

The crucial difference between this type of problems and standard questions in approximation theory is that the available information is limited to a random sample. One observes DN=((X1,Y1),,(XN,YN))\mathcal{D}_{N}=((X_{1},Y_{1}),\ldots,(X_{N},Y_{N})), that is, NN independent pairs, where each (Xi,Yi)(X_{i},Y_{i}) has the same distribution as (X,Y)(X,Y) and DN\mathcal{D}_{N} is independent of (X,Y)(X,Y). The fact that the distribution of the pair (X,Y)(X,Y) is not known makes it impossible to invoke approximation-theoretical methods and identify directly the true minimizer of the risk.

where throughout the article, for q1q\geq 1, we use the notation

The excess risk, also known as the prediction error, compares the ‘predictive capabilities’ of f^N\widehat{f}_{N} to that of the best in the class, and is defined by the conditional expectation

Note that, up to this point, YY was an arbitrary square-integrable real-valued random variable, and obviously one would like to be able to treat as wide a variety of targets as possible. Clearly, the accuracy and confidence one may establish may depend on some features of the target–for example, some a-priori estimate on its LqL_{q} norm–, or on its “distance” to F{\mathcal{F}}, etc. We consider a broad set of admissible targets Y{\cal Y}, and the accuracy and confidence of Φ\Phi relates to its performance for any admissible target YYY\in{\cal Y}. Thus, a learning problem is the triplet (F,Y,X)({\mathcal{F}},Y,X), when XX and YY are not known, though the learner does know that YYY\in{\cal Y}.

It is clear that there is a tradeoff between the accuracy and confidence in a given learning problem: the smaller the error is, the harder it is to attain it. The question of this accuracy/confidence tradeoff is of utmost importance in statistical learning theory, and has been investigated extensively in numerous manuscripts since the early days of the area in the late 1960’s (see, for example, the books for a sample of the work devoted to this question). To find the right accuracy/confidence tradeoff one must first identify a lower bound on the tradeoff in terms of the sample size, the structure of F{\mathcal{F}} and possibly some additional information on XX and YY, and then come up with a learning procedure that attains the tradeoff.

Roughly put, one should explore the tradeoff for the set of “achievable accuracies” of each learning problem. An accuracy E{\cal E} is achievable if there is a learning procedure in F{\mathcal{F}} that achieves the accuracy E{\cal E} for the problem (F,Y,X)({\mathcal{F}},Y,X) with constant probability–say at least 3/43/4–, and because YY is not known, this has to hold for any YYY\in{\cal Y}. We define the accuracy edge as the smallest achievable accuracy of a problem.

The primary question is to find the correct accuracy/confidence tradeoff for any (reasonable) learning problem and identify a learning procedure that attains that tradeoff all the way down to the accuracy edge. It should be noted that up to this point in time, and other than in a few isolated examples, no learning procedure even came close to the optimal tradeoff at any nontrivial accuracy level.

In this article we solve this problem by presenting an optimal learning procedure: it yields the best possible accuracy/confidence tradeoff (almost) up to the achievable edge, and under minimal assumptions on the learning problem.

The minor reservation “almost” is due to fact that more often than not, the identity of the accuracy edge of a learning problem is not known. As it is explained in what follows, while one may provide lower estimates on the accuracy edge, there is a very real possibility that such estimates are too optimistic and the real accuracy edge is larger. Regardless, the procedure we introduce “gets as close” to the accuracy edge as any other known procedure–or better, and with a dramatically better confidence. It also exhibits the optimal accuracy/confidence tradeoff for larger errors, something that no other procedure is known to do.

It should be clarified at this point that by “optimal accuracy” we mean an accuracy that is optimal up to a constant factor and optimality in the confidence means that the probability with which the claimed accuracy does not hold is optimal up to a constant factor in the exponent.

Before we present a more accurate formulation of our main results and describe the optimal procedure, let us explain what our procedure is not: it is not empirical risk minimization (ERM), nor any of its “family members”.

Perhaps the most natural way of choosing f^N\widehat{f}_{N} is by empirical risk minimization, that is, by least squares regression,

Again, we assume that the minimum is attained, while if there are several minimizers, f^N\widehat{f}_{N} may be chosen among them in an arbitrary way.

The performance of least squares regression has been thoroughly studied in many different scenarios. A sample of the rich literature includes Györfi, Kohler, Krzyzak, Walk , van de Geer , Bartlett, Bousquet, and Mendelson , Koltchinskii , Massart .

It turns out that the performance of ERM changes dramatically according to the tail behaviour of the functions involved in the given learning problem. One may show (see, e.g., ) that if F{\mathcal{F}} is convex and the functions in F{\mathcal{F}} (more precisely, the random variables {f(X):fF}\{f(X):f\in{\mathcal{F}}\}) and the target YY have well-behaved tails (and by “well-behaved” we mean sub-Gaussian), ERM preformed in F{\mathcal{F}} yields good results: for an accuracy that is not far from the accuracy edge, it attains the optimal confidence, though it does not maintain the optimal accuracy/confidence tradeoff for larger errors. Unfortunately, the situation deteriorates considerably when either members of F{\mathcal{F}} or one of the admissible targets is heavy-tailed in some sense. In such cases, the performance of ERM is significantly weaker than the known theoretical limitations of the accuracy/confidence tradeoff. Moreover, replacing ERM with a different procedure is of little use: other than in few and rather special learning problems, there have been no known alternatives to ERM whose performance comes close to the known theoretical limitations of the accuracy/confidence tradeoff, and certainly not when the problem is heavy-tailed.

The reason for ERM’s diminished capacity is that it is sensitive to even a small number of atypical points in the sample (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}: since ERM selects a minimizer of the empirical mean of the squared loss, atypical values may distort the selection and send ERM to the wrong part of F{\mathcal{F}}. This sensitivity is clearly reflected in the confidence with which ERM operates in heavy-tailed situations: roughly put, one can guarantee that ERM performs with the right accuracy only on samples that are not contaminated by a significant number of atypical values. However, in heavy-tailed situations, the latter does not occur frequently, and having atypical values is simply a fact of life one has to deal with.

In contrast, the procedure we suggest as an alternative to ERM leads to the optimal accuracy/confidence tradeoff even in heavy-tailed situations. Unlike ERM, it is not sensitive even to a large number of atypical sample points.

One may show that a nontrivial estimate is possible only when NcnN\geq cn for a suitable absolute constant cc, and we consider only such values of NN. Also, there are known estimates on the theoretical limitations of this problem: a lower bound on the accuracy edge is of the order of σn/N\sigma\sqrt{n/N}, and for an accuracy level that is proportional to the accuracy edge, say, c0σn/Nc_{0}\sigma\sqrt{n/N} for a suitable absolute constant c0c_{0}, the conjectured confidence is 12exp(c1n)1-2\exp(-c_{1}n).

Moreover, because of the minimal assumptions on WW and XX, the best estimate one can hope for on ()(*) and that holds with probability 1δ1-\delta follows from Chebyshev’s inequality. In particular, it leads to the following, rather unsatisfactory, estimate on the performance of ERM:

Also, if one wishes for an error that is proportional to the (conjectured) accuracy edge, that is, of the order of σn/N\sigma\sqrt{n/N}, the best that one can hope for is a constant confidence–a very different estimate from the conjectured confidence of 12exp(c1n)1-2\exp(-c_{1}n).

Although what we have described is an upper bound, one may show (see ) that this estimate captures the performance of empirical risk minimization, and in particular exhibits ERM’s inability to deal with atypical sample points. The reality is that ERM performs with an accuracy of the order of σn/N\sigma\sqrt{n/N} only on the relatively few samples that contain almost no misleading data.

The main result of this article, when applied to this example, shows that under the same assumptions, the procedure we suggest selects t^\widehat{t} for which

for some numerical constants c,C>0c,C>0; that is, it performs with optimal confidence at a level that is proportional to the accuracy edge. In fact, our procedure gives the optimal confidence for any accuracy rcσn/Nr\geq c^{\prime}\sigma\sqrt{n/N}.

In the next section we present the required definitions, outline the current state of the art, and formulate our main results. In Section 3, we describe the new procedure in detail. In Section 4 we illustrate the power of the main results on some canonical examples, before turning to the proofs of our results.

Let us point out the well-understood fact that the behaviour of the accuracy and confidence in learning problems in which F{\mathcal{F}} is not convex is trivial in some sense, and totally different from the convex case, which is why focus on the latter. As it happens, the dominating factor in non-convex problems is the ‘location’ of the targets YY relative to F{\mathcal{F}} rather than the structure of F{\mathcal{F}}, and an ‘unfavourable location’ of a target YY completely distorts the accuracy and confidence of the learning problem. However, even if F{\mathcal{F}} is not convex, all the targets of the form Y=f0(X)+WY=f_{0}(X)+W for f0Ff_{0}\in{\mathcal{F}} and WW that is symmetric and independent of XX happen to be in a ‘favourable’ location and thus our results apply to such problems as well.

The accuracy edge and the accuracy/confidence tradeoff

We begin by describing the known theoretical limitations on the accuracy edge and on the accuracy/confidence tradeoff for a given learning problem. To this end, let us introduce some notation, following the path of .

Let D={f:fL21}D=\{f:\|f\|_{L_{2}}\leq 1\} be the unit ball in L2(μ)L_{2}(\mu) and set S={f:fL2=1}S=\{f:\|f\|_{L_{2}}=1\} to be the unit sphere. For hL2(μ)h\in L_{2}(\mu) and r>0r>0, put rDh={f:fhL2r}rD_{h}=\{f:\|f-h\|_{L_{2}}\leq r\}. Let

Thus, star(F,h){\rm star}({\mathcal{F}},h) is the star-shaped hull of F{\mathcal{F}} around hh, that is, the union of all segments for which one end-point is hh and the other is in F{\mathcal{F}}.

The star-shaped hull star(F,h){\rm star}({\mathcal{F}},h) adds regularity to the class around the fixed centre hh: on the one hand, it does not increase the size of the class by much, while on the other hand, it implies that every function of the form fhf-h has a ‘scaled-down’ version when one moves towards . In particular, the level sets star(Fh,0)rS{\rm star}({\cal F}-h,0)\cap rS become ‘richer’ as rr gets smaller: each one of them contains scaled-down copies of all ‘higher’ levels.

Consider the localization of F{\mathcal{F}}

which is given by the shift that maps the designated point hh to . Then the resulting class is made ‘more regular’ by taking its star-shaped hull around , and finally it is localized, by considering its intersection with rDrD, the L2(μ)L_{2}(\mu) ball of radius rr, centred in .

Observe that if F{\mathcal{F}} is convex then for any hFh\in{\mathcal{F}}, star(F,h)=F{\rm star}({\mathcal{F}},h)={\mathcal{F}}. Also, in that case,

and if, in addition, F{\mathcal{F}} is centrally symmetric (that is, if fFf\in{\mathcal{F}} then fF-f\in{\mathcal{F}}), then Fh,r2FrD{\mathcal{F}}_{h,r}\subset 2{\mathcal{F}}\cap rD.

One way of deriving lower estimates on the accuracy edge and on the accuracy/confidence tradeoff is based on the packing numbers of the localizations Fh,r{\mathcal{F}}_{h,r}.

Given a set HL2(μ)H\subset L_{2}(\mu) and ε>0\varepsilon>0, denote by M(H,εD){\cal M}(H,\varepsilon D) the cardinality of a maximal ε\varepsilon-separated subset of HH. That is, M(H,εD){\cal M}(H,\varepsilon D) is the maximal cardinality of a subset {h1,...,hm}H\{h_{1},...,h_{m}\}\subset H, for which hihjL2ε\|h_{i}-h_{j}\|_{L_{2}}\geq\varepsilon for every iji\not=j.

Note that if HH^{\prime} is a maximal ε\varepsilon-separated subset of HH, then it is also an ε\varepsilon-cover of HH in the sense that for every hHh\in H there is some hHh^{\prime}\in H^{\prime} that satisfies hhL2ε\|h^{\prime}-h\|_{L_{2}}\leq\varepsilon.

For κ,η>0\kappa,\eta>0 and hFh\in{\mathcal{F}}, set

There exist absolute constants κ\kappa and η\eta for which the following holds. Let FL2(μ){\mathcal{F}}\subset L_{2}(\mu) be convex and centrally symmetric. For any learning procedure Φ\Phi there exists an f0Ff_{0}\in{\cal F} and target Y=f0(X)Y=f_{0}(X) for which, with probability at least 1/41/4,

For κ>0\kappa>0, 0<η<10<\eta<1 and hFh\in{\mathcal{F}}, set

(See, e.g., .) There exist absolute constants κ\kappa and η\eta for which the following holds. Let FL2(μ){\mathcal{F}}\subset L_{2}(\mu) and set WW to be a centred Gaussian variable with variance σ>0\sigma>0 that is independent of XX. Then, for any learning procedure Φ\Phi there exists f0Ff_{0}\in{\mathcal{F}} and a target Y=f0(X)+WY=f_{0}(X)+W for which, with probability at least 1/41/4,

Combining these two facts, we have a lower bound on the accuracy edge:

for some constants κi,ηi\kappa_{i},\eta_{i}, i=1,2i=1,2. However, there is no guarantee that this lower estimate is sharp. As we explain in what follows, it is very possible that the true accuracy edge is larger.

Let us turn to the theoretical limitations of the accuracy/confidence tradeoff, assuming that the set of admissible targets Y{\cal Y} is not too trivial. By that we mean that it at least contains all the targets of the form Y=f0(X)Y=f_{0}(X) (the noise-free problems) and Y=f0(X)+WY=f_{0}(X)+W, for f0Ff_{0}\in{\mathcal{F}} and WW that is a centred Gaussian variable with variance σ\sigma, and is independent of XX. We call this set of targets minimal, and of course, Y{\cal Y} could be much larger.

Applying the results in and one has the following:

There exists an absolute constant cc for which the following holds. Let FL2(μ){\mathcal{F}}\subset L_{2}(\mu) be a class that is star-shaped around one of its points (i.e., for some f0Ff_{0}\in{\mathcal{F}} and every fFf\in{\mathcal{F}}, [f0,f]F[f_{0},f]\subset{\mathcal{F}}). Consider Y{\cal Y} that contains the minimal set of targets. If Φ\Phi is a learning procedure that performs with accuracy rr and confidence 1δ1-\delta for every such target, then

These facts set our first benchmark (which may be too optimistic, of course): the lower bound on the accuracy edge

and the bound on the accuracy/confidence tradeoff for rc0λr\geq c_{0}\lambda^{*},

As we noted earlier, λ\lambda^{*} is an optimistic, and perhaps not very realistic, lower bound on the accuracy edge. A more reasonable conjecture relies on more “global” parameters that take into account the fine structure of F{\mathcal{F}} at an arbitrarily small level, defined next.

From here on, let (εi)i=1N(\varepsilon_{i})_{i=1}^{N} be independent, symmetric {1,1}\{-1,1\}-valued random variables that are independent of (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}.

For κ>0\kappa>0 and hFh\in{\mathcal{F}} let

and set rE(κ)=suphFrE(κ,h)r_{E}(\kappa)=\sup_{h\in{\mathcal{F}}}r_{E}(\kappa,h).

The other “global” parameter we require does depend on YY. It is used to calibrate the interaction between F{\mathcal{F}} and the target.

Remark. The role of σ\sigma and of FY(σ){\mathcal{F}}_{Y}^{(\sigma)} in Definition 2.8 deserves some explanation. The mean oscillation in Definition 2.8 involves the “multipliers” (h(Xi)Yi)i=1N(h(X_{i})-Y_{i})_{i=1}^{N}, and the right choice for the centre hh is the unknown ff^{*}. While one may take into account the “worst” hFh\in{\mathcal{F}}, doing so makes little sense, as ff^{*} is the minimizer of the L2L_{2} distance between F{\mathcal{F}} and YY, and for the worst hh, hYL2\|h-Y\|_{L_{2}} could be significantly larger than fYL2\|f^{*}-Y\|_{L_{2}}. To overcome this obstacle we assume that an a-priori estimate on the L2L_{2} distance between YY and F{\mathcal{F}} (i.e., a value σ\sigma such that fYL2σ\|f^{*}-Y\|_{L_{2}}\leq\sigma) is available. With this information one still needs to consider the worst centre hh, but only among all functions hFh\in{\mathcal{F}} that satisfy hYL2σ\|h-Y\|_{L_{2}}\leq\sigma. As we explain in Section 4, thanks to known estimates for the expectation of the supremum of a multiplier process, one only needs to keep in mind that the multipliers ξi=h(Xi)Yi\xi_{i}=h(X_{i})-Y_{i} are independent copies of some random variable ξ\xi that satisfies some moment condition, such as ξLqLσ\|\xi\|_{L_{q}}\leq L\sigma for some q>2q>2 and a suitable constant LL.

In light of the results from , a realistic alternative to λ\lambda^{*} is

for some constants c1,c2c_{1},c_{2}, and when for the given (and unknown) target YYY\in{\cal Y} one has Yf(X)L2σ\|Y-f^{*}(X)\|_{L_{2}}\leq\sigma. Indeed, in it was shown that under some mild conditions on the learning problem, specified below, ERM performs in F{\mathcal{F}} with accuracy crcr^{*} and constant confidence. Thus, rr^{*} is a potential (and to-date, the best) candidate for the accuracy edge.

Therefore, up to the issue of the true identity of the accuracy edge, the (somewhat vaguely formulated) question of the accuracy/confidence tradeoff is as follows:

Is there a learning procedure which, for any reasonable learning problem and any rc1rr\geq c_{1}r^{*}, performs with accuracy rr and confidence 1-2\exp\bigl{(}-c_{2}N\min\{1,\sigma^{-2}r^{2}\}\bigr{)}, thus achieving the optimal accuracy/confidence tradeoff?

Our main result answers Question 2.9 in the affirmative where our notion of a “reasonable learning problem” is formulated rigorously below.

The first theorem we present requires the following conditions:

Let LL be a constant and let XX be distributed according to the measure μ\mu on X{\cal X}. Given a locally compact, convex class of functions FL2(μ){\mathcal{F}}\subset L_{2}(\mu) and YL2Y\in L_{2}, assume that

\bullet for every f,hFf,h\in{\mathcal{F}}, fhL4LfhL2\|f-h\|_{L_{4}}\leq L\|f-h\|_{L_{2}};

\bullet for every fFf\in{\mathcal{F}}, fYL4LfYL2\|f-Y\|_{L_{4}}\leq L\|f-Y\|_{L_{2}};

\bullet fYL2σ\|f^{*}-Y\|_{L_{2}}\leq\sigma for some known constant σ>0\sigma>0.

Let L1L\geq 1, σ>0\sigma>0, and suppose Assumption 2.1. There exist constants c,c0,c1c,c_{0},c_{1} and c2c_{2} that depend only on LL for which the following holds. Let

There exists a procedure that, based on the data DN=(Xi,Yi)i=1N\mathcal{D}_{N}=(X_{i},Y_{i})_{i=1}^{N} and the values of LL, σ\sigma and rr, selects a function f^F\widehat{f}\in{\mathcal{F}} such that, with probability at least

Of course, the identity of ff^{*} is not known, and therefore, it is not reasonable to expect that r(f)r^{*}(f^{*}) is known beforehand. A “legal” data-independent choice is any r2rr\geq 2r^{*} that is larger than 2r(f)2r^{*}(f^{*}) regardless of the identity of ff^{*}. In particular, Theorem 2.10 gives the optimal accuracy/confidence tradeoff for any accuracy r2rr\geq 2r^{*}. Alternatively, one may consider the “right” choice of the parameter rr as a model selection problem that may be selected using cross validation if independent data are available. We do not discuss the rather straightforward details further here.

Remark. In our main assumption of Theorem 2.10 we use the equivalence between the L4L_{4} and L2L_{2} norms for functions of the form Yf(X)Y-f(X). This allows us to derive bounds in terms of the variance of Yf(X)Y-f^{*}(X). In fact, if instead of norm equivalence we just have a bound on σ4=Yf(X)L4\sigma_{4}=\|Y-f^{*}(X)\|_{L_{4}}, the arguments work equally well, with σ4\sigma_{4} replacing σ\sigma. The sets FYσ{\mathcal{F}}_{Y}^{\sigma} need to be adjusted as well: it should be replaced by all functions in F{\mathcal{F}} whose L4L_{4} distance to YY is at most σ4\sigma_{4}.

It turns out that, when dealing with independent noise, the assumptions required in Theorem 2.10 may be relaxed even further. In particular, we do not require convexity of the class F{\mathcal{F}} and the assumption of norm equivalence may be relaxed:

Let q>2q>2, L>1L>1, and σ>0\sigma>0. There exist constants c,c0,c1c,c_{0},c_{1} and c2c_{2} that depend only on qq and LL for which the following holds. Let F{\mathcal{F}} be a locally compact class of functions and assume that for every fspan(F)f\in{\rm span}({\mathcal{F}}), fLqLfL2\|f\|_{L_{q}}\leq L\|f\|_{L_{2}}. Assume further that Y=f0(X)+WY=f_{0}(X)+W for some f0Ff_{0}\in{\mathcal{F}} and WW that is mean-zero, independent of XX, and satisfies WL2σ\|W\|_{L_{2}}\leq\sigma.

Let r(f)r^{*}(f^{*}) be as above and fix r2r(f)r\geq 2r^{*}(f^{*}). There exists a procedure that, based on the data DN=(Xi,Yi)i=1N\mathcal{D}_{N}=(X_{i},Y_{i})_{i=1}^{N} and the values of L,qL,q, σ\sigma and rr, selects a function f^F\widehat{f}\in{\mathcal{F}} such that, with probability at least

Note that in the case of independent additive noise, the assumptions of Theorem 2.11 are almost the minimal needed for the learning problem to be well defined: norm equivalence for qq that may be arbitrarily close to 22 and WL2W\in L_{2} that perhaps does not have any higher moments. Even under these minimal assumptions, we still obtain the optimal accuracy/confidence tradeoff.

To put Theorem 2.10 in perspective, we describe the current sharpest estimates on the accuracy and confidence of a learning problem, focusing on possibly heavy-tailed distributions.

The best known estimate on prediction and estimation in a general convex class are based on a quite weak condition, rather than the norm equivalence we use. Recall that a class F{\mathcal{F}} satisfies a small-ball condition with constants κ0\kappa_{0} and ρ0\rho_{0} if for every f,hF{0}f,h\in{\mathcal{F}}\cup\{0\},

for constants c1,c2c_{1},c_{2} and c3c_{3} that depend only on κ0\kappa_{0} and ρ0\rho_{0}.

In other words, when NcnN\geq cn, the procedure exhibits the optimal accuracy/confidence tradeoff for any rc3σn/Nr\geq c_{3}\sigma\sqrt{n/N}.

The median-of-means tournament

The key to obtaining sharp estimates for both the accuracy and the confidence is identifying a procedure that is not sensitive to atypical values that may occur on a small part of the given sample. Thus, a natural starting point is the conceptually simple and attractive mean estimator, the so-called median-of-means estimator. It was proposed, independently, by Nemirovsky and Yudin , Jerrum, Valiant, and Vazirani , Alon, Matias, and Szegedy , and is defined as follows.

and define μ^N(δ)\widehat{\mu}_{N}^{(\delta)} as the median of W1,,WnW_{1},\ldots,W_{n}. (If the median is not uniquely defined, here, and in the rest of the paper, we choose the smallest one. Any other choice would work equally well.) It is straightforward to verify that for any N4N\geq 4,

For properties, applications, and extensions of the median-of-means estimator, we refer to Bubeck, Cesa-Bianchi, and Lugosi . Devroye, Lerasle, Lugosi, and Oliveira Hsu and Sabato , Lerasle and Oliveira , Minsker , Audibert and Catoni .

It is tempting to try to replace the empirical means (1/N)i=1N(f(Xi)Yi)2(1/N)\sum_{i=1}^{N}(f(X_{i})-Y_{i})^{2} by some median-of-means estimate of the risk for each fFf\in{\mathcal{F}}, and select a function in F{\mathcal{F}} that minimizes the estimate. However, due to the nonlinear nature of the median-of-means, it is difficult to control the process of the estimated losses. Instead, the alternative we propose is to estimate the difference of the risk for all pairs f,hf,h and organize a two-stage “tournament”.

We mention here that Brownlees, Joly, and Lugosi propose empirical minimization based on a different robust mean estimator, a carefully designed M-estimator proposed by Catoni . Under general loss functions they derive analogs of Dudley’s chaining bound for the excess risk. However, the derived bounds are far from giving the optimal rate of convergence under the squared loss.

Without loss of generality and for convenience in the notation, we use 3N3N instead of NN for the sample size and assume that D3N=(Xi,Yi)i=13N\mathcal{D}_{3N}=(X_{i},Y_{i})_{i=1}^{3N} is the given sample. The sample is split into three equal parts (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}, (Xi,Yi)i=N+12N(X_{i},Y_{i})_{i=N+1}^{2N} and (Xi,Yi)i=2N+13N(X_{i},Y_{i})_{i=2N+1}^{3N}. Given the wanted degree of accuracy rrr\geq r^{*}, the first part of the sample–in fact, just (Xi)i=1N(X_{i})_{i=1}^{N}–is used to estimate pairwise distances within F{\mathcal{F}}, in a sense that will be clarified below. The second part is used in the preliminary round of the tournament. We show that the outcome of the preliminary round is a set HFH\subset{\mathcal{F}} that contains ff^{*} and possibly some other functions whose L2(μ)L_{2}(\mu) distance to ff^{*} is at most crcr. The final part of the tournament is a ‘champions league’ round. Participants in that final round are the elements in HH (the ‘qualifiers’ of the preliminary round), and the goal of that round is to identify a function f^H\widehat{f}\in H whose predictive capabilities are almost optimal, in the sense that

Let us now describe the three stages of the median-of-means tournament is detail.

The functional Φ\Phi allows one to identify distances in F{\mathcal{F}} in a crude (isomorphic) way, as the next theorem shows:

\bullet If ΦCN(f,f)βr\Phi_{\mathcal{C}_{N}}(f,f^{*})\geq\beta r then β1ΦCN(f,f)ffL2α1ΦCN(f,f)\beta^{-1}\Phi_{\mathcal{C}_{N}}(f,f^{*})\leq\|f-f^{*}\|_{L_{2}}\leq\alpha^{-1}\Phi_{\mathcal{C}_{N}}(f,f^{*}).

\bullet If ΦCN(f,f)<βr\Phi_{\mathcal{C}_{N}}(f,f^{*})<\beta r then ffL2(β/α)r\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r.

Proposition 3.2 is an immediate modification of Theorem 3.3 from . For the sake of completeness we outline the main components of its proof in the appendix.

Next we introduce the “distance oracle”, denoted by DO{\cal DO}. Recall the definition of rr^{*} from (2.7) and note that for the right choice of constants, rdr^{*}\geq d^{*}. The distance oracle is adapted to the wanted degree of accuracy, that is, to any fixed r2rr\geq 2r^{*}.

Fix r2rr\geq 2r^{*}. Using the notation of Proposition 3.2, if ΦCN(f,h)βr\Phi_{\mathcal{C}_{N}}(f,h)\geq\beta r set DO(f,h)=1{\cal DO}(f,h)=1, otherwise set DO(f,h)=0{\cal DO}(f,h)=0.

The distance oracle determines if a match between ff and hh takes place: it does if DO(f,h)=1{\cal DO}(f,h)=1 and it is abandoned if DO(f,h)=0{\cal DO}(f,h)=0. Note that Proposition 3.2 only shows that DO{\cal DO} is a realistic indication of the distance between pairs when one of the functions is the designated function ff^{*}. This serves our purposes since the designated function we are interested in is the minimizer of the true risk in F{\mathcal{F}}, and the success of the procedure only requires having accurate information on matches that involve ff^{*}, even if we do not know which matches those are.

It follows from Proposition 3.2 that with probability at least 12exp(cN)1-2\exp(-cN) relative to (Xi)i=1N(X_{i})_{i=1}^{N}, if a match between ff^{*} and ff is allowed to proceed then ffL2r\|f-f^{*}\|_{L_{2}}\geq r, while if it is abandoned then ffL2(β/α)r\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r.

2 The preliminary round

The goal of the preliminary round is to produce a subset HFH\subset{\mathcal{F}} that, with overwhelming probability over the samples (Xi,Yi)i=N+12N(X_{i},Y_{i})_{i=N+1}^{2N}, contains ff^{*} and hfL2(β/α)r\|h-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r for any hHh\in H.

The round consists of ‘matches’ between every pair f,hFf,h\in{\mathcal{F}}, and a match can have three possible outcomes: a win by either side, or a draw (the latter includes abandoned matches because of the ruling of the distance oracle).

Each match is ‘played’ using the second part of the sample, (Xi,Yi)i=N+12N(X_{i},Y_{i})_{i=N+1}^{2N}. The sub-sample is partitioned to nn blocks (Ij)j=1n(I_{j})_{j=1}^{n} of cardinality m=N/nm=N/n each, for a choice of nn specified later. Let us note that nn depends on the desired degree of accuracy rr.

\bullet A match between ff and hh takes place if the distance oracle, using the first part of the sample (Xi)i=1N(X_{i})_{i=1}^{N}, declares that DO(f,h)=1{\cal DO}(f,h)=1; otherwise, the match is abandoned and results in a draw.

\bullet Each match is decided according to the nn blocks generated by the partition of (Xi,Yi)i=N+12N(X_{i},Y_{i})_{i=N+1}^{2N}, with the jj-th block played on the coordinate block IjI_{j}. Put

The function hh defeats ff on the jj-th block if Bf,h(j)<0B_{f,h}(j)<0, and ff defeats hh if Bf,h(j)>0B_{f,h}(j)>0.

\bullet A winner of more than n/2n/2 blocks is the winner of the match. If neither function wins more than half of the blocks, the match is drawn.

A function fFf\in{\mathcal{F}} qualifies from the preliminary round if it has not lost a single match; that is, it has won or drawn all its matches. The set of “champions” HH consists of all functions qualified from the preliminary round.

The key fact regarding the outcome of the preliminary round is as follows:

Under the assumptions of Theorem 2.10 or of Theorem 2.11, and using their notation, with probability at least

with respect to (Xi,Yi)i=12N(X_{i},Y_{i})_{i=1}^{2N}, for all hFh\in{\mathcal{F}}, if DO(f,h)=1{\cal DO}(f^{*},h)=1 then ff^{*} defeats hh. In particular, fHf^{*}\in H and for any hHh\in H, DO(f,h)=0{\cal DO}(f^{*},h)=0, and therefore hfL2(β/α)r\|h-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r.

The proof of Proposition 3.5 is presented in Section 5.1. Note that Propositions 3.2 and 3.5 imply Theorem 2.11. In order to prove the general result of Theorem 2.10, another round of matches is necessary to choose a function from HH with small excess risk.

3 Champions league

The goal of the second round of the tournament is to choose, among the “champions” selected in the preliminary round, a function with a small excess risk. This round consists of different kind of matches, played between functions in HH. Since this round consists of matches between functions in HH, conditioned on the ‘good event’ from the preliminary round, every qualifier satisfies that hfL2(β/α)r\|h-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r.

The modified matches are decided using the third part of the sample (Xi,Yi)i=2N+13N(X_{i},Y_{i})_{i=2N+1}^{3N}. The aim is to produce a function f^H\widehat{f}\in H that has a good excess risk, namely,

for some r1r_{1} that is bounded by a constant multiple of rr.

Proof. Observe that for every fFf\in{\mathcal{F}}

The matches in the champions league consist of “home-and-away” legs:

Given a sample (Xi,Yi)i=2N+13N(X_{i},Y_{i})_{i=2N+1}^{3N}, let (Ij)j=1n(I_{j})_{j=1}^{n} be the partition of {2N+1,3N}\{2N+1,3N\} to nn blocks, for the same value of nn as in the preliminary round. Let β\beta and α\alpha be as in Proposition 3.5 and set r1=2(β/α)rr_{1}=2(\beta/\alpha)r. The function ff wins its home match against hh if

on more than n/2n/2 of the blocks IjI_{j}.

We select as f^\widehat{f} any “champion” in HH that wins all of its home matches.

The main result regarding the champions league is as follows:

Let HFH\subset{\mathcal{F}} as above. Under the assumptions of Theorem 2.10 and using its notation, with probability at least

with respect to (Xi,Yi)i=2N+13N(X_{i},Y_{i})_{i=2N+1}^{3N}, one has:

\bullet ff^{*} wins all of its home matches, and

The proof of Proposition 3.8 is presented in Section 5.2.

The combination of Propositions 3.2, 3.5, and 3.8 yields the proof of Theorem 2.10.

Examples

Before turning to the proofs of our main results, let us present some explicit examples of applications of Theorem 2.10.

The second example is similar to the first one and is motivated by questions in sparse recovery: linear regression in the set T=ρB1n={t:t1ρ}T=\rho B_{1}^{n}=\{t:\|t\|_{1}\leq\rho\}. The difference between this example and the first one lies in the assumption on XX. In the second example XX is not assumed to be LL-sub-Gaussian, but rather satisfies a much weaker moment condition, the same condition that is needed to ensure that the basis pursuit algorithm has a unique solution with the optimal number of measurements (see ).

Both examples lead to explicit estimates on the accuracy and confidence of the median-of-means tournament. The estimates are better than the known bounds and hold with optimal confidence for accuracy larger than crcr^{*} (though in general, the true identity of the accuracy edge is an open question). Moreover, in the second example, of ρB1n\rho B_{1}^{n}, one may show that rr^{*} is proportional to the accuracy edge, and the optimal tradeoff holds all the way down to that value.

Thanks to the isotropicity assumption, if TT is a convex and centrally-symmetric set and F={,t:tT}F=\{\left\langle\cdot,t\right\rangle:t\in T\}, then for every hFh\in F,

the mean-width of TT with respect to the Gaussian measure.

The following notion makes the task of obtaining such bounds more manageable, though still nontrivial.

The simplest examples of isotropic, LL-sub-Gaussian random vectors are vectors with independent, mean-zero, variance 11 components that are sub-Gaussian. For instance, the standard Gaussian vector (gi)i=1n(g_{i})_{i=1}^{n}, and (εi)i=1n(\varepsilon_{i})_{i=1}^{n}, whose components are independent, symmetric random signs are both LL-sub-Gaussian for a constant that is independent of the dimension nn. Another family of examples consists of the random vectors whose density is uniform on sets of the form {t:tpcn1/p}\{t:\|t\|_{p}\leq cn^{1/p}\} for some p2p\geq 2, normalized to have volume 11. Again, LL is an absolute constant, independent of nn and pp (see, e.g., ).

where (εi)i=1N(\varepsilon_{i})_{i=1}^{N} are independent, symmetric signs that are independent of (Xi,ξi)i=1N(X_{i},\xi_{i})_{i=1}^{N}.

Note that if ξ\xi is heavy-tailed there is no hope of obtaining a high-probability version of Proposition 4.3. In fact, if all one knows is that ξ\xi belongs to LqL_{q}, one cannot hope that

with a probability higher than 1c1/N(q/2)11-c_{1}/N^{(q/2)-1}.

Applying Proposition 4.3, it is evident that

Thus, assuming that YX,tLqLYX,tL2\|Y-\left\langle X,t\right\rangle\|_{L_{q}}\leq L\|Y-\left\langle X,t\right\rangle\|_{L_{2}}, as we do in Assumption 2.1, it follows that

For constants c1c_{1}, c2c_{2} and σ\sigma, let

for constants c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4} that depend only on qq and LL.

We obtain the following consequence of Theorem 2.10:

and the constants c1,...,c5c_{1},...,c_{5} depend only on LL and qq.

with probability at least 12exp(c1Nmin{1,σ2s2})1-2\exp(-c_{1}N\min\{1,\sigma^{-2}s^{2}\}), exhibiting the optimal accuracy-confidence tradeoff.

Theorem 4.5 improves Theorem A from , where it was shown that ERM produces t^\widehat{t} with the same accuracy and confidence as in the first part of Theorem 4.5, but only when YX,tLqLqYX,tL2\|Y-\left\langle X,t\right\rangle\|_{L_{q}}\leq L\sqrt{q}\|Y-\left\langle X,t\right\rangle\|_{L_{2}} for every q>2q>2 and every tTt\in T. In other words, Theorem A from is based on the assumption that each YX,tY-\left\langle X,t\right\rangle is an LL-sub-Gaussian random variable, and holds only for the accuracy level ss^{*}. In contrast, Theorem 4.5 shows that the median-of-means tournament performs in an optimal way in heavy-tailed situations that are totally out of reach for ERM and for the entire range scss\geq cs^{*}.

Observe that the only range of accuracies in which Theorem 4.5 is (perhaps) suboptimal, is when

for well chosen values of κi,ηi\kappa_{i},\eta_{i}, i=1,2i=1,2; that is, for values that are larger than the known lower estimate on the accuracy edge for such problems. As noted in , there are many examples in which λ\lambda^{*} and ss^{*} are equivalent (roughly speaking, this happens when Sudakov’s inequality is sharp). In such cases the median-of-means tournament is optimal in the entire range of accessible accuracies.

One important class of sets in which this equivalence is true is ρB1n={t:t1ρ}\rho B_{1}^{n}=\{t:\|t\|_{1}\leq\rho\} (see for the proof). In light of Theorem 4.5, the median-of-means tournament performs in an optimal way in ρB1n\rho B_{1}^{n}. Moreover, it turns out that one may relax the sub-Gaussian assumption on XX and still obtain the optimal behaviour ρB1n\rho B_{1}^{n}, as we show next.

Being a penalized version of ERM, the analysis of LASSO is equivalent to the study of ERM in the sets ρB1n\rho B_{1}^{n} for an arbitrary choice of ρ\rho.

While we defer the question of a “LASSO-tournament” procedure to future work, it is clear that the first step in that direction is to explore the median-of-means tournament in ρB1n\rho B_{1}^{n}. Instead of the sub-Gaussian assumption used in Theorem 4.5, the assumption we use follows the path of :

Note that the coordinates of XX need not be independent and XX may be far from being an LL-sub-Gaussian random vector. Indeed, it is required that linear functionals ,t\left\langle\cdot,t\right\rangle satisfy a sub-Gaussian moment growth only up to the logarithm of the dimension, and it is possible that some do not have any higher moment beyond p=κlognp=\kappa\log n.

It turns out (see ) that if XX satisfies Assumption 4.1, then N1/2i=1NXi,eiN^{-1/2}\sum_{i=1}^{N}\left\langle X_{i},\cdot\right\rangle e_{i}, the random matrix whose rows are independent copies of XX, exhibits the best possible sparse recovery features. For example, one requires Nslog(en/s)N\sim s\log(en/s) random measurements Xi,t\left\langle X_{i},t\right\rangle to recover any ss-sparse vector tt using basis pursuit, and a similar type of estimate holds in the “noisy” setup for the LASSO. Moreover, one cannot relax the moment condition in Assumption 4.1 and still get the same recovery properties.

We show that as far as regression in ρB1n\rho B_{1}^{n} goes, the median-of-means tournament yields the optimal, “Gaussian” behaviour even when XX only satisfies Assumption 4.1 and Yt,XY-\left\langle t^{*},X\right\rangle is heavy-tailed. To this end, we need sharp bounds on the parameters that are used to define rr^{*} in Theorem 2.10.

Let q>2q>2 and let XX satisfy Assumption 4.1 for κ=c1(q)\kappa=c_{1}(q). Let ξLq\xi\in L_{q} be a random variable (not necessarily independent of XX) and let (Xi,ξi)i=1N(X_{i},\xi_{i})_{i=1}^{N} be independent copies of (X,ξ)(X,\xi). Then

To put them in a more familiar form, set s=ρ/r\sqrt{s}=\rho/r and observe that

Recall that if 1sn1\leq s\leq n and VsV_{s} is the set of ss-sparse vectors in the Euclidean unit sphere (i.e., those with at most ss nonzero components), then conv(Vs)sB1nB2nCconv(Vs){\rm conv}(V_{s})\subset\sqrt{s}B_{1}^{n}\cap B_{2}^{n}\subset C\cdot{\rm conv}(V_{s}) for a suitable absolute constant CC. Using what are by now standard estimates (see, e.g., ),

Using these observations, and with the same (tedious) computation as in , one obtains the following: let c1c_{1} and c2c_{2} be well-chosen absolute constants and set

for constants c4,c5,c6c_{4},c_{5},c_{6} that depend only on qq and LL.

Proofs

Let us begin by considering the structure of the various indexing sets involved in the proofs, paying particular attention to the way their structure affects the regularity of the parameters we defined earlier.

For any class F{\mathcal{F}} and every hFh\in{\mathcal{F}}, FhFF{\mathcal{F}}-h\subset{\mathcal{F}}-{\mathcal{F}}, and in particular,

Thus, for all the complexity parameters defined above, one may avoid the need to take the supremum over all possible choices of centres by considering a slightly larger indexing set, namely, star(FF,0)rD{\rm star}({\mathcal{F}}-{\mathcal{F}},0)\cap rD. Moreover, if F{\mathcal{F}} is convex, then FF{\mathcal{F}}-{\mathcal{F}} is both convex and centrally symmetric, and if F{\mathcal{F}} happens to be convex and centrally symmetric then FF=2F{\mathcal{F}}-{\mathcal{F}}=2{\mathcal{F}} and star(FF,0)rD=2FrD{\rm star}({\mathcal{F}}-{\mathcal{F}},0)\cap rD=2{\mathcal{F}}\cap rD.

This argument shows that when r>rE(κ,h)r>r_{E}(\kappa,h),

For the sake of transparency, we do not specify all the constants involved in the definitions of the fixed points right from the start. Instead, we collect conditions on these constants and use the fact that one may “combine” and “intersect” them. It turns out that all the constants depend on only two parameters: q>2q>2, for which the LqL_{q} norm is equivalent to the L2L_{2} norm, and the constant LL. In what follows, we denote by c(q,L)c(q,L) a constant that depends only on qq and LL, for the values of qq and LL in Theorem 2.10 (where only q=4q=4 is considered) or in Theorem 2.11, when qq can be arbitrarily close to 22–at the price of worse constants, of course.

With this in mind, let κ1,κ2,κ3\kappa_{1},\kappa_{2},\kappa_{3} and η\eta be constants that will be specified later and that depend only on qq and LL. Fix fFf^{*}\in{\mathcal{F}} and consider rr^{*} for which, for every r>rr>r^{*},

The value of rr^{*} from (2.7) will be given by selecting right constants κ1,κ2,κ3\kappa_{1},\kappa_{2},\kappa_{3} and η\eta. We consider an accuracy r2rr\geq 2r^{*}, and for the rest of this section we fix its value.

Hence, nn depends on the wanted accuracy rr. Also, nτNn\leq\tau N and without loss of generality we may assume that both nn and m=N/nm=N/n are integers. Also, note that m1/τm\geq 1/\tau.

In this section we prove Proposition 3.5.

Recall that by Proposition 3.2 and the resulting condition on rr, one has that, with probability at least 12exp(c0N)1-2\exp(-c_{0}N) with respect to (Xi)i=1N(X_{i})_{i=1}^{N}, if a match involving ff^{*} is allowed to take place (i.e., if DO(f,f)=1{\cal DO}(f^{*},f)=1), then ffL2r\|f^{*}-f\|_{L_{2}}\geq r; and if the match is abandoned (i.e., DO(f,h)=0{\cal DO}(f^{*},h)=0), then ffL2(β/α)r\|f^{*}-f\|_{L_{2}}\leq(\beta/\alpha)r. The constants α<1<β\alpha<1<\beta and c0c_{0} depend only on LL and qq.

Therefore, to establish Proposition 3.5 it is enough to show that with probability 12exp(cn)1-2\exp(-cn), if ffL2r\|f-f^{*}\|_{L_{2}}\geq r, then ff^{*} wins its match against ff. In other words, that Bf,f(j)>0B_{f,f^{*}}(j)>0 for the majority of the blocks in any such match.

Let us begin by exploring the situation in a match between ff^{*} and ff, knowing that ffL2r2r\|f^{*}-f\|_{L_{2}}\geq r\geq 2r^{*}, as above.

Clearly, for every f,hFf,h\in{\mathcal{F}},

This is summarized in the following lemma, established in the next two sections, and for the right choice of the constants κ1,κ2,κ3\kappa_{1},\kappa_{2},\kappa_{3}, η,τ\eta,\tau, and θ\theta that depend only on LL and qq.

There exists an absolute constant cc and a constant C1=C1(L,q)C_{1}=C_{1}(L,q) for which the following holds. With probability at least 12exp(cτ2n)1-2\exp(-c\tau^{2}n), for every fFf\in{\mathcal{F}} with ffL2r\|f-f^{*}\|_{L_{2}}\geq r,

for constants κ0\kappa_{0} and ρ0\rho_{0} that depend only on LL and qq.

Recall that (Ij)j=1n(I_{j})_{j=1}^{n} are the nn blocks, each one of cardinality m=N/nm=N/n, that F1={fstar(F,f):ffL2=r}{\mathcal{F}}_{1}=\{f\in{\rm star}({\mathcal{F}},f^{*}):\|f-f^{*}\|_{L_{2}}=r\}, and that m1/τm\geq 1/\tau. For t>0t>0 and a function uu, set

Thanks to (5.6), it is evident that if fF1f\in{\mathcal{F}}_{1} and 1jn1\leq j\leq n then, with probability at least 12exp(c0ρ0m)1-2\exp(-c_{0}\rho_{0}m),

Recalling that m1/τm\geq 1/\tau, if τc1(ρ0)\tau\leq c_{1}(\rho_{0}) then

In other words, at least (1τ/6)n(1-\tau/6)n of the blocks IjI_{j} have at least mρ0/2m\rho_{0}/2 coordinates of size at least κ0r\kappa_{0}r. Moreover, in light of the high probability estimate with which (5.7) holds, by the union bound, the same property is satisfied uniformly by any subset of F1{\mathcal{F}}_{1} of cardinality at most exp(c2τ2n/2)\exp(c_{2}\tau^{2}n/2). The set we consider is a maximal ηr\eta r-separated subset of F1{\mathcal{F}}_{1} with respect to the L2(μ)L_{2}(\mu) norm and for a well-chosen η\eta specified later.

Let H1H_{1} be such a maximal separated set. Since

it follows from the translation invariance of packing numbers that

Thus, to obtain a uniform control over points in H1H_{1}, it suffices to show that

Recalling (5.4) and the choice of nn, it suffices to verify that

Observe that since H1H_{1} is maximal, it is also an ηr\eta r-net in F1{\mathcal{F}}_{1}, that is, every fF1f\in{\mathcal{F}}_{1} has some πfH1\pi f\in H_{1} for which fπfL2ηr\|f-\pi f\|_{L_{2}}\leq\eta r. Consider the following event:

(A)(A) (5.7) holds for every fH1f\in H_{1}, and

On this event, for every fF1f\in{\mathcal{F}}_{1} there are at least (1τ/4)n(1-\tau/4)n blocks IjI_{j} with the following properties:

\bullet πf(Xi)f(Xi)κ0r|\pi f(X_{i})-f^{*}(X_{i})|\geq\kappa_{0}r on at least mρ0/2m\rho_{0}/2 coordinates in IjI_{j}, and

\bullet f(Xi)πf(Xi)κ0r/2|f(X_{i})-\pi f(X_{i})|\geq\kappa_{0}r/2 on at most mρ0/4m\rho_{0}/4 coordinates in IjI_{j}.

Hence, in each one of the (1τ/4)n(1-\tau/4)n well-behaved blocks there are at least mρ0/4m\rho_{0}/4 coordinates XiX_{i} that satisfy

In particular, since ffL2=r\|f-f^{*}\|_{L_{2}}=r, one has

Moreover, (5.8) is positive homogeneous in fff-f^{*} and F1{\mathcal{F}}_{1} is star-shaped around ff^{*}, implying that (5.8) holds for every fF1f\in{\mathcal{F}}_{1} as long as ffL2r\|f-f^{*}\|_{L_{2}}\geq r.

All that remains is to establish that (B)(B) holds with a sufficiently high probability. To that end, let

First note that, by the bounded differences inequality (see, e.g., ), with probability at least 1exp(c3u2)1-\exp(-c_{3}u^{2}),

Thus, the final step in the proof is to show that

In the last step we used a standard symmetrization argument, see, for example, . To conclude the proof, observe that (8/ρ0κ0r)ηrτ/48({8}/{\rho_{0}\kappa_{0}r})\cdot\eta r\leq\tau/48 when

and for a suitable absolute constant c5c_{5}. The fact that

follows because r2rr\geq 2r^{*} and invoking (5.2), as long as

The multiplier component

Next we complete the proof of Lemma 5.1 by showing that, with high probability, if ffL2r\|f-f^{*}\|_{L_{2}}\geq r, then

on at most τn\tau n blocks. In the first step towards this goal, we consider a single function:

In particular, the same assertion holds uniformly for any fixed subset H2F1H_{2}\subset{\mathcal{F}}_{1} of cardinality at most exp(C2τ2n/2)\exp(C_{2}\tau^{2}n/2).

Before proving Lemma 5.2, consider the assumption that

Another observation is that (5.11) passes to star(F,f){\rm star}({\mathcal{F}},f^{*}), simply because any function in that set is of the form h=λf+(1λ)fh=\lambda f+(1-\lambda)f^{*} for some fFf\in{\mathcal{F}} and 0λ10\leq\lambda\leq 1 and therefore, (f(X)Y)(h(X)f(X))=λ(f(X)Y)(f(X)f(X))(f^{*}(X)-Y)(h(X)-f^{*}(X))=\lambda(f^{*}(X)-Y)(f(X)-f^{*}(X)).

for (Ui)i=1M(U_{i})_{i=1}^{M} that are independent copies UU. It is straightforward to verify that, with (εi)i=1N(\varepsilon_{i})_{i=1}^{N} defined as independent symmetric random signs,

By the norm equivalence assumption of Theorem 2.10,

and by the independence assumption of Theorem 2.11,

Hence, setting t=(C1/2)r2t=(C_{1}/2)r^{2} and noting that

Therefore, with probability at least 1τ/161-\tau/16,

Just as before, one may select any fixed subset H2F1H_{2}\subset{\mathcal{F}}_{1} of cardinality exp(C2τ2n/2)\exp(C_{2}\tau^{2}n/2), and the assertion of Lemma 5.2 holds with high probability and uniformly for every hH2h\in H_{2}. The choice of H2H_{2} requires some care. It cannot be just an arbitrary maximal separated set.

There exists a subset H2F1H_{2}\subset{\mathcal{F}}_{1} of cardinality at most exp(C2nτ2/2)\exp(C_{2}n\tau^{2}/2) such that for every fF1f\in{\mathcal{F}}_{1} there is some hH2h\in H_{2} that satisfies

Proof. Let HH^{\prime} be an maximal ηr\eta r-separated subset of F1{\mathcal{F}}_{1}. Recall that by the choice of rr and (5.4), logHκ32Nmin{1,σ2r2}\log|H^{\prime}|\leq\kappa_{3}^{2}N\min\{1,\sigma^{-2}r^{2}\}, which is smaller than (C2/2)τ2n(C_{2}/2)\tau^{2}n when

The class F{\mathcal{F}} is a locally compact subset of L2(μ)L_{2}(\mu) and therefore, so is F1{\mathcal{F}}_{1}. Thus, for every hHh^{\prime}\in H^{\prime}, the intersections of F1{\mathcal{F}}_{1} with the L2L_{2} balls B(h,ηr)B(h^{\prime},\eta r) are compact and the continuous linear functional on L2L_{2}

Hence, to complete the proof it suffices to show that for every fF1f\in{\mathcal{F}}_{1}, there are at most (7/8)nτ(7/8)n\tau blocks IjI_{j} with

There exists an absolute constant cc for which, with probability at least 12exp(cτ2n)1-2\exp(-c\tau^{2}n),

Consider the supremum of Binomial random variables

The next step is to centre the process. We show that for every fF1f\in{\mathcal{F}}_{1}, the centring term satisfies

Indeed, symmetrizing, applying either the assumption of norm equivalence of Theorem 2.10 or the independence assumption of Theorem 2.11, and recalling that fπfL22ηr\|f-\pi f\|_{L_{2}}\leq 2\eta r, it is evident that

again using the fact that n/Nθ(r/σ)\sqrt{n/N}\leq\sqrt{\theta}(r/\sigma). Clearly,

for a suitable absolute constant C3C_{3}.

Finally, we need to bound the centred empirical process. This is done by standard techniques of symmetrization, the contraction theorem for Bernoulli processes (for the Lipschitz function ϕ(t)=t\phi(t)=|t|) and then de-symmetrization (see, e.g., ):

2 Champions league–proof

Finally, it remains to prove Proposition 3.8. Recall that, with probability at least 1-2\exp\bigl{(}-c_{0}N\min\{1,\sigma^{-2}r^{2}\}\bigr{)} with respect to the sample (Xi,Yi)i=12N(X_{i},Y_{i})_{i=1}^{2N}, we have been able to identify a set of “qualifiers” that have not lost a single match in the preliminary round subset; that is, HFH\subset{\mathcal{F}}, consisting of ff^{*} and possibly other functions that satisfy ffL2(β/α)r\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r. In the rest of this section we work conditionally on this “good” event.

While producing a function that is close to ff^{*} solves the estimation problem, the question of prediction requires an additional step: we would like to choose one of the qualifiers that has an almost optimal statistical performance: a function f^\widehat{f} that satisfies

for an appropriate CC. We show that this is possible with the required high probability for C=16(β/α)2C=16(\beta/\alpha)^{2}. To this end, set r1=(β/α)rr_{1}=(\beta/\alpha)r.

Recall that for f,hstar(F,f)f,h\in{\rm star}({\mathcal{F}},f^{*}), Ψh,f=(h(X)f(X))(f(X)Y)\Psi_{h,f}=(h(X)-f(X))(f(X)-Y) and note that

The “champions league” round is designed to have “home-and-away” legs. For the partition (Ij)j=1n(I_{j})_{j=1}^{n} of (Xi,Yi)i=2N+13N(X_{i},Y_{i})_{i=2N+1}^{3N}, ff wins its home match against hh if

The main ingredient in the proof is the following lemma.

Let F2=star(F,f)(f+r1D){\mathcal{F}}_{2}={\rm star}({\mathcal{F}},f^{*})\cap(f^{*}+r_{1}D). Under the conditions of Theorem 2.10, there is an absolute constant cc, such that, with probability at least 12exp(cn)1-2\exp(-cn),

Proof. Using the assumption of norm equivalence of Theorem 2.10, one may verify that Ψf,fL2L2r1σ\|\Psi_{f,f^{*}}\|_{L_{2}}\leq L^{2}r_{1}\sigma. Therefore,

Applying the same argument used in the previous section–namely, symmetrization, followed by contraction for a Bernoulli process and the Lipschitz function ϕ(t)=t\phi(t)=|t| and de-symmetrization–one has that for absolute constants c1,c2c_{1},c_{2}, and c3c_{3},

provided that nc4L4(r12/σ2)Nn\leq c_{4}L^{4}\cdot(r_{1}^{2}/\sigma^{2})N for an absolute constant c4c_{4}. Since r1=2(β/α)rr_{1}=2(\beta/\alpha)r, it suffices that

Recall that for every fHf\in H, ffL2(β/α)r\|f-f^{*}\|_{L_{2}}\leq(\beta/\alpha)r and consider the “good” event from Lemma 5.5. For any fF2f\in{\mathcal{F}}_{2}, and, in particular, for any qualifier fHf\in H,

Finally, since Ψf,f=(f(X)f(X))2Ψf,f\Psi_{f^{*},f}=-(f^{*}(X)-f(X))^{2}-\Psi_{f,f^{*}}, on that event and the same coordinate blocks,

Thus, ff is defeated by ff^{*} in the majority of the blocks, and in particular, loses its home match against ff^{*}. It follows that, with probability at least 12exp(cn)1-2\exp(-cn), any function f^\widehat{f} selected in the champions league round must satisfy that Ψf,f^2r12\Psi_{f^{*},\widehat{f}}\geq-2r_{1}^{2}, and by Lemma 3.6,

Additional remarks

This fact has recently been established in in a very special situation: when F{\mathcal{F}} is a convex, LL-sub-Gaussian class of functions; all the admissible targets are of the form Y=f0(X)+WY=f_{0}(X)+W for f0Ff_{0}\in{\mathcal{F}} and WW that is sub-Gaussian, zero-mean, independent of XX. The procedure used is a modification of ERM: one replaces F{\mathcal{F}} with an appropriate net, thus ‘erasing’ all the fine structure of F{\mathcal{F}} at the right level, and then runs ERM on the net. The idea behind this procedure is straightforward: if one is interested in accuracy rr, one is insensitive to perturbations of that order. From that perspective, a net with a mesh that is proportional to rr is just as good as the entire class. It turns out that rr^{*} of the net is proportional to λ\lambda^{*} the original class F{\mathcal{F}}. Unfortunately, all the highly restrictive assumptions are essential to the proof and cannot be relaxed at all.

It should be noted that the median-of-means tournament may be modified in exactly the same way as ERM is modified in , leading to an accuracy that is proportional to λ\lambda^{*} when F{\mathcal{F}} is a convex, LL-sub-Gaussian class and for an independent noise WW that may be heavy-tailed. We decided not pursue this point further because it is a very special case, and shifts the emphasis of the article from the question of the optimal accuracy/confidence tradeoff to the nature of the accuracy edge, a problem of a different nature.

As far as the latter is concerned, it is not clear whether the gap between λ\lambda^{*} and rr^{*} can be closed in other cases. It is possible that both independent noise and a sub-Gaussian class are essential in attaining an accuracy proportional to λ\lambda^{*}, and under weaker assumptions, the true accuracy edge lies somewhere between λ\lambda^{*} and rr^{*}. We leave that question for future study.

Finally, it should be noted that in this work we completely ignore the algorithmic aspects of the new procedure. While computing the empirical risk minimizer often leads to thoroughly studied and well understood convex optimization problems, finding the winner of the median-of-means tournament in a computationally efficient manner is a highly nontrivial–and perhaps not hopeless–problem that goes beyong the scope of this paper. Techniques of “derivative-free optimization” with “function comparison oracle” may be useful, see, for example, .

References

Appendix A The distance oracle – outline of the proof

Here, we sketch the proof of Proposition 3.2. The proof is almost identical to that of Theorem 3.3 from , and follows the same path as the study of the quadratic component in the proof of Theorem 2.10.

Moreover, combining (A.1) with a straightforward application of Chebyshev’s inequality, one has that with probability at least 0.80.8,

for constants κ0\kappa_{0} and κ1\kappa_{1} that depend only on qq and LL.

Let us apply this observation to our setup: fix fFf^{*}\in{\mathcal{F}} and consider the set

for any fixed r>dr>d^{*}. By the choice of dd^{*}, star(Ff,0)rS{\rm star}({\mathcal{F}}-f^{*},0)\cap rS contains an ηr\eta r-net Vr{\cal V}_{r} of cardinality at most exp(c1k/2)\exp(c_{1}k/2). Therefore, with probability at least 12exp(c1k/2)1-2\exp(-c_{1}k/2), for every vVrv\in{\cal V}_{r} there is Jv{1,...,k}J_{v}\subset\{1,...,k\} of cardinality at least 0.7k0.7k, such that for every jJvj\in J_{v},

Next, for each vstar(Ff,0)rSv\in{\rm star}({\mathcal{F}}-f^{*},0)\cap rS let πvVr\pi v\in{\cal V}_{r} satisfy vπvL2ηr\|v-\pi v\|_{L_{2}}\leq\eta r. Suppose that one can show that on a high probability event, for every such vv there are at most k/10k/10 blocks IjI_{j}^{\prime} on which

Then, on that event, and on at least 0.6k0.6k blocks,

This is precisely the isomorphic estimate we require, so far, for elements of star(Ff,0){\rm star}({\mathcal{F}}-f^{*},0) whose L2(μ)L_{2}(\mu) norm is rr. Obtaining the isomorphic estimate for elements with a larger L2(μ)L_{2}(\mu) norm follows because star(Ff,0){\rm star}({\mathcal{F}}-f^{*},0) is star-shaped around and the required isomorphic estimate is positive homogeneous.

Therefore, to complete the proof of the first part of Proposition 3.2 it suffices to show that

where the supremum is taken in star(Ff,0)rS{\rm star}({\mathcal{F}}-f^{*},0)\cap rS. The proof of (A.3) is based on an identical argument to the one we used earlier, in the study of the quadratic component, and the fact that r>dr>d^{*}.

For each vVrv\in{\cal V}_{r}^{\prime}, with probability at least 0.80.8,

and thus, with probability at least 12exp(c1k)1-2\exp(-c_{1}k), at least 0.7k0.7k of the random variables Mv,j{\cal M}_{v,j} are smaller than κ1r\kappa_{1}r. Next, we consider an ηr\eta r-net VrFf,r{\cal V}^{\prime}_{r}\subset{\mathcal{F}}_{f^{*},r}, which, by the choice of rr is of cardinality at most exp(c1k/2)\exp(c_{1}k/2). It follows that, with probability at least 12exp(c1k/2)1-2\exp(-c_{1}k/2), (A.4) holds for every vVrv\in{\cal V}_{r}^{\prime}. The oscillation term is then controlled as we outlined above.

Finally, because k=c(q,L)Nk=c(q,L)N, the estimates hold with the claimed probability of 12exp(cN)1-2\exp(-cN) for a constant cc that depends only on qq and LL.