Risk minimization by median-of-means tournaments
Gabor Lugosi, Shahar Mendelson
Introduction
We assume in what follows that the minimum is attained and exists and is unique, as is the case when is a closed, convex set.
One may formulate two natural goals in estimation and prediction problems. One of them is to find a function whose distance to
is as small as possible. The other is to ensure that the excess risk of the selected function
The crucial difference between this type of problems and standard questions in approximation theory is that the available information is limited to a random sample. One observes , that is, independent pairs, where each has the same distribution as and is independent of . The fact that the distribution of the pair is not known makes it impossible to invoke approximation-theoretical methods and identify directly the true minimizer of the risk.
where throughout the article, for , we use the notation
The excess risk, also known as the prediction error, compares the ‘predictive capabilities’ of to that of the best in the class, and is defined by the conditional expectation
Note that, up to this point, was an arbitrary square-integrable real-valued random variable, and obviously one would like to be able to treat as wide a variety of targets as possible. Clearly, the accuracy and confidence one may establish may depend on some features of the target–for example, some a-priori estimate on its norm–, or on its “distance” to , etc. We consider a broad set of admissible targets , and the accuracy and confidence of relates to its performance for any admissible target . Thus, a learning problem is the triplet , when and are not known, though the learner does know that .
It is clear that there is a tradeoff between the accuracy and confidence in a given learning problem: the smaller the error is, the harder it is to attain it. The question of this accuracy/confidence tradeoff is of utmost importance in statistical learning theory, and has been investigated extensively in numerous manuscripts since the early days of the area in the late 1960’s (see, for example, the books for a sample of the work devoted to this question). To find the right accuracy/confidence tradeoff one must first identify a lower bound on the tradeoff in terms of the sample size, the structure of and possibly some additional information on and , and then come up with a learning procedure that attains the tradeoff.
Roughly put, one should explore the tradeoff for the set of “achievable accuracies” of each learning problem. An accuracy is achievable if there is a learning procedure in that achieves the accuracy for the problem with constant probability–say at least –, and because is not known, this has to hold for any . We define the accuracy edge as the smallest achievable accuracy of a problem.
The primary question is to find the correct accuracy/confidence tradeoff for any (reasonable) learning problem and identify a learning procedure that attains that tradeoff all the way down to the accuracy edge. It should be noted that up to this point in time, and other than in a few isolated examples, no learning procedure even came close to the optimal tradeoff at any nontrivial accuracy level.
In this article we solve this problem by presenting an optimal learning procedure: it yields the best possible accuracy/confidence tradeoff (almost) up to the achievable edge, and under minimal assumptions on the learning problem.
The minor reservation “almost” is due to fact that more often than not, the identity of the accuracy edge of a learning problem is not known. As it is explained in what follows, while one may provide lower estimates on the accuracy edge, there is a very real possibility that such estimates are too optimistic and the real accuracy edge is larger. Regardless, the procedure we introduce “gets as close” to the accuracy edge as any other known procedure–or better, and with a dramatically better confidence. It also exhibits the optimal accuracy/confidence tradeoff for larger errors, something that no other procedure is known to do.
It should be clarified at this point that by “optimal accuracy” we mean an accuracy that is optimal up to a constant factor and optimality in the confidence means that the probability with which the claimed accuracy does not hold is optimal up to a constant factor in the exponent.
Before we present a more accurate formulation of our main results and describe the optimal procedure, let us explain what our procedure is not: it is not empirical risk minimization (ERM), nor any of its “family members”.
Perhaps the most natural way of choosing is by empirical risk minimization, that is, by least squares regression,
Again, we assume that the minimum is attained, while if there are several minimizers, may be chosen among them in an arbitrary way.
The performance of least squares regression has been thoroughly studied in many different scenarios. A sample of the rich literature includes Györfi, Kohler, Krzyzak, Walk , van de Geer , Bartlett, Bousquet, and Mendelson , Koltchinskii , Massart .
It turns out that the performance of ERM changes dramatically according to the tail behaviour of the functions involved in the given learning problem. One may show (see, e.g., ) that if is convex and the functions in (more precisely, the random variables ) and the target have well-behaved tails (and by “well-behaved” we mean sub-Gaussian), ERM preformed in yields good results: for an accuracy that is not far from the accuracy edge, it attains the optimal confidence, though it does not maintain the optimal accuracy/confidence tradeoff for larger errors. Unfortunately, the situation deteriorates considerably when either members of or one of the admissible targets is heavy-tailed in some sense. In such cases, the performance of ERM is significantly weaker than the known theoretical limitations of the accuracy/confidence tradeoff. Moreover, replacing ERM with a different procedure is of little use: other than in few and rather special learning problems, there have been no known alternatives to ERM whose performance comes close to the known theoretical limitations of the accuracy/confidence tradeoff, and certainly not when the problem is heavy-tailed.
The reason for ERM’s diminished capacity is that it is sensitive to even a small number of atypical points in the sample : since ERM selects a minimizer of the empirical mean of the squared loss, atypical values may distort the selection and send ERM to the wrong part of . This sensitivity is clearly reflected in the confidence with which ERM operates in heavy-tailed situations: roughly put, one can guarantee that ERM performs with the right accuracy only on samples that are not contaminated by a significant number of atypical values. However, in heavy-tailed situations, the latter does not occur frequently, and having atypical values is simply a fact of life one has to deal with.
In contrast, the procedure we suggest as an alternative to ERM leads to the optimal accuracy/confidence tradeoff even in heavy-tailed situations. Unlike ERM, it is not sensitive even to a large number of atypical sample points.
One may show that a nontrivial estimate is possible only when for a suitable absolute constant , and we consider only such values of . Also, there are known estimates on the theoretical limitations of this problem: a lower bound on the accuracy edge is of the order of , and for an accuracy level that is proportional to the accuracy edge, say, for a suitable absolute constant , the conjectured confidence is .
Moreover, because of the minimal assumptions on and , the best estimate one can hope for on and that holds with probability follows from Chebyshev’s inequality. In particular, it leads to the following, rather unsatisfactory, estimate on the performance of ERM:
Also, if one wishes for an error that is proportional to the (conjectured) accuracy edge, that is, of the order of , the best that one can hope for is a constant confidence–a very different estimate from the conjectured confidence of .
Although what we have described is an upper bound, one may show (see ) that this estimate captures the performance of empirical risk minimization, and in particular exhibits ERM’s inability to deal with atypical sample points. The reality is that ERM performs with an accuracy of the order of only on the relatively few samples that contain almost no misleading data.
The main result of this article, when applied to this example, shows that under the same assumptions, the procedure we suggest selects for which
for some numerical constants ; that is, it performs with optimal confidence at a level that is proportional to the accuracy edge. In fact, our procedure gives the optimal confidence for any accuracy .
In the next section we present the required definitions, outline the current state of the art, and formulate our main results. In Section 3, we describe the new procedure in detail. In Section 4 we illustrate the power of the main results on some canonical examples, before turning to the proofs of our results.
Let us point out the well-understood fact that the behaviour of the accuracy and confidence in learning problems in which is not convex is trivial in some sense, and totally different from the convex case, which is why focus on the latter. As it happens, the dominating factor in non-convex problems is the ‘location’ of the targets relative to rather than the structure of , and an ‘unfavourable location’ of a target completely distorts the accuracy and confidence of the learning problem. However, even if is not convex, all the targets of the form for and that is symmetric and independent of happen to be in a ‘favourable’ location and thus our results apply to such problems as well.
The accuracy edge and the accuracy/confidence tradeoff
We begin by describing the known theoretical limitations on the accuracy edge and on the accuracy/confidence tradeoff for a given learning problem. To this end, let us introduce some notation, following the path of .
Let be the unit ball in and set to be the unit sphere. For and , put . Let
Thus, is the star-shaped hull of around , that is, the union of all segments for which one end-point is and the other is in .
The star-shaped hull adds regularity to the class around the fixed centre : on the one hand, it does not increase the size of the class by much, while on the other hand, it implies that every function of the form has a ‘scaled-down’ version when one moves towards . In particular, the level sets become ‘richer’ as gets smaller: each one of them contains scaled-down copies of all ‘higher’ levels.
Consider the localization of
which is given by the shift that maps the designated point to . Then the resulting class is made ‘more regular’ by taking its star-shaped hull around , and finally it is localized, by considering its intersection with , the ball of radius , centred in .
Observe that if is convex then for any , . Also, in that case,
and if, in addition, is centrally symmetric (that is, if then ), then .
One way of deriving lower estimates on the accuracy edge and on the accuracy/confidence tradeoff is based on the packing numbers of the localizations .
Given a set and , denote by the cardinality of a maximal -separated subset of . That is, is the maximal cardinality of a subset , for which for every .
Note that if is a maximal -separated subset of , then it is also an -cover of in the sense that for every there is some that satisfies .
For and , set
There exist absolute constants and for which the following holds. Let be convex and centrally symmetric. For any learning procedure there exists an and target for which, with probability at least ,
For , and , set
(See, e.g., .) There exist absolute constants and for which the following holds. Let and set to be a centred Gaussian variable with variance that is independent of . Then, for any learning procedure there exists and a target for which, with probability at least ,
Combining these two facts, we have a lower bound on the accuracy edge:
for some constants , . However, there is no guarantee that this lower estimate is sharp. As we explain in what follows, it is very possible that the true accuracy edge is larger.
Let us turn to the theoretical limitations of the accuracy/confidence tradeoff, assuming that the set of admissible targets is not too trivial. By that we mean that it at least contains all the targets of the form (the noise-free problems) and , for and that is a centred Gaussian variable with variance , and is independent of . We call this set of targets minimal, and of course, could be much larger.
Applying the results in and one has the following:
There exists an absolute constant for which the following holds. Let be a class that is star-shaped around one of its points (i.e., for some and every , ). Consider that contains the minimal set of targets. If is a learning procedure that performs with accuracy and confidence for every such target, then
These facts set our first benchmark (which may be too optimistic, of course): the lower bound on the accuracy edge
and the bound on the accuracy/confidence tradeoff for ,
As we noted earlier, is an optimistic, and perhaps not very realistic, lower bound on the accuracy edge. A more reasonable conjecture relies on more “global” parameters that take into account the fine structure of at an arbitrarily small level, defined next.
From here on, let be independent, symmetric -valued random variables that are independent of .
For and let
and set .
The other “global” parameter we require does depend on . It is used to calibrate the interaction between and the target.
Remark. The role of and of in Definition 2.8 deserves some explanation. The mean oscillation in Definition 2.8 involves the “multipliers” , and the right choice for the centre is the unknown . While one may take into account the “worst” , doing so makes little sense, as is the minimizer of the distance between and , and for the worst , could be significantly larger than . To overcome this obstacle we assume that an a-priori estimate on the distance between and (i.e., a value such that ) is available. With this information one still needs to consider the worst centre , but only among all functions that satisfy . As we explain in Section 4, thanks to known estimates for the expectation of the supremum of a multiplier process, one only needs to keep in mind that the multipliers are independent copies of some random variable that satisfies some moment condition, such as for some and a suitable constant .
In light of the results from , a realistic alternative to is
for some constants , and when for the given (and unknown) target one has . Indeed, in it was shown that under some mild conditions on the learning problem, specified below, ERM performs in with accuracy and constant confidence. Thus, is a potential (and to-date, the best) candidate for the accuracy edge.
Therefore, up to the issue of the true identity of the accuracy edge, the (somewhat vaguely formulated) question of the accuracy/confidence tradeoff is as follows:
Is there a learning procedure which, for any reasonable learning problem and any , performs with accuracy and confidence 1-2\exp\bigl{(}-c_{2}N\min\{1,\sigma^{-2}r^{2}\}\bigr{)}, thus achieving the optimal accuracy/confidence tradeoff?
Our main result answers Question 2.9 in the affirmative where our notion of a “reasonable learning problem” is formulated rigorously below.
The first theorem we present requires the following conditions:
Let be a constant and let be distributed according to the measure on . Given a locally compact, convex class of functions and , assume that
for every , ;
for every , ;
for some known constant .
Let , , and suppose Assumption 2.1. There exist constants and that depend only on for which the following holds. Let
There exists a procedure that, based on the data and the values of , and , selects a function such that, with probability at least
Of course, the identity of is not known, and therefore, it is not reasonable to expect that is known beforehand. A “legal” data-independent choice is any that is larger than regardless of the identity of . In particular, Theorem 2.10 gives the optimal accuracy/confidence tradeoff for any accuracy . Alternatively, one may consider the “right” choice of the parameter as a model selection problem that may be selected using cross validation if independent data are available. We do not discuss the rather straightforward details further here.
Remark. In our main assumption of Theorem 2.10 we use the equivalence between the and norms for functions of the form . This allows us to derive bounds in terms of the variance of . In fact, if instead of norm equivalence we just have a bound on , the arguments work equally well, with replacing . The sets need to be adjusted as well: it should be replaced by all functions in whose distance to is at most .
It turns out that, when dealing with independent noise, the assumptions required in Theorem 2.10 may be relaxed even further. In particular, we do not require convexity of the class and the assumption of norm equivalence may be relaxed:
Let , , and . There exist constants and that depend only on and for which the following holds. Let be a locally compact class of functions and assume that for every , . Assume further that for some and that is mean-zero, independent of , and satisfies .
Let be as above and fix . There exists a procedure that, based on the data and the values of , and , selects a function such that, with probability at least
Note that in the case of independent additive noise, the assumptions of Theorem 2.11 are almost the minimal needed for the learning problem to be well defined: norm equivalence for that may be arbitrarily close to and that perhaps does not have any higher moments. Even under these minimal assumptions, we still obtain the optimal accuracy/confidence tradeoff.
To put Theorem 2.10 in perspective, we describe the current sharpest estimates on the accuracy and confidence of a learning problem, focusing on possibly heavy-tailed distributions.
The best known estimate on prediction and estimation in a general convex class are based on a quite weak condition, rather than the norm equivalence we use. Recall that a class satisfies a small-ball condition with constants and if for every ,
for constants and that depend only on and .
In other words, when , the procedure exhibits the optimal accuracy/confidence tradeoff for any .
The median-of-means tournament
The key to obtaining sharp estimates for both the accuracy and the confidence is identifying a procedure that is not sensitive to atypical values that may occur on a small part of the given sample. Thus, a natural starting point is the conceptually simple and attractive mean estimator, the so-called median-of-means estimator. It was proposed, independently, by Nemirovsky and Yudin , Jerrum, Valiant, and Vazirani , Alon, Matias, and Szegedy , and is defined as follows.
and define as the median of . (If the median is not uniquely defined, here, and in the rest of the paper, we choose the smallest one. Any other choice would work equally well.) It is straightforward to verify that for any ,
For properties, applications, and extensions of the median-of-means estimator, we refer to Bubeck, Cesa-Bianchi, and Lugosi . Devroye, Lerasle, Lugosi, and Oliveira Hsu and Sabato , Lerasle and Oliveira , Minsker , Audibert and Catoni .
It is tempting to try to replace the empirical means by some median-of-means estimate of the risk for each , and select a function in that minimizes the estimate. However, due to the nonlinear nature of the median-of-means, it is difficult to control the process of the estimated losses. Instead, the alternative we propose is to estimate the difference of the risk for all pairs and organize a two-stage “tournament”.
We mention here that Brownlees, Joly, and Lugosi propose empirical minimization based on a different robust mean estimator, a carefully designed M-estimator proposed by Catoni . Under general loss functions they derive analogs of Dudley’s chaining bound for the excess risk. However, the derived bounds are far from giving the optimal rate of convergence under the squared loss.
Without loss of generality and for convenience in the notation, we use instead of for the sample size and assume that is the given sample. The sample is split into three equal parts , and . Given the wanted degree of accuracy , the first part of the sample–in fact, just –is used to estimate pairwise distances within , in a sense that will be clarified below. The second part is used in the preliminary round of the tournament. We show that the outcome of the preliminary round is a set that contains and possibly some other functions whose distance to is at most . The final part of the tournament is a ‘champions league’ round. Participants in that final round are the elements in (the ‘qualifiers’ of the preliminary round), and the goal of that round is to identify a function whose predictive capabilities are almost optimal, in the sense that
Let us now describe the three stages of the median-of-means tournament is detail.
The functional allows one to identify distances in in a crude (isomorphic) way, as the next theorem shows:
If then .
If then .
Proposition 3.2 is an immediate modification of Theorem 3.3 from . For the sake of completeness we outline the main components of its proof in the appendix.
Next we introduce the “distance oracle”, denoted by . Recall the definition of from (2.7) and note that for the right choice of constants, . The distance oracle is adapted to the wanted degree of accuracy, that is, to any fixed .
Fix . Using the notation of Proposition 3.2, if set , otherwise set .
The distance oracle determines if a match between and takes place: it does if and it is abandoned if . Note that Proposition 3.2 only shows that is a realistic indication of the distance between pairs when one of the functions is the designated function . This serves our purposes since the designated function we are interested in is the minimizer of the true risk in , and the success of the procedure only requires having accurate information on matches that involve , even if we do not know which matches those are.
It follows from Proposition 3.2 that with probability at least relative to , if a match between and is allowed to proceed then , while if it is abandoned then .
2 The preliminary round
The goal of the preliminary round is to produce a subset that, with overwhelming probability over the samples , contains and for any .
The round consists of ‘matches’ between every pair , and a match can have three possible outcomes: a win by either side, or a draw (the latter includes abandoned matches because of the ruling of the distance oracle).
Each match is ‘played’ using the second part of the sample, . The sub-sample is partitioned to blocks of cardinality each, for a choice of specified later. Let us note that depends on the desired degree of accuracy .
A match between and takes place if the distance oracle, using the first part of the sample , declares that ; otherwise, the match is abandoned and results in a draw.
Each match is decided according to the blocks generated by the partition of , with the -th block played on the coordinate block . Put
The function defeats on the -th block if , and defeats if .
A winner of more than blocks is the winner of the match. If neither function wins more than half of the blocks, the match is drawn.
A function qualifies from the preliminary round if it has not lost a single match; that is, it has won or drawn all its matches. The set of “champions” consists of all functions qualified from the preliminary round.
The key fact regarding the outcome of the preliminary round is as follows:
Under the assumptions of Theorem 2.10 or of Theorem 2.11, and using their notation, with probability at least
with respect to , for all , if then defeats . In particular, and for any , , and therefore .
The proof of Proposition 3.5 is presented in Section 5.1. Note that Propositions 3.2 and 3.5 imply Theorem 2.11. In order to prove the general result of Theorem 2.10, another round of matches is necessary to choose a function from with small excess risk.
3 Champions league
The goal of the second round of the tournament is to choose, among the “champions” selected in the preliminary round, a function with a small excess risk. This round consists of different kind of matches, played between functions in . Since this round consists of matches between functions in , conditioned on the ‘good event’ from the preliminary round, every qualifier satisfies that .
The modified matches are decided using the third part of the sample . The aim is to produce a function that has a good excess risk, namely,
for some that is bounded by a constant multiple of .
Proof. Observe that for every
The matches in the champions league consist of “home-and-away” legs:
Given a sample , let be the partition of to blocks, for the same value of as in the preliminary round. Let and be as in Proposition 3.5 and set . The function wins its home match against if
on more than of the blocks .
We select as any “champion” in that wins all of its home matches.
The main result regarding the champions league is as follows:
Let as above. Under the assumptions of Theorem 2.10 and using its notation, with probability at least
with respect to , one has:
wins all of its home matches, and
The proof of Proposition 3.8 is presented in Section 5.2.
The combination of Propositions 3.2, 3.5, and 3.8 yields the proof of Theorem 2.10.
Examples
Before turning to the proofs of our main results, let us present some explicit examples of applications of Theorem 2.10.
The second example is similar to the first one and is motivated by questions in sparse recovery: linear regression in the set . The difference between this example and the first one lies in the assumption on . In the second example is not assumed to be -sub-Gaussian, but rather satisfies a much weaker moment condition, the same condition that is needed to ensure that the basis pursuit algorithm has a unique solution with the optimal number of measurements (see ).
Both examples lead to explicit estimates on the accuracy and confidence of the median-of-means tournament. The estimates are better than the known bounds and hold with optimal confidence for accuracy larger than (though in general, the true identity of the accuracy edge is an open question). Moreover, in the second example, of , one may show that is proportional to the accuracy edge, and the optimal tradeoff holds all the way down to that value.
Thanks to the isotropicity assumption, if is a convex and centrally-symmetric set and , then for every ,
the mean-width of with respect to the Gaussian measure.
The following notion makes the task of obtaining such bounds more manageable, though still nontrivial.
The simplest examples of isotropic, -sub-Gaussian random vectors are vectors with independent, mean-zero, variance components that are sub-Gaussian. For instance, the standard Gaussian vector , and , whose components are independent, symmetric random signs are both -sub-Gaussian for a constant that is independent of the dimension . Another family of examples consists of the random vectors whose density is uniform on sets of the form for some , normalized to have volume . Again, is an absolute constant, independent of and (see, e.g., ).
where are independent, symmetric signs that are independent of .
Note that if is heavy-tailed there is no hope of obtaining a high-probability version of Proposition 4.3. In fact, if all one knows is that belongs to , one cannot hope that
with a probability higher than .
Applying Proposition 4.3, it is evident that
Thus, assuming that , as we do in Assumption 2.1, it follows that
For constants , and , let
for constants that depend only on and .
We obtain the following consequence of Theorem 2.10:
and the constants depend only on and .
with probability at least , exhibiting the optimal accuracy-confidence tradeoff.
Theorem 4.5 improves Theorem A from , where it was shown that ERM produces with the same accuracy and confidence as in the first part of Theorem 4.5, but only when for every and every . In other words, Theorem A from is based on the assumption that each is an -sub-Gaussian random variable, and holds only for the accuracy level . In contrast, Theorem 4.5 shows that the median-of-means tournament performs in an optimal way in heavy-tailed situations that are totally out of reach for ERM and for the entire range .
Observe that the only range of accuracies in which Theorem 4.5 is (perhaps) suboptimal, is when
for well chosen values of , ; that is, for values that are larger than the known lower estimate on the accuracy edge for such problems. As noted in , there are many examples in which and are equivalent (roughly speaking, this happens when Sudakov’s inequality is sharp). In such cases the median-of-means tournament is optimal in the entire range of accessible accuracies.
One important class of sets in which this equivalence is true is (see for the proof). In light of Theorem 4.5, the median-of-means tournament performs in an optimal way in . Moreover, it turns out that one may relax the sub-Gaussian assumption on and still obtain the optimal behaviour , as we show next.
Being a penalized version of ERM, the analysis of LASSO is equivalent to the study of ERM in the sets for an arbitrary choice of .
While we defer the question of a “LASSO-tournament” procedure to future work, it is clear that the first step in that direction is to explore the median-of-means tournament in . Instead of the sub-Gaussian assumption used in Theorem 4.5, the assumption we use follows the path of :
Note that the coordinates of need not be independent and may be far from being an -sub-Gaussian random vector. Indeed, it is required that linear functionals satisfy a sub-Gaussian moment growth only up to the logarithm of the dimension, and it is possible that some do not have any higher moment beyond .
It turns out (see ) that if satisfies Assumption 4.1, then , the random matrix whose rows are independent copies of , exhibits the best possible sparse recovery features. For example, one requires random measurements to recover any -sparse vector using basis pursuit, and a similar type of estimate holds in the “noisy” setup for the LASSO. Moreover, one cannot relax the moment condition in Assumption 4.1 and still get the same recovery properties.
We show that as far as regression in goes, the median-of-means tournament yields the optimal, “Gaussian” behaviour even when only satisfies Assumption 4.1 and is heavy-tailed. To this end, we need sharp bounds on the parameters that are used to define in Theorem 2.10.
Let and let satisfy Assumption 4.1 for . Let be a random variable (not necessarily independent of ) and let be independent copies of . Then
To put them in a more familiar form, set and observe that
Recall that if and is the set of -sparse vectors in the Euclidean unit sphere (i.e., those with at most nonzero components), then for a suitable absolute constant . Using what are by now standard estimates (see, e.g., ),
Using these observations, and with the same (tedious) computation as in , one obtains the following: let and be well-chosen absolute constants and set
for constants that depend only on and .
Proofs
Let us begin by considering the structure of the various indexing sets involved in the proofs, paying particular attention to the way their structure affects the regularity of the parameters we defined earlier.
For any class and every , , and in particular,
Thus, for all the complexity parameters defined above, one may avoid the need to take the supremum over all possible choices of centres by considering a slightly larger indexing set, namely, . Moreover, if is convex, then is both convex and centrally symmetric, and if happens to be convex and centrally symmetric then and .
This argument shows that when ,
For the sake of transparency, we do not specify all the constants involved in the definitions of the fixed points right from the start. Instead, we collect conditions on these constants and use the fact that one may “combine” and “intersect” them. It turns out that all the constants depend on only two parameters: , for which the norm is equivalent to the norm, and the constant . In what follows, we denote by a constant that depends only on and , for the values of and in Theorem 2.10 (where only is considered) or in Theorem 2.11, when can be arbitrarily close to –at the price of worse constants, of course.
With this in mind, let and be constants that will be specified later and that depend only on and . Fix and consider for which, for every ,
The value of from (2.7) will be given by selecting right constants and . We consider an accuracy , and for the rest of this section we fix its value.
Hence, depends on the wanted accuracy . Also, and without loss of generality we may assume that both and are integers. Also, note that .
In this section we prove Proposition 3.5.
Recall that by Proposition 3.2 and the resulting condition on , one has that, with probability at least with respect to , if a match involving is allowed to take place (i.e., if ), then ; and if the match is abandoned (i.e., ), then . The constants and depend only on and .
Therefore, to establish Proposition 3.5 it is enough to show that with probability , if , then wins its match against . In other words, that for the majority of the blocks in any such match.
Let us begin by exploring the situation in a match between and , knowing that , as above.
Clearly, for every ,
This is summarized in the following lemma, established in the next two sections, and for the right choice of the constants , , and that depend only on and .
There exists an absolute constant and a constant for which the following holds. With probability at least , for every with ,
for constants and that depend only on and .
Recall that are the blocks, each one of cardinality , that , and that . For and a function , set
Thanks to (5.6), it is evident that if and then, with probability at least ,
Recalling that , if then
In other words, at least of the blocks have at least coordinates of size at least . Moreover, in light of the high probability estimate with which (5.7) holds, by the union bound, the same property is satisfied uniformly by any subset of of cardinality at most . The set we consider is a maximal -separated subset of with respect to the norm and for a well-chosen specified later.
Let be such a maximal separated set. Since
it follows from the translation invariance of packing numbers that
Thus, to obtain a uniform control over points in , it suffices to show that
Recalling (5.4) and the choice of , it suffices to verify that
Observe that since is maximal, it is also an -net in , that is, every has some for which . Consider the following event:
(5.7) holds for every , and
On this event, for every there are at least blocks with the following properties:
on at least coordinates in , and
on at most coordinates in .
Hence, in each one of the well-behaved blocks there are at least coordinates that satisfy
In particular, since , one has
Moreover, (5.8) is positive homogeneous in and is star-shaped around , implying that (5.8) holds for every as long as .
All that remains is to establish that holds with a sufficiently high probability. To that end, let
First note that, by the bounded differences inequality (see, e.g., ), with probability at least ,
Thus, the final step in the proof is to show that
In the last step we used a standard symmetrization argument, see, for example, . To conclude the proof, observe that when
and for a suitable absolute constant . The fact that
follows because and invoking (5.2), as long as
The multiplier component
Next we complete the proof of Lemma 5.1 by showing that, with high probability, if , then
on at most blocks. In the first step towards this goal, we consider a single function:
In particular, the same assertion holds uniformly for any fixed subset of cardinality at most .
Before proving Lemma 5.2, consider the assumption that
Another observation is that (5.11) passes to , simply because any function in that set is of the form for some and and therefore, .
for that are independent copies . It is straightforward to verify that, with defined as independent symmetric random signs,
By the norm equivalence assumption of Theorem 2.10,
and by the independence assumption of Theorem 2.11,
Hence, setting and noting that
Therefore, with probability at least ,
Just as before, one may select any fixed subset of cardinality , and the assertion of Lemma 5.2 holds with high probability and uniformly for every . The choice of requires some care. It cannot be just an arbitrary maximal separated set.
There exists a subset of cardinality at most such that for every there is some that satisfies
Proof. Let be an maximal -separated subset of . Recall that by the choice of and (5.4), , which is smaller than when
The class is a locally compact subset of and therefore, so is . Thus, for every , the intersections of with the balls are compact and the continuous linear functional on
Hence, to complete the proof it suffices to show that for every , there are at most blocks with
There exists an absolute constant for which, with probability at least ,
Consider the supremum of Binomial random variables
The next step is to centre the process. We show that for every , the centring term satisfies
Indeed, symmetrizing, applying either the assumption of norm equivalence of Theorem 2.10 or the independence assumption of Theorem 2.11, and recalling that , it is evident that
again using the fact that . Clearly,
for a suitable absolute constant .
Finally, we need to bound the centred empirical process. This is done by standard techniques of symmetrization, the contraction theorem for Bernoulli processes (for the Lipschitz function ) and then de-symmetrization (see, e.g., ):
2 Champions league–proof
Finally, it remains to prove Proposition 3.8. Recall that, with probability at least 1-2\exp\bigl{(}-c_{0}N\min\{1,\sigma^{-2}r^{2}\}\bigr{)} with respect to the sample , we have been able to identify a set of “qualifiers” that have not lost a single match in the preliminary round subset; that is, , consisting of and possibly other functions that satisfy . In the rest of this section we work conditionally on this “good” event.
While producing a function that is close to solves the estimation problem, the question of prediction requires an additional step: we would like to choose one of the qualifiers that has an almost optimal statistical performance: a function that satisfies
for an appropriate . We show that this is possible with the required high probability for . To this end, set .
Recall that for , and note that
The “champions league” round is designed to have “home-and-away” legs. For the partition of , wins its home match against if
The main ingredient in the proof is the following lemma.
Let . Under the conditions of Theorem 2.10, there is an absolute constant , such that, with probability at least ,
Proof. Using the assumption of norm equivalence of Theorem 2.10, one may verify that . Therefore,
Applying the same argument used in the previous section–namely, symmetrization, followed by contraction for a Bernoulli process and the Lipschitz function and de-symmetrization–one has that for absolute constants , and ,
provided that for an absolute constant . Since , it suffices that
Recall that for every , and consider the “good” event from Lemma 5.5. For any , and, in particular, for any qualifier ,
Finally, since , on that event and the same coordinate blocks,
Thus, is defeated by in the majority of the blocks, and in particular, loses its home match against . It follows that, with probability at least , any function selected in the champions league round must satisfy that , and by Lemma 3.6,
Additional remarks
This fact has recently been established in in a very special situation: when is a convex, -sub-Gaussian class of functions; all the admissible targets are of the form for and that is sub-Gaussian, zero-mean, independent of . The procedure used is a modification of ERM: one replaces with an appropriate net, thus ‘erasing’ all the fine structure of at the right level, and then runs ERM on the net. The idea behind this procedure is straightforward: if one is interested in accuracy , one is insensitive to perturbations of that order. From that perspective, a net with a mesh that is proportional to is just as good as the entire class. It turns out that of the net is proportional to the original class . Unfortunately, all the highly restrictive assumptions are essential to the proof and cannot be relaxed at all.
It should be noted that the median-of-means tournament may be modified in exactly the same way as ERM is modified in , leading to an accuracy that is proportional to when is a convex, -sub-Gaussian class and for an independent noise that may be heavy-tailed. We decided not pursue this point further because it is a very special case, and shifts the emphasis of the article from the question of the optimal accuracy/confidence tradeoff to the nature of the accuracy edge, a problem of a different nature.
As far as the latter is concerned, it is not clear whether the gap between and can be closed in other cases. It is possible that both independent noise and a sub-Gaussian class are essential in attaining an accuracy proportional to , and under weaker assumptions, the true accuracy edge lies somewhere between and . We leave that question for future study.
Finally, it should be noted that in this work we completely ignore the algorithmic aspects of the new procedure. While computing the empirical risk minimizer often leads to thoroughly studied and well understood convex optimization problems, finding the winner of the median-of-means tournament in a computationally efficient manner is a highly nontrivial–and perhaps not hopeless–problem that goes beyong the scope of this paper. Techniques of “derivative-free optimization” with “function comparison oracle” may be useful, see, for example, .
References
Appendix A The distance oracle – outline of the proof
Here, we sketch the proof of Proposition 3.2. The proof is almost identical to that of Theorem 3.3 from , and follows the same path as the study of the quadratic component in the proof of Theorem 2.10.
Moreover, combining (A.1) with a straightforward application of Chebyshev’s inequality, one has that with probability at least ,
for constants and that depend only on and .
Let us apply this observation to our setup: fix and consider the set
for any fixed . By the choice of , contains an -net of cardinality at most . Therefore, with probability at least , for every there is of cardinality at least , such that for every ,
Next, for each let satisfy . Suppose that one can show that on a high probability event, for every such there are at most blocks on which
Then, on that event, and on at least blocks,
This is precisely the isomorphic estimate we require, so far, for elements of whose norm is . Obtaining the isomorphic estimate for elements with a larger norm follows because is star-shaped around and the required isomorphic estimate is positive homogeneous.
Therefore, to complete the proof of the first part of Proposition 3.2 it suffices to show that
where the supremum is taken in . The proof of (A.3) is based on an identical argument to the one we used earlier, in the study of the quadratic component, and the fact that .
For each , with probability at least ,
and thus, with probability at least , at least of the random variables are smaller than . Next, we consider an -net , which, by the choice of is of cardinality at most . It follows that, with probability at least , (A.4) holds for every . The oscillation term is then controlled as we outlined above.
Finally, because , the estimates hold with the claimed probability of for a constant that depends only on and .