Robust machine learning by median-of-means : theory and practice
Guillaume Lecué, Matthieu Lerasle
Introduction
Recent technological developments have allowed companies and state organizations to collect and store huge datasets. These yield amazing achievements in artificial intelligence such as self driving cars or softwares defeating humans in highly complex games such as chess or Go. In fact, most big organizations have realized that data will have a major role in the future economy. Most companies in banks, electric or oil companies, etc. have a digital branch developing new data-based services for customers and new companies even build all their business on data collected on Twitter, Facebook or Google.
Big datasets have also challenged scientists in statistics and computer science to develop new methods. In particular, Machine Learning has attracted a lot of attention over the past few years. As explained in , machine learning can be seen as a branch of statistical learning dealing with large datasets where any procedure besides presenting optimal statistical guarantees should be provided with at least a tractable algorithm for its practical implementation. This new constraint raised several interesting problems while revisiting the classical learning theory of Vapnik . In particular, to name just a few, algorithms minimizing convex relaxations of the classical empirical -risk counting the number of misclassification have been proposed and studied over the last few years. More generally, optimization algorithms are now routinely used and analyzed by statisticians.
Our theoretical understanding of many classical procedures of machine learning such as LASSO heavily rely on two assumptions on the data, that should both be i.i.d. and have subgaussian behaviors. As noted in , “data from real-world experiments oftentimes tend to be corrupted with outliers and/or exhibit heavy tails”. For example, in finance, heavy-tailed processes are routinely used and, in biology or medical experiments datasets are regularly subject to some corruption by outliers. These outliers are even in some applications the data of actual interests, one can think of fraud detections for example. The need for robust procedures in data science can be appreciated, for instance, by the recently posted challenges on “kaggle”, the most popular data science competition platform. The 1.5 million dollars problem “Passenger Screening Algorithm Challenge” is about to find terrorist activity from 3D images. The “NIPS 2017: Defense Against Adversarial Attack” is about constructing algorithms robust to adversarial data.
LASSO is not an isolated example, most algorithms are actually designed to approximate minimizers of penalized empirical loss and empirical means of unbounded functions are highly sensitive to corrupted data. In statistics, building robust estimators has been an issue for a long time at least since the work of John Tukey, Frank Hampel and Peter Huber. It has yield interesting “robust” alternatives to the standard maximum likelihood estimator, one can refer to for an overview on robust statistics and to Baraud, Birgé and Sart or Baraud and Birgé see also for deep new developments in robust statistics.
1.2 Resist or detect outliers : a raising challenge in ML
There are mainly two types of outliers in practice : those corrupting a dataset which are not interesting (outliers can appear in datasets due to storage issues, they can also be adversarial data as fake news, false declarative data, etc.) and those that are rare but important observations like frauds, terrorist activities, tumors in medical images,… A famous example of the latter type of “outlier” discovered unexpectedly was the ozone hole [62, p.2]. In the former case, the challenge is to construct predictions as sharp as if the dataset was clean and in the latter, the main question is to detect outliers.
Of course, any dataset is preprocessed to be “cleaned” from outliers. This step, called “data jujitsu”, “data massage”, “data wrangling” or “data munging” usually takes of data scientists time. It requires an important quantity of expertise to guide data scientists . Some platforms like “Amazon mechanical turk” allow to preprocess collectively datasets with millions of lines. Even when done with maximal care, some contamination always affect modern datasets, because outliers are particularly hard to detect in high dimension and because the distinction between outliers and informative data is often delicate in practice.
2 Robustness : state of the art
The book traces back the origins of robust statistics to the fundamental works of John Tukey , Peter Huber and Frank Hampel . Tukey asked the following question for the location problem of a Gaussian distribution “what happens if the distribution slightly deviates from the normal distribution?”. Tukey’s 1960 example shows the dramatic lack of distributional robustness of some classical procedures like Maximum Likelihood Estimator (MLE) [42, Example 1.1]. A related question was asked by Hodges about “tolerance to extreme values”.
In fact, one doesn’t define one robustness property in general. Estimation, prediction or more generally any statistical properties like consistency, asymptotic normality, minimax optimality, etc. require assumptions (cf. the “no free lunch theorem” ). Such assumptions are naturally questionable on real databases. What properties of an estimator pertain when one or several assumptions are not satisfied? As there are several types of assumptions, one can define several types of robustness. For example, Tukey’s question is about robustness to a misspecification of the Gaussian model while Hodges question can be understood as robustness to contamination of the dataset by large outliers or robustness to heavy-tailed distributions in the model that may have caused the appearance of large data.
satisfies for all (where is the space where the take their values and the )
If the fraction is small enough, the ERM over a class of functions from to
performs theoretically as well as the ERM based only on the “good” data
which is statistically optimal in some problems . In that case, the number of outliers has to be less than times the rate of convergence of – this is a condition we will meet later for our estimators.
2.2 Robustness to heavy-tailed data
The Gaussian assumption on the noise in statistics is usually weakened in learning theory to a subgaussian assumption (the -moments of the noise are bounded by those of a Gaussian variable). This subgaussian assumption is already much more flexible and essentially allows to extend the use of least-squares estimators and their penalized versions to a much broader class of possible distributions for the noise. Going beyond the subgaussian assumption is in general way more technical. One way to proceed is to assume a subexponential assumption – random variables whose moments are smaller than those of an exponential variable– as in the Bernstein’s condition in . More recently, the subexponential assumption has also been relaxed into -moment assumptions. Least-squares estimators have been studied as well as some modifications less sensitive to large data for instance in . It appears that the statistical properties of least-squares estimators are deteriorated when the noise satisfies weaker assumptions . A modification of the ERM based on Le Cam’s estimation by tests principle has also been considered in .
3 Robustness in regression
Most of the robustness literature of the 1980s was on the regression problem , [42, Section 7.12]. These methods make a distinction between the problem of robustness with respect to the outputs on one side and with respect to the inputs on the other side [42, Chapter 7]. Outliers in the inputs remain extremely difficult to handle while many solutions have been proposed to deal with outliers in the outputs.
Even if some solutions have been proposed in particular cases, see for example , the problem of robustness with respect to outliers in the inputs (also called leverage points), has not been solved in general neither in theory nor in practice. In the book , a three steps approach is suggested to solve the robustness problem in regression
find routine methods when there are only few leverage points,
find analytical methods for identifying leverage points and, if possible, leverage groups.
In other words, suggest to make an important “datacleaning” step before any analysis for solving the leverage point problem. As already discussed, this may not be possible for nowadays large datasets. Therefore, designing procedures that can handle outliers such as leverage points is an important task that requires new ideas.
In statistics, an elegant solution to this issue (and many others) called -estimators has recently been proposed. However, -estimators do not provide a satisfying solution for learning issues since they require a statistical model (even if almost no assumption besides a controlled massiveness are required on this model), they are specifically built for the Hellinger loss (although they seem to work extremely well for other risk function in some special cases) and they remain far from being computable in general which is dead-end for Machine Learning.
3.2 Outliers in the outputs
Robust procedures can be analyzed under the assumption that “we can act as if the are free of gross errors” [42, paragraph 3 in Section 7.3] ( being the -th coordinate of for and ). In this approach, the square loss is replaced by a convex differentiable function and the estimator minimizes
If , the estimator is also solution of the equation
Many alternative robust procedures can be found in the regression literature, among other, one can mention the least median of squared residual estimator and the least trimmed sum of squares estimator which minimizes the sum of the smallest squared residuals. Finally, introduced Least Median of Squares in regression which naturally extends the median estimator in the location problem :
Its computational complexity rises exponentially with [37, p.330], .
3.3 Other challenges
We discussed the problems of detecting outliers and resisting corruption of the inputs earlier. Besides these, robust statistic has left some interesting open questions that have to be addressed.
Most results in robust statistics are asymptotic but the asymptotic approach may be quite misleading : “ […] approaches based on asymptotics are treacherous, and it could be quite misleading to transfer insights gained from asymptotic variance theory or from infinitesimal approaches (gross error sensitivity and the like) by heuristics to leverage points with ” [42, p.189]. In other words, robustness issues ultimately have to be tackled from a non-asymptotic point of view. Again, -estimators treat this issue from a statistical perspective but cannot be applied in our learning setting.
When designing robust strategies for ML, one should keep in mind that these should be computationally tractable. To the best of our knowledge, no estimator robust to outlier in the outputs nor any general learning procedure robust to heavy-tailed data can currently be used in a high dimensional learning problems. There is currently no available algorithm for high dimensional learning whose good statistical performance don’t break down in the presence of one outlier in the dataset.
4 Contributions
In this paper, we propose a new estimator of the learning function
This new estimator has minimax-optimal non-asymptotic risk bounds even when the dataset has been corrupted by outliers (that may have corrupted indifferently the inputs, the outputs or both of them) and when informative data (those that are not outliers) only satisfy somehow minimal assumptions relating their geometry on to that of (see Sections 1.4.1 and 3.1 for details). This estimator solves therefore all challenges related to outliers in the learning theory literature including “adversarial outliers” – which are data satisfying no assumptions – and “stochastic outliers” – which are informative data that may look like outliers because of some weak concentration properties due to heavy tail distributions.
We also show that our estimator can be used for ML purposes by providing several algorithms to compute proxies of our estimators and illustrate their performance in an extensive simulation study (see Section A). This opens several interesting conjectures for example, to study and compare theoretically convergence of these algorithms, and to automatically calibrate our estimators.
We propose the following setup that allows the presence of outliers and relax the standard i.i.d. setup. We assume that data are partitioned in two groups (unknown from the statistician), more precisely , with
data ( stands for “outliers”) are not assumed to be independent nor independent to the other ones , they may not be identically distributed, in fact nothing is required on these data; they can even be adversarial in the sense that may have been designed to make the learning task harder,
data (called informative) are the only data on which we can rely on to solve our learning task. In what follows, we only need very weak assumptions on those data to make MOM estimators achieving optimal statistical properties. In particular, we will always assume that the informative data are independent.
As we allow the presence of outliers, our procedure will be in a sense robust to the i.d. (identically distributed) assumption. This robustness goes actually further, since even informative data won’t be assumed identically distributed. Instead, our procedure will be shown to work under the much weaker requirement that the distributions of for all induce an equivalent metric over the class and equivalent covariance between functions in and the output. If various distributions have equivalent first and second moments for all functions , our estimators will exhibit the same optimal performance as if they were i.i.d. Gaussian with similar first and second moments on all . Our new assumption feels more natural than the usual i.i.d one as distributions inducing the same “risk geometry” on seem interesting to solve our learning task and any other higher moment assumption above seems unnatural. However, the standard ERM once again would fail to achieve this goal as the behavior of the supremum over subclasses of of the empirical process used to bound the risk of the ERM depends in general on higher moments of these distributions.
4.2 Quantifying robustness by breackdown point
The breakdown point of an estimator is the smallest proportion of corrupted observations necessary to push an estimator to infinity [28, p.1809], [42, p.279]. The notion has been introduced by Hampel in his 1968 Ph. D. thesis. In , the notion was revisited with the following non-asymptotic definition, see also . Let denote a dataset and let be an estimator. If there exists a strategic / malicious / adversarial choice of another dataset such that is arbitrarily large, the estimator is said to break down under the contamination fraction . The Donoho-Hampel-Huber breakdown point is then defined as the smallest contamination fraction under which estimator breaks down:
For the estimation of the mean of a random variable from i.i.d. observations, the empirical mean can be made arbitrarily large by adding a single “bad” observation. The breakdown point of the empirical mean estimator is therefore and this is the worst possible value for an estimator. The empirical mean is not robust to outliers, it performs very poorly in corrupted environments. On the other hand, the empirical median has a breakdown value of for the location problem.
Among equivariant estimators, the common sense heuristic applies : “when more than half data are bad, one cannot learn from good ones” . In particular, among equivariant estimators, is the best possible breakdown point for an estimator and the empirical median is optimal from this perspective. When the parameter of interest belongs to a bounded set, one can adapt the definition of breakdown point as in the discussion papers of . Other generalization to quantile estimation can be found in .
4.3 A modified notion of breakdown point for ML
We introduce a new concept of breakdown point for Learning. The standard breakdown point (4) does not come with any estimation or prediction property. As already mentioned, this leads to consider restricted classes of estimators to study optimality from the breakdown point of view to avoid degenerate cases. Instead of imposing algebraic restrictions, our new definition focus on the risk of the procedures.
where denotes the marginal on of .
In words, the breakdown number is the minimal number of points one has to add to a dataset to break statistical performance of an estimator . In this definition, the rate one can achieve is of particular importance. Ultimately, we would like to take it as small as possible. To that end, recall the definition of a minimax rate of convergence.
The minimax accuracy is the smallest accuracy that can be achieved uniformly over with confidence by some estimator.
The minimax rate of convergence defined in Definition 2 is obtained in the “Gaussian model” (that is when the dataset is made of i.i.d. observations in the regression model with Gaussian design and Gaussian noise independent of the design) which is somehow the benchmark model in statistics in which one can derive minimax rates of convergence over . We will therefore use the minimax rates obtained in this model as benchmark rates of convergence.
Using this new definition, one can be interested in various existence or optimality properties such as:
Given , , what is the largest set of sample distributions where one can build estimators with accuracy of the same order as in the Gaussian model ?
given , and , is there some procedure such that ?
given , and , how large can be ? and is this maximum achievable by a computationally tractable estimator ?
Most recent theoretical works on robust learning approaches focus on the first question, showing that statistical optimality can actually be achieved over much broader classes of distributions than the Gaussian model, see for example. In this paper, we build on the approach introduced in where we proved that, for any smaller than , there exists estimators such that where is the minimax rate of convergence in the Gaussian model for exponential deviation. It is for instance given by for the problem of estimation of a -dimensional vector and when this vector has only non-zero coordinates. But we go further in this paper, showing that these performance can be achieved by a procedure that is computationally tractable, hence by a Machine Learning algorithm. This ultimate breakthrough is due to a new estimator that we shall now introduce.
4.4 Our estimators
The key equation to understand our procedure is that the target
can be obtained as the solution of the following minimaximization problem
This remark is obvious since the expectation operator is linear, it is fundamental though in our construction as we use non-linear estimators of the expectation. More precisely, unknown expectations are estimated by median-of-means (MOM) estimators : given a partition of the dataset into blocks of equal sizes, MOM is the median of the empirical means over each block, see Section 2.2 for details. MOM estimators are naturally more robust than the empirical mean thanks to the median step : if the number of blocks is at least twice that of the outliers, outliers can only affect less than half blocks so MOM remains an acceptable estimator of the mean. MOM’s estimators are non linear, therefore plugging a MOM estimator into the minimization problem doesn’t yield the same estimator as the plugging estimator on the minimaximization problem. More precisely, it is not so hard to see that MOM minimizers cannot achieve (fast) minimax rates . This is why we focus in the following on the minimaximization problem
MOM estimators of the increments of criteria have already been used in . What is new here is that, instead of combining these estimators to build a “tournament” as in or using the approach of Le Cam as in , we simply plug these estimators in the minimaximization problem. Our construction is therefore extremely simplified compared to these previous papers, we’ll show that our new estimator achieves the same theoretical properties as these alternatives.
Then one can perform a descent/ascent algorithm over the block to get the next iteration . In practice though, this basic idea can be substantially improved by shuffling the blocks at each time step, cf. Section A.8. The first advantage of this shuffling step is that the algorithm doesn’t converge to local minimaxima.
Furthermore, our algorithm defines a natural notion of depth of data : deep data are typically regularly chosen as member of the median block while “outliers” on the other hand are typically left aside. This notion of depth, based on the risk function, is natural in our learning framework and should probably be investigated more carefully in future works. It also suggests an empirical definition of outliers and therefore an outliers detection algorithm as a by-product.
The paper is organised as follow. Section 2 introduces formally the classical least-squares learning framework and presents our new estimator, Section 3 details our main theoretical results, that are proved in Section 4. An extended simulation study is provided in Section A, including the presentation of many robust versions of standard optimization algorithms. Theoretical abstract rates derived in the main results are particularized in Section B into important problems in Machine Learning to show the extent of our results. These main results focus on penalized version of the basic tests presented in this introduction that are well suited for high dimensional learning frameworks, we complete these results in Section C, providing results for the basic estimators without penalizations in small dimension. Finally, Section D provides some optimality results for our procedures.
Setting
For any and any , let the -norm of whenever it’s defined. Finally, let be a norm defined on the span of ; will be used as a regularization norm.
A key observation in our analysis is that is solution of a minimaximization problem:
The minimaximization point of view has been considered in and . These papers used bounded modification of the quadratic loss. As for classification, the impact of outliers on the empirical mean is negligible if the loss is bounded and the number of outliers controlled, which ensures robustness. The cost to pay is that the function to be minimized is not convex and is therefore hard to implement. The idea of estimation by aggregation of tests is due to Le Cam and was further developed by Birgé , Baraud , Baraud, Birgé and Sart . MOM estimators have been recently considered see and Section 2.2 for details. Nevertheless, to the best of our knowledge, this paper is the first to consider the estimators obtained by plugging MOM’s estimators in the minimaximization problem (6). This approach will proved to be efficient for designing algorithms.
2 MOM estimator
We will make an extensive use of empirical medians and quantiles in the following. We now precise some conventions used repeatedly hereafter. For all and real numbers , we denote by
where in the last inequality we use the Minkowsky sum of two sets. More generally, inequalities involving empirical quantiles are always understood in the worst possible case. For example, one can check that, for all vectors and
In the following, we will simply write with some abuse of notation that for all vectors and :
Likewise, one can check that and . Moreover, to prove that it is enough to prove that and to prove that it is enough to prove that .
We conclude this section with the definition of MOM tests and their regularized versions.
and, for a given regularization parameter , its regularized version is
3 Minimaximization of MOM tests
One can rewrite our estimators as cost minimization estimators
where for all {\cal C}_{K}\big{(}f\big{)}=\sup_{g\in F}T_{K}(g,f) and {\cal C}_{K,\lambda}\big{(}f\big{)}=\sup_{g\in F}T_{K,\lambda}(g,f). These cost functions play a central role in the theoretical analysis of and (cf. Section 4).
Compared to the aggregation of tests by “tournaments” , or “Le Cam type” aggregation of tests , minimaximization estimators have several avantages. First they are very natural from Eq (6) and they do not require a proxy for the excess risk or for the -norm, which simplifies the presentation of this estimator. Finally, while building the estimators of or is totally impossible, these new estimators are minimaximization procedures of locally convex-concave functions that could be approximated by a numerical scheme. It is actually easy to “convert” most of the classical algorithms used for “cost minimization” into an algorithm for minimaximization solving (7) (see Section A below).
Assumptions and main results
Denote by a partition of , where has cardinality . Data are considered as outliers, no assumptions on the joint distribution of these data or on their distribution conditionally on data is made. These data may not be independent, nor independent from the remaining data. The remaining set is the set of informative data, that is the ones one can use for estimation. These are hereafter assumed independent. Given the data no one knows in advance which data is informative or not.
All the assumptions we need to get the results involve only first and second moment of the distributions on functions in and . In particular, this setting strongly relaxes usual strong concentration assumptions made on the informative data to study ERM estimators.
There exists such that for all and all ,
Of course, Assumption 1 holds in the i.i.d. framework, with and . The second assumption bounds the correlation between the “noise” and the shifted class .
There exists such that for all and all ,
and so Assumption 2 holds for .
Now, let us introduce a norm equivalence assumption over : we call it a assumption.
There exists such that for all and all
Note that for all and . Therefore, Assumption 1 and Assumption 3 are together equivalent to assume that all the norms are equivalent over . Note also that Assumption 3 is equivalent to the small ball property (cf. ) which has been recently used in Learning theory and signal processing. We refer to for examples of distributions satisfying the small ball assumption.
Let be a real-valued random variable. The following holds:
therefore, .
2 Complexity measures and minimax rates of convergence
Balls associated with the regularization norm and the norm play a prominent role in learning theory . In particular, for all , the “sub-models”
where and their “localizations” at various level , i.e. intersection of with -balls
are key sets because their Rademacher complexities, drives the minimax rates of convergence. Let us introduce these complexity measures.
Let be independent Rademacher random variables (i.e. uniformly distributed in ), independent from . For all , and , we denote the intersection of the -ball of radius and the -norm of radius centered at by
Let for all and for define
and let be a continuous and non decreasing function such that for every ,
It follows from Lemma 2.3 in that and are continuous and non decreasing functions. Note that depend on . According to , if one can choose equal to the maximum of and then is the minimax rate of convergence over . Note also that and are well defined when , which implies that at least half data are informative.
3 The sparsity equation
To control the risk of our estimator, we bound from above for all functions far from either in -norm or for the regularization norm . Recall that
The multiplier term “” is the one containing the noise and is therefore the term we will try to control from above using either the quadratic process “” or the regularization term “”. To that end we will need both to control from below the quadratic process and the regularization term.
Let and denote . When is small, the quadratic term will not help to bound from above , one shall only rely on the penalization term . One can bound from below for all close to in using the sparsity equation of . First, introduce the subdifferentials of : for all ,
where is the dual normed space of .
For any , let denote the set of functions “close” to in and at distance from in regularization norm and let denote the set of subdifferentials of all vectors close to :
If there exists a “sparse” in , that is is almost all the unit dual sphere, then is large for any so is large as well. More precisely, let us introduce, for all ,
The sparsity equation, introduced in , quantifies these notions of “large”.
A radius is said to satisfy the sparsity equation when .
One can check that, if satisfies the sparsity equation, so do all . Therefore, one can define
Note that if then . Moreover, equals to the dual ball (i.e. the unit ball of ) and so . This implies that any satisfies the sparsity equation. This simple observation will be used to get “complexity-dependent rates of convergence” as in .
4 Main results
Our first results study the performance of the estimators for a fixed value of . The other ones will provide an adaptive way to select .
Grant Assumptions 1, 2 and 3 and let , denote the functions introduced in Definition 5. Assume that and . Let be solution to the sparsity equation from Definition 6. Let denote the smallest integer such that
where and is defined in Definition 5 for and . For any , define the radius and the regularization parameter as
Assume that for every , and such that for , one has
Then, for all , with probability larger than , the estimator defined in Section 2.3 satisfies
Assumption (8) holds for instance when for every and one has
These assumptions involve second moments associated with and . As a consequence, if the metrics for and coincide on the functions and and the satisfies Assumption 2 and 3 then Theorem 1 holds. This means that we only need informative data to induce the same metric as to estimate the oracle even if we do not have any observation coming from itself. This setup relaxes the classical i.i.d. where all data are generated from . In this setting, our estimators achieve the same results as the ERM would if data were all i.i.d. with a noise independent of and both and had a Gaussian distribution (see Section D).
The function is used to define the regularization parameter, so it cannot depend on . When depend on , should be a computable upper bound independent from .
In Theorem 1, all rates depend on the choice of the tuning parameter . The following construction inspired from Lepski’s method provides an adaptive choice of this parameter. Let us first recall the definition of empirical criterion introduced in Section 2.3 and the associated confidence regions: for all , , and absolute constant ,
The following theorem shows the performance of these estimators.
Grant the assumptions of Theorem 1 and assume moreover that and . For any , with probability larger than
where and . In particular, for , we have . Therefore, if holds for some absolute constant , then the breakdown number of is larger than .
Note that is any continuous, non decreasing function such that for all , . In particular, if is itself a continuous function (it is clearly non decreasing) then for every , is another non decreasing upper bound. Therefore, one can derive results similar to those in Theorem 2 but with an extra confidence parameter : for all , with probability at least ,
Note however that depends on through the regularization parameter .
Proofs
Upper and lower bounds on follow from a study of “quadratic” and “multiplier” quantiles of means processes. As no assumption is granted on the outliers, any block of data containing one or more of these outliers is “lost” from our perspective meaning that empirical means over these blocks cannot be controlled. Let denote the set of blocks which have not been corrupted by outliers:
If , all data indexed by are informative. We will show that controls on the blocks indexed by are sufficient to insure statistical performance of MOM estimators.
The first result is a lower bound on the quantiles of means quadratic processes.
In particular, .
Define . For all , let and note that for all , by Assumption 3 and by Assumption 1. It follows from Markov’s inequality that, for all ( is defined in (10)),
Since , and so
Denote . By the bounded difference inequality (see, for instance [17, Theorem 6.2]), there exists an event with probability larger than , on which, for all ,
Since the function is 1-Lipschitz and , by the contraction principle (see, for example [57, Chapter 4] or [17, Theorem 11.6]), we have
The family , where for all , is a collection of centered random variables. Therefore, if and denote independent copies of and then
Then, as and are two independent families of independent variables therefore, if denote a family of i.i.d. Rademacher variables independent of then and have the same distribution. Therefore,
By the contraction principle, on ,
For any , because is convex. Moreover, and . Therefore, . Therefore, by definition of ,
Using the last inequality together with (12) and the assumption (so that ), we get that, on the event , for any ,
Hence, on , for any , there exists at least blocks for which . On these blocks, , therefore, on , .
Now, let us turn to a control of the multiplier process.
For all and , set and define
Let and . It follows from Markov’s inequality that
Let and let . We have
where in the last but one inequality, we used that the class is convex and the same argument as in the proof of Lemma 1. Since is a family of centered random variables, one can use the symmetrization argument to get
where the definition of has been used in the last but one inequality.
where we used (4.1) in the last inequality.
Furthermore, it follows from by the symmetrization argument that
and, from the contraction principle and (4.1), that
In conclusion, on , for all ,
Let us first introduce the event on which the statement of Theorem 1 holds. Denote by the intersection of the events , defined respectively in Lemmas 1 and 2 for and
for some absolute constants to be specified later. For these values, conditions in both Lemmas 1 and 2 are satisfied:
if then
there exists block with , for which
Moreover, on the blocks where (17) holds, it follows from assumption in (8) that all such that satisfies
It follows from the convexity of and the nearest point theorem that for all , therefore, for all such that ,
Moreover, still on the blocks where (17) holds, one also has, thanks to assumption (8), that for all such that ,
It follows that, for all such that ,
The main result of this section is Lemma 3. It will be used to bound from above the criterion {\cal C}_{K,\lambda}\big{(}f^{*}\big{)}=\sup_{g\in F}T_{K,\lambda}(g,f^{*}). Recall that and are defined as
where and are absolute constants. We also need to consider a partition of the space according to the distance between and w.r.t. and as in Figure 2: define for all ,
On the event , it holds for all ,
when and .
Let . Since the quadratic process is non negative,
Therefore, applying (18) for and the choice of and as in (20), we get
Let . Given that for any vector and , we have
Moreover when , so it follows from (16) and (18) for that
Putting both inequalities together and using that , we get
Let , (cf.) section 3.3). For all ,
Let , and . We have
where the last inequality follows from . The result follows by taking supremum over .
Let . Let be such that . Define . Then , and,
The first conclusion holds by convexity of , the second statement is obvious. For the last one, let and note that and , so we have
Now, let us bound . Let . Apply Lemma 4 and Lemma 5 to : there exists such that and
First assume that . In that case, and therefore, . Moreover, by definition of and since , we have which implies that satisfies the sparsity equation from Definition 6. Therefore, . Now, it follows from the definition of in (20) that
Moreover, since the quadratic process is non-negative, by (18) applied to ,
Finally, noting that when , binding all the pieces together in (4.2) yields
Second, assume that . Since , it follows from (16) and (17) for that
where we used that when in the last inequality. Plugging the last result in (4.2) we get
when .
The proof follows essentially the one of [56, Theorem 3.2] or [50, Lemma 2].
Let be such that, on , {\cal C}_{K,\lambda}\big{(}\hat{f}\big{)}\leqslant(2+c^{\prime})\epsilon r^{2}(\rho_{K}). Then, on , satisfies
Thus, on , . When and ,
therefore, on . This yields the results for both the regularization and the -norm.
Finally, let us turn to the control on the excess risk. It follows from (19) for that
4 End of the proof of Theorem 1
By definition of ,
where is the decomposition of as in Figure 2. It follows from Lemma 3 (for ) that on the event ,
Therefore, for and the conclusion of the proof of Theorem 1 follows from Lemma 6.
5 Proof of Theorem 2
Let and let where we recall that . Lemma 3 (for ) shows that, for , . Therefore, on , which implies that . By Lemma 6 (for and ), this implies that
Appendix A Simulation study
The aim of this section is to show that the min-max procedure introduced in this work can be computed using alternating descent-ascent algorithms. It appears that all the following algorithms can be recast as block gradient descent (BGD) but the major difference with the classical BGD is that blocks are chosen according to their “centrality” via the median operator instead of iterating over all the blocks (resp. at random) in the classical (resp. stochastic) BGD approach.
We study the performance of all the algorithms on a dataset corrupted by outliers of various forms. The “basic” set of “informative data” is called . This set is corrupted by various datasets of outliers, named , and . Good and bad data have been merged and shuffled in the dataset given to the statistician. Let us now detail the construction of these datasets.
The set of “informative data” is a set of i.i.d. data with common distribution
is a dataset of “bad data” such that and
is a dataset of “bad data” such that and
is a dataset of “bad data” where is a -Bernoulli random variable and is uniformly distributed over ,
is also a set of “outliers” that have been generated according to a linear model as in (22) (i.e. with the same target vector ) but for a different choice of design and noise . Here, we take where and is a heavy-tailed noise distributed according to a Student distribution with various degrees of freedom.
These different types of “outliers” have been chosen to illustrate that the theory allows for outliers that may have absolutely nothing to do with the oracle that can be neither independent nor random as illustrated by datasets and .
A.2 From algorithms for the LASSO to their “MOM versions”
There is a huge literature presenting many algorithms to implement the LASSO procedure. We explore several of them to investigate their performance regarding robustness.
Each algorithm designed for the LASSO can be transformed into an algorithm for the min-max estimator we have been studying in this work. Let us now explain how this can be done. Recall that MOM version of the LASSO estimator is
Moreover, our algorithms particularly fits the “big data” framework which is our original motivation for the introduction of robust procedures in machine learning. In this framework, the map-reduce paradigm has emerged as a leading idea to design procedures. In this situation, the data are spread out in a cluster of servers and are therefore naturally split into blocks. Then our procedures simply uses for mapper a mean function and for reducer a median function. This makes our algorithms easily scalable into the big data framework even when some servers have crashed down (making outliers data). MOM algorithm could therefore reduced the maintenance of big clusters.
In the sequel, we use this strategy to transform several algorithms implemented for the LASSO into algorithms for the min-max estimator (23). We first start with the subgradient descent algorithm.
A.3 Subgradient descent algorithm
The key insight in Algorithm 1 are step 2 and step 4 where the blocks number have been chosen according to their centrality among the other blocks via the median operator. Those steps are expected 1) to remove outliers from the descent / ascent directions 2) to improve the accuracy of the latter directions.
for some given , and initial point .
A.4 Proximal gradient descent algorithms
Note that the step sizes sequences and may be chosen according to the remarks below Algorithm 1.
A.5 Douglas-Racheford / ADMM
In this section, we consider the ADMM algorithm. ADMM stands for Alternating Direction Method of Multipliers. It is also a splitting algorithm which reads as follows in the LASSO case: at step ,
where is some parameter to be chosen (for instance ). The ADMM algorithm returns after a stopping criteria is met. In Algorithm 4, we provide a MOM version of this algorithm.
A.6 Cyclic coordinate descent
A.7 Adaptive choice of hyper-parameters via MOM V-fold Cross Validation
The choice of hyperparameters in a “data corrupted” environment has to be done carefully. Classical Cross-validation methods cannot be used because of the potential presence of outliers in the dataset that are likely to corrupt the classical CV-criterion. We design new (empirical) criteria that trustfully reveal performance of estimators even in situations where “test datasets” might have been corrupted.
MOM’s principles can be combined with the idea of multiple splitting into training / test datasets in cross-validation. Let us now explain the construction of estimators where the number of blocks and the regularization parameter are hyper-parameters learned via such version of the CV principle.
We are given an integer such that is divided by . We are also given two finite grids and . Our aim is to chose the “best” numbers of blocks and best regularization parameter within both grids. The dataset is splitted into disjoints blocks . For each , is used to train a family of estimators
The remaining of the dataset is used to test the performance of each estimator in the family (31). Using these notations, we can define a MOM version of the cross-validation procedure.
The Median of Means -fold Cross Validation procedure associated to the family of estimators (31) is where is minimizing the criteria
and is a partition of the test set into blocks where such that divides .
The difference between standard V-fold cross validation procedure and its MOM version in Definition 7 is that empirical means on test sets in the classical V-fold CV procedure have been replaced by MOM estimators in (32). Moreover, the mean over all splits in the classical -fold CV is replaced by a median.
The choice of raises the same issues for MOM CV as for classical -fold CV . In the simulations we use . The construction of the MOM-CV requires to choose another parameter: , the number of blocks used to construct the MOM criteria (32) over the test set. A possible solution is to take . This has the advantage to make only one split of the dataset into blocks and then use for each of the rounds, of these blocks to construct the family of estimators (31) and then of these blocks to test them.
In Figures 4 and 4, hyper-parameters (i.e. the number of blocks) and (i.e. the regularization parameter) have been chosen for the MOM LASSO estimator via the MOM V-fold Cross validation procedure introduced above.
In Figure 4, the adaptively chosen grows with the number of outliers which is something expected since one needs the number of blocks to be at least twice the number of outliers.
A.8 Maximinimization, saddle-point, random blocks, outliers detection and depth
In the previous sections, we considered a setup that particularly fits a dataset distributed on clusters. This distributed source of data yields some specific partition of the dataset in compliance with the clusters structure – in other words, the blocks of data are imposed and fixed by the specific physical form of the dataset. In a batch setup, there is a priori no structural restriction to choose a fixed partition of the dataset given that it has no predefined organization. It appears that, in this setup, there is a way to improve the performance of the various iterative algorithms constructed in the previous sections by choosing randomly the blocks at every (descent and ascent) steps of the algorithm. The aim of this section is to show some advantages of this choice and how this modified version works on the example of ADMM. Moreover, as a byproduct of this approach, it is possible to construct an outliers detection algorithm. This algorithm outputs a score to each data in the dataset measuring its centrality. In particular, data with a low score should be considered as outliers.
Following the same strategy as in Section 4, we can show that has the very same statistical performance as the estimator from Section 3.4 (cf. Section C in the case where there is no regularization for a proof). Nevertheless, there is a priori no reason that and are the same estimator. That is we don’t know in advance that
In other words there is no evidence that the duality gap is null. Since , (34) holds if and only if
In that case, is by definition a saddle-point estimator and therefore the two sets of minmax and maxmin estimators are equal.
When is a saddle-point, the choice of fixed blocks may result in a problem of “local saddle points”: iterations of our original algorithms remain close to a potentially suboptimal local saddle point.
The problem here is that our algorithms are looking for saddle-points so that if there are several cells containing saddle-points the previous algorithms may be stuck in one of them whereas a “better saddle-point” (that is a saddle point closer to ) may be in an other cell.
In Figure 5, we ran both MOM LASSO procedures via ADMM with fixed and random blocks. Both the objective function and the estimation error of MOM LASSO jump in the case of fixed blocks. These jumps correspond to a change of cell number for these iterations. The algorithm converge to local saddle-points before jumping to other cells. This slows down its convergence. On the other hand, the algorithms with random blocks do not suffer this drawback. Figure 5 shows that the estimation error converges much faster and smoothly for random blocks than for fixed blocks. As a conclusion choosing blocks at random improve both stability and speed of the algorithm, avoiding the issue of “local saddle points” in the cells . Note also that the objective function of the MOM ADMM with random blocks tends to zero so that the duality gap tends to zero (meaning that we do have a natural stopping criterium and that the MOM LASSO is a saddle point).
A byproduct of this approach is that one can construct an outliers detection procedure. To that end one simply has to count the number of times each data is selected at steps and of Algorithm 5. Every data starts with a null score. Then, every time a block is selected as median block, every data it contains increases its score by one. At the end, every data ends up with a score revealing its centrality for our learning task.
It is expected that aggressive outliers corrupt their respective blocks. These “corrupted blocks” will not be median blocks and will not be chosen. In the case of fixed blocks, informative data cannot be distinguished from outliers lying in the same block, therefore, this outliers detection algorithm only makes sense when blocks are chosen at random. Figure 6 shows performance of this strategy on synthetic data (cf. Section A.9 for more details on the simulations). On this example, outliers (data number and ) end up with a null score meaning that they have never been selected along the descent / ascent iterations of the algorithm. Rearranging the scores, one can observe a jump in the score function between outliers and informative data. The gap between these scores can be enlarged by running more iterations of the algorithm. Finally, the score may be seen as a depth of a data point , the most selected data are the most central.
A.9 Simulations setup for the figures
For Figure 5, we have ran similar experiments with , , , , , the number of iterations was and the regularization parameter was .
For Figure 6, we took , , , , the number of outliers is and the outliers are of the form and , , the number of iterations is and .
Appendix B Examples of applications
Even if recent advances have shown some limitations of LASSO, it remains the benchmark estimator in high-dimensional statistics because a high dimensional parameter space does not significantly affect its performance as long as is sparse. This was shown for example, in for estimation and sparse oracle inequalities, in for support recovery results; more results and references on LASSO can be found in the books .
B.2 SLOPE
B.3 Classical results for LASSO and SLOPE
Typical results for LASSO and SLOPE have been obtained in the i.i.d. setup under a subgaussian assumption on the design and, most of the time, on the noise as well. Let us for the moment provide a definition of this assumption and recall the one of isotropicity.
In the following assumption, we recall a setup where both estimators have been studied in .
the data are i.i.d. (in particular, and , i.e. there is no outlier),
for f^{*}=\bigl{<}t^{*},\cdot\bigr{>}, for some .
Unlike many results on these estimators, Assumption 5 only requires a “minimal” for moment on the noise. It appears that LASSO and SLOPE still achieve optimal rates of convergence under this weak stochastic assumption but the price to pay is a severely deteriorated probability estimate.
The constants depend only on and .
The error rate in Theorem 3 coincides with the standard estimate on the LASSO (cf. ), but in a broader context: does not need to be sparse but should be approximated by a sparse vector; the target is arbitrary (there is no need for a statistical model) and the noise may be heavy tailed and does not need to be independent from . But there is no room for outliers, the design matrix still needs to be subgaussian and the data are assumed to be i.i.d.. We will see below that the MOM version of the LASSO can go further, achieving minimax optimal error bounds with a much better probability estimate.
Turning to SLOPE, recall the following result for the regularization norm when .
The constants depend only on and .
B.4 Statistical analysis of MOM LASSO and MOM SLOPE
and ,
for f^{*}=\bigl{<}t^{*},\cdot\bigr{>}, for some .
there exists such that {\rm var}(\zeta\bigl{<}X,t\bigr{>})\leqslant\theta_{m}\left\|\bigl{<}X,t\bigr{>}\right\|_{L^{2}}.
There exist constants , and such that , is isotropic and for every and , \left\|\bigl{<}X,e_{j}\bigr{>}\right\|_{L^{p}}\leqslant L\sqrt{p}\left\|\bigl{<}X,e_{j}\bigr{>}\right\|_{L^{2}}.
Under Assumption 7, if , [70, Theorem 1.6] shows that, for every ,
As a consequence, if and if there exists a -sparse vector in , Lemma 7 and the choice of in (38) imply that for ,
then satisfies the sparsity equation and is the rate of convergence of the LASSO for . But, this choice of requires to know the sparsity parameter which is usually not available. That is the reason why we either need to choose a larger value for the function as in – this results in the suboptimal rates of convergence from Theorem 3 – or to use an adaptation step as section 3.4.1 – this results in the better minimax rate achieved by the MOM LASSO. To get the latter one needs a final ingredient which is the computation of the radii and . Let and . The equation is solved by
for the function defined in (38). Therefore,
The regularization parameter depends on the “level of noise” , the -norm of . This parameter is unknown in practice. Nevertheless, it can be estimated and replaced by this estimator in the regularization parameter as in [31, Sections 5.4 and 5.6.2].
The following result follows from Theorem 2 together with the computation of , , and from the previous sections.
Grant Assumption 6. The MOM-LASSO estimator satisfies, with probability at least , for every ,
where depends only on and .
In particular, Theorem 5 shows that, for our estimator contrary to the one in , the sparsity parameter does not have to be known in advance in the LASSO case.
Theoretical properties of MOM LASSO (cf. Theorem 5) outperform those of LASSO (cf. Theorem 3) in several ways:
Estimation rates achieved by MOM-LASSO are the actual minimax rates , see , while classical LASSO estimators achieve the rate . This improvement is possible thanks to the adaptation step in MOM-LASSO.
the probability deviation in (36) is polynomial – – whereas it is exponentially small for MOM LASSO. Exponential rates for LASSO hold only if is subgaussian ( for all ).
MOM LASSO is insensitive to data corruption by up to outliers while only one outlier can be responsible of a dramatic breakdown of the performance of LASSO (cf. Figure 1). Moreover, the informative data are only asked to have equivalent moments to the one of for the MOM LASSO whereas the properties of the LASSO are only known in the i.i.d. setup.
Assumptions on are weaker for MOM LASSO than for LASSO. In the LASSO case, we assume that is subgaussian whereas for the MOM LASSO we assumed that the coordinates of have moments and that it satisfies a equivalence assumption.
The sparsity equation relative to the SLOPE norm has been solved in Lemma 4.3 from .
Let and set . If is approximated (relative to the SLOPE norm) by an -sparse vector and if then .
For , one may verify that . Hence, the condition holds when and \rho\gtrsim_{L,q_{0}}\|\xi\|_{L_{q}}\frac{s}{\sqrt{N}}\log\Big{(}\frac{ed}{s}\Big{)}. Hence, it follows from Lemma 8 that when there is an -sparse vector in ; therefore, one may apply Theorem 1 for the choice of the regularization parameter: .
Now, the final ingredient is to compute the solution to . It is straightforward to check that and still .
The following result follows from Theorem 2 together with the computation of , , and above. Its proof is similar to the one of Theorem 5 and is therefore omitted.
Grant Assumption 6. The MOM-SLOPE estimator satisfies, with probability at least ,
where depends only on and .
MOM-SLOPE has the same advantages upon SLOPE as MOM-LASSO upon LASSO. Those improvements are listed below Theorem 5 and will not be repeated. The only difference is that SLOPE, unlike LASSO, already achieves the minimax rate .
Appendix C Learning without regularization
All the results from the previous sections also apply in the setup of learning with no regularization which is the framework one should consider when there is no a priori known structure on the oracle.
We consider the learning problem with no regularization. In this setup, we may use both minmaximization or maxminimization estimators
Indeed, the ideal setup for ERM is the subgaussian (and convex) framework: that is for a convex class of functions, i.i.d. data having the same distribution as and such that for some and all ,
When satisfies the right-hand side of (42), we say that is a -subgaussian class. It is proved in that in this setup the ERM is an optimal minimax procedure (cf. Theorem A′ from recalled in Theorem 9 below).
But first, we need a version of the two theorems 1 and 2 valid for and (that is for the learning problem with no regularization). Let us first introduce the set of assumptions we use and for the sake of shortness we consider the simplification introduced in Remark 2. Then, we will introduce the two fixed points driving the statistical properties of and .
For all and , ,
and .
The two fixed points associated to this problem are and as in Definition 5 for :
and let .
Grant Assumptions 8 and let , and be defined as above for , and . Assume that and . Let denote the smallest integer such that . Then, for all , with probability larger than , the estimators and defined in (41) satisfy
Moreover, one can choose adaptively via Lepski’s method. We will do it only for the maxmin estimators . Similar result hold for the minmax estimators from straightforward modifications (the same as in Section 3.4.1). Define the confidence regions: for all and ,
The following theorem shows the performance of the resulting estimator.
Grant Assumption 8. For and all , with probability larger than ,
The proofs of Theorem 7 and 8 essentially follow the one of Theorem 1 and 2. We will only sketch the proof for the maxmin estimator given that we already studied the minmax estimators in the regularized setup in Section 4.
if then
there exists block with , for which
Moreover, it follows from Assumption 8 that for all , and because of the convexity of and the nearest point theorem. Therefore, on the event , for all ,
Let us place ourself on the event and let be such that . Given that , it follows from (43) and (45) that if is such that then
for and if then . In particular,
and since one has . On the other hand, we have . Therefore, . But, we know from (47) that if is such that then . Therefore, one necessarily have .
The oracle inequality now follows from (C):
Proof of Theorem 8. Consider the same notations as in the proof of Theorem 7 and denote . It follows from the proof of Theorem 7, that with probability larger than , for all , therefore, and so . The latter implies that which, by using the same argument as in the end of the proof of Theorem 7 implies that and then .
As a consequence, if , i.e. if . Using the same arguments as above, we have
Therefore, and .
Now, it follows from Theorem 8, that if and then the MOM OLS with adaptively chosen number of blocks is such that for all , with probability at least ,
Appendix D Minimax optimality of Theorem 1, 2, 7 and 8
The aim of this section is to show that the rates obtained in Theorems 1, 2, 7 and 8 are optimal in a minimax sense. To that end we recall a minimax lower bound result from .
where is a set of i.i.d. copies of . Then, necessarily, one has
where denotes the diameter of .
Theorem 9 proves that if the statistical model (49) holds then there is a strong connexion between the deviation parameter and the uniform rate of convergence over : the smaller , the larger . We now use this result to prove that Theorems 1, 2, 7 and 8 are essentially optimal.
In Theorems 7 and 8, the deviation bounds are and the residual terms in the (to the square) estimation rates are like . Therefore, setting then Theorem 9 proves that no procedure can do better than
Given that one can obviously bound from above the performance of and as well as those of and in Theorems 7 and 8 by the -diameter of (because and those estimators are in ), then the result of Theorem 7 and 8 are optimal even in the very strong Gaussian setup with i.i.d. data satisfying a Gaussian regression model like (49). The remarkable point is that Theorem 7 and 8 have been obtained under much weaker assumptions than those considered in Theorem 9 since outliers may corrupt the dataset, the noise and the design do not have to be independent, the informative data are only assumed to have a norm equivalent to the one of and may therefore be heavy tailed.
Given the form of the deviation bounds in Theorems 1 and 2 and given that and that (if one assumes a weak regularity assumption on the class ) then the same conclusions hold for Theorems 1 and 2: there is no procedure doing better than the MOM estimators even in the very good framework of Theorem 9.