Robust machine learning by median-of-means : theory and practice

Guillaume Lecué, Matthieu Lerasle

Introduction

Recent technological developments have allowed companies and state organizations to collect and store huge datasets. These yield amazing achievements in artificial intelligence such as self driving cars or softwares defeating humans in highly complex games such as chess or Go. In fact, most big organizations have realized that data will have a major role in the future economy. Most companies in banks, electric or oil companies, etc. have a digital branch developing new data-based services for customers and new companies even build all their business on data collected on Twitter, Facebook or Google.

Big datasets have also challenged scientists in statistics and computer science to develop new methods. In particular, Machine Learning has attracted a lot of attention over the past few years. As explained in , machine learning can be seen as a branch of statistical learning dealing with large datasets where any procedure besides presenting optimal statistical guarantees should be provided with at least a tractable algorithm for its practical implementation. This new constraint raised several interesting problems while revisiting the classical learning theory of Vapnik . In particular, to name just a few, algorithms minimizing convex relaxations of the classical empirical 010-1-risk counting the number of misclassification have been proposed and studied over the last few years. More generally, optimization algorithms are now routinely used and analyzed by statisticians.

Our theoretical understanding of many classical procedures of machine learning such as LASSO heavily rely on two assumptions on the data, that should both be i.i.d. and have subgaussian behaviors. As noted in , “data from real-world experiments oftentimes tend to be corrupted with outliers and/or exhibit heavy tails”. For example, in finance, heavy-tailed processes are routinely used and, in biology or medical experiments datasets are regularly subject to some corruption by outliers. These outliers are even in some applications the data of actual interests, one can think of fraud detections for example. The need for robust procedures in data science can be appreciated, for instance, by the recently posted challenges on “kaggle”, the most popular data science competition platform. The 1.5 million dollars problem “Passenger Screening Algorithm Challenge” is about to find terrorist activity from 3D images. The “NIPS 2017: Defense Against Adversarial Attack” is about constructing algorithms robust to adversarial data.

LASSO is not an isolated example, most algorithms are actually designed to approximate minimizers of penalized empirical loss and empirical means of unbounded functions are highly sensitive to corrupted data. In statistics, building robust estimators has been an issue for a long time at least since the work of John Tukey, Frank Hampel and Peter Huber. It has yield interesting “robust” alternatives to the standard maximum likelihood estimator, one can refer to for an overview on robust statistics and to Baraud, Birgé and Sart or Baraud and Birgé see also for deep new developments in robust statistics.

1.2 Resist or detect outliers : a raising challenge in ML

There are mainly two types of outliers in practice : those corrupting a dataset which are not interesting (outliers can appear in datasets due to storage issues, they can also be adversarial data as fake news, false declarative data, etc.) and those that are rare but important observations like frauds, terrorist activities, tumors in medical images,… A famous example of the latter type of “outlier” discovered unexpectedly was the ozone hole [62, p.2]. In the former case, the challenge is to construct predictions as sharp as if the dataset was clean and in the latter, the main question is to detect outliers.

Of course, any dataset is preprocessed to be “cleaned” from outliers. This step, called “data jujitsu”, “data massage”, “data wrangling” or “data munging” usually takes 80%80\% of data scientists time. It requires an important quantity of expertise to guide data scientists . Some platforms like “Amazon mechanical turk” allow to preprocess collectively datasets with millions of lines. Even when done with maximal care, some contamination always affect modern datasets, because outliers are particularly hard to detect in high dimension and because the distinction between outliers and informative data is often delicate in practice.

2 Robustness : state of the art

The book traces back the origins of robust statistics to the fundamental works of John Tukey , Peter Huber and Frank Hampel . Tukey asked the following question for the location problem of a Gaussian distribution “what happens if the distribution slightly deviates from the normal distribution?”. Tukey’s 1960 example shows the dramatic lack of distributional robustness of some classical procedures like Maximum Likelihood Estimator (MLE) [42, Example 1.1]. A related question was asked by Hodges about “tolerance to extreme values”.

In fact, one doesn’t define one robustness property in general. Estimation, prediction or more generally any statistical properties like consistency, asymptotic normality, minimax optimality, etc. require assumptions (cf. the “no free lunch theorem” ). Such assumptions are naturally questionable on real databases. What properties of an estimator pertain when one or several assumptions are not satisfied? As there are several types of assumptions, one can define several types of robustness. For example, Tukey’s question is about robustness to a misspecification of the Gaussian model while Hodges question can be understood as robustness to contamination of the dataset by large outliers or robustness to heavy-tailed distributions in the model that may have caused the appearance of large data.

satisfies for all f:X{1,1}f:{\cal X}\to\left\{-1,1\right\} (where X{\cal X} is the space where the XiX_{i} take their values and the Yi{1,1}Y_{i}\in\{-1,1\})

If the fraction O/N|{\cal O}|/N is small enough, the ERM over a class F{\cal F} of functions from X{\cal X} to {1,1}\{-1,1\}

performs theoretically as well as the ERM based only on the “good” data (Xi,Yi)iI(X_{i},Y_{i})_{i\in{\cal I}}

which is statistically optimal in some problems . In that case, the number of outliers O|{\cal O}| has to be less than NN times the rate of convergence of f^I\widehat{f}_{{\cal I}} – this is a condition we will meet later for our estimators.

2.2 Robustness to heavy-tailed data

The Gaussian assumption on the noise Y1f(X1)Y_{1}-f^{*}(X_{1}) in statistics is usually weakened in learning theory to a subgaussian assumption (the LpL_{p}-moments of the noise are bounded by those of a Gaussian variable). This subgaussian assumption is already much more flexible and essentially allows to extend the use of least-squares estimators and their penalized versions to a much broader class of possible distributions for the noise. Going beyond the subgaussian assumption is in general way more technical. One way to proceed is to assume a subexponential assumption – random variables whose LpL_{p} moments are smaller than those of an exponential variable– as in the Bernstein’s condition in . More recently, the subexponential assumption has also been relaxed into LpL_{p}-moment assumptions. Least-squares estimators have been studied as well as some modifications less sensitive to large data for instance in . It appears that the statistical properties of least-squares estimators are deteriorated when the noise satisfies weaker assumptions . A modification of the ERM based on Le Cam’s estimation by tests principle has also been considered in .

3 Robustness in regression

Most of the robustness literature of the 1980s was on the regression problem , [42, Section 7.12]. These methods make a distinction between the problem of robustness with respect to the outputs (Yi)i=1N(Y_{i})_{i=1}^{N} on one side and with respect to the inputs (Xi)i=1N(X_{i})_{i=1}^{N} on the other side [42, Chapter 7]. Outliers in the inputs remain extremely difficult to handle while many solutions have been proposed to deal with outliers in the outputs.

Even if some solutions have been proposed in particular cases, see for example , the problem of robustness with respect to outliers in the inputs X1,,XNX_{1},\ldots,X_{N} (also called leverage points), has not been solved in general neither in theory nor in practice. In the book , a three steps approach is suggested to solve the robustness problem in regression

find routine methods when there are only few leverage points,

find analytical methods for identifying leverage points and, if possible, leverage groups.

In other words, suggest to make an important “datacleaning” step before any analysis for solving the leverage point problem. As already discussed, this may not be possible for nowadays large datasets. Therefore, designing procedures that can handle outliers such as leverage points is an important task that requires new ideas.

In statistics, an elegant solution to this issue (and many others) called ρ\rho-estimators has recently been proposed. However, ρ\rho-estimators do not provide a satisfying solution for learning issues since they require a statistical model (even if almost no assumption besides a controlled massiveness are required on this model), they are specifically built for the Hellinger loss (although they seem to work extremely well for other risk function in some special cases) and they remain far from being computable in general which is dead-end for Machine Learning.

3.2 Outliers in the outputs

Robust procedures can be analyzed under the assumption that “we can act as if the xijx_{ij} are free of gross errors” [42, paragraph 3 in Section 7.3] (xijx_{ij} being the jj-th coordinate of XiX_{i} for i[N]i\in[N] and j[d]j\in[d]). In this approach, the square loss is replaced by a convex differentiable function ρ\rho and the estimator minimizes

If ψ=ρ\psi=\rho^{\prime}, the estimator is also solution of the equation

Many alternative robust procedures can be found in the regression literature, among other, one can mention the least median of squared residual estimator and the least trimmed sum of squares estimator which minimizes the sum of the N/2+1N/2+1 smallest squared residuals. Finally, introduced Least Median of Squares in regression which naturally extends the median estimator in the location problem :

Its computational complexity rises exponentially with dd [37, p.330], .

3.3 Other challenges

We discussed the problems of detecting outliers and resisting corruption of the inputs earlier. Besides these, robust statistic has left some interesting open questions that have to be addressed.

Most results in robust statistics are asymptotic but the asymptotic approach may be quite misleading : “ […] approaches based on asymptotics are treacherous, and it could be quite misleading to transfer insights gained from asymptotic variance theory or from infinitesimal approaches (gross error sensitivity and the like) by heuristics to leverage points with hi>0.2h_{i}>0.2” [42, p.189]. In other words, robustness issues ultimately have to be tackled from a non-asymptotic point of view. Again, ρ\rho-estimators treat this issue from a statistical perspective but cannot be applied in our learning setting.

When designing robust strategies for ML, one should keep in mind that these should be computationally tractable. To the best of our knowledge, no estimator robust to outlier in the outputs nor any general learning procedure robust to heavy-tailed data can currently be used in a high dimensional learning problems. There is currently no available algorithm for high dimensional learning whose good statistical performance don’t break down in the presence of one outlier in the dataset.

4 Contributions

In this paper, we propose a new estimator of the learning function

This new estimator has minimax-optimal non-asymptotic risk bounds even when the dataset has been corrupted by outliers (that may have corrupted indifferently the inputs, the outputs or both of them) and when informative data (those that are not outliers) only satisfy somehow minimal assumptions relating their L2L_{2} geometry on FF to that of PP (see Sections 1.4.1 and 3.1 for details). This estimator solves therefore all challenges related to outliers in the learning theory literature including “adversarial outliers” – which are data satisfying no assumptions – and “stochastic outliers” – which are informative data that may look like outliers because of some weak concentration properties due to heavy tail distributions.

We also show that our estimator can be used for ML purposes by providing several algorithms to compute proxies of our estimators and illustrate their performance in an extensive simulation study (see Section A). This opens several interesting conjectures for example, to study and compare theoretically convergence of these algorithms, and to automatically calibrate our estimators.

We propose the following setup that allows the presence of outliers and relax the standard i.i.d. setup. We assume that data are partitioned in two groups (unknown from the statistician), more precisely {1,,N}=OI\{1,\ldots,N\}={\cal O}\cup{\cal I}, with OI={\cal O}\cap{\cal I}=\emptyset

data (Xi,Yi)iO(X_{i},Y_{i})_{i\in{\cal O}} (O{\cal O} stands for “outliers”) are not assumed to be independent nor independent to the other ones (Xi,Yi)iI(X_{i},Y_{i})_{i\in{\cal I}}, they may not be identically distributed, in fact nothing is required on these data; they can even be adversarial in the sense that may have been designed to make the learning task harder,

data (Xi,Yi)iI(X_{i},Y_{i})_{i\in{\cal I}} (called informative) are the only data on which we can rely on to solve our learning task. In what follows, we only need very weak assumptions on those data to make MOM estimators achieving optimal statistical properties. In particular, we will always assume that the informative data are independent.

As we allow the presence of outliers, our procedure will be in a sense robust to the i.d. (identically distributed) assumption. This robustness goes actually further, since even informative data won’t be assumed identically distributed. Instead, our procedure will be shown to work under the much weaker requirement that the distributions of (Xi,Yi)(X_{i},Y_{i}) for all iIi\in{\cal I} induce an equivalent L2L_{2} metric over the class FF and equivalent covariance between functions in FF and the output. If various distributions have equivalent first and second moments for all functions fFf\in F, our estimators will exhibit the same optimal performance as if they were i.i.d. Gaussian with similar first and second moments on all fFf\in F. Our new assumption feels more natural than the usual i.i.d one as distributions inducing the same “risk geometry” on FF seem interesting to solve our learning task and any other higher moment assumption above 22 seems unnatural. However, the standard ERM once again would fail to achieve this goal as the behavior of the supremum over subclasses of FF of the empirical process used to bound the risk of the ERM depends in general on higher moments of these distributions.

4.2 Quantifying robustness by breackdown point

The breakdown point of an estimator is the smallest proportion of corrupted observations necessary to push an estimator to infinity [28, p.1809], [42, p.279]. The notion has been introduced by Hampel in his 1968 Ph. D. thesis. In , the notion was revisited with the following non-asymptotic definition, see also . Let DI{\cal D}_{{\cal I}} denote a dataset and let TT be an estimator. If there exists a strategic / malicious / adversarial choice of another dataset DO{\cal D}_{{\cal O}} such that T(DIDO)T(DI)T({\cal D}_{{\cal I}}\cup{\cal D}_{{\cal O}})-T({\cal D}_{{\cal I}}) is arbitrarily large, the estimator is said to break down under the contamination fraction DO/(DI+DO)|{\cal D}_{\cal O}|/(|{\cal D}_{\cal I}|+|{\cal D}_{\cal O}|). The Donoho-Hampel-Huber breakdown point ϵ(T,DI)\epsilon^{*}(T,{\cal D}_{{\cal I}}) is then defined as the smallest contamination fraction under which estimator TT breaks down:

For the estimation of the mean μ\mu of a random variable from NN i.i.d. observations, the empirical mean can be made arbitrarily large by adding a single “bad” observation. The breakdown point of the empirical mean estimator is therefore 1/(N+1)1/(N+1) and this is the worst possible value for an estimator. The empirical mean is not robust to outliers, it performs very poorly in corrupted environments. On the other hand, the empirical median has a breakdown value of 1/21/2 for the location problem.

Among equivariant estimators, the common sense heuristic applies : “when more than half data are bad, one cannot learn from good ones” . In particular, among equivariant estimators, 1/21/2 is the best possible breakdown point for an estimator and the empirical median is optimal from this perspective. When the parameter of interest belongs to a bounded set, one can adapt the definition of breakdown point as in the discussion papers of . Other generalization to quantile estimation can be found in .

4.3 A modified notion of breakdown point for ML

We introduce a new concept of breakdown point for Learning. The standard breakdown point (4) does not come with any estimation or prediction property. As already mentioned, this leads to consider restricted classes of estimators to study optimality from the breakdown point of view to avoid degenerate cases. Instead of imposing algebraic restrictions, our new definition focus on the risk of the procedures.

where PXP_{X} denotes the marginal on X{\cal X} of PP.

In words, the breakdown number is the minimal number of points one has to add to a dataset to break statistical performance of an estimator TT. In this definition, the rate R{\cal R} one can achieve is of particular importance. Ultimately, we would like to take it as small as possible. To that end, recall the definition of a minimax rate of convergence.

The minimax accuracy R(δ,F){\cal R}(\delta,F) is the smallest accuracy that can be achieved uniformly over FF with confidence δ\delta by some estimator.

The minimax rate of convergence defined in Definition 2 is obtained in the “Gaussian model” (that is when the dataset is made of NN i.i.d. observations in the regression model with Gaussian design and Gaussian noise independent of the design) which is somehow the benchmark model in statistics in which one can derive minimax rates of convergence over FF. We will therefore use the minimax rates obtained in this model as benchmark rates of convergence.

Using this new definition, one can be interested in various existence or optimality properties such as:

Given 0<δ<10<\delta<1, C>0C>0, what is the largest set of sample distributions Pδ,C{\cal P}_{\delta,C} where one can build estimators with accuracy CR(δ,F)C{\cal R}(\delta,F) of the same order as in the Gaussian model ?

given 0<δ<10<\delta<1, R>R(δ,F){\cal R}>{\cal R}(\delta,F) and PPδ,C{\cal P}\subset{\cal P}_{\delta,C}, is there some procedure TT such that KML(T,N,R,δ,P)>0K^{*}_{\text{ML}}(T,N,{\cal R},\delta,{\cal P})>0?

given 0<δ<10<\delta<1, R>R(δ,F){\cal R}>{\cal R}(\delta,F) and PPδ,C{\cal P}\subset{\cal P}_{\delta,C}, how large can be supTKML(T,N,R,δ,P)\sup_{T}K^{*}_{\text{ML}}(T,N,{\cal R},\delta,{\cal P})? and is this maximum achievable by a computationally tractable estimator TT?

Most recent theoretical works on robust learning approaches focus on the first question, showing that statistical optimality can actually be achieved over much broader classes of distributions than the Gaussian model, see for example. In this paper, we build on the approach introduced in where we proved that, for any δ\delta smaller than exp(CNrN2)\exp(-CNr_{N}^{2}), there exists estimators such that KML(T,N,rN2,δ,P)CNrN2K^{*}_{\text{ML}}(T,N,r_{N}^{2},\delta,{\cal P})\geqslant CNr_{N}^{2} where rN2r_{N}^{2} is the minimax rate of convergence in the Gaussian model for exponential deviation. It is for instance given by rN2=σ2d/Nr_{N}^{2}=\sigma^{2}d/N for the problem of estimation of a dd-dimensional vector and rN2=σ2slog(ed/s)/Nr_{N}^{2}=\sigma^{2}s\log(ed/s)/N when this vector has only ss non-zero coordinates. But we go further in this paper, showing that these performance can be achieved by a procedure that is computationally tractable, hence by a Machine Learning algorithm. This ultimate breakthrough is due to a new estimator that we shall now introduce.

4.4 Our estimators

The key equation to understand our procedure is that the target

can be obtained as the solution of the following minimaximization problem

This remark is obvious since the expectation operator is linear, it is fundamental though in our construction as we use non-linear estimators of the expectation. More precisely, unknown expectations are estimated by median-of-means (MOM) estimators : given a partition of the dataset into blocks of equal sizes, MOM is the median of the empirical means over each block, see Section 2.2 for details. MOM estimators are naturally more robust than the empirical mean thanks to the median step : if the number of blocks is at least twice that of the outliers, outliers can only affect less than half blocks so MOM remains an acceptable estimator of the mean. MOM’s estimators are non linear, therefore plugging a MOM estimator into the minimization problem doesn’t yield the same estimator as the plugging estimator on the minimaximization problem. More precisely, it is not so hard to see that MOM minimizers cannot achieve (fast) minimax rates R(δ,F){\cal R}(\delta,F). This is why we focus in the following on the minimaximization problem

MOM estimators of the increments of criteria have already been used in . What is new here is that, instead of combining these estimators to build a “tournament” as in or using the approach of Le Cam as in , we simply plug these estimators in the minimaximization problem. Our construction is therefore extremely simplified compared to these previous papers, we’ll show that our new estimator achieves the same theoretical properties as these alternatives.

Then one can perform a descent/ascent algorithm over the block BmedB_{\text{med}} to get the next iteration (ft+1,gt+1)(f_{t+1},g_{t+1}). In practice though, this basic idea can be substantially improved by shuffling the blocks at each time step, cf. Section A.8. The first advantage of this shuffling step is that the algorithm doesn’t converge to local minimaxima.

Furthermore, our algorithm defines a natural notion of depth of data : deep data are typically regularly chosen as member of the median block BmedB_{\text{med}} while “outliers” on the other hand are typically left aside. This notion of depth, based on the risk function, is natural in our learning framework and should probably be investigated more carefully in future works. It also suggests an empirical definition of outliers and therefore an outliers detection algorithm as a by-product.

The paper is organised as follow. Section 2 introduces formally the classical least-squares learning framework and presents our new estimator, Section 3 details our main theoretical results, that are proved in Section 4. An extended simulation study is provided in Section A, including the presentation of many robust versions of standard optimization algorithms. Theoretical abstract rates derived in the main results are particularized in Section B into important problems in Machine Learning to show the extent of our results. These main results focus on penalized version of the basic tests presented in this introduction that are well suited for high dimensional learning frameworks, we complete these results in Section C, providing results for the basic estimators without penalizations in small dimension. Finally, Section D provides some optimality results for our procedures.

Setting

For any Q{P,(Pi)i[N]}Q\in\{P,(P_{i})_{i\in[N]}\} and any p1p\geqslant 1, let fLQp=(Qfp)1/p\left\|f\right\|_{L^{p}_{Q}}=(Q|f|^{p})^{1/p} the LQpL_{Q}^{p}-norm of ff whenever it’s defined. Finally, let \left\|\cdot\right\| be a norm defined on the span of FF; \left\|\cdot\right\| will be used as a regularization norm.

A key observation in our analysis is that ff^{*} is solution of a minimaximization problem:

The minimaximization point of view has been considered in and . These papers used bounded modification of the quadratic loss. As for classification, the impact of outliers on the empirical mean is negligible if the loss is bounded and the number of outliers controlled, which ensures robustness. The cost to pay is that the function to be minimized is not convex and is therefore hard to implement. The idea of estimation by aggregation of tests is due to Le Cam and was further developed by Birgé , Baraud , Baraud, Birgé and Sart . MOM estimators have been recently considered see and Section 2.2 for details. Nevertheless, to the best of our knowledge, this paper is the first to consider the estimators obtained by plugging MOM’s estimators in the minimaximization problem (6). This approach will proved to be efficient for designing algorithms.

2 MOM estimator

We will make an extensive use of empirical medians and quantiles in the following. We now precise some conventions used repeatedly hereafter. For all α(0,1)\alpha\in(0,1) and real numbers x1,,xKx_{1},\ldots,x_{K}, we denote by

where in the last inequality we use the Minkowsky sum of two sets. More generally, inequalities involving empirical quantiles are always understood in the worst possible case. For example, one can check that, for all vectors xx and yy

In the following, we will simply write with some abuse of notation that for all vectors xx and yy:

Likewise, one can check that Q1/2(xy)Q3/4(x)Q1/4(y)Q_{1/2}(x-y)\leqslant Q_{3/4}(x)-Q_{1/4}(y) and Q1/2(x)Q1/2(x)Q_{1/2}(x)\geqslant-Q_{1/2}(-x). Moreover, to prove that Qα(x)tQ_{\alpha}(x)\geqslant t it is enough to prove that J[K],J(1α)K,kJ,xkt\exists J\subset[K],|J|\geqslant(1-\alpha)K,\forall k\in J,x_{k}\geqslant t and to prove that Qα(x)tQ_{\alpha}(x)\leqslant t it is enough to prove that J[K],JαK,kJ,xkt\exists J\subset[K],|J|\geqslant\alpha K,\forall k\in J,x_{k}\leqslant t.

We conclude this section with the definition of MOM tests and their regularized versions.

and, for a given regularization parameter λ0\lambda\geqslant 0, its regularized version is

3 Minimaximization of MOM tests

One can rewrite our estimators as cost minimization estimators

where for all fF,f\in F, {\cal C}_{K}\big{(}f\big{)}=\sup_{g\in F}T_{K}(g,f) and {\cal C}_{K,\lambda}\big{(}f\big{)}=\sup_{g\in F}T_{K,\lambda}(g,f). These cost functions play a central role in the theoretical analysis of f^K\hat{f}_{K} and f^K,λ\hat{f}_{K,\lambda} (cf. Section 4).

Compared to the aggregation of tests by “tournaments” , or “Le Cam type” aggregation of tests , minimaximization estimators have several avantages. First they are very natural from Eq (6) and they do not require a proxy for the excess risk or for the L2(P)L^{2}(P)-norm, which simplifies the presentation of this estimator. Finally, while building the estimators of or is totally impossible, these new estimators are minimaximization procedures of locally convex-concave functions that could be approximated by a numerical scheme. It is actually easy to “convert” most of the classical algorithms used for “cost minimization” into an algorithm for minimaximization solving (7) (see Section A below).

Assumptions and main results

Denote by OI{\cal O}\cup{\cal I} a partition of [N][N], where O{\cal O} has cardinality O|{\cal O}|. Data (Xi,Yi)iO(X_{i},Y_{i})_{i\in{\cal O}} are considered as outliers, no assumptions on the joint distribution of these data or on their distribution conditionally on data (Xi,Yi)iI(X_{i},Y_{i})_{i\in{\cal I}} is made. These data may not be independent, nor independent from the remaining data. The remaining set (Xi,Yi)iI(X_{i},Y_{i})_{i\in{\cal I}} is the set of informative data, that is the ones one can use for estimation. These are hereafter assumed independent. Given the data (Xi,Yi)i[N](X_{i},Y_{i})_{i\in[N]} no one knows in advance which data is informative or not.

All the assumptions we need to get the results involve only first and second moment of the distributions P,(Pi)iIP,(P_{i})_{i\in{\cal I}} on functions in FF and YY. In particular, this setting strongly relaxes usual strong concentration assumptions made on the informative data to study ERM estimators.

There exists θr0>0\theta_{r0}>0 such that for all fFf\in F and all iIi\in{\cal I},

Of course, Assumption 1 holds in the i.i.d. framework, with θr0=1\theta_{r0}=1 and I=[N]{\cal I}=[N]. The second assumption bounds the correlation between the “noise” ζi=Yif(Xi)\zeta_{i}=Y_{i}-f^{*}(X_{i}) and the shifted class FfF-f^{*}.

There exists θm>0\theta_{m}>0 such that for all iIi\in{\cal I} and all fFf\in F,

and so Assumption 2 holds for θm=θ1θ2\theta_{m}=\theta_{1}\theta_{2}.

Now, let us introduce a norm equivalence assumption over FfF-f^{*}: we call it a L2/L1L^{2}/L^{1} assumption.

There exists θ01\theta_{0}\geqslant 1 such that for all fFf\in F and all iIi\in{\cal I}

Note that ffLPi1ffLPi2\left\|f-f^{*}\right\|_{L^{1}_{P_{i}}}\leqslant\left\|f-f^{*}\right\|_{L^{2}_{P_{i}}} for all fFf\in F and iIi\in{\cal I}. Therefore, Assumption 1 and Assumption 3 are together equivalent to assume that all the norms LP2,LPi2,LPi1,iIL^{2}_{P},L_{P_{i}}^{2},L_{P_{i}}^{1},i\in{\cal I} are equivalent over FfF-f^{*}. Note also that Assumption 3 is equivalent to the small ball property (cf. ) which has been recently used in Learning theory and signal processing. We refer to for examples of distributions satisfying the small ball assumption.

Let ZZ be a real-valued random variable. The following holds:

therefore, p(θ01κ0)2p\geqslant(\theta_{0}^{-1}-\kappa_{0})^{2}.

2 Complexity measures and minimax rates of convergence

Balls associated with the regularization norm \left\|\cdot\right\| and the LP2L^{2}_{P} norm play a prominent role in learning theory . In particular, for all ρ0\rho\geqslant 0, the “sub-models”

where B={fspan(F),fρ}B=\{f\in{\rm span}(F),\left\|f\right\|\leqslant\rho\} and their “localizations” at various level r0r\geqslant 0, i.e. intersection of B(f,ρ)B(f^{*},\rho) with LP2L^{2}_{P}-balls

are key sets because their Rademacher complexities, drives the minimax rates of convergence. Let us introduce these complexity measures.

Let (ϵi)i[N](\epsilon_{i})_{i\in[N]} be independent Rademacher random variables (i.e. uniformly distributed in {1,1}\{-1,1\}), independent from (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}. For all fFf\in F, r>0r>0 and ρ(0,+]\rho\in(0,+\infty], we denote the intersection of the \left\|\cdot\right\|-ball of radius rr and the LP2L^{2}_{P}-norm of radius ρ\rho centered at ff by

Let ζi=Yif(Xi)\zeta_{i}=Y_{i}-f^{*}(X_{i}) for all iIi\in{\cal I} and for γQ,γM>0\gamma_{Q},\gamma_{M}>0 define

and let ρr(ρ,γQ,γM)\rho\to r(\rho,\gamma_{Q},\gamma_{M}) be a continuous and non decreasing function such that for every ρ>0\rho>0,

It follows from Lemma 2.3 in that rMr_{M} and rQr_{Q} are continuous and non decreasing functions. Note that rM(),rQ()r_{M}(\cdot),r_{Q}(\cdot) depend on ff^{*}. According to , if one can choose r(ρ)r(\rho) equal to the maximum of rM(ρ)r_{M}(\rho) and rQ(ρ)r_{Q}(\rho) then r(ρ)r(\rho) is the minimax rate of convergence over B(f,ρ)B(f^{*},\rho). Note also that rQr_{Q} and rMr_{M} are well defined when IN/2{\cal I}\geqslant N/2, which implies that at least half data are informative.

3 The sparsity equation

To control the risk of our estimator, we bound from above TK,λ(f,f)T_{K,\lambda}(f,f^{*}) for all functions ff far from ff^{*} either in LP2L^{2}_{P}-norm or for the regularization norm \left\|\cdot\right\|. Recall that

The multiplier term “2ζ(ff)2\zeta(f-f^{*})” is the one containing the noise and is therefore the term we will try to control from above using either the quadratic process “(ff)2(f-f^{*})^{2}” or the regularization term “λ(ff)\lambda(\left\|f^{*}\right\|-\left\|f\right\|)”. To that end we will need both to control from below the quadratic process and the regularization term.

Let fFf\in F and denote ρ=ff\rho=\left\|f-f^{*}\right\|. When ffLP2\|f-f^{*}\|_{L^{2}_{P}} is small, the quadratic term (ff)2(f-f^{*})^{2} will not help to bound from above TK,λ(f,f)T_{K,\lambda}(f,f^{*}), one shall only rely on the penalization term λ(ff)\lambda(\left\|f^{*}\right\|-\left\|f\right\|). One can bound from below ffρ\left\|f^{*}\right\|-\left\|f\right\|\gtrsim\rho for all ff close to ff^{*} in LP2L_{P}^{2} using the sparsity equation of . First, introduce the subdifferentials of \left\|\cdot\right\| : for all fFf\in F,

where (E,)(E^{*},\left\|\cdot\right\|^{*}) is the dual normed space of (E,)(E,\left\|\cdot\right\|).

For any ρ>0\rho>0, let HρH_{\rho} denote the set of functions “close” to ff^{*} in LP2L_{P}^{2} and at distance ρ\rho from ff^{*} in regularization norm and let Γf(ρ)\Gamma_{f^{*}}(\rho) denote the set of subdifferentials of all vectors close to ff^{*}:

If there exists a “sparse” ff^{**} in {fF:ffρ/20}\{f\in F:\left\|f^{*}-f\right\|\leqslant\rho/20\}, that is ()f(\partial\left\|\cdot\right\|)_{f^{**}} is almost all the unit dual sphere, then ff\left\|f\right\|-\left\|f^{**}\right\| is large for any fHρf\in H_{\rho} so ffffff\left\|f\right\|-\left\|f^{*}\right\|\geqslant\left\|f\right\|-\left\|f^{**}\right\|-\left\|f^{*}-f^{**}\right\| is large as well. More precisely, let us introduce, for all ρ>0\rho>0,

The sparsity equation, introduced in , quantifies these notions of “large”.

A radius ρ>0\rho>0 is said to satisfy the sparsity equation when Δ(ρ)4ρ/5\Delta(\rho)\geqslant 4\rho/5.

One can check that, if ρ\rho^{*} satisfies the sparsity equation, so do all ρρ\rho\geqslant\rho^{*}. Therefore, one can define

Note that if ρ20f\rho\geqslant 20\left\|f^{*}\right\| then 0Γf(ρ)0\in\Gamma_{f^{*}}(\rho). Moreover, ()0(\partial\left\|\cdot\right\|)_{0} equals to the dual ball (i.e. the unit ball of (E,)(E^{*},\left\|\cdot\right\|^{*})) and so Δ(ρ)=ρ\Delta(\rho)=\rho. This implies that any ρ20f\rho\geqslant 20\left\|f^{*}\right\| satisfies the sparsity equation. This simple observation will be used to get “complexity-dependent rates of convergence” as in .

4 Main results

Our first results study the performance of the estimators f^K,λ\widehat{f}_{K,\lambda} for a fixed value of KK. The other ones will provide an adaptive way to select KK.

Grant Assumptions 1, 2 and 3 and let rQr_{Q}, rMr_{M} denote the functions introduced in Definition 5. Assume that N384(θ0θr0)2N\geqslant 384(\theta_{0}\theta_{r0})^{2} and ON/(768θ02)|{\cal O}|\leqslant N/(768\theta_{0}^{2}). Let ρ\rho^{*} be solution to the sparsity equation from Definition 6. Let KK^{*} denote the smallest integer such that

where ϵ=1/(833θ02)\epsilon=1/(833\theta_{0}^{2}) and r2()r^{2}(\cdot) is defined in Definition 5 for γQ=(384θ0)1\gamma_{Q}=(384\theta_{0})^{-1} and γM=ϵ/192\gamma_{M}=\epsilon/192. For any KKK\geqslant K^{*}, define the radius ρK\rho_{K} and the regularization parameter as

Assume that for every iIi\in{\cal I}, K[max(K,O),N]K\in[\max(K^{*},|{\cal O}|),N] and fFf\in F such that ffρ\left\|f-f^{*}\right\|\leqslant\rho for ρ[ρK,2ρK]\rho\in[\rho_{K},2\rho_{K}], one has

Then, for all K[max(K,8O),N/(96(θ0θr0)2)]K\in[\max(K^{*},8|{\cal O}|),N/(96(\theta_{0}\theta_{r0})^{2})], with probability larger than 14exp(7K/9216)1-4\exp(-7K/9216), the estimator f^K,λ{\hat{f}}_{K,\lambda} defined in Section 2.3 satisfies

Assumption (8) holds for instance when for every iIi\in{\cal I} and fFf\in F one has

These assumptions involve second moments associated with PP and (Pi)iI(P_{i})_{i\in{\cal I}}. As a consequence, if the metrics LPi2L^{2}_{P_{i}} for iIi\in{\cal I} and LP2L^{2}_{P} coincide on the functions x(ff)(x)x\to(f-f^{*})(x) and (x,y)yf(x)(x,y)\to y-f(x) and the PiP_{i} satisfies Assumption 2 and 3 then Theorem 1 holds. This means that we only need informative data to induce the same L2L^{2} metric as PP to estimate the oracle ff^{*} even if we do not have any observation coming from PP itself. This setup relaxes the classical i.i.d. where all data are generated from PP. In this setting, our estimators achieve the same results as the ERM would if data were all i.i.d. with a noise ζ\zeta independent of XX and both XX and ζ\zeta had a Gaussian distribution (see Section D).

The function r()r(\cdot) is used to define the regularization parameter, so it cannot depend on ff^{*}. When rM(),rQ()r_{M}(\cdot),r_{Q}(\cdot) depend on ff^{*}, rr should be a computable upper bound independent from ff^{*}.

In Theorem 1, all rates depend on the choice of the tuning parameter KK. The following construction inspired from Lepski’s method provides an adaptive choice of this parameter. Let us first recall the definition of empirical criterion introduced in Section 2.3 and the associated confidence regions: for all J[K]J\in[K], λ>0\lambda>0, fFf\in F and absolute constant cad>0c_{ad}>0,

The following theorem shows the performance of these estimators.

Grant the assumptions of Theorem 1 and assume moreover that and ON/(768θ02θr02)|{\cal O}|\leqslant N/(768\theta_{0}^{2}\theta_{r0}^{2}). For any K[max(K,8O),N/(96(θ0θr0)2)]K\in[\max(K^{*},8|{\cal O}|),N/(96(\theta_{0}\theta_{r0})^{2})], with probability larger than

where cad=18/833c_{ad}=18/833 and ϵ=(833θ02)1\epsilon=(833\theta_{0}^{2})^{-1}. In particular, for K=KK=K^{*}, we have r(2ρK)=max(r(2ρ),O/N)r(2\rho_{K^{*}})=\max\left(r(2\rho^{*}),\sqrt{|{\cal O}|/N}\right). Therefore, if r(2ρ)c1r(ρ)r(2\rho^{*})\leqslant c_{1}r(\rho^{*}) holds for some absolute constant c1c_{1}, then the breakdown number of f^cad\widehat{f}_{c_{ad}} is larger than Nr(ρ)2Nr(\rho^{*})^{2}.

Note that r()r(\cdot) is any continuous, non decreasing function such that for all ρ>0\rho>0, r(ρ)max(rQ(ρ,γQ),rM(ρ,γM))r(\rho)\geqslant\max\left(r_{Q}(\rho,\gamma_{Q}),r_{M}(\rho,\gamma_{M})\right). In particular, if r:ρmax(rQ(ρ,γQ),rM(ρ,γM))r_{*}:\rho\to\max\left(r_{Q}(\rho,\gamma_{Q}),r_{M}(\rho,\gamma_{M})\right) is itself a continuous function (it is clearly non decreasing) then for every x>0x>0, r(ρ)=max(rQ(ρ,γQ),rM(ρ,γM))+x/Nr(\rho)=\max\left(r_{Q}(\rho,\gamma_{Q}),r_{M}(\rho,\gamma_{M})\right)+x/N is another non decreasing upper bound. Therefore, one can derive results similar to those in Theorem 2 but with an extra confidence parameter : for all x>0x>0, with probability at least 14exp(c0Nr2(ρK)+c0x)1-4\exp(-c_{0}Nr_{*}^{2}(\rho_{K^{*}})+c_{0}x),

Note however that f^cad\widehat{f}_{c_{ad}} depends on xx through the regularization parameter λ=16ϵ(r(ρK)+x/N)/ρK\lambda=16\epsilon(r_{*}(\rho_{K})+x/N)/\rho_{K}.

Proofs

Upper and lower bounds on TK(,)T_{K}(\cdot,\cdot) follow from a study of “quadratic” and “multiplier” quantiles of means processes. As no assumption is granted on the outliers, any block of data containing one or more of these outliers is “lost” from our perspective meaning that empirical means over these blocks cannot be controlled. Let K{\cal K} denote the set of blocks which have not been corrupted by outliers:

If kKk\in{\cal K}, all data indexed by BkB_{k} are informative. We will show that controls on the blocks indexed by K{\cal K} are sufficient to insure statistical performance of MOM estimators.

The first result is a lower bound on the quantiles of means quadratic processes.

In particular, Qη,K((ff)2)(4θ0)2ffLP22Q_{\eta,K}((f-f^{*})^{2})\geqslant(4\theta_{0})^{-2}\left\|f-f^{*}\right\|^{2}_{L^{2}_{P}}.

Define Fρ=B(f,ρ)={fF:ffρ}F^{*}_{\rho}=B(f^{*},\rho)=\{f\in F:\left\|f-f^{*}\right\|\leqslant\rho\}. For all fFρf\in F^{*}_{\rho}, let nf=(ff)/ffLP2n_{f}=(f-f^{*})/\left\|f-f^{*}\right\|_{L^{2}_{P}} and note that for all iIi\in{\cal I}, Pinfθ01P_{i}|n_{f}|\geqslant\theta_{0}^{-1} by Assumption 3 and Pinf2θr02P_{i}n_{f}^{2}\leqslant\theta_{r0}^{2} by Assumption 1. It follows from Markov’s inequality that, for all kKk\in{\cal K} (K{\cal K} is defined in (10)),

Since KNα/(2θ0θr0)2K\leqslant N\alpha/(2\theta_{0}\theta_{r0})^{2}, Bk=N/K(2θ0θr0)2/α|B_{k}|=N/K\geqslant(2\theta_{0}\theta_{r0})^{2}/\alpha and so

Denote F={fF:ffρ,  ffLP2rQ(ρ,γQ)}{\cal F}=\{f\in F:\left\|f-f^{*}\right\|\leqslant\rho,\;\left\|f-f^{*}\right\|_{L^{2}_{P}}\geqslant r_{Q}(\rho,\gamma_{Q})\}. By the bounded difference inequality (see, for instance [17, Theorem 6.2]), there exists an event ΩQ(K)\Omega_{Q}(K) with probability larger than 1exp(x2K/2)1-\exp(-x^{2}|{\cal K}|/2), on which, for all fFf\in{\cal F},

Since the function ϕ\phi is 1-Lipschitz and ϕ(0)=0\phi(0)=0, by the contraction principle (see, for example [57, Chapter 4] or [17, Theorem 11.6]), we have

The family (ϵ[i]nf(Xi):ikKBk)(\epsilon_{[i]}|n_{f}(X_{i})|:i\in\cup_{k\in{\cal K}}B_{k}), where [i]=i/K[i]=\lceil i/K\rceil for all iIi\in{\cal I}, is a collection of centered random variables. Therefore, if (ϵk)kK(\epsilon^{\prime}_{k})_{k\in{\cal K}} and (Xi)iI(X_{i}^{\prime})_{i\in{\cal I}} denote independent copies of (ϵk)kK(\epsilon_{k})_{k\in{\cal K}} and (Xi)iI(X_{i})_{i\in{\cal I}} then

Then, as (Xi)iI(X_{i})_{i\in{\cal I}} and (Xi)iI(X_{i}^{\prime})_{i\in{\cal I}} are two independent families of independent variables therefore, if (ϵi)iI(\epsilon_{i}^{\prime\prime})_{i\in{\cal I}} denote a family of i.i.d. Rademacher variables independent of (ϵi),(ϵi),(Xi)iI,(Xi)iI(\epsilon_{i}),(\epsilon_{i}^{\prime}),(X_{i})_{i\in{\cal I}},(X_{i}^{\prime})_{i\in{\cal I}} then (ϵknf(Xi)ϵknf(Xi))(\epsilon_{k}|n_{f}(X_{i})|-\epsilon_{k}^{\prime}|n_{f}(X_{i}^{\prime})|) and (ϵi(ϵknf(Xi)ϵknf(Xi)))\left(\epsilon_{i}^{\prime\prime}\left(\epsilon_{k}|n_{f}(X_{i})|-\epsilon_{k}^{\prime}|n_{f}(X_{i}^{\prime})|\right)\right) have the same distribution. Therefore,

By the contraction principle, on ΩQ(K)\Omega_{Q}(K),

For any fFf\in{\cal F}, rQ(ρ,γQ)nf+fFr_{Q}(\rho,\gamma_{Q})n_{f}+f^{*}\in F because FF is convex. Moreover, rQ(ρ,γQ)nfLP2=rQ(ρ,γQ)\left\|r_{Q}(\rho,\gamma_{Q})n_{f}\right\|_{L^{2}_{P}}=r_{Q}(\rho,\gamma_{Q}) and rQ(ρ,γQ)nf=[rQ(ρ,γQ)/ffLP2]ffρ\left\|r_{Q}(\rho,\gamma_{Q})n_{f}\right\|=[r_{Q}(\rho,\gamma_{Q})/\left\|f-f^{*}\right\|_{L^{2}_{P}}]\left\|f-f^{*}\right\|\leqslant\rho. Therefore, rQ(ρ,γQ)nf+fFr_{Q}(\rho,\gamma_{Q})n_{f}+f^{*}\in{\cal F}. Therefore, by definition of rQ(ρ,γQ)r_{Q}(\rho,\gamma_{Q}),

Using the last inequality together with (12) and the assumption KO/(1γ)K\geqslant|{\cal O}|/(1-\gamma) (so that KKOγK|{\cal K}|\geqslant K-|{\cal O}|\geqslant\gamma K), we get that, on the event ΩQ(K)\Omega_{Q}(K), for any fFf\in{\cal F},

Hence, on ΩQ(K)\Omega_{Q}(K), for any fFf\in{\cal F}, there exists at least (1η)K(1-\eta)K blocks BkB_{k} for which PBknf(4θ0)1P_{B_{k}}|n_{f}|\geqslant(4\theta_{0})^{-1}. On these blocks, PBknf2(PBknf)2(4θ0)2P_{B_{k}}n_{f}^{2}\geqslant(P_{B_{k}}|n_{f}|)^{2}\geqslant(4\theta_{0})^{-2}, therefore, on ΩQ(K)\Omega_{Q}(K), Qη,K[nf2](4θ0)2Q_{\eta,K}[n_{f}^{2}]\geqslant(4\theta_{0})^{-2}.

Now, let us turn to a control of the multiplier process.

For all k[K]k\in[K] and fFf\in F, set Wk=((Xi,Yi))iBkW_{k}=((X_{i},Y_{i}))_{i\in B_{k}} and define

Let fFf\in F and kKk\in\mathcal{K}. It follows from Markov’s inequality that

Let J=kKBkJ=\cup_{k\in\mathcal{K}}B_{k} and let rM(ρ)=rM(ρ,γM)r_{M}(\rho)=r_{M}(\rho,\gamma_{M}). We have

where in the last but one inequality, we used that the class FF is convex and the same argument as in the proof of Lemma 1. Since (ϵ[i](ζi(ff)(Xi)Piζi(ff)):iI)(\epsilon_{[i]}(\zeta_{i}(f-f^{*})(X_{i})-P_{i}\zeta_{i}(f-f^{*})):i\in{\cal I}) is a family of centered random variables, one can use the symmetrization argument to get

where the definition of rM(ρ)r_{M}(\rho) has been used in the last but one inequality.

where we used (4.1) in the last inequality.

Furthermore, it follows from by the symmetrization argument that

and, from the contraction principle and (4.1), that

In conclusion, on ΩM(K)\Omega_{M}(K), for all fB(f,ρ)f\in B(f^{*},\rho),

Let us first introduce the event on which the statement of Theorem 1 holds. Denote by Ω(K)\Omega(K) the intersection of the events ΩQ(K)\Omega_{Q}(K), ΩM(K)\Omega_{M}(K) defined respectively in Lemmas 1 and 2 for ρ{κρK:κ{1,2}}\rho\in\{\kappa\rho_{K}:\kappa\in\{1,2\}\} and

for some absolute constants c>0c>0 to be specified later. For these values, conditions in both Lemmas 1 and 2 are satisfied:

if ffLP2rQ(ρ,γQ)\left\|f-f^{*}\right\|_{L^{2}_{P}}\geqslant r_{Q}(\rho,\gamma_{Q}) then

there exists 3K/43K/4 block BkB_{k} with kKk\in{\cal K}, for which

Moreover, on the blocks BkB_{k} where (17) holds, it follows from assumption in (8) that all fFf\in F such that ffρ\left\|f-f^{*}\right\|\leqslant\rho satisfies

It follows from the convexity of FF and the nearest point theorem that P[2ζ(ff)]0P[2\zeta(f-f^{*})]\leqslant 0 for all fFf\in F, therefore, for all fFf\in F such that ffρ\left\|f-f^{*}\right\|\leqslant\rho,

Moreover, still on the blocks BkB_{k} where (17) holds, one also has, thanks to assumption (8), that for all fFf\in F such that ffρ\left\|f-f^{*}\right\|\leqslant\rho,

It follows that, for all fFf\in F such that ffρ\left\|f-f^{*}\right\|\leqslant\rho,

The main result of this section is Lemma 3. It will be used to bound from above the criterion {\cal C}_{K,\lambda}\big{(}f^{*}\big{)}=\sup_{g\in F}T_{K,\lambda}(g,f^{*}). Recall that ρK\rho_{K} and λ\lambda are defined as

where ϵ=(cθ02)1\epsilon=(c\theta_{0}^{2})^{-1} and c,cc,c^{\prime}\geqslant are absolute constants. We also need to consider a partition of the space FF according to the distance between gg and ff^{*} w.r.t. \left\|\cdot\right\| and LP2\left\|\cdot\right\|_{L_{P}^{2}} as in Figure 2: define for all κ1\kappa\geqslant 1,

On the event Ω(K)\Omega(K), it holds for all κ{1,2}\kappa\in\{1,2\},

when c32c\geqslant 32 and 10ϵ/4cϵ((4θ0)22ϵ)10\epsilon/4\leqslant c^{\prime}\epsilon\leqslant((4\theta_{0})^{-2}-2\epsilon).

Let gF1(κ)g\in F_{1}^{(\kappa)}. Since the quadratic process is non negative,

Therefore, applying (18) for ρ=κρK\rho=\kappa\rho_{K} and the choice of ρK\rho_{K} and λ\lambda as in (20), we get

Let gF2(κ)g\in F_{2}^{(\kappa)}. Given that Q1/2(xy)Q3/4(x)Q1/4(y)Q_{1/2}(x-y)\leqslant Q_{3/4}(x)-Q_{1/4}(y) for any vector xx and yy, we have

Moreover 2ϵ(4θ0)22\epsilon\leqslant(4\theta_{0})^{-2} when c32c\geqslant 32, so it follows from (16) and (18) for ρ=κρK\rho=\kappa\rho_{K} that

Putting both inequalities together and using that λκρK=cκϵr2(ρK)\lambda\kappa\rho_{K}=c^{\prime}\kappa\epsilon r^{2}(\rho_{K}), we get

Let ρ0\rho\geqslant 0, Γf(ρ)=ff+(ρ/20)B()f\Gamma_{f^{*}}(\rho)=\cup_{f\in f^{*}+(\rho/20)B}(\partial\left\|\cdot\right\|)_{f} (cf.) section 3.3). For all gFg\in F,

Let gFg\in F, ff+(ρ/20)Bf^{**}\in f^{*}+(\rho/20)B and z()fz^{*}\in(\partial\left\|\cdot\right\|)_{f^{**}}. We have

where the last inequality follows from z(ff)ffz^{*}(f^{**}-f^{*})\leqslant\|f^{**}-f^{*}\|. The result follows by taking supremum over zΓf(ρ)z^{*}\in\Gamma_{f^{*}}(\rho).

Let ρ0\rho\geqslant 0. Let gFg\in F be such that gfρ\left\|g-f^{*}\right\|\geqslant\rho. Define f=f+ρ(gf)/gff=f^{*}+\rho(g-f^{*})/\left\|g-f^{*}\right\|. Then fFf\in F, ff=ρ\left\|f-f^{*}\right\|=\rho and,

The first conclusion holds by convexity of FF, the second statement is obvious. For the last one, let Υ=gf/ρ\Upsilon=\|g-f^{*}\|/\rho and note that Υ1\Upsilon\geqslant 1 and gf=Υ(ff)g-f^{*}=\Upsilon(f-f^{*}), so we have

Now, let us bound supgF3(κ)TK,λ(g,f)\sup_{g\in F_{3}^{(\kappa)}}T_{K,\lambda}(g,f^{*}). Let gF3(κ)g\in F_{3}^{(\kappa)}. Apply Lemma 4 and Lemma 5 to ρ=ρK\rho=\rho_{K}: there exists fFf\in F such that ff=ρK\left\|f-f^{*}\right\|=\rho_{K} and

First assume that ffLP2r(ρK)\left\|f-f^{*}\right\|_{L_{P}^{2}}\leqslant r(\rho_{K}). In that case, ff=ρK\left\|f-f^{*}\right\|=\rho_{K} and ffLP2r(ρK)\left\|f-f^{*}\right\|_{L_{P}^{2}}\leqslant r(\rho_{K}) therefore, fHρKf\in H_{\rho_{K}}. Moreover, by definition of KK^{*} and since KKK\geqslant K^{*}, we have ρKρ\rho_{K}\geqslant\rho^{*} which implies that ρK\rho_{K} satisfies the sparsity equation from Definition 6. Therefore, supzΓf(ρK)z(ff)Δ(ρK)4ρK/5\sup_{z^{*}\in\Gamma_{f^{*}}(\rho_{K})}z^{*}(f-f^{*})\geqslant\Delta(\rho_{K})\geqslant 4\rho_{K}/5. Now, it follows from the definition of λ\lambda in (20) that

Moreover, since the quadratic process is non-negative, by (18) applied to ρ=ρK\rho=\rho_{K},

Finally, noting that 2ϵ4cϵ/502\epsilon-4c^{\prime}\epsilon/5\leqslant 0 when c10/4c^{\prime}\geqslant 10/4, binding all the pieces together in (4.2) yields

Second, assume that ffLP2r(ρK)\left\|f-f^{*}\right\|_{L_{P}^{2}}\geqslant r(\rho_{K}). Since ff=ρK\left\|f-f^{*}\right\|=\rho_{K}, it follows from (16) and (17) for ρ=ρK\rho=\rho_{K} that

where we used that 2ϵ(16θ0)22\epsilon\leqslant(16\theta_{0})^{-2} when c32c\geqslant 32 in the last inequality. Plugging the last result in (4.2) we get

when 16(2+c)ϵθ0216(2+c^{\prime})\epsilon\leqslant\theta_{0}^{-2}.

The proof follows essentially the one of [56, Theorem 3.2] or [50, Lemma 2].

Let f^F\hat{f}\in F be such that, on Ω(K)\Omega(K), {\cal C}_{K,\lambda}\big{(}\hat{f}\big{)}\leqslant(2+c^{\prime})\epsilon r^{2}(\rho_{K}). Then, on Ω(K)\Omega(K), f^\hat{f} satisfies

Thus, on Ω(K)\Omega(K), f^{gF:TK,λ(g,f)(2+c)ϵr2(ρK)}\hat{f}\in\left\{g\in F:T_{K,\lambda}(g,f^{*})\geqslant-(2+c^{\prime})\epsilon r^{2}(\rho_{K})\right\}. When c=16c^{\prime}=16 and c>832c>832,

therefore, f^F1(2)\hat{f}\in F_{1}^{(2)} on Ω(K)\Omega(K). This yields the results for both the regularization and the LP2L^{2}_{P}-norm.

Finally, let us turn to the control on the excess risk. It follows from (19) for ρ=κρK\rho=\kappa\rho_{K} that

4 End of the proof of Theorem 1

By definition of f^K,λ\widehat{f}_{K,\lambda},

where {F1(1),F2(1),F3(1)}\{F^{(1)}_{1},F^{(1)}_{2},F^{(1)}_{3}\} is the decomposition of FF as in Figure 2. It follows from Lemma 3 (for κ=1\kappa=1) that on the event Ω(K)\Omega(K),

Therefore, for c=16c^{\prime}=16 and c=833c=833 the conclusion of the proof of Theorem 1 follows from Lemma 6.

5 Proof of Theorem 2

Let K[K1,K2]K\in[K_{1},K_{2}] and let ΩK,cad={fJ=KK2R^J,cad}\Omega_{K,c_{ad}}=\{f^{*}\in\cap_{J=K}^{K_{2}}\hat{R}_{J,c_{ad}}\} where we recall that R^J,cad={fF:CJ,λ(f)(cad/θ02)r2(ρJ)}\hat{R}_{J,c_{ad}}=\{f\in F:{\cal C}_{J,\lambda}(f)\leqslant(c_{ad}/\theta_{0}^{2})r^{2}(\rho_{J})\}. Lemma 3 (for κ=1\kappa=1) shows that, for cad=(2+c)/cc_{ad}=(2+c^{\prime})/c, ΩK,cadJ=KK2Ω(J)\Omega_{K,c_{ad}}\supset\cap_{J=K}^{K_{2}}\Omega(J). Therefore, on J=KK2Ω(J)\cap_{J=K}^{K_{2}}\Omega(J), K^cadK\hat{K}_{c_{ad}}\leqslant K which implies that f^cadR^K,cad\widehat{f}_{c_{ad}}\in\hat{R}_{K,c_{ad}}. By Lemma 6 (for c=16c^{\prime}=16 and c=833c=833), this implies that

Appendix A Simulation study

The aim of this section is to show that the min-max procedure introduced in this work can be computed using alternating descent-ascent algorithms. It appears that all the following algorithms can be recast as block gradient descent (BGD) but the major difference with the classical BGD is that blocks are chosen according to their “centrality” via the median operator instead of iterating over all the blocks (resp. at random) in the classical (resp. stochastic) BGD approach.

We study the performance of all the algorithms on a dataset corrupted by outliers of various forms. The “basic” set of “informative data” is called D1{\cal D}_{1}. This set is corrupted by various datasets of outliers, named D2,D3{\cal D}_{2},{\cal D}_{3}, D4{\cal D}_{4} and D5{\cal D}_{5}. Good and bad data have been merged and shuffled in the dataset D=D1D2D3D4{\cal D}={\cal D}_{1}\cup{\cal D}_{2}\cup{\cal D}_{3}\cup{\cal D}_{4} given to the statistician. Let us now detail the construction of these datasets.

The set D1{\cal D}_{1} of “informative data” is a set of NgoodN_{good} i.i.d. data (Xi,Yi)(X_{i},Y_{i}) with common distribution

D2{\cal D}_{2} is a dataset of Nbad1N_{bad-1} “bad data” (Xi,Yi)(X_{i},Y_{i}) such that Yi=1Y_{i}=1 and Xi=(1)j=1dX_{i}=(1)_{j=1}^{d}

D3{\cal D}_{3} is a dataset of Nbad2N_{bad-2} “bad data” (Xi,Yi)(X_{i},Y_{i}) such that Yi=10000Y_{i}=10000 and Xi=(1)j=1dX_{i}=(1)_{j=1}^{d}

D4{\cal D}_{4} is a dataset of Nbad3N_{bad-3} “bad data” (Xi,Yi)(X_{i},Y_{i}) where YiY_{i} is a 010-1-Bernoulli random variable and XiX_{i} is uniformly distributed over d^{d},

D5{\cal D}_{5} is also a set of “outliers” that have been generated according to a linear model as in (22) (i.e. with the same target vector tt^{*}) but for a different choice of design XX and noise ζ\zeta. Here, we take XN(0,Σ)X\sim{\cal N}(0,\Sigma) where Σ=(ρij)1i,jd\Sigma=(\rho^{|i-j|})_{1\leqslant i,j\leqslant d} and ζ\zeta is a heavy-tailed noise distributed according to a Student distribution with various degrees of freedom.

These different types of “outliers” Dj,j=2,3,4,5{\cal D}_{j},j=2,3,4,5 have been chosen to illustrate that the theory allows for outliers that may have absolutely nothing to do with the oracle tt^{*} that can be neither independent nor random as illustrated by datasets D2{\cal D}_{2} and D3{\cal D}_{3}.

A.2 From algorithms for the LASSO to their “MOM versions”

There is a huge literature presenting many algorithms to implement the LASSO procedure. We explore several of them to investigate their performance regarding robustness.

Each algorithm designed for the LASSO can be transformed into an algorithm for the min-max estimator we have been studying in this work. Let us now explain how this can be done. Recall that MOM version of the LASSO estimator is

Moreover, our algorithms particularly fits the “big data” framework which is our original motivation for the introduction of robust procedures in machine learning. In this framework, the map-reduce paradigm has emerged as a leading idea to design procedures. In this situation, the data are spread out in a cluster of servers and are therefore naturally split into blocks. Then our procedures simply uses for mapper a mean function and for reducer a median function. This makes our algorithms easily scalable into the big data framework even when some servers have crashed down (making outliers data). MOM algorithm could therefore reduced the maintenance of big clusters.

In the sequel, we use this strategy to transform several algorithms implemented for the LASSO into algorithms for the min-max estimator (23). We first start with the subgradient descent algorithm.

A.3 Subgradient descent algorithm

The key insight in Algorithm 1 are step 2 and step 4 where the blocks number have been chosen according to their centrality among the other blocks via the median operator. Those steps are expected 1) to remove outliers from the descent / ascent directions 2) to improve the accuracy of the latter directions.

for some given ρ(0,1)\rho\in(0,1), δ=104\delta=10^{-4} and initial point γ0=1\gamma_{0}=1.

A.4 Proximal gradient descent algorithms

Note that the step sizes sequences (ηp)p(\eta_{p})_{p} and (βp)p(\beta_{p})_{p} may be chosen according to the remarks below Algorithm 1.

A.5 Douglas-Racheford / ADMM

In this section, we consider the ADMM algorithm. ADMM stands for Alternating Direction Method of Multipliers. It is also a splitting algorithm which reads as follows in the LASSO case: at step pp,

where ρ\rho is some parameter to be chosen (for instance ρ=10\rho=10). The ADMM algorithm returns tpt_{p} after a stopping criteria is met. In Algorithm 4, we provide a MOM version of this algorithm.

A.6 Cyclic coordinate descent

A.7 Adaptive choice of hyper-parameters via MOM V-fold Cross Validation

The choice of hyperparameters in a “data corrupted” environment has to be done carefully. Classical Cross-validation methods cannot be used because of the potential presence of outliers in the dataset that are likely to corrupt the classical CV-criterion. We design new (empirical) criteria that trustfully reveal performance of estimators even in situations where “test datasets” might have been corrupted.

MOM’s principles can be combined with the idea of multiple splitting into training / test datasets in cross-validation. Let us now explain the construction of estimators f^K^,λ^\hat{f}_{\hat{K},\hat{\lambda}} where the number of blocks K^\hat{K} and the regularization parameter λ^\hat{\lambda} are hyper-parameters learned via such version of the CV principle.

We are given an integer V[N]V\in[N] such that NN is divided by VV. We are also given two finite grids GK[N]{\cal G}_{K}\subset[N] and Gλ(0,1]{\cal G}_{\lambda}\subset(0,1]. Our aim is to chose the “best” numbers of blocks and best regularization parameter within both grids. The dataset is splitted into VV disjoints blocks D1,,DV{\cal D}_{1},\ldots,{\cal D}_{V}. For each v[V]v\in[V], uvDu\cup_{u\neq v}{\cal D}_{u} is used to train a family of estimators

The remaining Dv{\cal D}_{v} of the dataset is used to test the performance of each estimator in the family (31). Using these notations, we can define a MOM version of the cross-validation procedure.

The Median of Means VV-fold Cross Validation procedure associated to the family of estimators (31) is f^K^,λ^\hat{f}_{\hat{K},\hat{\lambda}} where (K^,λ^)(\hat{K},\hat{\lambda}) is minimizing the MomCvV{\rm MomCv}_{V} criteria

and B1(v),BK(v)B_{1}^{(v)}\cup\cdots,\cup B_{K^{\prime}}^{(v)} is a partition of the test set Dv{\cal D}_{v} into KK^{\prime} blocks where K[N/V]K^{\prime}\in[N/V] such that KK^{\prime} divides N/VN/V.

The difference between standard V-fold cross validation procedure and its MOM version in Definition 7 is that empirical means on test sets Dv{\cal D}_{v} in the classical V-fold CV procedure have been replaced by MOM estimators in (32). Moreover, the mean over all VV splits in the classical VV-fold CV is replaced by a median.

The choice of VV raises the same issues for MOM CV as for classical VV-fold CV . In the simulations we use V=5V=5. The construction of the MOM-CV requires to choose another parameter: KK^{\prime}, the number of blocks used to construct the MOM criteria (32) over the test set. A possible solution is to take K=K/VK^{\prime}=K/V. This has the advantage to make only one split of the dataset D{\cal D} into KK blocks and then use for each of the VV rounds, (V1)K/V(V-1)K/V of these blocks to construct the family of estimators (31) and then K/VK/V of these blocks to test them.

In Figures 4 and 4, hyper-parameters KK (i.e. the number of blocks) and λ\lambda (i.e. the regularization parameter) have been chosen for the MOM LASSO estimator via the MOM V-fold Cross validation procedure introduced above.

In Figure 4, the adaptively chosen K^\hat{K} grows with the number of outliers which is something expected since one needs the number of blocks to be at least twice the number of outliers.

A.8 Maximinimization, saddle-point, random blocks, outliers detection and depth

In the previous sections, we considered a setup that particularly fits a dataset distributed on clusters. This distributed source of data yields some specific partition of the dataset in compliance with the clusters structure – in other words, the blocks of data B1,,BKB_{1},\cdots,B_{K} are imposed and fixed by the specific physical form of the dataset. In a batch setup, there is a priori no structural restriction to choose a fixed partition of the dataset given that it has no predefined organization. It appears that, in this setup, there is a way to improve the performance of the various iterative algorithms constructed in the previous sections by choosing randomly the blocks at every (descent and ascent) steps of the algorithm. The aim of this section is to show some advantages of this choice and how this modified version works on the example of ADMM. Moreover, as a byproduct of this approach, it is possible to construct an outliers detection algorithm. This algorithm outputs a score to each data in the dataset measuring its centrality. In particular, data with a low score should be considered as outliers.

Following the same strategy as in Section 4, we can show that g^K,λ\hat{g}_{K,\lambda} has the very same statistical performance as the estimator f^K,λ\hat{f}_{K,\lambda} from Section 3.4 (cf. Section C in the case where there is no regularization for a proof). Nevertheless, there is a priori no reason that g^K,λ\hat{g}_{K,\lambda} and f^K,λ\hat{f}_{K,\lambda} are the same estimator. That is we don’t know in advance that

In other words there is no evidence that the duality gap is null. Since TK,λ(g,f)=TK,λ(f,g)T_{K,\lambda}(g,f)=-T_{K,\lambda}(f,g), (34) holds if and only if

In that case, f^\hat{f} is by definition a saddle-point estimator and therefore the two sets of minmax and maxmin estimators are equal.

When f^\hat{f} is a saddle-point, the choice of fixed blocks B1,,BKB_{1},\ldots,B_{K} may result in a problem of “local saddle points”: iterations of our original algorithms remain close to a potentially suboptimal local saddle point.

The problem here is that our algorithms are looking for saddle-points so that if there are several cells Ck{\cal C}_{k} containing saddle-points the previous algorithms may be stuck in one of them whereas a “better saddle-point” (that is a saddle point closer to tt^{*}) may be in an other cell.

In Figure 5, we ran both MOM LASSO procedures via ADMM with fixed and random blocks. Both the objective function and the estimation error of MOM LASSO jump in the case of fixed blocks. These jumps correspond to a change of cell number for these iterations. The algorithm converge to local saddle-points before jumping to other cells. This slows down its convergence. On the other hand, the algorithms with random blocks do not suffer this drawback. Figure 5 shows that the estimation error converges much faster and smoothly for random blocks than for fixed blocks. As a conclusion choosing blocks at random improve both stability and speed of the algorithm, avoiding the issue of “local saddle points” in the cells Ck{\cal C}_{k}. Note also that the objective function of the MOM ADMM with random blocks tends to zero so that the duality gap tends to zero (meaning that we do have a natural stopping criterium and that the MOM LASSO is a saddle point).

A byproduct of this approach is that one can construct an outliers detection procedure. To that end one simply has to count the number of times each data is selected at steps 22 and 44 of Algorithm 5. Every data starts with a null score. Then, every time a block BB is selected as median block, every data it contains increases its score by one. At the end, every data ends up with a score revealing its centrality for our learning task.

It is expected that aggressive outliers corrupt their respective blocks. These “corrupted blocks” will not be median blocks and will not be chosen. In the case of fixed blocks, informative data cannot be distinguished from outliers lying in the same block, therefore, this outliers detection algorithm only makes sense when blocks are chosen at random. Figure 6 shows performance of this strategy on synthetic data (cf. Section A.9 for more details on the simulations). On this example, outliers (data number 1,32,1701,32,170 and 194194) end up with a null score meaning that they have never been selected along the descent / ascent iterations of the algorithm. Rearranging the scores, one can observe a jump in the score function between outliers and informative data. The gap between these scores can be enlarged by running more iterations of the algorithm. Finally, the score may be seen as a depth of a data point (Xi,Yi)(X_{i},Y_{i}), the most selected data are the most central.

A.9 Simulations setup for the figures

For Figure 5, we have ran similar experiments with N=200N=200, d=300d=300, s=20s=20, σ=1\sigma=1, K=10K=10, the number of iterations was 500500 and the regularization parameter was 1/N1/\sqrt{N}.

For Figure 6, we took N=200N=200, d=500d=500, s=20s=20, σ=1\sigma=1, the number of outliers is O=4|{\cal O}|=4 and the outliers are of the form Y=10000Y=10000 and X=(1)j=1dX=(1)_{j=1}^{d}, K=10K=10, the number of iterations is 5.0005.000 and λ=1/200\lambda=1/\sqrt{200}.

Appendix B Examples of applications

Even if recent advances have shown some limitations of LASSO, it remains the benchmark estimator in high-dimensional statistics because a high dimensional parameter space does not significantly affect its performance as long as tt^{*} is sparse. This was shown for example, in for estimation and sparse oracle inequalities, in for support recovery results; more results and references on LASSO can be found in the books .

B.2 SLOPE

B.3 Classical results for LASSO and SLOPE

Typical results for LASSO and SLOPE have been obtained in the i.i.d. setup under a subgaussian assumption on the design XX and, most of the time, on the noise ζ\zeta as well. Let us for the moment provide a definition of this assumption and recall the one of isotropicity.

In the following assumption, we recall a setup where both estimators have been studied in .

the data are i.i.d. (in particular, I=N|{\cal I}|=N and O=0|{\cal O}|=0, i.e. there is no outlier),

for f^{*}=\bigl{<}t^{*},\cdot\bigr{>}, ζ=Yf(X)Lq0\zeta=Y-f^{*}(X)\in L^{q_{0}} for some q0>2q_{0}>2.

Unlike many results on these estimators, Assumption 5 only requires a “minimal” Lq0L^{q_{0}} for q0>2q_{0}>2 moment on the noise. It appears that LASSO and SLOPE still achieve optimal rates of convergence under this weak stochastic assumption but the price to pay is a severely deteriorated probability estimate.

The constants (cj)j=16(c_{j})_{j=1}^{6} depend only on LL and q0q_{0}.

The error rate in Theorem 3 coincides with the standard estimate on the LASSO (cf. ), but in a broader context: tt^{*} does not need to be sparse but should be approximated by a sparse vector; the target YY is arbitrary (there is no need for a statistical model) and the noise ζ\zeta may be heavy tailed and does not need to be independent from XX. But there is no room for outliers, the design matrix XX still needs to be subgaussian and the data are assumed to be i.i.d.. We will see below that the MOM version of the LASSO can go further, achieving minimax optimal error bounds with a much better probability estimate.

Turning to SLOPE, recall the following result for the regularization norm Ψ(t)=j=1dβjtj\Psi(t)=\sum_{j=1}^{d}\beta_{j}t_{j}^{\sharp} when βj=Clog(ed/j)\beta_{j}=C\sqrt{\log(ed/j)}.

The constants (cj)j=15(c_{j})_{j=1}^{5} depend only on LL and q0q_{0}.

B.4 Statistical analysis of MOM LASSO and MOM SLOPE

IN/2|{\cal I}|\geqslant N/2 and Oc1slog(ed/s)|{\cal O}|\leqslant c_{1}s\log(ed/s),

for f^{*}=\bigl{<}t^{*},\cdot\bigr{>}, ζ=Yf(X)Lq0\zeta=Y-f^{*}(X)\in L^{q_{0}} for some q0>2q_{0}>2.

there exists θm\theta_{m} such that {\rm var}(\zeta\bigl{<}X,t\bigr{>})\leqslant\theta_{m}\left\|\bigl{<}X,t\bigr{>}\right\|_{L^{2}}.

There exist constants q0>2q_{0}>2, C0C_{0} and LL such that ζLq0\zeta\in L^{q_{0}}, XX is isotropic and for every j[d]j\in[d] and 1pC0logd1\leqslant p\leqslant C_{0}\log d, \left\|\bigl{<}X,e_{j}\bigr{>}\right\|_{L^{p}}\leqslant L\sqrt{p}\left\|\bigl{<}X,e_{j}\bigr{>}\right\|_{L^{2}}.

Under Assumption 7, if σ=ζLq0\sigma=\left\|\zeta\right\|_{L^{q_{0}}}, [70, Theorem 1.6] shows that, for every ρ>0\rho>0,

As a consequence, if Nslog(ed/s)N\gtrsim s\log(ed/s) and if there exists a ss-sparse vector in t+(ρ/20)B1dt^{*}+(\rho/20)B_{1}^{d}, Lemma 7 and the choice of r()r(\cdot) in (38) imply that for σ=ζLq0\sigma=\left\|\zeta\right\|_{L^{q_{0}}},

then ρ\rho^{*} satisfies the sparsity equation and r2(ρ)r^{2}(\rho^{*}) is the rate of convergence of the LASSO for λr2(ρ)/ρζLq0log(ed/s)/N\lambda\sim r^{2}(\rho^{*})/\rho^{*}\sim\left\|\zeta\right\|_{L^{q_{0}}}\sqrt{\log(ed/s)/N}. But, this choice of λ\lambda requires to know the sparsity parameter ss which is usually not available. That is the reason why we either need to choose a larger value for the r()r(\cdot) function as in – this results in the suboptimal log(ed)/N\sqrt{\log(ed)/N} rates of convergence from Theorem 3 – or to use an adaptation step as section 3.4.1 – this results in the better minimax rate log(ed/s)/N\sqrt{\log(ed/s)/N} achieved by the MOM LASSO. To get the latter one needs a final ingredient which is the computation of the radii ρK\rho_{K} and λr2(ρK)/ρK\lambda\sim r^{2}(\rho_{K})/\rho_{K}. Let K[N]K\in[N] and σ=ζLq0\sigma=\left\|\zeta\right\|_{L^{q_{0}}}. The equation K=cr(ρK)2NK=cr(\rho_{K})^{2}N is solved by

for the r()r(\cdot) function defined in (38). Therefore,

The regularization parameter depends on the “level of noise” σ\sigma, the Lq0L^{q_{0}}-norm of ζ\zeta. This parameter is unknown in practice. Nevertheless, it can be estimated and replaced by this estimator in the regularization parameter as in [31, Sections 5.4 and 5.6.2].

The following result follows from Theorem 2 together with the computation of ρ\rho^{*}, rQ()r_{Q}(\cdot), rM()r_{M}(\cdot) and r()r(\cdot) from the previous sections.

Grant Assumption 6. The MOM-LASSO estimator t^\hat{t} satisfies, with probability at least 1c1exp(c2slog(ed/s))1-c_{1}\exp(-c_{2}s\log(ed/s)), for every 1p21\leqslant p\leqslant 2,

where (cj)j=13(c_{j})_{j=1}^{3} depends only on θ0,θm\theta_{0},\theta_{m} and q0q_{0}.

In particular, Theorem 5 shows that, for our estimator contrary to the one in , the sparsity parameter ss does not have to be known in advance in the LASSO case.

Theoretical properties of MOM LASSO (cf. Theorem 5) outperform those of LASSO (cf. Theorem 3) in several ways:

Estimation rates achieved by MOM-LASSO are the actual minimax rates slog(ed/s)/Ns\log(ed/s)/N, see , while classical LASSO estimators achieve the rate slog(ed)/Ns\log(ed)/N. This improvement is possible thanks to the adaptation step in MOM-LASSO.

the probability deviation in (36) is polynomial – 1/N(q0/21)1/N^{(q_{0}/2-1)} – whereas it is exponentially small for MOM LASSO. Exponential rates for LASSO hold only if ζ\zeta is subgaussian (ζLpCpζL2\left\|\zeta\right\|_{L_{p}}\leqslant C\sqrt{p}\left\|\zeta\right\|_{L_{2}} for all p2p\geqslant 2).

MOM LASSO is insensitive to data corruption by up to slog(ed/s)s\log(ed/s) outliers while only one outlier can be responsible of a dramatic breakdown of the performance of LASSO (cf. Figure 1). Moreover, the informative data are only asked to have equivalent L2L^{2} moments to the one of PP for the MOM LASSO whereas the properties of the LASSO are only known in the i.i.d. setup.

Assumptions on XX are weaker for MOM LASSO than for LASSO. In the LASSO case, we assume that XX is subgaussian whereas for the MOM LASSO we assumed that the coordinates of XX have C0log(ed)C_{0}\log(ed) moments and that it satisfies a L2/L1L^{2}/L^{1} equivalence assumption.

The sparsity equation relative to the SLOPE norm has been solved in Lemma 4.3 from .

Let 1sd1\leqslant s\leqslant d and set Bs=jsβj/j{\cal B}_{s}=\sum_{j\leqslant s}\beta_{j}/\sqrt{j}. If tt^{*} is ρ/20\rho/20 approximated (relative to the SLOPE norm) by an ss-sparse vector and if 40Bsρ/r(ρ)40{\cal B}_{s}\leqslant\rho/r(\rho) then Δ(ρ)4ρ/5\Delta(\rho)\geqslant 4\rho/5.

For βjClog(ed/j)\beta_{j}\leqslant C\sqrt{\log(ed/j)}, one may verify that Bs=jsβj/jCslog(ed/s){\cal B}_{s}=\sum_{j\leqslant s}\beta_{j}/\sqrt{j}\lesssim C\sqrt{s\log(ed/s)}. Hence, the condition Bsρ/r(ρ){\cal B}_{s}\lesssim\rho/r(\rho) holds when NL,q0slog(ed/s)N\gtrsim_{L,q_{0}}s\log(ed/s) and \rho\gtrsim_{L,q_{0}}\|\xi\|_{L_{q}}\frac{s}{\sqrt{N}}\log\Big{(}\frac{ed}{s}\Big{)}. Hence, it follows from Lemma 8 that Δ(ρ)4ρ/5\Delta(\rho)\geqslant 4\rho/5 when there is an ss-sparse vector in t+(ρ/20)BΨt^{*}+(\rho/20)B_{\Psi}; therefore, one may apply Theorem 1 for the choice of the regularization parameter: λr2(ρ)/ρL,q,δξLq/N\lambda\sim r^{2}(\rho)/\rho\sim_{L,q,\delta}\|\xi\|_{L_{q}}/\sqrt{N}.

Now, the final ingredient is to compute the ρK\rho_{K} solution to K=cr(ρK)2NK=cr(\rho_{K})^{2}N. It is straightforward to check that ρKK/(σN)\rho_{K}\sim K/(\sigma\sqrt{N}) and still λr2(ρK)/ρKL,q,δξLq/N\lambda\sim r^{2}(\rho_{K})/\rho_{K}\sim_{L,q,\delta}\|\xi\|_{L_{q}}/\sqrt{N}.

The following result follows from Theorem 2 together with the computation of ρ,ρK\rho^{*},\rho_{K}, rQ()r_{Q}(\cdot), rM()r_{M}(\cdot) and r()r(\cdot) above. Its proof is similar to the one of Theorem 5 and is therefore omitted.

Grant Assumption 6. The MOM-SLOPE estimator t^\hat{t} satisfies, with probability at least 1c1exp(c2slog(ed/s))1-c_{1}\exp(-c_{2}s\log(ed/s)),

where (cj)j=13(c_{j})_{j=1}^{3} depends only on θ0,θm\theta_{0},\theta_{m} and q0q_{0}.

MOM-SLOPE has the same advantages upon SLOPE as MOM-LASSO upon LASSO. Those improvements are listed below Theorem 5 and will not be repeated. The only difference is that SLOPE, unlike LASSO, already achieves the minimax rate slog(ed/s)/Ns\log(ed/s)/N .

Appendix C Learning without regularization

All the results from the previous sections also apply in the setup of learning with no regularization which is the framework one should consider when there is no a priori known structure on the oracle.

We consider the learning problem with no regularization. In this setup, we may use both minmaximization or maxminimization estimators

Indeed, the ideal setup for ERM is the subgaussian (and convex) framework: that is for a convex class FF of functions, i.i.d. data (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N} having the same distribution as (X,Y)(X,Y) and such that for some L>0L>0 and all f,gFf,g\in F,

When FF satisfies the right-hand side of (42), we say that FF is a LL-subgaussian class. It is proved in that in this setup the ERM is an optimal minimax procedure (cf. Theorem A′ from recalled in Theorem 9 below).

But first, we need a version of the two theorems 1 and 2 valid for f^K\widehat{f}_{K} and g^K\widehat{g}_{K} (that is for the learning problem with no regularization). Let us first introduce the set of assumptions we use and for the sake of shortness we consider the simplification introduced in Remark 2. Then, we will introduce the two fixed points driving the statistical properties of f^K\widehat{f}_{K} and g^K\widehat{g}_{K}.

For all iIi\in{\cal I} and fFf\in F, f(Xi)f(Xi)L2=f(X)f(X)L2\left\|f(X_{i})-f^{*}(X_{i})\right\|_{L^{2}}=\left\|f(X)-f^{*}(X)\right\|_{L_{2}},

and f(Xi)f(Xi)L2θ0f(Xi)f(Xi)L1\left\|f(X_{i})-f^{*}(X_{i})\right\|_{L^{2}}\leqslant\theta_{0}\left\|f(X_{i})-f^{*}(X_{i})\right\|_{L^{1}}.

The two fixed points associated to this problem are rQ(ρ,γQ)r_{Q}(\rho,\gamma_{Q}) and rM(ρ,γM)r_{M}(\rho,\gamma_{M}) as in Definition 5 for ρ=\rho=\infty:

and let r=r(γQ,γM)=max{rQ(γQ),rM(γM)}r^{*}=r^{*}(\gamma_{Q},\gamma_{M})=\max\{r_{Q}(\gamma_{Q}),r_{M}(\gamma_{M})\}.

Grant Assumptions 8 and let rQ(γQ)r_{Q}(\gamma_{Q}), rM(γM)r_{M}(\gamma_{M}) and rr^{*} be defined as above for γQ=(384θ0)1\gamma_{Q}=(384\theta_{0})^{-1}, γM=ϵ/192\gamma_{M}=\epsilon/192 and ϵ=1/(32θ02)\epsilon=1/(32\theta_{0}^{2}). Assume that N384θ02N\geqslant 384\theta_{0}^{2} and ON/(768θ02)|{\cal O}|\leqslant N/(768\theta_{0}^{2}). Let KK^{*} denote the smallest integer such that KNϵ2(r)2/(384θm2)K^{*}\geqslant N\epsilon^{2}(r^{*})^{2}/(384\theta_{m}^{2}). Then, for all K[max(K,8O),N/(96θ02)]K\in[\max(K^{*},8|{\cal O}|),N/(96\theta_{0}^{2})], with probability larger than 12exp(7K/9216)1-2\exp(-7K/9216), the estimators f^K\widehat{f}_{K} and g^K\widehat{g}_{K} defined in (41) satisfy

Moreover, one can choose adaptively KK via Lepski’s method. We will do it only for the maxmin estimators g^K\widehat{g}_{K}. Similar result hold for the minmax estimators f^K\widehat{f}_{K} from straightforward modifications (the same as in Section 3.4.1). Define the confidence regions: for all J[K]J\in[K] and gFg\in F,

The following theorem shows the performance of the resulting estimator.

Grant Assumption 8. For ϵ=1/(32θ02)\epsilon=1/(32\theta_{0}^{2}) and all K[max(K,8O),N/(96θ02)]K\in[\max(K^{*},8|{\cal O}|),N/(96\theta_{0}^{2})], with probability larger than 12exp(K/2304)1-2\exp(-K/2304),

The proofs of Theorem 7 and 8 essentially follow the one of Theorem 1 and 2. We will only sketch the proof for the maxmin estimator g^K\widehat{g}_{K} given that we already studied the minmax estimators in the regularized setup in Section 4.

if ffLP2rQ(γQ)\left\|f-f^{*}\right\|_{L^{2}_{P}}\geqslant r_{Q}(\gamma_{Q}) then

there exists 3K/43K/4 block BkB_{k} with kKk\in{\cal K}, for which

Moreover, it follows from Assumption 8 that for all kKk\in{\cal K}, PBk[ζ(ff)]=P[ζ(ff)]\overline{P}_{B_{k}}[\zeta(f-f^{*})]=P[\zeta(f-f^{*})] and P[2ζ(ff)]0P[2\zeta(f-f^{*})]\leqslant 0 because of the convexity of FF and the nearest point theorem. Therefore, on the event Ω(K)\Omega(K), for all fFf\in F,

Let us place ourself on the event Ω(K)\Omega(K) and let rKr_{K} be such that rK2=384θm2K/(ϵ2N)r_{K}^{2}=384\theta_{m}^{2}K/(\epsilon^{2}N). Given that rKrr_{K}\geqslant r^{*}, it follows from (43) and (45) that if fFf\in F is such that ffLP2rK\left\|f-f^{*}\right\|_{L_{P}^{2}}\geqslant r_{K} then

for ϵ=1/(32θ02)\epsilon=1/(32\theta_{0}^{2}) and if ffLP2rK\left\|f-f^{*}\right\|_{L_{P}^{2}}\leqslant r_{K} then TK(f,f)Q3/4,K(2ζ(ff))ϵrK2T_{K}(f,f^{*})\leqslant Q_{3/4,K}(2\zeta(f-f^{*}))\leqslant\epsilon r_{K}^{2}. In particular,

and since CK(g^K)CK(f){\mathfrak{C}}_{K}(\hat{g}_{K})\geqslant{\mathfrak{C}}_{K}(f^{*}) one has CK(g^K)ϵrK2{\mathfrak{C}}_{K}(\hat{g}_{K})\geqslant-\epsilon r_{K}^{2}. On the other hand, we have CK(g^K)=inffFTK(g^K,f)TK(g^K,f){\mathfrak{C}}_{K}(\hat{g}_{K})=\inf_{f\in F}T_{K}(\hat{g}_{K},f)\leqslant T_{K}(\hat{g}_{K},f^{*}). Therefore, TK(g^K,f)ϵrK2T_{K}(\hat{g}_{K},f^{*})\geqslant-\epsilon r_{K}^{2}. But, we know from (47) that if gFg\in F is such that gfLP2>32ϵθ0rK\left\|g-f^{*}\right\|_{L^{2}_{P}}>\sqrt{32\epsilon}\theta_{0}r_{K} then TK(g,f)(1/(32θ02))gfLP22<ϵrK2T_{K}(g,f^{*})\leqslant(-1/(32\theta_{0}^{2}))\left\|g-f^{*}\right\|_{L^{2}_{P}}^{2}<-\epsilon r_{K}^{2}. Therefore, one necessarily have g^KfLP232ϵθ0rK=rK\left\|\hat{g}_{K}-f^{*}\right\|_{L^{2}_{P}}\leqslant\sqrt{32\epsilon}\theta_{0}r_{K}=r_{K}.

The oracle inequality now follows from (C):

Proof of Theorem 8. Consider the same notations as in the proof of Theorem 7 and denote K2=N/(96θ02)K_{2}=N/(96\theta_{0}^{2}). It follows from the proof of Theorem 7, that with probability larger than 12J=KK2exp(7J/9216)1-2\sum_{J=K}^{K_{2}}\exp(-7J/9216), for all J[K,K2]J\in[K,K_{2}], CJ(f)ϵrJ2{\mathfrak{C}}_{J}(f^{*})\geqslant-\epsilon r_{J}^{2} therefore, fR^Jf^{*}\in\hat{R}_{J} and so K^K\hat{K}\leqslant K . The latter implies that g^R^K\widehat{g}\in\hat{R}_{K}which, by using the same argument as in the end of the proof of Theorem 7 implies that g^fLP2rK\left\|\widehat{g}-f^{*}\right\|_{L^{2}_{P}}\leqslant r_{K} and then R(g^)R(f)(1+2ϵ)rKR(\widehat{g})-R(f^{*})\leqslant(1+2\epsilon)r_{K}.

As a consequence, rQ(γQ)=0r_{Q}(\gamma_{Q})=0 if γQJDJ\gamma_{Q}|J|\geqslant\sqrt{D|J|}, i.e. if γQD/J\gamma_{Q}\geqslant\sqrt{D/|J|}. Using the same arguments as above, we have

Therefore, rM(γM)(θm/γM)D/J(θm/γM)2D/Nr_{M}(\gamma_{M})\leqslant(\theta_{m}/\gamma_{M})\sqrt{D/|J|}\leqslant(\theta_{m}/\gamma_{M})\sqrt{2D/N} and K=DK^{*}=D.

Now, it follows from Theorem 8, that if N2(384θ0)2DN\geqslant 2(384\theta_{0})^{2}D and ON/(768θ02)|{\cal O}|\leqslant N/(768\theta_{0}^{2}) then the MOM OLS with adaptively chosen number of blocks KK is such that for all K[max(D,8O),N/(96θ02)]K\in\left[\max\left(D,8|{\cal O}|\right),N/(96\theta_{0}^{2})\right], with probability at least 12exp(K/2304)1-2\exp(-K/2304),

Appendix D Minimax optimality of Theorem 1, 2, 7 and 8

The aim of this section is to show that the rates obtained in Theorems 1, 2, 7 and 8 are optimal in a minimax sense. To that end we recall a minimax lower bound result from .

where D={(Xi,Yi):i[N]}{\cal D}=\{(X_{i},Y_{i}):i\in[N]\} is a set of NN i.i.d. copies of (X,Yf)(X,Y^{f^{*}}). Then, necessarily, one has

where diam(F,L2(PX)){\rm diam}(F,L^{2}(P_{X})) denotes the L2(PX)L^{2}(P_{X}) diameter of FF.

Theorem 9 proves that if the statistical model (49) holds then there is a strong connexion between the deviation parameter δN\delta_{N} and the uniform rate of convergence rN2r_{N}^{2} over FF : the smaller δN\delta_{N}, the larger rN2r^{2}_{N}. We now use this result to prove that Theorems 1, 2, 7 and 8 are essentially optimal.

In Theorems 7 and 8, the deviation bounds are 1c1exp(c2K)1-c_{1}\exp(-c_{2}K) and the residual terms in the LP2L^{2}_{P} (to the square) estimation rates are like c3K/Nc_{3}K/N. Therefore, setting δN=c1exp(c2K)\delta_{N}=c_{1}\exp(-c_{2}K) then Theorem 9 proves that no procedure can do better than

Given that one can obviously bound from above the performance of f^K\widehat{f}_{K} and g^K\widehat{g}_{K} as well as those of f^\widehat{f} and g^\widehat{g} in Theorems 7 and 8 by the LP2L^{2}_{P}-diameter of FF (because ff^{*} and those estimators are in FF), then the result of Theorem 7 and 8 are optimal even in the very strong Gaussian setup with i.i.d. data satisfying a Gaussian regression model like (49). The remarkable point is that Theorem 7 and 8 have been obtained under much weaker assumptions than those considered in Theorem 9 since outliers may corrupt the dataset, the noise and the design do not have to be independent, the informative data are only assumed to have a L2L^{2} norm equivalent to the one of PP and may therefore be heavy tailed.

Given the form of the deviation bounds in Theorems 1 and 2 and given that r(ρK)K/Nr(\rho_{K})\sim K/N and that r(2ρK)K/Nr(2\rho_{K})\sim K/N (if one assumes a weak regularity assumption on the class FF) then the same conclusions hold for Theorems 1 and 2: there is no procedure doing better than the MOM estimators even in the very good framework of Theorem 9.

References