Performance of empirical risk minimization in linear aggregation

Guillaume Lecué, Shahar Mendelson

Introduction and main results

Let (X,μ)(\mathcal{X},\mu) be a probability space, set XX to be distributed according to μ\mu and put YY to be an unknown target random variable.

is the risk of the function f^\hat{f} that is chosen by the procedure, using the observations (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}.

There are many different ways in which one may construct learning procedures (see, e.g., the books for numerous examples), but in general, there is no ‘universal’ choice of an optimal learning procedure.

The variety of learning algorithms motivated the introduction of aggregation or ensemble methods, in which one combines a batch or dictionary, created by learning procedures, in the hope of obtaining a function with ‘better’ prediction capabilities than individual members of the dictionary.

Aggregation procedures have been studied extensively (see, e.g., and references therein), and among the more well-known aggregation procedures are boosting and bagging .

This type of inequality is called an oracle inequality and the function ff^{*} for which R(f)=inffspan(F)R(f)R(f^{*})=\inf_{f\in\operatorname{span}(F)}R(f) is called the oracle.

Of course, in (1) one is looking for the smallest possible residual term rN(M)r_{N}(M), that holds uniformly for all choices of couples (X,Y)(X,Y) and dictionaries FF that satisfy certain assumptions.

The linear aggregation problem has been studied in in the Gaussian white noise model; in for the Gaussian model with random design; in for the density estimation problem and in in the learning theory setup, under moment conditions. And, based on these cases, it appears that the best possible residual term rN(M)r_{N}(M) that one may hope for is of the order of M/NM/N.

This rate is usually called the optimal rate of linear aggregation and, in fact, its optimality holds in some minimax sense, introduced in .

The only procedure we will focus on here is empirical risk minimization (ERM) performed in the span of the dictionary:

We do not claim that ERM is always the best procedure for the linear aggregation problem, but rather, our aim is to identify conditions under which it achieves the optimal rate of M/NM/N.

If x>0x>0 satisfies that 2/N2exp(x)12/N\leq 2\exp(-x)\leq 1 and

then with probability at least 12exp(x)1-2\exp(-x),

It follows from Theorem 1.1 that under an L4L_{4} assumption on Yf(X)Y-f^{*}(X) and the equivalence between the L2L_{2} and LL_{\infty} norms on the span of FF, ERM achieves a rate of convergence of order B2M/NB^{2}M/N when NcB3MN\geq cB^{3}M for an absolute constant cc.

However, it should be noted that the best probability estimate one may obtain in Theorem 1.1 is 12/N1-2/N; also, it is possible to show that the constant BB defined in (2) is necessarily larger than the dimension MM of span(F)\operatorname{span}(F). For the sake of completeness, we shall provide a proof of that fact in the Appendix. Therefore, the rate that Theorem 1.1 guarantees is, at best, of the order of M3/NM^{3}/N, to achieve that rate, at least NcM4N\geq cM^{4} observations are needed, and even with that sample size, the probability estimate is, at best, 12/N1-2/N. This estimate is far from the anticipated rate of M/NM/N, which should be achieved when NcMN\geq cM and preferably, with significantly higher probability.

Nevertheless, the optimal rate of M/NM/N can be obtained by relaxing assumption (2) and using a different method of proof. Recall that the ψ2\psi_{2} norm of a function ff is

One may show that fψ2cfL\|f\|_{\psi_{2}}\leq c\|f\|_{L_{\infty}} for a suitable absolute constant cc (see, e.g., Section 1 in ). Therefore, assuming that the ψ2\psi_{2}-norm and the L2L_{2}-norm are equivalent in span(F)\operatorname{span}(F) is a weaker requirement than the one in (2). The assumption that for every fspan(F)f\in\operatorname{span}(F),

Naturally, the analysis of ERM under a sub-Gaussian assumption requires a more sophisticated technical machinery than in situations in which the L2/LL_{2}/L_{\infty} equivalence assumption used in Theorem 1.1 holds. Invoking the main result from , one can show that if Yf(X)Y-f^{*}(X) is sub-Gaussian and span(F)\operatorname{span}(F) is a sub-Gaussian class, then for every x>0x>0, ERM achieves a rate rN(M)=c1xM/Nr_{N}(M)=c_{1}xM/N with probability at least 1exp(c2xM)1-\exp(-c_{2}xM).

Although the sub-Gaussian case is interesting, the goal of this note is the study of ERM as a linear aggregation procedure under much weaker assumptions.

Let F={f1,,fM}F=\{f_{1},\ldots,f_{M}\} and assume that there are constants κ0\kappa_{0} and β0\beta_{0} for which

for every fspan(F)f\in\operatorname{span}(F). Let N(400)2M/β02N\geq(400)^{2}M/\beta_{0}^{2} and set ζ=Yf(X)\zeta=Y-f^{*}(X). Assume further that one of the following two conditions holds:

Then, for every x>0x>0, with probability at least 1exp(β02N/4)(1/x)1-\exp(-\beta_{0}^{2}N/4)-(1/x),

Since the loss is the squared one, one has to assume that YY and functions in span(F)\operatorname{span}(F) have a second moment. It follows from Theorem A that in some cases, this is (almost) all that is needed for an optimal rate. Indeed, if ζ=Yf(X)\zeta=Y-f^{*}(X) is independent of the design XX – as is the case in any regression model with independent noise Y=f(X)+ζY=f^{*}(X)+\zeta, and if (4) holds, ERM achieves the optimal rate M/NM/N.

Consider the regression model Y=f(X)+ζY=f^{*}(X)+\zeta where ζ\zeta is a mean-zero noise that is independent of XX. Assume that ζL2\zeta\in L_{2} and that fspan(F)f^{*}\in\operatorname{span}(F). If span(F)\operatorname{span}(F) satisfies (4) and N(400)2M/β02N\geq(400)^{2}M/\beta_{0}^{2}, then for every x>0x>0, with probability at least 1exp(β02N/4)1/x1-\exp(-\beta_{0}^{2}N/4)-1/x,

It is possible to slightly modify the assumptions of Theorem A and still obtain the same type of estimate. For example, it is straightforward to verify that the small-ball condition (4) holds when the L2L_{2} and LpL_{p} norms are equivalent on span(F)\operatorname{span}(F) for some p>2p>2. This type of Lp/L2L_{p}/L_{2} equivalence assumption on span(F)\operatorname{span}(F) is weaker than the equivalence between the Lψ2L_{\psi_{2}} and the L2L_{2} norms in (3) because for every p1p\geq 1, fLpcpfψ2\|f\|_{L_{p}}\leq c\sqrt{p}\|f\|_{\psi_{2}} for a suitable absolute constant cc. And, it is clearly weaker than the L/L2L_{\infty}/L_{2} equivalence assumption (2) used in Theorem 1.1.

It turns out that if the L2L_{2} and L4L_{4} norms are equivalent on span(F)\operatorname{span}(F), one may obtain the optimal rate for an arbitrary target YY, as long as ζ=Yf(X)\zeta=Y-f^{*}(X) has a fourth moment. The difference between such a result and Theorem A is that ζ\zeta need not be independent of XX, nor must it be bounded.

There exist absolute constants c0,c1c_{0},c_{1} and c2c_{2} for which the following holds. Assume that there exists θ0\theta_{0} for which

One may show that a possible choice of constants in Theorem 1.3 is c0=1600c_{0}=1600, c1=64c_{1}=64 and c2=(256)2c_{2}=(256)^{2}, but since we have not made any real attempt of optimizing the choice of constants – because identifying the correct rate is the main focus of this note – we will not keep track of the values of constants in what follows.

One example in which Theorem 1.3 may be used is the regression problem with a misspecified model: Y=f0(X)+WY=f_{0}(X)+W where the regression function f0f_{0} may not be in the model span(F)\operatorname{span}(F) and ζ=(f0f)(X)+W\zeta=(f_{0}-f^{*})(X)+W has a fourth moment. If span(F)\operatorname{span}(F) satisfies (4), then with high probability,

The standard way of analyzing the performance of ERM is via certain trade-offs between concentration and complexity. However, in the case we study here, the functions involved may have ‘heavy tails’, and empirical means do not exhibit strong, two-sided concentration around their true means – which is a crucial component in the standard method of analysis. Therefore, a completely different path must be taken if one is to obtain the results formulated above.

The method we shall employ here has been introduced in for problems in Learning Theory; in in the context of the geometry of convex bodies; in for applications in random matrix theory; and in for Compressed Sensing.

Obviously, and regardless of the method of analysis, the (seemingly) unsatisfactory probability estimate is the price one pays for the moment assumptions on the ‘noise’ Yf(X)Y-f^{*}(X). The next result shows that without stronger moment assumptions, only weak polynomial probability estimates are true.

Finally, we would like to address the problem of linear aggregation under the classical boundedness assumptions: that Y1|Y|\leq 1 and f(X)1|f(X)|\leq 1 almost surely for every fFf\in F.

These are the standard assumptions that have been considered for the three problems of aggregation with a random design. For instance, optimal rates of aggregation have been obtained under these assumptions for the model selection aggregation problem in and for the convex aggregation problem in . And, it has been established that while ERM is suboptimal for the model selection aggregation problem (see, e.g., Section 3.5 in or ), it is optimal for the convex aggregation problem. However, the optimality of ERM in the linear aggregation problem under the boundedness assumption was left open. The final result of this article addresses that problem – and it turns out that the answer is negative in a very strong way.

For every 0<η<10<\eta<1 and integers NN and MM, there exists a couple (X,Y)(X,Y) and a dictionary F={f1,,fM}F=\{f_{1},\ldots,f_{M}\} with the following properties:

Y1|Y|\leq 1 almost surely and f(X)1|f(X)|\leq 1 almost surely for every fFf\in F.

With probability at least η\eta, for every κ>0\kappa>0 there is some

Proposition 1.6 shows that even if one assumes that Y1|Y|\leq 1 and f(X)1|f(X)|\leq 1 almost surely for every function in the dictionary, and despite the convexity of span(F)\operatorname{span}(F), the empirical risk minimization procedure performs poorly. This illustrates the major difference between assuming that the class is well bounded in LL_{\infty} and assuming that the L2L_{2} and LpL_{p} norms are equivalent on its span: while the latter suffices for an optimal bound, the former is rather useless.

An obvious outcome of Proposition 1.6 is that ERM should not be used to solve the linear aggregation problem under the boundedness assumption and one has to look for different procedures in the bounded setup. It should also be noted that since Proposition 1.6 is a non-asymptotic lower bound and XX may depend on NN and MM, the asymptotic result appearing in Theorem 2.1 in does not apply here.

thus, R(f)R(f)=PL(X,Y)0R(f)-R(f^{*})=P\mathcal{L}(X,Y)\geq 0. The empirical measure over the data is denoted by PNP_{N} and

Finally, all absolute constants are denoted by c1,c2c_{1},c_{2}, etc. Their value may change from line to line. We write ABA\lesssim B if there is an absolute constant cc for which AcBA\leq cB, and AαBA\lesssim_{\alpha}B if Ac(α)BA\leq c(\alpha)B for a constant cc that depends only on α\alpha.

Proofs of Theorem A and Theorem 1.3

The starting point of the proof of Theorem A is the same as in : a decomposition of the excess loss function

to a sum of quadratic and linear terms in (ff)(X)(f-f^{*})(X). The idea of the proof is to control the quadratic term from below using a ‘small-ball’ argument, and the linear term from above using standard methods from empirical processes theory. A combination of these two bounds suffices to show that if ffL2rN\|f-f^{*}\|_{L_{2}}\geq r_{N}^{*} for an appropriate choice of rNr_{N}^{*}, the quadratic term dominates the linear one, and in particular, for such functions PNLf>0P_{N}\mathcal{L}_{f}>0. Since the empirical excess loss of the empirical minimizer is non-positive, it follows that f^fL2<rN\|\hat{f}-f^{*}\|_{L_{2}}<r_{N}^{*}.

There exists an absolute constant c0c_{0} for which the following holds. Assume that there are κ0\kappa_{0} and β0\beta_{0} for which

for every fspan(F)f\in\operatorname{span}(F). If Nc0M/β02N\geq c_{0}M/\beta_{0}^{2}, then with probability at least 1exp(β02N/4)1-\exp(-\beta_{0}^{2}N/4), for every fspan(F)f\in\operatorname{span}(F),

Since NN independent copies of XX, X1,,XNX_{1},\ldots,X_{N}, endow NN independent copies of WW, denoted by W1,,WNW_{1},\ldots,W_{N}, it follows that

By the bounded differences inequality (see, e.g., Theorem 6.2 in ), with probability at least 1exp(x2/2)1-\exp(-x^{2}/2),

(one may show the c1100c_{1}\leq 100 using a rough estimate on Dudley’s entropy integral combined with Exercise 2.6.4 in ). Therefore, if c1M/Nβ0/4c_{1}\sqrt{M/N}\leq\beta_{0}/4 and (1/2)x/N=β0/4(1/2)\sqrt{x/N}=\beta_{0}/4, then with probability at least 1exp(β02N/4)1-\exp(-\beta_{0}^{2}N/4), Hβ0/2H\leq\beta_{0}/2.

it follows that on the event {Hβ0/2}\{H\leq\beta_{0}/2\},

Therefore, (9) holds with probability at least 1exp(β02N/4)1-\exp(-\beta_{0}^{2}N/4). ∎

Let ζ=Yf(X)\zeta=Y-f^{*}(X) and assume that one of the following two conditions hold:

Then, for every x>0x>0, with probability larger than 1(1/x)1-(1/x),

Let ε1,,εN\varepsilon_{1},\ldots,\varepsilon_{N} be independent Rademacher variables that are also independent of the couples (Xi,Yi)i=1N(X_{i},Y_{i})_{i=1}^{N}. A standard symmetrization argument shows that

If Z1,,ZNZ_{1},\ldots,Z_{N} are independent copies of ZZ, it follows that

The claim now follows from Markov’s inequality. ∎

Proof of Theorem A Combining Lemma 2.1 and Lemma 2.2 when Nc0M/β02N\geq c_{0}M/\beta_{0}^{2}, it follows that with probability at least 1exp(β02N/4)(1/x)1-\exp(-\beta_{0}^{2}N/4)-(1/x), if fspan(F)f\in\operatorname{span}(F) and

Proof of Theorem 1.3 The proof of Theorem 1.3 is almost identical to the proof of Theorem A, and we will only outline the minor differences.

The small-ball condition (4) follows from the Paley–Zygmund inequality (see, for instance, Proposition 3.3.1 in ): if VV is a real-valued random variable then

and thus the assertion of Lemma 2.1 holds for κ0=1/2\kappa_{0}=1/2 and β0=(4θ04)1\beta_{0}=(4\theta_{0}^{4})^{-1}.

As for the analogous version of Lemma 2.2, the one change in its proof is that

Proof of Proposition 1.6

Fix Y=1Y=1 as the target and let X=j=0MXj\mathcal{X}=\bigcup_{j=0}^{M}\mathcal{X}_{j} be some partition of X\mathcal{X}. Consider a random variable XX which is distributed as follows: fix kMk\geq M to be chosen later; for 1jM1\leq j\leq M, set P(XXj)=1kP(X\in\mathcal{X}_{j})=\frac{1}{k} and put P(XX0)=1MkP(X\in\mathcal{X}_{0})=1-\frac{M}{k}.

Note that Y1|Y|\leq 1 almost surely and that for every fFf\in F, f(X)1|f(X)|\leq 1 almost surely. It is straightforward to verify that the oracle in span(F)\operatorname{span}(F) is f=j=1Mfj()f^{*}=\sum_{j=1}^{M}f_{j}(\cdot), and thus

Let X1,,XNX_{1},\ldots,X_{N} be independent copies of XX. Given 0<η<10<\eta<1 and kk large enough (for instance, kc(η)N/logMk\geq c(\eta)N/\log M for a sufficiently large constant c(η)c(\eta) would suffice), there exists an event Ω0\Omega_{0} of probability at least η\eta on which the following holds: there exists j0{1,,M}j_{0}\in\{1,\ldots,M\} for which XiXj0X_{i}\notin\mathcal{X}_{j_{0}} for every 1iN1\leq i\leq N (this is a slight modification of the coupon-collector problem).

For every sample in Ω0\Omega_{0}, let j0{1,,N}j_{0}\in\{1,\ldots,N\} be the index for which XiXj0X_{i}\notin\mathcal{X}_{j_{0}} for every 1iN1\leq i\leq N. Therefore,

and the claim follows by selecting ξ\xi large enough.

Appendix

We begin by presenting a proof of the well-known fact that if the LL_{\infty} and L2L_{2} norms are B\sqrt{B}-equivalent on the span of MM linearly-independent functions, then BMB\geq M.

Let F={f1,,fM}L2F=\{f_{1},\ldots,f_{M}\}\subset L_{2} be a dictionary whose span is of dimension MM, and recall that

For μ\mu-almost every xXx\in\mathcal{X},

Hence, for μ\mu-almost every xXx\in\mathcal{X},

Let us begin by showing that, conditionally on ζ1,,ζN\zeta_{1},\ldots,\zeta_{N}, and if σ^N2=1Ni=1Nζi2\hat{\sigma}_{N}^{2}=\frac{1}{N}\sum_{i=1}^{N}\zeta_{i}^{2}, then with probability at least 12exp(c0N)1-2\exp(-c_{0}N),

and that for every sample, if r1<r2r_{1}<r_{2} and

one has that for any ζ1,,ζN\zeta_{1},\ldots,\zeta_{N}

for suitable absolute constants c1c_{1} and c2c_{2}. We refer the reader to Lemma 2.6.4 and Theorem 2.6.5 in for more details on the techniques used to obtain these observations.

Hence, by (13), it follows that for Nc0MN\geq c_{0}M and conditioned on ζ1,,ζN\zeta_{1},\ldots,\zeta_{N}, with probability at least 12exp(c3N)1-2\exp(-c_{3}N),

Now, all that remains is to show that P(σ^N2x)c5/xP(\hat{\sigma}_{N}^{2}\geq x)\geq c_{5}/x.

For every N2N\geq 2 and x1x\geq 1, there exists a mean-zero, variance one random variable ζ\zeta for which

Let ζi=Rεiηi\zeta_{i}=R\varepsilon_{i}\eta_{i}, i=1,,Ni=1,\ldots,N be independent copies of ζ\zeta. Recall that NR2x=1NR^{-2}x=1 and that δN1\delta N\leq 1. Therefore,

Acknowledgment

Shahar Mendelson was supported by the Mathematical Sciences Institute – The Australian National University and by ISF Grant 900/10.

References