Loss minimization and parameter estimation with heavy tails

Daniel Hsu, Sivan Sabato

Introduction

The minimax principle in statistical estimation prescribes procedures (i.e., estimators) that minimize the worst-case risk over a large class of distributions generating the data. For a given loss function, the risk is the expectation of the loss of the estimator, where the expectation is taken over the data examined by the estimator. For example, for a large class of loss functions including squared loss, the empirical mean estimator minimizes the worst-case risk over the class of Gaussian distributions with known variance (Wolfowitz, 1950). In fact, Gaussian distributions with the specified variance are essentially the worst-case family of distributions for squared loss, at least up to constants (see, e.g., Catoni, 2012, Proposition 6.1).

In this work, we are interested in estimators whose deviations from expected behavior are controlled with very high probability over the random draw of the data examined by the estimator. Deviations of the behavior of the estimator from its expected behavior are worrisome especially when data come from unbounded and/or heavy-tail distributions, where only very low order moments may be finite. For example, the Pareto distributions with shape parameter α>0\alpha>0 are unbounded and have finite moments only up to orders <α<\alpha; these distributions are commonly associated with the modeling of extreme events that manifest in data. Bounds on the expected behavior of an estimator are insufficient in these cases, since the high-probability guarantees that may be derived from such bounds (say, using Markov’s inequality) are rather weak. For example, if the risk (i.e., expected loss) of an estimator is bounded by ϵ\epsilon, then all that we may derive from Markov’s inequality is that the loss is no more than ϵ/δ\epsilon/\delta with probability at least 1δ1-\delta. For small values of δ(0,1)\delta\in(0,1), the guarantee is not very reassuring, but it may be all one can hope for in these extreme scenarios—see Remark 7 in Section 3.1 for an example where this is tight. Much of the work in statistical learning theory is also primarily concerned with such high probability guarantees, but the bulk of the work makes either boundedness or subgaussian tail assumptions that severely limit the applicability of the results even in settings as simple as linear regression (see, e.g., Srebro et al., 2010; Shamir, 2014).

Recently, it has been shown that it is possible to improve on methods which are optimal for expected behavior but suboptimal when high-probability deviations are concerned (Audibert and Catoni, 2011; Catoni, 2012; Brownlees et al., 2014). These improvements, which are important when dealing with heavy-tailed distributions, suggest that new techniques (e.g., beyond empirical risk minimization) may be able to remove the reliance on boundedness or control of high-order moments. Bubeck et al. (2013) show how a more robust mean estimator can be used for solving the stochastic multi-armed bandit problem under heavy-tailed distributions.

This work applies and generalizes a technique for controlling large deviations from the expected behavior with high probability, assuming only bounded low-order moments such as variances. We show that the technique is applicable to minimization of smooth and strongly convex losses, and derive specific loss bounds for least squares linear regression, which match existing rates, but without requiring the noise or covariates to be bounded or subgaussian. This contrasts with several recent works (Srebro et al., 2010; Hsu et al., 2014; Shamir, 2014) concerned with (possibly regularized) empirical risk minimizers that require such assumptions. It is notable that in finite dimensions, our result implies that a constant factor approximation to the optimal loss can be achieved with a sample size that is independent of the size of the optimal loss. This improves over the recent work of Mahdavi and Jin (2013), which has a logarithmic dependence on the optimal loss, as well as a suboptimal dependence on specific problem parameters (namely condition numbers). We also provide a new generalization of the basic technique for general metric spaces, which we apply to least squares linear regression with heavy tail covariate and noise distributions, yielding an improvement over the computationally expensive procedure of Audibert and Catoni (2011).

The basic technique, found in the textbook of Nemirovsky and Yudin (1983, p. 243), is very simple, and can be viewed as a generalization of the median-of-means estimator used by Alon et al. (1999) and many others. The idea is to repeat an estimate several times, by splitting the sample into several groups, and then selecting a single estimator out of the resulting list of candidates. If an estimator from one group is good with noticeably better-than-fair chance, then the selected estimator will be good with probability exponentially close to one. This is remininscant of techniques from robust statistics (Huber, 1981), although our aim is expressly different in that our aim is good performance on the same probability distribution generating the data, rather than an uncontaminated or otherwise better behaved distribution. Our new technique can be cast as a simple selection problem in general metric spaces that generalizes the scalar median.

We demonstrate the versatility of our technique by giving further examples in sparse linear regression (Tibshirani, 1996) under heavy-tailed noise and low-rank covariance covariance matrix approximation (Koltchinskii et al., 2011) under heavy-tailed covariate distributions. We also show that for prediction problems where there may not be a reasonable metric on the predictors, one can achieve similar high-probability guarantees by using median aggregation in the output space.

The initial version of this article (Hsu and Sabato, 2013, 2014) appeared concurrently with the simultaneous and independent work of Minsker (2013), which develops a different generalization of the median-of-means estimator for Banach and Hilbert spaces. We provide a new analysis and comparison of this technique to ours in Section 7. We have also since become aware of the earlier work by Lerasle and Oliveira (2011), which applies the median-of-means technique to empirical risks in various settings much like the way we do in Algorithm 3, although our metric formulation is more general. Finally, the recent work of Brownlees et al. (2014) vastly generalizes the techniques of Catoni (2012) to apply to much more general settings, although they retain some of the same deficiencies (such as the need to know the noise variance for the optimal bound for least squares regression), and hence their results are not directly comparable to ours.

Overview of Main Results

This section gives an overview of the main results.

In this work, we are interested in performance guarantees that hold with high probability over the random draw of the input sample and any internal randomization used by the estimation algorithm. Thus, for a given allowed probability of failure δ(0,1)\delta\in(0,1), we study excess loss L(w^)LL(\hat{{{\boldsymbol{w}}}})-L_{\operatorname{\star}} achieved by the predictor w^w^(δ)\hat{{{\boldsymbol{w}}}}\equiv\hat{{{\boldsymbol{w}}}}(\delta) returned by the algorithm on a 1δ1-\delta probability subset of the sample space. Ideally, the excess loss only depends sub-logarithmically on 1/δ1/\delta, which is the dependence achieved when the distribution of the excess loss has exponentially decreasing tails. Note that we assume that the value of δ\delta is provided as input to the estimation algorithm, and only demand the probabilistic guarantee for this given value of δ\delta. Therefore, strictly speaking, the excess loss need not exhibit exponential concentration. Nevertheless, in this article, we shall say that an estimation algorithm achieves exponential concentration whenever it guarantees, on input δ\delta, an excess loss that grows only as log(1/δ)\log(1/\delta).

2 Robust Distance Approximation

Consider an estimation problem, where the goal is to estimate an unknown parameter of the distribution, using a random i.i.d. sample from that distribution. We show throughout this work that for many estimation problems, if the sample is split into non-overlapping subsamples, and estimators are obtained independently from each subsample, then with high probability, this generates a set of estimators such that some fraction of them are close, under a meaningful metric, to the true, unknown value of the estimated parameter. Importantly, this can be guaranteed in many cases even under under heavy-tailed distributions.

In some cases, learning with heavy-tailed distributions requires using a metric that depends on the distribution. Then, the Robust Distance Estimation procedure has access only to noisy measurements of distances in the metric space, and is required to succeed with high probability. In Section 3 we formalize these notions, and provide simple implementations of Robust Distance Approximation for general metric spaces, with and without direct access to the metric. For the case of direct access to the metric our formulation is similar to that of Nemirovsky and Yudin (1983).

3 Convex Loss Minimization

The general approach to estimation described above has many applications. We give here the general form of our main results for applications, and defer the technical definitions and results to the relevant sections. Detailed discussion of related work for each application is also provided in the appropriate sections.

the dual norm {\|\cdot\|}_{*} is γ\gamma-smooth;

there exists α>0\alpha>0 and sample size nαn_{\alpha} such that, with probability at least 1/21/2, the empirical loss wL^(w){{\boldsymbol{w}}}\mapsto\hat{L}({{\boldsymbol{w}}}) is α\alpha-strongly convex with respect to {\|\cdot\|} whenever the sample is of size at least nαn_{\alpha};

nClog(1/δ)nαn\geq C\log(1/\delta)\cdot n_{\alpha} for some universal constant C>0C>0;

wL(w){{\boldsymbol{w}}}\mapsto L({{\boldsymbol{w}}}) is βˉ\bar{\beta}-smooth with respect to {\|\cdot\|};

then with probability at least 1δ1-\delta, for another universal constant C>0C^{\prime}>0,

This gives a constant approximation of the optimal loss with a number of samples that does not depend on the value of the optimal loss. The full results for smooth convex losses are provided in Section 4. Theorem 2 is stated in full as Corollary 16, and we further provide a result with more relaxed smoothness requirements. As apparent in the result, the only requirements on the distribution are those that are implied by the strong convexity and smoothness parameters. This allows support for fairly general heavy-tailed distributions, as we show below.

4 Least Squares Linear Regression

LsqL^{\operatorname{sq}} and LsqL^{\operatorname{sq}}_{\operatorname{\star}} are defined similarly to LL and LL_{\operatorname{\star}}.

Unlike standard high-probability bounds for regression, we give bounds that make no assumption on the range or the tails of the distribution of the response variables, other than a trivial requirement that the optimal squared loss be finite. The assumptions on the distribution of the covariates are also minimal.

This theorem is stated in full as Theorem 19 in Section 5. Under standard finite fourth-moment conditions, this result translates to the bound

with probability 1δ\geq 1-\delta. These results improve over recent results by Audibert and Catoni (2011), Catoni (2012), and Mahdavi and Jin (2013). We provide a full comparison to related work in Section 5.

Theorem 3 can be specialized for specific cases of interest. For instance, suppose X{{\boldsymbol{X}}} is bounded and well-conditioned in the sense that there exists R<R<\infty such that Pr[XΣ1XR2]=1\Pr[{{\boldsymbol{X}}}^{\scriptscriptstyle{\top}}{\varSigma}^{-1}{{\boldsymbol{X}}}\leq R^{2}]=1, but YY may still be heavy-tailed. Under this assumption we have the following result.

This theorem is stated in full as Theorem 20 in Section 5. Note that

5 Other Applications, Comparisons, and Extensions

The general method studied here allows handling heavy tails in other applications as well. We give two examples in Section 6. First, we consider parameter estimation using L1L^{1}-regularized linear least squares regression (Lasso) under random subgaussian design. We show that using the above approach, parameter estimation bounds can be guaranteed for general bounded variance noise, including heavy-tailed noise. This contrasts with standard results that assume sub-Gaussian noise. Second, we show that low-rank covariance matrix approximation can be obtained for heavy-tailed distributions, under a bounded 4+ϵ4+\epsilon moment assumption. These two applications have been analyzed also in the independent and simultaneous work of Minsker (2013).

All the results above are provided using a specific solution to the Robust Distance Approximation problem, which is easy to implement for any metric space. For the case of a fully known metric, in a Banach or a Hilbert space, Minsker (2013) proposed a different solution, which is based on the geometric median. In Section 7, we provide a detailed comparison of the approximation factor achieved by each approach, as well as some general lower bounds. Several interesting open questions remain regarding this general problem.

Lastly, in Section 8, we give a short proof to the intuitive fact that in some prediction problem, one can replace Robust Distance Approximation with taking the median of the predictions of the input estimators. This gives a possible improper-learning algorithm for relevant learning settings.

All of the techniques we have developed in this work are simple enough to implement and empirically evaluate, and indeed in some simulated experiments, we have verified the improvements over standard methods such as the empirical mean when the data follow heavy-tailed distributions. However, at present, the relatively large constant factors in our bounds are real enough to restrict the empirical improvements only to settings where very high confidence (i.e., small values of δ\delta) is required. By contrast, with an appropriately determined noise variance, the techniques of Catoni (2012) and Brownlees et al. (2014) may yield improvements more readily. Nevertheless, since our techniques are more general in some respects, it is worth investigating whether they can be made more practical (e.g., with greater sample reuse or overlapping groups), and we plan to do this in future work.

The Core Techniques

In this section we present the core technique used for achieving exponential concentration. We first demonstrate the underlying principle via the median-of-means estimator, and then explain the generalization to arbitrary metric spaces. Finally, we show a new generalization that supports noisy feature measurements.

We first motivate the estimation procedure by considering the special case of estimating a scalar population mean using a median-of-means estimator, given in Algorithm 1. This estimator, heavily used in the streaming algorithm literature (Alon et al., 1999) (though a similar technique also appears in Nemirovsky and Yudin (1983) as noted in Levin (2005)), partitions a sample into kk equal-size groups, and returns the median of the sample means of each group. Note that the possible non-uniqueness of the median does not affect the result; the arguments below apply to any one of them. The input parameter kk should be thought of as a constant determined by the desired confidence level (i.e., k=Θ(log(1/δ))k=\Theta(\log(1/\delta)) for confidence δ(0,1)\delta\in(0,1)). It is well known that the median-of-means achieves estimation with exponential concentration. The following proposition gives a simple statement and proof. The constant 66 in the statement (see Eq. (1) below) is lower the constant in the analysis of Lerasle and Oliveira (2011, Proposition 1), which is 26e8.082\sqrt{6e}\approx 8.08, but we require a larger value of nn. By requiring an even larger nn, the constant in the statement below can approach 333\sqrt{3}.

Let xx be a random variable with mean μ\mu and variance σ2<\sigma^{2}<\infty, and let SS be a set of nn independent copies of xx. Assume kn/2k\leq n/2. With probability at least 1ek/4.51-e^{-k/4.5}, the estimate μ^\hat{\mu} returned by Algorithm 1 on input (S,k)(S,k) satisfies μ^μσ8k/n|\hat{\mu}-\mu|\leq\sigma\sqrt{8k/n}. Therefore, if k=4.5log(1/δ)k=4.5\lceil\log(1/\delta)\rceil and n18log(1/δ)n\geq 18\lceil\log(1/\delta)\rceil, then with probability at least 1δ1-\delta,

Catoni (2012) proposes a mean estimator μ^\hat{\mu} that satisfies μ^μ=O(σlog(1/δ)/n)|\hat{\mu}-\mu|=O(\sigma\sqrt{\log(1/\delta)/n}) with probability at least 1δ1-\delta. Remarkably, the leading constant in the bound is asymptotically optimal: it approaches 2\sqrt{2} as nn\to\infty. However, the estimator takes both δ\delta and σ\sigma as inputs. Catoni also presents an estimator that takes only σ\sigma as an input; this estimator guarantees a O(σlog(1/δ)/n)O(\sigma\log(1/\delta)/\sqrt{n}) bound for all values of δ>exp(1n/2)\delta>\exp(1-n/2) simultaneously.

Catoni (2012) shows that the empirical mean cannot provide a qualitatively similar guarantee. Specifically, for any σ>0\sigma>0 and δ(0,1/(2e))\delta\in(0,1/(2e)), there is a distribution with mean zero and variance σ2\sigma^{2} such that the empirical average μ^emp\hat{\mu}_{\operatorname{emp}} of nn i.i.d. draws satisfies

Therefore the deviation of the empirical mean necessarily scales with 1/δ1/\sqrt{\delta} rather than log(1/δ)\sqrt{\log(1/\delta)} (with probability Ω(δ)\Omega(\delta)).

2 Generalization to Arbitrary Metric Spaces

The first abstraction captures the generation of candidate solutions obtained from independent subsamples. We assume there is an oracle APPROXρ,ε\operatorname{APPROX}_{\rho,{\varepsilon}} which satisfies the following assumptions.

Note that the 2/32/3 could be replaced by another constant larger than half; we have not optimized the constants. The second assumption regards statistical independence. For an integer kk, let w1,,wk{{\boldsymbol{w}}}_{1},\ldots,{{\boldsymbol{w}}}_{k} be responses to kk separate queries to APPROXρ,ε\operatorname{APPROX}_{\rho,{\varepsilon}}.

w1,,wk{{\boldsymbol{w}}}_{1},\ldots,{{\boldsymbol{w}}}_{k} are statistically independent.

The proposed procedure, given in Algorithm 2, generates kk candidate solutions by querying APPROXρ,ε\operatorname{APPROX}_{\rho,{\varepsilon}} kk times, and then selecting a single candidate using a generalization of the median. Specifically, for each i[k]i\in[k], the smallest ball centered at wi{{\boldsymbol{w}}}_{i} that contains more than half of {w1,w2,,wk}\{{{\boldsymbol{w}}}_{1},{{\boldsymbol{w}}}_{2},\dotsc,{{\boldsymbol{w}}}_{k}\} is determined; the wi{{\boldsymbol{w}}}_{i} with the smallest such ball is returned. If there are multiple such wi{{\boldsymbol{w}}}_{i} with the smallest radius ball, any one of them may be selected. This selection method is a Robust Distance Approximation procedure. The proof is given below and illustrated in Figure 1. Nemirovsky and Yudin (1983) proposed a similar technique, however their formulation relies on knowledge of ε{\varepsilon}.

Let ri:=min{r0:Bρ(wi,r)W>k/2}r_{i}:=\min\{r\geq 0:|B_{\rho}({{\boldsymbol{w}}}_{i},r)\cap W|>k/2\}. Selecting wi{{\boldsymbol{w}}}_{i_{\operatorname{\star}}} such that i=argminirii_{\operatorname{\star}}=\operatorname*{argmin}_{i}r_{i} is a Robust Distance Approximation procedure with C0=3C_{0}=3.

Proof Assume that Δ(w,0)ε\Delta(w_{\operatorname{\star}},0)\leq{\varepsilon}. Then Bρ(w,ε)W>k/2|B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon})\cap W|>k/2. For any vBρ(w,ε)W{{\boldsymbol{v}}}\in B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon})\cap W, by the triangle inequality, Bρ(v,2ε)W>k/2|B_{\rho}({{\boldsymbol{v}}},2{\varepsilon})\cap W|>k/2. This implies that ri2εr_{i_{\operatorname{\star}}}\leq 2{\varepsilon}, and so Bρ(wi,2ε)W>k/2|B_{\rho}({{\boldsymbol{w}}}_{i_{\operatorname{\star}}},2{\varepsilon})\cap W|>k/2. By the pigeonhole principle, Bρ(w,ε)Bρ(wi,2ε)B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon})\cap B_{\rho}({{\boldsymbol{w}}}_{i_{\operatorname{\star}}},2{\varepsilon})\neq\emptyset. Therefore, by the triangle inequality again, ρ(w,wi)3ε\rho({{\boldsymbol{w}}}_{\operatorname{\star}},{{\boldsymbol{w}}}_{i_{\operatorname{\star}}})\leq 3{\varepsilon}.

Again, the number of candidates kk determines the resulting confidence level. The following theorem provides a guarantee for Algorithm 2. We note that the resulting constants here might not be optimal in specific applications, since they depend on the arbitrary constant in Assumption 1.

3 Random Distance Measurements

Zj=1Z_{j}=1 indicates that fjf_{j} provides a sufficiently accurate estimate of the distances from wj{{\boldsymbol{w}}}_{j}. Note that fjf_{j} need not correspond to a metric. We assume the following.

For any j[k]j\in[k], Pr[Zj=1]8/9\Pr[Z_{j}=1]\geq 8/9.

We further require the following independence assumption.

The random variables Z1,,ZkZ_{1},\ldots,Z_{k} are statistically independent.

Note that there is no assumption on the statistical relationship between Z1,,ZkZ_{1},\ldots,Z_{k} and w1,,wk{{\boldsymbol{w}}}_{1},\ldots,{{\boldsymbol{w}}}_{k}.

Algorithm 3 is a variant of Algorithm 2 that simply replaces computation of ρ\rho-distances with computations using the functions returned by querying the DISTρj\operatorname{DIST}^{j}_{\rho}’s. The resulting selection procedure is, with high probability, a Robust Distance Approximation.

Consider a run of Algorithm 3, with output w^\hat{{{\boldsymbol{w}}}}. Let Z1,,ZkZ_{1},\ldots,Z_{k} as defined above, and suppose that Assumption 3 and Assumption 4 hold. Then, with probability at least 1ek/6481-e^{-k/648},

where W={w1,,wk}W=\{{{\boldsymbol{w}}}_{1},\ldots,{{\boldsymbol{w}}}_{k}\}.

Proof By Assumptions 3 and 4, and by Hoeffding’s inequality,

Assume this event holds, and denote ε=ΔW(w,536){\varepsilon}=\Delta_{W}({{\boldsymbol{w}}}_{\operatorname{\star}},\frac{5}{36}). We have B(w,ε)W2336k|B({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon})\cap W|\geq\frac{23}{36}k.

Let i[k]i\in[k] such that wiBρ(w,ε){{\boldsymbol{w}}}_{i}\in B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon}). Then, for any j[k]j\in[k] such that wjBρ(w,ε){{\boldsymbol{w}}}_{j}\in B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon}), by the triangle inequality ρ(wi,wj)2ε\rho({{\boldsymbol{w}}}_{i},{{\boldsymbol{w}}}_{j})\leq 2{\varepsilon}. There are at least 2336k\frac{23}{36}k such indices jj, therefore for more than k/2k/2 of the indices jj, we have

For jj such that this holds, by the definition of ZjZ_{j}, fj(wi)4ε.f_{j}({{\boldsymbol{w}}}_{i})\leq 4{\varepsilon}. It follows that ri:=median{fj(wi)j[k]}4ϵr_{i}:=\operatorname{median}\{f_{j}({{\boldsymbol{w}}}_{i})\mid j\in[k]\}\leq 4\epsilon.

Now, let i[k]i\in[k] such that wiB(w,9ϵ){{\boldsymbol{w}}}_{i}\notin B({{\boldsymbol{w}}}_{\operatorname{\star}},9\epsilon). Then, for any j[k]j\in[k] such that wjBρ(w,ε){{\boldsymbol{w}}}_{j}\in B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon}), by the triangle inequality ρ(wi,wj)ρ(w,wi)ρ(w,wj)>8ε\rho({{\boldsymbol{w}}}_{i},{{\boldsymbol{w}}}_{j})\geq\rho({{\boldsymbol{w}}}_{\operatorname{\star}},{{\boldsymbol{w}}}_{i})-\rho({{\boldsymbol{w}}}_{\operatorname{\star}},{{\boldsymbol{w}}}_{j})>8{\varepsilon}. As above, for more than k/2k/2 of the indices jj,

For jj such that this holds, by the definition of ZjZ_{j}, fj(wi)>4ε.f_{j}({{\boldsymbol{w}}}_{i})>4{\varepsilon}. It follows that ri:=median{fj(wi)j[k]}>4ϵr_{i}:=\operatorname{median}\{f_{j}({{\boldsymbol{w}}}_{i})\mid j\in[k]\}>4\epsilon.

By Eq. (3), We conclude that with probability at least 1exp(k/648)1-\exp(-k/648),

ri4εr_{i}\leq 4{\varepsilon} for all wiWBρ(w,ε){{\boldsymbol{w}}}_{i}\in W\cap B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},{\varepsilon}), and

ri>4εr_{i}>4{\varepsilon} for all wiWBρ(w,9ε){{\boldsymbol{w}}}_{i}\in W\setminus B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},9{\varepsilon}).

In this event the wiW{{\boldsymbol{w}}}_{i}\in W with the smallest rir_{i} satisfies wiBρ(w,9ε){{\boldsymbol{w}}}_{i}\in B_{\rho}({{\boldsymbol{w}}}_{\operatorname{\star}},9{\varepsilon}).

The properties of the approximation procedure and of APPROXρ,ϵ\operatorname{APPROX}_{\rho,\epsilon} are combined to give a guarantee for Algorithm 3.

Minimizing Strongly Convex Losses

In this section we apply the core techniques to the problem of approximately minimizing strongly convex losses, which includes least squares linear regression as a special case.

In other words, the sample TT induces a loss LTL_{T} which is α\alpha-strongly convex around w{{\boldsymbol{w}}}_{\operatorname{\star}}.Technically, we only need the sample size to guarantee Eq. (4) for all wB(w,r){{\boldsymbol{w}}}\in B_{\|\cdot\|}({{\boldsymbol{w}}}_{\operatorname{\star}},r) for some r>0r>0. We assume that nα<n_{\alpha}<\infty for some α>0\alpha>0.

We use the following facts in our analysis.

2 Subsampled Empirical Loss Minimization

The following lemma proves that Assumption 1 holds under these assumptions with

Let ε{\varepsilon} be as defined in Eq. (5). Assume kn/4k\leq n/4, and that SS is an i.i.d. sample from D\mathcal{D} of size nn such that n/knα\lfloor n/k\rfloor\geq n_{\alpha}. Then subsampled empirical loss minimization using the sample SS is a correct implementation of APPROX,ε\operatorname{APPROX}_{{\|\cdot\|},{\varepsilon}} for up to kk queries.

where the last inequality follows from the definition of the dual norm, and the optimality of wi{{\boldsymbol{w}}}_{i} on LSiL_{S_{i}}. Rearranging and combining with the above probability inequality implies

Combining Lemma 14 and Proposition 9 gives the following theorem.

Let nαn_{\alpha} be as defined in Section 4.1, and assume that {\|\cdot\|}_{*} is γ\gamma-smooth. Also, assume k:=18log(1/δ)k:=18\lceil\log(1/\delta)\rceil, n72log(1/δ)n\geq 72\lceil\log(1/\delta)\rceil, and that SS is an i.i.d. sample from D\mathcal{D} of size nn such that n/knα\lfloor n/k\rfloor\geq n_{\alpha}. Finally, assume Algorithm 3 uses the subsampled empirical loss minimization to implement APPROX,ε\operatorname{APPROX}_{{\|\cdot\|},{\varepsilon}}, where ε{\varepsilon} is as in Eq. (5). Then with probability at least 1δ1-\delta, the parameter w^\hat{{\boldsymbol{w}}} returned by Algorithm 2 satisfies

Assume the same conditions as Theorem 15, and also that:

wL(w){{\boldsymbol{w}}}\mapsto L({{\boldsymbol{w}}}) is βˉ\bar{\beta}-smooth with respect to {\|\cdot\|}.

Then with probability at least 1δ1-\delta,

Corollary 16 implies that for smooth losses, Algorithm 2 provides a constant factor approximation to the optimal loss with a sample size max{nα,γββˉ/α2}O(log(1/δ))\max\{n_{\alpha},\gamma\beta\bar{\beta}/\alpha^{2}\}\cdot O(\log(1/\delta)) (with probability at least 1δ1-\delta). In subsequent sections, we exemplify cases where the two arguments of the max\max are roughly of the same order, and thus imply a sample size requirement of O(γβˉβ/α2log(1/δ))O(\gamma\bar{\beta}\beta/\alpha^{2}\log(1/\delta)). Note that there is no dependence on the optimal loss L(w)L({{\boldsymbol{w}}}_{\operatorname{\star}}) in the sample size, and the algorithm has no parameters besides k=O(log(1/δ))k=O(\log(1/\delta)).

We can also obtain a variant of Theorem 15 based on Algorithm 3 and Theorem 11, in which we assume that there exists some sample size nk,DISTn_{k,\operatorname{DIST}_{\|\cdot\|}} that allows DIST\operatorname{DIST}_{\|\cdot\|} to be correctly implemented using an i.i.d. sample of size at least nk,DISTn_{k,\operatorname{DIST}_{\|\cdot\|}}. Under such an assumption, essentially the same guarantee as in Theorem 15 can be afforded to Algorithm 3 using the subsampled empirical loss minimization to implement APPROX,ε\operatorname{APPROX}_{{\|\cdot\|},{\varepsilon}} (for ε{\varepsilon} as in Eq. (5)) and the assumed implementation of DIST\operatorname{DIST}_{\|\cdot\|}. Note that since Theorem 11 does not require APPROX,ε\operatorname{APPROX}_{{\|\cdot\|},{\varepsilon}} and DIST\operatorname{DIST}_{\|\cdot\|} to be statistically independent, both can be implemented using the same sample.

Let nαn_{\alpha} be as defined in Section 4.1, nk,DISTn_{k,\operatorname{DIST}_{\|\cdot\|}} be as defined above, and assume that {\|\cdot\|}_{*} is γ\gamma-smooth. Also, assume k:=648log(2/δ)k:=648\lceil\log(2/\delta)\rceil, SS is an i.i.d. sample from D\mathcal{D} of size nn such that nmax{4k,nk,DIST}n\geq\max\{4k,n_{k,\operatorname{DIST}_{\|\cdot\|}}\}, and n/knα\lfloor n/k\rfloor\geq n_{\alpha}. Further, assume Algorithm 3 implements APPROX,ε\operatorname{APPROX}_{{\|\cdot\|},{\varepsilon}} using SS with subsampled empirical loss minimization, where ε{\varepsilon} is as in Eq. (5), and implements DIST\operatorname{DIST}_{\|\cdot\|} using SS as well. Then with probability at least 1δ1-\delta, the parameter w^\hat{{\boldsymbol{w}}} returned by Algorithm 3 satisfies

with probability at least 2δ2\delta. Therefore empirical risk minimization cannot provide a qualitatively similar guarantee as Corollary 16. It is easy to check that minimizing a regularized objective also does not work, since any non-trivial regularized objective necessarily provides an estimator with a positive error for some distribution with zero variance.

In the next section we use the analysis for general smooth and convex losses to derive new algorithms and bounds for linear regression.

Least Squares Linear Regression

The regularized squared loss, for λ0\lambda\geq 0, is denoted

The proposed algorithm for regression (Algorithm 4) is as follows. Set k=Clog(1/δ)k=C\log(1/\delta), where CC is a universal constant. First, draw kk independent random samples i.i.d. from D\mathcal{D}, and perform linear regression with λ\lambda-regularization on each sample separately to obtain kk linear regressors. Then, use the same kk samples to generate kk estimates of the covariance matrix of the marginal of D\mathcal{D} on the data space. Finally, use the estimated covariances to select a single regressor from among the kk at hand. The slightly simpler variants of steps 4 and 5 can be used in some cases, as detailed below.

In Section 5.1, the full results for regression, mentioned in Section 2, are listed in full detail, and compared to previous work. The proofs are provided in Section 5.2.

Under this condition, we show the following guarantee for least squares regression.

Assume Σ{\varSigma} is not singular. If X{{\boldsymbol{X}}} satisfies Condition 1 with some fixed parameters c>0c>0 and η>0\eta>0, then if Algorithm 4 is run with nO(dlog(1/δ))n\geq O(d\log(1/\delta)) and δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta,

Our loss bound is given in terms of the following population quantity

which we assume is finite. This assumption only requires bounded low-order moments of X{{\boldsymbol{X}}} and YY and is essentially the same as the conditions from Audibert and Catoni (2011) (see the discussion following their Theorem 3.1). Define the following finite fourth-moment conditions:

with probability 1δ\geq 1-\delta. In comparison, the recent work of Audibert and Catoni (2011) proposes an estimator for linear regression based on optimization of a robust loss function which achieves essentially the same guarantee as Theorem 19 (with only mild differences in the moment conditions, see the discussion following their Theorem 3.1). However, that estimator depends on prior knowledge about the response distribution, and removing this dependency using Lepski’s adaptation method (Lepski, 1991) may result in a suboptimal convergence rate. It is also unclear whether that estimator can be computed efficiently.

Other analyses for linear least squares regression and ridge regression by Srebro et al. (2010) and Hsu et al. (2014) consider specifically the empirical minimizer of the squared loss, and give sharp rates of convergence to LsqL^{\operatorname{sq}}_{\operatorname{\star}}. However, both of these require either boundedness of the loss or boundedness of the approximation error. In Srebro et al. (2010), the specialization of the main result to square loss includes additive terms of order O(L(w)blog(1/δ)/n+blog(1/δ)/n)O(\sqrt{L({{\boldsymbol{w}}}_{\operatorname{\star}})b\log(1/\delta)/n}+b\log(1/\delta)/n), where b>0b>0 is assumed to bound the square loss of any predictions almost surely. In Hsu et al. (2014), the convergence rate includes an additive term involving almost-sure bounds on the approximation error/non-subgaussian noise (The remaining terms are comparable to Eq. (9) for λ=0\lambda=0, and Eq. (7) for λ>0\lambda>0, up to logarithmic factors). The additional terms preclude multiplicative approximations to L(w)L({{\boldsymbol{w}}}_{\operatorname{\star}}) in cases where the loss or approximation error is unbounded. In recent work, Mendelson (2014) proposes a more subtle ‘small-ball’ criterion for analyzing the performance of the risk minimizer. However, as evident from the lower bound in Remark 18, the empirical risk minimizer cannot obtain the same type of guarantees as our estimator.

The next result is for the case where there exists R<R<\infty such that Pr[XΣ1XR2]=1\Pr[{{\boldsymbol{X}}}^{\scriptscriptstyle{\top}}{\varSigma}^{-1}{{\boldsymbol{X}}}\leq R^{2}]=1 (and, here, we do not assume Condition 1). In contrast, YY may still be heavy-tailed. Then, the following result can be derived using Algorithm 4. Moreover, the simpler variant of Algorithm 4 suffices here.

Assume Σ{\varSigma} is not singular. Let w^\hat{{{\boldsymbol{w}}}} be the output of the variant of Algorithm 4 with λ=0\lambda=0. With probability at least 1δ1-\delta, for nO(R2log(R)log(1/δ))n\geq O(R^{2}\log(R)\log(1/\delta)),

2 Analysis

Therefore, to establish a bound on nαn_{\alpha}, it suffices to control

for an i.i.d. sample TT from D\mathcal{D}. The following lemma allows doing just that.

We use the boundedness assumption for sake of simplicity; it is possible to remove the boundedness assumption, and the logarithmic dependence on the cardinality of TT, under different conditions on X{{\boldsymbol{X}}} (e.g., assuming Σ1/2X{\varSigma}^{-1/2}{{\boldsymbol{X}}} has subgaussian projections, see Litvak et al. 2005). We now prove Theorem 20, Theorem 21 and Theorem 19.

as soon as nO(R2log(R)log(1/δ))n\geq O(R^{2}\log(R)\log(1/\delta)).

2.2 Ridge Regression

(here, X{Vei:i[d]}{{\boldsymbol{X}}}\in\{V{{\boldsymbol{e}}}_{i}:i\in[d]\} has Euclidean length VV almost surely, and BB is a bound on the Euclidean length of w{{\boldsymbol{w}}}_{\operatorname{\star}}). For d=B2V2n/σ2d=\sqrt{B^{2}V^{2}n/\sigma^{2}}, the bound becomes

As before, this minimax bound also implies a lower bound on any estimator with exponential concentration.

2.3 Heavy-tail Covariates

When the covariates are not bounded or subgaussian, the empirical second-moment matrix may deviate significantly from its population counterpart with non-negligible probability. In this case it is not possible to approximate the norm a=a(Σ+λId)a\|{\boldsymbol{a}}\|=\sqrt{{\boldsymbol{a}}^{\scriptscriptstyle{\top}}({\varSigma}+\lambda\operatorname{Id}){\boldsymbol{a}}} in Step 2 of Algorithm 2 using a single small sample (as discussed in Section 5.2.1 and Section 5.2.2). However, we may use Algorithm 3 instead of Algorithm 2, which only requires the stochastic distance measurements to be relatively accurate with some constant probability. The full version of Algorithm 4 is exactly such an implementation.

We now prove Theorem 19. Define cη:=512(48c)2+2/η(6+6/η)1+4/ηc_{\eta}:=512(48c)^{2+2/\eta}(6+6/\eta)^{1+4/\eta} (which is CmainC_{\text{\emph{main}}} from Srivastava and Vershynin, 2013). The following lemma shows that O(d)O(d) samples suffice so that the expected spectral norm distance between the empirical second-moment matrix and Σ{\varSigma} is bounded.

Let X{{\boldsymbol{X}}} satisfy Condition 1, and let X1,X2,,Xn{{\boldsymbol{X}}}_{1},{{\boldsymbol{X}}}_{2},\dotsc,{{\boldsymbol{X}}}_{n} be independent copies of X{{\boldsymbol{X}}}. Let Σ^:=1ni=1nXiXi\widehat{\varSigma}:=\frac{1}{n}\sum_{i=1}^{n}{{\boldsymbol{X}}}_{i}{{\boldsymbol{X}}}_{i}^{\scriptscriptstyle{\top}}. For any ϵ(0,1){\epsilon}\in(0,1), if ncηϵ22/ηdn\geq c_{\eta}{\epsilon}^{-2-2/\eta}d, then

Lemma 23 implies that n0.5=O(cηd)n_{0.5}=O(c_{\eta}^{\prime}d) where cη=cη2O(1+1/η)c_{\eta}^{\prime}=c_{\eta}\cdot 2^{O(1+1/\eta)}. Therefore, for k=O(log(1/δ))k=O(\log(1/\delta)), subsampled empirical loss minimization requires nkn0.5=O(cηdlog(1/δ))n\geq k\cdot n_{0.5}=O(c_{\eta}^{\prime}d\log(1/\delta)) samples to correctly implement APPROX,ε\operatorname{APPROX}_{{\|\cdot\|},{\varepsilon}}, for ε{\varepsilon} as in Eq. (5).

In particular, this holds for T=SjT=S_{j}, as long as SjO(cηd)|S_{j}|\geq O(c_{\eta}^{\prime}d). Thus, for k=O(log(1/δ))k=O(\log(1/\delta)), Assumption 3 holds if nO(cηdlog(1/δ))n\geq O(c_{\eta}^{\prime}d\log(1/\delta)). Assumption 4 (independence) also holds, since fjf_{j} depends only on SjS_{j}, and S1,,SkS_{1},\ldots,S_{k} are statistically independent.

Putting everything together, we have (as in Section 5.2.1) α=0.5\alpha=0.5 and γ=1\gamma=1. We obtain the final bound from Theorem 17 as follows: if nO(cηdlog(1/δ))n\geq O(c_{\eta}^{\prime}d\log(1/\delta)), then with probability at least 1δ1-\delta,

Other Applications

In this section we show how the core techniques we discuss can be used for other applications, namely Lasso and low-rank matrix approximation.

In this section we consider L1L^{1}-regularized linear least squared regression (Lasso) (Tibshirani, 1996) with a random subgaussian design, and show that Algorithm 2 achieves the same fast convergence rates for sparse parameter estimation as Lasso, even when the noise is heavy-tailed.

The following theorem shows that when Algorithm 2 is used with subsampled empirical loss minimization over the Lasso loss, and DD generates a subgaussian random design, then w{{\boldsymbol{w}}} can be estimated for any type of noise ε{\varepsilon}, including heavy-tailed noise.

For the proof of Theorem 24, we use the following theorem, adapted from Bickel et al. (2009) and Zhang (2009). The proof is provided in Appendix A for completeness.

If E\mathcal{E} holds and Ψεnλ/2\|\Psi^{\scriptscriptstyle{\top}}{\boldsymbol{{\varepsilon}}}\|_{\infty}\leq n\lambda/2, then we can apply Theorem 25 (with nλn\lambda instead of λ\lambda). We now show that this inequality holds with a constant probability. Fix the noise vector ε=yΨw{\boldsymbol{\varepsilon}}={\boldsymbol{y}}-\Psi{{\boldsymbol{w}}}_{\operatorname{\star}}. For l[d]l\in[d], since the coordinates of ε{\boldsymbol{{\varepsilon}}} are independent and each row of Ψ\Psi is an independent copy of the vector X=Σ12Z{{\boldsymbol{X}}}={\varSigma}^{\frac{1}{2}}{\boldsymbol{Z}}, we have

Since εjΣ12el2εjη\|{\boldsymbol{\varepsilon}}_{j}{\varSigma}^{\frac{1}{2}}{\boldsymbol{e}}_{l}\|_{2}\leq{\boldsymbol{\varepsilon}}_{j}\eta, we conclude that

Therefore APPROX,ϵ\operatorname{APPROX}_{{\|\cdot\|},\epsilon} satisfies Assumption 1 with ϵ\epsilon as in the right hand side above. The statement of the theorem now follows by applying Proposition 9 with k=O(log(1/δ)k=O(\log(1/\delta), and noting that ni=O(n/log(1/δ))n_{i}=O(n/\log(1/\delta)).

We note that while standard analyses of sparse estimation with mean-zero noise assume light-tailed noise (Zhang, 2009; Bickel et al., 2009), there are several works that analyze sparse estimation with heavy-tailed noise under various assumptions. For example, several works assume that the median of the noise is zero (e.g., Wang 2013; Belloni and Chernozhukov 2011; Zou and Yuan 2008; Wu and Liu 2009; Wang et al. 2007; Fan et al. 2012). van de Geer and Müller (2012) analyze a class of optimization functions that includes the Lasso and show polynomial convergence under fourth-moment bounds on the noise. Chatterjee and Lahiri (2013) study a two-phase sparse estimator for mean-zero noise termed the Adaptive Lasso, proposed in Zou (2006), and show asymptotic convergence results under mild moment assumptions on the noise.

2 Low-rank Matrix Approximation

If λΣ^Σ\lambda\geq\|\hat{{\varSigma}}-{\varSigma}\|, then

Now, assume condition 1 holds for XD\mathcal{X}\sim\mathcal{D}, and suppose for simplicity that Σ1\|{\varSigma}\|\leq 1. In this case, by Lemma 23, A random sample SS of size n=cηϵ22/ηdn^{\prime}=c^{\prime}_{\eta}\epsilon^{-2-2/\eta}d, where cη=cη(3/2)2+2/ηc^{\prime}_{\eta}=c_{\eta}(3/2)^{2+2/\eta} suffices to get an empirical covariance matrix ΣS{\varSigma}_{S} such that ΣSΣϵ\|{\varSigma}_{S}-{\varSigma}\|\leq\epsilon with probability at least 2/32/3.

Given a sample of size nn from D\mathcal{D}, We can thus implement APPROX,ε\operatorname{APPROX}_{{\|\cdot\|},{\varepsilon}} that simply returns the empirical covariance matrix of a sub-sample of size n=n/kn^{\prime}=n/k, so that Assumption 1 holds for an appropriate ε{\varepsilon}. By Proposition 9, Algorithm 2 returns Σ^\hat{{\varSigma}} such that with probability at least 1exp(k/18)1-\exp(-k/18), Σ^A3ϵ\|\hat{{\varSigma}}-A\|\leq 3\epsilon. The resulting Σ^\hat{{\varSigma}} can be used to minimize Eq. (13) with λ=3ϵ:=O((cηdlog(1/δ)/n)1/2(1+1/η))\lambda=3\epsilon:=O\left((c^{\prime}_{\eta}d\log(1/\delta)/n)^{1/2(1+1/\eta)}\right). The output matrix Σλ{\varSigma}_{\lambda} satisfies, with probability at least 1δ1-\delta,

A Comparison of Robust Distance Approximation Methods

In the following, we provide detailed guarantees for the approximation factor CαC_{\alpha} of the two types of procedures, for general metric spaces as well as for Banach and Hilbert spaces, and for set-based and sample-based procedures. We further provide lower bounds for specific procedures, as well as lower bounds that hold for any procedure. In Section 7.4 we summarize the results and compare the guarantees of the two procedures and the lower bounds. For a more useful comparison, we take into account the fact that the value of α\alpha usually affects not only the approximation factor, but also the upper bound obtained for ΔW(w,α)\Delta_{W}(w_{\operatorname{\star}},\alpha).

Minimizing the median distance over the set of input points was shown in Proposition 8 to achieve an approximation factor of 3. In this section we show that this upper bound on the approximation factor is tight for this procedure, even in a Hilbert space. Here and below, we say that an approximation factor upper bound is tight if for any constant smaller than this upper bound, there are a suitable space and a set of points in that space, such that the procedure achieves for this input a larger approximation factor than said constant.

The approximation factor can be improved to 22 for a sample-based procedure. This factor is tight as well, even assuming a Hilbert space. The following theorem summarizes these facts.

For any metric space, ρ(w,y)3ϵ\rho(w_{\operatorname{\star}},y)\leq 3\epsilon;

For any metric space, ρ(w,yˉ)2ϵ\rho(w_{\operatorname{\star}},\bar{y})\leq 2\epsilon;

There exists a set on the real line such that ρ(w,y)=3ϵ\rho(w_{\operatorname{\star}},y)=3\epsilon, where ρ\rho is the distance induced by the inner product;

There exists a set on the real line such that ρ(w,yˉ)=2ϵ\rho(w_{\operatorname{\star}},\bar{y})=2\epsilon, where ρ\rho is the distance induced by the inner product.

Proof First, we prove the two upper bounds. Since ΔW(w,γ)ϵ\Delta_{W}(w_{\operatorname{\star}},\gamma)\leq\epsilon, we have B(w,ϵ)W>k/2|B(w_{\operatorname{\star}},\epsilon)\cap W|>k/2. Let wB(w,ϵ)Ww\in|B(w_{\operatorname{\star}},\epsilon)\cap W|. Then by the triangle inequality, B(w,2ϵ)B(w,ϵ)B(w,2\epsilon)\supseteq B(w_{\operatorname{\star}},\epsilon). Therefore ΔW(w,0)2ϵ\Delta_{W}(w,0)\leq 2\epsilon. It follows that ΔW(y,0)2ϵ\Delta_{W}(y,0)\leq 2\epsilon, hence B(y,2ϵ)Wk/2|B(y,2\epsilon)\cap W|\geq k/2. By the pigeon hole principle, B(w,ϵ)B(y,2ϵ)>0|B(w_{\operatorname{\star}},\epsilon)\cap B(y,2\epsilon)|>0, therefore ρ(w,y)3ϵ\rho(w_{\operatorname{\star}},y)\leq 3\epsilon.

To see that these bounds are tight, we construct simple examples on the real line. For yy, suppose w=ϵw_{\operatorname{\star}}=\epsilon, and consider WW with kk points as follows: k/21k/2-1 points at , 22 points at 2ϵ2\epsilon, and k/21k/2-1 points at 4ϵ4\epsilon. The points at 4ϵ4\epsilon are clearly in argminwWΔW(w,0)\operatorname*{argmin}_{w\in W}\Delta_{W}(w,0), therefore ρ(w,y)=3ϵ\rho(w_{\operatorname{\star}},y)=3\epsilon.

For yˉ\bar{y}, suppose w=ϵw_{\operatorname{\star}}=\epsilon, and consider WW with kk points as follows: 22 points at , k/21k/2-1 points at 2ϵ2\epsilon, and k/21k/2-1 points at 3ϵ3\epsilon. The points at 3ϵ3\epsilon are clearly in argminwW+ΔW(w,0)\operatorname*{argmin}_{w\in W_{+}}\Delta_{W}(w,0), therefore ρ(w,yˉ)=2ϵ\rho(w_{\operatorname{\star}},\bar{y})=2\epsilon.

The non-uniqueness of the median distance minimizer is exploited in the lower bounds in Theorem 27. This suggests that some kind of aggregation of the median distance minimizers may provide a smaller bound at least in certain scenarios.

2 The Geometric Median

Minimizing over the entire space is a computationally intensive procedure, involving convex approximation. Moreover, if the only access to the metric is via estimated distances based on samples, as in Algorithm 3, then there are additional statistical challenges. It is thus of interest to also consider the simpler set-based procedure, and we provide approximation guarantees for this procedure as well. We show that an approximation factor of 2+12α2+\frac{1}{2\alpha} can be guaranteed for set-based procedures in general metric spaces, and this is also tight, even for Banach spaces.

The following theorem provides a bound that holds in several of these settings.

For any constant C<(2+12α)C<(2+\frac{1}{2\alpha}), there exists a problem in a Banach space such that ρ(w,y)>CΔW(w,α)\rho(w_{\operatorname{\star}},y)>C\cdot\Delta_{W}(w_{\operatorname{\star}},\alpha). Thus the upper bound above is tight.

For any constant C<(1+12α)C<(1+\frac{1}{2\alpha}), there exists a problem in a metric space such that ρ(w,yˉ)>CΔW(w,α)\rho(w_{\operatorname{\star}},\bar{y})>C\cdot\Delta_{W}(w_{\operatorname{\star}},\alpha). Thus the upper bound above is tight for general metric spaces.

Proof Let wargminwB(w,ϵ)Wρ(w,y)w\in\operatorname*{argmin}_{w\in B(w_{\operatorname{\star}},\epsilon)\cap W}\rho(w,y). Let ZB(w,ϵ)WZ\subset B(w_{\operatorname{\star}},\epsilon)\cap W such that Z=k(12+α)|Z|=k(\frac{1}{2}+\alpha) (we assume for simplicity that k(12+α)k(\frac{1}{2}+\alpha) is an integer; the proof can be easily modified to accommodate the general case). For vZv\in Z, ρ(w,v)ρ(w,w)+ρ(w,v)\rho(w,v)\leq\rho(w,w_{\operatorname{\star}})+\rho(w_{\operatorname{\star}},v). For vWZv\in W\setminus Z, ρ(w,v)ρ(w,y)+ρ(y,v)\rho(w,v)\leq\rho(w,y)+\rho(y,v). Therefore

By the definition of ww as a minimizer, for vZv\in Z, ρ(y,v)ρ(y,w)\rho(y,v)\geq\rho(y,w). Thus

Hence, since ρ(v,w)ϵ\rho(v,w_{\operatorname{\star}})\leq\epsilon for vZv\in Z,

Since Z=k(12+α)|Z|=k(\frac{1}{2}+\alpha) it follows that ρ(w,y)(1+12α)ϵ\rho(w,y)\leq(1+\frac{1}{2\alpha})\epsilon. In addition,

This shows that for any metric space, the set-based geometric median gives an approximation factor of 2+12α2+\frac{1}{2\alpha}, proving item 1.

Since ρ(w,v)ϵ\rho(w_{\operatorname{\star}},v)\leq\epsilon for vZv\in Z, and ρ(w,wˉ)ϵ\rho(w_{\operatorname{\star}},\bar{w})\leq\epsilon, it follows

Therefore ρ(wˉ,yˉ)12αϵ\rho(\bar{w},\bar{y})\leq\frac{1}{2\alpha}\epsilon, hence

This gives an approximation factor of 1+12α1+\frac{1}{2\alpha} for space-based geometric median, proving item 3.

Since α(0,12)\alpha\in(0,\frac{1}{2}), the guarantee for the geometric median in these settings is always worse than the guarantee for minimizing the median distance. Factoring in the dependence on α\alpha, the difference is even more pronounced. The full comparison is given in Section 7.4 below.

3 Optimal Approximation Factor

In this section we give lower bounds that hold for any robust distance approximation procedure. A lower bound of C>0C>0 for a category of metric spaces and a type of procedure indicates that if a procedure of this type guarantees a distance approximation CαC_{\alpha} for all metric spaces of the given category, then necessarily CαCC_{\alpha}\geq C. As shown below, in many cases the lower bounds provided here match the upper bounds obtained by either the median distance or the geometric median.

The following theorem gives a lower bound of 33 for the achievable approximation factor of set-based procedures in Banach spaces (and so, also in general metric spaces). This factor is achieved by the median distance minimizer, as shown in Theorem 27.

Consider set-based robust distance approximation procedures. For any α(0,12)\alpha\in(0,\frac{1}{2}), and for any such procedure, there exists a problem in a Banach space for which the approximation factor of the procedure is at least 33.

Consider the multi-set WW with k/nk/n elements at every bib_{i}. It is easy to check that for every aia_{i}, ΔW(ai,α)ΔW(ai,1/21/n)=1\Delta_{W}(a_{i},\alpha)\leq\Delta_{W}(a_{i},1/2-1/n)=1. On the other hand, since the problem is symmetric for permutations of the indices 1,,n1,\ldots,n, no procedure can distinguish the cases w=aiw_{\operatorname{\star}}=a_{i} for different i[n]i\in[n]. For any choice y=biWy=b_{i}\in W, if w=aiw_{\operatorname{\star}}=a_{i} then ρ(w,y)=3\rho(w_{\operatorname{\star}},y)=3. Therefore the approximation factor of any procedure is at least 33. Since any metric space can be embedded into a Banach space (Kuratowski, 1935) this result holds also for Banach spaces.

Next, we give a lower bound of 22 for space-based procedures over general metric spaces. Theorem 27 shows that this factor is also achieved by minimizing the median distance.

Consider robust space-based distance approximation procedures. For any α(0,12)\alpha\in(0,\frac{1}{2}), and for any such procedure, there exists a problem for which the approximation factor of the procedure is at least 22.

Consider the multi-set WW with k/nk/n points at every bib_{i}. It is easy to check that for every aia_{i}, ΔW(ai,α)ΔW(ai,1/21/n)=1\Delta_{W}(a_{i},\alpha)\leq\Delta_{W}(a_{i},1/2-1/n)=1. On the other hand, since the problem is symmetric for permutations of the indices 1,,n1,\ldots,n, no procedure can distinguish the cases w=aiw_{\operatorname{\star}}=a_{i} for different i[n]i\in[n]. Moreover, any point yy in the space has ρ(ai,y)=2\rho(a_{i},y)=2 for at least one i[n]i\in[n]. Therefore the approximation factor of any procedure is at least 22.

For lower bounds on Hilbert spaces and Banach spaces, we require the following lemma, which gives the radius of the ball inscribing the regular simplex in a pp-normed space.

We now prove a lower bound for robust distance approximation in Hilbert spaces. Unlike the previous lower bounds, this lower bound depends on the value of α\alpha.

For any set-based procedure, there exists a problem such that the procedure achieves an approximation factor at least

For any space-based procedure, there exists a problem such that the procedure achieves an approximation factor at least

The space-based bound given in Theorem 32 is tight for α1/2\alpha\rightarrow 1/2. This can be seen by noting that the limit of the space-based lower bound for α1/2\alpha\rightarrow 1/2 is (12+α)/2α(\frac{1}{2}+\alpha)/\sqrt{2\alpha}, which is exactly the guarantee provided in Minsker (2013) for the space-based geometric median procedure. For smaller α\alpha, there is a gap between the guarantee of Minsker for the geometric median and our lower bound.

Consider WW with k/nk/n points at each of b1,,bnb_{1},\ldots,b_{n}. Then ΔW(ei,α)ΔW(ei,11n)=eibj=rn1,2\Delta_{W}(e_{i},\alpha)\leq\Delta_{W}(e_{i},1-\frac{1}{n})=\|e_{i}-b_{j}\|=r_{n-1,2} for any jij\neq i. Any set-based procedure must select bib_{i} for some ii. if w=eiw_{\operatorname{\star}}=e_{i}, the resulting approximation factor is eibi/rn1,2=n2n1eibi\|e_{i}-b_{i}\|/r_{n-1,2}=\sqrt{\frac{n-2}{n-1}}\|e_{i}-b_{i}\|. For biei\|b_{i}-e_{i}\|, consider for instance b1b_{1} and e1e_{1}. We have b1=(0,1n1,,1n1)b_{1}=(0,\frac{1}{n-1},\ldots,\frac{1}{n-1}), therefore b1e1=nn1\|b_{1}-e_{1}\|=\sqrt{\frac{n}{n-1}}. The approximation factor of the procedure is thus at least nn2\sqrt{\frac{n}{n-2}}.

For a set-based procedure, whatever yy it returns, there exists at least one ii such that yairn,2\|y-a_{i}\|\geq r_{n,2}. Therefore the approximation factor is at least rn,2/rn1,2=n1n/n2n1=1+1n22nr_{n,2}/r_{n-1,2}=\sqrt{\frac{n-1}{n}}/\sqrt{\frac{n-2}{n-1}}=\sqrt{1+\frac{1}{n^{2}-2n}}.

For space-based procedures, we have seen that while there exists a lower bound of 22 for general metric spaces, in a Hilbert space better approximation factors can be achieved. Is it possible that in Banach spaces the same approximation factor can also be achieved? The following theorem shows that the answer is no.

Let α=1/6\alpha=1/6. There exists a Banach space for which an approximation factor of (12+α)/2α(\frac{1}{2}+\alpha)/\sqrt{2\alpha} cannot be achieved.

4 Comparison of Selection Procedures

We observe that for set-based procedures, the median distance is superior to the geometric median for general metric spaces as well as for general Banach spaces. It is an open question whether better results can be achieved for Hilbert spaces using set-based procedures.

For space-based procedures, the median distance is again superior, except in the case of a Hilbert space, where the geometric median is superior. The case of a Hilbert space is arguably the most useful in common applications such as linear regression. Nevertheless, gaps still remain and it would be interesting to develop optimal methods.

Implementing the geometric median procedure in a space-based formulation is computationally efficient for Hilbert spaces when accurate distances are available Minsker (2013). However, it is unknown whether and how the procedure can be implemented when only unreliable distance estimations are available, as in Section 3.3. A useful implementation should be both computationally feasible and statistically efficient, while degrading the approximation factors as little as possible.

Predicting Without a Metric on Predictors

The core technique presented above allows selecting a good candidate out of a set that includes mostly good candidates, in the presence of a metric between candidates. If the final goal is prediction of a scalar label, good prediction can still be achieved without access to a metric between candidates, using the following simple procedure: For every input data point, calculate the prediction of every candidate, and output the median of the predictions. This is a straight-forward generalization of voting techniques for classification such as when using bagging (Breiman, 1996).Note, however, that the usual implementation of bagging for regression involves averaging over the outputs of the classifiers, and not taking the median. The following lemma shows that this approach leads to guarantees similar to those achieved by Proposition 9.

Taking expectation over (x,y)({{\boldsymbol{x}}},y),

where the last inequality follows from the assumption that I(12+γ)k|I|\geq(\frac{1}{2}+\gamma)k.

A downside of this approach is that each prediction requires many applications of a predictor. If there is also access to unlimited unlabeled data, a possible approach to circumvent this issue is to generate predictions for a large set of random unlabeled data points based on the aggregate predictor, and then use the resulting labeled pairs as a training set to find a single predictor with a loss that approaches the loss of the aggregate predictor. A similar approach for derandomizing randomized classifiers was suggested by Kääriäinen (2005).

Conclusion

In this paper we show several applications of a generalized median-of-means approach to estimation. In particular, for linear regression we establish convergence rates for heavy-tailed distributions that match the min-max rates up to logarithmic factors. We further show conditions that allow parameter estimation using the Lasso under heavy-tailed noise, and cases under which low-rank covariance matrix approximation is possible for heavy-tailed distributions.

The core technique is based on performing independent estimates on separate random samples, and then combining these estimates. Other works have considered approaches which resemble this general scheme but provide other types of guarantees. For instance, in Zhang et al. (2013), faster parallel kernel ridge regression is achieved by performing loss minimizations on independent samples and then averaging the resulting estimators. In Rakhlin et al. (2013), faster rates of convergence for regression for some classes of estimators are achieved, using linear combinations of risk minimizers over subsets of the class of estimators. These works, together with ours, demonstrate that empirical risk minimization can be used as a black box to generate new algorithms with improved statistical performance.

Part of this work was completed while the authors were at Microsoft Research New England. Daniel Hsu was supported by a Yahoo Academic Career Enhancement Award. Sivan Sabato is supported by the Lynne and William Frankel Center for Computer Science.

A Proof of Theorem 25

From the definition of w^\hat{{{\boldsymbol{w}}}} as a minimizer we have

By Hölder’s inequality the assumptions of the theorem, 2εΨ(w^w)2εΨw^w1λw^w12{\varepsilon}^{\scriptscriptstyle{\top}}\Psi(\hat{{{\boldsymbol{w}}}}-{{\boldsymbol{w}}}_{\operatorname{\star}})\leq 2\|{\varepsilon}^{\scriptscriptstyle{\top}}\Psi\|_{\infty}\|\hat{{{\boldsymbol{w}}}}-{{\boldsymbol{w}}}_{\operatorname{\star}}\|_{1}\leq\lambda\|\hat{{{\boldsymbol{w}}}}-{{\boldsymbol{w}}}_{\operatorname{\star}}\|_{1}. Combining this with Eq. (14) gives

Adding λ(w^w)1\lambda\|(\hat{{{\boldsymbol{w}}}}-{{\boldsymbol{w}}})\|_{1} to both sides we get

therefore w^wEs\hat{{{\boldsymbol{w}}}}-{{\boldsymbol{w}}}_{\operatorname{\star}}\in E_{s}. Denote δ=w^w{\boldsymbol{\delta}}=\hat{{{\boldsymbol{w}}}}-{{\boldsymbol{w}}}. The above derivation also implies

Denote for brevity γ=γ(Ψ,s)\gamma=\gamma(\Psi,s). From the definition of γ\gamma,

Therefore δ[s]23λsγ2\|{\boldsymbol{\delta}}_{[s]}\|_{2}\leq\frac{3\lambda\sqrt{s}}{\gamma^{2}}. Now,

From δEs{\boldsymbol{\delta}}\in E_{s} we get δ[s]C13δ[s]1\|{\boldsymbol{\delta}}_{[s]^{C}}\|_{1}\leq 3\|{\boldsymbol{\delta}}_{[s]}\|_{1}. In addition, since δ[s]{\boldsymbol{\delta}}_{[s]} spans the largest coordinates of δ{\boldsymbol{\delta}} in absolute value, δ[s]Cδ[s]1/s\|{\boldsymbol{\delta}}_{[s]^{C}}\|_{\infty}\leq\|{\boldsymbol{\delta}}_{[s]}\|_{1}/s. Combining these with the inequality above we get

References