Geometric median and robust estimation in Banach spaces

Stanislav Minsker

Introduction

for some absolute constant CC without any extra assumptions on Π\Pi? What happens if the sample contains a fixed number of outliers of arbitrary nature? Does the estimator still exist?

A (somewhat surprising) answer is yes, and several ways to construct μ^\hat{\mu} are known. The earliest reference that we are aware of is the book by Nemirovski and Yudin , where related question was investigated in the context of stochastic optimization. We learned about problem (1) and its solution from the work of Oliveira and Lerasle who used the ideas in spirit of to develop the theory of “robust empirical mean estimators”. Method described in consists of the following steps: divide the given sample into VtV\approx t blocks, compute the sample mean within each block and then take the median of these sample means. A relatively simple analysis shows that the resulting estimator indeed satisfies (1). Similar idea was employed earlier in the work of Alon, Matias and Szegedy to construct randomized algorithms for approximating the so-called “frequency moments” of a sequence. Recently, the aforementioned “median of the means” construction appeared in in the context of multi-armed bandit problem under weak assumptions on the reward distribution. A different approach to the question (1) (based on PAC-Bayesian truncation) was given in . A closely related independent recent work applies original ideas of Nemirovski and Yudin to general convex loss minimization.

The main goal of this work is to design a general technique that allows construction of estimators satisfying a suitable version of (1) for Banach space-valued μ\mu. To achieve this goal, we show that a collection of independent estimators of a Banach space-valued parameter can be transformed into a new estimator which preserves the rate and admits much tighter concentration bounds. The method we propose is based on the properties of a geometric median, which is one of the possible extensions of a univariate median to higher dimensions.

Many popular estimators (e.g., Lasso in the context of high-dimensional linear regression) admit strong theoretical guarantees if the distribution of the noise satisfies restrictive assumptions (such as sub-Gaussian tails). An important question that we attempt to answer is: can one design algorithms which preserve nice properties of existing techniques and at the same time:

admit strong performance guarantees under weak assumptions on the noise;

are not affected by the presence of a fixed number of outliers of arbitrary nature and size;

can be implemented in parallel for faster computation with large data sets.

Our results imply that in many important applications the answer is positive. In Section 4, we illustrate this assertion with several classical examples, including principal component analysis, sparse linear regression and low-rank matrix recovery. In each case, we present non-asymptotic probabilistic bounds describing performance of proposed methods.

For an overview of classical and modern results in robust statistics, see , and references therein. Existing literature contains several approaches to estimation in the presence of heavy-tailed noise based on the aforementioned estimators satisfying (1). However, most of the previous work concentrated on one-dimensional versions of (1) and used it as a tool to solve intrinsically high-dimensional problems. For example, in authors develop robust estimator selection procedures based on the medians of empirical risks with respect to disjoint subsets of the sample. While this approach admits strong theoretical guarantees, it requires several technical assumptions that are not always easy to check it practice. Another related work discusses robust estimation in the context of ridge regression. Proposed method is based on a “min–max” estimator which has good theoretical properties but can only be evaluated approximately based on heuristic methods. It is also not immediately clear if this technique can be extended to robust estimation in other frameworks. An exception is the approach described in and further explored in , where authors use a version of the multidimensional median for estimator selection. However, this method has several weaknesses in statistical applications when compared to our technique; see Section 3 for more details and discussion. The main results of our work require minimal assumptions, apply to a wide range of models, and allow to use many existing algorithms as a subroutine to produce robust estimators which can be evaluated exactly via a simple iterative scheme.

Geometric median

The cornerstone of our subsequent presentation is the following lemma, which states that if a given point zz is “far” from the geometric median x=med(x1,,xk)x_{\ast}=\operatorname{med}(x_{1},\ldots,x_{k}), then it is also “far” from a constant fraction of the points x1,,xkx_{1},\ldots,x_{k}. We will denote F(y):=j=1kyxjF(y):=\sum_{j=1}^{k}\|y-x_{j}\|.

and r>0r>0. Then there exists a subset J{1,,k}J\subseteq\{1,\ldots,k\} of cardinality J>αk|J|>\alpha k such that for all jJj\in J, xjz>r\|x_{j}-z\|>r.

For general Banach spaces, the claim holds with a constant Cα=2(1α)12αC_{\alpha}=\frac{2(1-\alpha)}{1-2\alpha}.

(a) Assume that the implication is not true. Without loss of generality, it means that xizr\|x_{i}-z\|\leq r, i=1,,(1α)k+1i=1,\ldots,\lfloor(1-\alpha)k\rfloor+1.

For j=1,,(1α)k+1j=1,\ldots,\lfloor(1-\alpha)k\rfloor+1 and γj=arccos(xjx,zxxjxzx)\gamma_{j}=\arccos(\frac{\langle x_{j}-x_{\ast},z-x_{\ast}\rangle}{\|x_{j}-x_{\ast}\|\|z-x_{\ast}\|}), we clearly have (see Figure 1)

while xjx,zxxjxzx1\frac{\langle x_{j}-x_{\ast},z-x_{\ast}\rangle}{\|x_{j}-x_{\ast}\|\|z-x_{\ast}\|}\geq-1 for j>(1α)k+1j>\lfloor(1-\alpha)k\rfloor+1. This yields

whenever Cα(1α)112αC_{\alpha}\geq(1-\alpha)\sqrt{\frac{1}{1-2\alpha}}, which leads to a contradiction.

In general Banach spaces, it might be convenient to consider

where co(x1,,xk)\operatorname{co}(x_{1},\ldots,x_{k}) is the convex hull of {x1,,xk}\{x_{1},\ldots,x_{k}\}. The claim of Lemma 2.1 remains valid for x^\hat{x}_{\ast} whenever zco(x1,,xk)z\in\operatorname{co}(x_{1},\ldots,x_{k}).

“Boosting the confidence” by taking the geometric median of independent estimators

A useful property of the geometric median is that it transforms a collection of independent estimators that are “weakly” concentrated around the true parameter of interest into a single estimator which admits significantly tighter deviation bounds. For 0<p<α<120<p<\alpha<\frac{1}{2}, define

where CαC_{\alpha} is a constant defined in Lemma 2.1 above.

Assume that event E:={μ^μ>Cαε}\mathcal{E}:=\{\|\hat{\mu}-\mu\|>C_{\alpha}\varepsilon\} occurs. Lemma 2.1 implies that there exists a subset J{1,,k}J\subseteq\{1,\ldots,k\} of cardinality Jαk|J|\geq\alpha k such that μjμ>ε\|\mu_{j}-\mu\|>\varepsilon for all jJj\in J, hence

If WW has Binomial distribution WB(k,p)W\sim B(k,p), then

(see Lemma 23 in for a rigorous proof of this fact). Chernoff bound (e.g., Proposition A.6.1 in ) implies that

If (5) is replaced by a weaker condition assuming that

is satisfied only for μ^j\hat{\mu}_{j}, jJ{1,,k}j\in J\subset\{1,\ldots,k\}, where J=(1τ)k|J|=(1-\tau)k for 0τ<αp1p0\leq\tau<\frac{\alpha-p}{1-p}, then the previous argument implies

In particular, this version is useful in addressing the situation when the sample contains a subset of cardinality at most τk\tau k consisting of “outliers” of arbitrary nature.

It is also clear that results of Theorem 3.1 can be used to positively answer question (3) posed in the Introduction. Indeed, if several autonomous computational resources (e.g., processors) are available, one can evaluate estimators μ^j\hat{\mu}_{j}, j=1,,kj=1,\ldots,k in parallel and combine them via the geometric median as a final step. In many situations, the improvement in computational cost will be significant.

Note that it is often easy to obtain an estimator satisfying (5) with the correct rate ε\varepsilon under minimal assumptions on the underlying distribution. In particular, if μ\mu is the mean and μ^k\hat{\mu}_{k} is the sample mean, then (5) can be deduced from Chebyshev’s inequality, see Section 4.1 below for more details.

Next, we describe the method proposed in which is based on a different notion of the median. Let μ^1,,μ^k\hat{\mu}_{1},\ldots,\hat{\mu}_{k} be a collection of independent estimators of μ\mu and assume that ε>0\varepsilon>0 is chosen to satisfy

and j=1j_{\ast}=1 if none of μ^j\hat{\mu}_{j}’s satisfy the condition in braces. It is not hard to show that

It is important to mention the fact that (10) and the inequality (7) of Theorem 3.1 have different constants in front of ε\varepsilon: it is equal to CαC_{\alpha} in (7) and to 33 in (10). Note that in the Hilbert space case, Cα=(1α)112α1C_{\alpha}=(1-\alpha)\sqrt{\frac{1}{1-2\alpha}}\to 1 as α0\alpha\to 0, while for general Banach spaces Cα=2(1α)12α2C_{\alpha}=\frac{2(1-\alpha)}{1-2\alpha}\to 2 as α0\alpha\to 0. In particular, Cα<3C_{\alpha}<3 for all sufficiently small α\alpha (e.g., for α<8+620.485\alpha<-8+6\sqrt{2}\approx 0.485 in Hilbert space framework). This difference becomes substantial when ε\varepsilon is of the form

where the first term in the sum is a constant and the second term decreases with the growth of the sample size. This is a typical situation when the model is misspecified, see Section 4.4 below for a concrete example related to matrix regression. Our method allows to keep the constant in front of the approximation error term arbitrary close to 11 (and often leads to noticeably better constants in general).

Examples

In this section, we discuss applications of Theorem 3.1 to several classical problems, namely, estimation of the mean, principal component analysis, sparse linear regression and low-rank matrix recovery.

Our priority was simplicity and clarity of exposition of the main ideas which could affect optimality of some constants and generality of obtained results.

Set α:=718\alpha_{\ast}:=\frac{7}{18} and p:=0.1p_{\ast}:=0.1 (these numerical values allow to optimize the constants in Corollary 4.1 below). Let 0<δ<10<\delta<1 be the confidence parameter, and set

(we will assume that δ\delta is such that kn2k\leq\frac{n}{2}). Divide the sample X1,,XnX_{1},\ldots,X_{n} into kk disjoint groups G1,,GkG_{1},\ldots,G_{k} of size nk\lfloor\frac{n}{k}\rfloor each, and define

We will apply Theorem 3.1 to the independent estimators μ^1,,μ^k\hat{\mu}_{1},\ldots,\hat{\mu}_{k}.

To this end, we need to find ε\varepsilon satisfying (5). Since for all 1jkn21\leq j\leq k\leq\frac{n}{2}

which is further bounded by pp_{\ast} whenever ε22kpntr(Σ)\varepsilon^{2}\geq\frac{2k}{p_{\ast}n}\operatorname{tr}(\Sigma). The claim now follows from Theorem 3.1 and the bounds Cα2pψ(α;p)11C_{\alpha_{\ast}}\sqrt{\frac{2}{p_{\ast}\psi(\alpha_{\ast};p_{\ast})}}\leq 11 and log(1/δ)+ψ(α;p)log(1.4δ)\log(1/\delta)+\psi(\alpha_{\ast};p_{\ast})\leq\log(\frac{1.4}{\delta}). ∎

It is easy to see that the proof of Corollary 4.1 actually yields a better bound

with Cαpψ(α;p)7.6\frac{C_{\alpha_{\ast}}}{\sqrt{p_{\ast}\psi(\alpha_{\ast};p_{\ast})}}\leq 7.6.

which yields a noticeably larger constant (13.213.2 versus 7.67.6).

It is easy to see that for the univariate median, inequality (7) holds with α=1/2\alpha=1/2 and Cα=1C_{\alpha}=1, hence the union bound over i=1,,Di=1,\ldots,D implies that

(here, pp was set to be 0.120.12). This bound should be compared to (13) – the latter becomes better only when DD is sufficiently large (e.g., D165D\geq 165 for δ=0.1\delta=0.1 and D15806D\geq 15\,806 for δ=0.01\delta=0.01).We want to thank the anonymous reviewer for pointing this out. Note that the constant in (14) can be further improved in a situation when tight upper bounds on the true variances or kurtoses of coordinates of XX are known by using a univariate estimator of to construct μ^\hat{\mu}_{\ast}.

Our estimation technique naturally extends to the problem of constructing the confidence sets for the mean. Indeed, when faced with the task of obtaining the non-asymptotic confidence interval, one usually fixes the desired coverage probability in advance, which is exactly how we build our estimator. To obtain a parameter-free confidence ball from (12), one has to estimate tr(Σ)\operatorname{tr}(\Sigma). To this end, we will apply Theorem 3.1 to a collection of independent statistics T^1,,T^k\hat{T}_{1},\ldots,\hat{T}_{k}, where

and μ^j\hat{\mu}_{j} are the sample means defined in (4.1). Let T^:=med(T^1,,T^k)\hat{T}:=\operatorname{med}(\hat{T}_{1},\ldots,\hat{T}_{k}) (if kk is even, the median is not unique, so we pick an arbitrary representative).

Chebyshev’s inequality gives (assuming that kn/2k\leq n/2)

hence Pr(T^jtr(Σ)ε1+ε2)p\Pr(|\hat{T}_{j}-\operatorname{tr}(\Sigma)|\geq\varepsilon_{1}+\varepsilon_{2})\leq p_{\ast}. Theorem 3.1 implies that

Since Pr(T^tr(Σ)2)Pr(T^tr(Σ)Cα(ε1+ε2))\Pr(\hat{T}\leq\frac{\operatorname{tr}(\Sigma)}{2})\leq\Pr(\hat{T}\leq\operatorname{tr}(\Sigma)-C_{\alpha_{\ast}}(\varepsilon_{1}+\varepsilon_{2})) whenever (15) is satisfied, the result follows. ∎

Combining (16) with Corollary 4.1, we immediately get the following statement.

and let μ^\hat{\mu} be the estimator defined by (4.1). If (15) holds, then

2 Robust Principal Component Analysis

It is well known that classical Principal Component Analysis (PCA) is very sensitive to the presence of the outliers in a sample. The literature on robust PCA suggests several computationally efficient and theoretically sound methods to recover the linear structure in the data. For instance, if part of the observations is contained in a low-dimensional subspace while the rest are corrupted by noise, the low-dimensional subspace can often be recovered exactly, see and references therein.

However, for the case when no additional geometric structure in the data can be assumed, we suggest a simple and easy-to-implement alternative which uses the geometric median to obtain a robust estimator of the covariance matrix. In this section, we study the simplest case when the geometric median is combined with the sample covariance estimator. However, it is possible to use various alternatives in place of the sample covariance, such as the shrinkage estimator , banding/tapering estimator , hard thresholding estimator or the nuclear norm-penalized estimator , to name a few.

Assume first that the data is centered (so that μ=0\mu=0). As before, set α:=718\alpha_{\ast}:=\frac{7}{18}, p:=0.1p_{\ast}:=0.1, divide the sample X1,,XnX_{1},\ldots,X_{n} into k=log(1/δ)ψ(α;p)+1k=\lfloor\frac{\log({1}/{\delta})}{\psi(\alpha_{\ast};p_{\ast})}\rfloor+1 disjoint groups G1,,GkG_{1},\ldots,G_{k} of size nk\lfloor\frac{n}{k}\rfloor each, and let

Note that Σ^\hat{\Sigma} is positive semidefinite as a convex combination of positive semidefinite matrices.

Let Projm\operatorname{Proj}_{m} be the orthogonal projector on a subspace corresponding to the mm largest positive eigenvalues of Σ\Sigma. Let Projm^\widehat{\operatorname{Proj}_{m}} be the orthogonal projector of the same rank as Projm\operatorname{Proj}_{m} corresponding to the mnkm\leq\lfloor\frac{n}{k}\rfloor largest eigenvalues of Σ^\hat{\Sigma}. In this case, the following bound holds.

Let Δm:=λmλm+1\Delta_{m}:=\lambda_{m}-\lambda_{m+1} and assume that

Similar bounds can be obtained in a more general situation when XX is not necessarily centered. To this end, let

Note that Σ^1,,Σ^k\hat{\Sigma}_{1},\ldots,\hat{\Sigma}_{k} are independent. Then, using the fact that for any 1jk1\leq j\leq k

and assume that Δm>4εn(δ)\Delta_{m}>4\varepsilon_{n}(\delta). Then

3 High-dimensional sparse linear regression

We are interested in the case when DnD\gg n and λ0\lambda_{0} is sparse, meaning that

In this situation, a (version of) the famous Lasso estimator of λ0\lambda_{0} is obtained as a solution of the following optimization problem:

The goal of this section is to extend the applicability of some well-known results for this estimator to the case of a heavy-tailed noise distribution.

Existing literature on high-dimensional linear regression suggests several ways to handle corrupted measurements, for instance, by using a different loss function (e.g., the so-called Huber’s loss ), or by implementing a more flexible penalty term . In particular, in authors study the model

However, as in the case of the usual Lasso, confidence of estimation depends on the distribution of ξ\xi. In particular, Gaussian-type concentration holds only if the entries of ξ\xi have sub-Gaussian tails.

The main result of this subsection (stated in Theorem 4.2) provides strong performance guarantees for the robust version of the usual Lasso estimator (22) and requires only standard conditions on the degree of sparsity and restricted eigenvalues of the design. Similar method can be used to improve performance guarantees for the model (23) in the case of heavy-tailed noise ξ\xi.

Let 1sD1\leq s\leq D and c0>0c_{0}>0. We will say that the restricted eigenvalue condition holds if

Let Θ:=1nj=1nξjxj\Theta:=\|\frac{1}{n}\sum_{j=1}^{n}\xi_{j}x_{j}\|_{\infty}. The following result shows that the amount of regularization ε\varepsilon sufficient for recovery of λ0\lambda_{0} is closely related to the size of Θ\Theta.

On the event E={ε4Θ}\mathcal{E}=\{\varepsilon\geq 4\Theta\}, the following inequality holds:

In particular, when ξjN(0,σ2)\xi_{j}\sim N(0,\sigma^{2}) and ε=4σtlogDn\varepsilon=4\sigma t\sqrt{\frac{\log D}{n}},

This result is similar to the statement of Theorem 7.2 in , and its proof can be obtained along the same lines. See for more details. ∎

Our next goal is to construct an estimator of λ0\lambda_{0} which admits high confidence error bounds without restrictive assumptions on the noise variable, such as sub-Gaussian tails. Let t>0t>0 be fixed, and set k:=3.5t+1k:=\lfloor 3.5t\rfloor+1, m=nkm=\lfloor\frac{n}{k}\rfloor (as before, we will assume that kn2k\leq\frac{n}{2}). For 1lk1\leq l\leq k, let Gl:={(l1)m+1,,lm}G_{l}:=\{(l-1)m+1,\ldots,lm\} and

be the m×Dm\times D design matrix corresponding to the llth group of design vectors {xj,jGl}\{x_{j},j\in G_{l}\}. Moreover, let κl(s,c0)\kappa_{l}(s,c_{0}) be the corresponding restricted eigenvalues.

Assume that xjM,1jn\|x_{j}\|_{\infty}\leq M,1\leq j\leq n and κˉ(2N(λ0),3):=min1lkκl(2N(λ0),3)>0\bar{\kappa}(2N(\lambda_{0}),3):=\min_{1\leq l\leq k}\kappa_{l}(2N(\lambda_{0}),\allowbreak 3)>0. Then for any

We will first obtain a “weak concentration” bound from Theorem 4.1 and then apply Theorem 3.1 with α=718\alpha=\frac{7}{18} to get the result.

To this end, we need to estimate Θl:=1mjGlξjxj\Theta_{l}:=\|\frac{1}{m}\sum_{j\in G_{l}}\xi_{j}x_{j}\|_{\infty}, l=1,,kl=1,\ldots,k.

Assume that D3D\geq 3. Then for any ll, 1lk1\leq l\leq k,

whenever t4σMklog(2D)0.1nt\geq 4\sigma M\sqrt{\frac{k\log(2D)}{0.1n}}. In particular, for ε16σM3.5(t+2/7)log(2D)0.1n\varepsilon\geq 16\sigma M\sqrt{\frac{3.5(t+2/7)\log(2D)}{0.1n}}, the bound of Theorem 4.1 holds for λ^εl\hat{\lambda}_{\varepsilon}^{l} with probability 10.1\geq 1-0.1; note that 163.50.19516\sqrt{\frac{3.5}{0.1}}\leq 95. It remains to apply Theorem 3.1 to complete the proof. ∎

4 Matrix regression with isotropic sub-Gaussian design

The problem of low-rank matrix estimation has attracted a lot of attention during the last several years, for example, see and references therein. Recovery guarantees were later extended to allow the presence of noise. Results in this direction can be found in , to name a few.

In particular, these assumptions hold in the following important cases:

XX is symmetric and such that Xi,j=12εi,jX_{i,j}=\frac{1}{\sqrt{2}}\varepsilon_{i,j}, i<ji<j, Xi,i=εi,iX_{i,i}=\varepsilon_{i,i}, 1iD1\leq i\leq D, where εi,j\varepsilon_{i,j} are i.i.d. Rademacher random variables (i.e., random signs).

In a special case when all involved matrices are diagonal, the problem becomes a version of sparse linear regression with random design. In this case, isotropic design includes a situation when XX is a random diagonal matrix X=diag(x1,,xD)X=\operatorname{diag}(x_{1},\ldots,x_{D}), where xix_{i} are i.i.d. standard normal or Rademacher random variables.

In what follows, C1,C2,C_{1},C_{2},\ldots denote the constants that may depend on parameters of the underlying distribution (such as γ\gamma).

It is easy to see that ζψ2<\|\zeta\|_{\psi_{2}}<\infty for any sub-Gaussian random variable ζ\zeta. It follows from Proposition 9.1 in that there exists C(γ)>0C(\gamma)>0 such that for any sub-Gaussian isotropic matrix XX,

We will also need the following useful inequality: for any p1p\geq 1,

The proofs of the facts mentioned above can be found in .

Let (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\ldots,(X_{n},Y_{n}) be i.i.d. observations with the same distribution as (X,Y)(X,Y). We are mainly interested in the case when D<nD2D<n\ll D^{2}. In this situation, it is impossible to estimate A0A_{0} consistently in general, however, if A0A_{0} is low-rank (or approximately low-rank), then the solution of the following optimization problem provides a good approximation to A0A_{0}:

The noise variable ξ\xi is such that ξψ2<\|\xi\|_{\psi_{2}}<\infty.

If Assumption 4.1 is satisfied, then, whenever

where Cˉ1>0\bar{C}_{1}>0 is an absolute constant.

Indeed, recall that Θ=1nj=1nξjXj\Theta=\frac{1}{n}\sum_{j=1}^{n}\xi_{j}X_{j}, and apply Theorem 4.4 to Yj:=ξjXjY_{j}:=\xi_{j}X_{j}, noting that by (26), (27)

where C2C_{2}, C3C_{3} depend only on γ\gamma, hence giving the result.

As we mentioned above, our goal is to construct the estimator of A0A_{0} which admits bounds in flavor of Theorem 4.3 that hold with high probability under a much weaker assumption on the tail of the noise variable ξ\xi.

To achieve this goal, we follow the same pattern as before. Let t1t\geq 1 be fixed, let k:=t+1,m=nkk:=\lfloor t\rfloor+1,m=\lfloor\frac{n}{k}\rfloor, and assume that kn2k\leq\frac{n}{2}. Divide the data {(Xj,Yj)}j=1n\{(X_{j},Y_{j})\}_{j=1}^{n} into kk disjoint groups G1,,GkG_{1},\ldots,G_{k} of size mm each, and define

where the geometric median is evaluated with respect to the Frobenius norm.

We will start by deriving a “weak concentration” bound from Theorem 4.3. To this end, we need to estimate

The following result is a direct consequence of the so-called multiplier inequality (Lemma 2.9.1 in ).

Let ε1,,εm\varepsilon_{1},\ldots,\varepsilon_{m} be i.i.d. Rademacher random variables independent of X1,,XmX_{1},\ldots,X_{m}. Then

Next, it follows from Chebyshev’s inequality that for any 1lk1\leq l\leq k, with probability 1p(α)2\geq 1-\frac{p^{\ast}(\alpha)}{2}

the inequality of Theorem 4.3 (with the confidence parameter equal to log(2/p(α))\log(2/p^{\ast}(\alpha))) applied to the estimator A^εl\hat{A}^{l}_{\varepsilon} gives that with probability 1p(α)\geq 1-p^{\ast}(\alpha)

The claim (33) now follows from Theorem 3.1. ∎

Numerical evaluation of the geometric median and simulation results

As was mentioned in the introduction, the function F(z):=j=1kzxjF(z):=\sum_{j=1}^{k}\|z-x_{j}\| is convex, moreover, its minimum is unique unless {x1,,xk}\{x_{1},\ldots,x_{k}\} are on the same line.

where αm+1(j)=xjzm1j=1kxjzm1\alpha^{(j)}_{m+1}=\frac{\|x_{j}-z_{m}\|^{-1}}{\sum_{j=1}^{k}\|x_{j}-z_{m}\|^{-1}}. Kuhn proved that Weiszfeld’s algorithm converges to the geometric median for all but countably many initial points (additionally, his result states that zmz_{m} converges to the geometric median if none of zmz_{m} belong to {x1,,xk}\{x_{1},\ldots,x_{k}\}). It is straightforward to check that (36) is actually a gradient descent scheme: indeed, it is equivalent to

where βm+1=1j=1kxjzm1\beta_{m+1}=\frac{1}{\sum_{j=1}^{k}\|x_{j}-z_{m}\|^{-1}} and gm+1=j=1kzmxjzmxjg_{m+1}=\sum_{j=1}^{k}\frac{z_{m}-x_{j}}{\|z_{m}-x_{j}\|} is the gradient of FF (we assume that zm{x1,,xk}z_{m}\notin\{x_{1},\ldots,x_{k}\}).

Ostresh proposed a method which avoids the possibility of hitting one of the vertices {x1,,xk}\{x_{1},\ldots,x_{k}\} by considering the following descent scheme: starting with some z0z_{0} in the affine hull of {x1,,xk}\{x_{1},\ldots,x_{k}\}, let

For other approaches to fast numerical evaluation of the geometric median, see and references therein.

2 Simulation results

In this case, the usual sample covariance matrix does not provide any useful information about the principal components. However, in most cases our method gave reasonable approximation to the truth. We used the estimator described in Section 4.2 with the number of groups k=10k=10 containing 1616 observations each. The error was measured by the spectral norm Proj^5Proj5Op\|\widehat{\operatorname{Proj}}_{5}-\operatorname{Proj}_{5}\|_{\operatorname{Op}}, where Proj^5\widehat{\operatorname{Proj}}_{5} is a projector on the eigenvectors corresponding to 55 largest eigenvalues of the estimator. Figures 2, 3 show the histograms of the errors evaluated over 100100 runs of the simulation. Figure 4 shows performance of a “thresholded geometric median” estimator which is defined in Section 6 below.

2.2 High-dimensional sparse linear regression

The following model was used for simulation:

where ξ1,jN(0,1/8)\xi_{1,j}\sim N(0,1/8) and ξ2,j\xi_{2,j} takes values ±2502\pm\frac{250}{\sqrt{2}} with probability 1/21/2 each. All parameters λ0\lambda_{0}, xjx_{j}, ξj\xi_{j}, j=1,,300j=1,\ldots,300, were sampled independently. Error of the estimator λ^\hat{\lambda} was measured by the ratio λ^λ0λ0\frac{\|\hat{\lambda}-\lambda_{0}\|}{\|\lambda_{0}\|}. Size of the regularization parameter ε\varepsilon was chosen based on 44-fold cross validation. On each stage of the simulation, we evaluated the usual Lasso estimator (22) and the “median Lasso” estimator (24) based on partitioning the observations into 44 groups of size 7575 each. Figures 5 and 6 show the histograms of the errors over 5050 runs of the simulation. Note that the maximal error of the “median Lasso” is 0.0550.055 while the error of the usual Lasso exceeded 0.150.15 in 1818 out of 5050 cases.

Final remarks

Let α^1,,α^k0\hat{\alpha}_{1},\ldots,\hat{\alpha}_{k}\geq 0, j=1kαj=1\sum_{j=1}^{k}\alpha_{j}=1 be the coefficients such that μ^=j=1kα^kμk\hat{\mu}=\sum_{j=1}^{k}\hat{\alpha}_{k}\mu_{k} is the geometric median of a collection of estimators {μ1,,μk}\{\mu_{1},\ldots,\mu_{k}\}. Our numerical experiments reveal that performance of μ^\hat{\mu} can be significantly improved by setting the coefficients below a certain threshold level ν\nu to , that is,

An interesting problem that we plan to address in subsequent work is the possibility of adaptive choice of the threshold parameter.

Examples presented above cover only a small area on the map of possible applications. For instance, it would be interesting to obtain an estimator in the low-rank matrix completion framework that admits strong performance guarantees for the heavy-tailed noise model. Results obtained in Section 4.4 for the matrix regression problem do not seem to yield a straightforward solution in this case. Another promising direction is related to design of robust techniques for Bayesian inference and evaluation of the geometric median in the space of probability measures. We plan to address these questions in the future work.

Appendix: Proof of Lemma 2.1, part (b)

Once again, assume that the claim does not hold and xizr\|x_{i}-z\|\leq r, i=1,,(1α)k+1i=1,\ldots,\lfloor(1-\alpha)k\rfloor+1.

be the directional derivative of \|\cdot\| at the point xx in direction uu. We need the following useful fact from convex analysis:

There exists g:=gx,uxg^{\ast}:=g^{\ast}_{x,u}\in\partial\|x\| such that D(x;u)=g,uD(\|x\|;u)=\langle g^{\ast},u\rangle, where g,u:=g(u)\langle g^{\ast},u\rangle:=g^{\ast}(u).

Using Lemma .1, it is easy to see that there exist gjxjxg^{\ast}_{j}\in\partial\|x_{j}-x_{\ast}\|, j=1,,kj=1,\ldots,k such that

Moreover, for any uu, DF(x;zx)0DF(x_{\ast};z-x_{\ast})\geq 0 by the definition of xx_{\ast}. Note that for any jj

By the definition of gjg^{\ast}_{j} and triangle inequality,

and, since gj1\|g^{\ast}_{j}\|_{\ast}\leq 1,

whenever Cα2(1α)12αC_{\alpha}\geq\frac{2(1-\alpha)}{1-2\alpha}.

Acknowledgements

S. Minsker was supported by grants NSF DMS-0847388, NSF CCF-0808847, and R01-ES-017436 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH).

I want to thank Anirban Bhattacharya, David Dunson, the anonymous Referees and the Area Editor for their valuable comments and suggestions, and Philippe Rigollet for pointing out several missing references.

References