Geometric Median in Nearly Linear Time

Michael B. Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, Aaron Sidford

Introduction

This problem, also known as the geometric median problem, is well studied and has numerous applications. It is often considered over low dimensional spaces in the context of the facility location problem and over higher dimensional spaces it has applications to clustering in machine learning and data analysis. For example, computing the geometric median is a subroutine in popular expectation maximization heuristics for kk-medians clustering.

Our O(ndlog3nϵ)O(nd\log^{3}\frac{n}{\epsilon}) time algorithm is a careful modification of standard interior point methods for solving the geometric median problem. We provide a long step interior point method tailored to the geometric median problem for which we can implement every iteration in nearly linear time. While our analysis starts with a simple O((nd)O(1)log1ϵ)O((nd)^{O(1)}\log\frac{1}{\epsilon}) time interior point method and shows how to improve it, our final algorithm is quite non-standard from the perspective of interior point literature. Our result is one of very few cases we are aware of outperforming traditional interior point theory and the only we are aware of using interior point methods to obtain a nearly linear time algorithm for a canonical optimization problem that traditionally requires superlinear time. We hope our work leads to further improvements in this line of research.

Our O(dϵ2)O(d\epsilon^{-2}) algorithm is a relatively straightforward application of sampling techniques and stochastic subgradient descent. Some additional insight is required simply to provide a rigorous analysis of the robustness of the geometric median and use this to streamline our application of stochastic subgradient descent. We include it for completeness however, we defer its proof to Appendix C. The bulk of the work in this paper is focused on developing our O(ndlog3nϵ)O(nd\log^{3}\frac{n}{\epsilon}) time algorithm which we believe uses a set of techniques of independent interest.

The geometric median problem was first formulated for the case of three points in the early 1600s by Pierre de Fermat . A simple elegant ruler and compass construction was given in the same century by Evangelista Torricelli. Such a construction does not generalize when a larger number of points is considered: Bajaj has shown the even for five points, the geometric median is not expressible by radicals over the rationals . Hence, the (1+ϵ)(1+\epsilon)-approximate problem has been studied for larger values of nn.

we instead write the problem as minimizing a linear function over a convex set:

Clearly, these problems are the same as at optimality αi=x(i)a(i)2\alpha_{i}=\|x^{(i)}-a^{(i)}\|_{2}.

Difficulties

Unfortunately obtaining a nearly linear time algorithm for geometric median using interior point methods as presented poses numerous difficulties. Particularly troubling is the number of iterations required by standard interior point algorithms. The approach outlined in the previous section produced an O(n)O(n)-self concordant barrier and even if we use more advanced self concordance machinery, i.e. the universal barrier , the best known self concordance of barrier for the convex set i[n]xa(i)2c\sum_{i\in[n]}\|x-a^{(i)}\|_{2}\leq c is O(d)O(d). An interesting open question still left open by our work is to determine what is the minimal self concordance of a barrier for this set.

Consequently, even if we could implement every iteration of an interior point scheme in nearly linear time it is unclear whether one should hope for a nearly linear time interior point algorithm for the geometric median. While there are a instances of outperforming standard self-concordance analysis , these instances are few, complex, and to varying degrees specialized to the problems they solve. Moreover, we are unaware of any interior point scheme providing a provable nearly linear time for a general nontrivial convex optimization problem.

Beyond Standard Interior Point

Despite these difficulties we do obtain a nearly linear time interior point based algorithm that only requires O(lognϵ)O(\log\frac{n}{\epsilon}) iterations, i.e. increases to the path parameter. After choosing the natural penalty functions p(i)p^{(i)} described above, we optimize in closed form over the αi\alpha_{i} to obtain the following penalized objective function:It is unclear how to extend our proof for the simpler function: i[n]1+t2xa(i)22\sum_{i\in[n]}\sqrt{1+t^{2}\|x-a^{(i)}\|_{2}^{2}}.

In fact, we show that this movement over such a long step, i.e. a constant increase in tt, in the directions orthogonal to the bad direction is small enough that for any movement around a ball of this size the Hessian of ftf_{t} only changes by a small multiplicative constant. In short, starting at xtx_{t} there exists a point yy obtained just by moving from xtx_{t} in the bad direction, such that yy is close enough to xtx_{t^{\prime}} that standard first order method will converge quickly to xtx_{t^{\prime}}! Thus, we might hope to find such a yy, quickly converge to xtx_{t^{\prime}} and repeat. If we increase tt by a multiplicative constant in every such iterations, standard interior point theory suggests that O(lognϵ)O(\log\frac{n}{\epsilon}) iterations suffices.

Building an Algorithm

To turn the structural result in the previous section into a fast algorithm there are several further issues we need to address. We need to

(1) Show how to find the point along the bad direction that is close to xtx_{t^{\prime}}

(2) Show how to solve linear systems in the Hessian to actually converge quickly to xtx_{t^{\prime}}

(4) Bound the accuracy required by these computations

Deferring (1) for the moment, our solution to the rest are relatively straightforward. Careful inspection of the Hessian of ftf_{t} reveals that it is well approximated by a multiple of the identity matrix minus a rank 1 matrix. Consequently using explicit formulas for the inverse of of matrix under rank 1 updates, i.e. the Sherman-Morrison formula, we can solve such systems in nearly linear time thereby addressing (2). For (3), we show that the well known power method carefully applied to the Hessian yields the bad direction if it exists. Finally, for (4) we show that a constant approximate geometric median is near enough to the central path for t=Θ(1f(x))t=\Theta(\frac{1}{f(x_{*})}) and that it suffices to compute a central path point at t=O(nf(x)ϵ)t=O(\frac{n}{f(x_{*})\epsilon}) to compute a 1+ϵ1+\epsilon-geometric median. Moreover, for these values of tt, the precision needed in other operations is clear.

The more difficult operation is (1). Given xtx_{t} and the bad direction exactly, it is still not clear how to find the point along the bad direction line from xtx_{t} that is close to xtx_{t^{\prime}}. Just performing binary search on the objective function a priori might not yield such a point due to discrepancies between a ball in Euclidean norm and a ball in hessian norm and the size of the distance from the optimal point in euclidean norm. To overcome this issue we still line search on the bad direction, however rather than simply using f(xt+αvt)f(x_{t}+\alpha\cdot v_{t}) as the objective function to line search on, we use the function g(α)=minxxtαvt2cf(x)g(\alpha)=\min_{\|x-x_{t}-\alpha\cdot v_{t}\|_{2}\leq c}f(x) for some constant cc, that is given an α\alpha we move α\alpha in the bad direction and take the best objective function value in a ball around that point. For appropriate choice of cc the minimizers of α\alpha will include the optimal point we are looking for. Moreover, we can show that gg is convex and that it suffices to perform the minimization approximately.

Putting these pieces together yields our result. We perform O(lognϵ)O(\log\frac{n}{\epsilon}) iterations of interior point (i.e. increasing tt), where in each iteration we spend O(ndlognϵ)O(nd\log\frac{n}{\epsilon}) time to compute a high quality approximation to the bad direction, and then we perform O(lognϵ)O(\log\frac{n}{\epsilon}) approximate evaluations on g(α)g(\alpha) to binary search on the bad direction line, and then to approximately evaluate gg we perform gradient descent in approximate Hessian norm to high precision which again takes O(ndlognϵ)O(nd\log\frac{n}{\epsilon}) time. Altogether this yields a O(ndlog3nϵ)O(nd\log^{3}\frac{n}{\epsilon}) time algorithm to compute a 1+ϵ1+\epsilon geometric median. Here we made minimal effort to improve the log factors and plan to investigate this further in future work.

In addition to providing a nearly linear time algorithm we provide a stand alone result on quickly computing a crude (1+ϵ)(1+\epsilon)-approximate geometric median in Section C. In particular, given an oracle for sampling a random a(i)a^{(i)} we provide an O(dϵ2)O(d\epsilon^{-2}), i.e. sublinear, time algorithm that computes such an approximate median. Our algorithm for this result is fairly straightforward. First, we show that random sampling can be used to obtain some constant approximate information about the optimal point in constant time. In particular we show how this can be used to deduce an Euclidean ball which contains the optimal point. Second, we perform stochastic subgradient descent within this ball to achieve our desired result.

4 Paper Organization

The rest of the paper is structured as follows. After covering preliminaries in Section 2, in Section 3 we provide various results about the central path that we use to derive our nearly linear time algorithm. In Section 4 we then provide our nearly linear time algorithm. All the proofs and supporting lemmas for these sections are deferred to Appendix A and Appendix B. In Appendix C we provide our O(d/ϵ2)O(d/\epsilon^{2}) algorithm, in Appendix D we provide the derivation of our penalized objective function, in Appendix E we provide general technical machinery we use throughout and in Appendix F we show how to extend our results to Weber’s problem, i.e. weighted geometric median.

Notation

2 Problem Notation

3 Penalized Objective Notation

To solve this problem, we smooth the objective function ff and instead consider the following family of penalized objective functions parameterized by t>0t>0

Properties of the Central Path

Here provide various facts regarding the penalized objective function and the central path. While we use the lemmas in this section throughout the paper, the main contribution of this section is Lemma 5 in Section 3.3. There we prove that with the exception of a single direction, the change in the central path is small over a constant multiplicative change in the path parameter. In addition, we show that our penalized objective function is stable under changes in a O(1t)O(\frac{1}{t}) Euclidean ball (Section 3.1), we bound the change in the Hessian over the central path (Section 3.2), and we relate f(xt)f(x_{t}) to f(x)f(x_{*}) (Section 3.4).

Here, we show that the Hessian of the penalized objective function is stable under changes in a O(1t)O(\frac{1}{t}) sized Euclidean ball. This shows that if we have a point which is close to a central path point in Euclidean norm, then we can use Newton method to find it.

Suppose that xy2ϵt\|x-y\|_{2}\leq\frac{\epsilon}{t} with ϵ120\epsilon\leq\frac{1}{20}. Then, we have

2 How Much Does the Hessian Change Along the Path?

Here we bound how much the Hessian of the penalized objective function can change along the central path. First we provide the following lemma bound several aspects of the penalized objective function and proving that the weight, wtw_{t}, only changes by a small amount multiplicatively given small multiplicative changes in the path parameter, tt.

For all t0t\geq 0 and i[n]i\in[n] the following hold

Consequently, for all ttt^{\prime}\geq t we have that (tt)2wtwt(tt)2wt\left(\frac{t}{t^{\prime}}\right)^{2}w_{t}\leq w_{t^{\prime}}\leq\left(\frac{t^{\prime}}{t}\right)^{2}w_{t}.

Next we use this lemma to bound the change in the Hessian with respect to tt.

and therefore for all β[0,18]\beta\in[0,\frac{1}{8}]

3 Where is the Next Optimal Point?

Here we prove our main result of this section. We prove that over a long step the central path moves very little in directions orthogonal to the smallest eigenvector of the Hessian. We begin by noting the Hessian is approximately a scaled identity minus a rank 1 matrix.

Using this and the lemmas of the previous section we bound the amount xtx_{t} can move in every direction far from vtv_{t}.

For all t0t\geq 0, β[0,1600]\beta\in[0,\frac{1}{600}], and any unit vector yy with y,vt1t2κ|\langle y,v_{t}\rangle|\leq\frac{1}{t^{2}\cdot\kappa} where κ=maxδ[t,(1+β)t]wδμδ\kappa=\max_{\delta\in[t,(1+\beta)t]}\frac{w_{\delta}}{\mu_{\delta}}, we have y(x(1+β)txt)6βty^{\top}(x_{(1+\beta)t}-x_{t})\leq\frac{6\beta}{t}.

4 Where is the End?

In this section, we bound the quality of the central path with respect to the geometric median objective. In particular, we show that if we can solve the problem for some t=2nϵf(x)t=\frac{2n}{\epsilon f(x_{*})} then we obtain an (1+ϵ)(1+\epsilon)-approximate solution. As our algorithm ultimately starts from an initial t=1/O(f(x))t=1/O(f(x_{*})) and increases tt by a multiplicative constant in every iteration, this yields an O(lognϵ)O(\log\frac{n}{\epsilon}) iteration algorithm.

f(xt)f(x)2ntf(x_{t})-f(x_{*})\leq\frac{2n}{t} for all t>0t>0.

Nearly Linear Time Geometric Median

Here we show how to use the structural results from the previous section to obtain a nearly linear time algorithm for computing the geometric median. Our algorithm follows a simple structure (See Algorithm 1). First we use simply average the a(i)a^{(i)} to compute a 2-approximate median, denoted x(0)x^{(0)}. Then for a number of iterations we repeatedly move closer to xtx_{t} for some path parameter tt, compute the minimum eigenvector of the Hessian, and line search in that direction to find an approximation to a point further along the central path. Ultimately, this yields a point x(k)x^{(k)} that is precise enough approximation to a point along the central path with large enough tt that we can simply out x(k)x^{(k)} as our (1+ϵ)(1+\epsilon)-approximate geometric median.

Furthermore, we show that the v(i)v^{(i)} computed by this algorithm is sufficiently close to the bad direction. Combining 7 with the structural results from the previous section and Lemma 28, a minor technical lemma regarding the transitivity of large inner products,we provide the following lemma.

Let (λ,u)=ApproxMinEig(x,t,ϵv)(\lambda,u)=\mathtt{ApproxMinEig}(x,t,\epsilon_{v}) for ϵv<18\epsilon_{v}<\frac{1}{8} and xxt2ϵct\|x-x_{t}\|_{2}\leq\frac{\epsilon_{c}}{t} for ϵc(ϵv36)32\epsilon_{c}\leq(\frac{\epsilon_{v}}{36})^{\frac{3}{2}}. If μt14t2wt\mu_{t}\leq\frac{1}{4}t^{2}\cdot w_{t} then with high probability in n/ϵvn/\epsilon_{v} for all unit vectors yuy\perp u, we have y,vt28ϵv\langle y,v_{t}\rangle^{2}\leq 8\epsilon_{v}.

Note that this lemma assumes μt\mu_{t} is small. When μt\mu_{t} is large, we instead show that the next central path point is close to the current point and hence we do not need to compute the bad direction to center quickly.

Suppose μt14t2wt\mu_{t}\geq\frac{1}{4}t^{2}\cdot w_{t} and let t[t,(1+1600)t]t^{\prime}\in[t,(1+\frac{1}{600})t] then xtxt21100t\|x_{t^{\prime}}-x_{t}\|_{2}\leq\frac{1}{100t}.

2 Line Searching

Here we show how to line search along the bad direction to find the next point on the central path. Unfortunately, simply performing binary search on objective function directly may not suffice. If we search over α\alpha to minimize fti+1(y(i)+αv(i))f_{t_{i+1}}(y^{(i)}+\alpha v^{(i)}) it is unclear if we actually obtain a point close to xt+1x_{t+1}. It might be the case that even after minimizing α\alpha we would be unable to move towards xt+1x_{t+1} efficiently.

To overcome this difficulty, we use the fact that over the region xy2=O(1t)\|x-y\|_{2}=O(\frac{1}{t}) the Hessian changes by at most a constant and therefore we can minimize ft(x)f_{t}(x) over this region extremely quickly. Therefore, we instead line search on the following function

and use that we can evaluate gt,y,v(α)g_{t,y,v}(\alpha) approximately by using an appropriate centering procedure. We can show (See Lemma 30) that gt,y,v(α)g_{t,y,v}(\alpha) is convex and therefore we can minimize it efficiently just by doing an appropriate binary search. By finding the approximately minimizing α\alpha and outputting the corresponding approximately minimizing xx, we can obtain x(i+1)x^{(i+1)} that is close enough to xti+1x_{t_{i+1}}. For notational convenience, we simply write g(α)g(\alpha) if t,y,vt,y,v is clear from the context.

First, we show how we can locally center and provide error analysis for that algorithm.

Using this local centering algorithm as well as a general result for minimizing one dimensional convex functions using a noisy oracle (See Section E.3) we obtain our line search algorithm.

We also provide the following lemma useful for finding the first center.

3 Putting It All Together

Combining the results of the previous sections, we prove our main theorem.

In O(ndlog3(nϵ))O(nd\log^{3}(\frac{n}{\epsilon})) time, Algorithm 1 outputs an (1+ϵ)(1+\epsilon)-approximate geometric median with constant probability.

Acknowledgments

We thank Yan Kit Chim, Ravi Kannan, and Jonathan A. Kelner for many helpful conversations. We thank the reviewers for their help in completing the previous work table. This work was partially supported by NSF awards 0843915, 1065106 and 1111109, NSF Graduate Research Fellowship (grant no. 1122374) and Sansom Graduate Fellowship in Computer Science. Part of this work was done while authors were visiting the Simons Institute for the Theory of Computing, UC Berkeley.

References

Appendix A Properties of the Central Path (Proofs)

Here we provide proofs of the claims in Section 3 as well as additional technical lemmas we use throughout the paper.

Here we provide basic facts regarding the central path that we will use throughout our analysis. First we compute various derivatives of the penalized objective function.

and solving for ddtxt\frac{d}{dt}x_{t} then yields

Next, in we provide simple facts regarding the Hessian of the penalized objective function.

A.2 Stability of Hessian

Here we prove the following stronger statement, for all i[n]i\in[n]

where we used that y=xˉ+αvy=\bar{x}+\alpha v and ya(i)22=α2+xˉa(i)22\|y-a^{(i)}\|_{2}^{2}=\alpha^{2}+\|\bar{x}-a^{(i)}\|_{2}^{2} (since v(xˉa(i))v\perp(\bar{x}-a^{(i)})). Now we know that α2ϵ2t2\alpha^{2}\leq\frac{\epsilon^{2}}{t^{2}} and therefore, by Young’s inequality and Cauchy Schwarz we have that for all γ>0\gamma>0

Now, we separate the proof into two cases depending if txa(i)22ϵ1/2gt(i)(x)t\|x-a^{(i)}\|_{2}\geq 2\epsilon^{1/2}\sqrt{g_{t}^{(i)}(x)}.

If txa(i)22ϵ1/3gt(i)(x)t\|x-a^{(i)}\|_{2}\geq 2\epsilon^{1/3}\sqrt{g_{t}^{(i)}(x)} then since ϵ120\epsilon\leq\frac{1}{20} we have that

and tya(i)ϵt\|y-a^{(i)}\|\geq\epsilon, justifying our assumption that u(i)(x)0u^{(i)}(x)\neq 0 and u(i)(y)0u^{(i)}(y)\neq 0. Furthermore, this implies that

and therefore letting γ=ϵ2/3gt(i)(x)\gamma=\frac{\epsilon^{2/3}}{g_{t}^{(i)}(x)} yields

Since vu(i)(x)v\perp u^{(i)}(x) and v,zv,z are unit vectors, both [vz]2\left[v^{\top}z\right]^{2} and 1gt(i)(x)\frac{1}{g_{t}^{(i)}(x)} are less than

Otherwise, txa(i)2<2ϵ1/3gt(i)(x)t\|x-a^{(i)}\|_{2}<2\epsilon^{1/3}\sqrt{g_{t}^{(i)}(x)} and therefore

Therefore independent of (A.1) and the assumption that u(i)(x)0u^{(i)}(x)\neq 0 and u(i)(y)0u^{(i)}(y)\neq 0 we have

Now, we note that xy2ϵtϵgt(i)(x)t\|x-y\|_{2}\leq\frac{\epsilon}{t}\leq\epsilon\cdot\frac{g_{t}^{(i)}(x)}{t}. Therefore, by Lemma 15 we have that

Since ϵ<120\epsilon<\frac{1}{20}, the result follows. ∎

Consequently, so long as we have a point within a O(1t)O(\frac{1}{t}) sized Euclidean ball of some xtx_{t}, Newton’s method (or an appropriately transformed first order method) within the ball will converge quickly.

A.3 How Much Does the Hessian Change Along the Path?

Using this fact and the fact that txta(i)2gt(i)t\|x_{t}-a^{(i)}\|_{2}\leq g_{t}^{(i)} we have

which by Cauchy Schwarz and that txta(i)2gt(i)(xt)t\|x_{t}-a^{(i)}\|_{2}\leq g_{t}^{(i)}(x_{t}) yields the second equation. Furthermore,

which yields the third equation. Therefore, we have that

Exponentiating the above inequality yields the final inequality. ∎

and therefore by Lemma 2 and the fact that txta(i)2gt(i)t\|x_{t}-a^{(i)}\|_{2}\leq g_{t}^{(i)} we have

which completes the proof of (3.1). To prove (3.2), let vv be any unit vector and note that

where we used Lemma 3 and 0β180\leq\beta\leq\frac{1}{8} at the last line. ∎

A.4 Where is the next Optimal Point?

This follows immediately from Lemma 14, regarding the hessian of the penalized objective function, and Lemma 25, regarding the sum of PSD matrices expressed as the identity matrix minus a rank 1 matrix. ∎

Now since clearly αxαa(i)2gα(i)\alpha\|x_{\alpha}-a^{(i)}\|_{2}\leq g_{\alpha}^{(i)}, invoking Lemma 2 yields that

Now by invoking Lemma 3 and the Lemma 4, we have that

Let SS be the subspace orthogonal to vtv_{t}. Then, Lemma 4 shows that Ht12t2wtI\mathbf{H}_{t}\succeq\frac{1}{2}t^{2}w_{t}\mathbf{I} on SS and hence Ht214t4wt2I\mathbf{H}_{t}^{2}\succeq\frac{1}{4}t^{4}w_{t}^{2}\mathbf{I} on SS.By AB\mathbf{A}\preceq\mathbf{B} on SS we mean that for all xSx\in S we have xAxxBxx^{\top}\mathbf{A}x\leq x^{\top}\mathbf{B}x. The meaning of AB\mathbf{A}\succeq\mathbf{B} on SS is analagous. Since Hα2Ht2240βt4wt2\|\mathbf{H}_{\alpha}^{2}-\mathbf{H}_{t}^{2}\|_{2}\leq 40\beta t^{4}w_{t}^{2}, we have that

Now, we split y=z+y,vtvty=z+\langle y,v_{t}\rangle v_{t} where zSz\in S. Then, we have that

Combining these and using that β[0,1/600]\beta\in[0,1/600] yields that

A.5 Where is the End?

Therefore, by Cauchy Schwarz and the fact that txta(i)2gt(i)(xt)1+gt(i)t\|x_{t}-a^{(i)}\|_{2}\leq g_{t}^{(i)}(x_{t})\leq 1+g_{t}^{(i)}

Furthermore, since 1+gt(i)(xt)2+txta(i)21+g_{t}^{(i)}(x_{t})\leq 2+t\|x_{t}-a^{(i)}\|_{2} we have

A.6 Simple Lemmas

Here we provide various small technical Lemmas that we will use to bound the accuracy with which we need to carry out various operations in our algorithm. Here we use some notation from Section 4 to simplify our bounds and make them more readily applied.

For any xx, we have that xxt2f(x).\|x-x_{t}\|_{2}\leq f(x).

For the final inequality, we use the fact that f(x)f(xt)+nxxt2f(x)\leq f(x_{t})+n\|x-x_{t}\|_{2} and the fact that f(xt)f(x)+2ntf(x_{t})\leq f(x_{*})+\frac{2n}{t} by Lemma 6 and get

For the second inequality, note that Lemma 14 and Lemma 18 yields that

Consequently, applying 29 again yields the lower bound.∎

Appendix B Nearly Linear Time Geometric Median (Proofs)

Here we provide proofs, algorithms, and technical lemmas from Section 4.

We write x=i[d]αivi(A)x=\sum_{i\in[d]}\alpha_{i}v_{i}(\mathbf{A}). Then, we have

where we used that λ2λ1=1geg.\frac{\lambda_{2}}{\lambda_{1}}=1-g\leq e^{-g}.

where 1j\vec{1}_{j} is the indicator vector for coordinate jj. Consequently hh is 11-Lipschitz and by Gaussian concentration for Lipschitz functions we know there are absolute constants CC and cc such that

By the concavity of square root and the expected value of the chi-squared distribution we have

Consequently, since s1s\geq 1 we have that Pr[h(α)(1+λ)s]Cexp(cλ2)\Pr[h(\alpha)\geq(1+\lambda)\cdot\sqrt{s}]\leq C\exp(-c\cdot\lambda^{2}) for λ1\lambda\geq 1 and that j1αj(λj(A)λ1(A))=O(ns/ϵ)\sum_{j\neq 1}\alpha_{j}\cdot\left(\frac{\lambda_{j}(\mathbf{A})}{\lambda_{1}(\mathbf{A})}\right)=O(ns/\epsilon) with high probability in n/ϵn/\epsilon. Since k=Ω(1glog(nsϵ))k=\Omega(\frac{1}{g}\log(\frac{ns}{\epsilon})), we have v1(A),u21ϵ\left\langle v_{1}(\mathbf{A}),u\right\rangle^{2}\geq 1-\epsilon with high probability in n/ϵn/\epsilon. Furthermore, this implies that

By Lemma 4 we know that 12Z2ft(x)Z\frac{1}{2}\mathbf{Z}\preceq\nabla^{2}f_{t}(x)\preceq\mathbf{Z} where

Consequently, if μt(x)14t2wt(x)\mu_{t}(x)\leq\frac{1}{4}t^{2}w_{t}(x), then for all unit vectors zvt(x)z\perp v_{t}(x), we have that

Therefore, in this case, A\mathbf{A} has a constant multiplicative gap between its top two eigenvectors and stable rank at most a constant (i.e. g=Ω(1)g=\Omega(1) and s=O(1)s=O(1) in Theorem 20). Consequently, by Theorem 20 we have vt(x),u21ϵ\langle v_{t}(x),u\rangle^{2}\geq 1-\epsilon.

On the other hand, by Lemma 26, we have that

By Lemma 7 we know that vt(x),u21ϵv\langle v_{t}(x),u\rangle^{2}\geq 1-\epsilon_{v}. Since clearly xxt2120t\|x-x_{t}\|_{2}\leq\frac{1}{20t}, by assumption, Lemma 1 shows

Furthermore, since μt14t2wt\mu_{t}\leq\frac{1}{4}t^{2}\cdot w_{t}, as in Lemma 7 we know that the largest eigenvalue of A\mathbf{A} defined in ApproxMinEig(x,t,ϵ)\mathtt{ApproxMinEig}(x,t,\epsilon) is at least 34t2wt\frac{3}{4}t^{2}\cdot w_{t} while the second largest eigenvalue is at most 12t2wt\frac{1}{2}t^{2}\cdot w_{t}. Consequently, the eigenvalue gap, gg, defined in Lemma 27 is at least 13\frac{1}{3} and this lemma shows that vt,u2136ϵc2/31ϵv\langle v_{t},u\rangle^{2}\geq 1-36\epsilon_{c}^{2/3}\geq 1-\epsilon_{v}. Consequently, by Lemma 28, we have that u,vt214ϵv\langle u,v_{t}\rangle^{2}\geq 1-4\epsilon_{v}.

To prove the final claim, we write u=αvt+βwu=\alpha v_{t}+\beta w for an unit vector wvtw\perp v_{t}. Since yuy\perp u, we have that 0=αvt,y+βw,y0=\alpha\langle v_{t},y\rangle+\beta\langle w,y\rangle. Then, either vt,y=0\langle v_{t},y\rangle=0 and the result follows or α2vt,y2=β2w,y2\alpha^{2}\langle v_{t},y\rangle^{2}=\beta^{2}\langle w,y\rangle^{2} and since α2+β2=1\alpha^{2}+\beta^{2}=1, we have

where in the last line we used that α214ϵv>12\alpha^{2}\geq 1-4\epsilon_{v}>\frac{1}{2} since ϵv18\epsilon_{v}\leq\frac{1}{8}. ∎

Consequently, by Lemma 13, the fact that txta(i)2gt(i)t\|x_{t}-a^{(i)}\|_{2}\leq g_{t}^{(i)}, and Lemma 2 we have

B.2 Line Searching

Here we prove the main results we use on centering, Lemma 10, and line searching Lemma 11. These results are our main tools for computing approximations to the central path. To prove Lemma 11 we also include here two preliminary lemmas, Lemma 21 and Lemma 22, on the structure of gt,y,vg_{t,y,v} defined in (4.1).

The guarantee on x(k)x^{(k)} then follows from our choice of kk.

For the running time, Lemma 7 showed the cost of ApproxMinEig\mathtt{ApproxMinEig} is O(ndlog(nϵ))O(nd\log(\frac{n}{\epsilon})). Using Lemma 31 we see that the cost per iteration is O(nd)O(nd) and therefore, the total cost of the kk iterations is O(ndlog(1ϵ))O(nd\log(\frac{1}{\epsilon})). Combining yields the running time. ∎

Next, by Lemma 13, triangle inequality, and the fact that txa(i)2gt(i)(x)t\|x-a^{(i)}\|_{2}\leq g_{t}^{(i)}(x) we have

If μt14t2wt\mu_{t}\leq\frac{1}{4}t^{2}\cdot w_{t} then by Lemma 8 and our choice of ϵv\epsilon_{v} we have that if zuz\perp u then

Otherwise, we have μt14t2wt\mu_{t}\geq\frac{1}{4}t^{2}\cdot w_{t} and by Lemma 9 we have xtxt21100t\|x_{t^{\prime}}-x_{t}\|_{2}\leq\frac{1}{100t}.

In either case, since yxt21100t\|y-x_{t}\|_{2}\leq\frac{1}{100t}, we can reach xtx_{t^{\prime}} from yy by first moving an Euclidean distance of 1100t\frac{1}{100t} to go from yy to xtx_{t}, then adding some multiple of vv, then moving an Euclidean distance of 1100t\frac{1}{100t} in a direction perpendicular to vv. Since the total movement perpendicular to vv is 1100t+1100t149t\frac{1}{100t}+\frac{1}{100t}\leq\frac{1}{49t^{\prime}} we have that minαgt,y,v(α)=ft(xt)\min_{\alpha}g_{t^{\prime},y,v}(\alpha)=f_{t^{\prime}}(x_{t^{\prime}}) as desired.

All that remains is to show that there is a minimizer of gt,y,vg_{t^{\prime},y,v} in the range [6f(x),6f(x)][-6f(x_{*}),6f(x_{*})]. However, by Lemma 6 and Lemma 16 we know that

Consequently, α[6f(x),6f(x)]\alpha_{*}\in[-6f(x_{*}),6f(x_{*})] as desired. ∎

By (B.3) we know that ftf_{t^{\prime}} is ntnt^{\prime} Lipschitz and therefore

Furthermore, for α[6f(x),6f(x)]\alpha\in[-6f(x_{*}),6f(x_{*})] we know that by Lemma 16

The proof is strictly easier than the proof of Lemma 11 as xαuxt21100t\|x-\alpha^{*}u-x_{t}\|_{2}\leq\frac{1}{100t} is satisfied automatically for α=0.\alpha^{*}=0. Note that this lemma assume less for the initial point. ∎

B.3 Putting It All Together

Appendix C Pseudo Polynomial Time Algorithm

We first prove that the geometric median is stable even if we are allowed to modify up to half of the points. The following lemma is a strengthening of the robustness result in .

Let xx_{*} be a geometric median of {a(i)}i[n]\{a^{(i)}\}_{i\in[n]} and let S[n]S\subseteq[n] with S<n2\left|S\right|<\frac{n}{2}. For all xx

For notational convenience let r=xx2r=\|x_{*}-x\|_{2} and let M=maxiSa(i)x2M=\max_{i\notin S}\|a^{(i)}-x\|_{2}.

For all iSi\notin S, we have that xa(i)2M\|x-a^{(i)}\|_{2}\leq M, hence, we have

Furthermore, by triangle inequality for all iSi\in S, we have

Since xx_{*} is a minimizer of i[n]xa(i)2\sum_{i\in[n]}\|x_{*}-a^{(i)}\|_{2}, we have that

Now, we use Lemma 23 to show that the algorithm CrudeApproximate\mathtt{CrudeApproximate} outputs a constant approximation of the geometric median with high probability.

Let xx_{*} be a geometric median of {a(i)}i[n]\{a^{(i)}\}_{i\in[n]} and (x~,λ)(\widetilde{x},\lambda) be the output of CrudeApproximateK\mathtt{CrudeApproximate}_{K}. We define dTk(x)d_{T}^{k}(x) be the kk-percentile of {xa(i)}iT\left\{\|x-a^{(i)}\|\right\}_{i\in T}. Then, we have that xx~26d[n]60(x~)\|x_{*}-\widetilde{x}\|_{2}\leq 6d_{[n]}^{60}(\widetilde{x}). Furthermore, with probability 1eΘ(K)1-e^{-\Theta(K)}, we have

Lemma 23 shows that for all xx and T[n]T\subseteq[n] with Tn2|T|\leq\frac{n}{2}

Picking TT to be the indices of largest 40% of a(i)x~2\|a^{(i)}-\widetilde{x}\|_{2}, we have

For any point xx, we have that d[n]60(x)dS165(x)d_{[n]}^{60}(x)\leq d_{S_{1}}^{65}(x) with probability 1eΘ(K)1-e^{-\Theta(K)} because S1S_{1} is a random subset of [n][n] with size KK. Taking union bound over elements on S2S_{2}, with probability 1KeΘ(K)=1eΘ(K)1-Ke^{-\Theta(K)}=1-e^{-\Theta(K)}, for all points xS2x\in S_{2}

yielding that d[n]60(x~)λd_{[n]}^{60}(\widetilde{x})\leq\lambda.

Again, since S1S_{1} is a random subset of [n][n] with size KK, we have that dS165(a(i))d[n]70(a(i))d_{S_{1}}^{65}(a^{(i)})\leq d_{[n]}^{70}(a^{(i)}) with probability 1KeΘ(K)=1eΘ(K)1-Ke^{-\Theta(K)}=1-e^{-\Theta(K)}. Therefore,

Since S2S_{2} is an independent random subset, with probability 1eΘ(K)1-e^{-\Theta(K)}, there is iS2i\in S_{2} such that a(i)x2d[n]70(x)\|a^{(i)}-x_{*}\|_{2}\leq d_{[n]}^{70}(x_{*}). In this case, we have

Since ii^{*} minimize dS165(a(i))d_{S_{1}}^{65}(a^{(i)}) over all iS2i\in S_{2}, we have that

C.2 A 1+ϵ1italic-ϵ1+\epsilon Approximation of Geometric Median

1italic-ϵ1+\epsilon Approximation of Geometric Median Here we show how to improve the constant approximation in the previous section to a 1+ϵ1+\epsilon approximation. Our algorithm is essentially stochastic subgradient where we use the information from the previous section to bound the domain in which we need to search for a geometric median.

Let xx be the output of ApproximateMedian\mathtt{ApproximateMedian}. With probability 1eΘ(1/ϵ)1-e^{-\Theta(1/\epsilon)}, we have

Furthermore, the algorithm takes O(d/ϵ2)O(d/\epsilon^{2}) time.

(see [5, Thm 6.1]). Lemma 24 shows that xx(1)26λ\|x^{*}-x^{(1)}\|_{2}\leq 6\lambda and λ2d[n]70(x)\lambda\leq 2d_{[n]}^{70}(x^{*}) with probability 1TeΘ(T)1-\sqrt{T}e^{-\Theta(\sqrt{T})}. In this case, we have

Since d[n]70(x)10.3nf(x)d_{[n]}^{70}(x^{*})\leq\frac{1}{0.3n}f(x^{*}), we have

Appendix D Derivation of Penalty Function

Here we derive our penalized objective function. Consider the following optimization problem:

Since f(x,α)f(x,\alpha) is convex in α\alpha, the minimum αj\alpha_{j}^{*} must satisfy

Solving for such αj\alpha_{j}^{*} under the restriction αj0\alpha_{j}^{*}\geq 0 we obtain

If we drop the terms that do not affect the minimizing xx we obtain our penalty function ftf_{t} :

Appendix E Technical Facts

Here we provide various technical lemmas we use through the paper.

First we provide the following lemma that shows that any matrix obtained as a non-negative linear combination of the identity minus a rank 1 matrix less than the identity results in a matrix that is well approximated spectrally by the identity minus a rank 1 matrix. We use this lemma to characterize the Hessian of our penalized objective function and thereby imply that it is possible to apply the inverse of the Hessian to a vector with high precision.

Consequently, λd1(A)[α2,α]\lambda_{d-1}(\mathbf{A})\in[\frac{\alpha}{2},\alpha] and the result holds by the monotonicity of λi\lambda_{i}. ∎

Next we bound the spectral difference between the outer product of two unit vectors by their inner product. We use this lemma to bound the amount of precision required in our eigenvector computations.

For unit vectors u1u_{1} and u2u_{2} we have

Consequently if (u1u2)21ϵ\left(u_{1}^{\top}u_{2}\right)^{2}\geq 1-\epsilon for ϵ1\epsilon\leq 1 we have that

Note that u1u1u2u2u_{1}u_{1}^{\top}-u_{2}u_{2}^{\top} is a symmetric matrix and all eigenvectors are either orthogonal to both u1u_{1} and u2u_{2} (with eigenvalue 0) or are of the form v=αu1+βu2v=\alpha u_{1}+\beta u_{2} where α\alpha and β\beta are real numbers that are not both . Thus, if vv is an eigenvector of non-zero eigenvalue λ\lambda it must be that

By computing the determinant we see this has a solution only when

Solving for λ\lambda then yields (E.1) and completes the proof. ∎

Next we show how the top eigenvectors of two spectrally similar matrices are related. We use this to bound the amount of spectral approximation we need to obtain accurate eigenvector approximations.

Furthermore, by the optimality of v1(B)v_{1}(\mathbf{B}) we have that

Now since β2=1α2\beta^{2}=1-\alpha^{2} combining these inequalities yields

Rearranging terms, using the definition of gg, and that g(0,1]g\in(0,1] and ϵ0\epsilon\geq 0 yields

Here we prove a an approximate transitivity lemma for inner products of vectors. We use this to bound the accuracy need for certain eigenvector computations.

Without loss of generality, we can write v1=α1v2+β1w1v_{1}=\alpha_{1}v_{2}+\beta_{1}w_{1} for α12+β12=1\alpha_{1}^{2}+\beta_{1}^{2}=1 and unit vector w1v2w_{1}\perp v_{2}. Similarly we can write v3=α3v2+β3w3v_{3}=\alpha_{3}v_{2}+\beta_{3}w_{3} for α32+β32=1\alpha_{3}^{2}+\beta_{3}^{2}=1 and unit vector w3v2w_{3}\perp v_{2}. Now, by the inner products we know that α121ϵ\alpha_{1}^{2}\geq 1-\epsilon and α321ϵ\alpha_{3}^{2}\geq 1-\epsilon and therefore β1ϵ|\beta_{1}|\leq\sqrt{\epsilon} and β3ϵ\left|\beta_{3}\right|\leq\sqrt{\epsilon}. Consequently, since ϵ12\epsilon\leq\frac{1}{2}, β1β3ϵ1ϵα1α3|\beta_{1}\beta_{3}|\leq\epsilon\leq 1-\epsilon\leq|\alpha_{1}\alpha_{3}|, and we have

E.2 Convex Optimization

First we provide a single general lemma about about first order methods for convex optimization. We use this lemma for multiple purposes including bounding errors and quickly compute approximations to the central path.

Next we provide a short technical lemma about the convexity of functions that arises naturally in our line searching procedure.

Let xx^{*} be the solution of this problem. If xy22<α\|x^{*}-y\|_{2}^{2}<\alpha, then x=zx^{*}=z. Otherwise, there is λ>0\lambda>0 such that xx^{*} is the minimizer of

Let Q=Ivv\mathbf{Q}=\mathbf{I}-vv^{\top}. Then, the optimality condition of the above equation shows that

Let η=1+λ\eta=1+\lambda, then we have (Q+λI)=ηIvv(\mathbf{Q}+\lambda\mathbf{I})=\eta\mathbf{I}-vv^{\top}and hence Sherman–Morrison formula shows that

Let c1=Q(zy)22c_{1}=\|\mathbf{Q}(z-y)\|_{2}^{2} and c2=(vQ(zy))2c_{2}=\left(v^{\top}\mathbf{Q}(z-y)\right)^{2}, then we have

Note that this is a polynomial of degree 44 in η\eta and all coefficients can be computed in O(d)O(d) time. Solving this by explicit formula, one can test all 4 possible η\eta’s into the formula (E.3) of xx. Together with trivial case x=zx^{*}=z, we simply need to check among 55 cases to check which is the solution.∎

E.3 Noisy One Dimensional Convex Optimization

Here we show how to minimize a one dimensional convex function giving a noisy oracle for evaluating the function. While this could possibly be done using general results on convex optimization with a membership oracle, the proof in one dimension is much simpler and we provide it here for completeness.

Appendix F Weighted Geometric Median

First, we show that it suffices to consider the case where the weights are integers with bounded sum (Lemma 33). Then, we show that such an instance of the weighted geometric median problem can be solved using the algorithms developed for the unweighted problem.

Any (1+ϵ/5)(1+\epsilon/5)-approximate weighted geometric median of a(1),,a(n)a^{(1)},\ldots,a^{(n)} with the weights w1(1),,w1(n)w_{1}^{(1)},\ldots,w_{1}^{(n)} is also a (1+ϵ)(1+\epsilon)-approximate weighted geometric median of a(1),,a(n)a^{(1)},\ldots,a^{(n)} with the weights w(1),,w(n)w^{(1)},\ldots,w^{(n)}, and

w1(1),,w1(n)w_{1}^{(1)},\ldots,w_{1}^{(n)} are nonnegative integers and i[n]w1(i)5nϵ1\sum_{i\in[n]}w_{1}^{(i)}\leq 5n\epsilon^{-1}.

and W=i[n]w(i)W=\sum_{i\in[n]}w^{(i)}. Furthermore, let ϵ=ϵ/5\epsilon^{\prime}=\epsilon/5 and for each i[n]i\in[n], define w0(i)=nϵWw(i)w_{0}^{(i)}=\frac{n}{\epsilon^{\prime}W}w^{(i)}, w1(i)=w0(i)w_{1}^{(i)}=\left\lfloor w_{0}^{(i)}\right\rfloor and w2(i)=w0(i)w1(i)w_{2}^{(i)}=w_{0}^{(i)}-w_{1}^{(i)}. We also define f0,f1,f2,W0,W1,W2f_{0},f_{1},f_{2},W_{0},W_{1},W_{2} analogously to ff and WW.

Now, assume f1(x)(1+ϵ)f1(x)f_{1}(x)\leq(1+\epsilon^{\prime})f_{1}(x_{*}), where xx_{*} is the minimizer of ff and f0f_{0}. Then:

Now, since W0=nϵW_{0}=\frac{n}{\epsilon^{\prime}} and W1W0nW_{1}\geq W_{0}-n we have

We now proceed to show the main result of this section.

By applying Lemma 33, we can assume that the weights are integer and their sum does not exceed nϵ1.n\epsilon^{-1}. Note that computing the weighted geometric median with such weights is equivalent to computing an unweighted geometric median of O(nϵ1)O(n\epsilon^{-1}) points (where each point of the original input is repeated with the appropriate multiplicity). We now show how to simulate the behavior of our unweighted geometric median algorithms on such a set of points without computing it explicitly.

If ϵ>n1/2\epsilon>n^{-1/2}, we will apply the algorithm ApproximateMedian\mathtt{ApproximateMedian}, achieving a runtime of O(dϵ2)=O(nd)O(d\epsilon^{-2})=O(nd). It is only necessary to check that we can implement weighted sampling from our points with O(n)O(n) preprocessing and O(1)O(1) time per sample. This is achieved by the alias method .

Now assume ϵ<n1/2\epsilon<n^{-1/2}. We will employ the algorithm AccurateMedian\mathtt{AccurateMedian}. Note that we can implement the subroutines LineSearch\mathtt{LineSearch} and ApproxMinEig\mathtt{ApproxMinEig} on the implicitly represented multiset of O(nϵ1)O(n\epsilon^{-1}) points. It is enough to observe only nn of the points are distinct, and all computations performed by these subroutines are identical for identical points. The total runtime will thus be O(ndlog3(n/ϵ2))=O(ndlog3ϵ1)O(nd\log^{3}(n/\epsilon^{2}))=O(nd\log^{3}\epsilon^{-1}).∎