Convex Optimization: Algorithms and Complexity

Sébastien Bubeck

Chapter 1 Introduction

We are interested in algorithms that take as input a convex set X\mathcal{X} and a convex function ff and output an approximate minimum of ff over X\mathcal{X}. We write compactly the problem of finding the minimum of ff over X\mathcal{X} as

In the following we will make more precise how the set of constraints X\mathcal{X} and the objective function ff are specified to the algorithm. Before that we proceed to give a few important examples of convex optimization problems in machine learning.

Many fundamental convex optimization problems in machine learning take the following form:

In classification one has Y={1,1}\mathcal{Y}=\{-1,1\}. Taking fi(x)=max(0,1yixwi)f_{i}(x)=\max(0,1-y_{i}x^{\top}w_{i}) (the so-called hinge loss) and R(x)=x22\mathcal{R}(x)=\|x\|_{2}^{2} one obtains the SVM problem. On the other hand taking fi(x)=log(1+exp(yixwi))f_{i}(x)=\log(1+\exp(-y_{i}x^{\top}w_{i})) (the logistic loss) and again R(x)=x22\mathcal{R}(x)=\|x\|_{2}^{2} one obtains the (regularized) logistic regression problem.

Our last two examples are of a slightly different flavor. In particular the design variable xx is now best viewed as a matrix, and thus we denote it by a capital letter XX. The sparse inverse covariance estimation problem can be written as follows, given some empirical covariance matrix YY,

Intuitively the above problem is simply a regularized maximum likelihood estimator (under a Gaussian assumption).

Finally we introduce the convex version of the matrix completion problem. Here our data set consists of observations of some of the entries of an unknown matrix YY, and we want to “complete" the unobserved entries of YY in such a way that the resulting matrix is “simple" (in the sense that it has low rank). After some massaging (see Candès and Recht (2009)) the (convex) matrix completion problem can be formulated as follows:

where Ω[n]2\Omega\subset[n]^{2} and (Yi,j)(i,j)Ω(Y_{i,j})_{(i,j)\in\Omega} are given.

2 Basic properties of convexity

A basic result about convex sets that we shall use extensively is the Separation Theorem.

Note that if X\mathcal{X} is not closed then one can only guarantee that wx0wx,xXw^{\top}x_{0}\leq w^{\top}x,\forall x\in\mathcal{X} (and w0w\neq 0). This immediately implies the Supporting Hyperplane Theorem (X\partial\mathcal{X} denotes the boundary of X\mathcal{X}, that is the closure without the interior):

We introduce now the key notion of subgradients.

The set of subgradients of ff at xx is denoted f(x)\partial f(x).

To put it differently, for any xXx\in\mathcal{X} and gf(x)g\in\partial f(x), ff is above the linear function yf(x)+g(yx)y\mapsto f(x)+g^{\top}(y-x). The next result shows (essentially) that a convex functions always admit subgradients.

It is obvious that a function is convex if and only if its epigraph is a convex set.

The first claim is almost trivial: let gf((1γ)x+γy)g\in\partial f((1-\gamma)x+\gamma y), then by definition one has

which clearly shows that ff is convex by adding the two (appropriately rescaled) inequalities.

Clearly, by letting tt tend to infinity, one can see that b0b\leq 0. Now let us assume that xx is in the interior of X\mathcal{X}. Then for ε>0\varepsilon>0 small enough, y=x+εaXy=x+\varepsilon a\in\mathcal{X}, which implies that bb cannot be equal to (recall that if b=0b=0 then necessarily a0a\neq 0 which allows to conclude by contradiction). Thus rewriting (1.2) for t=f(y)t=f(y) one obtains

Thus a/bf(x)a/|b|\in\partial f(x) which concludes the proof of the second claim.

Finally let ff be a convex and differentiable function. Then by definition:

which shows that f(x)f(x)\nabla f(x)\in\partial f(x).

3 Why convexity?

The key to the algorithmic success in minimizing convex functions is that these functions exhibit a local to global phenomenon. We have already seen one instance of this in Proposition 1.2.4, where we showed that f(x)f(x)\nabla f(x)\in\partial f(x): the gradient f(x)\nabla f(x) contains a priori only local information about the function ff around xx while the subdifferential f(x)\partial f(x) gives a global information in the form of a linear lower bound on the entire function. Another instance of this local to global phenomenon is that local minima of convex functions are in fact global minima:

Let ff be convex. If xx is a local minimum of ff then xx is a global minimum of ff. Furthermore this happens if and only if 0f(x)0\in\partial f(x).

Clearly 0f(x)0\in\partial f(x) if and only if xx is a global minimum of ff. Now assume that xx is local minimum of ff. Then for γ\gamma small enough one has for any yy,

which implies f(x)f(y)f(x)\leq f(y) and thus xx is a global minimum of ff.

The nice behavior of convex functions will allow for very fast algorithms to optimize them. This alone would not be sufficient to justify the importance of this class of functions (after all constant functions are pretty easy to optimize). However it turns out that surprisingly many optimization problems admit a convex (re)formulation. The excellent book Boyd and Vandenberghe (2004) describes in great details the various methods that one can employ to uncover the convex aspects of an optimization problem. We will not repeat these arguments here, but we have already seen that many famous machine learning problems (SVM, ridge regression, logistic regression, LASSO, sparse covariance estimation, and matrix completion) are formulated as convex problems.

We conclude this section with a simple extension of the optimality condition “0f(x)0\in\partial f(x)” to the case of constrained optimization. We state this result in the case of a differentiable function for sake of simplicity.

Let ff be convex and X\mathcal{X} a closed convex set on which ff is differentiable. Then

The “if" direction is trivial by using that a gradient is also a subgradient. For the “only if" direction it suffices to note that if f(x)(yx)<0\nabla f(x)^{\top}(y-x)<0, then ff is locally decreasing around xx on the line to yy (simply consider h(t)=f(x+t(yx))h(t)=f(x+t(y-x)) and note that h(0)=f(x)(yx)h^{\prime}(0)=\nabla f(x)^{\top}(y-x)).

4 Black-box model

A zeroth order oracle takes as input a point xXx\in\mathcal{X} and outputs the value of ff at xx.

A first order oracle takes as input a point xXx\in\mathcal{X} and outputs a subgradient of ff at xx.

In this context we are interested in understanding the oracle complexity of convex optimization, that is how many queries to the oracles are necessary and sufficient to find an ε\varepsilon-approximate minima of a convex function. To show an upper bound on the sample complexity we need to propose an algorithm, while lower bounds are obtained by information theoretic reasoning (we need to argue that if the number of queries is “too small" then we don’t have enough information about the function to identify an ε\varepsilon-approximate solution).

The black-box model was essentially developed in the early days of convex optimization (in the Seventies) with Nemirovski and Yudin (1983) being still an important reference for this theory (see also Nemirovski (1995)). In the recent years this model and the corresponding algorithms have regained a lot of popularity, essentially for two reasons:

It is possible to develop algorithms with dimension-free oracle complexity which is quite attractive for optimization problems in very high dimension.

Many algorithms developed in this model are robust to noise in the output of the oracles. This is especially interesting for stochastic optimization, and very relevant to machine learning applications. We will explore this in details in Chapter 6.

Chapter 2, Chapter 3 and Chapter 4 are dedicated to the study of the black-box model (noisy oracles are discussed in Chapter 6). We do not cover the setting where only a zeroth order oracle is available, also called derivative free optimization, and we refer to Conn et al. (2009); Audibert et al. (2011) for further references on this.

5 Structured optimization

The black-box model described in the previous section seems extremely wasteful for the applications we discussed in Section 1.1. Consider for instance the LASSO objective: xWxy22+x1x\mapsto\|Wx-y\|_{2}^{2}+\|x\|_{1}. We know this function globally, and assuming that we can only make local queries through oracles seem like an artificial constraint for the design of algorithms. Structured optimization tries to address this observation. Ultimately one would like to take into account the global structure of both ff and X\mathcal{X} in order to propose the most efficient optimization procedure. An extremely powerful hammer for this task are the Interior Point Methods. We will describe this technique in Chapter 5 alongside with other more recent techniques such as FISTA or Mirror Prox.

We briefly describe now two classes of optimization problems for which we will be able to exploit the structure very efficiently, these are the LPs (Linear Programs) and SDPs (Semi-Definite Programs). Ben-Tal and Nemirovski (2001) describe a more general class of Conic Programs but we will not go in that direction here.

6 Overview of the results and disclaimer

The overarching aim of this monograph is to present the main complexity theorems in convex optimization and the corresponding algorithms. We focus on five major results in convex optimization which give the overall structure of the text: the existence of efficient cutting-plane methods with optimal oracle complexity (Chapter 2), a complete characterization of the relation between first order oracle complexity and curvature in the objective function (Chapter 3), first order methods beyond Euclidean spaces (Chapter 4), non-black box methods (such as interior point methods) can give a quadratic improvement in the number of iterations with respect to optimal black-box methods (Chapter 5), and finally noise robustness of first order methods (Chapter 6). Table 1.1 can be used as a quick reference to the results proved in Chapter 2 to Chapter 5, as well as some of the results of Chapter 6 (this last chapter is the most relevant to machine learning but the results are also slightly more specific which make them harder to summarize).

An important disclaimer is that the above selection leaves out methods derived from duality arguments, as well as the two most popular research avenues in convex optimization: (i) using convex optimization in non-convex settings, and (ii) practical large-scale algorithms. Entire books have been written on these topics, and new books have yet to be written on the impressive collection of new results obtained for both (i) and (ii) in the past five years.

A few of the blatant omissions regarding (i) include (a) the theory of submodular optimization (see Bach (2013)), (b) convex relaxations of combinatorial problems (a short example is given in Section 6.6), and (c) methods inspired from convex optimization for non-convex problems such as low-rank matrix factorization (see e.g. Jain et al. (2013) and references therein), neural networks optimization, etc.

With respect to (ii) the most glaring omissions include (a) heuristics (the only heuristic briefly discussed here is the non-linear conjugate gradient in Section 2.4), (b) methods for distributed systems, and (c) adaptivity to unknown parameters. Regarding (a) we refer to Nocedal and Wright (2006) where the most practical algorithms are discussed in great details (e.g., quasi-newton methods such as BFGS and L-BFGS, primal-dual interior point methods, etc.). The recent survey Boyd et al. (2011) discusses the alternating direction method of multipliers (ADMM) which is a popular method to address (b). Finally (c) is a subtle and important issue. In the entire monograph the emphasis is on presenting the algorithms and proofs in the simplest way, and thus for sake of convenience we assume that the relevant parameters describing the regularity and curvature of the objective function (Lipschitz constant, smoothness constant, strong convexity parameter) are known and can be used to tune the algorithm’s own parameters. Line search is a powerful technique to replace the knowledge of these parameters and it is heavily used in practice, see again Nocedal and Wright (2006). We observe however that from a theoretical point of view (c) is only a matter of logarithmic factors as one can always run in parallel several copies of the algorithm with different guesses for the values of the parametersNote that this trick does not work in the context of Chapter 6.. Overall the attitude of this text with respect to (ii) is best summarized by a quote of Thomas Cover: “theory is the first term in the Taylor series of practice”, Cover (1992).

Chapter 2 Convex optimization in finite dimension

As we will see these algorithms have an oracle complexity which is linear (or quadratic) in the dimension, hence the title of the chapter (in the next chapter the oracle complexity will be independent of the dimension). An interesting feature of the methods discussed here is that they only need a separation oracle for the constraint set X\mathcal{X}. In the literature such algorithms are often referred to as cutting plane methods. In particular these methods can be used to find a point xXx\in\mathcal{X} given only a separating oracle for X\mathcal{X} (this is also known as the feasibility problem).

We consider the following simple iterative algorithmAs a warm-up we assume in this section that X\mathcal{X} is known. It should be clear from the arguments in the next section that in fact the same algorithm would work if initialized with S1X\mathcal{S}_{1}\supset\mathcal{X}.: let S1=X\mathcal{S}_{1}=\mathcal{X}, and for t1t\geq 1 do the following:

Query the first order oracle at ctc_{t} and obtain wtf(ct)w_{t}\in\partial f(c_{t}). Let

If stopped after tt queries to the first order oracle then we use tt queries to a zeroth order oracle to output

This procedure is known as the center of gravity method, it was discovered independently on both sides of the Wall by Levin (1965) and Newman (1965).

Before proving this result a few comments are in order.

To attain an ε\varepsilon-optimal point the center of gravity method requires O(nlog(2B/ε))O(n\log(2B/\varepsilon)) queries to both the first and zeroth order oracles. It can be shown that this is the best one can hope for, in the sense that for ε\varepsilon small enough one needs Ω(nlog(1/ε))\Omega(n\log(1/\varepsilon)) calls to the oracle in order to find an ε\varepsilon-optimal point, see Nemirovski and Yudin (1983) for a formal proof.

The rate of convergence given by Theorem 2.1.1 is exponentially fast. In the optimization literature this is called a linear rate as the (estimated) error at iteration t+1t+1 is linearly related to the error at iteration tt.

The last and most important comment concerns the computational complexity of the method. It turns out that finding the center of gravity ctc_{t} is a very difficult problem by itself, and we do not have computationally efficient procedure to carry out this computation in general. In Section 6.7 we will discuss a relatively recent (compared to the 50 years old center of gravity method!) randomized algorithm to approximately compute the center of gravity. This will in turn give a randomized center of gravity method which we will describe in detail.

We now turn to the proof of Theorem 2.1.1. We will use the following elementary result from convex geometry:

Let xx^{*} be such that f(x)=minxXf(x)f(x^{*})=\min_{x\in\mathcal{X}}f(x). Since wtf(ct)w_{t}\in\partial f(c_{t}) one has

which clearly implies that one can never remove the optimal point from our sets in consideration, that is xStx^{*}\in\mathcal{S}_{t} for any tt. Without loss of generality we can assume that we always have wt0w_{t}\neq 0, for otherwise one would have f(ct)=f(x)f(c_{t})=f(x^{*}) which immediately conludes the proof. Now using that wt0w_{t}\neq 0 for any tt and Lemma 2.1.2 one clearly obtains

2 The ellipsoid method

Recall that an ellipsoid is a convex set of the form

We give now a simple geometric lemma, which is at the heart of the ellipsoid method.

By doing a quick picture, one can see that it makes sense to look for an ellipsoid E\mathcal{E} that would be centered at c=twc=-tw, with tt\in (presumably tt will be small), and such that one principal direction is ww (with inverse squared semi-axis a>0a>0), and the other principal directions are all orthogonal to ww (with the same inverse squared semi-axes b>0b>0). In other words we are looking for E={x:(xc)H1(xc)1}\mathcal{E}=\{x:(x-c)^{\top}H^{-1}(x-c)\leq 1\} with

Now we have to express our constraints on the fact that E\mathcal{E} should contain the half Euclidean ball {xB:xw0}\{x\in\mathcal{B}:x^{\top}w\leq 0\}. Since we are also looking for E\mathcal{E} to be as small as possible, it makes sense to ask for E\mathcal{E} to "touch" the Euclidean ball, both at x=wx=-w, and at the equator Bw\partial\mathcal{B}\cap w^{\perp}. The former condition can be written as:

As one can see from the above two equations, we are still free to choose any value for t[0,1/2)t\in[0,1/2) (the fact that we need t<1/2t<1/2 comes from b=1(tt1)2>0b=1-\left(\frac{t}{t-1}\right)^{2}>0). Quite naturally we take the value that minimizes the volume of the resulting ellipsoid. Note that

where f(h)=h2(2hh2)n1f(h)=h^{2}(2h-h^{2})^{n-1}. Elementary computations show that the maximum of ff (on $)isattainedat) is attained ath=1+\frac{1}{n}(whichcorrespondsto(which corresponds tot=\frac{1}{n+1}$), and the value is

where the lower bound follows again from elementary computations. Thus we showed that, for E0=B\mathcal{E}_{0}=\mathcal{B}, (2.3) and (2.4) are satisfied with the ellipsoid given by the set of points xx satisfying:

Applying Sherman-Morrison formula to (2.8) one can recover (2.6) which concludes the proof.

Let Et+1={x:(xct+1)Ht+11(xct+1)1}\mathcal{E}_{t+1}=\{x:(x-c_{t+1})^{\top}H_{t+1}^{-1}(x-c_{t+1})\leq 1\} be the ellipsoid given in Lemma 2.2.1 that contains {xEt:(xct)wt0}\{x\in\mathcal{E}_{t}:(x-c_{t})^{\top}w_{t}\leq 0\}, that is

If stopped after tt iterations and if {c1,,ct}X\{c_{1},\ldots,c_{t}\}\cap\mathcal{X}\neq\emptyset, then we use the zeroth order oracle to output

The following rate of convergence can be proved with the exact same argument than for Theorem 2.1.1 (observe that at step tt one can remove a point in X\mathcal{X} from the current ellipsoid only if ctXc_{t}\in\mathcal{X}).

For t2n2log(R/r)t\geq 2n^{2}\log(R/r) the ellipsoid method satisfies {c1,,ct}X\{c_{1},\ldots,c_{t}\}\cap\mathcal{X}\neq\emptyset and

We observe that the oracle complexity of the ellipsoid method is much worse than the one of the center gravity method, indeed the former needs O(n2log(1/ε))O(n^{2}\log(1/\varepsilon)) calls to the oracles while the latter requires only O(nlog(1/ε))O(n\log(1/\varepsilon)) calls. However from a computational point of view the situation is much better: in many cases one can derive an efficient separation oracle, while the center of gravity method is basically always intractable. This is for instance the case in the context of LPs and SDPs: with the notation of Section 1.5 the computational complexity of the separation oracle for LPs is O(mn)O(mn) while for SDPs it is O(max(m,n)n2)O(\max(m,n)n^{2}) (we use the fact that the spectral decomposition of a matrix can be done in O(n3)O(n^{3}) operations). This gives an overall complexity of O(max(m,n)n3log(1/ε))O(\max(m,n)n^{3}\log(1/\varepsilon)) for LPs and O(max(m,n2)n6log(1/ε))O(\max(m,n^{2})n^{6}\log(1/\varepsilon)) for SDPs. We note however that the ellipsoid method is almost never used in practice, essentially because the method is too rigid to exploit the potential easiness of real problems (e.g., the volume decrease given by (2.4) is essentially always tight).

3 Vaidya’s cutting plane method

We focus here on the feasibility problem (it should be clear from the previous sections how to adapt the argument for optimization). We have seen that for the feasibility problem the center of gravity has a O(n)O(n) oracle complexity and unclear computational complexity (see Section 6.7 for more on this), while the ellipsoid method has oracle complexity O(n2)O(n^{2}) and computational complexity O(n4)O(n^{4}). We describe here the beautiful algorithm of Vaidya (1989, 1996) which has oracle complexity O(nlog(n))O(n\log(n)) and computational complexity O(n4)O(n^{4}), thus getting the best of both the center of gravity and the ellipsoid method. In fact the computational complexity can even be improved further, and the recent breakthrough Lee et al. (2015) shows that it can essentially (up to logarithmic factors) be brought down to O(n3)O(n^{3}).

This section, while giving a fundamental algorithm, should probably be skipped on a first reading. In particular we use several concepts from the theory of interior point methods which are described in Section 5.3.

We also consider the volumetric barrier vv defined by

The intuition is clear: v(x)v(x) is equal to the logarithm of the inverse volume of the Dikin ellipsoid (for the logarithmic barrier) at xx. It will be useful to spell out the hessian of the logarithmic barrier:

3.2 Vaidya’s algorithm

If for some i[mt]i\in[m_{t}] one has σi(t)=minj[mt]σj(t)<ε\sigma_{i}^{(t)}=\min_{j\in[m_{t}]}\sigma_{j}^{(t)}<\varepsilon, then (A(t+1),b(t+1))(A^{(t+1)},b^{(t+1)}) is defined by removing the ithi^{th} row in (A(t),b(t))(A^{(t)},b^{(t)}) (in particular mt+1=mt1m_{t+1}=m_{t}-1).

Then we define (A(t+1),b(t+1))(A^{(t+1)},b^{(t+1)}) by adding to (A(t),b(t))(A^{(t)},b^{(t)}) the row given by (c(t),β(t))(c^{(t)},\beta^{(t)}) (in particular mt+1=mt+1m_{t+1}=m_{t}+1).

It can be shown that the volumetric barrier is a self-concordant barrier, and thus it can be efficiently minimized with Newton’s method. In fact it is enough to do one step of Newton’s method on vtv_{t} initialized at xt1x_{t-1}, see Vaidya (1989, 1996) for more details on this.

3.3 Analysis of Vaidya’s method

The construction of Vaidya’s method is based on a precise understanding of how the volumetric barrier changes when one adds or removes a constraint to the polytope. This understanding is derived in Section 2.3.4. In particular we obtain the following two key inequalities: If case 1 happens at iteration tt then

We show now how these inequalities imply that Vaidya’s method stops after O(nlog(nR/r))O(n\log(nR/r)) steps. First we claim that after 2t2t iterations, case 2 must have happened at least t1t-1 times. Indeed suppose that at iteration 2t12t-1, case 2 has happened t2t-2 times; then 2F(x)\nabla^{2}F(x) is singular and the leverage scores are infinite, so case 2 must happen at iteration 2t2t. Combining this claim with the two inequalities above we obtain:

Since E(x,1)\mathcal{E}(x,1) is always included in the polytope we have that v0(x0)-v_{0}(x_{0}) is at most the logarithm of the volume of the initial polytope which is O(nlog(R))O(n\log(R)). This clearly concludes the proof as the procedure will necessarily stop when the volume is below exp(nlog(r))\exp(n\log(r)) (we also used the trivial bound mtn+1+tm_{t}\leq n+1+t).

3.4 Constraints and the volumetric barrier

The leverage scores σi\sigma_{i} and σ~i\widetilde{\sigma}_{i} are closely related:

First we observe that by Sherman-Morrison’s formula (A+uv)1=A1A1uvA11+A1[u,v](A+uv^{\top})^{-1}=A^{-1}-\frac{A^{-1}uv^{\top}A^{-1}}{1+A^{-1}[u,v]} one has

This immediately proves σ~i(x)σi(x)\widetilde{\sigma}_{i}(x)\leq\sigma_{i}(x). It also implies the inequality σ~i(x)(1σm+1(x))σi(x)\widetilde{\sigma}_{i}(x)\geq(1-\sigma_{m+1}(x))\sigma_{i}(x) thanks the following fact: AAuuA1+A[u,u](1A[u,u])AA-\frac{Auu^{\top}A}{1+A[u,u]}\succeq(1-A[u,u])A. For the last inequality we use that A+AuuA1+A[u,u]11A[u,u]AA+\frac{Auu^{\top}A}{1+A[u,u]}\preceq\frac{1}{1-A[u,u]}A together with

We now assume the following key result, which was first proven by Vaidya. To put the statement in context recall that for a self-concordant barrier ff the suboptimality gap f(x)minff(x)-\min f is intimately related to the Newton decrement f(x)(2f(x))1\|\nabla f(x)\|_{(\nabla^{2}f(x))^{-1}}. Vaidya’s inequality gives a similar claim for the volumetric barrier. We use the version given in [Theorem 2.6, Anstreicher (1998)] which has slightly better numerical constants than the original bound. Recall also the definition of QQ from (2.10).

Let λ(x)=v(x)Q(x)1\lambda(x)=\|\nabla v(x)\|_{Q(x)^{-1}} be an approximate Newton decrement, ε=mini[m]σi(x)\varepsilon=\min_{i\in[m]}\sigma_{i}(x), and assume that λ(x)22εε36\lambda(x)^{2}\leq\frac{2\sqrt{\varepsilon}-\varepsilon}{36}. Then

We also denote λ~\widetilde{\lambda} for the approximate Newton decrement of v~\widetilde{v}. The goal for the rest of the section is to prove the following theorem which gives the precise understanding of the volumetric barrier we were looking for.

Let ε:=mini[m]σi(x)\varepsilon:=\min_{i\in[m]}\sigma_{i}(x^{*}), δ:=σm+1(x)/ε\delta:=\sigma_{m+1}(x^{*})/\sqrt{\varepsilon} and assume that (δε+δ3ε)21δε<2εε36\frac{\left(\delta\sqrt{\varepsilon}+\sqrt{\delta^{3}\sqrt{\varepsilon}}\right)^{2}}{1-\delta\sqrt{\varepsilon}}<\frac{2\sqrt{\varepsilon}-\varepsilon}{36}. Then one has

On the other hand assuming that σ~m+1(x~)=mini[m+1]σ~i(x~)=:ε\widetilde{\sigma}_{m+1}(\widetilde{x}^{*})=\min_{i\in[m+1]}\widetilde{\sigma}_{i}(\widetilde{x}^{*})=:\varepsilon and that ε1/4\varepsilon\leq 1/4, one has

Before going into the proof let us see briefly how Theorem 2.3.4 give the two inequalities stated at the beginning of Section 2.3.3. To prove (2.12) we use (2.14) with δ=1/5\delta=1/5 and ε0.006\varepsilon\leq 0.006, and we observe that in this case the right hand side of (2.14) is lower bounded by 120ε\frac{1}{20}\sqrt{\varepsilon}. On the other hand to prove (2.11) we use (2.15), and we observe that for ε0.006\varepsilon\leq 0.006 the right hand side of (2.15) is upper bounded by ε\varepsilon.

We start with the proof of (2.14). First observe that by factoring (2F(x))1/2(\nabla^{2}F(x))^{1/2} on the left and on the right of 2F~(x)\nabla^{2}\widetilde{F}(x) one obtains

To bound the suboptimality gap of xx^{*} in v~\widetilde{v} we will invoke Theorem 2.3.3 and thus we have to upper bound the approximate Newton decrement λ~\widetilde{\lambda}. Using [(2.16), Lemma 2.3.6] below one has

We now turn to the proof of (2.15). Following the same steps as above we immediately obtain

To invoke Theorem 2.3.3 it remains to upper bound λ(x~)\lambda(\widetilde{x}^{*}). Using [(2.17), Lemma 2.3.6] below one has

We can apply Theorem 2.3.3 since the assumption ε1/4\varepsilon\leq 1/4 implies that (2ε1ε)22εε36\left(\frac{2\varepsilon}{1-\varepsilon}\right)^{2}\leq\frac{2\sqrt{\varepsilon}-\varepsilon}{36}. This concludes the proof of (2.15).

Furthermore if σ~m+1(x)=mini[m+1]σ~i(x)\widetilde{\sigma}_{m+1}(x)=\min_{i\in[m+1]}\widetilde{\sigma}_{i}(x) then one also has

We start with the proof of (2.16). First observe that by Lemma 2.3.1 one has Q~(x)(1σm+1(x))Q(x)\widetilde{Q}(x)\succeq(1-\sigma_{m+1}(x))Q(x) and thus by definition of the Newton decrement

We now use that Q(x)(mini[m]σi(x))2F(x)Q(x)\succeq(\min_{i\in[m]}\sigma_{i}(x))\nabla^{2}F(x) to obtain

By Lemma 2.3.1 one has σ~m+1(x)σm+1(x)\widetilde{\sigma}_{m+1}(x)\leq{\sigma}_{m+1}(x) and thus we see that it only remains to prove

The above inequality follows from a beautiful calculation of Vaidya (see [Lemma 12, Vaidya (1996)]), starting from the identity

We now turn to the proof of (2.17). Following the same steps as above we immediately obtain

Using Lemma 2.3.1 together with the assumption σ~m+1(x)=mini[m+1]σ~i(x)\widetilde{\sigma}_{m+1}(x)=\min_{i\in[m+1]}\widetilde{\sigma}_{i}(x) yields (2.17), thus concluding the proof.

4 Conjugate gradient

Equation (2.20) is true by construction for i=ti=t, and for it1i\leq t-1 it follows by induction, assuming (2.20) at t=1t=1 and using the following formula:

which concludes the proof of xn=xx_{n}=x^{*}.

In order to have a proper black-box method it remains to describe how to build iteratively the orthogonal set {p0,,pn1}\{p_{0},\ldots,p_{n-1}\} based only on gradient evaluations of ff. A natural guess to obtain a set of orthogonal directions (w.r.t. ,A\langle\cdot,\cdot\rangle_{A}) is to take p0=f(x0)p_{0}=\nabla f(x_{0}) and for t1t\geq 1,

Furthermore using the definition (2.22) and f(xt),pt1=0\langle\nabla f(x_{t}),p_{t-1}\rangle=0 one also has

Thus we arrive at the following rewriting of the (linear) conjugate gradient algorithm, where we recall that x0x_{0} is some fixed starting point and p0=f(x0)p_{0}=\nabla f(x_{0}),

Observe that the algorithm defined by (2.23) and (2.24) makes sense for an arbitary convex function, in which case it is called the non-linear conjugate gradient. There are many variants of the non-linear conjugate gradient, and the above form is known as the Fletcher-–Reeves method. Another popular version in practice is the Polak-Ribière method which is based on the fact that for the general non-quadratic case one does not necessarily have f(xt+1),f(xt)=0\langle\nabla f(x_{t+1}),\nabla f(x_{t})\rangle=0, and thus one replaces (2.24) by

We refer to Nocedal and Wright (2006) for more details about these algorithms, as well as for advices on how to deal with the line search in (2.23).

Finally we also note that the linear conjugate gradient method can often attain an approximate solution in much fewer than nn steps. More precisely, denoting κ\kappa for the condition number of AA (that is the ratio of the largest eigenvalue to the smallest eigenvalue of AA), one can show that linear conjugate gradient attains an ε\varepsilon optimal point in a number of iterations of order κlog(1/ε)\sqrt{\kappa}\log(1/\varepsilon). The next chapter will demistify this convergence rate, and in particular we will see that (i) this is the optimal rate among first order methods, and (ii) there is a way to generalize this rate to non-quadratic convex functions (though the algorithm will have to be modified).

Chapter 3 Dimension-free convex optimization

where η>0\eta>0 is a fixed step-size parameter. The rationale behind (3.1) is to make a small step in the direction that minimizes the local first order Taylor approximation of ff (also known as the steepest descent direction).

As we shall see, methods of the type (3.1) can obtain an oracle complexity independent of the dimensionOf course the computational complexity remains at least linear in the dimension since one needs to manipulate gradients.. This feature makes them particularly attractive for optimization in very high dimension.

The following lemma will prove to be useful in our study. It is an easy corollary of Proposition 1.3.3, see also Figure 3.1.

which also implies ΠX(y)x2+yΠX(y)2yx2\|\Pi_{\mathcal{X}}(y)-x\|^{2}+\|y-\Pi_{\mathcal{X}}(y)\|^{2}\leq\|y-x\|^{2}.

Unless specified otherwise all the proofs in this chapter are taken from Nesterov (2004a) (with slight simplification in some cases).

In this section we assume that X\mathcal{X} is contained in an Euclidean ball centered at x1Xx_{1}\in\mathcal{X} and of radius RR. Furthermore we assume that ff is such that for any xXx\in\mathcal{X} and any gf(x)g\in\partial f(x) (we assume f(x)\partial f(x)\neq\emptyset), one has gL\|g\|\leq L. Note that by the subgradient inequality and Cauchy-Schwarz this implies that ff is LL-Lipschitz on X\mathcal{X}, that is f(x)f(y)Lxy|f(x)-f(y)|\leq L\|x-y\|.

In this context we make two modifications to the basic gradient descent (3.1). First, obviously, we replace the gradient f(x)\nabla f(x) (which may not exist) by a subgradient gf(x)g\in\partial f(x). Secondly, and more importantly, we make sure that the updated point lies in X\mathcal{X} by projecting back (if necessary) onto it. This gives the projected subgradient descent algorithmIn the optimization literature the term “descent” is reserved for methods such that f(xt+1)f(xt)f(x_{t+1})\leq f(x_{t}). In that sense the projected subgradient descent is not a descent method. which iterates the following equations for t1t\geq 1:

This procedure is illustrated in Figure 3.2. We prove now a rate of convergence for this method under the above assumptions.

The projected subgradient descent method with η=RLt\eta=\frac{R}{L\sqrt{t}} satisfies

Using the definition of subgradients, the definition of the method, and the elementary identity 2ab=a2+b2ab22a^{\top}b=\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}, one obtains

Now note that gsL\|g_{s}\|\leq L, and furthermore by Lemma 3.0.1

Summing the resulting inequality over ss, and using that x1xR\|x_{1}-x^{*}\|\leq R yield

Plugging in the value of η\eta directly gives the statement (recall that by convexity f((1/t)s=1txs)1ts=1tf(xs)f((1/t)\sum_{s=1}^{t}x_{s})\leq\frac{1}{t}\sum_{s=1}^{t}f(x_{s})).

We will show in Section 3.5 that the rate given in Theorem 3.1.1 is unimprovable from a black-box perspective. Thus to reach an ε\varepsilon-optimal point one needs Θ(1/ε2)\Theta(1/\varepsilon^{2}) calls to the oracle. In some sense this is an astonishing result as this complexity is independentObserve however that the quantities RR and LL may dependent on the dimension, see Chapter 4 for more on this. of the ambient dimension nn. On the other hand this is also quite disappointing compared to the scaling in log(1/ε)\log(1/\varepsilon) of the center of gravity and ellipsoid method of Chapter 2. To put it differently with gradient descent one could hope to reach a reasonable accuracy in very high dimension, while with the ellipsoid method one can reach very high accuracy in reasonably small dimension. A major task in the following sections will be to explore more restrictive assumptions on the function to be optimized in order to have the best of both worlds, that is an oracle complexity independent of the dimension and with a scaling in log(1/ε)\log(1/\varepsilon).

Finally we observe that the step-size recommended by Theorem 3.1.1 depends on the number of iterations to be performed. In practice this may be an undesirable feature. However using a time-varying step size of the form ηs=RLs\eta_{s}=\frac{R}{L\sqrt{s}} one can prove the same rate up to a logt\log t factor. In any case these step sizes are very small, which is the reason for the slow convergence. In the next section we will see that by assuming smoothness in the function ff one can afford to be much more aggressive. Indeed in this case, as one approaches the optimum the size of the gradients themselves will go to , resulting in a sort of “auto-tuning" of the step sizes which does not happen for an arbitrary convex function.

2 Gradient descent for smooth functions

We say that a continuously differentiable function ff is β\beta-smooth if the gradient f\nabla f is β\beta-Lipschitz, that is

Before embarking on the proof we state a few properties of smooth convex functions.

We represent f(x)f(y)f(x)-f(y) as an integral, apply Cauchy-Schwarz and then β\beta-smoothness:

This gives in particular the following important inequality to evaluate the improvement in one step of gradient descent:

The next lemma, which improves the basic inequality for subgradients under the smoothness assumption, shows that in fact ff is convex and β\beta-smooth if and only if (3.4) holds true. In the literature (3.4) is often used as a definition of smooth convex functions.

Let z=y1β(f(y)f(x))z=y-\frac{1}{\beta}(\nabla f(y)-\nabla f(x)). Then one has

Using (3.5) and the definition of the method one has

In particular, denoting δs=f(xs)f(x)\delta_{s}=f(x_{s})-f(x^{*}), this shows:

We will prove that xsx\|x_{s}-x^{*}\| is decreasing with ss, which with the two above displays will imply

Let us see how to use this last inequality to conclude the proof. Let ω=12βx1x2\omega=\frac{1}{2\beta\|x_{1}-x^{*}\|^{2}}, thenThe last step in the sequence of implications can be improved by taking δ1\delta_{1} into account. Indeed one can easily show with (3.4) that δ114ω\delta_{1}\leq\frac{1}{4\omega}. This improves the rate of Theorem 3.2.1 from 2βx1x2t1\frac{2\beta\|x_{1}-x^{*}\|^{2}}{t-1} to 2βx1x2t+3\frac{2\beta\|x_{1}-x^{*}\|^{2}}{t+3}.

Thus it only remains to show that xsx\|x_{s}-x^{*}\| is decreasing with ss. Using Lemma 3.2.4 one immediately gets

We use this as follows (together with f(x)=0\nabla f(x^{*})=0)

We now come back to the constrained problem

Similarly to what we did in Section 3.1 we consider the projected gradient descent algorithm, which iterates xt+1=ΠX(xtηf(xt))x_{t+1}=\Pi_{\mathcal{X}}(x_{t}-\eta\nabla f(x_{t})).

The key point in the analysis of gradient descent for unconstrained smooth optimization is that a step of gradient descent started at xx will decrease the function value by at least 12βf(x)2\frac{1}{2\beta}\|\nabla f(x)\|^{2}, see (3.5). In the constrained case we cannot expect that this would still hold true as a step may be cut short by the projection. The next lemma defines the “right" quantity to measure progress in the constrained case.

Let x,yXx,y\in\mathcal{X}, x+=ΠX(x1βf(x))x^{+}=\Pi_{\mathcal{X}}\left(x-\frac{1}{\beta}\nabla f(x)\right), and gX(x)=β(xx+)g_{\mathcal{X}}(x)=\beta(x-x^{+}). Then the following holds true:

Indeed the above inequality is equivalent to

which follows from Lemma 3.0.1. Now we use (3.7) as follows to prove the lemma (we also use (3.4) which still holds true in the constrained case)

Let ff be convex and β\beta-smooth on X\mathcal{X}. Then projected gradient descent with η=1β\eta=\frac{1}{\beta} satisfies

We will prove that xsx\|x_{s}-x^{*}\| is decreasing with ss, which with the two above displays will imply

Thus it only remains to show that xsx\|x_{s}-x^{*}\| is decreasing with ss. Using Lemma 3.2.7 one can see that gX(xs)(xsx)12βgX(xs)2g_{\mathcal{X}}(x_{s})^{\top}(x_{s}-x^{*})\geq\frac{1}{2\beta}\|g_{\mathcal{X}}(x_{s})\|^{2} which implies

3 Conditional gradient descent, aka Frank-Wolfe

We describe now an alternative algorithm to minimize a smooth convex function ff over a compact convex set X\mathcal{X}. The conditional gradient descent, introduced in Frank and Wolfe (1956), performs the following update for t1t\geq 1, where (γs)s1(\gamma_{s})_{s\geq 1} is a fixed sequence,

In words conditional gradient descent makes a step in the steepest descent direction given the constraint set X\mathcal{X}, see Figure 3.3 for an illustration. From a computational perspective, a key property of this scheme is that it replaces the projection step of projected gradient descent by a linear optimization over X\mathcal{X}, which in some cases can be a much simpler problem.

Let ff be a convex and β\beta-smooth function w.r.t. some norm \|\cdot\|, R=supx,yXxyR=\sup_{x,y\in\mathcal{X}}\|x-y\|, and γs=2s+1\gamma_{s}=\frac{2}{s+1} for s1s\geq 1. Then for any t2t\geq 2, one has

The following inequalities hold true, using respectively β\beta-smoothness (it can easily be seen that (3.4) holds true for smoothness in an arbitrary norm), the definition of xs+1x_{s+1}, the definition of ysy_{s}, and the convexity of ff:

Rewriting this inequality in terms of δs=f(xs)f(x)\delta_{s}=f(x_{s})-f(x^{*}) one obtains

A simple induction using that γs=2s+1\gamma_{s}=\frac{2}{s+1} finishes the proof (note that the initialization is done at step 22 with the above inequality yielding δ2β2R2\delta_{2}\leq\frac{\beta}{2}R^{2}).

Next we describe an application where the three properties of conditional gradient descent (projection-free, norm-free, and sparse iterates) are critical to develop a computationally efficient procedure.

This assumption is met for many combinatorial dictionaries. For instance the dictionary elements could be vector of incidence of spanning trees in some fixed graph, in which case the linear optimization problem can be solved with a greedy algorithm.

First let us study the computational complexity of the ttht^{th} step of conditional gradient descent. Observe that

To derive a rate of convergence it remains to study the smoothness of ff. This can be done as follows:

Putting together (3.11) and (3.12) we proved that one can get an ε\varepsilon-optimal solution to (3.10) with a computational effort of O(m2p(n)/ε+m4/ε2)O(m^{2}p(n)/\varepsilon+m^{4}/\varepsilon^{2}) using the conditional gradient descent.

4 Strong convexity

Of course this definition does not require differentiability of the function ff, and one can replace f(x)\nabla f(x) in the inequality above by gf(x)g\in\partial f(x). It is immediate to verify that a function ff is α\alpha-strongly convex if and only if xf(x)α2x2x\mapsto f(x)-\frac{\alpha}{2}\|x\|^{2} is convex (in particular if ff is twice differentiable then the eigenvalues of the Hessians of ff have to be larger than α\alpha). The strong convexity parameter α\alpha is a measure of the curvature of ff. For instance a linear function has no curvature and hence α=0\alpha=0. On the other hand one can clearly see why a large value of α\alpha would lead to a faster rate: in this case a point far from the optimum will have a large gradient, and thus gradient descent will make very big steps when far from the optimum. Of course if the function is non-smooth one still has to be careful and tune the step-sizes to be relatively small, but nonetheless we will be able to improve the oracle complexity from O(1/ε2)O(1/\varepsilon^{2}) to O(1/(αε))O(1/(\alpha\varepsilon)). On the other hand with the additional assumption of β\beta-smoothness we will prove that gradient descent with a constant step-size achieves a linear rate of convergence, precisely the oracle complexity will be O(βαlog(1/ε))O(\frac{\beta}{\alpha}\log(1/\varepsilon)). This achieves the objective we had set after Theorem 3.1.1: strongly-convex and smooth functions can be optimized in very large dimension and up to very high accuracy.

Before going into the proofs let us discuss another interpretation of strong-convexity and its relation to smoothness. Equation (3.13) can be read as follows: at any point xx one can find a (convex) quadratic lower bound qx(y)=f(x)+f(x)(yx)+α2xy2q_{x}^{-}(y)=f(x)+\nabla f(x)^{\top}(y-x)+\frac{\alpha}{2}\|x-y\|^{2} to the function ff, i.e. qx(y)f(y),yXq_{x}^{-}(y)\leq f(y),\forall y\in\mathcal{X} (and qx(x)=f(x)q_{x}^{-}(x)=f(x)). On the other hand for β\beta-smoothness (3.4) implies that at any point yy one can find a (convex) quadratic upper bound qy+(x)=f(y)+f(y)(xy)+β2xy2q_{y}^{+}(x)=f(y)+\nabla f(y)^{\top}(x-y)+\frac{\beta}{2}\|x-y\|^{2} to the function ff, i.e. qy+(x)f(x),xXq_{y}^{+}(x)\geq f(x),\forall x\in\mathcal{X} (and qy+(y)=f(y)q_{y}^{+}(y)=f(y)). Thus in some sense strong convexity is a dual assumption to smoothness, and in fact this can be made precise within the framework of Fenchel duality. Also remark that clearly one always has βα\beta\geq\alpha.

We consider here the projected subgradient descent algorithm with time-varying step size (ηt)t1(\eta_{t})_{t\geq 1}, that is

The following result is extracted from Lacoste-Julien et al. (2012).

Let ff be α\alpha-strongly convex and LL-Lipschitz on X\mathcal{X}. Then projected subgradient descent with ηs=2α(s+1)\eta_{s}=\frac{2}{\alpha(s+1)} satisfies

Coming back to our original analysis of projected subgradient descent in Section 3.1 and using the strong convexity assumption one immediately obtains

Multiplying this inequality by ss yields

Now sum the resulting inequality over s=1s=1 to s=ts=t, and apply Jensen’s inequality to obtain the claimed statement.

4.2 Strongly convex and smooth functions

As we will see now, having both strong convexity and smoothness allows for a drastic improvement in the convergence rate. We denote κ=βα\kappa=\frac{\beta}{\alpha} for the condition number of ff. The key observation is that Lemma 3.2.7 can be improved to (with the notation of the lemma):

Let ff be α\alpha-strongly convex and β\beta-smooth on X\mathcal{X}. Then projected gradient descent with η=1β\eta=\frac{1}{\beta} satisfies for t0t\geq 0,

Using (3.14) with y=xy=x^{*} one directly obtains

We now show that in the unconstrained case one can improve the rate by a constant factor, precisely one can replace κ\kappa by (κ+1)/4(\kappa+1)/4 in the oracle complexity bound by using a larger step size. This is not a spectacular gain but the reasoning is based on an improvement of (3.6) which can be of interest by itself. Note that (3.6) and the lemma to follow are sometimes referred to as coercivity of the gradient.

Let φ(x)=f(x)α2x2\varphi(x)=f(x)-\frac{\alpha}{2}\|x\|^{2}. By definition of α\alpha-strong convexity one has that φ\varphi is convex. Furthermore one can show that φ\varphi is (βα)(\beta-\alpha)-smooth by proving (3.4) (and using that it implies smoothness). Thus using (3.6) one gets

which gives the claimed result with straightforward computations. (Note that if α=β\alpha=\beta the smoothness of φ\varphi directly implies that f(x)f(y)=α(xy)\nabla f(x)-\nabla f(y)=\alpha(x-y) which proves the lemma in this case.)

First note that by β\beta-smoothness (since f(x)=0\nabla f(x^{*})=0) one has

5 Lower bounds

We prove here various oracle complexity lower bounds. These results first appeared in Nemirovski and Yudin (1983) but we follow here the simplified presentation of Nesterov (2004a). In general a black-box procedure is a mapping from “history" to the next query point, that is it maps (x1,g1,,xt,gt)(x_{1},g_{1},\ldots,x_{t},g_{t}) (with gsf(xs)g_{s}\in\partial f(x_{s})) to xt+1x_{t+1}. In order to simplify the notation and the argument, throughout the section we make the following assumption on the black-box procedure: x1=0x_{1}=0 and for any t0t\geq 0, xt+1x_{t+1} is in the linear span of g1,,gtg_{1},\ldots,g_{t}, that is

Let tnt\leq n, L,R>0L,R>0. There exists a convex and LL-Lipschitz function ff such that for any black-box procedure satisfying (3.15),

There also exists an α\alpha-strongly convex and LL-lipschitz function ff such that for any black-box procedure satisfying (3.15),

Note that the above result is restricted to a number of iterations smaller than the dimension, that is tnt\leq n. This restriction is of course necessary to obtain lower bounds polynomial in 1/t1/t: as we saw in Chapter 2 one can always obtain an exponential rate of convergence when the number of calls to the oracle is larger than the dimension.

We consider the following α\alpha-strongly convex function:

Next we describe the first order oracle for this function: when asked for a subgradient at xx, it returns αx+γei\alpha x+\gamma e_{i} where ii is the first coordinate that satisfies x(i)=max1jtx(j)x(i)=\max_{1\leq j\leq t}x(j). In particular when asked for a subgradient at x1=0x_{1}=0 it returns e1e_{1}. Thus x2x_{2} must lie on the line generated by e1e_{1}. It is easy to see by induction that in fact xsx_{s} must lie in the linear span of e1,,es1e_{1},\ldots,e_{s-1}. In particular for sts\leq t we necessarily have xs(t)=0x_{s}(t)=0 and thus f(xs)0f(x_{s})\geq 0.

It remains to compute the minimal value of ff. Let yy be such that y(i)=γαty(i)=-\frac{\gamma}{\alpha t} for 1it1\leq i\leq t and y(i)=0y(i)=0 for t+1int+1\leq i\leq n. It is clear that 0f(y)0\in\partial f(y) and thus the minimal value of ff is

Wrapping up, we proved that for any sts\leq t one must have

Taking γ=L/2\gamma=L/2 and R=L2αR=\frac{L}{2\alpha} we proved the lower bound for α\alpha-strongly convex functions (note in particular that y2=γ2α2t=L24α2tR2\|y\|^{2}=\frac{\gamma^{2}}{\alpha^{2}t}=\frac{L^{2}}{4\alpha^{2}t}\leq R^{2} with these parameters). On the other taking α=LR11+t\alpha=\frac{L}{R}\frac{1}{1+\sqrt{t}} and γ=Lt1+t\gamma=L\frac{\sqrt{t}}{1+\sqrt{t}} concludes the proof for convex functions (note in particular that y2=γ2α2t=R2\|y\|^{2}=\frac{\gamma^{2}}{\alpha^{2}t}=R^{2} with these parameters).

We proceed now to the smooth case. As we will see in the following proofs we restrict our attention to quadratic functions, and it might be useful to recall that in this case one can attain the exact optimum in nn calls to the oracle (see Section 2.4). We also recall that for a twice differentiable function ff, β\beta-smoothness is equivalent to the largest eigenvalue of the Hessians of ff being smaller than β\beta at any point, which we write

Furthermore α\alpha-strong convexity is equivalent to

Let t(n1)/2t\leq(n-1)/2, β>0\beta>0. There exists a β\beta-smooth convex function ff such that for any black-box procedure satisfying (3.15),

We consider now the following β\beta-smooth convex function:

Similarly to what happened in the proof Theorem 3.5.1, one can see here too that xsx_{s} must lie in the linear span of e1,,es1e_{1},\ldots,e_{s-1} (because of our assumption on the black-box procedure). In particular for sts\leq t we necessarily have xs(i)=0x_{s}(i)=0 for i=s,,ni=s,\ldots,n, which implies xsA2t+1xs=xsAsxsx_{s}^{\top}A_{2t+1}x_{s}=x_{s}^{\top}A_{s}x_{s}. In other words, if we denote

Thus it simply remains to compute the minimizer xkx^{*}_{k} of fkf_{k}, its norm, and the corresponding function value fkf_{k}^{*}.

The point xkx^{*}_{k} is the unique solution in the span of e1,,eke_{1},\ldots,e_{k} of Akx=e1A_{k}x=e_{1}. It is easy to verify that it is defined by xk(i)=1ik+1x^{*}_{k}(i)=1-\frac{i}{k+1} for i=1,,ki=1,\ldots,k. Thus we immediately have:

Note that for large values of the condition number κ\kappa one has

Furthermore since ff is α\alpha-strongly convex, one has

Thus it only remains to compute xx^{*}. This can be done by differentiating ff and setting the gradient to , which gives the following infinite set of equations

It is easy to verify that xx^{*} defined by x(i)=(κ1κ+1)ix^{*}(i)=\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^{i} satisfy this infinite set of equations, and the conclusion of the theorem then follows by straightforward computations.

6 Geometric descent

So far our results leave a gap in the case of smooth optimization: gradient descent achieves an oracle complexity of O(1/ε)O(1/\varepsilon) (respectively O(κlog(1/ε))O(\kappa\log(1/\varepsilon)) in the strongly convex case) while we proved a lower bound of Ω(1/ε)\Omega(1/\sqrt{\varepsilon}) (respectively Ω(κlog(1/ε))\Omega(\sqrt{\kappa}\log(1/\varepsilon))). In this section we close these gaps with the geometric descent method which was recently introduced in Bubeck et al. (2015b). Historically the first method with optimal oracle complexity was proposed in Nemirovski and Yudin (1983). This method, inspired by the conjugate gradient (see Section 2.4), assumes an oracle to compute plane searches. In Nemirovski (1982) this assumption was relaxed to a line search oracle (the geometric descent method also requires a line search oracle). Finally in Nesterov (1983) an optimal method requiring only a first order oracle was introduced. The latter algorithm, called Nesterov’s accelerated gradient descent, has been the most influential optimal method for smooth optimization up to this day. We describe and analyze this method in Section 3.7. As we shall see the intuition behind Nesterov’s accelerated gradient descent (both for the derivation of the algorithm and its analysis) is not quite transparent, which motivates the present section as geometric descent has a simple geometric interpretation loosely inspired from the ellipsoid method (see Section 2.2).

We focus here on the unconstrained optimization of a smooth and strongly convex function, and we prove that geometric descent achieves the oracle complexity of O(κlog(1/ε))O(\sqrt{\kappa}\log(1/\varepsilon)), thus reducing the complexity of the basic gradient descent by a factor κ\sqrt{\kappa}. We note that this improvement is quite relevant for machine learning applications. Consider for example the logistic regression problem described in Section 1.1: this is a smooth and strongly convex problem, with a smoothness of order of a numerical constant, but with strong convexity equal to the regularization parameter whose inverse can be as large as the sample size. Thus in this case κ\kappa can be of order of the sample size, and a faster rate by a factor of κ\sqrt{\kappa} is quite significant. We also observe that this improved rate for smooth and strongly convex objectives also implies an almost optimal rate of O(log(1/ε)/ε)O(\log(1/\varepsilon)/\sqrt{\varepsilon}) for the smooth case, as one can simply run geometric descent on the function xf(x)+εx2x\mapsto f(x)+\varepsilon\|x\|^{2}.

In Section 3.6.1 we describe the basic idea of geometric descent, and we show how to obtain effortlessly a geometric method with an oracle complexity of O(κlog(1/ε))O(\kappa\log(1/\varepsilon)) (i.e., similar to gradient descent). Then we explain why one should expect to be able to accelerate this method in Section 3.6.2. The geometric descent method is described precisely and analyzed in Section 3.6.3.

Rewriting the definition of strong convexity (3.13) as

one obtains an enclosing ball for the minimizer of ff with the 0th0^{th} and 1st1^{st} order information at xx:

Furthermore recall that by smoothness (see (3.5)) one has f(x+)f(x)12βf(x)2f(x^{+})\leq f(x)-\frac{1}{2\beta}\|\nabla f(x)\|^{2} which allows to shrink the above ball by a factor of 11κ1-\frac{1}{\kappa} and obtain the following:

Thus we see that in the strategy described above, the radius squared of the enclosing ball for xx^{*} shrinks by a factor 11κ1-\frac{1}{\kappa} at each iteration, thus matching the rate of convergence of gradient descent (see Theorem 3.4.3).

6.2 Acceleration

Thus it only remains to deal with the caveat noted above, which we do via a line search. In turns this line search might shift the new ball (3.16), and to deal with this we shall need the following strengthening of the above set inclusion (we refer to Bubeck et al. (2015b) for a simple proof of this result):

6.3 The geometric descent method

and ct+1c_{t+1} (respectively Rt+12R^{2}_{t+1}) be the center (respectively the squared radius) of the ball given by (the proof of) Lemma 3.6.1 which contains

Formulas for ct+1c_{t+1} and Rt+12R^{2}_{t+1} are given at the end of this section.

We will prove a stronger claim by induction that for each t0t\geq 0, one has

The case t=0t=0 follows immediately by (3.16). Let us assume that the above display is true for some t0t\geq 0. Then using f(xt+1+)f(xt+1)12βf(xt+1)2f(xt+)12βf(xt+1)2,f(x_{t+1}^{+})\leq f(x_{t+1})-\frac{1}{2\beta}\|\nabla f(x_{t+1})\|^{2}\leq f(x_{t}^{+})-\frac{1}{2\beta}\|\nabla f(x_{t+1})\|^{2}, one gets

Thus it only remains to observe that the squared radius of the ball given by Lemma 3.6.1 which encloses the intersection of the two above balls is smaller than (11κ)Rt22α(f(xt+1+)f(x))\left(1-\frac{1}{\sqrt{\kappa}}\right)R_{t}^{2}-\frac{2}{\alpha}(f(x_{t+1}^{+})-f(x^{*})). We apply Lemma 3.6.1 after moving ctc_{t} to the origin and scaling distances by RtR_{t}. We set ε=1κ\varepsilon=\frac{1}{\kappa}, g=f(xt+1)αg=\frac{\|\nabla f(x_{t+1})\|}{\alpha}, δ=2α(f(xt+1+)f(x))\delta=\frac{2}{\alpha}\left(f(x_{t+1}^{+})-f(x^{*})\right) and a=xt+1++cta={x_{t+1}^{++}-c_{t}}. The line search step of the algorithm implies that f(xt+1)(xt+1ct)=0\nabla f(x_{t+1})^{\top}(x_{t+1}-c_{t})=0 and therefore, a=xt+1++ctf(xt+1)/α=g\|a\|=\|x_{t+1}^{++}-c_{t}\|\geq\|\nabla f(x_{t+1})\|/\alpha=g and Lemma 3.6.1 applies to give the result.

One can use the following formulas for ct+1c_{t+1} and Rt+12R^{2}_{t+1} (they are derived from the proof of Lemma 3.6.1). If f(xt+1)2/α2<Rt2/2|\nabla f(x_{t+1})|^{2}/\alpha^{2}<R_{t}^{2}/2 then one can tate ct+1=xt+1++c_{t+1}=x_{t+1}^{++} and Rt+12=f(xt+1)2α2(11κ)R_{t+1}^{2}=\frac{|\nabla f(x_{t+1})|^{2}}{\alpha^{2}}\left(1-\frac{1}{\kappa}\right). On the other hand if f(xt+1)2/α2Rt2/2|\nabla f(x_{t+1})|^{2}/\alpha^{2}\geq R_{t}^{2}/2 then one can tate

7 Nesterov’s accelerated gradient descent

We describe here the original Nesterov’s method which attains the optimal oracle complexity for smooth convex optimization. We give the details of the method both for the strongly convex and non-strongly convex case. We refer to Su et al. (2014) for a recent interpretation of the method in terms of differential equations, and to Allen-Zhu and Orecchia (2014) for its relation to mirror descent (see Chapter 4).

Nesterov’s accelerated gradient descent, illustrated in Figure 3.6, can be described as follows: Start at an arbitrary initial point x1=y1x_{1}=y_{1} and then iterate the following equations for t1t\geq 1,

Let ff be α\alpha-strongly convex and β\beta-smooth, then Nesterov’s accelerated gradient descent satisfies

We define α\alpha-strongly convex quadratic functions Φs,s1\Phi_{s},s\geq 1 by induction as follows:

Intuitively Φs\Phi_{s} becomes a finer and finer approximation (from below) to ff in the following sense:

The above inequality can be proved immediately by induction, using the fact that by α\alpha-strong convexity one has

Equation (3.18) by itself does not say much, for it to be useful one needs to understand how “far" below ff is Φs\Phi_{s}. The following inequality answers this question:

The rest of the proof is devoted to showing that (3.19) holds true, but first let us see how to combine (3.18) and (3.19) to obtain the rate given by the theorem (we use that by β\beta-smoothness one has f(x)f(x)β2xx2f(x)-f(x^{*})\leq\frac{\beta}{2}\|x-x^{*}\|^{2}):

In particular Φs+1\Phi_{s+1} is by definition minimized at vs+1v_{s+1} which can now be defined by induction using the above identity, precisely:

Using the form of Φs\Phi_{s} and Φs+1\Phi_{s+1}, as well as the original definition (3.17) one gets the following identity by evaluating Φs+1\Phi_{s+1} at xsx_{s}:

Finally we show by induction that vsxs=κ(xsys)v_{s}-x_{s}=\sqrt{\kappa}(x_{s}-y_{s}), which concludes the proof of (3.20) and thus also concludes the proof of the theorem:

where the first equality comes from (3.21), the second from the induction hypothesis, the third from the definition of ys+1y_{s+1} and the last one from the definition of xs+1x_{s+1}.

7.2 The smooth case

In this section we show how to adapt Nesterov’s accelerated gradient descent for the case α=0\alpha=0, using a time-varying combination of the elements in the primary sequence (yt)(y_{t}). First we define the following sequences:

(Note that γt0\gamma_{t}\leq 0.) Now the algorithm is simply defined by the following equations, with x1=y1x_{1}=y_{1} an arbitrary initial point,

Let ff be a convex and β\beta-smooth function, then Nesterov’s accelerated gradient descent satisfies

We follow here the proof of Beck and Teboulle (2009). We also refer to Tseng (2008) for a proof with simpler step-sizes.

Using the unconstrained version of Lemma 3.2.7 one obtains

Now multiplying (3.23) by (λs1)(\lambda_{s}-1) and adding the result to (3.24), one obtains with δs=f(ys)f(x)\delta_{s}=f(y_{s})-f(x^{*}),

Multiplying this inequality by λs\lambda_{s} and using that by definition λs12=λs2λs\lambda_{s-1}^{2}=\lambda_{s}^{2}-\lambda_{s}, as well as the elementary identity 2aba2=b2ba22a^{\top}b-\|a\|^{2}=\|b\|^{2}-\|b-a\|^{2}, one obtains

Putting together (3.25) and (3.26) one gets with us=λsxs(λs1)ysxu_{s}=\lambda_{s}x_{s}-(\lambda_{s}-1)y_{s}-x^{*},

Summing these inequalities from s=1s=1 to s=t1s=t-1 one obtains:

By induction it is easy to see that λt1t2\lambda_{t-1}\geq\frac{t}{2} which concludes the proof.

Chapter 4 Almost dimension-free convex optimization in non-Euclidean spaces

We also define the Bregman divergence associated to ff as

The following identity will be useful several times:

Φ\Phi is strictly convex and differentiable.

The gradient of Φ\Phi diverges on the boundary of D\mathcal{D}, that is

Property (i) and (iii) ensures the existence and uniqueness of this projection (in particular since xDΦ(x,y)x\mapsto D_{\Phi}(x,y) is locally increasing on the boundary of D\mathcal{D}). The following lemma shows that the Bregman divergence essentially behaves as the Euclidean norm squared in terms of projections (recall Lemma 3.0.1).

Let xXDx\in\mathcal{X}\cap\mathcal{D} and yDy\in\mathcal{D}, then

The proof is an immediate corollary of Proposition 1.3.3 together with the fact that xDΦ(x,y)=Φ(x)Φ(y)\nabla_{x}D_{\Phi}(x,y)=\nabla\Phi(x)-\nabla\Phi(y).

2 Mirror descent

See Figure 4.1 for an illustration of this procedure.

Let Φ\Phi be a mirror map ρ\rho-strongly convex on XD\mathcal{X}\cap\mathcal{D} w.r.t. \|\cdot\|. Let R2=supxXDΦ(x)Φ(x1)R^{2}=\sup_{x\in\mathcal{X}\cap\mathcal{D}}\Phi(x)-\Phi(x_{1}), and ff be convex and LL-Lipschitz w.r.t. \|\cdot\|. Then mirror descent with η=RL2ρt\eta=\frac{R}{L}\sqrt{\frac{2\rho}{t}} satisfies

Let xXDx\in\mathcal{X}\cap\mathcal{D}. The claimed bound will be obtained by taking a limit xxx\rightarrow x^{*}. Now by convexity of ff, the definition of mirror descent, equation (4.1), and Lemma 4.1.1, one has

which concludes the proof up to trivial computation.

We observe that one can rewrite mirror descent as follows:

This last expression is often taken as the definition of mirror descent (see Beck and Teboulle (2003)). It gives a proximal point of view on mirror descent: the method is trying to minimize the local linearization of the function while not moving too far away from the previous point, with distances measured via the Bregman divergence of the mirror map.

3 Standard setups for mirror descent

“Simplex setup". A more interesting choice of a mirror map is given by the negative entropy

“Spectrahedron setup". We consider here functions defined on matrices, and we are interested in minimizing a function ff on the spectrahedron Sn\mathcal{S}_{n} defined as:

where λ1(X),,λn(X)\lambda_{1}(X),\ldots,\lambda_{n}(X) are the eigenvalues of XX. It can be shown that the gradient update Φ(Yt+1)=Φ(Xt)ηf(Xt)\nabla\Phi(Y_{t+1})=\nabla\Phi(X_{t})-\eta\nabla f(X_{t}) can be written equivalently as

where the matrix exponential and matrix logarithm are defined as usual. Furthermore the projection on Sn\mathcal{S}_{n} is a simple trace renormalization.

With highly non-trivial computation one can show that Φ\Phi is 12\frac{1}{2}-strongly convex with respect to the Schatten 11-norm defined as

4 Lazy mirror descent, aka Nesterov’s dual averaging

In this section we consider a slightly more efficient version of mirror descent for which we can prove that Theorem 4.2.1 still holds true. This alternative algorithm can be advantageous in some situations (such as distributed settings), but the basic mirror descent scheme remains important for extensions considered later in this text (saddle points, stochastic oracles, …).

In lazy mirror descent, also commonly known as Nesterov’s dual averaging or simply dual averaging, one replaces (4.2) by

and also y1y_{1} is such that Φ(y1)=0\nabla\Phi(y_{1})=0. In other words instead of going back and forth between the primal and the dual, dual averaging simply averages the gradients in the dual, and if asked for a point in the primal it simply maps the current dual point to the primal using the same methodology as mirror descent. In particular using (4.2) one immediately sees that dual averaging is defined by:

Let Φ\Phi be a mirror map ρ\rho-strongly convex on XD\mathcal{X}\cap\mathcal{D} w.r.t. \|\cdot\|. Let R2=supxXDΦ(x)Φ(x1)R^{2}=\sup_{x\in\mathcal{X}\cap\mathcal{D}}\Phi(x)-\Phi(x_{1}), and ff be convex and LL-Lipschitz w.r.t. \|\cdot\|. Then dual averaging with η=RLρ2t\eta=\frac{R}{L}\sqrt{\frac{\rho}{2t}} satisfies

where the second inequality comes from the first order optimality condition for xt+1x_{t+1} (see Proposition 1.3.3). Next observe that

Putting together the two above displays and using Cauchy-Schwarz (with the assumption gtL\|g_{t}\|_{*}\leq L) one obtains

In particular this shows that xt+1xt2ηLρ\|x_{t+1}-x_{t}\|\leq\frac{2\eta L}{\rho} and thus with the above display

Now we claim that for any xXDx\in\mathcal{X}\cap\mathcal{D},

which would clearly conclude the proof thanks to (4.7) and straightforward computations. Equation (4.8) is equivalent to

5 Mirror prox

It can be shown that mirror descent accelerates for smooth functions to the rate 1/t1/t. We will prove this result in Chapter 6 (see Theorem 6.2.1). We describe here a variant of mirror descent which also attains the rate 1/t1/t for smooth functions. This method is called mirror prox and it was introduced in Nemirovski (2004a). The true power of mirror prox will reveal itself later in the text when we deal with smooth representations of non-smooth functions as well as stochastic oraclesBasically mirror prox allows for a smooth vector field point of view (see Section 4.6), while mirror descent does not..

Mirror prox is described by the following equations:

In words the algorithm first makes a step of mirror descent to go from xtx_{t} to yt+1y_{t+1}, and then it makes a similar step to obtain xt+1x_{t+1}, starting again from xtx_{t} but this time using the gradient of ff evaluated at yt+1y_{t+1} (instead of xtx_{t}), see Figure 4.2 for an illustration. The following result justifies the procedure.

Let Φ\Phi be a mirror map ρ\rho-strongly convex on XD\mathcal{X}\cap\mathcal{D} w.r.t. \|\cdot\|. Let R2=supxXDΦ(x)Φ(x1)R^{2}=\sup_{x\in\mathcal{X}\cap\mathcal{D}}\Phi(x)-\Phi(x_{1}), and ff be convex and β\beta-smooth w.r.t. \|\cdot\|. Then mirror prox with η=ρβ\eta=\frac{\rho}{\beta} satisfies

Let xXDx\in\mathcal{X}\cap\mathcal{D}. We write

We will now bound separately these three terms. For the first one, using the definition of the method, Lemma 4.1.1, and equation (4.1), one gets

For the second term using the same properties than above and the strong-convexity of the mirror map one obtains

Finally for the last term, using Cauchy-Schwarz, β\beta-smoothness, and 2aba2+b22ab\leq a^{2}+b^{2} one gets

Thus summing up these three terms and using that η=ρβ\eta=\frac{\rho}{\beta} one gets

The proof is concluded with straightforward computations.

6 The vector field point of view on MD, DA, and MP

In this section we consider a mirror map Φ\Phi that satisfies the assumptions from Theorem 4.2.1.

The observation that the sequence of vectors (gs)(g_{s}) does not have to come from the subgradients of a fixed function ff is the starting point for the theory of online learning, see Bubeck (2011) for more details. In this monograph we will use this observation to generalize mirror descent to saddle point calculations as well as stochastic settings. We note that we could also use dual averaging (defined by (4.6)) which satisfies

Under the assumption that the vector field is β\beta-Lipschitz w.r.t. \|\cdot\|, i.e., g(x)g(y)βxy\|g(x)-g(y)\|_{*}\leq\beta\|x-y\| one obtains with η=ρβ\eta=\frac{\rho}{\beta}

Chapter 5 Beyond the black-box model

We consider here the following problemWe restrict to unconstrained minimization for sake of simplicity. One can extend the discussion to constrained minimization by using ideas from Section 3.2.:

where ff is convex and β\beta-smooth, and gg is convex. We assume that ff can be accessed through a first order oracle, and that gg is known and “simple". What we mean by simplicity will be clear from the description of the algorithm. For instance a separable function, that is g(x)=i=1ngi(x(i))g(x)=\sum_{i=1}^{n}g_{i}(x(i)), will be considered as simple. The prime example being g(x)=x1g(x)=\|x\|_{1}. This section is inspired from Beck and Teboulle (2009) (see also Nesterov (2007); Wright et al. (2009)).

ISTA (Iterative Shrinkage-Thresholding Algorithm)

Recall that gradient descent on the smooth function ff can be written as (see (4.5))

Here one wants to minimize f+gf+g, and gg is assumed to be known and “simple". Thus it seems quite natural to consider the following update rule, where only ff is locally approximated with a first order oracle:

The algorithm described by the above iteration is known as ISTA (Iterative Shrinkage-Thresholding Algorithm). In terms of convergence rate it is easy to show that ISTA has the same convergence rate on f+gf+g as gradient descent on ff. More precisely with η=1β\eta=\frac{1}{\beta} one has

This improved convergence rate over a subgradient descent directly on f+gf+g comes at a price: in general (5.1) may be a difficult optimization problem by itself, and this is why one needs to assume that gg is simple. For instance if gg can be written as g(x)=i=1ngi(x(i))g(x)=\sum_{i=1}^{n}g_{i}(x(i)) then one can compute xt+1x_{t+1} by solving nn convex problems in dimension 11. In the case where g(x)=λx1g(x)=\lambda\|x\|_{1} this one-dimensional problem is given by:

Elementary computations shows that this problem has an analytical solution given by τλη(x0)\tau_{\lambda\eta}(x_{0}), where τ\tau is the shrinkage operator (hence the name ISTA), defined by

Much more is known about (5.1) (which is called the proximal operator of gg), and in fact entire monographs have been written about this equation, see e.g. Parikh and Boyd (2013); Bach et al. (2012).

FISTA (Fast ISTA)

An obvious idea is to combine Nesterov’s accelerated gradient descent (which results in a 1/t21/t^{2} rate to optimize ff) with ISTA. This results in FISTA (Fast ISTA) which is described as follows. Let

Let x1=y1x_{1}=y_{1} an arbitrary initial point, and

Again it is easy show that the rate of convergence of FISTA on f+gf+g is similar to the one of Nesterov’s accelerated gradient descent on ff, more precisely:

CMD and RDA

ISTA and FISTA assume smoothness in the Euclidean metric. Quite naturally one can also use these ideas in a non-Euclidean setting. Starting from (4.5) one obtains the CMD (Composite Mirror Descent) algorithm of Duchi et al. (2010), while with (4.6) one obtains the RDA (Regularized Dual Averaging) of Xiao (2010). We refer to these papers for more details.

2 Smooth saddle-point representation of a non-smooth function

Quite often the non-smoothness of a function ff comes from a max\max operation. More precisely non-smooth functions can often be represented as

where the functions fif_{i} are smooth. This was the case for instance with the function we used to prove the black-box lower bound 1/t1/\sqrt{t} for non-smooth optimization in Theorem 3.5.1. We will see now that by using this structural representation one can in fact attain a rate of 1/t1/t. This was first observed in Nesterov (2004b) who proposed the Nesterov’s smoothing technique. Here we will present the alternative method of Nemirovski (2004a) which we find more transparent (yet another version is the Chambolle-Pock algorithm, see Chambolle and Pock (2011)). Most of what is described in this section can be found in Juditsky and Nemirovski (2011a, b).

In the next subsection we introduce the more general problem of saddle point computation. We then proceed to apply a modified version of mirror descent to this problem, which will be useful both in Chapter 6 and also as a warm-up for the more powerful modified mirror prox that we introduce next.

By Sion’s minimax theorem there exists a pair (x,y)X×Y(x^{*},y^{*})\in\mathcal{X}\times\mathcal{Y} such that

We will explore algorithms that produce a candidate pair of solutions (x~,y~)X×Y(\widetilde{x},\widetilde{y})\in\mathcal{X}\times\mathcal{Y}. The quality of (x~,y~)(\widetilde{x},\widetilde{y}) is evaluated through the so-called duality gapObserve that the duality gap is the sum of the primal gap maxyYφ(x~,y)φ(x,y)\max_{y\in\mathcal{Y}}\varphi(\widetilde{x},y)-\varphi(x^{*},y^{*}) and the dual gap φ(x,y)minxXφ(x,y~)\varphi(x^{*},y^{*})-\min_{x\in\mathcal{X}}\varphi(x,\widetilde{y}).

The key observation is that the duality gap can be controlled similarly to the suboptimality gap f(x)f(x)f(x)-f(x^{*}) in a simple convex optimization problem. Indeed for any (x,y)X×Y(x,y)\in\mathcal{X}\times\mathcal{Y},

In particular, using the notation z=(x,y)Z:=X×Yz=(x,y)\in\mathcal{Z}:=\mathcal{X}\times\mathcal{Y} and g(z)=(gX(x,y),gY(x,y))g(z)=(g_{\mathcal{X}}(x,y),g_{\mathcal{Y}}(x,y)) we just proved

We will assume in the next subsections that X\mathcal{X} is equipped with a mirror map ΦX\Phi_{\mathcal{X}} (defined on DX\mathcal{D}_{\mathcal{X}}) which is 11-strongly convex w.r.t. a norm X\|\cdot\|_{\mathcal{X}} on XDX\mathcal{X}\cap\mathcal{D}_{\mathcal{X}}. We denote RX2=supxXΦ(x)minxXΦ(x)R^{2}_{\mathcal{X}}=\sup_{x\in\mathcal{X}}\Phi(x)-\min_{x\in\mathcal{X}}\Phi(x). We define similar quantities for the space Y\mathcal{Y}.

2.2 Saddle Point Mirror Descent (SP-MD)

where gt=(gX,t,gY,t)g_{t}=(g_{\mathcal{X},t},g_{\mathcal{Y},t}) with gX,txφ(xt,yt)g_{\mathcal{X},t}\in\partial_{x}\varphi(x_{t},y_{t}) and gY,ty(φ(xt,yt))g_{\mathcal{Y},t}\in\partial_{y}(-\varphi(x_{t},y_{t})).

Assume that φ(,y)\varphi(\cdot,y) is LXL_{\mathcal{X}}-Lipschitz w.r.t. X\|\cdot\|_{\mathcal{X}}, that is gX(x,y)XLX,(x,y)X×Y\|g_{\mathcal{X}}(x,y)\|_{\mathcal{X}}^{*}\leq L_{\mathcal{X}},\forall(x,y)\in\mathcal{X}\times\mathcal{Y}. Similarly assume that φ(x,)\varphi(x,\cdot) is LYL_{\mathcal{Y}}-Lipschitz w.r.t. Y\|\cdot\|_{\mathcal{Y}}. Then SP-MD with a=LXRXa=\frac{L_{\mathcal{X}}}{R_{\mathcal{X}}}, b=LYRYb=\frac{L_{\mathcal{Y}}}{R_{\mathcal{Y}}}, and η=2t\eta=\sqrt{\frac{2}{t}} satisfies

First we endow Z\mathcal{Z} with the norm Z\|\cdot\|_{\mathcal{Z}} defined by

It is immediate that Φ\Phi is 11-strongly convex with respect to Z\|\cdot\|_{\mathcal{Z}} on ZD\mathcal{Z}\cap\mathcal{D}. Furthermore one can easily check that

and thus the vector field (gt)(g_{t}) used in the SP-MD satisfies:

Using (4.10) together with (5.3) and the values of a,ba,b and η\eta concludes the proof.

2.3 Saddle Point Mirror Prox (SP-MP)

We now consider the most interesting situation in the context of this chapter, where the function φ\varphi is smooth. Precisely we say that φ\varphi is (β11,β12,β22,β21)(\beta_{11},\beta_{12},\beta_{22},\beta_{21})-smooth if for any x,xX,y,yYx,x^{\prime}\in\mathcal{X},y,y^{\prime}\in\mathcal{Y},

Assume that φ\varphi is (β11,β12,β22,β21)(\beta_{11},\beta_{12},\beta_{22},\beta_{21})-smooth. Then SP-MP with a=1RX2a=\frac{1}{R_{\mathcal{X}}^{2}}, b=1RY2b=\frac{1}{R_{\mathcal{Y}}^{2}}, and η=1/(2max(β11RX2,β22RY2,β12RXRY,β21RXRY))\eta=1/\left(2\max\left(\beta_{11}R^{2}_{\mathcal{X}},\beta_{22}R^{2}_{\mathcal{Y}},\beta_{12}R_{\mathcal{X}}R_{\mathcal{Y}},\beta_{21}R_{\mathcal{X}}R_{\mathcal{Y}}\right)\right) satisfies

In light of the proof of Theorem 5.2.1 and (4.11) it clearly suffices to show that the vector field g(z)=(xφ(x,y),yφ(x,y))g(z)=(\nabla_{x}\varphi(x,y),-\nabla_{y}\varphi_{(}x,y)) is β\beta-Lipschitz w.r.t. zZ=1RX2xX2+1RY2yY2\|z\|_{\mathcal{Z}}=\sqrt{\frac{1}{R_{\mathcal{X}}^{2}}\|x\|_{\mathcal{X}}^{2}+\frac{1}{R_{\mathcal{Y}}^{2}}\|y\|_{\mathcal{Y}}^{2}} with β=2max(β11RX2,β22RY2,β12RXRY,β21RXRY)\beta=2\max\left(\beta_{11}R^{2}_{\mathcal{X}},\beta_{22}R^{2}_{\mathcal{Y}},\beta_{12}R_{\mathcal{X}}R_{\mathcal{Y}},\beta_{21}R_{\mathcal{X}}R_{\mathcal{Y}}\right). In other words one needs to show that

which can be done with straightforward calculations (by introducing g(x,y)g(x^{\prime},y) and using the definition of smoothness for φ\varphi).

2.4 Applications

We investigate briefly three applications for SP-MD and SP-MP.

The problem (5.2) (when ff has to minimized over X\mathcal{X}) can be rewritten as

that is β21=L\beta_{21}=L. On the other hand xφ(x,y)=i=1myifi(x)\nabla_{x}\varphi(x,y)=\sum_{i=1}^{m}y_{i}\nabla f_{i}(x), and thus

that is β11=β\beta_{11}=\beta and β12=L\beta_{12}=L. Thus using SP-MP with some mirror map on X\mathcal{X} and the negentropy on Δm\Delta_{m} (see the “simplex setup" in Section 4.3), one obtains an ε\varepsilon-optimal point of f(x)=max1imfi(x)f(x)=\max_{1\leq i\leq m}f_{i}(x) in O(βRX2+LRXlog(m)ε)O\left(\frac{\beta R_{\mathcal{X}}^{2}+LR_{\mathcal{X}}\sqrt{\log(m)}}{\varepsilon}\right) iterations. Furthermore an iteration of SP-MP has a computational complexity of order of a step of mirror descent in X\mathcal{X} on the function xi=1my(i)fi(x)x\mapsto\sum_{i=1}^{m}y(i)f_{i}(x) (plus O(m)O(m) for the update in the Y\mathcal{Y}-space).

Thus by using the structure of ff we were able to obtain a much better rate than black-box procedures (which would have required Ω(1/ε2)\Omega(1/\varepsilon^{2}) iterations as ff is potentially non-smooth).

Here we equip both Δn\Delta_{n} and Δm\Delta_{m} with 1\|\cdot\|_{1}. Let φ(x,y)=xAy\varphi(x,y)=x^{\top}Ay. Using that xφ(x,y)=Ay\nabla_{x}\varphi(x,y)=Ay and yφ(x,y)=Ax\nabla_{y}\varphi(x,y)=A^{\top}x one immediately obtains β11=β22=0\beta_{11}=\beta_{22}=0. Furthermore since

3 Interior point methods

We describe here interior point methods (IPM), a class of algorithms fundamentally different from what we have seen so far. The first algorithm of this type was described in Karmarkar (1984), but the theory we shall present was developed in Nesterov and Nemirovski (1994). We follow closely the presentation given in [Chapter 4, Nesterov (2004a)]. Other useful references (in particular for the primal-dual IPM, which are the ones used in practice) include Renegar (2001); Nemirovski (2004b); Nocedal and Wright (2006).

IPM are designed to solve convex optimization problems of the form

The idea of the barrier method is to move along the central path by “boosting" a fast locally convergent algorithm, which we denote for the moment by A\mathcal{A}, using the following scheme: Assume that one has computed x(t)x^{*}(t), then one uses A\mathcal{A} initialized at x(t)x^{*}(t) to compute x(t)x^{*}(t^{\prime}) for some t>tt^{\prime}>t. There is a clear tension for the choice of tt^{\prime}, on the one hand tt^{\prime} should be large in order to make as much progress as possible on the central path, but on the other hand x(t)x^{*}(t) needs to be close enough to x(t)x^{*}(t^{\prime}) so that it is in the basin of fast convergence for A\mathcal{A} when run on FtF_{t^{\prime}}.

IPM follows the above methodology with A\mathcal{A} being Newton’s method. Indeed as we will see in the next subsection, Newton’s method has a quadratic convergence rate, in the sense that if initialized close enough to the optimum it attains an ε\varepsilon-optimal point in loglog(1/ε)\log\log(1/\varepsilon) iterations! Thus we now have a clear plan to make these ideas formal and analyze the iteration complexity of IPM:

First we need to describe precisely the region of fast convergence for Newton’s method. This will lead us to define self-concordant functions, which are “natural" functions for Newton’s method.

Then we need to evaluate precisely how much larger tt^{\prime} can be compared to tt, so that x(t)x^{*}(t) is still in the region of fast convergence of Newton’s method when optimizing the function FtF_{t^{\prime}} with t>tt^{\prime}>t. This will lead us to define ν\nu-self concordant barriers.

3.2 Traditional analysis of Newton’s method

Thus, starting at xx, in order to minimize ff it seems natural to move in the direction hh that minimizes

While this method can have an arbitrarily bad behavior in general, if started close enough to a strict local minimum of ff, it can have a very fast convergence:

Then Newton’s method is well-defined and converges to xx^{*} at a quadratic rate:

Now note that f(x)=0\nabla f(x^{*})=0, and thus with the above formula one obtains

Using the Lipschitz property of the Hessian one immediately obtains that

3.3 Self-concordant functions

Before giving the definition of self-concordant functions let us try to get some insight into the “geometry" of Newton’s method. Let AA be a n×nn\times n non-singular matrix. We look at a Newton step on the functions f:xf(x)f:x\mapsto f(x) and φ:yf(A1y)\varphi:y\mapsto f(A^{-1}y), starting respectively from xx and y=Axy=Ax, that is:

In other words Newton’s method will follow the same trajectory in the “xx-space" and in the “yy-space" (the image through AA of the xx-space), that is Newton’s method is affine invariant. Observe that this property is not shared by the methods described in Chapter 3 (except for the conditional gradient descent).

The issue is that this inequality depends on the choice of an inner product. More importantly it is easy to see that a convex function which goes to infinity on a compact set simply cannot satisfy the above inequality. A natural idea to try fix these issues is to replace the Euclidean metric on the right hand side by the metric given by the function ff itself at xx, that is:

Observe that to be clear one should rather use the notation x,f\|\cdot\|_{x,f}, but since ff will always be clear from the context we stick to x\|\cdot\|_{x}.

We say that ff is standard self-concordant if ff is self-concordant with constant M=2M=2.

An easy consequence of the definition is that a self-concordant function is a barrier for the set X\mathcal{X}, see [Theorem 4.1.4, Nesterov (2004a)]. The main example to keep in mind of a standard self-concordant function is f(x)=logxf(x)=-\log x for x>0x>0. The next definition will be key in order to describe the region of quadratic convergence for Newton’s method on self-concordant functions.

see [Equation 4.1.18, Nesterov (2004a)]. We state the next theorem without a proof, see also [Theorem 4.1.14, Nesterov (2004a)].

In other words the above theorem states that, if initialized at a point x0x_{0} such that λf(x0)1/4\lambda_{f}(x_{0})\leq 1/4, then Newton’s iterates satisfy λf(xk+1)2λf(xk)2\lambda_{f}(x_{k+1})\leq 2\lambda_{f}(x_{k})^{2}. Thus, Newton’s region of quadratic convergence for self-concordant functions can be described as a “Newton decrement ball" {x:λf(x)1/4}\{x:\lambda_{f}(x)\leq 1/4\}. In particular by taking the barrier to be a self-concordant function we have now resolved Step (1) of the plan described in Section 5.3.1.

3.4 ν𝜈\nu-self-concordant barriers

We deal here with Step (2) of the plan described in Section 5.3.1. Given Theorem 5.3.5 we want tt^{\prime} to be as large as possible and such that

Since the Hessian of FtF_{t^{\prime}} is the Hessian of FF, one has

Observe that, by first order optimality, one has tc+F(x(t))=0,tc+\nabla F(x^{*}(t))=0, which yields

immediately yields (5.6). In particular with the value of tt^{\prime} given in (5.8) the Newton’s method on FtF_{t^{\prime}} initialized at x(t)x^{*}(t) will converge quadratically fast to x(t)x^{*}(t^{\prime}).

It remains to verify that by iterating (5.8) one obtains a sequence diverging to infinity, and to estimate the rate of growth. Thus one needs to control cx(t)=1tF(x(t))x(t)\|c\|^{*}_{x^{*}(t)}=\frac{1}{t}\|\nabla F(x^{*}(t))\|_{x^{*}(t)}^{*}. Luckily there is a natural class of functions for which one can control F(x)x\|\nabla F(x)\|_{x}^{*} uniformly over xx. This is the set of functions such that

Thus a safe choice to increase the penalization parameter is t=(1+14ν)tt^{\prime}=\left(1+\frac{1}{4\sqrt{\nu}}\right)t. Note that the condition (5.9) can also be written as the fact that the function FF is 1ν\frac{1}{\nu}-exp-concave, that is xexp(1νF(x))x\mapsto\exp(-\frac{1}{\nu}F(x)) is concave. We arrive at the following definition.

FF is a ν\nu-self-concordant barrier if it is a standard self-concordant function, and it is 1ν\frac{1}{\nu}-exp-concave.

A key property of ν\nu-self-concordant barriers is the following inequality:

see [Equation (4.2.17), Nesterov (2004a)]. More generally using (5.10) together with (5.5) one obtains

In the next section we describe a precise algorithm based on the ideas we developed above. As we will see one cannot ensure to be exactly on the central path, and thus it is useful to generalize the identity (5.7) for a point xx close to the central path. We do this as follows:

3.5 Path-following scheme

We can now formally describe and analyze the most basic IPM called the path-following scheme. Let FF be ν\nu-self-concordant barrier for X\mathcal{X}. Assume that one can find x0x_{0} such that λFt0(x0)1/4\lambda_{F_{t_{0}}}(x_{0})\leq 1/4 for some small value t0>0t_{0}>0 (we describe a method to find x0x_{0} at the end of this subsection). Then for k0k\geq 0, let

The next theorem shows that after O(νlogνt0ε)O\left(\sqrt{\nu}\log\frac{\nu}{t_{0}\varepsilon}\right) iterations of the path-following scheme one obtains an ε\varepsilon-optimal point.

The path-following scheme described above satisfies

We show that the iterates (xk)k0(x_{k})_{k\geq 0} remain close to the central path (x(tk))k0(x^{*}(t_{k}))_{k\geq 0}. Precisely one can easily prove by induction that

Indeed using Theorem 5.3.5 and equation (5.12) one immediately obtains

where we used in the last inequality that tk+1/tk=1+113νt_{k+1}/t_{k}=1+\frac{1}{13\sqrt{\nu}} and ν1\nu\geq 1.

Observe that tk=(1+113ν)kt0t_{k}=\left(1+\frac{1}{13\sqrt{\nu}}\right)^{k}t_{0}, which finally yields

At this point we still need to explain how one can get close to an intial point x(t0)x^{*}(t_{0}) of the central path. This can be done with the following rather clever trick. Assume that one has some point y0Xy_{0}\in\mathcal{X}. The observation is that y0y_{0} is on the central path at t=1t=1 for the problem where cc is replaced by F(y0)-\nabla F(y_{0}). Now instead of following this central path as t+t\to+\infty, one follows it as t0t\to 0. Indeed for tt small enough the central paths for cc and for F(y0)-\nabla F(y_{0}) will be very close. Thus we iterate the following equations, starting with t0=1t_{0}^{\prime}=1,

A straightforward analysis shows that for k=O(νlogν)k=O(\sqrt{\nu}\log\nu), which corresponds to tk=1/νO(1)t_{k}^{\prime}=1/\nu^{O(1)}, one obtains a point yky_{k} such that λFtk(yk)1/4\lambda_{F_{t_{k}^{\prime}}}(y_{k})\leq 1/4. In other words one can initialize the path-following scheme with t0=tkt_{0}=t_{k}^{\prime} and x0=ykx_{0}=y_{k}.

3.6 IPMs for LPs and SDPs

There is one important issue that we overlooked so far. In most interesting cases LPs and SDPs come with equality constraints, resulting in a set of constraints X\mathcal{X} with empty interior. From a theoretical point of view there is an easy fix, which is to reparametrize the problem as to enforce the variables to live in the subspace spanned by X\mathcal{X}. This modification also has algorithmic consequences, as the evaluation of the Newton direction will now be different. In fact, rather than doing a reparametrization, one can simply search for Newton directions such that the updated point will stay in X\mathcal{X}. In other words one has now to solve a convex quadratic optimization problem under linear equality constraints. Luckily using Lagrange multipliers one can find a closed form solution to this problem, and we refer to previous references for more details.

Chapter 6 Convex optimization and randomness

In this chapter we explore the interplay between optimization and randomness. A key insight, going back to Robbins and Monro (1951), is that first order methods are quite robust: the gradients do not have to be computed exactly to ensure progress towards the optimum. Indeed since these methods usually do many small steps, as long as the gradients are correct on average, the error introduced by the gradient approximations will eventually vanish. As we will see below this intuition is correct for non-smooth optimization (since the steps are indeed small) but the picture is more subtle in the case of smooth optimization (recall from Chapter 3 that in this case we take long steps).

We also note that the situation with a biased oracle is quite different, and we refer to d’Aspremont (2008); Schmidt et al. (2011) for some works in this direction.

The two canonical examples of a stochastic oracle in machine learning are as follows.

The second example is the one described in Section 1.1, where one wants to minimize f(x)=1mi=1mfi(x)f(x)=\frac{1}{m}\sum_{i=1}^{m}f_{i}(x). In this situation a stochastic oracle can be obtained by selecting uniformly at random I[m]I\in[m] and reporting fI(x)\nabla f_{I}(x).

This immediately yields a rate of convergence thanks to the following simple observation based on the tower rule:

Similarly, in the Euclidean and strongly convex case, one can directly generalize Theorem 3.4.1. Precisely we consider stochastic gradient descent (SGD), that is S-MD with Φ(x)=12x22\Phi(x)=\frac{1}{2}\|x\|_{2}^{2}, with time-varying step size (ηt)t1(\eta_{t})_{t\geq 1}, that is

2 Smooth stochastic optimization and mini-batch SGD

In the previous section we showed that, for non-smooth optimization, there is basically no cost for having a stochastic oracle instead of an exact oracle. Unfortunately one can show (see e.g. Tsybakov (2003)) that smoothness does not bring any acceleration for a general stochastic oracleWhile being true in general this statement does not say anything about specific functions/oracles. For example it was shown in Bach and Moulines (2013) that acceleration can be obtained for the square loss and the logistic loss.. This is in sharp contrast with the exact oracle case where we showed that gradient descent attains a 1/t1/t rate (instead of 1/t1/\sqrt{t} for non-smooth), and this could even be improved to 1/t21/t^{2} thanks to Nesterov’s accelerated gradient descent.

The next result interpolates between the 1/t1/\sqrt{t} for stochastic smooth optimization, and the 1/t1/t for deterministic smooth optimization. We will use it to propose a useful modification of SGD in the smooth case. The proof is extracted from Dekel et al. (2012).

Using β\beta-smoothness, Cauchy-Schwarz (with 2abxa2+b2/x2ab\leq xa^{2}+b^{2}/x for any x>0x>0), and the 1-strong convexity of Φ\Phi, one obtains

Observe that, using the same argument as to derive (4.9), one has

By summing this inequality from s=1s=1 to s=ts=t one can easily conclude with the standard argument.

where g~i(xt),i=1,,m\widetilde{g}_{i}(x_{t}),i=1,\ldots,m are independent random variables (conditionally on xtx_{t}) obtained from repeated queries to the stochastic oracle. Assuming that ff is β\beta-smooth and that the stochastic oracle is such that g~(x)2B\|\widetilde{g}(x)\|_{2}\leq B, one can obtain a rate of convergence for mini-batch SGD with Theorem 6.2.1. Indeed one can apply this result with the modified stochastic oracle that returns 1mi=1mg~i(x)\frac{1}{m}\sum_{i=1}^{m}\widetilde{g}_{i}(x), it satisfies

Thus one obtains that with tt calls to the (original) stochastic oracle, that is t/mt/m iterations of the mini-batch SGD, one has a suboptimality gap bounded by

Thus as long as mBRβtm\leq\frac{B}{R\beta}\sqrt{t} one obtains, with mini-batch SGD and tt calls to the oracle, a point which is 3RBt3\frac{RB}{\sqrt{t}}-optimal.

Mini-batch SGD can be a better option than basic SGD in at least two situations: (i) When the computation for an iteration of mini-batch SGD can be distributed between multiple processors. Indeed a central unit can send the message to the processors that estimates of the gradient at point xsx_{s} have to be computed, then each processor can work independently and send back the estimate they obtained. (ii) Even in a serial setting mini-batch SGD can sometimes be advantageous, in particular if some calculations can be re-used to compute several estimated gradients at the same point.

3 Sum of smooth and strongly convex functions

Let us examine in more details the main example from Section 1.1. That is one is interested in the unconstrained minimization of

where f1,,fmf_{1},\ldots,f_{m} are β\beta-smooth and convex functions, and ff is α\alpha-strongly convex. Typically in machine learning α\alpha can be as small as 1/m1/m, while β\beta is of order of a constant. In other words the condition number κ=β/α\kappa=\beta/\alpha can be as large as Ω(m)\Omega(m). Let us now compare the basic gradient descent, that is

where iti_{t} is drawn uniformly at random in [m][m] (independently of everything else). Theorem 3.4.3 shows that gradient descent requires O(mκlog(1/ε))O(m\kappa\log(1/\varepsilon)) gradient computations (which can be improved to O(mκlog(1/ε))O(m\sqrt{\kappa}\log(1/\varepsilon)) with Nesterov’s accelerated gradient descent), while Theorem 6.1.2 shows that SGD (with appropriate averaging) requires O(1/(αε))O(1/(\alpha\varepsilon)) gradient computations. Thus one can obtain a low accuracy solution reasonably fast with SGD, but for high accuracy the basic gradient descent is more suitable. Can we get the best of both worlds? This question was answered positively in Le Roux et al. (2012) with SAG (Stochastic Averaged Gradient) and in Shalev-Shwartz and Zhang (2013a) with SDCA (Stochastic Dual Coordinate Ascent). These methods require only O((m+κ)log(1/ε))O((m+\kappa)\log(1/\varepsilon)) gradient computations. We describe below the SVRG (Stochastic Variance Reduced Gradient descent) algorithm from Johnson and Zhang (2013) which makes the main ideas of SAG and SDCA more transparent (see also Defazio et al. (2014) for more on the relation between these different methods). We also observe that a natural question is whether one can obtain a Nesterov’s accelerated version of these algorithms that would need only O((m+mκ)log(1/ε))O((m+\sqrt{m\kappa})\log(1/\varepsilon)), see Shalev-Shwartz and Zhang (2013b); Zhang and Xiao (2014); Agarwal and Bottou (2014) for recent works on this question.

To obtain a linear rate of convergence one needs to make “big steps", that is the step-size should be of order of a constant. In SGD the step-size is typically of order 1/t1/\sqrt{t} because of the variance introduced by the stochastic oracle. The idea of SVRG is to “center" the output of the stochastic oracle in order to reduce the variance. Precisely instead of feeding fi(x)\nabla f_{i}(x) into the gradient descent one would use fi(x)fi(y)+f(y)\nabla f_{i}(x)-\nabla f_{i}(y)+\nabla f(y) where yy is a centering sequence. This is a sensible idea since, when xx and yy are close to the optimum, one should have that fi(x)fi(y)\nabla f_{i}(x)-\nabla f_{i}(y) will have a small variance, and of course f(y)\nabla f(y) will also be small (note that fi(x)\nabla f_{i}(x) by itself is not necessarily small). This intuition is made formal with the following lemma.

Let gi(x)=fi(x)fi(x)fi(x)(xx)g_{i}(x)=f_{i}(x)-f_{i}(x^{*})-\nabla f_{i}(x^{*})^{\top}(x-x^{*}). By convexity of fif_{i} one has gi(x)0g_{i}(x)\geq 0 for any xx and in particular using (3.5) this yields gi(x)12βgi(x)22-g_{i}(x)\leq-\frac{1}{2\beta}\|\nabla g_{i}(x)\|_{2}^{2} which can be equivalently written as

On the other hand the computation of f(y)\nabla f(y) is expensive (it requires mm gradient computations), and thus the centering sequence should be updated more rarely than the main sequence. These ideas lead to the following epoch-based algorithm.

where it(s)i_{t}^{(s)} is drawn uniformly at random (and independently of everything else) in [m][m]. Also let

which clearly implies the theorem. To simplify the notation in the following we drop the dependency on ss, that is we want to show that

We start as for the proof of Theorem 3.4.3 (analysis of gradient descent for smooth and strongly convex functions) with

and thus plugging this into (6.2) together with (6.3) one obtains

Summing the above inequality over t=1,,kt=1,\ldots,k yields

Noting that x1=yx_{1}=y and that by α\alpha-strong convexity one has f(x)f(x)α2xx22f(x)-f(x^{*})\geq\frac{\alpha}{2}\|x-x^{*}\|_{2}^{2}, one can rearrange the above display to obtain

Using that η=110β\eta=\frac{1}{10\beta} and k=20κk=20\kappa finally yields (6.1) which itself concludes the proof.

4 Random coordinate descent

where isi_{s} is drawn uniformly at random from [n][n] (and independently of everything else).

Thus using Theorem 6.1.1 (with Φ(x)=12x22\Phi(x)=\frac{1}{2}\|x\|_{2}^{2}, that is S-MD being SGD) one immediately obtains the following result.

Somewhat unsurprisingly RCD requires nn times more iterations than gradient descent to obtain the same accuracy. In the next section, we will see that this statement can be greatly improved by taking into account directional smoothness.

If ff is twice differentiable then this is equivalent to (2f(x))i,iβi(\nabla^{2}f(x))_{i,i}\leq\beta_{i}. In particular, since the maximal eigenvalue of a matrix is upper bounded by its trace, one can see that the directional smoothness implies that ff is β\beta-smooth with βi=1nβi\beta\leq\sum_{i=1}^{n}\beta_{i}. We now study the following “aggressive" RCD, where the step-sizes are of order of the inverse smoothness:

Furthermore we study a more general sampling distribution than uniform, precisely for γ0\gamma\geq 0 we assume that isi_{s} is drawn (independently) from the distribution pγp_{\gamma} defined by

This algorithm was proposed in Nesterov (2012), and we denote it by RCD(γ\gamma). Observe that, up to a preprocessing step of complexity O(n)O(n), one can sample from pγp_{\gamma} in time O(log(n))O(\log(n)).

The following rate of convergence is derived in Nesterov (2012), using the dual norms [γ],[γ]\|\cdot\|_{[\gamma]},\|\cdot\|_{[\gamma]}^{*} defined by

Recall from Theorem 3.2.1 that in this context the basic gradient descent attains a rate of βx1x22/t\beta\|x_{1}-x^{*}\|_{2}^{2}/t where βi=1nβi\beta\leq\sum_{i=1}^{n}\beta_{i} (see the discussion above). Thus we see that RCD(11) greatly improves upon gradient descent for functions where β\beta is of order of i=1nβi\sum_{i=1}^{n}\beta_{i}. Indeed in this case both methods attain the same accuracy after a fixed number of iterations, but the iterations of coordinate descent are potentially much cheaper than the iterations of gradient descent.

Thus putting together the above calculations one obtains

The proof can be concluded with similar computations than for Theorem 3.2.1.

We discussed above the specific case of γ=1\gamma=1. Both γ=0\gamma=0 and γ=1/2\gamma=1/2 also have an interesting behavior, and we refer to Nesterov (2012) for more details. The latter paper also contains a discussion of high probability results and potential acceleration à la Nesterov. We also refer to Richtárik and Takác (2012) for a discussion of RCD in a distributed setting.

4.2 RCD for smooth and strongly convex optimization

If in addition to directional smoothness one also assumes strong convexity, then RCD attains in fact a linear rate.

By strong convexity, Hölder’s inequality, and an elementary calculation,

which concludes the proof by taking y=xy=x^{*}.

In the proof of Theorem 6.4.2 we showed that

The proof is concluded with straightforward calculations.

5 Acceleration by randomization for saddle points

Using S-SP-MD we revisit the examples of Section 5.2.4 and Section 5.2.4. In both cases one has φ(x,y)=xAy\varphi(x,y)=x^{\top}Ay (with AiA_{i} being the ithi^{th} column of AA), and thus xφ(x,y)=Ay\nabla_{x}\varphi(x,y)=Ay and yφ(x,y)=Ax\nabla_{y}\varphi(x,y)=A^{\top}x.

Matrix games. Here xΔnx\in\Delta_{n} and yΔmy\in\Delta_{m}. Thus there is a quite natural stochastic oracle:

Unfortunately this last term can be O(n)O(n). However it turns out that one can do a more careful analysis of mirror descent in terms of local norms, which allows to prove that the “local variance" is dimension-free. We refer to Bubeck and Cesa-Bianchi (2012) for more details on these local norms, and to Clarkson et al. (2012) for the specific details in the linear classification situation.

6 Convex relaxation and randomized rounding

In this section we briefly discuss the concept of convex relaxation, and the use of randomization to find approximate solutions. By now there is an enormous literature on these topics, and we refer to Barak (2014) for further pointers.

Viewing AA as the (weighted) adjacency matrix of a graph, one can rewrite (6.6) as follows, using the graph Laplacian L=DAL=D-A where DD is the diagonal matrix with entries (j=1nAi,j)i[n](\sum_{j=1}^{n}A_{i,j})_{i\in[n]},

It turns out that this optimization problem is NP\mathbf{NP}-hard, that is the existence of a polynomial time algorithm to solve (6.7) would prove that P=NP\mathbf{P}=\mathbf{NP}. The combinatorial difficulty of this problem stems from the hypercube constraint. Indeed if one replaces {1,1}n\{-1,1\}^{n} by the Euclidean sphere, then one obtains an efficiently solvable problem (it is the problem of computing the maximal eigenvalue of LL).

We show now that, while (6.7) is a difficult optimization problem, it is in fact possible to find relatively good approximate solutions by using the power of randomization. Let ζ\zeta be uniformly drawn on the hypercube {1,1}n\{-1,1\}^{n}, then clearly

Next we show that one can obtain an even better approximation ratio by combining the power of convex optimization and randomization. This approach was pioneered by Goemans and Williamson (1995). The Goemans-Williamson algorithm is based on the following inequality

The proof of this result is based on the following elementary geometric lemma.

We can now get to the proof of Theorem 6.6.1.

and in particular for x{1,1}nx\in\{-1,1\}^{n}, xLx=i,j=1nAi,j(1xixj)x^{\top}Lx=\sum_{i,j=1}^{n}A_{i,j}(1-x_{i}x_{j}). Thus, using Lemma 6.6.2, and the facts that Ai,j0A_{i,j}\geq 0 and Σi,j1|\Sigma_{i,j}|\leq 1 (see the proof of Lemma 6.6.2), one has

Theorem 6.6.1 depends on the form of the Laplacian LL (insofar as (6.8) was used). We show next a result from Nesterov (1997) that applies to any positive semi-definite matrix, at the expense of the constant of approximation. Precisely we are now interested in the following optimization problem:

Finally one can conclude using the fact if A,B0A,B\succeq 0 then AB0A\circ B\succeq 0. This can be seen by writing A=VVA=VV^{\top}, B=UUB=UU^{\top}, and thus

In other words ABA\circ B is a Gram-matrix and, thus it is positive semi-definite.

7 Random walk based methods

Randomization naturally suggests itself in the center of gravity method (see Section 2.1), as a way to circumvent the exact calculation of the center of gravity. This idea was proposed and developed in Bertsimas and Vempala (2004). We give below a condensed version of the main ideas of this paper.

Thus if one can ensure that St\mathcal{S}_{t} is in (near) isotropic position, and ctc^t2\|c_{t}-\hat{c}_{t}\|_{2} is small (say smaller than 0.10.1), then the randomized center of gravity method (which replaces ctc_{t} by c^t\hat{c}_{t}) will converge at the same speed than the original center of gravity method.

Let us now consider the issue of putting St\mathcal{S}_{t} in near-isotropic position. Let Σ^t=1Ni=1N(Xic^t)(Xic^t)\hat{\Sigma}_{t}=\frac{1}{N}\sum_{i=1}^{N}(X_{i}-\hat{c}_{t})(X_{i}-\hat{c}_{t})^{\top}. Rudelson (1999) showed that as long as N=Ω~(n)N=\widetilde{\Omega}(n), one has with high probability (say at least probability 11/n21-1/n^{2}) that the set Σ^t1/2(Stc^t)\hat{\Sigma}_{t}^{-1/2}(\mathcal{S}_{t}-\hat{c}_{t}) is in near-isotropic position.

Thus it only remains to explain how to sample from a near-isotropic convex set K\mathcal{K}. This is where random walk ideas come into the picture. The hit-and-run walkOther random walks are known for this problem but hit-and-run is the one with the sharpest theoretical guarantees. Curiously we note that one of those walks is closely connected to projected gradient descent, see Bubeck et al. (2015a). is described as follows: at a point xKx\in\mathcal{K}, let L\mathcal{L} be a line that goes through xx in a direction taken uniformly at random, then move to a point chosen uniformly at random in LK\mathcal{L}\cap\mathcal{K}. Lovász (1998) showed that if the starting point of the hit-and-run walk is chosen from a distribution “close enough" to the uniform distribution on K\mathcal{K}, then after O(n3)O(n^{3}) steps the distribution of the last point is ε\varepsilon away (in total variation) from the uniform distribution on K\mathcal{K}. In the randomized center of gravity method one can obtain a good initial distribution for St\mathcal{S}_{t} by using the distribution that was obtained for St1\mathcal{S}_{t-1}. In order to initialize the entire process correctly we start here with S1=[L,L]nX\mathcal{S}_{1}=[-L,L]^{n}\supset\mathcal{X} (in Section 2.1 we used S1=X\mathcal{S}_{1}=\mathcal{X}), and thus we also have to use a separation oracle at iterations where c^t∉X\hat{c}_{t}\not\in\mathcal{X}, just like we did for the ellipsoid method (see Section 2.2).

Wrapping up the above discussion, we showed (informally) that to attain an ε\varepsilon-optimal point with the randomized center of gravity method one needs: O~(n)\widetilde{O}(n) iterations, each iterations requires O~(n)\widetilde{O}(n) random samples from St\mathcal{S}_{t} (in order to put it in isotropic position) as well as a call to either the separation oracle or the first order oracle, and each sample costs O~(n3)\widetilde{O}(n^{3}) steps of the random walk. Thus overall one needs O~(n)\widetilde{O}(n) calls to the separation oracle and the first order oracle, as well as O~(n5)\widetilde{O}(n^{5}) steps of the random walk.

References