A geometric alternative to Nesterov's accelerated gradient descent

Sébastien Bubeck, Yin Tat Lee, Mohit Singh

Introduction

Let κ=β/α\kappa=\beta/\alpha be its condition number. It is a one line calculation to verify that a step of gradient descent on ff will decrease (multiplicatively) the squared distance to the optimum by 11/κ1-1/\kappa. In this paper we propose a new method, which can be viewed as some combination of gradient descent and the ellipsoid method, for which the squared distance to the optimum decreases at a rate of (11/κ)(1-1/\sqrt{\kappa}) (and each iteration requires one gradient evaluation and two line-searches). This matches the optimal rate of convergence among the class of first-order methods, [Nesterov(1983), Nesterov(2004)].

Nesterov’s acceleration (i.e., replacing κ\kappa by κ\sqrt{\kappa} in the convergence rate) has proven to be of fundamental importance both in theory and in practice, see e.g. [Bubeck(2014)] for references. However the intuition behind Nesterov’s accelerated gradient descent is notoriously difficult to grasp, and this has led to a recent surge of interest in new interpretations of this algorithm, as well as the reasons behind the possibility of acceleration for smooth problems, see [Allen-Zhu and Orecchia(2014), Lessard et al.(2014)Lessard, Recht, and Packard, Su et al.(2014)Su, Boyd, and Candès, Flammarion and Bach(2015)].

In this paper we propose a new method with a clear intuition and which achieves acceleration. Since the function is strongly convex, gradient at any point gives a ball, say AA, containing the optimum solution. Using the fact that the function is smooth, one can get an improved bound on the radius of this ball. The algorithm also maintains a ball BB containing the optimal solution obtained via the information from previous iterations. A simple calculation then shows that the smallest ball enclosing the intersection of AA and BB already has a radius shrinking at the rate of 11κ1-\frac{1}{\kappa}. To achieve the accelerated rate, we make the observation that the gradient information in this iteration can also be used to shrink the ball BB and therefore, the radius of the enclosing ball containing the intersection of AA and BB shrinks at a faster rate. We detail this intuition in Section 2. The new optimal method is described and analyzed in Section 3. We conclude with some experiments in Section 4.

2 Preliminaries

Furthermore recall that by smoothness one has f(x+)f(x)12βf(x)2f(x^{+})\leq f(x)-\frac{1}{2\beta}|\nabla f(x)|^{2} which allows to shrink the above ball by a factor of 11κ1-\frac{1}{\kappa} and obtain the following:

Intuition

In Section 2.1 we describe a geometric alternative to gradient descent (with the same convergence rate) which gives the core of our new optimal method. Then in Section 2.2 we explain why one can expect to accelerate this geometric algorithm.

Denote by TT the map from x0x_{0} to x1x_{1} defined implicitely above, and let (xk)(x_{k}) be defined by xk+1=T(xk)x_{k+1}=T(x_{k}). Then we just proved

In other words, after 2κlog(R0/ε)2\kappa\log(R_{0}/\varepsilon) iterations where each iteration cost one call to the gradient oracle) one obtains a point ε\varepsilon-close to the minimizer of ff.

2 Why one can accelerate

which, intuitively, allows us the shrink the radius squared from R02R_{0}^{2} to R02f(x0)2α2κR_{0}^{2}-\frac{|\nabla f(x_{0})|^{2}}{\alpha^{2}\kappa} using the local information at x0x_{0}. From (1), we have

Now, intersecting the above two shrunk balls and using Lemma 1 (see below and also see Figure 2), we obtain that there is an x1x_{1}^{\prime} such that

giving us an acceleration in shrinking of the radius. To carry the argument for the next iteration, we would have required that f(x1)f(x0+)f(x_{1}^{\prime})\leq f(x_{0}^{+}) but it may not hold. Thus, we choose x1x_{1} by a line search

which ensures that f(x1)f(x0+)f(x_{1})\leq f(x_{0}^{+}). To remedy the fact that the ball for the next iteration is centered at x1x_{1}^{\prime} and not x1x_{1}, we observe that the line search also ensures that f(x1)\nabla f(x_{1}) is perpendicular to the line going through x1x_{1} and x1x_{1}^{\prime}. This geometric fact is enough for the algorithm to work at the next iteration as well. In the next section we describe precisely our proposed algorithm which is based on the above insights.

An optimal algorithm

and ck+1c_{k+1} (respectively Rk+12R^{2}_{k+1}) be the center (respectively the squared radius) of the ball given by (the proof of) Lemma 1 which contains

The formulas for ck+1c_{k+1} and Rk+12R^{2}_{k+1} are given in Algorithm 1.

Proof We will prove a stronger claim by induction that for each k0k\geq 0, one has

The case k=0k=0 follows immediately by (1). Let us assume that the above display is true for some k0k\geq 0. Then using f(x)f(xk+1+)f(xk+1)12βf(xk+1)2f(xk+)12βf(xk+1)2,f(x^{*})\leq f(x_{k+1}^{+})\leq f(x_{k+1})-\frac{1}{2\beta}|\nabla f(x_{k+1})|^{2}\leq f(x_{k}^{+})-\frac{1}{2\beta}|\nabla f(x_{k+1})|^{2}, one gets

Thus it only remains to observe that the squared radius of the ball given by Lemma 1 which encloses the intersection of the two above balls is smaller than (11κ)Rk22α(f(xk+1+)f(x))\left(1-\frac{1}{\sqrt{\kappa}}\right)R_{k}^{2}-\frac{2}{\alpha}(f(x_{k+1}^{+})-f(x^{*})). We apply Lemma 1 after moving ckc_{k} to the origin and scaling distances by RkR_{k}. We set ε=1κ\varepsilon=\frac{1}{\kappa}, g=f(xk+1)αg=\frac{|\nabla f(x_{k+1})|}{\alpha}, δ=2α(f(xk+1+)f(x))\delta=\frac{2}{\alpha}\left(f(x_{k+1}^{+})-f(x^{*})\right) and a=xk+1++cka={x_{k+1}^{++}-c_{k}}. The line search step of the algorithm implies that f(xk+1)(xk+1ck)=0\nabla f(x_{k+1})^{\top}(x_{k+1}-c_{k})=0 and therefore, a=xk+1++ckf(xk+1)/α=g|a|=|x_{k+1}^{++}-c_{k}|\geq|\nabla f(x_{k+1})|/\alpha=g and Lemma 1 applies to give the result.

Proof First observe that if g21/2g^{2}\leq 1/2 then one can take c=ac=a since 12(1ε)1ε\frac{1}{2}(1-\varepsilon)\leq 1-\sqrt{\varepsilon}. Thus we assume now that g2>1/2g^{2}>1/2, and note that we can also clearly assume that n=2n=2. Consider the segment joining the two points at the intersection of the two balls under consideration. We denote cc for the point at the intersection of this segment and [0,a][0,a], and x=cx=|c| (that is c=xaac=x\frac{a}{|a|}). A simple picture reveals that xx satisfies

which is straightforward to verify (recall that a2g21/2|a|^{2}\geq g^{2}\geq 1/2).

Algorithm 2 we give is more agressive than Theorem 1, for instance, using line search instead of fixed step size. The correctness of this version follows from a similar proof as Theorem 1.

This algorithm does not require the smoothness parameter and the number of iterations; and it guarantees the function value is strictly decreasing. They are useful properties for machine learning applications because the only required parameter α\alpha is usually given. Furthermore, we believe that the integration of zeroth and first order information about the function makes our new method particularly well-suited in practice.

Experiments

In this section, we compare Geometric Descent method (GeoD) with a variety of full gradient methods. It includes steepest descent (SD), accelerated full gradient method (AFG), accelerated full gradient method with adaptive restart (AFGwR) and quasi-Newton with limited-memory BFGS updating (L-BFGS). For SD, we compute the gradient and perform an exact line search on the gradient direction. For AFG, we use the ‘Constant Step Scheme II’ in [Nesterov(2004)]. For AFGwR, [O’Donoghue and Candes(2013)], we use the function restart scheme and replace the gradient step by an exact line search to improve its performance. For both AFG and AFGwR, the parameter is chosen among all powers of 2 for each dataset individually. For L-BFGS, we use the software developed by Mark Schmidt with default settings (see [Schmidt(2012)]).

In all experiments, the minimization problem is of the form iφ(aiTx)\sum_{i}\varphi(a_{i}^{T}x) where computing aiTxa_{i}^{T}x is the computational bottleneck. Therefore, if we reuse the calculations carefully, each iteration of all mentioned methods requires only one calculation of aiTxa_{i}^{T}x for some xx. In particular, the cost of exact line searches is negligible compares with the cost of computing aiTxa_{i}^{T}x. Hence, we simply report the number of iterations in the following experiments.

We evaluate the algorithms via the binary classification problem on the 40 datasetsWe omitted all datasets of size \geq 100 MB for time consideration. from LIBSVM data, [Chang and Lin(2011)]. The problem is to minimize the regularized empirical risk:

We solve this problem with different regularization coefficients λ{104,105,106,107,108}\lambda\in\{10^{-4},10^{-5},10^{-6},10^{-7},10^{-8}\} and report the median and 90th percentile of the number of steps required to achieve a certain accuracy. In figure 3, we see that GeoD is better than SD, AFG and AFGwR, but worse than L-BFGS. Since L-BFGS stores and uses the gradients of the previous iterations, it is interesting to see if GeoD will be competitive to L-BFGS if it computes the intersection of multiple balls instead of 2 balls.

2 Worst Case Experiment

In this section, we consider the minimization problem

where β\beta is the smothness parameter. Within the first nn iterations, it is known that any iterative methods uses only the gradient information cannot minimize this function faster than the rate 1Θ(β1/2)1-\Theta({\beta}^{-1/2}).

In figure 4, we see that every method except SD converge in the same rate with different constants for the first nn iterations. However, after Θ(n)\Theta(n) iterations, both SD and AFG continue to converge in the rate the theory predicted while other methods converge much faster. We remark that the memory size of L-BFGS we are using is 100 and if in the right example we choose n=100n=100 instead of 200200, L-BFGS will converge at n=100n=100 immediately. It is surprising that the AFGwR and GeoD can achieve a comparable result by using “memory size” being 1.

References