Regularized Nonlinear Acceleration

Damien Scieur, Alexandre d'Aspremont, Francis Bach

Introduction

Suppose we seek to solve the following optimization problem

Since the publication of Nesterov’s optimal first-order smooth convex minimization algorithm [Nesterov, 1983], significant efforts have been focused on either providing more interpretable views on current acceleration techniques, or on replicating these complexity gains using different, more intuitive schemes. Early efforts sought to directly extend the original acceleration result in [Nesterov, 1983] to broader function classes [Nemirovskii and Nesterov, 1985], allow for generic metrics, line searches, produce simpler proofs [Beck and Teboulle, 2009; Nesterov, 2013] or adaptive accelerated algorithms [Nesterov, 2015], etc. More recently however, several authors [Drori and Teboulle, 2014; Lessard et al., 2016] have started using classical results from control theory to obtain numerical bounds on convergence rates that match the optimal rates. Others have studied the second order ODEs obtained as the limit for small step sizes of classical accelerated schemes, to better understand their convergence [Su et al., 2014; Wibisono and Wilson, 2015]. Finally, recent results have also shown how to wrap classical algorithms in an outer optimization loop, to accelerate convergence and reach optimal complexity bounds [Lin et al., 2015] on certain structured problems.

Here, we take a significantly different approach to convergence acceleration stemming from classical results in numerical analysis. We use the iterates produced by any (converging) optimization algorithm, and estimate the solution directly from this sequence, assuming only some regularity conditions on the function to minimize. Our scheme is based on the idea behind Aitken’s Δ2\Delta^{2}-algorithm [Aitken, 1927], generalized as the Shanks transform [Shanks, 1955], whose recursive formulation is known as the ε\varepsilon-algorithm [Wynn, 1956] (see e.g. [Brezinski, 2006; Sidi et al., 1986] for a survey). In a nutshell, these methods fit geometrical models to linearly converging sequences, then extrapolate their limit from the fitted model.

In a sense, this approach is more statistical in nature. It assumes an approximately linear model holds for iterations near the optimum, and estimates this model using the iterates. In fact, Wynn’s algorithm [Wynn, 1956] is directly connected to the Levinson-Durbin algorithm [Levinson, 1949; Durbin, 1960] used to solve Toeplitz systems recursively and fit autoregressive models (the Shanks transform solves Hankel systems, but this is essentially the same problem [Heinig and Rost, 2011]). The key difference in these extrapolation techniques is that estimating the autocovariance operator AA is not required, as we only focus on the limit. Moreover, the method presents strong links with the conjugate gradient when applied to unconstrained quadratic optimization, but does not further calls to the operator.

We start from a formulation of these techniques known as Anderson Acceleration [Anderson, 1965], Mešina’s Algorithm [Mešina, 1977] or minimal polynomial extrapolation (MPE) [Sidi et al., 1986; Smith et al., 1987]. They use the minimal polynomial of the linear operator driving iterations to estimate the optimum by a nonlinear average of the iterates (i.e. computing a weighted average using weights which are nonlinear functions of the iterates).

Our contribution here is to regularize this procedure and produce explicit bounds on the distance to optimality by controlling stability, thus explicitly quantifying acceleration. We show that these extrapolation algorithms reach optimal performance (asymptotically) and describe several numerical examples where these stabilized estimates often speed up convergence by an order of magnitude. So far, for all the techniques cited above, no proofs of convergence of the estimates were given when the estimation process became unstable. Furthermore, the acceleration scheme runs in parallel with the original algorithm, providing improved estimates of the solution on the fly, while the original method is progressing, so its numerical complexity is marginal.

The paper is organized as follows. In Section 2 we recall basic results behind the acceleration for linear iterations. Then, in Section 3, we generalize these results to nonlinear iterations and show how to fully control the impact of nonlinearity. We use these results to derive explicit bounds on the acceleration performance of our estimates. In Section 4 we connect the acceleration methods to the conjugate gradient method and Nesterov’s method. Finally, we present numerical results in Section 5.

Convergence Acceleration

Suppose g(x)g(x) is differentiable and let GG be the Jacobian of gg evaluated at xx^{*}. In the rest of the paper, we assume GG to be symmetric, positive semi-definite and GσIG\preceq\sigma I, with σ<1\sigma<1. Equation (FPI) becomes

By neglecting the second order term, and because g(x)=xg(x^{*})=x^{*}, we obtain the linear fixed-point iteration

Because G2σ<1\|G\|_{2}\leq\sigma<1, the iterates xkx_{k} converges to xx^{*} at a linear rate, with

where \|\cdot\| stands for the Euclidean norm here and throughout the paper. We will now see how to improve convergence rates using a linear combination of the previous iterates.

Suppose we run kk iterations of (LFPI), a linear combination of iterates xix_{i} with coefficients cic_{i} reads

we can write (3) more concisely in terms of the matrix polynomial p(G)p(G), setting p(1)=i=0kci=1p(1)=\sum_{i=0}^{k}c_{i}=1 without loss of generality, to get

Ideally, we need to find cc (or equivalently pp) which minimizes the error term p(G)(x0x)p(G)(x_{0}-x^{*}). We will study the error when linearly combining the last k+1k+1 iterates xix_{i}, assuming we have an algorithm computing this optimal combination, i.e.

The next proposition produces an uniform bound on the value of this error using Chebyshev polynomials.

where mm is the number of distinct eigenvalues of GG and

Proof. Because GG is symmetric, it admits the eigenvalue decomposition

First, assume kmk\geq m. The Cayley-Hamilton theorem means that if p(x)p(x) is the characteristic polynomial of GG, then p(G)=p(Λ)=0p(G)=p(\Lambda)=0. By assumption none of the λi\lambda_{i} is equal to 1 (we assumed λi[0,σ]\lambda_{i}\in[0,\sigma] with σ<1\sigma<1) so we can normalize pp so that p(1)=1p(1)=1 and p(λi)=0p(\lambda_{i})=0 for all i=1,,mi=1,\ldots,m.

We now assume k<mk<m. We have, for any polynomial pp,

Because 0Gσ0\preceq G\preceq\sigma, we have 0λiσ0\leq\lambda_{i}\leq\sigma, so the error bound becomes

where the right hand side involves a minmax problem, explicitly solved using Chebyshev’s polynomials. Let CkC_{k} be the Chebyshev polynomial of degree kk. By definition, CkC_{k} is a monic polynomial (i.e. a polynomial whose leading coefficient is one) solving

Golub and Varga use a variant of Ck(x)C_{k}(x) to solve the problem in (7), whose solution is a rescaled Chebyshev polynomial given by

where t(x,σ)t(x,\sigma) is simply a linear mapping from interval [0,σ][0,\sigma] to $$. Moreover, they show

Injecting the result of (9) in (7) yields the desired result.

In the case of the gradient method applied on quadratic function with eigenvalues bounded in the interval [μ,L][\mu,L] (this correspond also to a LL-smooth and μ\mu-strongly convex function), we have

The proof of this proposition suggests Tk(x,σ)T_{k}(x,\sigma) is a good universal solution for the convergence acceleration problem and we plot both T3(x,σ)T_{3}(x,\sigma) and T5(x,σ)T_{5}(x,\sigma) for xx\in and σ=0.85\sigma=0.85 in Figure 1. This solution is called the Chebyshev semi-iterative method in [Golub and Varga, 1961] and was further studied by e.g. [Nemirovskiy and Polyak, 1984]. Combining the kk iterates xix_{i} using the coefficients cic_{i} in Tk(x,σ)T_{k}(x,\sigma), we ensure

which means convergence is indeed accelerated. However, this method has some key drawbacks. First, we need to know σ\sigma to form Tk(x,σ)T_{k}(x,\sigma), which is not always the case. For example, in the case of the gradient method, σ\sigma depends on the smoothness constant LL and the strong convexity constant μ\mu. In the general non-linear case, σ\sigma depends on the spectrum of the Jacobian at the optimum, which is clearly not observed. Second, the algorithm does not allow us to control the magnitude of the coefficients in the polynomial T(x,σ)T(x,\sigma), which has a strong impact on the stability of the algorithm in the presence of numerical errors, or when the iterates are generated by a non-linear function gg.

Because of stability issues with Chebyshev acceleration, we focus now on a method which will approximately minimize the error p(G)(x0x)2\|p(G)(x_{0}-x^{*})\|_{2}. Since we of course do not observe GG and xx^{*} we will work with the residuals

when gg is a linear function (LFPI) this becomes

A linear combination of residuals rir_{i} with coefficients cic_{i} is written

We recognize the error term we wanted to minimize, multiplied by the matrix (GI)(G-I). Using the coefficients which minimize this alternative quantity will approximately minimize the error, as stated in the following proposition.

whose coefficients, written cc^{*}, satisfy

The iterates xix_{i} defined in (LFPI) averaged with coefficients cc^{*} satisfy

where we have assumed 0GσI0\preceq G\preceq\sigma I, with σ<1\sigma<1.

Proof. By definition of cc^{*}, and using (LFPI),

We can bound this last error term because GI1\|G-I\|\leq 1,

Using the fact that (GI)111σ\|(G-I)^{-1}\|\leq\frac{1}{1-\sigma} yields the desired result.

This leads to the following acceleration algorithm.

This acceleration algorithm is called nonlinear because the coefficients cic_{i} vary with of xix_{i}. This method is also known as Anderson acceleration [Anderson, 1965], the Eddy-Mesina algorithm [Mešina, 1977; Eddy, 1979], Minimal Polynomial Extrapolation [Cabay and Jackson, 1976], or Reduced Rank Extrapolation [Sidi et al., 1986; Smith et al., 1987]. There are small variations between all these methods, which lie in the way they solve the minimization problem in (12). The next proposition gives us an explicit solution, involving the inversion of a kk-by-kk matrix.

Proof. Let μ\mu be the dual variable of the equality constraint. Both cc^{*} and μ\mu^{*} should satisfy the KKT system

This block matrix can be inverted explicitly, with

Using this inverse we easily solve the linear system, which gives the result in (15).

In practice of course, instead of computing the inverse of the matrix RTRR^{T}R, we solve the linear system

then set c=z/(1Tz)c^{*}=z/(\mathbf{1}^{T}z). This formula is used in Anderson Acceleration algorithm and Mesina method. Other algorithms usually force the coefficient ckc_{k} to be equal to one, solve the remaining linear system, then normalize the vector. However, these alternative strategies are harder to analyze when the iterates are generated by a non-linear function gg. We will now apply this acceleration algorithm on gradient method for nonlinear functions and compute its rate of convergence.

Regularized Nonlinear Acceleration of Convergence

So far, we have only considered linear functions gg in (LFPI), without perturbations, when computing the iterates xix_{i}. In general, the fixed-point iteration (FPI) is usually generated by a nonlinear function gg, thus inducing a second order error term in O(xix2)O(\|x_{i}-x^{*}\|^{2}) compared to the dynamics in (LFPI).

Here, in §3.1 we first give a bound on the deviation error when there are perturbations in (LFPI). In §3.2 we then derive a regularized version of Algorithm 1 which better controls the impact of perturbations. We then study the impact of regularization on the solution when there are no perturbations in §3.3. Finally, in §3.4 we gather the results of the previous sections to bound the rate of convergence of the regularized acceleration algorithm.

We now study the sensitivity of the acceleration algorithm to perturbations. Consider the following perturbed linear fixed point iteration

where eie_{i} is the noise injected in xi+1x_{i+1} at iteration ii. For now, we do not assume any structure on the noise, so eie_{i} may be the nonlinearity of gg, stochastic noise, roundoff error, etc. The iterates of this process will be compared to their noiseless counterpart,

The next proposition describes the sensitivity of Algorithm 1 using RR and PP.

Indeed, using the definition of cc^{*} and μ\mu^{*} in (16),

With this simplification, the system becomes

The explicit solution in obtained by inverting the block matrix, and is written

Because the first factor is the norm of a projector of rank k1k-1, its value is bounded by 11, so we get the desired result.

2. Regularized Nonlinear Acceleration of Convergence

In this section, we will analyze the following acceleration algorithm, which uses Tikhonov regularization to solve the linear system in (15).

Regularization controls the norm of the coefficients produced by the algorithm and reduces the impact of perturbations, as shown in the following proposition.

Proof. Using the same proof technique of Propositions 2.4 and 3.1, we have

Regularization allows a better control of the impact of perturbations, but also changes the solution cc^{*} into cλc^{*}_{\lambda}. The next section analyses the impact of regularization on the extrapolated solution when there are no perturbations.

3. Regularized Chebyshev Polynomial

The previous section shows that regularization is important for the control of the perturbations present in (Pert. LFPI). However, the convergence analysis becomes more complicated in the perturbation-free case, and we introduce regularized Chebyshev polynomials.

The regularized Chebyshev polynomial Cσ(x,k,α)C^{*}_{\sigma}(x,k,\alpha) of degree kk, range σ\sigma and regularization parameter α\alpha is defined as the solution of

Using this specific polynomial we can now bound the accuracy of the extrapolated point using the regularized algorithm.

Let cλc_{\lambda}^{*} be the output of Algorithm 2 using the sequence xix_{i} generated by (LFPI) (with Gσ<1\|G\|\leq\sigma<1) and the parameter λ>0\lambda>0. The accuracy of the extrapolation is bounded by

Proof. Consider the optimization problem in Algorithm 2,

Since ri=(GI)(xix)=(GI)Gi(x0x)r_{i}=(G-I)(x_{i}-x^{*})=(G-I)G^{i}(x_{0}-x^{*}), if we use the polynomial pp with coefficients cc, the problem becomes

Because MM is symmetric, we only need to look at its eigenvalues which are inside the segment [0,σ][0,\sigma],

It remains to link the optimization problem to the accuracy of the extrapolation. Indeed,

We proved in (27) that this quantity can be bounded by Sσ2(k,λˉ)x0x2S^{2}_{\sigma}(k,\bar{\lambda})\|x_{0}-x^{*}\|^{2}, so we finally have

Regularized Chebyshev polynomials are crucial for the bound on the accuracy of Algorithm 2. Unfortunately, there is no explicit expressions for Sσ(k,α)S_{\sigma}(k,\alpha) in (24). However, this value can be computed numerically using sum-of-squares optimization. We show in Figure 2 the difference of performances when using the coefficients of the regularized Chebyshev polynomial instead of its non-regularized version.

which means that checking if a polynomial is non-negative on the real line is equivalent to solving a linear matrix inequality (see e.g. [Ben-Tal and Nemirovski, 2001, §4.2] for details). We can thus write the problem of computing the maximum of a polynomial over the real line as

is non-negative on the real line, or equivalently, that the polynomial

is non-negative on the real line. Overall, this implies that problem (24) can be written

4. Convergence rate

We will now prove global accuracy bounds, using the following decomposition of the error term,

where κ>1\kappa>1 with (GI)111σ=κ\|(G-I)^{-1}\|\leq\frac{1}{1-\sigma}=\kappa.

Proof. The proof is divided into four parts, where the three first parts bound each term of (30) and the last one combines everything. The bound on the first term comes explicitly from Proposition 3.4, with

We finally combine the bounds (31), (32) and (33) according to the decomposition (30), to get

Here, cλ\|c_{\lambda}^{*}\| appears twice in the expression. We remove it by maximizing the bound over cλ\|c_{\lambda}^{*}\|. The first two terms of (34) can be written

with a=Sσ2(k,λˉ)x0x2a=S^{2}_{\sigma}(k,\bar{\lambda})\|x_{0}-x^{*}\|^{2} and b=XˉPλb=\|\bar{X}\|\frac{\|P\|}{\lambda}. By Proposition A.1 (in the Appendix), its maximum value is equal to

The bound on extrapolation accuracy in (34) now becomes

We can further simplify the bound above by bounding P\|P\| using σ\sigma, E\mathcal{E} and Xˉ\|\bar{X}\|.

Let PP the perturbation matrix defined in (17). Then

where E\mathcal{E} is defined in Theorem 3.5, RR is the matrix of residuals for the sequence xix_{i} generated by (LFPI), for Gσ\|G\|\leq\sigma.

Since r0=(GI)(x0x)r_{0}=(G-I)(x_{0}-x^{*}), and GI1\|G-I\|\leq 1, we have r0x0x\|r_{0}\|\leq\|x_{0}-x^{*}\|. Injecting this result in the previous bound gives the desired result,

It remains to bound Δ\|\Delta\|. Consider Xˉ\bar{X}, where each column of Xˉ=(xixˉ)\bar{X}=(x_{i}-\bar{x}) for some point xˉ\bar{x}. Then we can build RR from XX,

By identification, we have Δ=ED\Delta=\mathcal{E}D, so ΔDE2E\|\Delta\|\leq\|D\|\|\mathcal{E}\|\leq 2\|\mathcal{E}\|.

Assuming again Gσ\|G\|\leq\sigma, the following propositions bound Xˉ\|\bar{X}\| when xˉ=x\bar{x}=x^{*}.

Let Xˉ\bar{X} be the matrix built with the columns Xˉi=xixˉ\bar{X}_{i}=x_{i}-\bar{x}, where the sequence xix_{i} is generated by (LFPI) and xˉ=x\bar{x}=x^{*}. If Gσ\|G\|\leq\sigma, where GG is the matrix present in (LFPI), the norm of Xˉ\bar{X} is bounded by

Proof. Since each column of Xˉ\bar{X} correspond to xixx_{i}-x^{*},

5. Accelerating Gradient Descent

When using gradient method on a μ\mu-strongly convex, LL-smooth function with a Lipschitz-continuous Hessian with constant MM, we have the following bounds,

where σ=1μL\sigma=1-\frac{\mu}{L} satisfies Gσ\|G\|\leq\sigma.

Using these expressions, we can compare convergence rates between convergence acceleration in Algorithm 2 and Nesterov’s method. In Figure 3 we illustrate the difference on a particular instance where x0x=104\|x_{0}-x^{*}\|=10^{-4}, L=1L=1, μ=M=0.1\mu=M=0.1. We see that, despite the highly conservative nature of this bound, for small kk at least our method is faster than Nesterov’s acceleration.

When using the gradient method, this result bounds all quantities present in Theorem 3.5 as a function of μ,L,M\mu,\,L,\,M and x0x\|x_{0}-x^{*}\|. Asymptotically, i.e. when x0x0{\|x_{0}-x^{*}\|\rightarrow 0} and we are starting close enough to the optimal point, we show that we recover the acceleration rate for linear sequences in Proposition 2.3 if the regularization parameter λ\lambda il well-chosen.

Let λ=O(x0xs)\lambda=O(\|x_{0}-x^{*}\|^{s}) for some scalar ss. The bound of Theorem 3.5 normalized by x0x\|x_{0}-x^{*}\| becomes

If 4s<04-s<0, clearly the last terms vanishes when x0x0\|x_{0}-x^{*}\|\rightarrow 0. It remains to analyze

If s]2,8/3[s\in]2,8/3[ then s2>0s-2>0 and 83s>08-3s>0, implying (x0xs2)0(\|x_{0}-x^{*}\|^{s-2})\rightarrow 0 and O(x0x83s)0O(\|x_{0}-x^{*}\|^{8-3s})\rightarrow 0 when x0x0\|x_{0}-x^{*}\|\rightarrow 0. The bound finally becomes

However, Sσ(k,0)S_{\sigma}(k,0) is exactly equal to the maximum value of the rescaled (non-regularized) Chebyshev polynomial Tk(x,σ)T_{k}(x,\sigma), so by equation (9),

In other words, the result above means that the bound of Theorem 3.5 tends to be the bound of Proposition 2.3 (for the case where k<mk<m) and, asymptotically, we recover the optimal rate of convergence in [Nesterov, 2013]. In fact, the result may also hold for other kinds of methods, because the proof only needs

These assumptions are not too restrictive, and are often encountered when using deterministic, twice-differentiable, linearly convergent iterations gg in (FPI).

In practice of course, xx\|x-x^{*}\| is unknown and the regularization parameter λ\lambda should decrease fast enough to ensure Sσ(k,λˉ)Sσ(k,0)S_{\sigma}(k,\bar{\lambda})\rightarrow S_{\sigma}(k,0), but not too fast otherwise the algorithm becomes unstable. An adaptive strategy thus ensures a good convergence rate, which is what we detail next.

6. Adaptive regularization

The major problem of the regularized Algorithm 2 is the presence of the parameter λ\lambda, unknown in advance. Of course, one can use the bound in Theorem 3.5 to search the best λ\lambda, but this requires a lot of information on the problem, like the constants L,μL,\,\mu and MM as well as the distance to the optimum x0x\|x_{0}-x^{*}\|. Moreover, the bound is extremely pessimistic and does not correspond to the good numerical performances of the algorithm.

To avoid this problem we use adaptive strategy to find λ\lambda, based on grid search, which requires kk additional calls to f(x)f(x). In comparison, we also need to call kk times the oracle for common adaptive strategy in the (accelerated) gradient method. For example, the backtracking line-search over the constant LL requires the evaluation of f(xi)f(x_{i}) at each iteration i=1...ki=1...k.

Finally, the introduction of the regularization parameter introduces some damping in the acceleration algorithm, in the sense that the step length xextr(λ)x0x_{\text{extr}}(\lambda)-x_{0} is reduced with higher values of λ\lambda. A simple line search over the step-size, which consists in finding a good scalar tt which minimizes the function, solving

significantly improves the solution. Nevertheless, this requires further calls to f(x)f(x), and an inexact line-search is usually preferable. We start with t=1t=1, then multiply the value by two until the objective function increases,

In our numerical experiments, this line-search dramatically increases acceleration performances.

7. Computational Complexity of Convergence Acceleration

We can explicitly solve this system in variables aa and bb, and the solutions are

7.2. Batch mode.

Extensions & Links with other Methods

We can extend our results to smooth functions that are not strongly convex using a simple regularization trick which we trace back at least to [Hazan, 2014]. Suppose we seek to solve

using the smoothness of fε(x)f_{\varepsilon}(x) and writing xεx_{\varepsilon}^{*} the optimum of problem (38). It suffices to optimize fεf_{\varepsilon} up to ε/2\varepsilon/2 to find an ε\varepsilon-solution for the original problem. The linear convergence of gradient [Nesterov, 2013] algorithms guarantees

The number of iterations required to reach a target precision ε/2\varepsilon/2 is thus bounded by

while accelerated algorithms have r=1ε/(LD2+ε)r=1-\sqrt{\varepsilon/(LD^{2}+\varepsilon)} which yields

Up to a logarithmic constant, these upper bounds match the complexity of gradient and accelerated gradient methods. Overall, an algorithm for strongly convex function used with this regularization trick recovers an ε\varepsilon-approximated solution. This means we can always reduce a not strongly convex problem to (1), where our acceleration analysis applies.

2. Convergence Acceleration on Gradient Method for Quadratic Functions

This formulation is equivalent to f=Axb\nabla f=Ax-b, where b=Axb=Ax^{*} but it will be more convenient in this section to manipulate directly xx^{*}. Let μIALI\mu I\preceq A\preceq LI so that the function ff is strongly convex of constant μ\mu and smooth of constant LL. If we use the fixed-step gradient method, with step-size 1/L1/L,

Notice that g(xk+1)g(x_{k+1}) and (39) are equivalent. The Jacobian of gg is thus equal to (IA/L)(I-A/L). We have the following bounds on GG,

By consequence, σ=1μL\sigma=1-\frac{\mu}{L}, thus the rate of convergence of our method is linear and the bound is

However, if we use Algorithm (1), we combine the iterates xix_{i} with coefficients cc^{*} (computed by formula (15)). By equations (5) and (13) the accuracy of this extrapolation is bounded by

This bound matches the rate obtained using the optimal method in [Nesterov, 2013]. Outside of the normalization constraint, this is very similar to the convergence analysis of Lanczos’ method.

3. Convergence acceleration versus conjugate gradient

The rate of convergence obtained above also matches that of the conjugate gradient within a factor L/μL/\mu. Indeed, the acceleration algorithm has a strong link with the conjugate gradient. Denote vM=vTMv\|v\|_{M}=\sqrt{v^{T}Mv} the norm induced by the positive definite matrix MM. Also, assume we want to solve Ax=bAx=b using conjugate gradient method (where AA is assumed to be symmetric and positive definite). By definition, at the kk-th iteration, the conjugate gradient computes an approximation of xx^{*} which follows

where Kk=\mboxspan{b,Ab,...,Ak1b}=\mboxspan{Ax,A2x,...,Akx}\mathcal{K}_{k}=\mbox{{span}}\{b,Ab,...,A^{k-1}b\}=\mbox{{span}}\{Ax^{*},A^{2}x^{*},...,A^{k}x^{*}\} is called a Krylov subspace. Since the constraint xKkx\in\mathcal{K}_{k} impose us to build xx from a linear combination of the basis of Kk\mathcal{K}_{k}, we can write

where q(x)q(x) is a polynomial of degree kk and q(1)=0q(1)=0. So the conjugate gradient method solves

which is very similar to the equations in (2.3). However, while conjugate gradient has access to an oracle giving the result of the product between AA and any vector vv, the acceleration algorithm can only use the iterations produced by (LFPI), so it does not require the knowledge of AA. Moreover, the convergence of conjugate gradient is analyzed in another norm (A\|\cdot\|_{A} instead of 2\|\cdot\|_{2}), which explains why a condition number appears in the bound (40).

Analysis of convergence on conjugate gradient often use Chebyshev’s polynomial, like the acceleration algorithm (1). We will now see that Nesterov’s algorithm generates also a polynomial, making the convergence analysis for quadratics easier.

4. Chebyshev’s Acceleration and Nesterov’s Accelerated Gradient Method

In Proposition 2.1, we bounded the rate of convergence of Algorithm 1 using Chebyshev polynomials. In fact, this is exactly the idea behind Chebyshev’s semi-iterative method, which uses these coefficients in order to accelerate gradient descent on quadratic functions. Here, we present Chebyshev semi-iterative acceleration and its analysis, then use the same arguments on Nesterov’s method. These points were also discussed in [Hardt, 2013].

Assume as above that we use the gradient method to minimize a quadratic function, we get the recurrence (39). We see easily that

Since G21μL=σ\|G\|_{2}\leq 1-\frac{\mu}{L}=\sigma, the rate of convergence is xkx2σkx0x2\|x_{k}-x^{*}\|_{2}\leq\sigma^{k}\|x_{0}-x^{*}\|_{2}. Moreover, if we average the vectors xix_{i} using coefficients cic_{i} (with unitary sum) from 0 to kk, we get

where TkT_{k} and tσt_{\sigma} are also defined in (8). Furthermore, the Chebyshev polynomials can be constructed using a three-terms recurrence

This scheme looks very similar to Nesterov’s accelerated gradient method, which reads

Compared with Chebyshev acceleration, Nesterov’s scheme is iteratively building a polynomial Nk(x)N_{k}(x) with yky=Nk(G)(y0x)y_{k}-y^{*}=N_{k}\left(G\right)(y_{0}-x^{*}). If we replace zkz_{k} by its definition in the expression of yky_{k} in the Nesterov’s scheme we get the following recurrence of order two

We can extract the polynomial NkN_{k}, which reads

with initial conditions N0(x)=1N_{0}(x)=1 and N1(x)=xN_{1}(x)=x. Notice that Nk(1)=1N_{k}(1)=1 for all kk.

When minimizing smooth strongly convex functions with Nesterov’s method, we use

Moreover, empirically at least, the maximum value of Nk(x)N_{k}(x) in the interval [0,σ][0,\sigma] is Nk(σ)N_{k}(\sigma). We conjecture that this always holds. We thus have the following recurrence

To get linear convergence with rate rr, we need NkrNk1r2Nk2N_{k}\leq rN_{k-1}\leq r^{2}N_{k-2}, or again

We have that Nesterov’s coefficients and rate, i.e. β=(1μ/L)/(1+μ/L)\beta={(1-\sqrt{\mu/L})}/{(1+\sqrt{\mu/L})} and r=(1μ/L)r=(1-\sqrt{\mu/L}), satisfy this condition, showing that Nesterov’s method converges with a rate at least r=(1μ/L)r=(1-\sqrt{\mu/L}) on quadratic problems. This provides an alternate proof of Nesterov’s acceleration result on these problems using Chebyshev polynomials (provided the conjecture on N(σ)N(\sigma) holds).

Numerical Experiments

In this section, we evaluate the performance of the adaptive acceleration methods without/with line-search on the step size, described in Algorithm 3.

We begin by testing our methods on a regularized logistic regression problem written

Fixed-step gradient method for smooth strongly convex functions [Nesterov, 2013, Th. 2.1.15]

Accelerated gradient method for smooth strongly convex functions [Nesterov, 2013, Th. 2.2.3]

The accelerated gradient method with backtracking line-search on the parameter LL.

The Adaptive acceleration algorithm 3 on kk iterations of gradient descent without line search (written RNA k).

The Black-box acceleration algorithm 3 (written RNA k + LS) on kk iterations of gradient descent.

The matrix ZZ is build using datasets Sonar (60 features, 208 points), Madelon (500 features, 4400 points) or Sido0 (4932 features, 12678 points), concatenated with a column of ones. The optimization is done on the raw data, i.e. without normalization. The starting point is always w0=0w_{0}=0.

Conclusion and Perspectives

References

Acknowledgements

AA is at the département d’informatique de l’ENS, École normale supérieure, UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA Sierra project-team. The authors would like to acknowledge support from a starting grant from the European Research Council (ERC project SIPA), from the ITN MacSeNet (project number 642685), as well as support from the chaire Économie des nouvelles données with the data science joint research initiative with the fonds AXA pour la recherche, and from a Google focused award.

Appendix A Missing propositions and proofs

defined for x[0,a/λ]x\in[0,\sqrt{a/\lambda}]. The its maximal value is attained at

and its maximal value is thus, if xopt[0,a/λ]x_{\text{opt}}\in[0,\sqrt{a/\lambda}],

Proof. The (positive) root of the derivative of ff follows

If we inject the solution in our function, we obtain its maximal value,

The simplification with λ\lambda in the last equality concludes the proof.

A.2. Proof of proposition 3.8

First, we show that the choice σ=1μL\sigma=1-\frac{\mu}{L} satisfies G=g(x)σ\|G\|=\|g^{\prime}(x^{*})\|\leq\sigma. Our fixed-point function gg reads

Since g(x)=I1Lf(x)g^{\prime}(x)=I-\frac{1}{L}f^{\prime\prime}(x), we have g(x)=I1Lf(x)g^{\prime}(x^{*})=I-\frac{1}{L}f^{\prime\prime}(x^{*}). Because ff is μ\mu-strongly convex, f(x)μIf^{\prime\prime}(x)\succeq\mu I, in particular at x=xx=x^{*}. In conclusion,

In the last inequality, we used the fact that ff is LL-Lipschitz, so f(x)f(x)Lxx\|f(x)-f(x^{*})\|\leq L\|x-x^{*}\|. It is also possible to prove [Nesterov, 2013] that gradient method converges at rate

Since our function has a Lipschitz-continuous Hessian, it is possible to show that (Nesterov , Lemma 1.2.4)

We can thus bound the norm of the error at the ithi^{\text{th}} iteration,

By equation (42), and because g(x)σ\|g^{\prime\prime}(x^{*})\|\leq\sigma, we have

The simplification in the last line greatly simplifies future computations. We thus have the bound

Despite the simplification made earlier, the results of this bounds are close to the one obtained without simplification.