A Universal Catalyst for First-Order Optimization

Hongzhou Lin, Julien Mairal, Zaid Harchaoui

Introduction

where ff is convex and has Lipschitz continuous derivatives with constant LL and ψ\psi is convex but may not be differentiable. The variable xx represents model parameters and the role of ff is to ensure that the estimated parameters fit some observed data. Specifically, ff is often a large sum of functions

Our goal is to accelerate gradient-based or first-order methods that are designed to solve (1), with a particular focus on large sums of functions (2). By “accelerating”, we mean generalizing a mechanism invented by Nesterov that improves the convergence rate of the gradient descent algorithm. More precisely, when ψ=0\psi=0, gradient descent steps produce iterates (xk)k0(x_{k})_{k\geq 0} such that F(xk)F=O(1/k)F(x_{k})-F^{*}=O(1/k), where FF^{*} denotes the minimum value of FF. Furthermore, when the objective FF is strongly convex with constant μ\mu, the rate of convergence becomes linear in O((1μ/L)k)O((1-{\mu}/{L})^{k}). These rates were shown by Nesterov to be suboptimal for the class of first-order methods, and instead optimal rates—O(1/k2)O(1/k^{2}) for the convex case and O((1μ/L)k)O((1-\sqrt{{\mu}/{L}})^{k}) for the μ\mu-strongly convex one—could be obtained by taking gradient steps at well-chosen points. Later, this acceleration technique was extended to deal with non-differentiable regularization functions ψ\psi .

For modern machine learning problems involving a large sum of nn functions, a recent effort has been devoted to developing fast incremental algorithms that can exploit the particular structure of (2). Unlike full gradient approaches which require computing and averaging nn gradients f(x)=(1/n)i=1nfi(x)\nabla f(x)=(1/n)\sum_{i=1}^{n}\nabla f_{i}(x) at every iteration, incremental techniques have a cost per-iteration that is independent of nn. The price to pay is the need to store a moderate amount of information regarding past iterates, but the benefit is significant in terms of computational complexity.

Our main achievement is a generic acceleration scheme that applies to a large class of optimization methods. By analogy with substances that increase chemical reaction rates, we call our approach a “catalyst”. A method may be accelerated if it has linear convergence rate for strongly convex problems. This is the case for full gradient and block coordinate descent methods , which already have well-known accelerated variants. More importantly, it also applies to incremental algorithms such as SAG , SAGA , Finito/MISO , SDCA , and SVRG . Whether or not these methods could be accelerated was an important open question. It was only known to be the case for dual coordinate ascent approaches such as SDCA or SDPC for strongly convex objectives. Our work provides a universal positive answer regardless of the strong convexity of the objective, which brings us to our second achievement.

Some approaches such as Finito/MISO, SDCA, or SVRG are only defined for strongly convex objectives. A classical trick to apply them to general convex functions is to add a small regularization εx2\varepsilon\|x\|^{2} . The drawback of this strategy is that it requires choosing in advance the parameter ε\varepsilon, which is related to the target accuracy. A consequence of our work is to automatically provide a direct support for non-strongly convex objectives, thus removing the need of selecting ε\varepsilon beforehand.

The approach Finito/MISO, which was proposed in and , is an incremental technique for solving smooth unconstrained μ\mu-strongly convex problems when nn is larger than a constant βL/μ\beta L/\mu (with β=2\beta=2 in ). In addition to providing acceleration and support for non-strongly convex objectives, we also make the following specific contributions: \bullet we extend the method and its convergence proof to deal with the composite problem (1); \bullet we fix the method to remove the “big data condition” nβL/μn\geq\beta L/\mu. The resulting algorithm can be interpreted as a variant of proximal SDCA with a different step size and a more practical optimality certificate—that is, checking the optimality condition does not require evaluating a dual objective. Our construction is indeed purely primal. Neither our proof of convergence nor the algorithm use duality, while SDCA is originally a dual ascent technique.

The catalyst acceleration can be interpreted as a variant of the proximal point algorithm , which is a central concept in convex optimization, underlying augmented Lagrangian approaches, and composite minimization schemes . The proximal point algorithm consists of solving (1) by minimizing a sequence of auxiliary problems involving a quadratic regularization term. In general, these auxiliary problems cannot be solved with perfect accuracy, and several notations of inexactness were proposed, including . The catalyst approach hinges upon (i) an acceleration technique for the proximal point algorithm originally introduced in the pioneer work ; (ii) a more practical inexactness criterion than those proposed in the past.Note that our inexact criterion was also studied, among others, in , but the analysis of led to the conjecture that this criterion was too weak to warrant acceleration. Our analysis refutes this conjecture. As a result, we are able to control the rate of convergence for approximately solving the auxiliary problems with an optimization method M\mathcal{M}. In turn, we are also able to obtain the computational complexity of the global procedure for solving (1), which was not possible with previous analysis . When instantiated in different first-order optimization settings, our analysis yields systematic acceleration.

Beyond , several works have inspired this paper. In particular, accelerated SDCA is an instance of an inexact accelerated proximal point algorithm, even though this was not explicitly stated in . Their proof of convergence relies on different tools than ours. Specifically, we use the concept of estimate sequence from Nesterov , whereas the direct proof of , in the context of SDCA, does not extend to non-strongly convex objectives. Nevertheless, part of their analysis proves to be helpful to obtain our main results. Another useful methodological contribution was the convergence analysis of inexact proximal gradient methods of . Finally, similar ideas appear in the independent work . Their results overlap in part with ours, but both papers adopt different directions. Our analysis is for instance more general and provides support for non-strongly convex objectives. Another independent work with related results is , which introduce an accelerated method for the minimization of finite sums, which is not based on the proximal point algorithm.

The Catalyst Acceleration

We present here our generic acceleration scheme, which can operate on any first-order or gradient-based optimization algorithm with linear convergence rate for strongly convex objectives.

where FF^{*} denotes the minimum value of FF. The quantity τM,F\tau_{\mathcal{M},F} controls the convergence rate: the larger is τM,F\tau_{\mathcal{M},F}, the faster is convergence to FF^{*}. However, for a given algorithm M\mathcal{M}, the quantity τM,F\tau_{\mathcal{M},F} depends usually on the ratio L/μL/\mu, which is often called the condition number of FF.

Our approach can accelerate a wide range of first-order optimization algorithms, starting from classical gradient descent. It also applies to randomized algorithms such as SAG, SAGA, SDCA, SVRG and Finito/MISO, whose rates of convergence are given in expectation. Such methods should be contrasted with stochastic gradient methods , which minimize a different non-deterministic function. Acceleration of stochastic gradient methods is beyond the scope of this work.

We now highlight the mechanics of the catalyst algorithm, which is presented in Algorithm 1. It consists of replacing, at iteration kk, the original objective function FF by an auxiliary objective GkG_{k}, close to FF up to a quadratic term:

where κ\kappa will be specified later and yky_{k} is obtained by an extrapolation step described in (6). Then, at iteration kk, the accelerated algorithm A\mathcal{A} minimizes GkG_{k} up to accuracy εk\varepsilon_{k}.

Substituting (4) to (1) has two consequences. On the one hand, minimizing (4) only provides an approximation of the solution of (1), unless κ=0\kappa=0; on the other hand, the auxiliary objective GkG_{k} enjoys a better condition number than the original objective FF, which makes it easier to minimize. For instance, when M\mathcal{M} is the regular gradient descent algorithm with ψ=0\psi=0, M\mathcal{M} has the rate of convergence (3) for minimizing FF with τM,F=μ/L\tau_{\mathcal{M},F}=\mu/L. However, owing to the additional quadratic term, GkG_{k} can be minimized by M\mathcal{M} with the rate (3) where τM,Gk=(μ+κ)/(L+κ)>τM,F\tau_{\mathcal{M},G_{k}}=(\mu+\kappa)/(L+\kappa)>\tau_{\mathcal{M},F}. In practice, there exists an “optimal” choice for κ\kappa, which controls the time required by M\mathcal{M} for solving the auxiliary problems (4), and the quality of approximation of FF by the functions GkG_{k}. This choice will be driven by the convergence analysis in Sec. 3.1-3.3; see also Sec. C for special cases.

Similar to the classical gradient descent scheme of Nesterov , Algorithm 1 involves an extrapolation step (6). As a consequence, the solution of the auxiliary problem (5) at iteration k+1k+1 is driven towards the extrapolated variable yky_{k}. As shown in , this step is in fact sufficient to reduce the number of iterations of Algorithm 1 to solve (1) when εk=0\varepsilon_{k}=0—that is, for running the exact accelerated proximal point algorithm.

Nevertheless, to control the total computational complexity of an accelerated algorithm A\mathcal{A}, it is necessary to take into account the complexity of solving the auxiliary problems (5) using M\mathcal{M}. This is where our approach differs from the classical proximal point algorithm of . Essentially, both algorithms are the same, but we use the weaker inexactness criterion Gk(xk)GkεkG_{k}(x_{k})-G_{k}^{*}\leq\varepsilon_{k}, where the sequence (εk)k0(\varepsilon_{k})_{k\geq 0} is fixed beforehand, and only depends on the initial point. This subtle difference has important consequences: (i) in practice, this condition can often be checked by computing duality gaps; (ii) in theory, the methods M\mathcal{M} we consider have linear convergence rates, which allows us to control the complexity of step (5), and then to provide the computational complexity of A\mathcal{A}.

Convergence Analysis

In this section, we present the theoretical properties of Algorithm 1, for optimization methods M\mathcal{M} with deterministic convergence rates of the form (3). When the rate is given as an expectation, a simple extension of our analysis described in Section 4 is needed. For space limitation reasons, we shall sketch the proof mechanics here, and defer the full proofs to Appendix B.

We first analyze the convergence rate of Algorithm 1 for solving problem 1, regardless of the complexity required to solve the subproblems (5). We start with the μ\mu-strongly convex case.

Choose α0=q\alpha_{0}=\sqrt{q} with q=μ/(μ+κ)q=\mu/(\mu+\kappa) and

Then, Algorithm 1 generates iterates (xk)k0(x_{k})_{k\geq 0} such that

This theorem characterizes the linear convergence rate of Algorithm 1. It is worth noting that the choice of ρ\rho is left to the discretion of the user, but it can safely be set to ρ=0.9q\rho=0.9\sqrt{q} in practice. The choice α0=q\alpha_{0}=\sqrt{q} was made for convenience purposes since it leads to a simplified analysis, but larger values are also acceptable, both from theoretical and practical point of views. Following an advice from Nesterov[17, page 81] originally dedicated to his classical gradient descent algorithm, we may for instance recommend choosing α0\alpha_{0} such that α02+(1q)α01=0\alpha_{0}^{2}+(1-q)\alpha_{0}-1=0.

The choice of the sequence (εk)k0(\varepsilon_{k})_{k\geq 0} is also subject to discussion since the quantity F(x0)FF(x_{0})-F^{*} is unknown beforehand. Nevertheless, an upper bound may be used instead, which will only affects the corresponding constant in (7). Such upper bounds can typically be obtained by computing a duality gap at x0x_{0}, or by using additional knowledge about the objective. For instance, when FF is non-negative, we may simply choose εk=(2/9)F(x0)(1ρ)k\varepsilon_{k}=(2/9)F(x_{0})(1-\rho)^{k}.

The proof of convergence uses the concept of estimate sequence invented by Nesterov , and introduces an extension to deal with the errors (εk)k0(\varepsilon_{k})_{k\geq 0}. To control the accumulation of errors, we borrow the methodology of for inexact proximal gradient algorithms. Our construction yields a convergence result that encompasses both strongly convex and non-strongly convex cases. Note that estimate sequences were also used in , but, as noted by , the proof of only applies when using an extrapolation step (6) that involves the true minimizer of (5), which is unknown in practice. To obtain a rigorous convergence result like (7), a different approach was needed.

Theorem 3.1 is important, but it does not provide yet the global computational complexity of the full algorithm, which includes the number of iterations performed by M\mathcal{M} for approximately solving the auxiliary problems (5). The next proposition characterizes the complexity of this inner-loop.

Under the assumptions of Theorem 3.1, let us consider a method M\mathcal{M} generating iterates (zt)t0(z_{t})_{t\geq 0} for minimizing the function GkG_{k} with linear convergence rate of the form

This proposition is generic since the assumption (8) is relatively standard for gradient-based methods . It may now be used to obtain the global rate of convergence of an accelerated algorithm. By calling FsF_{s} the objective function value obtained after performing s=kTMs=kT_{\mathcal{M}} iterations of the method M\mathcal{M}, the true convergence rate of the accelerated algorithm A\mathcal{A} is

As a result, algorithm A\mathcal{A} has a global linear rate of convergence with parameter

where τM\tau_{\mathcal{M}} typically depends on κ\kappa (the greater, the faster is M\mathcal{M}). Consequently, κ\kappa will be chosen to maximize the ratio τM/μ+κ\tau_{\mathcal{M}}/{\sqrt{\mu+\kappa}}. Note that for other algorithms M\mathcal{M} that do not satisfy (8), additional analysis and possibly a different initialization z0z_{0} may be necessary (see Appendix D for example).

2 Convergence Analysis for Convex but Non-Strongly Convex Objective Functions

We now state the convergence rate when the objective is not strongly convex, that is when μ=0\mu=0.

When μ=0\mu=0, choose α0=(51)/2\alpha_{0}=(\sqrt{5}-1)/2 and

Then, Algorithm 1 generates iterates (xk)k0(x_{k})_{k\geq 0} such that

This theorem is the counter-part of Theorem 3.1 when μ=0\mu=0. The choice of η\eta is left to the discretion of the user; it empirically seem to have very low influence on the global convergence speed, as long as it is chosen small enough (e.g., we use η=0.1\eta=0.1 in practice). It shows that Algorithm 1 achieves the optimal rate of convergence of first-order methods, but it does not take into account the complexity of solving the subproblems (5). Therefore, we need the following proposition:

We can now draw up the global complexity of an accelerated algorithm A\mathcal{A} when M\mathcal{M} has a linear convergence rate (8) for κ\kappa-strongly convex objectives. To produce xkx_{k}, M\mathcal{M} is called at most kTMlog(k+2)kT_{\mathcal{M}}\log(k+2) times. Using the global iteration counter s=kTMlog(k+2)s=kT_{\mathcal{M}}\log(k+2), we get

If M\mathcal{M} is a first-order method, this rate is near-optimal, up to a logarithmic factor, when compared to the optimal rate O(1/s2)O(1/s^{2}), which may be the price to pay for using a generic acceleration scheme.

Acceleration in Practice

We show here how to accelerate existing algorithms M\mathcal{M} and compare the convergence rates obtained before and after catalyst acceleration. For all the algorithms we consider, we study rates of convergence in terms of total number of iterations (in expectation, when necessary) to reach accuracy ε\varepsilon. We first show how to accelerate full gradient and randomized coordinate descent algorithms . Then, we discuss other approaches such as SAG , SAGA , or SVRG . Finally, we present a new proximal version of the incremental gradient approaches Finito/MISO , along with its accelerated version. Table 1 summarizes the acceleration obtained for the algorithms considered.

The convergence rate of an accelerated algorithm A\mathcal{A} is driven by the parameter κ\kappa. In the strongly convex case, the best choice is the one that maximizes the ratio τM,Gk/μ+κ\tau_{\mathcal{M},G_{k}}/\sqrt{\mu+\kappa}. As discussed in Appendix C, this rule also holds when (8) is given in expectation and in many cases where the constant CM,Gk\mathcal{C}_{\mathcal{M},G_{k}} is different than A(Gk(z0)Gk)A(G_{k}(z_{0})-G_{k}^{*}) from (8). When μ=0\mu=0, the choice of κ>0\kappa>0 only affects the complexity by a multiplicative constant. A rule of thumb is to maximize the ratio τM,Gk/L+κ\tau_{\mathcal{M},G_{k}}/\sqrt{L+\kappa} (see Appendix C for more details).

After choosing κ\kappa, the global iteration-complexity is given by Compkinkout\text{Comp}\leq k_{\text{in}}\>k_{\text{out}}, where kink_{\text{in}} is an upper-bound on the number of iterations performed by M\mathcal{M} per inner-loop, and koutk_{\text{out}} is the upper-bound on the number of outer-loop iterations, following from Theorems 3.1-3.3. Note that for simplicity, we always consider that LμL\gg\mu such that we may write LμL-\mu simply as “LL” in the convergence rates.

1 Acceleration of Existing Algorithms

Table 1 presents convergence rates that are valid for proximal and non-proximal settings, since most methods we consider are able to deal with such non-differentiable penalties. The exception is SAG , for which proximal variants are not analyzed. The incremental method Finito/MISO has also been limited to non-proximal settings so far. In Section 4.2, we actually introduce the extension of MISO to composite minimization, and establish its theoretical convergence rates.

We now consider randomized incremental gradient methods, resp. SAG and SAGA . When μ>0\mu>0, we focus on the “ill-conditioned” setting nL/μn\leq L/\mu, where these methods have the complexity O((L/μ)log(1/ε))O((L/\mu)\log(1/\varepsilon)). Otherwise, their complexity becomes O(nlog(1/ε))O(n\log(1/\varepsilon)), which is independent of the condition number and seems theoretically optimal .

For these methods, the best choice for κ\kappa has the form κ=a(Lμ)/(n+b)μ\kappa={a(L-\mu)}/{(n+b)}-\mu, with (a,b)=(2,2)(a,b)=(2,-2) for SAG, (a,b)=(1/2,1/2)(a,b)=(1/2,1/2) for SAGA. A similar formula, with a constant LL^{\prime} in place of LL, holds for SVRG; we omit it here for brevity. SDCA and Finito/MISO are actually related to incremental gradient methods, and the choice for κ\kappa has a similar form with (a,b)=(1,1)(a,b)=(1,1).

2 Proximal MISO and its Acceleration

Finito/MISO was proposed in and for solving the problem (1) when ψ=0\psi=0 and when ff is a sum of nn μ\mu-strongly convex functions fif_{i} as in (2), which are also differentiable with LL-Lipschitz derivatives. The algorithm maintains a list of quadratic lower bounds—say (dik)i=1n(d_{i}^{k})_{i=1}^{n} at iteration kk—of the functions fif_{i} and randomly updates one of them at each iteration by using strong-convexity inequalities. The current iterate xkx_{k} is then obtained by minimizing the lower-bound of the objective

Interestingly, since DkD_{k} is a lower-bound of FF we also have Dk(xk)FD_{k}(x_{k})\leq F^{*}, and thus the quantity F(xk)Dk(xk)F(x_{k})-D_{k}(x_{k}) can be used as an optimality certificate that upper-bounds F(xk)FF(x_{k})-F^{*}. Furthermore, this certificate was shown to converge to zero with a rate similar to SAG/SDCA/SVRG/SAGA under the condition n2L/μn\geq 2L/\mu. In this section, we show how to remove this condition and how to provide support to non-differentiable functions ψ\psi whose proximal operator can be easily computed. We shall briefly sketch the main ideas, and we refer to Appendix D for a thorough presentation.

The first idea to deal with a nonsmooth regularizer ψ\psi is to change the definition of DkD_{k}:

which was also proposed in without a convergence proof. Then, because the dikd_{i}^{k}’s are quadratic functions, the minimizer xkx_{k} of DkD_{k} can be obtained by computing the proximal operator of ψ\psi at a particular point. The second idea to remove the condition n2L/μn\geq 2L/\mu is to modify the update of the lower bounds dikd_{i}^{k}. Assume that index iki_{k} is selected among {1,,n}\{1,\ldots,n\} at iteration kk, then

Whereas the original Finito/MISO uses δ=1\delta=1, our new variant uses δ=min(1,μn/2(Lμ))\delta=\min(1,{\mu n}/{2(L-\mu)}). The resulting algorithm turns out to be very close to variant “5” of proximal SDCA , which corresponds to using a different value for δ\delta. The main difference between SDCA and MISO-Prox is that the latter does not use duality. It also provides a different (simpler) optimality certificate F(xk)Dk(xk)F(x_{k})-D_{k}(x_{k}), which is guaranteed to converge linearly, as stated in the next theorem.

Let (xk)k0(x_{k})_{k\geq 0} be obtained by MISO-Prox, then

Furthermore, we also have fast convergence of the certificate

The proof of convergence is given in Appendix D. Finally, we conclude this section by noting that MISO-Prox enjoys the catalyst acceleration, leading to the iteration-complexity presented in Table 1. Since the convergence rate (14) does not have exactly the same form as (8), Propositions 3.2 and 3.4 cannot be used and additional analysis, given in Appendix D, is needed. Practical forms of the algorithm are also presented there, along with discussions on how to initialize it.

Experiments

We use three datasets used in , namely real-sim, rcv1, and ocr, which are relatively large, with up to n=2500000n=2\,500\,000 points for ocr and p=47152p=47\,152 variables for rcv1. We consider three regimes: μ=0\mu=0 (no regularization), μ/L=0.001/n\mu/L=0.001/n and μ/L=0.1/n\mu/L=0.1/n, which leads significantly larger condition numbers than those used in other studies (μ/L1/n\mu/L\approx 1/n in ). We compare MISO, SAG, and SAGA with their default parameters, which are recommended by their theoretical analysis (step-sizes 1/L1/L for SAG and 1/3L1/3L for SAGA), and study several accelerated variants. The values of κ\kappa and ρ\rho and the sequences (εk)k0(\varepsilon_{k})_{k\geq 0} are those suggested in the previous sections, with η ⁣= ⁣0.1\eta\!=\!0.1 in (10). Other implementation details are presented in Appendix E.

The restarting strategy for M\mathcal{M} is key to achieve acceleration in practice. All of the methods we compare store nn gradients evaluated at previous iterates of the algorithm. We always use the gradients from the previous run of M\mathcal{M} to initialize a new one. We detail in Appendix E the initialization for each method. Finally, we evaluated a heuristic that constrain M\mathcal{M} to always perform at most nn iterations (one pass over the data); we call this variant AMISO2 for MISO whereas AMISO1 refers to the regular “vanilla” accelerated variant, and we also use this heuristic to accelerate SAG.

The results are reported in Table 1. We always obtain a huge speed-up for MISO, which suffers from numerical stability issues when the condition number is very large (for instance, μ/L=103/n=4.1010\mu/L=10^{-3}/n=4.10^{-10} for ocr). Here, not only does the catalyst algorithm accelerate MISO, but it also stabilizes it. Whereas MISO is slower than SAG and SAGA in this “small μ\mu” regime, AMISO2 is almost systematically the best performer. We are also able to accelerate SAG and SAGA in general, even though the improvement is less significant than for MISO. In particular, SAGA without acceleration proves to be the best method on ocr. One reason may be its ability to adapt to the unknown strong convexity parameter μμ\mu^{\prime}\geq\mu of the objective near the solution. When μ/L1/n\mu^{\prime}/L\geq 1/n, we indeed obtain a regime where acceleration does not occur (see Sec. 4). Therefore, this experiment suggests that adaptivity to unknown strong convexity is of high interest for incremental optimization.

This work was supported by ANR (MACARON ANR-14-CE23-0003-01), MSR-Inria joint centre, CNRS-Mastodons program (Titan), and NYU Moore-Sloan Data Science Environment.

References

Appendix A Construction of the Approximate Estimate Sequence

The estimate sequence is a generic tool introduced by Nesterov for proving the convergence of accelerated gradient-based algorithms. We start by recalling the definition given in .

Estimate sequences are used for proving convergence rates thanks to the following lemma

If for some sequence (xk)k0(x_{k})_{k\geq 0} we have

for an estimate sequence (φk)k0(\varphi_{k})_{k\geq 0} of FF, then

The rate of convergence of F(xk)F(x_{k}) is thus directly related to the convergence rate of λk\lambda_{k}. Constructing estimate sequences is thus appealing, even though finding the most appropriate one is not trivial for the catalyst algorithm because of the approximate minimization of GkG_{k} in (5). In a nutshell, the main steps of our convergence analysis are

define an “approximate” estimate sequence for FF corresponding to Algorithm 1—that is, finding a function φ\varphi that almost satisfies Definition A.1 up to the approximation errors εk\varepsilon_{k} made in (5) when minimizing GkG_{k}, and control the way these errors sum up together.

extend Lemma A.2 to deal with the approximation errors εk\varepsilon_{k} to derive a generic convergence rate for the sequence (xk)k0(x_{k})_{k\geq 0}.

This is also the strategy proposed by Güler in for his inexact accelerated proximal point algorithm, which essentially differs from ours in its stopping criterion. The estimate sequence we choose is also different and leads to a more rigorous convergence proof. Specifically, we prove in this section the following theorem:

where the αi\alpha_{i}’s are defined in Algorithm 1. Then, the sequence (xk)k0(x_{k})_{k\geq 0} satisfies

where FF^{*} is the minimum value of FF and

Then, the theorem will be used with the following lemma from to control the convergence rate of the sequence (λk)k0(\lambda_{k})_{k\geq 0}, whose definition follows the classical use of estimate sequences . This will provide us convergence rates both for the strongly convex and non-strongly convex cases.

If in the quantity γ0\gamma_{0} defined in (17) satisfies γ0μ\gamma_{0}\geq\mu, then the sequence (λk)k0(\lambda_{k})_{k\geq 0} from (15) satisfies

We may now move to the proof of the theorem.

The first step is to construct an estimate sequence is typically to find a sequence of lower bounds of FF. By calling xkx_{k}^{*} the minimizer of GkG_{k}, the following one is used in :

By strong convexity, Gk(x)Gk(xk)+κ+μ2xxk2G_{k}(x)\geq G_{k}(x_{k}^{*})+\frac{\kappa+\mu}{2}\|x-x_{k}^{*}\|^{2}, which is equivalent to

After developing the quadratic terms, we directly obtain (19). ∎

Unfortunately, the exact value xkx_{k}^{*} is unknown in practice and the estimate sequence of yields in fact an algorithm where the definition of the anchor point yky_{k} involves the unknown quantity xkx_{k}^{*} instead of the approximate solutions xkx_{k} and xk1x_{k-1} as in (6), as also noted by others . To obtain a rigorous proof of convergence for Algorithm 1, it is thus necessary to refine the analysis of . To that effect, we construct below a sequence of functions that approximately satisfies the definition of estimate sequences. Essentially, we replace in (19) the quantity xkx_{k}^{*} by xkx_{k} to obtain an approximate lower bound, and control the error by using the condition Gk(xk)Gk(xk)εkG_{k}(x_{k})-G_{k}(x_{k}^{*})\leq\varepsilon_{k}. This leads us to the following construction:

ϕ0(x)=F(x0)+γ02xx02\phi_{0}(x)=F(x_{0})+\frac{\gamma_{0}}{2}\|x-x_{0}\|^{2};

where the value of γ0\gamma_{0}, given in (17) will be explained later. Note that if one replaces xkx_{k} by xkx_{k}^{*} in the above construction, it is easy to show that (ϕk)k0(\phi_{k})_{k\geq 0} would be exactly an estimate sequence for FF with the relation λk\lambda_{k} given in (15).

Before extending Lemma A.2 to deal with the approximate sequence and conclude the proof of the theorem, we need to characterize a few properties of the sequence (ϕk)k0(\phi_{k})_{k\geq 0}. In particular, the functions ϕk\phi_{k} are quadratic and admit a canonical form:

For all k0k\geq 0, ϕk\phi_{k} can be written in the canonical form

where the sequences (γk)k0(\gamma_{k})_{k\geq 0}, (vk)k0(v_{k})_{k\geq 0}, and (ϕk)k0(\phi_{k}^{*})_{k\geq 0} are defined as follows

Differentiate twice the relations (23) gives us directly (20). Since vkv_{k} minimizes ϕk\phi_{k}, the optimality condition ϕk(vk)=0\nabla\phi_{k}(v_{k})=0 gives

and then we obtain (21). Finally, apply x=xkx=x_{k} to (23), which yields

Using the expression of vkv_{k} from (21), we have

It remains to plug this relation into (24), use once (20), and we obtain the formula (22) for ϕk\phi_{k}^{*}. ∎

We may now start analyzing the errors εk\varepsilon_{k} to control how far is the sequence (ϕk)k0(\phi_{k})_{k\geq 0} from an exact estimate sequence. For that, we need to understand the effect of replacing xkx_{k}^{*} by xkx_{k} in the lower bound (19). The following lemma will be useful for that purpose.

where GkG_{k}^{*} is the minimum value of GkG_{k}. Replacing GkG_{k} by its definition (5) gives

We can now show that Algorithm 1 generates iterates (xk)k0(x_{k})_{k\geq 0} that approximately satisfy the condition of Lemma A.2 from Nesterov .

Let ϕk\phi_{k} be the estimate sequence constructed above. Then, Algorithm 1 generates iterates (xk)k0(x_{k})_{k\geq 0} such that

where the sequence (ξk)k0(\xi_{k})_{k\geq 0} is defined by ξ0=0\xi_{0}=0 and

We proceed by induction. For k=0k=0, ϕ0=F(x0)\phi_{0}^{*}=F(x_{0}) and ξ0=0\xi_{0}=0. Assume now that F(xk1)ϕk1+ξk1F(x_{k-1})\leq\phi_{k-1}^{*}+\xi_{k-1}. Then,

where the second inequality is due to (25). By Lemma A.6, we now have,

We now need to show that the choice of the sequences (αk)k0(\alpha_{k})_{k\geq 0} and (yk)k0(y_{k})_{k\geq 0} will cancel all the terms involving yk1xky_{k-1}-x_{k}. In other words, we want to show that

which will be sufficient to conclude that ϕk+ξkF(xk)\phi_{k}^{*}+\xi_{k}\geq F(x_{k}). The relation (27) can be obtained from the definition of αk\alpha_{k} in (6) and the form of γk\gamma_{k} given in (20). We have indeed from (6) that

Then, the quantity (κ+μ)αk2(\kappa+\mu)\alpha_{k}^{2} follows the same recursion as γk+1\gamma_{k+1} in (20). Moreover, we have

from the definition of γ0\gamma_{0} in (17). We can then conclude by induction that γk+1=(κ+μ)αk2\gamma_{k+1}=(\kappa+\mu)\alpha_{k}^{2} for all k0k\geq 0 and (27) is satisfied.

To prove (26), we assume that yk1y_{k-1} is chosen such that (26) is satisfied, and show that it is equivalent to defining yky_{k} as in (6). By lemma A.6,

As a result, using (26) by replacing k1k-1 by kk yields

and we obtain the original equivalent definition of (6). This concludes the proof. ∎

With this lemma in hand, we introduce the following proposition, which brings us almost to Theorem A.3, which we want to prove.

Let us consider the sequence (λk)k0(\lambda_{k})_{k\geq 0} defined in (15). Then, the sequence (xk)k0(x_{k})_{k\geq 0} satisfies

where xx^{*} is a minimizer of FF and FF^{*} its minimum value.

By the definition of the function ϕk\phi_{k}, we have

where the inequality comes from (25). Therefore, by using the definition of ξk\xi_{k} in Lemma A.8,

where the first equality uses the relation (28), the last inequality comes from the strong convexity relation εkGk(xk)Gk(xk)(1/2)(κ+μ)xkxk2\varepsilon_{k}\geq G_{k}(x_{k})-G_{k}(x_{k}^{*})\geq(1/2)(\kappa+\mu)\|x_{k}^{*}-x_{k}\|^{2}, and the last equality uses the relation γk=(κ+μ)αk12\gamma_{k}=(\kappa+\mu)\alpha_{k-1}^{2}.

Dividing both sides by λk\lambda_{k} yields

To control the error term on the right and finish the proof of Theorem A.3, we are going to borrow some methodology used to analyze the convergence of inexact proximal gradient algorithms from , and use an extension of a lemma presented in to bound the value of vix\|v_{i}-x^{*}\|. This lemma is presented below.

Assume that the nonnegative sequences (uk)k0(u_{k})_{k\geq 0} and (ak)k0(a_{k})_{k\geq 0} satisfy the following recursion for all k0k\geq 0:

where (Sk)k0(S_{k})_{k\geq 0} is an increasing sequence such that S0u02S_{0}\geq u_{0}^{2}. Then,

The first part—that is, Eq. (31)—is exactly Lemma 1 from . The proof is in their appendix. Then, by calling bkb_{k} the right-hand side of (31), we have that for all k1k\geq 1, ukbku_{k}\leq b_{k}. Furthermore (bk)k0(b_{k})_{k\geq 0} is increasing and we have

and using the inequality x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}, we have

We are now in shape to conclude the proof of Theorem A.3. We apply the previous lemma to (29):

Appendix B Proofs of the Main Theorems and Propositions

We simply use Theorem A.3 and specialize it to the choice of parameters. The initialization α0=q\alpha_{0}=\sqrt{q} leads to a particularly simple form of the algorithm, where αk=q\alpha_{k}=\sqrt{q} for all k0k\geq 0. Therefore, the sequence (λk)k0(\lambda_{k})_{k\geq 0} from Theorem A.3 is also simple. For all k0k\geq 0, we indeed have λk=(1q)k\lambda_{k}=(1-\sqrt{q})^{k}. To upper-bound the quantity SkS_{k} from Theorem A.3, we now remark that γ0=μ\gamma_{0}=\mu and thus, by strong convexity of FF,

Therefore, Theorem A.3 combined with the previous inequality gives us

Since 1x+x2\sqrt{1-x}+\frac{x}{2} is decreasing in $,wehave, we have\sqrt{1-\rho}+\frac{\rho}{2}\geq\sqrt{1-\sqrt{q}}+\frac{\sqrt{q}}{2}$. Consequently,

B.2 Proof of Proposition 3.2

To control the number of calls of M\mathcal{M}, we need to upper bound Gk(xk1)GkG_{k}(x_{k-1})-G_{k}^{*} which is given by the following lemma:

Let (xk)k0(x_{k})_{k\geq 0} and (yk)k0(y_{k})_{k\geq 0} be generated by Algorithm 1. Remember that by definition of xk1x_{k-1},

which can be shown by using the respective definitions of GkG_{k} and Gk1G_{k-1} and manipulate the quadratic term resulting from the difference Gk(x)Gk1(x)G_{k}(x)-G_{k-1}(x).

Plugging x=xk1x=x_{k-1} and y=xky=x_{k}^{*} in the previous relation yields

where the last inequality comes from the strong convexity inequality of

Moreover, from the inequality x,y12x2+12y2{\langle x,y\rangle}\leq\frac{1}{2}\|x\|^{2}+\frac{1}{2}\|y\|^{2}, we also have

Summing inequalities (33), (34) and (35) gives the desired result. ∎

Next, we need to upper-bound the term yk1yk22\|y_{k-1}-y_{k-2}\|^{2}, which was also required in the convergence proof of the accelerated SDCA algorithm . We follow here their methodology.

Let us consider the iterates (xk)k0(x_{k})_{k\geq 0} and (yk)k0(y_{k})_{k\geq 0} produced by Algorithm 1, and define

which appears in Theorem 3.1 and which is such that F(xk)FδkF(x_{k})-F^{*}\leq\delta_{k}. Then, for any k3k\geq 3,

We follow here . By definition of yky_{k}, we have

where βk\beta_{k} is defined in (6). The last inequality was due to the fact that βk1\beta_{k}\leq 1. Indeed, the specific choice of α0=q\alpha_{0}=\sqrt{q} in Theorem A.3 leads to βk=qqq+q1\beta_{k}=\frac{\sqrt{q}-{q}}{\sqrt{q}+q}\leq 1 for all kk. Note, however, that this relation βk1\beta_{k}\leq 1 is true regardless of the choice of α0\alpha_{0}:

where the last equality uses the relation αk2+αkαk12=αk12+qαk\alpha_{k}^{2}+\alpha_{k}\alpha_{k-1}^{2}=\alpha_{k-1}^{2}+q\alpha_{k} from Algorithm 1. To conclude the lemma, we notice that by triangle inequality

We are now in shape to conclude the proof of Proposition 3.2.

By Proposition B.1 and lemma B.2, we have for all k3k\geq 3,

Let (zt)t0(z_{t})_{t\geq 0} be the sequence of using M\mathcal{M} to solve GkG_{k} with initialization z0=xk1z_{0}=x_{k-1}. By assumption (8), we have

The number of iterations TMT_{\mathcal{M}} of M\mathcal{M} to guarantee an accuracy of εk\varepsilon_{k} needs to satisfy

Let us denote RR the right-hand side. We remark that this upper bound holds for k3k\geq 3. We now consider the cases k=1k=1 and k=2k=2.

When k=1k=1, G1(x)=F(x)+κ2xy02G_{1}(x)=F(x)+\frac{\kappa}{2}\|x-y_{0}\|^{2}. Note that x0=y0x_{0}=y_{0}, then G1(x0)=F(x0)G_{1}(x_{0})=F(x_{0}). As a result,

When k=2k=2, we remark that y1y0=(1+β1)(x1x0)y_{1}-y_{0}=(1+\beta_{1})(x_{1}-x_{0}). Then, by following similar steps as in the proof of Lemma B.2, we have

which is smaller than 72δ1μ\frac{72\delta_{-1}}{\mu}. Therefore, the previous steps from the case k3k\geq 3 apply and G2(x1)G2ε2R\frac{G_{2}(x_{1})-G_{2}^{*}}{\varepsilon_{2}}\leq R. Thus, for any k1k\geq 1,

B.3 Proof of Theorem 3.3.

We will again Theorem A.3 and specialize it to the choice of parameters. To apply it, the following Lemma will be useful to control the growth of (λk)k0(\lambda_{k})_{k\geq 0}.

Let (λk)k0(\lambda_{k})_{k\geq 0} be the sequence defined in (15) where (αk)k0(\alpha_{k})_{k\geq 0} is produced by Algorithm 1 with α0=512\alpha_{0}=\frac{\sqrt{5}-1}{2} and μ=0\mu=0. Then, we have the following bounds for all k0k\geq 0,

Note that by definition of αk\alpha_{k}, we have for all k1k\geq 1,

With the choice of α0\alpha_{0}, the quantity γ0\gamma_{0} defined in (17) is equal to κ\kappa. By Lemma A.4, we have λk4(k+2)2\lambda_{k}\leq\frac{4}{(k+2)^{2}} for all k0k\geq 0 and thus αk2k+3\alpha_{k}\leq\frac{2}{k+3} for all k1k\geq 1 (it is also easy to check numerically that this is also true for k=0k=0 since 5120.6223\frac{\sqrt{5}-1}{2}\approx 0.62\leq\frac{2}{3}). We now have all we need to conclude the lemma:

With this lemma in hand, we may now proceed and apply Theorem A.3. We have remarked in the proof of the previous lemma that γ0=κ\gamma_{0}=\kappa. Then,

where the last inequality uses Lemma B.3 to upper-bound the ratio εi/λi\varepsilon_{i}/\lambda_{i}. Moreover,

The last inequality uses (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}).

B.4 Proof of Proposition 3.4

When μ=0\mu=0, we remark that Proposition B.1 still holds but Lemma B.2 does not. The main difficulty is thus to find another way to control the quantity yk1yk2\|y_{k-1}-y_{k-2}\|.

Since F(xk)FF(x_{k})-F^{*} is bounded by Theorem 3.3, we may use the bounded level set assumptions to ensure that there exists B>0B>0 such that xkxB\|x_{k}-x^{*}\|\leq B for any k0k\geq 0 where xx^{*} is a minimizer of FF. We can now follow similar steps as in the proof of Lemma B.2, and show that

Since κ>0\kappa>0, GkG_{k} is strongly convex, then using the same argument as in the strongly convex case, the number of calls for M\mathcal{M} is given by

The right hand side is upper-bounded by O((k+2)4+η)O((k+2)^{4+\eta}). Plugging this relation into (38) gives the desired result.

Appendix C Derivation of Global Convergence Rates

We give here a generic “template” for computing the optimal choice of κ\kappa to accelerate a given algorithm M\mathcal{M}, and therefore compute the rate of convergence of the accelerated algorithm A\mathcal{A}.

We assume here that M\mathcal{M} is a randomized first-order optimization algorithm, i.e. the iterates (xk)(x_{k}) generated by M\mathcal{M} are a sequence of random variables; specialization to a deterministic algorithm is straightforward. Also, for the sake of simplicity, we shall use simple notations to denote the stopping time to reach accuracy ε\varepsilon. Definition and notation using filtrations, σ\sigma-algebras, etc. are unnecessary for our purpose here where the quantity of interest has a clear interpretation.

Assume that algorithm M\mathcal{M} enjoys a linear rate of convergence, in expectation. There exists constants CM,F\mathcal{C}_{\mathcal{M},F} and τM,F\tau_{\mathcal{M},F} such that the sequence of iterates (xk)k0(x_{k})_{k\geq 0} for minimizing a strongly-convex objective FF satisfies

Define the random variable TM,F(ε)T_{\mathcal{M},F}(\varepsilon) (stopping time) corresponding to the minimum number of iterations to guarantee an accuracy ε\varepsilon in the course of running M\mathcal{M}

Then, an upper bound on the expectation is provided by the following lemma.

Let M\mathcal{M} be an optimization method with the expected rate of convergence (39). Then,

where we have dropped the dependency in FF to simplify the notation.

We abbreviate τM\tau_{\mathcal{M}} by τ\tau. Set

As simple calculation shows that for any τ(0,1)\tau\in(0,1), τ21eτ\frac{\tau}{2}\leq 1-e^{-\tau} and then

Note that the previous lemma mirrors Eq. (36-37) in the proof of Prop. 3.1 in Appendix B. For all optimization methods of interest, the rate τM,Gk\tau_{\mathcal{M},G_{k}} is independent of kk and varies with the parameter κ\kappa. We may now compute the iteration-complexity (in expectation) of the accelerated algorithm A\mathcal{A}—that is, for a given ε\varepsilon, the expected total number of iterations performed by the method M\mathcal{M}. Let us now fix ε>0\varepsilon>0. Calculating the iteration-complexity decomposes into three steps:

Find κ\kappa that maximizes the ratio τM,Gk/μ+κ\tau_{\mathcal{M},G_{k}}/\sqrt{\mu+\kappa} for algorithm M\mathcal{M} when FF is μ\mu-strongly convex. In the non-strongly convex case, we suggest maximizing instead the ratio τM,Gk/L+κ\tau_{\mathcal{M},G_{k}}/\sqrt{L+\kappa}. Note that the choice of κ\kappa is less critical for non-strongly convex problems since it only affects multiplicative constants in the global convergence rate.

Compute the upper-bound of the number of outer iterations koutk_{\text{out}} using Theorem 3.1 (for the strongly convex case), or Theorem 3.3 (for the non-strongly convex case), by replacing κ\kappa by the optimal value found in step 1.

Compute the upper-bound of the expected number of inner iterations

by replacing the appropriate quantities in Eq. 41 for algorithm M\mathcal{M}; for that purpose, the proofs of Propositions 3.2 of 3.4 my be used to upper-bound the ratio CM,Gk/εk\mathcal{C}_{\mathcal{M},G_{k}}/\varepsilon_{k}, or another dedicated analysis for M\mathcal{M} may be required if the constant CM,Gk\mathcal{C}_{\mathcal{M},G_{k}} does not have the required form A(Gk(z0)Gk)A(G_{k}(z_{0})-G_{k}^{*}) in (8).

Then, the iteration-complexity (in expectation) denoted Comp. is given by

Appendix D A Proximal MISO/Finito Algorithm

In this section, we present the algorithm MISO/Finito, and show how to extend it in two ways. First, we propose a proximal version to deal with composite optimization problems, and we analyze its rate of convergence. Second, we show how to remove a large sample condition n2L/μn\geq 2L/\mu, which was necessary for the convergence of the algorithm. The resulting algorithm is a variant of proximal SDCA with a different stepsize and a stopping criterion that does not use duality.

MISO/Finito was proposed in and for solving the following smooth unconstrained convex minimization problem

where each fif_{i} is differentiable with LL-Lipschitz continuous derivatives and μ\mu-strongly convex. At iteration kk, the algorithm updates a list of lower bounds dikd_{i}^{k} of the functions fif_{i}, by randomly picking up one index iki_{k} among {1,,n}\{1,\cdots,n\} and performing the following update

which is a lower bound of fif_{i} because of the μ\mu-strong convexity of fif_{i}. Equivalently, one may perform the following updates

and all functions dikd_{i}^{k} have the form

where cikc_{i}^{k} is a constant. Then, MISO/Finito performs the following minimization to produce the iterate (xk)(x_{k}):

In many machine learning problems, it is worth remarking that each function fi(x)f_{i}(x) has the specific form fi(x)=li(x,wi)+μ2x2f_{i}(x)=l_{i}(\langle x,w_{i}\rangle)+\frac{\mu}{2}\|x\|^{2}. In such cases, the vectors zikz_{i}^{k} can be obtained by storing only O(n)O(n) scalars.Note that even though we call this algorithm MISO (or Finito), it was called MISOμ\mu in , whereas “MISO” was originally referring to an incremental majorization-minimization procedure that uses upper bounds of the functions fif_{i} instead of lower bounds, which is appropriate for non-convex optimization problems. The main convergence result of is that the procedure above converges with a linear rate of convergence of the form (3), with τMISO=1/3n\tau_{\textrm{MISO}}=1/3n (also refined in 1/2n1/2n in ), when the large sample size constraint n2L/μn\geq 2L/\mu is satisfied.

Removing this condition and extending MISO to the composite optimization problem (1) is the purpose of the next section.

D.2 Proximal MISO

We now consider the composite optimization problem below,

where the functions fif_{i} are differentiable with LL-Lipschitz derivatives and μ\mu-strongly convex. As in typical composite optimization problems, ψ\psi is convex but not necessarily differentiable. We assume that the proximal operator of ψ\psi can be computed easily. The algorithm needs to be initialized with some lower bounds for the functions fif_{i}:

Then, we remark that under the large sample size condition n2L/μn\geq 2L/\mu, we have δ=1\delta=1 and the update of the quantities zikz_{i}^{k} in (45) is the same as in the original MISO/Finito algorithm. As we will see in the convergence analysis, the choice of δ\delta ensures convergence of the algorithm even in the small sample size regime n<2L/μn<2L/\mu.

The algorithm MISO-Prox is almost identical to variant 55 of proximal SDCA , which performs the same updates with δ=μn/(L+μn)\delta=\mu n/(L+\mu n) instead of δ=min(1,μn2(Lμ))\delta=\min(1,\frac{\mu n}{2(L-\mu)}). It is however not clear that MISO-Prox actually performs dual ascent steps in the sense of SDCA since the proof of convergence of SDCA cannot be directly modified to use the stepsize of proximal MISO and furthermore, the convergence proof of MISO-Prox does not use the concept of duality. Another difference lies in the optimality certificate of the algorithms. Whereas Proximal-SDCA provides a certificate in terms of linear convergence of a duality gap based on Fenchel duality, Proximal-SDCA ensures linear convergence of a gap that relies on strong convexity but not on the Fenchel dual (at least explicitly).

Similar to the original MISO algorithm, Proximal MISO maintains a list (dik)(d_{i}^{k}) of lower bounds of the functions fif_{i}, which are updated in the following fashion

Then, the following function is a lower bound of the objective FF:

and the update (45) can be shown to exactly minimize DkD_{k}. As a lower bound of FF, we have that Dk(xk)FD_{k}(x_{k})\leq F^{*} and thus

The quantity F(xk)Dk(xk)F(x_{k})-D_{k}(x_{k}) can then be interpreted as an optimality gap, and the analysis below will show that it converges linearly to zero. In practice, it also provides a convenient stopping criterion, which yields Algorithm 3.

To explain the stopping criterion in Algorithm 3, we remark that the functions dikd_{i}^{k} are quadratic and can be written

where the cikc_{i}^{k}’s are some constants and cik=cik+μ2zik2c_{i}^{\prime k}=c_{i}^{k}+\frac{\mu}{2}\|z_{i}^{k}\|^{2}. Equation (48) shows how to update recursively these constants cikc_{i}^{\prime k}, and finally

which justifies the stopping criterion. Since computing F(xk)F(x_{k}) requires scanning all the data points, the criterion is only computed every nn iterations.

The convergence of MISO-Prox is guaranteed by Theorem 4.1 from the main part of paper. Before we prove this theorem, we note that this rate is slightly better than the one proven in MISO , which converges as (113n)k(1-\frac{1}{3n})^{k}. We start by recalling a classical lemma that provides useful inequalities. Its proof may be found in .

To start the proof, we need a sequence of upper and lower bounds involving the functions DkD_{k} and Dk1D_{k-1}. The first one is given in the next lemma

where the definition of dikd_{i}^{k} is given in (46). The first inequality uses Lemma D.1, and the last one uses the inequality fidik1f_{i}\geq d_{i}^{k-1}. From this inequality, we can obtain (50) by simply using Dk(x)=i=1ndik(x)+ψ(x)=Dk1(x)+1n(dikk(x)dikk1(x))D_{k}(x)=\sum_{i=1}^{n}d_{i}^{k}(x)+\psi(x)=D_{k-1}(x)+\frac{1}{n}\left(d^{k}_{\,i_{k}}(x)-d^{k-1}_{\,i_{k}}(x)\right). ∎

Next, we prove the following lemma to compare DkD_{k} and Dk1D_{k-1}.

Remember that the functions dikd_{i}^{k} are quadratic and have the form (49), that DkD_{k} is defined in (47), and that zˉk\bar{z}_{k} minimizes 1ni=1ndik\frac{1}{n}\sum_{i=1}^{n}d_{i}^{k}. Then, there exists a constant AkA_{k} such that

Then, we are able to control the value of Dk(xk1)D_{k}(x_{k-1}) in the next lemma.

Using Lemma D.3 with x=xk1x=x_{k-1} and y=xky=x_{k} yields

Moreover xk1x_{k-1} is the minimum of Dk1D_{k-1} which is μ\mu-strongly convex. Thus,

Adding the two previous inequalities gives the first inequality below

and the last one comes from the basic inequality 12a2+a,b+12b20\frac{1}{2}\|a\|^{2}+\langle a,b\rangle+\frac{1}{2}\|b\|^{2}\geq 0. ∎

We have now all the inequalities in hand to prove Theorem 4.1.

Choose yy that maximizes the above quadratic function, i.e.

Then, we start introducing expected values. By construction

After taking expectation, we obtain the relation

On the one hand, since F(xk1)FF(x_{k-1})\geq F^{*}, we have

This is true for any k1k\geq 1, as a result

On the other hand, since FDk(xk)F^{*}\geq D_{k}(x_{k}), then

which gives us the relation (14) of the theorem. We conclude giving the choice of δ\delta. We choose it to maximize the rate of convergence, which turns to maximize τ\tau. This is a quadratic function, which is maximized at δ=nμ2(Lμ)\delta=\frac{n\mu}{2(L-\mu)}. However, by definition δ1\delta\leq 1. Therefore, the optimal choice of δ\delta is given by

When nμ2(Lμ)1\frac{n\mu}{2(L-\mu)}\leq 1, we have δ=nμ2(Lμ)\delta=\frac{n\mu}{2(L-\mu)} and τ=μ4(Lμ)\tau=\frac{\mu}{4(L-\mu)}.

When 1nμ2(Lμ)1\leq\frac{n\mu}{2(L-\mu)}, we have δ=1\delta=1 and τ=1nLμn2μ12n\tau=\frac{1}{n}-\frac{L-\mu}{n^{2}\mu}\geq\frac{1}{2n}.

Therefore, τmin(12n,μ4(Lμ))\tau\geq\min\left(\frac{1}{2n},\frac{\mu}{4(L-\mu)}\right), which concludes the first part of the theorem.

To prove the second part, we use (58) and (14), which gives

D.3 Accelerating MISO-Prox

The convergence rate of MISO (or also SDCA) requires a special handling since it does not satisfy exactly the condition (8) from Proposition 3.2. The rate of convergence is linear, but with a constant proportional to FD0(x0)F^{*}-D_{0}(x_{0}) instead of F(x0)FF(x_{0})-F^{*} for many classical gradient-based approaches.

To achieve acceleration, we show in this section how to obtain similar guarantees as Proposition 3.2 and 3.4—that is, how to solve efficiently the subproblems (5). This essentially requires the right initialization each time MISO-Prox is called. By initialization, we mean initializing the variables zi0z_{i}^{0}.

Assume that MISO-Prox is used to obtain xk1x_{k-1} from Algorithm 1 with Gk1(xk1)Gkεk1G_{k-1}(x_{k-1})-G_{k}^{*}\leq\varepsilon_{k-1}, and that one wishes to use MISO-Prox again on GkG_{k} to compute xkx_{k}. Then, let us call DD^{\prime} the lower-bound of Gk1G_{k-1} produced by MISO-Prox when computing xk1x_{k-1} such that

Note that we do not index these quantities with k1k-1 or kk for the sake of simplicity. The convergence of MISO-Prox may ensure that not only do we have Gk1(xk1)Gkεk1G_{k-1}(x_{k-1})-G_{k}^{*}\leq\varepsilon_{k-1}, but in fact we have the stronger condition Gk1(xk1)D(xk1)εk1G_{k-1}(x_{k-1})-D^{\prime}(x_{k-1})\leq\varepsilon_{k-1}. Remember now that

and that DD^{\prime} is a lower-bound of Gk1G_{k-1}. Then, we may set for all ii in {1,,n}\{1,\ldots,n\}

which is equivalent to initializing the new instance of MISO-Prox with

and by choosing appropriate quantities ci0c_{i}^{0}. Then, the following function is a lower bound of GkG_{k}

and the new instance of MISO-Prox to minimize GkG_{k} and compute xkx_{k} will produce iterates, whose first point, which we call x0x^{0}, minimizes D0D_{0}. This leads to the relation

where we use the notation zˉ0=1ni=1nzi0\bar{z}^{0}=\frac{1}{n}\sum_{i=1}^{n}{z^{0}_{i}} and zˉ=1ni=1nzi\bar{z}^{\prime}=\frac{1}{n}\sum_{i=1}^{n}{z^{\prime}_{i}} as in Algorithm 2.

Then, it remains to show that the quantity GkD0(x0)G_{k}^{*}-D_{0}(x^{0}) is upper bounded in a similar fashion as Gk(xk1)GkG_{k}(x_{k-1})-G_{k}^{*} in Propositions 3.2 and 3.4 to obtain a similar result for MISO-Prox and control the number of inner-iterations. This is indeed the case, as stated in the next lemma.

When initializing MISO-Prox as described above, we have

where the last inequality is using a simple relation 12a2+2a,b+12b20\frac{1}{2}\|a\|^{2}+2\langle a,b\rangle+\frac{1}{2}\|b\|^{2}\geq 0. As a result,

We remark that this bound is half of the bound shown in (32). Hence, a similar argument gives the bound on the number of inner iterations. We may finally compute the iteration-complexity of accelerated MISO-Prox.

When FF is μ\mu-strongly convex, the accelerated MISO-Prox algorithm achieves the accuracy ε\varepsilon with an expected number of iteration upper bounded by

When n>2(Lμ)/μn>2(L-\mu)/\mu, there is no acceleration. The optimal value for κ\kappa is zero, and we may use Theorem 4.1 and Lemma C.1 to obtain the complexity

When n<2(Lμ)/μn<2(L-\mu)/\mu, there is an acceleration, with κ=2(Lμ)/μ  μ\kappa=2(L-\mu)/\mu\;-\mu. Let us compute the global complexity using the “template” presented in Appendix C. The number of outer iteration is given by

At each inner iteration, we initialize with the value x0x^{0} described above, and we use Lemma D.5:

With Miso-Prox, with have τGk=12n\tau_{G_{k}}=\frac{1}{2n}, thus the expected number of inner iteration is given by Lemma C.1:

To conclude, the complexity of the accelerated algorithm is given by

Appendix E Implementation Details of Experiments

In the experimental section, we compare the performance with and without acceleration for three algorithms SAG, SAGA and MISO-Prox on l2l_{2}-logistic regression problem. In this part, we clarify some details about the implementation of the experiments.

Firstly, we normalize the observed data before running the regression. Then we apply Catalyst using parameters according to the theoretical settings. Standard analysis of the logistic function shows that the Lipschitz gradient parameter LL is 1/41/4 and strongly convex parameter μ=0\mu=0 when there is no regularization. Adding properly a l2l_{2} term generates the strongly-convex regimes. Several parameters need to be fixed at the beginning stage. The parameter κ\kappa is set to its optimal value suggested by theory, which only depends on nn, μ\mu and LL. More precisely, κ\kappa writes as κ=a(Lμ)/(n+b)μ\kappa={a(L-\mu)}/{(n+b)}-\mu, with (a,b)=(2,2)(a,b)=(2,-2) for SAG, (a,b)=(1/2,1/2)(a,b)=(1/2,1/2) for SAGA and (a,b)=(1,1)(a,b)=(1,1) for MISO-Prox. The parameter α0\alpha_{0} is initialized as the positive solution of x2+(1q)x1=0x^{2}+(1-q)x-1=0 where q=μ/(μ+κ)q=\sqrt{\mu/(\mu+\kappa)}. Furthermore, since the objective function is always positive, F(x0)FF(x_{0})-F^{*} can be upper bounded by F(x0)F(x_{0}) which allow us to set the εk=(2/9)F(x0)(1ρ)k\varepsilon_{k}=(2/9)F(x_{0})(1-\rho)^{k} in the strongly convex case and εk=2F(x0)/9(k+2)4+η\varepsilon_{k}=2F(x_{0})/9(k+2)^{4+\eta} in the non-strongly convex case. Finally, we set the free parameter in the expression of εk\varepsilon_{k} as follows. We simply set ρ=0.9q\rho=0.9\sqrt{q} in the strongly convex case and η=0.1\eta=0.1 in the non strongly convex case.

To solve the subproblem at each iteration, the step-sizes parameter for SAG, SAGA and MISO are set to the values suggested by theory, which only depend on μ\mu, LL and κ\kappa. All of the methods we compare store nn gradients evaluated at previous iterates of the algorithm. For MISO, the convergence analysis in Appendix D leads to the initialization xk1+κμ+κ(yk1yk2)x_{k-1}+\frac{\kappa}{\mu+\kappa}(y_{k-1}-y_{k-2}) that moves xk1x_{k-1} closer to yk1y_{k-1} and further away from yk2y_{k-2}. We found that using this initial point for SAGA was giving slightly better results than xk1x_{k-1}.