Catalyst Acceleration for Gradient-Based Non-Convex Optimization

Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, Zaid Harchaoui

Introduction

We consider optimization problems of the form

Yet, nonconvex problems in machine learning are of high interest. For instance, the variable xx may represent the parameters of a neural network, where each term fi(x)f_{i}(x) measures the fit between xx and a data point indexed by ii, or (1) may correspond to a nonconvex matrix factorization problem (see Section 7). Besides, even when the data-fitting functions fif_{i} are convex, it is also typical to consider nonconvex regularization functions ψ\psi, for example for feature selection in signal processing . In this work, we address two questions from nonconvex optimization:

How to apply a method for convex optimization to a nonconvex problem?

How to design an algorithm which does not need to know whether the objective function is convex while obtaining the optimal convergence guarantee if the function is convex?

Several works attempted to transfer ideas from the convex world to the nonconvex one, see, e.g., . Our paper has a similar goal and studies the extension of Nesterov’s acceleration for convex problems to nonconvex composite ones. For C1C^{1}-smooth and nonconvex problems, gradient descent is optimal among first-order methods in terms of information based complexity to find an ε\varepsilon-stationary point [10, Theorem 2, Sec. 5]. Without additional assumptions, worst case complexity for first-order methods can not achieve better than O(ε2)\mathcal{O}(\varepsilon^{-2}) oracle queries . Under a stronger assumption that the objective function is C2C^{2}-smooth, state-of-the-art methods [e.g., 11] achieve a marginal gain with complexity O(ε7/4log(1/ε))O(\varepsilon^{-7/4}\log(1/\varepsilon)), and do not appear to generalize to composite or finite-sum settings. For this reason, our work fits within a broader stream of recent research on methods that do not perform worse than gradient descent in the nonconvex case (in terms of worst-case complexity), while automatically accelerating for minimizing convex functions. The hope when applying such methods to nonconvex problems is to see acceleration in practice, by heuristically exploiting convexity that is “hidden” in the objective (for instance, local convexity near the optimum, or convexity along the trajectory of iterates).

Inspired by Nesterov’s acceleration method for convex optimization , the first accelerated method performing universally well for nonconvex and convex problems was introduced in . Specifically, this work addresses composite problems such as (1) with n=1n=1, and, provided the iterates are bounded, it performs no worse than gradient descent on nonconvex instances with complexity O(ε2)O(\varepsilon^{-2}) on the gradient norm. When the problem is convex, it accelerates with complexity O(ε2/3)O(\varepsilon^{-2/3}). Extensions to accelerated Gauss-Newton type methods were also recently developed in . In a follow-up work, the authors of proposed a new scheme that monotonically interlaces proximal gradient descent steps and Nesterov’s extrapolation; thereby achieving similar guarantees as but without the need to assume the iterates to be bounded. Extensions when the gradient of ψ\psi is only Hölder continuous can also be devised. Whether accelerated methods are superior to gradient descent remains an open question in the nonconvex setting; their performance escaping saddle points faster than gradient descent has been observed .

In , a similar strategy is proposed, focusing instead on convergence guarantees under the so-called Kurdyka-Łojasiewicz inequality—a property corresponding to polynomial-like growth of the function, as shown by . Our scheme is in the same spirit as these previous papers, since it monotonically interlaces proximal-point steps (instead of proximal-gradient as in ) and extrapolation/acceleration steps. A fundamental difference is that our method is generic and accommodates inexact computations, since we allow the subproblems to be approximately solved by any method we wish to accelerate.

By considering C2C^{2}-smooth nonconvex objective functions ff with Lipschitz continuous gradient f\nabla f and Hessian 2f\nabla^{2}f, the authors of propose an algorithm with complexity O(ε7/4log(1/ε))O(\varepsilon^{-7/4}\log(1/\varepsilon)), based on iteratively solving convex subproblems closely related to the original problem. It is not clear if the complexity of their algorithm improves in the convex setting. Note also that the algorithm proposed in is inherently for C2C^{2}-smooth minimization and requires exact gradient evaluations. This implies that the scheme does not allow incorporating nonsmooth regularizers and can not exploit finite sum structure.

A stochastic scheme for minimizing (1) under the nonconvex but smooth setting were recently considered in . The method can be seen as a nonconvex variant of the stochastically controlled stochastic gradient (SCSG) methods . If the target accuracy is small, then the method performs no worse than nonconvex SVRG . If the target accuracy is large, the method achieves a rate better than SGD. The proposed scheme does not incorporate nonsmooth regularizers and it is unclear whether numerically the scheme performs as well as SVRG.

Finally, a stochastic method related to SVRG for minimizing large sums while automatically adapting to the weak convexity constant of the objective function is proposed in . When the weak convexity constant is small (i.e., the function is nearly convex), the proposed method enjoys an improved efficiency estimate. This algorithm, however, does not automatically accelerate for convex problems, in the sense that the overall rate is slower than O(ε3/2)O(\varepsilon^{-3/2}) in terms of target accuracy ε\varepsilon on the gradient norm.

Organization of the paper.

Section 2 presents mathematical tools for non-convex and non-smooth analysis, which are used throughout the paper. We provide a discussion of related works for solving the nonconvex and nonsmooth problem (1) in Section 3. In Sections 4 and 5, we introduce the main algorithm and important extensions, respectively. Section 6 presents global convergence guarantees of the scheme and convergence guarantees when the algorithm wraps specific algorithms such as SAGA, SVRG, and randomized coordinate descent. Finally, we present experimental results on matrix factorization and training of neural networks in Section 7.

Tools for nonconvex optimization

In this paper, we focus on a broad class of nonconvex functions known as weakly convex functions, which covers most of the cases of interest in machine learning and signal processing.

Weakly convex functions have appeared in a wide variety of contexts, and under different names. Some notable examples are globally lower-C2C^{2} , prox-regular , proximally smooth functions , and those functions whose epigraph has positive reach . We recall here basic definitions and classical results.

When ρ=0\rho=0, the above definition reduces to the classical definition of convex functions.

A function ff is ρ\rho-weakly convex if and only if the function fρf_{\rho} is convex, where

In order to prove the result, it suffices to show that

This follows by rearranging the terms in Equation (3). Next, we suppose ff is ρ\rho-weakly convex; hence Equation (2) holds. We observe that ρλ(1λ)2xy2\tfrac{\rho\lambda(1-\lambda)}{2}\|x-y\|^{2} can be rewritten as Equation (3). As a result, we conclude

Rearranging the terms, we get the desired result. ∎

If ff is twice differentiable, then ff is ρ\rho-weakly convex if and only if 2f(x)ρI\nabla^{2}f(x)\succeq-\rho I for all xx.

This follows from the observations that a twice differentiable function is convex if and only if 2f(x)0\nabla^{2}f(x)\succeq 0 and Proposition 2.3. ∎

Intuitively, a function is weakly convex when it is “nearly convex” up to a quadratic function. This represents a complementary notion to strong convexity.

If a function ff is differentiable and its gradient is Lipschitz continuous with Lipschitz parameter LL, then ff is LL-weakly convex.

Hence, we see the function fρ(x)f_{\rho}(x) is convex for ρ=L\rho=L and the result follows by applying Proposition 2.3. ∎

We remark that for most of the interesting machine learning problems, the smooth part of the objective function admits Lipchitz gradients, meaning that the function is in fact weakly convex.

2 Subdifferential

Convergence results for nonsmooth optimization typically rely on the concept of subdifferential. However, the generalization of the subdifferential to nonconvex nonsmooth function is not unique . With the weak convexity in hand, all these constructions coincide, and therefore we slightly abuse standard notation as set out in Rockafellar and Wets .

Thus, a vector ξ\xi lies in f(x)\partial f(x) whenever the linear function yf(x)+ξT(yx)y\mapsto f(x)+\xi^{T}(y-x) is a lower-model of ff, up to first-order around xx. In particular, the subdifferential f(x)\partial f(x) of a differentiable function ff is the singleton {f(x)}\{\nabla f(x)\}; while for a convex function ff it coincides with the subdifferential in the sense of convex analysis, see [49, Exercise 8.8]. Moreover, the following sum rule,

holds for any differentiable function gg.

In non-convex optimization, standard complexity bounds are derived to guarantee

Recall when ε=0\varepsilon=0, we are at a stationary point and satisfy first-order optimality conditions. For functions that are nonconvex, first-order methods search for points with small subgradients, which does not necessarily imply small function values, in contrast to convex functions where the two criteria are much closer related. In our convergence analysis, we will use the following differential characterization of ρ\rho-weakly convex functions, which generalize classical properties of convex functions.

Related work on weakly convex functions

For many machine learning problems, the objective functions includes a smooth component which is often assumed to have an LL-Lipschitz gradient. The precise relationship between the weak-convexity constant ρ\rho and the Lipschitz constant LL is given in Proposition 2.5:

If ff is differentiable and f(x)\nabla f(x) is LL-Lipschitz, then ff is ρ\rho-weakly convex for some ρL\rho\leq L.

Many functions with LL-Lipschitz gradients have weak-convexity constants ρ\rho which are smaller than LL. Our goal is to develop a method that exploits this property of the weak convexity constant for nonconvex functions while obtaining optimal convergence rates for convex problems. Up until now, nearly all the research for methods to solve the large finite sum problem (1) have assumed ρ=0\rho=0 (i.e. convex) or ρ=L\rho=L. We provide a short, selective list of convergence guarantees for a few popular approaches.

When ρ=L\rho=L, Full Gradient Descent (FG) finds a point satisfying f(x)ε\left\|\nabla f(x)\right\|\leq\varepsilon after O(nL/ε2)O(nL/\varepsilon^{2}) number of gradient computations.

When ρ=0\rho=0, AdaGrad uses regret guarantees in an online convex optimization setting. We are not aware of guarantees for convex optimization with finite-sum structure nor for non-convex optimization with finite-sum structure.

Stochastic methods based on variance-reduced stochastic gradients have recently been applied to nonconvex problems. The authors of propose for instance stochastic methods for minimizing (1) using variants of SVRG and SAGA under the assumption that ρ=L\rho=L. Their scheme works for both convex and nonconvex settings and achieves convergence guarantees of O(Ln/ε)O(Ln/\varepsilon) (convex) and O(n2/3L/ε2)O(n^{2/3}L/\varepsilon^{2}) (nonconvex) and includes a minibatch variant.

2 Behavior of finite-sum optimization methods when the objective is convex.

For the stochastic methods previously considered in the nonconvex setting, we note their convex rates. In , for convex objectives, the methods attain convergence guaranties of O(nL/ε)O(nL/\varepsilon). In both , they only focus on nonconvex problems, as such their convex rates are the same as the nonconvex setting, that is, O(ε2)O(\varepsilon^{-2}). When the objective is assumed to be convex, methods often achieve faster rates of convergence. Accelerated gradient methods designed by Nesterov are known to require O(nL2/3/ε2/3)O\left(nL^{2/3}/\varepsilon^{2/3}\right) number of gradient computations to obtain near stationary point, f(x)ε\left\|\nabla f(x)\right\|\leq\varepsilon, but only O(nnL/ε)O(n\sqrt{nL/\varepsilon}) number of gradient computations to obtain a near optimal point, f(x)fεf(x)-f^{*}\leq\varepsilon . This gap between the two guarantees was resolved in . By adding a regularization term and an additional known bound on x0x2\left\|x_{0}-x^{*}\right\|^{2}, one can improve the gradient complexity to O(nL/ε)O(n\sqrt{L/\varepsilon}). Without such a bound on the distance to the optimal solution set, it is unclear if one can improve the convergence rate. We will assume throughout this paper that we do not know a bound on xx02\left\|x^{*}-x_{0}\right\|^{2} for (1).

The authors of based their work off a class of algorithms called stochastically controlled stochastic gradient (SCSG) methods . In these methods, the functions fif_{i} are smooth and convex. The SCSG method satisfies: when the target accuracy is low, the method has the same O(ε2)O(\varepsilon^{-2}) rate as SGD but with a small data-dependent constant factor and when the target accuracy is high, the method has the same rate as the best non-accelerated methods, O(nL/ε)O(nL/\varepsilon).

3 Our results

In this paper, we design a generic method that performs no worse than gradient descent in the nonconvex case, while automatically accelerating for minimizing convex functions. In particular, we devise a single algorithm which adapts to the weak convexity constant if the objective is nonconvex, while also obtaining the accelerated rate of O(ε2/3)O(\varepsilon^{-2/3}) when the objective is convex. The hope is that by applying such methods to nonconvex problems we see acceleration in practice by heuristically exploiting convexity that is “hidden” in the objective function. Moreover, our algorithm applies to incremental methods SVRG/SAGA and randomized coordinate descent. Designing such an acceleration scheme for possibly nonconvex optimization problems is challenging. Whether convergence guarantees for optimization algorithms accelerated naively with classical Nesterov or momentum acceleration match gradient descent on nonconvex problems remains an open problem; yet in the vicinity of saddle points, accelerated gradient methods escape faster than gradient descent . Our scheme capitalizes on this valuable observation.

First, we consider the situation where the weak convexity constant ρ\rho is known. By interlacing incremental methods such as SVRG and SAGA, our proposed algorithm, Basic 4WD-Catalyst-SVRG/SAGA, where ρ\rho is known, finds an ε\varepsilon-approximate stationarity point of f(x)f(x) in gradient complexity

It is common in machine learning problems for the weak convexity constant to be unknown. Previous work in the area, namely , required the parameter ρ\rho to be specified (see Line 8 in Algorithms 1 and 2 of ). Our second method, 4WD-Catalyst-SVRG/SAGA, incorporates a procedure that eliminates the need to specify the weak convexity constant ρ\rho. The resulting method, 4WD-Catalyst-SVRG/SAGA, finds an ε\varepsilon-approximate stationarity point of f(x)f(x) in gradient complexity

The Basic 4WD-Catalyst algorithm for non-convex optimization

We now present a generic scheme (Algorithm 1) for applying a convex optimization method to minimize

where ff is only ρ\rho-weakly convex and ff is lower bounded. Our goal is to develop a unified framework that automatically accelerates in convex settings. Consequently, the scheme must be agnostic to the constant ρ\rho.

At the center of our meta algorithm (Algorithm 1) are two sequences of subproblems obtained by adding simple quadratics to ff. The proposed approach extends the Catalyst acceleration of and comes with a simplified convergence analysis. We next describe in detail each step of the scheme.

Here, κ\kappa is a regularization parameter and yy is called the prox-center. By adding the quadratic, we make the problem more “convex”: when ff is non convex, with a large enough κ\kappa, the subproblem will be convex; when ff is convex, we improve the conditioning of the problem.

At the kk-th iteration, given a previous iterate xk1x_{k-1} and the extrapolation term vk1v_{k-1}, we construct the two following subproblems.

Proximal point step. We first perform an inexact proximal point step with prox-center xk1x_{k-1}:

Accelerated proximal point step. Then we build the next prox-center yky_{k} as the convex combination

Next, we use yky_{k} as a prox-center and update the next extrapolation term:

where αk+1(0,1)\alpha_{k+1}\in(0,1) is a sequence of coefficients satisfying (1αk+1)/αk+12=1/αk2\small{(1-\alpha_{k+1})/\alpha^{2}_{k+1}=1/{\alpha_{k}}^{2}}. Essentially, the sequences (αk)k,(yk)k,(vk)k(\alpha_{k})_{k},(y_{k})_{k},(v_{k})_{k} are built upon the extrapolation principles of Nesterov .

Picking the best.

The proposed scheme blends the two steps in a synergistic way, allowing us to recover the near-optimal rates of convergence in both worlds: convex and non-convex. Intuitively, when xˉk\bar{x}_{k} is chosen, it means that Nesterov’s extrapolation step “fails” to accelerate convergence.

Stopping criterion for the subproblems.

In order to derive complexity bounds, it is important to properly define the stopping criterion for the proximal subproblems. When the subproblem is convex, a functional gap like fκ(z;x)infzfκ(z;x)f_{\kappa}(z;x)-\inf_{z}f_{\kappa}(z;x) may be used as a control of the inexactness, as in . Without convexity, this criterion cannot be used since such quantities can not be easily bounded. In particular, first order methods seek points whose subgradient is small. Since small subgradients do not necessarily imply small function values in a non-convex setting, first order methods only test for near-stationarity is small subgradients. In contrast, in the convex setting, small subgradients imply small function values; thus a first order method in the convex setting can “test” for small function values. Hence, we cannot use a direct application of Catalyst which uses the functional gap as a stopping criteria. Because we are working in the nonconvex setting, we include a stationarity stopping criteria.

We propose to use jointly the following two types of stopping criteria:

Descent condition: fκ(z;y)fκ(y;y)f_{\kappa}(z;y)\leq f_{\kappa}(y;y);

Adaptive stationary condition: \text{\rm dist}\big{(}0,\partial f_{\kappa}(z;y)\big{)}<\kappa\left\|z-y\right\|.

Without the descent condition, the stationarity condition is insufficient for defining a good stopping criterion because of the existence of local maxima in nonconvex problems. In the nonconvex setting, local maxima and local minima satisfy this stationarity condition. The descent condition ensures the iterates generated by the algorithm always decrease the value of the objective function ff; thus ensuring we move away from local maxima. The second criterion, adaptive stationary condition, provides a flexible relative tolerance on termination of algorithm used for solving the subproblems; a detailed analysis is forthcoming.

In Basic 4WD-Catalyst, we use both the stationary condition and the descent condition as a stopping criteria to produce the point xˉ\bar{x}:

2 Convergence analysis.

We present here the theoretical properties of Algorithm 1. In this first stage, we do not take into account the complexity of solving the subproblems (7) and (9). For the next two theorems, we assume that the stopping criteria for the proximal subproblems are satisfied at each iteration of Algorithm 1.

Suppose that the function ff is lower bounded. For any κ>0\kappa>0 and N1N\geq 1, the iterates generated by Algorithm 1 satisfy

It is important to notice that this convergence result is valid for any κ\kappa and does not require it to be larger than the weak convexity parameter. As long as the stopping criteria for the proximal subproblems are satisfied, the quantities dist(0,f(xjˉ)){\rm dist}(0,\partial f(\bar{x_{j}})) tend to zero. The proof is inspired by that of inexact proximal algorithms and appears in Appendix B.

If the function ff turns out to be convex, the scheme achieves a faster convergence rate both in function values and in stationarity:

If the function ff is convex, then for any κ>0\kappa>0 and N1N\geq 1, the iterates generated by Algorithm 1 satisfy

where xx^{*} is any minimizer of the function ff.

The proof of Theorem 4.2 appears in Appendix B. This theorem establishes a rate of O(N2)O(N^{-2}) for suboptimality in function value and convergence in O(N3/2)O(N^{-3/2}) for the minimal norm of subgradients. The first rate is optimal in terms of information-based complexity for the minimization of a convex composite function . The second can be improved to O(N2log(N))O(N^{-2}\log(N)) through a regularization technique, if one knew in advance that the function is convex and had an estimate on the distance of the initial point to an optimal solution .

The 4WD-Catalyst algorithm

In this section, we work towards understanding the global efficiency of Algorithm 1, which automatically adapts to the weak convexity parameter. For this, we must take into account the cost of approximately solving the proximal subproblems to the desired stopping criteria. We expect that once the subproblem becomes strongly convex, the given optimization method M\mathcal{M} can solve it efficiently. For this reason, we first focus on the computational cost for solving the sub-problems, before introducing a new algorithm with known worst-case complexity.

When κ\kappa is large enough, the subproblems become strongly convex; thus globally solvable. Henceforth, we will assume that M\mathcal{M} satisfies the following natural linear convergence assumption.

We assume that for any κ>ρ\kappa>\rho, there exist Aκ0A_{\kappa}\geq 0 and τκ(0,1)\tau_{\kappa}\in(0,1) so that the following hold:

where fκ(y):=infzfκ(z;y)f_{\kappa}(y)^{*}:=\inf_{z}f_{\kappa}(z;y). If the method M{\mathcal{M}} is randomized, we require the same inequality to hold in expectation.

The rates τκ\tau_{\kappa} and the constants AκA_{\kappa} are increasing in κ\kappa.

Our assumption on the linear rate of convergence of M{\mathcal{M}} differs from the one considered by , which was given in terms of function values. However, if the problem is a composite one, both points of view are near-equivalent, as discussed in Section A and the precise relationship is given in Appendix C. We choose the norm of the subgradient as our measurement because the complexity analysis is easier.

Then, a straightforward analysis bounds the computational complexity to achieve an ε\varepsilon-stationary point.

Let us consider a strongly convex problem fκ(;y)f_{\kappa}(\cdot;y) and a linearly convergent method M{\mathcal{M}} generating a sequence of iterates {zt}t0\{z_{t}\}_{t\geq 0}. Define T(\varepsilon)=\inf\{t\geq 1,\text{\rm dist}\big{(}0,\partial f_{\kappa}(z_{t};y)\big{)}\leq\varepsilon\}, where ε\varepsilon is the target accuracy; then,

As we can see, we only lose a factor in the log term by switching from deterministic to randomized algorithms. For the sake of simplicity, we perform our analysis only for deterministic algorithms and the analysis for randomized algorithms holds in the same way in expectation.

We can now prove a global complexity bound for Basic 4WD-Catalyst if the weak convexity constant ρ\rho is known. For this, we introduce κcvx\kappa_{\text{\rm cvx}}, a M\mathcal{M} dependent smoothing parameter and set it in the same way as the smoothing parameter in .

Algorithm 1 generates a point satisfying dist(0,f(x))ε\text{dist}(0,\partial f(x))\leq\varepsilon after at most

If ff is convex, then Algorithm 1 generates a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 1 generates a point xx satisfying f(x)fεf(x)-f^{*}\leq\varepsilon after at most

Bounding the required iterations when κ>ρ𝜅𝜌\kappa>\rho and restart strategy.

Recall that we add a quadratic to ff with the hope to make each subproblem convex. Thus, if ρ\rho is known, then we should set κ>ρ\kappa>\rho. In this first stage, we show that whenever κ>ρ\kappa>\rho, then the number of inner calls to M\mathcal{M} can be bounded with a proper initialization. Consider the subproblem

and define the initialization point z0z_{0} by

if f=f0+ψf=f_{0}+\psi is composite, with f0f_{0} LL-smooth, then set z0=proxηψ(yηf0(y))z_{0}=\text{prox}_{\eta\psi}(y-\eta\nabla f_{0}(y)) with η1L+κ\eta\leq\frac{1}{L+\kappa}.

Consider the subproblem (17) and suppose κ>ρ\kappa>\rho. Then initializing M\mathcal{M} at the previous z0z_{0} generates a sequence of iterates (zt)t0(z_{t})_{t\geq 0} such that

the output zTz_{T} satisfies fκ(zT;y)fκ(z0;y)f_{\kappa}(z_{T};y)\leq f_{\kappa}(z_{0};y) (descent condition) and dist(0,fκ(zT;y))κzTy\text{dist}(0,\partial f_{\kappa}(z_{T};y))\leq\kappa\left\|z_{T}-y\right\| (adaptive stationary condition);

in at most Sκlog(k+1)S_{\kappa}\log(k+1) iterations where

the output zSz_{S} satisfies dist(0,fκ(zS;y))κk+1zSy\text{dist}(0,\partial f_{\kappa}(z_{S};y))\leq\frac{\kappa}{k+1}\left\|z_{S}-y\right\| (modified adaptive stationary condition).

The proof is technical and is presented in Appendix D. The lesson we learn here is that as soon as the subproblem becomes strongly convex, it can be solved in almost a constant number of iterations. Herein arises a problem–the choice of the smoothing parameter κ\kappa. On one hand, when ff is already convex, we may want to choose κ\kappa small in order to obtain the desired optimal complexity. On the other hand, when the problem is non convex, a small κ\kappa may not ensure the strong convexity of the subproblems. Because of such different behavior according to the convexity of the function, we introduce an additional parameter κcvx\kappa_{\text{\rm cvx}} to handle the regularization of the extrapolation step. Moreover, in order to choose a κ>ρ\kappa>\rho in the nonconvex case, we need to know in advance an estimate of ρ\rho. This is not an easy task for large scale machine learning problems such as neural networks. Thus we propose an adaptive step to handle it automatically.

2 4WD-Catalyst: adaptation to weak convexity

We now introduce 4WD-Catalyst, presented in Algorithm 2, which can automatically adapt to the unknown weak convexity constant of the objective. The algorithm relies on a procedure to automatically adapt to ρ\rho, described in Algorithm 3.

Descent condition: fκ(zT;x)fκ(x;x)f_{\kappa}(z_{T};x)\leq f_{\kappa}(x;x);

Adaptive stationary condition: \text{\rm dist}\big{(}0,\partial f_{\kappa}(z_{T};x)\big{)}\leq\kappa\left\|z_{T}-x\right\|.

Thus, if either condition is not satisfied, then the subproblem is deemed not convex and we double κ\kappa and repeat. The procedure yields an estimate of ρ\rho in a logarithmic number of increases; see Lemma D.3.

We shall see in the experiments that our strategy of predefining TT and SS works quite well. The theoretical bounds we derive are, in general, too conservative; we observe in our experiments that one may choose TT and SS significantly smaller than the theory suggests and still retain the stopping criteria.

To derive the global complexity results for 4WD-Catalyst that match optimal convergence guarantees, we make a distinction between the regularization parameter κ\kappa in the proximal point step and in the extrapolation step. For the proximal point step, we apply Algorithm 3 to adaptively produce a sequence of κk\kappa_{k} initializing at κ0>0\kappa_{0}>0, an initial guess of ρ\rho. The resulting xˉk\bar{x}_{k} and κk\kappa_{k} satisfy both the following inequalities:

3 Convergence analysis

Fix real constants κ0,κcvx>0\kappa_{0},\kappa_{\text{\rm cvx}}>0, the function ff is lower bounded, and x0dom fx_{0}\in\text{dom}~{}f. Set κmax:=maxk1κk\kappa_{\max}:=\max_{k\geq 1}\kappa_{k}. Suppose that the number of iterations TT is such that xˉk\bar{x}_{k} satisfies (20). Define f:=limkf(xk)f^{*}:=\lim_{k\to\infty}f(x_{k}). Then for any N1N\geq 1, the iterates generated by Algorithm 2 satisfy,

where xx^{*} is any minimizer of the function ff.

Suppose the stopping criteria are (20) and (21) as in in Theorem 5.5, and choose TT and SS in Algorithm 2 to be the smallest numbers satisfying

Then κmax4L\kappa_{\max}\leq 4L and the following hold for any index k1k\geq 1:

Appendix D is devoted to proving Theorem 5.6, but we outline below the general procedure and state the two main propositions (see Proposition 5.7 and Proposition 5.8).

We summarize the proof of Theorem 5.6 as followed:

When κ>ρ+L\kappa>\rho+L, we compute the number of iterations of M\mathcal{M} to produce a point satisfying (20). Such a point will become xˉk\bar{x}_{k}.

We compute the smallest number of times we must double κ0\kappa_{0} until it becomes larger than ρ+L\rho+L. Thus eventually the condition 4Lκ>ρ+L4L\geq\kappa>\rho+L will occur.

The next proposition shows that Auto-adapt terminates with a suitable choice for xˉk\bar{x}_{k} after TT number of iterations.

Suppose ρ+L<κ4L\rho+L<\kappa\leq 4L. By initializing the method M{\mathcal{M}} using the strategy suggested in Algorithm 2 for solving

we may run the method M{\mathcal{M}} for at least TT iterations, where

then, the output zTz_{T} satisfies fκ(zT;x)fκ(x;x)f_{\kappa}(z_{T};x)\leq f_{\kappa}(x;x) and \text{dist}\big{(}0,\partial f_{\kappa}(z_{{T}};x)\big{)}\leq\kappa\left\|z_{T}-x\right\|.

Under the additional assumption that the function ff is convex, we produce a point with (21) when the number of iterations SS is chosen sufficiently large.

Consider the method M{\mathcal{M}} with the initialization strategy suggested in Algorithm 2 for minimizing fκcvx(;yk)f_{\kappa_{\text{\rm cvx}}}(\cdot;y_{k}) with linear convergence rates of the form (16). Suppose the function ff is convex. If the number of iterations of M\mathcal{M} is greater than

We can now derive global complexity bounds by combining Theorem 5.5 and Theorem 5.6, and a good choice for the constant κcvx\kappa_{\text{\rm cvx}}.

Algorithm 2 generates a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 generates a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 generates a point xx satisfying f(x)fεf(x)-f^{*}\leq\varepsilon after at most

In general, the linear convergence parameter of M{\mathcal{M}}, τκ\tau_{\kappa}, depends on the condition number of the problem fκf_{\kappa}. Here, τL\tau_{L} and τκcvx\tau_{\kappa_{\text{\rm cvx}}} are precisely given by plugging in κ=L\kappa=L and κcvx\kappa_{\text{\rm cvx}} respectively into τκ\tau_{\kappa}. To clarify, let M{\mathcal{M}} be SVRG, τκ\tau_{\kappa} is given by 1n+κ+Lκ\frac{1}{n+\frac{\kappa+L}{\kappa}} which yields τL=1/(n+2)\tau_{L}=1/(n+2). A more detailed computation is given in Table 3. For all the incremental methods we considered, these parameters τL\tau_{L} and τκ\tau_{\kappa} are on the order of 1/n1/n.

If M\mathcal{M} is a first order method, the convergence guarantee in the convex setting is near-optimal, up to logarithmic factors, when compared to O(1/ε)O(1/\sqrt{\varepsilon}) . In the non-convex setting, our approach matches, up to logarithmic factors, the best known rate for this class of functions, namely O(1/ε2)O(1/\varepsilon^{2}) . Moreover, our rates dependence on the dimension and Lipschitz constant equals, up to log factors, the best known dependencies in both the convex and nonconvex setting. These logarithmic factors may be the price we pay for having a generic algorithm.

Applications to Existing Algorithms

We now show how to accelerate existing algorithms M\mathcal{M} and compare the convergence guaranties before and after 4WD-Catalyst. In particular, we focus on the gradient descent algorithm, randomized coordinate descent, and on the incremental methods SAGA and SVRG. For all the algorithms considered, we state the convergence guaranties in terms of the total number of iterations (in expectation, if appropriate) to reach an accuracy of ε\varepsilon; in the convex setting, the accuracy is stated in terms of functional error, f(x)inff<εf(x)-\inf f<\varepsilon and in the nonconvex setting, the appropriate measure is stationarity, namely dist(0,f(x))<ε\text{dist}(0,\partial f(x))<\varepsilon. All the algorithms considered have formulations for the composite setting with analogous convergence rates.

Table 2 presents convergence rates for SAGA , (prox) SVRG , randomized coordinate descent (Rand. CD) , and gradient descent (FG).

The original SVRG has no guarantee for nonconvex functions. However a nonconvex extension of SVRG was proposed in . Their convergence rate gives a better dependence on nn compared to ours, namely O(n2/3Lε2)O(\frac{n^{2/3}L}{\varepsilon^{2}}). This is achieved thanks to a mini-batching strategy. In order to obtain a similar dependency on nn, we need a tighter bound for SVRG with mini-batching applied to μ\mu-strongly convex problems, namely O((n2/3+Lμ)log(1ε))O\left(\left(n^{2/3}+\frac{L}{\mu}\right)\log\left(\frac{1}{\varepsilon}\right)\right). To the best of our knowledge, such a rate is currently unknown. Therefore, for ncvx-SVRG, we present the results without mini-batching. With mini-batching, the same convergence rate can be obtained by using a batch size b=n2/3b=n^{2/3} and a stepsize O(1/L)O(1/L). Similarly for ncvx-SAGA. For Rand. CD, we present the results for a smooth function ff, with LmaxL_{\max} the max. of the coordinate-wise Lipschitz constants for f\nabla f and pp is the dimension of the domain of ff.

The smoothing parameter κcvx\kappa_{\text{\rm cvx}} drives the convergence rate of 4WD-Catalyst in the convex setting. To determine κcvx\kappa_{\text{\rm cvx}}, we pretend ρ=0\rho=0 and compute the global complexity of our scheme. As such, we end up with the same complexity result as Catalyst . Following their work, the rule of thumb is to maximize the ratio τκ/L+κ\tau_{\kappa}/\sqrt{L+\kappa} for convex problems. On the other hand, the choice of κ0\kappa_{0} is independent of M\mathcal{M}; it is an initial lower estimate for the weak convexity constant ρ\rho. In practice, we typically choose κ0=κcvx\kappa_{0}=\kappa_{\text{\rm cvx}}; For incremental approaches a natural heuristic is also to choose S=T=nS=T=n, meaning that SS iterations of M{\mathcal{M}} performs one pass over the data. In Table 3, we present the values of κcvx\kappa_{\text{\rm cvx}} used for various algorithms, as well as other quantities that are useful to derive the convergence rates.

A first illustration is the algorithm obtained when accelerating the regular “full” gradient (FG). Here, the optimal choice for κcvx\kappa_{\text{\rm cvx}} is LL. In the convex setting, we get an accelerated rate of O(nL/εlog(1/ε))O(n\sqrt{L/\varepsilon}\log(1/\varepsilon)) which agrees with Nesterov’s accelerated variant (AFG) up to logarithmic factors. On the other hand, in the nonconvex setting, our approach achieves no worse rate than O(nL/ε2log(1/ε))O(nL/\varepsilon^{2}\log(1/\varepsilon)), which agrees with the standard gradient descent up to logarithmic factors. We note that under stronger assumptions, namely C2C^{2}-smoothness of the objective, the accelerated algorithm in achieves the same rate as (AFG) for the convex setting and O(ε7/4log(1/ε))O(\varepsilon^{-7/4}\log(1/\varepsilon)) for the nonconvex setting. Their approach, however, does not extend to composite setting nor to stochastic methods. Our marginal loss is the price we pay for considering a much larger class of functions.

Randomized Coordinate Descent (Rand. CD).

Randomized incremental gradient.

We now consider randomized incremental gradient methods such as SAGA and (prox) SVRG . Here, the optimal choice for κcvx\kappa_{\text{\rm cvx}} is O(L/n)O(L/n). Under the convex setting, we achieve an accelerated rate of O(nL/εlog(1/ε))O(\sqrt{n}\sqrt{L/\varepsilon}\log(1/\varepsilon)). A direct application of SVRG and SAGA have no convergence guarantees in the non-convex setting. With our approach, the resulting algorithm matches the guarantees for FG up to log factors.

2 Detailed derivation of convergence rates

Using the values of Table 3, we may now specialize our convergence results to different methods. Many of the linearly convergent methods (e.g. Rand. CD and incremental methods) state convergence results in terms of function values instead of subgradients as in Equation (16). We relate function values to subgradients by using the Lipschitz constant LL:

To compute the parameters τL\tau_{L}, κcvx\kappa_{\text{\rm cvx}}, etc, we use the convergence analysis from for full gradient:

, where xx^{*} is the optimal solution of ff.

With this result, the number of iterations in the inner loop are

The global complexity for gradient descent is

Algorithm 2 will generate a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 will generate a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 will generate a point xx satisfying f(x)fεf(x)-f^{*}\leq\varepsilon after at most

Rand. CD.

We use the convergence analysis from [Algorithm 3, Theorem 1]:

Set Lmax=maxi=1,,pLiL_{\max}=\max_{i=1,\ldots,p}L_{i}. Then the iterates of Algorithm 3 in satisfy

For the Rand. CD, the number of iterations in the inner loop are

Algorithm 2 will generate a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 will generate a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 will generate a point xx satisfying f(x)fεf(x)-f^{*}\leq\varepsilon after at most

SVRG.

We use the convergence analysis established in :

Suppose the function 1/ni=1nfi1/n\sum_{i=1}^{n}f_{i} is LL-Lipschitz and the function ff is μ\mu-strongly convex. Choose the real constant 0<θ<1/40<\theta<1/4 sufficiently small so that

Then the Prox-SVRG method in has geometric convergence in expectation:

In particular, each stage requires n+100L/μn+100L/\mu component gradient evaluations so the overall complexity is

For SVRG, the number of iterations in the inner loop are

The global complexity for SVRG when nn is sufficiently large is

Algorithm 2 will generate a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 will generate a point xx satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most

If ff is convex, then Algorithm 2 will generate a point xx satisfying f(x)fεf(x)-f^{*}\leq\varepsilon after at most

SAGA.

We observe that the variables for SAGA are the same as for SVRG up to a multiplicative factors. Therefore, the global complexities results for SAGA are, up to constant factors, the same as SVRG.

Experiments

We investigate the performance of 4WD-Catalyst on two standard non-convex problems in machine learning, namely on sparse matrix factorization and on training a simple two-layer neural network.

We report experimental results of 4WD-Catalyst when applied to the incremental algorithms SVRG and SAGA , and consider the following variants:

ncvx SVRG/SAGA with its theoretical stepsize η=1/Ln2/3\eta=1/Ln^{2/3}.

a minibatch variant of ncvx SVRG/SAGA with batch size b=n2/3b=n^{2/3} and stepsize η=1/L\eta=1/L.

SVRG/SAGA with large stepsize η=1/L\eta=1/L. This is variant of SVRG/SAGA, whose stepsize is not justified by theory for nonconvex problems, but which performs well in practice.

4WD-Catalyst SVRG/SAGA with its theoretical stepsize η=1/2L\eta=1/2L.

The algorithm SVRG (resp. SAGA) was originally designed for minimizing convex objectives. The nonconvex version was developed in , using a significantly smaller stepsize η=1/Ln2/3\eta=1/Ln^{2/3}. Following , we also include in the comparison a heuristic variant that uses a large stepsize η=1/L\eta=1/L, where no theoretical guarantee is available for nonconvex objectives. 4WD-Catalyst SVRG and 4WD-Catalyst SAGA use a similar stepsize, but the Catalyst mechanism makes this choice theoretically grounded.

Comparison with popular stochastic algorithms.

We also include as baselines three popular stochastic algorithms: stochastic gradient descent (SGD), AdaGrad , and Adam .

AdaGrad with stepsize η=0.1\eta=0.1 or 0.010.01.

Adam with stepsize α=0.01\alpha=0.01 or 0.0010.001, β1=0.9\beta_{1}=0.9, and β2=0.999\beta_{2}=0.999.

The stepsize (learning rate) of these algorithms are manually tuned to output the best performance. Note that none of them, SGD, AdaGrad , or Adam enjoys linear convergence when the problem is strongly convex. Therefore, we do not apply 4WD-Catalyst to these algorithms. SGD is used in both experiments, whereas AdaGrad and Adam are used only on the neural network experiments and not on sparse matrix factorization since it is unclear how to apply it to a nonsmooth objective.

Parameter settings.

We start from an initial estimate of the Lipschitz constant LL and use the theoretically recommended κ0=κcvx=2L/n\kappa_{0}=\kappa_{\text{cvx}}=2L/n in 4WD-Catalyst. We set the number of inner iterations T=S=nT=S=n in all experiments which means making at most one pass over the data to solve each sub-problem. Moreover, the log(k)\log(k) dependency dictated by the theory is dropped while solving the subproblem in (18). These choices turn out to be justified a posteriori, as both SVRG and SAGA have a much better convergence rate in practice than the theoretical rate derived from a worst-case analysis. Indeed, in all experiments, one pass over the data to solve each sub-problem was found to be enough to guarantee sufficient descent.

Sparse matrix factorization a.k.a. dictionary learning.

Neural networks.

Initial estimates of L𝐿L.

The proposed algorithm 4WD-Catalyst requires an initial estimate of the Lipschitz constant LL. In the problems we are considering, there is no simple closed form formula available to compute an estimate of LL. We use the following heuristics to estimate LL:

For matrix factorization, it can be shown that the function fif_{i} defined in (24) is differentiable according to Danskin’s theorem [see Bertsekas , Proposition B.25] and its gradient is given by

If the coefficients αi\alpha_{i} were fixed, the gradient would be linear in DD and thus admit αi2\left\|\alpha_{i}\right\|^{2} as Lipschitz constant. Therefore, when initializing our algorithm at D0D_{0}, we find αi(D0)\alpha_{i}(D_{0}) for any i[1,n]i\in[1,n] and use maxi[1,n]αi(D0)2\max_{i\in[1,n]}{\left\|\alpha_{i}(D_{0})\right\|^{2}} as an estimate of LL.

For neural networks, the formulation we are considering is differentiable. We randomly generate two pairs of weight vectors (W1,W2)(W_{1},W_{2}) and (W1,W2)(W_{1}^{\prime},W_{2}^{\prime}) and use the quantity

as an estimate of the Lipschitz constant, where fif_{i} denotes the loss function respect to ii-th training sample (ai,bi)(a_{i},b_{i}). We separate weights in each layer to estimate the Lipschitz constant per layer. Indeed the scales of the weights can be quite different across layers.

Computational cost.

We report in Figure 6 an experimental study where we vary SS on the neural network example. In terms of number of iterations, of course, the larger SkS_{k} the better the performance. This is not surprising as we solve each subproblem more accurately. Nevertheless, in terms of number of gradient evaluations, the relative performance is reversed. There is clearly no benefit to take larger SkS_{k}. This justifies in hindsight our choice of setting S=nS=n.

Experimental conclusions.

In the matrix factorization experiments in Fig. 2 and Fig. 3, 4WD-Catalyst-SVRG/SAGA were always competitive, with a similar performance to the heuristic SVRG/SAGA-η=1/L\eta=1/L in two cases out of three, while being significantly better as soon as the amount of data nn was large enough. As expected, the variants of SVRG with theoretical stepsizes have slow convergence, but exhibit a stable behavior compared to SVRG-η=1/L\eta=1/L. This confirms the remarkable ability of 4WD-Catalyst-SVRG/SAGA to adapt to nonconvex terrains.

In the neural network experiments, we observe that 4WD-Catalyst-SVRG converges much faster overall in terms of objective values than other algorithms. Yet Adam and AdaGrad often perform well-during the first iterations, they oscillate a lot, which is a behavior commonly observed. In constrast, 4WD-Catalyst-SVRG always decreases and keeps decreasing while other algorithms tend to stabilize, hence achieving significantly lower objective values.

More interestingly, as the algorithm proceeds, the subgradient norm may increase at some point and then decrease, while the function value keeps decreasing. This suggests that the extrapolation step, or the Auto-adapt procedure, is helpful to escape bad stationary points, e.g., saddle-points. We leave the study of this particular phenomenon as a potential direction for future work.

Acknowledgments.

The authors would like to thank J. Duchi for fruitful discussions related to this work. C. Paquette was partially supported by the “Learning in Machines and Brains” program of CIFAR. H. Lin and J. Mairal were supported by ERC grant SOLARIS (# 714381) and ANR grant MACARON (ANR-14-CE23-0003-01). D. Drusvyatskiy was supported by AFOSR YIP FA9550-15-1-0237, NSF DMS 1651851, and NSF CCF 1740551 awards. Z. Harchaoui was supported by NSF CCF 1740551 award, the “Learning in Machines and Brains” program of CIFAR, and faculty research awards. This work was performed while C. Paquette was at University of Washington and H. Lin was at Inria.

Appendix A Convergence rates in strongly-convex composite minimization

We now briefly discuss convergence rates, which are typically given in different forms in the convex and non-convex cases. If the weak-convex constant is known, we can form a strongly convex approximation similar to . For that purpose, we consider a strongly-convex composite minimization problem

Let xx^{*} be the minimizer of hh and hh^{*} be the minimal value of hh. In general, there are three types of measures of optimality that one can monitor: xx2\|x-x^{*}\|^{2}, h(x)hh(x)-h^{*}, and dist(0,h(x))\text{dist}(0,\partial h(x)).

Since hh is strongly convex, the three of them are equivalent in terms of convergence rates if one can take an extra prox-gradient step:

To see this, define the displacement vector, also known as the gradient mapping, gL(x):=L(x[x]L)g_{L}(x):=L(x-[x]_{L}), and notice the inclusion gL(x)h([x]L)g_{L}(x)\in\partial h([x]_{L}). In particular gL(x)=0g_{L}(x)=0 if and only if xx is the minimizer of hh. These next inequalities follow directly from Theorem 2.2.7 in :

Thus, an estimate of any one of the four quantities xx\|x-x^{*}\|, h(x)hh(x)-h^{*}, gL(x)\|g_{L}(x)\|, or dist(0,h(x))\text{dist}(0,\partial h(x)) directly implies an estimate of the other three evaluated either at xx or at [x]L[x]_{L}.

Appendix B Theoretical analysis of the basic algorithm

We present here proofs of the theoretical results of the paper. All throughout the proofs, we shall work under the Assumptions on ff stated in Section 4 and the Assumptions on M\mathcal{M} stated in Section 5.

In Theorem 4.1 and Theorem 4.2 under an appropriate tolerance policy on the proximal subproblems (7) and (9), Basic 4WD-Catalyst performs no worse than an exact proximal point method in general, while automatically accelerating when ff is convex. For this, we need the following observations.

Suppose the sequence {αk}k1\{\alpha_{k}\}_{k\geq 1} is produced by Algorithm 1. Then, the following bounds hold for all k1k\geq 1:

This result is noted without proof in a remark of . For completeness, we give below a simple proof using induction. Clearly, the statement holds for k=1k=1. Assume the inequality on the right-hand side holds for kk. By using the induction hypothesis, we get

as claimed and the expression for αk+1\alpha_{k+1} is given by explicitly solving (11). To show the lower bound, we note that for all k1k\geq 1, we have

Using the established upper bound αk2k+1\alpha_{k}\leq\frac{2}{k+1} yields

Suppose y+y^{+} satisfies dist(0,fκ(y+;y))<ε\text{dist}(0,\partial f_{\kappa}(y^{+};y))<\varepsilon. Then, the inequality holds:

We can find ξfκ(y+;y)\xi\in\partial f_{\kappa}(y^{+};y) with ξε\left\|\xi\right\|\leq\varepsilon. Taking into account fκ(y+;y)=f(y+)+κ(y+y)\partial f_{\kappa}(y^{+};y)=\partial f(y^{+})+\kappa(y^{+}-y) the result follows. ∎

Next we establish convergence guarantees of Theorem 4.1 and Theorem 4.2 for Basic 4WD-Catalyst .

The proof of Theorem 4.1 follows the analysis of inexact proximal point method . The descent condition in (13) implies {f(xk)}k0\{f(x_{k})\}_{k\geq 0} are monotonically decreasing. From this, we deduce

Using the adaptive stationarity condition (13), we apply Lemma B.2 with y=xk1y=x_{k-1}, y+=xˉky^{+}=\bar{x}_{k} and ε=κxˉkxk1\varepsilon=\kappa\left\|\bar{x}_{k}-x_{k-1}\right\|; hence we obtain

We combine the above inequality with (25) to deduce

We substitute x=αkx+(1αk)xk1x=\alpha_{k}x^{*}+(1-\alpha_{k})x_{k-1} where xx^{*} is any minimizer of ff. Using the convexity of ff, the norm of ξk\xi_{k}, and Equations (8) and (10), we deduce

Set θk=1k+1\theta_{k}=\frac{1}{k+1}. Completing the square on Equation (27), we obtain

Denote Ak:=1θk2A_{k}:=1-\theta_{k}^{2}. Subtracting ff^{*} from both sides and using the inequality 1αkαk2=1αk12\frac{1-\alpha_{k}}{\alpha_{k}^{2}}=\frac{1}{\alpha_{k-1}^{2}} and α11\alpha_{1}\equiv 1, we derive the following recursion argument:

The last inequality follows because 0<Ak110<A_{k-1}\leq 1. Iterating NN times,we deduce

thereby concluding the result. Summing up (26) from j=N+1j=N+1 to 2N2N, we obtain

Combining this inequality with (28), the result is shown. ∎

Appendix C Analysis of 4WD-Catalyst and Auto-adapt

Our assumption on the linear rate of convergence of M\mathcal{M} (see (16)) may look strange at first sight. Nevertheless, most linearly convergent first-order methods M\mathcal{M} for composite minimization either already satisfy this assumption or can be made to satisfy it by introducing an extra prox-gradient step. To see this, recall the convex composite minimization problem from Section A

for each t=0,1,2,,t=0,1,2,\ldots,\infty. To bring such convergence guarantees into the desired form (16), define the prox-gradient step

and notice the inclusion gL(z)h([z]L)g_{L}(z)\in\partial h([z]_{L}). The following inequality follows from :

Thus, the linear rate of convergence (30) implies

which is exactly in the desired form (16).

C.1 Convergence analysis of the adaptive algorithm: 4WD-Catalyst

First, under some reasonable assumptions on the method M\mathcal{M} (see Section 5.1), the sub-method Auto-adapt terminates.

Assume that τκ1\tau_{\kappa}\to 1 when κ+\kappa\to+\infty. The procedure Auto-adapt(x,κ,ε,T)(x,\kappa,\varepsilon,T) terminates after finitely many iterations.

Due to our assumptions on M\mathcal{M} and the expressions fκ(x;x)=f(x)f_{\kappa}(x;x)=f(x) and fκ(x)ff_{\kappa}^{*}(x)\geq f^{*}, we have

Since τκ\tau_{\kappa} tends to one, for all sufficiency large κ\kappa, we can be sure that the right-hand-side is smaller than ε2\varepsilon^{2}. On the other hand, for κ>ρ\kappa>\rho, the function fκ(;x)f_{\kappa}(\cdot;x) is (κρ)(\kappa-\rho)-strongly convex and therefore we have dist2(0,fκ(zT;x))2(κρ)(fκ(zT;x)fκ(x))\text{dist}^{2}(0,\partial f_{\kappa}(z_{T};x))\geq 2(\kappa-\rho)(f_{\kappa}(z_{T};x)-f_{\kappa}^{*}(x)). Combining this with (31), we deduce

Letting κ\kappa\to\infty, we deduce fκ(zT;x)f(x)f_{\kappa}(z_{T};x)\leq f(x), as required. Thus the loop indeed terminates. ∎

We prove the main result, Theorem 5.5, for 4WD-Catalyst.

The proof closely resembles the proofs of Theorem 4.2 and Theorem 4.2, so we omit some of the details. The main difference in the proof is that we keep track of the effects the parameters κcvx\kappa_{\text{\rm cvx}} and κ0\kappa_{0} have on the inequalities as well as the sequence of κk\kappa_{k}. Since {f(xk)}k0\{f(x_{k})\}_{k\geq 0} are monotonically decreasing, we deduce

Using the adaptive stationary condition (20), we apply Lemma B.2 with ε=κkxˉkxk1\varepsilon=\kappa_{k}\left\|\bar{x}_{k}-x_{k-1}\right\|; hence we obtain

We combine the above inequality with (32) to deduce

Suppose the function ff is convex. Using in the stopping criteria (19) in replacement of (13), we deduce a similar expression as (27):

Denote θk=1k+1\theta_{k}=\frac{1}{k+1}. Completing the square, we obtain

Denote Ak:=1θk2A_{k}:=1-\theta_{k}^{2}. Following the standard recursion argument as in the proofs of Theorem 4.2 and Theorem 4.2, we conclude

The last inequality follows because 0<Ak110<A_{k-1}\leq 1. Iterating NN times, we deduce

thus the result is shown. Summing up (33) from j=N+1j=N+1 to 2N2N, we obtain

Combining this inequality with (34), the result is shown. ∎

Appendix D Inner-loop complexity: proof of Theorem 5.6

Assuming ψ(x)\psi(x) is convex and the parameter κ>ρ\kappa>\rho, then

where yy^{*} is a minima of fκ(;y)f_{\kappa}(\cdot;y) and fκ(y)f_{\kappa}^{*}(y) is the optimal value.

As the κ\kappa is chosen sufficiently large, we know f0(;y)f_{0}(\cdot;y) is convex and differentiable with (κ+L)(\kappa+L)-Lipschitz continuous gradient. Hence, we deduce for all xx

Using the definition of y0y^{0} and the (κ+L)(\kappa+L)-Lip. continuous gradient of f0(;y)f_{0}(\cdot;y), we conclude for all xx

By setting x=yx=y^{*} in both (37) and (38) and combining these results, we conclude

Note that if we are not in the composite setting and κ>ρ\kappa>\rho, then fκ(,y)f_{\kappa}(\cdot,y) is (κ+L)(\kappa+L)-strongly convex. Using standard bounds for strongly convex functions, Equation (36) follows (see ). We next show an important lemma for deducing the inner complexities.

Assume κ>ρ\kappa>\rho. Given any εκρ2\varepsilon\leq\frac{\kappa-\rho}{2}, if an iterate zz satisfies dist(0,fκ(z;y))εyy,\text{dist}(0,\partial f_{\kappa}(z;y))\leq\varepsilon\left\|y^{*}-y\right\|, then

Since κ>ρ\kappa>\rho, we know fκ(;y)f_{\kappa}(\cdot;y) is (κρ)(\kappa-\rho)-strongly convex. Therefore, by , we know

By the triangle inequality and Equation (40), we deduce

The last inequality follows because of the assumption εκρ2\varepsilon\leq\frac{\kappa-\rho}{2}. Rearranging the terms above, we get the desired result. ∎

These two lemmas together give us Theorem 5.4.

First, we prove that zTz_{T} satisfies both adaptive stationary condition and the descent condition. Recall, the point y0y^{0} is defined to be the prox or yy depending on if fκ(;y)f_{\kappa}(\cdot;y) is a composite form or smooth, respectively (see statement of Theorem 5.4). By Lemma D.1 (or the remark following it), the starting y0y^{0} satisfies

By the linear convergence assumption of M\mathcal{M} (see (16)) and the above equation, after T:=TκT:=T_{\kappa} iterations initializing from y0y^{0}, we have

Take the square root and apply Lemma D.2 yields

which gives the adaptive stationary condition. Next, we show the descent condition. Let vfκ(zT;y)v\in\partial f_{\kappa}(z_{T};y) such that v(κρ)zTy/2\left\|v\right\|\leq(\kappa-\rho)\left\|z_{T}-y\right\|/2, by the (κρ)(\kappa-\rho)-strong convexity of fκ(;y)f_{\kappa}(\cdot;y), we deduce

This yields the descent condition which completes the proof for TT. The proof for SκS_{\kappa} is similar to TκT_{\kappa}, so we omit many of the details. In this case, we only need to show the adaptive stationary condition. For convenience, we denote S=SκS=S_{\kappa}. Following the same argument as in Equation (41) but with Slog(k+1)S\log(k+1) number of iterations, we deduce

which proves the desired result for zSz_{S}. ∎

Assuming Proposition 5.7 and Proposition 5.8 hold as well as Lemma D.3, we begin by providing the proof of Theorem 5.6.

We consider two cases: (i) the function ff is non-convex and (ii) the function ff is convex. First, we consider the non-convex setting. To produce xˉk\bar{x}_{k}, the method M\mathcal{M} is called

Next, we supply the proof of Proposition 5.7 which shows that by choosing κ\kappa large enough, Algorithm 3 terminates.

The idea is to apply Theorem 5.4. Since the parameter AκA_{\kappa} increases with κ\kappa, then we upper bound it by AκkA4LA_{\kappa_{k}}\leq A_{4L}. Moreover, we have κρρ+Lρ=L\kappa-\rho\geq\rho+L-\rho=L. Lastly, since τκ\tau_{\kappa} is increasing in κ\kappa, we know 1τκ1τL\frac{1}{\tau_{\kappa}}\leq\frac{1}{\tau_{L}}. Plugging these bound into Theorem 5.4, we see that for any smoothing parameter κ\kappa satisfying ρ+L<κ<4L\rho+L<\kappa<4L, we get the desired result. ∎

Next, we compute the maximum number of times we must double κ\kappa until κ>ρ+L\kappa>\rho+L.

If we set TT and SS according to Theorem 5.6, then the doubling of κ0\kappa_{0} will terminate as soon as κ>ρ+L\kappa>\rho+L. Thus the number of times κ0\kappa_{0} must be doubled in Algorithm 3 is at most

Since κ\kappa is doubled (Algorithm 3) and TT is chosen as in Proposition 5.7 , the maximum the value κ\kappa, κmax\kappa_{\max}, takes is 2(ρ+L)4L2(\rho+L)\leq 4L.

The proof immediately follows from Theorem 5.4 by setting κ=κcvx\kappa=\kappa_{\text{cvx}} and ρ=0\rho=0 as the function ff is convex. ∎

References