Catalyst Acceleration for Gradient-Based Non-Convex Optimization
Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, Zaid Harchaoui
Introduction
We consider optimization problems of the form
Yet, nonconvex problems in machine learning are of high interest. For instance, the variable may represent the parameters of a neural network, where each term measures the fit between and a data point indexed by , or (1) may correspond to a nonconvex matrix factorization problem (see Section 7). Besides, even when the data-fitting functions are convex, it is also typical to consider nonconvex regularization functions , for example for feature selection in signal processing . In this work, we address two questions from nonconvex optimization:
How to apply a method for convex optimization to a nonconvex problem?
How to design an algorithm which does not need to know whether the objective function is convex while obtaining the optimal convergence guarantee if the function is convex?
Several works attempted to transfer ideas from the convex world to the nonconvex one, see, e.g., . Our paper has a similar goal and studies the extension of Nesterov’s acceleration for convex problems to nonconvex composite ones. For -smooth and nonconvex problems, gradient descent is optimal among first-order methods in terms of information based complexity to find an -stationary point [10, Theorem 2, Sec. 5]. Without additional assumptions, worst case complexity for first-order methods can not achieve better than oracle queries . Under a stronger assumption that the objective function is -smooth, state-of-the-art methods [e.g., 11] achieve a marginal gain with complexity , and do not appear to generalize to composite or finite-sum settings. For this reason, our work fits within a broader stream of recent research on methods that do not perform worse than gradient descent in the nonconvex case (in terms of worst-case complexity), while automatically accelerating for minimizing convex functions. The hope when applying such methods to nonconvex problems is to see acceleration in practice, by heuristically exploiting convexity that is “hidden” in the objective (for instance, local convexity near the optimum, or convexity along the trajectory of iterates).
Inspired by Nesterov’s acceleration method for convex optimization , the first accelerated method performing universally well for nonconvex and convex problems was introduced in . Specifically, this work addresses composite problems such as (1) with , and, provided the iterates are bounded, it performs no worse than gradient descent on nonconvex instances with complexity on the gradient norm. When the problem is convex, it accelerates with complexity . Extensions to accelerated Gauss-Newton type methods were also recently developed in . In a follow-up work, the authors of proposed a new scheme that monotonically interlaces proximal gradient descent steps and Nesterov’s extrapolation; thereby achieving similar guarantees as but without the need to assume the iterates to be bounded. Extensions when the gradient of is only Hölder continuous can also be devised. Whether accelerated methods are superior to gradient descent remains an open question in the nonconvex setting; their performance escaping saddle points faster than gradient descent has been observed .
In , a similar strategy is proposed, focusing instead on convergence guarantees under the so-called Kurdyka-Łojasiewicz inequality—a property corresponding to polynomial-like growth of the function, as shown by . Our scheme is in the same spirit as these previous papers, since it monotonically interlaces proximal-point steps (instead of proximal-gradient as in ) and extrapolation/acceleration steps. A fundamental difference is that our method is generic and accommodates inexact computations, since we allow the subproblems to be approximately solved by any method we wish to accelerate.
By considering -smooth nonconvex objective functions with Lipschitz continuous gradient and Hessian , the authors of propose an algorithm with complexity , based on iteratively solving convex subproblems closely related to the original problem. It is not clear if the complexity of their algorithm improves in the convex setting. Note also that the algorithm proposed in is inherently for -smooth minimization and requires exact gradient evaluations. This implies that the scheme does not allow incorporating nonsmooth regularizers and can not exploit finite sum structure.
A stochastic scheme for minimizing (1) under the nonconvex but smooth setting were recently considered in . The method can be seen as a nonconvex variant of the stochastically controlled stochastic gradient (SCSG) methods . If the target accuracy is small, then the method performs no worse than nonconvex SVRG . If the target accuracy is large, the method achieves a rate better than SGD. The proposed scheme does not incorporate nonsmooth regularizers and it is unclear whether numerically the scheme performs as well as SVRG.
Finally, a stochastic method related to SVRG for minimizing large sums while automatically adapting to the weak convexity constant of the objective function is proposed in . When the weak convexity constant is small (i.e., the function is nearly convex), the proposed method enjoys an improved efficiency estimate. This algorithm, however, does not automatically accelerate for convex problems, in the sense that the overall rate is slower than in terms of target accuracy on the gradient norm.
Organization of the paper.
Section 2 presents mathematical tools for non-convex and non-smooth analysis, which are used throughout the paper. We provide a discussion of related works for solving the nonconvex and nonsmooth problem (1) in Section 3. In Sections 4 and 5, we introduce the main algorithm and important extensions, respectively. Section 6 presents global convergence guarantees of the scheme and convergence guarantees when the algorithm wraps specific algorithms such as SAGA, SVRG, and randomized coordinate descent. Finally, we present experimental results on matrix factorization and training of neural networks in Section 7.
Tools for nonconvex optimization
In this paper, we focus on a broad class of nonconvex functions known as weakly convex functions, which covers most of the cases of interest in machine learning and signal processing.
Weakly convex functions have appeared in a wide variety of contexts, and under different names. Some notable examples are globally lower- , prox-regular , proximally smooth functions , and those functions whose epigraph has positive reach . We recall here basic definitions and classical results.
When , the above definition reduces to the classical definition of convex functions.
A function is weakly convex if and only if the function is convex, where
In order to prove the result, it suffices to show that
This follows by rearranging the terms in Equation (3). Next, we suppose is -weakly convex; hence Equation (2) holds. We observe that can be rewritten as Equation (3). As a result, we conclude
Rearranging the terms, we get the desired result. ∎
If is twice differentiable, then is -weakly convex if and only if for all .
This follows from the observations that a twice differentiable function is convex if and only if and Proposition 2.3. ∎
Intuitively, a function is weakly convex when it is “nearly convex” up to a quadratic function. This represents a complementary notion to strong convexity.
If a function is differentiable and its gradient is Lipschitz continuous with Lipschitz parameter , then is -weakly convex.
Hence, we see the function is convex for and the result follows by applying Proposition 2.3. ∎
We remark that for most of the interesting machine learning problems, the smooth part of the objective function admits Lipchitz gradients, meaning that the function is in fact weakly convex.
2 Subdifferential
Convergence results for nonsmooth optimization typically rely on the concept of subdifferential. However, the generalization of the subdifferential to nonconvex nonsmooth function is not unique . With the weak convexity in hand, all these constructions coincide, and therefore we slightly abuse standard notation as set out in Rockafellar and Wets .
Thus, a vector lies in whenever the linear function is a lower-model of , up to first-order around . In particular, the subdifferential of a differentiable function is the singleton ; while for a convex function it coincides with the subdifferential in the sense of convex analysis, see [49, Exercise 8.8]. Moreover, the following sum rule,
holds for any differentiable function .
In non-convex optimization, standard complexity bounds are derived to guarantee
Recall when , we are at a stationary point and satisfy first-order optimality conditions. For functions that are nonconvex, first-order methods search for points with small subgradients, which does not necessarily imply small function values, in contrast to convex functions where the two criteria are much closer related. In our convergence analysis, we will use the following differential characterization of -weakly convex functions, which generalize classical properties of convex functions.
Related work on weakly convex functions
For many machine learning problems, the objective functions includes a smooth component which is often assumed to have an -Lipschitz gradient. The precise relationship between the weak-convexity constant and the Lipschitz constant is given in Proposition 2.5:
If is differentiable and is -Lipschitz, then is -weakly convex for some .
Many functions with -Lipschitz gradients have weak-convexity constants which are smaller than . Our goal is to develop a method that exploits this property of the weak convexity constant for nonconvex functions while obtaining optimal convergence rates for convex problems. Up until now, nearly all the research for methods to solve the large finite sum problem (1) have assumed (i.e. convex) or . We provide a short, selective list of convergence guarantees for a few popular approaches.
When , Full Gradient Descent (FG) finds a point satisfying after number of gradient computations.
When , AdaGrad uses regret guarantees in an online convex optimization setting. We are not aware of guarantees for convex optimization with finite-sum structure nor for non-convex optimization with finite-sum structure.
Stochastic methods based on variance-reduced stochastic gradients have recently been applied to nonconvex problems. The authors of propose for instance stochastic methods for minimizing (1) using variants of SVRG and SAGA under the assumption that . Their scheme works for both convex and nonconvex settings and achieves convergence guarantees of (convex) and (nonconvex) and includes a minibatch variant.
2 Behavior of finite-sum optimization methods when the objective is convex.
For the stochastic methods previously considered in the nonconvex setting, we note their convex rates. In , for convex objectives, the methods attain convergence guaranties of . In both , they only focus on nonconvex problems, as such their convex rates are the same as the nonconvex setting, that is, . When the objective is assumed to be convex, methods often achieve faster rates of convergence. Accelerated gradient methods designed by Nesterov are known to require number of gradient computations to obtain near stationary point, , but only number of gradient computations to obtain a near optimal point, . This gap between the two guarantees was resolved in . By adding a regularization term and an additional known bound on , one can improve the gradient complexity to . Without such a bound on the distance to the optimal solution set, it is unclear if one can improve the convergence rate. We will assume throughout this paper that we do not know a bound on for (1).
The authors of based their work off a class of algorithms called stochastically controlled stochastic gradient (SCSG) methods . In these methods, the functions are smooth and convex. The SCSG method satisfies: when the target accuracy is low, the method has the same rate as SGD but with a small data-dependent constant factor and when the target accuracy is high, the method has the same rate as the best non-accelerated methods, .
3 Our results
In this paper, we design a generic method that performs no worse than gradient descent in the nonconvex case, while automatically accelerating for minimizing convex functions. In particular, we devise a single algorithm which adapts to the weak convexity constant if the objective is nonconvex, while also obtaining the accelerated rate of when the objective is convex. The hope is that by applying such methods to nonconvex problems we see acceleration in practice by heuristically exploiting convexity that is “hidden” in the objective function. Moreover, our algorithm applies to incremental methods SVRG/SAGA and randomized coordinate descent. Designing such an acceleration scheme for possibly nonconvex optimization problems is challenging. Whether convergence guarantees for optimization algorithms accelerated naively with classical Nesterov or momentum acceleration match gradient descent on nonconvex problems remains an open problem; yet in the vicinity of saddle points, accelerated gradient methods escape faster than gradient descent . Our scheme capitalizes on this valuable observation.
First, we consider the situation where the weak convexity constant is known. By interlacing incremental methods such as SVRG and SAGA, our proposed algorithm, Basic 4WD-Catalyst-SVRG/SAGA, where is known, finds an -approximate stationarity point of in gradient complexity
It is common in machine learning problems for the weak convexity constant to be unknown. Previous work in the area, namely , required the parameter to be specified (see Line 8 in Algorithms 1 and 2 of ). Our second method, 4WD-Catalyst-SVRG/SAGA, incorporates a procedure that eliminates the need to specify the weak convexity constant . The resulting method, 4WD-Catalyst-SVRG/SAGA, finds an -approximate stationarity point of in gradient complexity
The Basic 4WD-Catalyst algorithm for non-convex optimization
We now present a generic scheme (Algorithm 1) for applying a convex optimization method to minimize
where is only -weakly convex and is lower bounded. Our goal is to develop a unified framework that automatically accelerates in convex settings. Consequently, the scheme must be agnostic to the constant .
At the center of our meta algorithm (Algorithm 1) are two sequences of subproblems obtained by adding simple quadratics to . The proposed approach extends the Catalyst acceleration of and comes with a simplified convergence analysis. We next describe in detail each step of the scheme.
Here, is a regularization parameter and is called the prox-center. By adding the quadratic, we make the problem more “convex”: when is non convex, with a large enough , the subproblem will be convex; when is convex, we improve the conditioning of the problem.
At the -th iteration, given a previous iterate and the extrapolation term , we construct the two following subproblems.
Proximal point step. We first perform an inexact proximal point step with prox-center :
Accelerated proximal point step. Then we build the next prox-center as the convex combination
Next, we use as a prox-center and update the next extrapolation term:
where is a sequence of coefficients satisfying . Essentially, the sequences are built upon the extrapolation principles of Nesterov .
Picking the best.
The proposed scheme blends the two steps in a synergistic way, allowing us to recover the near-optimal rates of convergence in both worlds: convex and non-convex. Intuitively, when is chosen, it means that Nesterov’s extrapolation step “fails” to accelerate convergence.
Stopping criterion for the subproblems.
In order to derive complexity bounds, it is important to properly define the stopping criterion for the proximal subproblems. When the subproblem is convex, a functional gap like may be used as a control of the inexactness, as in . Without convexity, this criterion cannot be used since such quantities can not be easily bounded. In particular, first order methods seek points whose subgradient is small. Since small subgradients do not necessarily imply small function values in a non-convex setting, first order methods only test for near-stationarity is small subgradients. In contrast, in the convex setting, small subgradients imply small function values; thus a first order method in the convex setting can “test” for small function values. Hence, we cannot use a direct application of Catalyst which uses the functional gap as a stopping criteria. Because we are working in the nonconvex setting, we include a stationarity stopping criteria.
We propose to use jointly the following two types of stopping criteria:
Descent condition: ;
Adaptive stationary condition: \text{\rm dist}\big{(}0,\partial f_{\kappa}(z;y)\big{)}<\kappa\left\|z-y\right\|.
Without the descent condition, the stationarity condition is insufficient for defining a good stopping criterion because of the existence of local maxima in nonconvex problems. In the nonconvex setting, local maxima and local minima satisfy this stationarity condition. The descent condition ensures the iterates generated by the algorithm always decrease the value of the objective function ; thus ensuring we move away from local maxima. The second criterion, adaptive stationary condition, provides a flexible relative tolerance on termination of algorithm used for solving the subproblems; a detailed analysis is forthcoming.
In Basic 4WD-Catalyst, we use both the stationary condition and the descent condition as a stopping criteria to produce the point :
2 Convergence analysis.
We present here the theoretical properties of Algorithm 1. In this first stage, we do not take into account the complexity of solving the subproblems (7) and (9). For the next two theorems, we assume that the stopping criteria for the proximal subproblems are satisfied at each iteration of Algorithm 1.
Suppose that the function is lower bounded. For any and , the iterates generated by Algorithm 1 satisfy
It is important to notice that this convergence result is valid for any and does not require it to be larger than the weak convexity parameter. As long as the stopping criteria for the proximal subproblems are satisfied, the quantities tend to zero. The proof is inspired by that of inexact proximal algorithms and appears in Appendix B.
If the function turns out to be convex, the scheme achieves a faster convergence rate both in function values and in stationarity:
If the function is convex, then for any and , the iterates generated by Algorithm 1 satisfy
where is any minimizer of the function .
The proof of Theorem 4.2 appears in Appendix B. This theorem establishes a rate of for suboptimality in function value and convergence in for the minimal norm of subgradients. The first rate is optimal in terms of information-based complexity for the minimization of a convex composite function . The second can be improved to through a regularization technique, if one knew in advance that the function is convex and had an estimate on the distance of the initial point to an optimal solution .
The 4WD-Catalyst algorithm
In this section, we work towards understanding the global efficiency of Algorithm 1, which automatically adapts to the weak convexity parameter. For this, we must take into account the cost of approximately solving the proximal subproblems to the desired stopping criteria. We expect that once the subproblem becomes strongly convex, the given optimization method can solve it efficiently. For this reason, we first focus on the computational cost for solving the sub-problems, before introducing a new algorithm with known worst-case complexity.
When is large enough, the subproblems become strongly convex; thus globally solvable. Henceforth, we will assume that satisfies the following natural linear convergence assumption.
We assume that for any , there exist and so that the following hold:
where . If the method is randomized, we require the same inequality to hold in expectation.
The rates and the constants are increasing in .
Our assumption on the linear rate of convergence of differs from the one considered by , which was given in terms of function values. However, if the problem is a composite one, both points of view are near-equivalent, as discussed in Section A and the precise relationship is given in Appendix C. We choose the norm of the subgradient as our measurement because the complexity analysis is easier.
Then, a straightforward analysis bounds the computational complexity to achieve an -stationary point.
Let us consider a strongly convex problem and a linearly convergent method generating a sequence of iterates . Define T(\varepsilon)=\inf\{t\geq 1,\text{\rm dist}\big{(}0,\partial f_{\kappa}(z_{t};y)\big{)}\leq\varepsilon\}, where is the target accuracy; then,
As we can see, we only lose a factor in the log term by switching from deterministic to randomized algorithms. For the sake of simplicity, we perform our analysis only for deterministic algorithms and the analysis for randomized algorithms holds in the same way in expectation.
We can now prove a global complexity bound for Basic 4WD-Catalyst if the weak convexity constant is known. For this, we introduce , a dependent smoothing parameter and set it in the same way as the smoothing parameter in .
Algorithm 1 generates a point satisfying after at most
If is convex, then Algorithm 1 generates a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 1 generates a point satisfying after at most
Bounding the required iterations when κ>ρ𝜅𝜌\kappa>\rho and restart strategy.
Recall that we add a quadratic to with the hope to make each subproblem convex. Thus, if is known, then we should set . In this first stage, we show that whenever , then the number of inner calls to can be bounded with a proper initialization. Consider the subproblem
and define the initialization point by
if is composite, with -smooth, then set with .
Consider the subproblem (17) and suppose . Then initializing at the previous generates a sequence of iterates such that
the output satisfies (descent condition) and (adaptive stationary condition);
in at most iterations where
the output satisfies (modified adaptive stationary condition).
The proof is technical and is presented in Appendix D. The lesson we learn here is that as soon as the subproblem becomes strongly convex, it can be solved in almost a constant number of iterations. Herein arises a problem–the choice of the smoothing parameter . On one hand, when is already convex, we may want to choose small in order to obtain the desired optimal complexity. On the other hand, when the problem is non convex, a small may not ensure the strong convexity of the subproblems. Because of such different behavior according to the convexity of the function, we introduce an additional parameter to handle the regularization of the extrapolation step. Moreover, in order to choose a in the nonconvex case, we need to know in advance an estimate of . This is not an easy task for large scale machine learning problems such as neural networks. Thus we propose an adaptive step to handle it automatically.
2 4WD-Catalyst: adaptation to weak convexity
We now introduce 4WD-Catalyst, presented in Algorithm 2, which can automatically adapt to the unknown weak convexity constant of the objective. The algorithm relies on a procedure to automatically adapt to , described in Algorithm 3.
Descent condition: ;
Adaptive stationary condition: \text{\rm dist}\big{(}0,\partial f_{\kappa}(z_{T};x)\big{)}\leq\kappa\left\|z_{T}-x\right\|.
Thus, if either condition is not satisfied, then the subproblem is deemed not convex and we double and repeat. The procedure yields an estimate of in a logarithmic number of increases; see Lemma D.3.
We shall see in the experiments that our strategy of predefining and works quite well. The theoretical bounds we derive are, in general, too conservative; we observe in our experiments that one may choose and significantly smaller than the theory suggests and still retain the stopping criteria.
To derive the global complexity results for 4WD-Catalyst that match optimal convergence guarantees, we make a distinction between the regularization parameter in the proximal point step and in the extrapolation step. For the proximal point step, we apply Algorithm 3 to adaptively produce a sequence of initializing at , an initial guess of . The resulting and satisfy both the following inequalities:
3 Convergence analysis
Fix real constants , the function is lower bounded, and . Set . Suppose that the number of iterations is such that satisfies (20). Define . Then for any , the iterates generated by Algorithm 2 satisfy,
where is any minimizer of the function .
Suppose the stopping criteria are (20) and (21) as in in Theorem 5.5, and choose and in Algorithm 2 to be the smallest numbers satisfying
Then and the following hold for any index :
Appendix D is devoted to proving Theorem 5.6, but we outline below the general procedure and state the two main propositions (see Proposition 5.7 and Proposition 5.8).
We summarize the proof of Theorem 5.6 as followed:
When , we compute the number of iterations of to produce a point satisfying (20). Such a point will become .
We compute the smallest number of times we must double until it becomes larger than . Thus eventually the condition will occur.
The next proposition shows that Auto-adapt terminates with a suitable choice for after number of iterations.
Suppose . By initializing the method using the strategy suggested in Algorithm 2 for solving
we may run the method for at least iterations, where
then, the output satisfies and \text{dist}\big{(}0,\partial f_{\kappa}(z_{{T}};x)\big{)}\leq\kappa\left\|z_{T}-x\right\|.
Under the additional assumption that the function is convex, we produce a point with (21) when the number of iterations is chosen sufficiently large.
Consider the method with the initialization strategy suggested in Algorithm 2 for minimizing with linear convergence rates of the form (16). Suppose the function is convex. If the number of iterations of is greater than
We can now derive global complexity bounds by combining Theorem 5.5 and Theorem 5.6, and a good choice for the constant .
Algorithm 2 generates a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 generates a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 generates a point satisfying after at most
In general, the linear convergence parameter of , , depends on the condition number of the problem . Here, and are precisely given by plugging in and respectively into . To clarify, let be SVRG, is given by which yields . A more detailed computation is given in Table 3. For all the incremental methods we considered, these parameters and are on the order of .
If is a first order method, the convergence guarantee in the convex setting is near-optimal, up to logarithmic factors, when compared to . In the non-convex setting, our approach matches, up to logarithmic factors, the best known rate for this class of functions, namely . Moreover, our rates dependence on the dimension and Lipschitz constant equals, up to log factors, the best known dependencies in both the convex and nonconvex setting. These logarithmic factors may be the price we pay for having a generic algorithm.
Applications to Existing Algorithms
We now show how to accelerate existing algorithms and compare the convergence guaranties before and after 4WD-Catalyst. In particular, we focus on the gradient descent algorithm, randomized coordinate descent, and on the incremental methods SAGA and SVRG. For all the algorithms considered, we state the convergence guaranties in terms of the total number of iterations (in expectation, if appropriate) to reach an accuracy of ; in the convex setting, the accuracy is stated in terms of functional error, and in the nonconvex setting, the appropriate measure is stationarity, namely . All the algorithms considered have formulations for the composite setting with analogous convergence rates.
Table 2 presents convergence rates for SAGA , (prox) SVRG , randomized coordinate descent (Rand. CD) , and gradient descent (FG).
The original SVRG has no guarantee for nonconvex functions. However a nonconvex extension of SVRG was proposed in . Their convergence rate gives a better dependence on compared to ours, namely . This is achieved thanks to a mini-batching strategy. In order to obtain a similar dependency on , we need a tighter bound for SVRG with mini-batching applied to -strongly convex problems, namely . To the best of our knowledge, such a rate is currently unknown. Therefore, for ncvx-SVRG, we present the results without mini-batching. With mini-batching, the same convergence rate can be obtained by using a batch size and a stepsize . Similarly for ncvx-SAGA. For Rand. CD, we present the results for a smooth function , with the max. of the coordinate-wise Lipschitz constants for and is the dimension of the domain of .
The smoothing parameter drives the convergence rate of 4WD-Catalyst in the convex setting. To determine , we pretend and compute the global complexity of our scheme. As such, we end up with the same complexity result as Catalyst . Following their work, the rule of thumb is to maximize the ratio for convex problems. On the other hand, the choice of is independent of ; it is an initial lower estimate for the weak convexity constant . In practice, we typically choose ; For incremental approaches a natural heuristic is also to choose , meaning that iterations of performs one pass over the data. In Table 3, we present the values of used for various algorithms, as well as other quantities that are useful to derive the convergence rates.
A first illustration is the algorithm obtained when accelerating the regular “full” gradient (FG). Here, the optimal choice for is . In the convex setting, we get an accelerated rate of which agrees with Nesterov’s accelerated variant (AFG) up to logarithmic factors. On the other hand, in the nonconvex setting, our approach achieves no worse rate than , which agrees with the standard gradient descent up to logarithmic factors. We note that under stronger assumptions, namely -smoothness of the objective, the accelerated algorithm in achieves the same rate as (AFG) for the convex setting and for the nonconvex setting. Their approach, however, does not extend to composite setting nor to stochastic methods. Our marginal loss is the price we pay for considering a much larger class of functions.
Randomized Coordinate Descent (Rand. CD).
Randomized incremental gradient.
We now consider randomized incremental gradient methods such as SAGA and (prox) SVRG . Here, the optimal choice for is . Under the convex setting, we achieve an accelerated rate of . A direct application of SVRG and SAGA have no convergence guarantees in the non-convex setting. With our approach, the resulting algorithm matches the guarantees for FG up to log factors.
2 Detailed derivation of convergence rates
Using the values of Table 3, we may now specialize our convergence results to different methods. Many of the linearly convergent methods (e.g. Rand. CD and incremental methods) state convergence results in terms of function values instead of subgradients as in Equation (16). We relate function values to subgradients by using the Lipschitz constant :
To compute the parameters , , etc, we use the convergence analysis from for full gradient:
, where is the optimal solution of .
With this result, the number of iterations in the inner loop are
The global complexity for gradient descent is
Algorithm 2 will generate a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 will generate a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 will generate a point satisfying after at most
Rand. CD.
We use the convergence analysis from [Algorithm 3, Theorem 1]:
Set . Then the iterates of Algorithm 3 in satisfy
For the Rand. CD, the number of iterations in the inner loop are
Algorithm 2 will generate a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 will generate a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 will generate a point satisfying after at most
SVRG.
We use the convergence analysis established in :
Suppose the function is -Lipschitz and the function is -strongly convex. Choose the real constant sufficiently small so that
Then the Prox-SVRG method in has geometric convergence in expectation:
In particular, each stage requires component gradient evaluations so the overall complexity is
For SVRG, the number of iterations in the inner loop are
The global complexity for SVRG when is sufficiently large is
Algorithm 2 will generate a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 will generate a point satisfying \text{dist}\big{(}0,\partial f(x)\big{)}\leq\varepsilon after at most
If is convex, then Algorithm 2 will generate a point satisfying after at most
SAGA.
We observe that the variables for SAGA are the same as for SVRG up to a multiplicative factors. Therefore, the global complexities results for SAGA are, up to constant factors, the same as SVRG.
Experiments
We investigate the performance of 4WD-Catalyst on two standard non-convex problems in machine learning, namely on sparse matrix factorization and on training a simple two-layer neural network.
We report experimental results of 4WD-Catalyst when applied to the incremental algorithms SVRG and SAGA , and consider the following variants:
ncvx SVRG/SAGA with its theoretical stepsize .
a minibatch variant of ncvx SVRG/SAGA with batch size and stepsize .
SVRG/SAGA with large stepsize . This is variant of SVRG/SAGA, whose stepsize is not justified by theory for nonconvex problems, but which performs well in practice.
4WD-Catalyst SVRG/SAGA with its theoretical stepsize .
The algorithm SVRG (resp. SAGA) was originally designed for minimizing convex objectives. The nonconvex version was developed in , using a significantly smaller stepsize . Following , we also include in the comparison a heuristic variant that uses a large stepsize , where no theoretical guarantee is available for nonconvex objectives. 4WD-Catalyst SVRG and 4WD-Catalyst SAGA use a similar stepsize, but the Catalyst mechanism makes this choice theoretically grounded.
Comparison with popular stochastic algorithms.
We also include as baselines three popular stochastic algorithms: stochastic gradient descent (SGD), AdaGrad , and Adam .
AdaGrad with stepsize or .
Adam with stepsize or , , and .
The stepsize (learning rate) of these algorithms are manually tuned to output the best performance. Note that none of them, SGD, AdaGrad , or Adam enjoys linear convergence when the problem is strongly convex. Therefore, we do not apply 4WD-Catalyst to these algorithms. SGD is used in both experiments, whereas AdaGrad and Adam are used only on the neural network experiments and not on sparse matrix factorization since it is unclear how to apply it to a nonsmooth objective.
Parameter settings.
We start from an initial estimate of the Lipschitz constant and use the theoretically recommended in 4WD-Catalyst. We set the number of inner iterations in all experiments which means making at most one pass over the data to solve each sub-problem. Moreover, the dependency dictated by the theory is dropped while solving the subproblem in (18). These choices turn out to be justified a posteriori, as both SVRG and SAGA have a much better convergence rate in practice than the theoretical rate derived from a worst-case analysis. Indeed, in all experiments, one pass over the data to solve each sub-problem was found to be enough to guarantee sufficient descent.
Sparse matrix factorization a.k.a. dictionary learning.
Neural networks.
Initial estimates of L𝐿L.
The proposed algorithm 4WD-Catalyst requires an initial estimate of the Lipschitz constant . In the problems we are considering, there is no simple closed form formula available to compute an estimate of . We use the following heuristics to estimate :
For matrix factorization, it can be shown that the function defined in (24) is differentiable according to Danskin’s theorem [see Bertsekas , Proposition B.25] and its gradient is given by
If the coefficients were fixed, the gradient would be linear in and thus admit as Lipschitz constant. Therefore, when initializing our algorithm at , we find for any and use as an estimate of .
For neural networks, the formulation we are considering is differentiable. We randomly generate two pairs of weight vectors and and use the quantity
as an estimate of the Lipschitz constant, where denotes the loss function respect to -th training sample . We separate weights in each layer to estimate the Lipschitz constant per layer. Indeed the scales of the weights can be quite different across layers.
Computational cost.
We report in Figure 6 an experimental study where we vary on the neural network example. In terms of number of iterations, of course, the larger the better the performance. This is not surprising as we solve each subproblem more accurately. Nevertheless, in terms of number of gradient evaluations, the relative performance is reversed. There is clearly no benefit to take larger . This justifies in hindsight our choice of setting .
Experimental conclusions.
In the matrix factorization experiments in Fig. 2 and Fig. 3, 4WD-Catalyst-SVRG/SAGA were always competitive, with a similar performance to the heuristic SVRG/SAGA- in two cases out of three, while being significantly better as soon as the amount of data was large enough. As expected, the variants of SVRG with theoretical stepsizes have slow convergence, but exhibit a stable behavior compared to SVRG-. This confirms the remarkable ability of 4WD-Catalyst-SVRG/SAGA to adapt to nonconvex terrains.
In the neural network experiments, we observe that 4WD-Catalyst-SVRG converges much faster overall in terms of objective values than other algorithms. Yet Adam and AdaGrad often perform well-during the first iterations, they oscillate a lot, which is a behavior commonly observed. In constrast, 4WD-Catalyst-SVRG always decreases and keeps decreasing while other algorithms tend to stabilize, hence achieving significantly lower objective values.
More interestingly, as the algorithm proceeds, the subgradient norm may increase at some point and then decrease, while the function value keeps decreasing. This suggests that the extrapolation step, or the Auto-adapt procedure, is helpful to escape bad stationary points, e.g., saddle-points. We leave the study of this particular phenomenon as a potential direction for future work.
Acknowledgments.
The authors would like to thank J. Duchi for fruitful discussions related to this work. C. Paquette was partially supported by the “Learning in Machines and Brains” program of CIFAR. H. Lin and J. Mairal were supported by ERC grant SOLARIS (# 714381) and ANR grant MACARON (ANR-14-CE23-0003-01). D. Drusvyatskiy was supported by AFOSR YIP FA9550-15-1-0237, NSF DMS 1651851, and NSF CCF 1740551 awards. Z. Harchaoui was supported by NSF CCF 1740551 award, the “Learning in Machines and Brains” program of CIFAR, and faculty research awards. This work was performed while C. Paquette was at University of Washington and H. Lin was at Inria.
Appendix A Convergence rates in strongly-convex composite minimization
We now briefly discuss convergence rates, which are typically given in different forms in the convex and non-convex cases. If the weak-convex constant is known, we can form a strongly convex approximation similar to . For that purpose, we consider a strongly-convex composite minimization problem
Let be the minimizer of and be the minimal value of . In general, there are three types of measures of optimality that one can monitor: , , and .
Since is strongly convex, the three of them are equivalent in terms of convergence rates if one can take an extra prox-gradient step:
To see this, define the displacement vector, also known as the gradient mapping, , and notice the inclusion . In particular if and only if is the minimizer of . These next inequalities follow directly from Theorem 2.2.7 in :
Thus, an estimate of any one of the four quantities , , , or directly implies an estimate of the other three evaluated either at or at .
Appendix B Theoretical analysis of the basic algorithm
We present here proofs of the theoretical results of the paper. All throughout the proofs, we shall work under the Assumptions on stated in Section 4 and the Assumptions on stated in Section 5.
In Theorem 4.1 and Theorem 4.2 under an appropriate tolerance policy on the proximal subproblems (7) and (9), Basic 4WD-Catalyst performs no worse than an exact proximal point method in general, while automatically accelerating when is convex. For this, we need the following observations.
Suppose the sequence is produced by Algorithm 1. Then, the following bounds hold for all :
This result is noted without proof in a remark of . For completeness, we give below a simple proof using induction. Clearly, the statement holds for . Assume the inequality on the right-hand side holds for . By using the induction hypothesis, we get
as claimed and the expression for is given by explicitly solving (11). To show the lower bound, we note that for all , we have
Using the established upper bound yields
Suppose satisfies . Then, the inequality holds:
We can find with . Taking into account the result follows. ∎
Next we establish convergence guarantees of Theorem 4.1 and Theorem 4.2 for Basic 4WD-Catalyst .
The proof of Theorem 4.1 follows the analysis of inexact proximal point method . The descent condition in (13) implies are monotonically decreasing. From this, we deduce
Using the adaptive stationarity condition (13), we apply Lemma B.2 with , and ; hence we obtain
We combine the above inequality with (25) to deduce
We substitute where is any minimizer of . Using the convexity of , the norm of , and Equations (8) and (10), we deduce
Set . Completing the square on Equation (27), we obtain
Denote . Subtracting from both sides and using the inequality and , we derive the following recursion argument:
The last inequality follows because . Iterating times,we deduce
thereby concluding the result. Summing up (26) from to , we obtain
Combining this inequality with (28), the result is shown. ∎
Appendix C Analysis of 4WD-Catalyst and Auto-adapt
Our assumption on the linear rate of convergence of (see (16)) may look strange at first sight. Nevertheless, most linearly convergent first-order methods for composite minimization either already satisfy this assumption or can be made to satisfy it by introducing an extra prox-gradient step. To see this, recall the convex composite minimization problem from Section A
for each . To bring such convergence guarantees into the desired form (16), define the prox-gradient step
and notice the inclusion . The following inequality follows from :
Thus, the linear rate of convergence (30) implies
which is exactly in the desired form (16).
C.1 Convergence analysis of the adaptive algorithm: 4WD-Catalyst
First, under some reasonable assumptions on the method (see Section 5.1), the sub-method Auto-adapt terminates.
Assume that when . The procedure Auto-adapt terminates after finitely many iterations.
Due to our assumptions on and the expressions and , we have
Since tends to one, for all sufficiency large , we can be sure that the right-hand-side is smaller than . On the other hand, for , the function is -strongly convex and therefore we have . Combining this with (31), we deduce
Letting , we deduce , as required. Thus the loop indeed terminates. ∎
We prove the main result, Theorem 5.5, for 4WD-Catalyst.
The proof closely resembles the proofs of Theorem 4.2 and Theorem 4.2, so we omit some of the details. The main difference in the proof is that we keep track of the effects the parameters and have on the inequalities as well as the sequence of . Since are monotonically decreasing, we deduce
Using the adaptive stationary condition (20), we apply Lemma B.2 with ; hence we obtain
We combine the above inequality with (32) to deduce
Suppose the function is convex. Using in the stopping criteria (19) in replacement of (13), we deduce a similar expression as (27):
Denote . Completing the square, we obtain
Denote . Following the standard recursion argument as in the proofs of Theorem 4.2 and Theorem 4.2, we conclude
The last inequality follows because . Iterating times, we deduce
thus the result is shown. Summing up (33) from to , we obtain
Combining this inequality with (34), the result is shown. ∎
Appendix D Inner-loop complexity: proof of Theorem 5.6
Assuming is convex and the parameter , then
where is a minima of and is the optimal value.
As the is chosen sufficiently large, we know is convex and differentiable with -Lipschitz continuous gradient. Hence, we deduce for all
Using the definition of and the -Lip. continuous gradient of , we conclude for all
By setting in both (37) and (38) and combining these results, we conclude
Note that if we are not in the composite setting and , then is -strongly convex. Using standard bounds for strongly convex functions, Equation (36) follows (see ). We next show an important lemma for deducing the inner complexities.
Assume . Given any , if an iterate satisfies then
Since , we know is -strongly convex. Therefore, by , we know
By the triangle inequality and Equation (40), we deduce
The last inequality follows because of the assumption . Rearranging the terms above, we get the desired result. ∎
These two lemmas together give us Theorem 5.4.
First, we prove that satisfies both adaptive stationary condition and the descent condition. Recall, the point is defined to be the prox or depending on if is a composite form or smooth, respectively (see statement of Theorem 5.4). By Lemma D.1 (or the remark following it), the starting satisfies
By the linear convergence assumption of (see (16)) and the above equation, after iterations initializing from , we have
Take the square root and apply Lemma D.2 yields
which gives the adaptive stationary condition. Next, we show the descent condition. Let such that , by the -strong convexity of , we deduce
This yields the descent condition which completes the proof for . The proof for is similar to , so we omit many of the details. In this case, we only need to show the adaptive stationary condition. For convenience, we denote . Following the same argument as in Equation (41) but with number of iterations, we deduce
which proves the desired result for . ∎
Assuming Proposition 5.7 and Proposition 5.8 hold as well as Lemma D.3, we begin by providing the proof of Theorem 5.6.
We consider two cases: (i) the function is non-convex and (ii) the function is convex. First, we consider the non-convex setting. To produce , the method is called
Next, we supply the proof of Proposition 5.7 which shows that by choosing large enough, Algorithm 3 terminates.
The idea is to apply Theorem 5.4. Since the parameter increases with , then we upper bound it by . Moreover, we have . Lastly, since is increasing in , we know . Plugging these bound into Theorem 5.4, we see that for any smoothing parameter satisfying , we get the desired result. ∎
Next, we compute the maximum number of times we must double until .
If we set and according to Theorem 5.6, then the doubling of will terminate as soon as . Thus the number of times must be doubled in Algorithm 3 is at most
Since is doubled (Algorithm 3) and is chosen as in Proposition 5.7 , the maximum the value , , takes is .
The proof immediately follows from Theorem 5.4 by setting and as the function is convex. ∎