Inexact Successive Quadratic Approximation for Regularized Optimization

Ching-pei Lee, Stephen J. Wright

Introduction

We consider the following regularized optimization problem:

We consider algorithms that generate a sequence $\{x^{k}\}_{k=0,1,\dotsc}$ from some starting point $x^{0}$ , and solve the following subproblem inexactly at each iteration, for some symmetric matrix $H_{k}$ :

We abbreviate the objective in (2) as $Q_{k}(\cdot)$ (or as $Q(\cdot)$ when we focus on the inner workings of iteration $k$ ). In some results, we allow $H_{k}$ to have zero or negative eigenvalues, provided that $Q_{k}$ itself is strongly convex. (Strong convexity in $\psi$ may overcome any lack of strong convexity in the quadratic part of (2).)

In the special case of the proximal-gradient algorithm (ComW05a, ; WriNF08a, ), where $H_{k}$ is a positive multiple of the identity, the subproblem (2) can often be solved cheaply, particularly when $\psi$ is (block) separable, by means of a prox-operator involving $\psi$ . For more general choices of $H_{k}$ , or for more complicated regularization functions $\psi$ , it may make sense to solve (2) by an iterative process, such as accelerated proximal gradient or coordinate descent. Since it may be too expensive to run this iterative process to obtain a high-accuracy solution of (2), we consider the possibility of an inexact solution. In this paper, we assume that the inexact solution satisfies the following condition, for some constant $\eta\in[0,1)$ :

where $Q^{*}\coloneqq\inf_{d}Q(d)$ and $Q(0)=0$ . The value $\eta=0$ corresponds to exact solution of (2). Other values $\eta\in(0,1)$ indicate solutions that are inexact to within a multiplicative constant.

The condition (3) is studied in (BonLPP16a, , Section 4.1), which applies a primal-dual approach to (2) to satisfy it. In this connection, note that if we have access to a lower bound $Q_{LB}\leq Q^{*}$ (obtained by finding a feasible point for the dual of (2), or other means), then any $d$ satisfying $Q(d)\leq(1-\eta)Q_{LB}$ also satisfies (3).

In practical situations, we need not enforce (3) explicitly for some chosen value of $\eta$ . In fact, we do not necessarily require $\eta$ to be known, or (3) to be checked at all. Rather, we can take advantage of the convergence rates of whatever solver is applied to (2) to ensure that (3) holds for some value of $\eta\in(0,1)$ , possibly unknown. For instance, if we apply an iterative solver to the strongly convex function $Q$ in (2) that converges at a global linear rate $(1-\tau)$ , then the “inner” iteration sequence $\{d^{(t)}\}_{t=0,1,\dotsc}$ (starting from some $d^{(0)}$ with $Q(d^{(0)})\leq 0$ ) satisfies

If we fix the number of inner iterations at $T$ (say), then $d^{(T)}$ satisfies (3) with $\eta=(1-\tau)^{T}$ . Although $\tau$ might be unknown as well, we can implicitly tune the accuracy of the solution by adjusting $T$ . On the other hand, if we wish to attain a certain target accuracy $\eta$ and have an estimate of rate $\tau$ , we can choose the number of iterations $T$ large enough that $(1-\tau)^{T}\leq\eta$ . Note that $\tau$ depends on the extreme eigenvalues of $H_{k}$ in some algorithms; we can therefore choose $H_{k}$ to ensure that $\tau$ is restricted to a certain range for all $k$ .

Empirically, we observe that Q-linear methods for solving (2) often have rapid convergence in their early stages, with slower convergence later. Thus, a moderate value of $\eta$ may be preferable to a smaller value, because moderate accuracy is attainable in disproportionately fewer iterations than high accuracy.

A practical stopping condition for the subproblem solver in our framework is just to set a fixed number of iterations, provided that a linearly convergent method is used to solve (2). This guideline can be combined with other more sophisticated approaches, possibly adjusting the number of inner iterations (and hence implicitly $\eta$ ) according to some heuristics. For simplicity, our analysis assumes a fixed choice of $\eta\in(0,1)$ . We examine in particular the number of outer iterations required to solve (1) to a given accuracy $\epsilon$ . We show that the dependence of the iteration complexity on the inexactness measure $\eta$ is benign, increasing only modestly with $\eta$ over approaches that require exact solution of (2) for each $k$ .

To build complete algorithms around the subproblem (2), we either do a step size line search along the inexact solution $d^{k}$ , or adjust $H_{k}$ and recompute $d^{k}$ , seeking in both cases to satisfy a familiar “sufficient decrease” criterion. We present two algorithms that reflect each of these approaches. The first uses a line search approach on the step size with a modified Armijo rule, as presented in TseY09a . We consider a backtracking line-search procedure for simplicity; the analysis could be adapted for more sophisticated procedures. Given the current point $x^{k}$ , the update direction $d^{k}$ and parameters $\beta,\gamma\in(0,1)$ , backtracking finds the smallest nonnegative integer $i$ such that the step size $\alpha_{k}=\beta^{i}$ satisfies

This version appears as Algorithm 1. The exact version of this algorithm can be considered as a special case of the block-coordinate descent algorithm of TseY09a .The definition of $\Delta$ in TseY09a contains another term $\omega d^{T}Hd/2$ , where $\omega\in[0,1)$ is a parameter. We take $\omega=0$ for simplicity, but our analysis can be extended in a straightforward way to the case of $\omega\in(0,1)$ . In BonLPP16a , Algorithm 1 (with possibly a different criterion on $d^{k}$ ) is called the “variable metric inexact line-search-based method”. (We avoid the term “metric” because we consider the possibility of indefinite $H_{k}$ in some of our results.) More complicated metrics, not representable by a matrix norm, were also considered in BonLPP16a . Since our analysis makes use only of the smallest and largest eigenvalues of $H_{k}$ (which correspond to the strong convexity and Lipschitz continuity parameters of the quadratic approximation term), we could also generalize our approach to this setting. We present only the matrix-representable case, however, as it allows a more direct comparison with the second algorithm presented next.

The second algorithm uses the following sufficient decrease criterion from SchT16a ; GhaS16a :

for a given parameter $\gamma\in(0,1]$ . If this criterion is not satisfied, the algorithm modifies $H$ and recomputes $d^{k}$ . The criterion (7) is identical to that used by trust-region methods (see, for example, (NocW06a, , Chapter 4)), in that the ratio between the actual objective decrease and the decrease predicted by $Q$ is bounded below by $\gamma$ ; that is,

We consider two variants of modifying $H$ such that (7) is satisfied. The first successively increases $H$ by a factor $\beta^{-1}$ (for some parameter $\beta\in(0,1)$ ) until (7) holds. We require in this variant that the initial choice of $H$ is positive definite, so that all eigenvalues grow by a factor of $\beta^{-1}$ at each multiplication. The second variant uses a similar strategy, except that $H$ is modified by adding a successively larger multiple of the identity, until (7) holds. (This algorithm allows negative eigenvalues in the initial estimate of $H$ .) These two approaches are defined as the first and the second variants of Algorithm 2, respectively.

Algorithm 1 and Variant 1 of Algorithm 2 are direct extensions of backtracking line search in the smooth case, in the sense that when $\psi$ is not present, both approaches are identical to shrinking the step size. However, aside from the sufficient decrease criteria, the two differ when the regularization term is present.

The second variant of Algorithm 2 is similar to the method proposed in SchT16a ; GhaS16a , with the only difference being the inexactness criterion of the subproblem solution. This variant of modifying $H$ can be seen as interpolating between the step from the original $H$ and the proximal gradient step. It is also a generalization of the trust-region technique for smooth optimization. When $\psi$ is not present, adding a multiple of the identity to $H$ in (2) is equivalent to shrinking the trust region (MorS83a, ). We can therefore think of Algorithm 2, Variant 2 as a generalized trust-region approach for regularized problems.

Rather than our multiplicative criterion (3), the works SchT16a ; GhaS16a use an additive criterion to measure inexactness of the solution. In the analysis of SchT16a ; GhaS16a , this tolerance must then be reduced to zero at a certain rate as the algorithm progresses, resulting in growth of the number of inner iterations per outer iteration as the algorithms progress. By contrast, we attain satisfactory performance (both in theory and practice) for a fixed value $\eta\in(0,1)$ in (3).

Which of the algorithms described above is “best” depends on the circumstances. When (2) is expensive to solve, Algorithm 1 may be preferred, as it requires approximate solution of this subproblem just once on each outer iteration. On the other hand, when $\psi$ has special properties, such as inducing sparsity or low rank in $x$ , Algorithm 2 might benefit from working with sparse iterates and solving the subproblem in spaces of reduced dimension.

Variants and special cases of the algorithms above have been discussed extensively in the literature. Proximal gradient algorithms have $H=\xi I$ for some $\xi>0$ (ComW05a, ; WriNF08a, ); proximal-Newton uses $H=\nabla^{2}f$ (LeeSS14a, ; RodK16a, ; LiAV17a, ); proximal-quasi-Newton and variable metric use quasi-Newton approximations for $H_{k}$ (SchT16a, ; GhaS16a, ). The term “successive quadratic approximation” is also used by ByrNO16a . Our methods can even be viewed as a special case of block-coordinate descent (TseY09a, ) with a single block. The key difference in this work is the use of the inexactness criterion (3), while existing works either assume exact solution of (2), or use a different criterion that requires increasing accuracy as the number of outer iterations grows. Some of these works provide only an asymptotic convergence guarantee and a local convergence rate, with a lack of clarity about when the fast local convergence rate will take effect. An exception is BonLPP16a , which also makes use of the condition (3). However, BonLPP16a gives convergence rate only for convex $f$ and requires existence of a scalar $\mu\geq 1$ and a sequence $\{\zeta_{k}\}$ such that

where $A\succeq B$ means that $A-B$ is positive semidefinite. This condition may preclude such useful and practical choices of $H_{k}$ as the Hessian and quasi-Newton approximations. We believe that our setting may be more general, practical, and straightforward in some situations.

2 Contribution

This paper shows that, when the initial value of $H_{k}$ at all outer iterations $k$ is chosen appropriately, and that (3) is satisfied for all iterations, then the objectives of the two algorithms converge at a global Q-linear rate under an “optimal set strong convexity” condition defined in (10), and at a sublinear rate for general convex functions. When $F$ is nonconvex, we show sublinear convergence of the first-order optimality condition. Moreover, to discuss the relation between the subproblem solution precision and the convergence rate, we show that the iteration complexity is proportional to either $1/(1-\eta)$ or $1/(2(1-\sqrt{\eta}))$ , depending on the properties of $f$ and $\psi$ , and the algorithm parameter choices.Note that for $\eta\in[0,1)$ , $1/(1-\eta)>1/(2(1-\sqrt{\eta}))$ .

In comparison to existing works, our major contributions are as follows.

We quantify how the inexactness criterion (3) affects the step size of Algorithm 1, the norm of the final $H$ in Algorithm 2, and the iteration complexity of these algorithms. We discuss why the process for finding a suitable value of $\alpha_{k}$ in each algorithm can potentially improve the convergence speed when the quadratic approximations incorporate curvature information, leading to acceptance of step sizes whose values are close to one.

We provide a global convergence rate result on the first-order optimality condition for the case of nonconvex $f$ in (1) for general choices of $H_{k}$ , without assumptions beyond the Lipschitzness of $\nabla f$ .

The global R-linear convergence case of a similar algorithm in GhaS16a when $F$ is strongly convex is improved to a global Q-linear convergence result for a broader class of problems.

For general convex problems, in addition to the known sublinear ( $1/k$ ) convergence rate, we show linear convergence with a rate independent of the conditioning of the problem in the early stages of the algorithm.

Faster linear convergence in the early iterations also applies to problems with global Q-linear convergence, explaining in part the empirical observation that many methods converge rapidly in their early stages before settling down to a slower rate. This observation also allows improvement of iteration complexities.

3 Related Work

Our general framework and approach, and special cases thereof, have been widely studied in the literature. Some related work has already been discussed above. We give a broader discussion in this section.

When $\psi$ is the indicator function of a convex constraint set, our approach includes an inexact variant of a constrained Newton or quasi-Newton method. There are a number of papers on this approach, but their convergence results generally have a different flavor from ours. They typically show only asymptotic convergence rates, together with global convergence results without rates, under weaker smoothness and convexity assumptions on $f$ than we make here. For example, when $\psi$ is the indicator function of a “box” defined by bound constraints, ConGT88a applies a trust-region framework to solve (2) approximately, and shows asymptotic convergence. The paper ByrLNZ95a uses a line-search approach, with $H_{k}$ defined by an L-BFGS update, and omits convergence results. For constraint sets defined by linear inequalities, or general convex constraints, BurMT90a shows global convergence of a trust region method using the Cauchy point. A similar approach using the exact Hessian as $H_{k}$ is considered in LinM99a , proving local superlinear or quadratic convergence in the case of linear constraints.

Turning to our formulation (1) in its full generality, Algorithm 1 is analyzed in BonLPP16a , which refers to the condition (3) as “ $\eta$ -approximation.” (Their $\eta$ is equivalent to $1-\eta$ in our notation.) This paper shows asymptotic convergence of $Q_{k}(d)$ to zero without requiring convexity of $F$ , Lipschitz continuity of $\nabla f$ , or a fixed value of $\eta$ . The only assumptions are that $Q_{k}(d^{k})<0$ for all $k$ and the sequence of objective function values converges (which always happens when $F$ is bounded below). Under the additional assumptions that $\nabla f$ is Lipschitz continuous, $F$ is convex, (8), and (3), they showed convergence of the objective value at a $1/k$ rate. The same authors considered convergence for nonconvex functions satisfying a Kurdyka-Łojasiewicz condition in BonLPPR17a , but the exact rates are not given. Our results differ in not requiring the assumption (8), and we are more explicit about the dependence of the rates on $\eta$ . Moreover, we show detailed convergence rates for several additional classes of problems.

A version of Algorithm 2 without line search but requiring $H_{k}$ to overestimate the Hessian, as follows:

is considered in ChoPR14a . Asymptotic convergence is proved, but no rates are given.

Convergence of an inexact proximal-gradient method (for which $H_{k}=LI$ for all $k$ ) is discussed in SchRB11a . With this choice of $H_{k}$ , (7) always holds with $\gamma=1$ . They also discuss its accelerated version for convex and strongly convex problems. Instead of our multiplicative inexactness criterion, they assume an additive inexactness criterion in the subproblem, of the form

Their analysis also allows for an error $e^{k}$ in the gradient term in (2). The paper shows that for general convex problems, the objective value converges at a $1/k$ rate provided that $\sum_{k}\sqrt{\epsilon_{k}}$ and $\sum_{k}\|e^{k}\|$ converge. For strongly convex problems, they proved R-linear convergence of $\|x^{k}-x^{*}\|$ , provided that the sequence $\{\|e^{k}\|\}$ and $\{\sqrt{\epsilon_{k}}\}$ both decrease linearly to zero. When our approaches are specialized to proximal gradient ( $H_{k}=LI$ ), our analysis shows a Q-linear rate (rather than R-linear) for the strongly convex case, and applies to the convergence of the objective value rather than the iterates. Additionally, our results shows convergence for nonconvex problems.

Variant 2 of Algorithm 2 is proposed in SchT16a ; GhaS16a for convex and strongly convex objectives, with inexactness defined additively as in (9). For convex $f$ , SchT16a showed that if $\sum_{k=0}^{\infty}\epsilon_{k}/\|H_{k}\|$ and $\sum_{k=0}^{\infty}\sqrt{\epsilon_{k}/\|H_{k}\|}$ converge then a $1/k$ convergence rate is achievable. The same rate can be achieved if $\epsilon_{k}\leq(a/k)^{2}$ for any $a\in$ . When $F$ is $\mu$ -strongly convex, GhaS16a showed that if $\sum\epsilon_{k}/\rho^{k}$ is finite (where $\rho=1-(\gamma\mu)/(\mu+M)$ , $M$ is the upper bound for $\|H_{k}\|$ , and $\gamma$ is as defined in (7)), then a global R-linear convergence rate is attained. In both cases, the conditions require a certain rate of decrease for $\epsilon_{k}$ , a condition that can be achieved by performing more and more inner iterations as $k$ increases. By contrast, our multiplicative inexactness criterion (3) can be attained with a fixed number of inner iterations. Moreover, we attain a Q-linear rather than an R-linear result.

For the case in which $f$ is convex, thrice continuously differentiable, and self-concordant, and $\psi$ is the indicator function of a closed convex set, TraKC14a analyzed global and local convergence rates of inexact damped proximal Newton with a fixed step size. The paper LiAV17a extends this convergence analysis to general convex $\psi$ . However, generalization of these results beyond the case of $H_{k}=\nabla^{2}f(x^{k})$ and self-concordant $f$ is not obvious.

Accelerated inexact proximal gradient is discussed in SchRB11a ; VilSBV13a for convex $f$ to obtain an improved $O(1/k^{2})$ convergence rate. The work JiaST12a considers acceleration with more general choices of $H$ under the requirement $H_{k}\succeq H_{k+1}$ for all $k$ , which precludes many interesting choices of $H_{k}$ . This requirement is relaxed by GhaS16a to $\theta_{k}H_{k}\succeq\theta_{k+1}H_{k+1}$ for scalars $\theta_{k}$ that are used to decide the extrapolation step size. However, as shown in the experiment in GhaS16a , extrapolation may not accelerate the algorithm. Our analysis does not include acceleration using extrapolation steps, but by combining with the Catalyst framework (LinMH15a, ), similar improved rates could be attained.

4 Outline: Remainder of the Paper

In Section 2, we introduce notation and prove some preliminary results. Convergence analysis appears in Section 3 for Algorithms 1 and 2, covering both convex and nonconvex problems. Some interesting and practical choices of $H_{k}$ are discussed in Section 4 to show that our framework includes many existing algorithms. We provide some preliminary numerical results in Section 5, and make some final comments in Section 6.

Notations and Preliminaries

The norm $\|\cdot\|$ , when applied on vectors, denotes the Euclidean norm. When applied to a symmetric matrix $A$ , it denotes the corresponding induced norm, which is equivalent to the spectral radius of $A$ . For any symmetric matrix $A$ , $\lambda_{\mbox{\rm\scriptsize{min}}}(A)$ denotes its smallest eigenvalue. For any two symmetric matrices $A$ and $B$ , $A\succeq B$ (respectively $A\succ B$ ) denotes that $A-B$ is positive semidefinite (respectively positive definite). For our nonsmooth function $F$ , $\partial F$ denotes the set of generalized gradient defined as

where $\partial\Psi$ denotes the subdifferential (as $\Psi$ is convex). When the minimum $F^{*}$ of $F(x)$ is attainable, we denote the solution set by $\Omega\coloneqq\left\{x\mid F\left(x\right)=F^{*}\right\}$ , and define $P_{\Omega}(x)$ as the (Euclidean-norm) projection of $x$ onto $\Omega$ .

In some results, we use a particular strong convexity assumption to obtain a faster rate. We say that $F$ satisfies the optimal set strong convexity condition with modulus $\mu\geq 0$ if for any $x$ and any $\lambda\in$ , we have

This condition does not require the strong convexity to hold globally, but only between the current point and its projection onto the solution set. Examples of functions that are not strongly convex but satisfy (10) include:

$F(x)=h(Ax)$ where $h$ is strongly convex, and $A$ is any matrix;

$F(x)=h(Ax)+\mathbf{1}_{X}(x)$ , where $X$ is a polyhedron;

Squared-hinge loss: $F(x)=\sum\max(0,a_{i}^{T}x-b_{i})^{2}$ .

A similar condition is the “quasi-strong convexity” condition proposed by NecNG18a , which always implies (10), and can be implied by optimal set strong convexity if $F$ is differentiable. However, since we allow $\psi$ (and therefore $F$ ) to be nonsmooth, we need a different definition here.

Turning to the subproblem (2) and the definition of $\Delta_{k}$ in (6), we find a condition for $d$ to be a descent direction.

If $\Psi$ is convex and $f$ is differentiable, then $d$ is a descent direction for $F$ at $x$ if $\Delta<0$ .

We know that $d$ is a descent direction for $F$ at $x$ if the directional derivative

is negative. Note that since $f$ is differentiable and $\Psi$ is convex,

is well-defined. Now from the convexity of $\Psi$ ,

Therefore, when $\Delta<0$ , the directional derivative is negative and $d$ is a descent direction. ∎

The following lemma motivates our algorithms.

If $Q$ and $\Psi$ are convex and $f$ is differentiable, then $Q(d)<0$ implies that $d$ is a descent direction for $F$ at $x$ .

Note that $Q(0)=0$ . Therefore, if $Q$ is convex, we have

for all $\lambda\in(0,1]$ . It follows that $\nabla f(x)^{T}(\lambda d)+\psi(x+\lambda d)-\psi(x)<0$ for all sufficiently small $\lambda$ . Therefore, from Lemma 1, $\lambda d$ is a descent direction, and since $d$ and $\lambda d$ only differ in their lengths, so is $d$ . ∎

Positive semidefiniteness of $H$ suffices to ensure convexity of $Q$ . However, Lemma 2 may be used even when $H$ has negative eigenvalues, as $\psi$ may have a strong convexity property that ensures convexity of $Q$ . Lemma 2 then suggests that no matter how coarse the approximate solution of (2) is, as long as it is better than $d=0$ for a convex $Q$ , it results in a descent direction. This fact implies finite termination of the backtracking line search procedure in Algorithm 1.

Convergence Analysis

We start our analysis for both algorithms by showing finite termination of the line search procedures. We then discuss separately three classes of problems involving different assumptions on $F$ , namely, that $F$ is convex, that $F$ satisfies optimal set strong convexity (10), and that $F$ is nonconvex. Different iteration complexities are proved in each case. The following condition is assumed throughout our analysis in this section.

In (1), $f$ is $L$ -Lipschitz-continuously differentiable for some $L>0$ ; $\psi$ is convex, extended-valued, proper, and closed; $F$ is lower-bounded; and the solution set $\Omega$ of (1) is nonempty.

We show that the line search procedures have finite termination. The following lemma for the backtracking line search in Algorithm 1 does not require $H$ to be positive definite, though it does require strong convexity of $Q$ (2).

If Assumption 1 holds, $Q$ is $\sigma$ -strongly convex for some $\sigma>0$ , and the approximate solution $d$ to (2) satisfies (3) for some $\eta<1$ , then for $\Delta$ defined in (6), we have

then the backtracking line search procedure in Algorithm 1 terminates in finite steps and produces a step size $\alpha$ that satisfies the following lower bound:

From (3) and strong convexity of $Q$ , we have that for any $\lambda\in$ ,

Since $Q(0)=0$ , we obtain by substituting from the definition of $Q$ that

Since $1/(1-\eta)\geq 1\geq\lambda$ , we have

It follows immediately that the following bound holds for any $\lambda\in$ :

We make the following specific choice of $\lambda$ :

The result (11) follows by substituting these identities into (14).

If the right-hand side of (11) is negative, then we have from the Lipschitz continuity of $\nabla f$ , the convexity of $\psi$ , and the mean value theorem that the following relationships are true for all $\alpha\in$ :

This leads to (12), when we introduce a factor $\beta$ to account for possible undershoot of the backtracking procedure. ∎

Note that Lemma 3 allows indefinite $H$ , and suggests that we can still obtain a certain amount of objective decrease as long as $\lambda_{\mbox{\rm\scriptsize{min}}}(H)$ is not too negative in comparison to the strong convexity parameter of $Q$ . When the strong convexity of $Q$ is accounted for completely by the quadratic part (that is, $\lambda_{\mbox{\rm\scriptsize{min}}}(H)=\sigma>0$ ) we have the following simplification of Lemma 3.

If Assumption 1 holds, $\lambda_{\mbox{\rm\scriptsize{min}}}(H)=\sigma>0$ , and the approximate solution $d$ to (2) satisfies (3) for some $\eta<1$ , we have

Moreover, the backtracking line search procedure in Algorithm 1 terminates in finite steps and produces a step size that satisfies the following lower bound:

Following (13), we have from convexity of $\psi$ for any $\lambda\in$ that

Using (15) in (18), we obtain (16). The bound (17) follows by substituting $\sigma=\lambda_{\mbox{\rm\scriptsize{min}}}(H)$ into (12). ∎

Note that the first inequality in (11) and the second inequality in (16) make use of the pessimistic lower bound $d^{T}Hd\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H)\|d\|^{2}$ , in practice, we observe (see Section 5) that the unit step $\alpha_{k}=1$ is often accepted in practice (significantly larger than the lower bounds (12) and (17)) when $H_{k}$ is the actual Hessian $\nabla^{2}f(x^{k})$ or its quasi-Newton approximation.

If Assumption 1 holds, $Q$ is $\sigma$ -strongly convex for some $\sigma>0$ , and $d$ is an approximate solution to (2) satisfying (3) for some $\eta\in[0,1)$ , then (7) is satisfied if

Therefore, in Algorithm 2, if the initial $H^{0}_{k}$ satisfies

for some $M_{0}>0$ , $m_{0}\leq M_{0}$ , then for Variant 2, the final $H_{k}$ satisfies

For Variant 1, if we assume in addition that $m_{0}>0$ , we have

From Lipschitz continuity of $\nabla f$ , we have that

where in (23) we used the definition (6), and in (24) we used Lemma 3. By noting $d^{T}Hd\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H)\|d\|^{2}$ , (24) shows that (19) implies (7).

Since $\psi$ is convex, we have that $\sigma\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H)$ , so that a sufficient condition for (19) is that

Let the coefficient of $\lambda_{\mbox{\rm\scriptsize{min}}}(H)$ in the above inequality be denoted by $C_{1}$ , this observation suggests that for Variant 1 the smallest eigenvalue of the final $H$ is no larger than $L/(C_{1}\beta)$ , and since the proportion between the largest and the smallest eigenvalues of $H_{k}$ remains unchanged after scaling the whole matrix, we obtain (22).

For Variant 2, to satisfy $C_{1}H\succeq LI$ , the coefficient for $I$ must be at least $L/C_{1}-m_{0}$ . Considering the overshoot, and that the difference between the largest and the smallest eigenvalues is fixed after adding a multiple of identity, we obtain the condition (21). ∎

By noting the simplification from $d^{T}Hd\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H)\|d\|^{2}$ , we rarely observe the worst-case bounds (22) or (21) in practice, unless $H^{0}$ is a multiple of the identity.

2 Iteration Complexity

Now we turn to the iteration complexity of our algorithms, considering three different assumptions on $F$ : convexity, optimal set strong convexity, and the general (possibly nonconvex) case.

The following lemma is modified from some intermediate results in GhaS16a , which shows R-linear convergence of Variant 2 of Algorithm 2 for a strongly convex objective when the inexactness is measured by an additive criterion. A proof can be found in Appendix A.

Let $F^{*}$ be the optimum of $F$ . If Assumption 1 holds, $f$ is convex and $F$ is $\mu$ -optimal-set-strongly convex as defined in (10) for some $\mu\geq 0$ , then for any given $x$ and $H$ , and for all $\lambda\in$ , we have

where $Q^{*}$ is the optimal objective value of (2). In particular, by setting $\lambda=\mu/(\mu+\|H\|)$ (as in GhaS16a ), we have

We start with case of $F$ convex, that is, $\mu=0$ in the definition (10). In this case, the first inequality in (25) reduces to

for all $\lambda\in$ . We assume the following in this subsection.

There exists finite $R_{0},M>0$ such that

Using this assumption, we can bound the second term in (27) by

The bound $\hat{A}\leq MR_{0}^{2}$ is quite pessimistic, but we use it for purposes of comparing with existing works.

The following lemma is inspired by (Bac15a, , Lemma 4.4) but contains many nontrivial modifications, and will be needed in proving the convergence rate for general convex problems. Its proof can be found in Appendix B.

Assume we have three nonnegative sequences $\{\delta_{k}\}_{k\geq 0}$ , $\{c_{k}\}_{k\geq 0}$ , and $\{A_{k}\}_{k\geq 0}$ , and a constant $A>0$ such that for all $k=0,1,2,\dotsc$ , and for all $\lambda_{k}\in$ , we have

In addition, if we define $k_{0}\coloneqq\arg\min\{k:\delta_{k}<A\}$ , then

By Lemma 6 together with Assumption 2, we can show that the algorithms converge at a global sublinear rate (with a linear rate in the early stages) for the case of convex $F$ , provided that the final value of $H_{k}$ for each iteration $k$ of Algorithms 1 and 2 is positive semidefinite.

Assume that $f$ is convex, Assumptions 1 and 2 hold, $H_{k}\succeq 0$ for all $k$ , and there is some $\eta\in[0,1)$ such that the approximate solution $d^{k}$ of (2) satisfies (3) for all $k$ . Then the following claims for Algorithm 1 are true.

When $F(x^{k})-F^{*}\geq(x^{k}-P_{\Omega}(x^{k}))^{T}H_{k}(x^{k}-P_{\Omega}(x^{k}))$ , we have a linear improvement of the objective error at iteration $k$ , that is,

For any $k\geq k_{0}$ , where $k_{0}\coloneqq\arg\min\{k:F(x^{k})-F^{*}<MR_{0}^{2}\}$ , we have

suggesting sublinear convergence of the objective error. If there exists $\bar{\alpha}>0$ such that $\alpha_{k}\geq\bar{\alpha}$ for all $k$ , we have

Denoting $\delta_{k}\coloneqq F(x^{k})-F^{*}$ , we have for Algorithm 1 that the sufficient decrease condition (5) together with $H_{k}\succeq 0$ imply that

(note that $A_{k}\leq A$ follows from (29)) and using (3), (36), and (27), we obtain

The results now follow directly from Lemma 6.

For Algorithm 2, from (7) and (3), we get that for any $k\geq 0$ ,

and the remainder of the proof follows the above procedure starting from the right-hand side of (36) with $\alpha_{k}\equiv 1$ . ∎

The conditions of Parts 1 and 2 of Theorem 3.1 bear further consideration. When the regularization term $\psi$ is not present in $F$ , and $M$ is a global bound on the norm of the true Hessian $\nabla^{2}f(x)$ , the condition in Part 2 of Theorem 3.1 is satisfied for $k_{0}=0$ , since $f(x^{0})-f^{*}\leq\frac{1}{2}M\|x^{0}-P_{\Omega}(x^{0})\|^{2}\leq\frac{1}{2}MR_{0}^{2}$ . Under these circumstances, the linear convergence result of Part 1 may appear not to be interesting. We note, however, that the contribution from $\psi$ may make a significant difference in the general case (in particular, it may result in $F(x^{0})-F^{*}>MR_{0}^{2}$ ) and, moreover, a choice of $H_{k}$ with $\|H_{k}\|$ significantly less than $M$ may result in the condition of Part 1 being satisfied intermittently during the computation. In particular, Part 1 lends some support to the empirical observation of rapid convergence on the early stages of the algorithms, as we discuss further below. Note that (Nes13a, , Theorem 4) suggests that when the algorithm is exact proximal gradient, we get $F(x^{k})-F^{*}\leq MR_{0}^{2}$ for all $k\geq 1$ , but this is not always the case when a different $H$ is picked or when (2) is solved only approximately.

By combining Theorem 3.1 with Lemma 3 and Corollary 1 (which yield lower bounds on $\alpha_{k}$ ), we obtain the following results for Algorithm 1.

Assume the conditions of Theorem 3.1 are all satisfied. Then we have the following.

If there exists $\sigma>0$ such that $\lambda_{\mbox{\rm\scriptsize{min}}}(H_{k})\geq\sigma$ for all $k$ , then (33) becomes

If $Q_{k}$ is $\sigma$ -strongly convex and $H_{k}\succeq 0$ for all $k$ , then (33) becomes

We make some remarks on the results above.

Therefore, Algorithm 1 with positive definite $H_{k}$ has better dependency on $\eta$ than the case in which we set $\lambda_{\mbox{\rm\scriptsize{min}}}(H_{k})=0$ and rely on $\psi$ to make $Q_{k}$ strongly convex. If $\psi$ is strongly convex, we can move some of its curvature to $H_{k}$ without changing the subproblems (2). This strategy may require us to increase $M$ , but this has only a slight effect on the bounds in Corollary 2. These bounds give good reasons to capture the curvature of $Q_{k}$ in the Hessian $H_{k}$ alone, so henceforth we focus our discussion on this case.

For Algorithm 2, when we use the bounds (22) and (21) for $M$ in (28), the dependency of the global complexity on $\eta$ becomes

This result is slightly worse than that of using positive definite $H$ in Algorithm 1 if we compare the second part in the max operation.

The bound in (29) is not tight for general $H$ , unless $H_{k}\equiv MI$ , as in standard prox-gradient methods. This observation gives further intuition for why second-order methods tend to perform well even though their iteration complexities (which are based on the bound (29)) tend to be worse than first-order methods. Moreover, when $H_{k}$ incorporates curvature information for $f$ , step sizes $\alpha_{k}$ are often much larger than the worst-case bounds that are used in Corollary 2. Theorem 3.1, which shows how the convergence rates are related directly to the $\alpha_{k}$ , would give tighter bounds in such cases. Line search on $H_{k}$ in Algorithm 2 does not improve the rate directly, but we note that using $H_{k}$ with smaller norm whenever possible gives more chances of switching to the intermittent linear rate (33).

Part 1 of Theorem 3.1 also explains why solving the subproblem (2) approximately can save the running time significantly, since because of fast early convergence rate, a solution of moderate accuracy can be attained relatively quickly.

2.2 Linear Convergence for Optimal Set Strongly Convex Functions

We now consider problems that satisfy the $\mu$ -optimal-set-strong-convexity condition (10) for some $\mu>0$ , and show that our algorithms have a global linear convergence property.

If Assumption 1 holds, $f$ is convex, $F$ is $\mu$ -optimal-set-strongly convex for some $\mu>0$ , there is some $\eta\in[0,1)$ such that at every iteration of Algorithm 1, the approximate solution $d$ of (2) satisfies (3), and

Moreover, on iterates $k$ for which $F(x^{k})-F^{*}\geq(x^{k}-P_{\Omega}(x^{k}))^{T}H_{k}(x^{k}-P_{\Omega}(x^{k}))$ , these per-iteration contraction rates can be replaced by the faster rates (33) and (39).

where in (42a) we used the inexactness condition (3) and in (42b) we used (26). Using the result in Corollary 1 to lower-bound $\alpha_{k}$ , we obtain (41b).

To show that the part for the early fast rate in (33) and (39) can be applied, we show that Assumption 2 holds. Then because $f$ is assumed to be convex as well here, Theorem 3.1 and Corollary 2 apply as well. Consider (10), by rearranging the terms, we get

as $F\left(\lambda x+\left(1-\lambda\right)P_{\Omega}\left(x\right)\right)\geq F^{*}$ from optimality. By dividing both sides of (43) by $\lambda$ and letting $\lambda\rightarrow 0$ , we get the bound

Note that the parameter $\mu$ in the theorem above is decided by the problem and cannot be changed, while $\sigma$ can be altered according to the algorithm choice. We have a similar result for Algorithm 2.

If Assumption 1 holds, $f$ is convex, $F$ is $\mu$ -optimal-set-strongly convex for some $\mu>0$ , there exists some $\eta\in[0,1)$ such that at every iteration of Algorithm 2, the approximate solution $d$ of (2) satisfies (3), and the conditions for $H^{0}_{k}$ in Lemma 4 are satisfied for all $k$ . Then we have

and the right-hand side of (45) can be further bounded by

By reasoning with the extreme eigenvalues of $H_{k}$ , we can see that the convergence rates still depend on the conditioning of $f$ . For Algorithm 1, if we select $M\leq L$ , then backtracking may be necessary, and the bound (41b) (in which a factor $\mu/L$ appears) is germane. This same factor appears in both (41a) and (41b) when $M>L$ . Often, however, the backtracking line search chooses a value of $\alpha_{k}$ that is not much less than $1$ , which is why we believe that the bounds (33), (34), and (41a) (which depend explicitly on $\alpha_{k}$ ) have some value in revealing the actual performance of the algorithm. Similar comments apply to Algorithm 2, because (7) may be satisfied with $\|H_{k}\|$ much smaller than the bounds for properly chosen $H^{0}_{k}$ .

In the interesting case in which we choose $H_{k}\equiv LI$ and $\eta=0$ , we have $m_{0}=\|H_{k}\|=L$ in Algorithm 2, and modification of $H_{k}$ is not needed, since (7) always holds for $\gamma=1$ . The bound (34) becomes $(F(x^{k})-F^{*})\leq 2LR_{0}^{2}/(k+2)$ , which matches the known convergence rates of proximal gradient (Nes13a, ) and gradient descent (Nes04a, ). The global linear rate in Theorem 3.3 also matches that of existing proximal gradient analysis for strongly convex problems, but the intermittent linear rate (33) that applies to both cases is new. For the case of accelerated proximal gradient covered in Nes13a , although not covered directly by our framework studied in this work, one can combine our algorithm and analysis with the Catalyst framework (LinMH15a, ) to obtain similar accelerated rates for both the strongly convex and the general convex cases.

2.3 Sublinear Convergence of the First-order Optimality Condition for Nonconvex Problems

We consider now the case of nonconvex $F$ . In this situation, Lemma 5 cannot be used, so we consider other properties of $Q$ . We can no longer guarantee the convergence of the objective value to the global minimum. Instead, we consider the norm of the exact solution of the subproblem as the indicator of closeness to the first-order optimality condition $0\in\partial F(x)$ for (1) (see, for example, (Fle87a, , (14.2.16))). In particular, it is known that $0\in\partial F(x)$ if and only if

This is a consequence of the following lemma.

Given any $H\succ 0$ , and $Q_{H}^{x}$ defined as in (2), the following are true.

A point $x$ satisfies the first-order optimality condition $0\in\partial F(x)$ if and only if

For any $x$ , defining $d^{*}$ to be the minimizer of $Q_{H}^{x}(\cdot)$ , we have

Part 1 is well known. For Part 2, we have from the optimality conditions for $d^{*}$ that $-\nabla f(x)-Hd^{*}\in\partial\psi(x+d^{*})$ . By convexity of $\psi$ , we thus have

As in (47), we consider the following measure of closeness to a stationary point:

We show that the minimum value of the norm of this measure over the first $k$ iterations converges to zero at a sublinear rate of $O(1/\sqrt{k})$ . The first step is to show that the minimum of $|Q_{k}|$ converges at a $O(1/k)$ rate.

Assume that there is an $\eta\in[0,1)$ such that (3) is satisfied at all iterations. For Algorithm 1, if Assumption 1 holds and $H_{k}\succeq\sigma I$ for some $\sigma>0$ and all $k$ , we have

For Algorithm 2 (requires $H_{k}^{0}\succ 0$ for the first variant), we have

From (36), we have that for any $k\geq 0$ ,

From Corollary 1, we have that $\alpha_{t}$ for all $t$ is lower bounded by a positive value. Therefore, using $\left|Q_{t}\left(d^{t}\right)\right|=-Q_{t}\left(d^{t}\right)$ for all $t$ , we obtain

Substituting the lower bound for $\alpha$ from Corollary 1 gives the desired result (50). The result for Algorithms 2 follows from the same reasoning applied to (7). ∎

The following lemma is from TseY09a . (Its proof is omitted.)

Given $H_{k}$ satisfying (40) for all $k$ , we have

We are now ready to show the convergence of $\|G_{k}\|$ .

For Algorithm 2, if the initial $H_{k}^{0}$ satisfies $M_{0}I\succeq H_{k}^{0}\succeq m_{0}I$ with $M_{0}\geq m_{0}>0$ then for Variant 1 we have:

Let $\bar{k}\coloneqq\arg\min_{0\leq t\leq k}\left|Q_{t}(d^{t})\right|$ , the condition (3) and Lemmas 7 and 9 imply

Finally, we note that $\|G_{\bar{k}}\|\geq\min_{0\leq t\leq k}\|G_{t}\|$ . The proof is finished by combining (52) with Lemma 8. ∎

If we replace the definition of $G_{k}$ in (49) by the solution of (2), the inequality in Lemma 9 is not needed. In particular, when we use the proximal gradient algorithm with $H_{k}=LI$ and $\eta=0$ (so that (7) holds with $\gamma=1$ , and $M=L$ ) we obtain a bound of $2(F(x^{0})-F^{*})/(L(k+1))$ on $\|d^{k}\|^{2}$ , matching the result shown in Nes13a ; DruL16a .

2.4 Comparison Among Different Approaches

Algorithms 1 and 2 both require evaluation of the function $F$ for each choice of the parameter $\alpha_{k}$ , to check whether the decrease conditions (5) and (7) (respectively) are satisfied. The difference is that Algorithm 2 may also require solution of the subproblem (2) for each $\alpha_{k}$ . This additional computation comes with two potential benefits. First, the second variant of Algorithm 2 allows the initial choice of approximate Hessian $H^{0}_{k}$ to be indefinite, although the final value $H_{k}$ at each iteration needs to be positive semidefinite for our analysis to hold. (There is a close analogy here to trust-region methods for nonconvex smooth optimization, where an indefinite Hessian is adjusted to be positive semidefinite in the process of solving the trust-region subproblem.) Second, because full steps are always taken in Algorithm 2, any structure induced in the iterates $x^{k}$ by the regularizer $\psi$ (such as sparsity) will be preserved. This fact in turn may lead to faster convergence, as the algorithm will effectively be working in a low-dimensional subspace.

Here we discuss some ways to choose $H_{k}$ so that the algorithms are well defined and practical, and our convergence theory can be applied.

When $H_{k}$ are chosen to be positive multiples of identity ( $H_{k}=\zeta_{k}I$ , say), our algorithms reduce to variants of proximal gradient. If we set $\zeta_{k}\geq L$ , then the unit step size is always accepted even if the problem is not solved exactly, because $Q_{k}(d^{k})$ is an upper bound of $F(x^{k})-F(x^{k}+d^{k})$ . When $L$ is not known in advance, adaptive strategies can be used to find it. For Algorithm 2, we could define $\zeta_{k}^{0}$ (such that $H_{k}^{0}=\zeta_{k}^{0}I$ ) to be the final value $\zeta_{k-1}$ from the previous iteration, possibly choosing a smaller value at some iterations to avoid being too conservative. For Algorithm 1, we could increase $\zeta_{k}^{0}$ over $\zeta_{k-1}$ if backtracking was necessary at iteration $k-1$ , and shrink it when a unit stepsize sufficed for several successive iterations.

The proximal Newton approach of setting $H_{k}=\nabla^{2}f(x^{k})$ is a common choice in the convex case (LeeSS14a, ), where we can guarantee that $H_{k}$ is at least positive semidefinite. In LeeSS14a , it is shown that in some neighborhood of the optimum, when $d^{k}$ is the exact solution of (2), then unit step size is always taken, and superlinear or quadratic convergence to the optimum ensues. (A global complexity condition is not required for this result.) Generally, however, indefiniteness in $\nabla^{2}f(x^{k})$ may lead to the search direction $d^{k}$ not being a descent direction, and the backtracking line search will not terminate in this situation. (Our convergence results for Algorithm 1 do not apply in the case of $H_{k}$ indefinite.) A common fix is to use damping, setting $H_{k}=\nabla^{2}f(x^{k})+\zeta_{k}I$ , for some $\zeta_{k}\geq 0$ that at least ensures positive definiteness of $H_{k}$ . Strategies for choosing $\zeta_{k}$ adaptively have been the subject of much research in the context of smooth minimization, for example, in trust-region methods and the Levenberg-Marquardt method for nonlinear least squares (see NocW06a ). Variant 2 of our Algorithm 2 uses this strategy. It is desirable to ensure that $\zeta_{k}\to 0$ as the iterates approach a solution at which local convexity holds, to ensure rapid local convergence.

An L-BFGS approximation of $\nabla^{2}f(x^{k})$ could also be used for $H_{k}$ . When $\psi$ is not present in (1) and $f$ is strongly convex, it is shown in LiuN89a that this approach has global linear convergence because the eigenvalues of $H_{k}$ are restricted to a bounded positive interval. This proof can be extended to our algorithms, when a convex $\psi$ is present in (1). When $f$ is not strongly convex, one can apply safeguards to the L-BFGS update procedure (as described in LiF01a ) to ensure that the upper and lower eigenvalues of $H_{k}$ are bounded uniformly away from zero.

Another interesting choice of $H_{k}$ is a block-diagonal approximation of the Hessian, which (when $\psi$ can be partitioned accordingly) allows the subproblem (2) to be solved in parallel while still retaining some curvature information. Strategies like this one are often used in distributed optimization for machine learning problems (see, for example, Yan13a ; LeeC17a ; ZheXXZ17a ).

Numerical Results

We define $H_{k}$ to be the limited-memory BFGS approximation (LiuN89a, ) based on the past 10 steps, with a safeguard mechanism proposed in LiF01a to ensure uniform boundedness of $H_{k}$ . The subproblems (2) are solved with SpaRSA (WriNF08a, ), a proximal-gradient method which, for bounded $H_{k}$ , converges globally at a linear rate. We consider the publicly available data sets listed in Table 1,Downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. and present empirical convergence results by showing the relative objective error, defined as

where $F^{*}$ is the optimum, obtained approximately through running our algorithm with long enough time. For all variants of our framework, we used parameters $\beta=0.5$ , and $\gamma=10^{-4}$ . Further details of our implementation are described in LeeLW18a .

We use the two smaller data sets a9a and rcv1 to quantify the relationship between accuracy of the subproblem solution and the number of outer iterations. We compare running SpaRSA with a fixed number of iterations $T\in\{5,10,15,20,25,30\}$ . Figure 1 shows that, in all cases, the number of outer iterations decreases monotonically as the (fixed) number of inner iterations is increased. For $T\geq 15$ , the degradation in number of outer iterations resulting from less accurate solution of the subproblems is modest, as our theory suggests. We also observe the initial fast linear rates in the early stages of the method that are predicted by our theory, settling down to a slower linear rate on later iterations, but with sudden drops of the objective, possibly as a consequence of intermittent satisfaction of the condition in Part 1 of Theorem 3.1.

Next, we examine empirically the step size distribution for Algorithm 1 and how often in Algorithm 2 the matrix $H_{k}$ needs to be modified. On both a9a and rcv1, the initial step estimate $\alpha=1$ is accepted on over 99.5% of iterations in Algorithm 1, while in both variants of Algorithm 2, the initial choice of $H_{k}$ is used without modification on over 99% of iterations. These statistics hold regardless of the value of $T$ (the number of inner iterations), though in the case of Algorithm 2, we see a faint trend toward more adjustments for larger values of $T$ . When adjustments are needed, they never number more than $4$ at any one iteration, except for a single case (a9a for Variant 1 of Algorithm 2 with $T=5$ ) for which up to $8$ adjustments are needed.

We next compare our inexact method with an exact version, in which the subproblems (2) are solved to near-optimality at each iteration. Since the three algorithms behave similarly, we use Algorithm 1 as the representative for this investigation. We use a local cluster with 16 nodes for the two larger data sets rcv1 and epsilon, while for the small data set a9a, only one node is used. Iteration counts and running time comparisons are shown in Figure 2. The exact version requires fewer iterations, as expected, but the inexact version requires only modestly more iterations. In terms of runtime, the inexact versions with moderate amount of inner iterations (at least $30$ ) has the advantage, due to the savings obtained by solving the subproblem inexactly.

We note that the approach of gradually increasing the number of inner iterations, suggested in SchT16a ; GhaS16a , produces good results for this application, the number of iterations required being comparable to those for the exact solver while the running time is slightly faster than that of $T=30$ for epsilon and competitive with it for the rest two data sets.

With the notation $A\coloneqq(b_{1}a_{1},b_{2}a_{2},\dotsc,b_{l}a_{l})$ , the dual of this problem is

which is $(1/2C)$ -strongly convex. We consider the distributed setting such that the columns of $A$ are stored across multiple processors. In this setup, only the block-diagonal parts (up to a permutation) of $A^{T}A$ can be easily formed locally on each processor. We take $H_{k}$ to be the matrix formed by these diagonal blocks, so that the subproblem (2) can be decomposed into independent parts. We use cyclic coordinate descent with random permutation (RPCD) as the solver for each subproblem. (Note that this algorithm partitions trivially across processors, because of the block-diagonal structure of $H_{k}$ .)

Our experiment compares the strategy of performing a fixed number of RPCD iterations for each subproblem with one of increasing the number of inner iterations as the algorithm proceeds, as in SchT16a ; GhaS16a . We use the data sets in Table 1, and compare the two strategies on Algorithm 1, but use an exact line search to choose $\alpha_{k}$ rather than the backtracking approach. (An exact line search is made easy by the quadratic objective.) For the first strategy, we use ten iterations of RPCD on each subproblem, while for the second strategy, we perform $1+\lfloor k/10\rfloor$ iterations of RPCD at the $k$ th outer iteration as suggested by SchT16a ; GhaS16a . The implementation is a modification of the experimental code of LeeR15a . We run the algorithms on a local cluster with 16 machines, so that $H_{k}$ contains 16 diagonal blocks. Results are shown in Figure 3. Since the choice of $H_{k}$ in this case does not capture global curvature information adequately, the strategy of increasing the accuracy of subproblem solution on later iterations does not reduce the number of iterations as significantly as in the previous experiment. The runtime results show a significant advantage for the first strategy of a fixed number of inner iterations, particularly on the a9a and rcv1 data sets. Judging from the trend in the approach of increasing inner iterations, we can expect that the exact version will show huger disadvantage for running time in this case. We also observe the faster linear rate on early iterations, matching our theory.

Conclusions

We have analyzed global convergence rates of three practical inexact successive quadratic approximation algorithms under different assumptions on the objective function, including the nonconvex case. Our analysis shows that inexact solution of the subproblems affects the rates of convergence in fairly benign ways, with a modest factor appearing in the bounds. When linearly convergent methods are used to solve the subproblems, the inexactness condition holds when a fixed number of inner iterations is applied at each outer iteration $k$ .

References

Appendix A Proof of Lemma 5

where in (56a) we used the convexity of $f$ , in (56b) we set $d=\lambda(P_{\Omega}(x)-x)$ , and in (56c) we used the optimal set strong convexity (10) of $F$ . Thus we obtain (25). ∎

Appendix B Proof of Lemma 6

then by setting the derivative to zero in (57), we have

When $\delta_{k}\geq A_{k}$ , we have from (58) that $\lambda_{k}=1$ . Therefore, from (30) we get

On the other hand, since $A\geq A_{k}>0,c_{k}\geq 0$ for all $k$ , (30) can be further upper-bounded by

For $\delta_{k}\geq A\geq A_{k}$ , (31) still applies. If $A>\delta_{k}$ , we have from (59) that $\lambda_{k}=\delta_{k}/A$ , hence

This together with (31) imply that $\{\delta_{k}\}$ is a monotonically decreasing sequence. Dividing both sides of (60) by $\delta_{k+1}\delta_{k}$ , and from the fact that $\delta_{k}$ is decreasing and nonnegative, we conclude

Summing this inequality from $k_{0}$ , and using $\delta_{k_{0}}<A$ , we obtain