Inexact Successive Quadratic Approximation for Regularized Optimization

Ching-pei Lee, Stephen J. Wright

Introduction

We consider the following regularized optimization problem:

We consider algorithms that generate a sequence {xk}k=0,1,\{x^{k}\}_{k=0,1,\dotsc} from some starting point x0x^{0}, and solve the following subproblem inexactly at each iteration, for some symmetric matrix HkH_{k}:

We abbreviate the objective in (2) as Qk()Q_{k}(\cdot) (or as Q()Q(\cdot) when we focus on the inner workings of iteration kk). In some results, we allow HkH_{k} to have zero or negative eigenvalues, provided that QkQ_{k} itself is strongly convex. (Strong convexity in ψ\psi may overcome any lack of strong convexity in the quadratic part of (2).)

In the special case of the proximal-gradient algorithm (ComW05a, ; WriNF08a, ), where HkH_{k} is a positive multiple of the identity, the subproblem (2) can often be solved cheaply, particularly when ψ\psi is (block) separable, by means of a prox-operator involving ψ\psi. For more general choices of HkH_{k}, or for more complicated regularization functions ψ\psi, it may make sense to solve (2) by an iterative process, such as accelerated proximal gradient or coordinate descent. Since it may be too expensive to run this iterative process to obtain a high-accuracy solution of (2), we consider the possibility of an inexact solution. In this paper, we assume that the inexact solution satisfies the following condition, for some constant η[0,1)\eta\in[0,1):

where QinfdQ(d)Q^{*}\coloneqq\inf_{d}Q(d) and Q(0)=0Q(0)=0. The value η=0\eta=0 corresponds to exact solution of (2). Other values η(0,1)\eta\in(0,1) indicate solutions that are inexact to within a multiplicative constant.

The condition (3) is studied in (BonLPP16a, , Section 4.1), which applies a primal-dual approach to (2) to satisfy it. In this connection, note that if we have access to a lower bound QLBQQ_{LB}\leq Q^{*} (obtained by finding a feasible point for the dual of (2), or other means), then any dd satisfying Q(d)(1η)QLBQ(d)\leq(1-\eta)Q_{LB} also satisfies (3).

In practical situations, we need not enforce (3) explicitly for some chosen value of η\eta. In fact, we do not necessarily require η\eta to be known, or (3) to be checked at all. Rather, we can take advantage of the convergence rates of whatever solver is applied to (2) to ensure that (3) holds for some value of η(0,1)\eta\in(0,1), possibly unknown. For instance, if we apply an iterative solver to the strongly convex function QQ in (2) that converges at a global linear rate (1τ)(1-\tau), then the “inner” iteration sequence {d(t)}t=0,1,\{d^{(t)}\}_{t=0,1,\dotsc} (starting from some d(0)d^{(0)} with Q(d(0))0Q(d^{(0)})\leq 0) satisfies

If we fix the number of inner iterations at TT (say), then d(T)d^{(T)} satisfies (3) with η=(1τ)T\eta=(1-\tau)^{T}. Although τ\tau might be unknown as well, we can implicitly tune the accuracy of the solution by adjusting TT. On the other hand, if we wish to attain a certain target accuracy η\eta and have an estimate of rate τ\tau, we can choose the number of iterations TT large enough that (1τ)Tη(1-\tau)^{T}\leq\eta. Note that τ\tau depends on the extreme eigenvalues of HkH_{k} in some algorithms; we can therefore choose HkH_{k} to ensure that τ\tau is restricted to a certain range for all kk.

Empirically, we observe that Q-linear methods for solving (2) often have rapid convergence in their early stages, with slower convergence later. Thus, a moderate value of η\eta may be preferable to a smaller value, because moderate accuracy is attainable in disproportionately fewer iterations than high accuracy.

A practical stopping condition for the subproblem solver in our framework is just to set a fixed number of iterations, provided that a linearly convergent method is used to solve (2). This guideline can be combined with other more sophisticated approaches, possibly adjusting the number of inner iterations (and hence implicitly η\eta) according to some heuristics. For simplicity, our analysis assumes a fixed choice of η(0,1)\eta\in(0,1). We examine in particular the number of outer iterations required to solve (1) to a given accuracy ϵ\epsilon. We show that the dependence of the iteration complexity on the inexactness measure η\eta is benign, increasing only modestly with η\eta over approaches that require exact solution of (2) for each kk.

To build complete algorithms around the subproblem (2), we either do a step size line search along the inexact solution dkd^{k}, or adjust HkH_{k} and recompute dkd^{k}, seeking in both cases to satisfy a familiar “sufficient decrease” criterion. We present two algorithms that reflect each of these approaches. The first uses a line search approach on the step size with a modified Armijo rule, as presented in TseY09a . We consider a backtracking line-search procedure for simplicity; the analysis could be adapted for more sophisticated procedures. Given the current point xkx^{k}, the update direction dkd^{k} and parameters β,γ(0,1)\beta,\gamma\in(0,1), backtracking finds the smallest nonnegative integer ii such that the step size αk=βi\alpha_{k}=\beta^{i} satisfies

This version appears as Algorithm 1. The exact version of this algorithm can be considered as a special case of the block-coordinate descent algorithm of TseY09a .The definition of Δ\Delta in TseY09a contains another term ωdTHd/2\omega d^{T}Hd/2, where ω[0,1)\omega\in[0,1) is a parameter. We take ω=0\omega=0 for simplicity, but our analysis can be extended in a straightforward way to the case of ω(0,1)\omega\in(0,1). In BonLPP16a , Algorithm 1 (with possibly a different criterion on dkd^{k}) is called the “variable metric inexact line-search-based method”. (We avoid the term “metric” because we consider the possibility of indefinite HkH_{k} in some of our results.) More complicated metrics, not representable by a matrix norm, were also considered in BonLPP16a . Since our analysis makes use only of the smallest and largest eigenvalues of HkH_{k} (which correspond to the strong convexity and Lipschitz continuity parameters of the quadratic approximation term), we could also generalize our approach to this setting. We present only the matrix-representable case, however, as it allows a more direct comparison with the second algorithm presented next.

The second algorithm uses the following sufficient decrease criterion from SchT16a ; GhaS16a :

for a given parameter γ(0,1]\gamma\in(0,1]. If this criterion is not satisfied, the algorithm modifies HH and recomputes dkd^{k}. The criterion (7) is identical to that used by trust-region methods (see, for example, (NocW06a, , Chapter 4)), in that the ratio between the actual objective decrease and the decrease predicted by QQ is bounded below by γ\gamma; that is,

We consider two variants of modifying HH such that (7) is satisfied. The first successively increases HH by a factor β1\beta^{-1} (for some parameter β(0,1)\beta\in(0,1)) until (7) holds. We require in this variant that the initial choice of HH is positive definite, so that all eigenvalues grow by a factor of β1\beta^{-1} at each multiplication. The second variant uses a similar strategy, except that HH is modified by adding a successively larger multiple of the identity, until (7) holds. (This algorithm allows negative eigenvalues in the initial estimate of HH.) These two approaches are defined as the first and the second variants of Algorithm 2, respectively.

Algorithm 1 and Variant 1 of Algorithm 2 are direct extensions of backtracking line search in the smooth case, in the sense that when ψ\psi is not present, both approaches are identical to shrinking the step size. However, aside from the sufficient decrease criteria, the two differ when the regularization term is present.

The second variant of Algorithm 2 is similar to the method proposed in SchT16a ; GhaS16a , with the only difference being the inexactness criterion of the subproblem solution. This variant of modifying HH can be seen as interpolating between the step from the original HH and the proximal gradient step. It is also a generalization of the trust-region technique for smooth optimization. When ψ\psi is not present, adding a multiple of the identity to HH in (2) is equivalent to shrinking the trust region (MorS83a, ). We can therefore think of Algorithm 2, Variant 2 as a generalized trust-region approach for regularized problems.

Rather than our multiplicative criterion (3), the works SchT16a ; GhaS16a use an additive criterion to measure inexactness of the solution. In the analysis of SchT16a ; GhaS16a , this tolerance must then be reduced to zero at a certain rate as the algorithm progresses, resulting in growth of the number of inner iterations per outer iteration as the algorithms progress. By contrast, we attain satisfactory performance (both in theory and practice) for a fixed value η(0,1)\eta\in(0,1) in (3).

Which of the algorithms described above is “best” depends on the circumstances. When (2) is expensive to solve, Algorithm 1 may be preferred, as it requires approximate solution of this subproblem just once on each outer iteration. On the other hand, when ψ\psi has special properties, such as inducing sparsity or low rank in xx, Algorithm 2 might benefit from working with sparse iterates and solving the subproblem in spaces of reduced dimension.

Variants and special cases of the algorithms above have been discussed extensively in the literature. Proximal gradient algorithms have H=ξIH=\xi I for some ξ>0\xi>0 (ComW05a, ; WriNF08a, ); proximal-Newton uses H=2fH=\nabla^{2}f (LeeSS14a, ; RodK16a, ; LiAV17a, ); proximal-quasi-Newton and variable metric use quasi-Newton approximations for HkH_{k} (SchT16a, ; GhaS16a, ). The term “successive quadratic approximation” is also used by ByrNO16a . Our methods can even be viewed as a special case of block-coordinate descent (TseY09a, ) with a single block. The key difference in this work is the use of the inexactness criterion (3), while existing works either assume exact solution of (2), or use a different criterion that requires increasing accuracy as the number of outer iterations grows. Some of these works provide only an asymptotic convergence guarantee and a local convergence rate, with a lack of clarity about when the fast local convergence rate will take effect. An exception is BonLPP16a , which also makes use of the condition (3). However, BonLPP16a gives convergence rate only for convex ff and requires existence of a scalar μ1\mu\geq 1 and a sequence {ζk}\{\zeta_{k}\} such that

where ABA\succeq B means that ABA-B is positive semidefinite. This condition may preclude such useful and practical choices of HkH_{k} as the Hessian and quasi-Newton approximations. We believe that our setting may be more general, practical, and straightforward in some situations.

2 Contribution

This paper shows that, when the initial value of HkH_{k} at all outer iterations kk is chosen appropriately, and that (3) is satisfied for all iterations, then the objectives of the two algorithms converge at a global Q-linear rate under an “optimal set strong convexity” condition defined in (10), and at a sublinear rate for general convex functions. When FF is nonconvex, we show sublinear convergence of the first-order optimality condition. Moreover, to discuss the relation between the subproblem solution precision and the convergence rate, we show that the iteration complexity is proportional to either 1/(1η)1/(1-\eta) or 1/(2(1η))1/(2(1-\sqrt{\eta})), depending on the properties of ff and ψ\psi, and the algorithm parameter choices.Note that for η[0,1)\eta\in[0,1), 1/(1η)>1/(2(1η))1/(1-\eta)>1/(2(1-\sqrt{\eta})).

In comparison to existing works, our major contributions are as follows.

We quantify how the inexactness criterion (3) affects the step size of Algorithm 1, the norm of the final HH in Algorithm 2, and the iteration complexity of these algorithms. We discuss why the process for finding a suitable value of αk\alpha_{k} in each algorithm can potentially improve the convergence speed when the quadratic approximations incorporate curvature information, leading to acceptance of step sizes whose values are close to one.

We provide a global convergence rate result on the first-order optimality condition for the case of nonconvex ff in (1) for general choices of HkH_{k}, without assumptions beyond the Lipschitzness of f\nabla f.

The global R-linear convergence case of a similar algorithm in GhaS16a when FF is strongly convex is improved to a global Q-linear convergence result for a broader class of problems.

For general convex problems, in addition to the known sublinear (1/k1/k) convergence rate, we show linear convergence with a rate independent of the conditioning of the problem in the early stages of the algorithm.

Faster linear convergence in the early iterations also applies to problems with global Q-linear convergence, explaining in part the empirical observation that many methods converge rapidly in their early stages before settling down to a slower rate. This observation also allows improvement of iteration complexities.

3 Related Work

Our general framework and approach, and special cases thereof, have been widely studied in the literature. Some related work has already been discussed above. We give a broader discussion in this section.

When ψ\psi is the indicator function of a convex constraint set, our approach includes an inexact variant of a constrained Newton or quasi-Newton method. There are a number of papers on this approach, but their convergence results generally have a different flavor from ours. They typically show only asymptotic convergence rates, together with global convergence results without rates, under weaker smoothness and convexity assumptions on ff than we make here. For example, when ψ\psi is the indicator function of a “box” defined by bound constraints, ConGT88a applies a trust-region framework to solve (2) approximately, and shows asymptotic convergence. The paper ByrLNZ95a uses a line-search approach, with HkH_{k} defined by an L-BFGS update, and omits convergence results. For constraint sets defined by linear inequalities, or general convex constraints, BurMT90a shows global convergence of a trust region method using the Cauchy point. A similar approach using the exact Hessian as HkH_{k} is considered in LinM99a , proving local superlinear or quadratic convergence in the case of linear constraints.

Turning to our formulation (1) in its full generality, Algorithm 1 is analyzed in BonLPP16a , which refers to the condition (3) as “η\eta-approximation.” (Their η\eta is equivalent to 1η1-\eta in our notation.) This paper shows asymptotic convergence of Qk(d)Q_{k}(d) to zero without requiring convexity of FF, Lipschitz continuity of f\nabla f, or a fixed value of η\eta. The only assumptions are that Qk(dk)<0Q_{k}(d^{k})<0 for all kk and the sequence of objective function values converges (which always happens when FF is bounded below). Under the additional assumptions that f\nabla f is Lipschitz continuous, FF is convex, (8), and (3), they showed convergence of the objective value at a 1/k1/k rate. The same authors considered convergence for nonconvex functions satisfying a Kurdyka-Łojasiewicz condition in BonLPPR17a , but the exact rates are not given. Our results differ in not requiring the assumption (8), and we are more explicit about the dependence of the rates on η\eta. Moreover, we show detailed convergence rates for several additional classes of problems.

A version of Algorithm 2 without line search but requiring HkH_{k} to overestimate the Hessian, as follows:

is considered in ChoPR14a . Asymptotic convergence is proved, but no rates are given.

Convergence of an inexact proximal-gradient method (for which Hk=LIH_{k}=LI for all kk) is discussed in SchRB11a . With this choice of HkH_{k}, (7) always holds with γ=1\gamma=1. They also discuss its accelerated version for convex and strongly convex problems. Instead of our multiplicative inexactness criterion, they assume an additive inexactness criterion in the subproblem, of the form

Their analysis also allows for an error eke^{k} in the gradient term in (2). The paper shows that for general convex problems, the objective value converges at a 1/k1/k rate provided that kϵk\sum_{k}\sqrt{\epsilon_{k}} and kek\sum_{k}\|e^{k}\| converge. For strongly convex problems, they proved R-linear convergence of xkx\|x^{k}-x^{*}\|, provided that the sequence {ek}\{\|e^{k}\|\} and {ϵk}\{\sqrt{\epsilon_{k}}\} both decrease linearly to zero. When our approaches are specialized to proximal gradient (Hk=LIH_{k}=LI), our analysis shows a Q-linear rate (rather than R-linear) for the strongly convex case, and applies to the convergence of the objective value rather than the iterates. Additionally, our results shows convergence for nonconvex problems.

Variant 2 of Algorithm 2 is proposed in SchT16a ; GhaS16a for convex and strongly convex objectives, with inexactness defined additively as in (9). For convex ff, SchT16a showed that if k=0ϵk/Hk\sum_{k=0}^{\infty}\epsilon_{k}/\|H_{k}\| and k=0ϵk/Hk\sum_{k=0}^{\infty}\sqrt{\epsilon_{k}/\|H_{k}\|} converge then a 1/k1/k convergence rate is achievable. The same rate can be achieved if ϵk(a/k)2\epsilon_{k}\leq(a/k)^{2} for any aa\in. When FF is μ\mu-strongly convex, GhaS16a showed that if ϵk/ρk\sum\epsilon_{k}/\rho^{k} is finite (where ρ=1(γμ)/(μ+M)\rho=1-(\gamma\mu)/(\mu+M), MM is the upper bound for Hk\|H_{k}\|, and γ\gamma is as defined in (7)), then a global R-linear convergence rate is attained. In both cases, the conditions require a certain rate of decrease for ϵk\epsilon_{k}, a condition that can be achieved by performing more and more inner iterations as kk increases. By contrast, our multiplicative inexactness criterion (3) can be attained with a fixed number of inner iterations. Moreover, we attain a Q-linear rather than an R-linear result.

For the case in which ff is convex, thrice continuously differentiable, and self-concordant, and ψ\psi is the indicator function of a closed convex set, TraKC14a analyzed global and local convergence rates of inexact damped proximal Newton with a fixed step size. The paper LiAV17a extends this convergence analysis to general convex ψ\psi. However, generalization of these results beyond the case of Hk=2f(xk)H_{k}=\nabla^{2}f(x^{k}) and self-concordant ff is not obvious.

Accelerated inexact proximal gradient is discussed in SchRB11a ; VilSBV13a for convex ff to obtain an improved O(1/k2)O(1/k^{2}) convergence rate. The work JiaST12a considers acceleration with more general choices of HH under the requirement HkHk+1H_{k}\succeq H_{k+1} for all kk, which precludes many interesting choices of HkH_{k}. This requirement is relaxed by GhaS16a to θkHkθk+1Hk+1\theta_{k}H_{k}\succeq\theta_{k+1}H_{k+1} for scalars θk\theta_{k} that are used to decide the extrapolation step size. However, as shown in the experiment in GhaS16a , extrapolation may not accelerate the algorithm. Our analysis does not include acceleration using extrapolation steps, but by combining with the Catalyst framework (LinMH15a, ), similar improved rates could be attained.

4 Outline: Remainder of the Paper

In Section 2, we introduce notation and prove some preliminary results. Convergence analysis appears in Section 3 for Algorithms 1 and 2, covering both convex and nonconvex problems. Some interesting and practical choices of HkH_{k} are discussed in Section 4 to show that our framework includes many existing algorithms. We provide some preliminary numerical results in Section 5, and make some final comments in Section 6.

Notations and Preliminaries

The norm \|\cdot\|, when applied on vectors, denotes the Euclidean norm. When applied to a symmetric matrix AA, it denotes the corresponding induced norm, which is equivalent to the spectral radius of AA. For any symmetric matrix AA, λ\mboxmin(A)\lambda_{\mbox{\rm\scriptsize{min}}}(A) denotes its smallest eigenvalue. For any two symmetric matrices AA and BB, ABA\succeq B (respectively ABA\succ B) denotes that ABA-B is positive semidefinite (respectively positive definite). For our nonsmooth function FF, F\partial F denotes the set of generalized gradient defined as

where Ψ\partial\Psi denotes the subdifferential (as Ψ\Psi is convex). When the minimum FF^{*} of F(x)F(x) is attainable, we denote the solution set by Ω{xF(x)=F}\Omega\coloneqq\left\{x\mid F\left(x\right)=F^{*}\right\}, and define PΩ(x)P_{\Omega}(x) as the (Euclidean-norm) projection of xx onto Ω\Omega.

In some results, we use a particular strong convexity assumption to obtain a faster rate. We say that FF satisfies the optimal set strong convexity condition with modulus μ0\mu\geq 0 if for any xx and any λ\lambda\in, we have

This condition does not require the strong convexity to hold globally, but only between the current point and its projection onto the solution set. Examples of functions that are not strongly convex but satisfy (10) include:

F(x)=h(Ax)F(x)=h(Ax) where hh is strongly convex, and AA is any matrix;

F(x)=h(Ax)+1X(x)F(x)=h(Ax)+\mathbf{1}_{X}(x), where XX is a polyhedron;

Squared-hinge loss: F(x)=max(0,aiTxbi)2F(x)=\sum\max(0,a_{i}^{T}x-b_{i})^{2}.

A similar condition is the “quasi-strong convexity” condition proposed by NecNG18a , which always implies (10), and can be implied by optimal set strong convexity if FF is differentiable. However, since we allow ψ\psi (and therefore FF) to be nonsmooth, we need a different definition here.

Turning to the subproblem (2) and the definition of Δk\Delta_{k} in (6), we find a condition for dd to be a descent direction.

If Ψ\Psi is convex and ff is differentiable, then dd is a descent direction for FF at xx if Δ<0\Delta<0.

We know that dd is a descent direction for FF at xx if the directional derivative

is negative. Note that since ff is differentiable and Ψ\Psi is convex,

is well-defined. Now from the convexity of Ψ\Psi,

Therefore, when Δ<0\Delta<0, the directional derivative is negative and dd is a descent direction. ∎

The following lemma motivates our algorithms.

If QQ and Ψ\Psi are convex and ff is differentiable, then Q(d)<0Q(d)<0 implies that dd is a descent direction for FF at xx.

Note that Q(0)=0Q(0)=0. Therefore, if QQ is convex, we have

for all λ(0,1]\lambda\in(0,1]. It follows that f(x)T(λd)+ψ(x+λd)ψ(x)<0\nabla f(x)^{T}(\lambda d)+\psi(x+\lambda d)-\psi(x)<0 for all sufficiently small λ\lambda. Therefore, from Lemma 1, λd\lambda d is a descent direction, and since dd and λd\lambda d only differ in their lengths, so is dd. ∎

Positive semidefiniteness of HH suffices to ensure convexity of QQ. However, Lemma 2 may be used even when HH has negative eigenvalues, as ψ\psi may have a strong convexity property that ensures convexity of QQ. Lemma 2 then suggests that no matter how coarse the approximate solution of (2) is, as long as it is better than d=0d=0 for a convex QQ, it results in a descent direction. This fact implies finite termination of the backtracking line search procedure in Algorithm 1.

Convergence Analysis

We start our analysis for both algorithms by showing finite termination of the line search procedures. We then discuss separately three classes of problems involving different assumptions on FF, namely, that FF is convex, that FF satisfies optimal set strong convexity (10), and that FF is nonconvex. Different iteration complexities are proved in each case. The following condition is assumed throughout our analysis in this section.

In (1), ff is LL-Lipschitz-continuously differentiable for some L>0L>0; ψ\psi is convex, extended-valued, proper, and closed; FF is lower-bounded; and the solution set Ω\Omega of (1) is nonempty.

We show that the line search procedures have finite termination. The following lemma for the backtracking line search in Algorithm 1 does not require HH to be positive definite, though it does require strong convexity of QQ (2).

If Assumption 1 holds, QQ is σ\sigma-strongly convex for some σ>0\sigma>0, and the approximate solution dd to (2) satisfies (3) for some η<1\eta<1, then for Δ\Delta defined in (6), we have

then the backtracking line search procedure in Algorithm 1 terminates in finite steps and produces a step size α\alpha that satisfies the following lower bound:

From (3) and strong convexity of QQ, we have that for any λ\lambda\in,

Since Q(0)=0Q(0)=0, we obtain by substituting from the definition of QQ that

Since 1/(1η)1λ1/(1-\eta)\geq 1\geq\lambda, we have

It follows immediately that the following bound holds for any λ\lambda\in:

We make the following specific choice of λ\lambda:

The result (11) follows by substituting these identities into (14).

If the right-hand side of (11) is negative, then we have from the Lipschitz continuity of f\nabla f, the convexity of ψ\psi, and the mean value theorem that the following relationships are true for all α\alpha\in:

This leads to (12), when we introduce a factor β\beta to account for possible undershoot of the backtracking procedure. ∎

Note that Lemma 3 allows indefinite HH, and suggests that we can still obtain a certain amount of objective decrease as long as λ\mboxmin(H)\lambda_{\mbox{\rm\scriptsize{min}}}(H) is not too negative in comparison to the strong convexity parameter of QQ. When the strong convexity of QQ is accounted for completely by the quadratic part (that is, λ\mboxmin(H)=σ>0\lambda_{\mbox{\rm\scriptsize{min}}}(H)=\sigma>0) we have the following simplification of Lemma 3.

If Assumption 1 holds, λ\mboxmin(H)=σ>0\lambda_{\mbox{\rm\scriptsize{min}}}(H)=\sigma>0, and the approximate solution dd to (2) satisfies (3) for some η<1\eta<1, we have

Moreover, the backtracking line search procedure in Algorithm 1 terminates in finite steps and produces a step size that satisfies the following lower bound:

Following (13), we have from convexity of ψ\psi for any λ\lambda\in that

Using (15) in (18), we obtain (16). The bound (17) follows by substituting σ=λ\mboxmin(H)\sigma=\lambda_{\mbox{\rm\scriptsize{min}}}(H) into (12). ∎

Note that the first inequality in (11) and the second inequality in (16) make use of the pessimistic lower bound dTHdλ\mboxmin(H)d2d^{T}Hd\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H)\|d\|^{2}, in practice, we observe (see Section 5) that the unit step αk=1\alpha_{k}=1 is often accepted in practice (significantly larger than the lower bounds (12) and (17)) when HkH_{k} is the actual Hessian 2f(xk)\nabla^{2}f(x^{k}) or its quasi-Newton approximation.

If Assumption 1 holds, QQ is σ\sigma-strongly convex for some σ>0\sigma>0, and dd is an approximate solution to (2) satisfying (3) for some η[0,1)\eta\in[0,1), then (7) is satisfied if

Therefore, in Algorithm 2, if the initial Hk0H^{0}_{k} satisfies

for some M0>0M_{0}>0, m0M0m_{0}\leq M_{0}, then for Variant 2, the final HkH_{k} satisfies

For Variant 1, if we assume in addition that m0>0m_{0}>0, we have

From Lipschitz continuity of f\nabla f, we have that

where in (23) we used the definition (6), and in (24) we used Lemma 3. By noting dTHdλ\mboxmin(H)d2d^{T}Hd\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H)\|d\|^{2}, (24) shows that (19) implies (7).

Since ψ\psi is convex, we have that σλ\mboxmin(H)\sigma\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H), so that a sufficient condition for (19) is that

Let the coefficient of λ\mboxmin(H)\lambda_{\mbox{\rm\scriptsize{min}}}(H) in the above inequality be denoted by C1C_{1}, this observation suggests that for Variant 1 the smallest eigenvalue of the final HH is no larger than L/(C1β)L/(C_{1}\beta), and since the proportion between the largest and the smallest eigenvalues of HkH_{k} remains unchanged after scaling the whole matrix, we obtain (22).

For Variant 2, to satisfy C1HLIC_{1}H\succeq LI, the coefficient for II must be at least L/C1m0L/C_{1}-m_{0}. Considering the overshoot, and that the difference between the largest and the smallest eigenvalues is fixed after adding a multiple of identity, we obtain the condition (21). ∎

By noting the simplification from dTHdλ\mboxmin(H)d2d^{T}Hd\geq\lambda_{\mbox{\rm\scriptsize{min}}}(H)\|d\|^{2}, we rarely observe the worst-case bounds (22) or (21) in practice, unless H0H^{0} is a multiple of the identity.

2 Iteration Complexity

Now we turn to the iteration complexity of our algorithms, considering three different assumptions on FF: convexity, optimal set strong convexity, and the general (possibly nonconvex) case.

The following lemma is modified from some intermediate results in GhaS16a , which shows R-linear convergence of Variant 2 of Algorithm 2 for a strongly convex objective when the inexactness is measured by an additive criterion. A proof can be found in Appendix A.

Let FF^{*} be the optimum of FF. If Assumption 1 holds, ff is convex and FF is μ\mu-optimal-set-strongly convex as defined in (10) for some μ0\mu\geq 0, then for any given xx and HH, and for all λ\lambda\in, we have

where QQ^{*} is the optimal objective value of (2). In particular, by setting λ=μ/(μ+H)\lambda=\mu/(\mu+\|H\|) (as in GhaS16a ), we have

We start with case of FF convex, that is, μ=0\mu=0 in the definition (10). In this case, the first inequality in (25) reduces to

for all λ\lambda\in. We assume the following in this subsection.

There exists finite R0,M>0R_{0},M>0 such that

Using this assumption, we can bound the second term in (27) by

The bound A^MR02\hat{A}\leq MR_{0}^{2} is quite pessimistic, but we use it for purposes of comparing with existing works.

The following lemma is inspired by (Bac15a, , Lemma 4.4) but contains many nontrivial modifications, and will be needed in proving the convergence rate for general convex problems. Its proof can be found in Appendix B.

Assume we have three nonnegative sequences {δk}k0\{\delta_{k}\}_{k\geq 0}, {ck}k0\{c_{k}\}_{k\geq 0}, and {Ak}k0\{A_{k}\}_{k\geq 0}, and a constant A>0A>0 such that for all k=0,1,2,k=0,1,2,\dotsc, and for all λk\lambda_{k}\in, we have

In addition, if we define k0argmin{k:δk<A}k_{0}\coloneqq\arg\min\{k:\delta_{k}<A\}, then

By Lemma 6 together with Assumption 2, we can show that the algorithms converge at a global sublinear rate (with a linear rate in the early stages) for the case of convex FF, provided that the final value of HkH_{k} for each iteration kk of Algorithms 1 and 2 is positive semidefinite.

Assume that ff is convex, Assumptions 1 and 2 hold, Hk0H_{k}\succeq 0 for all kk, and there is some η[0,1)\eta\in[0,1) such that the approximate solution dkd^{k} of (2) satisfies (3) for all kk. Then the following claims for Algorithm 1 are true.

When F(xk)F(xkPΩ(xk))THk(xkPΩ(xk))F(x^{k})-F^{*}\geq(x^{k}-P_{\Omega}(x^{k}))^{T}H_{k}(x^{k}-P_{\Omega}(x^{k})), we have a linear improvement of the objective error at iteration kk, that is,

For any kk0k\geq k_{0}, where k0argmin{k:F(xk)F<MR02}k_{0}\coloneqq\arg\min\{k:F(x^{k})-F^{*}<MR_{0}^{2}\}, we have

suggesting sublinear convergence of the objective error. If there exists αˉ>0\bar{\alpha}>0 such that αkαˉ\alpha_{k}\geq\bar{\alpha} for all kk, we have

Denoting δkF(xk)F\delta_{k}\coloneqq F(x^{k})-F^{*}, we have for Algorithm 1 that the sufficient decrease condition (5) together with Hk0H_{k}\succeq 0 imply that

(note that AkAA_{k}\leq A follows from (29)) and using (3), (36), and (27), we obtain

The results now follow directly from Lemma 6.

For Algorithm 2, from (7) and (3), we get that for any k0k\geq 0,

and the remainder of the proof follows the above procedure starting from the right-hand side of (36) with αk1\alpha_{k}\equiv 1. ∎

The conditions of Parts 1 and 2 of Theorem 3.1 bear further consideration. When the regularization term ψ\psi is not present in FF, and MM is a global bound on the norm of the true Hessian 2f(x)\nabla^{2}f(x), the condition in Part 2 of Theorem 3.1 is satisfied for k0=0k_{0}=0, since f(x0)f12Mx0PΩ(x0)212MR02f(x^{0})-f^{*}\leq\frac{1}{2}M\|x^{0}-P_{\Omega}(x^{0})\|^{2}\leq\frac{1}{2}MR_{0}^{2}. Under these circumstances, the linear convergence result of Part 1 may appear not to be interesting. We note, however, that the contribution from ψ\psi may make a significant difference in the general case (in particular, it may result in F(x0)F>MR02F(x^{0})-F^{*}>MR_{0}^{2}) and, moreover, a choice of HkH_{k} with Hk\|H_{k}\| significantly less than MM may result in the condition of Part 1 being satisfied intermittently during the computation. In particular, Part 1 lends some support to the empirical observation of rapid convergence on the early stages of the algorithms, as we discuss further below. Note that (Nes13a, , Theorem 4) suggests that when the algorithm is exact proximal gradient, we get F(xk)FMR02F(x^{k})-F^{*}\leq MR_{0}^{2} for all k1k\geq 1, but this is not always the case when a different HH is picked or when (2) is solved only approximately.

By combining Theorem 3.1 with Lemma 3 and Corollary 1 (which yield lower bounds on αk\alpha_{k}), we obtain the following results for Algorithm 1.

Assume the conditions of Theorem 3.1 are all satisfied. Then we have the following.

If there exists σ>0\sigma>0 such that λ\mboxmin(Hk)σ\lambda_{\mbox{\rm\scriptsize{min}}}(H_{k})\geq\sigma for all kk, then (33) becomes

If QkQ_{k} is σ\sigma-strongly convex and Hk0H_{k}\succeq 0 for all kk, then (33) becomes

We make some remarks on the results above.

Therefore, Algorithm 1 with positive definite HkH_{k} has better dependency on η\eta than the case in which we set λ\mboxmin(Hk)=0\lambda_{\mbox{\rm\scriptsize{min}}}(H_{k})=0 and rely on ψ\psi to make QkQ_{k} strongly convex. If ψ\psi is strongly convex, we can move some of its curvature to HkH_{k} without changing the subproblems (2). This strategy may require us to increase MM, but this has only a slight effect on the bounds in Corollary 2. These bounds give good reasons to capture the curvature of QkQ_{k} in the Hessian HkH_{k} alone, so henceforth we focus our discussion on this case.

For Algorithm 2, when we use the bounds (22) and (21) for MM in (28), the dependency of the global complexity on η\eta becomes

This result is slightly worse than that of using positive definite HH in Algorithm 1 if we compare the second part in the max operation.

The bound in (29) is not tight for general HH, unless HkMIH_{k}\equiv MI, as in standard prox-gradient methods. This observation gives further intuition for why second-order methods tend to perform well even though their iteration complexities (which are based on the bound (29)) tend to be worse than first-order methods. Moreover, when HkH_{k} incorporates curvature information for ff, step sizes αk\alpha_{k} are often much larger than the worst-case bounds that are used in Corollary 2. Theorem 3.1, which shows how the convergence rates are related directly to the αk\alpha_{k}, would give tighter bounds in such cases. Line search on HkH_{k} in Algorithm 2 does not improve the rate directly, but we note that using HkH_{k} with smaller norm whenever possible gives more chances of switching to the intermittent linear rate (33).

Part 1 of Theorem 3.1 also explains why solving the subproblem (2) approximately can save the running time significantly, since because of fast early convergence rate, a solution of moderate accuracy can be attained relatively quickly.

2.2 Linear Convergence for Optimal Set Strongly Convex Functions

We now consider problems that satisfy the μ\mu-optimal-set-strong-convexity condition (10) for some μ>0\mu>0, and show that our algorithms have a global linear convergence property.

If Assumption 1 holds, ff is convex, FF is μ\mu-optimal-set-strongly convex for some μ>0\mu>0, there is some η[0,1)\eta\in[0,1) such that at every iteration of Algorithm 1, the approximate solution dd of (2) satisfies (3), and

Moreover, on iterates kk for which F(xk)F(xkPΩ(xk))THk(xkPΩ(xk))F(x^{k})-F^{*}\geq(x^{k}-P_{\Omega}(x^{k}))^{T}H_{k}(x^{k}-P_{\Omega}(x^{k})), these per-iteration contraction rates can be replaced by the faster rates (33) and (39).

where in (42a) we used the inexactness condition (3) and in (42b) we used (26). Using the result in Corollary 1 to lower-bound αk\alpha_{k}, we obtain (41b).

To show that the part for the early fast rate in (33) and (39) can be applied, we show that Assumption 2 holds. Then because ff is assumed to be convex as well here, Theorem 3.1 and Corollary 2 apply as well. Consider (10), by rearranging the terms, we get

as F(λx+(1λ)PΩ(x))FF\left(\lambda x+\left(1-\lambda\right)P_{\Omega}\left(x\right)\right)\geq F^{*} from optimality. By dividing both sides of (43) by λ\lambda and letting λ0\lambda\rightarrow 0, we get the bound

Note that the parameter μ\mu in the theorem above is decided by the problem and cannot be changed, while σ\sigma can be altered according to the algorithm choice. We have a similar result for Algorithm 2.

If Assumption 1 holds, ff is convex, FF is μ\mu-optimal-set-strongly convex for some μ>0\mu>0, there exists some η[0,1)\eta\in[0,1) such that at every iteration of Algorithm 2, the approximate solution dd of (2) satisfies (3), and the conditions for Hk0H^{0}_{k} in Lemma 4 are satisfied for all kk. Then we have

and the right-hand side of (45) can be further bounded by

By reasoning with the extreme eigenvalues of HkH_{k}, we can see that the convergence rates still depend on the conditioning of ff. For Algorithm 1, if we select MLM\leq L, then backtracking may be necessary, and the bound (41b) (in which a factor μ/L\mu/L appears) is germane. This same factor appears in both (41a) and (41b) when M>LM>L. Often, however, the backtracking line search chooses a value of αk\alpha_{k} that is not much less than 11, which is why we believe that the bounds (33), (34), and (41a) (which depend explicitly on αk\alpha_{k}) have some value in revealing the actual performance of the algorithm. Similar comments apply to Algorithm 2, because (7) may be satisfied with Hk\|H_{k}\| much smaller than the bounds for properly chosen Hk0H^{0}_{k}.

In the interesting case in which we choose HkLIH_{k}\equiv LI and η=0\eta=0, we have m0=Hk=Lm_{0}=\|H_{k}\|=L in Algorithm 2, and modification of HkH_{k} is not needed, since (7) always holds for γ=1\gamma=1. The bound (34) becomes (F(xk)F)2LR02/(k+2)(F(x^{k})-F^{*})\leq 2LR_{0}^{2}/(k+2), which matches the known convergence rates of proximal gradient (Nes13a, ) and gradient descent (Nes04a, ). The global linear rate in Theorem 3.3 also matches that of existing proximal gradient analysis for strongly convex problems, but the intermittent linear rate (33) that applies to both cases is new. For the case of accelerated proximal gradient covered in Nes13a , although not covered directly by our framework studied in this work, one can combine our algorithm and analysis with the Catalyst framework (LinMH15a, ) to obtain similar accelerated rates for both the strongly convex and the general convex cases.

2.3 Sublinear Convergence of the First-order Optimality Condition for Nonconvex Problems

We consider now the case of nonconvex FF. In this situation, Lemma 5 cannot be used, so we consider other properties of QQ. We can no longer guarantee the convergence of the objective value to the global minimum. Instead, we consider the norm of the exact solution of the subproblem as the indicator of closeness to the first-order optimality condition 0F(x)0\in\partial F(x) for (1) (see, for example, (Fle87a, , (14.2.16))). In particular, it is known that 0F(x)0\in\partial F(x) if and only if

This is a consequence of the following lemma.

Given any H0H\succ 0, and QHxQ_{H}^{x} defined as in (2), the following are true.

A point xx satisfies the first-order optimality condition 0F(x)0\in\partial F(x) if and only if

For any xx, defining dd^{*} to be the minimizer of QHx()Q_{H}^{x}(\cdot), we have

Part 1 is well known. For Part 2, we have from the optimality conditions for dd^{*} that f(x)Hdψ(x+d)-\nabla f(x)-Hd^{*}\in\partial\psi(x+d^{*}). By convexity of ψ\psi, we thus have

As in (47), we consider the following measure of closeness to a stationary point:

We show that the minimum value of the norm of this measure over the first kk iterations converges to zero at a sublinear rate of O(1/k)O(1/\sqrt{k}). The first step is to show that the minimum of Qk|Q_{k}| converges at a O(1/k)O(1/k) rate.

Assume that there is an η[0,1)\eta\in[0,1) such that (3) is satisfied at all iterations. For Algorithm 1, if Assumption 1 holds and HkσIH_{k}\succeq\sigma I for some σ>0\sigma>0 and all kk, we have

For Algorithm 2 (requires Hk00H_{k}^{0}\succ 0 for the first variant), we have

From (36), we have that for any k0k\geq 0,

From Corollary 1, we have that αt\alpha_{t} for all tt is lower bounded by a positive value. Therefore, using Qt(dt)=Qt(dt)\left|Q_{t}\left(d^{t}\right)\right|=-Q_{t}\left(d^{t}\right) for all tt, we obtain

Substituting the lower bound for α\alpha from Corollary 1 gives the desired result (50). The result for Algorithms 2 follows from the same reasoning applied to (7). ∎

The following lemma is from TseY09a . (Its proof is omitted.)

Given HkH_{k} satisfying (40) for all kk, we have

We are now ready to show the convergence of Gk\|G_{k}\|.

For Algorithm 2, if the initial Hk0H_{k}^{0} satisfies M0IHk0m0IM_{0}I\succeq H_{k}^{0}\succeq m_{0}I with M0m0>0M_{0}\geq m_{0}>0 then for Variant 1 we have:

Let kˉargmin0tkQt(dt)\bar{k}\coloneqq\arg\min_{0\leq t\leq k}\left|Q_{t}(d^{t})\right|, the condition (3) and Lemmas 7 and 9 imply

Finally, we note that Gkˉmin0tkGt\|G_{\bar{k}}\|\geq\min_{0\leq t\leq k}\|G_{t}\|. The proof is finished by combining (52) with Lemma 8. ∎

If we replace the definition of GkG_{k} in (49) by the solution of (2), the inequality in Lemma 9 is not needed. In particular, when we use the proximal gradient algorithm with Hk=LIH_{k}=LI and η=0\eta=0 (so that (7) holds with γ=1\gamma=1, and M=LM=L) we obtain a bound of 2(F(x0)F)/(L(k+1))2(F(x^{0})-F^{*})/(L(k+1)) on dk2\|d^{k}\|^{2}, matching the result shown in Nes13a ; DruL16a .

2.4 Comparison Among Different Approaches

Algorithms 1 and 2 both require evaluation of the function FF for each choice of the parameter αk\alpha_{k}, to check whether the decrease conditions (5) and (7) (respectively) are satisfied. The difference is that Algorithm 2 may also require solution of the subproblem (2) for each αk\alpha_{k}. This additional computation comes with two potential benefits. First, the second variant of Algorithm 2 allows the initial choice of approximate Hessian Hk0H^{0}_{k} to be indefinite, although the final value HkH_{k} at each iteration needs to be positive semidefinite for our analysis to hold. (There is a close analogy here to trust-region methods for nonconvex smooth optimization, where an indefinite Hessian is adjusted to be positive semidefinite in the process of solving the trust-region subproblem.) Second, because full steps are always taken in Algorithm 2, any structure induced in the iterates xkx^{k} by the regularizer ψ\psi (such as sparsity) will be preserved. This fact in turn may lead to faster convergence, as the algorithm will effectively be working in a low-dimensional subspace.

Here we discuss some ways to choose HkH_{k} so that the algorithms are well defined and practical, and our convergence theory can be applied.

When HkH_{k} are chosen to be positive multiples of identity (Hk=ζkIH_{k}=\zeta_{k}I, say), our algorithms reduce to variants of proximal gradient. If we set ζkL\zeta_{k}\geq L, then the unit step size is always accepted even if the problem is not solved exactly, because Qk(dk)Q_{k}(d^{k}) is an upper bound of F(xk)F(xk+dk)F(x^{k})-F(x^{k}+d^{k}). When LL is not known in advance, adaptive strategies can be used to find it. For Algorithm 2, we could define ζk0\zeta_{k}^{0} (such that Hk0=ζk0IH_{k}^{0}=\zeta_{k}^{0}I) to be the final value ζk1\zeta_{k-1} from the previous iteration, possibly choosing a smaller value at some iterations to avoid being too conservative. For Algorithm 1, we could increase ζk0\zeta_{k}^{0} over ζk1\zeta_{k-1} if backtracking was necessary at iteration k1k-1, and shrink it when a unit stepsize sufficed for several successive iterations.

The proximal Newton approach of setting Hk=2f(xk)H_{k}=\nabla^{2}f(x^{k}) is a common choice in the convex case (LeeSS14a, ), where we can guarantee that HkH_{k} is at least positive semidefinite. In LeeSS14a , it is shown that in some neighborhood of the optimum, when dkd^{k} is the exact solution of (2), then unit step size is always taken, and superlinear or quadratic convergence to the optimum ensues. (A global complexity condition is not required for this result.) Generally, however, indefiniteness in 2f(xk)\nabla^{2}f(x^{k}) may lead to the search direction dkd^{k} not being a descent direction, and the backtracking line search will not terminate in this situation. (Our convergence results for Algorithm 1 do not apply in the case of HkH_{k} indefinite.) A common fix is to use damping, setting Hk=2f(xk)+ζkIH_{k}=\nabla^{2}f(x^{k})+\zeta_{k}I, for some ζk0\zeta_{k}\geq 0 that at least ensures positive definiteness of HkH_{k}. Strategies for choosing ζk\zeta_{k} adaptively have been the subject of much research in the context of smooth minimization, for example, in trust-region methods and the Levenberg-Marquardt method for nonlinear least squares (see NocW06a ). Variant 2 of our Algorithm 2 uses this strategy. It is desirable to ensure that ζk0\zeta_{k}\to 0 as the iterates approach a solution at which local convexity holds, to ensure rapid local convergence.

An L-BFGS approximation of 2f(xk)\nabla^{2}f(x^{k}) could also be used for HkH_{k}. When ψ\psi is not present in (1) and ff is strongly convex, it is shown in LiuN89a that this approach has global linear convergence because the eigenvalues of HkH_{k} are restricted to a bounded positive interval. This proof can be extended to our algorithms, when a convex ψ\psi is present in (1). When ff is not strongly convex, one can apply safeguards to the L-BFGS update procedure (as described in LiF01a ) to ensure that the upper and lower eigenvalues of HkH_{k} are bounded uniformly away from zero.

Another interesting choice of HkH_{k} is a block-diagonal approximation of the Hessian, which (when ψ\psi can be partitioned accordingly) allows the subproblem (2) to be solved in parallel while still retaining some curvature information. Strategies like this one are often used in distributed optimization for machine learning problems (see, for example, Yan13a ; LeeC17a ; ZheXXZ17a ).

Numerical Results

We define HkH_{k} to be the limited-memory BFGS approximation (LiuN89a, ) based on the past 10 steps, with a safeguard mechanism proposed in LiF01a to ensure uniform boundedness of HkH_{k}. The subproblems (2) are solved with SpaRSA (WriNF08a, ), a proximal-gradient method which, for bounded HkH_{k}, converges globally at a linear rate. We consider the publicly available data sets listed in Table 1,Downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. and present empirical convergence results by showing the relative objective error, defined as

where FF^{*} is the optimum, obtained approximately through running our algorithm with long enough time. For all variants of our framework, we used parameters β=0.5\beta=0.5, and γ=104\gamma=10^{-4}. Further details of our implementation are described in LeeLW18a .

We use the two smaller data sets a9a and rcv1 to quantify the relationship between accuracy of the subproblem solution and the number of outer iterations. We compare running SpaRSA with a fixed number of iterations T{5,10,15,20,25,30}T\in\{5,10,15,20,25,30\}. Figure 1 shows that, in all cases, the number of outer iterations decreases monotonically as the (fixed) number of inner iterations is increased. For T15T\geq 15, the degradation in number of outer iterations resulting from less accurate solution of the subproblems is modest, as our theory suggests. We also observe the initial fast linear rates in the early stages of the method that are predicted by our theory, settling down to a slower linear rate on later iterations, but with sudden drops of the objective, possibly as a consequence of intermittent satisfaction of the condition in Part 1 of Theorem 3.1.

Next, we examine empirically the step size distribution for Algorithm 1 and how often in Algorithm 2 the matrix HkH_{k} needs to be modified. On both a9a and rcv1, the initial step estimate α=1\alpha=1 is accepted on over 99.5% of iterations in Algorithm 1, while in both variants of Algorithm 2, the initial choice of HkH_{k} is used without modification on over 99% of iterations. These statistics hold regardless of the value of TT (the number of inner iterations), though in the case of Algorithm 2, we see a faint trend toward more adjustments for larger values of TT. When adjustments are needed, they never number more than 44 at any one iteration, except for a single case (a9a for Variant 1 of Algorithm 2 with T=5T=5) for which up to 88 adjustments are needed.

We next compare our inexact method with an exact version, in which the subproblems (2) are solved to near-optimality at each iteration. Since the three algorithms behave similarly, we use Algorithm 1 as the representative for this investigation. We use a local cluster with 16 nodes for the two larger data sets rcv1 and epsilon, while for the small data set a9a, only one node is used. Iteration counts and running time comparisons are shown in Figure 2. The exact version requires fewer iterations, as expected, but the inexact version requires only modestly more iterations. In terms of runtime, the inexact versions with moderate amount of inner iterations (at least 3030) has the advantage, due to the savings obtained by solving the subproblem inexactly.

We note that the approach of gradually increasing the number of inner iterations, suggested in SchT16a ; GhaS16a , produces good results for this application, the number of iterations required being comparable to those for the exact solver while the running time is slightly faster than that of T=30T=30 for epsilon and competitive with it for the rest two data sets.

With the notation A(b1a1,b2a2,,blal)A\coloneqq(b_{1}a_{1},b_{2}a_{2},\dotsc,b_{l}a_{l}), the dual of this problem is

which is (1/2C)(1/2C)-strongly convex. We consider the distributed setting such that the columns of AA are stored across multiple processors. In this setup, only the block-diagonal parts (up to a permutation) of ATAA^{T}A can be easily formed locally on each processor. We take HkH_{k} to be the matrix formed by these diagonal blocks, so that the subproblem (2) can be decomposed into independent parts. We use cyclic coordinate descent with random permutation (RPCD) as the solver for each subproblem. (Note that this algorithm partitions trivially across processors, because of the block-diagonal structure of HkH_{k}.)

Our experiment compares the strategy of performing a fixed number of RPCD iterations for each subproblem with one of increasing the number of inner iterations as the algorithm proceeds, as in SchT16a ; GhaS16a . We use the data sets in Table 1, and compare the two strategies on Algorithm 1, but use an exact line search to choose αk\alpha_{k} rather than the backtracking approach. (An exact line search is made easy by the quadratic objective.) For the first strategy, we use ten iterations of RPCD on each subproblem, while for the second strategy, we perform 1+k/101+\lfloor k/10\rfloor iterations of RPCD at the kkth outer iteration as suggested by SchT16a ; GhaS16a . The implementation is a modification of the experimental code of LeeR15a . We run the algorithms on a local cluster with 16 machines, so that HkH_{k} contains 16 diagonal blocks. Results are shown in Figure 3. Since the choice of HkH_{k} in this case does not capture global curvature information adequately, the strategy of increasing the accuracy of subproblem solution on later iterations does not reduce the number of iterations as significantly as in the previous experiment. The runtime results show a significant advantage for the first strategy of a fixed number of inner iterations, particularly on the a9a and rcv1 data sets. Judging from the trend in the approach of increasing inner iterations, we can expect that the exact version will show huger disadvantage for running time in this case. We also observe the faster linear rate on early iterations, matching our theory.

Conclusions

We have analyzed global convergence rates of three practical inexact successive quadratic approximation algorithms under different assumptions on the objective function, including the nonconvex case. Our analysis shows that inexact solution of the subproblems affects the rates of convergence in fairly benign ways, with a modest factor appearing in the bounds. When linearly convergent methods are used to solve the subproblems, the inexactness condition holds when a fixed number of inner iterations is applied at each outer iteration kk.

References

Appendix A Proof of Lemma 5

where in (56a) we used the convexity of ff, in (56b) we set d=λ(PΩ(x)x)d=\lambda(P_{\Omega}(x)-x), and in (56c) we used the optimal set strong convexity (10) of FF. Thus we obtain (25). ∎

Appendix B Proof of Lemma 6

then by setting the derivative to zero in (57), we have

When δkAk\delta_{k}\geq A_{k}, we have from (58) that λk=1\lambda_{k}=1. Therefore, from (30) we get

On the other hand, since AAk>0,ck0A\geq A_{k}>0,c_{k}\geq 0 for all kk, (30) can be further upper-bounded by

For δkAAk\delta_{k}\geq A\geq A_{k}, (31) still applies. If A>δkA>\delta_{k}, we have from (59) that λk=δk/A\lambda_{k}=\delta_{k}/A, hence

This together with (31) imply that {δk}\{\delta_{k}\} is a monotonically decreasing sequence. Dividing both sides of (60) by δk+1δk\delta_{k+1}\delta_{k}, and from the fact that δk\delta_{k} is decreasing and nonnegative, we conclude

Summing this inequality from k0k_{0}, and using δk0<A\delta_{k_{0}}<A, we obtain