Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone line-search algorithms
Andreas Themelis, Lorenzo Stella, Panagiotis Patrinos
Introduction
In this paper we deal with optimization problems of the form
under the following assumptions, which will be valid throughout the paper without further mention. {ass}[Basic assumption]In problem (1)
(differentiable with -Lipschitz continuous gradient);
is proper, closed and -prox-bounded (see Section 2.1);
a solution exists, that is, .
Both and are allowed to be nonconvex, making (1) prototypic for a plethora of applications spanning signal and image processing, machine learning, statistics, control and system identification. A well known algorithm addressing (1) is forward-backward splitting (FBS), also known as proximal gradient method. FBS has been thoroughly analyzed under the assumption of being convex. If moreover is convex, then FBS is known to converge globally with rate in terms of objective value, where is the iteration count. In this case, accelerated variants of FBS, also known as fast forward-backward splitting (FFBS), can be derived thanks to the work of Nesterov , that only require minimal additional computations per iteration but achieve the provably optimal global convergence rate of order .
The work in pioneered an alternative acceleration technique. The method is based on an exact, real-valued penalty function for the original problem (1), namely the forward-backward envelope (FBE), defined as follows
The name forward-backward envelope comes from the fact that is the value of the minimization problem that defines the forward-backward step and alludes to the kinship that it has with the Moreau envelope. These claims will be addressed more in detail in Section 4. When is sufficiently smooth and both and are convex, the FBE was shown to be continuously differentiable and amenable to be minimized with generalized Newton methods. More recently, proposed a linesearch algorithm based on (L-)BFGS quasi-Newton directions for minimizing the FBE. The curvature information exploited by Newton-like methods acts as an online preconditioner, enabling superlinear rates of convergence under mild assumptions. However, unlike plain (F)FBS schemes, such methods require accessing second-order information of the smooth term (needed for the evaluation of ), and are well defined only as long as the nonsmooth term is convex. On the contrary, FBS merely requires first-order information on and prox-boundedness of the nonsmooth term , in which case all accumulation points are stationary for , \iethey satisfy the first order necessary conditions .
In this paper we propose ZeroFPR, a nonmonotone linesearch algorithm that, to the best of our knowledge, is the first that (1) addresses the same range of problems as FBS, (2) requires the same black-box oracle as FBS (gradient of one function and proximity operator of the other), (3) yet achieves superlinear rates under mild assumptions (only) at the limit point. Though related to minFBE algorithm , ZeroFPR is conceptually different, mainly because it is gradient-free, in the sense that it does not require the gradient of the FBE. Moreover,
We provide the necessary theoretical background linking the concepts of stationarity of a point for problem (1), criticality and optimality. To the best of our knowledge, such an analysis was previously made only for the proximal point algorithm and for a special case of the projected gradient method .
The analysis of the FBE, previously studied only in the case of being and convex , is extended to and as in Equation 1. In particular, we provide mild assumptions on and that ensure (1) continuous differentiabilty of the FBE around critical points, (2) (strict) twice differentiability at critical points, and (3) equivalence of strong local minimality for the original function and the FBE.
Exploiting the investigated properties of the FBE and of critical points we prove that ZeroFPR with monotone linesearch converges (1) globally if has the Kurdyka-Łojasiewicz property , and (2) superlinearly when quasi-Newton Broyden directions are employed, under mild additional requirements at the limit point.
In Section 2 we introduce some notation and list some known facts about FBS. In Section 3 we define and explore notions of stationarity and criticality for the investigated problem and relate them with properties of the forward-backward operator. In Section 4 we extend the results of about the fundamental properties of the FBE to the more general setting addressed in this paper, where and satisfy Equation 1; for the sake of readability, some of the proofs are deferred to Appendix A. Section 5 addresses the core contribution of the paper, ZeroFPR; although arbitrary directions can be chosen, we specialize the results on superlinear convergence to a quasi-Newton Broyden method so as to truely maintain the same black-box oracle as FBS. Some ancillary results needed for the proofs are listed in Appendix B. Finally, Section 6 illustrates numerical results obtained with the proposed method.
Preliminaries
The identity matrix is denoted as , and the extended real line as . The open and closed ball of radius centered in is denoted as and , respectively. Given a set and a sequence we write with the obvious meaning of for all . The (possibly empty) set of cluster points of is denoted as , or simply as whenever the indexing is clear from the context. We say that is \DEFsummable if is finite, and \DEFsquare-summable if is summable.
Following the terminology of , we say that a function is strictly continuous at if is finite, and \DEFstrictly differentiable at if exists and . The set of functions with Lipschitz continuous gradient is denoted as , and for we write to indicate the Lipschitz modulus of .
For a proper, closed function , a vector is a \DEFsubgradient of at , where the \DEFsubdifferential is considered in the sense of [44, Def. 8.3]
We have and [44, Ex. 8.8(c)].
Given a parameter value , the \DEFMoreau envelope function and the \DEFproximal mapping are defined by
We now summarize some properties of and ; the interested reader is referred to for a detailed discussion. A function is \DEFprox-bounded if there exists such that is bounded below on . The supremum of all such is the \DEFthreshold of prox-boundedness for . In particular, if is convex or bounded below then . In general, for any the proximal mapping is nonempty- and compact-valued, and the Moreau envelope finite [44, Thm. 1.25].
Given a nonempty closed set we let denote its \DEFindicator function, namely if and otherwise, and the (set-valued) \DEFprojection . Proximal mappings can be seen as generalized projections, due to the relation for any .
For a set-valued mapping we let denote its \DEFgraph, the set of its \DEFzeros and the set of its \DEFfixed-points.
2. Forward-backward iterations
holding for all [9, Prop. A.24], for any the function
furnishes a majorization model for , in the sense that
where \gamma\in\bigl{(}0,\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}}\bigr{)} is the stepsize parameter. The (set-valued) \DEFforward-backward operator can be equivalently expressed as
Stationary and critical points
Unless is convex, the stationarity condition in problem (1) is only necessary for the optimality of [44, Thm. 10.1]. In this section we define different concepts of (sub)optimality and show how they are related for generic functions as in Equation 1. {defin} We say that a point is
stationary if ;
critical if it is \DEF-critical for some , \ieif ;
optimal if , \ieif it solves (1).
The notion of criticality was already discussed in under the name of -stationarity ( plays the role of ) for the special case of , where is a convex set and is the (nonconvex) set of vectors with at most nonzero entries.
If is convex, then and we may talk of criticality without mention of : in this case, the properties of -criticality and stationarity are equivalent regardless of the value of . For more general functions , instead, the value of plays a role in determining whether a point is -critical or not, which legitimizes the following definition. {defin} The \DEFcriticality threshold is the function
As usual, whenever and are clear from the context we simply write in place of . That is due to the fact that (and consequently ) is everywhere empty-valued for . Considering also forces the set in the definition to be nonempty, and the lower-bound in particular; more precisely, observe that, by definition, iff is a critical point.
Let us consider for and where . Clearly, (as is lower-bounded), and are both (unique) optima. Since for and is clearly empty elsewhere, all points in are stationary. is the (set-valued) projection on , therefore the forward-backward operator is . We have
In particular, . We now list some properties of critical and optimal points which will be used to derive regularity properties of and . {thm}[Properties of critical points]The following properties hold:
for , a point is -critical iff
if is critical, then it is -critical for all ; moreover, is also -critical provided that ;
and for any critical point and .
By suitably rearranging, the claim readily follows.
2: due to 1, if is -critical, apparently it is also -critical for any . From the definition (9) of the criticality threshold , it then follows that is -critical for any . Suppose now that . Then, due to 1 for all we have
By taking the limit as we obtain that the inequality holds for as well, proving the claim in light of the characterization 1.
3: let be a critical point, and let for some . Fix . From 1 and 2 it then follows that
Since , necessarily .
In the next result we show that criticality is a halfway property between stationarity and optimality. In light of these relations we shall seek “suboptimal” solutions which we characterize as critical points. {prop}[Optimality, criticality, stationarity]Let \bar{\gamma}{}\coloneqq{}\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}}.
(criticality stationarity) for all ;
(optimality criticality) for all ; in particular, for all , and also for if ;
1: let and . Since minimizes , we have 0{}\in{}\hat{\partial}\bigl{[}g+\tfrac{1}{2\gamma}\|{}\cdot{}-x+\gamma\nabla f(x)\|^{2}\bigr{]}(x){}={}\hat{\partial}g(x)+\nabla f(x){}={}\hat{\partial}\varphi(x), where the first inclusion follows from [44, Thm. 10.1] and the equalities from [44, Thm. 8.8(c)]. This proves that is stationary.
2: Fix , and . Necessarily , otherwise, due to section 2.2, would contradict minimality of . Therefore, is -critical and the claim follows from the arbitrarity of .
As already seen in Section 3, the bound \Gamma(x^{\star})\geq\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}} at optimal points in Item 2 is tight, and clearly the implication “optimality criticality” cannot be reversed (consider, \egthe point for ). The next example shows that the other implication is also proper. {es}[Stationarity criticality] Let and . We have , , and for it holds that . Therefore, is stationary; however, T_{\gamma}(x^{\star})=\prox_{\gamma g}(0)=\set{-(\nicefrac{{5\gamma}}{{3}})^{3}}, and in particular for any , proving to be non critical.
Forward-backward envelope
The FBE (2) was introduced in and further analyzed in in the case when is convex. Under such assumption the FBE was shown to be continuously differentiable, which made it possible to derive minimization algorithms based on its gradient. In the general setting addressed in this paper the FBE might fail to be (continuously) differentiable, and as such we need to resort to gradient-free methods. This task will be addressed in Section 5 where Section 5 will be proposed; other than being applicable to a wider range of problems, the proposed scheme is entirely based on the same oracle of forward-backward iterations, unlike the approaches in which instead require the computation of . All this will be possible thanks to continuity properties of the FBE, and to the behavior of at critical points. We now focus on its continuity, while the other property will be addressed shortly after in Section 4.2.
[Alternative expressions for ] By expanding the square and rearranging the terms in the definition (2), can equivalently be expressed as
Comparing with (7), it is apparent that the set of minimizers in the above expression coincides with , the forward-backward operator at . Moreover, taking out the constant term from the infimum we immediately obtain the following expression involving the Moreau envelope of :
Other than providing an explicit way of computing the FBE, (11) emphasizes how inherits the regularity properties of the Moreau envelope of . In particular, the next key property follows from the strict continuity of [44, Ex. 10.32]. {prop}[Strict continuity of ]For any , the FBE is a real-valued and strictly continuous function on .
In the next section we will see the fundamental qualitative similarities between the FBE and the Moreau envelope. Namely, for small enough both and are lower bounds for the original function with same minimizers and minimum; in particular the minimization of is equivalent to that of or . Similarly, the identity
2. Basic properties
We now provide bounds relating to the original function that extend the well known inequalities involving the Moreau envelope. {prop}Let be fixed. Then
for all and .
1 is obvious from the definition of the FBE (consider in (2)). As to 2, since the set of minimizers in (2) is (cf. (12b)), (5) yields
With respect to the inequalities holding for convex treated in , the lower bound in Section 4.2 is weaker, while the upper bound unchanged. Regardless, an immediate consequence of the result is that the value of and at critical points is the same, and minimizers and infima of the two functions coincide for small enough. {thm}The following hold
for all and ;
and for all \gamma{}\in{}\bigl{(}0,\min\set{\nicefrac{{1}}{{L_{f}}},\gamma_{g}}\bigr{)}.
The bound in Item 2 is tight even when and are convex, as the counterexample with and shows (see [45, Ex. 2.4] for details).
Although we will address problem (1) by simply exploiting the continuity of the FBE, nevertheless enjoys favorable properties which are key for the efficacy of the method which will be discussed in Section 5. Firstly, observe that, due to strict continuity, is almost everywhere differentiable, as it follows from Rademacher’s theorem. The same applies to the mapping , its Jacobian being
which is symmetric wherever it exists [44, Cor. 13.42 and Prop. 13.34]. However, in order to show that the proposed method achieves fast convergence we need additional regularity properties, namely (strict) twice differentiability at critical points and continuous differentiability around. The rest of the section is dedicated to this task.
3. Prox-regularity and first-order properties
In the favorable case in which is convex and , the FBE enjoys global continuous differentiability . In our setting, \DEFprox-regularity acts as a surrogate of convexity; the interested reader is referred to [44, §13.F] for a detailed discussion. {defin}[Prox-regularity]Function is said to be \DEFprox-regular at for if there exist such that for all and
it holds that . Prox-regularity is a mild requirement enjoyed globally and for any subgradient by all convex functions, with and . When is prox-regular at for , then for sufficiently small the Moreau envelope is continuously differentiable in a neighborhood of . To our purposes, when needed, prox-regularity of will be required only at critical points , and only for the subgradient . Therefore, with a slight abuse of terminology we define prox-regularity of critical points as follows. {defin}[Prox-regularity of critical points] We say that a critical point is \DEFprox-regular if is prox-regular at for . Examples where a critical point fails to be prox-regular are of challenging construction; before illustrating a cumbersome such instance in Section 4.3, we first prove an important result that connects prox-regularity with first-order properties of the FBE. {thm}[Continuous differentiability of ]Suppose that is of class around a prox-regular critical point . Then, for all there exists a neighborhood of on which the following properties hold:
and are strictly continuous, and in particular single-valued;
with , where is as in (13).
For , using LABEL:{prop:ProxGrad}, LABEL: and LABEL:{prop:SingleValuedFB} we obtain that
Replacing with in the above expression, the inequality is strict for all . From [40, Thm. 4.4] applied to the “tilted” function it follows that there is a neighborhood of in which is strictly continuous and is of class with gradient for all . By possibly narrowing , we may assume that and for all . 2 then follows from (11) and the chain rule of differentiation, and 1 from the fact that strict continuity is preserved by composition.
When , Section 4.3 restates the known fact that if is prox-regular at for , then is continuosly differentiable around with . Notice that the bound is tight: in general, for no continuity of nor continuous differentiability of around can be guaranteed. In fact, even when is -critical, might even fail to be single-valued and differentiable at , as the following counterexample shows. {es}[Why in first-order properties] Consider and where . Then, , , and the FBE is . At the critical point , which satisfies , is prox-regular for any subgradient. For any it is easy to see that is differentiable in a neighborhood of . However, for the distance function has a first-order singularity in , due to the -valuedness of .
[Prox-nonregularity of critical points]Consider where , and S=\set{\nicefrac{{1}}{{n}}}[n\in\N_{\geq 1}]\cup\set{0}. For we have , however fails to be prox-regular at for . For any and for any neighborhood of in it is always possible to find a point arbitrarily close to with multi-valued projection on . Specifically, the midpoint P_{n}{}={}\bigl{(}\frac{1}{2}(\frac{1}{n}+\frac{1}{n+1}),\,{}-\nicefrac{{1}}{{\rho}}\bigr{)} has 2-valued projection on for any , being it \proj_{\graph g}(P_{n})=\set{\nicefrac{{1}}{{n}},\nicefrac{{1}}{{n+1}}}. By considering a large , can be made arbitrarily close to and at the same time its projection(s) arbitrarily close to . Therefore, cannot be prox-regular at for , for otherwise such projections would be single-valued close enough to [40, Cor. 3.4 and Thm. 3.5]. As a result, is not differentiable around , and indeed at each midpoint for it has a nonsmooth spike.
To underline how unfortunate the situation depicted in Section 4.3 is, notice that adding a linear term to for any , yet leaving unchanged, restores the desired prox-regularity of each critical point. Indeed, this is trivially true for any nonzero critical point; besides, is prox-regular at for any , and for we have that is nomore critical.
4. Second-order properties
In this section we discuss sufficient conditions for twice-differentiability of the FBE at critical points. Additionally to prox-regularity, which is needed for local continuous differentiability, we will also need generalized second-order properties of . The interested reader is referred to [44, §13] for an extensive discussion on epi-differentiability. {ass}With respect to a given critical point
exists and is (strictly) continuous around ;
is prox-regular and (strictly) twice epi-differentiable at for , with its second order epi-derivative being generalized quadratic:
where is a linear subspace and . Without loss of generality we take symmetric, and such that and .This can indeed be done without loss of generality: if and satisfy (15), then it suffices to replace with to ensure the desired properties.
We say that the assumptions are “strictly” satisfied if the stronger conditions in parenthesis hold.
We now show that the quite common properties required in Section 4.4 are all is needed for ensuring first-order properties of the proximal mapping and second-order properties of the FBE at critical points. {thm}[Twice differentiability of ]Suppose that Section 4.4 is (strictly) satisfied with respect to a critical point . Then, for any
is (strictly) differentiable at with symmetric and positive semidefinite Jacobian
is (strictly) differentiable at with Jacobian
where is as in (13) and as in (16);
is (strictly) twice differentiable at with symmetric Hessian
See Appendix A. Again, when Section 4.4 covers the differentiability properties of the proximal mapping (and consequently the second-order properties of the Moreau envelope, due to the identity ) as discussed in .
We now provide a key result that links nonsingularity of the Jacobian of the forward-backward residual to strong (local) minimality for the original cost and for the FBE , under the generalized second-order properties of Section 4.4. {thm}[Conditions for strong local minimality]Suppose that Section 4.4 is satisfied with respect to a critical point , and let \gamma{}\in{}(0,\min\set{\Gamma(x^{\star}),\nicefrac{{1}}{{L_{f}}}}). The following are equivalent:
is a strong local minimum for ;
is a local minimum for and is nonsingular;
the (symmetric) matrix is positive definite;
is a strong local minimum for ;
is a local minimum for and is nonsingular. {proof} See Appendix A.
ZeroFPR algorithm
The first algorithmic framework exploiting the FBE for solving composite minimization problems was studied in , and other schemes have been recently investigated in . All such methods tackle the problem by looking for a (local) minimizer of the FBE, exploting the equivalence of (local) minimality for the original function and for the FBE , for small enough. To do so, they all employ the concept of directions of descent, thus requiring the gradient of the FBE to be well defined everywhere. In the more general framework addressed in this paper, such basic requirement is not met, which is why we approach the problem from a different perspective. This leads to ZeroFPR, the first algorithm, to the best of our knowledge, that despite requiring only the black-box oracle of FBS and being suited for fully nonconvex problems it achieves superlinear convergence rates.
Instead of directly addressing the minimization of or , we seek solutions of the following nonlinear inclusion (generalized equation)
By doing so we address the problem from the same perspective of FBS, that is, finding fixed points of the forward-backward operator or, equivalently, zeros of its residual . Despite might be quite irregular when is nonconvex, it enjoys favorable properties at the very solutions to (20) — \ieat -critical points — starting from single-valuedness, cf. Item 3. If mild assumptions are met, turns out to be continuous around and even differentiable at critical points (cf. Sections 4.3 and 4.4), and as a consequence the inclusion problem (20) reduces to a well behaved system of equations, as opposed to generalized equations, when close to solutions.
This motivates addressing problem (20) with fast methods for nonlinear equations. Newton-like schemes are iterative methods that prescribe updates of the form
which essentially amount to selecting , a linear operator that ideally carries information of the geometry of around , in the attempt to yield an optimal iterate . For instance, when is sufficiently regular Newton method corresponds to selecting as the inverse of an element of the generalized Jacobian of at , enabling fast convergence when close to a solution under some assumptions. However, selecting as in Newton method would require information additional to the forward-backward oracle , and as such it goes beyond the scope of the paper. For this reason we focus instead on quasi-Newton schemes, in which are linear operators recursively defined with low-rank updates that satisfy the (inverse) secant condition
A famous result states that, under mild assumptions and starting sufficiently close to a solution , updates as in (21) are superlinearly convergent to iff the Dennis-Moré condition holds, namely the limit . More recently, in the result was extended to generalized equations of the form , where is smooth and possibly set-valued. The study focuses on Josephy-Newton methods where the update is the solution of the inner problem , where , which can be interpreted as a forward-backward step in the metric induced by . In particular, differently from the here proposed ZeroFPR, the method in has the crucial limitation that, unless the operator has a very particular structure, the backward step may be prohibitely challenging.
Quasi-Newton schemes are powerful and widely used methods. However, it is well known that they are effective only when close enough to a solution and might even diverge otherwise. To cope with this crucial downside there comes the need of a globalization strategy; this is usually addressed by means of a linesearch over a suitable merit function , along directions of descent for so as to ensure sufficient decrease for small enough stepsizes. Unfortunately, the potential choice is not regular enough for a ‘direction of descent’ to be everywhere defined. The proposed Section 5 bypasses this limitation by exploiting the favorable properties of the FBE.
Globalizing the convergence of any fast local method is the core contribution of ZeroFPR, an algorithm that exploits the favorable properties of the FBE, and that requires exactly the same oracle of FBS. Conceptually, ZeroFPR is really elementary; for simplicity, let us first consider the monotone case, \iewith so that (cf. 5). The following steps are executed for updating iterate :
first, at 1 a nominal forward-backward call yields an element that decreases the value of by at least (item 1);
then, at 3 an update direction at (not at !) is selected;
because of the sufficient decrease on and the continuity of , at 4 a stepsize can be found with finite many backtrackings that ensures a decrease for of at least in the update , for any .
In order to reduce the number of backtrackings, can be selected resulting in a nonmonotone linesearch. The sufficient decrease is enforced with respect to a parameter (cf. section 5.1.1), namely a convex combination of . For the sake of convergence, can be selected arbitrarily in as long as it is bounded away from , hence the role of the user-set lower bound . Consequently, small values of and concur in reducing conservatism in the linesearch by favoring larger stepsizes. {lem}[Nonmonotone linesearch globalization]For all the iterates generated by ZeroFPR satisfy
and there exists such that
In particular, the number of backtrackings at 4 is finite. {proof} The first two inequalities in (23) are due to item 2. Moreover,
where the inequality follows by the linesearch condition (19); this proves the last inequality in (23). As to (24), let be fixed and contrary to the claim suppose that for all there exists such that the point satisfies . Taking the limit for , continuity of as ensured by eq. 11 yields
where the last inequality is due to the fact that . This contradicts item 2; therefore, there exists such that for all . By combining this with (23) the claim follows. Section 5.1.1 ensures that regardless of the choice of , ZeroFPR does not get stuck in infinite loops. In Section 5.4 we will also show that the algorithm returns solutions of problem (20), and that under mild assumptions at the limit point the convergence rate is superlinear when good directions are selected at 3. Before going through the technicalities, we briefly anticipate what such good directions are.
1.2. Choice of the directions: quasi-Newton methods
As already emphasized, fast convergence of ZeroFPR will be obtained thanks to the employment of Newton-like directions . Differently from the classical Newton-like step (21), when stepsize is accepted, the update in ZeroFPR is of the form rather than , where is an element of . Therefore, needs to be a Newton-like direction at , and not at , namely
(as opposed to ).
We consider a modified Broyden’s scheme that performs rank-one updates of the form
and is a fixed parameter, with the convention . Starting from an invertible matrix this selection ensures that all matrices are invertible.
BFGS method consists in the following update rule for matrices in (25): starting from a symmetric and positive definite ,
with and , see \eg[34, §6.1]. BFGS is the most popular quasi-Newton scheme; it is based on rank-two updates that, additionally to the secant condition, enforce also symmetricity. In fact, BFGS is guaranteed to satisfy the Dennis-Moré condition only provided that the Jacobian of the nonlinear system at the limit point is symmetric . Although this is not the case for , we observed in practice that BFGS directions (27) perform extremely well.
Ultimately, instead of storing and operating on dense matrices, limited-memory variants of quasi-Newton schemes keep in memory only a few (usually to ) most recent pairs implicitly representing the approximate inverse Jacobian. Their employment considerably reduces storage and computations over the full-memory counterparts, and as such they are the methods of choice for large-scale problems. The most popular limited-memory method is L-BFGS: based on BFGS, it efficiently computes matrix-vector products with the approximate inverse Jacobian using a two-loop recursion procedure .
2. Connections with other methods
The first algorithmic framework exploiting the FBE was studied in , where two semismooth Newton methods were analyzed for convex and with (twice continuously differentiable with Lipschitz continuous gradient). A generalization of the scheme was then studied in under less restrictive assumptions, with particular attention to quasi-Newton directions in place of semismooth Newton methods. The proposed algorithm interleaves descent steps over the FBE with forward-backward steps. then analyzed global and linear convergence properties of a generic linesearch algorithmic framework for minimizing the FBE based on gradient-related directions, for analytic and subanalytic, convex, and lower bounded .
Though apparently closely related, the approach that we provide in this paper presents major conceptual differences from any of the ones above. Apart from the significantly less restrictive assumptions, the crucial distinction is that our method is derivative-free, \ieit does not require the gradient of the FBE. As a consequence, no computation nor the existence of is required, resulting in a method that, differently from the others, truly relies on the very same oracle information of the forward-backward operator .
3. Main remarks
In this section we list a few observations that come in handy when implementing ZeroFPR. {rem}[Adaptive variant when is unknown]In practice, no prior knowledge of the global Lipschitz constant is required for ZeroFPR. In fact, replacing with an initial estimate and fixing a backtracking ratio , after 2 the following instruction can be added:
[Support for locally Lipschitz ]If is bounded and, as it is reasonable, the directions selected at 3 do not diverge, then Item 1 on can be relaxed to being locally Lipschitz.
In fact, it follows from the definition of proximal mapping that , and if the directions are bounded then there exists a compact domain such that . Then, all results of the paper apply by replacing with , the (finite) Lipschitz constant of on .
[Cost per iteration]Evaluating essentially amounts to one evaluation of ; this is evident from the expression (11), together with the observation that for any . Therefore, computing at 4 yields an element required in 1, since at every iteration. In general, one evaluation of per backtracking step is required. If the directions are computed with Broyden or BFGS methods (26) and (27), then one additional evaluation of is required for retrieving ; in the best case of being accepted, which asymptotically happens under mild assumptions (cf. section 5.4.2), the algorithm then requires exactly two evaluations of per iteration.
[Extension of FBS] Observe that by selecting ZeroFPR reduces to the classical FBS algorithm. Item 2 combined with the relation due to (23) shows that the condition at 4 is always statisfied (with ). Therefore, for all , which is FBS, cf. (7).
4. Convergence results
In this section we analyze the properties of cluster points of the iterates generated by ZeroFPR. Specifically,
every cluster point of and solves problem (20) (Section 5.4);
if the linesearch is (eventually) monotone, then global and linear convergence are achieved under mild assumptions (Sections 5.4.1 and 5.4.1);
directions satisfying the Dennis-Moré condition, such as Broyden’s, enable superlinear rates under mild assumptions (Sections 5.4.2 and 5.4.2).
In what follows, we exclude the trivial case in which the optimality condition is achieved in a finite number of iterations, and therefore assume for all . {thm}[Criticality of cluster points]The following hold for the iterates generated by ZeroFPR:
square-summably, and all cluster points of and are critical; more precisely, ;
converges to a (finite) value , and so does if is bounded.
By telescoping the above inequality and using (23), we obtain
proving square-summably. Suppose now that for some and . Then, since , in particular as well. Due to the arbitrarity of the cluster point it follows that , and a similar reasoning proves the converse inclusion, hence . Moreover, we have and since , from the outer semicontinuity of [44, Ex. 5.23(b)] it follows that , \ie.
2: from (28) it follows that is decreasing, and in particular its limit exists, be it . Due to (23), necessarily , therefore
proving that . If is bounded, then so is due to compact-valuedness of [44, Thm. 1.25]. Due to eq. 11 is -Lipschitz continuous on a compact set containing and for some . Then,
where the inequalities follow from section 4.2. Consequently, as well.
If follows from (23) and the fact that is a decreasing sequence (cf. (28)), that the iterates of ZeroFPR satisfy . As a consequence, a sufficient condition for ensuring that the sequence does not diverge — and consequently nor does provided that the sequence of directions is bounded — is that the level set is compact. In the adaptive variant discussed in Section 5.3, this translates to boundedness of the level set , where denotes the iteration starting from which is constant. Since such point is unknown a priori, the sufficient condition needs be strengthened to having bounded level sets.
We now show that if is well-behaved at cluster points, then the whole sequence generated by ZeroFPR is convergent. Good behavior involves the existence of a desingularizing function, that is, needs to possess the Kurdyka-Łojasiewicz property, a mild requirement that we restate here for the reader’s convenience. {defin}[KL property]A proper and lower semicontinuous function has the Kurdyka-Łojasiewicz property (KL property) at if there exist a concave desingularizing function (or KL function) for some and a neighborhood of , such that
is with on ;
for all s.t. it holds that
The KL property is a mild requirement enjoyed by semi-algebraic functions and by subanalytic functions which are continuous on their domain see also . Moreover, since semi-algebraic functions are closed under parametric minimization, from the expression (2) it is apparent that is semi-algebraic provided that and are. More precisely, in all such cases the desingularizing function can be taken of the form for some and , in which case it is usually referred to as a Łojasiewicz function. This property has been extensively exploited to provide convergence rates of optimization algorithms such as FBS, see . Further properties of and that ensure to satisfy such requirement are discussed in .
We first show how the KL property on ensures global convergence of the iterates of ZeroFPR if the linesearch is eventually monotone, \ieif for sufficiently large, and then show that linear convergence is attained when the KL function is actually a Łojasiewicz function with large enough exponent. {thm}[Global convergence (monotone LS)]Consider the iterates generated by ZeroFPR with for ’s large enough, and with directions satisfying
for some . Suppose that remains bounded, that has the KL property on , and that every cluster point is prox-regular. If is of class in a neighborhood of , then and are convergent to (the same point) , and the sequence of residuals is summable. {proof} From appendix B we know that is constant on the (nonempty) compact set . It then follows from [12, Lem. 6] that there exist and a uniformized KL function, namely a function satisfying LABEL:{def:KL1}, LABEL:, LABEL:{def:KL2}, LABEL: and LABEL:{def:KL3} for all and such that and . Let , which exists and is finite (cf. section 5.4), and let be such that for all . Then we have (cf. 5 and (19))
By possibly restricting , from item 2 and since is compact it follows that is differentiable in an -enlargement of . appendix B ensures that there exists such that for all we have and . For all such , by item 2 we have \nabla\varphi_{\gamma}(x^{k}){}={}Q_{\gamma}(x^{k})R_{\gamma}(x^{k}){}={}\bigl{[}I-\gamma\nabla^{2}f(x^{k})\bigr{]}r^{k} and the uniformized KL property yields
Letting \Delta_{k}{}\coloneqq{}\psi\bigl{(}\varphi_{\gamma}(x^{k})-\varphi_{\star}\bigr{)}{}>{}0, by concavity of and (32) it follows that
By telescoping the inequality it follows that is summable, hence, due to item 1, also is. Therefore, is a Cauchy sequence and as such it admits a limit, this being also the limit of in light of item 1 (and the fact that is also bounded). {thm}[Linear convergence (monotone LS)]Consider the iterates generated by ZeroFPR. Suppose that the hypothesis of Section 5.4.1 are satisfied, and that the KL function can be taken of the form for some . Then, and are -linearly convergent. {proof} From section 5.4.1 we know that and converge to the same (-critical) point, be it . Defining , from LABEL:{lem:Deltaxr} and LABEL:{lem:DeltaBarxr} we have
Therefore, the proof reduces to showing that converges with asymptotic -linear rate. Inequality (33) reads \varphi_{\gamma}(x^{k}){}-{}\varphi_{\star}{}\leq{}\bigl{[}(1+\gamma L_{f})\rho\theta\|r^{k}\|\bigr{]}^{\frac{1}{1-\theta}}, and since for large enough we have
Therefore, eventually \Delta_{k}{}\coloneqq{}\psi\bigl{(}\varphi_{\gamma}(x^{k})-\varphi_{\star}\bigr{)}{}<{}1 and from (34) we get
for some . Therefore, for large enough we have , \ie, proving asymptotic -linear convergence of .
4.2. Superlinear convergence
In the next result we show that under mild assumptions ZeroFPR exhibits superlinear rates of convergence if the directions satisfy a Dennis-Moré condition. Then, we show that the Broyden scheme (26) produces directions that satisfy such condition, and that due to the acceptance of unit stepsize , eventually each iteration of ZeroFPR will require only two evaluations of (cf. section 5.3). We remind that a sequence such that for all is said to be \DEFsuperlinearly convergent to if as .
[Superlinear convergence under Dennis-Moré condition]Suppose that Section 4.4 is strictly satisfied at a strong local minimum of , and consider the iterates generated by ZeroFPR. Suppose that converges to and that the directions satisfy the Dennis-Moré condition
Then, eventually stepsize is always accepted and the sequences , , and , converge with superlinear rate. {proof} From sections 4.3, 2, 4.4 and 3 we know that and are strictly differentiable at , with and that there exists a neighborhood of in which is differentiable and Lipschitz continuous. Since due to item 1, it holds that for all large enough. By single-valuedness of , for all such we may write and in place of and , respectively. In particular, since (cf. item 1), necessarily . In turn, due to (35) it also holds that . Let ; then,
and since , from (35) and strict differentiability of at applied on the first term on the right-hand side it follows that
By possibly restricting , nonsingularity of ensures the existence of a constant such that for all . Since , eventually . We have
A second-order expansion of at yields \mathtight
where the last equality is due to (37). Substracting,
where . Therefore, there exists such that for all ; in particular, for all such
where the second inequality follows from item 2, and the last one from (23) and the fact that . Therefore, for the linesearch condition (19) holds with , and unitary stepsize is always accepted. In particular, the limit (37) reads and from the inequality
superlinear convergence of follows. Since , then also converges superlinearly, and in turn, since , also does.
We conclude the section showing that employing Broyden directions (26) in ZeroFPR enables superlinear convergence rates, provided that is Lipschitz continuously semidifferentiable at the limit point (see ). {thm}[Superlinear convergence with Broyden directions]Suppose that Section 4.4 is (strictly) satisfied at a strong local minimum of at which is Lipschitz-continuously semidifferentiable. Consider the iterates generated by ZeroFPR with directions selected with Broyden method (26), and suppose that .
Then, the Dennis-Moré condition (35) is satisfied, and in particular all the claims of Section 5.4.2 hold. {proof} From section 4.4 we know that is strictly differentiable at the critical point and Lipschitz-continuously semidifferentiable there. Denoting ,
and since , due to [22, Lem. 2.2] there exists such that
In particular, due to sections 5.4.1 and B, is summable. Let and let denote the Frobenius norm. With a simple modification of the proofs of [22, Thm. 4.1] and [2, Lem. 4.4] that takes into account the scalar we obtain
The last term on the right-hand side, be it , is summable and therefore the sequence is bounded. Therefore,
where . Telescoping the inequality, summability of ensures that of proving in particular the claimed Dennis-Moré condition (35).
Simulations
We now present numerical results with the proposed method. In ZeroFPR we set , and for the nonmonotone linesearch we used the sequence where , , : in this way is computed as in .
We performed experiments with different choices of in step 3. In particular,
ZeroFPR(Broyden): , and obtained by the Broyden method (26) with ;
ZeroFPR(BFGS): , where is computed using BFGS updates (27);
ZeroFPR(L-BFGS): is computed using L-BFGS [34, Alg. 7.4] with memory .
We only show the results with full quasi-Newton updates (Broyden, BFGS) for one of the examples: for the other experiments we focus on L-BFGS, which is better suited for large-scale problems. Although is nonsymmetric at the critical points in general, we observed that the symmetric updates of BFGS and L-BFGS perform very well in practice and outperform the Broyden method.
We compared ZeroFPR with the forward-backward splitting algorithm (denoted FBS), that is (7), the inertial FBS (denoted IFBS) proposed in [14, Eq. (7)] (with parameter ), and the nonmonotone accelerated FBS (denoted AFBS) proposed in [26, Alg. 2] for fully nonconvex problems. All experiments were performed in MATLAB. The implementation of the methods used in the tests are available online.http://github.com/kul-forbes/ForBES
Here we consider the problem of finding a sparse solution to a least-squares problem. As discussed in , this is achieved by solving the following nonconvex problem:
2. Dictionary learning
Given a collection of signals of dimension , collected as columns in a matrix , we seek for a sparse representation of each of them as combination of a set of vectors , called dictionary atoms. To do so, we solve the following problem
Problem (39) takes the form (1) by letting and , where S{}={}\overbrace{S_{D}\times\ldots\times S_{D}}^{\text{ktimes}}{}\times{}\overbrace{S_{C}\times\ldots\times S_{C}}^{\text{mtimes}}, with
3. Matrix decomposition
We consider the problem of approximating a given matrix as the sum of a low-rank and a sparse component, by solving
This problem has application, for example, in the analysis of video imagery, specifically the separation of the background (fixed over time) scenery from the foreground (moving) objects in a series of video frames. In this case, matrix contains video frames (columns), each consisting of pixels, and , will respectively contain the background scenery and foreground objects identified in each frame. Therefore here and , while . The proximal mapping of is given by
Here, is the hard-thresholding operation, defined componentwise as
The set of matrices of rank at most is nonconvex and closed, and the projection onto it is given by , where are largest singular values of , and are the matrices of left and right singular vectors, respectively. Each computation of requires a partial SVD which is, from the computational perspective, the most significantly expensive operation in this case.
We applied this technique to a sequence of frames coming from the ShoppingMall dataset.http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html The footage consists of grayscale pixels frames, therefore the problem has variables in total. In problem (40) we used and . The results are shown in Figures 3 and 4. Also in this case, the fast asymptotic convergence of ZeroFPR(L-BFGS) is apparent.
Conclusions
The forward-backward envelope is a valuable tool for deriving efficient algorithms tackling nonsmooth and nonconvex problems of the form , as it can be used as a merit function to devise globally convergent linesearch methods solving the system of nonlinear equations defining the stationary points of .
ZeroFPR implements this idea, and we proved that it globally converges to a stationary point under the assumption that has the Kurdyka-Łojasiewicz property. Furthermore, if the linesearch directions satisfy the Dennis-Moré condition (for example, if they are determined according to the Broyden method), the convergence rate at strong local minima is superlinear.
Numerical simulations with the proposed method on convex and nonconvex problems confirm our theoretical results. Using Broyden method, BFGS (in the case of small-scale problems) and L-BFGS (for large-scale problems) to compute directions in ZeroFPR greatly outperform FBS and its accelerated variant. It is our belief that the surprising efficacy of (L-)BFGS is due to the fact that, under the appropriate assumptions, the Jacobian of at strong local minima is similar to a symmetric and positive definite matrix. Future investigation may better explain the effectiveness of symmetric update formulas in this framework.
References
Appendix A Proofs of Section 4
1: It follows from [37, Thm.s 3.8 and 4.1] that is (strictly) differentiable at iff (strictly) satisfies item 2. Consequently, if is of class around (and in particular strictly differentiable at [44, Cor. 9.19]), is (strictly) differentiable at with Jacobian as in (17) due to the chain rule of differentiation (and the fact that strict differentiability is preserved by composition). For and we have
The expression (15) of the second-order epi-derivative then implies for all (since for ). Therefore, , proving to be positive definite, and in particular invertible. We may now trace the proof of [45, Lem. 2.9] to infer that . Apparently, is symmetric and positive semidefinite.
2: Since is (strictly) continuous at and is (strictly) differentiable at , from [45, Lem. 6.2] we have that is (strictly) differentiable at , and (17) follows by the chain rule.
3: A simple application of the chain rule proves (18); moreover, combined with (17) we obtain , and since both and are symmetric, then so is .
[Proof of Section 4.4]We will show that all conditions are equivalent to either one of the following
, where and are as in section 4.4;
is similar to a symmetric and positive definite matrix.
4.4 4.4: trivial, since exists as shown in item 3.
4.4 thm:StrongMinimality(f): follows from [44, Thm. 13.24(c)], since
4.4 4.4: if , then is a (strong) local minimum for and, due to (18), necessarily is invertible. Conversely, if is a local minimum for , then . If, additionally, is invertible, then due to (18) is also invertible, and therefore positive definite.
4.4 thm:StrongMinimality(g): by comparing (17) and (18) we observe that is similar to the (symmetric) matrix , which is positive definite iff is.
thm:StrongMinimality(f) thm:StrongMinimality(g): the proof is the same as that of [45, Thm. 2.11(b)(c)].
4.4 thm:StrongMinimality(g): with similar reasonings as in the proof of the implications “4.4 thm:StrongMinimality(f) thm:StrongMinimality(g)”, we conclude that local minimality of for entails being similar to a symmetric and positive semidefinite matrix. Therefore, if is nonsingular, then it is similar to a symmetric and positive definite matrix.
4.4 4.4: trivial, since and (cf. items 1 and 1).
Appendix B Additional results for Section 5
Consider the iterates generated by ZeroFPR and suppose that the directions are selected so as to satisfy (31). Then,
in particular, and converge to 0.
where in the last inequality we used the fact that . This proves 1, and 2 trivially follows by the triangular inequality . Using this, 3 follows from item 1.
Consider the iterates generated by ZeroFPR. Suppose that (31) is satisfied and that the sequence is bounded. Then, are nonempty compact and connected sets over which and are constant and coincide. Moreover,
The sets of cluster points are nonempty because of boundedness of the sequences; in turn, connectedness and compactness as well as (41) are shown in [12, Rem. 5], which applies since and converge to (cf. item 3). Moreover, since converges to some value and as shown in section 5.4, it follows Item 1 that and coincide on (and equal ).
Suppose that Section 4.4 is satisfied at a strong local minimum of . Then, for any the FBE possesses the KL property at , and the desingularizing function can be taken of the form for some . {proof} From section 4.4 it follows that is a strong local minimum for at which is twice differentiable with . Let and . Since , from a second-order expansion of and a first-order expansion of we obtain that there exists a neighborhood of such that, for all , and , and in particular . Letting we obtain that is a KL function for at .