Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone line-search algorithms

Andreas Themelis, Lorenzo Stella, Panagiotis Patrinos

Introduction

In this paper we deal with optimization problems of the form

under the following assumptions, which will be valid throughout the paper without further mention. {ass}[Basic assumption]In problem (1)

fC1,1(Rn)f\in C^{1,1}(\R^{n}) (differentiable with LfL_{f}-Lipschitz continuous gradient);

\funcgRn\Rinf\func{g}{\R^{n}}{\Rinf} is proper, closed and γg\gamma_{g}-prox-bounded (see Section 2.1);

a solution exists, that is, arg minφ\argmin\varphi\neq\emptyset.

Both ff and gg are allowed to be nonconvex, making (1) prototypic for a plethora of applications spanning signal and image processing, machine learning, statistics, control and system identification. A well known algorithm addressing (1) is forward-backward splitting (FBS), also known as proximal gradient method. FBS has been thoroughly analyzed under the assumption of gg being convex. If moreover ff is convex, then FBS is known to converge globally with rate O(1/k)O(1/k) in terms of objective value, where kk is the iteration count. In this case, accelerated variants of FBS, also known as fast forward-backward splitting (FFBS), can be derived thanks to the work of Nesterov , that only require minimal additional computations per iteration but achieve the provably optimal global convergence rate of order o(1/k2)o(1/k^{2}) .

The work in pioneered an alternative acceleration technique. The method is based on an exact, real-valued penalty function for the original problem (1), namely the forward-backward envelope (FBE), defined as follows

The name forward-backward envelope comes from the fact that φγ(x)\varphi_{\gamma}(x) is the value of the minimization problem that defines the forward-backward step and alludes to the kinship that it has with the Moreau envelope. These claims will be addressed more in detail in Section 4. When ff is sufficiently smooth and both ff and gg are convex, the FBE was shown to be continuously differentiable and amenable to be minimized with generalized Newton methods. More recently, proposed a linesearch algorithm based on (L-)BFGS quasi-Newton directions for minimizing the FBE. The curvature information exploited by Newton-like methods acts as an online preconditioner, enabling superlinear rates of convergence under mild assumptions. However, unlike plain (F)FBS schemes, such methods require accessing second-order information of the smooth term ff (needed for the evaluation of φγ\nabla\varphi_{\gamma}), and are well defined only as long as the nonsmooth term gg is convex. On the contrary, FBS merely requires first-order information on ff and prox-boundedness of the nonsmooth term gg, in which case all accumulation points are stationary for φ\varphi, \iethey satisfy the first order necessary conditions .

In this paper we propose ZeroFPR, a nonmonotone linesearch algorithm that, to the best of our knowledge, is the first that (1) addresses the same range of problems as FBS, (2) requires the same black-box oracle as FBS (gradient of one function and proximity operator of the other), (3) yet achieves superlinear rates under mild assumptions (only) at the limit point. Though related to minFBE algorithm , ZeroFPR is conceptually different, mainly because it is gradient-free, in the sense that it does not require the gradient of the FBE. Moreover,

We provide the necessary theoretical background linking the concepts of stationarity of a point for problem (1), criticality and optimality. To the best of our knowledge, such an analysis was previously made only for the proximal point algorithm and for a special case of the projected gradient method .

The analysis of the FBE, previously studied only in the case of ff being C2(Rn)C^{2}(\R^{n}) and gg convex , is extended to ff and gg as in Equation 1. In particular, we provide mild assumptions on ff and gg that ensure (1) continuous differentiabilty of the FBE around critical points, (2) (strict) twice differentiability at critical points, and (3) equivalence of strong local minimality for the original function and the FBE.

Exploiting the investigated properties of the FBE and of critical points we prove that ZeroFPR with monotone linesearch converges (1) globally if φγ\varphi_{\gamma} has the Kurdyka-Łojasiewicz property , and (2) superlinearly when quasi-Newton Broyden directions are employed, under mild additional requirements at the limit point.

In Section 2 we introduce some notation and list some known facts about FBS. In Section 3 we define and explore notions of stationarity and criticality for the investigated problem and relate them with properties of the forward-backward operator. In Section 4 we extend the results of about the fundamental properties of the FBE to the more general setting addressed in this paper, where ff and gg satisfy Equation 1; for the sake of readability, some of the proofs are deferred to Appendix A. Section 5 addresses the core contribution of the paper, ZeroFPR; although arbitrary directions can be chosen, we specialize the results on superlinear convergence to a quasi-Newton Broyden method so as to truely maintain the same black-box oracle as FBS. Some ancillary results needed for the proofs are listed in Appendix B. Finally, Section 6 illustrates numerical results obtained with the proposed method.

Preliminaries

The identity n×nn\times n matrix is denoted as \id\id, and the extended real line as \Rinf=R{}\Rinf=\R\cup\set{\infty}. The open and closed ball of radius r0r\geq 0 centered in xRnx\in\R^{n} is denoted as \ballxr\ball xr and \cballxr\cball xr, respectively. Given a set EE and a sequence \seqxk\seq{x^{k}} we write \seqxkE\seq{x^{k}}\subset E with the obvious meaning of xkEx^{k}\in E for all kNk\in\N. The (possibly empty) set of cluster points of \seqxk\seq{x^{k}} is denoted as ω(\seqxk)\omega\left(\seq{x^{k}}\right), or simply as ω(xk)\omega(x^{k}) whenever the indexing is clear from the context. We say that \seqxkRn\seq{x^{k}}\subset\R^{n} is \DEFsummable if kNxk\sum_{k\in\N}\|x^{k}\| is finite, and \DEFsquare-summable if \seqxk2\seq{\|x^{k}\|^{2}} is summable.

Following the terminology of , we say that a function \funcfRnR\func{f}{\R^{n}}{\R} is strictly continuous at xˉ\bar{x} if lim supy,zxˉyzf(y)f(z)yz\limsup_{\begin{subarray}{c}y,z\to\bar{x}\\ y\neq z\end{subarray}}{\frac{|f(y)-f(z)|}{\|y-z\|}} is finite, and \DEFstrictly differentiable at xˉ\bar{x} if f(xˉ)\nabla f(\bar{x}) exists and limy,zxˉyzf(y)f(z)\innprodf(xˉ)yzyz=0\lim_{\begin{subarray}{c}y,z\to\bar{x}\\ y\neq z\end{subarray}}{\frac{f(y)-f(z)-\innprod{\nabla f(\bar{x})}{y-z}}{\|y-z\|}}{}={}0. The set of functions RnR\R^{n}\to\R with Lipschitz continuous gradient is denoted as C1,1(Rn)C^{1,1}(\R^{n}), and for fC1,1(Rn)f\in C^{1,1}(\R^{n}) we write LfL_{f} to indicate the Lipschitz modulus of f\nabla f.

For a proper, closed function \funcgRn\Rinf\func{g}{\R^{n}}{\Rinf}, a vector vg(x)v\in\partial g(x) is a \DEFsubgradient of gg at xx, where the \DEFsubdifferential g(x)\partial g(x) is considered in the sense of [44, Def. 8.3]

We have φ(x)=f(x)+g(x)\partial\varphi(x){}={}\nabla f(x){}+{}\partial g(x) and ^φ(x)=f(x)+^g(x)\hat{\partial}\varphi(x){}={}\nabla f(x){}+{}\hat{\partial}g(x) [44, Ex. 8.8(c)].

Given a parameter value γ>0\gamma>0, the \DEFMoreau envelope function gγg^{\gamma} and the \DEFproximal mapping \proxγg\prox_{\gamma g} are defined by

We now summarize some properties of gγg^{\gamma} and \proxγg\prox_{\gamma g}; the interested reader is referred to for a detailed discussion. A function \funcgRn\Rinf\func{g}{\R^{n}}{\Rinf} is \DEFprox-bounded if there exists γ>0\gamma>0 such that g+12γ2g+\tfrac{1}{2\gamma}\|{}\cdot{}\|^{2} is bounded below on Rn\R^{n}. The supremum of all such γ\gamma is the \DEFthreshold γg\gamma_{g} of prox-boundedness for gg. In particular, if gg is convex or bounded below then γg=\gamma_{g}=\infty. In general, for any γ(0,γg)\gamma\in(0,\gamma_{g}) the proximal mapping \proxγg\prox_{\gamma g} is nonempty- and compact-valued, and the Moreau envelope gγg^{\gamma} finite [44, Thm. 1.25].

Given a nonempty closed set SRnS\subseteq\R^{n} we let \func\indicatorSRn\Rinf\func{\indicator_{S}}{\R^{n}}{\Rinf} denote its \DEFindicator function, namely \indicatorS(x)=0\indicator_{S}(x)=0 if xSx\in S and \indicatorS(x)=\indicator_{S}(x)=\infty otherwise, and \ffunc\projSRnRn\ffunc{\proj_{S}}{\R^{n}}{\R^{n}} the (set-valued) \DEFprojection xarg minzSzxx\mapsto\argmin_{z\in S}\|z-x\|. Proximal mappings can be seen as generalized projections, due to the relation \projS=\proxγ\indicatorS\proj_{S}=\prox_{\gamma{\indicator_{S}}} for any γ>0\gamma>0.

For a set-valued mapping \ffuncTRnRn\ffunc{T}{\R^{n}}{\R^{n}} we let \graphT={(x,y)}[yT(x)]\graph T=\set{(x,y)}[y\in T(x)] denote its \DEFgraph, \zerT={xRn}[0T(x)]\zer T=\set{x\in\R^{n}}[0\in T(x)] the set of its \DEFzeros and \fixT={xRn}[xT(x)]\fix T=\set{x\in\R^{n}}[x\in T(x)] the set of its \DEFfixed-points.

2. Forward-backward iterations

holding for all x,zRnx,z\in\R^{n} [9, Prop. A.24], for any γ(0,\nicefrac1Lf)\gamma\in(0,\nicefrac{{1}}{{L_{f}}}) the function

furnishes a majorization model for φ\varphi, in the sense that

where \gamma\in\bigl{(}0,\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}}\bigr{)} is the stepsize parameter. The (set-valued) \DEFforward-backward operator Tγf,gT_{\gamma}^{f,g} can be equivalently expressed as

Stationary and critical points

Unless φ\varphi is convex, the stationarity condition 0^φ(x)0\in\hat{\partial}\varphi(x^{\star}) in problem (1) is only necessary for the optimality of xx^{\star} [44, Thm. 10.1]. In this section we define different concepts of (sub)optimality and show how they are related for generic functions φ=f+g\varphi=f+g as in Equation 1. {defin} We say that a point x\domφx^{\star}\in\dom\varphi is

stationary if 0^φ(x)0\in\hat{\partial}\varphi(x^{\star});

critical if it is \DEFγ\gamma-critical for some γ(0,γg)\gamma\in(0,\gamma_{g}), \ieif xTγ(x)x^{\star}\in T_{\gamma}(x^{\star});

optimal if xarg minφx^{\star}\in\argmin\varphi, \ieif it solves (1).

The notion of criticality was already discussed in under the name of LL-stationarity (LL plays the role of \nicefrac1γ\nicefrac{{1}}{{\gamma}}) for the special case of g=\indicatorBCsg=\indicator_{B\cap C_{s}}, where BB is a convex set and CsC_{s} is the (nonconvex) set of vectors with at most ss nonzero entries.

If gg is convex, then γg=\gamma_{g}=\infty and we may talk of criticality without mention of γ\gamma: in this case, the properties of γ\gamma-criticality and stationarity are equivalent regardless of the value of γ\gamma. For more general functions gg, instead, the value of γ\gamma plays a role in determining whether a point is γ\gamma-critical or not, which legitimizes the following definition. {defin} The \DEFcriticality threshold is the function \funcΓf,gRn[0,γg]\func{\Gamma^{f,g}}{\R^{n}}{[0,\gamma_{g}]}

As usual, whenever ff and gg are clear from the context we simply write Γ\Gamma in place of Γf,g\Gamma^{f,g}. That Γγg\Gamma\leq\gamma_{g} is due to the fact that \proxγg\prox_{\gamma g} (and consequently TγT_{\gamma}) is everywhere empty-valued for γ>γg\gamma>\gamma_{g}. Considering also γ=0\gamma=0 forces the set in the definition to be nonempty, and the lower-bound Γ0\Gamma\geq 0 in particular; more precisely, observe that, by definition, Γ(x)>0\Gamma(x)>0 iff xx is a critical point.

Let us consider φ=f+g\varphi=f+g for f(x)=12x2f(x)=\frac{1}{2}x^{2} and g=\indicatorCg=\indicator_{C} where C={±1}C=\set{\pm 1}. Clearly, γg=+\gamma_{g}=+\infty (as gg is lower-bounded), Lf=1L_{f}=1 and ±1\pm 1 are both (unique) optima. Since ^φ(x)=R\hat{\partial}\varphi(x)=\R for xCx\in C and ^φ\hat{\partial}\varphi is clearly empty elsewhere, all points in CC are stationary. \proxγg\prox_{\gamma g} is the (set-valued) projection on CC, therefore the forward-backward operator is Tγ(x)=\projC((1γ)x)T_{\gamma}(x){}={}\proj_{C}((1-\gamma)x). We have

In particular, Γ(1)=Γ(1)=1\Gamma(1)=\Gamma(-1)=1. We now list some properties of critical and optimal points which will be used to derive regularity properties of TγT_{\gamma} and gγg^{\gamma}. {thm}[Properties of critical points]The following properties hold:

for γ(0,γg)\gamma\in(0,\gamma_{g}), a point xx^{\star} is γ\gamma-critical iff

if xx^{\star} is critical, then it is γ\gamma-critical for all γ(0,Γ(x))\gamma\in(0,\Gamma(x^{\star})); moreover, xx^{\star} is also Γ(x)\Gamma(x^{\star})-critical provided that Γ(x)<γg\Gamma(x^{\star})<\gamma_{g};

Tγ(x)={x}T_{\gamma}(x^{\star}){=}\set{x^{\star}} and Rγ(x)={0}R_{\gamma}(x^{\star}){=}\set{0} for any critical point xx^{\star} and γ(0,Γ(x))\gamma\in(0,\Gamma(x^{\star})).

By suitably rearranging, the claim readily follows.

2: due to 1, if xx^{\star} is γ\gamma-critical, apparently it is also γ\gamma^{\prime}-critical for any γ(0,γ]\gamma^{\prime}\in(0,\gamma]. From the definition (9) of the criticality threshold Γ(x)\Gamma(x^{\star}), it then follows that xx^{\star} is γ\gamma-critical for any γ(0,Γ(x))\gamma\in(0,\Gamma(x^{\star})). Suppose now that Γ(x)<γg\Gamma(x^{\star})<\gamma_{g}. Then, due to 1 for all γ(0,Γ(x))\gamma\in(0,\Gamma(x^{\star})) we have

By taking the limit as γΓ(x)\gamma\nearrow\Gamma(x^{\star}) we obtain that the inequality holds for Γ(x)\Gamma(x^{\star}) as well, proving the claim in light of the characterization 1.

3: let xx^{\star} be a critical point, and let xTγ(x)x\in T_{\gamma}(x^{\star}) for some γ<Γ(x)\gamma<\Gamma(x^{\star}). Fix γ(γ,Γ(x))\gamma^{\prime}\in(\gamma,\Gamma(x^{\star})). From 1 and 2 it then follows that

Since 12γ12γ>0\tfrac{1}{2\gamma}-\tfrac{1}{2\gamma^{\prime}}>0, necessarily x=xx=x^{\star}.

In the next result we show that criticality is a halfway property between stationarity and optimality. In light of these relations we shall seek “suboptimal” solutions which we characterize as critical points. {prop}[Optimality, criticality, stationarity]Let \bar{\gamma}{}\coloneqq{}\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}}.

(criticality \Rightarrow stationarity) \fixTγ\zer^φ\fix T_{\gamma}\subseteq\zer\hat{\partial}\varphi for all γ(0,γg)\gamma\in(0,\gamma_{g});

(optimality \Rightarrow criticality) Γ(x)γˉ\Gamma(x^{\star})\geq\bar{\gamma} for all xarg minφx^{\star}\in\argmin\varphi; in particular, arg minφ\fixTγ\argmin\varphi\subseteq\fix T_{\gamma} for all γ(0,γˉ)\gamma\in(0,\bar{\gamma}), and also for γ=\nicefrac1Lf\gamma=\nicefrac{{1}}{{L_{f}}} if γg>\nicefrac1Lf\gamma_{g}>\nicefrac{{1}}{{L_{f}}};

1: let γ(0,γg)\gamma\in(0,\gamma_{g}) and x\fixTγx\in\fix T_{\gamma}. Since xx minimizes g+12γx+γf(x)2g+\tfrac{1}{2\gamma}\|{}\cdot{}-x+\gamma\nabla f(x)\|^{2}, we have 0{}\in{}\hat{\partial}\bigl{[}g+\tfrac{1}{2\gamma}\|{}\cdot{}-x+\gamma\nabla f(x)\|^{2}\bigr{]}(x){}={}\hat{\partial}g(x)+\nabla f(x){}={}\hat{\partial}\varphi(x), where the first inclusion follows from [44, Thm. 10.1] and the equalities from [44, Thm. 8.8(c)]. This proves that xx is stationary.

2: Fix γ(0,γˉ)\gamma\in(0,\bar{\gamma}), xarg minφx^{\star}\in\argmin\varphi and yTγ(x)y\in T_{\gamma}(x^{\star}). Necessarily y=xy=x^{\star}, otherwise, due to section 2.2, φ(y)\varphi(y) would contradict minimality of φ(x)\varphi(x^{\star}). Therefore, xx^{\star} is γ\gamma-critical and the claim follows from the arbitrarity of γ(0,γˉ)\gamma\in(0,\bar{\gamma}).

As already seen in Section 3, the bound \Gamma(x^{\star})\geq\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}} at optimal points in Item 2 is tight, and clearly the implication “optimality \Rightarrow criticality” cannot be reversed (consider, \egthe point x=0x^{\star}=0 for φ=cos\varphi=\cos). The next example shows that the other implication is also proper. {es}[Stationarity ⇏\not\Rightarrow criticality] Let f(x)=12x2f(x)=\frac{1}{2}x^{2} and g(x)=x\nicefrac53g(x)=x^{\nicefrac{{5}}{{3}}}. We have γg=+\gamma_{g}=+\infty, Lf=1L_{f}=1, and for x=0x^{\star}=0 it holds that ^φ(x)={φ(x)}={0}\hat{\partial}\varphi(x^{\star})=\set{\nabla\varphi(x^{\star})}=\set{0}. Therefore, xx^{\star} is stationary; however, T_{\gamma}(x^{\star})=\prox_{\gamma g}(0)=\set{-(\nicefrac{{5\gamma}}{{3}})^{3}}, and in particular xTγ(x)x^{\star}\notin T_{\gamma}(x^{\star}) for any γ>0\gamma>0, proving xx^{\star} to be non critical.

Forward-backward envelope

The FBE (2) was introduced in and further analyzed in in the case when gg is convex. Under such assumption the FBE was shown to be continuously differentiable, which made it possible to derive minimization algorithms based on its gradient. In the general setting addressed in this paper the FBE might fail to be (continuously) differentiable, and as such we need to resort to gradient-free methods. This task will be addressed in Section 5 where Section 5 will be proposed; other than being applicable to a wider range of problems, the proposed scheme is entirely based on the same oracle of forward-backward iterations, unlike the approaches in which instead require the computation of 2f\nabla^{2}f. All this will be possible thanks to continuity properties of the FBE, and to the behavior of φγ\varphi_{\gamma} at critical points. We now focus on its continuity, while the other property will be addressed shortly after in Section 4.2.

[Alternative expressions for φγ\varphi_{\gamma}] By expanding the square and rearranging the terms in the definition (2), φγ\varphi_{\gamma} can equivalently be expressed as

Comparing with (7), it is apparent that the set of minimizers zz in the above expression coincides with Tγ(x)T_{\gamma}(x), the forward-backward operator at xx. Moreover, taking out the constant term f(x)γ2f(x)2f(x){}-{}\tfrac{\gamma}{2}\|\nabla f(x)\|^{2} from the infimum we immediately obtain the following expression involving the Moreau envelope of gg:

Other than providing an explicit way of computing the FBE, (11) emphasizes how φγ\varphi_{\gamma} inherits the regularity properties of the Moreau envelope of gg. In particular, the next key property follows from the strict continuity of gγg^{\gamma} [44, Ex. 10.32]. {prop}[Strict continuity of φγ\varphi_{\gamma}]For any γ(0,γg)\gamma\in(0,\gamma_{g}), the FBE φγ\varphi_{\gamma} is a real-valued and strictly continuous function on Rn\R^{n}.

In the next section we will see the fundamental qualitative similarities between the FBE and the Moreau envelope. Namely, for γ\gamma small enough both φγ\varphi^{\gamma} and φγ\varphi_{\gamma} are lower bounds for the original function φ\varphi with same minimizers and minimum; in particular the minimization of φ\varphi is equivalent to that of φγ\varphi^{\gamma} or φγ\varphi_{\gamma}. Similarly, the identity

2. Basic properties

We now provide bounds relating φγ\varphi_{\gamma} to the original function φ\varphi that extend the well known inequalities involving the Moreau envelope. {prop}Let γ(0,γg)\gamma\in(0,\gamma_{g}) be fixed. Then

φ(xˉ)φγ(x)1γLf2γxxˉ2\varphi(\bar{x}){}\leq{}\varphi_{\gamma}(x){}-{}\tfrac{1-\gamma L_{f}}{2\gamma}\|x-\bar{x}\|^{2} for all xRnx\in\R^{n} and xˉTγ(x)\bar{x}\in T_{\gamma}(x).

1 is obvious from the definition of the FBE (consider z=xz=x in (2)). As to 2, since the set of minimizers in (2) is Tγ(x)T_{\gamma}(x) (cf. (12b)), (5) yields

With respect to the inequalities holding for convex gg treated in , the lower bound in Section 4.2 is weaker, while the upper bound unchanged. Regardless, an immediate consequence of the result is that the value of φ\varphi and φγ\varphi_{\gamma} at critical points is the same, and minimizers and infima of the two functions coincide for γ\gamma small enough. {thm}The following hold

φ(x)=φγ(x)\varphi(x)=\varphi_{\gamma}(x) for all γ(0,γg)\gamma\in(0,\gamma_{g}) and x\fixTγx\in\fix T_{\gamma};

infφ=infφγ\inf\varphi=\inf\varphi_{\gamma} and arg minφ=arg minφγ\argmin\varphi=\argmin\varphi_{\gamma} for all \gamma{}\in{}\bigl{(}0,\min\set{\nicefrac{{1}}{{L_{f}}},\gamma_{g}}\bigr{)}.

The bound γ<\nicefrac1Lf\gamma<\nicefrac{{1}}{{L_{f}}} in Item 2 is tight even when ff and gg are convex, as the counterexample with f(x)=12x2f(x)=\frac{1}{2}x^{2} and g=\indicatorR+g=\indicator_{\R_{+}} shows (see [45, Ex. 2.4] for details).

Although we will address problem (1) by simply exploiting the continuity of the FBE, nevertheless φγ\varphi_{\gamma} enjoys favorable properties which are key for the efficacy of the method which will be discussed in Section 5. Firstly, observe that, due to strict continuity, φγ\varphi_{\gamma} is almost everywhere differentiable, as it follows from Rademacher’s theorem. The same applies to the mapping xxγf(x)x\mapsto x-\gamma\nabla f(x), its Jacobian being

which is symmetric wherever it exists [44, Cor. 13.42 and Prop. 13.34]. However, in order to show that the proposed method achieves fast convergence we need additional regularity properties, namely (strict) twice differentiability at critical points and continuous differentiability around. The rest of the section is dedicated to this task.

3. Prox-regularity and first-order properties

In the favorable case in which gg is convex and fC2(Rn)f\in C^{2}(\R^{n}), the FBE enjoys global continuous differentiability . In our setting, \DEFprox-regularity acts as a surrogate of convexity; the interested reader is referred to [44, §13.F] for a detailed discussion. {defin}[Prox-regularity]Function gg is said to be \DEFprox-regular at x0x_{0} for v0g(x0)v_{0}\in\partial g(x_{0}) if there exist ρ,ε>0\rho,\varepsilon>0 such that for all x\ballx0εx^{\prime}\in\ball{x_{0}}{\varepsilon} and

it holds that g(x)g(x)+\innprodvxxρ2xx2g(x^{\prime}){}\geq{}g(x){}+{}\innprod{v}{x^{\prime}-x}{}-{}\tfrac{\rho}{2}\|x^{\prime}-x\|^{2}. Prox-regularity is a mild requirement enjoyed globally and for any subgradient by all convex functions, with ε=+\varepsilon=+\infty and ρ=0\rho=0. When gg is prox-regular at x0x_{0} for v0v_{0}, then for sufficiently small γ>0\gamma>0 the Moreau envelope gγg^{\gamma} is continuously differentiable in a neighborhood of x0+γv0x_{0}+\gamma v_{0} . To our purposes, when needed, prox-regularity of gg will be required only at critical points xx^{\star}, and only for the subgradient f(x)-\nabla f(x^{\star}). Therefore, with a slight abuse of terminology we define prox-regularity of critical points as follows. {defin}[Prox-regularity of critical points] We say that a critical point xx^{\star} is \DEFprox-regular if gg is prox-regular at xx^{\star} for f(x)-\nabla f(x^{\star}). Examples where a critical point fails to be prox-regular are of challenging construction; before illustrating a cumbersome such instance in Section 4.3, we first prove an important result that connects prox-regularity with first-order properties of the FBE. {thm}[Continuous differentiability of φγ\varphi_{\gamma}]Suppose that ff is of class C2C^{2} around a prox-regular critical point xx^{\star}. Then, for all γ(0,Γ(x))\gamma\in(0,\Gamma(x^{\star})) there exists a neighborhood UxU_{x^{\star}} of xx^{\star} on which the following properties hold:

TγT_{\gamma} and RγR_{\gamma} are strictly continuous, and in particular single-valued;

φγC1\varphi_{\gamma}\in C^{1} with φγ=QγRγ\nabla\varphi_{\gamma}{}={}Q_{\gamma}R_{\gamma}, where QγQ_{\gamma} is as in (13).

For γ(γ,Γ(x))\gamma^{\prime}\in(\gamma,\Gamma(x^{\star})), using LABEL:{prop:ProxGrad}, LABEL: and LABEL:{prop:SingleValuedFB} we obtain that

Replacing γ\gamma^{\prime} with γ\gamma in the above expression, the inequality is strict for all xxx\neq x^{\star}. From [40, Thm. 4.4] applied to the “tilted” function xg(x+x)g(x)\innprodf(x)xx{}\mapsto{}g(x+x^{\star}){}-{}g(x^{\star}){}-{}\innprod{\nabla f(x^{\star})}{x} it follows that there is a neighborhood VV of \Fwx\Fw{x^{\star}} in which \proxγg\prox_{\gamma g} is strictly continuous and gγg^{\gamma} is of class C1+C^{1+} with gradient gγ(x)=γ1(x\proxγg(x))\nabla g^{\gamma}(x){}={}\gamma^{-1}\left(x-\prox_{\gamma g}(x)\right) for all xVx\in V. By possibly narrowing UxU_{x^{\star}}, we may assume that fC2(Ux)f\in C^{2}(U_{x^{\star}}) and \FwxV\Fw x\in V for all xUxx\in U_{x^{\star}}. 2 then follows from (11) and the chain rule of differentiation, and 1 from the fact that strict continuity is preserved by composition.

When f=0f=0, Section 4.3 restates the known fact that if gg is prox-regular at xx^{\star} for 0g(x)0\in\partial g(x^{\star}), then gγg^{\gamma} is continuosly differentiable around xx^{\star} with gγ(x)=1γ(x\proxγg(x))\nabla g^{\gamma}(x)=\frac{1}{\gamma}(x-\prox_{\gamma g}(x)). Notice that the bound γ<Γ(x)\gamma<\Gamma(x^{\star}) is tight: in general, for γ=Γ(x)\gamma=\Gamma(x^{\star}) no continuity of TγT_{\gamma} nor continuous differentiability of φγ\varphi_{\gamma} around xx^{\star} can be guaranteed. In fact, even when xx^{\star} is Γ(x)\Gamma(x^{\star})-critical, TγT_{\gamma} might even fail to be single-valued and φγ\varphi_{\gamma} differentiable at xx^{\star}, as the following counterexample shows. {es}[Why γΓ(x)\gamma\neq\Gamma(x^{\star}) in first-order properties] Consider f=12x2f=\frac{1}{2}x^{2} and g=\indicatorSg=\indicator_{S} where S={0,1}S=\set{0,1}. Then, Lf=1L_{f}=1, γg=+\gamma_{g}=+\infty, Tγ(x)=\projS((1γ)x)T_{\gamma}(x)=\proj_{S}((1-\gamma)x) and the FBE is φγ(x)=1γ2x2+12γ\dist((1γ)x,S)2\varphi_{\gamma}(x){}={}\tfrac{1-\gamma}{2}\|x\|^{2}{}+{}\tfrac{1}{2\gamma}\dist((1-\gamma)x,S)^{2}. At the critical point x=1x=1, which satisfies Γ(1)=\nicefrac12\Gamma(1)=\nicefrac{{1}}{{2}}, gg is prox-regular for any subgradient. For any γ(0,\nicefrac12)\gamma\in(0,\nicefrac{{1}}{{2}}) it is easy to see that φγ\varphi_{\gamma} is differentiable in a neighborhood of x=1x=1. However, for γ=\nicefrac12\gamma=\nicefrac{{1}}{{2}} the distance function has a first-order singularity in x=1x=1, due to the 22-valuedness of Tγ(1)=\projS(\nicefrac12)={0,1}T_{\gamma}(1)=\proj_{S}(\nicefrac{{1}}{{2}})=\set{0,1}.

[Prox-nonregularity of critical points]Consider φ=f+g\varphi=f+g where f(x)=12x2f(x)=\tfrac{1}{2}x^{2}, g(x)=\indicatorS(x)g(x)=\indicator_{S}(x) and S=\set{\nicefrac{{1}}{{n}}}[n\in\N_{\geq 1}]\cup\set{0}. For x0=0x_{0}=0 we have Γ(x0)=+\Gamma(x_{0})=+\infty, however gg fails to be prox-regular at x0x_{0} for v0=0=f(x0)v_{0}=0=-\nabla f(x_{0}). For any ρ>0\rho>0 and for any neighborhood VV of (0,0)(0,0) in \graphg\graph g it is always possible to find a point arbitrarily close to (0,\nicefrac1ρ)(0,-\nicefrac{{1}}{{\rho}}) with multi-valued projection on VV. Specifically, the midpoint P_{n}{}={}\bigl{(}\frac{1}{2}(\frac{1}{n}+\frac{1}{n+1}),\,{}-\nicefrac{{1}}{{\rho}}\bigr{)} has 2-valued projection on \graphg\graph g for any nN1n\in\N_{\geq 1}, being it \proj_{\graph g}(P_{n})=\set{\nicefrac{{1}}{{n}},\nicefrac{{1}}{{n+1}}}. By considering a large nn, PnP_{n} can be made arbitrarily close to (0,\nicefrac1ρ)(0,-\nicefrac{{1}}{{\rho}}) and at the same time its projection(s) arbitrarily close to (0,0)(0,0). Therefore, gg cannot be prox-regular at for , for otherwise such projections would be single-valued close enough to (0,0)(0,0) [40, Cor. 3.4 and Thm. 3.5]. As a result, gγ(x)=12γ\dist(x,S)2g^{\gamma}(x)=\frac{1}{2\gamma}\dist(x,S)^{2} is not differentiable around x=0x=0, and indeed at each midpoint 12(1n+1n+1)\frac{1}{2}(\frac{1}{n}+\frac{1}{n+1}) for nN1n\in\N_{\geq 1} it has a nonsmooth spike.

To underline how unfortunate the situation depicted in Section 4.3 is, notice that adding a linear term λx\lambda x to ff for any λ0\lambda\neq 0, yet leaving gg unchanged, restores the desired prox-regularity of each critical point. Indeed, this is trivially true for any nonzero critical point; besides, gg is prox-regular at for any λ(0,)\lambda\in(0,-\infty), and for λ<0\lambda<0 we have that is nomore critical.

4. Second-order properties

In this section we discuss sufficient conditions for twice-differentiability of the FBE at critical points. Additionally to prox-regularity, which is needed for local continuous differentiability, we will also need generalized second-order properties of gg. The interested reader is referred to [44, §13] for an extensive discussion on epi-differentiability. {ass}With respect to a given critical point xx^{\star}

2f\nabla^{2}f exists and is (strictly) continuous around xx^{\star};

gg is prox-regular and (strictly) twice epi-differentiable at xx^{\star} for f(x)-\nabla f(x^{\star}), with its second order epi-derivative being generalized quadratic:

where SRnS\subseteq\R^{n} is a linear subspace and MRn×nM\in\R^{n\times n}. Without loss of generality we take MM symmetric, and such that Im(M)S\operatorname*{Im}(M)\subseteq S and ker(M)S\ker(M)\supseteq S^{\perp}.This can indeed be done without loss of generality: if MM and SS satisfy (15), then it suffices to replace MM with M=\projSM+\transM2\projSM^{\prime}=\proj_{S}\frac{M+\trans M}{2}\proj_{S} to ensure the desired properties.

We say that the assumptions are “strictly” satisfied if the stronger conditions in parenthesis hold.

We now show that the quite common properties required in Section 4.4 are all is needed for ensuring first-order properties of the proximal mapping and second-order properties of the FBE at critical points. {thm}[Twice differentiability of φγ\varphi_{\gamma}]Suppose that Section 4.4 is (strictly) satisfied with respect to a critical point xx^{\star}. Then, for any γ(0,Γ(x))\gamma\in(0,\Gamma(x^{\star}))

\proxγg\prox_{\gamma g} is (strictly) differentiable at \Fwx\Fw{x^{\star}} with symmetric and positive semidefinite Jacobian

RγR_{\gamma} is (strictly) differentiable at xx^{\star} with Jacobian

where QγQ_{\gamma} is as in (13) and PγP_{\gamma} as in (16);

φγ\varphi_{\gamma} is (strictly) twice differentiable at xx^{\star} with symmetric Hessian

See Appendix A. Again, when f0f\equiv 0 Section 4.4 covers the differentiability properties of the proximal mapping (and consequently the second-order properties of the Moreau envelope, due to the identity gγ(x)=1γ(x\proxγg(x))\nabla g^{\gamma}(x)=\frac{1}{\gamma}(x-\prox_{\gamma g}(x))) as discussed in .

We now provide a key result that links nonsingularity of the Jacobian of the forward-backward residual RγR_{\gamma} to strong (local) minimality for the original cost φ\varphi and for the FBE φγ\varphi_{\gamma}, under the generalized second-order properties of Section 4.4. {thm}[Conditions for strong local minimality]Suppose that Section 4.4 is satisfied with respect to a critical point xx^{\star}, and let \gamma{}\in{}(0,\min\set{\Gamma(x^{\star}),\nicefrac{{1}}{{L_{f}}}}). The following are equivalent:

xx^{\star} is a strong local minimum for φ\varphi;

xx^{\star} is a local minimum for φ\varphi and JRγ(x)JR_{\gamma}(x^{\star}) is nonsingular;

the (symmetric) matrix 2φγ(x)\nabla^{2}\varphi_{\gamma}(x^{\star}) is positive definite;

xx^{\star} is a strong local minimum for φγ\varphi_{\gamma};

xx^{\star} is a local minimum for φγ\varphi_{\gamma} and JRγ(x)JR_{\gamma}(x^{\star}) is nonsingular. {proof} See Appendix A.

ZeroFPR algorithm

The first algorithmic framework exploiting the FBE for solving composite minimization problems was studied in , and other schemes have been recently investigated in . All such methods tackle the problem by looking for a (local) minimizer of the FBE, exploting the equivalence of (local) minimality for the original function φ\varphi and for the FBE φγ\varphi_{\gamma}, for γ\gamma small enough. To do so, they all employ the concept of directions of descent, thus requiring the gradient of the FBE to be well defined everywhere. In the more general framework addressed in this paper, such basic requirement is not met, which is why we approach the problem from a different perspective. This leads to ZeroFPR, the first algorithm, to the best of our knowledge, that despite requiring only the black-box oracle of FBS and being suited for fully nonconvex problems it achieves superlinear convergence rates.

Instead of directly addressing the minimization of φ\varphi or φγ\varphi_{\gamma}, we seek solutions of the following nonlinear inclusion (generalized equation)

By doing so we address the problem from the same perspective of FBS, that is, finding fixed points of the forward-backward operator TγT_{\gamma} or, equivalently, zeros of its residual RγR_{\gamma}. Despite RγR_{\gamma} might be quite irregular when gg is nonconvex, it enjoys favorable properties at the very solutions to (20) — \ieat γ\gamma-critical points — starting from single-valuedness, cf. Item 3. If mild assumptions are met, RγR_{\gamma} turns out to be continuous around and even differentiable at critical points (cf. Sections 4.3 and 4.4), and as a consequence the inclusion problem (20) reduces to a well behaved system of equations, as opposed to generalized equations, when close to solutions.

This motivates addressing problem (20) with fast methods for nonlinear equations. Newton-like schemes are iterative methods that prescribe updates of the form

which essentially amount to selecting H=H(x)H=H(x), a linear operator that ideally carries information of the geometry of RγR_{\gamma} around xx, in the attempt to yield an optimal iterate x+x^{+}. For instance, when RγR_{\gamma} is sufficiently regular Newton method corresponds to selecting HH as the inverse of an element of the generalized Jacobian of RγR_{\gamma} at xx, enabling fast convergence when close to a solution under some assumptions. However, selecting HH as in Newton method would require information additional to the forward-backward oracle TγT_{\gamma}, and as such it goes beyond the scope of the paper. For this reason we focus instead on quasi-Newton schemes, in which HH are linear operators recursively defined with low-rank updates that satisfy the (inverse) secant condition

A famous result states that, under mild assumptions and starting sufficiently close to a solution xx^{\star}, updates as in (21) are superlinearly convergent to xx^{\star} iff the Dennis-Moré condition holds, namely the limit (H1JRγ(x))ss0\frac{\|(H^{-1}-JR_{\gamma}(x^{\star}))s\|}{\|s\|}{}\to{}0. More recently, in the result was extended to generalized equations of the form f(x)+G(x)0f(x)+G(x)\ni 0, where ff is smooth and GG possibly set-valued. The study focuses on Josephy-Newton methods where the update x+x^{+} is the solution of the inner problem f(x)BxBx++G(x+)f(x)-Bx\in Bx^{+}+G(x^{+}), where B=H1B=H^{-1}, which can be interpreted as a forward-backward step in the metric induced by BB. In particular, differently from the here proposed ZeroFPR, the method in has the crucial limitation that, unless the operator BB has a very particular structure, the backward step (B+G)1(B+G)^{-1} may be prohibitely challenging.

Quasi-Newton schemes are powerful and widely used methods. However, it is well known that they are effective only when close enough to a solution and might even diverge otherwise. To cope with this crucial downside there comes the need of a globalization strategy; this is usually addressed by means of a linesearch over a suitable merit function ψ\psi, along directions of descent for ψ\psi so as to ensure sufficient decrease for small enough stepsizes. Unfortunately, the potential choice ψ(x)=12Rγ(x)2\psi(x)=\frac{1}{2}\|R_{\gamma}(x)\|^{2} is not regular enough for a ‘direction of descent’ to be everywhere defined. The proposed Section 5 bypasses this limitation by exploiting the favorable properties of the FBE.

Globalizing the convergence of any fast local method is the core contribution of ZeroFPR, an algorithm that exploits the favorable properties of the FBE, and that requires exactly the same oracle of FBS. Conceptually, ZeroFPR is really elementary; for simplicity, let us first consider the monotone case, \iewith pk1p_{k}\equiv 1 so that Φˉk=φγ(xk)\bar{\Phi}_{k}=\varphi_{\gamma}(x^{k}) (cf. 5). The following steps are executed for updating iterate xkx^{k}:

first, at 1 a nominal forward-backward call yields an element xˉkTγ(xk)\bar{x}^{k}\in T_{\gamma}(x^{k}) that decreases the value of φγ\varphi_{\gamma} by at least γ1γLf2rk2\gamma\frac{1-\gamma L_{f}}{2}\|r^{k}\|^{2} (item 1);

then, at 3 an update direction dkd^{k} at xˉk\bar{x}^{k} (not at xkx^{k}!) is selected;

because of the sufficient decrease xkxˉkx^{k}\mapsto\bar{x}^{k} on φγ\varphi_{\gamma} and the continuity of φγ\varphi_{\gamma}, at 4 a stepsize τk\tau_{k} can be found with finite many backtrackings τkβτk\tau_{k}\leftarrow\beta\tau_{k} that ensures a decrease for φγ\varphi_{\gamma} of at least σrk2\sigma\|r^{k}\|^{2} in the update xkxˉk+τkdkx^{k}\mapsto\bar{x}^{k}+\tau_{k}d^{k}, for any σ<1γLf2\sigma<\frac{1-\gamma L_{f}}{2}.

In order to reduce the number of backtrackings, pk<1p_{k}<1 can be selected resulting in a nonmonotone linesearch. The sufficient decrease is enforced with respect to a parameter Φˉkφγ(xk)\bar{\Phi}_{k}\geq\varphi_{\gamma}(x^{k}) (cf. section 5.1.1), namely a convex combination of {φγ(xi)}i=0k\set{\varphi_{\gamma}(x^{i})}_{i=0}^{k}. For the sake of convergence, \seqpk\seq{p_{k}} can be selected arbitrarily in (0,1](0,1] as long as it is bounded away from , hence the role of the user-set lower bound pminp_{\rm min}. Consequently, small values of σ\sigma and pkp_{k} concur in reducing conservatism in the linesearch by favoring larger stepsizes. {lem}[Nonmonotone linesearch globalization]For all kNk\in\N the iterates generated by ZeroFPR satisfy

and there exists τˉk>0\bar{\tau}_{k}>0 such that

In particular, the number of backtrackings at 4 is finite. {proof} The first two inequalities in (23) are due to item 2. Moreover,

where the inequality follows by the linesearch condition (19); this proves the last inequality in (23). As to (24), let kk be fixed and contrary to the claim suppose that for all ε>0\varepsilon>0 there exists τε[0,ε]\tau_{\varepsilon}\in[0,\varepsilon] such that the point xε=xˉk+τεdkx_{\varepsilon}{}={}\bar{x}^{k}+\tau_{\varepsilon}d^{k} satisfies φγ(xε)>φγ(xk)σxkxˉk2\varphi_{\gamma}(x_{\varepsilon}){}>{}\varphi_{\gamma}(x^{k}){}-{}\sigma\|x^{k}-\bar{x}^{k}\|^{2}. Taking the limit for ε0+\varepsilon\to 0^{+}, continuity of φγ\varphi_{\gamma} as ensured by eq. 11 yields

where the last inequality is due to the fact that xkxˉkx^{k}\neq\bar{x}^{k}. This contradicts item 2; therefore, there exists τˉk>0\bar{\tau}_{k}>0 such that φγ(xˉk+τdk)φγ(xk)σxkxˉk2\varphi_{\gamma}(\bar{x}^{k}+\tau d^{k}){}\leq{}\varphi_{\gamma}(x^{k}){}-{}\sigma\|x^{k}-\bar{x}^{k}\|^{2} for all τ[0,τˉk]\tau\in[0,\bar{\tau}_{k}]. By combining this with (23) the claim follows. Section 5.1.1 ensures that regardless of the choice of dkd^{k}, ZeroFPR does not get stuck in infinite loops. In Section 5.4 we will also show that the algorithm returns solutions of problem (20), and that under mild assumptions at the limit point the convergence rate is superlinear when good directions are selected at 3. Before going through the technicalities, we briefly anticipate what such good directions are.

1.2. Choice of the directions: quasi-Newton methods

As already emphasized, fast convergence of ZeroFPR will be obtained thanks to the employment of Newton-like directions dkd^{k}. Differently from the classical Newton-like step (21), when stepsize 11 is accepted, the update in ZeroFPR is of the form x+=xˉ+dx^{+}=\bar{x}+d rather than x+=x+dx^{+}=x+d, where xˉ\bar{x} is an element of Tγ(x)T_{\gamma}(x). Therefore, dd needs to be a Newton-like direction at xˉ\bar{x}, and not at xx, namely

(as opposed to rˉkRγ(xk)\bar{r}^{k}\in R_{\gamma}(x^{k})).

We consider a modified Broyden’s scheme that performs rank-one updates of the form

and ϑˉ(0,1)\bar{\vartheta}\in(0,1) is a fixed parameter, with the convention \sign0=1\sign 0=1. Starting from an invertible matrix H0H_{0} this selection ensures that all matrices HkH_{k} are invertible.

BFGS method consists in the following update rule for matrices HkH_{k} in (25): starting from a symmetric and positive definite H0H_{0},

with sk=xk+1xˉks_{k}=x^{k+1}-\bar{x}^{k} and yk=rk+1rˉky_{k}=r^{k+1}-\bar{r}^{k}, see \eg[34, §6.1]. BFGS is the most popular quasi-Newton scheme; it is based on rank-two updates that, additionally to the secant condition, enforce also symmetricity. In fact, BFGS is guaranteed to satisfy the Dennis-Moré condition only provided that the Jacobian of the nonlinear system at the limit point is symmetric . Although this is not the case for JRγ(x)JR_{\gamma}(x^{\star}), we observed in practice that BFGS directions (27) perform extremely well.

Ultimately, instead of storing and operating on dense m×mm\times m matrices, limited-memory variants of quasi-Newton schemes keep in memory only a few (usually 33 to 2020) most recent pairs (sk,yk)(s^{k},y^{k}) implicitly representing the approximate inverse Jacobian. Their employment considerably reduces storage and computations over the full-memory counterparts, and as such they are the methods of choice for large-scale problems. The most popular limited-memory method is L-BFGS: based on BFGS, it efficiently computes matrix-vector products with the approximate inverse Jacobian using a two-loop recursion procedure .

2. Connections with other methods

The first algorithmic framework exploiting the FBE was studied in , where two semismooth Newton methods were analyzed for convex ff and gg with fC2,1(Rn)f\in C^{2,1}(\R^{n}) (twice continuously differentiable with Lipschitz continuous gradient). A generalization of the scheme was then studied in under less restrictive assumptions, with particular attention to quasi-Newton directions in place of semismooth Newton methods. The proposed algorithm interleaves descent steps over the FBE with forward-backward steps. then analyzed global and linear convergence properties of a generic linesearch algorithmic framework for minimizing the FBE based on gradient-related directions, for analytic ff and subanalytic, convex, and lower bounded gg.

Though apparently closely related, the approach that we provide in this paper presents major conceptual differences from any of the ones above. Apart from the significantly less restrictive assumptions, the crucial distinction is that our method is derivative-free, \ieit does not require the gradient of the FBE. As a consequence, no computation nor the existence of 2f\nabla^{2}f is required, resulting in a method that, differently from the others, truly relies on the very same oracle information of the forward-backward operator TγT_{\gamma}.

3. Main remarks

In this section we list a few observations that come in handy when implementing ZeroFPR. {rem}[Adaptive variant when LfL_{f} is unknown]In practice, no prior knowledge of the global Lipschitz constant LfL_{f} is required for ZeroFPR. In fact, replacing LfL_{f} with an initial estimate L>0L>0 and fixing a backtracking ratio α(0,1)\alpha\in(0,1), after 2 the following instruction can be added:

[Support for locally Lipschitz f\nabla f]If \domg\dom g is bounded and, as it is reasonable, the directions \seqdk\seq{d^{k}} selected at 3 do not diverge, then Item 1 on ff can be relaxed to f\nabla f being locally Lipschitz.

In fact, it follows from the definition of proximal mapping that \seqxˉk\domg\seq{\bar{x}^{k}}\subseteq\dom g, and if the directions are bounded then there exists a compact domain Ω\domg\Omega\supseteq\dom g such that \seqxkΩ\seq{x^{k}}\subseteq\Omega. Then, all results of the paper apply by replacing LfL_{f} with \lipΩf\lip_{\Omega}\nabla f, the (finite) Lipschitz constant of f\nabla f on Ω\Omega.

[Cost per iteration]Evaluating φγ\varphi_{\gamma} essentially amounts to one evaluation of TγT_{\gamma}; this is evident from the expression (11), together with the observation that gγ(\Fwx)=g(xˉ)+12γ\Fwxxˉ2g^{\gamma}(\Fw x){}={}g(\bar{x})+\tfrac{1}{2\gamma}\|\Fw x-\bar{x}\|^{2} for any xˉTγ(x)\bar{x}\in T_{\gamma}(x). Therefore, computing φγ(xˉk+τkdk)\varphi_{\gamma}(\bar{x}^{k}+\tau_{k}d^{k}) at 4 yields an element xˉk+1Tγ(xk+1)\bar{x}^{k+1}\in T_{\gamma}(x^{k+1}) required in 1, since xk+1=xˉk+τkdkx^{k+1}=\bar{x}^{k}+\tau_{k}d^{k} at every iteration. In general, one evaluation of TγT_{\gamma} per backtracking step is required. If the directions dkd^{k} are computed with Broyden or BFGS methods (26) and (27), then one additional evaluation of TγT_{\gamma} is required for retrieving dkd^{k}; in the best case of τk=1\tau_{k}=1 being accepted, which asymptotically happens under mild assumptions (cf. section 5.4.2), the algorithm then requires exactly two evaluations of TγT_{\gamma} per iteration.

[Extension of FBS] Observe that by selecting dk0d^{k}\equiv 0 ZeroFPR reduces to the classical FBS algorithm. Item 2 combined with the relation φγ(xk)Φˉk\varphi_{\gamma}(x^{k})\leq\bar{\Phi}_{k} due to (23) shows that the condition at 4 is always statisfied (with τk=1\tau_{k}=1). Therefore, xk+1=xˉk+dk=xˉkTγ(xk)x^{k+1}{}={}\bar{x}^{k}+d^{k}{}={}\bar{x}^{k}{}\in{}T_{\gamma}(x^{k}) for all kk, which is FBS, cf. (7).

4. Convergence results

In this section we analyze the properties of cluster points of the iterates generated by ZeroFPR. Specifically,

every cluster point of \seqxk\seq{x^{k}} and \seqxˉk\seq{\bar{x}^{k}} solves problem (20) (Section 5.4);

if the linesearch is (eventually) monotone, then global and linear convergence are achieved under mild assumptions (Sections 5.4.1 and 5.4.1);

directions satisfying the Dennis-Moré condition, such as Broyden’s, enable superlinear rates under mild assumptions (Sections 5.4.2 and 5.4.2).

In what follows, we exclude the trivial case in which the optimality condition rk=0r^{k}=0 is achieved in a finite number of iterations, and therefore assume rk0r^{k}\neq 0 for all kk. {thm}[Criticality of cluster points]The following hold for the iterates generated by ZeroFPR:

rk0r^{k}\to 0 square-summably, and all cluster points of \seqxk\seq{x^{k}} and \seqxˉk\seq{\bar{x}^{k}} are critical; more precisely, ω(xk)=ω(xˉk)\fixTγ\omega(x^{k})=\omega(\bar{x}^{k})\subseteq\fix T_{\gamma};

\seqφγ(xk)\seq{\varphi_{\gamma}(x^{k})} converges to a (finite) value φ\varphi_{\star}, and so does \seqφ(xˉk)\seq{\varphi(\bar{x}^{k})} if \seqxk\seq{x^{k}} is bounded.

By telescoping the above inequality and using (23), we obtain

proving rk0r^{k}\to 0 square-summably. Suppose now that \seqxk[kK]x\seq{x^{k}}[k\in K]\to x^{\prime} for some xRnx^{\prime}\in\R^{n} and KNK\subseteq\N. Then, since xˉkxk=γrk0\|\bar{x}^{k}-x^{k}\|{}={}\gamma\|r^{k}\|{}\to{}0, in particular \seqxˉk[kK]x\seq{\bar{x}^{k}}[k\in K]\to x^{\prime} as well. Due to the arbitrarity of the cluster point xx^{\prime} it follows that ω(xk)ω(xˉk)\omega(x^{k})\subseteq\omega(\bar{x}^{k}), and a similar reasoning proves the converse inclusion, hence ω(xk)=ω(xˉk)\omega(x^{k})=\omega(\bar{x}^{k}). Moreover, we have xk\cballxˉkγrk\FBxk+\cball0γrkx^{k}{}\in{}\cball{\bar{x}^{k}}{\gamma\|r^{k}\|}{}\subseteq{}\FB{x^{k}}{}+{}\cball{0}{\gamma\|r^{k}\|} and since \seqxkγf(xk)[kK]xγf(x)\seq{x^{k}{}-{}\gamma\nabla f(x^{k})}[k\in K]{}\to{}x^{\prime}{}-{}\gamma\nabla f(x^{\prime}), from the outer semicontinuity of \proxγg\prox_{\gamma g} [44, Ex. 5.23(b)] it follows that x\FBxx^{\prime}\in\FB{x^{\prime}}, \iex\fixTγx^{\prime}\in\fix T_{\gamma}.

2: from (28) it follows that \seqΦˉk\seq{\bar{\Phi}_{k}} is decreasing, and in particular its limit exists, be it φ\varphi_{\star}. Due to (23), necessarily φinfφ>\varphi_{\star}\geq\inf\varphi>-\infty, therefore

proving that φγ(xk+1)φ\varphi_{\gamma}(x^{k+1})\to\varphi_{\star}. If \seqxk\seq{x^{k}} is bounded, then so is \seqxˉk\seq{\bar{x}^{k}} due to compact-valuedness of \proxγg\prox_{\gamma g} [44, Thm. 1.25]. Due to eq. 11 φγ\varphi_{\gamma} is LL-Lipschitz continuous on a compact set containing \seqxk\seq{x^{k}} and \seqxˉk\seq{\bar{x}^{k}} for some L>0L>0. Then,

where the inequalities follow from section 4.2. Consequently, \seqφ(xˉk)φ\seq{\varphi(\bar{x}^{k})}\to\varphi_{\star} as well.

If follows from (23) and the fact that \seqΦˉk\seq{\bar{\Phi}_{k}} is a decreasing sequence (cf. (28)), that the iterates of ZeroFPR satisfy φ(xˉk)Φˉ0=φ(xˉ0)\varphi(\bar{x}^{k})\leq\bar{\Phi}_{0}=\varphi(\bar{x}^{0}). As a consequence, a sufficient condition for ensuring that the sequence \seqxˉk\seq{\bar{x}^{k}} does not diverge — and consequently nor does \seqxk\seq{x^{k}} provided that the sequence of directions \seqdk\seq{d^{k}} is bounded — is that the level set {φφ(xˉ0)}\set{\varphi\leq\varphi(\bar{x}^{0})} is compact. In the adaptive variant discussed in Section 5.3, this translates to boundedness of the level set {φφ(xˉk0)}\set{\varphi\leq\varphi(\bar{x}^{k_{0}})}, where k0k_{0} denotes the iteration starting from which γ\gamma is constant. Since such point is unknown a priori, the sufficient condition needs be strengthened to φ\varphi having bounded level sets.

We now show that if φγ\varphi_{\gamma} is well-behaved at cluster points, then the whole sequence generated by ZeroFPR is convergent. Good behavior involves the existence of a desingularizing function, that is, φγ\varphi_{\gamma} needs to possess the Kurdyka-Łojasiewicz property, a mild requirement that we restate here for the reader’s convenience. {defin}[KL property]A proper and lower semicontinuous function \funchRn\Rinf\func{h}{\R^{n}}{\Rinf} has the Kurdyka-Łojasiewicz property (KL property) at x\domhx^{\star}\in\dom\partial h if there exist a concave desingularizing function (or KL function) \funcψ[0,η][0,+)\func{\psi}{[0,\eta]}{[0,+\infty)} for some η>0\eta>0 and a neighborhood UxU_{x^{\star}} of xx^{\star}, such that

ψ\psi is C1C^{1} with ψ>0\psi^{\prime}>0 on (0,η)(0,\eta);

for all xUxx\in U_{x^{\star}} s.t. h(x)<h(x)<h(x)+ηh(x^{\star})<h(x)<h(x^{\star})+\eta it holds that

The KL property is a mild requirement enjoyed by semi-algebraic functions and by subanalytic functions which are continuous on their domain see also . Moreover, since semi-algebraic functions are closed under parametric minimization, from the expression (2) it is apparent that φγ\varphi_{\gamma} is semi-algebraic provided that ff and gg are. More precisely, in all such cases the desingularizing function can be taken of the form ψ(s)=ρsθ\psi(s)=\rho s^{\theta} for some ρ>0\rho>0 and θ(0,1]\theta\in(0,1], in which case it is usually referred to as a Łojasiewicz function. This property has been extensively exploited to provide convergence rates of optimization algorithms such as FBS, see . Further properties of ff and gg that ensure φγ\varphi_{\gamma} to satisfy such requirement are discussed in .

We first show how the KL property on φγ\varphi_{\gamma} ensures global convergence of the iterates of ZeroFPR if the linesearch is eventually monotone, \ieif pk=1p_{k}=1 for kk sufficiently large, and then show that linear convergence is attained when the KL function is actually a Łojasiewicz function with large enough exponent. {thm}[Global convergence (monotone LS)]Consider the iterates generated by ZeroFPR with pk=1p_{k}=1 for kk’s large enough, and with directions satisfying

for some D0D\geq 0. Suppose that \seqxk\seq{x^{k}} remains bounded, that φγ\varphi_{\gamma} has the KL property on ω(xk)\omega(x^{k}), and that every cluster point is prox-regular. If ff is of class C2C^{2} in a neighborhood of ω(xk)\omega(x^{k}), then \seqxk\seq{x^{k}} and \seqxˉk\seq{\bar{x}^{k}} are convergent to (the same point) xx^{\star}, and the sequence of residuals \seqrk\seq{r^{k}} is summable. {proof} From appendix B we know that φγ\varphi_{\gamma} is constant on the (nonempty) compact set ω(xk)\omega(x^{k}). It then follows from [12, Lem. 6] that there exist η,ε>0\eta,\varepsilon>0 and a uniformized KL function, namely a function ψ\psi satisfying LABEL:{def:KL1}, LABEL:, LABEL:{def:KL2}, LABEL: and LABEL:{def:KL3} for all xω(xk)x^{\star}\in\omega(x^{k}) and xx such that \dist(x,ω(xk))<ε\dist(x,\omega(x^{k}))<\varepsilon and φ(x)<φ(x)<φ(x)+η\varphi(x^{\star})<\varphi(x)<\varphi(x^{\star})+\eta. Let φlimkφγ(xk)\varphi_{\star}{}\coloneqq{}\lim_{k\to\infty}\varphi_{\gamma}(x^{k}), which exists and is finite (cf. section 5.4), and let k1Nk_{1}\in\N be such that pk=1p_{k}=1 for all kk1k\geq k_{1}. Then we have (cf. 5 and (19))

By possibly restricting ε\varepsilon, from item 2 and since ω(xk)\omega(x^{k}) is compact it follows that φγ\varphi_{\gamma} is differentiable in an ε\varepsilon-enlargement of ω(xk)\omega(x^{k}). appendix B ensures that there exists k2k1k_{2}\geq k_{1} such that for all kk2k\geq k_{2} we have φ<φγ(xk)<φ+η\varphi_{\star}{}<{}\varphi_{\gamma}(x^{k}){}<{}\varphi_{\star}+\eta and \dist(xk,ω(xk))<ε\dist(x^{k},\omega(x^{k})){}<{}\varepsilon. For all such kk, by item 2 we have \nabla\varphi_{\gamma}(x^{k}){}={}Q_{\gamma}(x^{k})R_{\gamma}(x^{k}){}={}\bigl{[}I-\gamma\nabla^{2}f(x^{k})\bigr{]}r^{k} and the uniformized KL property yields

Letting \Delta_{k}{}\coloneqq{}\psi\bigl{(}\varphi_{\gamma}(x^{k})-\varphi_{\star}\bigr{)}{}>{}0, by concavity of ψ\psi and (32) it follows that

By telescoping the inequality it follows that \seqrk\seq{\|r^{k}\|} is summable, hence, due to item 1, also \seqxk+1xk\seq{\|x^{k+1}-x^{k}\|} is. Therefore, \seqxk\seq{x^{k}} is a Cauchy sequence and as such it admits a limit, this being also the limit of \seqxˉk\seq{\bar{x}^{k}} in light of item 1 (and the fact that \seqxˉk\seq{\bar{x}^{k}} is also bounded). {thm}[Linear convergence (monotone LS)]Consider the iterates generated by ZeroFPR. Suppose that the hypothesis of Section 5.4.1 are satisfied, and that the KL function can be taken of the form ψ(s)=ρsθ\psi(s)=\rho s^{\theta} for some θ[\nicefrac12,1]\theta\in[\nicefrac{{1}}{{2}},1]. Then, \seqxk\seq{x^{k}} and \seqxˉk\seq{\bar{x}^{k}} are RR-linearly convergent. {proof} From section 5.4.1 we know that \seqxk\seq{x^{k}} and \seqxˉk\seq{\bar{x}^{k}} converge to the same (γ\gamma-critical) point, be it xx^{\star}. Defining BkikriB_{k}{}\coloneqq{}\sum_{i\geq k}\|r^{i}\|, from LABEL:{lem:Deltaxr} and LABEL:{lem:DeltaBarxr} we have

Therefore, the proof reduces to showing that \seqBk\seq{B_{k}} converges with asymptotic QQ-linear rate. Inequality (33) reads \varphi_{\gamma}(x^{k}){}-{}\varphi_{\star}{}\leq{}\bigl{[}(1+\gamma L_{f})\rho\theta\|r^{k}\|\bigr{]}^{\frac{1}{1-\theta}}, and since rk0r^{k}\to 0 for large enough kk we have

Therefore, eventually \Delta_{k}{}\coloneqq{}\psi\bigl{(}\varphi_{\gamma}(x^{k})-\varphi_{\star}\bigr{)}{}<{}1 and from (34) we get

for some C>0C>0. Therefore, for large enough kk we have BkCrk=C(BkBk+1)B_{k}{}\leq{}C\|r^{k}\|{}={}C(B_{k}-B_{k+1}), \ieBk+1(1\nicefrac1C)BkB_{k+1}{}\leq{}(1-\nicefrac{{1}}{{C}})B_{k}, proving asymptotic QQ-linear convergence of BkB_{k}.

4.2. Superlinear convergence

In the next result we show that under mild assumptions ZeroFPR exhibits superlinear rates of convergence if the directions satisfy a Dennis-Moré condition. Then, we show that the Broyden scheme (26) produces directions that satisfy such condition, and that due to the acceptance of unit stepsize τk=1\tau_{k}=1, eventually each iteration of ZeroFPR will require only two evaluations of TγT_{\gamma} (cf. section 5.3). We remind that a sequence \seqxk\seq{x^{k}} such that xkxx^{k}\neq x^{\star} for all kk is said to be \DEFsuperlinearly convergent to xx^{\star} if xk+1x/xkx0\|x^{k+1}-x^{\star}\|/\|x^{k}-x^{\star}\|{}\to{}0 as kk\to\infty.

[Superlinear convergence under Dennis-Moré condition]Suppose that Section 4.4 is strictly satisfied at a strong local minimum xx^{\star} of φ\varphi, and consider the iterates generated by ZeroFPR. Suppose that \seqxk\seq{x^{k}} converges to xx^{\star} and that the directions \seqdk\seq{d^{k}} satisfy the Dennis-Moré condition

Then, eventually stepsize τk=1\tau_{k}=1 is always accepted and the sequences \seqxk\seq{x^{k}}, \seqxˉk\seq{\bar{x}^{k}}, and \seqrk\seq{r^{k}}, converge with superlinear rate. {proof} From sections 4.3, 2, 4.4 and 3 we know that φγ\nabla\varphi_{\gamma} and RγR_{\gamma} are strictly differentiable at xx^{\star}, with G2φγ(x)=Qγ(x)JRγ(x)0,G_{\star}{}\coloneqq{}\nabla^{2}\varphi_{\gamma}(x^{\star}){}={}Q_{\gamma}(x^{\star})JR_{\gamma}(x^{\star}){}\succ{}0, and that there exists a neighborhood UxU_{x^{\star}} of xx^{\star} in which φγ\varphi_{\gamma} is differentiable and RγR_{\gamma} Lipschitz continuous. Since xˉk=xkγrkx\bar{x}^{k}=x^{k}-\gamma r^{k}\to x^{\star} due to item 1, it holds that xk,xˉkUxx^{k},\bar{x}^{k}\in U_{x^{\star}} for all kk large enough. By single-valuedness of RγR_{\gamma}, for all such kk we may write Rγ(xk)R_{\gamma}(x^{k}) and Rγ(xˉk)R_{\gamma}(\bar{x}^{k}) in place of rkr^{k} and rˉk\bar{r}^{k}, respectively. In particular, since x\fixTγx^{\star}\in\fix T_{\gamma} (cf. item 1), necessarily Rγ(xˉk)0R_{\gamma}(\bar{x}^{k})\to 0. In turn, due to (35) it also holds that dk0d^{k}\to 0. Let x0k+1xˉk+dkx^{k+1}_{0}\coloneqq\bar{x}^{k}+d^{k}; then,

and since x0k+1xˉk=dk0x^{k+1}_{0}-\bar{x}^{k}=d^{k}\to 0, from (35) and strict differentiability of RγR_{\gamma} at xx^{\star} applied on the first term on the right-hand side it follows that

By possibly restricting UxU_{x^{\star}}, nonsingularity of JRγ(x)JR_{\gamma}(x^{\star}) ensures the existence of a constant α>0\alpha>0 such that Rγ(x)αxx\|R_{\gamma}(x)\|{}\geq{}\alpha\|x-x^{\star}\| for all xUxx\in U_{x^{\star}}. Since xˉk+dkx\bar{x}^{k}+d^{k}\to x^{\star}, eventually x0k+1Uxx^{k+1}_{0}\in U_{x^{\star}}. We have

A second-order expansion of φγ\varphi_{\gamma} at xx^{\star} yields \mathtight

where the last equality is due to (37). Substracting,

where β=12λmin(G)>0\beta=\frac{1}{2}\lambda_{\rm min}(G_{\star})>0. Therefore, there exists k0Nk_{0}\in\N such that φγ(xˉk+dk)φγ(xˉk)\varphi_{\gamma}(\bar{x}^{k}+d^{k}){}\leq{}\varphi_{\gamma}(\bar{x}^{k}) for all kk0k\geq k_{0}; in particular, for all such kk

where the second inequality follows from item 2, and the last one from (23) and the fact that σ<γ1γLf2\sigma<\gamma\frac{1-\gamma L_{f}}{2}. Therefore, for kk0k\geq k_{0} the linesearch condition (19) holds with τk=1\tau_{k}=1, and unitary stepsize is always accepted. In particular, the limit (37) reads limkxk+1x/xˉkx=0,\lim_{k\to\infty}{\|x^{k+1}-x^{\star}\|/\|\bar{x}^{k}-x^{\star}\|}{}={}0, and from the inequality

superlinear convergence of \seqxk\seq{x^{k}} follows. Since rkLRxkx\|r^{k}\|{}\leq{}L_{R}\|x^{k}-x^{\star}\|, then also \seqrk\seq{r^{k}} converges superlinearly, and in turn, since xˉkxγrk+xkx\|\bar{x}^{k}-x^{\star}\|{}\leq{}\gamma\|r^{k}\|{}+{}\|x^{k}-x^{\star}\|, also \seqxˉk\seq{\bar{x}^{k}} does.

We conclude the section showing that employing Broyden directions (26) in ZeroFPR enables superlinear convergence rates, provided that RγR_{\gamma} is Lipschitz continuously semidifferentiable at the limit point (see ). {thm}[Superlinear convergence with Broyden directions]Suppose that Section 4.4 is (strictly) satisfied at a strong local minimum xx^{\star} of φ\varphi at which RγR_{\gamma} is Lipschitz-continuously semidifferentiable. Consider the iterates generated by ZeroFPR with directions dkd^{k} selected with Broyden method (26), and suppose that xkxx^{k}\to x^{\star}.

Then, the Dennis-Moré condition (35) is satisfied, and in particular all the claims of Section 5.4.2 hold. {proof} From section 4.4 we know that RγR_{\gamma} is strictly differentiable at the critical point xx^{\star} and Lipschitz-continuously semidifferentiable there. Denoting G=JRγ(x)G_{\star}=JR_{\gamma}(x_{\star}),

and since xk,xˉkxx^{k},\bar{x}^{k}\to x^{\star}, due to [22, Lem. 2.2] there exists L>0L>0 such that

In particular, due to sections 5.4.1 and B, ykGsksk\frac{\|y_{k}-G_{\star}s_{k}\|}{\|s_{k}\|} is summable. Let Ek=BkGE_{k}=B_{k}-G_{\star} and let F\|{}\cdot{}\|_{F} denote the Frobenius norm. With a simple modification of the proofs of [22, Thm. 4.1] and [2, Lem. 4.4] that takes into account the scalar ϑk[ϑˉ,2ϑˉ]\vartheta_{k}\in[\bar{\vartheta},2-\bar{\vartheta}] we obtain

The last term on the right-hand side, be it σk\sigma_{k}, is summable and therefore the sequence \seqEk\seq{E_{k}} is bounded. Therefore,

where Eˉsup\seqEkF\bar{E}\coloneqq\sup\seq{\|E_{k}\|_{F}}. Telescoping the inequality, summability of σk\sigma_{k} ensures that of (BkG)sk2sk2\frac{\|(B_{k}-G_{\star})s_{k}\|^{2}}{\|s_{k}\|^{2}} proving in particular the claimed Dennis-Moré condition (35).

Simulations

We now present numerical results with the proposed method. In ZeroFPR we set β=\nicefrac12\beta=\nicefrac{{1}}{{2}}, and for the nonmonotone linesearch we used the sequence pk=(ηQk+1)1p_{k}=(\eta Q_{k}+1)^{-1} where Q0=1Q_{0}=1, Qk+1=ηQk+1Q_{k+1}=\eta Q_{k}+1, η=0.85\eta=0.85: in this way \seqpk\seq{p_{k}} is computed as in .

We performed experiments with different choices of dkd^{k} in step 3. In particular,

ZeroFPR(Broyden): dk=Hkrˉkd^{k}=-H_{k}\bar{r}^{k}, and HkH_{k} obtained by the Broyden method (26) with ϑˉ=104\bar{\vartheta}=10^{-4};

ZeroFPR(BFGS): dk=Hkrˉkd^{k}=-H_{k}\bar{r}^{k}, where HkH_{k} is computed using BFGS updates (27);

ZeroFPR(L-BFGS): dkd^{k} is computed using L-BFGS [34, Alg. 7.4] with memory 1010.

We only show the results with full quasi-Newton updates (Broyden, BFGS) for one of the examples: for the other experiments we focus on L-BFGS, which is better suited for large-scale problems. Although JRγJR_{\gamma} is nonsymmetric at the critical points in general, we observed that the symmetric updates of BFGS and L-BFGS perform very well in practice and outperform the Broyden method.

We compared ZeroFPR with the forward-backward splitting algorithm (denoted FBS), that is (7), the inertial FBS (denoted IFBS) proposed in [14, Eq. (7)] (with parameter β=0.2\beta=0.2), and the nonmonotone accelerated FBS (denoted AFBS) proposed in [26, Alg. 2] for fully nonconvex problems. All experiments were performed in MATLAB. The implementation of the methods used in the tests are available online.http://github.com/kul-forbes/ForBES

Here we consider the problem of finding a sparse solution to a least-squares problem. As discussed in , this is achieved by solving the following nonconvex problem:

2. Dictionary learning

Given a collection of mm signals of dimension nn, collected as columns in a matrix YRn×mY\in\R^{n\times m}, we seek for a sparse representation of each of them as combination of a set of kk vectors {d1,,dk}\set{d_{1},\ldots,d_{k}}, called dictionary atoms. To do so, we solve the following problem

Problem (39) takes the form (1) by letting f(D,C)=12YDCF2f(D,C)=\tfrac{1}{2}\|Y-DC\|^{2}_{F} and g(D,C)=δS(D,C)g(D,C)=\delta_{S}(D,C), where S{}={}\overbrace{S_{D}\times\ldots\times S_{D}}^{\text{ktimes}}{}\times{}\overbrace{S_{C}\times\ldots\times S_{C}}^{\text{mtimes}}, with

3. Matrix decomposition

We consider the problem of approximating a given matrix ARm×nA\in\R^{m\times n} as the sum of a low-rank and a sparse component, by solving

This problem has application, for example, in the analysis of video imagery, specifically the separation of the background (fixed over time) scenery from the foreground (moving) objects in a series of video frames. In this case, matrix AA contains nn video frames (columns), each consisting of mm pixels, and XLX_{L}, XSX_{S} will respectively contain the background scenery and foreground objects identified in each frame. Therefore here f(XL,XS)=12AXLXSF2f(X_{L},X_{S})=\tfrac{1}{2}\|A-X_{L}-X_{S}\|_{F}^{2} and Lf=2L_{f}=2, while g(XL,XS)=\indicator\rankr(XL)+λXS0g(X_{L},X_{S})=\indicator_{\rank\leq r}(X_{L})+\lambda\|X_{S}\|_{0}. The proximal mapping of gg is given by

Here, \proxγλ0\prox_{\gamma\lambda\|\cdot\|_{0}} is the hard-thresholding operation, defined componentwise as

The set of matrices of rank at most rr is nonconvex and closed, and the projection onto it is given by \proj\rankr(X)=Ur\diag(σ1,,σr)VrT\proj_{\rank\leq r}(X){}={}U_{r}\diag(\sigma_{1},\ldots,\sigma_{r})V_{r}^{T}, where σ1σr\sigma_{1}\ldots\sigma_{r} are rr largest singular values of XX, and Ur,VrU_{r},V_{r} are the matrices of left and right singular vectors, respectively. Each computation of Π\rankr\Pi_{\rank\leq r} requires a partial SVD which is, from the computational perspective, the most significantly expensive operation in this case.

We applied this technique to a sequence of n=50n=50 frames coming from the ShoppingMall dataset.http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html The footage consists of m=320×256m=320\times 256 grayscale pixels frames, therefore the problem has 81920008192000 variables in total. In problem (40) we used r=1r=1 and λ=3103\lambda=3\cdot 10^{-3}. The results are shown in Figures 3 and 4. Also in this case, the fast asymptotic convergence of ZeroFPR(L-BFGS) is apparent.

Conclusions

The forward-backward envelope is a valuable tool for deriving efficient algorithms tackling nonsmooth and nonconvex problems of the form φ=f+g\varphi=f+g, as it can be used as a merit function to devise globally convergent linesearch methods solving the system of nonlinear equations defining the stationary points of φ\varphi.

ZeroFPR implements this idea, and we proved that it globally converges to a stationary point under the assumption that φγ\varphi_{\gamma} has the Kurdyka-Łojasiewicz property. Furthermore, if the linesearch directions satisfy the Dennis-Moré condition (for example, if they are determined according to the Broyden method), the convergence rate at strong local minima is superlinear.

Numerical simulations with the proposed method on convex and nonconvex problems confirm our theoretical results. Using Broyden method, BFGS (in the case of small-scale problems) and L-BFGS (for large-scale problems) to compute directions in ZeroFPR greatly outperform FBS and its accelerated variant. It is our belief that the surprising efficacy of (L-)BFGS is due to the fact that, under the appropriate assumptions, the Jacobian of RγR_{\gamma} at strong local minima is similar to a symmetric and positive definite matrix. Future investigation may better explain the effectiveness of symmetric update formulas in this framework.

References

Appendix A Proofs of Section 4

1: It follows from [37, Thm.s 3.8 and 4.1] that \proxγg\prox_{\gamma g} is (strictly) differentiable at xγf(x)x^{\star}-\gamma\nabla f(x^{\star}) iff gg (strictly) satisfies item 2. Consequently, if ff is of class C2C^{2} around xx^{\star} (and in particular strictly differentiable at xx^{\star} [44, Cor. 9.19]), Rγ(x)=x\FBxR_{\gamma}(x)=x-\FB x is (strictly) differentiable at xx^{\star} with Jacobian as in (17) due to the chain rule of differentiation (and the fact that strict differentiability is preserved by composition). For γ(γ,Γ(x))\gamma^{\prime}\in(\gamma,\Gamma(x^{\star})) and wRnw\in\R^{n} we have

The expression (15) of the second-order epi-derivative then implies \innprodMww1γw2\innprod{Mw}{w}{}\geq{}-\frac{1}{\gamma^{\prime}}\|w\|^{2} for all wRnw\in\R^{n} (since Mw=0Mw=0 for wSw\in S^{\perp}). Therefore, λmin(M)\nicefrac1γ>\nicefrac1γ\lambda_{\rm min}(M){}\geq{}-\nicefrac{{1}}{{\gamma^{\prime}}}{}>{}-\nicefrac{{1}}{{\gamma}}, proving I+γMI+\gamma M to be positive definite, and in particular invertible. We may now trace the proof of [45, Lem. 2.9] to infer that JPγ(x)=\projS[I+γM]1\projSJP_{\gamma}(x^{\star}){}={}\proj_{S}[I+\gamma M]^{-1}\proj_{S}. Apparently, JPγ(x)JP_{\gamma}(x^{\star}) is symmetric and positive semidefinite.

2: Since QγQ_{\gamma} is (strictly) continuous at xx^{\star} and RγR_{\gamma} is (strictly) differentiable at xx^{\star}, from [45, Lem. 6.2] we have that φγ=QγRγ\nabla\varphi_{\gamma}=Q_{\gamma}R_{\gamma} is (strictly) differentiable at xx^{\star}, and (17) follows by the chain rule.

3: A simple application of the chain rule proves (18); moreover, combined with (17) we obtain 2φγ(x)=1γ[Qγ(x)Qγ(x)Pγ(x)Qγ(x)]\nabla^{2}\varphi_{\gamma}(x^{\star}){}={}\tfrac{1}{\gamma}\left[Q_{\gamma}(x^{\star}){}-{}Q_{\gamma}(x^{\star})P_{\gamma}(x^{\star})Q_{\gamma}(x^{\star})\right], and since both Qγ(x)Q_{\gamma}(x^{\star}) and Pγ(x)P_{\gamma}(x^{\star}) are symmetric, then so is 2φ(x)\nabla^{2}\varphi(x^{\star}).

[Proof of Section 4.4]We will show that all conditions are equivalent to either one of the following

\tinnprodd(2f(x)+M)d>0\tinnprod{d}{(\nabla^{2}f(x^{\star})+M)d}>0 dS\forall d\in S, where MM and SS are as in section 4.4;

JRγ(x)JR_{\gamma}(x^{\star}) is similar to a symmetric and positive definite matrix.

4.4 \Leftrightarrow 4.4: trivial, since 2φγ(x)\nabla^{2}\varphi_{\gamma}(x^{\star}) exists as shown in item 3.

4.4 \Leftrightarrow thm:StrongMinimality(f): follows from [44, Thm. 13.24(c)], since

4.4 \Leftrightarrow 4.4: if 2φγ(x)0\nabla^{2}\varphi_{\gamma}(x^{\star})\succ 0, then xx^{\star} is a (strong) local minimum for φγ\varphi_{\gamma} and, due to (18), necessarily JRγ(x)JR_{\gamma}(x^{\star}) is invertible. Conversely, if xx^{\star} is a local minimum for φγ\varphi_{\gamma}, then 2φγ(x)0\nabla^{2}\varphi_{\gamma}(x^{\star})\succeq 0. If, additionally, JRγ(x)JR_{\gamma}(x^{\star}) is invertible, then due to (18) 2φγ(x)\nabla^{2}\varphi_{\gamma}(x^{\star}) is also invertible, and therefore positive definite.

4.4 \Leftrightarrow thm:StrongMinimality(g): by comparing (17) and (18) we observe that JRγ(x)JR_{\gamma}(x^{\star}) is similar to the (symmetric) matrix Qγ(x)\nicefrac122φγ(x)Qγ(x)\nicefrac12Q_{\gamma}(x^{\star})^{-\nicefrac{{1}}{{2}}}\nabla^{2}\varphi_{\gamma}(x^{\star})Q_{\gamma}(x^{\star})^{-\nicefrac{{1}}{{2}}}, which is positive definite iff 2φγ(x)\nabla^{2}\varphi_{\gamma}(x^{\star}) is.

thm:StrongMinimality(f) \Leftrightarrow thm:StrongMinimality(g): the proof is the same as that of [45, Thm. 2.11(b)\Leftrightarrow(c)].

4.4 \Rightarrow thm:StrongMinimality(g): with similar reasonings as in the proof of the implications “4.4 \Leftrightarrow thm:StrongMinimality(f) \Leftrightarrow thm:StrongMinimality(g)”, we conclude that local minimality of xx^{\star} for φ\varphi entails JRγ(x)JR_{\gamma}(x^{\star}) being similar to a symmetric and positive semidefinite matrix. Therefore, if JRγ(x)JR_{\gamma}(x^{\star}) is nonsingular, then it is similar to a symmetric and positive definite matrix.

4.4 \Rightarrow 4.4: trivial, since φγφ\varphi_{\gamma}\leq\varphi and φγ(x)=φ(x)\varphi_{\gamma}(x^{\star})=\varphi(x^{\star}) (cf. items 1 and 1).

Appendix B Additional results for Section 5

Consider the iterates generated by ZeroFPR and suppose that the directions \seqdk\seq{d^{k}} are selected so as to satisfy (31). Then,

xk+1xk(γ+D)rk\|x^{k+1}-x^{k}\|{}\leq{}(\gamma+D)\|r^{k}\|

xˉk+1xˉkγrk+1+(2γ+D)rk\|\bar{x}^{k+1}-\bar{x}^{k}\|{}\leq{}\gamma\|r^{k+1}\|{}+{}(2\gamma+D)\|r^{k}\|

in particular, xk+1xk\|x^{k+1}-x^{k}\| and xˉk+1xˉk\|\bar{x}^{k+1}-\bar{x}^{k}\| converge to 0.

where in the last inequality we used the fact that τk(0,1]\tau_{k}\in(0,1]. This proves 1, and 2 trivially follows by the triangular inequality xˉk+1xˉkxk+1xk+γrk+1+γrk\|\bar{x}^{k+1}-\bar{x}^{k}\|{}\leq{}\|x^{k+1}-x^{k}\|{}+{}\gamma\|r^{k+1}\|{}+{}\gamma\|r^{k}\|. Using this, 3 follows from item 1.

Consider the iterates generated by ZeroFPR. Suppose that (31) is satisfied and that the sequence \seqxk\seq{x^{k}} is bounded. Then, ω(xk)=ω(xˉk)\omega(x^{k})=\omega(\bar{x}^{k}) are nonempty compact and connected sets over which φ\varphi and φγ\varphi_{\gamma} are constant and coincide. Moreover,

The sets of cluster points are nonempty because of boundedness of the sequences; in turn, connectedness and compactness as well as (41) are shown in [12, Rem. 5], which applies since xk+1xk\|x^{k+1}-x^{k}\| and xˉk+1xˉk\|\bar{x}^{k+1}-\bar{x}^{k}\| converge to (cf. item 3). Moreover, since \seqφγ(xk)\seq{\varphi_{\gamma}(x^{k})} converges to some value φR\varphi_{\star}\in\R and ω(xk)=ω(xˉk)\fixTγ\omega(x^{k})=\omega(\bar{x}^{k})\subseteq\fix T_{\gamma} as shown in section 5.4, it follows Item 1 that φ\varphi and φγ\varphi_{\gamma} coincide on ω(xk)\omega(x^{k}) (and equal φ\varphi_{\star}).

Suppose that Section 4.4 is satisfied at a strong local minimum xx^{\star} of φ\varphi. Then, for any γ(0,\nicefrac1Lf)\gamma\in(0,\nicefrac{{1}}{{L_{f}}}) the FBE φγ\varphi_{\gamma} possesses the KL property at xx^{\star}, and the desingularizing function ψ\psi can be taken of the form ψ(s)=ρs\nicefrac12\psi(s){}={}\rho s^{\nicefrac{{1}}{{2}}} for some ρ>0\rho>0. {proof} From section 4.4 it follows that xx^{\star} is a strong local minimum for φγ\varphi_{\gamma} at which φγ\varphi_{\gamma} is twice differentiable with H2φγ(x)0H_{\star}\coloneqq\nabla^{2}\varphi_{\gamma}(x^{\star})\succ 0. Let λλmin(H)\lambda{}\coloneqq{}\lambda_{\rm min}(H_{\star}) and Λλmax(H)\Lambda{}\coloneqq{}\lambda_{\rm max}(H_{\star}). Since φγ(x)=0\nabla\varphi_{\gamma}(x^{\star})=0, from a second-order expansion of φγ\varphi_{\gamma} and a first-order expansion of φγ\nabla\varphi_{\gamma} we obtain that there exists a neighborhood UxU_{x^{\star}} of xx^{\star} such that, for all xUxx\in U_{x^{\star}}, φγ(x)φγ(x)Λ4xx2\varphi_{\gamma}(x)-\varphi_{\gamma}(x^{\star}){}\leq{}\tfrac{\Lambda}{4}\|x-x^{\star}\|^{2} and φγ(x)λ2xx\|\nabla\varphi_{\gamma}(x)\|{}\geq{}\tfrac{\lambda}{2}\|x-x^{\star}\|, and in particular ψ(φγ(x)φγ(x))φγ(x)=ρ2φγ(x)φγ(x)φγ(x)ρλ2Λ\psi^{\prime}\left(\varphi_{\gamma}(x)-\varphi_{\gamma}(x^{\star})\right)\|\nabla\varphi_{\gamma}(x)\|{}={}\tfrac{\rho}{2\sqrt{\varphi_{\gamma}(x)-\varphi_{\gamma}(x^{\star})}}\|\nabla\varphi_{\gamma}(x)\|{}\geq{}\tfrac{\rho\lambda}{2\sqrt{\Lambda}}. Letting ρ=2Λλ\rho{}={}\frac{2\sqrt{\Lambda}}{\lambda} we obtain that ψ\psi is a KL function for φγ\varphi_{\gamma} at xx^{\star}.