Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone line-search algorithms

Andreas Themelis, Lorenzo Stella, Panagiotis Patrinos

Introduction

In this paper we deal with optimization problems of the form

under the following assumptions, which will be valid throughout the paper without further mention. {ass}[Basic assumption]In problem (1)

$f\in C^{1,1}(\R^{n})$ (differentiable with $L_{f}$ -Lipschitz continuous gradient);

$\func{g}{\R^{n}}{\Rinf}$ is proper, closed and $\gamma_{g}$ -prox-bounded (see Section 2.1);

a solution exists, that is, $\argmin\varphi\neq\emptyset$ .

Both $f$ and $g$ are allowed to be nonconvex, making (1) prototypic for a plethora of applications spanning signal and image processing, machine learning, statistics, control and system identification. A well known algorithm addressing (1) is forward-backward splitting (FBS), also known as proximal gradient method. FBS has been thoroughly analyzed under the assumption of $g$ being convex. If moreover $f$ is convex, then FBS is known to converge globally with rate $O(1/k)$ in terms of objective value, where $k$ is the iteration count. In this case, accelerated variants of FBS, also known as fast forward-backward splitting (FFBS), can be derived thanks to the work of Nesterov , that only require minimal additional computations per iteration but achieve the provably optimal global convergence rate of order $o(1/k^{2})$ .

The work in pioneered an alternative acceleration technique. The method is based on an exact, real-valued penalty function for the original problem (1), namely the forward-backward envelope (FBE), defined as follows

The name forward-backward envelope comes from the fact that $\varphi_{\gamma}(x)$ is the value of the minimization problem that defines the forward-backward step and alludes to the kinship that it has with the Moreau envelope. These claims will be addressed more in detail in Section 4. When $f$ is sufficiently smooth and both $f$ and $g$ are convex, the FBE was shown to be continuously differentiable and amenable to be minimized with generalized Newton methods. More recently, proposed a linesearch algorithm based on (L-)BFGS quasi-Newton directions for minimizing the FBE. The curvature information exploited by Newton-like methods acts as an online preconditioner, enabling superlinear rates of convergence under mild assumptions. However, unlike plain (F)FBS schemes, such methods require accessing second-order information of the smooth term $f$ (needed for the evaluation of $\nabla\varphi_{\gamma}$ ), and are well defined only as long as the nonsmooth term $g$ is convex. On the contrary, FBS merely requires first-order information on $f$ and prox-boundedness of the nonsmooth term $g$ , in which case all accumulation points are stationary for $\varphi$ , \iethey satisfy the first order necessary conditions .

In this paper we propose ZeroFPR, a nonmonotone linesearch algorithm that, to the best of our knowledge, is the first that (1) addresses the same range of problems as FBS, (2) requires the same black-box oracle as FBS (gradient of one function and proximity operator of the other), (3) yet achieves superlinear rates under mild assumptions (only) at the limit point. Though related to minFBE algorithm , ZeroFPR is conceptually different, mainly because it is gradient-free, in the sense that it does not require the gradient of the FBE. Moreover,

We provide the necessary theoretical background linking the concepts of stationarity of a point for problem (1), criticality and optimality. To the best of our knowledge, such an analysis was previously made only for the proximal point algorithm and for a special case of the projected gradient method .

The analysis of the FBE, previously studied only in the case of $f$ being $C^{2}(\R^{n})$ and $g$ convex , is extended to $f$ and $g$ as in Equation 1. In particular, we provide mild assumptions on $f$ and $g$ that ensure (1) continuous differentiabilty of the FBE around critical points, (2) (strict) twice differentiability at critical points, and (3) equivalence of strong local minimality for the original function and the FBE.

Exploiting the investigated properties of the FBE and of critical points we prove that ZeroFPR with monotone linesearch converges (1) globally if $\varphi_{\gamma}$ has the Kurdyka-Łojasiewicz property , and (2) superlinearly when quasi-Newton Broyden directions are employed, under mild additional requirements at the limit point.

In Section 2 we introduce some notation and list some known facts about FBS. In Section 3 we define and explore notions of stationarity and criticality for the investigated problem and relate them with properties of the forward-backward operator. In Section 4 we extend the results of about the fundamental properties of the FBE to the more general setting addressed in this paper, where $f$ and $g$ satisfy Equation 1; for the sake of readability, some of the proofs are deferred to Appendix A. Section 5 addresses the core contribution of the paper, ZeroFPR; although arbitrary directions can be chosen, we specialize the results on superlinear convergence to a quasi-Newton Broyden method so as to truely maintain the same black-box oracle as FBS. Some ancillary results needed for the proofs are listed in Appendix B. Finally, Section 6 illustrates numerical results obtained with the proposed method.

Preliminaries

The identity $n\times n$ matrix is denoted as $\id$ , and the extended real line as $\Rinf=\R\cup\set{\infty}$ . The open and closed ball of radius $r\geq 0$ centered in $x\in\R^{n}$ is denoted as $\ball xr$ and $\cball xr$ , respectively. Given a set $E$ and a sequence $\seq{x^{k}}$ we write $\seq{x^{k}}\subset E$ with the obvious meaning of $x^{k}\in E$ for all $k\in\N$ . The (possibly empty) set of cluster points of $\seq{x^{k}}$ is denoted as $\omega\left(\seq{x^{k}}\right)$ , or simply as $\omega(x^{k})$ whenever the indexing is clear from the context. We say that $\seq{x^{k}}\subset\R^{n}$ is \DEFsummable if $\sum_{k\in\N}\|x^{k}\|$ is finite, and \DEFsquare-summable if $\seq{\|x^{k}\|^{2}}$ is summable.

Following the terminology of , we say that a function $\func{f}{\R^{n}}{\R}$ is strictly continuous at $\bar{x}$ if $\limsup_{\begin{subarray}{c}y,z\to\bar{x}\\ y\neq z\end{subarray}}{\frac{|f(y)-f(z)|}{\|y-z\|}}$ is finite, and \DEFstrictly differentiable at $\bar{x}$ if $\nabla f(\bar{x})$ exists and $\lim_{\begin{subarray}{c}y,z\to\bar{x}\\ y\neq z\end{subarray}}{\frac{f(y)-f(z)-\innprod{\nabla f(\bar{x})}{y-z}}{\|y-z\|}}{}={}0$ . The set of functions $\R^{n}\to\R$ with Lipschitz continuous gradient is denoted as $C^{1,1}(\R^{n})$ , and for $f\in C^{1,1}(\R^{n})$ we write $L_{f}$ to indicate the Lipschitz modulus of $\nabla f$ .

For a proper, closed function $\func{g}{\R^{n}}{\Rinf}$ , a vector $v\in\partial g(x)$ is a \DEFsubgradient of $g$ at $x$ , where the \DEFsubdifferential $\partial g(x)$ is considered in the sense of [44, Def. 8.3]

We have $\partial\varphi(x){}={}\nabla f(x){}+{}\partial g(x)$ and $\hat{\partial}\varphi(x){}={}\nabla f(x){}+{}\hat{\partial}g(x)$ [44, Ex. 8.8(c)].

Given a parameter value $\gamma>0$ , the \DEFMoreau envelope function $g^{\gamma}$ and the \DEFproximal mapping $\prox_{\gamma g}$ are defined by

We now summarize some properties of $g^{\gamma}$ and $\prox_{\gamma g}$ ; the interested reader is referred to for a detailed discussion. A function $\func{g}{\R^{n}}{\Rinf}$ is \DEFprox-bounded if there exists $\gamma>0$ such that $g+\tfrac{1}{2\gamma}\|{}\cdot{}\|^{2}$ is bounded below on $\R^{n}$ . The supremum of all such $\gamma$ is the \DEFthreshold $\gamma_{g}$ of prox-boundedness for $g$ . In particular, if $g$ is convex or bounded below then $\gamma_{g}=\infty$ . In general, for any $\gamma\in(0,\gamma_{g})$ the proximal mapping $\prox_{\gamma g}$ is nonempty- and compact-valued, and the Moreau envelope $g^{\gamma}$ finite [44, Thm. 1.25].

Given a nonempty closed set $S\subseteq\R^{n}$ we let $\func{\indicator_{S}}{\R^{n}}{\Rinf}$ denote its \DEFindicator function, namely $\indicator_{S}(x)=0$ if $x\in S$ and $\indicator_{S}(x)=\infty$ otherwise, and $\ffunc{\proj_{S}}{\R^{n}}{\R^{n}}$ the (set-valued) \DEFprojection $x\mapsto\argmin_{z\in S}\|z-x\|$ . Proximal mappings can be seen as generalized projections, due to the relation $\proj_{S}=\prox_{\gamma{\indicator_{S}}}$ for any $\gamma>0$ .

For a set-valued mapping $\ffunc{T}{\R^{n}}{\R^{n}}$ we let $\graph T=\set{(x,y)}[y\in T(x)]$ denote its \DEFgraph, $\zer T=\set{x\in\R^{n}}[0\in T(x)]$ the set of its \DEFzeros and $\fix T=\set{x\in\R^{n}}[x\in T(x)]$ the set of its \DEFfixed-points.

2. Forward-backward iterations

holding for all $x,z\in\R^{n}$ [9, Prop. A.24], for any $\gamma\in(0,\nicefrac{{1}}{{L_{f}}})$ the function

furnishes a majorization model for $\varphi$ , in the sense that

where $\gamma\in\bigl{(}0,\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}}\bigr{)}$ is the stepsize parameter. The (set-valued) \DEFforward-backward operator $T_{\gamma}^{f,g}$ can be equivalently expressed as

Stationary and critical points

Unless $\varphi$ is convex, the stationarity condition $0\in\hat{\partial}\varphi(x^{\star})$ in problem (1) is only necessary for the optimality of $x^{\star}$ [44, Thm. 10.1]. In this section we define different concepts of (sub)optimality and show how they are related for generic functions $\varphi=f+g$ as in Equation 1. {defin} We say that a point $x^{\star}\in\dom\varphi$ is

stationary if $0\in\hat{\partial}\varphi(x^{\star})$ ;

critical if it is \DEF $\gamma$ -critical for some $\gamma\in(0,\gamma_{g})$ , \ieif $x^{\star}\in T_{\gamma}(x^{\star})$ ;

optimal if $x^{\star}\in\argmin\varphi$ , \ieif it solves (1).

The notion of criticality was already discussed in under the name of $L$ -stationarity ( $L$ plays the role of $\nicefrac{{1}}{{\gamma}}$ ) for the special case of $g=\indicator_{B\cap C_{s}}$ , where $B$ is a convex set and $C_{s}$ is the (nonconvex) set of vectors with at most $s$ nonzero entries.

If $g$ is convex, then $\gamma_{g}=\infty$ and we may talk of criticality without mention of $\gamma$ : in this case, the properties of $\gamma$ -criticality and stationarity are equivalent regardless of the value of $\gamma$ . For more general functions $g$ , instead, the value of $\gamma$ plays a role in determining whether a point is $\gamma$ -critical or not, which legitimizes the following definition. {defin} The \DEFcriticality threshold is the function $\func{\Gamma^{f,g}}{\R^{n}}{[0,\gamma_{g}]}$

As usual, whenever $f$ and $g$ are clear from the context we simply write $\Gamma$ in place of $\Gamma^{f,g}$ . That $\Gamma\leq\gamma_{g}$ is due to the fact that $\prox_{\gamma g}$ (and consequently $T_{\gamma}$ ) is everywhere empty-valued for $\gamma>\gamma_{g}$ . Considering also $\gamma=0$ forces the set in the definition to be nonempty, and the lower-bound $\Gamma\geq 0$ in particular; more precisely, observe that, by definition, $\Gamma(x)>0$ iff $x$ is a critical point.

Let us consider $\varphi=f+g$ for $f(x)=\frac{1}{2}x^{2}$ and $g=\indicator_{C}$ where $C=\set{\pm 1}$ . Clearly, $\gamma_{g}=+\infty$ (as $g$ is lower-bounded), $L_{f}=1$ and $\pm 1$ are both (unique) optima. Since $\hat{\partial}\varphi(x)=\R$ for $x\in C$ and $\hat{\partial}\varphi$ is clearly empty elsewhere, all points in $C$ are stationary. $\prox_{\gamma g}$ is the (set-valued) projection on $C$ , therefore the forward-backward operator is $T_{\gamma}(x){}={}\proj_{C}((1-\gamma)x)$ . We have

In particular, $\Gamma(1)=\Gamma(-1)=1$ . We now list some properties of critical and optimal points which will be used to derive regularity properties of $T_{\gamma}$ and $g^{\gamma}$ . {thm}[Properties of critical points]The following properties hold:

for $\gamma\in(0,\gamma_{g})$ , a point $x^{\star}$ is $\gamma$ -critical iff

if $x^{\star}$ is critical, then it is $\gamma$ -critical for all $\gamma\in(0,\Gamma(x^{\star}))$ ; moreover, $x^{\star}$ is also $\Gamma(x^{\star})$ -critical provided that $\Gamma(x^{\star})<\gamma_{g}$ ;

$T_{\gamma}(x^{\star}){=}\set{x^{\star}}$ and $R_{\gamma}(x^{\star}){=}\set{0}$ for any critical point $x^{\star}$ and $\gamma\in(0,\Gamma(x^{\star}))$ .

By suitably rearranging, the claim readily follows.

2: due to 1, if $x^{\star}$ is $\gamma$ -critical, apparently it is also $\gamma^{\prime}$ -critical for any $\gamma^{\prime}\in(0,\gamma]$ . From the definition (9) of the criticality threshold $\Gamma(x^{\star})$ , it then follows that $x^{\star}$ is $\gamma$ -critical for any $\gamma\in(0,\Gamma(x^{\star}))$ . Suppose now that $\Gamma(x^{\star})<\gamma_{g}$ . Then, due to 1 for all $\gamma\in(0,\Gamma(x^{\star}))$ we have

By taking the limit as $\gamma\nearrow\Gamma(x^{\star})$ we obtain that the inequality holds for $\Gamma(x^{\star})$ as well, proving the claim in light of the characterization 1.

3: let $x^{\star}$ be a critical point, and let $x\in T_{\gamma}(x^{\star})$ for some $\gamma<\Gamma(x^{\star})$ . Fix $\gamma^{\prime}\in(\gamma,\Gamma(x^{\star}))$ . From 1 and 2 it then follows that

Since $\tfrac{1}{2\gamma}-\tfrac{1}{2\gamma^{\prime}}>0$ , necessarily $x=x^{\star}$ .

In the next result we show that criticality is a halfway property between stationarity and optimality. In light of these relations we shall seek “suboptimal” solutions which we characterize as critical points. {prop}[Optimality, criticality, stationarity]Let $\bar{\gamma}{}\coloneqq{}\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}}$ .

(criticality $\Rightarrow$ stationarity) $\fix T_{\gamma}\subseteq\zer\hat{\partial}\varphi$ for all $\gamma\in(0,\gamma_{g})$ ;

(optimality $\Rightarrow$ criticality) $\Gamma(x^{\star})\geq\bar{\gamma}$ for all $x^{\star}\in\argmin\varphi$ ; in particular, $\argmin\varphi\subseteq\fix T_{\gamma}$ for all $\gamma\in(0,\bar{\gamma})$ , and also for $\gamma=\nicefrac{{1}}{{L_{f}}}$ if $\gamma_{g}>\nicefrac{{1}}{{L_{f}}}$ ;

1: let $\gamma\in(0,\gamma_{g})$ and $x\in\fix T_{\gamma}$ . Since $x$ minimizes $g+\tfrac{1}{2\gamma}\|{}\cdot{}-x+\gamma\nabla f(x)\|^{2}$ , we have $0{}\in{}\hat{\partial}\bigl{[}g+\tfrac{1}{2\gamma}\|{}\cdot{}-x+\gamma\nabla f(x)\|^{2}\bigr{]}(x){}={}\hat{\partial}g(x)+\nabla f(x){}={}\hat{\partial}\varphi(x)$ , where the first inclusion follows from [44, Thm. 10.1] and the equalities from [44, Thm. 8.8(c)]. This proves that $x$ is stationary.

2: Fix $\gamma\in(0,\bar{\gamma})$ , $x^{\star}\in\argmin\varphi$ and $y\in T_{\gamma}(x^{\star})$ . Necessarily $y=x^{\star}$ , otherwise, due to section 2.2, $\varphi(y)$ would contradict minimality of $\varphi(x^{\star})$ . Therefore, $x^{\star}$ is $\gamma$ -critical and the claim follows from the arbitrarity of $\gamma\in(0,\bar{\gamma})$ .

As already seen in Section 3, the bound $\Gamma(x^{\star})\geq\min\set{\gamma_{g},\nicefrac{{1}}{{L_{f}}}}$ at optimal points in Item 2 is tight, and clearly the implication “optimality $\Rightarrow$ criticality” cannot be reversed (consider, \egthe point $x^{\star}=0$ for $\varphi=\cos$ ). The next example shows that the other implication is also proper. {es}[Stationarity $\not\Rightarrow$ criticality] Let $f(x)=\frac{1}{2}x^{2}$ and $g(x)=x^{\nicefrac{{5}}{{3}}}$ . We have $\gamma_{g}=+\infty$ , $L_{f}=1$ , and for $x^{\star}=0$ it holds that $\hat{\partial}\varphi(x^{\star})=\set{\nabla\varphi(x^{\star})}=\set{0}$ . Therefore, $x^{\star}$ is stationary; however, $T_{\gamma}(x^{\star})=\prox_{\gamma g}(0)=\set{-(\nicefrac{{5\gamma}}{{3}})^{3}}$ , and in particular $x^{\star}\notin T_{\gamma}(x^{\star})$ for any $\gamma>0$ , proving $x^{\star}$ to be non critical.

Forward-backward envelope

The FBE (2) was introduced in and further analyzed in in the case when $g$ is convex. Under such assumption the FBE was shown to be continuously differentiable, which made it possible to derive minimization algorithms based on its gradient. In the general setting addressed in this paper the FBE might fail to be (continuously) differentiable, and as such we need to resort to gradient-free methods. This task will be addressed in Section 5 where Section 5 will be proposed; other than being applicable to a wider range of problems, the proposed scheme is entirely based on the same oracle of forward-backward iterations, unlike the approaches in which instead require the computation of $\nabla^{2}f$ . All this will be possible thanks to continuity properties of the FBE, and to the behavior of $\varphi_{\gamma}$ at critical points. We now focus on its continuity, while the other property will be addressed shortly after in Section 4.2.

[Alternative expressions for $\varphi_{\gamma}$ ] By expanding the square and rearranging the terms in the definition (2), $\varphi_{\gamma}$ can equivalently be expressed as

Comparing with (7), it is apparent that the set of minimizers $z$ in the above expression coincides with $T_{\gamma}(x)$ , the forward-backward operator at $x$ . Moreover, taking out the constant term $f(x){}-{}\tfrac{\gamma}{2}\|\nabla f(x)\|^{2}$ from the infimum we immediately obtain the following expression involving the Moreau envelope of $g$ :

Other than providing an explicit way of computing the FBE, (11) emphasizes how $\varphi_{\gamma}$ inherits the regularity properties of the Moreau envelope of $g$ . In particular, the next key property follows from the strict continuity of $g^{\gamma}$ [44, Ex. 10.32]. {prop}[Strict continuity of $\varphi_{\gamma}$ ]For any $\gamma\in(0,\gamma_{g})$ , the FBE $\varphi_{\gamma}$ is a real-valued and strictly continuous function on $\R^{n}$ .

In the next section we will see the fundamental qualitative similarities between the FBE and the Moreau envelope. Namely, for $\gamma$ small enough both $\varphi^{\gamma}$ and $\varphi_{\gamma}$ are lower bounds for the original function $\varphi$ with same minimizers and minimum; in particular the minimization of $\varphi$ is equivalent to that of $\varphi^{\gamma}$ or $\varphi_{\gamma}$ . Similarly, the identity

2. Basic properties

We now provide bounds relating $\varphi_{\gamma}$ to the original function $\varphi$ that extend the well known inequalities involving the Moreau envelope. {prop}Let $\gamma\in(0,\gamma_{g})$ be fixed. Then

$\varphi(\bar{x}){}\leq{}\varphi_{\gamma}(x){}-{}\tfrac{1-\gamma L_{f}}{2\gamma}\|x-\bar{x}\|^{2}$ for all $x\in\R^{n}$ and $\bar{x}\in T_{\gamma}(x)$ .

1 is obvious from the definition of the FBE (consider $z=x$ in (2)). As to 2, since the set of minimizers in (2) is $T_{\gamma}(x)$ (cf. (12b)), (5) yields

With respect to the inequalities holding for convex $g$ treated in , the lower bound in Section 4.2 is weaker, while the upper bound unchanged. Regardless, an immediate consequence of the result is that the value of $\varphi$ and $\varphi_{\gamma}$ at critical points is the same, and minimizers and infima of the two functions coincide for $\gamma$ small enough. {thm}The following hold

$\varphi(x)=\varphi_{\gamma}(x)$ for all $\gamma\in(0,\gamma_{g})$ and $x\in\fix T_{\gamma}$ ;

$\inf\varphi=\inf\varphi_{\gamma}$ and $\argmin\varphi=\argmin\varphi_{\gamma}$ for all $\gamma{}\in{}\bigl{(}0,\min\set{\nicefrac{{1}}{{L_{f}}},\gamma_{g}}\bigr{)}$ .

The bound $\gamma<\nicefrac{{1}}{{L_{f}}}$ in Item 2 is tight even when $f$ and $g$ are convex, as the counterexample with $f(x)=\frac{1}{2}x^{2}$ and $g=\indicator_{\R_{+}}$ shows (see [45, Ex. 2.4] for details).

Although we will address problem (1) by simply exploiting the continuity of the FBE, nevertheless $\varphi_{\gamma}$ enjoys favorable properties which are key for the efficacy of the method which will be discussed in Section 5. Firstly, observe that, due to strict continuity, $\varphi_{\gamma}$ is almost everywhere differentiable, as it follows from Rademacher’s theorem. The same applies to the mapping $x\mapsto x-\gamma\nabla f(x)$ , its Jacobian being

which is symmetric wherever it exists [44, Cor. 13.42 and Prop. 13.34]. However, in order to show that the proposed method achieves fast convergence we need additional regularity properties, namely (strict) twice differentiability at critical points and continuous differentiability around. The rest of the section is dedicated to this task.

3. Prox-regularity and first-order properties

In the favorable case in which $g$ is convex and $f\in C^{2}(\R^{n})$ , the FBE enjoys global continuous differentiability . In our setting, \DEFprox-regularity acts as a surrogate of convexity; the interested reader is referred to [44, §13.F] for a detailed discussion. {defin}[Prox-regularity]Function $g$ is said to be \DEFprox-regular at $x_{0}$ for $v_{0}\in\partial g(x_{0})$ if there exist $\rho,\varepsilon>0$ such that for all $x^{\prime}\in\ball{x_{0}}{\varepsilon}$ and

it holds that $g(x^{\prime}){}\geq{}g(x){}+{}\innprod{v}{x^{\prime}-x}{}-{}\tfrac{\rho}{2}\|x^{\prime}-x\|^{2}$ . Prox-regularity is a mild requirement enjoyed globally and for any subgradient by all convex functions, with $\varepsilon=+\infty$ and $\rho=0$ . When $g$ is prox-regular at $x_{0}$ for $v_{0}$ , then for sufficiently small $\gamma>0$ the Moreau envelope $g^{\gamma}$ is continuously differentiable in a neighborhood of $x_{0}+\gamma v_{0}$ . To our purposes, when needed, prox-regularity of $g$ will be required only at critical points $x^{\star}$ , and only for the subgradient $-\nabla f(x^{\star})$ . Therefore, with a slight abuse of terminology we define prox-regularity of critical points as follows. {defin}[Prox-regularity of critical points] We say that a critical point $x^{\star}$ is \DEFprox-regular if $g$ is prox-regular at $x^{\star}$ for $-\nabla f(x^{\star})$ . Examples where a critical point fails to be prox-regular are of challenging construction; before illustrating a cumbersome such instance in Section 4.3, we first prove an important result that connects prox-regularity with first-order properties of the FBE. {thm}[Continuous differentiability of $\varphi_{\gamma}$ ]Suppose that $f$ is of class $C^{2}$ around a prox-regular critical point $x^{\star}$ . Then, for all $\gamma\in(0,\Gamma(x^{\star}))$ there exists a neighborhood $U_{x^{\star}}$ of $x^{\star}$ on which the following properties hold:

$T_{\gamma}$ and $R_{\gamma}$ are strictly continuous, and in particular single-valued;

$\varphi_{\gamma}\in C^{1}$ with $\nabla\varphi_{\gamma}{}={}Q_{\gamma}R_{\gamma}$ , where $Q_{\gamma}$ is as in (13).

For $\gamma^{\prime}\in(\gamma,\Gamma(x^{\star}))$ , using LABEL:{prop:ProxGrad}, LABEL: and LABEL:{prop:SingleValuedFB} we obtain that

Replacing $\gamma^{\prime}$ with $\gamma$ in the above expression, the inequality is strict for all $x\neq x^{\star}$ . From [40, Thm. 4.4] applied to the “tilted” function $x{}\mapsto{}g(x+x^{\star}){}-{}g(x^{\star}){}-{}\innprod{\nabla f(x^{\star})}{x}$ it follows that there is a neighborhood $V$ of $\Fw{x^{\star}}$ in which $\prox_{\gamma g}$ is strictly continuous and $g^{\gamma}$ is of class $C^{1+}$ with gradient $\nabla g^{\gamma}(x){}={}\gamma^{-1}\left(x-\prox_{\gamma g}(x)\right)$ for all $x\in V$ . By possibly narrowing $U_{x^{\star}}$ , we may assume that $f\in C^{2}(U_{x^{\star}})$ and $\Fw x\in V$ for all $x\in U_{x^{\star}}$ . 2 then follows from (11) and the chain rule of differentiation, and 1 from the fact that strict continuity is preserved by composition.

When $f=0$ , Section 4.3 restates the known fact that if $g$ is prox-regular at $x^{\star}$ for $0\in\partial g(x^{\star})$ , then $g^{\gamma}$ is continuosly differentiable around $x^{\star}$ with $\nabla g^{\gamma}(x)=\frac{1}{\gamma}(x-\prox_{\gamma g}(x))$ . Notice that the bound $\gamma<\Gamma(x^{\star})$ is tight: in general, for $\gamma=\Gamma(x^{\star})$ no continuity of $T_{\gamma}$ nor continuous differentiability of $\varphi_{\gamma}$ around $x^{\star}$ can be guaranteed. In fact, even when $x^{\star}$ is $\Gamma(x^{\star})$ -critical, $T_{\gamma}$ might even fail to be single-valued and $\varphi_{\gamma}$ differentiable at $x^{\star}$ , as the following counterexample shows. {es}[Why $\gamma\neq\Gamma(x^{\star})$ in first-order properties] Consider $f=\frac{1}{2}x^{2}$ and $g=\indicator_{S}$ where $S=\set{0,1}$ . Then, $L_{f}=1$ , $\gamma_{g}=+\infty$ , $T_{\gamma}(x)=\proj_{S}((1-\gamma)x)$ and the FBE is $\varphi_{\gamma}(x){}={}\tfrac{1-\gamma}{2}\|x\|^{2}{}+{}\tfrac{1}{2\gamma}\dist((1-\gamma)x,S)^{2}$ . At the critical point $x=1$ , which satisfies $\Gamma(1)=\nicefrac{{1}}{{2}}$ , $g$ is prox-regular for any subgradient. For any $\gamma\in(0,\nicefrac{{1}}{{2}})$ it is easy to see that $\varphi_{\gamma}$ is differentiable in a neighborhood of $x=1$ . However, for $\gamma=\nicefrac{{1}}{{2}}$ the distance function has a first-order singularity in $x=1$ , due to the $2$ -valuedness of $T_{\gamma}(1)=\proj_{S}(\nicefrac{{1}}{{2}})=\set{0,1}$ .

[Prox-nonregularity of critical points]Consider $\varphi=f+g$ where $f(x)=\tfrac{1}{2}x^{2}$ , $g(x)=\indicator_{S}(x)$ and $S=\set{\nicefrac{{1}}{{n}}}[n\in\N_{\geq 1}]\cup\set{0}$ . For $x_{0}=0$ we have $\Gamma(x_{0})=+\infty$ , however $g$ fails to be prox-regular at $x_{0}$ for $v_{0}=0=-\nabla f(x_{0})$ . For any $\rho>0$ and for any neighborhood $V$ of $(0,0)$ in $\graph g$ it is always possible to find a point arbitrarily close to $(0,-\nicefrac{{1}}{{\rho}})$ with multi-valued projection on $V$ . Specifically, the midpoint $P_{n}{}={}\bigl{(}\frac{1}{2}(\frac{1}{n}+\frac{1}{n+1}),\,{}-\nicefrac{{1}}{{\rho}}\bigr{)}$ has 2-valued projection on $\graph g$ for any $n\in\N_{\geq 1}$ , being it $\proj_{\graph g}(P_{n})=\set{\nicefrac{{1}}{{n}},\nicefrac{{1}}{{n+1}}}$ . By considering a large $n$ , $P_{n}$ can be made arbitrarily close to $(0,-\nicefrac{{1}}{{\rho}})$ and at the same time its projection(s) arbitrarily close to $(0,0)$ . Therefore, $g$ cannot be prox-regular at for , for otherwise such projections would be single-valued close enough to $(0,0)$ [40, Cor. 3.4 and Thm. 3.5]. As a result, $g^{\gamma}(x)=\frac{1}{2\gamma}\dist(x,S)^{2}$ is not differentiable around $x=0$ , and indeed at each midpoint $\frac{1}{2}(\frac{1}{n}+\frac{1}{n+1})$ for $n\in\N_{\geq 1}$ it has a nonsmooth spike.

To underline how unfortunate the situation depicted in Section 4.3 is, notice that adding a linear term $\lambda x$ to $f$ for any $\lambda\neq 0$ , yet leaving $g$ unchanged, restores the desired prox-regularity of each critical point. Indeed, this is trivially true for any nonzero critical point; besides, $g$ is prox-regular at for any $\lambda\in(0,-\infty)$ , and for $\lambda<0$ we have that is nomore critical.

4. Second-order properties

In this section we discuss sufficient conditions for twice-differentiability of the FBE at critical points. Additionally to prox-regularity, which is needed for local continuous differentiability, we will also need generalized second-order properties of $g$ . The interested reader is referred to [44, §13] for an extensive discussion on epi-differentiability. {ass}With respect to a given critical point $x^{\star}$

$\nabla^{2}f$ exists and is (strictly) continuous around $x^{\star}$ ;

$g$ is prox-regular and (strictly) twice epi-differentiable at $x^{\star}$ for $-\nabla f(x^{\star})$ , with its second order epi-derivative being generalized quadratic:

where $S\subseteq\R^{n}$ is a linear subspace and $M\in\R^{n\times n}$ . Without loss of generality we take $M$ symmetric, and such that $\operatorname*{Im}(M)\subseteq S$ and $\ker(M)\supseteq S^{\perp}$ .This can indeed be done without loss of generality: if $M$ and $S$ satisfy (15), then it suffices to replace $M$ with $M^{\prime}=\proj_{S}\frac{M+\trans M}{2}\proj_{S}$ to ensure the desired properties.

We say that the assumptions are “strictly” satisfied if the stronger conditions in parenthesis hold.

We now show that the quite common properties required in Section 4.4 are all is needed for ensuring first-order properties of the proximal mapping and second-order properties of the FBE at critical points. {thm}[Twice differentiability of $\varphi_{\gamma}$ ]Suppose that Section 4.4 is (strictly) satisfied with respect to a critical point $x^{\star}$ . Then, for any $\gamma\in(0,\Gamma(x^{\star}))$

$\prox_{\gamma g}$ is (strictly) differentiable at $\Fw{x^{\star}}$ with symmetric and positive semidefinite Jacobian

$R_{\gamma}$ is (strictly) differentiable at $x^{\star}$ with Jacobian

where $Q_{\gamma}$ is as in (13) and $P_{\gamma}$ as in (16);

$\varphi_{\gamma}$ is (strictly) twice differentiable at $x^{\star}$ with symmetric Hessian

See Appendix A. Again, when $f\equiv 0$ Section 4.4 covers the differentiability properties of the proximal mapping (and consequently the second-order properties of the Moreau envelope, due to the identity $\nabla g^{\gamma}(x)=\frac{1}{\gamma}(x-\prox_{\gamma g}(x))$ ) as discussed in .

We now provide a key result that links nonsingularity of the Jacobian of the forward-backward residual $R_{\gamma}$ to strong (local) minimality for the original cost $\varphi$ and for the FBE $\varphi_{\gamma}$ , under the generalized second-order properties of Section 4.4. {thm}[Conditions for strong local minimality]Suppose that Section 4.4 is satisfied with respect to a critical point $x^{\star}$ , and let $\gamma{}\in{}(0,\min\set{\Gamma(x^{\star}),\nicefrac{{1}}{{L_{f}}}})$ . The following are equivalent:

$x^{\star}$ is a strong local minimum for $\varphi$ ;

$x^{\star}$ is a local minimum for $\varphi$ and $JR_{\gamma}(x^{\star})$ is nonsingular;

the (symmetric) matrix $\nabla^{2}\varphi_{\gamma}(x^{\star})$ is positive definite;

$x^{\star}$ is a strong local minimum for $\varphi_{\gamma}$ ;

$x^{\star}$ is a local minimum for $\varphi_{\gamma}$ and $JR_{\gamma}(x^{\star})$ is nonsingular. {proof} See Appendix A.

ZeroFPR algorithm

The first algorithmic framework exploiting the FBE for solving composite minimization problems was studied in , and other schemes have been recently investigated in . All such methods tackle the problem by looking for a (local) minimizer of the FBE, exploting the equivalence of (local) minimality for the original function $\varphi$ and for the FBE $\varphi_{\gamma}$ , for $\gamma$ small enough. To do so, they all employ the concept of directions of descent, thus requiring the gradient of the FBE to be well defined everywhere. In the more general framework addressed in this paper, such basic requirement is not met, which is why we approach the problem from a different perspective. This leads to ZeroFPR, the first algorithm, to the best of our knowledge, that despite requiring only the black-box oracle of FBS and being suited for fully nonconvex problems it achieves superlinear convergence rates.

Instead of directly addressing the minimization of $\varphi$ or $\varphi_{\gamma}$ , we seek solutions of the following nonlinear inclusion (generalized equation)

By doing so we address the problem from the same perspective of FBS, that is, finding fixed points of the forward-backward operator $T_{\gamma}$ or, equivalently, zeros of its residual $R_{\gamma}$ . Despite $R_{\gamma}$ might be quite irregular when $g$ is nonconvex, it enjoys favorable properties at the very solutions to (20) — \ieat $\gamma$ -critical points — starting from single-valuedness, cf. Item 3. If mild assumptions are met, $R_{\gamma}$ turns out to be continuous around and even differentiable at critical points (cf. Sections 4.3 and 4.4), and as a consequence the inclusion problem (20) reduces to a well behaved system of equations, as opposed to generalized equations, when close to solutions.

This motivates addressing problem (20) with fast methods for nonlinear equations. Newton-like schemes are iterative methods that prescribe updates of the form

which essentially amount to selecting $H=H(x)$ , a linear operator that ideally carries information of the geometry of $R_{\gamma}$ around $x$ , in the attempt to yield an optimal iterate $x^{+}$ . For instance, when $R_{\gamma}$ is sufficiently regular Newton method corresponds to selecting $H$ as the inverse of an element of the generalized Jacobian of $R_{\gamma}$ at $x$ , enabling fast convergence when close to a solution under some assumptions. However, selecting $H$ as in Newton method would require information additional to the forward-backward oracle $T_{\gamma}$ , and as such it goes beyond the scope of the paper. For this reason we focus instead on quasi-Newton schemes, in which $H$ are linear operators recursively defined with low-rank updates that satisfy the (inverse) secant condition

A famous result states that, under mild assumptions and starting sufficiently close to a solution $x^{\star}$ , updates as in (21) are superlinearly convergent to $x^{\star}$ iff the Dennis-Moré condition holds, namely the limit $\frac{\|(H^{-1}-JR_{\gamma}(x^{\star}))s\|}{\|s\|}{}\to{}0$ . More recently, in the result was extended to generalized equations of the form $f(x)+G(x)\ni 0$ , where $f$ is smooth and $G$ possibly set-valued. The study focuses on Josephy-Newton methods where the update $x^{+}$ is the solution of the inner problem $f(x)-Bx\in Bx^{+}+G(x^{+})$ , where $B=H^{-1}$ , which can be interpreted as a forward-backward step in the metric induced by $B$ . In particular, differently from the here proposed ZeroFPR, the method in has the crucial limitation that, unless the operator $B$ has a very particular structure, the backward step $(B+G)^{-1}$ may be prohibitely challenging.

Quasi-Newton schemes are powerful and widely used methods. However, it is well known that they are effective only when close enough to a solution and might even diverge otherwise. To cope with this crucial downside there comes the need of a globalization strategy; this is usually addressed by means of a linesearch over a suitable merit function $\psi$ , along directions of descent for $\psi$ so as to ensure sufficient decrease for small enough stepsizes. Unfortunately, the potential choice $\psi(x)=\frac{1}{2}\|R_{\gamma}(x)\|^{2}$ is not regular enough for a ‘direction of descent’ to be everywhere defined. The proposed Section 5 bypasses this limitation by exploiting the favorable properties of the FBE.

Globalizing the convergence of any fast local method is the core contribution of ZeroFPR, an algorithm that exploits the favorable properties of the FBE, and that requires exactly the same oracle of FBS. Conceptually, ZeroFPR is really elementary; for simplicity, let us first consider the monotone case, \iewith $p_{k}\equiv 1$ so that $\bar{\Phi}_{k}=\varphi_{\gamma}(x^{k})$ (cf. 5). The following steps are executed for updating iterate $x^{k}$ :

first, at 1 a nominal forward-backward call yields an element $\bar{x}^{k}\in T_{\gamma}(x^{k})$ that decreases the value of $\varphi_{\gamma}$ by at least $\gamma\frac{1-\gamma L_{f}}{2}\|r^{k}\|^{2}$ (item 1);

then, at 3 an update direction $d^{k}$ at $\bar{x}^{k}$ (not at $x^{k}$ !) is selected;

because of the sufficient decrease $x^{k}\mapsto\bar{x}^{k}$ on $\varphi_{\gamma}$ and the continuity of $\varphi_{\gamma}$ , at 4 a stepsize $\tau_{k}$ can be found with finite many backtrackings $\tau_{k}\leftarrow\beta\tau_{k}$ that ensures a decrease for $\varphi_{\gamma}$ of at least $\sigma\|r^{k}\|^{2}$ in the update $x^{k}\mapsto\bar{x}^{k}+\tau_{k}d^{k}$ , for any $\sigma<\frac{1-\gamma L_{f}}{2}$ .

In order to reduce the number of backtrackings, $p_{k}<1$ can be selected resulting in a nonmonotone linesearch. The sufficient decrease is enforced with respect to a parameter $\bar{\Phi}_{k}\geq\varphi_{\gamma}(x^{k})$ (cf. section 5.1.1), namely a convex combination of $\set{\varphi_{\gamma}(x^{i})}_{i=0}^{k}$ . For the sake of convergence, $\seq{p_{k}}$ can be selected arbitrarily in $(0,1]$ as long as it is bounded away from , hence the role of the user-set lower bound $p_{\rm min}$ . Consequently, small values of $\sigma$ and $p_{k}$ concur in reducing conservatism in the linesearch by favoring larger stepsizes. {lem}[Nonmonotone linesearch globalization]For all $k\in\N$ the iterates generated by ZeroFPR satisfy

and there exists $\bar{\tau}_{k}>0$ such that

In particular, the number of backtrackings at 4 is finite. {proof} The first two inequalities in (23) are due to item 2. Moreover,

where the inequality follows by the linesearch condition (19); this proves the last inequality in (23). As to (24), let $k$ be fixed and contrary to the claim suppose that for all $\varepsilon>0$ there exists $\tau_{\varepsilon}\in[0,\varepsilon]$ such that the point $x_{\varepsilon}{}={}\bar{x}^{k}+\tau_{\varepsilon}d^{k}$ satisfies $\varphi_{\gamma}(x_{\varepsilon}){}>{}\varphi_{\gamma}(x^{k}){}-{}\sigma\|x^{k}-\bar{x}^{k}\|^{2}$ . Taking the limit for $\varepsilon\to 0^{+}$ , continuity of $\varphi_{\gamma}$ as ensured by eq. 11 yields

where the last inequality is due to the fact that $x^{k}\neq\bar{x}^{k}$ . This contradicts item 2; therefore, there exists $\bar{\tau}_{k}>0$ such that $\varphi_{\gamma}(\bar{x}^{k}+\tau d^{k}){}\leq{}\varphi_{\gamma}(x^{k}){}-{}\sigma\|x^{k}-\bar{x}^{k}\|^{2}$ for all $\tau\in[0,\bar{\tau}_{k}]$ . By combining this with (23) the claim follows. Section 5.1.1 ensures that regardless of the choice of $d^{k}$ , ZeroFPR does not get stuck in infinite loops. In Section 5.4 we will also show that the algorithm returns solutions of problem (20), and that under mild assumptions at the limit point the convergence rate is superlinear when good directions are selected at 3. Before going through the technicalities, we briefly anticipate what such good directions are.

1.2. Choice of the directions: quasi-Newton methods

As already emphasized, fast convergence of ZeroFPR will be obtained thanks to the employment of Newton-like directions $d^{k}$ . Differently from the classical Newton-like step (21), when stepsize $1$ is accepted, the update in ZeroFPR is of the form $x^{+}=\bar{x}+d$ rather than $x^{+}=x+d$ , where $\bar{x}$ is an element of $T_{\gamma}(x)$ . Therefore, $d$ needs to be a Newton-like direction at $\bar{x}$ , and not at $x$ , namely

(as opposed to $\bar{r}^{k}\in R_{\gamma}(x^{k})$ ).

We consider a modified Broyden’s scheme that performs rank-one updates of the form

and $\bar{\vartheta}\in(0,1)$ is a fixed parameter, with the convention $\sign 0=1$ . Starting from an invertible matrix $H_{0}$ this selection ensures that all matrices $H_{k}$ are invertible.

BFGS method consists in the following update rule for matrices $H_{k}$ in (25): starting from a symmetric and positive definite $H_{0}$ ,

with $s_{k}=x^{k+1}-\bar{x}^{k}$ and $y_{k}=r^{k+1}-\bar{r}^{k}$ , see \eg[34, §6.1]. BFGS is the most popular quasi-Newton scheme; it is based on rank-two updates that, additionally to the secant condition, enforce also symmetricity. In fact, BFGS is guaranteed to satisfy the Dennis-Moré condition only provided that the Jacobian of the nonlinear system at the limit point is symmetric . Although this is not the case for $JR_{\gamma}(x^{\star})$ , we observed in practice that BFGS directions (27) perform extremely well.

Ultimately, instead of storing and operating on dense $m\times m$ matrices, limited-memory variants of quasi-Newton schemes keep in memory only a few (usually $3$ to $20$ ) most recent pairs $(s^{k},y^{k})$ implicitly representing the approximate inverse Jacobian. Their employment considerably reduces storage and computations over the full-memory counterparts, and as such they are the methods of choice for large-scale problems. The most popular limited-memory method is L-BFGS: based on BFGS, it efficiently computes matrix-vector products with the approximate inverse Jacobian using a two-loop recursion procedure .

2. Connections with other methods

The first algorithmic framework exploiting the FBE was studied in , where two semismooth Newton methods were analyzed for convex $f$ and $g$ with $f\in C^{2,1}(\R^{n})$ (twice continuously differentiable with Lipschitz continuous gradient). A generalization of the scheme was then studied in under less restrictive assumptions, with particular attention to quasi-Newton directions in place of semismooth Newton methods. The proposed algorithm interleaves descent steps over the FBE with forward-backward steps. then analyzed global and linear convergence properties of a generic linesearch algorithmic framework for minimizing the FBE based on gradient-related directions, for analytic $f$ and subanalytic, convex, and lower bounded $g$ .

Though apparently closely related, the approach that we provide in this paper presents major conceptual differences from any of the ones above. Apart from the significantly less restrictive assumptions, the crucial distinction is that our method is derivative-free, \ieit does not require the gradient of the FBE. As a consequence, no computation nor the existence of $\nabla^{2}f$ is required, resulting in a method that, differently from the others, truly relies on the very same oracle information of the forward-backward operator $T_{\gamma}$ .

3. Main remarks

In this section we list a few observations that come in handy when implementing ZeroFPR. {rem}[Adaptive variant when $L_{f}$ is unknown]In practice, no prior knowledge of the global Lipschitz constant $L_{f}$ is required for ZeroFPR. In fact, replacing $L_{f}$ with an initial estimate $L>0$ and fixing a backtracking ratio $\alpha\in(0,1)$ , after 2 the following instruction can be added:

[Support for locally Lipschitz $\nabla f$ ]If $\dom g$ is bounded and, as it is reasonable, the directions $\seq{d^{k}}$ selected at 3 do not diverge, then Item 1 on $f$ can be relaxed to $\nabla f$ being locally Lipschitz.

In fact, it follows from the definition of proximal mapping that $\seq{\bar{x}^{k}}\subseteq\dom g$ , and if the directions are bounded then there exists a compact domain $\Omega\supseteq\dom g$ such that $\seq{x^{k}}\subseteq\Omega$ . Then, all results of the paper apply by replacing $L_{f}$ with $\lip_{\Omega}\nabla f$ , the (finite) Lipschitz constant of $\nabla f$ on $\Omega$ .

[Cost per iteration]Evaluating $\varphi_{\gamma}$ essentially amounts to one evaluation of $T_{\gamma}$ ; this is evident from the expression (11), together with the observation that $g^{\gamma}(\Fw x){}={}g(\bar{x})+\tfrac{1}{2\gamma}\|\Fw x-\bar{x}\|^{2}$ for any $\bar{x}\in T_{\gamma}(x)$ . Therefore, computing $\varphi_{\gamma}(\bar{x}^{k}+\tau_{k}d^{k})$ at 4 yields an element $\bar{x}^{k+1}\in T_{\gamma}(x^{k+1})$ required in 1, since $x^{k+1}=\bar{x}^{k}+\tau_{k}d^{k}$ at every iteration. In general, one evaluation of $T_{\gamma}$ per backtracking step is required. If the directions $d^{k}$ are computed with Broyden or BFGS methods (26) and (27), then one additional evaluation of $T_{\gamma}$ is required for retrieving $d^{k}$ ; in the best case of $\tau_{k}=1$ being accepted, which asymptotically happens under mild assumptions (cf. section 5.4.2), the algorithm then requires exactly two evaluations of $T_{\gamma}$ per iteration.

[Extension of FBS] Observe that by selecting $d^{k}\equiv 0$ ZeroFPR reduces to the classical FBS algorithm. Item 2 combined with the relation $\varphi_{\gamma}(x^{k})\leq\bar{\Phi}_{k}$ due to (23) shows that the condition at 4 is always statisfied (with $\tau_{k}=1$ ). Therefore, $x^{k+1}{}={}\bar{x}^{k}+d^{k}{}={}\bar{x}^{k}{}\in{}T_{\gamma}(x^{k})$ for all $k$ , which is FBS, cf. (7).

4. Convergence results

In this section we analyze the properties of cluster points of the iterates generated by ZeroFPR. Specifically,

every cluster point of $\seq{x^{k}}$ and $\seq{\bar{x}^{k}}$ solves problem (20) (Section 5.4);

if the linesearch is (eventually) monotone, then global and linear convergence are achieved under mild assumptions (Sections 5.4.1 and 5.4.1);

directions satisfying the Dennis-Moré condition, such as Broyden’s, enable superlinear rates under mild assumptions (Sections 5.4.2 and 5.4.2).

In what follows, we exclude the trivial case in which the optimality condition $r^{k}=0$ is achieved in a finite number of iterations, and therefore assume $r^{k}\neq 0$ for all $k$ . {thm}[Criticality of cluster points]The following hold for the iterates generated by ZeroFPR:

$r^{k}\to 0$ square-summably, and all cluster points of $\seq{x^{k}}$ and $\seq{\bar{x}^{k}}$ are critical; more precisely, $\omega(x^{k})=\omega(\bar{x}^{k})\subseteq\fix T_{\gamma}$ ;

$\seq{\varphi_{\gamma}(x^{k})}$ converges to a (finite) value $\varphi_{\star}$ , and so does $\seq{\varphi(\bar{x}^{k})}$ if $\seq{x^{k}}$ is bounded.

By telescoping the above inequality and using (23), we obtain

proving $r^{k}\to 0$ square-summably. Suppose now that $\seq{x^{k}}[k\in K]\to x^{\prime}$ for some $x^{\prime}\in\R^{n}$ and $K\subseteq\N$ . Then, since $\|\bar{x}^{k}-x^{k}\|{}={}\gamma\|r^{k}\|{}\to{}0$ , in particular $\seq{\bar{x}^{k}}[k\in K]\to x^{\prime}$ as well. Due to the arbitrarity of the cluster point $x^{\prime}$ it follows that $\omega(x^{k})\subseteq\omega(\bar{x}^{k})$ , and a similar reasoning proves the converse inclusion, hence $\omega(x^{k})=\omega(\bar{x}^{k})$ . Moreover, we have $x^{k}{}\in{}\cball{\bar{x}^{k}}{\gamma\|r^{k}\|}{}\subseteq{}\FB{x^{k}}{}+{}\cball{0}{\gamma\|r^{k}\|}$ and since $\seq{x^{k}{}-{}\gamma\nabla f(x^{k})}[k\in K]{}\to{}x^{\prime}{}-{}\gamma\nabla f(x^{\prime})$ , from the outer semicontinuity of $\prox_{\gamma g}$ [44, Ex. 5.23(b)] it follows that $x^{\prime}\in\FB{x^{\prime}}$ , \ie $x^{\prime}\in\fix T_{\gamma}$ .

2: from (28) it follows that $\seq{\bar{\Phi}_{k}}$ is decreasing, and in particular its limit exists, be it $\varphi_{\star}$ . Due to (23), necessarily $\varphi_{\star}\geq\inf\varphi>-\infty$ , therefore

proving that $\varphi_{\gamma}(x^{k+1})\to\varphi_{\star}$ . If $\seq{x^{k}}$ is bounded, then so is $\seq{\bar{x}^{k}}$ due to compact-valuedness of $\prox_{\gamma g}$ [44, Thm. 1.25]. Due to eq. 11 $\varphi_{\gamma}$ is $L$ -Lipschitz continuous on a compact set containing $\seq{x^{k}}$ and $\seq{\bar{x}^{k}}$ for some $L>0$ . Then,

where the inequalities follow from section 4.2. Consequently, $\seq{\varphi(\bar{x}^{k})}\to\varphi_{\star}$ as well.

If follows from (23) and the fact that $\seq{\bar{\Phi}_{k}}$ is a decreasing sequence (cf. (28)), that the iterates of ZeroFPR satisfy $\varphi(\bar{x}^{k})\leq\bar{\Phi}_{0}=\varphi(\bar{x}^{0})$ . As a consequence, a sufficient condition for ensuring that the sequence $\seq{\bar{x}^{k}}$ does not diverge — and consequently nor does $\seq{x^{k}}$ provided that the sequence of directions $\seq{d^{k}}$ is bounded — is that the level set $\set{\varphi\leq\varphi(\bar{x}^{0})}$ is compact. In the adaptive variant discussed in Section 5.3, this translates to boundedness of the level set $\set{\varphi\leq\varphi(\bar{x}^{k_{0}})}$ , where $k_{0}$ denotes the iteration starting from which $\gamma$ is constant. Since such point is unknown a priori, the sufficient condition needs be strengthened to $\varphi$ having bounded level sets.

We now show that if $\varphi_{\gamma}$ is well-behaved at cluster points, then the whole sequence generated by ZeroFPR is convergent. Good behavior involves the existence of a desingularizing function, that is, $\varphi_{\gamma}$ needs to possess the Kurdyka-Łojasiewicz property, a mild requirement that we restate here for the reader’s convenience. {defin}[KL property]A proper and lower semicontinuous function $\func{h}{\R^{n}}{\Rinf}$ has the Kurdyka-Łojasiewicz property (KL property) at $x^{\star}\in\dom\partial h$ if there exist a concave desingularizing function (or KL function) $\func{\psi}{[0,\eta]}{[0,+\infty)}$ for some $\eta>0$ and a neighborhood $U_{x^{\star}}$ of $x^{\star}$ , such that

$\psi$ is $C^{1}$ with $\psi^{\prime}>0$ on $(0,\eta)$ ;

for all $x\in U_{x^{\star}}$ s.t. $h(x^{\star})<h(x)<h(x^{\star})+\eta$ it holds that

The KL property is a mild requirement enjoyed by semi-algebraic functions and by subanalytic functions which are continuous on their domain see also . Moreover, since semi-algebraic functions are closed under parametric minimization, from the expression (2) it is apparent that $\varphi_{\gamma}$ is semi-algebraic provided that $f$ and $g$ are. More precisely, in all such cases the desingularizing function can be taken of the form $\psi(s)=\rho s^{\theta}$ for some $\rho>0$ and $\theta\in(0,1]$ , in which case it is usually referred to as a Łojasiewicz function. This property has been extensively exploited to provide convergence rates of optimization algorithms such as FBS, see . Further properties of $f$ and $g$ that ensure $\varphi_{\gamma}$ to satisfy such requirement are discussed in .

We first show how the KL property on $\varphi_{\gamma}$ ensures global convergence of the iterates of ZeroFPR if the linesearch is eventually monotone, \ieif $p_{k}=1$ for $k$ sufficiently large, and then show that linear convergence is attained when the KL function is actually a Łojasiewicz function with large enough exponent. {thm}[Global convergence (monotone LS)]Consider the iterates generated by ZeroFPR with $p_{k}=1$ for $k$ ’s large enough, and with directions satisfying

for some $D\geq 0$ . Suppose that $\seq{x^{k}}$ remains bounded, that $\varphi_{\gamma}$ has the KL property on $\omega(x^{k})$ , and that every cluster point is prox-regular. If $f$ is of class $C^{2}$ in a neighborhood of $\omega(x^{k})$ , then $\seq{x^{k}}$ and $\seq{\bar{x}^{k}}$ are convergent to (the same point) $x^{\star}$ , and the sequence of residuals $\seq{r^{k}}$ is summable. {proof} From appendix B we know that $\varphi_{\gamma}$ is constant on the (nonempty) compact set $\omega(x^{k})$ . It then follows from [12, Lem. 6] that there exist $\eta,\varepsilon>0$ and a uniformized KL function, namely a function $\psi$ satisfying LABEL:{def:KL1}, LABEL:, LABEL:{def:KL2}, LABEL: and LABEL:{def:KL3} for all $x^{\star}\in\omega(x^{k})$ and $x$ such that $\dist(x,\omega(x^{k}))<\varepsilon$ and $\varphi(x^{\star})<\varphi(x)<\varphi(x^{\star})+\eta$ . Let $\varphi_{\star}{}\coloneqq{}\lim_{k\to\infty}\varphi_{\gamma}(x^{k})$ , which exists and is finite (cf. section 5.4), and let $k_{1}\in\N$ be such that $p_{k}=1$ for all $k\geq k_{1}$ . Then we have (cf. 5 and (19))

By possibly restricting $\varepsilon$ , from item 2 and since $\omega(x^{k})$ is compact it follows that $\varphi_{\gamma}$ is differentiable in an $\varepsilon$ -enlargement of $\omega(x^{k})$ . appendix B ensures that there exists $k_{2}\geq k_{1}$ such that for all $k\geq k_{2}$ we have $\varphi_{\star}{}<{}\varphi_{\gamma}(x^{k}){}<{}\varphi_{\star}+\eta$ and $\dist(x^{k},\omega(x^{k})){}<{}\varepsilon$ . For all such $k$ , by item 2 we have $\nabla\varphi_{\gamma}(x^{k}){}={}Q_{\gamma}(x^{k})R_{\gamma}(x^{k}){}={}\bigl{[}I-\gamma\nabla^{2}f(x^{k})\bigr{]}r^{k}$ and the uniformized KL property yields

Letting $\Delta_{k}{}\coloneqq{}\psi\bigl{(}\varphi_{\gamma}(x^{k})-\varphi_{\star}\bigr{)}{}>{}0$ , by concavity of $\psi$ and (32) it follows that

By telescoping the inequality it follows that $\seq{\|r^{k}\|}$ is summable, hence, due to item 1, also $\seq{\|x^{k+1}-x^{k}\|}$ is. Therefore, $\seq{x^{k}}$ is a Cauchy sequence and as such it admits a limit, this being also the limit of $\seq{\bar{x}^{k}}$ in light of item 1 (and the fact that $\seq{\bar{x}^{k}}$ is also bounded). {thm}[Linear convergence (monotone LS)]Consider the iterates generated by ZeroFPR. Suppose that the hypothesis of Section 5.4.1 are satisfied, and that the KL function can be taken of the form $\psi(s)=\rho s^{\theta}$ for some $\theta\in[\nicefrac{{1}}{{2}},1]$ . Then, $\seq{x^{k}}$ and $\seq{\bar{x}^{k}}$ are $R$ -linearly convergent. {proof} From section 5.4.1 we know that $\seq{x^{k}}$ and $\seq{\bar{x}^{k}}$ converge to the same ( $\gamma$ -critical) point, be it $x^{\star}$ . Defining $B_{k}{}\coloneqq{}\sum_{i\geq k}\|r^{i}\|$ , from LABEL:{lem:Deltaxr} and LABEL:{lem:DeltaBarxr} we have

Therefore, the proof reduces to showing that $\seq{B_{k}}$ converges with asymptotic $Q$ -linear rate. Inequality (33) reads $\varphi_{\gamma}(x^{k}){}-{}\varphi_{\star}{}\leq{}\bigl{[}(1+\gamma L_{f})\rho\theta\|r^{k}\|\bigr{]}^{\frac{1}{1-\theta}}$ , and since $r^{k}\to 0$ for large enough $k$ we have

Therefore, eventually $\Delta_{k}{}\coloneqq{}\psi\bigl{(}\varphi_{\gamma}(x^{k})-\varphi_{\star}\bigr{)}{}<{}1$ and from (34) we get

for some $C>0$ . Therefore, for large enough $k$ we have $B_{k}{}\leq{}C\|r^{k}\|{}={}C(B_{k}-B_{k+1})$ , \ie $B_{k+1}{}\leq{}(1-\nicefrac{{1}}{{C}})B_{k}$ , proving asymptotic $Q$ -linear convergence of $B_{k}$ .

4.2. Superlinear convergence

In the next result we show that under mild assumptions ZeroFPR exhibits superlinear rates of convergence if the directions satisfy a Dennis-Moré condition. Then, we show that the Broyden scheme (26) produces directions that satisfy such condition, and that due to the acceptance of unit stepsize $\tau_{k}=1$ , eventually each iteration of ZeroFPR will require only two evaluations of $T_{\gamma}$ (cf. section 5.3). We remind that a sequence $\seq{x^{k}}$ such that $x^{k}\neq x^{\star}$ for all $k$ is said to be \DEFsuperlinearly convergent to $x^{\star}$ if $\|x^{k+1}-x^{\star}\|/\|x^{k}-x^{\star}\|{}\to{}0$ as $k\to\infty$ .

[Superlinear convergence under Dennis-Moré condition]Suppose that Section 4.4 is strictly satisfied at a strong local minimum $x^{\star}$ of $\varphi$ , and consider the iterates generated by ZeroFPR. Suppose that $\seq{x^{k}}$ converges to $x^{\star}$ and that the directions $\seq{d^{k}}$ satisfy the Dennis-Moré condition

Then, eventually stepsize $\tau_{k}=1$ is always accepted and the sequences $\seq{x^{k}}$ , $\seq{\bar{x}^{k}}$ , and $\seq{r^{k}}$ , converge with superlinear rate. {proof} From sections 4.3, 2, 4.4 and 3 we know that $\nabla\varphi_{\gamma}$ and $R_{\gamma}$ are strictly differentiable at $x^{\star}$ , with $G_{\star}{}\coloneqq{}\nabla^{2}\varphi_{\gamma}(x^{\star}){}={}Q_{\gamma}(x^{\star})JR_{\gamma}(x^{\star}){}\succ{}0,$ and that there exists a neighborhood $U_{x^{\star}}$ of $x^{\star}$ in which $\varphi_{\gamma}$ is differentiable and $R_{\gamma}$ Lipschitz continuous. Since $\bar{x}^{k}=x^{k}-\gamma r^{k}\to x^{\star}$ due to item 1, it holds that $x^{k},\bar{x}^{k}\in U_{x^{\star}}$ for all $k$ large enough. By single-valuedness of $R_{\gamma}$ , for all such $k$ we may write $R_{\gamma}(x^{k})$ and $R_{\gamma}(\bar{x}^{k})$ in place of $r^{k}$ and $\bar{r}^{k}$ , respectively. In particular, since $x^{\star}\in\fix T_{\gamma}$ (cf. item 1), necessarily $R_{\gamma}(\bar{x}^{k})\to 0$ . In turn, due to (35) it also holds that $d^{k}\to 0$ . Let $x^{k+1}_{0}\coloneqq\bar{x}^{k}+d^{k}$ ; then,

and since $x^{k+1}_{0}-\bar{x}^{k}=d^{k}\to 0$ , from (35) and strict differentiability of $R_{\gamma}$ at $x^{\star}$ applied on the first term on the right-hand side it follows that

By possibly restricting $U_{x^{\star}}$ , nonsingularity of $JR_{\gamma}(x^{\star})$ ensures the existence of a constant $\alpha>0$ such that $\|R_{\gamma}(x)\|{}\geq{}\alpha\|x-x^{\star}\|$ for all $x\in U_{x^{\star}}$ . Since $\bar{x}^{k}+d^{k}\to x^{\star}$ , eventually $x^{k+1}_{0}\in U_{x^{\star}}$ . We have

A second-order expansion of $\varphi_{\gamma}$ at $x^{\star}$ yields \mathtight

where the last equality is due to (37). Substracting,

where $\beta=\frac{1}{2}\lambda_{\rm min}(G_{\star})>0$ . Therefore, there exists $k_{0}\in\N$ such that $\varphi_{\gamma}(\bar{x}^{k}+d^{k}){}\leq{}\varphi_{\gamma}(\bar{x}^{k})$ for all $k\geq k_{0}$ ; in particular, for all such $k$

where the second inequality follows from item 2, and the last one from (23) and the fact that $\sigma<\gamma\frac{1-\gamma L_{f}}{2}$ . Therefore, for $k\geq k_{0}$ the linesearch condition (19) holds with $\tau_{k}=1$ , and unitary stepsize is always accepted. In particular, the limit (37) reads $\lim_{k\to\infty}{\|x^{k+1}-x^{\star}\|/\|\bar{x}^{k}-x^{\star}\|}{}={}0,$ and from the inequality

superlinear convergence of $\seq{x^{k}}$ follows. Since $\|r^{k}\|{}\leq{}L_{R}\|x^{k}-x^{\star}\|$ , then also $\seq{r^{k}}$ converges superlinearly, and in turn, since $\|\bar{x}^{k}-x^{\star}\|{}\leq{}\gamma\|r^{k}\|{}+{}\|x^{k}-x^{\star}\|$ , also $\seq{\bar{x}^{k}}$ does.

We conclude the section showing that employing Broyden directions (26) in ZeroFPR enables superlinear convergence rates, provided that $R_{\gamma}$ is Lipschitz continuously semidifferentiable at the limit point (see ). {thm}[Superlinear convergence with Broyden directions]Suppose that Section 4.4 is (strictly) satisfied at a strong local minimum $x^{\star}$ of $\varphi$ at which $R_{\gamma}$ is Lipschitz-continuously semidifferentiable. Consider the iterates generated by ZeroFPR with directions $d^{k}$ selected with Broyden method (26), and suppose that $x^{k}\to x^{\star}$ .

Then, the Dennis-Moré condition (35) is satisfied, and in particular all the claims of Section 5.4.2 hold. {proof} From section 4.4 we know that $R_{\gamma}$ is strictly differentiable at the critical point $x^{\star}$ and Lipschitz-continuously semidifferentiable there. Denoting $G_{\star}=JR_{\gamma}(x_{\star})$ ,

and since $x^{k},\bar{x}^{k}\to x^{\star}$ , due to [22, Lem. 2.2] there exists $L>0$ such that

In particular, due to sections 5.4.1 and B, $\frac{\|y_{k}-G_{\star}s_{k}\|}{\|s_{k}\|}$ is summable. Let $E_{k}=B_{k}-G_{\star}$ and let $\|{}\cdot{}\|_{F}$ denote the Frobenius norm. With a simple modification of the proofs of [22, Thm. 4.1] and [2, Lem. 4.4] that takes into account the scalar $\vartheta_{k}\in[\bar{\vartheta},2-\bar{\vartheta}]$ we obtain

The last term on the right-hand side, be it $\sigma_{k}$ , is summable and therefore the sequence $\seq{E_{k}}$ is bounded. Therefore,

where $\bar{E}\coloneqq\sup\seq{\|E_{k}\|_{F}}$ . Telescoping the inequality, summability of $\sigma_{k}$ ensures that of $\frac{\|(B_{k}-G_{\star})s_{k}\|^{2}}{\|s_{k}\|^{2}}$ proving in particular the claimed Dennis-Moré condition (35).

Simulations

We now present numerical results with the proposed method. In ZeroFPR we set $\beta=\nicefrac{{1}}{{2}}$ , and for the nonmonotone linesearch we used the sequence $p_{k}=(\eta Q_{k}+1)^{-1}$ where $Q_{0}=1$ , $Q_{k+1}=\eta Q_{k}+1$ , $\eta=0.85$ : in this way $\seq{p_{k}}$ is computed as in .

We performed experiments with different choices of $d^{k}$ in step 3. In particular,

ZeroFPR(Broyden): $d^{k}=-H_{k}\bar{r}^{k}$ , and $H_{k}$ obtained by the Broyden method (26) with $\bar{\vartheta}=10^{-4}$ ;

ZeroFPR(BFGS): $d^{k}=-H_{k}\bar{r}^{k}$ , where $H_{k}$ is computed using BFGS updates (27);

ZeroFPR(L-BFGS): $d^{k}$ is computed using L-BFGS [34, Alg. 7.4] with memory $10$ .

We only show the results with full quasi-Newton updates (Broyden, BFGS) for one of the examples: for the other experiments we focus on L-BFGS, which is better suited for large-scale problems. Although $JR_{\gamma}$ is nonsymmetric at the critical points in general, we observed that the symmetric updates of BFGS and L-BFGS perform very well in practice and outperform the Broyden method.

We compared ZeroFPR with the forward-backward splitting algorithm (denoted FBS), that is (7), the inertial FBS (denoted IFBS) proposed in [14, Eq. (7)] (with parameter $\beta=0.2$ ), and the nonmonotone accelerated FBS (denoted AFBS) proposed in [26, Alg. 2] for fully nonconvex problems. All experiments were performed in MATLAB. The implementation of the methods used in the tests are available online.http://github.com/kul-forbes/ForBES

Here we consider the problem of finding a sparse solution to a least-squares problem. As discussed in , this is achieved by solving the following nonconvex problem:

2. Dictionary learning

Given a collection of $m$ signals of dimension $n$ , collected as columns in a matrix $Y\in\R^{n\times m}$ , we seek for a sparse representation of each of them as combination of a set of $k$ vectors $\set{d_{1},\ldots,d_{k}}$ , called dictionary atoms. To do so, we solve the following problem

Problem (39) takes the form (1) by letting $f(D,C)=\tfrac{1}{2}\|Y-DC\|^{2}_{F}$ and $g(D,C)=\delta_{S}(D,C)$ , where $S{}={}\overbrace{S_{D}\times\ldots\times S_{D}}^{\text{$ k $times}}{}\times{}\overbrace{S_{C}\times\ldots\times S_{C}}^{\text{$ m $times}},$ with

3. Matrix decomposition

We consider the problem of approximating a given matrix $A\in\R^{m\times n}$ as the sum of a low-rank and a sparse component, by solving

This problem has application, for example, in the analysis of video imagery, specifically the separation of the background (fixed over time) scenery from the foreground (moving) objects in a series of video frames. In this case, matrix $A$ contains $n$ video frames (columns), each consisting of $m$ pixels, and $X_{L}$ , $X_{S}$ will respectively contain the background scenery and foreground objects identified in each frame. Therefore here $f(X_{L},X_{S})=\tfrac{1}{2}\|A-X_{L}-X_{S}\|_{F}^{2}$ and $L_{f}=2$ , while $g(X_{L},X_{S})=\indicator_{\rank\leq r}(X_{L})+\lambda\|X_{S}\|_{0}$ . The proximal mapping of $g$ is given by

Here, $\prox_{\gamma\lambda\|\cdot\|_{0}}$ is the hard-thresholding operation, defined componentwise as

The set of matrices of rank at most $r$ is nonconvex and closed, and the projection onto it is given by $\proj_{\rank\leq r}(X){}={}U_{r}\diag(\sigma_{1},\ldots,\sigma_{r})V_{r}^{T}$ , where $\sigma_{1}\ldots\sigma_{r}$ are $r$ largest singular values of $X$ , and $U_{r},V_{r}$ are the matrices of left and right singular vectors, respectively. Each computation of $\Pi_{\rank\leq r}$ requires a partial SVD which is, from the computational perspective, the most significantly expensive operation in this case.

We applied this technique to a sequence of $n=50$ frames coming from the ShoppingMall dataset.http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html The footage consists of $m=320\times 256$ grayscale pixels frames, therefore the problem has $8192000$ variables in total. In problem (40) we used $r=1$ and $\lambda=3\cdot 10^{-3}$ . The results are shown in Figures 3 and 4. Also in this case, the fast asymptotic convergence of ZeroFPR(L-BFGS) is apparent.

Conclusions

The forward-backward envelope is a valuable tool for deriving efficient algorithms tackling nonsmooth and nonconvex problems of the form $\varphi=f+g$ , as it can be used as a merit function to devise globally convergent linesearch methods solving the system of nonlinear equations defining the stationary points of $\varphi$ .

ZeroFPR implements this idea, and we proved that it globally converges to a stationary point under the assumption that $\varphi_{\gamma}$ has the Kurdyka-Łojasiewicz property. Furthermore, if the linesearch directions satisfy the Dennis-Moré condition (for example, if they are determined according to the Broyden method), the convergence rate at strong local minima is superlinear.

Numerical simulations with the proposed method on convex and nonconvex problems confirm our theoretical results. Using Broyden method, BFGS (in the case of small-scale problems) and L-BFGS (for large-scale problems) to compute directions in ZeroFPR greatly outperform FBS and its accelerated variant. It is our belief that the surprising efficacy of (L-)BFGS is due to the fact that, under the appropriate assumptions, the Jacobian of $R_{\gamma}$ at strong local minima is similar to a symmetric and positive definite matrix. Future investigation may better explain the effectiveness of symmetric update formulas in this framework.

References

Appendix A Proofs of Section 4

1: It follows from [37, Thm.s 3.8 and 4.1] that $\prox_{\gamma g}$ is (strictly) differentiable at $x^{\star}-\gamma\nabla f(x^{\star})$ iff $g$ (strictly) satisfies item 2. Consequently, if $f$ is of class $C^{2}$ around $x^{\star}$ (and in particular strictly differentiable at $x^{\star}$ [44, Cor. 9.19]), $R_{\gamma}(x)=x-\FB x$ is (strictly) differentiable at $x^{\star}$ with Jacobian as in (17) due to the chain rule of differentiation (and the fact that strict differentiability is preserved by composition). For $\gamma^{\prime}\in(\gamma,\Gamma(x^{\star}))$ and $w\in\R^{n}$ we have

The expression (15) of the second-order epi-derivative then implies $\innprod{Mw}{w}{}\geq{}-\frac{1}{\gamma^{\prime}}\|w\|^{2}$ for all $w\in\R^{n}$ (since $Mw=0$ for $w\in S^{\perp}$ ). Therefore, $\lambda_{\rm min}(M){}\geq{}-\nicefrac{{1}}{{\gamma^{\prime}}}{}>{}-\nicefrac{{1}}{{\gamma}}$ , proving $I+\gamma M$ to be positive definite, and in particular invertible. We may now trace the proof of [45, Lem. 2.9] to infer that $JP_{\gamma}(x^{\star}){}={}\proj_{S}[I+\gamma M]^{-1}\proj_{S}$ . Apparently, $JP_{\gamma}(x^{\star})$ is symmetric and positive semidefinite.

2: Since $Q_{\gamma}$ is (strictly) continuous at $x^{\star}$ and $R_{\gamma}$ is (strictly) differentiable at $x^{\star}$ , from [45, Lem. 6.2] we have that $\nabla\varphi_{\gamma}=Q_{\gamma}R_{\gamma}$ is (strictly) differentiable at $x^{\star}$ , and (17) follows by the chain rule.

3: A simple application of the chain rule proves (18); moreover, combined with (17) we obtain $\nabla^{2}\varphi_{\gamma}(x^{\star}){}={}\tfrac{1}{\gamma}\left[Q_{\gamma}(x^{\star}){}-{}Q_{\gamma}(x^{\star})P_{\gamma}(x^{\star})Q_{\gamma}(x^{\star})\right]$ , and since both $Q_{\gamma}(x^{\star})$ and $P_{\gamma}(x^{\star})$ are symmetric, then so is $\nabla^{2}\varphi(x^{\star})$ .

[Proof of Section 4.4]We will show that all conditions are equivalent to either one of the following

$\tinnprod{d}{(\nabla^{2}f(x^{\star})+M)d}>0$ $\forall d\in S$ , where $M$ and $S$ are as in section 4.4;

$JR_{\gamma}(x^{\star})$ is similar to a symmetric and positive definite matrix.

4.4 $\Leftrightarrow$ 4.4: trivial, since $\nabla^{2}\varphi_{\gamma}(x^{\star})$ exists as shown in item 3.

4.4 $\Leftrightarrow$ thm:StrongMinimality(f): follows from [44, Thm. 13.24(c)], since

4.4 $\Leftrightarrow$ 4.4: if $\nabla^{2}\varphi_{\gamma}(x^{\star})\succ 0$ , then $x^{\star}$ is a (strong) local minimum for $\varphi_{\gamma}$ and, due to (18), necessarily $JR_{\gamma}(x^{\star})$ is invertible. Conversely, if $x^{\star}$ is a local minimum for $\varphi_{\gamma}$ , then $\nabla^{2}\varphi_{\gamma}(x^{\star})\succeq 0$ . If, additionally, $JR_{\gamma}(x^{\star})$ is invertible, then due to (18) $\nabla^{2}\varphi_{\gamma}(x^{\star})$ is also invertible, and therefore positive definite.

4.4 $\Leftrightarrow$ thm:StrongMinimality(g): by comparing (17) and (18) we observe that $JR_{\gamma}(x^{\star})$ is similar to the (symmetric) matrix $Q_{\gamma}(x^{\star})^{-\nicefrac{{1}}{{2}}}\nabla^{2}\varphi_{\gamma}(x^{\star})Q_{\gamma}(x^{\star})^{-\nicefrac{{1}}{{2}}}$ , which is positive definite iff $\nabla^{2}\varphi_{\gamma}(x^{\star})$ is.

thm:StrongMinimality(f) $\Leftrightarrow$ thm:StrongMinimality(g): the proof is the same as that of [45, Thm. 2.11(b) $\Leftrightarrow$ (c)].

4.4 $\Rightarrow$ thm:StrongMinimality(g): with similar reasonings as in the proof of the implications “4.4 $\Leftrightarrow$ thm:StrongMinimality(f) $\Leftrightarrow$ thm:StrongMinimality(g)”, we conclude that local minimality of $x^{\star}$ for $\varphi$ entails $JR_{\gamma}(x^{\star})$ being similar to a symmetric and positive semidefinite matrix. Therefore, if $JR_{\gamma}(x^{\star})$ is nonsingular, then it is similar to a symmetric and positive definite matrix.

4.4 $\Rightarrow$ 4.4: trivial, since $\varphi_{\gamma}\leq\varphi$ and $\varphi_{\gamma}(x^{\star})=\varphi(x^{\star})$ (cf. items 1 and 1).

Appendix B Additional results for Section 5

Consider the iterates generated by ZeroFPR and suppose that the directions $\seq{d^{k}}$ are selected so as to satisfy (31). Then,

$\|x^{k+1}-x^{k}\|{}\leq{}(\gamma+D)\|r^{k}\|$

$\|\bar{x}^{k+1}-\bar{x}^{k}\|{}\leq{}\gamma\|r^{k+1}\|{}+{}(2\gamma+D)\|r^{k}\|$

in particular, $\|x^{k+1}-x^{k}\|$ and $\|\bar{x}^{k+1}-\bar{x}^{k}\|$ converge to 0.

where in the last inequality we used the fact that $\tau_{k}\in(0,1]$ . This proves 1, and 2 trivially follows by the triangular inequality $\|\bar{x}^{k+1}-\bar{x}^{k}\|{}\leq{}\|x^{k+1}-x^{k}\|{}+{}\gamma\|r^{k+1}\|{}+{}\gamma\|r^{k}\|$ . Using this, 3 follows from item 1.

Consider the iterates generated by ZeroFPR. Suppose that (31) is satisfied and that the sequence $\seq{x^{k}}$ is bounded. Then, $\omega(x^{k})=\omega(\bar{x}^{k})$ are nonempty compact and connected sets over which $\varphi$ and $\varphi_{\gamma}$ are constant and coincide. Moreover,

The sets of cluster points are nonempty because of boundedness of the sequences; in turn, connectedness and compactness as well as (41) are shown in [12, Rem. 5], which applies since $\|x^{k+1}-x^{k}\|$ and $\|\bar{x}^{k+1}-\bar{x}^{k}\|$ converge to (cf. item 3). Moreover, since $\seq{\varphi_{\gamma}(x^{k})}$ converges to some value $\varphi_{\star}\in\R$ and $\omega(x^{k})=\omega(\bar{x}^{k})\subseteq\fix T_{\gamma}$ as shown in section 5.4, it follows Item 1 that $\varphi$ and $\varphi_{\gamma}$ coincide on $\omega(x^{k})$ (and equal $\varphi_{\star}$ ).

Suppose that Section 4.4 is satisfied at a strong local minimum $x^{\star}$ of $\varphi$ . Then, for any $\gamma\in(0,\nicefrac{{1}}{{L_{f}}})$ the FBE $\varphi_{\gamma}$ possesses the KL property at $x^{\star}$ , and the desingularizing function $\psi$ can be taken of the form $\psi(s){}={}\rho s^{\nicefrac{{1}}{{2}}}$ for some $\rho>0$ . {proof} From section 4.4 it follows that $x^{\star}$ is a strong local minimum for $\varphi_{\gamma}$ at which $\varphi_{\gamma}$ is twice differentiable with $H_{\star}\coloneqq\nabla^{2}\varphi_{\gamma}(x^{\star})\succ 0$ . Let $\lambda{}\coloneqq{}\lambda_{\rm min}(H_{\star})$ and $\Lambda{}\coloneqq{}\lambda_{\rm max}(H_{\star})$ . Since $\nabla\varphi_{\gamma}(x^{\star})=0$ , from a second-order expansion of $\varphi_{\gamma}$ and a first-order expansion of $\nabla\varphi_{\gamma}$ we obtain that there exists a neighborhood $U_{x^{\star}}$ of $x^{\star}$ such that, for all $x\in U_{x^{\star}}$ , $\varphi_{\gamma}(x)-\varphi_{\gamma}(x^{\star}){}\leq{}\tfrac{\Lambda}{4}\|x-x^{\star}\|^{2}$ and $\|\nabla\varphi_{\gamma}(x)\|{}\geq{}\tfrac{\lambda}{2}\|x-x^{\star}\|$ , and in particular $\psi^{\prime}\left(\varphi_{\gamma}(x)-\varphi_{\gamma}(x^{\star})\right)\|\nabla\varphi_{\gamma}(x)\|{}={}\tfrac{\rho}{2\sqrt{\varphi_{\gamma}(x)-\varphi_{\gamma}(x^{\star})}}\|\nabla\varphi_{\gamma}(x)\|{}\geq{}\tfrac{\rho\lambda}{2\sqrt{\Lambda}}$ . Letting $\rho{}={}\frac{2\sqrt{\Lambda}}{\lambda}$ we obtain that $\psi$ is a KL function for $\varphi_{\gamma}$ at $x^{\star}$ .