An Inexact Variable Metric Proximal Point Algorithm for Generic Quasi-Newton Acceleration

Hongzhou Lin, Julien Mairal, Zaid Harchaoui

Introduction

Convex composite optimization arises in many scientific fields, such as image and signal processing or machine learning. It consists of minimizing a real-valued function composed of two convex terms:

where $\|\cdot\|$ denotes the Euclidean norm. Note that when $\psi$ is an indicator function, the proximal operator corresponds to the simple Euclidean projection.

To solve (1), significant efforts have been devoted to (i) extending techniques for smooth optimization to deal with composite terms; (ii) exploiting the underlying structure of the problem—is $f$ a finite sum of independent terms? Is $\psi$ separable in different blocks of coordinates? (iii) exploiting the local curvature of the smooth term $f$ to achieve faster convergence than gradient-based approaches when the dimension $d$ is large. Typically, the first point is well understood in the context of optimal first-order methods, see , and the third point is tackled with effective heuristics such as L-BFGS when the problem is smooth . Yet, addressing all these challenges at the same time is difficult, which is precisely the focus of this paper.

where the functions $f_{i}$ are convex and smooth with Lipschitz continuous derivatives, and $\psi$ is a composite term, possibly non-smooth. The function $f_{i}$ measures the fit of some model parameters $x$ to a specific data point indexed by $i$ , and $\psi$ is a regularization penalty to prevent over-fitting. To exploit the sum structure of $f$ , a large number of randomized incremental gradient-based techniques have been proposed, such as SAG , SAGA , SDCA , SVRG , Finito , or MISO . These approaches access a single gradient $\nabla f_{i}(x)$ at every iteration instead of the full gradient $(1/n)\sum_{i=1}^{n}\nabla f_{i}(x)$ and achieve lower computational complexity in expectation than optimal first-order methods under a few assumptions. Yet, these methods are unable to exploit the curvature of the objective function; this is indeed also the case for variants that are accelerated in the sense of Nesterov .

To tackle (2), dedicated first-order methods are often the default choice in machine learning, but it is also known that standard Quasi-Newton approaches can sometimes be surprisingly effective in the smooth case—that is when $\psi=0$ , see, e.g., for extensive benchmarks. Since the dimension of the problem $d$ is typically very large ( $d\geq 10\,000$ ), “limited memory” variants of these algorithms, such as L-BFGS, are necessary to achieve the desired scalability . The theoretical guarantees offered by L-BFGS are somewhat limited, meaning that it does not outperform accelerated first-order methods in terms of worst-case convergence rate, and also it is not guaranteed to correctly approximate the Hessian of the objective. Yet, L-BFGS remains one of the greatest practical success of smooth optimization. Adapting L-BFGS to composite and structured problems, such as the finite sum of functions (2), is of utmost importance nowadays.

For instance, there have been several attempts to develop a proximal Quasi-Newton method . These algorithms typically require computing many times the proximal operator of $\psi$ with respect to a variable metric. Quasi-Newton steps were also incorporated as local search steps into accelerated first-order methods to further enhance their numerical performance . More related to our work, L-BFGS is combined with SVRG for minimizing smooth finite sums in . The scope of our approach is broader beyond the case of SVRG. We present a generic Quasi-Newton scheme, applicable to a large-class of first-order methods for composite optimization, including other incremental algorithms and block coordinate descent methods

The idea of combining second-order or Quasi-Newton methods with Moreau envelope is in fact relatively old. It may be traced back to variable metric proximal bundle methods , which aims to incorporate curvature information into the bundle methods. Our approach revisits this principle with a limited-memory variant (to deal with large dimension $d$ ), with a simple line search scheme, with several warm start strategies for the sub-problems and with a global complexity analysis that is more relevant than convergence rates that do not take into account the cost per iteration.

To demonstrate the effectiveness of our scheme in practice, we evaluate QNing on regularized logistic regression and regularized least-squares, with smooth and nonsmooth regularization penalities such as the Elastic-Net . We use large-scale machine learning datasets and show that QNing performs at least as well as the recently proposed accelerated incremental algorithm Catalyst , and other Quasi-Newton baselines such as proximal Quasi-Newton methods and Stochastic L-BFGS in all numerical experiments, and significantly outperforms them in many cases.

The paper is organized as follows: Section 2 presents related work on Quasi-Newton methods such as L-BFGS; we introduce QNing in Section 3 and its convergence analysis in Section 4; Section 5 is devoted to numerical experiments and Section 6 concludes the paper.

Related work and preliminaries

The history of Quasi-Newton methods can be traced back to the 1950’s . Quasi-Newton methods often lead to significantly faster convergence in practice compared to simpler gradient-based methods for solving smooth optimization problems . Yet, a theoretical analysis of Quasi-Newton methods that explains their impressive empirical behavior is still an open topic. Here, we briefly review the well-known BFGS algorithm in Section 2.1, its limited memory variant , and a few recent extensions in Section 2.2. Then, we present earlier works that combine proximal point algorithm and Quasi-Newton methods in Section 2.3.

The most popular Quasi-Newton method is probably BFGS, named after its inventors (Broyden-Fletcher-Goldfarb-Shanno), and its limited variant L-BFGS . These approaches will be the workhorses of the QNing meta-algorithm in practice. Consider a smooth convex objective $f$ to be minimized, the BFGS method constructs at iteration $k$ a couple $(x_{k},B_{k})$ with the following update:

where $\alpha_{k}$ is a suitable stepsize and

The matrix $B_{k}$ aims to approximate the Hessian matrix at the iterate $x_{k}$ . When $f$ is strongly convex, the positive definiteness of $B_{k}$ is guaranteed, as well as the condition $y_{k}^{\top}s_{k}>0$ , which ensures that (3) is well defined. The stepsize $\alpha_{k}$ is usually determined by a line search strategy. For instance, applying Wolfe’s line-search strategy provides linear convergence rate for strongly convex objectives. Moreover, under stronger conditions that the objective $f$ is twice differentiable and its Hessian is Lipschitz continuous, the algorithm can asymptotically achieve superlinear convergence rate .

However, when the dimension $d$ is large, storing the $d$ -by- $d$ matrix $B_{k}$ is infeasible. The limited memory variant L-BFGS overcomes this issue by restricting the matrix $B_{k}$ to be low rank. More precisely, instead of storing the full matrix, a “generating list” of at most $l$ pairs of vectors $\{(s_{i}^{k},y_{i}^{k})\}_{i=0...j}$ is kept in memory. The low rank matrix $B_{k}$ can then be recovered by performing the matrix update recursion in (3) involving all pairs of the generating list. Between iteration $k$ and $k+1$ , the generating list is incrementally updated, by removing the oldest pair in the list (when $j=l$ ) and adding a new one. What makes the approach appealing is the ability of computing the matrix-vector product $H_{k}z=B_{k}^{-1}z$ with only $O(ld)$ floating point operations for any vector $z$ . This procedure entirely relies on vector-vector product which does not explicitly construct the $d$ -by- $d$ matrix $B_{k}$ or $H_{k}$ . The price to pay is that superlinear convergence becomes out of reach.

L-BFGS is thus appropriate for high-dimensional problems (when $d$ is large), but it still requires computing the full gradient at each iteration, which may be cumbersome in the large sum setting (2). This motivated stochastic counterparts of the Quasi-Newton methods (SQN) . Unfortunately, directly substituting the full gradient $\nabla f(x_{k})$ by its stochastic counterpart does not lead to a convergent scheme. Instead, the SQN method uses updates with sub-sampled Hessian-vector products, which leads to a sublinear convergence rate. Later, a linearly-convergent SQN algorithm is proposed by exploiting a variance reduction scheme . However, it is unclear how to extend these techniques into the composite setting.

2 Quasi-Newton methods for composite optimization

Different approaches have been proposed to extend Quasi-Newton methods to composite optimization problems. A first approach consists in minimizing successive quadratic approximations, also called proximal Quasi-Newton methods . More concretely, a local quadratic approximation $q_{k}$ is minimized at each iteration:

where $B_{k}$ is a Hessian approximation based on Quasi-Newton methods. The minimizer of $q_{k}$ provides a descent direction, which is subsequently used to build the next iterate. However, a closed form solution of (4) is usually not available since $B_{k}$ changes over the iterations. Thus, one needs to apply an optimization algorithm to approximately solve (4). The composite structure of the subproblem naturally leads to choosing a first-order optimization algorithm, such as randomized coordinate descent algorithms. Then, superlinear complexity becomes out of reach since it requires the subproblems (4) to be solved with “high accuracy” . The global convergence rate of this inexact variant has been for instance analyzed in , where a sublinear convergence rate is obtained for convex problems; later, the analysis has been extended to strongly convex problems in , where linear convergence rate is achieved.

A second approach of extending Quasi-Newton methods to composite optimization problems is based on smoothing techniques. More precisely, a Quasi-Newton method is applied to a smoothed version of the objective. For instance, one may use the forward-backward envelope . The idea is to mimic forward-backward splitting methods and apply Quasi-Newton instead of gradient steps on top of the envelope. Another well known smoothing technique is to apply the Moreau-Yosida regularization which gives the smoothed function called Moreau envelope. Then apply Quasi-Newton methods on it leads to the family of variable metric proximal point algorithms . Our method pursues this line of work by developing a practical inexact variant with global complexity guarantees.

3 Combining the proximal point algorithm and Quasi-Newton methods

We briefly recall the definition of the Moreau envelope and its basic properties.

Given an objective function $f$ and a smoothing parameter $\kappa>0$ , the Moreau envelope of $f$ is the function $F$ obtained by performing the infimal convolution

When $f$ is convex, the sub-problem defined in (5) is strongly convex which provides an unique minimizer, called the proximal point of $x$ , which we denote by $p(x)$ .

If $f$ is convex, the Moreau envelope $F$ defined in (5) satisfies

and the solution set of the two above problems coincide with each other.

$F$ is continuously differentiable even when $f$ is not and

Moreover the gradient $\nabla F$ is Lipschitz continuous with constant $L_{F}=\kappa$ .

$F$ is convex; moreover, when $f$ is $\mu$ -strongly convex with respect to the Euclidean norm, $F$ is $\mu_{F}$ -strongly convex with $\mu_{F}=\frac{\mu\kappa}{\mu+\kappa}.$

Interestingly, $F$ inherits all the convex properties of $f$ and more importantly it is always continuously differentiable, see for elementary proofs. Moreover, the condition number of $F$ is given by

which may be adjusted by the regularization parameter $\kappa$ . Then, a naive approach to overcome the non-smoothness of the function $f$ is to transfer the optimization problem to its Moreau envelope $F$ . More concretely, we may apply an optimization algorithm to minimize $F$ and use the obtained solution as a solution to the original problem, since both functions share the same minimizers. This yields the following well-known algorithms.

Consider the gradient descent method with constant step size $1/L_{F}$ :

By rewriting the gradient $\nabla F(x_{k})$ as $\kappa(x_{k}-p(x_{k}))$ , we obtain the proximal point algorithm :

Accelerated proximal point algorithm.

Since gradient descent on $F$ yields the proximal point algorithm, it is natural to apply an accelerated first-order method to get faster convergence. To that effect, Nesterov’s algorithm uses a two-stage update, along with a specific extrapolation parameter $\beta_{k+1}$ :

This is known as the accelerated proximal point algorithm introduced by Güler , which was recently extended in .

Variable metric proximal point algorithm.

One can also apply Quasi-Newton methods on $F$ , which yields

where $B_{k}$ is the Hessian approximation of $F$ based on Quasi-Newton methods. This is known as the variable metric proximal point algorithm .

Towards an inexact variable metric proximal point algorithm.

Quasi-Newton approaches have been applied after inexact Moreau envelope in various ways . In particular, it is shown in that if the sub-problems (5) are solved up to high enough accuracy, then the inexact variable metric proximal point algorithm preserves the superlinear convergence rate. However, the complexity for solving the sub-problems with high accuracy is typically not taken into account in these previous works.

In the unrealistic case where $p(x_{k})$ can be obtained at no cost, proximal point algorithm can afford much larger step sizes than classical gradient methods, thus are more effective. For instance, when $f$ is strongly convex, the Moreau envelope $F$ can be made arbitrarily well-conditioned by making $\kappa$ arbitrarily small, according to (8). Then, a single gradient step on $F$ is enough to be arbitrarily close to the optimum. In practice, however, sub-problems are solved only approximately and the complexity of solving the sub-problems is directly related to the smoothing parameter $\kappa$ . This leaves an important question: how to choose the smoothing parameter $\kappa$ . A small $\kappa$ makes the smoothed function $F$ better conditioned, while a large $\kappa$ is needed to improve the conditioning of the sub-problem (5).

The main contribution of our paper is to close this gap by providing a global complexity analysis which takes into account the complexity of solving the subproblems. More concretely, in the proposed QNing algorithm, we provide i) a practical stopping criterion for the sub-problems; ii) several warm-start strategies; iii) a simple line-search strategy which guarantees sufficient descent in terms of function value. These three components together yield the global convergence analysis, which allows us to use first-order method as a subproblem solver. Moreover, the global complexity we developed depends on the smoothing parameter $\kappa$ , which provides some insight about how to practically choose this parameter.

Solving the subproblems with first-order algorithms.

In the composite setting, both proximal Quasi-Newton methods and the variable metric proximal point algorithm require solving sub-problems, namely (4) and (5), respectively. In the general case, when a generic first-order method—e.g., proximal gradient descent—is used, our worst-case complexity analysis does not provide a clear winner, and our experiments in Section 5.4 confirm that both approaches perform similarly. However, when it is possible to exploit the specific structure of the sub-problems in one case, but not in the other one, the conclusion may differ.

QNing: a Quasi-Newton meta-algorithm

We now present the QNing method in Algorithm 1, which consists of applying variable metric algorithms on the smoothed objective $F$ with inexact gradients. Each gradient approximation is the result of a minimization problem tackled with the algorithm $\mathcal{M}$ , used as a sub-routine. The outer loop of the algorithm performs Quasi-Newton updates. The method $\mathcal{M}$ can be any algorithm of the user’s choice, as long as it enjoys linear convergence rate for strongly convex problems. More technical details are given in Section 3.1.

We now discuss the main algorithm components and its main features.

We apply variable metric algorithms with a simple line search strategy similar to on the Moreau envelope $F$ . Given a positive definite matrix $H_{k}$ and a step size $\eta_{k}$ in $$, the algorithm computes the update

where $H_{0}=\frac{1}{\kappa}I$ . When $\eta_{k}=1$ , the update uses the metric $H_{k}$ , and when $\eta_{k}=0$ , it uses an inexact proximal point update $x_{k+1}=x_{k}-(1/\kappa)g_{k}$ . In other words, when the quality of the metric $H_{k}$ is not good enough, due to the inexactness of the gradients used in its construction, the update is corrected towards a simple proximal point update, whose convergence is well understood when the gradients are inexact.

In order to determine the stepsize $\eta_{k}$ , we introduce the following descent condition,

We show that the descent condition (13) is always satisfied when $\eta_{k}=0$ , thus the finite termination of the line search follows, see Section 4.3 for more details. In our experiments, we observed empirically that the stepsize $\eta_{k}=1$ was almost always selected. In practice, we try the values $\eta_{k}$ in $\{1,1/2,1/4,1/8,0\}$ starting from the largest one and stops whenever condition (13) is satisfied.

Example of variable metric algorithm: inexact L-BFGS method.

The L-BFGS rule we consider is the standard one and consists in updating incrementally a generating list of vectors $\{(s_{i},y_{i})\}_{i=1\ldots j}$ , which implicitly defines the L-BFGS matrix. We use here the two-loop recursion detailed in [51, Algorithm 7.4] and use skipping steps when the condition $s_{i}^{\top}y_{i}>0$ is not satisfied, in order to ensure the positive-definiteness of the L-BFGS matrix $H_{k}$ (see ).

Inner-loop: approximate Moreau envelope.

The inexactness of our scheme comes from the approximation of the Moreau envelope $F$ and its gradient. The procedure $\text{{ApproxGradient}}\left({}\right)$ calls an minimization algorithm $\mathcal{M}$ and apply $\mathcal{M}$ to minimize the sub-problem (11). When the problem is solved exactly, the function returns the exact values $g=\nabla F(x)$ , $F_{a}=F(x)$ , and $z=p(x)$ . However, this is infeasible in practice and we can only expect approximate solutions. In particular, a stopping criterion should be specified. We consider the following variants:

we define an adaptive stopping criterion based on function values and stop $\mathcal{M}$ when the approximate solution satisfies the inequality (12). In contrast to standard stopping criterion where the accuracy is an absolute constant, our stopping criterion is adaptive since the righthand side of (12) also depends on the current iterate $z$ . More detailed theoretical insights will be given in Section 4. Typically, checking whether or not the criterion is satisfied requires computing a duality gap, as in Catalyst .

using a pre-defined budget $T_{\mathcal{M}}$ in terms of number of iterations of the method $\mathcal{M}$ , where $T_{\mathcal{M}}$ is a constant independent of $k$ .

Note that such an adaptive stopping criterion is relatively classical in the literature of inexact gradient-based methods . As we will see later in Section 4, when $T_{\mathcal{M}}$ is large enough, criterion (12) is guaranteed.

Requirements on ℳℳ\mathcal{M}.

To apply QNing, the optimization method $\mathcal{M}$ needs to have linear convergence rates for strongly-convex problems. More precisely, for any strongly-convex objective $h$ , the method $\mathcal{M}$ should be able to generate a sequence of iterates $(w_{t})_{t\geq 0}$ such that

where $w_{0}$ is the initial point given to $\mathcal{M}$ . The notion of linearly-convergent methods extends naturally to non-deterministic methods where (14) is satisfied in expectation:

The linear convergence condition typically holds for many primal gradient-based optimization techniques, including classical full gradient descent methods, block-coordinate descent algorithms , or variance reduced incremental algorithms . In particular, our method provides a generic way to combine incremental algorithms with Quasi-Newton methods which are suitable for large-scale optimization problems. For the simplicity of the presentation, we only consider the deterministic variant (14) in the analysis. However, it is possible to show that the same complexity results still hold for non-deterministic methods in expectation, as discussed in Section 4.5. We emphasize that we do not assume any convergence guarantee of $\mathcal{M}$ on non-strongly convex problems since our sub-problems are always strongly convex.

Warm starts for the sub-problems.

Employing an adequate initialization for solving each sub-problem plays an important rule in our analysis. The warm start strategy we proposed here ensures that the stopping criterion in each subproblem can be achieved in a constant number of iterations: Consider the minimization of a sub-problem

Then, our warm start strategy depends on the nature of $f$ :

when $f$ is smooth, we initialize with $w_{0}=x$ ;

when $f=f_{0}+\psi$ is composite, we initialize with

which performs an additional proximal step comparing to the smooth case.

Handling composite objective functions.

Convergence and complexity analysis

In this section, we study the convergence of the QNing algorithm—that is, the rate of convergence of the quantities $(F(x_{k})-F^{*})_{k\geq 0}$ and $(f(z_{k})-f^{*})_{k\geq 0}$ , and also the computational complexity due to solving the sub-problems (11). We start by stating the main properties of the gradient approximation in Section 4.1. Then, we analyze the convergence of the outer loop algorithm in Section 4.2, and Section 4.3 is devoted to the properties of the line search strategy. After that, we provide the cost of solving the sub-problems in Section 4.4 and derive the global complexity analysis in Section 4.5.

The next lemma is classical and provides approximation guarantees about the quantities returned by the ApproxGradient procedure (Algorithm 2); see . We recall here the proof for completeness.

Moreover, $F_{a}$ is related to $f$ by the following relationship

(16) and (19) are straightforward by definition of $h(z)$ . Since $f$ is convex, the function $h$ is $\kappa$ -strongly convex, and (17) follows from

where we recall that $p(x)$ minimizes $h$ . Finally, we obtain (18) from

by using the definitions of $g$ and the property (6). ∎

This lemma allows us to quantify the quality of the gradient and function value approximations, which is crucial to control the error accumulation of inexact proximal point methods. Moreover, the relation (19) establishes a link between the approximate function value of $F$ and the function value of the original objective $f$ ; as a consequence, it is possible to relate the convergence rate of $f$ from the convergence rate of $F$ . Finally, the following result is a direct consequence of Lemma 1:

Consider the same quantities introduced in Lemma 1. Then,

The right-hand side of Eq. (20) follows from

Interchanging $\nabla F(x)$ and $g$ gives the left-hand side inequality. ∎

If $\varepsilon\leq\frac{c}{\kappa}\|g\|^{2}$ with $c<\frac{1}{4}$ , then

This corollary is important since it allows to replace the unknown exact gradient $\|\nabla F(x)\|$ by its approximation $\|g\|$ , at the cost of a constant factor, as long as the condition $\varepsilon\leq\frac{c}{\kappa}\|g\|^{2}$ is satisfied.

2 Convergence analysis of the outer loop

We are now in shape to establish the convergence of the QNing meta-algorithm, without considering yet the cost of solving the sub-problems (11). At iteration $k$ , an approximate proximal point is evaluated:

The following lemma characterizes the expected descent in terms of objective function value.

At iteration $k$ , if the sub-problem (22) is solved up to accuracy $\varepsilon_{k}$ in the sense of Lemma 1 and the next iterate $x_{k+1}$ satisfies the descent condition (13), then,

This lemma gives us a first intuition about the natural choice of the accuracy $\varepsilon_{k}$ , which should be in the same order as $\|\nabla F(x_{k})\|^{2}$ . In particular, if

which is a typical inequality used for analyzing gradient descent methods. Before presenting the convergence result, we remark that condition (24) cannot be used directly since it requires knowing the exact gradient $\|\nabla F(x_{k})\|$ . A more practical choice consists of replacing it by the approximate gradient.

The following condition implies inequality (24):

This is the first stopping criterion (12) in Algorithm 2. Finally, we obtain the following convergence result for strongly convex problems, which is classical in the literature of inexact gradient methods (see Section 4.1 of for a similar result).

Assume that $f$ is $\mu$ -strongly convex. Let $(x_{k})_{k\geq 0}$ be the sequences generated by Algorithm 1 where the stopping criterion (12) is used. Then,

The proof follows directly from (25) and the standard analysis of the gradient descent algorithm for the $\mu_{F}$ -strongly convex and $L_{F}$ -smooth function $F$ by remarking that $L_{F}=\kappa$ and $\mu_{F}=\frac{\mu\kappa}{\mu+\kappa}$ . ∎

Under the conditions of Proposition 2, we have

Moreover, $F(x_{0})$ is upper-bounded by $f(x_{0})$ following (7). ∎

It is worth pointing out that our analysis establishes a linear convergence rate whereas one would expect a superlinear convergence rate as for classical variable metric methods. The tradeoff lies in the choice of the accuracy $\varepsilon_{k}$ . In order to achieve superlinear convergence, the approximation error $\varepsilon_{k}$ needs to decrease superlinearly, as shown in . However, a fast decreasing sequence $\varepsilon_{k}$ requires an increasing effort in solving the sub-problems, which will dominate the global complexity. In other words, the global complexity may become worse even though we achieve faster convergence in the outer-loop. This will become clearer when we discuss the inner loop complexity in Section 4.4.

Next, we show that under a bounded level set condition, QNing enjoys the classical sublinear $O(1/k)$ convergence rate when the objective is convex but not strongly convex.

Let $f$ be a convex function with bounded level sets. Then, there exists a constant $R>0$ , which depends on the initialization point $x_{0}$ , such that the sequences $(x_{k})_{k\geq 0}$ and $(z_{k})_{k\geq 0}$ generated by Algorithm 1 with stopping criterion (12), satisfies

We defer the proof and the proper definition of the bounded level set assumption to Appendix A. ∎

So far, we have assumed in our analysis that the iterates satisfy the descent condition (13), which means the line search strategy will always terminate. We prove this is indeed the case in the next section and provide some additional conditions under which a non-zero step size will be selected.

At iteration $k$ , a line search is performed on the stepsize $\eta_{k}$ to find the next iterate

such that $x_{k+1}$ satisfies the descent condition (13). We first show that the descent condition holds when $\eta_{k}=0$ before giving a more general result.

If the sub-problems are solved up to accuracy $\varepsilon_{k}\leq\frac{1}{36\kappa}\|g_{k}\|^{2}$ , then the descent condition (13) holds when $\eta_{k}=0$ .

When $\eta_{k}=0$ , $x_{k+1}=x_{k}-\frac{1}{\kappa}g_{k}=z_{k}$ . Then,

Therefore, it is theoretically sound to take the trivial step size $\eta_{k}=0$ , which implies the termination of our line search strategy. In other words, the descent condition always holds by taking an inexact gradient step on the Moreau envelope $F$ , which corresponds to the update of the proximal point algorithm. However, the purpose of using variable metric method is to exploit the curvature of the function, which is not the case when $\eta_{k}=0$ . Thus, the trivial step size should be only considered as a backup plan and we show in the following some sufficient conditions for taking non-zero step sizes and even stronger, unit step sizes.

If the sub-problems are solved up to accuracy $\varepsilon_{k}\leq\frac{1}{36\kappa}\|g_{k}\|^{2}$ , then the sufficient condition (13) holds for any $x_{k+1}=x_{k}-A_{k}g_{k}$ where $A_{k}$ is a positive definite matrix satisfying $\frac{1-\alpha}{\kappa}I\preceq A_{k}\preceq\frac{1+\alpha}{\kappa}I$ with $\alpha\leq\frac{1}{3}$ .

As a consequence, a line search strategy consisting of finding the largest $\eta_{k}$ of the form $\gamma^{i}$ , with $i=1,\ldots,+\infty$ and $\gamma$ in $(0,1)$ always terminates in a bounded number of iterations if the sequence of variable metric $(H_{k})_{k\geq 0}$ is bounded, i.e. there exists $0<m<M$ such that for any $k$ , $mI\preceq H_{k}\preceq MI$ . This is the case for L-BFGS update:

The variable metric matrices $(B_{k})_{k}$ constructed by the L-BFGS rule are positive definite and bounded.

First, we recall that $z_{k}=x_{k}-\frac{1}{\kappa}g_{k}$ and we rewrite

We are going to bound the two error terms $E_{1}$ and $E_{2}$ by some factors of $\|g_{k}\|^{2}$ . Noting that the subproblems are solved up to $\varepsilon_{k}\leq\frac{c}{\kappa}\|g_{k}\|^{2}$ with $c=\frac{1}{36}$ , we obtain by construction

where the last inequality comes from Corollary 1. Moreover,

Since $x_{k+1}-z_{k}=(\frac{1}{\kappa}-A_{k})g_{k}$ , we have $\|x_{k+1}-z_{k}\|\leq\frac{\alpha}{\kappa}\|g_{k}\|$ . This implies that

Second, by the $\kappa$ -smoothness of $F$ , we have

where the last inequality follows from $\|x_{k+1}-z_{k}\|\leq\frac{\alpha}{\kappa}\|g_{k}\|$ . Combining (30) and (31) yields

When $c\leq\frac{1}{36}$ and $\alpha\leq\frac{1}{3}$ , we have

where the last equality follows from (19), this completes the proof. ∎

Note that in practice, we consider a set of step sizes $\eta_{k}=\gamma^{i}$ for $i\leq i_{\text{max}}$ or $\eta_{k}=0$ , which naturally upper-bounds the number of line search iterations to $i_{\text{max}}$ . More precisely, all experiments performed in this paper use $\gamma=1/2$ and $i_{\text{max}}=3$ . Moreover, we observe that the unit stepsize is very often sufficient for the descent condition to hold, as empirically studied in Appendix C.2.

The following result shows that under a specific assumption on the Moreau envelope $F$ , the unit stepsize is indeed selected when the iterate are close to the optimum. The condition, called Dennis-Moré criterion , is classical in the literature of Quasi-Newton methods. Even though we cannot formally show that the criterion holds for the Moreau envelope $F$ , since it requires $F$ to be twice continuously differentiable, which is not true in general, see , it provides a sufficient condition for the unit step size. Therefore, the lemma below should not be seen as an formal explanation for the choice of step size $\eta_{k}=1$ , but simply as a reasonable condition that leads to this choice.

Assume that $f$ is strongly convex and $F$ is twice-continuously differentiable with Lipschitz continuous Hessian $\nabla^{2}F$ . If the sub-problems are solved up to accuracy $\varepsilon_{k}\leq\frac{\mu^{2}}{128\kappa(\mu+\kappa)^{2}}\|g_{k}\|^{2}$ and the Dennis-Moré criterion is satisfied, i.e.

where $x^{*}$ is the minimizer of the problem and $B_{k}=H_{k}^{-1}$ is the variable metric matrix, then the descent condition (13) is satisfied with $\eta_{k}=1$ when $k$ is large enough.

We remark that the Dennis-Moré criterion we use here is slightly different from the standard one since our criterion is based on approximate gradients $g_{k}$ . If the $g_{k}$ ’s are indeed the exact gradients and the variable metric $B_{k}$ are bounded, then our criterion is equivalent to the standard Dennis-Moré criterion. The proof of the lemma is close to that of similar lemmas appearing in the proximal Quasi-Newton literature , and is relegated to the appendix. Interestingly, this proof also suggests that a stronger stopping criterion $\varepsilon_{k}$ such that $\varepsilon_{k}=o(\|g_{k}\|^{2})$ could lead to superlinear convergence. However, such a choice of $\varepsilon_{k}$ would significantly increase the complexity for solving the sub-problems, and overall degrade the global complexity.

4 Complexity analysis of the inner loop

In this section, we evaluate the complexity of solving the sub-problems (11) up to the desired accuracy using a linearly convergent method $\mathcal{M}$ . Our main result is that all sub-problems can be solved in a constant number $T_{\mathcal{M}}$ of iterations (in expectation if the method is non-deterministic) using the proposed warm start strategy. Let us consider the sub-problem with an arbitrary prox center $x$ ,

The number of iterations needed is determined by the ratio between the initial gap $h(w_{0})-h^{*}$ and the desired accuracy. We are going to bound this ratio by a constant factor.

If $f$ is differentiable with $L$ -Lipschitz continuous gradients, we initialize the method $\mathcal{M}$ with $w_{0}=x$ . Then, we have the guarantee that

Denote by $w^{*}$ the minimizer of $h$ . Then, we have the optimality condition $\nabla f(w^{*})+\kappa(w^{*}-x)=0$ . As a result,

The inequality in the above proof relies on the smoothness of $f$ , which does not hold for composite problems. The next lemma addresses this issue.

Consider the composite optimization problem $f=f_{0}+\psi$ , where $f_{0}$ is $L$ -smooth. By initializing with

We use the inequality corresponding to Lemma 2.3 in : for any $w$ ,

with $L^{\prime}=L+\kappa$ . Then, we apply this inequality to $w=w^{*}$ , and

We get an initialization of the same quality in the composite case as in the smooth case, by performing an additional proximal step. It is important to remark that the above analysis do not require strong convexity of $f$ , which allows us to derive the desired inner-loop complexity.

Consider Algorithm 1 with the warm start strategy described in Lemma 9 or in Lemma 10. Assume that the optimization method $\mathcal{M}$ applied in the inner loop produces a sequence $(w_{t})_{t\geq 0}$ for each sub-problem (34) such that

Then, the stopping criterion $\varepsilon\leq\frac{1}{72\kappa}\|g\|^{2}$ is achieved in at most $T_{\mathcal{M}}$ iterations with

Consider at iteration $k$ , we apply $\mathcal{M}$ to approximate the proximal mapping according to $x$ . With the given $T_{\mathcal{M}}$ (which we abbreviate by $T$ ), we have

where the last inequality follows from Lemma 2. ∎

Next, we extend the previous result obtained with deterministic methods $\mathcal{M}$ to randomized ones, where linear convergence is only achieved in expectation. The proof is a simple application of Lemma C.1 in (see also for related results on the expected complexity of randomized algorithms).

Assume that the optimization method $\mathcal{M}$ applied to each sub-problem (34) produces a sequence $(w_{t})_{t\geq 0}$ such that

We define the stopping time $T_{\mathcal{M}}$ by

which is the random variable corresponding to the minimum number of iterations to guarantee the stopping condition (12). Then, when the warm start strategy described in Lemma 9 or in Lemma 10 is applied, the expected number of iterations satisfies

It is worth to notice that the stopping criterium (12), i.e. $h(w)-h^{*}\leq\frac{\kappa}{36}\|w-x\|^{2}$ , can not be directly checked since $h^{*}$ is unknown. Nevertheless, an upper bound on the optimality gap $h(w)-h^{*}$ is usually available. In particular,

When $f$ is smooth, which implies $h$ is smooth, we have

Otherwise, we can evaluate the Fenchel conjugate function, which is a natural lower bound of $h^{*}$ , see Section D.2.3 in .

5 Global complexity of QNing

Finally, we can use the previous results to upper-bound the complexity of the QNing algorithm in terms of iterations of the method $\mathcal{M}$ for minimizing $f$ up to $\varepsilon$ .

Given a linearly-convergent method $\mathcal{M}$ satisfying (14), we apply $\mathcal{M}$ to solve the sub-problems of Algorithm 1 with the warm start strategy given in Lemma 9 or Lemma 10 up to accuracy $\varepsilon_{k}\leq\frac{1}{36\kappa}\|g_{k}\|^{2}$ . Then, the number of iterations of the method $\mathcal{M}$ to guarantee the optimality condition $f(z_{k})-f^{*}\leq\varepsilon$ is

for convex problems with bounded level sets:

The total number of calls of method $\mathcal{M}$ is simply $T_{\mathcal{M}}$ times the number of outer-loop iterations times the potential number of line search steps at each iteration (which is hidden in the $O(.)$ notation since this number can be made arbitrarily small). ∎

For non-deterministic methods, applying (40) yields a global complexity in expectation similar to the previous result with additional constant $2/\tau_{\mathcal{M}}$ in the last $\log$ factor.

As we shall see, the global complexity of our algorithm is mainly controlled by the smoothing parameter $\kappa$ . Unfortunately, under the current analysis, our algorithm QNing does not lead to an improved convergence rate in terms of the worst-case complexity bounds. It is worthwhile to underline, though, that this result is not surprising since it is often the case for L-BFGS-type methods, for which an important gap remains between theory and practice. Indeed, L-BFGS often outperforms the vanilla gradient descent method in many practical cases, but never in theory, which turns out to be the bottleneck in our analysis.

We give below the worst-case global complexity of QNing when applied to two optimization methods $\mathcal{M}$ of interest. Proposition 5 and its application to the two examples show that, in terms of worse-case complexity, the QNing scheme leaves the convergence rate almost unchanged.

Consider gradient descent with fixed constant step-size $1/L$ as the optimization method $\mathcal{M}$ . Directly applying gradient descent (GD) to minimize $f$ requires

iterations to achieve $\varepsilon$ accuracy. The complexity to achieve the same accuracy with QNing-GD is in the worst case

Consider the stochastic variance-reduced gradient (SVRG) as the optimization method $\mathcal{M}$ . SVRG minimizes $f$ to $\varepsilon$ accuracy in

iterations in expectation. QNing-SVRG achieves the same result with the worst-case expected complexity

Minimizing the above worst-case complexity respect to $\kappa$ suggests that $\kappa$ should be chosen as small as possible. However, such statement is based on the pessimistic theoretical analysis of L-BFGS-type method, which is not better than standard gradient descent methods. Noting that for smooth functions, L-BFGS method often outperforms Nesterov’s accelerated gradient method, it is reasonable to expect they achieve a similar complexity bound. In other words, the choice of $\kappa$ may be substantially different if one is able to show that L-BFGS-type method enjoys an accelerated convergence rate.

In order to illustrate the difference, we heuristically assume that L-BFGS method enjoys a similar convergence rarte as Nesterov’s accelerated gradient method. Then, the global complexity of our algorithm QNing matches the complexity of the related Catalyst acceleration scheme , which will be

for $\mu$ -strongly-convex problems. In such case, the complexity of QNing-GD and QNing-SVRG will be

which do enjoy acceleration by taking $\kappa=O(L)$ and $\kappa=O(L/n)$ respectively. In the following section, we will experiment with this heuristic, as if L-BFGS method enjoys an accelerated convergence rate. More precisely, we will choose the smoothing parameter $\kappa$ as in the related Catalyst acceleration scheme , and we present empirical evidence in support of this heuristic.

Experiments and practical details

In this section, we present the experimental results obtained by applying QNing to several first-order optimization algorithms. We start the section by presenting various benchmarks and practical parameter-tuning choices. Then, we study the performance of QNing applied to SVRG (Section 5.3) and to the proximal gradient algorithm ISTA (Section 5.4), which reduces to gradient descent (GD) in the smooth case. We demonstrate that QNing can be viewed as an acceleration scheme: by applying QNing to an optimization algorithm $\mathcal{M}$ , we achieve better performance than when applying $\mathcal{M}$ directly to the problem. Besides, we also compare QNing to existing stochastic variants of L-BFGS algorithm in Section 5.3. Finally, we study the behavior of QNing under different choice of parameters in Section 5.5. The code used for all the experiments is available at https://github.com/hongzhoulin89/Catalyst-QNing/.

We consider three common optimization problems in machine learning and signal processing, including logistic regression, Lasso and linear regression with Elastic-Net regularization. These three formulations all admit the composite finite-sum structure but differ in terms of smoothness and strong-convexness. Specifically, the three formulations are listed below.

which leads to a $\mu$ -strongly convex smooth optimization problem.

which is convex and non-smooth, but not strongly convex.

which is based on the Elastic-Net regularization leading to non-smooth strongly-convex problems.

We consider five standard machine learning datasets with different characteristics in terms of size and dimension, which are described below:

The first three data sets are standard machine learning data sets from LIBSVM . We normalize the features, which provides a natural estimate of the Lipschitz constant as mentioned previously. The last two data sets are coming from computer vision applications. MNIST and CIFAR-10 are two image classification data sets involving 10 classes. The feature representation of each image was computed using an unsupervised convolutional kernel network . We focus here on the task of classifying class #1 vs. other classes.

2 Choice of hyper-parameters and variants

We now discuss the choice of default parameters used in the experiments as well as the different variants. First, to deal with the high-dimensional nature of the data, we systematically use the L-BFGS metric $H_{k}$ and maintain the positive definiteness by skipping updates when necessary (see ).

We apply QNing to proximal SVRG algorithm and proximal gradient algorithm. The proximal SVRG algorithm is an incremental algorithm that is able to exploit the finite-sum structure of the objective and can deal with the composite regularization. We also consider the gradient descent algorithm and its proximal variant ISTA, which allows us to perform a comparison with the natural baselines FISTA and L-BFGS.

Stopping criterion for the inner loop.

Choice of regularization parameter κ𝜅\kappa.

We choose $\kappa$ as in the Catalyst algorithm , which is $\kappa=L$ for gradient descent/ISTA and $\kappa=L/2n$ for SVRG. Indeed, convergence of L-BFGS is hard to characterize and its theoretical rate of convergence can be pessimistic as shown in our theoretical analysis. Noting that for smooth functions, L-BFGS often outperforms Nesterov’s accelerated gradient method, it is reasonable to expect QNing achieves a similar complexity bound as Catalyst. Later in Section 5.5, we make a comparison between different values of $\kappa$ to demonstrate the effectiveness of this strategy.

Choice of limited memory parameter l𝑙l.

The default setting is $l=100$ . We show later in Section 5.5 a comparison with different values to study the influence of this parameter.

Implementation of the line search.

As mentioned earlier, we consider the stepsizes $\eta_{k}$ in the set $\{1,1/2,1/4,1/8,0\}$ and select the largest one that satisfies the descent condition.

Evaluation metric.

For all experiments, we use the number of gradient evaluations as a measure of complexity, assuming this is the computational bottleneck of all methods considered. This is indeed the case here since the L-BFGS step cost $O(ld)$ floating-point operations , whereas evaluating the gradient of the full objective costs $O(nd)$ , with $l\ll n$ .

3 QNing-SVRG for minimizing large sums of functions

We now apply QNing to SVRG and compare different variants.

SVRG: the Prox-SVRG algorithm of with default parameters $m=1$ and $\eta=1/L$ , where $L$ is the upper-bound on Lipschitz constant of the gradient, as described in the Section 5.1.

Catalyst-SVRG: The Catalyst meta-algorithm of applied to Prox-SVRG, using the variant (C3) that performs best among the different variants of Catalyst.

L-BFGS/Orthant: Since implementing effectively L-BFGS with a line-search algorithm is a bit involved, we use the implementation of Mark Schmidtavailable here http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html, which has been widely used in other comparisons . In particular, the Orthant-wise method follows the algorithm developed in . We use L-BFGS for the logistic regression experiment and the Orthant-wise method for elastic-net and lasso experiments. The limited memory parameter $l$ is set to $100$ .

QNing-SVRG: the algorithm according to the theory by solving the sub-problems until $\varepsilon_{k}\leq\frac{1}{36}\|g_{k}\|^{2}$ .

The result of the comparison is presented in Figure 1 and leads to the conclusions below, showing that QNing-SVRG1 is a safe heuristic, which never decreases the speed of the method SVRG:

L-BFGS/Orthant is less competitive than other approaches that exploit the sum structure of the objective, except on the dataset real-sim; the difference in performance with the SVRG-based approaches can be important (see dataset alpha).

QNing-SVRG1 is significantly faster than or on par with SVRG and QNing-SVRG.

QNing-SVRG is significantly faster than, or on par with, or only slightly slower than SVRG.

QNing-SVRG1 is significantly faster, or on par with Catalyst-SVRG. This justifies our choice of $\kappa$ which assumes “a priori” that L-BFGS performs as well as Nesterov’s method.

So far, we have shown that applying QNing with SVRG provides a significant speedup compared to the original SVRG algorithm or other acceleration scheme such as Catalyst. Now we compare our algorithm to other variable metric approaches including Proximal L-BFGS and Stochastic L-BFGS :

Proximal L-BFGS: We apply the Matlab package PNOPTavailable here https://web.stanford.edu/group/SOL/software/pnopt implemented by . The sub-problems are solved by the default algorithm up to desired accuracy. We consider one sub-problem as one gradient evaluation in our plot, even though it often requires multiple passes.

Stochastic L-BFGS (for smooth objectives): We apply the Matlab package StochBFGSavailable here https://perso.telecom-paristech.fr/rgower/software.html implemented by . We consider the ’prev’ variant which has the best practical performance.

The result of the comparison is presented in Figure 2 and we observe that QNing-SVRG1 is significantly faster than Proximal L-BFGS and Stochastic L-BFGS:

Proximal L-BFGS often outperforms Orthant-based methods but it is less competitive than QNing.

Stochastic L-BFGS is sensitive to parameters and data since the variable metric is based on stochastic information which may have high variance. It performs well on dataset covtype but becomes less competitive on other datasets. Moreover, it only applies to smooth problems.

The previous results are complemented by Appendix C.1, which also presents some comparison in terms of outer-loop iterations, regardless of the cost of the inner-loop.

4 QNing-ISTA and comparison with L-BFGS

The previous experiments have included a comparison between L-BFGS and approaches that are able to exploit the sum structure of the objective. It is then interesting to study the behavior of QNing when applied to a basic proximal gradient descent algorithm such as ISTA. Specifically, we now consider

GD/ISTA: the classical proximal gradient descent algorithm ISTA with back-tracking line-search to automatically adjust the Lipschitz constant of the gradient objective;

Acc-GD/FISTA: the accelerated variant of ISTA from .

QNing-ISTA, and QNing-ISTA1, as in the previous section replacing SVRG by GD/ISTA.

The results are reported in Figure 3 and lead to the following conclusions

L-BFGS is slightly better on average than QNing-ISTA1 for smooth problems, which is not surprising since we use a state-of-the-art implementation with a well-calibrated line search.

QNing-ISTA1 is always significantly faster than ISTA and QNing-ISTA.

The QNing-ISTA approaches are significantly faster than FISTA in 12 cases out of 15.

There is no clear conclusion regarding the performance of the Orthant-wise method vs other approaches. For three datasets, covtype, alpha and mnist, QNing-ISTA is significantly better than Orthant-wise. However, on the other two datasets, the behavior is different Orthant-wise method outperforms QNing-ISTA.

5 Experimental study of hyper-parameters l𝑙l and κ𝜅\kappa

In this section, we study the influence of the limited memory parameter $l$ and of the regularization parameter $\kappa$ in QNing. More precisely, we start with the parameter $l$ and try the method QNing-SVRG1 with the values $l=1,2,5,10,20,100$ . Note that all previous experiments were conducted with $l=100$ , which is the most expensive in terms of memory and computational cost for the L-BFGS step. The results are presented in Figure 4. Interestingly, the experiment suggests that having a large value for $l$ is not necessarily the best choice, especially for composite problems where the solution is sparse, where $l=10$ seems to perform reasonably well.

The next experiment consists of studying the robustness of QNing to the smoothing parameter $\kappa$ . We present in Figure 5 an experiment by trying the values $\kappa=10^{i}\kappa_{0}$ , for $i=-3,-2,\ldots,2,3$ , where $\kappa_{0}$ is the default parameter that we used in the previous experiments. The conclusion is clear: QNing clearly slows down when using a larger smoothing parameter than $\kappa_{0}$ , but it is robust to small values of $\kappa$ (and in fact it even performs better for smaller values than $\kappa_{0}$ ).

Discussions and concluding remarks

A few questions naturally arise regarding the QNing scheme: one may wonder whether or not our convergence rates may be improved, or if the Moreau envelope could be replaced by another smoothing technique. In this section, we discuss these two points and present concluding remarks.

In this paper, we have established the linear convergence of QNing for strongly convex objectives when sub-problems are solved with enough accuracy. Since QNing uses Quasi-Newton steps, one might have expected a superlinear convergence rate as several Quasi-Newton algorithms often enjoy . The situation is as follows. Consider the BFGS algorithm (without limited memory), as shown in , if the sequence $(\varepsilon_{k})_{k\geq 0}$ decreases super-linearly, then, it is possible to design a scheme similar to QNing that indeed enjoys a super-linear convergence rate. There is a major downside though: a super-linearly decreasing sequence $(\varepsilon_{k})_{k\geq 0}$ implies an exponentially growing number of iterations in the inner-loops, which will degrade the global complexity of the algorithm. This issue makes the approach proposed in impractical.

Another potential strategy for obtaining a faster convergence rate consists in interleaving a Nesterov-type extrapolation step in the QNing algorithm. Indeed, the convergence rate of QNing scales linearly in the condition number $\mu_{F}/L_{F}$ , which suggests that a faster convergence rate could be obtained using a Nesterov-type acceleration scheme. Empirically, we did not observe any benefit of such a strategy, probably because of the pessimistic nature of the convergence rates that are typically obtained for Quasi-Newton approaches based on L-BFGS. Obtaining a linear convergence rate for an L-BFGS algorithm is still an important sanity check, but to the best of our knowledge, the gap in performance between these worst-case rates and practice has always been huge for this class of algorithms.

2 Other types of smoothing

The Moreau envelope we considered is a particular instance of infimal convolution smoothing , whose family also encompasses the so-called Nesterov smoothing . Other ways to smooth a function include randomization techniques or specific strategies tailored for the objective at hand.

One of the main purposes of applying the Moreau envelope is to provide a better conditioning. As recalled in Proposition 1, the gradient of the smoothed function $F$ is $\kappa-$ Lipschitz continuous regardless of whether the original function is continuously differentiable or not. Furthermore, the conditioning of $F$ is improved with respect to the original function, with a condition number depending on the amount of smoothing. As highlighted in , this property is also shared by other types of infimal convolutions. Therefore, QNing could potentially be extended to such types of smoothing in place of the Moreau envelope. A major advantage of our approach, though, is its outstanding simplicity.

3 Concluding remarks

To conclude, we have proposed a generic mechanism, QNing, to accelerate existing first-order optimization algorithms with Quasi-Newton-type methods. QNing’s main features are the compatibility with variable metric update rule and composite optimization. Its ability of combining with incremental approaches makes it a promising tool for solving large-scale machine learning problems. A few questions remain however open regarding the use of the method in a pure stochastic optimization setting, and the gap in performance between worst-case convergence analysis and practice is significant. We are planning to address the first question about stochastic optimization in future work; the second question is unfortunately difficult and is probably one of the main open questions in the literature about L-BFGS methods.

Acknowledgements

The authors would like to thank the editor and the reviewers for their constructive and detailed comments. This work was supported by the ERC grant SOLARIS (number 714381), a grant from ANR (MACARON project ANR-14-CE23-0003-01), and the program “Learning in Machines and Brains” (CIFAR).

References

Appendix A Proof of Proposition 3

First, we show that the Moreau envelope $F$ inherits the bounded level set property from $f$ .

If $f$ has bounded level sets, then its Moreau envelope $F$ has bounded level sets as well.

First, from Proposition 1, the minimum of $F$ is attained at $x^{*}$ . Next, we reformulate the bounded level set property by contraposition: for any $x$ , there exists $R_{x}>0$ such that

Let $y$ satisfies the above inequality, by definition,

Then either $\|y-p(y)\|>\sqrt{\frac{2(f(x)-f^{*})}{\kappa}}$ or $\|p(y)-x^{*}\|>R_{x}$ .

If $\|y-p(y)\|>\sqrt{\frac{2(f(x)-f^{*})}{\kappa}}$ , then

We are now in shape to prove the proposition.

Thus $F(x_{k})$ is decreasing. From the bounded level set property of $F$ , there exists $R>0$ such that $\|x_{k}-x^{*}\|\leq R$ for any $k$ . By the convexity of $F$ , we have

Let us define $r_{k}\triangleq f(x_{k})-f^{*}$ . Thus,

Then, after exploiting the telescoping sum,

Appendix B Proof of Lemma 8

Let us denote $\delta_{k}=-B_{k}^{-1}g_{k}$ and let the subproblems solved to accuracy $\varepsilon_{k}\leq\frac{c}{\kappa}\|g_{k}\|^{2}$ . We show that when $c\leq\frac{\mu^{2}}{128(\mu+\kappa)^{2}}$ , the following two inequalities hold:

Then summing up the above inequalities yields

where the last inequality holds since $F(x_{k})\leq F_{k}$ and $o(\|g_{k}\|^{2})\leq\frac{1}{4\kappa}\|g_{k}\|^{2}$ when $k$ is large enough. This is the desired descent condition (13).

We first prove (42) which relies on the somoothness and Lipschitz Hessian assumption of $F$ . More concretely,

We are going upper bound each term one by one. First,

where the last inequality uses (DM) and the $\kappa$ -smoothness of $F$ which implies $\nabla^{2}F(x^{*})\preceq\kappa I$ . Second,

where the last line comes from (DM) and the fact that $\|x_{k}-x^{*}\|\rightarrow 0$ . Last, since

Summing up above four inequalities yields (42). Next, we prove the other desired inequality (43). The main effort is to bound $\|g_{k+1}\|$ by a constant factor times $\|g_{k}\|$ . From the inexactness of the subproblem, we have

Appendix C Additional experiments

In this section, we provide additional experimental results including experimental comparisons in terms of outer loop iterations, and an empirical study regarding the choice of the unit step size $\eta_{k}=1$ .

In the main paper, we have used the number of gradient evaluations as a natural measure of complexity. Here, we also provide a comparison in terms of outer-loop iterations, which does not take into account the complexity of solving the sub-problems. While interesting, the comparison artificially gives an advantage to the stopping criteria (12) since achieving it usually requires multiple passes.

The result of the comparison is presented in Figure 6. We observe that the theoretical grounded variant QNing-SVRG always outperform the one-pass heuristic QNing-SVRG1. This is not surprising since the sub-problems are solved more accurately in the theoretical grounded variant. However, once we take the complexity of the sub-problems into account, QNing-SVRG never outperforms QNing-SVRG1. This suggests that it is not beneficial to solve the sub-problem up to high accuracy as long as the algorithm converge.

C.2 Empirical frequency of choosing the unit stepsize

In this section, we evaluate how often the unit stepsize is taken in the line search. When the unit stepsize is taken, the variable metric step provides sufficient decrease, which is the key for acceleration. The statistics of QNing-SVRG1 (one-pass variant) and QNing-SVRG (the sub-problems are solved until the stopping criteria (12) is satisfed ) are given in Table 2 and Table 4, respectively. As we can see, for most of the iterations ( $>90\%$ ), the unit stepsize is taken.