A Linearly-Convergent Stochastic L-BFGS Algorithm

Philipp Moritz, Robert Nishihara, Michael I. Jordan

Introduction

A trend in machine learning has been toward using more parameters to model larger datasets. As a consequence, it is important to design optimization algorithms for these large-scale problems. A typical optimization problem arising in this setting is empirical risk minimization. That is,

When $d$ is small, Newton’s method is often the algorithm of choice due to its rapid convergence (both in theory and in practice). However, Newton’s method requires the computation and inversion of the Hessian matrix $\nabla^{2}f(w)$ , which may be computationally too expensive in high dimensions. As a consequence, practitioners are often limited to using first-order methods which only compute gradients of the objective, requiring $O(d)$ computation per iteration. The gradient method is the simplest example of a first-order method, but much work has been done to design quasi-Newton methods which incorporate information about the curvature of the objective without ever computing second derivatives. L-BFGS (Liu and Nocedal, 1989), the limited-memory version of the classic BFGS algorithm, is one of the most successful algorithms in this space. Inexact Newton methods are another approach to using second order information for large-scale optimization. They approximately invert the Hessian in $O(d)$ steps. This can be done by using a constant number of iterations of the conjugate gradient method (Dembo et al., 1982; Dembo and Steihaug, 1983; Nocedal and Wright, 2006).

When $N$ is large, batch algorithms such as the gradient method, which compute the gradient of the full objective at every iteration, are slowed down by the fact that they have to process every data point before updating the model. Stochastic optimization algorithms get around this problem by updating the model $w$ after processing only a small subset of the data, allowing them to make much progress in the time that it takes the gradient method to make a single step.

For many machine learning problems, where both $d$ and $N$ are large, stochastic gradient descent (SGD) and its variants are the most widely used algorithms (Robbins and Monro, 1951; Bottou, 2010; Bottou and LeCun, 2004), often because they are some of the few algorithms that can realistically be applied in this setting.

Given this context, much research in optimization has been directed toward designing better stochastic first-order algorithms. For a partial list, see (Kingma and Ba, 2015; Sutskever et al., 2013; Duchi et al., 2011; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013; Roux et al., 2012; Wang et al., 2013; Nesterov, 2009; Frostig et al., 2015; Agarwal et al., 2014). In particular, much progress has gone toward designing stochastic variants of L-BFGS (Mokhtari and Ribeiro, 2014a; Wang et al., 2014; Byrd et al., 2014; Bordes et al., 2009; Schraudolph et al., 2007; Sohl-Dickstein et al., 2014).

Unlike gradient descent, L-BFGS does not immediately lend itself to a stochastic version. The updates in the stochastic gradient method average together to produce a downhill direction in expectation. However, as pointed out in Byrd et al. (2014), the updates used in L-BFGS to construct the inverse Hessian approximation overwrite one another instead of averaging. Our algorithm addresses this problem in the same ways as Byrd et al. (2014), by computing Hessian vector products formed from larger minibatches.

Though stochastic methods often make rapid progress early on, the variance of the estimates of the gradient slow their convergence near the optimum. To illustrate this phenomenon, even if SGD is initialized at the optimum, it will immediately move to a point with a worse objective value. For this reason, convergence guarantees typically require diminishing step sizes. One promising line of work involves speeding up the convergence of stochastic first-order methods by reducing the variance of the gradient estimates (Johnson and Zhang, 2013; Roux et al., 2012; Defazio et al., 2014; Shalev-Shwartz and Zhang, 2013).

We introduce a stochastic variant of L-BFGS that incorporates the idea of variance reduction and has two desirable features. First, it obtains a guaranteed linear rate of convergence in the strongly-convex case. In particular, it does not require a diminishing step size in order to guarantee convergence (as partially evidenced by the fact that if our algorithm is initialized at the optimum it will stay there). Second, it performs very well on large-scale optimization problems, exhibiting a qualitatively linear rate of convergence in practice.

The Algorithm

We consider the problem of minimizing the function

Our updates will use stochastic estimates of the gradient $\nabla f_{\mathcal{S}}$ as well as stochastic approximations to the inverse Hessian $\nabla^{2}f_{\mathcal{T}}$ . Following Byrd et al. (2014), we use distinct subsets $\mathcal{S},\mathcal{T}\subseteq\{1,\ldots,N\}$ in order to decouple the estimation of the gradient from the estimation of the Hessian. We let $b=|\mathcal{S}|$ and $b_{H}=|\mathcal{T}|$ .

Following Johnson and Zhang (2013), we occasionally compute full gradients, which we use to reduce the variance of our stochastic gradient estimates.

The update rule for our algorithm will take the form

In the gradient method, $H_{k}$ is the identity matrix. In Newton’s method, it is the inverse Hessian $(\nabla^{2}f(w_{k}))^{-1}$ . In our algorithm, as in L-BFGS, $H_{k}$ will be an approximation to the inverse Hessian. Instead of the usual stochastic estimate of the gradient, $v_{k}$ will be a stochastic estimate of the gradient with reduced variance.

Code for our algorithm is given in Algorithm 1. Our algorithm is specified by several parameters. It requires a step size $\eta$ , a memory size $M$ , and positive integers $m$ and $L$ . Every $m$ iterations, the algorithm performs a full gradient computation, which it uses to reduce the variance of the stochastic gradient estimates. Every $L$ iterations, the algorithm updates the inverse Hessian approximation. The vector $s_{r}$ records the average direction in which the algorithm has made progress over the past $2L$ iterations. The vector $y_{r}$ is obtained by multiplying $s_{r}$ by a stochastic estimate of the Hessian. Note that this differs from the usual L-BFGS algorithm, which produces $y_{r}$ by taking the difference between successive gradients. We find that this approach works better in the stochastic setting. The inverse Hessian approximation $H_{r}$ is defined from the pairs $(s_{j},y_{j})$ for $r-M+1\leq j\leq r$ using the standard L-BFGS update rule, which is described in Section 2.1. The user must also choose batch sizes $b$ and $b_{H}$ from which to construct the stochastic gradient and stochastic Hessian estimates.

In Algorithm 1 and below, we use $I$ to refer to the identity matrix. We use $\mathcal{F}_{k,t}$ to denote the sigma algebra generated by the random variables introduced up to the time when the iteration counters $k$ and $t$ have the specified values. That is,

We define the inverse Hessian approximation $H_{r}$ in Section 2.1. Note that we do not actually construct the matrix $H_{r}$ because doing so would require $O(d^{2})$ computation. In practice, we directly compute products of the form $H_{r}v$ using the two-loop recursion (Nocedal and Wright, 2006, Algorithm 7.4).

To define the inverse Hessian approximation $H_{r}$ from the pairs $(s_{j},y_{j})$ , we follow the usual L-BFGS method. Let $\rho_{j}=1/s_{j}^{\top}y_{j}$ and recursively define

for $r-M+1\leq j\leq r$ . Initialize $H_{r}^{(r-M)}=(s_{r}^{\top}y_{r}/\|y_{r}\|^{2})I$ and set $H_{r}=H_{r}^{(r)}$ .

Note that the update in Equation 4 preserves positive definiteness (note that $\rho_{j}>0$ ), which implies that $H_{r}$ and each $H_{r}^{(j)}$ will be positive definite, as will their inverses.

Preliminaries

Our analysis makes use of the following assumptions.

There exist positive constants $\lambda$ and $\Lambda$ such that

We will typically force strong convexity to hold by adding a strongly-convex regularizer to our objective (which can be absorbed into the $f_{i}$ ’s). These assumptions imply that $f$ has a unique minimizer, which we denote by $w_{*}$ .

Suppose that Assumption 1 and Assumption 2 hold. Let $B_{r}=H_{r}^{-1}$ . Then

Suppose that Assumption 1 and Assumption 2 hold. Then there exist constants $0<\gamma\leq\Gamma$ such that $H_{r}$ satisfies

In Section 7.2, we prove Lemma 4 with the values

We will make use of Lemma 5, a simple result for strongly convex functions. We include a proof for completeness.

The last equality holds by plugging in the minimizer $v=-\nabla f(x)/\lambda$ . ∎

In Lemma 6, we bound the variance of our variance-reduced gradient estimates. The proof of Lemma 6, given in Section 7.3, closely follows that of Johnson and Zhang (2013, Theorem 1).

Let $w_{*}$ be the unique minimizer of $f$ . Let $\mu_{k}=\nabla f(w_{k})$ and let $v_{t}=\nabla f_{\mathcal{S}}(x_{t})-\nabla f_{\mathcal{S}}(w_{k})+\mu_{k}$ be the variance-reduced stochastic gradient. Conditioning on $\mathcal{F}_{k,t}$ and taking an expectation with respect to $\mathcal{S}$ , we have

Convergence Analysis

Suppose that Assumption 1 and Assumption 2 hold. Let $w_{*}$ be the unique minimizer of $f$ . Then for all $k\geq 0$ , we have

where the convergence rate $\alpha$ is given by

assuming that we choose $\eta<\gamma\lambda/(2\Gamma^{2}\Lambda^{2})$ and that we choose $m$ large enough to satisfy

Using the Lipschitz continuity of $\nabla f$ , which follows from Assumption 2, we have

Conditioning on $\mathcal{F}_{k,t}$ and taking expectations in Equation 9, this becomes

Taking expectations over all random variables, summing over $t=0,\ldots,m-1$ , and using a telescoping sum gives

The second inequality follows from the fact that $f(w_{*})\leq f(x_{m})$ . Using the fact that $\eta<\gamma\lambda/(2\Gamma^{2}\Lambda^{2})$ , it follows that

Since we chose $m$ and $\eta$ to satisfy Equation 8, it follows that the rate $\alpha$ is less than one. This completes the proof. ∎

Related Work

There is a large body of work that attempts to improve on stochastic gradient descent by reducing variance. Shalev-Shwartz and Zhang (2013) propose stochastic dual coordinate ascent (SDCA). Roux et al. (2012) propose the stochastic average gradient method (SAG). Johnson and Zhang (2013) propose the stochastic variance reduced gradient (SVRG). Wang et al. (2013) develop an approach based on the construction of control variates. More recently, Frostig et al. (2015) devise an online version of SVRG that uses streaming estimates of the gradient to perform variance reduction.

Similarly, a number of stochastic quasi-Newton methods have been proposed. Bordes et al. (2009) propose a variant of stochastic gradient descent that makes use of second order information. Mokhtari and Ribeiro (2014a) analyze the straightforward application of L-BFGS in the stochastic setting and prove a $O(1/k)$ convergence rate in the strongly-convex setting. Byrd et al. (2014) propose a modified version of L-BFGS in the stochastic setting and prove a $O(1/k)$ convergence rate in the strongly-convex setting. Sohl-Dickstein et al. (2014) propose a stochastic quasi-Newton method for minimizing sums of functions by maintaining a separate approximation of the inverse Hessian for each function in the sum. Schraudolph et al. (2007) develop a stochastic version of L-BFGS for the online convex optimization setting. Wang et al. (2014) prove the convergence of various stochastic quasi-Newton methods in the nonconvex setting. Our work differs from the preceding in that we guarantee a linear rate of convergence.

Lucchi et al. (2015) independently propose a variance-reduction procedure to speed up stochastic quasi-Newton methods and to achieve a linear rate of convergence. Their approach to updating the inverse-Hessian approximation is similar to that of L-BFGS, whereas our method leverages Hessian-vector products to stabilize the approximation.

Experimental Results

To probe our theoretical results, we compare Algorithm 1 (SLBFGS) to the stochastic variance-reduced gradient method (SVRG) (Johnson and Zhang, 2013), the stochastic quasi-Newton method (SQN) (Byrd et al., 2014), and stochastic gradient descent (SGD). We evaluate these algorithms on several popular machine learning models, including ridge regression, support vector machines, and matrix completion. Our experiments show the effeciveness of the algorithm on real-world problems that are not neccessarily (strongly) convex.

Because SLBFGS and SVRG require computations of the full gradient, each epoch requires an additional pass through the data. Additionally, SLBFGS and SQN require Hessian-vector-product calculations, each of which is about as expensive as a gradient calculation Pearlmutter (1994). The number of Hessian-vector-product computations per epoch introduced by this is $(b_{H}N)/(bL)$ , which in our experiments is either $N$ or $2N$ . To incorporate these additional costs, our plots show error with respect to the number of passes through the data (that is, the number of gradient or Hessian-vector-product computations divided by $N$ ). For this reason, the first iterations of SLBFGS, SVRG, SQN, and SGD all begin at different times, with SGD appearing first and SLBFGS appearing last.

For all experiments, we set the batch size $b$ to either $20$ or $100$ , we set the Hessian batch size $b_{H}$ to $10b$ or $20b$ , we set the Hessian update interval $L$ to $10$ , we set the memory size $M$ to $10$ , and we set the number of stochastic updates $m$ to $N/b$ . We optimize the learning rate via grid search. SLBFGS and SVRG use a constant step size. For SQN and SGD, we try three different step-size schemes: constant, $1/\sqrt{t}$ , and $1/t$ , and we report the best one. All experiments are initialized with a vector of zeros, except for the matrix completion problem, where in order to break symmetry, we initialize the experiments with a vector of standard normal random variables scaled by $10^{-5}$ .

First, we performed ridge regression on the millionsong dataset (Bertin-Mahieux et al., 2011) consisting of approximately $4.6\times 10^{5}$ data points. We set the regularization parameter $\lambda=10^{-3}$ . In this experiment, both SLBFGS and SVRG rapidly solve the problem to high levels of precision. Second, we trained a support vector machine on RCV1 (Lewis et al., 2004), with approximately $7.8\times 10^{6}$ data points. We set the regularization parameter to $\lambda=0$ . In this experiment, SGD and SQN make more progress initially as expected, but SLBFGS finds a better optimum. Third, we solve a nonconvex matrix completion problem on the Netflix Prize dataset, as formulated in Recht and Ré (2013), with approximately $10^{8}$ data points. We set the regularization parameter to $\lambda=10^{-4}$ . The poor performance of SVRG and SGD on this problem may be accounted for by the fact that the algorithms are initialized near the vector of all zeros, which is a stationary point (though not the optimum). Presumably the use of curvature information helps SLBFGS and SQN escape the neighborhood of the all zeros vector faster than SVRG and SGD.

Figure 1 plots a comparison of these methods on the three problems. For the convex problems, we plot the logarithm of the optimization error with respect to a precomputed reference solution. For the nonconvex problem, we simply plot the objective value as the global optimum is not necessarily known.

In this section, we illustrate that SLBFGS performs well on convex problems for a large range of step sizes. The windows in which SVRG, SQN, and SGD perform well are much narrower. In Figure 2, we plot the performance of SLBFGS, SVRG, SQN, and SGD for ridge regression on the millionsong dataset for step sizes varying over a couple orders of magnitude. In Figure 3, we show a similar plot for a support vector machine on the RCV1 dataset. In both cases, SLBFGS performs well, solving the problem to a high degree of precision over a large range of step sizes, whereas the performance of SVRG, SQN, and SGD degrade much more rapidly with poor step-size choices.

Proofs of Preliminaries

The analysis below closely follows many other analyses of the inverse Hessian approximation used in L-BFGS (Nocedal and Wright, 2006; Byrd et al., 2014; Mokhtari and Ribeiro, 2014a, b), and we include it for completeness.

Note that $s_{j}^{\top}y_{j}=s_{j}\nabla^{2}f_{\mathcal{T}_{j}}(u_{j})s_{j}$ , it follows from Assumption 2 that

Similarly, letting $z_{j}=(\nabla^{2}f_{\mathcal{T}_{j}}(u_{j}))^{1/2}s_{j}$ and noting that

Note that using the Sherman-Morrison-Woodbury formula, we can equivalently write Equation 4 in terms of the Hessian approximation $B_{r}=H_{r}^{-1}$ as

We will begin by bounding the eigenvalues of $B_{r}$ . We will do this indirectly by bounding the trace and determinant of $B_{r}$ . We have

The first equality follows from the linearity of the trace operator. The second equality follows from the fact that $\operatorname{tr}(AB)=\operatorname{tr}(BA)$ . The fourth relation follows from Equation 12. Since

The first equality uses $\det(AB)=\det(A)\det(B)$ . The second equality follows from the identity

by setting $u_{1}=-s_{j}$ , $v_{1}=(B_{r}^{(j-1)}s_{j})/(s_{j}^{\top}B_{r}^{(j-1)}s_{j})$ , $u_{2}=(B_{r}^{(j-1)})^{-1}y_{j}$ , and $v_{2}=y_{j}/(y_{j}^{\top}s_{j})$ . See Dennis and Moré (1977, Lemma 7.6) for a proof, or simply note that Equation 14 follows from two applications of the identity $\det(A+uv^{\top})=(1+v^{\top}A^{-1}u)\det(A)$ when $I+u_{1}v_{1}^{\top}$ is invertible and by continuity when it isn’t. The third equality follows by multiplying the numerator and denominator by $\|s_{j}\|^{2}$ . The fourth relation follows from Equation 11 and from the fact that $s_{j}^{\top}B_{r}^{(j-1)}s_{j}\leq\lambda_{\max}(B_{r}^{(j-1)})\|s_{j}\|^{2}$ . The fifth relation uses the fact that the largest eigenvalue of a positive definite matrix is bounded by its trace. The sixth relation uses the previous bound on $\operatorname{tr}(B_{r}^{(j-1)})$ . Since

2 Proof of Lemma 4

Using Lemma 3 as well as the fact that $H_{r}$ is positive definite, we have

Since we defined $B_{r}=H_{r}^{-1}$ , it follows that

3 Proof of Lemma 6

Define the function $g_{\mathcal{S}}(w)=f_{\mathcal{S}}(w)-f_{\mathcal{S}}(w_{*})-\nabla f_{\mathcal{S}}(w_{*})^{\top}(w-w_{*})$ to get the linearization of $f_{\mathcal{S}}$ around the optimum $w_{*}$ , and note that $g_{\mathcal{S}}$ is minimized at $w_{*}$ . It follows that for any $w$ , we have

Averaging over all possible minibatches $\mathcal{S}\subseteq\{1,\ldots,N\}$ of cardinality $b$ and using the fact that $\nabla f(w_{*})=0$ , we see that

Now, let $\mu_{k}=\nabla f(w_{k})$ and $v_{t}=\nabla f_{\mathcal{S}}(x_{t})-\nabla f_{\mathcal{S}}(w_{k})+\mu_{k}$ . Conditioning on $\mathcal{F}_{k,t}$ and taking an expectation with respect to $\mathcal{S}$ , we find

Discussion

This paper introduces a stochastic version of L-BFGS and proves a linear rate of convergence in the strongly convex case. Theorem 7 captures the qualitatively linear rate of convergence of SLBFGS, which is reflected in our experimental results. We expect SLBFGS to outperform other stochastic first-order methods in poorly conditioned settings where curvature information is valuable as well in settings where we wish to solve the optimization problem to high precision.

There are a number of interesting points to address in future work. The proof of Theorem 7 and many similar proofs used to analyze quasi-Newton methods result in constants that scale poorly with the problem size. At a deeper level, the point of studying quasi-Newton methods is to devise algorithms that lie somewhere along the spectrum from gradient descent to Newton’s method, reaping the computational benefits of gradient descent and the rapid convergence of Newton’s method. Many of the proofs in the literature, including the proof of Theorem 7, bound the extent to which the quasi-Newton method deviates from gradient descent by bounding the extent to which the inverse Hessian approximation deviates from the identity matrix. Those bounds are then used to show that the quasi-Newton method does not perform too much worse than gradient descent. A future avenue of research is to study if stochastic quasi-Newton methods can be designed that provably exhibit superlinear convergence as has been done in the non-stochastic case.