Stochastic Block BFGS: Squeezing More Curvature out of Data

Robert M. Gower, Donald Goldfarb, Peter Richtárik

Introduction

We design a new stochastic variable-metric (quasi-Newton) method—the stochastic block BFGS method—for solving the Empirical Risk Minimization (ERM) problem:

To solve (1), we employ iterative methods of the form

The most successful classical optimization methods fit the format (2), such as gradient descent $\left(H_{t}=I\right)$ , Newton’s method $\left(H_{t}=\nabla^{2}f(x_{t})^{-1}\right)$ , and the quasi-Newton methods $\left(H_{t}\approx\nabla^{2}f(x_{t})^{-1}\right)$ ; all with $g_{t}=\nabla f(x_{t})$ . The difficulty in our setting is that the large number of data points makes the computational costs of a single iteration of these classical methods prohibitively expensive.

To amortize these costs, the current state-of-the-art methods use subsampling, where $g_{t}$ and $H_{t}$ are calculated using only derivatives of the subsampled function

where $S\subseteq[n]\overset{\text{def}}{=}\{1,2,\dots,n\}$ is a subset of examples selected uniformly at random. Using the subsampled gradient $\nabla f_{S}(x)$ as a proxy for the gradient is the basis for (minibatch) stochastic gradient descent (SGD), but also for many successful variance-reduced methods that make use of the subsampled gradient in calculating $g_{t}.$

In particular, when using automatic differentiation techniques or backpropagation on a neural network , evaluating the above costs at most five times as much as the cost of evaluating the subsampled gradient $\nabla f_{T}(x_{t})$ .

Using only a single Hessian-vector product to update $H_{t}$ yields only a very limited amount of curvature information, and thus may result in an ineffective metric matrix. The block BFGS method addresses this issue. The starting point in the development of the block BFGS method is the simple observation that, ideally, we would like the metric matrix $H_{t}$ to satisfy the inverse equation

since then $H_{t}$ would be the inverse of an unbiased estimate of the Hessian. But solving the inverse equation is computationally expensive. So instead, we propose that $H_{t}$ should instead satisfy a sketched version of this equation, namely

In the remainder of the paper we describe the block BFGS update, propose a new limited-memory block BFGS update and introduce several new sketching strategies. We conclude by presenting the results of numerical tests of a method that combines the limited-memory block BFGS update with the SVRG/S2GD method , and demonstrate that our new method yields dramatically better results when compared to SVRG, or the SVRG method coupled with the classical L-BFGS update as proposed in .

This paper makes five main contributions:

We develop a stochastic block BFGS updateIn this paper we use the word “update” to denote metric learning, i.e., an algorithm for updating one positive definite matrix into another. for approximately tracking the inverse of the Hessian of $f$ . This technique is novel in two ways: the update is more flexible than the traditional BFGS update as it works by employing the actions of a subsampled Hessian on a set of random vectors rather than just on a single deterministic vector, as is the case with the standard BFGS. That is, we use block sketches of the subsampled Hessian.

(b) Stochastic block BFGS method.

Our block BFGS update is capable of incorporating enriched second-order information into gradient based stochastic approximation methods for solving problem (1). In this paper we illustrate the power of this strategy in conjunction with the strategy employed in the SVRG method of for computing variance-reduced stochastic gradients. We prove that the resulting combined method is linearly convergent and empirically demonstrate its ability to substantially outperform current state-of-the-art methods.

(c) Limited-memory method.

To make the stochastic block BFGS method applicable to large-scale problems, we devise a new limited-memory variant of it. As is the case for L-BFGS , our limited-memory approach allows for a user-defined amount of memory to be set aside. But unlike L-BFGS, our limited-memory approach allows one to use the available memory to store more recent curvature information, encoded in sketches of previous Hessian matrices. This development of a new limited block BFGS method should also be of general interest to the optimization community.

(d) Factored form.

We develop a limited-memory factored form of the block BFGS update. While factorized versions of standard quasi-Newton methods were proposed in the 1970’s (e.g., see ), as far as we know, no efficient limited memory versions of such methods have been developed. Factored forms are important as they can be used to enforce positive definiteness of the metric even in the presence numerical imprecision. Furthermore, we use the factored form in calculating new sketching matrices.

(e) Adaptive sketching.

Not only can sketching be used to tackle the large dimensions of the Hessian, but it can also simultaneously precondition the inverse equation (4). We present a self-conditioning (i.e., adaptive) sketching that makes use of the efficient factored form of the block BFGS method developed earlier. We also present a sketching approach based on using previous search directions. Our numerical tests show that adaptive sketching can in practice lead to a significant speedup in comparison with sketching from a fixed distribution.

2 Background and Related Work

The first stochastic variable-metric method developed that makes use of subsampling was the online L-BFGS method . In this work the authors adapt the L-BFGS method to make use of subsampled gradients, among other empirically verified improvements. The regularized BFGS method also makes use of stochastic gradients, and further modifies the BFGS update by adding a regularizer to the metric matrix.

The first method to use subsampled Hessian-vector products in the BFGS update, as opposed to using differences of stochastic gradients, was the SQN method . Recently, propose combining SQN with SVRG. The resulting method performs very well in numerical tests. In our work we combine a novel stochastic block BFGS update with SVRG, and prove linear convergence. The resulting method is more versatile and superior in practice to SQN as it can capture more useful curvature information.

The update formula, that we refer to as the block BFGS update, has a rather interesting background. The formula first appeared in 1983 in unpublished work by Schnabel on designing quasi-Newton methods that make use of multiple secant equations. Schnabel’s method requires several modifications that stem from the lack of symmetry and positive definiteness of the resulting update. Later, and completely independently, the block BFGS update appears in the domain decomposition literature as a preconditioner, where it is referred to as the balancing preconditioner. In that work, the motivation and derivation are very different from those used in the quasi-Newton literature; for instance, no variational interpretation is given for the method. The balancing preconditioner was subsequently taken out of the PDE context and tested as a general purpose preconditioner for solving a single linear system and systems with a changing right hand side . Furthermore, in the authors present a factored form of the update in a different context, which we adapt for limited-memory implementation. Finally, and again independently, a family of block quasi-Newton methods that includes the block BFGS is presented in through a variational formulation and in using Bayesian inference.

Stochastic Block BFGS Update

The stochastic block BFGS update, applied to $H_{t-1}$ to obtain $H_{t}$ , is defined by the projection

where $\Delta_{t}\overset{\text{def}}{=}(D_{t}^{T}Y_{t})^{-1}$ and $Y_{t}\overset{\text{def}}{=}\nabla^{2}f_{T}(x_{t})D_{t}.$ This solution was given in and in for multiple secant equations.

We take this opportunity to point out that stochastic block BFGS can generate the metric used in the Stochastic Dual Newton Ascent (SDNA) method as a special case. Indeed, when $H_{t-1}=0$ , then $H_{t}$ is given by

When the sketching matrix $D_{t}$ is a random column submatrix of the identity, then (7) is the positive semidefinite matrix used in calculating the iterates of the SDNA method . However, SDNA operates in the dual of (1).

Stochastic Block BFGS Method

The goal of this paper is to design a method that uses a low-variance estimate of the gradient, but also gradually incorporates enriched curvature information. To this end, we propose to combine the stochastic variance reduced gradient (SVRG) approach with our novel stochastic block BFGS update, described in the previous section. The resulting method is Algorithm 1.

To form the sketching matrix $D_{t}$ we employ one of the following three strategies:

$D_{t}$ has standard Gaussian entries sampled i.i.d at each iteration.

(b) Previous search directions delayed.

Let us write $d_{t}=-H_{t}g_{t}$ for the search direction used in step $t$ of the method. In this strategy we store $L$ such search directions as columns of matrix $D_{t}$ : $D_{t}=[d_{t+1-L},\ldots,d_{t}]$ , and then update $H_{t}$ only once every $L$ inner iterations.

(c) Self-conditioning.

We sample $C_{t}\subseteq[d]$ uniformly at random and set $D_{t}=L_{t-1}I_{:C_{t}}=[L_{t-1}]_{:C_{t}}$ , where $L_{t-1}L_{t-1}^{T}=H_{t-1}$ and $I_{:C_{t}}$ denotes the concatenation of the columns of the identity matrix indexed by a set $C_{t}\subset[d]$ . Thus the sketching matrix is formed with a random subset of columns of a factored form of $H_{t}.$ The idea behind this strategy is that the ideal sketching matrix should be $D_{t}=(\nabla^{2}f_{T}(x_{t}))^{-1/2}$ so that the sketch not only compresses but also acts as a preconditioner on the inverse equation (4). It was also shown in that this choice of sketching matrix can accelerate the convergence of $H_{t}$ to the inverse of a fixed matrix. In Section 3.2 we describe in detail how to efficiently maintain and update the factored form.

1 Limited-Memory Block BFGS

When $d$ is large, we cannot store the $d\times d$ matrix $H_{t}$ . Instead, we store $M$ block triples, consisting of previous block curvature pairs and the inverse of their products

With these triples we can form the $H_{t}$ operator implicitly by using a block limited-memory two loop recurrence. To describe this two loop recurrence, let $V_{t}\overset{\text{def}}{=}I-D_{t}\Delta_{t}Y_{t}^{T}.$

The block BFGS update (6) with memory parameter $M$ can be expanded as a function of the $M$ block triples (8) and of $H_{t-M}$ as

Since we do not store $H_{t}$ for any $t$ , we do not have access to $H_{t-M}$ . In our experiments we simply set $H_{t-M}=I$ (the identity matrix). Other, more sophisticated, choices are possible, but we do not explore them further here. Using the above expansion, the action of the operator $H_{t}$ on a vector $v$ can be efficiently calculated using Algorithm 2.

The total cost in floating point operations of executing Algorithm 2 is $Mq(4d+2q).$ In our experiments $M=5$ and $q$ will be orders of magnitude less than $d$ , typically $q\leq\sqrt{d}$ . Thus the cost of applying Algorithm 2 is approximately $O(d^{3/2}).$ This does not include the cost of computing the product ${D_{t}}^{T}Y_{t}$ ( $O(q^{2}d)$ operations) and its Cholesky factorization ( $O(q^{3})$ operations), which is done outside of Algorithm 2. The two places in Algorithm 2 where multiplication by $\Delta_{i}$ is indicated is in practice performed by solving two triangular systems using the Cholesky factor of ${D_{i}}^{T}Y_{i}$ . We do this because it is more numerically stable than explicitly calculating the inverse matrix $\Delta_{i}.$

2 Factored Form

Here we develop a new efficient method for maintaining and updating a factored form of the metric matrix. This facilitates the development of a novel idea which we call self-conditioning sketch.

This factored form of $H_{t}$ is too costly to compute because it requires inverting $L_{t-1}$ . However, if we let $D_{t}=L_{t-1}I_{:C_{t}}$ , where $C_{t}\subset[d]$ , then (9) reduces to

which can be computed efficiently. Furthermore, this update of the factored form (10) is amenable to recursion and can thus be expanded as

Again, we can implement a more numerically stable version of Algorithm 3 by storing the Cholesky factor of $D_{i}^{T}Y_{i}$ and using triangular solves, as opposed to calculating the inverse matrix $\Delta_{i}=(D_{i}^{T}Y_{i})^{-1}$ .

Convergence

In this section we prove that Algorithm 1 converges linearly. Our analysis relies on the following assumption, and is a combination of novel insights and techniques from and .

There exist constants $0<\lambda\leq\Lambda$ such that

We need two technical lemmas, whose proofs are given in Sections 7 and 8 at the end of the paper.

Let $H_{t}$ be the result of applying the limited-memory Block BFGS update with memory $M$ , as implicitly defined by Algorithm 2. Then there exists positive constants $\Gamma\geq\gamma>0$ such that for all $t$ we have

A proof of this lemma is given in Appendix 2, where in particular it is shown that the lower bound satisfies $\gamma~{}\geq~{}\frac{1}{1+M\Lambda}$ and the upper bound satisfies

where $\kappa~{}\overset{\text{def}}{=}~{}\Lambda/\lambda.$

We now state a bound on the norm of the SVRG variance-reduced gradient for minibatches.

The following theorem guarantees the linear convergence of Algorithm 1.

Suppose that Assumption 1 holds. Let $w_{*}$ be the unique minimizer of $f$ . When Option II is used in Algorithm 1, we have for all $k\geq 0$ that

assuming we have chosen $\eta<\gamma\lambda/(2\Gamma^{2}\Lambda^{2})$ and that we choose $m$ large enough to satisfyBy our assumption on $\eta$ , the expression on the right is nonnegative.

Proof. Since $g_{t}=\nabla f_{S_{t}}(x_{t})-\nabla f_{S_{t}}(w_{k})+\mu$ and $x_{t+1}=x_{t}-\eta H_{t}g_{t}$ in Algorithm 1, from (12) we have that

Taking expectation conditioned on $x_{t}$ (i.e., with respect to $S_{t}$ , $T_{t}$ and $D_{t}$ ) and using Lemma 1 we have

where $\alpha=2\eta\left(\gamma\lambda-\eta\Gamma^{2}\Lambda^{2}\right)$ and $\beta=2\eta^{2}\Gamma^{2}\Lambda(\Lambda-\lambda)$ . Taking expectation, summing over $t=0,\ldots,m-1$ and using telescopic cancellation gives

where we used that $f(w_{*})\leq f(x_{m})$ . Using that $\eta~{}<~{}\gamma\lambda/(2\Gamma^{2}\Lambda^{2}),$ it follows that

Numerical Experiments

To validate our approach, we compared our algorithm to SVRG and the variable-metric adaption of SVRG presented in , which we refer to as the MNJ method. We tested the methods on seven empirical risk minimization problems with a logistic loss and L2 regularizer, that is, we solved

We tested three variants of Algorithm 1, each specified by the use of a different sketching matrix. In Table 1 we present a key to the abbreviations used in all our figures. The first three methods, gauss, prev and fact, are implementations of the three variants (a), (b) and (c), respectively, of Algorithm 1 using the three different sketching methods discussed at the start of Section 3. In these three methods the $M$ stands for the number of stored curvature pairs (8) used.

For each method and a given parameter choice, we tried the stepsizes

and reported the one that gave the best results. We used the error $f(x_{t})-f(w_{*})$ for the $y$ -axis in all our figuresThanks to Mark Schmidt, whose code prettyPlot was used to generate all figures: https://www.cs.ubc.ca/~schmidtm/Software/prettyPlot.html. Note that in all plots the markers (circles, plus signs, triangles … etc) are used to help distinguish the different methods, and are not related to the iteration count.. We calculated $f(w_{*})$ by running all methods for 30 passes over the data, and then taking the minimum function value.

Finally, we used $m=\lfloor n/|S_{t}|\rfloor$ for the number of inner iterations in all variants of Algorithm 1, SVRG and MNJ, so that all methods perform an entire pass over the data before recalculating the gradient.

In our first set of tests we explored the parameter space of the prev variant of Algorithm 1. By fixing all parameters but one, we can see how sensitive the prev method is to the singled-out parameter, but also, build some intuition as to what value or, interval of values, yields the best results for this parameter. We focused our tests on the prev method since it proved to be overall, the most robust method.

In Figure 1 (a) we depict the results of varying the memory parameter $M$ , while fixing the remaining parameters. In particular, we fixed $|T_{t}|=|S_{t}|=15.$ In both the subplots of error $\times$ datapasses and error $\times$ time in Figure 1 (a), we see that $1\leq M\leq 4$ resulted in the best performance. Furthermore, the error $\times$ datapasses is insensitive to increasing the memory, since increasing the memory parameter does not incur in any additional data passes. On larger dimensional problems we found that approximately $M=5$ yielded the overall best performance. Thus we used $M=5$ in our tests on large-scale problems in Sections 5.2 and 5.3.

In Figure 1 (b) we experimented with varying $|T_{t}|$ , the size of the Hessian subsampling. Here the results of the error $\times$ datapasses subplot conflict with those of the the error $\times$ time subplots. While in the error $\times$ time subplot the method improves as $|T_{t}|$ increases, in the error $\times$ datapasses subplot the method improves as $|T_{t}|$ decreases. As a compromise, in Sections 5.2 and 5.3 we use $|S_{t}|=|T_{t}|$ as our default choice.

In Figure 2 (a) we experimented with varying the size of gradient subsampling and Hessian subsampling jointly with $|S_{t}|=|T_{t}|.$ In both the error $\times$ datapasses and the error $\times$ time subplot the range

resulted in a good performance. Based on this experiment, we use $|S_{t}|=|T_{t}|=n^{1/2}$ as our default choice in Sections 5.2 and 5.3. Note that when $|S_{t}|$ is large the method passes through the data in fewer iterations and consequentially less time. This is why the method terminated early (in time) when $|S_{t}|$ is large.

Finally in Figure 2 (b) we vary the parameter $L$ , that is, the number of previous search directions used to form the columns of $D_{t}$ . From these tests we can see that using a value that is too small, e.g. $L=1$ , or a value that is too large $L=2\lceil\sqrt{d}\rceil=24$ , results in an inefficient method in terms of both time and datapasses. Instead, we get the best performance when $\lceil d^{1/4}\rceil\leq L\leq\lceil d^{1/2}\rceil.$ Thus on the first five problems we used either $L=d^{1/3}$ or $L=d^{1/4}$ , depending on which gave the best performance. As for the last two problems: rcv1-train.binary and url-combined, we found that $L=d^{1/4}$ was too large. Instead, we probed the set $L\in\{2,\ldots,10\}$ for an $L$ that resulted in a reasonable performance.

Through these four experiments in Figures 1 and 2, we can conclude that the prev method is not overly sensitive to the choice of these parameters. That is, the method works well for a range of parameter choices. This is in contrast with choosing the stepsize parameter, whose choice can make the difference between a divergent method and a fast method.

2 Data passes

We now compare all the methods in Table 1 in terms of error $\times$ datapasses. Since our experiment in Figure 1 (b) indicated that $|S_{t}|=|T_{t}|$ resulted in a reasonable method, for simplicity, we used the same subsampling for the gradient and Hessian in all of our methods; that is $S_{t}=T_{t}.$ This is not necessarily an optimal choice. Furthermore, we set the subsampling size $|S_{t}|=\sqrt{n}$ .

For the MNJ method, we used the suggestion of the authors of both and , and chose $L=10$ and $|T_{t}|\approx L|S_{t}|$ so that the computational workload performed by $L$ inner iterations of the SVRG method was approximately equal to that of applying the L-BFGS metric once. The exact rule we found to be efficient was

We set the memory to $10$ for the MNJ method in all tests, which is a standard choice for the L-BFGS method.

On the problems with $d$ significantly smaller then $n$ , such as Figures 1(c) and 1(d), all the methods that make use of curvature information performed similarly and significantly better than the SVRG method. The prev method proved to be the best overall and the most robust, performing comparably well on problems with $d\ll n$ , such as Figures 1(c) and 1(d), but was also the most efficient method in Figures 1(a), 1(b), 1(e) (shared being most efficient with MNJ) and Figure 2(a). The only problem on which the prev method was not the most efficient method was on the url-combined problem in Figure 2(b), where the MNJ method proved to be the most efficient.

Overall, these experiments illustrate that incorporating curvature information results in a fast and robust method. Moreover, the added flexibility of the block BFGS update to incorporate more curvature information, as compared to a single Hessian-vector product in the MNJ method, can significantly improve the convergence of the method, as can be seen in Figures 1(a) and 1(b).

3 Timed

In Figures 5 and 6, we compare the evolution of the error over time for each method. While measuring time is implementation and machine dependent, we include these time plots as to provide further insight into the methods performance. Note that we did not use any sophisticated implementation tricks such as “lazy” gradient updates , but instead implemented each method as originally designed so that the methods can be compared on a equal footing.

The results in these tests corroborate with our conclusions in Section 5.2

Extensions

This work opens up many new research avenues. For instance, sketching techniques are increasingly successful tools in large scale machine learning, numerical linear algebra, optimization and computer science. Therefore, one could employ a number of new sketching methods in the block BFGS method, such as the Walsh-Hadamard matrix . Using new sophisticated sketching methods, combined with the block BFGS update, could result in even more efficient and accurate estimates of the underlying curvature.

Viewed in terms of sketching, the MNJ method and our Stochastic Block BFGS method with prev sketching follow opposite strategies. While the MNJ method sketches the Hessian matrix with a single vector $v$ , where $v$ is the average over a combination of previous search directions (a very coarse approximation), our prev variant uses all previous search directions to form the sketching matrix $D_{t}$ (a finer approximation). Deciding between these two extremes or how to combine them could be done adaptively by examining the curvature matrix $D_{t}\nabla f(x_{t})D_{t}.$ When the previous search directions are almost collinear, $D_{t}\nabla f(x_{t})D_{t}$ becomes ill-conditioned. In this case one should form matrix $D_{t}$ with less columns using fewer or a coarser combination of previous search directions, while if $D_{t}\nabla f(x_{t})D_{t}$ is well-conditioned, one could use more or a finer combination of previous search directions.

While in this work we have for simplicity focused on utilizing our metric learning techniques in conjunction with SVRG, they can be used with other optimization algorithms as well, including SGD , SDCA and more.

Proof of Lemma 2

Minimizing the right hand side of the above in $b$ gives

Re-arranging the above and switching back to $f$ , we have

Recalling the notation $\delta_{f}(x)\overset{\text{def}}{=}f(x)-f(w_{*})$ and taking expectation with respect to $S$ gives

Proof of Lemma 1

To simplify notation, we define $G\overset{\text{def}}{=}\nabla^{2}f_{T}(x_{t})$ , $\Delta=\Delta_{t}$ , $Y=Y_{t}$ , $H=H_{t-1}$ , $H^{+}=H_{t}$ , $B=H^{-1}$ , $B^{+}=(H^{+})^{-1}$ and $V=Y\Delta D^{T}$ . Thus, the block BFGS update can be written as

Proposition 2.2 in proves that so long as $G$ and $H$ (and hence $B$ ) are positive definite and $D$ has full rank, then $H^{+}$ is positive definite and non-singular, and consequently, $B^{+}$ is well defined and positive definite. Using the Sherman-Morrison-Woodbury identity the update formula for $B^{+}$ , as shown in the Appendix in , is given by

We will now bound $\lambda_{\max}(H^{+})=\|H^{+}\|_{2}$ from above and $\lambda_{\min}(H^{+})=1/\|B^{+}\|_{2}$ from below.

Let $C=BD(D^{{}^{T}}BD)^{-1}D^{{}^{T}}B$ . Then since $C\succeq 0$ , $B-C\preceq B$ and hence, $\|B-C\|_{2}\leq\|B\|$ and

Now, letting $G^{\frac{1}{2}}$ and $G^{-\frac{1}{2}}$ denote the unique square root of $G$ and its inverse, and defining $U=G^{\frac{1}{2}}D$ , we have

where $P=U(U^{{}^{T}}U)^{-1}U^{{}^{T}}$ is an orthogonal projection matrix. Moreover, it is easy to see that

Since $\|MN\|_{2}\leq\|M\|_{2}\|N\|_{2}$ and $\|P\|_{2}=1$ , we have $\|D\Delta D^{{}^{T}}\|_{2}\leq\|G^{-1}\|_{2}$ , $\|Y\Delta Y^{{}^{T}}\|_{2}\leq\|G\|_{2}$ and $\|Y\Delta D^{{}^{T}}\|_{2}\leq\|G^{-\frac{1}{2}}\|_{2}\|G^{\frac{1}{2}}\|_{2}$ .

where $\kappa=\Lambda/\lambda$ and $\alpha=(1+\sqrt{\kappa})^{2}$ .

Since we use a memory of $M$ block triples $(D_{i},Y_{i},\Delta_{i})$ , and the metric matrix $H_{t}$ is the result of applying, at most, $M$ block updates BFGS (6) to $H_{0}$ , we have that

Finally, since $\alpha=(1+\sqrt{\kappa})^{2}$ , we have

The bound (14) now follows by observing that $H_{t-M}=I.$