A Stochastic Quasi-Newton Method for Large-Scale Optimization

R. H. Byrd, S. L. Hansen, J. Nocedal, Y. Singer

Introduction

In many applications of machine learning, one constructs very large models from massive amounts of training data. Learning such models imposes high computational and memory demands on the optimization algorithms employed to learn the models. In some applications, a full-batch (sample average approximation) approach is feasible and appropriate. However, in most large scale learning problems, it is imperative to employ stochastic approximation algorithms that update the prediction model based on a relatively small subset of the training data. These algorithms are particularly suited for settings where data is perpetually streamed to the learning process; examples include computer network traffic, web search, online advertisement, and sensor networks.

The goal of this paper is to propose a quasi-Newton method that operates in the stochastic approximation regime. We employ the well known limited memory BFGS updating formula, and show how to collect second-order information that is reliable enough to produce stable and productive Hessian approximations. The key is to compute average curvature estimates at regular intervals using (sub-sampled) Hessian-vector products. This ensures sample uniformity and avoids the potentially harmful effects of differencing noisy gradients.

The problem under consideration is the minimization of a convex stochastic function,

where $\xi$ is a random variable. Although problem (1.1) arises in other settings, such as simulation optimization , we assume for concreteness that $\xi$ is a random instance consisting of an input-output pair $(x,z)$ . The vector $x$ is typically referred to in machine learning as the input representation while $z$ as the target output. In this setting, $f$ typically takes the form

In learning applications with very large amounts of training data, it is common to use a mini-batch stochastic gradient based on $b\stackrel{{\scriptstyle\triangle}}{{=}}|{\cal S}|\ll N$ input-output instances, yielding the following estimate

The subset ${\cal S}$ $\subset\{1,2,\cdots,N\}$ is randomly chosen, with $b$ sufficiently small so that the algorithm operates in the stochastic approximation regime. Therefore, the stochastic estimates of the gradient are substantially faster to compute than a gradient based on the entire training set.

Our optimization method employs iterations of the form

where $B_{k}$ is a symmetric positive definite approximation to the Hessian matrix $\nabla^{2}F(w)$ , and $\alpha^{k}>0$ . Since the stochastic gradient is not an accurate approximation to the gradient of (1.3) it is essential (to guarantee convergence) that the steplength parameter $\alpha^{k}\rightarrow 0$ . In our experiments and analysis, $\alpha^{k}$ has the form $\alpha^{k}=\beta/k$ , where $\beta>0$ is given, but other choices can be employed.

A critical question is how to construct the Hessian approximation in a stable and efficient manner. For the algorithm to be scalable, it must update the inverse matrix $H_{k}=B_{k}^{-1}$ directly, so that (1.5) can be implemented as

Furthermore, this step computation should require only $O(n)$ operations, as in limited memory quasi-Newton methods for deterministic optimization.

If we set $H_{k}=I$ and $\alpha^{k}=\beta/k$ in (1.6), we recover the classical Robbins-Monro method , which is also called the stochastic gradient descent method. Under standard convexity assumptions, the number of iterations needed by this method to compute an $\epsilon$ -accurate solution is of order ${n\nu\kappa^{2}}/{\epsilon}\,,$ where $\kappa$ is the condition number of the Hessian at the optimal solution, $\nabla^{2}F(w^{*})$ , and $\nu$ is a parameter that depends on both the Hessian matrix and the gradient covariance matrix; see . Therefore, the stochastic gradient descent method is adversely affected by ill conditioning in the Hessian. In contrast, it is shown by Murata that setting $H_{k}=\nabla^{2}F(w^{*})^{-1}$ in (1.6) completely removes the dependency on $\kappa$ from the complexity estimate. Although the choice $H_{k}=\nabla^{2}F(w^{*})^{-1}$ is not viable in practice, it suggests that an appropriate choice of $H_{k}$ may result in an algorithm that improves upon the stochastic gradient descent method.

In the next section, we present a stochastic quasi-Newton method of the form (1.6) that is designed for large-scale applications. It employs the limited memory BFGS update, which is defined in terms of correction pairs $(s,y)$ that provide an estimate of the curvature of the objective function $F(w)$ along the most recently generated directions. We propose an efficient way of defining these correction pairs that yields curvature estimates that are not corrupted by the effect of differencing the noise in the gradients. Our numerical experiments using problems arising in machine learning, suggest that the new method is robust and efficient.

The paper is organized into 6 sections. The new algorithm is presented in section 2, and its convergence properties are discussed in section 3. Numerical experiments that illustrate the practical performance of the algorithm are reported in section 4. A literature survey on related stochastic quasi-Newton methods is given in section 5. The paper concludes in section 6 with some remarks about the contributions of the paper.

Notation. The terms Robbins-Monro method, stochastic approximation (SA) method, and stochastic gradient descent (SGD) method are used in the literature to denote (essentially) the same algorithm. The first term is common in statistics, the second term is popular in the stochastic programming literature, and the acronym SGD is standard in machine learning. We will use the name stochastic gradient descent method (SGD) in the discussion that follows.

A stochastic quasi-Newton method

The success of quasi-Newton methods for deterministic optimization lies in the fact that they construct curvature information during the course of the optimization process, and this information is good enough to endow the iteration with a superlinear rate of convergence. In the classical BFGS method for minimizing a deterministic function $F(w)$ , the new inverse approximation $H_{k+1}$ is uniquely determined by the previous approximation $H_{k}$ and the correction pairs

This BFGS update is well defined as long as the curvature condition $y_{k}^{T}s_{k}>0$ is satisfied, which is always the case when $F(w)$ is strictly convex.

For large scale applications, it is necessary to employ a limited memory variant that is scalable in the number of variables, but enjoys only a linear rate of convergence. This so-called L-BFGS method is considered generally superior to the steepest descent method for deterministic optimization: it produces well scaled and productive search directions that yield an approximate solution in fewer iterations and function evaluations.

When extending the concept of limited memory quasi-Newton updating to the stochastic approximation regime it is not advisable to mimic the classical approach for deterministic optimization and update the model based on information from only one iteration. This is because quasi-Newton updating is inherently an overwriting process rather than an averaging process, and therefore the vector $y$ must reflect the action of the Hessian of the entire objective $F$ given in (1.1) — something that is not achieved by differencing stochastic gradients (1.4) based on small samples.

The displacement $s$ can be computed based on a collection of average iterates. Assuming that new curvature estimates are calculated every $L$ iterations, we define $s_{t}$ as the difference of disjoint averages between the $2L$ most recent iterations:

(and $\bar{w}_{t-1}$ is defined similarly). In order to avoid the potential harmful effects of gradient differencing when $\|s_{t}\|$ is small, we chose to compute $y_{t}$ via a Hessian vector product,

i.e., by approximating differences in gradients via a first order Taylor expansion, where $\widehat{\nabla}^{2}F$ is a sub-sampled Hessian defined as follows. Let ${\cal S}_{H}\subset\{1,\cdots,N\}$ be a randomly chosen subset of the training examples and let

where $b_{H}$ is the cardinality of ${\cal S}_{H}$ .

We emphasize that the matrix $\widehat{\nabla}^{2}F(\bar{w}_{t})$ is never constructed explicitly when computing $y_{t}$ in (2.2), rather, the Hessian-vector product can be coded directly. To provide useful curvature information, ${\cal S}_{H}$ should be relatively large (see section 4), regardless of the size of $b$ .

The pseudocode of the complete method is given in Algorithm 1.

Using the averages (2.1) is not essential. One could also define $s$ to be the difference between two recent iterates.

The L-BFGS step computation in Step 10 follows standard practice . Having chosen a memory parameter $M$ , the matrix $H_{t}$ is defined as the result of applying $M$ BFGS updates to an initial matrix using the $M$ most recent correction pairs $\{s_{j},y_{j}\}_{j=t-M+1}^{t}$ computed by the algorithm. This procedure is mathematically described as follows.

In practice, the quasi-Newton matrix $H_{t}$ is not formed explicitly; to compute the product $H_{t}\widehat{\nabla}F(w^{k})$ in Step 10 of Algorithm 1 one employs a formula based on the structure of the 2-rank BFGS update. This formula, commonly called the two-loop recursion, computes the step directly from the correction pairs and stochastic gradient as described in [17, §7.2].

In summary, the algorithm builds upon the strengths of BFGS updating, but deviates from the classical method in that the correction pairs $(s,y)$ are based on sub-sampled Hessian-vector products computed at regularly spaced intervals, which amortize their cost. Our task in the remainder of the paper is to argue that even with the extra computational cost of Hessian-vector products (2.2) and the extra cost of computing the iteration (1.6), the stochastic quasi-Newton method is competitive with the SGD method in terms of computing time (even in the early stages of the optimization), and is able to find a lower objective value.

Let us compare the cost of the stochastic gradient descent method

The quasi-Newton matrix-vector product in (2.6) requires approximately $4Mn$ operations . To measure the cost of the gradient and Hessian-vector computations, let us consider one particular but representative example, namely the binary classification test problem tested in section 4; see (4.1). In this case, the component function $f$ in (1.2) is given by

The gradient and Hessian-vector product of $f$ are given by,

The evaluation of the function $c(w;x_{i})$ requires approximately $n$ operations (where we follow the convention of counting a multiplication and an addition as an operation). Therefore, by (2.8) the cost of evaluating one batch gradient is approximately $2{b}n$ , and the cost of computing the Hessian-vector product $\widehat{\nabla}^{2}F(\bar{w}_{t})s_{t}$ is about $3{b_{H}}n$ . This assumes these two vectors are computed independently. If the Hessian is computed at the same point where we compute a gradient and ${b}\geq{b_{H}}$ then $c(w;x_{i})$ can be reused for a savings of ${b_{H}}n$ .

Therefore, for binary logistic problems the total number of floating point operations of the stochastic quasi-Newton iteration (2.6) is approximately

On the other hand, the cost associated with the computation of the SGD step is only $bn$ . At first glance it may appear that the SQN method is prohibitively expensive, but this is not the case when using the values for ${b}$ , ${b_{H}}$ , ${L}$ and ${M}$ suggested in this paper. To see this, note that

In the experiments reported below, we use ${M}=5$ , ${b}=50,100,\ldots,$ ${L}=10$ or 20, and choose ${b_{H}}\geq 300$ . For such parameter settings, the additional cost of the SQN iteration is small relative to the cost of the SGD method.

For the multi-class logistic regression problem described in section 4.3, the costs of gradient and Hessian-vector products are slightly different. Nevertheless, the relative cost of the SQN and SGD iterations is similar to that given in (2.11).

The quasi-Newton method can take advantage of parallelism. Instead of employing the two-loop recursion mentioned above to implement the limited memory BFGS step computation in step 10 of Algorithm 1, we can employ the compact form of limited memory BFGS updating [17, §7.2] in which $H_{t}$ is represented as the outer product of two matrices. This computation can be parallelized and its effective cost is around $3n$ operations, which is smaller than the $4{M}n$ operations assumed above. The precise cost of parallelizing the compact form computation depends on the computer architecture, but is in any case independent of ${M}$ .

Additionally, the Hessian-vector products can be computed in parallel with the main iteration (2.6) if we allow freedom in the choice of the point $\bar{w}_{t}$ where (2.2) is computed. The choice of this point is not delicate since it suffices to estimate the average curvature of the problem around the current iterate, and hence the computation of (2.2) can lag behind the main iteration. In such a parallel setting, the computational overhead of Hessian-vector products may be negligible.

The SQN method contains several parameters, and we provide the following guidelines on how to select them. First, the minibatch size $b$ is often dictated by the experimental set-up or the computing environment, and we view it as an exogenous parameter. A key requirement in the design of our algorithm is that it should work well with any value of $b$ . Given $b$ , our experiments show that it is most efficient if the per-iteration cost of updating, namely $b_{H}/L$ , is less than the cost of the stochastic gradient $b$ , with the ratio $Lb/b_{H}$ in the range . The choice of the parameter $M$ in L-BFGS updating is similar as in deterministic optimization; the best value is problem dependent but values in the range are commonly used.

Convergence Analysis

In this paper, we do not specify the precise mechanism by which the strong convexity is ensured. The assumptions made in our analysis are summarized as follows.

(1) The objective function $F$ is twice continuously differentiable.

(2) There exist positive constants $\lambda$ and $\Lambda$ such that

We first show that the Hessian approximations generated by the SQN method have eigenvalues that are uniformly bounded above and away from zero.

If Assumptions 1 hold, there exist constants $0<\mu_{1}\leq\mu_{2}$ such that the Hessian approximations $\{H_{t}\}$ generated by Algorithm 1 satisfy

Proof: Instead of analyzing the inverse Hessian approximation $H_{k}$ , we will study the direct Hessian approximation $B_{k}$ (see (1.5)) because this allows us to easily quote a result from the literature. In this case, the limited memory quasi-Newton updating formula is given as follows:

and since $\widehat{\nabla}^{2}F(\bar{w}_{t})$ is symmetric and positive definite, it has a square root so that

This proves that the eigenvalues of the matrices $B_{t}^{(0)}=\frac{y_{t}^{T}y_{t}}{{s}_{t}^{T}{y}_{t}}I$ at the start of the L-BFGS update cycles are bounded above and away from zero, for all $t$ .

for some positive constant $M_{3}$ . This implies that the largest eigenvalue of all matrices $B_{t}$ is bounded uniformly.

We now derive an expression for the determinant of $B_{t}$ . It is shown by Powell that

Since by (3.9) the largest eigenvalue of $B_{t}^{(i)}$ is less than $M_{3}$ , we have, using (3.7) and the fact that the smallest eigenvalue of $B_{t}^{(0)}$ is bounded away from zero,

for some positive constant $M_{4}$ . This shows that the smallest eigenvalue of the matrices $B_{t}$ is bounded away from zero, uniformly. Therefore, condition (3.4) is satisfied. $\Box$

Our next task is to establish global convergence. Rather than proving this result just for our SQN method, we analyze a more general iteration that covers it as a special case. We do so because the more general result is of interest beyond this paper and we are unaware of a self-contained proof of this result in the literature (c.f. ).

when applied to a strongly convex objective function $F(w)$ . (As in (1.2), we used the notation $\xi=(x,z)$ .) We assume that the eigenvalues of $\{H_{k}\}$ are uniformly bounded above and away from zero, and that $E_{\xi^{k}}[\nabla f(w^{k},\xi^{k})]=F(w^{k})$ . Clearly Algorithm 1 is a special case of (3.12) in which $H_{k}$ is constant for $L$ iterations.

We make the following assumptions about the functions $f$ and $F$ .

(1) $F(w)$ is twice continuously differentiable.

For convenience, we define $\alpha^{k}=\beta/k$ , for an appropriate choice of $\beta$ , rather than assuming well known and more general conditions $\sum\alpha^{k}=\infty$ , $\sum(\alpha^{k})^{2}<\infty$ . This allows us to provide a short proof similar to the analysis of Nemirovski et al. .

Suppose that Assumptions 2 hold. Let $w^{k}$ be the iterates generated by the Newton-like method (3.12), where for $k=1,2,\ldots$

Taking the expectation over all possible values of $\xi^{k}$ and recalling (3.14) gives

where the second inequality follows from the fact that $\hat{v}=w^{k}-\frac{1}{\lambda}\nabla F(w^{k})$ minimizes the quadratic $q_{k}(v)=F(w^{k})+\nabla F(w^{k})^{T}(v-w^{k})+\textstyle\frac{\lambda}{2}||v-w^{k}||^{2}.$ Setting $v=w^{\ast}$ in (3.19) yields

Let us define $\phi_{k}$ to be the expectation of $F(w^{k})-F(w^{\ast})$ over all choices $\{\xi^{1},\xi^{2},\ldots,\xi^{k-1}\}$ starting at $w^{1}$ , which we write as

We prove the desired result (3.16) by induction. The result clearly holds for $k=1$ . Assuming it holds for some value of $k$ , inequality (3.22), definition (3.17), and the choice of $\alpha^{k}$ imply

We can now establish the convergence result.

Suppose that Assumptions 1 and the bound (3.14) hold. Let $\{w^{k}\}$ be the iterates generated by Algorithm 1. Then there is a constant $\mu_{1}$ such that $\|H_{k}^{-1}\|\leq{1/\mu_{1}}$ , and if the steplength is chosen by

Proof: Lemma 3.1 ensures that the Hessian approximation satisfies (3.4). Now, the iteration in Step 10 of Algorithm 1 is a special case of iteration (3.12). Therefore, the result follows from Theorem 3.2. $\Box$

Numerical Experiments

In this section, we compare the performance of the stochastic gradient descent (SGD) method (2.5) and the stochastic quasi-Newton (SQN) method (2.6) on three test problems of the form (1.2)-(1.3) arising in supervised machine learning. The parameter $\beta>0$ is fixed at the beginning of each run, as discussed below, and the SQN method is implemented as described in Algorithm 1.

It is well known amongst the optimization and machine learning communities that the SGD method can be improved by choosing the parameter $\beta$ via a set of problem dependent heuristics . In some cases, $\beta_{k}$ (rather than $\beta$ ) is made to vary during the course of the iteration, and could even be chosen so that $\beta_{k}/k$ is constant, in which case only convergence to a neighborhood of the solution is guaranteed . There is, however, no generally accepted rule for choosing $\beta_{k}$ , so our testing approach is to consider the simple strategy of selecting the (constant) $\beta$ so as to give good performance for each problem.

Specifically, in the experiments reported below, we tried several values for $\beta$ in (2.5) and (2.6) and chose a value for which increasing or decreasing it by a fixed increment results in inferior performance. This allows us to observe the effect of the quasi-Newton Hessian approximation $H_{k}$ in a controlled setting, without the clutter introduced by elaborate step length strategies for $\beta_{k}$ .

In the figures provided below, we use the following notation.

$N$ : the number of training points in the dataset.

${b}$ : size of the batch used in the computation of the stochastic gradient $\widehat{\nabla}F(w)$ defined in (1.4); i.e., ${b}=|{\cal S}|$ .

${b_{H}}$ : size of the batch used in the computation of Hessian-vector products (2.2) and (2.3); i.e., ${b_{H}}=|{\cal S}_{H}|$ .

${L}$ : controls the frequency of limited memory BFGS updating. Every ${L}$ iterations a new curvature pair (2.2) is formed and the oldest pair is removed.

${M}$ : memory used in limited memory BFGS updating.

adp: Accessed data points. At each iteration the SGD method evaluates the stochastic gradient $\widehat{\nabla}F(w^{k})$ using ${b}$ randomly chosen training points $(x_{i},z_{i})$ , so we say that the iteration accessed ${b}$ data points. On the other hand, an iteration of the stochastic BFGS method accesses ${b}+{b_{H}}/{L}$ points.

iteration: In some graphs we compare SGD and SQN iteration by iteration (in addition to comparing them in terms of accessed data points).

epoch: One complete pass through the dataset.

In our experiments, the stochastic gradient (1.4) is formed by randomly choosing ${b}$ training points from the dataset without replacement. This process is repeated every epoch, which guarantees that all training points are equally used when forming the stochastic gradient. Independent of the stochastic gradients, the Hessian-vector products are formed by randomly choosing ${b_{H}}$ training points from the dataset without replacement.

We first test our algorithm on a binary classification problem. The objective function is given by

The training points were generated randomly as described in , with $N=7000$ and $n=50$ . To establish a reference benchmark with a well known algorithm, we used the particular implementation of one of the coordinate descent (CD) methods of Tseng and Yun .

Figure 1 reports the performance of SGD (with $\beta=7$ ) and SQN (with $\beta=2$ ), as measured by accessed data points. Both methods use a gradient batch size of ${b}=50$ ; for SQN we display results for two values of the Hessian batch size ${b_{H}}$ , and set ${M}=10$ and ${L}=10$ . The vertical axis, labeled fx, measures the value of the objective (4.1); the dotted black line marks the best function value obtained by the coordinate descent (CD) method mentioned above. We observe that the SQN method with ${b_{H}}=300$ and 600 outperforms SGD, and obtains the same or better objective value than the coordinate descent method.

In Figure 2 we explore the effect of the memory size ${M}$ . Increasing ${M}$ beyond 1 and 2 steadily improves the performance of the SQN algorithm, both during the first few epochs (left figure), and after letting the algorithm run for many epochs (right figure). For this problem, a large memory size is helpful in the later stages of the run.

2 RCV1 Data Set

The RCV1 dataset is a composition of newswire articles produced by Reuters from 1996-1997. Each article was manually labeled into 4 different classes: Corporate/Industrial, Economics, Government/Social, and Markets. For the purpose of classification, each article was then converted into a boolean feature vector with a 1 representing the appearance of a given word. Post word stemming, this gives each feature vector a dimension of $n=112919$ .

Each data point $x_{i}\in^{n}$ is extremely sparse, with an average of 91 ( $.013\%)$ nonzero elements. There are $N=688329$ training points. We consider the binary classification problem of predicting whether or not an article is in the fourth class, Markets, and accordingly we have labels $z_{i}\in\left\{0,1\right\}$ . We use logistic regression to model the problem, and define the objective function by equation (4.1).

In our numerical experiments with this problem, we used gradient batch sizes of ${b}=50$ , 300 or 1000, which respectively comprise $.0073\%$ , $.044\%$ and $.145\%$ of the dataset. The frequency of quasi-Newton updates was set to ${L}=20$ , a value that balances the aims of quickly retrieving curvature information and minimizing computational costs. For the SGD method we chose $\beta=5$ when ${b}=50$ , and $\beta=10$ when ${b}=300$ or 1000; for the SQN method (2.6) we chose $\beta=1$ when ${b}=50$ , and $\beta=5$ when ${b}=300$ or 1000, and we set ${b_{H}}=1000$ .

Figure 3 reports the performance of the two methods as measured by either iteration count or accessed data points. As before, the vertical axis, labeled fx, measures the value of the objective (4.1). Figure 3 shows that for each batch size, the SQN method outperforms SGD, and both methods improve as batch size increases. We observe that using ${b}=300$ or 1000 yields different relative outcomes for the SQN method when measured in terms of iterations or adp: a batch size of 300 provides the fastest initial decrease in the objective, but that method is eventually overtaken by the variant with the larger batch size of 1000.

Figure 4 illustrates the effect of varying the Hessian batch size ${b_{H}}$ from 10 to 10000, while keeping the gradient batch size ${b}$ fixed at 300 or 1000. For ${b}=300$ (Figure 4a) increasing ${b_{H}}$ improves the performance of SQN, in terms of adp, up until ${b_{H}}=1000$ , where the benefits of the additional accuracy in the Hessian approximation do not outweigh the additional computational cost. In contrast, Figure 4b shows that for ${b}=1000$ , a high value for ${b_{H}}$ , such as 10000, can be effectively used since the cost of the Hessian-vector product relative to the gradient is lower. One of the conclusions drawn from this experiment is that there is much freedom in the choice of ${b_{H}}$ , and that only a small sub-sample of the data (e.g. ${b_{H}}=100$ ) is needed for the stochastic quasi-Newton approach to yield benefits.

One should guard, however, against the use of very small values for ${b_{H}}$ , as seen in the large blue spike in Figure 4a corresponding to ${b_{H}}=10$ . To understand this behavior, we monitored the angle between the vectors $s$ and $y$ and observed that it approached $90^{\circ}$ between iteration 3100 and 3200, which is where the spike occurred. Since the term $s^{T}y$ enters in the denominator of the BFGS update formula (2.4), this led to a very large and poor step. Monitoring $s^{T}y$ (relative to, say, $s^{T}Bs$ ) can a useful indicator of a potentially harmful update; one can increase ${b_{H}}$ or skip the update when this number is smaller than a given threshold.

The impact of the memory size parameter ${M}$ is shown in Figure 5. The results improve consistently as ${M}$ increases, but beyond ${M}=2$ these improvements are rather small, especially in comparison to the results in Figure 2 for the synthetic data. The reasons for this difference are not clear, but for the deterministic L-BFGS method the effect of ${M}$ on performance is known to be problem dependent. We observe that performance with a value of ${M}=0$ , which results in a Hessian approximation of the form $H_{t}=\frac{s_{t}^{T}y_{t}}{y_{t}^{T}y_{t}}I$ , is poor and also unstable in early iterations, as shown by the spikes in Figure 5.

To gain a better understanding of the behavior of the SQN method, we also monitored the following two errors:

The quantities $\nabla F(w)$ and $\nabla^{2}F(\bar{w}_{I})(\bar{w}_{I}-\bar{w}_{J})$ are computed with the entire data set, as indicated by (4.1). Therefore, the ratios above report the relative error in the stochastic gradient used in (2.6) and the relative error in the computation of the Hessian-vector product (2.2).

Figure 6 displays these relative errors for various batch sizes ${b}$ and ${b_{H}}$ , along with the norms of the stochastic gradients. These errors were calculated every 20 iterations during a single run of SQN with the following parameter settings: ${b}=300$ , ${L}=20$ , ${M}=5$ , and ${b_{H}}=688329$ . Batch sizes larger than ${b}=10000$ exhibit non-stochastic behavior in the sense that all relative errors are less than one, and the norm of these approximate gradients decreases during the course of the iteration. Gradients with a batch size less than 10000 have relative errors greater than 1, and their norm does not exhibit decrease over the course of the run.

Figure 6 indicates that the Hessian-vector errors stay relatively constant throughout the run and have smaller relative error than the gradient. As discussed above, some accuracy here is important while it is not needed for the batch gradient.

3 A Speech Recognition Problem

The speech dataset, provided by Google, is a collection of feature vectors representing 10 millisecond frames of speech with a corresponding label representing the phonetic state assigned to that frame. Each feature $x_{i}$ has a dimension of NF $=235$ and has corresponding label $z_{i}\in C=\left\{1,2,\ldots,129\right\}$ . There are a total of $N=191607$ samples; the number of variables is $n=$ NF $\times|C|=30315$ .

Figure 7 displays the performance of SGD and SQN for ${b}=100$ and 500 (which represent approximately 0.05%, and 0.25% of the dataset). For the SGD method, we chose the step length $\beta=10$ for both values of ${b}$ ; for the SQN method we set $\beta=2$ , ${L}=10,\ {M}=5,\ {b_{H}}=1000$ .

We observe from Figure 7 that SQN improves upon SGD in terms of adp, both initially and in finding a lower objective value. Although the number of SQN iterations decreases when ${b}$ is increased from 100 to 500, in terms of computational cost the two versions of the SQN method yield almost identical performance.

The effect of varying the Hessian batch size ${b_{H}}$ is illustrated in Figure 8. The figure on the left shows that increasing ${b_{H}}$ improves the performance of SQN, as measured by iterations, but only marginally from ${b_{H}}=1000$ to $10,000$ . Once the additional computation cost of Hessian-vector products is accounted for, we observe from the figure on the right that ${b_{H}}=100$ is as effective as ${b_{H}}=1000$ . Once more, we conclude that only a small subset of data points ${\cal S}_{H}$ is needed to obtain useful curvature information in the SQN method.

Figure 9 illustrates the impact of increasing the memory size ${M}$ from 0 to 20 for the SQN method. A memory size of zero leads to a marked degradation of performance. Increasing ${M}$ from 0 to 5 improves SQN, but values greater than 5 yield no measurable benefit.

4 Generalization Error

The primary focus of this paper is on the minimization of training error (1.3), but it is also interesting to explore the performance of the SQN method in terms of generalization (testing) error. For this purpose we consider the RCV1 dataset, and in Figure 10 we report the performance of algorithms SQN and SGD with respect to unseen data (dotted lines). Both algorithms were trained using $75\%$ of the data and then tested on the remaining $25\%$ (the test set). In Figure 10a, the generalization error is measured in terms of decrease of the objective (4.1) over the test set, and in Figure 10b, in terms of the percent of correctly classified data points from the test set. The first measure complements the latter in the sense that it takes into account the confidence of the correct predictions and the inaccuracies wrought by the misclassifications. Recall that there are 2 classes in our RCV1 experiments, so random guessing would yield a percentage of correct classification of 0.5.

As expected, the objective on the training set is lower than the objective on the test set, but not by much. These graphs suggests that over-fitting is not occurring since the objective on the test set decreases monotonically. The performance of the stochastic quasi-Newton method is clearly very good on this problem.

5 Small Mini-Batches

In the experiments reported in the sections 4.2 and 4.3, we used fairly large gradient batch sizes, such as ${b}=50,100,1000$ , because they gave good performance for both the SGD and SQN methods on our test problems. Since we set ${M}=5$ , the cost of the multiplication $H_{t}\widehat{\nabla}F(w^{k})$ (namely, $4{M}n=20n$ ) is small compared to the cost of ${b}n$ for a batch gradient evaluation. We now explore the efficiency of the stochastic quasi-Newton method for smaller values of the batch size ${b}$ .

In Figure 11 we report results for the SGD and SQN methods for problem RCV1, for ${b}=10$ and 20. We use two measures of performance: total computational work and adp. For the SQN method, the work measure is given by (2.10), which includes the evaluation of the gradient (1.4), the computation of the quasi-Newton step (2.6), and the Hessian-vector products (2.2).

In order to compare total work and adp on the same figure, we scale the work by $1/n$ . The solid lines in Figure 11 plot the objective value vs adp, while the dotted lines plot function value vs total work. We observe from Figure 11 that, even for small ${b}$ , the SQN method outperforms SGD by a significant margin in spite of the additional Hessian-vector product cost. Note that in this experiment the $4Mn$ cost of computing the steps is still less than half the total computational cost (2.10) of the SQN iteration.

In this experiment, it was crucial to update the quasi-Newton matrix infrequently ( ${L}=200$ ), as this allowed us to employ a large value of ${b_{H}}$ at an acceptable cost. In general, the parameters ${L},\ {M}$ and ${b_{H}}$ provide much freedom in adapting the SQN method to a specific application.

6 Comparison to the oLBFGS method

We also compared our algorithm to the oLBFGS method , which is the best known stochastic quasi-Newton method in the literature. It is of the form (1.6) but differs from our approach in three crucial respects: the L-BFGS update is performed at every iteration, the curvature estimate is calculated using gradient differencing, and the sample size for gradient differencing is the same as the sample size for the stochastic gradient. This approach requires two gradient evaluations per iteration; it computes

We implemented the oLBFGS method as described in , with the following parameter settings: i) we found it to be unnecessary to add a damping parameter to the computation $y_{k}$ , and thus set $\lambda=0$ in the reset $y_{k}\leftarrow y_{k}+\lambda s_{k}$ ; ii) the parameter $\epsilon$ used to rescaled the first iteration, $w^{1}=w^{0}-\epsilon\alpha^{k}\widehat{\nabla}F(w^{0})$ , was set to $\epsilon=10^{-6}$ ; iii) the initial choice of scaling parameter in Hessian updating (see Step 1 of Algorithm 2) was the average of the quotients $s_{i}^{T}y_{i}/y_{i}^{T}y_{i}$ averaged over the last $M$ iterations, as recommended in .

Figure 12 compares our SQN method to the aforementioned oLBFGS on our two realistic test problems, in terms of accessed data points. We observe that SQN has overall better performance, which is more pronounced for smaller batch sizes.

Related Work

Various stochastic quasi-Newton algorithms have been proposed in the literature , but have not been entirely successful. The methods in and use the BFGS framework; the first employs an L-BFGS implementation, as mentioned in the previous section, and the latter uses a regularized BFGS matrix. Both methods enforce uniformity in gradient differencing by resampling data points so that two consecutive gradients are evaluated with the same sample ${\cal S}$ ; this strategy requires an extra gradient evaluation at each iteration. The algorithm presented in uses SGD with a diagonal rescaling matrix based on the secant condition associated with quasi-Newton methods. Similar to our approach, updates the rescaling matrix at fixed intervals in order to reduce computational costs. A common feature of is that the Hessian approximation might be updated with a high level of noise.

A two-stage online Newton strategy is proposed in . The first stage runs averaged SGD with a step size of order $O(1/\sqrt{k})$ , and the second stage minimizes a quadratic model of the objective function using SGD with a constant step size. The second stage effectively takes one Newton step, and employs Hessian-vector products in order to compute stochastic derivatives of the quadratic model. This method is significantly different from our quasi-Newton approach.

A stochastic approximation method that has shown to be effective in practice is AdaGrad . The iteration is of the form (1.5), where $B_{k}$ is a diagonal matrix that estimates the diagonal of the squared root of the uncentered covariance matrix of the gradients; it is shown in that such a matrix minimizes a regret bound. The algorithm presented in this paper is different in nature from AdaGrad, in that it employs a full (non-diagonal) approximation to the Hessian $\nabla^{2}F(w)$ .

Amari popularized the idea of incorporating information from the geometric space of the inputs into online learning with his presentation of the natural gradient method. This method seeks to find the steepest descent direction in the feature space $x$ by using the Fisher information matrix, and is shown to achieve asymptotically optimal bounds. The method does, however, require knowledge of the underlying distribution of the training points $(x,z)$ , and the Fisher information matrix must be inverted. These concerns are addressed in , which presents an adaptive method for computing the inverse Fisher Information matrix in the context of multi-layer neural networks.

The authors of TONGA interpret natural gradient descent as the direction that maximizes the probability of reducing the generalization error. They outline an online implementation using the uncentered covariance matrix of the empirical gradients that is updated in a weighted manner at each iteration. Additionally, they show how to maintain a low rank approximation of the covariance matrix so that the cost of the natural gradient step is $O(n)$ . In it is argued that an algorithm should contain information about both the Hessian and covariance matrix, maintaining that that covariance information is needed to cope with the variance due to the space of inputs, and Hessian information is useful to improve the optimization.

Our algorithm may appear at first sight to be similar to the method proposed by Byrd et al. , which also employs Hessian-vector products to gain curvature information. We note, however, that the algorithms are different in nature, as the algorithm presented here operates in the stochastic approximation regime, whereas is a batch (or SAA) method.

Final Remarks

In this paper, we presented a quasi-Newton method that operates in the stochastic approximation regime. It is designed for the minimization of convex stochastic functions, and was tested on problems arising in machine learning. In contrast to previous attempts at designing stochastic quasi-Newton methods, our approach does not compute gradient differences at every iteration to gather curvature information; instead it computes (sub-sampled) Hessian-vector products at regular intervals to obtain this information in a stable manner.

Our numerical results suggest that the method does more than rescale the gradient, i.e., that its improved performance over the stochastic gradient descent method of Robbins-Monro is the result of incorporating curvature information in the form of a full matrix.

The practical success of the algorithm relies on the fact that the batch size ${b_{H}}$ for Hessian-vector products can be chosen large enough to provide useful curvature estimates, while the update spacing $L$ can be chosen large enough (say $L=20$ ) to amortize the cost of Hessian-vector products, and make them affordable. Similarly, there is a wide range of values for the gradient batch size $b$ that makes the overall quasi-Newton approach (1.6) viable.

The use of the Hessian-vector products (2.2) may not be essential; one might be able to achieve the same goals using differences in gradients, i.e.,

This would require, however, that the evaluation of these gradients employ the same sample $\cal S$ so as to obtain sample uniformity, as well as the development of a strategy to prevent close gradient differences from magnifying round off noise. In comparison, the use of Hessian-vector products takes care of these issues automatically, but it requires code for Hessian-vector computations (a task that is not often not onerous).

We established global convergence of the algorithm on strongly convex objective functions. Our numerical results indicate that the algorithm is more effective than the best known stochastic quasi-Newton method (oLBFGS ) and suggest that it holds much promise for the solution of large-scale problems arising in stochastic optimization. Although we presented and analyzed the algorithm in the convex case, our approach is applicable to non-convex problems provided it employs a mechanism for ensuring that the condition $s_{t}^{T}y_{t}>0$ is satisfied.