Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Farbod Roosta-Khorasani, Michael W. Mahoney

Introduction

Large scale optimization problems arise frequently in machine learning and data analysis and there has been a great deal of effort to devise algorithms for efficiently solving such problems. Here, following many data-fitting applications, we consider the optimization problem of the form

The rest of this paper is organized as follows: in Section 1.1, we first give a very brief background on the general methodology for optimizing (1). The notation and the assumptions used in this paper are given in Section 1.2. The contributions of this paper are listed in Section 1.3. Section 1.4 surveys the related work. Section 2 gives global convergence results for the case where only the Hessian is sub-sampled while the full gradient is used. In particular, Section 2.1.1 gives a linearly convergent global algorithm with exact update, whereas Section 2.1.2 gives a similar result for the case when approximate solution of the linear system is used as search direction. The case where the gradient, as well as Hessian, is sub-sampled is treated in Section 3. More specifically, Section 3.1.1 gives a globally convergent algorithm with linear rate with exact update, whereas Section 3.1.2 addresses the algorithm with inexact updates. A few examples from generalized linear models (GLM), a very popular class of problems in machine learning community, as well as numerical simulations are given in Section 4. Conclusions and further thoughts are gathered in Section 5. All proofs are given in the appendix.

For optimizing (1), the standard deterministic or full gradient method, which dates back to Cauchy , uses iterations of the form

where αk\alpha_{k} is the step size at iteration kk. However, when n1n\gg 1, the full gradient method can be inefficient because its iteration cost scales linearly in nn. In addition, when p1p\gg 1 or when each individual fif_{i} are complicated functions (e.g., evaluating each fif_{i} may require the solution of a partial differential equation), the mere evaluation of the gradient can be computationally prohibitive. Consequently, stochastic variant of full gradient descent, e.g., (mini-batch) stochastic gradient descent (SGD) was developed . In such methods a subset S{1,2,,n}\mathcal{S}\subset\{1,2,\cdots,n\} is chosen at random and the update is obtained by

When Sn|\mathcal{S}|\ll n (e.g., S=1|\mathcal{S}|=1 for simple SGD), the main advantage of such stochastic gradient methods is that the iteration cost is independent of nn and can be much cheaper than the full gradient methods, making them suitable for modern problems with large nn.

The above class of methods are among what is known as first-order methods where only the gradient information is used at every iteration. One attractive feature of such class of methods is their relatively low per-iteration-cost. Despite the low per-iteration-cost of first order methods, in almost all problems, incorporating curvature information (e.g., Hessian) as a form of scaling the gradient, i.e.,

can significantly improve the convergence rate. Such class of methods which take the curvature information into account are known as second-order methods, and compared to first-order methods, they enjoy superior convergence rate in both theory and practice. This is so since there is an implicit local scaling of coordinates at a given x{\bf x}, which is determined by the local curvature of FF. This local curvature in fact determines the condition number of a FF at x{\bf x}. Consequently, by taking the curvature information into account (e.g., in the form of the Hessian), second order methods can rescale the gradient direction so it is a much more “useful” direction to follow. This is in contrast to first order methods which can only scale the gradient uniformly for all coordinates. Such second order information have long been used in many machine learning applications .

The canonical example of second order methods, i.e., Newton’s method , is with DkD_{k} taken to be the inverse of the full Hessian and αk=1\alpha_{k}=1, i.e.,

However, when n,p1n,p\gg 1, the per-iteration-cost of such algorithm is significantly higher than that of first-order methods. As a result, a line of research is to try to construct an approximation of the Hessian in a way that the update is computationally feasible, and yet, still provides sufficient second order information. One such class of methods are quasi-Newton methods, which are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In such methods, the approximation to the Hessian is updated iteratively using only first order information from the gradients and the iterates through low-rank updates. Among these methods, the celebrated Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm and its limited memory version (L-BFGS) , are the most popular and widely used members of this class. Another class of methods for approximating the Hessian is based on sub-sampling where the Hessian of the full function FF is estimated using that of the randomly selected subset of functions fif_{i}, . More precisely, a subset S{1,2,,n}\mathcal{S}\subset\{1,2,\cdots,n\} is chosen at random and, if the sub-sampled matrix is invertible, the update is obtained by

where Sg\mathcal{S}_{{\bf g}} and SH\mathcal{S}_{H} are sample sets used for approximating the Hessian and the gradient, respectively. The variants (2b) are what we call sub-sampled Newton methods in this paper.

This paper has a companion paper , henceforth called SSN2, which considers the technically-more-sophisticated local convergence rates for sub-sampled Newton methods (by local convergence, it is meant that the initial iterate is close enough to a local minimizer at which the sufficient conditions hold). However, here, we only concentrate on designing such algorithms with global convergence guarantees. In doing so, we need to ensure the following requirements:

Our sampling strategy needs to provide a sample size S|\mathcal{S}| which is independent of nn, or at least smaller. Note that this is the same requirement as in SSN2 [40, (R.1)]. However, as a result of the simpler goals of the present paper, we will show that, comparatively, a much smaller sample size can be required here than that used in SSN2 .

In addition, any such method must, at least probabilistically, ensure that the sub-sampled matrix is invertible. If the gradient is also sub-sampled, we need to ensure that sampling is done in a way to keep as much of this first order information as possible. Note that unlike SSN2 [40, (R.2)] where we require a strong spectrum-preserving property, here, we only require the much weaker invertibility condition, which is enough to yield global convergence.

We need to ensure that our designed algorithms are globally convergent and approach the optimum starting from any initial guess. In addition, we require to have bounds which yield explicit convergence rate as opposed to asymptotic results. Note that unlike SSN2 [40, (R.3)] where our focus is on speed, here, we mainly require global convergence guarantees.

For p1p\gg 1, even when the sub-sampled Hessian is invertible, computing the update at each iteration can indeed pose a significant computational challenge. More precisely, it is clear from (2b) that to compute the update, sub-sampled Newton methods require the solution of a linear system, which regardless of the sample size, can be the bottleneck of the computations. Solving such systems inexactly can further improve the computational efficiency of sub-sampled algorithms. Hence, it is imperative to allow for approximate solutions and still guarantee convergence.

In this paper, we give global convergence rates for sub-sampled Newton methods, addressing challenges (R.1), (R.2), (R.3) and (R.4). As a result, the local rates of the companion paper, SSN2 , coupled with the global convergence guarantees presented here, provide globally convergent algorithms with fast and problem-independent local rates (e.g., see Theorems 2 and 7). To the best of our knowledge, the present paper and SSN2 are the very first to thoroughly and quantitatively study the convergence behavior of such sub-sampled second order algorithms, in a variety of settings.

2 Notation and Assumptions

Note that Assumption (3b) implies uniqueness of the minimizer, x{\bf x}^{*}, which is assumed to be attained. The quantity

is known as the condition number of the problem.

For an integer 1qn1\leq q\leq n, let Q\mathcal{Q} be the set of indices corresponding to qq largest KiK_{i}’s and define the “sub-sampling” condition number as

It is easy to see that for any two integers qq and rr such that 1qrn1\leq q\leq r\leq n, we have κκrκq\kappa\leq\kappa_{r}\leq\kappa_{q}. Finally, define

where κ1\kappa_{1} and κS\kappa_{|\mathcal{S}|} are as in (5).

3 Contributions

The contributions of this paper can be summarized as follows:

Under the above assumptions, we propose various globally convergent algorithms. These algorithms are guaranteed to approach the optimum, regardless of their starting point. Our algorithms are designed for the following settings.

Algorithm 1 practically implements (2a). In other words, we incorporate sub-sampled Hessian while the full gradient is used. Theorem 1 establishes the global convergence of Algorithm 1.

Algorithms 2 and 3, are modifications of Algorithm 1 in which the sub-sampled Hessian is regularized by modifying its spectrum or by Levenberg-type regularization (henceforth called ridge-type regularization), respectively. Such regularization can be used to guarantee global convergence, in the absence of positive definiteness of the full Hessian. Theorems 4 and 5 as well as Corollaries 1 and 2 guarantee global convergence of these algorithms.

Algorithm 4 is the implementation of the fully stochastic formulation (2b), in which the gradient as well the Hessian is sub-sampled. The global convergence of Algorithm 4 is guaranteed by Theorem 6.

For all of our algorithms, we present analysis for the case of inexact update where the linear system is solved only approximately, to a given tolerance. In addition, we establish criteria for the tolerance to guarantee faster convergence rate. The results of Theorems 3, 4, 5, and 8 give global convergence of the corresponding algorithms with inexact updates.

4 Related Work

The results of Section 3, can be applied to more general setting where nn can be arbitrarily large. This is so since sub-sampling the gradient, in addition to that of the Hessian, allows for iteration complexity, which can be much smaller than nn. Within the context of first order methods, there has been numerous variants of gradient sampling from a simple stochastic gradient descent, , to the most recent improvements by incorporating the previous gradient directions in the current update . For second order methods, such sub-sampling strategy has been successfully applied in large scale non-linear inverse problems . However, to the best of our knowledge, Section 3 offers the first quantitative and global convergence results for such sub-sampled methods.

Finally, inexact updates have been used in many second-order optimization algorithms; see and references therein.

Sub-Sampling Hessian

For the optimization problem (1), at each iteration, consider picking a sample of indices from {1,2,,n}\{1,2,\ldots,n\}, uniformly at random with or without replacement. Let S\mathcal{S} and S|\mathcal{S}| denote the sample collection and its cardinality, respectively and define

to be the sub-sampled Hessian. As mentioned before in Section 1.1, in order for such sub-sampling to be useful, we need to ensure that the sample size S|\mathcal{S}| satisfies the requirement (R.1), while H(x)H({\bf x}) is invertible as mentioned in (R.2). Below, we make use of random matrix concentration inequalities to probabilistically guarantee such properties.

then for H(x)H({\bf x}) defined in (8), we have

where γ\gamma and κ1\kappa_{1} are defined in (3b) and (5), respectively.

Hence, depending on κ1\kappa_{1}, the sample size S|\mathcal{S}| can be smaller than nn. In addition, we can always probabilistically guarantee that the sub-sampled Hessian is uniformly positive definite and, consequently, the direction given by it, indeed, yields a direction of descent.

It is important to note that the sufficient sample size, S|\mathcal{S}|, here grows only linearly in κ1\kappa_{1}, i.e., Ω(κ1){\Omega}(\kappa_{1}), as opposed to quadratically, i.e., Ω(κ12)\Omega(\kappa_{1}^{2}), in . In fact, it might be worth elaborating more on the differences between the above sub-sampling strategy and that of SSN2 [40, Lemmas 1, 2, and 3]. These differences, in fact, boil down to the differences between the requirement (R.2) and the corresponding one in SSN2 [40, Section 1.1, (R.2)]. As a result of a “coarser-grained” analysis in the present paper and in order to guarantee global convergence, we only require that the sub-sampled Hessian is uniformly postive definite. Consequently, Lemma 1 require a smaller sample size, i.e., in the order of κ1\kappa_{1} vs. κ12\kappa_{1}^{2} for SSN2 [40, Lemma 1, 2, and 3], while delivering a much weaker guarantee about the invertibility of the sub-sampled Hessian. In contrast, for the finer-grained analysis in SSN2 , we needed a much stronger guarantee to preserve the spectrum of the true Hessian, and not just simple invertibility.

In Section 2.1.1, we first present an iterative algorithm in which, at every iteration, the linear system in (2a) is solved exactly. In Section 2.1.2, we then present a modified algorithm where such step is done only approximately and the update is computed inexactly, to within a desired tolerance. Finally, Section 2.2 will present algorithms in which the sub-sampled Hessian is regularized through modifying its spectrum or ridge-type regularization. For this latter section, the algorithms are given for the case of inexact update as extensions to exact solve is straightforward. The proofs of all the results are given in the appendix.

For the sub-sampled Hessian H(x(k))H({\bf x}^{(k)}), consider the update

for some β(0,1)\beta\in(0,1) and α^1\widehat{\alpha}\geq 1. Recall that (11c) can be approximately solved using various methods such as Armijo backtracking line search .

Theorem 1 guarantees global convergence with at least a linear rate which depends on the quantities related to the specific problem. In SSN2 , we have shown, through a finer grained analysis, that the locally linear convergence rate of such sub-sampled Newton method with a constant step size αk=1\alpha_{k}=1 is indeed problem independent. In fact, it is possible to combine both results to obtain a globally convergent algorithm with a locally linear and problem-independent rate, which is indeed much faster than what Theorem 1 implies.

Let Assumptions (3) hold and each fif_{i} have a Lipschitz continuous Hessian as

iterations, with probability (1δ)k(1-\delta)^{k} we get “problem-independent” Q-linear convergence, i.e.,

In fact, it is even possible to obtain a globally convergent algorithm with locally superlinear rate of convergence using Algorithm 1 with iteration dependent ϵ^(k)\hat{\epsilon}^{(k)} as

where ϵ(k)\epsilon^{(k)} is chosen as in SSN2 [40, Theorem 3 or 4]. The details are similar to Theorem 2 and are omitted here.

1.2 Inexact update

In many situations, where finding the exact update, pk{\bf p}_{k}, in (11b) is computationally expensive, it is imperative to be able to calculate the update direction pk{\bf p}_{k} only approximately. Such inexactness can indeed reduce the computational costs of each iteration and is particularly beneficial when the iterates are far from the optimum. This makes intuitive sense because, if the current iterate is far from x{\bf x}^{*}, it may be computationally wasteful to exactly solve for pk{\bf p}_{k} in (11b). Such inexact updates have been used in many second-order optimization algorithms, e.g. . Here, in the context of uniform sub-sampling, we give similar global results using inexact search directions inspired by .

For computing the search direction, pk{\bf p}_{k}, consider the linear system H(x(k))pk=F(x(k))H({\bf x}^{(k)}){\bf p}_{k}=-\nabla F({\bf x}^{(k)}) at kthk^{th} iteration. Instead, in order to allow for inexactness, one can solve the linear system such that for some 0θ1,θ2<10\leq\theta_{1},\theta_{2}<1, pk{\bf p}_{k} satisfies

The condition (16a) is the usual relative residual of the approximate solution. However, for a given θ1\theta_{1}, any pk{\bf p}_{k} satisfying (16a) might not necessarily result in a descent direction. As a result, condition (16b) ensures that such a pk{\bf p}_{k} is always a direction of descent. Note that given any 0θ1,θ2<10\leq\theta_{1},\theta_{2}<1, one can always find a pk{\bf p}_{k} satisfying (16) (e.g., the exact solution always satisfies (16)).

Comment 2: The dependence of the guarantees of Theorem 3 on the inexactness parameters, θ1\theta_{1} and θ2\theta_{2}, is indeed intuitive. The minimum amount of decrease in the objective function is mainly dependent on θ1\theta_{1}, i.e., the accuracy of the the linear system solve. On the other hand, the dependence of the step size, αk\alpha_{k}, on θ2\theta_{2} indicates that the algorithm can take larger steps along a search direction, pk,{\bf p}_{k}, that points more accurately towards the direction of the largest rate of decrease, i.e., pkTF(x(k)){\bf p}_{k}^{T}\nabla F({\bf x}^{(k)}) is more negative which means that pk{\bf p}_{k} points more accurately in the direction of F(x(k))-\nabla F({\bf x}^{(k)}). As a result, the more exactly we solve for pk{\bf p}_{k}, the larger the step that the algorithm might take and the larger the decrease in the objective function. However, the cost of any such calculation at each iteration might be expensive. As a result, there is a trade-off between accuracy for computing the update in each iteration and the overall progress of the algorithm.

2 Modifying the Sample Hessian

As mentioned in the introduction, if FF is not strongly convex, it is still possible to obtain globally convergent algorithms. This is done through regularization of the Hessian of FF to ensure that the resulting search direction is indeed a direction of descent. Even when FF is strongly convex, the use of regularization can still be beneficial. Indeed, the lower bound for the step size in Theorems 1 and 3 imply that a small γ\gamma can potentially slow down the initial progress of the algorithm. More specifically, small γ\gamma might force the algorithm to adopt small step sizes, at least in the early stages of the algorithm. This observation is indeed reinforced by the composite nature of error recursions obtained in SSN2 . In particular, it has been shown in SSN2 that in the early stages of the algorithm, when the iterates are far from the optimum, the error recursion is dominated by a quadratic term which transitions to a linear term as the iterates get closer to the optimum. However, unlike the linear term, the quadratic term is negatively affected by the small values of γ\gamma. In other words, small γ\gamma can hinder the initial progress and even using the full Hessian cannot address this issue. As a result, one might resort to regularization of the (estimated) Hessian to improve upon the initial slow progress. Here, we explore two strategies for such regularization which are incorporated as part of our algorithms. The results are given for when (11b) is solved approximately with a “small-enough” tolerance and for the case of sub-sampling without replacement. Extensions to arbitrary tolerance as well as sampling with replacement is as before and straightforward.

In this section, we follow the ideas presented in , by accounting for such a potentially negative factor γ\gamma, through a regularization of eigenvalue distribution of the sub-sampled Hessian. More specifically, for some λ0\lambda\geq 0, let

The operation (17b) can be equivalently represented as

with vi{\bf v}_{i} being the ithi^{th} eigenvector of HH corresponding to the ithi^{th} eigenvalue, λi(H)\lambda_{i}(H). Operation (17b) can be performed using truncated SVD (TSVD). Note that although TSVD can be done through standard methods, faster randomized alternatives exist which provide accurate approximations to TSVD much more efficiently .

As a result of Steps 5–7 in Algorithm 2, the regularized sub-sampled matrix, H^(x(k))\hat{H}({\bf x}^{(k)}), is always positive definite with minimum eigenvalue λ(k)>0\lambda^{(k)}>0. Consequently, just to obtain global convergence, there is no need to resort to sampling Lemma 1 to ensure invertibility of the sub-sampled matrix. In fact, such regularization guarantees the global convergence of Algorithm 2, even in the absence of strong convexity assumption (3b). Theorem 4 gives such a result for Algorithm 2 in the case of inexact update.

where K^S\widehat{K}_{|\mathcal{S}|} is defined as in (6). If in addition, Assumption (3b) holds, then we have (12) with

Moreover, for both cases, the step size is at least

where K^S\widehat{K}_{|\mathcal{S}|} is defined as in (6). Moreover, with probability 1δ1-\delta, the step size is at least

2.2 Ridge Regularization

As an alternative to the spectral regularization, we can consider the following simple ridge-type regularization

for some λ0\lambda\geq 0, similar to the Levenberg-Marquardt type algorithms . Such regularization might be preferable to the spectral regularization of Section 2.2.1, as it avoids the projection operation 17b at every iteration. Theorem 5 gives a global convergence guarantee for Algorithm 3 in the case of inexact update.

where K^S\widehat{K}_{|\mathcal{S}|} is defined as in (6). If, in addition, Assumption (3b) holds, then we have (12) with

Moreover, for both cases, the step size is at least

As mentioned before in Section 2.2.1, under Assumption (3b), if the sample size S|\mathcal{S}| is not chosen large enough then H(x(k))H({\bf x}^{(k)}) might (nearly) be singular. This can in turn cause a need for a larger λ\lambda or, alternatively, if λ1\lambda\ll 1, the accuracy tolerance θ1\theta_{1} and the step-size αk\alpha_{k} can be very small. However, by using the sample size given by Lemma 1, one can probabilistically guarantee a minimum step-size as well as a minimum sufficient upper bound for θ1\theta_{1} which are bounded away from zero even if λ=0\lambda=0.

where K^S\widehat{K}_{|\mathcal{S}|} is defined as in (6). Moreover, with probability 1δ1-\delta, the step size is at least

Sub-Sampling Hessian & Gradient

In order to compute the update x(k+1){\bf x}^{(k+1)} in Section 2, full gradient was used. In many problems, this can be a major bottleneck and reduction in computational costs can be made by considering sub-sampling the gradient as well. This issue arises more prominently in high dimensional settings where n,p1n,p\gg 1 and evaluating the full gradient at each iteration can pose a significant challenge. In such problems, sub-sampling the gradient can, at times, drastically reduce the computational complexity of many problems.

Consider selecting a sample collection from {1,2,,n}\{1,2,\ldots,n\}, uniformly at random with replacement and let

be the sub-sampled gradient. As mentioned before in the requirement (R.2), we need to ensure that sampling is done in a way to keep as much of the first order information from the full gradient as possible.

By a simple observation, the gradient F(x)\nabla F({\bf x}) can be written in matrix-matrix product from as

Hence, we can use approximate matrix multiplication results as a fundamental primitive in RandNLA , to probabilistically control the error in approximation of F(x)\nabla F({\bf x}) by g(x){\bf g}({\bf x}), through uniform sampling of the columns and rows of the involved matrices above. As a result, we have the following lemma (a more general form of this lemma for the case of constrained optimization is given in the companion paper, SSN2 [40, Lemma 4]).

For any 0<ϵ<10<\epsilon<1 and 0<δ<10<\delta<1, if

then for g(x){\bf g}({\bf x}) defined in (19), we have

Comment 4: In order to use the above result in our gradient sub-sampling, we need to be able to efficiently estimate G(x)G({\bf x}) at every iteration. Fortunately, in many different problems, this is often possible; for concrete examples, see Section 4.

We now show that by combining the gradient sub-sampling of Lemma 2 with Hessian sub-sampling of Lemma 1, we can still obtain global guarantees for some modification of Algorithm 1. This indeed generalizes our algorithms to fully stochastic variants where both the gradient and the Hessian are approximated.

In Section 3.1.1, we first present an iterative algorithm with exact update where, at every iteration, the linear system in (2a) is solved exactly. In Section 3.1.2, we then present a modified algorithm where such step is done only approximately and the update is computed inexactly, to within a desired tolerance. The proofs of all the results are given in the appendix.

For the sub-sampled Hessian, H(x(k))H({\bf x}^{(k)}), and the sub-sampled gradient, g(x(k)){\bf g}({\bf x}^{(k)}), consider the update

we have the following with probability (1δ)2(1-\delta)^{2}:

Theorem 6 guarantees global convergence with at least a linear rate which depends on the quantities related to the specific problem, i.e., condition number. As mentioned before, in SSN2 , we have shown, through a finer grained analysis, that the locally linear convergence rate of such sub-sampled Newton method with a constant step size αk=1\alpha_{k}=1 is indeed problem independent. As in Theorem 2, it is possible to combine the two results and obtain a globally convergent algorithm with fast and problem-independent rate.

Let Assumptions (3) and (13) hold. Consider any 0<ρ0,ρ1,ρ2<10<\rho_{0},\rho_{1},\rho_{2}<1 such that ρ0+ρ1<ρ2\rho_{0}+\rho_{1}<\rho_{2}, set

iterations, we have the following with probability (1δ)2k(1-\delta)^{2k}:

otherwise, we get “problem-independent” linear convergence, i.e.,

1.2 Inexact update

As in Section 2.1.2, we now consider the inexact version of (21b), as a solution of

we have the following with probability 1δ1-\delta:

Examples

In this Section, we present an instance of problems which are of the form (1). Specifically, examples from generalized linear models (GLM) are given in Sections 4.1, followed by some numerical simulations in Section 4.2.

Consider the problem of maximum a posteriori (MAP) estimation using any GLM with canonical link function and Gaussian prior. This problem boils down to minimizing the regularized negative log-likelihood as

As mentioned before in Section 3, in order to use Lemma 2, we need to be able to efficiently estimate G(x(k))G({\bf x}^{(k)}) at every iteration. For illustration purposes only, Table 1 gives some very rough estimates of G(x)G({\bf x}) for GLMs. In practice, as a pre-processing and before starting the algorithms, one can pre-compute the quantities which depend only on (ai,bi)({\bf a}_{i},b_{i})’s. Then updating G(x)G({\bf x}) at every iteration is done very efficiently as it is just a matter of computing x\|{\bf x}\|.

2 Numerical Simulations

Gradient Descent (GD) with constant step-size (the step-size was hand tuned to obtain the best performance),

Accelerated Gradient Descent (AGD), , which improves over GD by using a momentum term,

L-BFGS with Armijo line-search and using limited past memory of 1010, 100100

Full Newton’s method with Armijo line-search,

SSN with inexact update (SSN-NX) with inexactness tolerances of (θ1=102,θ2=0.5)(\theta_{1}=10^{-2},\theta_{2}=0.5) for data sets D1D_{1} and D2D_{2}, and (θ1=104,θ2=0.5)(\theta_{1}=10^{-4},\theta_{2}=0.5) for D3D_{3}.

We run the simulations for each method, starting from the same initial point, until F(x)108\|\nabla F({\bf x})\|\leq 10^{-8} or a maximum number of iterations is reached, and report the relative errors of the iterates, i.e., x(k)x/x\|{\bf x}^{(k)}-{\bf x}^{*}\|/\|{\bf x}^{*}\| as well as the relative errors of the objective function, i.e., F(x(k))F(x)/F(x)|F({\bf x}^{(k)})-F({\bf x}^{*})|/|F({\bf x}^{*})|, both versus elapsed time (in seconds). The results are shown in Figure 1.

The first order methods, i.e., GD and AGD, in none of these examples, managed to converge to anything reasonable. For the ill-conditioned problems with data sets D2D_{2} and D3D_{3}, while all instances of SSN converged to the desired accuracy with no difficulty, no other method managed to go past a “single” digit of accuracy, in the same time-period, or anytime soon after! This is indeed expected, as SSN captures the regions with high and low curvature, and scales the gradient accordingly to make progress. These examples show that, when dealing with ill-conditioned problems, using only first order information is certainly not enough to obtain a reasonable solution. In these problems, employing a second-order algorithm such as SSN with inexact update not only yields the desired solution, but also does it very efficiently! In particular note the fast convergence of SSN-NX for both of these problems.

For the much better conditioned problem using the data set D1D_{1}, BFGS and L-BFGS with the history size of 100100 outperform SSN-X with 5%5\% and 10%10\% sampling. This is also expected since solving the 10,000×10,00010,000\times 10,000 linear system exactly at every iteration is the bottleneck of computations for SSN-X, in comparison to matrix free BFGS and L-BFGS. However, even in such a well-conditioned problem where BFGS and L-BFGS appear very attractive, inexactness coupled with SSN can be more efficient. It is clear that all SSN-NX variants converge to the desired solution faster than either BFGS or its limited memory variant. For example, using D1D_{1}, the speed-up of using SSN-NX with 10%10\% of the data over L-BFGS is at least 44 times. Note also that for the very ill-conditioned problem with D3D_{3}, larger sample size, i.e., 20%20\%, was required to obtain a fast convergence, illustrating that the sample size could grow with the condition number.

Conclusion

In this paper, we studied globally convergent sub-sampled Newton algorithms for unconstrained optimization in various settings. In the first part of the paper, we studied the case in which only the Hessian is sub-sampled and the full gradient is used. In this setting, we showed that using random matrix concentration result, it is possible to, probabilistically, guarantee that the sub-sampled Hessian yields a descent direction at every iteration. We then provided global convergence for modifications of this algorithms where the sub-sampled Hessian is regularized by changing its spectrum or ridge-type regularization. We argued that such regularization can only be beneficial at early stages of the algorithm and we need to revert to using the true sub-sampled Hessian (of course, if it is invertible) as iterates get closer to the optimum.

In the second part of the paper, we considered the global convergence of a fully stochastic algorithm in which both the Hessian and the gradients are sub-sampled, independently of each other, as way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from RandNLA to obtain the proper sampling strategy.

Although our main focus here was merely to provide global convergence guarantees for sub-sampled Newton methods under a variety of settings, admittedly, one major downside of the bounds presented in this paper is that they are all pessimistic. In fact, some of the bounds of the present paper exhibit a dependence on the condition-number which is very discouraging. However, the consolation lies in combining the global results presented here and the corresponding local convergence rates of the companion paper, SSN2 . Indeed, in SSN2 , we show that, using Newton’s “natural” step-size of αk=1\alpha_{k}=1, such sub-sampled Newton algorithms enjoy local convergence rates which are condition-number independent. In addition, we show that, through controlling the sub-sampling accuracy, one can make the local convergence speed of such algorithms as close to that of the full Newton’s method as desired (though we stay shy of obtaining its famous quadratic rate). Consequently through such combination of the results of the two companion papers, we guaranteed that, ultimately, the convergence rate of the algorithms of the present paper which use exact update, becomes condition-number independent. In addition, using the Armijo rule, the “natural” step size of αk=1\alpha_{k}=1 will eventually be always accepted.

Finally, despite the fact that we considered the global convergence behavior of the algorithms which incorporate inexact updates, the local convergence properties of such algorithms are not known. The results of SSN2 only address such properties for the algorithms which use exact update. As a result, studying local convergence behavior of such inexact algorithms is left for future work. In addition, here, the global convergence have been established for algorithms for solving unconstrained optimization. SSN2 considers the general case of constrained optimization, but only in studying the local convergence rates of the presented algorithms. As a result, extensions of global convergence guarantees to convex constrained problems are important avenues for future research.

References

Appendix A Proofs

Consider S|\mathcal{S}| i.i.d random matrices Xj(x),j=1,2,,SX_{j}({\bf x}),j=1,2,\ldots,|\mathcal{S}| such that Pr(Xj(x)=2fi(x))=1/n;  i=1,2,,n,\Pr(X_{j}({\bf x})=\nabla^{2}f_{i}({\bf x}))=1/n;\;\forall i=1,2,\ldots,n,. Define

where K^1\widehat{K}_{1} is defined in (6). Hence we can apply Matrix Chernoff [48, Theorem 1.1] or [47, Theorem 2.2] for sub-sampling with or without replacement, respectively, to get

implies that pkTg(x(k))<0{\bf p}_{k}^{T}{\bf g}({\bf x}^{(k)})<0 and we can indeed obtain decrease in the objective function.

Now, it suffices to show that there exists an iteration-independent α~>0\widetilde{\alpha}>0, such that the constrain in (11c) holds for any 0αα~0\leq\alpha\leq\widetilde{\alpha}. For any 0α<10\leq\alpha<1, define xα=x(k)+αpk{\bf x}_{\alpha}={\bf x}^{(k)}+\alpha{\bf p}_{k}. By Assumption (3b), we have

Now in order to pass the Armijo rule, we search for α\alpha such that

This latter inequality is satisfied if we require that

satisfies the Armijo rule. So in particular, we can always find an iteration independent lower bound on step size such that the constrain in (11c) holds. On the other hand, for sampling without replacement and from H(x(k))pk=F(x(k))H({\bf x}^{(k)}){\bf p}_{k}=-\nabla F({\bf x}^{(k)}) we get

Similarly for sampling with replacement, we have

Now the result follows immediately by noting that Assumption (3b) implies (see [35, Theorem 2.1.10])

The choice of ϵ\epsilon is to meet a requirement of SSN2 [40, Theorem 2] and account for the differences between Lemma 1 and SSN2 [40, Lemma 1].

The rest of the proof follows closely the line of argument in [9, Section 9.5.3]. define xα=x(k)+αpk{\bf x}_{\alpha}={\bf x}^{(k)}+\alpha{\bf p}_{k}. From (13), it follows that

Defining F^(α):=F(x(k)+αpk)\widehat{F}(\alpha)\mathrel{\mathop{:}}=F({\bf x}^{(k)}+\alpha{\bf p}_{k}), we have

as well as F^(0)=αpkTF(x(k))=αF(x(k))T[H(x(k))]1F(x(k))\widehat{F}^{{}^{\prime}}(0)=\alpha{\bf p}_{k}^{T}\nabla F({\bf x}^{(k)})=-\alpha\nabla F({\bf x}^{(k)})^{T}[H({\bf x}^{(k)})]^{-1}\nabla F({\bf x}^{(k)}) and

The last inequality follows since by the choice of ϵ\epsilon and SSN2 [40, Lemma 5], we have

Hence, with α=1\alpha=1 and denoting c(x):=F(x)T[H(x)]1F(x)c({\bf x})\mathrel{\mathop{:}}=\nabla F({\bf x})^{T}[H({\bf x})]^{-1}\nabla F({\bf x}), we have

Hence, noting that c(x)F(x)2/((1ϵ)γ)c({\bf x})\leq\|\nabla F({\bf x})\|^{2}/((1-\epsilon)\gamma), if

which implies that (11c) is satisfied with α=1\alpha=1.

The proof is complete if we can find kk such that both the sufficient condition of SSN2 [40, Theorem 2] as well as (25) is satisfied. First, note that from Theorem 1, Assumtpion (3b) and by using the iteration-independent lower bound on αk\alpha_{k}, it follows that

In order to satisfy (25), we require that

which yields (14). Again, from Theorem 1 and Assumtpion (3b),we get

and hence the sufficient condition of SSN2 [40, Theorem 2] is also satisfied and we get (15).

We give the proof only for the case of sampling without replacement. The proof for sampling with replacement is obtained similarly.

So, pkTg(x(k))<0{\bf p}_{k}^{T}{\bf g}({\bf x}^{(k)})<0 and we can indeed obtain decrease in the objective function. As in the proof of Theorem (1), we get

Hence, in order to pass the Armijo rule, we search for α\alpha such that

it follows that the condition (16a), implies

Now the result follows, using Assumption (3b), as in the end of the proof of Theorem 1.

By the choice of λ(k)\lambda^{(k)} and the convexity of fif_{i}, we have λ(k)>0\lambda^{(k)}>0 and, so by (16b) it gives

So it follows that pkTF(x(k))<0{\bf p}_{k}^{T}\nabla F({\bf x}^{(k)})<0 and we can indeed obtain decrease in the objective function. Now as before, in order to pass the Armijo rule, we search for α\alpha such that

Now the rest of the proof is similar to that of Theorem 3 by noting that

As before, by the choice of λ>0\lambda>0, convexity of fif_{i}, and (16b) we get

which implies that pkTg(x(k))<0{\bf p}_{k}^{T}{\bf g}({\bf x}^{(k)})<0 and we can indeed obtain decrease in the objective function. Similarly to the proof previous theorems, it is easy to see that

satisfies the Armijo rule. The rest of the results also follow as in the proof of Theorem 3 by noting that

A.2 Proofs of Section 3

The proof of the following lemma, in a more general format for constrained optimization, is given in the companion paper, SSN2 [40, Lemma 4]. However, it is given here as well for completeness.

As mentioned before, the full gradient, F(x)\nabla F({\bf x}), can be equivalently written as a product of two matrices as F(x)=AB\nabla F({\bf x})=AB, where

with probability 1δ1-\delta. Now the result follows by requiring that

We give the proof only for the case of sampling without replacement. The proof for sampling with replacement is obtained similarly.

As in the proof of Theorem 1, we first need to show that there exists an iteration-independent step-size, α~>0\widetilde{\alpha}>0, such that the constrain in (21c) holds for any 0αα~0\leq\alpha\leq\widetilde{\alpha}. For any 0α<10\leq\alpha<1, define xα=x(k)+αpk{\bf x}_{\alpha}={\bf x}^{(k)}+\alpha{\bf p}_{k}. By Assumption (3b), we have

which shows that pkTg(x(k))<0{\bf p}_{k}^{T}{\bf g}({\bf x}^{(k)})<0 and we can indeed obtain decrease in the objective function. Now, using the above, it follows that

As a result, we need to search for α\alpha such that

Hence, from H(x(k))pk=g(x(k))H({\bf x}^{(k)}){\bf p}_{k}=-{\bf g}({\bf x}^{(k)}), it follows that in order to guarantee an iteration independent lower bound for α\alpha as above, we need to have

which, by the choice of σ\sigma and ϵ1\epsilon_{1}, is imposed by the algorithm. If the stopping criterion succeeds, then by

However, if the stopping criterion fails and the algorithm is allowed to continue, then by

Hence, from H(x(k))pk=g(x(k))H({\bf x}^{(k)}){\bf p}_{k}=-{\bf g}({\bf x}^{(k)}), we get

Finally, using Assumption (3b), the desired result follows as in the end of the proof of Theorem 1.

The choice of ϵ1\epsilon_{1} and ϵ2(k)\epsilon_{2}^{(k)} is to meet a requirement of SSN2 [40, Theorem 13] and account for the differences between Lemma 1 and SSN2 [40, Lemma 1].

In addition, from F(x(k))g(x(k))ϵ2\|\nabla F({\bf x}^{(k)})-{\bf g}({\bf x}^{(k)})\|\leq\epsilon_{2}, we get pkTF(x(k))pkTg(x(k))+ϵ2pk{\bf p}_{k}^{T}\nabla F({\bf x}^{(k)})\leq{\bf p}_{k}^{T}{\bf g}({\bf x}^{(k)})\|+\epsilon_{2}\|{\bf p}_{k}\| and so

Finally, as in the proof of Theorem 2, we have

Hence, with α=1\alpha=1 and denoting h(x):=g(x)T[H(x)]1g(x)h({\bf x})\mathrel{\mathop{:}}={\bf g}({\bf x})^{T}[H({\bf x})]^{-1}{\bf g}({\bf x}), we have

where the last inequality follows from g(x)2/K^Sh(x)g(x)2/((1ϵ1)γ)\|{\bf g}({\bf x})\|^{2}/\widehat{K}_{|\mathcal{S}|}\leq h({\bf x})\leq\|{\bf g}({\bf x})\|^{2}/((1-\epsilon_{1})\gamma). Now denoting

Now, if the stopping criterion fails and the algorithm is allowed to continue, then by

As a result, in order to satisfy the right hand side of (28), we require that

which yields (22). Again, from Theorem 6 and Assumtpion (3b),we get

and hence the sufficient condition of SSN2 [40, Theorem 13] is also satisfied and we get (23).

The proof is given by combining the arguments used to prove Theorems 3 and 6, and is given here only for completeness. We also give the proof only for the case of sampling without replacement. The proof for sampling with replacement is obtained similarly.

Hence, pkTg(x(k))<0{\bf p}_{k}^{T}{\bf g}({\bf x}^{(k)})<0 and we can indeed obtain decrease in the objective function. For the Armijo rule to hold, we search for α\alpha such that

Now H(x(k))pk+g(x(k))θ1g(x(k))\|H({\bf x}^{(k)}){\bf p}_{k}+{\bf g}({\bf x}^{(k)})\|\leq\theta_{1}\|{\bf g}({\bf x}^{(k)})\| implies

which, by the choice of σ\sigma and ϵ1\epsilon_{1}, is imposed by the algorithm. If the stopping criterion holds, then by

However, if the stopping criterion fails and the algorithm continues, then by

which, since σ4\sigma\geq 4, implies that

it follows that the condition (16a), implies

Now as in the proof of Theorem 3, we get that if

Since F(x(k))ϵ2g(x(k))\|\nabla F({\bf x}^{(k)})\|-\epsilon_{2}\leq\|{\bf g}({\bf x}^{(k)})\|, we get

For part (ii), we note that by H(x(k))pk+g(x(k))θ1g(x(k))\|H({\bf x}^{(k)}){\bf p}_{k}+{\bf g}({\bf x}^{(k)})\|\leq\theta_{1}\|{\bf g}({\bf x}^{(k)})\|, we get

we then square both sides and use (29) to get

Now the result follows, using Assumption (3b), as in the end of the proof of Theorem 1.