Sub-Sampled Newton Methods II: Local Convergence Rates
Farbod Roosta-Khorasani, Michael W. Mahoney
Introduction
In this paper, we provide a detailed analysis of the use of sub-sampling as way to introduce randomness to the classical Newton’s method and derive variants which are more suited for the modern big data problems. In doing so, we give conditions under which the local convergence properties of the full Newton’s method is, to the extend possible, persevered. In particular, we will show that error recursions in all of our results exhibit a composite behavior whose dominating error term varies according to the distance of iterates to optimality. These results give a better control over various tradeoffs which exhibit themselves in different applications, e.g., practitioners might require faster running-time while statisticians might be more interested in statistical aspects regarding the recovered solution.
The rest of the paper is organized as follows: in Section 1.1, we first give a very brief background on the general methodology for optimizing (1). The notation and the assumptions used in this paper are given in Sections 1.2 and 1.3. An overview of the main contributions of this paper is given in Section 1.4. In Section 1.5, we give a brief survey of the related work, and in their light, we discuss, in more details, the contributions of the present article. Section 2 addresses the local convergence behavior of sub-sampled Newton method in the case where only the Hessian is sub-sampled, while the gradient is used in full. Specifically, Section 2.1 gives such local results for when the sub-sampled Hessian is not “altered”, whereas Section 2.2 establishes similar results for the cases where the sub-sampled Hessian is regularized by modifying its spectrum or Levenberg-type (henceforth called ridge-type) regularization. The case where the gradient, as well as Hessian, is sub-sampled is treated in Section 3. A few examples from generalized linear models (GLM) and support vector machine (SVM), very popular classes of problems in machine learning community, are given in Section 4. Conclusions and further thoughts are gathered in Section 5. All proofs are given in the appendix.
where is the step size at iteration . It is well-known, , that full gradient method achieves rates of and , for smooth and smooth-strongly convex objectives, respectively. However, when , the full gradient method can be inefficient because its iteration cost scales linearly in . Consequently, stochastic variants of full gradient descent, e.g., (mini-batch) stochastic gradient descent (SGD) were developed . In such methods a subset is chosen at random and the update is obtained by
When (e.g., for simple SGD), the main advantage of these stochastic methods is that the iteration cost is independent of and can be much cheaper than the full gradient methods, making them suitable for modern problems with large . However, this advantage comes at a cost: the convergence rate of these stochastic variants can be significantly slower that that of the full gradient descent. For example, under standard assumptions, it has been shown(e.g., ) that, for a suitably chosen decreasing step-size sequence , the simple SGD iterations have an expected sub-optimality, i.e., and for smooth and smooth-strongly convex objectives, respectively. Since these sub-linear rates are slower than the corresponding rates for the full gradient method, great deal of efforts have been made to devise modifications to achieve the convergence rates of the full gradient methods while preserving the per-iteration cost of stochastic methods .
The above class of methods are among what is known as first-order methods where only the gradient information is used at every iteration. One attractive feature of such class of methods is their relatively low per-iteration-cost. However, despite such low cost, in almost all problems, incorporating curvature information (e.g., Hessian) as a form of scaling the gradient, i.e.,
can significantly improve the convergence rate. Such class of methods which take the curvature information into account are known as second-order methods, and compared to first-order methods, they enjoy superior convergence rate in both theory and practice. This is so since there is an implicit local scaling of coordinates at a given , which is determined by the local curvature of . This local curvature in fact determines the condition number of a at . Consequently, by taking the curvature information into account (e.g., in the form of the Hessian), second order methods can rescale the gradient so it is a much more “useful” direction to follow. This is in contrast to first order methods which can only scale the gradient uniformly for all coordinates. Such second order information have long been used in many machine learning applications as well as many PDE constrained inverse problems .
It is well known, , that the canonical example of second order methods, i.e., Newton’s method, where is taken to be the inverse of the full Hessian and , i.e.,
However when , the per-iteration-cost of such algorithm is significantly higher than that of first order methods. As a result, a line of research is to try to construct an approximation of the Hessian in a way that the update is computationally feasible, and yet, still provides sufficient second order information. One such class of methods are quasi-Newton methods, which are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In such methods, the approximation to the Hessian is updated iteratively using only first order information from the gradients and the iterates through low-rank updates. Among these methods, the celebrated Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm and its limited memory version (L-BFGS) , are the most popular and widely used members of this class. Another class of methods for approximating the Hessian is based on sub-sampling where the Hessian of the full function is estimated using that of the randomly selected subset of functions , . More precisely, a subset is chosen at random and, if the sub-sampled matrix is invertible, the update is obtained by
where and are sample sets used for approximating the Hessian and the gradient, respectively. The variants (2b) are what we call sub-sampled Newton methods in this paper.
Unlike the sub-sampling in first order methods, theoretical properties of such techniques in the class of second-order methods are not yet very well understood. As a result, our aim in the present paper is to address the local convergence behavior of such sub-sampled Newton methods in a variety of situations (by local convergence, it is meant that the initial iterate is close enough to a local minimizer at which the sufficient conditions hold). This paper has an associated companion paper, , henceforth called SSN1, in which we consider globally convergent sub-sampled Newton algorithms and their convergence properties (by globally convergent algorithm, it is meant an algorithm that approaches the optimal solution starting from any initial point). The reason for splitting into two papers is that, while SSN1 , introduces several of the ideas in simpler settings, the more advanced techniques of this paper are needed to provide more control on the convergence behavior under a variety of assumptions. We expect that the insight into the theoretical properties of the algorithms presented in this paper and SSN1 , will enable the development of still-further improved sub-sampled Newton algorithms in a variety of applications in scientific computing, statistical data analysis, etc.
For the rest of this paper, we consider the general case of constrained optimization (1). More specifically, given the current iterate, , we consider the following iterative scheme,
where and are some approximations to (in our case, sub-samples of) the actual gradient and the Hessian at the iteration, respectively. A variety of first and second order methods are of this form. For example,
full Newton’s method is obtained by setting and ,
a step of the Frank-Wolfe, , algorithm can be obtained by considering and ,
and finally, choosing the pair or for some index sets gives rise to sub-sampled Newton methods, which are the focus of this paper (as well as SSN1 ).
Unlike SSN1 where our focus is mainly on designing sub-sampled Newton algorithms to guarantee global convergence, here, we concentrate on the actual speed of convergence. In particular, we aim to ensure that any such sub-sampled Newton algorithms preserves, at least locally, as much of the convergence properties of the full Newton’s method as possible. In doing so, we need to ensure the following requirements:
Our sampling strategy needs to provide a sample size which is independent of , or at least smaller. Note that this is the same requirement as in SSN1 [44, (R.1)]. However, as a result of the more intricate goals of the present paper, we will show that, comparatively, a larger sample size is required here than that used in SSN1 .
In addition, we need to ensure, at least probabilistically, that the sub-sampled matrix preserves the spectrum of the true Hessian as much as possible. If the gradient is also sub-sampled, we need to ensure that sampling is done in a way to keep as much of this first order information as possible. Note that unlike SSN1 [44, (R.2)], in order to preserve as much of the local convergence properties of the full Newton’s method as possible, the mere invertibility of the sub-sampled Hessian is not enough.
Finally, we need to ensure our algorithms enjoy a reasonably fast convergence rate which is, at least locally, similar to that of the full Newton’s method. Note that unlike SSN1 [44, (R.3)] where mere global convergence guarantee is required, here, the emphasis in on local convergence speed.
In this paper, we address challenges (R.1), (R.2), and (R.3). More precisely, by using a random matrix concentration inequality as well as results from approximate matrix multiplication of randomized numerical linear algebra (RandNLA), we ensure (R.1) and (R.2). To address (R.3), we give algorithms whose local convergence rates can be made close to that of full Newton’s method. These local rates coupled with the global convergence guarantees of SSN1 , provide globally convergent algorithms with fast local rates (e.g., see SSN1 [44, Theorems 2 and 7]). The present paper and the companion, SSN1 , to the best of our knowledge, are the very first to thoroughly and quantitatively study the convergence behavior of such sub-sampled second order algorithms, in a variety of settings.
2 Notation
For a vector and a matrix A, using such cone, we can define their -restricted norms, respectively, as
Similarly, one can define the -restricted maximum and the minimum eigenvalues of a symmetric matrix as
Alternatively, let be an orthonormal basis for the cone . The definitions above are equivalent to the following:
Throughout this paper we make use of two standard definitions of convergence rate: Q-convergence rate and R-convergence rate. Recall that a sequence of vectors is said to converge Q-linearly to a limiting value , if for some ,
Q-superlinear convergence is defined similarly as
The notion of R-convergence rate is an extension which captures sequences which still converge reasonably fast, but whose “speed” is variable. A sequence of vectors is said to converge R-superlinearly to a limiting value , if
for some and a sequence such that
R-superlinear convergence is similarly defined by requiring that
3 Assumptions
Our theoretical analysis is based on the following assumptions.
Lipschitz Hessian: Throughout this article, we assume that each is twice-differentiable and has a Lipschitz continuous Hessian with respect to the cone , i.e., for some ,
where is defined in (5b).
Hessian Regularity: Regularity of the Hessian refers to smoothness and strong convexity of . Such properties can either be required for or alternatively, only at a local minimum, , which is assumed to always exist. In this paper, these properties are referred to as global and local regularity, respectively. In particular, for a given , let and be such that
where and are defined in (5b) and (6a), respectively. We make the following regularity assumptions:
Global Hessian Regularity: Throughout Sections 2 and 3, every convergence result is first given for the case where we have global smoothness and strong convexity with respect to . More precisely, we assume that
where and are defined in (9). Note that since is convex, Assumption (10b) implies strong convexity of which, in turn, implies the uniqueness of .
Local Hessian Regularity: Assumptions (10) are rather restrictive and limit the class of problems where the methods in the present paper apply. Fortunately, these assumptions can be further relaxed and be required to hold only locally at a local optimum . In other words, we require that
where and are defined in (9). Note that since is convex, Assumption (11b) imply that is an isolated local optimum. These assumptions are relaxations of (10). For example, (11b) appears as a second order sufficiency condition for a local optimum . In fact, some highly non-convex problems with multiple local minima exhibit such local structures, e.g., . As a result, each analysis in Sections 2 and 3 is complemented with a convergence results using Assumptions (11), to make them applicable in a more general setting.
Locally Bounded Gradient: In Section 3, in addition to the Hessian, the gradient is also sub-sampled. For some of the results presented there, we give convergence results using the following regularity assumptions on the gradients in the form of boundedness at a local optimum :
For the results of Section 2.2.1 only, due to technical reasons which will be explained there, we need to remove -constrained condition from the above assumptions. More specifically, (8) will be replaced by
Table 1 gives an overview of where each of the above assumptions are used to prove convergence of different algorithms presented in this paper. As noted before, for each algorithm, we first give convergence results using the global regularity assumptions, and then we provide separate convergence results by relaxing these assumptions to hold locally only at .
4 Contributions
The contributions of this paper can be summarized as follows:
Under the above assumptions, we study the local convergence behavior of various sub-sampled algorithms. We show that all of our error recursions exhibit a composite behavior whose dominating error term varies according to the distance of iterates to optimality. More specifically, we will show that a quadratic error term dominates when we are far from an optimum and it transitions to lower degree terms according to the distance to optimality.
Our algorithms are designed for the following settings.
Algorithms 1 and 2 practically implement (3) for the case where only the Hessian is sub-sampled, while the full gradient is used, i.e., . We give locally Q-linear and Q-superlinear convergence rates for Algorithms 1 and 2, respectively.
Algorithms 3 and 4, are modifications of Algorithm 1 in which the sub-sampled Hessian is regularized by modifying its spectrum or by ridge-type regularization, respectively. We show that such regularizations can be used to improve upon the initial progress of the algorithm, and for both of these regularization methods, we give Q-linear convergence rates.
Algorithms 5 and 6 are the implementation of the fully stochastic formulation of (3), in which the gradient as well the Hessian is sub-sampled. We show that one can sub-sample the gradient independently of the Hessian, or simultaneously using the same collection of sample indices and we provide R-linear and R-superlinear convergence rates for these algorithms.
For all of these algorithms, we give quantitative convergence results, i.e., our bounds contain an actual worst-case convergence rates. Though the sample size depends on the condition number of the problem, we show that the rates for the (super)linear convergence phase are, in fact, problem-independent. As a result, by increasing the estimation accuracy, one can arbitrarily improve the speed of local convergence.
Our results here only address the local convergence behaviors of different algorithms for when the initial iterate is sufficiently close to an optimum. However, under global regularity assumptions (10), in SSN1 , we show that by simple modifications of all of our algorithms here, one can obtain globally convergent algorithms which approach the optimum, regardless of the initial starting point. These connections guarantee that the modified algorithms converge globally with a local rate which is problem-independent; see SSN1 [44, Theorems 2 and 7].
Table 2 gives a summary of the main results of this paper. In addition, in Section 1.5 and in light of the related work in the literature, we give more details of the contributions of the present paper.
5 Related Work
The results of Section 2, where the full gradient is used, offer computational efficiency for the regime where both and are large. However, it is required that is not so large as to make the gradient evaluation prohibitive. In such regime (where but is not too large), similar results can be found in :
The pioneering work in establishes, for the first time, the convergence of Newton’s method with sub-sampled Hessian and full gradient. There, two sub-sampled Hessian algorithms are proposed, where one is based on a matrix-free inexact Newton iteration and the other incorporates a preconditioned limited memory BFGS iteration. However, the results are asymptotic, i.e., for , and no quantitative convergence rate is given. In addition, convergence is established for the case where each is assumed to be strongly convex. Here, we concentrate on obtaining non-asymptotic and quantitative local convergence rates of such sub-sampled algorithm under milder assumption where only is required to be (locally) strongly convex. In addition, we extend these algorithms to the case where the gradient as well as the Hessian is sub-sampled. These results are presented in Sections 2 and 3.
Within the context of deep learning, is the first to study the application of a modification of Newton’s method. It suggests a heuristic algorithm where at each iteration, the full Hessian is approximated by a relatively large subset of ’s, i.e. a “mini-batch”, and the size of such mini-batch grows as the optimization progresses. The resulting matrix is then damped in a Levenberg-Marquardt style, , and conjugate gradient, , is used to approximately solve the resulting linear system
Authors in were among the first to explore hybrid methods, combining the inexpensive iterations of incremental gradient algorithms and the steady convergence of full gradient methods, which exhibit the benefits of both approaches. Their analysis shows that by carefully increasing the sample size across iterations, it is possible to maintain the steady convergence rates of full-gradient methods. They also present a practical quasi-Newton implementation based on their approach. Such careful control over the increase of the sample size is one of the main ingredients of our analysis here.
Our analysis here most resembles that of . The work in is the first to establish quantitative convergence rate for the case where the Hessian is sub-sampled and the full gradient is used. The authors consider the two metric projection formulation of the Newton’s method and establish probabilistic linear convergence rate using Hoeffding matrix concentration result. They suggest an algorithm, where at each iteration, the spectrum of the sub-sampled Hessian is modified as a form of regularization. Our work here resembles that of , but is different in many aspects:
For the constrained optimization, the results in the present paper are given for Newton’s method formulated as a scaled gradient projection method (3), whereas the results in are given for what is called as two metric projection method. The main difficulty with the latter formulation is that an arbitrary positive definite matrix will not necessarily yield a descent direction, i.e., for all , or even the algorithm might not recognize, i.e., fail to stop at, an stationary point [3, Section 2.4], which might result in instability in the algorithms relying on choosing the correct . However, the two metric projection method, can potentially have an easier projection step as opposed to scaled gradient projection (3). In fact, the scaled gradient projection algorithm can be viewed as a form of projection method with norm , resulting in an oblique projection. It should also be noted that, for the case of unconstrained optimization, the both formulations coincide.
Within the context of scaled gradient projection method, we extend the work in in several dimensions. More specifically, we give a wide range of results from simple sub-sampling to adding different kinds of regularization. For each of these methods, we give sufficient conditions for local convergence of the respective algorithm, under both global and local regularity assumptions as in Section 1.3. We also give results for the case where both Hessian and the gradients are sub-sampled; see Sections 2 and 3.
In situations where the Hessian is sparse, spectral regularization suggested in can destroy the sparsity. This in turn might make the storage and the computation with the resulting matrix less efficient. Here, in addition to extending the results using such spectral regularization, in Section 2.2.2, we suggest an alternative using ridge type regularization, which effectively achieves the same goal as the spectral counterpart, but at no extra computational or storage cots.
In addition to results with global regularity assumptions, in every section, we give additional convergence results under milder assumptions where the regularity is only required at a local minimum. In addition, most regularity assumptions are given with respect to the tangent cone of the constraints at a local minimum. These assumptions, depending on the geometry of this cone, can be significantly weaker than those made for unconstrained optimization (e.g., compare (10) and (11) with (13)).
As for the sufficient condition to guarantee convergence using the spectral regularization, in [19, Corollary 3.5], a sample size of is necessarily required (to at least obtain a positive region of convergence for the initial iterate), where is the spectral regularization threshold as in Section 2.2.1. However, the sample size required for Theorems 7 and 8 is of order , which is much smaller.
The results of Section 3, apply to more general settings where can be arbitrarily large. This is so since sub-sampling the gradient, in addition to that of the Hessian, allows for per-iteration cost which can be much smaller than . Within the context of first order methods, there has been numerous variants of gradient sampling from a simple stochastic gradient descent, , to the most recent improvements by incorporating the previous gradient directions in the current update . For second order methods, such sub-sampling strategy has been successfully applied in large scale non-linear inverse problems . However, to the best of our knowledge, Section 3 offers the first quantitative convergence results for such sub-sampled methods. In particular, these results guarantee the local convergence of many heuristic algorithms used in large scale nonlinear inverse problems, e.g. .
Sub-Sampling Hessian
For the optimization problem (1), at each iteration, consider picking a sample of indices from , uniformly at random with or without replacement. Let and denote the sample collection and its cardinality, respectively and define
to be the sub-sampled Hessian. As mentioned before in Section 1.1, in order for such Hessian sub-sampling to be effective in reducing as well as yielding a fast convergent algorithm, we need to ensure that the sample size satisfies the requirement (R.1), while as mentioned in (R.2), the spectrum of is as close to that of as possible. The latter requirement is reinforced by noticing that the bounds in Section A.1.1 contain quantities which directly relate to the relative eigenvalue distributions of and (with respect to the cone ). This indicates that the closer we can approximate the spectrum of the full Hessian, the faster the convergence is expected to be. Below, we make use of some matrix concentration inequalities to probabilistically guarantee the above properties.
For a given , let
With the above in mind, we can present our main sub-sampling Lemma to ensure that, to a desired accuracy, the spectrum of the full Hessian is preserved after sub-sampling.
Given any , and , if
then for defined in (14), we have
where is the -restricted condition number at , defined in (15), and is defined in (7).
Lemma 1 indicates that, with respect to the cone , it is indeed possible to sub-sample in a way that the spectrum of the sub-sampled Hessian does not deviate from that of the full Hessian.
Below, we show a couple of ways to improve upon the lower bound (16). First, we show that if each is convex, it is possible to reduce the constant factor further:
Given any , , and , if and
then (17) holds for defined in (14).
Even in the absence of convexity of each , it is still possible to improve the dependence on by using the concept of intrinsic dimension [55, Chapter 7]. This, in turn, allows us to incorporate the dimension of the subspace in our bound. More specifically, we have the following Lemma:
be the intrinsic dimension (a.k.a effective rank) of . Given any , , and , if
and is chosen uniformly at random with replacement, then (17) holds for defined in (14).
Comment 1: It is important to note that , and consequently, when is low dimensional subspace, (19) can indeed offer some computational savings, compared to (16). In fact, if computing is not feasible, one can simply replace that with in (19). For the rest of this paper, however, we will simply stick with the original bound (16), and note that extension to the above improvements is trivially done.
Comment 2: It might be worth noting the differences between the above sub-sampling strategies and that of SSN1 [44, Lemma 1]. These differences, in fact, boil down to the differences between the requirement (R.2) and the corresponding one in SSN1 [44, Section 1.1, (R.2)]. As a result of a “finer-grained” analysis in the present paper and in order to preserve as much of the local convergence properties of the full Newton’s method, Lemmas 1, 2, and 3 require a larger sample size, i.e., in the order of vs. for SSN1 [44, Lemma 1], while delivering a much stronger guarantee about the spectrum of the sub-sampled Hessian. In contrast, for the global analysis in SSN1 , we only need to ensure that the sub-sampled Hessian is invertible to yield a descent direction at every iteration.
Comment 3: In our results and algorithms below, we use Lemma 1 with the following iteration independent sample sizes:
if Assumptions (10) hold, we use a larger and, yet, iteration-independent sample size
with , defined as in (10), and
with , and defined as in (11).
For the results and algorithms of Section 2.2.1 where Assumptions (13) are used, we use similar sample sizes, but with the respective constants defined as in (13).
Using the above sampling strategies, we can now present our main algorithms for the case of sub-sampling Hessian and full gradient (22). Note that for the following algorithms, we always consider the “natural” Newton step size, i.e., . A brief explanation for this choice is given in the beginning of Section A.1.2.
Throughout this section, for a given , we consider the update
where is as in (14). Note that (22) is the same as (3) with , i.e., in (22) the full gradient in used.
Global Regularity: Let Assumptions (8) and (10) hold and let and be given. Set as described in Comment 2, and let be as in (14). Then, for the update (22) with , with probability , we have
Local Regularity: Under Assumptions (8), (11), and (53), the result of theorem 1 holds with
Comment 4: Bounds given here exhibit a composite behavior where the error recursion, when far from an optimum, is first dominated by a quadratic term and it transforms to linear term near an optimum. What is rather interesting is that the rate for the linear phase is indeed independent of any problem dependent quantities, and only depends on the sub-sampling accuracy! Of course, such problem dependent constants indeed appear in the lower bound for the sample size, in the form of (local) condition number.
Now we establish sufficient conditions for Q-linear convergence of sub-sampled Newton methods (22). We remind that in this case, the sub-sampling is done only for the Hessian and the full gradient is used.
Global Regularity: Let Assumptions (8) and (10) hold and consider any . Using Algorithm 1 with
with as in Theorem 1, we get locally Q-linear convergence
Local Regularity: Under Assumptions (8), (11), Theorem 2 holds with
In addition to Q-linear convergence, if the accuracy by which Hessian is sampled increases as the iterations progress, it is also possible to obtain Q-superlinear convergence.
Let the respective assumptions of Theorem 2 hold. Using Algorithm 2, with
if satisfies (26) with , , and , we get locally Q-superlinear convergence
with probability , where is as in (24) (or (25)) with .
It is even possible to obtain Q-superlinear rate with a very mild growth of the sample size. For example, we can consider a growth which is only logarithmic with the iteration counter.
Global Regularity: Let Assumptions (8) and (10) hold. Using Algorithm 2 with
Local Regularity: Under Assumptions (8), (11), the result of theorem 4 holds with
Comment 5: Note that since for all , Theorem 4 guarantees that after only a few iterations, we see a very fast local convergence rate with only logarithmic increase in Hessian estimation accuracy.
Comment 6: Theorems 3 and 4 state that in order to obtain locally superlinear convergence, as we get closer to the optimal solution, the Hessian is required to be estimated more accurately. The rate at which this estimation accuracy is improved, in turn, directly determines that of the Q-superlinear convergence.
2 Modifying the Sample Hessian
As was discussed earlier, the composite behavior in the error recursion of Theorem 1 indicates that in early stages of the algorithm, when the iterates are far from an optimum, the error is dominated by a quadratic term. Now it can be seen from (23) that the quadratic term is negatively affected by small values of , and unlike the linear term, the estimation accuracy cannot reverse the effect. In other words, small values of can hinder the initial progress and even using the full Hessian cannot address this issue. As a result, one might resort to regularization of the (estimated) Hessian to improve upon the initial convergence rate. Here, we explore two options for such regularization and will discuss the pros and cons of such strategies.
In this section, we follow the ideas presented in , by accounting for such a potentially negative factor , through a regularization of eigenvalue distribution of the sub-sampled Hessian. More specifically, for some , let
The operation (30) can be equivalently represented as
with being the eigenvector of corresponding to the eigenvalue, . Operation (30) can be performed using truncated SVD (TSVD). Note that although TSVD can be done through standard methods, faster randomized alternatives exist which provide accurate approximations to TSVD much more efficiently . In the constrained optimization case, where the restricted eigenvalues of with respect to the cone are not a priori known, it is impossible to compute . As a result, we replace Assumptions (8), (10) and (11), with the corresponding assumptions in (13).
For this regularization, we have the following error recursion result:
Let Assumptions (13a), (13b), and (13c) hold and let and be given. Set as described in Comment 2, and as in (14). For some , let
where is as in (30). Then, for the update (22) with , if , it follows that (23) holds with probability where
Now we consider the local regularity counterparts of Assumptions (13b) and (13c) by replacing them with (13d) and (13e). Under such relaxed assumptions, we have the following convergence result:
Under Assumptions (13a), (13d), and (13e) hold, if , the result of Theorem 5 holds with
Comment 7: The results of Theorems 5 and 6 indicate that, by increasing the threshold , we get a better factor for the quadratic term, but a worse factor for the linear term. This is specially important in light of the fact that, without regularization and by Theorem 1, linear term only depends on sub-sampling accuracy which can be arbitrarily made small. Hence, it might be advisable to have larger threshold, , when far from the solution and gradually decrease the threshold, as iterates get closer to the optimum. This remedies the slow initial progress and yet allows for the sub-sampling dependent linear rate to be recovered when close enough to the solution.
Let Assumptions (13a), (13b), and (13c) hold. Using Algorithm 3, if
and at every iteration the threshold is chosen as
we get locally Q-linear convergence with variable rates
with probability , where
Let Assumptions (13a), (13d), and (13e) hold. Using Algorithm 3, if
and at every iteration the threshold is chosen as
Comment 9: To get convergence, we only need a rough estimate of the smallest eigenvalue of at iteration . As a result, one can consider relatively large in Algorithm 3 and have a small sample size for such rough estimation.
Comment 10: The results of Theorems 7 and 8 indicate that, when close enough to the solution, increasing the threshold slows down the convergence rate! Hence, more aggressive regularization might, in fact, adversarially affect the efficiency of the algorithm. As a result, it is only advisable to have large thresholds at early stages of the algorithm. For example, using Theorem 5 and any given , it can be shown that if , for some , then while
2.2 Ridge Regularization
As an alternative to the spectral regularization, we can consider the following simple ridge-type regularization
for some , similar to the Levenberg-Marquardt type algorithms . Such regularization might be preferable to the spectral regularization of Section 2.2.1, as it avoids the projection operation (30) at every iteration. In addition, in order to establish convergence in Section 2.2.1, the values of the threshold have to be chosen with regards to the eigenvalue distributions of (or ). As a result of this undesirable coupling, in Section 2.2.1, assumptions on the full eigenvalue distribution (as opposed to the ones restricted to the cone of the constraints) were unavoidable, and depending on the cone , this can be significant (compare Assumptions (13) with their respective counterparts in (8), (10) and (11)). However, here can be chosen a priori without any restriction and also, if desired, can be kept fixed across all iterations. As a result, we can continue to use Assumptions (8), (10) and (11), and still acquire convergence. For such regularization, we have the following results:
Global Regularity: Let Assumptions (8) and (10) hold, and let and be given. Set as described in Comment 2 and as in (14). For some , let be as in (37). Then, for the update (22) with , it follows that (23) holds with probability where
Local Regularity: Under Assumptions (8), (11), and (53) the result of theorem 9 holds with
Comment 11: From Theorem 9 it can be seen that increasing results in decreasing the rate for the quadratic term, whilst increasing that of the linear term. As a result, there is a trade-off in choosing the values of regularization and it might be preferable to have larger values for when far from the solution and gradually decrease it, as iterates get closer to the optimum. This is so because as gets larger, the method tends to behave more like first order gradient descent.
Global Regularity: Let Assumptions (8) and (10) hold. For any , consider and such that
if (26) holds with as in Theorem 9, then with probability , we get locally Q-linear convergence as in (27) with the rate .
Local Regularity: Under Assumptions (8), (11), for any
Comment 12: Similar observations as in Theorem 7 and 8 can also be made here. In other words, we see that large value for negatively affects the linear convergence phase, while it is beneficial for the early stages of iteration where the rate is quadratic. For example, using Theorem 9 and any given , it can be shown that if, for some ,
Sub-Sampling Hessian & Gradient
In order to compute the update in Section 2, full gradient was used. In many problems, this can be a major bottleneck and reduction in computational costs can be made by considering sub-sampling the gradient as well. Such methods have recently been used in PDE constrained inverse problems, , where each function is only implicitly available. In such problems, computing requires solving a PDE twice, amounting to a total cost of PDE solves for each full gradient evaluation. In high dimensional settings where , this can pose a significant challenge and sub-sampling the gradient can, at times, drastically reduce the computational complexity of many problems. As such, for this section, we consider the general update of (3), where is a sub-sampled approximation to .
As in Section 2, consider picking a sample of indices from , uniformly at random with replacement. Also let and denote the sample collection and its cardinality, respectively. Let
be the sub-samples gradient. As mentioned before in the requirement (R.2), we need to ensure that sampling is done in a way to keep as much of the first order information from the full gradient as possible.
By a simple observation, the gradient can be written in matrix-matrix product from as
Hence, we can use approximate matrix multiplication results as a fundamental primitive in RandNLA , to probabilistically control the error in approximation of by , through uniform sampling of the columns and rows of the involved matrices above.
For a given , let
For any and , if
then for defined in (38), we have
where is defined as in (5a).
Comment 13: Note that for the results and algorithms below which make use of Assumption (12), we use the sample size, , given by Lemma 4 with related to a local optimum, . However, for those which do not make such assumption, the sample size, , from Lemma 4 is given with the current iterate, . For these latter algorithms, we need to be able to efficiently estimate at every iteration or, a priori, have a uniform upper bound for for all . Fortunately, in many different problems, it is often possible to estimate very efficiently; see Section 4 and SSN1 [44, Section 4] for concrete examples of a uniform and per-iteration estimations, respectively.
In this section, we present various algorithms which incorporate both Hessian and gradient sub-sampling. In such a setting, one is generally faced with two options: either to sub-sample the gradient and the Hessian independently of each other, or to use the same collection of indices and perform simultaneous sub-sampling for both. In this section, we present algorithms and convergence results for both of these strategies. Combining the sampling of Lemma 1 for the Hessian that of Lemma 4 for the gradient, we are now in the position to present our main results for this Section. We remind that, for this Section, we consider the general update of (3), where and are sub-sampled approximations to and , respectively.
Global Regularity: Let Assumptions (8) and (10) hold, and let , , and be given. Set as described in Comment 2 with and as in (39) with and . Independently, choose and , and let and be as in (14) and (38), respectively. Then, for the update (3) with , with probability , we have
Local Regularity: Under Assumptions (8), (11), and (53) with , the results of Theorem 11 holds with
Comment 14: Similar to the case of Hessian sub-sampling, the bounds given here exhibit a composite behavior where the error is at first dominated by a quadratic term, which transforms to linear term, and finally is dominated by the rate at which the gradient is approximated near an optimum.
It is also possible to obtain an alternative convergence result under local regularity assumption on the gradients at a local optimum as in (12). This gives rise to an algorithm where the Hessian and the gradient are simultaneously sub-sampled, resulting in dependency among the approximations; see Algorithm 6.
Let Assumptions (8), (11), and (12) hold, and let and be given. Using the same , set as described in Comment 2 and as in (39) with . Select a sample set of size , uniformly at random and let and be as in (14) and (38), respectively, both using . For the update (3) with , if satisfies (53), then with probability , we have
We are now in the position to give R-linear convergence of the proposed algorithms.
Global Regularity: Let Assumptions (8) and (10) hold. Consider any , , and such that . Let
if the initial iterate, , satisfies
Local Regularity: Under Assumptions (8) and (11), Theorem 13 holds with
Comment 15: From Theorems 13, it can be seen that in order to get linear convergence rate, estimation of the gradient must be done, progressively, more accurately, whereas the sample size for Hessian can remain unchanged across iterations. This is in line with the common knowledge where, in practice, as the iterations get closer to the optimal solution, the accuracy of gradient estimation is more important than that of Hessian.
Let Assumptions (8), (11) and (12) hold and consider any . Let be such that
Using Algorithm 6, if the initial iterate, , satisfies
we get locally R-linear convergence as in (41) with probability .
Comment 16: Unlike Theorem 13, the assumptions in Theorem 14 are more relaxed and the sampling for both are done simultaneously, i.e., using the same sample collection . As a result in order to guarantee linear convergence, using Algorithm 6, sampling the gradient and the Hessian both must be done, progressively, more accurately.
Using a slight modification of Algorithm 6, it is even possible to obtain R-superlinear convergence. However, in this case, the sample size increase must be done “extremely aggressively”, which might render such strategy rather impractical and we only give the following result for completeness. By aggressive, we refer to a very fast rate at which is decreased across iterations, compared to a more moderate decrease of the previous results.
Under the assumptions of Theorem 14, Algorithm 6 with
Examples
In this Section, we present a few instances of problems which are of the form (1). Specifically, examples from generalized linear models (GLM) and linear support vector machine (SVM) are given in Sections 4.1 and 4.2 respectively.
Consider sparse maximum likelihood (ML) estimation using any GLM with canonical link function. One way to induce sparsity is by using the constraint set, , as
Hence, sparse ML is equivalent to minimizing the negative log-likelihood over , where the negative log-likelihood of such model can be written as
The cumulant generating function, , determines the type of GLM. For example, gives rise to ordinary least squares formulation (OLS), while and yield logistic regression (LR) and Poisson regression (PR), respectively. It is also easily verified that the gradient and the Hessian of are
As mentioned in Comment 3, in order to use Lemma 4, we need to be able to efficiently estimate at every iteration or, a priori, have a uniform upper bound for for all . For illustration purposes only, Table 3 gives some very rough estimates of the constant for GLMs with respect to . Note that, here, is a uniform bound for . For an example where a per-iteration bound can be efficiently computed see Section 4.2.
2 SVM with Smooth Quadratic Hinge Loss
A support vector machine constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression, or other tasks. The vast majority of text books and articles introducing SVM very briefly state the the primal optimization problem, and then go directly to the dual formulation . Primal optimizations of linear SVMs have already been studied by . This is so since the dual problem does not scale well with the number of data points, e.g., for some approaches, , the primal might be better-suited for optimization of linear SVMs.
where and is either (hinge loss) or 2 (quadratic loss). Note that, here, the coordinate corresponding to the intercept is implicitly included in each . The unconstrained optimization formulation of the above is
the gradient and the Hessian of is given as follows
As mentioned before, in practice, as a pre-processing and before starting the algorithms, one can pre-compute the quantities which depend only on ’s. Then updating at every iteration is done very efficiently as it is just a matter of computing .
Conclusion
In this paper, we studied the local convergence behavior of sub-sampled Newton methods for constrained optimization in two general settings. We first studied the convergence behavior of the algorithm in which only the Hessian is sub-sampled and the full gradient is used. We showed that the error recursion has a composite nature where it is first dominated by a quadratic term and is subsequently transformed to a linear term near an optimum. In such a setting, we proved locally Q-linear convergence of the proposed algorithm. We also showed that by increasing the accuracy of such sub-sampling as the iterations progress, one can indeed recover superlinear rates. We then studied the situations in which the sub-sampled Hessian is regularized by either modifying its spectrum or by ridge-type regularization. We showed that such regularization can only be beneficial at early stages of the algorithm, i.e., the phase where the quadratic term dominates the error recursion, and we need to revert to using the unmodified sub-sampled Hessian as iterates get closer to the optimum.
Finally, we studied the local convergence of a fully stochastic algorithm in which both Hessian and the gradients are sub-sampled. Such sub-sampling can be done independently for the Hessian and the gradients, or simultaneously where one sample collection of indices is used for sub-sampling both. As before, for all of our results, we showed the composite rate of the error recursions. We also showed that by progressively increasing the sub-sampling accuracy of the gradient, we can indeed guarantee a local R-linear convergence rate. In addition, we argued that through a more aggressive sub-sampling strategy, one can obtain a superlinear rate.
For all of our results, we showed that the error bounds exhibit a composite behavior whose dominating term varies according to the distance of iterates to optimality. For example, for Hessian sub-sampling, we showed that, when far from a local optimum, the dominating error term is quadratic which then transitions to linear when the iterates are within a small enough region around that local optimum. These results might give a better control over various tradeoffs which exhibit themselves in different applications, e.g., practitioners might require faster running-time while statisticians might be more interested in statistical aspects regarding the recovered solution.
One major drawback of our algorithms is that we require the quadratic sub-problems (3) to be solved exactly. This can indeed be a computational challenge in many situations. In SSN1 and in the context of globally convergent sub-sampled algorithms for unconstrained problems, we relax this requirement and solve such sub-problems only approximately. This results in globally convergent sub-sampled algorithms with inexact updates which have much lower per-iteration cost. The extension of such inexact updates for the sub-sampled algorithms for constrained optimization and studying their local convergence behavior are left for future work.
References
Appendix A Proofs
Here we give the proofs of all of the results in Section 2.
Let be an orthonormal basis for the cone defined in (4). Consider i.i.d random matrices such that . Define
Hence we can apply Operator-Bernstein inequality [21, Theorem 1] to get
Noting that for any symmetric matrix ,
Now having a sample size as in (16), we get that
Similarly, using [2, Theorem 8.4.11, Eqn. (8.4.12)], and for any such that , we have
Let be as in the proof of Lemma 1. Since is convex, we have , , and for , we have
and hence, after obtaining the same bound for and applying Operator-Bernstein inequalit as in the proof of Lemma 1, we get a lower bound for the sample size as in (18).
Let be as in the proof of Lemma 1. We have
we can apply Matrix Bernstein using the intrinsic dimension [55, Theorem 7.7.1], to get for ,
Applying the same bound for and , followed by using the union bound, we get the desired result. It is also easily verified that (19) implies (45), and so our derivation is valid.
Note that for the simplicity of the presentation of the proofs of our main results, we will make use of the following lemma, which is just a simple variant of Lemma 1, and whose proof is very similar to that of Lemma 1
Given any , and , if the sample size is as in (16), then for defined in (14), we have all of the following statements, simultaneously, with probability :
We first give some structural lemma which will be the foundation of our main results for this Section.
Let Assumptions (8) and (10) hold. Also assume that
Define . By optimality of in (22), we have for any ,
In particular, setting , and noting that , we get
Since by optimality of , we have , it follows that
where is defined as in (5b).
By Assumption (49), we have . Hence we finally get
Comment 17: Note that for the case of and , we exactly recover the convergence rate for the standard Newton’s method [3, Proposition 1.4.1].
Under the relaxed assumptions (11), we have the following result:
Let Assumptions (8) and (11) hold. In addition, we assume that
As in Lemma 6, let . We get,
By Assumption (50), we have . Hence, by setting , we get
A.1.2 Main proofs
Before delving in to the proofs, we first make a note regarding the step-size chosen for all of our algorithms. Let us denote
By Weyl’s inequality [2, Theorem 8.4.11], we have
minimizes the maximum of both right hand sides. Now if the sampling is done as in Lemma 5, then on the event that
Hence, the optimal step size for the linear term is . As a result, for the following results, we always consider the “natural” Newton step size (i.e., ).
In addition, using (47), the requirement (52) holds if
where we have . Also, under Assumptions (10b) and (11b), and always satisfy (49) and (50), respectively.
Since is set as described in Comment 2 and , then (48) holds with probability . Now the results follow immediately by applying Lemmas 6 and 7 and noting that for our choice of , we have .
Using this particular choice of , Theorem 1, for every , yields
Also note that for the latter part of Theorem 2, since , by induction we have, for every
Let denote the event that . The overall success probability is
since for every , the conditional probability of a successful update , given the past successful iterations , is .
where depending on the assumptions of the theorem, and are as in (24) or (25) using . Note that, by and set as in Theorem 2, it follows that
We prove the result by induction on . Define . For , by assumptions on , , and , we have
Now assume that (28) holds up to the iteration . For , we get
By induction hypothesis, we have , and
Under the local regularity Assumptions (11), we can see by induction that for every
so the condition (53) is satisfied. The overall success probability is computed as in the end of the proof of Theorem 2.
We only give the proof for the first part, as the proof for the second part is almost identical. By the choice of in (24), we have
We again prove the result by induction on . For , by assumptions on , we have
Now assume that (29) holds up to the iteration . For , we get
i.e., is decreasing, it follows that, we have
Now, by induction hypothesis, we have , and
Now suppose is chosen as described in Comment 2. By Lemma 5, it follows that, with probability , we have
and using the same line of reasoning as in the proof of Lemma 6.
Proof goes along the same line as that of Lemma 7. In particular, using Assumption (13a), we have the following chain of inequalities
where the second inequality follows by [5, Lemma VII.5.5] and noting that the scalar version of the projection operator (30), i.e., is 1-Lipschitz. Specifically,
By setting as in (33) and using Lemma 5, we in fact have and a requirement of Theorem 5 holds with probability . Now we solve for , so that for some , which is to be determined later, we have
For this choice of , using Theorem 5, we get (23) with and . Now setting
it can be easily shown, by induction, that under Assumption (31) and the choice of in (32), we have
for every . Furthermore, since two sampling steps in Algorithm 3 are independent, the success probability for each iteration is . Now the overall success probability can computed as in the end of the proof of Theorem 2.
Suppose at iteration , (34) holds. By Weyl’s inequality and Assumption (13a), we have
Hence, by setting as in (36), we in fact have and a requirement of Theorem 6 holds with probability . The rest of the proof follows the same reasoning as in the proof of Theorem 7 by using the results of Theorem 6.
The first part is straightforward by using Lemma 6. The proof of the second part follows the same reasoning as in Lemma 7 by noting
Proof follows the same reasoning as before using Theorem 9, so we omit the details. Note that, for the latter part of Theorem 10, since , we have
so the condition (53) is satisfied. The last inequality follows since, by the choice of and , we have
A.2 Proofs of the theorems and lemma of Section 3
Here we give the proofs of all of the results in Section 3.
Same as in the proof of Lemma 1, let be an orthonormal basis for . Hence, by the assumption and the definition (5a), we have that at a given
As mentioned before, the gradient can be equivalently written as a product of two matrices as , where
with probability . Now the result follows by requiring that
As before, we first give some structural lemmas which form the foundation of our main results for this Section.
Let Assumptions (8), (10) and (49) hold. For the update (3), we have
where is as in (5a).
The result is obtained as in the proof of Lemma 6, and using the identity , and noting that .
As in Section A.1.1, it is possible to have similar results as in Lemma 8 by replacing (10) with its local counterpart (11). The proof is omitted as it is similar to the proof of previous lemmas.
Let Assumptions (8), (11), (50), and (51) hold. For the update (3), if satisfies (52), we have
It is even possible to relax Assumption (11a) and give a more general result where the gradient approximation error is measured only at a local optimum .
Let Assumptions (11b), (50) and (51) hold and let
denote the Jacobian of at . Also assume that for some ,
For the update (3) with , if satisfies (52), we have
As in Lemma 6, let . We get,
where, for the last inequality, we used the fact that, by first order optimality condition, we have . Using mean value theorem, we get
Now the same reasoning as in the end of the proof of Lemma 7 gives the result.
A.2.2 Main proofs
The results are immediately obtained using Lemmas 8 and 9.
First note that since and are not independent, the joint probability that they are within the desired tolerance, by sub-additivity of probability, is lower bounded by . Now by construction, , and the result follows by Assumption (8) on Lipschitz continuity of and Lemma 10 with .
Using Theorem 11, the particular choice of and , for each , gives
where and . We prove the result by induction on . Define . For , using Assumption (40) and noting that by the definition of
Now assume that (41) holds for . For , we get
Under the local regularity assumptions (11), we can see by induction that for every
so the condition (53) is satisfied. The overall success probability is computed as in the end of the proof of Theorem 2.
First note that by (42) and (43), we get that
The proof is again by induction using Theorem 12. First note that, by construction and the value of ,
and . For , we have
where the last inequality follows from the definition (43) and noting that
Now assume that (41) holds for . For , we get
Under the local regularity assumption (11b), we can see by induction that for every
so the condition (53) is satisfied. The overall success probability is computed as in the end of the proof of Theorem 2.
First, as in the proof of Theorem 14, by (42) and (43) , we get
Now we note that , , and
Again, the result is obtained by induction on . Define . For the base case of , we have
where for the inequality, we used the definition of and noting that
Now assume that (44) holds for . For , we have