Sub-Sampled Newton Methods II: Local Convergence Rates

Farbod Roosta-Khorasani, Michael W. Mahoney

Introduction

In this paper, we provide a detailed analysis of the use of sub-sampling as way to introduce randomness to the classical Newton’s method and derive variants which are more suited for the modern big data problems. In doing so, we give conditions under which the local convergence properties of the full Newton’s method is, to the extend possible, persevered. In particular, we will show that error recursions in all of our results exhibit a composite behavior whose dominating error term varies according to the distance of iterates to optimality. These results give a better control over various tradeoffs which exhibit themselves in different applications, e.g., practitioners might require faster running-time while statisticians might be more interested in statistical aspects regarding the recovered solution.

The rest of the paper is organized as follows: in Section 1.1, we first give a very brief background on the general methodology for optimizing (1). The notation and the assumptions used in this paper are given in Sections 1.2 and 1.3. An overview of the main contributions of this paper is given in Section 1.4. In Section 1.5, we give a brief survey of the related work, and in their light, we discuss, in more details, the contributions of the present article. Section 2 addresses the local convergence behavior of sub-sampled Newton method in the case where only the Hessian is sub-sampled, while the gradient is used in full. Specifically, Section 2.1 gives such local results for when the sub-sampled Hessian is not “altered”, whereas Section 2.2 establishes similar results for the cases where the sub-sampled Hessian is regularized by modifying its spectrum or Levenberg-type (henceforth called ridge-type) regularization. The case where the gradient, as well as Hessian, is sub-sampled is treated in Section 3. A few examples from generalized linear models (GLM) and support vector machine (SVM), very popular classes of problems in machine learning community, are given in Section 4. Conclusions and further thoughts are gathered in Section 5. All proofs are given in the appendix.

where αk\alpha_{k} is the step size at iteration kk. It is well-known, , that full gradient method achieves rates of O(1/k)\mathcal{O}(1/k) and O(ρk),  ρ<1\mathcal{O}(\rho^{k}),\;\rho<1, for smooth and smooth-strongly convex objectives, respectively. However, when n1n\gg 1, the full gradient method can be inefficient because its iteration cost scales linearly in nn. Consequently, stochastic variants of full gradient descent, e.g., (mini-batch) stochastic gradient descent (SGD) were developed . In such methods a subset S{1,2,,n}\mathcal{S}\subset\{1,2,\cdots,n\} is chosen at random and the update is obtained by

When Sn|\mathcal{S}|\ll n (e.g., S=1|\mathcal{S}|=1 for simple SGD), the main advantage of these stochastic methods is that the iteration cost is independent of nn and can be much cheaper than the full gradient methods, making them suitable for modern problems with large nn. However, this advantage comes at a cost: the convergence rate of these stochastic variants can be significantly slower that that of the full gradient descent. For example, under standard assumptions, it has been shown(e.g., ) that, for a suitably chosen decreasing step-size sequence αk\alpha_{k}, the simple SGD iterations have an expected sub-optimality, i.e., O(1/k)\mathcal{O}(1/\sqrt{k}) and O(1/k)\mathcal{O}(1/k) for smooth and smooth-strongly convex objectives, respectively. Since these sub-linear rates are slower than the corresponding rates for the full gradient method, great deal of efforts have been made to devise modifications to achieve the convergence rates of the full gradient methods while preserving the per-iteration cost of stochastic methods .

The above class of methods are among what is known as first-order methods where only the gradient information is used at every iteration. One attractive feature of such class of methods is their relatively low per-iteration-cost. However, despite such low cost, in almost all problems, incorporating curvature information (e.g., Hessian) as a form of scaling the gradient, i.e.,

can significantly improve the convergence rate. Such class of methods which take the curvature information into account are known as second-order methods, and compared to first-order methods, they enjoy superior convergence rate in both theory and practice. This is so since there is an implicit local scaling of coordinates at a given xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, which is determined by the local curvature of FF. This local curvature in fact determines the condition number of a FF at x{\bf x}. Consequently, by taking the curvature information into account (e.g., in the form of the Hessian), second order methods can rescale the gradient so it is a much more “useful” direction to follow. This is in contrast to first order methods which can only scale the gradient uniformly for all coordinates. Such second order information have long been used in many machine learning applications as well as many PDE constrained inverse problems .

It is well known, , that the canonical example of second order methods, i.e., Newton’s method, where DkD_{k} is taken to be the inverse of the full Hessian and αk=1\alpha_{k}=1, i.e.,

However when n,p1n,p\gg 1, the per-iteration-cost of such algorithm is significantly higher than that of first order methods. As a result, a line of research is to try to construct an approximation of the Hessian in a way that the update is computationally feasible, and yet, still provides sufficient second order information. One such class of methods are quasi-Newton methods, which are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In such methods, the approximation to the Hessian is updated iteratively using only first order information from the gradients and the iterates through low-rank updates. Among these methods, the celebrated Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm and its limited memory version (L-BFGS) , are the most popular and widely used members of this class. Another class of methods for approximating the Hessian is based on sub-sampling where the Hessian of the full function FF is estimated using that of the randomly selected subset of functions fif_{i}, . More precisely, a subset S{1,2,,n}\mathcal{S}\subset\{1,2,\cdots,n\} is chosen at random and, if the sub-sampled matrix is invertible, the update is obtained by

where Sg\mathcal{S}_{{\bf g}} and SH\mathcal{S}_{H} are sample sets used for approximating the Hessian and the gradient, respectively. The variants (2b) are what we call sub-sampled Newton methods in this paper.

Unlike the sub-sampling in first order methods, theoretical properties of such techniques in the class of second-order methods are not yet very well understood. As a result, our aim in the present paper is to address the local convergence behavior of such sub-sampled Newton methods in a variety of situations (by local convergence, it is meant that the initial iterate is close enough to a local minimizer at which the sufficient conditions hold). This paper has an associated companion paper, , henceforth called SSN1, in which we consider globally convergent sub-sampled Newton algorithms and their convergence properties (by globally convergent algorithm, it is meant an algorithm that approaches the optimal solution starting from any initial point). The reason for splitting into two papers is that, while SSN1 , introduces several of the ideas in simpler settings, the more advanced techniques of this paper are needed to provide more control on the convergence behavior under a variety of assumptions. We expect that the insight into the theoretical properties of the algorithms presented in this paper and SSN1 , will enable the development of still-further improved sub-sampled Newton algorithms in a variety of applications in scientific computing, statistical data analysis, etc.

For the rest of this paper, we consider the general case of constrained optimization (1). More specifically, given the current iterate, x(k)DX{\bf x}^{(k)}\in\mathcal{D}\cap\mathcal{X}, we consider the following iterative scheme,

where g(x(k)){\bf g}({\bf x}^{(k)}) and H(x(k))H({\bf x}^{(k)}) are some approximations to (in our case, sub-samples of) the actual gradient and the Hessian at the kthk^{th} iteration, respectively. A variety of first and second order methods are of this form. For example,

full Newton’s method is obtained by setting g(x(k))=F(x(k)){\bf g}({\bf x}^{(k)})=\nabla F({\bf x}^{(k)}) and H(x(k))=2F(x(k))H({\bf x}^{(k)})=\nabla^{2}F({\bf x}^{(k)}),

a step of the Frank-Wolfe, , algorithm can be obtained by considering g(x(k))=F(x(k)){\bf g}({\bf x}^{(k)})=\nabla F({\bf x}^{(k)}) and H(x(k))=0H({\bf x}^{(k)})=0,

and finally, choosing the pair {g(x(k))=F(x(k)),H(x(k))=1/SHjSH2fi(x(k))}\left\{{\bf g}({\bf x}^{(k)})=\nabla F({\bf x}^{(k)}),H({\bf x}^{(k)})=1/|\mathcal{S}_{H}|\sum_{j\in\mathcal{S}_{H}}\nabla^{2}f_{i}({\bf x}^{(k)})\right\} or {g(x(k))=1/SgjSgfj(x(k)),H(x(k))=1/SHjSH2fi(x(k))}\left\{{\bf g}({\bf x}^{(k)})=1/|\mathcal{S}_{{\bf g}}|\sum_{j\in\mathcal{S}_{{\bf g}}}\nabla f_{j}({\bf x}^{(k)}),H({\bf x}^{(k)})=1/|\mathcal{S}_{H}|\sum_{j\in\mathcal{S}_{H}}\nabla^{2}f_{i}({\bf x}^{(k)})\right\} for some index sets Sg,SH[n]\mathcal{S}_{{\bf g}},\mathcal{S}_{H}\subseteq[n] gives rise to sub-sampled Newton methods, which are the focus of this paper (as well as SSN1 ).

Unlike SSN1 where our focus is mainly on designing sub-sampled Newton algorithms to guarantee global convergence, here, we concentrate on the actual speed of convergence. In particular, we aim to ensure that any such sub-sampled Newton algorithms preserves, at least locally, as much of the convergence properties of the full Newton’s method as possible. In doing so, we need to ensure the following requirements:

Our sampling strategy needs to provide a sample size S|\mathcal{S}| which is independent of nn, or at least smaller. Note that this is the same requirement as in SSN1 [44, (R.1)]. However, as a result of the more intricate goals of the present paper, we will show that, comparatively, a larger sample size is required here than that used in SSN1 .

In addition, we need to ensure, at least probabilistically, that the sub-sampled matrix preserves the spectrum of the true Hessian as much as possible. If the gradient is also sub-sampled, we need to ensure that sampling is done in a way to keep as much of this first order information as possible. Note that unlike SSN1 [44, (R.2)], in order to preserve as much of the local convergence properties of the full Newton’s method as possible, the mere invertibility of the sub-sampled Hessian is not enough.

Finally, we need to ensure our algorithms enjoy a reasonably fast convergence rate which is, at least locally, similar to that of the full Newton’s method. Note that unlike SSN1 [44, (R.3)] where mere global convergence guarantee is required, here, the emphasis in on local convergence speed.

In this paper, we address challenges (R.1), (R.2), and (R.3). More precisely, by using a random matrix concentration inequality as well as results from approximate matrix multiplication of randomized numerical linear algebra (RandNLA), we ensure (R.1) and (R.2). To address (R.3), we give algorithms whose local convergence rates can be made close to that of full Newton’s method. These local rates coupled with the global convergence guarantees of SSN1 , provide globally convergent algorithms with fast local rates (e.g., see SSN1 [44, Theorems 2 and 7]). The present paper and the companion, SSN1 , to the best of our knowledge, are the very first to thoroughly and quantitatively study the convergence behavior of such sub-sampled second order algorithms, in a variety of settings.

2 Notation

For a vector v{\bf v} and a matrix A, using such cone, we can define their K\mathcal{K}-restricted norms, respectively, as

Similarly, one can define the K\mathcal{K}-restricted maximum and the minimum eigenvalues of a symmetric matrix AA as

Alternatively, let UU be an orthonormal basis for the cone K\mathcal{K}. The definitions above are equivalent to the following:

Throughout this paper we make use of two standard definitions of convergence rate: Q-convergence rate and R-convergence rate. Recall that a sequence of vectors {z(k)}k\{{\bf z}^{(k)}\}_{k} is said to converge Q-linearly to a limiting value z{\bf z}^{*}, if for some 0ρ<10\leq\rho<1,

Q-superlinear convergence is defined similarly as

The notion of R-convergence rate is an extension which captures sequences which still converge reasonably fast, but whose “speed” is variable. A sequence of vectors {z(k)}k\{{\bf z}^{(k)}\}_{k} is said to converge R-superlinearly to a limiting value z{\bf z}^{*}, if

for some R>0R>0 and a sequence {r(k)}k\{r^{(k)}\}_{k} such that

R-superlinear convergence is similarly defined by requiring that

3 Assumptions

Our theoretical analysis is based on the following assumptions.

Lipschitz Hessian: Throughout this article, we assume that each fif_{i} is twice-differentiable and has a Lipschitz continuous Hessian with respect to the cone K\mathcal{K}, i.e., for some L>0L>0,

where AK\|A\|_{\mathcal{K}} is defined in (5b).

Hessian Regularity: Regularity of the Hessian refers to smoothness and strong convexity of FF. Such properties can either be required for xDX\forall{\bf x}\in\mathcal{D}\cap\mathcal{X} or alternatively, only at a local minimum, x{\bf x}^{*}, which is assumed to always exist. In this paper, these properties are referred to as global and local regularity, respectively. In particular, for a given xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, let K(x)K({\bf x}) and γ(x)\gamma({\bf x}) be such that

where AK\|A\|_{\mathcal{K}} and λminK(A)\lambda_{\min}^{\mathcal{K}}(A) are defined in (5b) and (6a), respectively. We make the following regularity assumptions:

Global Hessian Regularity: Throughout Sections 2 and 3, every convergence result is first given for the case where we have global smoothness and strong convexity with respect to K\mathcal{K}. More precisely, we assume that

where K(x)K({\bf x}) and γ(x)\gamma({\bf x}) are defined in (9). Note that since DX\mathcal{D}\cap\mathcal{X} is convex, Assumption (10b) implies strong convexity of FF which, in turn, implies the uniqueness of x{\bf x}^{*}.

Local Hessian Regularity: Assumptions (10) are rather restrictive and limit the class of problems where the methods in the present paper apply. Fortunately, these assumptions can be further relaxed and be required to hold only locally at a local optimum x{\bf x}^{*}. In other words, we require that

where K(x)K({\bf x}^{*}) and γ(x)\gamma({\bf x}^{*}) are defined in (9). Note that since DX\mathcal{D}\cap\mathcal{X} is convex, Assumption (11b) imply that x{\bf x}^{*} is an isolated local optimum. These assumptions are relaxations of (10). For example, (11b) appears as a second order sufficiency condition for a local optimum x{\bf x}^{*}. In fact, some highly non-convex problems with multiple local minima exhibit such local structures, e.g., . As a result, each analysis in Sections 2 and 3 is complemented with a convergence results using Assumptions (11), to make them applicable in a more general setting.

Locally Bounded Gradient: In Section 3, in addition to the Hessian, the gradient is also sub-sampled. For some of the results presented there, we give convergence results using the following regularity assumptions on the gradients in the form of boundedness at a local optimum x{\bf x}^{*}:

For the results of Section 2.2.1 only, due to technical reasons which will be explained there, we need to remove K\mathcal{K}-constrained condition from the above assumptions. More specifically, (8) will be replaced by

Table 1 gives an overview of where each of the above assumptions are used to prove convergence of different algorithms presented in this paper. As noted before, for each algorithm, we first give convergence results using the global regularity assumptions, and then we provide separate convergence results by relaxing these assumptions to hold locally only at x{\bf x}^{*}.

4 Contributions

The contributions of this paper can be summarized as follows:

Under the above assumptions, we study the local convergence behavior of various sub-sampled algorithms. We show that all of our error recursions exhibit a composite behavior whose dominating error term varies according to the distance of iterates to optimality. More specifically, we will show that a quadratic error term dominates when we are far from an optimum and it transitions to lower degree terms according to the distance to optimality.

Our algorithms are designed for the following settings.

Algorithms 1 and 2 practically implement (3) for the case where only the Hessian is sub-sampled, while the full gradient is used, i.e., g(x)=F(x){\bf g}({\bf x})=\nabla F({\bf x}). We give locally Q-linear and Q-superlinear convergence rates for Algorithms 1 and 2, respectively.

Algorithms 3 and 4, are modifications of Algorithm 1 in which the sub-sampled Hessian is regularized by modifying its spectrum or by ridge-type regularization, respectively. We show that such regularizations can be used to improve upon the initial progress of the algorithm, and for both of these regularization methods, we give Q-linear convergence rates.

Algorithms 5 and 6 are the implementation of the fully stochastic formulation of (3), in which the gradient as well the Hessian is sub-sampled. We show that one can sub-sample the gradient independently of the Hessian, or simultaneously using the same collection of sample indices and we provide R-linear and R-superlinear convergence rates for these algorithms.

For all of these algorithms, we give quantitative convergence results, i.e., our bounds contain an actual worst-case convergence rates. Though the sample size depends on the condition number of the problem, we show that the rates for the (super)linear convergence phase are, in fact, problem-independent. As a result, by increasing the estimation accuracy, one can arbitrarily improve the speed of local convergence.

Our results here only address the local convergence behaviors of different algorithms for when the initial iterate is sufficiently close to an optimum. However, under global regularity assumptions (10), in SSN1 , we show that by simple modifications of all of our algorithms here, one can obtain globally convergent algorithms which approach the optimum, regardless of the initial starting point. These connections guarantee that the modified algorithms converge globally with a local rate which is problem-independent; see SSN1 [44, Theorems 2 and 7].

Table 2 gives a summary of the main results of this paper. In addition, in Section 1.5 and in light of the related work in the literature, we give more details of the contributions of the present paper.

5 Related Work

The results of Section 2, where the full gradient is used, offer computational efficiency for the regime where both nn and pp are large. However, it is required that nn is not so large as to make the gradient evaluation prohibitive. In such regime (where n,p1n,p\gg 1 but nn is not too large), similar results can be found in :

The pioneering work in establishes, for the first time, the convergence of Newton’s method with sub-sampled Hessian and full gradient. There, two sub-sampled Hessian algorithms are proposed, where one is based on a matrix-free inexact Newton iteration and the other incorporates a preconditioned limited memory BFGS iteration. However, the results are asymptotic, i.e., for kk\rightarrow\infty, and no quantitative convergence rate is given. In addition, convergence is established for the case where each fif_{i} is assumed to be strongly convex. Here, we concentrate on obtaining non-asymptotic and quantitative local convergence rates of such sub-sampled algorithm under milder assumption where only FF is required to be (locally) strongly convex. In addition, we extend these algorithms to the case where the gradient as well as the Hessian is sub-sampled. These results are presented in Sections 2 and 3.

Within the context of deep learning, is the first to study the application of a modification of Newton’s method. It suggests a heuristic algorithm where at each iteration, the full Hessian is approximated by a relatively large subset of 2fi\nabla^{2}f_{i}’s, i.e. a “mini-batch”, and the size of such mini-batch grows as the optimization progresses. The resulting matrix is then damped in a Levenberg-Marquardt style, , and conjugate gradient, , is used to approximately solve the resulting linear system

Authors in were among the first to explore hybrid methods, combining the inexpensive iterations of incremental gradient algorithms and the steady convergence of full gradient methods, which exhibit the benefits of both approaches. Their analysis shows that by carefully increasing the sample size across iterations, it is possible to maintain the steady convergence rates of full-gradient methods. They also present a practical quasi-Newton implementation based on their approach. Such careful control over the increase of the sample size is one of the main ingredients of our analysis here.

Our analysis here most resembles that of . The work in is the first to establish quantitative convergence rate for the case where the Hessian is sub-sampled and the full gradient is used. The authors consider the two metric projection formulation of the Newton’s method and establish probabilistic linear convergence rate using Hoeffding matrix concentration result. They suggest an algorithm, where at each iteration, the spectrum of the sub-sampled Hessian is modified as a form of regularization. Our work here resembles that of , but is different in many aspects:

For the constrained optimization, the results in the present paper are given for Newton’s method formulated as a scaled gradient projection method (3), whereas the results in are given for what is called as two metric projection method. The main difficulty with the latter formulation is that an arbitrary positive definite matrix HH will not necessarily yield a descent direction, i.e., F(x(k+1))>F(x(k))F({\bf x}^{(k+1)})>F({\bf x}^{(k)}) for all αk>0\alpha_{k}>0, or even the algorithm might not recognize, i.e., fail to stop at, an stationary point [3, Section 2.4], which might result in instability in the algorithms relying on choosing the correct HH. However, the two metric projection method, can potentially have an easier projection step as opposed to scaled gradient projection (3). In fact, the scaled gradient projection algorithm can be viewed as a form of projection method with norm .H\|.\|_{H}, resulting in an oblique projection. It should also be noted that, for the case of unconstrained optimization, the both formulations coincide.

Within the context of scaled gradient projection method, we extend the work in in several dimensions. More specifically, we give a wide range of results from simple sub-sampling to adding different kinds of regularization. For each of these methods, we give sufficient conditions for local convergence of the respective algorithm, under both global and local regularity assumptions as in Section 1.3. We also give results for the case where both Hessian and the gradients are sub-sampled; see Sections 2 and 3.

In situations where the Hessian is sparse, spectral regularization suggested in can destroy the sparsity. This in turn might make the storage and the computation with the resulting matrix less efficient. Here, in addition to extending the results using such spectral regularization, in Section 2.2.2, we suggest an alternative using ridge type regularization, which effectively achieves the same goal as the spectral counterpart, but at no extra computational or storage cots.

In addition to results with global regularity assumptions, in every section, we give additional convergence results under milder assumptions where the regularity is only required at a local minimum. In addition, most regularity assumptions are given with respect to the tangent cone of the constraints at a local minimum. These assumptions, depending on the geometry of this cone, can be significantly weaker than those made for unconstrained optimization (e.g., compare (10) and (11) with (13)).

As for the sufficient condition to guarantee convergence using the spectral regularization, in [19, Corollary 3.5], a sample size of Ω(ln(p)λ2K2/γ6)\Omega(\ln(p)\lambda^{2}K^{2}/\gamma^{6}) is necessarily required (to at least obtain a positive region of convergence for the initial iterate), where λ\lambda is the spectral regularization threshold as in Section 2.2.1. However, the sample size required for Theorems 7 and 8 is of order Ω(ln(p)K2/γ2)\Omega(\ln(p)K^{2}/\gamma^{2}), which is much smaller.

The results of Section 3, apply to more general settings where nn can be arbitrarily large. This is so since sub-sampling the gradient, in addition to that of the Hessian, allows for per-iteration cost which can be much smaller than nn. Within the context of first order methods, there has been numerous variants of gradient sampling from a simple stochastic gradient descent, , to the most recent improvements by incorporating the previous gradient directions in the current update . For second order methods, such sub-sampling strategy has been successfully applied in large scale non-linear inverse problems . However, to the best of our knowledge, Section 3 offers the first quantitative convergence results for such sub-sampled methods. In particular, these results guarantee the local convergence of many heuristic algorithms used in large scale nonlinear inverse problems, e.g. .

Sub-Sampling Hessian

For the optimization problem (1), at each iteration, consider picking a sample of indices from {1,2,,n}\{1,2,\ldots,n\}, uniformly at random with or without replacement. Let S\mathcal{S} and S|\mathcal{S}| denote the sample collection and its cardinality, respectively and define

to be the sub-sampled Hessian. As mentioned before in Section 1.1, in order for such Hessian sub-sampling to be effective in reducing nn as well as yielding a fast convergent algorithm, we need to ensure that the sample size S|\mathcal{S}| satisfies the requirement (R.1), while as mentioned in (R.2), the spectrum of H(x)H({\bf x}) is as close to that of 2F(x)\nabla^{2}F({\bf x}) as possible. The latter requirement is reinforced by noticing that the bounds in Section A.1.1 contain quantities which directly relate to the relative eigenvalue distributions of H(x(k))H({\bf x}^{(k)}) and 2F(x(k))\nabla^{2}F({\bf x}^{(k)}) (with respect to the cone K\mathcal{K}). This indicates that the closer we can approximate the spectrum of the full Hessian, the faster the convergence is expected to be. Below, we make use of some matrix concentration inequalities to probabilistically guarantee the above properties.

For a given xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, let

With the above in mind, we can present our main sub-sampling Lemma to ensure that, to a desired accuracy, the spectrum of the full Hessian is preserved after sub-sampling.

Given any 0<ϵ<10<\epsilon<1, 0<δ<10<\delta<1 and xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, if

then for H(x)H({\bf x}) defined in (14), we have

where κ(x,K)\kappa({\bf x},\mathcal{K}) is the K\mathcal{K}-restricted condition number at x{\bf x}, defined in (15), and λiK(A)\lambda_{i}^{\mathcal{K}}(A) is defined in (7).

Lemma 1 indicates that, with respect to the cone K\mathcal{K}, it is indeed possible to sub-sample in a way that the spectrum of the sub-sampled Hessian does not deviate from that of the full Hessian.

Below, we show a couple of ways to improve upon the lower bound (16). First, we show that if each fif_{i} is convex, it is possible to reduce the constant factor further:

Given any 0<ϵ<10<\epsilon<1, 0<δ<10<\delta<1, and xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, if 2fi(x)0,i\nabla^{2}f_{i}({\bf x})\succeq 0,\forall i and

then (17) holds for H(x)H({\bf x}) defined in (14).

Even in the absence of convexity of each fif_{i}, it is still possible to improve the dependence on ln(p)\ln(p) by using the concept of intrinsic dimension [55, Chapter 7]. This, in turn, allows us to incorporate the dimension of the subspace K\mathcal{K} in our bound. More specifically, we have the following Lemma:

be the intrinsic dimension (a.k.a effective rank) of VV. Given any 0<ϵ1/20<\epsilon\leq 1/2 , 0<δ<10<\delta<1, and xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, if

and S\mathcal{S} is chosen uniformly at random with replacement, then (17) holds for H(x)H({\bf x}) defined in (14).

Comment 1: It is important to note that drank(K)pd\leq\text{rank}(\mathcal{K})\leq p, and consequently, when K\mathcal{K} is low dimensional subspace, (19) can indeed offer some computational savings, compared to (16). In fact, if computing dd is not feasible, one can simply replace that with rank(K)\text{rank}(\mathcal{K}) in (19). For the rest of this paper, however, we will simply stick with the original bound (16), and note that extension to the above improvements is trivially done.

Comment 2: It might be worth noting the differences between the above sub-sampling strategies and that of SSN1 [44, Lemma 1]. These differences, in fact, boil down to the differences between the requirement (R.2) and the corresponding one in SSN1 [44, Section 1.1, (R.2)]. As a result of a “finer-grained” analysis in the present paper and in order to preserve as much of the local convergence properties of the full Newton’s method, Lemmas 1, 2, and 3 require a larger sample size, i.e., in the order of κ2\kappa^{2} vs. κ\kappa for SSN1 [44, Lemma 1], while delivering a much stronger guarantee about the spectrum of the sub-sampled Hessian. In contrast, for the global analysis in SSN1 , we only need to ensure that the sub-sampled Hessian is invertible to yield a descent direction at every iteration.

Comment 3: In our results and algorithms below, we use Lemma 1 with the following iteration independent sample sizes:

if Assumptions (10) hold, we use a larger and, yet, iteration-independent sample size

with KK, γ\gamma defined as in (10), and

with KK^{*}, and γ\gamma^{*} defined as in (11).

For the results and algorithms of Section 2.2.1 where Assumptions (13) are used, we use similar sample sizes, but with the respective constants defined as in (13).

Using the above sampling strategies, we can now present our main algorithms for the case of sub-sampling Hessian and full gradient (22). Note that for the following algorithms, we always consider the “natural” Newton step size, i.e., αk=1\alpha_{k}=1. A brief explanation for this choice is given in the beginning of Section A.1.2.

Throughout this section, for a given x(k)DX{\bf x}^{(k)}\in\mathcal{D}\cap\mathcal{X}, we consider the update

where H(x(k))H({\bf x}^{(k)}) is as in (14). Note that (22) is the same as (3) with g=F(x(k)){\bf g}=\nabla F({\bf x}^{(k)}), i.e., in (22) the full gradient in used.

Global Regularity: Let Assumptions (8) and (10) hold and let 0<δ<10<\delta<1 and 0<ϵ<10<\epsilon<1 be given. Set S|\mathcal{S}| as described in Comment 2, and let H(x(k))H({\bf x}^{(k)}) be as in (14). Then, for the update (22) with αk=1\alpha_{k}=1 , with probability 1δ1-\delta, we have

Local Regularity: Under Assumptions (8), (11), and (53), the result of theorem 1 holds with

Comment 4: Bounds given here exhibit a composite behavior where the error recursion, when far from an optimum, is first dominated by a quadratic term and it transforms to linear term near an optimum. What is rather interesting is that the rate for the linear phase is indeed independent of any problem dependent quantities, and only depends on the sub-sampling accuracy! Of course, such problem dependent constants indeed appear in the lower bound for the sample size, in the form of (local) condition number.

Now we establish sufficient conditions for Q-linear convergence of sub-sampled Newton methods (22). We remind that in this case, the sub-sampling is done only for the Hessian and the full gradient is used.

Global Regularity: Let Assumptions (8) and (10) hold and consider any 0<ρ0<ρ<10<\rho_{0}<\rho<1. Using Algorithm 1 with

with ξ\xi as in Theorem 1, we get locally Q-linear convergence

Local Regularity: Under Assumptions (8), (11), Theorem 2 holds with

In addition to Q-linear convergence, if the accuracy by which Hessian is sampled increases as the iterations progress, it is also possible to obtain Q-superlinear convergence.

Let the respective assumptions of Theorem 2 hold. Using Algorithm 2, with

if x(0){\bf x}^{(0)} satisfies (26) with ρ\rho, ρ0\rho_{0}, and ξ(0)\xi^{(0)}, we get locally Q-superlinear convergence

with probability (1δ)k0(1-\delta)^{k_{0}}, where ξ(0)\xi^{(0)} is as in (24) (or (25)) with ϵ(0)\epsilon^{(0)}.

It is even possible to obtain Q-superlinear rate with a very mild growth of the sample size. For example, we can consider a growth which is only logarithmic with the iteration counter.

Global Regularity: Let Assumptions (8) and (10) hold. Using Algorithm 2 with

Local Regularity: Under Assumptions (8), (11), the result of theorem 4 holds with

Comment 5: Note that since 1/ln(3+k)1/21/\ln(3+k)\leq 1/2 for all k5k\geq 5, Theorem 4 guarantees that after only a few iterations, we see a very fast local convergence rate with only logarithmic increase in Hessian estimation accuracy.

Comment 6: Theorems 3 and 4 state that in order to obtain locally superlinear convergence, as we get closer to the optimal solution, the Hessian is required to be estimated more accurately. The rate at which this estimation accuracy is improved, in turn, directly determines that of the Q-superlinear convergence.

2 Modifying the Sample Hessian

As was discussed earlier, the composite behavior in the error recursion of Theorem 1 indicates that in early stages of the algorithm, when the iterates are far from an optimum, the error is dominated by a quadratic term. Now it can be seen from (23) that the quadratic term is negatively affected by small values of γ\gamma, and unlike the linear term, the estimation accuracy ϵ\epsilon cannot reverse the effect. In other words, small values of γ\gamma can hinder the initial progress and even using the full Hessian cannot address this issue. As a result, one might resort to regularization of the (estimated) Hessian to improve upon the initial convergence rate. Here, we explore two options for such regularization and will discuss the pros and cons of such strategies.

In this section, we follow the ideas presented in , by accounting for such a potentially negative factor γ\gamma, through a regularization of eigenvalue distribution of the sub-sampled Hessian. More specifically, for some λ0\lambda\geq 0, let

The operation (30) can be equivalently represented as

with vi{\bf v}_{i} being the ithi^{th} eigenvector of HH corresponding to the ithi^{th} eigenvalue, λi(H)\lambda_{i}(H). Operation (30) can be performed using truncated SVD (TSVD). Note that although TSVD can be done through standard methods, faster randomized alternatives exist which provide accurate approximations to TSVD much more efficiently . In the constrained optimization case, where the restricted eigenvalues of HH with respect to the cone K\mathcal{K} are not a priori known, it is impossible to compute λiK(H)\lambda_{i}^{\mathcal{K}}(H). As a result, we replace Assumptions (8), (10) and (11), with the corresponding assumptions in (13).

For this regularization, we have the following error recursion result:

Let Assumptions (13a), (13b), and (13c) hold and let 0<δ<10<\delta<1 and 0<ϵ<10<\epsilon<1 be given. Set S|\mathcal{S}| as described in Comment 2, and H(x(k))H({\bf x}^{(k)}) as in (14). For some λ>0\lambda>0, let

where P(λ;.)\mathcal{P}(\lambda;.) is as in (30). Then, for the update (22) with αk=1\alpha_{k}=1, if λ(1ϵ)γ\lambda\geq(1-\epsilon)\gamma, it follows that (23) holds with probability 1δ1-\delta where

Now we consider the local regularity counterparts of Assumptions (13b) and (13c) by replacing them with (13d) and (13e). Under such relaxed assumptions, we have the following convergence result:

Under Assumptions (13a), (13d), and (13e) hold, if λ(1ϵ)γ\lambda\geq(1-\epsilon)\gamma^{*}, the result of Theorem 5 holds with

Comment 7: The results of Theorems 5 and 6 indicate that, by increasing the threshold λ\lambda, we get a better factor for the quadratic term, but a worse factor for the linear term. This is specially important in light of the fact that, without regularization and by Theorem 1, linear term only depends on sub-sampling accuracy which can be arbitrarily made small. Hence, it might be advisable to have larger threshold, λ\lambda, when far from the solution and gradually decrease the threshold, as iterates get closer to the optimum. This remedies the slow initial progress and yet allows for the sub-sampling dependent linear rate to be recovered when close enough to the solution.

Let Assumptions (13a), (13b), and (13c) hold. Using Algorithm 3, if

and at every iteration the threshold is chosen as

we get locally Q-linear convergence with variable rates

with probability (1δ)2k0(1-\delta)^{2k_{0}}, where

Let Assumptions (13a), (13d), and (13e) hold. Using Algorithm 3, if

and at every iteration the threshold is chosen as

Comment 9: To get convergence, we only need a rough estimate of the smallest eigenvalue of 2F(x(k))\nabla^{2}F({\bf x}^{(k)}) at iteration kk. As a result, one can consider relatively large ϵ0\epsilon_{0} in Algorithm 3 and have a small sample size for such rough estimation.

Comment 10: The results of Theorems 7 and 8 indicate that, when close enough to the solution, increasing the threshold λ\lambda slows down the convergence rate! Hence, more aggressive regularization might, in fact, adversarially affect the efficiency of the algorithm. As a result, it is only advisable to have large thresholds at early stages of the algorithm. For example, using Theorem 5 and any given 0<ξ0<10<\xi_{0}<1, it can be shown that if λβL/(2ξ0)\lambda\geq\beta L/(2\xi_{0}), for some β>1\beta>1, then while

2.2 Ridge Regularization

As an alternative to the spectral regularization, we can consider the following simple ridge-type regularization

for some λ0\lambda\geq 0, similar to the Levenberg-Marquardt type algorithms . Such regularization might be preferable to the spectral regularization of Section 2.2.1, as it avoids the projection operation (30) at every iteration. In addition, in order to establish convergence in Section 2.2.1, the values of the threshold have to be chosen with regards to the eigenvalue distributions of 2F(xk)\nabla^{2}F({\bf x}^{k}) (or 2F(x)\nabla^{2}F({\bf x}^{*})). As a result of this undesirable coupling, in Section 2.2.1, assumptions on the full eigenvalue distribution (as opposed to the ones restricted to the cone of the constraints) were unavoidable, and depending on the cone K\mathcal{K}, this can be significant (compare Assumptions (13) with their respective counterparts in (8), (10) and (11)). However, here λ\lambda can be chosen a priori without any restriction and also, if desired, can be kept fixed across all iterations. As a result, we can continue to use Assumptions (8), (10) and (11), and still acquire convergence. For such regularization, we have the following results:

Global Regularity: Let Assumptions (8) and (10) hold, and let 0<δ<10<\delta<1 and 0<ϵ<10<\epsilon<1 be given. Set S|\mathcal{S}| as described in Comment 2 and H(x(k))H({\bf x}^{(k)}) as in (14). For some λ0\lambda\geq 0, let H^(x(k))\hat{H}({\bf x}^{(k)}) be as in (37). Then, for the update (22) with αk=1\alpha_{k}=1, it follows that (23) holds with probability 1δ1-\delta where

Local Regularity: Under Assumptions (8), (11), and (53) the result of theorem 9 holds with

Comment 11: From Theorem 9 it can be seen that increasing λ\lambda results in decreasing the rate for the quadratic term, whilst increasing that of the linear term. As a result, there is a trade-off in choosing the values of regularization and it might be preferable to have larger values for λ\lambda when far from the solution and gradually decrease it, as iterates get closer to the optimum. This is so because as λ\lambda gets larger, the method tends to behave more like first order gradient descent.

Global Regularity: Let Assumptions (8) and (10) hold. For any λ0\lambda\geq 0, consider ρ0\rho_{0} and ρ\rho such that

if (26) holds with ξ\xi as in Theorem 9, then with probability (1δ)k0(1-\delta)^{k_{0}}, we get locally Q-linear convergence as in (27) with the rate ρ\rho.

Local Regularity: Under Assumptions (8), (11), for any

Comment 12: Similar observations as in Theorem 7 and 8 can also be made here. In other words, we see that large value for λ\lambda negatively affects the linear convergence phase, while it is beneficial for the early stages of iteration where the rate is quadratic. For example, using Theorem 9 and any given 0<ξ0<10<\xi_{0}<1, it can be shown that if, for some β>1\beta>1,

Sub-Sampling Hessian & Gradient

In order to compute the update x(k+1){\bf x}^{(k+1)} in Section 2, full gradient was used. In many problems, this can be a major bottleneck and reduction in computational costs can be made by considering sub-sampling the gradient as well. Such methods have recently been used in PDE constrained inverse problems, , where each function fif_{i} is only implicitly available. In such problems, computing fi\nabla f_{i} requires solving a PDE twice, amounting to a total cost of 2n2n PDE solves for each full gradient evaluation. In high dimensional settings where n,p1n,p\gg 1, this can pose a significant challenge and sub-sampling the gradient can, at times, drastically reduce the computational complexity of many problems. As such, for this section, we consider the general update of (3), where g(x(k)){\bf g}({\bf x}^{(k)}) is a sub-sampled approximation to F(x(k))\nabla F({\bf x}^{(k)}).

As in Section 2, consider picking a sample of indices from {1,2,,n}\{1,2,\ldots,n\}, uniformly at random with replacement. Also let S\mathcal{S} and S|\mathcal{S}| denote the sample collection and its cardinality, respectively. Let

be the sub-samples gradient. As mentioned before in the requirement (R.2), we need to ensure that sampling is done in a way to keep as much of the first order information from the full gradient as possible.

By a simple observation, the gradient F(x)\nabla F({\bf x}) can be written in matrix-matrix product from as

Hence, we can use approximate matrix multiplication results as a fundamental primitive in RandNLA , to probabilistically control the error in approximation of F(x)\nabla F({\bf x}) by g(x){\bf g}({\bf x}), through uniform sampling of the columns and rows of the involved matrices above.

For a given xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, let

For any 0<ϵ<10<\epsilon<1 and 0<δ<10<\delta<1, if

then for g(x){\bf g}({\bf x}) defined in (38), we have

where .K\|.\|_{\mathcal{K}} is defined as in (5a).

Comment 13: Note that for the results and algorithms below which make use of Assumption (12), we use the sample size, Sg|\mathcal{S}_{{\bf g}}|, given by Lemma 4 with G=G(x)G^{*}=G({\bf x}^{*}) related to a local optimum, x{\bf x}^{*}. However, for those which do not make such assumption, the sample size, Sg|\mathcal{S}_{{\bf g}}|, from Lemma 4 is given with the current iterate, x(k){\bf x}^{(k)}. For these latter algorithms, we need to be able to efficiently estimate G(x(k))G({\bf x}^{(k)}) at every iteration or, a priori, have a uniform upper bound for G(x)G({\bf x}) for all xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}. Fortunately, in many different problems, it is often possible to estimate G(x)G({\bf x}) very efficiently; see Section 4 and SSN1 [44, Section 4] for concrete examples of a uniform and per-iteration estimations, respectively.

In this section, we present various algorithms which incorporate both Hessian and gradient sub-sampling. In such a setting, one is generally faced with two options: either to sub-sample the gradient and the Hessian independently of each other, or to use the same collection of indices and perform simultaneous sub-sampling for both. In this section, we present algorithms and convergence results for both of these strategies. Combining the sampling of Lemma 1 for the Hessian that of Lemma 4 for the gradient, we are now in the position to present our main results for this Section. We remind that, for this Section, we consider the general update of (3), where H(x(k))H({\bf x}^{(k)}) and g(x(k)){\bf g}({\bf x}^{(k)}) are sub-sampled approximations to 2F(x(k))\nabla^{2}F({\bf x}^{(k)}) and F(x(k))\nabla F({\bf x}^{(k)}), respectively.

Global Regularity: Let Assumptions (8) and (10) hold, and let 0<δ<10<\delta<1, 0<ϵ1<10<\epsilon_{1}<1, and 0<ϵ2<10<\epsilon_{2}<1 be given. Set SH|\mathcal{S}_{H}| as described in Comment 2 with (ϵ1,δ)(\epsilon_{1},\delta) and Sg|\mathcal{S}_{{\bf g}}| as in (39) with (ϵ2,δ)(\epsilon_{2},\delta) and x(k){\bf x}^{(k)}. Independently, choose SH\mathcal{S}_{H} and Sg\mathcal{S}_{{\bf g}}, and let H(x(k))H({\bf x}^{(k)}) and g(x(k)){\bf g}({\bf x}^{(k)}) be as in (14) and (38), respectively. Then, for the update (3) with αk=1\alpha_{k}=1, with probability (1δ)2(1-\delta)^{2}, we have

Local Regularity: Under Assumptions (8), (11), and (53) with ϵ1\epsilon_{1}, the results of Theorem 11 holds with

Comment 14: Similar to the case of Hessian sub-sampling, the bounds given here exhibit a composite behavior where the error is at first dominated by a quadratic term, which transforms to linear term, and finally is dominated by the rate at which the gradient is approximated near an optimum.

It is also possible to obtain an alternative convergence result under local regularity assumption on the gradients at a local optimum x{\bf x}^{*} as in (12). This gives rise to an algorithm where the Hessian and the gradient are simultaneously sub-sampled, resulting in dependency among the approximations; see Algorithm 6.

Let Assumptions (8), (11), and (12) hold, and let 0<δ<10<\delta<1 and 0<ϵ<10<\epsilon<1 be given. Using the same ϵ\epsilon, set SH|\mathcal{S}_{H}| as described in Comment 2 and Sg|\mathcal{S}_{{\bf g}}| as in (39) with GG^{*}. Select a sample set S\mathcal{S} of size max{SH,SG}\max\{|\mathcal{S}_{H}|,|\mathcal{S}_{G}|\}, uniformly at random and let H(x(k))H({\bf x}^{(k)}) and g(x(k)){\bf g}({\bf x}^{(k)}) be as in (14) and (38), respectively, both using S\mathcal{S}. For the update (3) with αk=1\alpha_{k}=1, if x(k){\bf x}^{(k)} satisfies (53), then with probability 12δ1-2\delta, we have

We are now in the position to give R-linear convergence of the proposed algorithms.

Global Regularity: Let Assumptions (8) and (10) hold. Consider any 0<ρ<10<\rho<1, 0<ρ0<10<\rho_{0}<1, and 0<ρ1<10<\rho_{1}<1 such that ρ0+ρ1<ρ\rho_{0}+\rho_{1}<\rho. Let

if the initial iterate, x(0){\bf x}^{(0)}, satisfies

Local Regularity: Under Assumptions (8) and (11), Theorem 13 holds with

Comment 15: From Theorems 13, it can be seen that in order to get linear convergence rate, estimation of the gradient must be done, progressively, more accurately, whereas the sample size for Hessian can remain unchanged across iterations. This is in line with the common knowledge where, in practice, as the iterations get closer to the optimal solution, the accuracy of gradient estimation is more important than that of Hessian.

Let Assumptions (8), (11) and (12) hold and consider any 0<ρ0<ρ<10<\rho_{0}<\rho<1. Let 0<ϵ<10<\epsilon<1 be such that

Using Algorithm 6, if the initial iterate, x(0){\bf x}^{(0)}, satisfies

we get locally R-linear convergence as in (41) with probability (12δ)k(1-2\delta)^{k}.

Comment 16: Unlike Theorem 13, the assumptions in Theorem 14 are more relaxed and the sampling for both are done simultaneously, i.e., using the same sample collection S\mathcal{S}. As a result in order to guarantee linear convergence, using Algorithm 6, sampling the gradient and the Hessian both must be done, progressively, more accurately.

Using a slight modification of Algorithm 6, it is even possible to obtain R-superlinear convergence. However, in this case, the sample size increase must be done “extremely aggressively”, which might render such strategy rather impractical and we only give the following result for completeness. By aggressive, we refer to a very fast rate at which ϵ(k)\epsilon^{(k)} is decreased across iterations, compared to a more moderate decrease of the previous results.

Under the assumptions of Theorem 14, Algorithm 6 with

Examples

In this Section, we present a few instances of problems which are of the form (1). Specifically, examples from generalized linear models (GLM) and linear support vector machine (SVM) are given in Sections 4.1 and 4.2 respectively.

Consider sparse maximum likelihood (ML) estimation using any GLM with canonical link function. One way to induce sparsity is by using the constraint set, X\mathcal{X}, as

Hence, sparse ML is equivalent to minimizing the negative log-likelihood over X\mathcal{X}, where the negative log-likelihood of such model can be written as

The cumulant generating function, Φ\Phi, determines the type of GLM. For example, Φ(t)=0.5t2\Phi(t)=0.5t^{2} gives rise to ordinary least squares formulation (OLS), while Φ(t)=ln(1+exp(t))\Phi(t)=\ln\left(1+\exp(t)\right) and Φ(t)=exp(t)\Phi(t)=\exp(t) yield logistic regression (LR) and Poisson regression (PR), respectively. It is also easily verified that the gradient and the Hessian of FF are

As mentioned in Comment 3, in order to use Lemma 4, we need to be able to efficiently estimate G(x(k))G({\bf x}^{(k)}) at every iteration or, a priori, have a uniform upper bound for G(x)G({\bf x}) for all xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}. For illustration purposes only, Table 3 gives some very rough estimates of the constant GG for GLMs with respect to X\mathcal{X}. Note that, here, GG is a uniform bound for G(x)G({\bf x}). For an example where a per-iteration bound can be efficiently computed see Section 4.2.

2 SVM with Smooth Quadratic Hinge Loss

A support vector machine constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression, or other tasks. The vast majority of text books and articles introducing SVM very briefly state the the primal optimization problem, and then go directly to the dual formulation . Primal optimizations of linear SVMs have already been studied by . This is so since the dual problem does not scale well with the number of data points, e.g., O(n3)\mathcal{O}(n^{3}) for some approaches, , the primal might be better-suited for optimization of linear SVMs.

where C>0C>0 and mm is either 11 (hinge loss) or 2 (quadratic loss). Note that, here, the coordinate corresponding to the intercept is implicitly included in each ai{\bf a}_{i}. The unconstrained optimization formulation of the above is

the gradient and the Hessian of fif_{i} is given as follows

As mentioned before, in practice, as a pre-processing and before starting the algorithms, one can pre-compute the quantities which depend only on (ai,bi)({\bf a}_{i},b_{i})’s. Then updating G(x)G({\bf x}) at every iteration is done very efficiently as it is just a matter of computing x\|{\bf x}\|.

Conclusion

In this paper, we studied the local convergence behavior of sub-sampled Newton methods for constrained optimization in two general settings. We first studied the convergence behavior of the algorithm in which only the Hessian is sub-sampled and the full gradient is used. We showed that the error recursion has a composite nature where it is first dominated by a quadratic term and is subsequently transformed to a linear term near an optimum. In such a setting, we proved locally Q-linear convergence of the proposed algorithm. We also showed that by increasing the accuracy of such sub-sampling as the iterations progress, one can indeed recover superlinear rates. We then studied the situations in which the sub-sampled Hessian is regularized by either modifying its spectrum or by ridge-type regularization. We showed that such regularization can only be beneficial at early stages of the algorithm, i.e., the phase where the quadratic term dominates the error recursion, and we need to revert to using the unmodified sub-sampled Hessian as iterates get closer to the optimum.

Finally, we studied the local convergence of a fully stochastic algorithm in which both Hessian and the gradients are sub-sampled. Such sub-sampling can be done independently for the Hessian and the gradients, or simultaneously where one sample collection of indices is used for sub-sampling both. As before, for all of our results, we showed the composite rate of the error recursions. We also showed that by progressively increasing the sub-sampling accuracy of the gradient, we can indeed guarantee a local R-linear convergence rate. In addition, we argued that through a more aggressive sub-sampling strategy, one can obtain a superlinear rate.

For all of our results, we showed that the error bounds exhibit a composite behavior whose dominating term varies according to the distance of iterates to optimality. For example, for Hessian sub-sampling, we showed that, when far from a local optimum, the dominating error term is quadratic which then transitions to linear when the iterates are within a small enough region around that local optimum. These results might give a better control over various tradeoffs which exhibit themselves in different applications, e.g., practitioners might require faster running-time while statisticians might be more interested in statistical aspects regarding the recovered solution.

One major drawback of our algorithms is that we require the quadratic sub-problems (3) to be solved exactly. This can indeed be a computational challenge in many situations. In SSN1 and in the context of globally convergent sub-sampled algorithms for unconstrained problems, we relax this requirement and solve such sub-problems only approximately. This results in globally convergent sub-sampled algorithms with inexact updates which have much lower per-iteration cost. The extension of such inexact updates for the sub-sampled algorithms for constrained optimization and studying their local convergence behavior are left for future work.

References

Appendix A Proofs

Here we give the proofs of all of the results in Section 2.

Let UU be an orthonormal basis for the cone K\mathcal{K} defined in (4). Consider S|\mathcal{S}| i.i.d random matrices Hj(x),j=1,2,,SH_{j}({\bf x}),j=1,2,\ldots,|\mathcal{S}| such that Pr(Hj(x)=2fi(x))=1/n;  i=1,2,,n,\Pr(H_{j}({\bf x})=\nabla^{2}f_{i}({\bf x}))=1/n;\;\forall i=1,2,\ldots,n,. Define

Hence we can apply Operator-Bernstein inequality [21, Theorem 1] to get

Noting that for any symmetric matrix AA,

Now having a sample size as in (16), we get that

Similarly, using [2, Theorem 8.4.11, Eqn. (8.4.12)], and for any 1i,jp1\leq i,j\leq p such that i+j=p+1i+j=p+1, we have

Let XjX_{j} be as in the proof of Lemma 1. Since fif_{i} is convex, we have 2fi(x)0\nabla^{2}f_{i}({\bf x})\succeq 0, 2F(x)0\nabla^{2}F({\bf x})\succeq 0, and for Hj=2fi(x)H_{j}=\nabla^{2}f_{i}({\bf x}), we have

and hence, after obtaining the same bound for Yj=XjY_{j}=-X_{j} and applying Operator-Bernstein inequalit as in the proof of Lemma 1, we get a lower bound for the sample size as in (18).

Let XjX_{j} be as in the proof of Lemma 1. We have

we can apply Matrix Bernstein using the intrinsic dimension [55, Theorem 7.7.1], to get for ϵ1/2\epsilon\leq 1/2,

Applying the same bound for Yj=XjY_{j}=-X_{j} and Y=jSYjY=\sum_{j\in\mathcal{S}}{Y_{j}}, followed by using the union bound, we get the desired result. It is also easily verified that (19) implies (45), and so our derivation is valid.

Note that for the simplicity of the presentation of the proofs of our main results, we will make use of the following lemma, which is just a simple variant of Lemma 1, and whose proof is very similar to that of Lemma 1

Given any 0<ϵ<10<\epsilon<1, 0<δ<10<\delta<1 and xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}, if the sample size S|\mathcal{S}| is as in (16), then for H(x)H({\bf x}) defined in (14), we have all of the following statements, simultaneously, with probability 1δ1-\delta:

We first give some structural lemma which will be the foundation of our main results for this Section.

Let Assumptions (8) and (10) hold. Also assume that

Define Δk:=x(k)x\Delta_{k}\mathrel{\mathop{:}}={\bf x}^{(k)}-{\bf x}^{*}. By optimality of x(k+1){\bf x}^{(k+1)} in (22), we have for any xDX{\bf x}\in\mathcal{D}\cap\mathcal{X},

In particular, setting x=x{\bf x}={\bf x}^{*}, and noting that x(k+1)x(k)=Δk+1Δk{\bf x}^{(k+1)}-{\bf x}^{(k)}=\Delta_{k+1}-\Delta_{k}, we get

Since by optimality of x{\bf x}^{*}, we have F(x)T(x(k+1)x)0\nabla F({\bf x}^{*})^{T}({\bf x}^{(k+1)}-{\bf x}^{*})\geq 0, it follows that

where AK\|A\|_{\mathcal{K}} is defined as in (5b).

By Assumption (49), we have λminK(H(x(k)))>0\lambda_{\min}^{\mathcal{K}}\left(H({\bf x}^{(k)})\right)>0. Hence we finally get

Comment 17: Note that for the case of H(x(k))=2F(x(k))H({\bf x}^{(k)})=\nabla^{2}F({\bf x}^{(k)}) and αk=1\alpha_{k}=1, we exactly recover the convergence rate for the standard Newton’s method [3, Proposition 1.4.1].

Under the relaxed assumptions (11), we have the following result:

Let Assumptions (8) and (11) hold. In addition, we assume that

As in Lemma 6, let Δk=x(k)x\Delta_{k}=\|{\bf x}^{(k)}-{\bf x}^{*}\|. We get,

By Assumption (50), we have λminK(H(x))>0\lambda_{\min}^{\mathcal{K}}\left(H({\bf x}^{*})\right)>0. Hence, by setting ΔkλminK(H(x))/(2Γ)\|\Delta_{k}\|\leq\lambda_{\min}^{\mathcal{K}}\left(H({\bf x}^{*})\right)/(2\Gamma), we get

A.1.2 Main proofs

Before delving in to the proofs, we first make a note regarding the step-size chosen for all of our algorithms. Let us denote

By Weyl’s inequality [2, Theorem 8.4.11], we have

minimizes the maximum of both right hand sides. Now if the sampling is done as in Lemma 5, then on the event that

Hence, the optimal step size for the linear term is αk1\alpha_{k}\approx 1. As a result, for the following results, we always consider the “natural” Newton step size (i.e., αk=1\alpha_{k}=1).

In addition, using (47), the requirement (52) holds if

where we have Γ=L\Gamma=L. Also, under Assumptions (10b) and (11b), H(x(k))H({\bf x}^{(k)}) and H(x)H({\bf x}^{*}) always satisfy (49) and (50), respectively.

Since S|\mathcal{S}| is set as described in Comment 2 and αk=1\alpha_{k}=1, then (48) holds with probability 1δ1-\delta. Now the results follow immediately by applying Lemmas 6 and 7 and noting that for our choice of HH, we have Γ=L\Gamma=L.

Using this particular choice of ϵ\epsilon, Theorem 1, for every kk, yields

Also note that for the latter part of Theorem 2, since 0<ρ0<ρ<10<\rho_{0}<\rho<1, by induction we have, for every kk

Let AkA_{k} denote the event that x(k)xρx(k1)x\|{\bf x}^{(k)}-{\bf x}^{*}\|\leq\rho\|{\bf x}^{(k-1)}-{\bf x}^{*}\|. The overall success probability is

since for every kk, the conditional probability of a successful update x(k+1){\bf x}^{(k+1)}, given the past successful iterations {xi}i=1k\{{\bf x}_{i}\}_{i=1}^{k}, is 1δ1-\delta.

where depending on the assumptions of the theorem, ρ0(k)\rho_{0}^{(k)} and ξ(k)\xi^{(k)} are as in (24) or (25) using ϵ(k)\epsilon^{(k)}. Note that, by ϵ(k)=ρkϵ\epsilon^{(k)}=\rho^{k}\epsilon and ϵ\epsilon set as in Theorem 2, it follows that

We prove the result by induction on kk. Define Δk:=x(k)x\Delta_{k}\mathrel{\mathop{:}}={\bf x}^{(k)}-{\bf x}^{*}. For k=0k=0, by assumptions on ρ\rho, ρ0\rho_{0}, and ξ(0)\xi^{(0)}, we have

Now assume that (28) holds up to the iteration kk. For k+1k+1, we get

By induction hypothesis, we have Δk1Δ0\|\Delta_{k-1}\|\leq\|\Delta_{0}\|, and

Under the local regularity Assumptions (11), we can see by induction that for every kk

so the condition (53) is satisfied. The overall success probability is computed as in the end of the proof of Theorem 2.

We only give the proof for the first part, as the proof for the second part is almost identical. By the choice of ϵ(k)\epsilon^{(k)} in (24), we have

We again prove the result by induction on kk. For k=0k=0, by assumptions on ξ(0)\xi^{(0)}, we have

Now assume that (29) holds up to the iteration kk. For k+1k+1, we get

i.e., ϕ(x)\phi(x) is decreasing, it follows that, we have

Now, by induction hypothesis, we have Δk1Δ0\|\Delta_{k-1}\|\leq\|\Delta_{0}\|, and

Now suppose S|\mathcal{S}| is chosen as described in Comment 2. By Lemma 5, it follows that, with probability 1δ1-\delta, we have

and using the same line of reasoning as in the proof of Lemma 6.

Proof goes along the same line as that of Lemma 7. In particular, using Assumption (13a), we have the following chain of inequalities

where the second inequality follows by [5, Lemma VII.5.5] and noting that the scalar version of the projection operator (30), i.e., max(x,λ)\max(x,\lambda) is 1-Lipschitz. Specifically,

By setting λ(k)\lambda^{(k)} as in (33) and using Lemma 5, we in fact have λ(k)(1ϵ)γ\lambda^{(k)}\geq(1-\epsilon)\gamma and a requirement of Theorem 5 holds with probability 1δ1-\delta. Now we solve for ϵ(k)\epsilon^{(k)}, so that for some ρ0(k)>0\rho_{0}^{(k)}>0, which is to be determined later, we have

For this choice of ϵ(k)\epsilon^{(k)}, using Theorem 5, we get (23) with ρ0(k)\rho_{0}^{(k)} and ξ(k)=L/(2λ(k))\xi^{(k)}=L/(2\lambda^{(k)}). Now setting

it can be easily shown, by induction, that under Assumption (31) and the choice of ϵ\epsilon in (32), we have

for every kk. Furthermore, since two sampling steps in Algorithm 3 are independent, the success probability for each iteration is (1δ)2(1-\delta)^{2}. Now the overall success probability can computed as in the end of the proof of Theorem 2.

Suppose at iteration kk, (34) holds. By Weyl’s inequality and Assumption (13a), we have

Hence, by setting λ(k)\lambda^{(k)} as in (36), we in fact have λ(k)(1ϵ)γ\lambda^{(k)}\geq(1-\epsilon)\gamma^{*} and a requirement of Theorem 6 holds with probability 1δ1-\delta. The rest of the proof follows the same reasoning as in the proof of Theorem 7 by using the results of Theorem 6.

The first part is straightforward by using Lemma 6. The proof of the second part follows the same reasoning as in Lemma 7 by noting

Proof follows the same reasoning as before using Theorem 9, so we omit the details. Note that, for the latter part of Theorem 10, since 1γ/(γ+2λ)<ρ0<ρ<11-\gamma^{*}/(\gamma^{*}+2\lambda)<\rho_{0}<\rho<1, we have

so the condition (53) is satisfied. The last inequality follows since, by the choice of ϵ\epsilon and ρ0<1\rho_{0}<1, we have

A.2 Proofs of the theorems and lemma of Section 3

Here we give the proofs of all of the results in Section 3.

Same as in the proof of Lemma 1, let UU be an orthonormal basis for K\mathcal{K}. Hence, by the assumption and the definition (5a), we have that at a given xDX{\bf x}\in\mathcal{D}\cap\mathcal{X}

As mentioned before, the gradient F(x)\nabla F({\bf x}) can be equivalently written as a product of two matrices as F(x)=AB\nabla F({\bf x})=AB, where

with probability 1δ1-\delta. Now the result follows by requiring that

As before, we first give some structural lemmas which form the foundation of our main results for this Section.

Let Assumptions (8), (10) and (49) hold. For the update (3), we have

where vK\|{\bf v}\|_{\mathcal{K}} is as in (5a).

The result is obtained as in the proof of Lemma 6, and using the identity g(x(k))=g(x(k))F(x(k))+F(x(k)){\bf g}({\bf x}^{(k)})={\bf g}({\bf x}^{(k)})-\nabla F({\bf x}^{(k)})+\nabla F({\bf x}^{(k)}), and noting that Δk+1T(g(x(k))F(x(k)))g(x(k))F(x(k))KΔk+1|\Delta_{k+1}^{T}\left({\bf g}({\bf x}^{(k)})-\nabla F({\bf x}^{(k)})\right)|\leq\|{\bf g}({\bf x}^{(k)})-\nabla F({\bf x}^{(k)})\|_{\mathcal{K}}\|\Delta_{k+1}\|.

As in Section A.1.1, it is possible to have similar results as in Lemma 8 by replacing (10) with its local counterpart (11). The proof is omitted as it is similar to the proof of previous lemmas.

Let Assumptions (8), (11), (50), and (51) hold. For the update (3), if x(k){\bf x}^{(k)} satisfies (52), we have

It is even possible to relax Assumption (11a) and give a more general result where the gradient approximation error is measured only at a local optimum x{\bf x}^{*}.

Let Assumptions (11b), (50) and (51) hold and let

denote the Jacobian of g{\bf g} at x{\bf x}. Also assume that for some T0T\geq 0,

For the update (3) with αk=1\alpha_{k}=1, if x(k){\bf x}^{(k)} satisfies (52), we have

As in Lemma 6, let Δk=x(k)x\Delta_{k}=\|{\bf x}^{(k)}-{\bf x}^{*}\|. We get,

where, for the last inequality, we used the fact that, by first order optimality condition, we have Δk+1TF(x)0\Delta_{k+1}^{T}\nabla F({\bf x}^{*})\geq 0. Using mean value theorem, we get

Now the same reasoning as in the end of the proof of Lemma 7 gives the result.

A.2.2 Main proofs

The results are immediately obtained using Lemmas 8 and 9.

First note that since H(x(k))H({\bf x}^{(k)}) and g(x(k)){\bf g}({\bf x}^{(k)}) are not independent, the joint probability that they are within the desired tolerance, by sub-additivity of probability, is lower bounded by 12δ1-2\delta. Now by construction, Jg(y)=H(y)J_{{\bf g}}({\bf y})=H({\bf y}), and the result follows by Assumption (8) on Lipschitz continuity of 2fi(x)\nabla^{2}f_{i}({\bf x}) and Lemma 10 with T=LT=L.

Using Theorem 11, the particular choice of ϵ1\epsilon_{1} and ϵ2(k)=ρkϵ2\epsilon^{(k)}_{2}=\rho^{k}\epsilon_{2}, for each kk, gives

where η(k)=ρη(k1)\eta^{(k)}=\rho\eta^{(k-1)} and η(0)ρ1σ\eta^{(0)}\leq\rho_{1}\sigma. We prove the result by induction on kk. Define Δk:=x(k)x\Delta_{k}\mathrel{\mathop{:}}={\bf x}^{(k)}-{\bf x}^{*}. For k=0k=0, using Assumption (40) and noting that by the definition of σ\sigma

Now assume that (41) holds for kk. For k+1k+1, we get

Under the local regularity assumptions (11), we can see by induction that for every kk

so the condition (53) is satisfied. The overall success probability is computed as in the end of the proof of Theorem 2.

First note that by (42) and (43), we get that

The proof is again by induction using Theorem 12. First note that, by construction and the value of ϵ\epsilon,

and ξ(k)ξ(k1)\xi^{(k)}\leq\xi^{(k-1)}. For k=0k=0, we have

where the last inequality follows from the definition (43) and noting that

Now assume that (41) holds for kk. For k+1k+1, we get

Under the local regularity assumption (11b), we can see by induction that for every kk

so the condition (53) is satisfied. The overall success probability is computed as in the end of the proof of Theorem 2.

First, as in the proof of Theorem 14, by (42) and (43) , we get

Now we note that η(0)ρ0σ\eta^{(0)}\leq\rho_{0}\sigma, η(k)ρkη(k1)\eta^{(k)}\leq\rho^{k}\eta^{(k-1)}, ξ(k)ξ(k1)\xi^{(k)}\leq\xi^{(k-1)} and

Again, the result is obtained by induction on kk. Define Δk:=x(k)x\Delta_{k}\mathrel{\mathop{:}}={\bf x}^{(k)}-{\bf x}^{*}. For the base case of k=0k=0, we have

where for the inequality, we used the definition of σ\sigma and noting that

Now assume that (44) holds for kk. For k+1k+1, we have