Convergence rates of sub-sampled Newton methods
Murat A. Erdogdu, Andrea Montanari
Introduction
in a batch setting, where is assumed to be much larger than . Most machine learning models can be expressed as above, where each function corresponds to an observation. Examples include logistic regression, support vector machines, neural networks and graphical models.
Many optimization algorithms have been developed to solve the above minimization problem using iterative methods [Bis95, BV04, Nes04]. In this paper, we consider the iterations of the following form
where, as , and is the -th largest eigenvalue of the Hessian of at minimizer .
A key challenge is that the sub-sampled Hessian is close to the actual Hessian along the directions corresponding to large eigenvalues (large curvature directions in ), but is a poor approximation in the directions corresponding to small eigenvalues (flatter directions in ). In order to overcome this problem, we use low-rank approximation. More precisely, we treat all the eigenvalues below the -th as if they were equal to the -th. This yields the desired stability with respect to the sub-sample: we call our algorithm NewSamp . In this paper, we establish the following:
NewSamp has a composite convergence rate: quadratic at start and linear near the minimizer, as illustrated in Figure 1. Formally, we prove a bound of the form
with coefficient that are explicitly given (and are computable from data).
The asymptiotic behavior of the linear convergence coefficient is , for small. The condition number which controls the convergence of GD, has been replaced by the milder . For datasets with strong spectral features, this can be a large improvement, as shown in Figure 1.
The above results are achived without tuning the step-size, in particular, by setting .
The complexity per iteration of NewSamp is with the sample size.
Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling algorithms.
We demonstrate the performance of NewSamp on four datasets, and compare it to the well-known optimization methods.
The rest of the paper is organized as follows: Section 1.1 surveys the related work. In Section 2, we describe the proposed algorithm and provide the intuition behind it. Next, we present our theoretical results in Section 3, i.e., convergence rates corresponding to different sub-sampling schemes, followed by a discussion on how to choose the algorithm parameters. Two applications of the algorithm are discussed in Section 4. We compare our algorithm with several existing methods on various datasets in Section 5. Finally, in Section 6, we conclude with a brief discussion.
Even a synthetic review of optimization algorithms for large-scale machine learning would go beyond the page limits of this paper. Here, we emphasize that the method of choice depends crucially on the amount of data to be used, and their dimensionality (i.e., respectively, on the parameters and ). In this paper, we focus on a regime in which is large but not so large as to make matrix manipulations (of order to ) impossible. Also is large but not so large as to make batch gradient computation (of order ) prohibitive. On the other hand, our aim is to avoid calculations required by standard Newton method. Examples of this regime are given in Section 4.
In contrast, online algorithms are the option of choice for very large since the computation per update is independent of . In the case of Stochastic Gradient Descent (SGD), the descent direction is formed by a randomly selected gradient [RM51]. Improvements to SGD have been developed by incorporating the previous gradient directions in the current update [SRB13, SHRY13, Bot10, DHS11].
Batch algorithms, on the other hand, can achieve faster convergence and exploit second order information. They are competitive for intermediate . Several methods in this category aim at quadratic, or at least super-linear convergence rates. In particular, Quasi-Newton methods have proven effective [Bis95, Nes04]. Another approach towards the same goal is to utilize sub-sampling to form an approximate Hessian [Mar10, BCNN11, VP12, QRTF15, EM15, Erd15a]. If the sub-sampled Hessian is close to the true Hessian, these methods can approach NM in terms of convergence rate, nevertheless, they enjoy much smaller complexity per update. No convergence rate analysis is available for these methods: this analysis is the main contribution of our paper. To the best of our knowledge, the best result in this direction is proven in [BCNN11] that estabilishes asymptotic convergence without quantitative bounds (exploiting general theory from [GNS09]).
Further improvements have been suggested either by utilizing Conjugate Gradient (CG) methods and/or using Krylov sub-spaces [Mar10, BCNN11, VP12]. Sub-sampling can be also used to obtain an approximate solution, if an exact solution is not required [DLFU13]. Lastly, there are various hybrid algorithms that combine two or more techniques to gain improvement. Examples include, sub-sampling and Quasi-Newton [SYG07, SDPG13, BHNS14], SGD and GD [FS12], NGD and NM [RF10], NGD and low-rank approximation [RaMB08].
NewSamp : A Newton method via sub-sampling and eigenvalue thresholding
In the regime we consider, , there are two main drawbacks associated with the classical second order methods such as Newton’s method. The predominant issue in this regime is the computation of the Hessian matrix, which requires operations, and the other issue is finding the inverse of the Hessian, which requires computation. Sub-sampling is an effective and efficient way of addressing the first issue, by forming an approximate Hessian to exploit curvature information. Recent empirical studies show that sub-sampling the Hessian provides significant improvement in terms of computational cost, yet preserves the fast convergence rate of second order methods [Mar10, VP12, Erd15b]. If a uniform sub-sample is used, the sub-sampled Hessian will be a random matrix with expected value at the true Hessian, which can be considered as a sample estimator to the mean. Recent advances in statistics have shown that the performance of various estimators can be significantly improved by simple procedures such as shrinkage and/or thresholding [CCS10, DGJ13, GD14, GD14]. To this extent, we use a specialized low-rank approximation as the important second order information is generally contained in the largest few eigenvalues/vectors of the Hessian. We will see in Section 3, how this procedure provides faster convergence rates compared to the bare sub-sampling methods.
which is simply the sum of a scaled identity matrix and a rank- matrix. Note that the low-rank approximation that is suggested to improve the curvature estimation has been further utilized to reduce the cost of computing the inverse matrix. Final per-iteration cost of NewSamp will be . NewSamp takes the parameters and as inputs. We discuss in Section 3.4, how to choose these parameters near-optimally, based on the theory we develop in Section 3.
Operator projects the current iterate to the feasible set using Euclidean projection. Throughout, we assume that this projection can be done efficiently. In general, most unconstrained optimization problems do not require this step, and can be omitted. The purpose of projected iterations in our algorithm is mostly theoretical, and will be clear in Section 3.
Theoretical results
In this section, we provide the convergence analysis of NewSamp based on two different sub-sampling schemes:
Independent sub-sampling: At each iteration , is uniformly sampled from , independently from the sets , with or without replacement.
Sequentially dependent sub-sampling: At each iteration , is sampled from , based on a distribution which might depend on the previous sets , but not on any randomness in the data.
The first sub-sampling scheme is simple and commonly used in optimization. One drawback is that the sub-sampled set at the current iteration is independent of the previous sub-samples, hence does not consider which of the samples were previously used to form the approximate curvature information. In order to prevent cycles and obtain better performance near the optimum, one might want to increase the sample size as the iteration advances [Mar10], including previously unused samples. This process results in a sequence of dependent sub-samples which falls into the sub-sampling scheme S2. In our theoretical analysis, we make the following assumptions:
For any subset , there exists a constant depending on the size of , such that ,
, the Hessian of the function , , is upper bounded by an absolute constant , i.e.,
Assume that the parameter set is convex and is based on sub-sampling scheme S1. Further, let the Assumptions 1 and 2 hold and . Then, for an absolute constant , with probability at least , the updates of the form Eq. (1.2) satisfy
for coefficients and defined as
If the initial point is close to , the algorithm will start with a quadratic rate of convergence which will transform into linear rate later in the close neighborhood of the optimum.
then we have, with probability at least ,
for an absolute constant , for the coefficients and are defined as
NewSamp has a composite convergence rate where and are the coefficients of the linear and the quadratic terms, respectively (See the right plot in Figure 1). We observe that the sub-sampling size has a significant effect on the linear term, whereas the quadratic term is governed by the Lipschitz constant. We emphasize that the case is feasible for the conditions of Theorem 3.2. In the case of quadratic functions, since the Lipschitz constant is 0 , we obtain and the algorithm converges linearly. Following corollary summarizes this case.
2 Sequentially dependent sub-sampling
Here, we assume that the sub-sampling scheme S2 is used to generate . Distribution of sub-sampled sets may depend on each other, but not on any randomness in the dataset. Examples include fixed sub-samples as well as sub-samples of increasing size, sequentially covering unused data. In addition to Assumptions 1-2, we assume the following.
Most statistical learning algorithms can be formulated as above, e.g., in classification problems, one has access to i.i.d. samples where and denote the class label and the covariate, and measures the classification error (See Section 4 for examples). For the sub-sampling scheme S2, an analogue of Lemma 3.1 is stated in Appendix as Lemma A.1, which immediately leads to the following theorem.
for the coefficients and defined as
Compared to the Theorem 3.2, we observe that the coefficient of the quadratic term does not change. This is due to Assumption 1. However, the bound on the linear term is worse, since we use the uniform bound over the convex parameter set . The same order of magnitude is also observed by [Erd15b], which relies on a similar proof technique. Similar to Corollary 3.3, we have the following result for the quadratic functions.
Let the assumptions of Theorem 3.4 hold. Further assume that , the functions are quadratic. Then, conditioned on the event , with probability at least , NewSamp iterates satisfy
for coefficient defined as in Theorem 3.4.
3 Dependence of coefficients on t𝑡t and convergence guarantees
The coefficients and depend on the iteration step which is an undesirable aspect of the above results. However, these constants can be well approximated by their analogues and evaluated at the optimum which are defined by simply replacing with in their definition, where the latter is the -th eigenvalue of full-Hessian at . For the sake of simplicity, we only consider the case where the functions are quadratic.
Assume that the functions are quadratic, is based on scheme S1 and . Let the full Hessian at be lower bounded by a constant . Then for sufficiently large , we have, with probability
for some absolute constants .
Theorem 3.6 implies that, when the sub-sampling size is sufficiently large, will concentrate around . Generalizing the above theorem to non-quadratic functions is straightforward, in which case, one would get additional terms involving the difference . In the case of scheme S2, if one uses fixed sub-samples, i.e., , , then the coefficient does not depend on . The following corollary gives a sufficient condition for convergence. A detailed discussion on the number of iterations until convergence and further local convergence properties can be found in Appendix B.
Assume that and are well-approximated by and with an error bound of , i.e., for , as in Theorem 3.6. For the initial point , a sufficient condition for convergence is
4 Choosing the algorithm parameters
Algorithm parameters play a crucial role in most optimization methods. Based on the theoretical results from previous sections, we discuss procedures to choose the optimal values for the step size , sub-sample size and rank threshold.
Step size: For the step size of NewSamp at iteration , we suggest
where . Note that is the upper bound in Theorems 3.2 and 3.4 and it minimizes the first component of . The other terms in and linearly depend on . To compensate for that, we shrink towards 1. Contrary to most algorithms, optimal step size of NewSamp is larger than 1. See Appendix C for a rigorous derivation of Eq. 3.3.
Sample size: By Theorem 3.2, a sub-sample of size should be sufficient to obtain a small coefficient for the linear phase. Also note that sub-sample size scales quadratically with the condition number.
Rank threshold: For a full-Hessian with effective rank (trace divided by the largest eigenvalue), it suffices to use samples [Ver10, Ver12]. Effective rank is upper bounded by the dimension . Hence, one can use samples to approximate the full-Hessian and choose a rank threshold which retains the important curvature information.
Examples
Finding the maximum likelihood estimator in Generalized Linear Models (GLMs) is equivalent to minimizing the negative log-likelihood ,
The gradient and the Hessian of the above function can be written as:
We note that the Hessian of the GLM problem is always positive definite. This is because the second derivative of the cumulant generating function is simply the variance of the observations. Using the results from Section 3, we perform a convergence analysis of our algorithm on a GLM problem.
Let be a uniform sub-sample, and be a convex parameter set. Assume that the second derivative of the cumulant generating function, is bounded by , and it is Lipschitz continuous with Lipschitz constant . Further, assume that the covariates are contained in a ball of radius , i.e. Then, for given by NewSamp with constant step size at iteration , with probability at least , we have
for constants and defined as
Proof of Corollary 4.1 can be found in Appendix A. Note that the bound on the second derivative is quite loose for Poisson regression due to exponentially fast growing cumulant generating function.
2 Support Vector Machines
A linear Support Vector Machine (SVM) provides a separating hyperplane which maximizes the margin, i.e., the distance between the hyperplane and the support vectors. Although the vast majority of the literature focuses on the dual problem [Vap98, SS02], SVMs can be trained using the primal as well. Since the dual problem does not scale well with the number of data points (some approaches get complexity, [WG11]), the primal might be better-suited for optimization of linear SVMs [KD05, Cha07].
The primal problem for the linear SVM can be written as
For the sake of simplicity, we will focus on Hinge-2 loss. Denote by , the set of indices of all the support vectors at iteration , i.e.,
When the loss is set to be the Hinge-2 loss, the Hessian of the SVM problem, normalized by the number of support vectors, can be written as
When is large, the problem falls into our setup and can be solved efficiently using NewSamp . Note that unlike the GLM setting, Lipschitz condition of our Theorems do not apply here. However, we empirically demonstrate that NewSamp works regardless of such assumptions.
Experiments
In this section, we validate the performance of NewSamp through extensive numerical studies. We experimented on two optimization problems, namely, Logistic Regression (LR) and Support Vector Machines (SVM) with quadratic loss. LR minimizes Eq. 4.1 for the logistic function, whereas SVM minimizes Eq. 4.3 for the Hinge-2 loss.
In the following, we briefly describe the algorithms that are used in the experiments:
Gradient Descent (GD), at each iteration, takes a step proportional to negative of the full gradient evaluated at the current iterate. Under certain regularity conditions, GD exhibits a linear convergence rate.
Accelerated Gradient Descent (AGD) is proposed by Nesterov [Nes83], which improves over the gradient descent by using a momentum term. Performance of AGD strongly depends of the smoothness of the function and decreasing step size adjustments may be necessary for convergence.
Newton’s Method (NM) achieves a quadratic convergence rate by utilizing the inverse Hessian evaluated at the current iterate. However, the computation of Hessian makes it impractical for large-scale datasets.
Broyden-Fletcher-Goldfarb-Shanno (BFGS) is the most popular and stable Quasi-Newton method. Scaling matrix is formed by accumulating the information from iterates and gradients, satisfying Quasi-Newton rule. The convergence rate is locally super-linear and per-iteration cost is comparable to first order methods.
Limited Memory BFGS (L-BFGS) is a variant of BFGS, which uses only the recent iterates and gradients to form the approximate Hessian, providing significant improvement in terms of memory usage.
Stochastic Gradient Descent (SGD) is a simplified version of GD where, at each iteration, instead of the full gradient, a randomly selected gradient is used. Per-iteration cost is independent of , yet the convergence rate is significantly slower compared to batch algorithms. We follow the guidelines of [Bot10, SHRY13] for the step size,, i.e.,
Adaptive Gradient Scaling (AdaGrad) is an online algorithm which uses an adaptive learning rate based on the previous gradients. AdaGrad significantly improves the performance and stability of SGD [DHS11]. This is achieved by scaling each entry of gradient differently. , i.e., at iteration step , step size for the -th coordinate is
For each of the batch algorithms, we used constant step size, and for all the algorithms, we choose the step size that provides the fastest convergence. For the stochastic algorithms, we optimized over the parameters that define the step size. Parameters of NewSamp are selected following the guidelines described in Section 3.4.
The effects of sub-sampling size and rank threshold are demonstrated in Figure 1. A thorough comparison of the aforementioned optimization techniques is presented in Figure 2. In the case of LR, we observe that stochastic algorithms enjoy fast convergence at start, but slows down later as they get close to the true minimizer. The algorithm that comes close to NewSamp in terms of performance is BFGS. In the case of SVM, Newton’s method is the closest algorithm to NewSamp, yet in all scenarios, NewSamp outperforms its competitors. Note that the global convergence of BFGS is not better than that of GD [Nes04]. The condition for super-linear rate is for which, an initial point close to the optimum is required [DM77]. This condition can be rarely satisfied in practice, which also affects the performance of the other second order methods. For NewSamp , even though the rank thresholding provides a certain level of robustness, we observed that the choice of a good starting point is still an important factor. Details about Figure 2 can be found in Table 3 in Appendix. For additional experiments and a detailed discussion, see Appendix D.
Conclusion
In this paper, we proposed a sub-sampling based second order method utilizing low-rank Hessian estimation. The proposed method has the target regime and has complexity per-iteration. We showed that the convergence rate of NewSamp is composite for two widely used sub-sampling schemes, i.e., starts as quadratic convergence and transforms to linear convergence near the optimum. Convergence behavior under other sub-sampling schemes is an interesting line of research. Numerical experiments on both real and synthetic datasets demonstrate the performance of the proposed algorithm which we compared to the classical optimization methods.
Acknowledgments
We are grateful to Mohsen Bayati for stimulating conversations on the topic of this work. We would like to thank Robert M. Gower for carefully reading this manuscript and providing valuable feedback. A.M. was partially supported by NSF grants CCF-1319979 and DMS-1106627 and the AFOSR grant FA9550-13-1-0036.
References
Appendix A Proofs of Theorems and Lemmas
Note that the first term on the right hand side governs the convergence behavior of the algorithm.
By the triangle inequality, the governing term that determines the convergence rate can be bounded as
In the following, we will use some matrix concentration results to bound the right hand side of Eq. (A.1). The result for sampling with replacement can be obtained by matrix Hoeffding’s inequality given in [Tro12]. Note that this explicitly assumes that the samples are independent. For the concentration bounds under sampling without replacement (see i.e. [GN10, Gro11, MJC+14]), we will use the Operator-Bernstein inequality given in [GN10] which is provided in Section E as Lemma E.3 for convenience.
Using any indexing over the elements of sub-sample , we denote the each element in by , i.e.,
Next, we apply the matrix Bernstein’s inequality given in Lemma E.3. For , and ,
Therefore, to obtain a convergence rate of , we let
where is sufficient. We also note that the condition on is trivially satisfied by our choice of in the target regime.
First inequality follows from the fact that norm of an integral is less than or equal to the integral of the norm. Second inequality follows from the Lipschitz property.
Combining the above results, we obtain the following for the governing term in Eq.(A.1): For some absolute constants , with probability at least , we have
Assume that the parameter set is convex and is based on sub-sampling scheme S2. Further, let the Assumptions 1, 2 and 3 hold, almost surely. Then, for some absolute constants , with probability at least , the updates of the form stated in Eq. (1.2) satisfy
for coefficients defined as
The first part of the proof is the same as Lemma 3.1. We carry our analysis from Eq.(A.1). Note that in this general set-up, the iterates are random variables that depend on the random functions. Therefore, we use a uniform bound for the right hand side in Eq.(A.1). That is,
By the Assumption 1, given such that , we have,
Next, we will use a covering net argument to obtain a bound on the matrix empirical process. Note that similar bounds on the matrix forms can be obtained through other approaches like chaining as well [DE15]. Let be a -net over the convex set . By the above inequality, we obtain
Now we will argue that the right hand side is small with high probability using the matrix Hoeffding’s inequality from [Tro12]. By the union bound over , we have
For the first term on the right hand side, by Lemma E.1, we write:
By the Assumption (2), we have the same bounds as in Eq.(A.2). Hence, for and , by the matrix Hoeffding’s inequality [Tro12],
We would like to obtain an exponential decay with a rate of at least . Hence, we require,
which gives the optimal value of as
Therefore, we conclude that for the above choice of , with probability at least , we have
Applying this result to the inequality in Eq.(A.5), we obtain that with probability at least ,
The right hand side of the above inequality depends on the net covering diameter . We optimize over using Lemma E.5 which provides for
we obtain that with probability at least ,
Combining this with the bound stated in Eq.(A.1), we conclude the proof. ∎
By the Weyl’s and matrix Hoeffding’s [Tro12] inequalities (See Eq. (A.3) for details), we can write
for some constants and . ∎
Observe that , and . For an index set , we have
Therefore, the Assumption 1 is satisfied with the Lipschitz constant Moreover, by the inequality
the Assumption 2 is satisfied for We conclude the proof by applying Theorem 3.2. ∎
Appendix B Properties of composite convergence
and assume that the algorithm gets a composite convergence rate as
where denote the coefficients of linearly and quadratically converging terms, respectively.
We state the following theorem on the local convergence properties of compositely converging algorithms.
For a compositely converging algorithm as in Eq. (B.1) with coefficients , if the initial distance satisfies , then we have
The above theorem states that the local convergence of a compositely converging algorithm will be dominated by the linear term.
The condition on the initial point implies that as . Hence, for any given , there exists a positive integer such that , we have . For such values of , we write
Taking the limit on both sides concludes the proof. ∎
B.2 Number of iterations
The total number of iterations, combined with the per-iteration cost, determines the total complexity of an algorithm. Therefore, it is important to derive an upper bound on the total number of iterations of a compositely converging algorithm.
For a compositely converging algorithm as in Eq. (B.1) with coefficients , assume that the initial distance satisfies and for a given tolerance , define the interval
Then the total number of iterations needed to approximate the true minimizer with tolerance is upper bounded by , where
We have as by the condition on initial point . Let be a real number and be the last iteration step such that . Then ,
Therefore, in this regime, the convergence rate of the algorithm is dominated by a quadratically converging term with coefficient . The total number of iterations needed to attain a tolerance of is upper bounded by
When , namely , we have
In this regime, the convergence rate is dominated by a linearly converging term with coefficient . Therefore, the total number of iterations since until a tolerance of is reached can be upper bounded by
Hence, the total number of iterations needed for a composite algorithm as in Eq. B.1 to reach a tolerance of is upper bounded by
The above statement holds for any . Therefore, we minimize over the set . ∎
In most optimization algorithms, step size plays a crucial role. If the dataset is so large that one cannot try out many values of the step size. In this section, we describe an efficient and adaptive way for this purpose by using the theoretical results derived in the previous sections.
In the proof of Lemma 3.1, we observe that the convergence rate of NewSamp is governed by the term
If we optimize the above quantity over , we obtain the optimal step size as
By the Lipschitz continuity of eigenvalues , we write
Similarly, for the minimum eigenvalue, we can write
We are more interested in the case where . In this case, we suggest the step size for the iteration step as
Appendix D Further experiments and details
In this section, we present the details of the experiments presented in Figure 2 and provide additional simulation results.
We first start with additional experiments. The goal of this experiment is to further analyze the effect of rank in the performance of NewSamp . We experimented using -spiked model for . The case was already presented in Figure 2, which is included in Figure 3 to ease the comparison. The results are presented in Figures 3 and the details are summarized in Table 2. In the case of LR optimization, we observe through Figure 3 that stochastic algorithms enjoy fast convergence in the beginning but slows down later as they get close to the true minimizer. The algorithms that come closer to NewSamp in terms of performance are BFGS and LBFGS. Especially when , performance of BFGS and that of NewSamp are similar, yet NewSamp still does better. In the case of SVM optimization, the algorithm that comes closer to NewSamp is Newton’s method.
We further demonstrate how the algorithm coefficients and between datasets in Figure 4.
Appendix E Useful lemmas
A similar proof appears in [VdVW96]. The set can be contained in a -dimensional cube of size . Consider a grid over this cube with mesh width . Then can be covered with at most many cubes of edge length . If ones takes the projection of the centers of such cubes onto and considers the circumscribed balls of radius , we may conclude that can be covered with at most
Let be a symmetric matrix, and let be an -net over . Then,
Given its size, let denote a uniformly random sample from with or without replacement. Then we have
Let be a random variable with a density function and cumulative distribution function . If , then,
Since , we have
we have .
Since and is a monotone increasing function, the above inequality condition is equivalent to
Now, we define the function for . is continuous and invertible on . Note that is also a continuous and increasing function for . Therefore, we have
Observe that the smallest possible value for would be simply the square root of . For simplicity, we will obtain a more interpretable expression for . By the definition of , we have
Since the condition on and enforces to be larger than 1, we obtain the simple inequality that
Using the above inequality, if satisfies