A Generic Approach for Escaping Saddle points

Sashank J Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, Alexander J Smola

Introduction

We study nonconvex finite-sum problems of the form

In the large-scale settings, algorithms based on first-order information of functions $f_{i}$ are typically favored as they are relatively inexpensive and scale seamlessly. An algorithm widely used in practice is stochastic gradient descent (Sgd), which has the iterative update:

where $i_{t}\in[n]$ is a randomly chosen index and $\eta_{t}$ is a learning rate. Under suitable selection of the learning rate, we can show that Sgd converges to a point $x$ that, in expectation, satisfies the stationarity condition $\|\nabla f(x)\|\leq\epsilon$ in $O(1/\epsilon^{4})$ iterations . This result has two critical weaknesses: (i) It does not ensure convergence to local optima or second-order critical points; (ii) The rate of convergence of the Sgd algorithm is slow.

A key work that explicitly uses Hessians to obtain faster convergence rates is the cubic regularization (CR) method . In particular, Nesterov and Polyak showed that CR requires $O(1/\epsilon^{3/2})$ iterations to achieve the second-order critical conditions. However, each iteration of CR is expensive as it requires computing the Hessian and solving multiple linear systems, each of which has complexity $O(d^{\omega})$ ( $\omega$ is the matrix multiplication constant), thus, undermining the benefit of its faster convergence. Recently, Agarwal et al. designed an algorithm to solve the CR more efficiently, however, it still exhibits slower convergence in practice compared to first-order methods. Both of these approaches use Hessian based optimization in each iteration, which make them slow in practice.

A second line of work focuses on using Hessian information (or its structure) whenever the method gets stuck at stationary points that are not second-order critical. To our knowledge, the first work in this line is , which shows that for a class of functions that satisfy a special property called “strict-saddle” property, a noisy variant of Sgd can converge to a point close to a local minimum. For this class of functions, points close to saddle points have a Hessian with a large negative eigenvalue, which proves instrumental in escaping saddle points using an isotropic noise. While such a noise-based method is appealing as it only uses first-order information, it has a very bad dependence on the dimension $d$ , and furthermore, the result only holds when the strict-saddle property is satisfied . More recently, Carmon et al. presented a new faster algorithm that alternates between first-order and second-order subroutines. However, their algorithm is designed for the simple case of $n=1$ in (1) and hence, can be expensive in practice.

Inspired by this line of work, we develop a general framework for finding second-order critical points. The key idea of our framework is to use first-order information for the most part of the optimization process and invoke Hessian information only when stuck at stationary points that are not second-order critical. We summarize the key idea and main contributions of this paper below.

Main Contributions: We develop an algorithmic framework for converging to second-order critical points and provide convergence analysis for it. Our framework carefully alternates between two subroutines that use gradient and Hessian information, respectively, and ensures second-order criticality. Furthermore, we present two instantiations of our framework and provide convergence rates for them. In particular, we show that a simple instance of our framework, based on Svrg, achieves convergence rates competitive with the current state-of-the-art methods; thus highlighting the simplicity and applicability of our framework. Finally, we demonstrate the empirical performance of a few algorithms encapsulated by our framework and show their superior performance.

There is a vast literature on algorithms for solving optimization problems of the form (1). A classical approach for solving such optimization problems is Sgd, which dates back at least to the seminal work of . Since then, Sgd has been a subject of extensive research, especially in the convex setting . Recently, new faster methods, called variance reduced (VR) methods, have been proposed for convex finite-sum problems. VR methods attain faster convergence by reducing the variance in the stochastic updates of Sgd, see e.g., . Accelerated variants of these methods achieve the lower bounds proved in , thereby settling the question of their optimality. Furthermore, developed an asynchronous framework for VR methods and demonstrated their benefits in parallel environments.

Most of the aforementioned prior works study stochastic methods in convex or very specialized nonconvex settings that admit theoretical guarantees on sub-optimality. For the general nonconvex setting, it is only recently that non-asymptotic convergence rate analysis for Sgd and its variants was obtained in , who showed that Sgd ensures $\|\nabla f\|\leq\epsilon$ (in expectation) in $O(1/\epsilon^{4})$ iterations. A similar rate for parallel and distributed Sgd was shown in . For these problems, Reddi et al. proved faster convergence rates that ensure the same optimality criteria in $O(n+n^{2/3}/\epsilon^{2})$ , which is an order $n^{1/3}$ faster than GD. While these methods ensure convergence to stationary points at a faster rate, the question of convergence to local minima (or in general to second-order critical points) has not been addressed. To our knowledge, convergence rates to second-order critical points (defined in Definition 1) for general nonconvex functions was first studied by . However, each iteration of the algorithm in is prohibitively expensive since it requires eigenvalue decompositions, and hence, is unsuitable for large-scale high-dimensional problems. More recently, Carmon et al. , Agarwal et al. presented algorithms for finding second-order critical points by tackling some practical issues that arise in . However, these algorithms are either only applicable to a restricted setting or heavily use Hessian based computations, making them unappealing from a practical standpoint. Noisy variants of first-order methods have also been shown to escape saddle points (see ), however, these methods have strong dependence on either $n$ or $d$ , both of which are undesirable.

Background & Problem Setup

We assume that each of the functions $f_{i}$ in (1) is $L$ -smooth, i.e., $\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\|$ for all $i\in[n]$ . Furthermore, we assume that the Hessian of $f$ in (1) is Lipschitz, i.e., we have

We must exercise caution while interpreting results pertaining to $(\epsilon,\gamma)$ -second order critical points. Such points need not be close to any local minima either in objective function value, or in the domain of (1). For our algorithms, we use only an Incremental First-order Oracle (IFO) and an Incremental Second-order Oracle (ISO), defined below.

IFO and ISO calls are typically cheap, with ISO call being relatively more expensive. In many practical settings that arise in machine learning, the time complexity of these oracle calls is linear in $d$ . For clarity and clean comparison, the dependence of time complexity on Lipschitz constant $L$ , $M$ , initial point and any polylog factors in the results is hidden.

Generic Framework

In this section, we propose a generic framework for escaping saddle points while solving nonconvex problems of form (1). One of the primary difficulties in reaching a second-order critical point is the presence of saddle points. To evade such points, one needs to use properties of both gradients and Hessians. To this end, our framework is based on two core subroutines: Gradient-Focused-Optimizer and Hessian-Focused-Optimizer.

The idea is to use these two subroutines, each focused on different aspects of the optimization procedure. Gradient-Focused-Optimizer focuses on using gradient information for decreasing the function. On its own, the Gradient-Focused-Optimizer might not converge to a local minimizer since it can get stuck at a saddle point. Hence, we require the subroutine Hessian-Focused-Optimizer to help avoid saddle points. A natural idea is to interleave these subroutines to obtain a second-order critical point. But it is not even clear if such a procedure even converges. We propose a carefully designed procedure that effectively balances these two subroutines, which not only provides meaningful theoretical guarantees, but remarkably also translates into strong empirical gains in practice.

Algorithm 1 provides pseudocode of our framework. Observe that the algorithm is still abstract, since it does not specify the subroutines Gradient-Focused-Optimizer and Hessian-Focused-Optimizer. These subroutines determine the crucial update mechanism of the algorithm. We will present specific instance of these subroutines in the next section, but we assume the following properties to hold for these subroutines.

Here the expectation is over any randomness in subroutine Hessian-Focused-Optimizer. The two conditions ensure that the objective function value, in expectation, never increases and furthermore, decreases with a certain rate when $\lambda_{\min}(\nabla^{2}f(x))\leq-\gamma$ . In general, this subroutine utilizes the Hessian or its properties for minimizing the objective function. Typically, this is the most expensive part of the Algorithm 1 and hence, needs to be invoked judiciously.

The key aspect of these subroutines is that they, in expectation, never increase the objective function value. The functions $g$ and $h$ will determine the convergence rate of Algorithm 1. In order to provide a concrete implementation, we need to specify the aforementioned subroutines. Before we delve into those details, we will provide a generic convergence analysis for Algorithm 1.

Let $\Delta=f(x^{0})-B$ and $\theta=\min((1-p)\epsilon^{2}g(n,\epsilon),ph(n,\epsilon,\gamma))$ . Also, let set $\Gamma$ be the output of Algorithm 1 with Gradient-Focused-Optimizer satisfying G.1 and G.2 and Hessian-Focused-Optimizer satisfying H.1 and H.2. Furthermore, $T$ be such that $T>\Delta/\theta$ .

Suppose the multiset $S=\{i_{1},...i_{k}\}$ are $k$ indices selected independently and uniformly randomly from {1, …, $|\Gamma|$ }. Then the following holds for the indices in $S$ :

$y^{t}$ , where $t\in\{i_{1},...,i_{k}\}$ , is a $(\epsilon,\gamma)$ -critical point with probability at least $1-\max(\Delta/(T\theta),q)$ .

If $k=O(\log(1/\zeta)/\min(\log(\Delta/(T\theta)),\log(1/q)))$ , with at least probability $1-\zeta$ , at least one iterate $y^{t}$ where $t\in\{i_{1},...,i_{k}\}$ is a $(\epsilon,\gamma)$ -critical point.

The proof of the result is presented in Appendix A. The key point regarding the above result is that the overall convergence rate depends on the magnitude of both functions $g$ and $h$ . Theorem 1 shows that the slowest amongst the subroutines Gradient-Focused-Optimizer and Hessian-Focused-Optimizer governs the overall rate of Algorithm 1. Thus, it is important to ensure that both these procedures have good convergence. Also, note that the optimal setting for $p$ based on the result above satisfies $1/p=1/\epsilon^{2}g(n,\epsilon)+1/h(n,\epsilon,\gamma)$ . We defer further discussion of convergence to next section, where we present more specific convergence and rate analysis.

Concrete Instantiations

We now present specific instantiations of our framework in this section. Before we state our key results, we discuss an important subroutine that is used as Gradient-Focused-Optimizer for rest of this paper: Svrg. We give a brief description of the algorithm in this section and show that it meets the conditions required for a Gradient-Focused-Optimizer. Svrg is a stochastic algorithm recently shown to be very effective for reducing variance in finite-sum problems. We seek to understand its benefits for nonconvex optimization, with a particular focus on the issue of escaping saddle points. Algorithm 2 presents Svrg’s pseudocode.

Suppose $\eta_{t}=\eta=1/4Ln^{2/3}$ , $m=n$ and $T_{g}=T_{\epsilon}$ , which depends on $\epsilon$ , then Algorithm 2 is a Gradient-Focused-Optimizer with $g(n,\epsilon)=T_{\epsilon}/40Ln^{2/3}$ .

In rest of this section, we discuss approaches using Svrg as a Gradient-Focused-Optimizer. In particular, we propose and provide convergence analysis for two different methods with different Hessian-Focused-Optimizer but which use Svrg as a Gradient-Focused-Optimizer.

The first approach is based on directly using the eigenvector corresponding to the smallest eigenvalue as a Hessian-Focused-Optimizer. More specifically, when the smallest eigenvalue of the Hessian is negative and reasonably large in magnitude, the Hessian information can be used to ensure descent in the objective function value. The pseudo-code for the algorithm is given in Algorithm 3.

The key idea is to utilize the minimum eigenvalue information in order to make a descent step. If $\lambda_{\min}(\nabla^{2}f(x))\leq-\gamma$ then the idea is to use this information to take a descent step. Note the subroutine is designed in a fashion such that the objective function value never increases. Thus, it naturally satisfies the requirement H.1 of Hessian-Focused-Optimizer. The following result shows that HessianDescent is a Hessian-Focused-Optimizer.

HessianDescent is a Hessian-Focused-Optimizer with $h(n,\epsilon,\gamma)=\frac{\rho}{24M^{2}}\gamma^{3}$ .

The proof of the result is presented in Appendix C. With Svrg as Gradient-Focused-Optimizer and HessianDescent as Hessian-Focused-Optimizer, we show the following key result:

Suppose Svrg with $m=n$ , $\eta_{t}=\eta=1/4Ln^{2/3}$ for all $t\in\{1,...,m\}$ and $T_{g}=40Ln^{2/3}/\epsilon^{1/2}$ is used as Gradient-Focused-Optimizer and HessianDescent is used as Hessian-Focused-Optimizer with $q=0$ , then Algorithm 1 finds a $(\epsilon,\sqrt{\epsilon})$ -second order critical point in $T=O(\Delta/\min(p,1-p)\epsilon^{3/2})$ with probability at least $0.9$ .

The result directly follows from using Lemma 1 and 2 in Theorem 1. The result shows that the iteration complexity of Algorithm 1 in this case is $O(\Delta/\epsilon^{3/2}\min(p,1-p))$ . Thus, the overall IFO complexity of Svrg algorithm is $(n+T_{g})\times T=O(n/\epsilon^{3/2}+n^{2/3}/\epsilon^{2})$ . Since each IFO call takes $O(d)$ time, the overall time complexity of all Gradient-Focused-Optimizer steps is $O(nd/\epsilon^{3/2}+n^{2/3}d/\epsilon^{2})$ . To understand the time complexity of HessianDescent, we need the following result .

Note that each iteration of Algorithm 1 in this case has just linear dependence on $d$ . Since the total number of HessianDescent iterations is $O(\Delta/\min(p,1-p)\epsilon^{3/2})$ and each iteration has the complexity of $O(nd+n^{3/4}d/\epsilon^{1/4})$ , using the above remark, we obtain an overall time complexity of HessianDescent is $O(nd/\epsilon^{3/2}+n^{3/4}d/\epsilon^{7/4})$ . Combining this with the time complexity of Svrg, we get the following result.

The overall running time of Algorithm 1 to find a $(\epsilon,\sqrt{\epsilon})$ -second order critical point, with parameter settings used in Theorem 2, is $O(nd/\epsilon^{3/2}+n^{3/4}d/\epsilon^{7/4}+n^{2/3}d/\epsilon^{2})$ .

Note that the dependence on $\epsilon$ is much better in comparison to that of Noisy SGD used in . Furthermore, our results are competitive with in their respective settings, but with a much simpler algorithm and analysis. We also note that our algorithm is faster than the one proposed in , which has a time complexity of $O(nd/\epsilon^{2})$ .

2 Cubic Descent

In this section, we show that the cubic regularization method in can be used as Hessian-Focused-Optimizer. More specifically, here Hessian-Focused-Optimizer approximately solves the following optimization problem:

and returns $(y,\diamond)$ as output. The following result can be proved for this approach.

Suppose Svrg (same as Theorem 2) is used as Gradient-Focused-Optimizer and CubicDescent is used as Hessian-Focused-Optimizer with $q=0$ , then Algorithm 1 finds a $(\epsilon,\sqrt{\epsilon})$ -second order critical point in $T=O(\Delta/\min(p,1-p)\epsilon^{3/2})$ with probability at least $0.9$ .

In principle, Algorithm 1 with CubicDescent as Hessian-Focused-Optimizer can converge without the use of Gradient-Focused-Optimizer subroutine at each iteration since it essentially reduces to the cubic regularization method of . However, in practice, we would expect Gradient-Focused-Optimizer to perform most of the optimization and Hessian-Focused-Optimizer to be used for far fewer iterations. Using the method developed in for solving CubicDescent, we obtain the following corollary.

The overall running time of Algorithm 1 to find a $(\epsilon,\sqrt{\epsilon})$ -second order critical point, with parameter settings used in Theorem 3, is $O(nd^{\omega}/\epsilon^{3/2}+n^{2/3}d/\epsilon^{2})$ .

Here $\omega$ is the matrix multiplication constant. The dependence on $\epsilon$ is weaker in comparison to Corollary 1. However, each iteration of CubicDescent is expensive (as seen from the factor $d^{\omega}$ in the corollary above) and thus, in high dimensional settings typically encountered in machine learning, this approach can be expensive in comparison to HessianDescent.

3 Practical Considerations

The focus of this section was to demonstrate the wide applicability of our framework; wherein using a simple instantiation of this framework, we could achieve algorithms with fast convergence rates. To further achieve good empirical performance, we had to slightly modify these procedures. For Hessian-Focused-Optimizer, we found stochastic, adaptive and inexact approaches for solving HessianDescent and CubicDescent work well in practice. Due to lack of space, the exact description of these modifications is deferred to Appendix F. Furthermore, in the context of deep learning, empirical evidence suggests that first-order methods like Adam exhibit behavior that is in congruence with properties G.1 and G.2. While theoretical analysis for a setting where Adam is used as Gradient-Focused-Optimizer is still unresolved, we nevertheless demonstrate its performance through empirical results in the following section.

Experiments

We now present empirical results for our saddle point avoidance technique with an aim to highlight three aspects: (i) the framework successfully escapes non-degenerate saddle points, (ii) the framework is fast, and (iii) the framework is practical on large-scale problems. All the algorithms are implemented on TensorFlow . In case of deep networks, the Hessian-vector product is evaluated using the trick presented in . We run our experiments on a commodity machine with Intel ${}^{\text{\textregistered}}$ Xeon ${}^{\text{\textregistered}}$ CPU E5-2630 v4 CPU, 256GB RAM, and NVidia ${}^{\text{\textregistered}}$ Titan X (Pascal) GPU.

Synthetic Problem To demonstrate the fast escape from a saddle point by the proposed method, we consider the following simple nonconvex finite-sum problem:

Here the parameters are designed such that $\sum_{i}b_{i}=0$ and $\sum_{i}A_{i}$ matrix has exactly one negative eigenvalue of $-0.001$ and other eigenvalues randomly chosen in the interval $ $. The total number of examples$ n $is set to be 100,000 and$ d $is$ 1000 $. It is not hard to see that this problem has a non-degenerate saddle point at the origin. This allows us to explore the behaviour of different optimization algorithms in the vicinity of the saddle point. In this experiment, we compare a mix of Svrg and HessianDescent (as in Theorem 2) with Sgd (with constant step size), Adam, Svrg and CubicDescent. The parameter of these algorithms is chosen by grid search so that it gives the best performance. The subproblem of CubicDescent was solved with gradient descent until the gradient norm of the subproblem is reduced below$ 10^{-3}$. We study the progress of optimization, i.e., decrease in function value with wall clock time, IFO calls, and ISO calls. All algorithms were initialized with the same starting point very close to origin.

The results are presented in Figure 2, which shows that our proposed mix framework was the fastest to escape the saddle point in terms of wall clock time. We observe that performance of the first order methods suffered severely due to the saddle point. Note that Sgd eventually escaped the saddle point due to inherent noise in the mini-batch gradient. CubicDescent, a second-order method, escaped the saddle point faster in terms of iterations using the Hessian information. But operating on Hessian information is expensive as a result this method was slow in terms of wall clock time. The proposed framework, which is a mix of the two strategies, inherits the best of both worlds by using cheap gradient information most of the time and reducing the use of relatively expensive Hessian information (ISO calls) by 100x. This resulted in faster escape from saddle point in terms of wall clock time.

Deep Networks To investigate the practical performance of the framework for deep learning problems, we applied it to two deep autoencoder optimization problems from called “CURVES” and “MNIST”. Due to their high difficulty, performance on these problems has become a standard benchmark for neural network optimization methods, e.g. . The “CURVES” autoencoder consists of an encoder with layers of size (28x28)-400-200-100- 50-25-6 and a symmetric decoder totaling in 0.85M parameters. The six units in the code layer were linear and all the other units were logistic. The network was trained on 20,000 images and tested on 10,000 new images. The data set contains images of curves that were generated from three randomly chosen points in two dimensions. The “MNIST” autoencoder consists of an encoder with layers of size (28x28)-1000-500-250-30 and a symmetric decoder, totaling in 2.8M parameters. The thirty units in the code layer were linear and all the other units were logistic. The network was trained on 60,000 images and tested on 10,000 new images. The data set contains images of handwritten digits 0-9. The pixel intensities were normalized to lie between 0 and 1.Data available at: www.cs.toronto.edu/~jmartens/digs3pts_1.mat, mnist_all.mat

As an instantiation of our framework, we use a mix of Adam, which is popular in deep learning community, and an ApproxCubicDescent for the practical reasons mentioned in Section 4.3. This method with Adam and ApproxCubicDescent. The parameters of these algorithms were chosen to produce the best generalization on a held out test set. The regularization parameter $M$ was chosen as the smallest value such that the function value does not fluctuate in the first 10 epochs. We use the initialization suggested in and a mini-batch size of 1000 for all the algorithms. We report objective function value against wall clock time and ISO calls.

The results are presented in Figure 3, which shows that our proposed mix framework was the fastest to escape the saddle point in terms of wall clock time. Adam took considerably more time to escape the saddle point, especially in the case of MNIST. While ApproxCubicDescent escaped the saddle point in relatively fewer iterations, each iteration required considerably large number of ISO calls; as a result, the method was extremely slow in terms of wall clock time, despite our efforts to improve it via approximations and code optimizations. On the other hand, our proposed framework, seamlessly balances these two methods, thereby, resulting in the fast decrease of training loss.

Discussion

In this paper, we examined a generic strategy to escape saddle points in nonconvex finite-sum problems and presented its convergence analysis. The key intuition is to alternate between a first-order and second-order based optimizers; the latter is mainly intended to escape points that are only stationary but are not second-order critical points. We presented two different instantiations of our framework and provided their detailed convergence analysis. While both our methods explicity use the Hessian information, one can also use noisy first-order methods as Hessian-Focused-Optimizer (see for e.g. noisy Sgd in ). In such a scenario, we exploit the negative eigenvalues of the Hessian to escape saddle points by using isotropic noise, and do not explicitly use ISO. For these methods, under strict-saddle point property , we can show convergence to local optima within our framework.

We primarily focused on obtaining second-order critical points for nonconvex finite-sums (1). This does not necessarily imply low test error or good generalization capabilities. Thus, we should be careful when interpreting the results presented in this paper. A detailed discussion or analysis of these issues is out of scope of this paper. While a few prior works argue for convergence to local optima, the exact connection between generalization and local optima is not well understood, and is an interesting open problem. Nevertheless, we believe the techniques presented in this paper can be used alongside other optimization tools for faster and better nonconvex optimization.

References

Appendix A Proof of Theorem 1

The case of $\tau=\varnothing$ can be handled in a straightforward manner, so let us focus on the case where $\tau=\diamond$ . We split our analysis into cases, each analyzing the change in objective function value depending on second-order criticality of $y^{t}$ .

We start with the case where the gradient condition of second-order critical point is violated and then proceed to the case where the Hessian condition is violated.

This follows from the fact that $y^{t}$ is the output of Gradient-Focused-Optimizer subroutine, which satisfies the condition that for $(y,z)$ = $\textsc{Gradient-Focused-Optimizer}(x,n,\epsilon)$ , we have

The second inequality is due to he non-increasing property of Gradient-Focused-Optimizer. Therefore, when the hessian condition is violated, the objective function value always decreases by at least $ph(n,\epsilon,\gamma)$ .

This is the favorable case for the algorithm. The only condition to note is that the objective function value will be non-increasing in this case too. This is, again, due to the non-increasing properties of subroutines Gradient-Focused-Optimizer and Hessian-Focused-Optimizer. In general, greater the occurrence of this case during the course of the algorithm, higher will the probability that the output of our algorithm satisfies the desired property.

The key observation is that Case I & II cannot occur large number of times since each of these cases strictly decreases the objective function value. In particular, from Equation (6) and (7), it is easy to see that each occurrence of Case I & II the following holds:

where $\theta=\min((1-p)\epsilon^{2}g(n,\epsilon),ph(n,\epsilon,\gamma))$ . Furthermore, the function $f$ is lower bounded by B, thus, Case I & II cannot occur more than $(f(x^{0})-B)/\theta$ times. Therefore, the probability of occurrence of Case III is at least $1-(f(x^{0})-B)/(T\theta)$ , which completes the first part of the proof.

The second part of the proof simply follows from first part. As seen above, the probability of Case I & II is at most $(f(x^{0})-B)/T\theta$ . Therefore, probability that an element of the set $S$ falls in Case III is at least $1-((f(x^{0})-B)/T\theta)^{k}$ , which gives us the required result for the second part.

Appendix B Proof of Lemma 1

The proof follows from the analysis in with some additional reasoning. We need to show two properties: G.1 and G.2, both of which are based on objective function value. To this end, we start with an update in the $s^{\text{th}}$ epoch. We have the following:

The first inequality is due to $L$ -smoothness of the function $f$ . The second inequality simply follows from the unbiasedness of Svrg update in Algorithm 2. For the analysis of the algorithm, we need the following Lyapunov function:

for all $t\in\{0,\cdots,m-1\}$ and $\mu_{m}=0$ . For bounding the Lypunov function $A$ , we need the following bound on the distance of the current iterate from the latest snapshot:

The second equality is due to the unbiasedness of the update of Svrg. The last inequality follows from a simple application of Cauchy-Schwarz and Young’s inequality. Substituting Equation (8) and Equation (9) into the Lypunov function $A^{s+1}_{t+1}$ , we obtain the following:

The second inequality follows from the definition of $\mu_{t}$ and $A_{t}^{s+1}$ . Since $\eta_{t}=\eta=1/(4Ln^{2/3})$ for $j>0$ and $t\in\{0,\dots,j-1\}$ ,

We now prove that $\upsilon_{n}>0$ and also G.2 of Gradient-Focused-Optimizer is satisifed for Svrg algorithm. By using telescoping the sum with $j=m$ in Equation (11), we obtain

Using the above bound on $\theta$ , we get

wherein the second inequality follows upon noting that $(1+\frac{1}{l})^{l}$ is increasing for $l>0$ and $\lim_{l\rightarrow\infty}(1+\frac{1}{l})^{l}=e$ (here $e$ is the Euler’s number). Now we can lower bound $\upsilon_{n}$ , as

The first inequality holds since $\mu_{t}$ decreases with $t$ . The second inequality holds since (a) $\mu_{0}/\beta$ can be upper bounded by $(e-1)/4$ (follows from Equation (B)), (b) $\eta^{2}L\leq\eta/4$ and (c) $2\mu_{0}\eta^{2}\leq(e-1)\eta/8$ (follows from Equation (B)). Substituting the above lower bound in Equation (13), we obtain the following:

From the definition of $(y,z)$ in output of Algorithm 2 i.e., $y$ is Iterate $x_{a}$ chosen uniformly random from $\{\{x^{s+1}_{t}\}_{t=0}^{m-1}\}_{s=0}^{S-1}$ and $z=x^{S}_{m}$ , it is clear that Algorithm 2 satisfies the G.2 requirement of Gradient-Focused-Optimizer with $g(n,\epsilon)=T_{\epsilon}/40Ln^{2/3}$ . Since both G.1 and G.2 are satisified for Algorithm 2, we conclude that Svrg is a Gradient-Focused-Optimizer. ∎

Appendix C Proof of Lemma 2

The first important observation is that the function value never increases because $y=\arg\min_{z\in\{u,x\}}f(z)$ i.e., $f(y)\leq f(x)$ , thus satisfying H.1 of Hessian-Focused-Optimizer. We now analyze the scenario where $\lambda_{min}(\nabla^{2}f(x))\leq-\gamma$ . Consider the event where we obtain $v$ such that

This event (denoted by $\mathcal{E}$ ) happens with at least probability $\rho$ . Note that, since $\lambda_{min}(\nabla^{2}f(x))\leq-\gamma$ , we have $\langle v,\nabla^{2}f(x)v\rangle\leq-\tfrac{\gamma}{2}$ . In this case, we have the following relationship:

The first inequality follows from the $M$ -lipschitz continuity of the Hessain $\nabla^{2}f(x)$ . The first equality follows from the update rule of HessianDescent. The second inequality is obtained by dropping the negative term and using the fact that $\|v\|=1$ . The second equality is obtained by substituting $\alpha=\frac{|v^{T}\nabla^{2}f(x)v|}{M}$ . The last inequality is due to the fact that $\langle v,\nabla^{2}f(x)v\rangle\leq-\tfrac{\gamma}{2}$ . In the other scenario where

we can at least ensure that $f(y)\leq f(x)$ since $y=\arg\min_{z\in\{u,x\}}f(z)$ . Therefore, we have

The last inequality is due to Equation (16). Hence, Hessian-Focused-Optimizer satisfies H.2 of Hessian-Focused-Optimizer with $h(n,\epsilon,\gamma)=\frac{\rho}{24M^{2}}\gamma^{3}$ , thus concluding the proof. ∎

Appendix D Proof of Theorem 3

First note that cubic method is a descent method (refer to Theorem 1 of ); thus, H.1 is trivially satisfied. Furthermore, cubic descent is a Hessian-Focused-Optimizer with $h(n,\epsilon,\gamma)=\frac{2\gamma^{3}}{81M^{3}}\gamma^{3}$ . This, again, follows from Theorem 1 of . The result easily follows from the aforementioned observations.

Appendix E Other Lemmas

The following bound on the variance of Svrg is useful for our proof .

Let $v^{s+1}_{t}$ be computed by Algorithm 2. Then,

We use the definition of $v^{s+1}_{t}$ to get

The inequality follows from the simple fact that $(a+b)^{2}\leq a^{2}+b^{2}$ . From the above inequality, we get the following:

Appendix F Approximate Cubic Regularization

Cubic regularization method of is designed to operate on full batch, i.e., it does not exploit the finite-sum structure of the problem and requires the computation of the gradient and the Hessian on the entire dataset to make an update. However, such full-batch methods do not scale gracefully with the size of data and become prohibitively expensive on large datasets. To overcome this challenge, we devised an approximate cubic regularization method described below:

Pick a mini-batch $\mathcal{B}$ and obtain the gradient and the hessian based on $\mathcal{B}$ , i.e.,

We found that this mini-batch training strategy, which requires the computation of the gradient and the Hessian on a small subset of the dataset, to work well on a few datasets (CURVES, MNIST, CIFAR10). A similar method has been analysed in .

Furthermore, in many deep-networks, adaptive per-parameter learning rate helps immensely . One possible explanation for this is that the scale of the gradients in each layer of the network often differ by several orders of magnitude. A well-suited optimization method should take this into account. This is the reason for popularity of methods like Adam or RMSprop in the deep learning community. On similar lines, to account for different per-parameter behaviour in cubic regularization, we modify the sub-problem by adding a diagonal matrix $M_{d}$ in addition to the scalar regularization coefficient $M$ , i.e.,

Also we devised an adaptive rule to obtain the diagonal matrix as $M_{d}=\mathsf{diag}((s+10^{-12})^{1/9})$ , where $s$ is maintained as a moving average of third order polynomial of the mini-batch gradient $g$ , in a fashion similar to RMSprop and Adam:

where $|g|^{3}$ and $g^{2}$ are vectors such that $[|g|^{3}]_{i}=|g_{i}|^{3}$ and $[g^{2}]_{i}=g_{i}^{2}$ respectively for all $i\in[n]$ . The experiments reported on CURVES and MNIST in this paper utilizes both the above modifications to the cubic regularization, with $\beta$ set to 0.9. We refer to this modified procedure as ACubic in our results.

Appendix G Experiment Details

In this section we provide further experimental details and results to aid reproducibility.

The parameter selection for all the methods were carried as follows:

Sgd: The scalar step-size was determined by a grid search.

Adam: We performed a grid search over $\alpha$ and $\varepsilon$ parameters of Adam tied together, i.e., $\alpha=\varepsilon$ .

Svrg: The scalar step-size was determined by a grid search.

CubicDescent: The regularization parameter $M$ was chosen by grid search. The sub-problem was solved with gradient descent with the step-size of solver to be $10^{-2}$ and run till the gradient norm of the sub-problem is reduced below $10^{-3}$ .

Further Observations The results are presented in Figure 4. The other first order methods like Adam with higher noise could escape relatively faster whereas Svrg with reduced noise stayed stuck at the saddle point.

G.2 Deep Networks

Methods The parameter selection for all the methods were carried as follows::

Adam: We performed a grid search over $\alpha$ and $\varepsilon$ parameters of Adam so as to produce the best generalization on a held out test set. We found it to be $\alpha=10^{-3},\varepsilon=10^{-3}$ for CURVES and $\alpha=10^{-2},\varepsilon=10^{-1}$ for MNIST.

ApproxCubicDescent: The regularization parameter $M$ was chosen as the largest value such function value does not jump in first 10 epochs. We found it to be $M=10^{3}$ for both CURVES and MNIST. The sub-problem was solved with gradient descent with the step-size of solver to be $10^{-3}$ and run till the gradient norm of the sub-problem is reduced below 0.1.