The Power of Normalization: Faster Evasion of Saddle Points
Kfir Y. Levy
Introduction
Owing to its efficiency, simplicity and intuitive interpretation, Gradient Descent (GD) and its numerous variants are often the method of choice in large-scale optimization tasks, including neural network optimization Bengio (2009), ranking Burges et al. (2005), matrix completion Jain et al. (2010), and reinforcement learning Sutton et al. (1999). Normalized Gradient Descent (NGD) is a less popular alternation of GD, which enjoys the same efficiency and simplicity.
Exploring the limitations of applying GD to non-convex tasks is an important an active line of research. Several phenomena have been found to prevent GD from attaining satisfactory results. Among them are local-minima, and saddle points. Local minima might entrap GD, preventing it from reaching a possibly better solution. Gradients vanish near saddle points, which causes GD to stall. This saddle phenomenon was recently studied in the context of deep learning, where it was established both empirically and theoretically that saddles are prevalent in such tasks, and cause GD to decelerate Saad and Solla (1995); Saxe et al. (2015); Dauphin et al. (2014); Choromanska et al. (2015). Several empirical studies suggest that arriving at local minima is satisfactory for deep learning tasks: in Dauphin et al. (2014), it is asserted that all local minima are of approximately the same quality as the global one. Additionally, in Choromanska et al. (2015) it is argued that global minima may cause overfitting, while local minima are likely to yields sufficient generalization. Very recently, evading saddles was found to be crucial in other non-convex applications, among them are complete dictionary recovery Sun et al. (2015), phase retrieval Sun et al. (2016), and Matrix completion Ge et al. (2016).
Our experimental study demonstrates the benefits of using Saddle-NGD for the setting of online tensor decomposition. The tensor decomposition optimization problem contains both local minima and saddle points with the interesting property that any local minimum is also a global minimum.
Non-convex optimization problems are in general NP-hard and thus most literature on iterative optimization methods focuses on local guarantees. The most natural guarantee to expect GD in this context is to approach a stationary point, namely a point where gradients vanish. This kind of guarantees is provided in Allen-Zhu and Hazan (2016); Ghadimi and Lan (2013), which focus on the stochastic setting. Approaching local minima while evading saddle points is much more intricate than ensuring stationary points. In Dauphin et al. (2014); Sun et al. (2016), it is demonstrated how to avoid saddle points through a trust region approach which utilizes second order derivatives. A perturbed version of GD was explored in Ge et al. (2015), showing both theoretically and empirically that it efficiently evades saddles. Very recently, it was shown in Agarwal et al. (2016) how to use second order methods in order to rapidly approach local minima. Nevertheless, this approach is only practical in settings where Hessian-vector products can be done efficiently.
NGD is well known to guarantee convergence in the offline settings of convex and quasi-convex optimization Nesterov (1984); Kiwiel (2001) . Recently, a stochastic version of NGD was explored in the context of stochastic quasi-convex optimization, Hazan et al. (2015).
Setting and Assumptions
We focus on the optimization of strict-saddle functions, a family of multi-modal functions which encompasses objectives that acquire saddle points. Interestingly, Ge et al. (2015) have found that tensor decomposition problems acquire this strict-saddle property.
There exists a local minimum such that , and the function restricted to a neighbourhood of is -strongly-convex.
We also assume that is -smooth, meaning:
Lastly, we assume that has -Lipschitz Hessians:
Finally, we recall the definition of strong-convexity:
Saddle-NGD algorithm and Main Results
In Algorithm 1 we present Saddle-NGD, an algorithm that is adapted to handle strict-saddle functions. The main difference from GD is the use of the direction of the gradients rather than the gradients themselves. Note that most updates are noiseless, yet once every rounds we add zero mean gaussian noise , here is the -dimensional identity matrix. The noisy updates ensure that once we arrive near a saddle, the direction with most negative eigenvalue will be sufficiently large. This in turn implies a fast evasion of the saddle.
Following is the main theorem of this paper:
In fact, since strict-saddle functions are strongly-convex in a radius around local minima, our ultimate goal should be arriving -close to such minima. Then we could use the well known convex optimization machinery to rapidly converge. The following corollary formalizes the number of rounds required to reach -close to some local minimum:
In the Stochastic Optimization setting, an exact gradient for the objective function is unavailable, and we may only access noisy (unbiased) gradient estimates. This setting captures some of the most important tasks in machine learning, and is therefore the subject of an extensive study.
Our Saddle-NGD algorithm for offline optimization of strict-saddle functions could be extended to the stochastic setting. This could be done by using minibatches in order to calculate the gradient, i.e. instead of appearing in Algorithm 1, we use:
here are independent and unbiased estimates of . Except for this alternation, the stochastic version of Saddle-NGD is similar to the one presented in Algorithm 1. We have found that in order to ensure convergence in the stochastic case, a minibatch is required. This dependence implies that in the stochastic setting Saddle-NGD obtains no better guarantees than noisy-GD. We therefore omit the proof for the stochastic setting.
Nevertheless, we have found that in practice, the stochastic version of Saddle-NGD demonstrates a superior performance compared to noisy-GD. Even when we employ a moderate minibatch size. This is illustrated in Section 6 where Saddle-NGD is applied to the task of online tensor decomposition.
Analysis Overview
Our analysis of Algorithm 1 divides according to the three scenarios defined by the strict-saddle property. Here we present the main statements regarding the guarantees of the algorithm for each scenario, and provide a proof sketch of Theorem 3. For ease of notation we assume that we reach at each scenario at .
In case that the gradientNote the use of the notation here and in the rest of the paper. is sufficiently large, the following ensures us to improve by value within one step:
Suppose that , and we use the Saddle-NGD Algorithm, then:
The next lemma ensures that once we arrive close enough to a local minimum we will remain in its proximity.
The next statement describes the improvement attained by Saddle-NGD near saddle points.
Theorem 3 is based on the above three lemmas. The full proof of the Theorem appears in Appendix A. Next we provide a short sketch of the proof.
Analysis
Here we prove the main statements regrading the three scenarios defined by the strict-saddle property. Section 5.1 analyses the scenario of large gradients (Lemma 5), Section 5.2 analyses the local-minimum scenario (Lemma 6), and Section 5.3 analyses the case of saddle points (Lemma 7). For brevity we do not always provide full proofs, which are deferred to appendix.
We will prove assuming a noisy update, i.e. and (the noiseless update case is similar). By the update rule:
2 Local Minimum
For brevity we will not prove Lemma 6, but rather state and prove a simpler lemma assuming all updates are noiseless; the proof of Lemma 6 regarding the general case appears in Appendix B.
Suppose that is close to a local minimum , i.e., , and . Then the following holds for any :
Due to the local strong-strong convexity of around , we know that . In order to be consistent with the definition of strict-saddle property we choose such that .
The proof requires the following lemma regarding strongly-convex functions:
We are now ready to prove Lemma 8 by induction, assuming all updates are noiseless, i.e. . Note that the case naturally holds, next we discuss the case where . First assume that , the noiseless Saddle-NGD update rule implies:
here the first inequality uses Lemma 9, the second inequality uses which follows from smoothness, and the third inequality uses .
For the case where similarly to the above we can conclude that:
we use , which follows by the local strong-convexity around . ∎
3 Saddle Points
We first provide some intuition regarding the benefits of using NGD rather than GD in escaping saddles. Later we present a short proof sketch of Lemma 7.
Lemma 7 states the decrease in function values attained by the Saddle-NGD near saddle points. Intuitively, Saddle-NGD implicitly performs an approximate power method over the Hessian matrix . Since the gradients near saddle points tend to be small the use of normalized gradients rather than the gradients themselves yields a faster improvement. Consider the minimization of a pure saddle function: . As can be seen in Figure 1, the gradients are almost vanishing around the saddle point . Conversely, the normalized gradients (Figure 1) are of a constant magnitude. This intuitively suggets that using NGD instead of (noisy) GD yields a faster escape of the saddle. Figure 1 compares between NGD and (noisy) GD for the pure saddle function; both methods are initialized in the proximity of , and employ a learning rate of . As expected, NGD attains a much faster initial improvement. Later, when the gradients are sufficiently large, GD prevails. Since our goal is the optimization of a general family of functions where saddles behave like pure saddles only locally, we are mostly concerned about the initial local improvement in the case of a pure saddles. This renders NGD more appropriate than GD for our goal.
Let be a point such that , and also let . Letting , it can be shown that the Saddle-NGD update rule implies:
The noise injections utilized by Saddle-NGD ensure that the additive term, , does not interfere with the increase of the components related to the negative eigenvectors. Since is bounded, a careful choice of the noise magnitude ensures that there is a sufficiently large initial component in these directions. ∎
Analysis
Before proving Lemma 7 we introduce some notation and establish several lemmas regarding the dynamics of the Saddle-NGD update rule near a saddle point.
Let be an orthonormal eigen-basis of with eigenvalues :
Also assume without loss of generality that is the direction with the most negative eigenvector, i.e. . We represent each point in the eigen-basis of , i.e.,
Denoting ; the NGD update rule: , translates coordinate-wise as follows:
The above enables to relate the gradient of the original function to the gradient of the approximation:
the above uses the -Lipschitzness of the Hessian.
Next we show that whenever then we have improved by value.
Suppose we are in a saddle, and then for the first such that , the following holds w.p.
The Lemma follows directly by Lemmas 14, 15. ∎
In Appendix C we provide the complete proofs for the statements that appear in this section.
Experiments
In many challenging machine learning tasks, first and second order moments are not sufficient in order to extract the underlying parameters of the problem; and higher order moments are required. Such problems include Gaussian Mixture Models (GMMs), Independent Component Analysis (ICA), Latent Dirichlet Allocation (LDA), and more. Tensors of order greater than may capture such high order statistics, and their decomposition enables to reveal the underlying parameters of the problem (see Anandkumar et al. (2014) with references therein). Thus, tensor decomposition methods have been extensively studied over the years Harshman (1970); Kolda (2001); Anandkumar et al. (2014). While most studies of tensor decomposition methods have focused on the offline setting, Ge et al. (2015) recently proposed a new setting of online tensor decomposition which is more appropriate for big data tasks.
Tensor decomposition is an intriguing multi-modal optimization task, which provably acquires many saddle points. Interestingly, every local minimum is also a global minimum for this task. Thus we decided to focus our experimental study on this task.
In what follows, we briefly review tensors and the online decompositions task. Then we present our experimental results comparing our method to noisy GD, which was proposed in Ge et al. (2015).
and we call the components of . Given that has an orthogonal decomposition, the offline decomposition problem is to find its components (which are unique up to permutations and sign flips).
Ge et al. (2015) suggest to calculate the components of a decomposable tensor by minimizing the following strict-saddle objective:
In many machine learning tasks, data often arrives sequentially (sampled from some unknown distribution). When the dimensionality of the problem is large, it is desirable to store only small batches of the data. In many such tasks where tensor decomposition is required, data samples arrive from some unknown distribution . And we aim at decomposing a tensor , which is often an expectation over mulitilinear operators , i.e. . Using the linearity of multilinear maps, and the objective appearing in Equation (5), we can formulate such problems as follows:
where .
We adopt the online tensor decomposition setting for ICA suggested above, and present experimental results comparing our method to the noisy GD method of Ge et al. (2015). We take , and our performance measure is the reconstruction error defined as:
In our experiments, both methods use a minibatch of size to calculate gradient estimates. Moreover, both methods employ the following learning rate rule:
In Figure 2 we present our resultsNote that we have repeated each experiment times. Figure 2 presents the reconstruction error averaged over these 10 runs, as well as error bars, for three values of initial learning rates . As can be seen in all three cases, noisy-GD obtains a faster initial improvement. Yet, at some point the error obtained by Saddle-NGD decreases sharply, and eventually Saddle-NGD outperforms noisy-GD. We found that this behaviour persisted when we employed learning rate rules other than (7).This behaviour also persisted when have employed several noise injection magnitudes to both algorithms. Note that for Saddle-NGD outperforms noisy-GD within iterations, for this only occurs after iterations.
Discussion
We have demonstrated both empirically and theoretically the benefits of using normalized gradients rather than gradients for an intriguing family of non-convex objectives. It is natural to ask what are the limits of the achievable rates for evading saddles. Concretely, we ask wether we can do better than Saddle-NGD using only first order information.
Acknowledgement
I would like to thank Elad Hazan for many useful discussions during the early stages of this work.
References
Appendix A Proof of Theorem 3
Let us define the following sets with respects to the three scenarios of the strict saddle property:
Define the following sequence of stopping times with , and
The second inequality holds by the following Corollary of Lemma 7
Now define a series of events as follows . Note that and therefore . Now consider
Summing the above equation over we conclude that:
Appendix B Proof of Lemma 6 (Local Minimum)
Here we prove Lemma 6 regarding the local minimum for the general case which includes the noisy updates. In Section B.1 we prove Lemma 9 which is used during the proof of Lemmas 6,8.
Let us first bound the increase in the square distance to due to the first noisy update.
Let , and suppose , the Saddle-NGD update rule implies:
here in the first inequality we use Lemma 9, the second inequality uses which follows from smoothness, and the last inequality uses .
Combining Equations (B),(B) it follows that within the total of rounds of noiseless updates the distance to the local minimum must decrease to . The rest of the proof goes along the same lines as the proof of Lemma 8. ∎
Writing the strong-convexity inequality for at both , and using , which follows by the optimality of , we have:
summing the above equations the lemma follows. ∎
Appendix C Omitted Proofs from Section 5.3 (Saddle Analysis)
Denote , then coordinate-wise the NGD rule translates to
First part: First we show that the following always applies:
We will prove Equation (18) by showing the following to hold for all :
We will now prove Equation (19) by induction. The base case clearly holds. Now assume by the induction assumption that it holds for , then we divide into two cases:
Case 1: Suppose that . Since can not change by more than in each round then the following holds:
thus the induction hypothesis of Equation (19) holds in this case.
Case 2: Suppose that . In this case our induction hypothesis asserts:
If then using Equation (17), we conclude that:
thus by Equation (C.1), the induction hypothesis holds.
If , then since each coordinate does not change more than in each iterartion then we have:
and again, the induction hypothesis holds.
Second part: Here we show that whenever then the following applies:
We will now show by induction that Equation (21) holds for any . The base case for clearly holds. Now assume by the Induction Hypothesis that . We now divide into two cases:
Case 1: Suppose that . Since each coordinate does not change by more than in each round, then:
Case 2: Suppose that . Recall that , and , thus:
In this case, a similar analysis to the one appearing in Equation (C.1), shows:
Combining Equations (18),(21), establishes the Lemma. ∎
C.2 Proof of Lemma 11
Denote , then coordinate-wise the NGD rule translates to
First part: First we show that the following always applies for some :
Equation (23) follows since by the Saddle-NGD update rule the magnitude of the query points can not decrease by more than over rounds.
Second part: Here we show that the following applies whenever :
We will prove Equation (24), by induction. Clearly the base case holds. Now assume that . Since , and , then we necessarily have:
where we used . Hence,
and the induction hypothesis holds. Combining Equations (23), (24), proves the lemma. ∎
C.3 Proof of Corollary 12
where the first inequality uses Lemma 10, and the last inequality uses , and .
Low : Suppose that . Denote , and note that this expression is maximized when .
Combining Equations (C.3),(C.3), establishes the lemma. ∎
C.4 Proof of Corollary 13
Using Equation (27) we divide into two cases: Case 1: Suppose that , then by Equation (27) we have and therefore
Case 2: Suppose that , using Equation (27) we get:
C.5 Proof of Lemma 14
Following is the key relation that enables us to prove the lemma:
where we denote .
Using the Saddle-NGD update rule inside Equation (C.5) we obtain:
Let us look at the gradient component in the most negative direction :
Next we will show that taking that fulfill the following two conditions imply that the magnitude of the gradient rises beyond within less than rounds:
Assume by contradiction that the gradient does not rise beyond for that fulfill the above conditions. We will now show by induction that the following holds for any :
The above clearly holds for . Assume it holds for , and we will now show that is holds for . By Equation (30) we have:
here the second inequality uses , which is our contradiction assumption. The third inequality holds since , and by the induction hypothesis , combining these with Equation (31) implies that
Thus the induction hypothesis holds for any . The induction hypothesis implies that within less than rounds the magnitude of the gradient rises beyond , combining this with the condition of Equation (32) contradicts our assumption that the gradient does not rise beyond within less that . We therefore conclude that the gradient rises beyond within less than rounds, for that fulfill conditions (31),(32).
C.6 Proof of Lemma 15
Given , there always exists such that the following holds:
where . Using the above equation, and the Lipschitzness of the Hessian, we may bound the difference between the original function and its quadratic approximation around as follows: