Convex and Non-convex Optimization Under Generalized Smoothness

Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, Ali Jadbabaie

Introduction

In this paper, we study the following unconstrained optimization problem

Recently, Zhang et al. (2019) proposed the more general (L0,L1)(L_{0},L_{1})-smoothness condition, which assumes 2f(x)L0+L1f(x)\left\|\nabla^{2}f(x)\right\|\leq L_{0}+L_{1}\left\|\nabla f(x)\right\| for some constants L0,L10L_{0},L_{1}\geq 0, motivated by their extensive language model experiments. This notion generalizes the standard Lipschitz smoothness condition and also contains e.g. univariate polynomial and exponential functions. For non-convex and (L0,L1)(L_{0},L_{1})-smooth functions, they prove convergence of gradient descent (GD) and stochastic gradient descent (SGD) with gradient clipping and also provide a complexity lower bound for constant-stepsize GD/SGD without clipping. Based on these results, they claim gradient clipping or other forms of adaptivity provably accelerate the convergence for (L0,L1)(L_{0},L_{1})-smooth functions. Perhaps due to the lower bound, all the follow-up works under this condition that we are aware of limit their analyses to adaptive methods. Most of these focus on non-convex functions. See Section 2 for more discussions of related works.

Contributions. In light of the above discussions, we summarize our main contributions as follows.

We prove the convergence of constant-stepsize GD/SGD/NAG in the convex and non-convex settings, and obtain the classical rates for all of them, as summarized in Table 1.

Besides the generalized smoothness condition and the new approach, our results are also novel in the following aspects.

The convergence results of constant-stepsize methods challenge the folklore belief on the necessity of adaptive stepsize for generalized smooth functions.

We obtain new convergence results for GD and NAG in the convex setting under the generalized smoothness condition.

We relax the assumption of bounded noise to the weaker one of bounded variance of noise in the stochastic setting with the simple SGD method.

Related work

Gradient-based optimizaiton. The classical gradient-based optimization problems for the standard Lipschitz smooth functions have been well studied for both convex (Nemirovskij and Yudin, 1983; Nesterov, 2003; d’Aspremont et al., 2021) and non-convex functions. In the convex setting, the goal is to reach an ϵ\epsilon-sub-optimal point xx satisfying f(x)infxf(x)ϵf(x)-\inf_{x}f(x)\leq\epsilon. It is well known that GD achieves the O(1/ϵ)\mathcal{O}(1/\epsilon) gradient complexity and NAG achieves the accelerated O(1/ϵ)\mathcal{O}(1/\sqrt{\epsilon}) complexity which is optimal among all gradient-based methods. For strongly convex functions, GD and NAG achieve the O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) and O(κlog(1/ϵ))\mathcal{O}(\sqrt{\kappa}\log(1/\epsilon)) complexity respectively, where κ\kappa is the condition number and the latter is again optimal. In the non-convex setting, the goal is to find an ϵ\epsilon-stationary point xx satisfying f(x)ϵ\left\|\nabla f(x)\right\|\leq\epsilon, since finding a global minimum is NP-hard in general. It is well known that GD achieves the optimal O(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) complexity which matches the lower bound in (Carmon et al., 2017). In the stochastic setting for unbiased stochastic gradient with bounded variance, SGD achieves the optimal O(1/ϵ4)\mathcal{O}(1/\epsilon^{4}) complexity (Ghadimi and Lan, 2013), matching the lower bound in (Arjevani et al., 2019). In this paper, we obtain the classical rates in terms of ϵ\epsilon for all the above-mentioned methods and settings, under a far more general smoothness condition.

Generalized smoothness. The (L0,L1)(L_{0},L_{1})-smoothness condition proposed by Zhang et al. (2019) was studied by many follow-up works. Under the same condition, (Zhang et al., 2020) considers momentum in the updates and improves the constant dependency of the convergence rate for SGD with clipping derived in (Zhang et al., 2019). (Qian et al., 2021) studies gradient clipping in incremental gradient methods, (Zhao et al., 2021) studies stochastic normalized gradient descent, and (Crawshaw et al., 2022) studies a generalized SignSGD method, under the (L0,L1)(L_{0},L_{1})-smoothess condition. (Reisizadeh et al., 2023) studies variance reduction for (L0,L1)(L_{0},L_{1})-smooth functions. (Chen et al., 2023) proposes a new notion of α\alpha-symmetric generalized smoothness, which is roughly as general as (L0,L1)(L_{0},L_{1})-smoothness. (Wang et al., 2022) analyzes convergence of Adam and provides a lower bound which shows non-adaptive SGD may diverge. In the stochastic setting, the above-mentioned works either consider the strong assumption of bounded gradient noise or require a very large batch size that depends on ϵ\epsilon, which essentially reduces the analysis to the deterministic setting. (Faw et al., 2023) proposes an AdaGrad-type algorithm in order to relax the bounded noise assumption. Perhaps due to the lower bounds in (Zhang et al., 2019; Wang et al., 2022), all the above works study methods with an adaptive stepsize. In this and our concurrent work (Li et al., 2023), we further generalize the smoothness condition and analyze various methods under this condition through bounding the gradients along the trajectory.

Function class

In this section, we discuss the function class of interest where the objective function ff lies. We start with the following two standard assumptions in the literature of unconstrained optimization, which will be assumed throughout Sections 4 and 5 unless explicitly stated.

The objective function ff is differentiable and closed within its open domain X\mathcal{X}.

The objective function ff is bounded from below, i.e., f:=infxXf(x)>f^{*}:=\inf_{x\in\mathcal{X}}f(x)>-\infty.

In this section, we formally define the generalized smoothness condition, and present its properties and examples.

Definitions 1 and 2 below are two equivalent ways of stating the definition, where we use B(x,R)\mathcal{B}(x,R) to denote the Euclidean ball with radius RR centered at xx.

Next, we prove that the above two definitions are equivalent in the following proposition, whose proof is involved and deferred to Appendix A.2.

The condition in Definition 1 is simple and one can easily check whether it is satisfied for a given example function. On the other hand, Definition 2 is a local Lipschitz condition on the gradient that is harder to verify. However, it is useful for deriving several useful properties in the next section.

1.2 Properties

First, we provide the following lemma which is very useful in our analyses of all the methods considered in this paper. Its proof is deferred to Appendix A.3.

Lemma 3.3 states that, if the gradient at xx is bounded by some constant GG, then within its neighborhood with a constant radius, we can obtain (2), the same inequalities that were derived in the textbook analysis (Nesterov, 2003) under the standard Lipschitz smoothness condition. With (2), the analysis for generalized smoothness is not much harder than that for standard smoothness. Since we mostly choose x=x2=xtx=x_{2}=x_{t} and x1=xt+1x_{1}=x_{t+1} in the analysis, in order to apply Lemma 3.3, we need two conditions: f(xt)G\left\|\nabla f(x_{t})\right\|\leq G and xt+1xtr(G)\left\|x_{t+1}-x_{t}\right\|\leq r(G) for some constant GG. The latter is usually directly implied by the former for most deterministic methods with a small enough stepsize, and the former can be obtained with our new approach that bounds the gradients along the trajectory.

With Lemma 3.3, we can derive the following useful lemma which is the reverse direction of a generalized Polyak-Lojasiewicz (PL) inequality, whose proof is deferred to Appendix A.3.

Therefore, in order to bound the gradients along the trajectory as we discussed below Lemma 3.3, it suffices to bound the function values, which is usually easier.

1.3 Examples

Convex setting

In this section, we present the convergence results of gradient descent (GD) and Nesterov’s accelerated gradient method (NAG) in the convex setting. Formally, we define convexity as follows.

We assume the existence of a global optimal point xx^{*} throughout this section, as in the following assumption. However, we want to note that, for gradient descent, this assumption is just for simplicity rather than necessary.

There exists a point xXx^{*}\in\mathcal{X} such that f(x)=f=infxXf(x)f(x^{*})=f^{*}=\inf_{x\in\mathcal{X}}f(x).

The gradient descent method with a constant stepsize η\eta is defined via the following update rule

As discussed below Lemma 3.3, the key in the convergence analysis is to show f(xt)G\left\|\nabla f(x_{t})\right\|\leq G for all t0t\geq 0 and some constant GG. We will prove it by induction relying on the following lemma whose proof is deferred to Appendix B.

Lemma 4.1 suggests that for gradient descent (3) with a small enough stepsize, if the gradient norm at xtx_{t} is bounded by GG, then we have f(xt+1)f(xt)G\left\|\nabla f(x_{t+1})\right\|\leq\left\|\nabla f(x_{t})\right\|\leq G, i.e., the gradient norm is also bounded by GG at t+1t+1. In other words, the gradient norm is indeed a non-increasing potential function for gradient descent in the convex setting. With a standard induction argument, we can show that f(xt)f(x0)\left\|\nabla f(x_{t})\right\|\leq\left\|\nabla f(x_{0})\right\| for all t0t\geq 0. As discussed below Lemma 3.3, then we can basically apply the classical analysis to obtain the convergence guarantee in the convex setting as in the following theorem, whose proof is deferred to Appendix B.

Since η\eta is a constant independent of ϵ\epsilon or TT, Theorem 4.2 achieves the classical O(1/T)\mathcal{O}(1/T) rate, or O(1/ϵ)\mathcal{O}(1/\epsilon) gradient complexity to achieve an ϵ\epsilon-sub-optimal point, under the generalized smoothness condition. Since strongly convex functions are a subset of convex functions, Lemma 4.1 still holds for them. Then we immediately obtain the following result in the strongly convex setting, whose proof is deferred to Appendix B.

2 Nesterov’s accelerated gradient method

It is easy to note that Theorem 4.4 achieves the accelerated O(1/T2)\mathcal{O}(1/T^{2}) convergence rate, or equivalently the O(1/ϵ)\mathcal{O}(1/\sqrt{\epsilon}) gradient complexity to find an ϵ\epsilon-sub-optimal point, which is optimal among gradient-based methods (Nesterov, 2003).

In order to prove Theorem 4.4, we also use induction to show the gradients along the trajectory of Algorithm 1 are bounded by GG. However, unlike gradient descent, the gradient norm is no longer a potential function or monotonically non-increasing, which makes the induction analysis more challenging. Suppose that we have shown f(ys)G\left\|\nabla f(y_{s})\right\|\leq G for s<ts<t. To complete the induction, it suffices to prove f(yt)G\left\|\nabla f(y_{t})\right\|\leq G. Since xt=yt1ηf(yt1)x_{t}=y_{t-1}-\eta\nabla f(y_{t-1}) is a gradient descent step by Line 6 of Algorithm 1, Lemma 4.1 directly shows f(xt)G\left\|\nabla f(x_{t})\right\|\leq G. In order to also bound f(yt)\left\|\nabla f(y_{t})\right\|, we try to control ytxt\left\|y_{t}-x_{t}\right\|, which is the most challenging part of our proof. Since ytxty_{t}-x_{t} can be expressed as a linear combination of past gradients {f(ys)}s<t\{\nabla f(y_{s})\}_{s<t}, it might grow linearly with tt if we simply apply f(ys)G\left\|\nabla f(y_{s})\right\|\leq G for s<ts<t. Fortunately, Lemma 3.5 allows us to control the gradient norm with the function value. Thus if the function value is decreasing sufficiently fast, which can be shown by following the standard Lyapunov analysis of NAG, we are able to obtain a good enough bound on f(ys)\left\|\nabla f(y_{s})\right\| for s<ts<t, which allows us to control ytxt\left\|y_{t}-x_{t}\right\|. We defer the detailed proof to Appendix C.

Non-convex setting

In this section, we present convergence results of gradient descent (GD) and stochastic gradient descent (SGD) in the non-convex setting.

Similar to the convex setting, we still want to bound the gradients along the trajectory. However, in the non-convex setting, the gradient norm is not necessarily non-increasing. Fortunately, similar to the classical analyses, the function value is still non-increasing and thus a potential function, as formally shown in the following lemma, whose proof is deferred to Appendix E.

Then using a standard induction argument, we can show f(xt)f(x0)f(x_{t})\leq f(x_{0}) for all t0t\geq 0. According to Corollary 3.6, it implies bounded gradients along the trajectory. Therefore, we can show convergence of gradient descent as in the following theorem, whose proof is deferred to Appendix E.

It is clear that Theorem 5.2 gives the classical O(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) gradient complexity to achieve an ϵ\epsilon-stationary point, which is optimal as it matches the lower bound in (Carmon et al., 2017).

2 Stochastic gradient descent

In this part, we present the convergence result for stochastic gradient descent defined as follows.

where gtg_{t} is an estimate of the gradient f(xt)\nabla f(x_{t}). We consider the following standard assumption on the gradient noise ϵt:=gtf(xt)\epsilon_{t}:=g_{t}-\nabla f(x_{t}).

Under Assumption 4, we can obtain the following theorem.

As we choose η=O(1/T)\eta=\mathcal{O}(1/\sqrt{T}), Theorem 5.3 gives the classical O(1/ϵ4)\mathcal{O}(1/\epsilon^{4}) gradient complexity, where we ignore non-leading terms. This rate is optimal as it matches the lower bound in (Arjevani et al., 2019). The key to its proof is again to bound the gradients along the trajectory. However, bounding gradients in the stochastic setting is much more challenging than in the deterministic setting, especially with the heavy-tailed noise in Assumption 4. We briefly discuss some of the challenges as well as our approach below and defer the detailed proof of Theorem 5.3 to Appendix F.

First, due to the existence of heavy-tailed gradient noise as considered in Assumption 4, neither the gradient nor the function values is non-increasing. The induction analyses we have used in the deterministic setting hardly work. In addition, to apply Lemma 3.3, we need to control the update at each step and make sure xt+1xt=ηgtG/L\left\|x_{t+1}-x_{t}\right\|=\eta\left\|g_{t}\right\|\leq G/L. However, gtg_{t} might be unbounded due to the potentially unbounded gradient noise.

To overcome these challenges, we define the following random variable τ\tau.

3 Reconciliation with existing lower bounds

In this section, we reconcile our convergence results for constant-stepsize GD/SGD in the non-convex setting with existing lower bounds in (Zhang et al., 2019) and (Wang et al., 2022), based on which the authors claim that adaptive methods such as GD/SGD with clipping and Adam are provably faster than non-adaptive GD/SGD. This may seem to contradict our convergence results. In fact, we show that any gain in adaptive methods is at most by constant factors, as GD and SGD already achieve the optimal rates in the non-convex setting.

(Zhang et al., 2019) provides both upper and lower complexity bounds for constant-stepsize GD for (L0,L1)(L_{0},L_{1})-smooth functions, and shows that its complexity is O(Mϵ2)\mathcal{O}(M\epsilon^{-2}), where

is the supremum gradient norm below the level set of the initial function value. If MM is very large, then the O(Mϵ2)\mathcal{O}(M\epsilon^{-2}) complexity can be viewed as a negative result, and as evidence that constant-stepsize GD can be slower than GD with gradient clipping, since in the latter case, they obtain the O(ϵ2)\mathcal{O}(\epsilon^{-2}) complexity without MM. However, based on our Corollary 3.6, their MM can be actually bounded by our GG, which is a constant. Therefore, the gain in adaptive methods is at most by constant factors.

(Wang et al., 2022) further provides a lower bound which shows non-adaptive GD may diverge for some examples. However, their counter-example does not allow the stepsize to depend on the initial sub-optimality gap. In contrast, our stepsize η\eta depends on the effective smoothness constant LL, which depends on the initial sub-optimality gap through GG. Therefore, there is no contradiction here either. We should point out that in the practice of training neural networks, the stepsize is usually tuned after fixing the loss function and initialization, so it does depend on the problem instance and initialization.

4 Lower bound

For (ρ,L0,Lρ)(\rho,L_{0},L_{\rho})-smooth functions with ρ<2\rho<2, it is easy to verify that the constant GG in both Theorem 5.2 and Theorem 5.3 is a polynomial function of problem-dependent parameters like L0,Lρ,f(x0)f,σL_{0},L_{\rho},f(x_{0})-f^{*},\sigma, etc. In other words, GD and SGD are provably efficient methods in the non-convex setting for ρ<2\rho<2. In this section, we show that the requirement of ρ<2\rho<2 is necessary in the non-convex setting with the lower bound for GD in the following Theorem 5.4, whose proof is deferred in Appendix G. Since SGD reduces to GD when there is no gradient noise, it is also a lower bound for SGD.

Given L0,L2,G0,Δ0>0L_{0},L_{2},G_{0},\Delta_{0}>0 satisfying L2Δ010L_{2}\Delta_{0}\geq 10, for any η0\eta\geq 0, there exists a (2,L0,L2)(2,L_{0},L_{2})-smooth function ff that satisfies Assumptions 1 and 2, and initial point x0x_{0} that satisfies f(x0)G0\|\nabla f(x_{0})\|\leq G_{0} and f(x0)fΔ0f(x_{0})-f^{\ast}\leq\Delta_{0}, such that gradient descent with stepsize η\eta (3) either cannot reach a 11-stationary point or takes at least exp(L2Δ0/8)/6\exp(L_{2}\Delta_{0}/8)/6 steps to reach a 11-stationary point.

Conclusion

Acknowledgments

This work was supported, in part, by the MIT-IBM Watson AI Lab and ONR Grants N00014-20-1-2394 and N00014-23-1-2299. We also acknowledge support from DOE under grant DE-SC0022199, and NSF through awards DMS-2031883, DMS-1953181, and DMS-2022448 (TRIPODS program).

References

Appendix A Proofs related to generalized smoothness

In this section, we provide the proofs of propositions and lemmas related to the generalized smoothness condition in Definition 1 or 2. First, in Appendix A.1, we justify the examples we discussed in Section 3. Next, we provide the detailed proof of Proposition 3.2 in Appendix A.2. Finally, we provide the proofs of the useful properties of generalized smoothness in Appendix A.3, including Lemma 3.3, Lemma 3.5, and Corollary 3.6 stated in Section 3.1.2.

In this section, we justify the univariate examples of (ρ,L0,Lρ)(\rho,L_{0},L_{\rho})-smooth functions listed in Table 2 and also provide the proof of Propositions 3.7.

First, it is well-known that all quadratic functions have bounded Hessian and are Lipschitz smooth, corresponding to ρ=0\rho=0. Next, [Zhang et al., 2019, Lemma 2] shows that any univariate polynomial is (L0,L1)(L_{0},L_{1})-smooth, corresponding to ρ=1\rho=1. Then, regarding the exponential function f(x)=axf(x)=a^{x} where a>1a>1, we have f(x)=log(a)axf^{\prime}(x)=\log(a)a^{x} and f(x)=log(a)2ax=log(a)f(x)f^{\prime\prime}(x)=\log(a)^{2}a^{x}=\log(a)f^{\prime}(x), which implies ff is (1,0,log(a))(1,0,\log(a))-smooth. Similarly, by standard calculations, it is straight forward to verify that logarithmic functions and xpx^{p}, p1p\neq 1 are also (ρ,L0,Lρ)(\rho,L_{0},L_{\rho})-smooth with ρ=2\rho=2 and ρ=p2p1\rho=\frac{p-2}{p-1} respectively. So far we have justified all the examples in Table 2 except double exponential functions a(bx)a^{(b^{x})} and rational functions, which will be justified rigorously by the two propositions below.

First, for double exponential functions in the form of f(x)=a(bx)f(x)=a^{(b^{x})} where a,b>1a,b>1, we have the following proposition, which shows ff is (ρ,L0,Lρ)(\rho,L_{0},L_{\rho})-smooth for any ρ>1\rho>1.

For any ρ>1\rho>1, the double exponential function f(x)=a(bx)f(x)=a^{(b^{x})}, where a,b>1a,b>1, is (ρ,L0,Lρ)(\rho,L_{0},L_{\rho})-smooth for some L0,Lρ0L_{0},L_{\rho}\geq 0. However, it is not necessarily (L0,L1)(L_{0},L_{1})-smooth for any L0,L10L_{0},L_{1}\geq 0.

Next, to show ff is not necessarily (L0,L1)(L_{0},L_{1})-smooth, consider the specific double exponential function f(x)=e(ex)f(x)=e^{(e^{x})}. Then we have

For any xmax{log(L0+1),log(L1+1)}x\geq\max\left\{\log(L_{0}+1),\log(L_{1}+1)\right\}, we can show that

which shows ff is not (L0,L1)(L_{0},L_{1}) smooth for any L0,L10L_{0},L_{1}\geq 0. ∎

In the next proposition, we show that any univariate rational function f(x)=P(x)/Q(x)f(x)=P(x)/Q(x), where PP and QQ are two polynomials, is (ρ,L0,Lρ)(\rho,L_{0},L_{\rho})-smooth with ρ=1.5\rho=1.5.

The rational function f(x)=P(x)/Q(x)f(x)=P(x)/Q(x), where PP and QQ are two polynomials, is (1.5,L0,L1.5)(1.5,L_{0},L_{1.5})-smooth for some L0,L1.50L_{0},L_{1.5}\geq 0. However, it is not necessarily (ρ,L0,Lρ)(\rho,L_{0},L_{\rho})-smooth for any ρ<1.5\rho<1.5 and L0,Lρ0L_{0},L_{\rho}\geq 0.

Let f(x)=P(x)/Q(x)f(x)=P(x)/Q(x) where PP and QQ are two polynomials. Then the partial fractional decomposition of f(x)f(x) is given by

Then we have f(x)=w(x)+i=1mr=1jipir(x)+i=1nr=1kiqir(x)f(x)=w(x)+\sum_{i=1}^{m}\sum_{r=1}^{j_{i}}p_{ir}(x)+\sum_{i=1}^{n}\sum_{r=1}^{k_{i}}q_{ir}(x). We know that r+2r+11.5\frac{r+2}{r+1}\leq 1.5 for any r1r\geq 1. Then we can show that

where the first equality is because one can easily verify that the first and second order derivatives of pijip_{ij_{i}} dominate those of all other terms when xx goes to aia_{i}, and the second equality is by standard calculations noting that ji+2ji+11.5\frac{j_{i}+2}{j_{i}+1}\leq 1.5. Note that (7) implies that, for any Lρ>ji+1L_{\rho}>j_{i}+1, there exists δi>0\delta_{i}>0 such that

Similarly, one can show limxf(x)1.5f(x)=\lim_{x\to\infty}\frac{\left|f^{\prime}(x)\right|^{1.5}}{\left|f^{\prime\prime}(x)\right|}=\infty, which implies there exists x0>0x_{0}>0 such that

We know B\mathcal{B} is a compact set and therefore the continuous function ff^{\prime\prime} is bounded within B\mathcal{B}, i.e., there exists some constant L0>0L_{0}>0 such that

Combining (8), (9), and (10), we have shown

which completes the proof of the first part.

For the second part, consider the ration function f(x)=1/xf(x)=1/x. Then we know that f(x)=1/x2f^{\prime}(x)=-1/x^{2} and f(x)=2/x3f^{\prime\prime}(x)=2/x^{3}. Note that for any ρ<1.5\rho<1.5 and 0<xmin{(L0+1)1/3,(Lρ+1)1/(32ρ)}0<x\leq\min\{(L_{0}+1)^{-1/3},(L_{\rho}+1)^{-1/(3-2\rho)}\}, we have

which shows ff is not (ρ,L0,Lρ)(\rho,L_{0},L_{\rho}) smooth for any ρ<1.5\rho<1.5 and L0,Lρ0L_{0},L_{\rho}\geq 0. ∎

Finally, we complete this section with the proof of Proposition 3.7, which shows self-concordant functions are (2,L0,L2)(2,L_{0},L_{2})-smooth for some L0,Lρ0L_{0},L_{\rho}\geq 0.

Integrating both sides from x0x_{0} to yy for x0,y(a,b)x_{0},y\in(a,b), we have

Since h(y)>0h^{\prime\prime}(y)>0, we have h(y)=h(y)\left|h^{\prime\prime}(y)\right|=h^{\prime\prime}(y). Therefore, the above inequality shows that hh is (2,L0,L2)(2,L_{0},L_{2})-smooth with L0=2(h(x0)1/2h(x0))2L_{0}=2(h^{\prime\prime}(x_{0})^{1/2}-h^{\prime}(x_{0}))^{2} and L2=2L_{2}=2. ∎

A.2 Proof of Proposition 3.2

In order to prove Proposition 3.2, we need the following several lemmas. First, the lemma below partially generalizes Grönwall’s inequality.

Let α:[a,b][0,)\alpha:[a,b]\to[0,\infty) and β:[0,)(0,)\beta:[0,\infty)\to(0,\infty) be two continuous functions. Suppose α(t)β(α(t))\alpha^{\prime}(t)\leq\beta(\alpha(t)) almost everywhere over (a,b)(a,b). Denote function ϕ(u):=1β(u)du\phi(u):=\int\frac{1}{\beta(u)}\,du. We have for all t[a,b]t\in[a,b],

Integrating both sides, noting that γ(a)=α(a)\gamma(a)=\alpha(a) by (11), we obtain

Then it suffices to show ϕ(α(t))ϕ(γ(t)),  t[a,b]\phi(\alpha(t))\leq\phi(\gamma(t)),\;\forall t\in[a,b]. Note that the following inequality holds almost everywhere.

where the inequality is because α(t)β(α(t))\alpha^{\prime}(t)\leq\beta(\alpha(t)) by the assumption of this lemma and γ(t)=β(γ(t))\gamma^{\prime}(t)=\beta(\gamma(t)) by (11). Since ϕ(α(a))ϕ(γ(a))=0\phi(\alpha(a))-\phi(\gamma(a))=0, we know for all t[a,b]t\in[a,b], ϕ(α(t))ϕ(γ(t))\phi(\alpha(t))\leq\phi(\gamma(t)), which completes the proof. ∎

With Lemma A.3, one can bound the gradient norm within a small enough neighborhood of a given point as in the following lemma.

Denote z(t):=(1t)x+tyz(t):=(1-t)x+ty for 0t10\leq t\leq 1. Then we know z(t)Xz(t)\in\mathcal{X} for all 0t10\leq t\leq 1 by the assumption made in this lemma. Then we can also define α(t):=f(z(t))\alpha(t):=\left\|\nabla f(z(t))\right\| for 0t10\leq t\leq 1. Note that for any 0ts10\leq t\leq s\leq 1, by triangle inequality,

We know that α(t)=f(z(t))\alpha(t)=\left\|\nabla f(z(t))\right\| is differentiable almost everywhere since ff is second order differentiable almost everywhere (Here we assume α(t)0\alpha(t)\neq 0 for 0<t<10<t<1 without loss of generality. Otherwise, one can define tm=sup{0<t<1α(t)=0}t_{m}=\sup\{0<t<1\mid\alpha(t)=0\} and consider the interval [tm,1][t_{m},1] instead). Then the following equality holds almost everywhere

Since ψ\psi is increasing, we have f(y)f(x)+a\left\|\nabla f(y)\right\|\leq\left\|\nabla f(x)\right\|+a. ∎

With Lemma A.4, we are ready to prove Proposition 3.2.

We prove the two directions in this proposition separately.

For each fixed xXx\in\mathcal{X} where 2f(x)\nabla^{2}f(x) exists and any unit-norm vector ww, by Definition 2, we know that for any tr(f(x))t\leq r(\left\|\nabla f(x)\right\|),

On the other hand, by the definition of tbt_{\text{b}}, we know z(t)Xz(t)\in\mathcal{X} for every 0t<tb0\leq t<t_{\text{b}}. Then by Lemma A.4, for all 0t<tb0\leq t<t_{\text{b}}, we have f(z(t))f(x)+a\left\|\nabla f(z(t))\right\|\leq\left\|\nabla f(x)\right\|+a. Therefore for all 0t<tb0\leq t<t_{\text{b}},

which contradicts (13). Therefore we have shown yXy\in\mathcal{X}. Since yy is chosen arbitrarily with the ball B(x,r(f(x)))\mathcal{B}(x,r(\left\|\nabla f(x)\right\|)), we have B(x,r(f(x)))X\mathcal{B}(x,r(\left\|\nabla f(x)\right\|))\subseteq\mathcal{X}. Then for any x1,x2B(x,r(f(x)))x_{1},x_{2}\in\mathcal{B}(x,r(\left\|\nabla f(x)\right\|)), we denote w(t):=tx1+(1t)x2w(t):=tx_{1}+(1-t)x_{2}. Then we know w(t)B(x,r(f(x)))w(t)\in\mathcal{B}(x,r(\left\|\nabla f(x)\right\|)) for all 0t10\leq t\leq 1 and can obtain

where the last inequality is due to Lemma A.4. ∎

A.3 Proofs of lemmas implied by generalized smoothness

In this part, we provide the proofs of the useful properties stated in Section 3.1.2, including Lemma 3.3, Lemma 3.5, and Corollary 3.6.

Next, for the second inequality in (2), define z(t):=(1t)x2+tx1z(t):=(1-t)x_{2}+tx_{1} for 0t10\leq t\leq 1. We know z(t)B(x,r(G))z(t)\in\mathcal{B}(x,r(G)). Note that we have shown

Then based on the definition of GG, we have f(x)G\left\|\nabla f(x)\right\|\leq G. ∎

Appendix B Analysis of GD for convex functions

In this section, we provide the detailed convergence analysis of gradient descent in the convex setting, including the proofs of Lemma 4.1 and Theorem 4.2, for which the following lemma will be helpful.

Define the Bregman divergences ϕx(w):=f(w)f(x),w\phi_{x}(w):=f(w)-\langle\nabla f(x),w\rangle and ϕy(w):=f(w)f(y),w\phi_{y}(w):=f(w)-\langle\nabla f(y),w\rangle, which are both convex functions. Since ϕx(w)=f(w)f(x)\nabla\phi_{x}(w)=\nabla f(w)-\nabla f(x), we have ϕx(x)=0\nabla\phi_{x}(x)=0 which implies minwϕx(w)=ϕx(x)\min_{w}\phi_{x}(w)=\phi_{x}(x) as ϕx\phi_{x} is convex. Similarly we have minwϕy(w)=ϕy(y)\min_{w}\phi_{y}(w)=\phi_{y}(y).

Note that for any yB(x,r(f(x))/2)y\in\mathcal{B}(x,r(\left\|\nabla f(x)\right\|)/2) as in the lemma statement,

where the first inequality uses triangle inequality and ϕx(y)=f(y)f(x)\nabla\phi_{x}(y)=\nabla f(y)-\nabla f(x); and the second inequality uses Definition 2. It implies that y1Lϕx(y)B(x,rx)y-\frac{1}{L}\nabla\phi_{x}(y)\in\mathcal{B}(x,r_{x}). Then we can obtain

where the last inequality uses (15) where we choose x1=y1Lϕx(y)x_{1}=y-\frac{1}{L}\nabla\phi_{x}(y) and x2=yx_{2}=y. By the definition of ϕx\phi_{x}, the above inequality is equivalent to

Similar argument can be made for ϕy()\phi_{y}(\cdot) to obtain

Summing up the two inequalities, we can obtain the desired result. ∎

With Lemma B.1, we prove Lemma 4.1 as follows.

where we choose ηr(G)/(2G)\eta\leq r(G)/(2G). Thus by Lemma B.1, we have

where the first inequality uses Lemma B.1 and the last inequality chooses η2/L\eta\leq 2/L. ∎

With Lemma 4.1, we are ready to prove both Theorem 4.2 and Theorem 4.3.

Denote G:=f(x0)G:=\left\|\nabla f(x_{0})\right\|. Then we trivially have f(x0)G\left\|\nabla f(x_{0})\right\|\leq G. Lemma 4.1 states that if f(xt)G\left\|\nabla f(x_{t})\right\|\leq G for any t0t\geq 0, then we also have f(xt+1)f(xt)G\left\|\nabla f(x_{t+1})\right\|\leq\left\|\nabla f(x_{t})\right\|\leq G. By induction, we can show that f(xt)G\left\|\nabla f(x_{t})\right\|\leq G for all t0t\geq 0. Then the rest of the proof basically follows the standard textbook analysis. We still provide the detailed proof below for completeness.

Note that xt+1xt=ηf(xt)ηGr(G)\left\|x_{t+1}-x_{t}\right\|=\eta\left\|\nabla f(x_{t})\right\|\leq\eta G\leq r(G), where we choose ηr(G)/(2G)\eta\leq r(G)/(2G). Thus we can apply Lemma 3.3 to obtain

where the last inequality chooses η1/L\eta\leq 1/L. Meanwhile, by convexity between xtx_{t} and xx^{*}, we have

Then reorganizing the terms of the above inequality, noting that

The above inequality implies t(f(xt)f)+12ηxtx2t(f(x_{t})-f^{*})+\frac{1}{2\eta}\left\|x_{t}-x^{*}\right\|^{2} is a non-increasing potential function, which directly implies the desired result. ∎

Since strongly convex functions are also convex, by the same argument as in the proof of Theorem 4.2, we have f(xt)G:=f(x0)\left\|\nabla f(x_{t})\right\|\leq G:=\left\|\nabla f(x_{0})\right\| for all t0t\geq 0. Moreover, (B) still holds. For μ\mu-strongly-convex function, we can obtain a tighter version of (17) as follows.

Let A0=0A_{0}=0 and At+1=(1+At)/(1ημ)A_{t+1}=(1+A_{t})/(1-\eta\mu) for all t0t\geq 0. Combining (B) and (18), we have

Then reorganizing the terms of the above inequality, noting that

The above inequality means At(f(xt)f)+1+ημAt2ηxtx2A_{t}(f(x_{t})-f^{*})+\frac{1+\eta\mu A_{t}}{2\eta}\left\|x_{t}-x^{*}\right\|^{2} is a non-increasing potential function. Thus by telescoping we have

Appendix C Analysis of NAG for convex functions

Choose ηmin{116L32/α,12L}\eta\leq\min\left\{\frac{1}{16L^{3-2/\alpha}},\frac{1}{2L}\right\}. Then the iterates of Algorithm 1 satisfy

Before proving Theorem C.1, we first present several additional useful lemmas. To start with, we provide two lemmas regarding the weights {At}t0\{A_{t}\}_{t\geq 0} and {Bt}t0\{B_{t}\}_{t\geq 0} used in Algorithm 1. The lemma below states that Bt=Θ(t2)B_{t}=\Theta(t^{2}).

The weights {Bt}t0\{B_{t}\}_{t\geq 0} in Algorithm 1 satisfy 14t2Btt2\frac{1}{4}t^{2}\leq B_{t}\leq t^{2} for all t0t\geq 0.

We prove this lemma by induction. First note that the inequality obviously holds for B0=0B_{0}=0. Suppose its holds up to tt. Then we have

Lemma C.2 implies the following useful lemma.

The weights {At}t0\{A_{t}\}_{t\geq 0} in Algorithm 1 satisfy that

First, note that it is easy to verify that As+1As1=Bs+1Bs10A_{s+1}-A_{s}-1=B_{s+1}-B_{s}-1\geq 0, which implies each term in the LHS of the above inequality is non-negative. Then we have

The following lemma summarizes the results in the classical potential function analysis of NAG in [d’Aspremont et al., 2021]. In order to not deal with the generalized smoothness condition for now, we directly assume the inequality (20) holds in the lemma, which will be proved later under the generalized smoothness condition.

For any t0t\geq 0, if the following inequality holds,

These derivations below can be found in [d’Aspremont et al., 2021]. We present them here for completeness.

First, since ff is convex, the convexity between xx^{*} and yty_{t} gives

Similarly the convexity between xtx_{t} and yty_{t} gives

Combining the above two inequalities as well as (20) assumed in this lemma, we have

Then we complete the proof noting that it is easy to verify

In the next lemma, we show that if f(yt)G\left\|\nabla f(y_{t})\right\|\leq G, then the condition (20) assumed in Lemma C.4 is satisfied at time tt.

For any t0t\geq 0, if f(yt)G\left\|\nabla f(y_{t})\right\|\leq G, then we have f(xt+1)G\left\|\nabla f(x_{t+1})\right\|\leq G, and furthermore,

As disccued below Theorem C.1, the stepsize satisfies η1/(2L)min{2m(G),r(G)2G}\eta\leq 1/(2L)\leq\min\{\frac{2}{m(G)},\frac{r(G)}{2G}\}. Therefore we can apply Lemma 4.1 to show f(xt+1)f(yt)G\left\|\nabla f(x_{t+1})\right\|\leq\left\|\nabla f(y_{t})\right\|\leq G. For the second part, note that xt+1yt=ηf(yt)G2Lr(G)\left\|x_{t+1}-y_{t}\right\|=\eta\left\|\nabla f(y_{t})\right\|\leq\frac{G}{2L}\leq r(G), we can apply Lemma 3.3 to show

With Lemma C.4 and Lemma C.5, we can show that f(yt)G\left\|\nabla f(y_{t})\right\|\leq G for all t0t\geq 0, as in the lemma below.

For all t0t\geq 0, f(yt)G\left\|\nabla f(y_{t})\right\|\leq G.

We will prove this lemma by induction. First, by Lemma 3.5 and the choice of GG, it is easy to verify that f(x0)G\left\|\nabla f(x_{0})\right\|\leq G. Then for any fixed t0t\geq 0, suppose that f(xs)G\left\|\nabla f(x_{s})\right\|\leq G for all s<ts<t. Then by Lemma C.4 and Lemma C.5, we know that f(xs)G\left\|\nabla f(x_{s})\right\|\leq G for all 0st0\leq s\leq t, and that for all s<ts<t,

By telescoping (24), we have for all 0s<t0\leq s<t,

For 0st0\leq s\leq t, since f(xs)G\left\|\nabla f(x_{s})\right\|\leq G, then Lemma 3.5 implies

Since f(ys)G\left\|\nabla f(y_{s})\right\|\leq G and xs+1ys=ηf(ys)r(G)\left\|x_{s+1}-y_{s}\right\|=\left\|\eta\nabla f(y_{s})\right\|\leq r(G) for s<ts<t, by Lemma 3.3, we have

Since f(xt)G\left\|\nabla f(x_{t})\right\|\leq G and we just showed xtytr(G)\left\|x_{t}-y_{t}\right\|\leq r(G), by Lemma 3.3, we have

Then we complete the induction as well as the proof.

With the three lemmas above, it is straight forward to prove Theorem C.1.

Combining Lemmas C.4, C.5, and C.6, we know the following inequality holds for all t0t\geq 0.

Then by telescoping, we directly complete the proof.

Appendix D Analysis of NAG for strongly convex functions

In this section, we provide the convergence analysis of the modified version of Nesterov’s accelerated gradient method for μ\mu-strongly-convex functions defined in Algorithm 2.

The convergence results is formally presented in the following theorem.

The iterates generated by Algorithm 2 satisfy

In what follows, we will provide the proof of Theorem D.1. We will always use the parameter choices in the theorem throughout this section.

In this part, we provide several useful lemmas for proving Theorem D.1. To start with, the following two lemmas provide two useful inequalities.

For any 0u10\leq u\leq 1, we have log(1+u)12u\log(1+u)\geq\frac{1}{2}u.

For all 0<p10<p\leq 1 and t0t\geq 0, we have

It is obvious that f(t)0f(t)\geq 0 for t2plog(e+1p)t\leq\frac{2}{\sqrt{p}}\log(e+\frac{1}{p}). For t>2plog(e+1p)t>\frac{2}{\sqrt{p}}\log(e+\frac{1}{p}), we have

In the next four lemmas, we provide several useful inequalities regarding the weights {At}t0\{A_{t}\}_{t\geq 0} and {Bt}t0\{B_{t}\}_{t\geq 0} used in Algorithm 2.

which implies τtδs1.\tau_{t}\cdot\delta_{s}\leq 1.

where in the inequality, we use the fact that BsB_{s} is non-decreasing with ss. Therefore

If 0<ημ<10<\eta\mu<1, then for any t1t\geq 1, we have

For t1t\geq 1, we have Bt1B_{t}\geq 1 thus

Combined with Lemma D.5, we have the desired result. ∎

D.2 Proof of Theorem D.1

With all the useful lemmas in the previous section, we proceed to prove Theorem D.1, for which we need several additional lemmas. First, similar to Lemma C.4, the following lemma summarizes the results in the classical potential function analysis of NAG for strongly convex functions in [d’Aspremont et al., 2021].

For any t0t\geq 0, if the following inequality holds

These derivations can be found in d’Aspremont et al. . We present it here for completeness.

The strong convexity between xx^{*} and yty_{t} gives

The convexity between xtx_{t} and yty_{t} gives

Combining the above two inequalities and the one assumed in this lemma, we have

Next, note that Lemma C.5 still holds in the strongly convex setting. We repeat it below for completeness.

For any t0t\geq 0, if f(yt)G\left\|\nabla f(y_{t})\right\|\leq G, then we have f(xt+1)G\left\|\nabla f(x_{t+1})\right\|\leq G, and furthermore,

With Lemma D.8 and Lemma D.9, we will show that f(yt)G\left\|\nabla f(y_{t})\right\|\leq G for all t0t\geq 0 by induction in the following lemma.

For all t0t\geq 0, we have f(yt)G\left\|\nabla f(y_{t})\right\|\leq G.

We will prove this lemma by induction. First, by Lemma 3.5 and the choice of GG, it is easy to verify that f(x0)G\left\|\nabla f(x_{0})\right\|\leq G. Then for any fixed t0t\geq 0, suppose that f(xs)G\left\|\nabla f(x_{s})\right\|\leq G for all s<ts<t. Then by Lemma D.8 and Lemma D.9, we know that f(xs)G\left\|\nabla f(x_{s})\right\|\leq G for all 0st0\leq s\leq t, and that for all s<ts<t,

By telescoping (30), we have for all 0s<t0\leq s<t,

For 0st0\leq s\leq t, since f(xs)G\left\|\nabla f(x_{s})\right\|\leq G, then Lemma 3.5 implies

where the second inequality follows from Lemma D.4. We further control term I\mathcal{I} by

Since f(xt)G\left\|\nabla f(x_{t})\right\|\leq G and we just showed xtytr(G)\left\|x_{t}-y_{t}\right\|\leq r(G), by Lemma 3.3, we have

Then we complete the induction as well as the proof. ∎

Combining Lemmas D.8, D.9, and D.10, we know the following inequality holds for all t0t\geq 0.

Finally, applying Lemma D.5, we have At=Bt+1/(ημ)1/(1ημ)t1+1/(ημ)A_{t}=B_{t}+1/(\eta\mu)\geq 1/(1-\sqrt{\eta\mu})^{t-1}+1/(\eta\mu). Thus completes the proof. ∎

Appendix E Analysis of GD for non-convex functions

In this section, we provide the proofs related to analysis of gradient descent for non-convex function, including those of Lemma 5.1 and Theorem 5.2.

First, based on Corollary 3.6, we know f(x)G<\left\|\nabla f(x)\right\|\leq G<\infty. Also note that

Then by Lemma 3.3 and Remark 3.4, we have x+Xx^{+}\in\mathcal{X} and

By Lemma 5.1, using induction, we directly obtain f(xt)f(x0)f(x_{t})\leq f(x_{0}) for all t0t\geq 0. Then by Corollary 3.6, we have f(xt)G\left\|\nabla f(x_{t})\right\|\leq G for all t0t\geq 0. Following the proof of Lemma 5.1, we can similarly show

Taking a summation over t<Tt<T and rearanging terms, we have

Appendix F Analysis of SGD for non-convex functions

We first present some useful inequalities related to the parameter choices in Theorem 5.3.

Under the parameters choices in Theorem 5.3, the following inequalities hold.

First note that by Corollary 3.6, we know

i.e., σLG2δ/16\sigma L\leq G^{2}\delta/16. Then since we choose η14GT\eta\leq\frac{1}{4G\sqrt{T}}, we have

Under the parameters choices in Theorem 5.3, the following inequality holds

If t<τt<\tau, by the definition of τ\tau, we know f(xt)fFf(x_{t})-f^{*}\leq F and ϵtG5ηL\left\|\epsilon_{t}\right\|\leq\frac{G}{5\eta L}, and the former also implies f(xt)G\left\|\nabla f(x_{t})\right\|\leq G by Corollary 3.6. Then we can bound

where we use the choice of η12L\eta\leq\frac{1}{2L}. Then based on Lemma 3.3 and Remark 3.4, for any t<τt<\tau, we have

where the equality is due to (4); the second inequality uses gt=ϵt+f(xt)g_{t}=\epsilon_{t}+\nabla f(x_{t}) and Young’s inequality y+z22y2+2z2\left\|y+z\right\|^{2}\leq 2\left\|y\right\|^{2}+2\left\|z\right\|^{2} for any vectors y,zy,z; and the last inequality chooses η1/(2L)\eta\leq 1/(2L). Taking a summation over t<τt<\tau and rearanging terms, we have

Now we bound the last two terms on th RHS. First, for the last term, we have

where the first inequality uses τT\tau\leq T by its defnition; and in the last inequality we use Assumption 4.

where the last inequality is due to Lemma F.1. ∎

With Lemma F.2, we are ready to prove Theorem 5.3.

We want to show the probability of {τ<T}\{\tau<T\} is small, as its complement {τ=T}\{\tau=T\} means f(xt)fFf(x_{t})-f^{*}\leq F for all tTt\leq T which implies f(xt)G\left\|\nabla f(x_{t})\right\|\leq G for all tTt\leq T. Note that

Therefore we only need to bound the probability of each of these two events on the RHS.

where we choose η12L\eta\leq\frac{1}{2L}. Then based on Lemma 3.3 and Remark 3.4, we have

where the first inequality is obtained following the same derivation as in (F); the last equality is due to Corollary 3.6. Therefore we can show that under the event {τ1<T,τ2=T}\{\tau_{1}<T,\tau_{2}=T\},

where the second inequality uses Markov’s inequality; the third inequality uses Lemma F.2; and in the last inequality we choose F=8(f(x0)f+σ)/δF=8(f(x_{0})-f^{*}+\sigma)/\delta.

Appendix G Lower bound

In this section, we provide the proof of Theorem 5.4.

Let c,η0>0c,\eta_{0}>0 satisfy η0c2/2\eta_{0}\leq c^{2}/2. Consider

where c>0c>0 is a constant and y=(c+c2+2η0)/2>0y=(c+\sqrt{c^{2}+2\eta_{0}})/2>0 is the fixed point of the iteration

and kk, bb are chosen in such a way that f(x)f(x) and f(x)f^{\prime}(x) are continuous. Specifically, choose k=c1f(c/2)k=c^{-1}f^{\prime}(c/2) and b=f(c/2)cf(c/2)/4b=f(c/2)-cf^{\prime}(c/2)/4. Since f(x)=f(x)f(-x)=f(x), f(x)f(x) is symmetric about the line x=0x=0. In a small neighborhood, f(x)f(x) is symmetric about (y,f(y))(y,f(y)), so f(x)f^{\prime}(x) is continuous at yy.

Let us first consider the smoothness of ff. By symmetry, it suffices to consider x>0x>0. Then,

Note that f(x)f(x) has a stationary point . For stepsize ηf\eta_{f} satisfying η0ηfc2/4\eta_{0}\leq\eta_{f}\leq c^{2}/4, there exists z=(c+c2+2ηf)yz=(c+\sqrt{c^{2}+2\eta_{f}})\geq y such that z=zηf(yc)1-z=z-\eta_{f}(y-c)^{-1} and by symmetry, once xτ=zx_{\tau}=z, xt=±zx_{t}=\pm z for all tτt\geq\tau, making the GD iterations stuck. Now we choose a proper x0x_{0} such that f(x0)f^{\prime}(x_{0}) and f(x0)f(0)f(x_{0})-f(0) are bounded.

We consider arriving at yy from above. That is, x0x1xτ=z>c>0x_{0}\geq x_{1}\geq\ldots x_{\tau}=z>c>0. Since in each update where xt+1=xtηf(xtc)1>cx_{t+1}=x_{t}-\eta_{f}(x_{t}-c)^{-1}>c,

Hence, we can choose τ\tau in such a way that 3c/2x0<3c/2+ηf3c/2\leq x_{0}<3c/2+\sqrt{\eta_{f}}. Then,

By definition, yc=η0(c+c2+2η0)1y-c=\eta_{0}(c+\sqrt{c^{2}+2\eta_{0}})^{-1}. Hence,

For stepsize ηf<η0\eta_{f}<\eta_{0}, reaching below 4c/34c/3 takes at least

steps to reach 4c/34c/3, where f(4c/3)=log(c/3)f^{\prime}(4c/3)=\log(c/3).

Now we set cc and η0\eta_{0} and scale function f(x)f(x) to satisfy the parameter specifications L0,L2,G0,Δ0L_{0},L_{2},G_{0},\Delta_{0}. Define g(x)=L21f(x)g(x)=L_{2}^{-1}f(x). Then, g(x)g(x) is (2,2kL21,L2)(2,2kL_{2}^{-1},L_{2})-smooth. Since the gradient of g(x)g(x) is L21L_{2}^{-1} times f(x)f(x), the above analysis for f(x)f(x) applies to g(x)g(x) by replacing η0\eta_{0} with η1=L2η0\eta_{1}=L_{2}\eta_{0} and ηf\eta_{f} with η=L2ηf\eta=L_{2}\eta_{f}. To ensure that

it suffices to take c2/L0L2c\geq 2/\sqrt{L_{0}L_{2}}. To ensure that

it suffices to take c2/(L2G0)c\geq 2/(L_{2}G_{0}). To ensure that

Since we require η1c2/2\eta_{1}\leq c^{2}/2, parameters L2L_{2} and Δ0\Delta_{0} need to satisfy

that is, L2Δ05.5log2+0.5L_{2}\Delta_{0}\geq 5.5\log 2+0.5, which holds because L2Δ010L_{2}\Delta_{0}\geq 10. Take c=max{2/L0L2,2/(L2G0),8/L0}c=\max\{2/\sqrt{L_{0}L_{2}},2/(L_{2}G_{0}),\sqrt{8/L_{0}}\}. Then, as long as η2/L0\eta\leq 2/L_{0}, the requirement that ηc2/4\eta\leq c^{2}/4 is satisfied. Therefore, on g(x)g(x) with initial point x0x_{0}, gradient descent with a constant stepsize either gets stuck, or takes at least

On the other hand, if η>2/L0\eta>2/L_{0}, consider the function f(x)=L02x2f(x)=\frac{L_{0}}{2}x^{2}. For any xt0x_{t}\neq 0, we always have xt+1/xt=1ηL0>1\left|x_{t+1}\right|/\left|x_{t}\right|=\left|1-\eta L_{0}\right|>1, which means the iterates diverge to infinity. ∎