Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD

Matthew Faw, Litu Rout, Constantine Caramanis, Sanjay Shakkottai

Introduction

A fundamental problem in stochastic optimization is to characterize the convergence behavior of the Stochastic Gradient Descent algorithm:

Much of the literature on stochastic optimization, e.g., [NY83, GL13, Bub15, FSSSSW19], focuses on a special case of (Affine-var) where the variance is uniformly upper-bounded (σ1=0\sigma_{1}=0):

Rates of convergence to a first-order stationary point in these settings are now well-understood. Under (Bounded-var) regime, [GL13] prove an O(\nicefracσ02L0(F(w1)F)T)\mathcal{O}(\nicefrac{{\sqrt{\sigma_{0}^{2}L_{0}(F(\mathbf{w}_{1})-F^{*})}}}{{\sqrt{T}}}) rate of convergence with a fixed step-size schedule. Later, [ACDFSW22] show that this rate is optimal up to constant factors. Further, as noted by [BCN18], a minor modification to this step-size gives nearly the same rate in the more general (Affine-var) setting, i.e., σ1>0\sigma_{1}>0. This rate is obtained by making trivial changes to the proof technique of [GL13].

One crucial assumption in these lines of work is L0L_{0}-smoothness, i.e., L0L_{0}-Lipschitz gradients of the loss landscape. However, recent works [ZJFW20, ZHSJ20] provide empirical evidence that this assumption is often not satisfied in practical machine learning problems. For instance, in large-scale language modeling including BERT [DCLT18] and other variants [Rad+21, Car+21, LYFJHN23], the loss landscape of transformer architectures either does not satisfy the L0L_{0}-smoothness assumption, or the value of L0L_{0} becomes so large that it produces a significantly weaker rate of convergence [ZJFW20, ZHSJ20].

Aiming to address these issues, there has been a recent surge of interest in relaxing the standard L0L_{0}-smoothness assumption and characterizing the rate of convergence. One appealing relaxation proposed by [ZHSJ20] is that of (L0,L1)(L_{0},L_{1})-smoothnessFor convenience, we state this assumption in terms of a bound on the hessian of FF. The requirement that the hessian exists everywhere can be relaxed to a condition on the gradients [ZJFW20]. This relaxation is the one we use for our main results, see Assumption 2.:

While every L0L_{0}-smooth function is also (L0,0)(L_{0},0)-smooth, this relaxation admits functions that grow significantly faster than a quadratic function, e.g., F(w)=wdF(w)=w^{d} is (\nicefracd(d1)L1d2,(d1)L1)(\nicefrac{{d(d-1)}}{{L_{1}^{d-2}}},(d-1)L_{1})-smooth for any L1>0L_{1}>0, and F(w)=exp(L1w)F(w)=\exp(L_{1}w) is (0,L1)(0,L_{1})-smooth. With regards to convergence, recent works [ZHSJ20, CLOZZ22] establish an O(\nicefrac1T)\mathcal{O}(\nicefrac{{1}}{{\sqrt{T}}}) rate in the (L0,L1)(L_{0},L_{1})-smooth setting, as long as the noise of the stochastic gradients has uniformly-bounded support, i.e.,

In many real-world scenarios, the (Bounded-supp) assumption does not hold. For instance, when running SGD in standard least-squares regression settings, the stochastic gradients have multiplicative noise, as noted in [DFB17, FB17, JKKNS18, JT19]. Similar noise assumptions have also been considered, e.g., in convergence of stochastic proximal gradient methods [RVV20], Hilbert-valued stochastic subgradient methods [BRS07], and adaptive gradient methods [FTCMSW22]. Moreover, multiplicative noise naturally arises in machine learning problems with (additive or multiplicative) feature noise [LW11, Hwa86, CRSC06]. Thus, we believe that characterizing (L0,L1)(L_{0},L_{1})-smooth functions under (Affine-var) is an important step in extending the theory of non-convex stochastic optimization beyond the standard L0L_{0}-smooth setting.

A major challenge in the analysis of adaptive stochastic gradient descent is the correlation between the stochastic gradients and the step-size. Here, we develop a technique to simplify this challenge. Our key innovation is a recursively-defined stopping time which satisfies two crucial properties: (i) before the stopping time is reached, the step sizes behave roughly independently of the gradients, and (ii) on average, the stopping time is at least a constant fraction of the time horizon. As a consequence, instead of analyzing over the entire time horizon, we conduct the analysis over this sub-interval over which we exploit this convenient almost-independent property. This tool allows us to prove the first O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}}) rate of convergence for (L0,L1)(L_{0},L_{1})-smooth functions beyond the (Bounded-supp) setting. Our main contributions are three-fold:

(a) Convergence for (L0,L1)(L_{0},L_{1})-smoothness when σ1<1\sigma_{1}<1. We show in Section 4 that AdaGrad-Norm converges at a rate O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}}) when the stochastic gradient oracle satisfies (Affine-var) with σ00\sigma_{0}\geq 0 and σ1[0,1)\sigma_{1}\in[0,1). This is the first convergence rate for any algorithm even under (Bounded-var) (i.e., σ1=0\sigma_{1}=0) for general (L0,L1)(L_{0},L_{1})-smooth optimization. Note that the scaling of this bound with TT matches (up to poly-logarithmic factors) the best-known rate for L0L_{0}-smooth functions – with a minor caveat that σ1<1\sigma_{1}<1 is not needed in the L0L_{0}-smooth setting. Also, we show that the rate improves to O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{T}}) in the “small variance” regime when σ0,σ10\sigma_{0},\sigma_{1}\to 0 even without tuning the step-size.

(b) Convergence for all σ1\sigma_{1}. We establish a sufficient condition under which AdaGrad-Norm converges at a rate O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}}) when σ11\sigma_{1}\geq 1, see Section 5. This condition allows us to analyze a broad subset of (L0,L1)(L_{0},L_{1})-smooth functions that includes all L0L_{0}-smooth functions as well as fixed-degree polynomials without any restrictions on σ1\sigma_{1}. This simultaneously generalizes the result and simplifies a key proof technique of [FTCMSW22] for L0L_{0}-smooth functions.

(c) Negative results for known algorithms. We prove a set of negative results in Section 6 for most algorithms analyzed under (L0,L1)(L_{0},L_{1})-smoothness and (Bounded-supp). We construct an oracle for Clipped and Normalized SGD [ZHSJ20, ZJFW20] and Sign SGD with Momentum [CLOZZ22] that leads to failure with constant probability in a wide parameter regime. We also prove that AdaGrad-Norm can diverge with constant probability if the step-size is not carefully tuned in the “large variance” regime for (L0,L1)(L_{0},L_{1})-smooth functions. By contrast, no parameter tuning is needed in the L0L_{0}-smooth setting in this noise regime.

Related Works

Stochastic gradient descent. (SGD) has been well-studied for many decades [RM51]. [PT73] proved almost-sure convergence to a first-order stationary point of (SGD) for non-convex and L0L_{0}-smooth functions with F(w)FF(\mathbf{w})\geq F^{*} with stochastic gradient oracle satisfying (a slightly weaker condition than) (Affine-var). [BT00] extended the result to a setting where F(w)F(\mathbf{w}) does not have a uniform lower-bound. [GL13] proved that (SGD) with step-size ηt=η=min{1/L0,\nicefrac2(F(w1)F)(L0σ02T)}\eta_{t}=\eta=\min\left\{1/L_{0},\sqrt{\nicefrac{{2(F(\mathbf{w}_{1})-F^{*})}}{{(L_{0}\sigma_{0}^{2}T)}}}\right\} achieves a convergence rate to a first-order stationary point of O(\nicefracL0σ02(F(w1)F)T)\mathcal{O}\left(\sqrt{\nicefrac{{L_{0}\sigma_{0}^{2}(F(\mathbf{w}_{1})-F^{*})}}{{T}}}\right), assuming L0L_{0}-smoothness and (Bounded-var). [DS20] proved that this is the optimal rate for (SGD) without further assumptions. Recently, [ACDFSW22] proved that the convergence rate of [GL13] is optimal among all first-order methods, not just SGD.

(L0,L1)(L_{0},L_{1})-smoothness in the (Bounded-supp) regime. Recent work by [ZHSJ20] argued that the L0L_{0}-smoothness assumption is not realistic for many practical machine learning tasks, e.g., large-scale natural language processing using transformer architectures. Instead, they demonstrated that (L0,L1)(L_{0},L_{1})-smooth functions (Generalized-smooth) better capture the loss landscape, and proved that the gradient clipping algorithm converges at a rate O(\nicefrac1T)\mathcal{O}(\nicefrac{{1}}{{\sqrt{T}}}) in the (Bounded-supp) regime. [ZJFW20] later proved convergences for a generalized class of gradient clipping algorithms. They used a slightly weaker definition of (L0,L1)(L_{0},L_{1})-smoothness, which we use in Assumption 2. Very recently, [CLOZZ22] considered a “coordinate-wise” generalization of (L0,L1)(L_{0},L_{1})-smoothness, and proved that a “generalized SignSGD” algorithm converges at a rate O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}}). By contrast, they proved that gradient descent with fixed step-sizes must scale linearly in ML1ML_{1}, where M=sup{F(w):F(w)F(w1)}M=\sup\left\{\left\lVert\nabla F(\mathbf{w})\right\rVert:F(\mathbf{w})\leq F(\mathbf{w}_{1})\right\} is the largest gradient in the sublevel set F(w)F(w1)F(\mathbf{w})\leq F(\mathbf{w}_{1}). Interestingly, this line of work establishes that adaptive step-size schedules can avoid this dependence on MM.

Problem Setting

We are interested in finding a first-order stationary point of a non-convex function, given access to a stochastic gradient oracle, using (SGD). For compactness, let gt:=g(wt)\bm{g}_{t}:=\bm{g}(\mathbf{w}_{t}). Our objective function F(w)F(\mathbf{w}) satisfies the following:

We note that (L0,L1)(L_{0},L_{1})-smoothness was originally defined in [ZHSJ20] as a bound on the Hessian of F()F(\cdot), as in (Generalized-smooth). Following [ZJFW20, Remark 2.3], we choose to adopt the alternative condition in Assumption 2 for two reasons. First, Assumption 2 is strictly weaker than L0L_{0}-smoothness, since (L0,0)(L_{0},0)-smoothness implies the gradients are L0L_{0}-Lipschitz. Second, whenever the objective is twice-differentiable, Assumption 2 implies (Generalized-smooth) (up to constant factors in the definitions of L0L_{0} and L1L_{1}):

A function satisfying (L0,L1)(L_{0},L_{1})-smoothness as per (Generalized-smooth) is also (2L0,(e1)L1)(2L_{0},(e-1)L_{1})-smooth as per Assumption 2. If F()F(\cdot) is twice continuously differentiable and (L0,L1)(L_{0},L_{1})-smooth as per Assumption 2, then it is also (L0,L1)(L_{0},L_{1})-smooth as per (Generalized-smooth).

Let Ft\mathcal{F}_{t} be the sigma-algebra generated by the interaction between the algorithm and stochastic gradient oracle for tt rounds, i.e., Ft:=σ{w1,g1,,wt,gt,wt+1}\mathcal{F}_{t}:=\sigma\left\{\mathbf{w}_{1},\bm{g}_{1},\ldots,\mathbf{w}_{t},\bm{g}_{t},\mathbf{w}_{t+1}\right\}. We impose the following conditions on the stochastic gradients:

Assumptions 3 and 4 imply the following bound on the stochastic gradients in terms of the true gradient:

We are interested in studying algorithms which require as little hyper-parameter tuning as possible and, simultaneously, can handle potentially unbounded smoothness constant. To achieve this, we analyze AdaGrad-Norm [SM10], a step-size sequence ηt\eta_{t} for (SGD) which, at each time tt, depends on the current and past stochastic gradients {gs}s[t]\left\{\bm{g}_{s}\right\}_{s\in[t]}:

Our main results, Sections 4 and 4, both establish O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}}) convergence rates for (AG-Norm) in the (L0,L1)(L_{0},L_{1})-smooth regime under (Affine-var). Section 4 holds for any (L0,L1)(L_{0},L_{1})-smooth function under a mild restriction that σ1<1\sigma_{1}<1. It is easy to extend this result for σ11\sigma_{1}\geq 1 by computing mini-batch gradients with a batch size Bσ12B\approx\sigma_{1}^{2}, refer Section A.3 for a proof. Despite the restriction of Section 4 to σ1<1\sigma_{1}<1, we emphasize that, prior to our work, no proof of convergence even for the (Bounded-var) setting (i.e., σ1=0\sigma_{1}=0) was known for a general class of (L0,L1)(L_{0},L_{1})-smooth functions. Besides, Section 4 holds for all σ1\sigma_{1} and a subclass of (L0,L1)(L_{0},L_{1})-smooth functions, i.e., excluding functions like exp(L1x)\exp(L_{1}x).

Fix any constants ε,ε,ε,ε(0,1)\varepsilon,\varepsilon^{\prime},\varepsilon^{\prime\prime},\varepsilon^{\prime\prime\prime}\in(0,1) such that ε+ε+ε+ε<1\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime}<1. Consider (AG-Norm) with any parameters η\nicefrac2ε5L1\eta\leq\nicefrac{{2\varepsilon^{\prime}}}{{5L_{1}}} and b02>0b_{0}^{2}>0, running on an objective function satisfying Assumption 2, and given access to a stochastic gradient oracle satisfying Assumptions 3 and 4. Assuming that σ1(1(ε+ε+ε+ε))\sigma_{1}\leq\left(1-(\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime})\right), then for any T1T\geq 1 and δ(0,1)\delta^{\prime}\in(0,1), with probability at least 1δ1-\delta^{\prime}, the iterates of (AG-Norm) satisfy

To extend our convergence proofs beyond σ1<1\sigma_{1}<1, we consider a subclass of (L0,L1)(L_{0},L_{1})-smooth functions which satisfy the following additional assumption:

Notice that, whereas Assumption 2 is a local constraint on the objective, Section 4 enforces a global polynomial growth constraint – thus ruling out such (L0,L1)(L_{0},L_{1})-smooth functions as exponentials, while capturing a significantly broader class of functions than L0L_{0}-smoothness. We refer the interested reader to Section C.1 for some properties of this class of functions. Using this definition, we are able to prove the following:

Fix any constants ε,ε,ε,ε(0,1)\varepsilon,\varepsilon^{\prime},\varepsilon^{\prime\prime},\varepsilon^{\prime\prime\prime}\in(0,1) such that ε+ε+ε+ε<1\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime}<1. Consider (AG-Norm) with any parameters η\nicefrac2εL1(4+σ12)\eta\leq\nicefrac{{2\varepsilon^{\prime}}}{{L_{1}(4+\sigma_{1}^{2})}} and b02>0b_{0}^{2}>0, running on an objective function satisfying Assumption 2 and Section 4 for some constants k2,ck1,ck>0k\geq 2,c_{k}\geq 1,c_{k}^{\prime}>0, and given access to a stochastic gradient oracle satisfying Assumptions 3 and 4 for any σ0,σ10\sigma_{0},\sigma_{1}\geq 0. Then, for any T1T\geq 1 and δ(0,1)\delta^{\prime}\in(0,1), with probability at least 1δO~(\nicefrac1T)1-\delta^{\prime}-\widetilde{\mathcal{O}}(\nicefrac{{1}}{{T}}), the iterates of (AG-Norm) satisfy

There are several notable takeaways from the above results.

Noise adaptivity. Both Sections 4 and 4 provide “noise-adaptive” convergence rates, in a sense that as σ0,σ10\sigma_{0},\sigma_{1}\to 0, the convergence rates automatically improve from O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}}) to O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{T}}) without any additional hyperparameter tuning.

Less hyperparameter tuning. These rates hold without tuning the parameters b0b_{0} or η\eta with respect to σ0\sigma_{0} or L0L_{0}, unlike all prior algorithms for (L0,L1)(L_{0},L_{1})-smoothness that we are aware of [ZHSJ20, ZJFW20, CLOZZ22]This feature, however, manifests into a worse dependence on L1L_{1} unlike [ZJFW20, CLOZZ22]..

Generalization of prior work. We remark that Section 4 strictly generalizes the result of [FTCMSW22] beyond the uniform L0L_{0}-smooth setting. Further, our stopped analysis simplifies their “recursive improvement” technique [FTCMSW22, Lemma 13].

Key technical ideas

As discussed earlier, the main technical tool we use to obtain our convergence rates in Sections 4 and 4 is a recursively-defined stopping time. Before we are ready to define this time and discuss its utility, we first give a brief overview of the main initial steps of our analysis.

This inequality is no longer true for (L0,L1)(L_{0},L_{1})-smooth functions. Indeed, it is clearly not satisfied for all w,w\mathbf{w},\mathbf{w}^{\prime} on the (0,L1)(0,L_{1})-smooth function exp(L1x)\exp(L_{1}x). However, [ZHSJ20, ZJFW20] note that a similar variant holds “locally” for ww\nicefrac1L1\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq\nicefrac{{1}}{{L_{1}}} (see Section A.2). Using this variant, we obtain the following inequality, which is our first tool for studying the convergence of (AG-Norm).

Fix any ε,ε(0,1)\varepsilon,\varepsilon^{\prime}\in(0,1). Suppose that η2εL1(4+σ12)\eta\leq\frac{2\varepsilon^{\prime}}{L_{1}(4+\sigma_{1}^{2})}. Then, for any tt,

Intuitively, the “good” times are those times when Section 5 is non-vacuous. Using Section 5, we sum the expression Section 5 until any stopping time τ\tau to obtain the following more useful form.

We leverage the fact that Section 5 holds for any stopping time τ[2,T+1]\tau\in[2,T+1] as follows. Suppose there were some stopping time τ[2,T+1]\tau\in[2,T+1] such that:

Notice that τ1(δ)=1\tau_{1}(\delta)=1, S1(δ)=0S_{1}(\delta)=0, X1(δ)=1X_{1}(\delta)=1, and τ2(δ)=2\tau_{2}(\delta)=2 deterministically. Further, one can show that St(δ),Xt(δ)S_{t}(\delta),X_{t}(\delta), and τt+1(δ)\tau_{t+1}(\delta) are Ft1\mathcal{F}_{t-1}-measurable for every t1t\geq 1. Intuitively, τT+1(δ)\tau_{T+1}(\delta) is the first time that the sum of stochastic gradient norms is significantly larger than its expectation, where the expectation is crucially over the random summation range (refer to Section B.2 for a further discussion). The following result shows the utility of this recursive construction:

For any T1T\geq 1 and δ(0,1]\delta\in(0,1], let τT+1(δ)\tau_{T+1}(\delta) be the stopping time from Section 5.1. Then, we have the following:

τT+1(δ)\tau_{T+1}(\delta) is a stopping time with respect to (Fs1)s1(\mathcal{F}_{s-1})_{s\geq 1}, i.e., s1\forall s\geq 1, {s<τT+1(δ)}Fs1\left\{s<\tau_{T+1}(\delta)\right\}\in\mathcal{F}_{s-1}.

Notice that an immediate consequence of Section 5.1 is that:

General (L0,L1)(L_{0},L_{1})-smooth functions clearly violate this inequality. Indeed, F(wt)\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert can potentially be a multiplicative factor of exp(ηL1tt)\exp(\eta L_{1}\left|t-t^{\prime}\right|) times larger than F(wt)\left\lVert\nabla F(\mathbf{w}_{t^{\prime}})\right\rVert (for instance, when the (0,L1)(0,L_{1})-smooth objective is exp(L1x)\exp(L_{1}x). Thus, even if we could guarantee deterministically that only the first O(log(T))\mathcal{O}(\log(T)) time-steps are “bad”, the objective function (and also the norm of the gradient) could grow by polynomial factor in TT during this interval! In fact, this is exactly the intuition behind our negative result for (AG-Norm) in the “large σ1\sigma_{1}” regime (see Section D.1).

Since the numerator and denominator both depend on the same summation, we can apply essentially the same arguments from the σ1<1\sigma_{1}<1 case to obtain our convergence rate for σ11\sigma_{1}\geq 1 in Section 4.

Given our positive results for (AG-Norm) from the previous sections, we now turn our focus to algorithms which have been analyzed in prior works on (L0,L1)(L_{0},L_{1})-smooth optimization. Some of the first-studied algorithms for (L0,L1)(L_{0},L_{1})-smooth optimization take the following forms: for parameters η>0\eta>0 and γ0\gamma\geq 0:

These closely-related updates are referred to as Normalized SGD and Clipped SGD respectively. One motivation for considering these specific updates, at least in the noiseless setting where gt=F(wt)\bm{g}_{t}=\nabla F(\mathbf{w}_{t}), comes through a comparison with the natural SGD step-size for L0L_{0}-smooth non-convex optimization. Indeed, [GL13] show that a constant step-size of ηt=\nicefrac1L0\eta_{t}=\nicefrac{{1}}{{L_{0}}} yields a \nicefrac1T\nicefrac{{1}}{{T}} rate of convergence to a first-order stationary point. Further, a simple extension of this result (see, e.g., [BCN18] for a proof) is that, under L0L_{0}-smoothness and Assumption 4 with σ0=0\sigma_{0}=0 and σ10\sigma_{1}\geq 0, the step size ηt=\nicefrac1L0(1+σ12)\eta_{t}=\nicefrac{{1}}{{L_{0}(1+\sigma_{1}^{2})}} still achieves the \nicefrac1T\nicefrac{{1}}{{T}} convergence rate. Thus, by analogy, in the (L0,L1)(L_{0},L_{1})-smooth setting, ηt=\nicefrac1(L0+L1F(wt))\eta_{t}=\nicefrac{{1}}{{(L_{0}+L_{1}\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert)}} (and, in the multiplicative noise regime, ηt=\nicefrac1(1+σ12)(L0+L1F(wt))\eta_{t}=\nicefrac{{1}}{{(1+\sigma_{1}^{2})(L_{0}+L_{1}\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert)}}) is a natural candidate step size.

A number of works, including [ZHSJ20, ZJFW20, CLOZZ22], have proved that (variants of) these algorithms converge whenever the noise of the stochastic gradient satisfies (Bounded-supp). It turns out, however, that these algorithms can diverge under the noise model considered in this paper, (Affine-var). To see this, it is useful to consider a specific stochastic gradient oracle which satisfies Assumptions 3 and 4:

We can then take the output of the oracle to be g(w):=ξadd(w)+ξmult(w)F(w)\bm{g}(\mathbf{w}):=\xi_{\textrm{add}}(\mathbf{w})+\xi_{\textrm{mult}}(\mathbf{w})\nabla F(\mathbf{w}). Then, this construction satisfies Assumptions 3 and 4 with the specified σ0\sigma_{0} and σ1\sigma_{1}.

Consider the above oracle with σ0=0\sigma_{0}=0 and σ11+ε\sigma_{1}\gg 1+\varepsilon. This oracle outputs stochastic gradients with the same sign as the true gradient for only roughly a \nicefrac1σ12\nicefrac{{1}}{{\sigma_{1}^{2}}} fraction of the times it is queried. The majority of stochastic gradients thus have the opposite sign of the true gradient! This turns out to be quite problematic for algorithms of the form (5). Indeed, consider the behavior of (5) when gtγ\left\lVert\bm{g}_{t}\right\rVert\geq\gamma. In this regime, both algorithms discard the magnitude of the stochastic gradients gt\bm{g}_{t}, and use only their sign to perform updates. Since the stochastic gradients gt\bm{g}_{t} of Section 6 have the opposite sign of F(wt)\nabla F(\mathbf{w}_{t}) for almost all time steps tt, one can prove that algorithms of the form (5) do not converge to a stationary point with constant probability under Assumptions 3 and 4, even when the objective function is a 11-dimensional quadratic function (i.e., both smooth and strongly-convex). We give a proof of (a slightly more general version of) this fact in Section D.2.

Acknowledgements

This research is supported in part by NSF Grants 2019844 and 2112471, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program.

References

Overview of Appendix

Appendix A Auxiliary Lemmas

Let {ai}i=1\{a_{i}\}_{i=1}^{\infty} be a sequence of non-negative integers such that a1>0a_{1}>0. Then, for any TT,

We proceed via induction. The base case of T=1T=1 holds trivially, with equality. Assuming the hypothesis holds at some time T1T\geq 1, we have that

Now, using the fact that exp(x)\nicefrac1(1x)\exp(x)\leq\nicefrac{{1}}{{(1-x)}} for any x<1x<1, we have that

Combining these two bounds, we conclude that

so the claim holds also for T+1T+1. Thus, the claim holds for all TT by induction. ∎

The (AG-Norm) step-sizes satisfy, for any (possibly random) times 1t0t11\leq t_{0}\leq t_{1} and s0s\geq 0,

We first note that, by definition of btb_{t}:

The iterates {ws}s=1\left\{\mathbf{w}_{s}\right\}_{s=1}^{\infty} generated by (AG-Norm) satisfy, for every t1t\geq 1,

Moreover, for any k2k\geq 2 and t>tt>t^{\prime},

which establishes the first inequality. To obtain the second, we apply the first, together with Jensen’s inequality (noting that k1\left\lVert\cdot\right\rVert^{k-1} is convex), to obtain:

For any function FF satisfying Assumption 2, the sequence of iterates {ws}s=1\{\mathbf{w}_{s}\}_{s=1}^{\infty} generated by (AG-Norm) with η\nicefrac1L1\eta\leq\nicefrac{{1}}{{L_{1}}} satisfy

Thus, by choosing η\nicefrac1L1\eta\leq\nicefrac{{1}}{{L_{1}}}, the claim is an immediate consequence of Section A.1. ∎

For any (L0,L1)(L_{0},L_{1})-smooth function F()F(\cdot), assuming that η\nicefrac1L1\eta\leq\nicefrac{{1}}{{L_{1}}}, the gradient F(wt)2\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert^{2} evaluated at the iterate of (AG-Norm) at time tt satisfies:

Since η\nicefrac1L1\eta\leq\nicefrac{{1}}{{L_{1}}}, wt+1wt\nicefrac1L1\left\lVert\mathbf{w}_{t+1}-\mathbf{w}_{t}\right\rVert\leq\nicefrac{{1}}{{L_{1}}} by Section A.1. Thus, we may apply Assumption 2 to obtain

The proof of the first statement is from [ZJFW20, Corollary A.4]. The proof of the second statement closely follows the analogous proof for L0L_{0}-smooth functions from [Nes03, Lemma 1.2.2]. We give a proof of this claim for completeness.

Hence, by the limit inequality theorem, we have that, for any 0<s\nicefrac1L10<\left\lVert\mathbf{s}\right\rVert\leq\nicefrac{{1}}{{L_{1}}},

In particular, by taking the supremum over all such s\mathbf{s}, we conclude that

as claimed, where the first equality follows by observing that 2F(x)2F(x)\nabla^{2}F(\mathbf{x})\nabla^{2}F(\mathbf{x})^{\top} and 2F(x)2F(x)\nabla^{2}F(\mathbf{x})^{\top}\nabla^{2}F(\mathbf{x}) have the same non-zero eigenvalues (since all entries of 2F(x)\nabla^{2}F(\mathbf{x}) are real, and by appealing to the singular value decomposition), which implies that 2F(x)\nabla^{2}F(x) and 2F(x)\nabla^{2}F(x)^{\top} have the same spectral norm. ∎

Suppose that the stochastic gradient oracle satisfies Assumptions 3 and 4 for some σ00\sigma_{0}\geq 0 and σ11\sigma_{1}\geq 1. Then, assuming this oracle returns independent stochastic gradients each time g(w)\bm{g}(\mathbf{w}) is sampled, one can construct, for any ε(0,1)\varepsilon\in(0,1), a new stochastic gradient oracle from this one through mini-batching which satisfies Assumptions 3 and 4 with σ0~σ0\widetilde{\sigma_{0}}\leq\sigma_{0} and σ1~=1ε\widetilde{\sigma_{1}}=1-\varepsilon, and where each call to the new gradient requires only B=\nicefracσ12(1ε)2B=\left\lceil\nicefrac{{\sigma_{1}^{2}}}{{(1-\varepsilon)^{2}}}\right\rceil calls to the old one.

where the first inequality follows by Assumption 3 and since gj(w)\bm{g}_{j}(\mathbf{w}) and gj(w)\bm{g}_{j^{\prime}}(\mathbf{w}) are independent. The second inequality follows by Assumption 4 and our choice of B\nicefracσ12(1ε)2B\geq\nicefrac{{\sigma_{1}^{2}}}{{(1-\varepsilon)^{2}}}. Thus, Assumption 4 is satisfied with σ0~2=\nicefrac(1ε)2σ02σ12σ02\widetilde{\sigma_{0}}^{2}=\nicefrac{{(1-\varepsilon)^{2}\sigma_{0}^{2}}}{{\sigma_{1}^{2}}}\leq\sigma_{0}^{2} and σ1~2=(1ε)2\widetilde{\sigma_{1}}^{2}=(1-\varepsilon)^{2}. ∎

An immediate consequence of Section A.2 and [FTCMSW22, Lemma 5] is that, as long as η\nicefrac1L1\eta\leq\nicefrac{{1}}{{L_{1}}},

We provide a proof of this inequality in Section B.5A careful reader may notice that the inequality in Section B.5 is actually slightly smaller than the one from [FTCMSW22, Lemma 5], since the dependence on constants is strictly better..

Now, let’s focus on bounding the final term above. We start by rewriting it as follows: Let us take Eσ0={F(wt)>σ0}\mathcal{E}_{\sigma_{0}}=\{\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert>\sigma_{0}\}. Then, we can decompose the final term (trivially) as

Now, whenever Eσ0\mathcal{E}_{\sigma_{0}} is false, then this expression is easy to bound, since

Notice that the first term above can be absorbed into the first term in (6), assuming η\eta is sufficiently small. For the remaining term, we begin by noticing that

Therefore, collecting these results and choosing η<\nicefrac2εL1(4+σ12)\eta<\nicefrac{{2\varepsilon^{\prime}}}{{L_{1}(4+\sigma_{1}^{2})}}, we have that

where c~0=c0+η2L1σ0\widetilde{c}_{0}=c_{0}+\eta^{2}L_{1}\sigma_{0}. ∎

Fix any ε(0,1)\varepsilon^{\prime\prime}\in(0,1), and let 2τT+12\leq\tau\leq T+1 be any stopping time with respect to (Fs1)s1(\mathcal{F}_{s-1})_{s\geq 1}. Then, we have that

Let us define, for a parameter λ\lambda to be determined,

More specifically, for any tt, we can decompose

In the other case, for any fixed t[T]t\in[T], we have that

and in the third, we used the fact that ~t2=σ02+F(wt)2\|\widetilde{\nabla}_{t}\|^{2}=\sigma_{0}^{2}+\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert^{2}. Then, noting that

Thus, if we choose λ=εη2(2+σ12)c~0(τ1),\lambda=\frac{\varepsilon^{\prime\prime}\eta^{2}}{(2+\sigma_{1}^{2})\widetilde{c}_{0}(\tau-1)}, then we obtain

Now, summing over t[τ1]t\in[\tau-1], and using the fact that {t<τ}Ft1\left\{t<\tau\right\}\in\mathcal{F}_{t-1} by assumption on τ\tau, we have:

Focusing on the last term in the above inequality, and recalling that (deterministically) τ>1\tau>1 by assumption, we may apply the above bounds together with Jensen’s inequality to obtain:

Combining these bounds yields the claimed inequality. ∎

In the following, we restate Section 5 with an equivalent characterization that is sometimes more convenient for our analysis.

A time t[T]t\in[T] is “good” if, for fixed parameters ε,ε,ε,ε(0,1)\varepsilon,\varepsilon^{\prime},\varepsilon^{\prime\prime},\varepsilon^{\prime\prime\prime}\in(0,1), satisfying ε+ε+ε+ε<1\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime}<1

Now, we may use the first and second inequalities in Section B.1 to bound the sum over “good” and “bad” times, respectively, and, collecting terms, we obtain

B.2 Constructing the “nice” stopping time

Let us recall the definition of τT+1(δ)\tau_{T+1}(\delta), the “nice” stopping time: See 5.1

Here, we show that these random variables are well-defined, and enumerate the crucial properties that they satisfy.

For any δ(0,1]\delta\in(0,1] and t1t\geq 1, let τt(δ)\tau_{t}(\delta), St(δ)S_{t}(\delta), and Xt(δ)X_{t}(\delta) be recursively-defined random variables from Section 5.1. Then, we have that, for all t1t\geq 1,

τt(δ)\tau_{t}(\delta) is Ft2\mathcal{F}_{t-2}-measurable, and St(δ),Xt(δ)S_{t}(\delta),X_{t}(\delta) are each Ft1\mathcal{F}_{t-1}-measurable (where we take F0=F1\mathcal{F}_{0}=\mathcal{F}_{-1} to be the trivial σ\sigma-algebra).

τt(δ)\tau_{t}(\delta) is a stopping time with respect to (Fs1)s1(\mathcal{F}_{s-1})_{s\geq 1}, i.e., for all s0s\geq 0 {s<τt(δ)}Fs1\{s<\tau_{t}(\delta)\}\in\mathcal{F}_{s-1}.

For all t1t\geq 1, τt+1(δ)τt(δ)\tau_{t+1}(\delta)\geq\tau_{t}(\delta), St+1(δ)St(δ)S_{t+1}(\delta)\geq S_{t}(\delta), and Xt+1(δ)Xt(δ)X_{t+1}(\delta)\leq X_{t}(\delta).

ττt(δ)1(δ)=a.s.τt(δ)1\tau_{\tau_{t}(\delta)-1}(\delta)\overset{\text{a.s.}}{=}\tau_{t}(\delta)-1

For every s<τt(δ)s<\tau_{t}(\delta), the following inequalities hold deterministically:

Before proving this result, let us briefly discuss an alternative construction to Section 5.1 which is (perhaps) more natural and easier to define, but does not satisfy a property we rely on to prove Section B.3:

Notice that (8) is true no matter the realization of τt(δ)\tau_{t}(\delta). Indeed, for any realization of τt(δ)\tau_{t}(\delta), the bound on the right-hand side still involves a random index inside the expectation. This is not the case with ( ‣ B.2) (there, the random index is outside of the expectation). This difference is crucial, and this special property of Sτt(δ)1S_{\tau_{t}(\delta)-1} is actually what makes the proof of Section B.3 possible.

For the second claim, it suffices to consider 0st20\leq s\leq t-2 (since we just established that τt(δ)\tau_{t}(\delta) is Ft2\mathcal{F}_{t-2}-measurable, and Ft2Ft2\mathcal{F}_{t-2}\subset\mathcal{F}_{t^{\prime}-2} for any ttt^{\prime}\geq t). Now, for any such ss, since s<ts<t, we have that

For the fourth claim, we have that, by definition of St(δ)S_{t}(\delta) and the tower rule of expectation,

Now, since {s<τt(δ)}Fs1\left\{s<\tau_{t}(\delta)\right\}\in\mathcal{F}_{s-1}, and applying (1),

Summing the above expression over s[t1]s\in[t-1], we conclude that

For the seventh claim, assuming s<τt(δ)s<\tau_{t}(\delta), we have that, by Section A.2,

Further, since ττt(δ)1(δ)=τt(δ)1\tau_{\tau_{t}(\delta)-1}(\delta)=\tau_{t}(\delta)-1, and by definition of St(δ)S_{t}(\delta), we have that

For the final claim, we note that τt(δ)t\tau_{t}(\delta)\leq t deterministically, by construction. Thus, we focus on the lower bound. Indeed, notice that, since τt(δ)[t]\tau_{t}(\delta)\in[t],

Therefore, by applying the union bound and Markov’s inequality, we conclude that

B.3 The key consequence of the nice stopping time construction

The following result is the most crucial place where the properties of Section 5.1 are utilized. It tells us that, as long as the sum of “bad” gradients is comparable to the sum of “good” ones, and as long as the descent inequality (Section 5) holds, then the sum of gradients scales (roughly) as O(\nicefracb(T)2δ+b(T)\nicefracTδ)\mathcal{O}(\nicefrac{{b(T)^{2}}}{{\delta}}+b(T)\sqrt{\nicefrac{{T}}{{\delta}}}). One can compare this result to that of [FTCMSW22, Lemma 13 ], which obtained a similar bound in the simpler L0L_{0}-smooth setting. Their argument utilized a technique they termed “recursive improvement,” which required recursively invoking gradually improving bounds in order to reach their desired conclusion after infinitely many calls. Moreover, their argument crucially relies on properties of L0L_{0}-smoothness in order to obtain worst-case upper bounds on the sum of gradients, which are no longer true in our setting. Through our construction of the stopping time τT+1(δ)\tau_{T+1}(\delta), we are able to obtain a similar bound as in their setting, but with an (arguably) significantly simpler and more general proof which works even in the (L0,L1)(L_{0},L_{1})-smooth setting.

Then, we obtain the inequality given below:

Now, by Item 7 in Section B.2, since t<τt(δ)t<\tau_{t}(\delta) for any tS~(τT+1(δ))t\in\widetilde{S}(\tau_{T+1}(\delta)),

then we may solve this inequality to conclude that

and c~0=ησ02ε+η2L0+σ0L12\widetilde{c}_{0}=\frac{\eta\sigma_{0}}{2\varepsilon}+\eta^{2}\frac{L_{0}+\sigma_{0}L_{1}}{2}.

Now, applying Hölder’s inequality to the above, we have:

where we used the following version of Hölder’s:

Recalling that {s<τT+1(δ)}Fs1\left\{s<\tau_{T+1}(\delta)\right\}\in\mathcal{F}_{s-1} by Item 2 of Section B.2, we may apply Assumption 4 to obtain

where δ(0,1)\delta\in(0,1) is a parameter of our choosing. In particular, choosing (with foresight) δ=\nicefracδ4T\delta=\nicefrac{{\delta^{\prime}}}{{4T}} for any δ(0,1)\delta^{\prime}\in(0,1), the above can be rewritten as:

To obtain a convergence rate, we begin by noting, for any δ(0,1)\delta^{\prime}\in(0,1), we can decompose

The first term is easy to bound via Markov’s inequality, since, choosing δ=δ4Tδ2(T+1)\delta=\frac{\delta^{\prime}}{4T}\leq\frac{\delta^{\prime}}{2(T+1)} (since T1T\geq 1),

To bound the second term, we note that, whenever S~(τT+1(δ))>\nicefracT2|\widetilde{S}(\tau_{T+1}(\delta))|>\nicefrac{{T}}{{2}}, then

Hence, we have by Markov’s inequality and our previous bounds,

B.5 A deferred proof for establishing Section 5

Fix any ε(0,1)\varepsilon\in(0,1). Suppose that η\nicefrac1L1\eta\leq\nicefrac{{1}}{{L_{1}}}. Then, for any time tt, the iterates of (AG-Norm) satisfy

The proof proceeds using similar arguments as in [FTCMSW22, Lemma 5]. By Section A.2 and the definition of (AG-Norm), we know that

We begin by bounding the inner product term above as:

Combining the above arguments, and applying Hölder’s inequality, we have that

where the last step comes from ~tF(wt)\|\widetilde{\nabla}_{t}\|\geq\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert. Collecting our bounds so far yields:

Focusing on the term depending on σ0\sigma_{0}, we have that for any ε>0\varepsilon>0,

In this section, we show that Section B.4 can be used to establish a O~(\nicefrac1T)\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}}) convergence rate without the restriction of σ1<1\sigma_{1}<1. The key is to restrict our attention to (L0,L1)(L_{0},L_{1})-smooth functions which satisfy the following additional property:

The following result provides a characterization of these functions relative to L0L_{0}-smooth functions and (L0,L1)(L_{0},L_{1})-smooth functions. In particular, it tells us that Section 4 is a richer function class than (L0,L1)(L_{0},L_{1})-smooth functions. However, not all (L0,L1)(L_{0},L_{1})-smooth functions satisfy Section 4.

Every L0L_{0}-smooth function satisfies Section 4 with k=2k=2, ck=1c_{k}=1, and ck=L0c_{k}^{\prime}=L_{0}.

Every (L0,L1)(L_{0},L_{1})-smooth function satisfies Section 4 locally (i.e., when ww\nicefrac1L1\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq\nicefrac{{1}}{{L_{1}}}) with k=2,ck=2k=2,c_{k}=2 and ck=L0c_{k}^{\prime}=L_{0}.

There is a (0,L1)(0,L_{1})-smooth function which does not satisfy Section 4 for any fixed k,ck,ckk,c_{k},c_{k}^{\prime}.

For any k2k\geq 2, F(w)=wwkF(\mathbf{w})=\left\lVert\mathbf{w}-\mathbf{w}^{*}\right\rVert^{k} satisfies Section 4 with k=kk=k, ck=2k2c_{k}=2^{k-2}, and ck=k2k2c_{k}^{\prime}=k2^{k-2}. Additionally, for any L1>0L_{1}>0, F(w)F(\mathbf{w}) is (\nicefrac2k(k1)L1k2,(e1)(k1)L1)(\nicefrac{{2k(k-1)}}{{L_{1}^{k-2}}},(e-1)(k-1)L_{1})-smooth. However, this F(w)F(\mathbf{w}) is not L0L_{0}-smooth when k>2k>2.

The second follows since, for any (L0,L1)(L_{0},L_{1})-smooth function, for every wwL1\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq L_{1},

For the third claim, consider the function F(w)=exp(L1w)F(w)=\exp(L_{1}w). Since F(w)=L12exp(L1w)=L1F(w)F^{\prime\prime}(w)=L_{1}^{2}\exp(L_{1}w)=L_{1}F^{\prime}(w). Suppose there were some k,ck,ckk,c_{k},c_{k}^{\prime} such that Section 4 is satisfied. Then, it must be the case that, for any x>0x>0:

where the inequality follows from the definition of Section 4, the first equality by L’Hôpital’s rule, and the second by rewriting the previous expression. Repeating this argument k1k-1 times, this implies that

a contradiction. Hence, exp(L1x)\exp(L_{1}x) cannot satisfy Section 4.

For the final claim, we see that F(w)F(\mathbf{w}) satisfies Section 4 with ck=2k2c_{k}=2^{k-2} and ck=k2k2c_{k}^{\prime}=k2^{k-2} since, by Jensen’s inequality,

Further, F(w)F(\mathbf{w}) is also (2k(k1),(e1)(k1))(2k(k-1),(e-1)(k-1))-smooth, since simple calculations yield that

In particular, this implies that ww\mathbf{w}-\mathbf{w}^{*} is an eigenvector with largest eigenvalue, so, for any L1>0L_{1}>0,

Therefore, by [ZJFW20, Corollary A.4], for any ww\nicefrac1(k1)L1\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq\nicefrac{{1}}{{(k-1)L_{1}}},

The following result demonstrates the difference in worst-case gradient norm scaling that (L0,L1)(L_{0},L_{1})-smooth functions provide, versus the worst-case scaling of functions satisfying Section 4.

For any function satisfying Assumption 2, and any algorithm producing iterates (ws)s1(\mathbf{w}_{s})_{s\geq 1} satisfying ws+1wsη\nicefrac1L1\left\lVert\mathbf{w}_{s+1}-\mathbf{w}_{s}\right\rVert\leq\eta\leq\nicefrac{{1}}{{L_{1}}} for every s1s\geq 1, the following inequality holds for every t>tt>t^{\prime}:

Moreover, this inequality is essentially unimprovable, in the sense that there exists a (0,O(L1))(0,\mathcal{O}(L_{1}))-smooth function and η\nicefrac1L1\eta\leq\nicefrac{{1}}{{L_{1}}} such that F(wT+1)=(1+O(ηL1))T+1F(w1)\left\lVert\nabla F(\mathbf{w}_{T+1})\right\rVert=(1+\mathcal{O}(\eta L_{1}))^{T+1}\left\lVert\nabla F(\mathbf{w}_{1})\right\rVert for any T1T\geq 1, and a (O(L0),O(L1))(\mathcal{O}(L_{0}),\mathcal{O}(L_{1}))-smooth function such that F(w1)=0\left\lVert\nabla F(\mathbf{w}_{1})\right\rVert=0 and F(wT+1)=O(\nicefracL0L1)((1+O(ηL1))T1)\left\lVert\nabla F(\mathbf{w}_{T+1})\right\rVert=\mathcal{O}(\nicefrac{{L_{0}}}{{L_{1}}})((1+\mathcal{O}(\eta L_{1}))^{T}-1). By contrast, any function satisfying Section 4 satisfies:

We begin by proving the first claim by by induction on ttt-t^{\prime}. The base case of tt=1t-t^{\prime}=1 holds by definition, since

Now, supposing the claim holds for tt=1,,st-t^{\prime}=1,\ldots,s, we have that:

where the first inequality follows by applying the induction hypothesis for tt=st-t^{\prime}=s, the second by applying the induction hypothesis for tt=1t-t^{\prime}=1, and the final equality follows by rearranging the prior line. Thus, the inequality holds also at tt=s+1t-t^{\prime}=s+1, and thus our claim holds by induction.

To see that this inequality is essentially unimprovable, let us consider first consider, for any L1>0L_{1}>0, the function:

Since F(x)=L12exp(L1x)=L1F(x)F^{\prime\prime}(x)=L_{1}^{2}\exp(L_{1}x)=L_{1}F^{\prime}(x), it follows from Section 3 that F()F(\cdot) is (0,(e1)L1)(0,(e-1)L_{1})-smooth. Notice that, if xs+1xs=η=\nicefrac1(e1)L1x_{s+1}-x_{s}=\eta=\nicefrac{{1}}{{(e-1)L_{1}}}, then, taking x1=0x_{1}=0 and t1t\geq 1,

Further, for any L0,L1>0L_{0},L_{1}>0, consider the function:

Noting that F(x)2L0+2L1F(x)F^{\prime\prime}(x)\leq 2L_{0}+2L_{1}F^{\prime}(x) when x\nicefrac2L1\left|x\right|\geq\nicefrac{{\sqrt{2}}}{{L_{1}}}, and F(x)L1(2+exp(2))+2L1F(x)F^{\prime\prime}(x)\leq L_{1}(2+\exp(\sqrt{2}))+2L_{1}F^{\prime}(x) otherwise, it follows that FF is (2(2+exp(2))L0,2(e1)L1)(2(2+\exp(\sqrt{2}))L_{0},2(e-1)L_{1})-smooth (by Section 3). Therefore, whenever η=\nicefrac12(e1)L1\eta=\nicefrac{{1}}{{2(e-1)L_{1}}},

where the first equality follows by rearranging the definition, the second since ηL1=Θ(1)\eta L_{1}=\Theta(1), the third since 2c1+c(e1)ηL1=2c(e1)ηL11+2c(e1)ηL1log(1+c2(e1)ηL1)2c(e1)ηL12\frac{c}{1+c}(e-1)\eta L_{1}=\frac{2c(e-1)\eta L_{1}}{1+2c(e-1)\eta L_{1}}\leq\log(1+c2(e-1)\eta L_{1})\leq 2c(e-1)\eta L_{1}, and the fourth by rearranging.

The final inequality follows immediately from Sections A.1 and 4, which together imply that

where we use the fact that ck1c_{k}\geq 1 and bt12bt12b_{t-1}^{2}\geq b_{t^{\prime}-1}^{2} (since ttt\geq t^{\prime}) for the first inequality, the definition of Section 4 for the second, and the definition of b~t\widetilde{b}_{t} from Section 3 for the third. Now, either F(wt)2max{ckwtwtk1,L0wtwt}\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert\geq 2\max\left\{c_{k}^{\prime}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert^{k-1},L_{0}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert\right\}, or not. In the first case, we note that

In the alternate case that F(wt)<2max{ckwtwtk1,L0wtwt}\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert<2\max\left\{c_{k}^{\prime}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert^{k-1},L_{0}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert\right\}, we obtain:

To bound the second term, we apply Sections 4 and A.1, together with the above construction, to obtain:

Thus, by the Multinomial theorem, we have that

Combining this with the above, we have the following:

We claim that, for any s1s\geq 1, the inner summation term above is bounded in expectation by:

To see this, first note that, by Section B.1, and since {t<τT+1(δ)}Ft1\left\{t<\tau_{T+1}(\delta)\right\}\in\mathcal{F}_{t-1} by Section B.2, for any t0t^{\prime}\geq 0,

Now, by Items 4, 5 and 6 of Section B.2, we have that, almost surely,

Therefore, collecting these results, we conclude that, for any t0t^{\prime}\geq 0,

Now, the base case of s=1s=1 for (C.2) follows immediately from (14) with t=0t^{\prime}=0. Let us now suppose that the claim (C.2) holds for some s1s\geq 1. Then, to apply the induction hypothesis, we begin by decomposing:

Notice that the above expectation is a product of two terms: indicators depending of times t1,,tst_{1},\ldots,t_{s}, and those depending on ts+1>tst_{s+1}>t_{s}. Therefore, since, by Sections B.2 and B.1,

we may apply the tower rule of expectations and the inequality from (14):

Therefore, summing the above expression over 1t1<<tsT1\leq t_{1}<\ldots<t_{s}\leq T and applying the induction hypothesis, we conclude that:

Now, finally noting that, for any x1x\geq 1,

C.3 Bounding the sum of “bad” gradients by the sum of “good” ones

We recall from Section B.4 that, in order to use this bound, we need to show that the sum of “bad” gradients can be upper-bounded (relatively) by the sum of “good” ones. It turns out, for functions satisfying Section 4, this is possible, as we now show.

Let τ1\tau\geq 1 be any (possibly random) time, and consider any (possibly random) set S(τ)[τ1]S(\tau)\subseteq[\tau-1]. Denote S(τ)c=[τ1]S(τ)S(\tau)^{c}=[\tau-1]\setminus S(\tau). Then, assuming F()F(\cdot) satisfies Section 4, the following is satisfied deterministically:

The proof of this result follows a similar argument as used in Section 5.2. The main idea here is to, for every tS(τ)ct\in S(\tau)^{c} in decreasing order, find the first available time tS(τ)t^{\prime}\in S(\tau) which has not been associated with an earlier time from S(τ)cS(\tau)^{c}. Then, using Section 4, we show that, as long as tt and tt^{\prime} are not too far apart, then F(wt)2\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert^{2} and F(wt)2\left\lVert\nabla F(\mathbf{w}_{t^{\prime}})\right\rVert^{2} must also be close. For some times tS(τ)ct\in S(\tau)^{c}, there may not be such a tS(τ).t^{\prime}\in S(\tau). However, because of the greedy construction, these times must be relatively small (roughly within the first S(τ)c|S(\tau)^{c}| time steps). Thus, as long as S(τ)c|S(\tau)^{c}| is not “too big” (in expectation), then we can still bound these remaining terms. We now make these arguments precise.

To begin, note that for every t,t1t,t^{\prime}\geq 1, by Section 4,

We use this bound as follows: let us index the times in S(τ)cS(\tau)^{c}, denoting τ~[i]\widetilde{\tau}_{[i]} to be the iith largest time in S(τ)cS(\tau)^{c}, i.e.,

Indeed, this follows by first decomposing

Next, notice that, for every iii\geq i^{*},

where the first inequality is by definition of τ~[i]\widetilde{\tau}_{[i]}. To see the second inequality, we follow a similar argument as before. Indeed, observe that

where in the second inequality, we applied Section 5.2. Thus, we obtain the claimed result. ∎

and c~0=ησ02ε+η2L0+σ0L12\widetilde{c}_{0}=\frac{\eta\sigma_{0}}{2\varepsilon}+\eta^{2}\frac{L_{0}+\sigma_{0}L_{1}}{2} (where we use the notation (x)+:=max{0,x}(x)_{+}:=\max\left\{0,x\right\}).

Thus, the conditions to apply Section B.4 are satisfied, and we obtain the convergence rate. ∎

In this section, we consider the convergence behavior of several natural candidate algorithms which have been studied in the literature on (L0,L1)(L_{0},L_{1})-smooth optimization. These algorithms take the form wt+1=wtut\mathbf{w}_{t+1}=\mathbf{w}_{t}-\bm{u}_{t}, where ut\bm{u}_{t} takes a number of different forms, including: in Normalized SGD:

and Sign-SGD with Momentum (operations performed element-wise):

[ZHSJ20, ZJFW20, CLOZZ22] prove O(\nicefrac1T)\mathcal{O}(\nicefrac{{1}}{{\sqrt{T}}}) convergence of these algorithms in the setting of (Bounded-supp). In this section, we show that these step-size choices for (L0,L1)(L_{0},L_{1})-smooth optimization fail under (Affine-var), despite working in the noiseless and (Bounded-supp) settings. Our negative results rely on the following stochastic gradient oracle construction:

Fix any ε,σ10\varepsilon,\sigma_{1}\geq 0. We begin by establishing that Assumption 3 holds for our construction of g(w)\bm{g}(\mathbf{w}). Begin by denoting δ=\nicefrac(1+ε)2((1+ε)2+σ12)\delta=\nicefrac{{(1+\varepsilon)^{2}}}{{((1+\varepsilon)^{2}+\sigma_{1}^{2})}}. Under this notation, we have that

which establishes Assumption 4 for any σ0,σ10\sigma_{0},\sigma_{1}\geq 0. ∎

We establish all of the following negative results using the stochastic gradient oracle described in Section 6. Before stating our results, let us briefly discuss some intuition behind why one should expect (NormSGD), (ClippedSGD), and (SignSGD-M) to fail under Section 6. Consider the setting where σ11+ε\sigma_{1}\gg 1+\varepsilon. Then, notice that the stochastic gradient gt\bm{g}_{t} only has the same sign as F(wt)\nabla F(\mathbf{w}_{t}) with roughly \nicefrac1σ12\nicefrac{{1}}{{\sigma_{1}^{2}}} probability. Otherwise, gt\bm{g}_{t} has the opposite sign as F(wt)\nabla F(\mathbf{w}_{t}). Now, for an algorithm which incorporates the magnitude of the stochastic gradients together with the signs, the oracle in Section 6 may not be so problematic – indeed, even though the updates with correct sign are somewhat “rare”, they are also of significantly larger magnitude compared to the updates with proper sign. However, notice that (NormSGD), (ClippedSGD), and (SignSGD-M) are (effectively) unit step-length algorithms (at least, in the setting where gtγ\left\lVert\bm{g}_{t}\right\rVert\geq\gamma). Thus, in many parameter regimes, all of these algorithms effectively disregard the magnitude of the stochastic gradients and only use their signs. This results in a biased random walk which never finds an iterate better than the initial one with constant probability. We formalize this intuition in the following:

We note that the statement Section D.1 follows from Section D.2 by choosing the parameter ε=\nicefracσ122\varepsilon=\nicefrac{{\sigma_{1}}}{{2\sqrt{2}}}. The main takeaway here is that, for a reasonably wide range of parameters, (NormSGD), (ClippedSGD), and (SignSGD-M) can diverge in the affine variance setting, even for very simple smooth and strongly convex problems (in fact, even on a 11-dimensional quadratic function). In particular, this says that, whenever (NormSGD) is run with γ=0\gamma=0 (or (SignSGD-M) with β=0\beta=0), then there is no parameter tuning with respect to η\eta such that mint[T]F(xt)2\min_{t\in[T]}\left\lVert\nabla F(x_{t})\right\rVert^{2} converges!

Fix any L1>0L_{1}>0, time horizon T>1T>1, and affine variance parameter

D.2 Full statement and proof of negative results for (SignSGD-M), (NormSGD), and (ClippedSGD)

Here, we give the complete negative result for (SignSGD-M), (NormSGD), and (ClippedSGD), and formalize the intuition given there.

Let us choose, for arbitrary L0>0L_{0}>0 and Δ>0\Delta>0, the (L0,0)(L_{0},0)-smooth objective F(x)=\nicefracL02 x2F(x)=\nicefrac{{L_{0}}}{{2}}\ x^{2}, and assume without loss of generality that x1=\nicefrac2ΔL0x_{1}=-\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}} (indeed, if this is not the case, then we can always translate the function F(x)F(x) to be F(x)=\nicefracL02(xx1\nicefrac2ΔL0)2F(x)=\nicefrac{{L_{0}}}{{2}}(x-x_{1}-\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}})^{2}, and our arguments remain unchanged). Notice that F(x1)F=F(x1)=ΔF(x_{1})-F^{*}=F(x_{1})=\Delta.

Consider, for any ε>0\varepsilon>0 and σ12>(1+ε)2\sigma_{1}^{2}>(1+\varepsilon)^{2}, the stochastic gradient oracle from Section 6, i.e.,

Let τ\tau^{*} be the first time when an iterate becomes larger than the original one, i.e.,

Notice that this implies that, for any 1t<τ1\leq t<\tau^{*}:

This guarantees that, before τ\tau^{*}, (i) the iterates are always to the left of the minimizer, (ii) that the algorithm (ClippedSGD) never “clips” (i.e., ut=\nicefracηgtgtu_{t}=\nicefrac{{\eta g_{t}}}{{\left|g_{t}\right|}}), and (iii), mint<τF(xt)2=F(x1)2\min_{t<\tau^{*}}\left\lVert\nabla F(x_{t})\right\rVert^{2}=\left\lVert\nabla F(x_{1})\right\rVert^{2} (i.e., the algorithm never achieves any nontrivial target minimization criterion). Additionally, it must be the case that:

since xτ1<x1x_{\tau^{*}-1}<x_{1} and xτ=xτ1uτ1x1x_{\tau^{*}}=x_{\tau^{*}-1}-u_{\tau^{*}-1}\geq x_{1}.

Now, let us distinguish the updates of (SignSGD-M), (NormSGD), and (ClippedSGD) as (xt(1),ut(1))(x_{t}(1),u_{t}(1)), (xt(2),ut(2))(x_{t}(2),u_{t}(2)), and (xt(3),ut(3))(x_{t}(3),u_{t}(3)), respectively. Now, instead of reasoning about the dynamics of each of these algorithms individually, we instead reason about an algorithm with simpler dynamics, and draw conclusions about each of these processes via a stochastic dominance argument.

To do this, we utilize the coupling of these algorithms defined in Section D.2 – namely, we let x1(i)=x1=\nicefrac2ΔL0x_{1}(i)=x_{1}=\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}} (as discussed above), and g(xt(i))=ξmult,tF(xt(i))g(x_{t}(i))=\xi_{\textrm{mult},t}\nabla F(x_{t}(i)) for every ii, where ξmult,t\xi_{\textrm{mult},t} is ε-\varepsilon with probability 1δ1-\delta, and 1+\nicefracσ12(1+ε)1+\nicefrac{{\sigma_{1}^{2}}}{{(1+\varepsilon)}} otherwise. That is, each process starts from the same initial iterate, and receives the same multiplicative noise on the stochastic gradient at time tt.

Similarly, let us define the “simpler” comparison process as:

and take x1(4)=x1x_{1}(4)=x_{1} and xt+1(4)=xt(4)ut(4)x_{t+1}(4)=x_{t}(4)-u_{t}(4). Now, denote τ(i)\tau^{*}(i) as the stopping time from (17) corresponding to the process ii\in. Then, by Section D.2, we have that, under our coupling of these algorithms, τ(4)miniτ(i)\tau^{*}(4)\leq\min_{i\in}\tau^{*}(i), which implies that, for each algorithm ii\in:

By Section D.2, we have that, for any t00t_{0}\geq 0:

Thus, using the geometric summation formula, we can bound the above summations as:

Now, let us focus on bounding the two terms in the above expression. To do this, we first observe that, for any μ,x>0\mu,x>0 and i0i\geq 0,

In particular, since exp(x)<\nicefrac1(1+x)\exp(-x)<\nicefrac{{1}}{{(1+x)}} for any x>0x>0, and thus also \nicefracexp(x)1exp(x)<\nicefrac1x\nicefrac{{\exp(-x)}}{{1-\exp(-x)}}<\nicefrac{{1}}{{x}}, since x>x1\left\lfloor x\right\rfloor>x-1, we have that the above inequality is satisfied whenever:

and, combining our results, we conclude that, for any algorithm ii\in:

Consider the process {xt}t1\left\{x_{t}\right\}_{t\geq 1} from (SignSGD-M) as defined in Section D.2, where x1<0x_{1}<0, F(x):=\nicefracL02 x2F(x):=\nicefrac{{L_{0}}}{{2}}\ x^{2} for some L0>0L_{0}>0, and gtg_{t} are the stochastic gradients output by the oracle from Section 6. Suppose that the parameter β\beta of (SignSGD-M) satisfies:

Let τ=min{t>1:x1xt}\tau^{*}=\min\left\{t>1:x_{1}\leq x_{t}\right\}. Then, if t<τt<\tau^{*} and gt=εF(xt)g_{t}=-\varepsilon\nabla F(x_{t}), then ut=ηu_{t}=\eta.

Recall that, by construction of the stochastic gradient oracle from Section 6, and since F(x)=L0x\nabla F(x)=L_{0}x:

We wish to show that the process from (SignSGD-M) has the property that, whenever gt=εL0xtg_{t}=-\varepsilon L_{0}x_{t} and xs<0x_{s}<0 for every s[t]s\in[t], then ut=ηu_{t}=\eta. We consider any initialization x1<0x_{1}<0, and denote τ\tau^{*} to be the first time when an iterate becomes non-negative, i.e.,

Notice that, since ut{±η}u_{t}\in\left\{\pm\eta\right\} by definition of (SignSGD-M), and by construction of the stochastic gradient oracle:

Thus, it suffices to prove by induction that, for any i0i\geq 0, either τiτ\tau_{i}\geq\tau^{*}, or uτi=ηu_{\tau_{i}}=\eta, as long as

For the base case of i=0i=0, we may assume without loss of generality that uτ0=u0=ηu_{\tau_{0}}=u_{0}=\eta, since m0=0m_{0}=0 and the dynamics of the update rule do not depend on u0u_{0} (i.e., the dynamics begin at time t=1t=1 and x1x_{1} is the starting point of the process). Thus, the base case is true by construction.

Now, suppose the claim holds for some i1i\geq 1. Either τi+1τ\tau_{i+1}\geq\tau^{*} or not. In the former case, the claim follows trivially, so let us assume that τi+1<τ\tau_{i+1}<\tau^{*}. Since τi<τi+1<τ\tau_{i}<\tau_{i+1}<\tau^{*} by construction, uτi=ηu_{\tau_{i}}=\eta by the induction hypothesis. Further, let us assume that gτi+1=εL0xτi+1g_{\tau_{i+1}}=-\varepsilon L_{0}x_{\tau_{i+1}}, since otherwise the claim again follows trivially by definition of τi+1\tau_{i+1}. Thus, we can write:

where the first equality is the definition of mτi+1m_{\tau_{i+1}}. The second inequality follows from observation (21). Further, since uτi=ηu_{\tau_{i}}=\eta, then by definition of (SignSGD-M), either mτi>0m_{\tau_{i}}>0, or mτi=0m_{\tau_{i}}=0 and the algorithm chooses uτi=ηu_{\tau_{i}}=\eta. In either case, mτi0m_{\tau_{i}}\geq 0. Therefore, since, for β[0,1)\beta\in[0,1):

we obtain, using the fact that xτi+1=xτi+1+η(τi+1(τi+1))x_{\tau_{i+1}}=x_{\tau_{i}+1}+\eta(\tau_{i+1}-(\tau_{i}+1)) and mτi0m_{\tau_{i}}\geq 0:

Thus, since xτi+1xτi+1<0x_{\tau_{i}+1}\leq x_{\tau_{i+1}}<0, and since τi+1<τ\tau_{i+1}<\tau^{*} (which implies, since each update of (SignSGD-M) satisfies ut{±η}u_{t}\in\left\{\pm\eta\right\} and by definition of τ\tau^{*}, xτi+1xτ1=x1η<0x_{\tau_{i+1}}\leq x_{\tau^{*}-1}=x_{1}-\eta<0), the above inequality implies that mτi+1>0m_{\tau_{i+1}}>0 as long as:

Since we require 0β<10\leq\beta<1, the second inequality is equivalent to:

Thus, since 1x<1\nicefracx2\sqrt{1-x}<1-\nicefrac{{x}}{{2}} for 0<x10<x\leq 1, it suffices to choose β\beta as:

In this case, mτi+1>0m_{\tau_{i+1}}>0, and thus ut=ηu_{t}=\eta, which establishes the induction step. Thus, for every i0i\geq 0, either τiτ\tau_{i}\geq\tau^{*} or uτi=ηu_{\tau_{i}}=\eta, as claimed. ∎

Let us recall the i.i.d random process {ξmult,t}t1\left\{\xi_{\textrm{mult},t}\right\}_{t\geq 1} from Section 6, where each ξmult,t\xi_{\textrm{mult},t} is ε-\varepsilon with probability 1δ1-\delta, and (1+\nicefracσ12(1+ε))(1+\nicefrac{{\sigma_{1}^{2}}}{{(1+\varepsilon)}}) otherwise. Let us distinguish the three processes from Section D.2 (Eqs. SignSGD-M, ClippedSGD and NormSGD) as, respectively, {xt(i)}t1\left\{x_{t}(i)\right\}_{t\geq 1} for ii\in. Consider the coupling of these three processes, where x1(i)=x1:=\nicefrac2ΔL0x_{1}(i)=x_{1}:=-\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}} for every ii\in, and for each t1t\geq 1 and ii\in, g(xt(i))=ξmult,tF(xt(i))g(x_{t}(i))=\xi_{\textrm{mult},t}\nabla F(x_{t}(i)). Further, let us denote, for each t1t\geq 1:

and take x1(4)=x1x_{1}(4)=x_{1} and xt+1(4)=xt(4)ut(4)x_{t+1}(4)=x_{t}(4)-u_{t}(4). Further, let, for each ii\in,

Then, under the constraints on parameters of the three algorithms as imposed in Section D.2, we have that:

We claim that, for each ii\in, and any t<τ(i)t<\tau^{*}(i), ut(4)ut(i)u_{t}(4)\leq u_{t}(i). Notice that, supposing this claim is true, then τ(4)τ(i)\tau^{*}(4)\leq\tau^{*}(i) for each ii\in, since, by definition of τ(i)\tau^{*}(i):

Thus, since τ(i)\tau^{*}(i) is the first time t>1t>1 for which xt(i)x1x_{t}(i)\geq x_{1}, it follows that τ(4)τ(i)\tau^{*}(4)\leq\tau^{*}(i). Having established this implication, it suffices to prove the claim for each of the ut(i)u_{t}(i)s.

For the case of i=2i=2 (i.e., algorithm (NormSGD)), for every t<τ(2)t<\tau^{*}(2), since g(xt(2))εL0x1γ\left|g(x_{t}(2))\right|\geq\varepsilon L_{0}|x_{1}|\geq\gamma by (18) and since \nicefracx(x+y)\nicefrac{{x}}{{(x+y)}} is non-decreasing in xx on the interval x(0,)x\in(0,\infty) for any fixed y0y\geq 0,

Therefore, the claim is established in all three cases, which also concludes the proof. ∎

Consider the algorithm 44 as defined in Section D.2. Then, under the assumptions of Section D.2, we have that, for any T1T\geq 1 and any t00t_{0}\geq 0,

as the “net movement” of algorithm 44 to the left of xt1x_{t_{1}} after t2t1+1t_{2}-t_{1}+1 time steps. and observe that

Additionally, note that, recalling the definition of τ(4)\tau^{*}(4) from Eq. 17,

Therefore, we have that, for any t00t_{0}\geq 0,

it remains only to upper-bound each probability inside of the above summation. To do this, let us denote, for any t[t0+2,T]t\in[t_{0}+2,T],

Collecting the above results, we arrive at the claimed lower bound. ∎

In particular, whenever η\nicefracαL1\eta\geq\nicefrac{{\alpha}}{{L_{1}}} for some α>0\alpha>0, and, for any δ(0,1)\delta\in(0,1),

and g(xt)=ξmult,tF(xt)g(x_{t})=\xi_{\textrm{mult},t}\nabla F(x_{t}). As established in Section 6, this oracle satisfies Assumptions 3 and 4 with σ0=0\sigma_{0}=0 and the specified σ1>1\sigma_{1}>1.

Let us define, for a parameter t01t_{0}\geq 1 to be determined shortly:

Now, since the noise is sampled i.i.d at each time step, we have that, for any t00t_{0}\geq 0:

Whenever Enc\mathcal{E}_{\text{nc}} is true, notice that:

Now, using the fact that, whenever Enc\mathcal{E}_{\text{nc}} is true, then F(xt)F(xt+1)\nabla F(x_{t})\leq\nabla F(x_{t+1}) for each t[t0]t\in[t_{0}], and assuming b02ε2F(x1)2b_{0}^{2}\leq\varepsilon^{2}\left\lVert\nabla F(x_{1})\right\rVert^{2}, we can bound

Now, for a parameter α>0\alpha>0 to be determined shortly, let us define:

Notice that, by construction, ξmult,τi=1+\nicefracσ12(1+ε)\xi_{\textrm{mult},\tau_{i}}=1+\nicefrac{{\sigma_{1}^{2}}}{{(1+\varepsilon)}} for every i0i\geq 0. Further, xτi+1xτi+1x_{\tau_{i}+1}\leq x_{\tau_{i+1}} since τi+1\tau_{i+1} is the first time after τi\tau_{i} satisfying F(xτi+1+1)<F(xτi+1)\nabla F(x_{\tau_{i+1}+1})<\nabla F(x_{\tau_{i}+1}), or equivalently, xτi+1+1<xτi+1x_{\tau_{i+1}+1}<x_{\tau_{i}+1}. This implies that

Now, by construction of the τi\tau_{i}, we have that, assuming Enc\mathcal{E}_{\text{nc}} is true, then

Thus, to ensure that mint[T]F(xt)2=F(x1)2\min_{t\in[T]}\left\lVert\nabla F(x_{t})\right\rVert^{2}=\left\lVert\nabla F(x_{1})\right\rVert^{2}, it suffices to have that, for every i<Ti<T, F(xτi)2F(x1)2\left\lVert\nabla F(x_{\tau_{i}})\right\rVert^{2}\geq\left\lVert\nabla F(x_{1})\right\rVert^{2}. Now, notice that

Thus, it suffices to establish conditions under which

Thus, if we choose α=\nicefraclog(T1)ηL1\alpha=\nicefrac{{\log(T-1)}}{{\eta L_{1}}}, then it suffices to take:

in which case mint[T]F(xt)2=F(x1)2\min_{t\in[T]}\left\lVert\nabla F(x_{t})\right\rVert^{2}=\left\lVert\nabla F(x_{1})\right\rVert^{2} under Enc\mathcal{E}_{\text{nc}}. Hence,

In particular, using the fact that 1x>exp(\nicefracx(1x))1-x>\exp(\nicefrac{{-x}}{{(1-x)}}) for x<1x<1, it follows that:

Hence, as long as, for some δ(0,1)\delta\in(0,1),