Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD

Matthew Faw, Litu Rout, Constantine Caramanis, Sanjay Shakkottai

Introduction

A fundamental problem in stochastic optimization is to characterize the convergence behavior of the Stochastic Gradient Descent algorithm:

Much of the literature on stochastic optimization, e.g., [NY83, GL13, Bub15, FSSSSW19], focuses on a special case of (Affine-var) where the variance is uniformly upper-bounded ( $\sigma_{1}=0$ ):

Rates of convergence to a first-order stationary point in these settings are now well-understood. Under (Bounded-var) regime, [GL13] prove an $\mathcal{O}(\nicefrac{{\sqrt{\sigma_{0}^{2}L_{0}(F(\mathbf{w}_{1})-F^{*})}}}{{\sqrt{T}}})$ rate of convergence with a fixed step-size schedule. Later, [ACDFSW22] show that this rate is optimal up to constant factors. Further, as noted by [BCN18], a minor modification to this step-size gives nearly the same rate in the more general (Affine-var) setting, i.e., $\sigma_{1}>0$ . This rate is obtained by making trivial changes to the proof technique of [GL13].

One crucial assumption in these lines of work is $L_{0}$ -smoothness, i.e., $L_{0}$ -Lipschitz gradients of the loss landscape. However, recent works [ZJFW20, ZHSJ20] provide empirical evidence that this assumption is often not satisfied in practical machine learning problems. For instance, in large-scale language modeling including BERT [DCLT18] and other variants [Rad+21, Car+21, LYFJHN23], the loss landscape of transformer architectures either does not satisfy the $L_{0}$ -smoothness assumption, or the value of $L_{0}$ becomes so large that it produces a significantly weaker rate of convergence [ZJFW20, ZHSJ20].

Aiming to address these issues, there has been a recent surge of interest in relaxing the standard $L_{0}$ -smoothness assumption and characterizing the rate of convergence. One appealing relaxation proposed by [ZHSJ20] is that of $(L_{0},L_{1})$ -smoothnessFor convenience, we state this assumption in terms of a bound on the hessian of $F$ . The requirement that the hessian exists everywhere can be relaxed to a condition on the gradients [ZJFW20]. This relaxation is the one we use for our main results, see Assumption 2.:

While every $L_{0}$ -smooth function is also $(L_{0},0)$ -smooth, this relaxation admits functions that grow significantly faster than a quadratic function, e.g., $F(w)=w^{d}$ is $(\nicefrac{{d(d-1)}}{{L_{1}^{d-2}}},(d-1)L_{1})$ -smooth for any $L_{1}>0$ , and $F(w)=\exp(L_{1}w)$ is $(0,L_{1})$ -smooth. With regards to convergence, recent works [ZHSJ20, CLOZZ22] establish an $\mathcal{O}(\nicefrac{{1}}{{\sqrt{T}}})$ rate in the $(L_{0},L_{1})$ -smooth setting, as long as the noise of the stochastic gradients has uniformly-bounded support, i.e.,

In many real-world scenarios, the (Bounded-supp) assumption does not hold. For instance, when running SGD in standard least-squares regression settings, the stochastic gradients have multiplicative noise, as noted in [DFB17, FB17, JKKNS18, JT19]. Similar noise assumptions have also been considered, e.g., in convergence of stochastic proximal gradient methods [RVV20], Hilbert-valued stochastic subgradient methods [BRS07], and adaptive gradient methods [FTCMSW22]. Moreover, multiplicative noise naturally arises in machine learning problems with (additive or multiplicative) feature noise [LW11, Hwa86, CRSC06]. Thus, we believe that characterizing $(L_{0},L_{1})$ -smooth functions under (Affine-var) is an important step in extending the theory of non-convex stochastic optimization beyond the standard $L_{0}$ -smooth setting.

A major challenge in the analysis of adaptive stochastic gradient descent is the correlation between the stochastic gradients and the step-size. Here, we develop a technique to simplify this challenge. Our key innovation is a recursively-defined stopping time which satisfies two crucial properties: (i) before the stopping time is reached, the step sizes behave roughly independently of the gradients, and (ii) on average, the stopping time is at least a constant fraction of the time horizon. As a consequence, instead of analyzing over the entire time horizon, we conduct the analysis over this sub-interval over which we exploit this convenient almost-independent property. This tool allows us to prove the first $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}})$ rate of convergence for $(L_{0},L_{1})$ -smooth functions beyond the (Bounded-supp) setting. Our main contributions are three-fold:

(a) Convergence for $(L_{0},L_{1})$ -smoothness when $\sigma_{1}<1$ . We show in Section 4 that AdaGrad-Norm converges at a rate $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}})$ when the stochastic gradient oracle satisfies (Affine-var) with $\sigma_{0}\geq 0$ and $\sigma_{1}\in[0,1)$ . This is the first convergence rate for any algorithm even under (Bounded-var) (i.e., $\sigma_{1}=0$ ) for general $(L_{0},L_{1})$ -smooth optimization. Note that the scaling of this bound with $T$ matches (up to poly-logarithmic factors) the best-known rate for $L_{0}$ -smooth functions – with a minor caveat that $\sigma_{1}<1$ is not needed in the $L_{0}$ -smooth setting. Also, we show that the rate improves to $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{T}})$ in the “small variance” regime when $\sigma_{0},\sigma_{1}\to 0$ even without tuning the step-size.

(b) Convergence for all $\sigma_{1}$ . We establish a sufficient condition under which AdaGrad-Norm converges at a rate $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}})$ when $\sigma_{1}\geq 1$ , see Section 5. This condition allows us to analyze a broad subset of $(L_{0},L_{1})$ -smooth functions that includes all $L_{0}$ -smooth functions as well as fixed-degree polynomials without any restrictions on $\sigma_{1}$ . This simultaneously generalizes the result and simplifies a key proof technique of [FTCMSW22] for $L_{0}$ -smooth functions.

(c) Negative results for known algorithms. We prove a set of negative results in Section 6 for most algorithms analyzed under $(L_{0},L_{1})$ -smoothness and (Bounded-supp). We construct an oracle for Clipped and Normalized SGD [ZHSJ20, ZJFW20] and Sign SGD with Momentum [CLOZZ22] that leads to failure with constant probability in a wide parameter regime. We also prove that AdaGrad-Norm can diverge with constant probability if the step-size is not carefully tuned in the “large variance” regime for $(L_{0},L_{1})$ -smooth functions. By contrast, no parameter tuning is needed in the $L_{0}$ -smooth setting in this noise regime.

Related Works

Stochastic gradient descent. (SGD) has been well-studied for many decades [RM51]. [PT73] proved almost-sure convergence to a first-order stationary point of (SGD) for non-convex and $L_{0}$ -smooth functions with $F(\mathbf{w})\geq F^{*}$ with stochastic gradient oracle satisfying (a slightly weaker condition than) (Affine-var). [BT00] extended the result to a setting where $F(\mathbf{w})$ does not have a uniform lower-bound. [GL13] proved that (SGD) with step-size $\eta_{t}=\eta=\min\left\{1/L_{0},\sqrt{\nicefrac{{2(F(\mathbf{w}_{1})-F^{*})}}{{(L_{0}\sigma_{0}^{2}T)}}}\right\}$ achieves a convergence rate to a first-order stationary point of $\mathcal{O}\left(\sqrt{\nicefrac{{L_{0}\sigma_{0}^{2}(F(\mathbf{w}_{1})-F^{*})}}{{T}}}\right)$ , assuming $L_{0}$ -smoothness and (Bounded-var). [DS20] proved that this is the optimal rate for (SGD) without further assumptions. Recently, [ACDFSW22] proved that the convergence rate of [GL13] is optimal among all first-order methods, not just SGD.

$(L_{0},L_{1})$ -smoothness in the (Bounded-supp) regime. Recent work by [ZHSJ20] argued that the $L_{0}$ -smoothness assumption is not realistic for many practical machine learning tasks, e.g., large-scale natural language processing using transformer architectures. Instead, they demonstrated that $(L_{0},L_{1})$ -smooth functions (Generalized-smooth) better capture the loss landscape, and proved that the gradient clipping algorithm converges at a rate $\mathcal{O}(\nicefrac{{1}}{{\sqrt{T}}})$ in the (Bounded-supp) regime. [ZJFW20] later proved convergences for a generalized class of gradient clipping algorithms. They used a slightly weaker definition of $(L_{0},L_{1})$ -smoothness, which we use in Assumption 2. Very recently, [CLOZZ22] considered a “coordinate-wise” generalization of $(L_{0},L_{1})$ -smoothness, and proved that a “generalized SignSGD” algorithm converges at a rate $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}})$ . By contrast, they proved that gradient descent with fixed step-sizes must scale linearly in $ML_{1}$ , where $M=\sup\left\{\left\lVert\nabla F(\mathbf{w})\right\rVert:F(\mathbf{w})\leq F(\mathbf{w}_{1})\right\}$ is the largest gradient in the sublevel set $F(\mathbf{w})\leq F(\mathbf{w}_{1})$ . Interestingly, this line of work establishes that adaptive step-size schedules can avoid this dependence on $M$ .

Problem Setting

We are interested in finding a first-order stationary point of a non-convex function, given access to a stochastic gradient oracle, using (SGD). For compactness, let $\bm{g}_{t}:=\bm{g}(\mathbf{w}_{t})$ . Our objective function $F(\mathbf{w})$ satisfies the following:

We note that $(L_{0},L_{1})$ -smoothness was originally defined in [ZHSJ20] as a bound on the Hessian of $F(\cdot)$ , as in (Generalized-smooth). Following [ZJFW20, Remark 2.3], we choose to adopt the alternative condition in Assumption 2 for two reasons. First, Assumption 2 is strictly weaker than $L_{0}$ -smoothness, since $(L_{0},0)$ -smoothness implies the gradients are $L_{0}$ -Lipschitz. Second, whenever the objective is twice-differentiable, Assumption 2 implies (Generalized-smooth) (up to constant factors in the definitions of $L_{0}$ and $L_{1}$ ):

A function satisfying $(L_{0},L_{1})$ -smoothness as per (Generalized-smooth) is also $(2L_{0},(e-1)L_{1})$ -smooth as per Assumption 2. If $F(\cdot)$ is twice continuously differentiable and $(L_{0},L_{1})$ -smooth as per Assumption 2, then it is also $(L_{0},L_{1})$ -smooth as per (Generalized-smooth).

Let $\mathcal{F}_{t}$ be the sigma-algebra generated by the interaction between the algorithm and stochastic gradient oracle for $t$ rounds, i.e., $\mathcal{F}_{t}:=\sigma\left\{\mathbf{w}_{1},\bm{g}_{1},\ldots,\mathbf{w}_{t},\bm{g}_{t},\mathbf{w}_{t+1}\right\}$ . We impose the following conditions on the stochastic gradients:

Assumptions 3 and 4 imply the following bound on the stochastic gradients in terms of the true gradient:

We are interested in studying algorithms which require as little hyper-parameter tuning as possible and, simultaneously, can handle potentially unbounded smoothness constant. To achieve this, we analyze AdaGrad-Norm [SM10], a step-size sequence $\eta_{t}$ for (SGD) which, at each time $t$ , depends on the current and past stochastic gradients $\left\{\bm{g}_{s}\right\}_{s\in[t]}$ :

Our main results, Sections 4 and 4, both establish $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}})$ convergence rates for (AG-Norm) in the $(L_{0},L_{1})$ -smooth regime under (Affine-var). Section 4 holds for any $(L_{0},L_{1})$ -smooth function under a mild restriction that $\sigma_{1}<1$ . It is easy to extend this result for $\sigma_{1}\geq 1$ by computing mini-batch gradients with a batch size $B\approx\sigma_{1}^{2}$ , refer Section A.3 for a proof. Despite the restriction of Section 4 to $\sigma_{1}<1$ , we emphasize that, prior to our work, no proof of convergence even for the (Bounded-var) setting (i.e., $\sigma_{1}=0$ ) was known for a general class of $(L_{0},L_{1})$ -smooth functions. Besides, Section 4 holds for all $\sigma_{1}$ and a subclass of $(L_{0},L_{1})$ -smooth functions, i.e., excluding functions like $\exp(L_{1}x)$ .

Fix any constants $\varepsilon,\varepsilon^{\prime},\varepsilon^{\prime\prime},\varepsilon^{\prime\prime\prime}\in(0,1)$ such that $\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime}<1$ . Consider (AG-Norm) with any parameters $\eta\leq\nicefrac{{2\varepsilon^{\prime}}}{{5L_{1}}}$ and $b_{0}^{2}>0$ , running on an objective function satisfying Assumption 2, and given access to a stochastic gradient oracle satisfying Assumptions 3 and 4. Assuming that $\sigma_{1}\leq\left(1-(\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime})\right)$ , then for any $T\geq 1$ and $\delta^{\prime}\in(0,1)$ , with probability at least $1-\delta^{\prime}$ , the iterates of (AG-Norm) satisfy

To extend our convergence proofs beyond $\sigma_{1}<1$ , we consider a subclass of $(L_{0},L_{1})$ -smooth functions which satisfy the following additional assumption:

Notice that, whereas Assumption 2 is a local constraint on the objective, Section 4 enforces a global polynomial growth constraint – thus ruling out such $(L_{0},L_{1})$ -smooth functions as exponentials, while capturing a significantly broader class of functions than $L_{0}$ -smoothness. We refer the interested reader to Section C.1 for some properties of this class of functions. Using this definition, we are able to prove the following:

Fix any constants $\varepsilon,\varepsilon^{\prime},\varepsilon^{\prime\prime},\varepsilon^{\prime\prime\prime}\in(0,1)$ such that $\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime}<1$ . Consider (AG-Norm) with any parameters $\eta\leq\nicefrac{{2\varepsilon^{\prime}}}{{L_{1}(4+\sigma_{1}^{2})}}$ and $b_{0}^{2}>0$ , running on an objective function satisfying Assumption 2 and Section 4 for some constants $k\geq 2,c_{k}\geq 1,c_{k}^{\prime}>0$ , and given access to a stochastic gradient oracle satisfying Assumptions 3 and 4 for any $\sigma_{0},\sigma_{1}\geq 0$ . Then, for any $T\geq 1$ and $\delta^{\prime}\in(0,1)$ , with probability at least $1-\delta^{\prime}-\widetilde{\mathcal{O}}(\nicefrac{{1}}{{T}})$ , the iterates of (AG-Norm) satisfy

There are several notable takeaways from the above results.

Noise adaptivity. Both Sections 4 and 4 provide “noise-adaptive” convergence rates, in a sense that as $\sigma_{0},\sigma_{1}\to 0$ , the convergence rates automatically improve from $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}})$ to $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{T}})$ without any additional hyperparameter tuning.

Less hyperparameter tuning. These rates hold without tuning the parameters $b_{0}$ or $\eta$ with respect to $\sigma_{0}$ or $L_{0}$ , unlike all prior algorithms for $(L_{0},L_{1})$ -smoothness that we are aware of [ZHSJ20, ZJFW20, CLOZZ22]This feature, however, manifests into a worse dependence on $L_{1}$ unlike [ZJFW20, CLOZZ22]..

Generalization of prior work. We remark that Section 4 strictly generalizes the result of [FTCMSW22] beyond the uniform $L_{0}$ -smooth setting. Further, our stopped analysis simplifies their “recursive improvement” technique [FTCMSW22, Lemma 13].

Key technical ideas

As discussed earlier, the main technical tool we use to obtain our convergence rates in Sections 4 and 4 is a recursively-defined stopping time. Before we are ready to define this time and discuss its utility, we first give a brief overview of the main initial steps of our analysis.

This inequality is no longer true for $(L_{0},L_{1})$ -smooth functions. Indeed, it is clearly not satisfied for all $\mathbf{w},\mathbf{w}^{\prime}$ on the $(0,L_{1})$ -smooth function $\exp(L_{1}x)$ . However, [ZHSJ20, ZJFW20] note that a similar variant holds “locally” for $\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq\nicefrac{{1}}{{L_{1}}}$ (see Section A.2). Using this variant, we obtain the following inequality, which is our first tool for studying the convergence of (AG-Norm).

Fix any $\varepsilon,\varepsilon^{\prime}\in(0,1)$ . Suppose that $\eta\leq\frac{2\varepsilon^{\prime}}{L_{1}(4+\sigma_{1}^{2})}$ . Then, for any $t$ ,

Intuitively, the “good” times are those times when Section 5 is non-vacuous. Using Section 5, we sum the expression Section 5 until any stopping time $\tau$ to obtain the following more useful form.

We leverage the fact that Section 5 holds for any stopping time $\tau\in[2,T+1]$ as follows. Suppose there were some stopping time $\tau\in[2,T+1]$ such that:

Notice that $\tau_{1}(\delta)=1$ , $S_{1}(\delta)=0$ , $X_{1}(\delta)=1$ , and $\tau_{2}(\delta)=2$ deterministically. Further, one can show that $S_{t}(\delta),X_{t}(\delta)$ , and $\tau_{t+1}(\delta)$ are $\mathcal{F}_{t-1}$ -measurable for every $t\geq 1$ . Intuitively, $\tau_{T+1}(\delta)$ is the first time that the sum of stochastic gradient norms is significantly larger than its expectation, where the expectation is crucially over the random summation range (refer to Section B.2 for a further discussion). The following result shows the utility of this recursive construction:

For any $T\geq 1$ and $\delta\in(0,1]$ , let $\tau_{T+1}(\delta)$ be the stopping time from Section 5.1. Then, we have the following:

$\tau_{T+1}(\delta)$ is a stopping time with respect to $(\mathcal{F}_{s-1})_{s\geq 1}$ , i.e., $\forall s\geq 1$ , $\left\{s<\tau_{T+1}(\delta)\right\}\in\mathcal{F}_{s-1}$ .

Notice that an immediate consequence of Section 5.1 is that:

General $(L_{0},L_{1})$ -smooth functions clearly violate this inequality. Indeed, $\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert$ can potentially be a multiplicative factor of $\exp(\eta L_{1}\left|t-t^{\prime}\right|)$ times larger than $\left\lVert\nabla F(\mathbf{w}_{t^{\prime}})\right\rVert$ (for instance, when the $(0,L_{1})$ -smooth objective is $\exp(L_{1}x)$ . Thus, even if we could guarantee deterministically that only the first $\mathcal{O}(\log(T))$ time-steps are “bad”, the objective function (and also the norm of the gradient) could grow by polynomial factor in $T$ during this interval! In fact, this is exactly the intuition behind our negative result for (AG-Norm) in the “large $\sigma_{1}$ ” regime (see Section D.1).

Since the numerator and denominator both depend on the same summation, we can apply essentially the same arguments from the $\sigma_{1}<1$ case to obtain our convergence rate for $\sigma_{1}\geq 1$ in Section 4.

Given our positive results for (AG-Norm) from the previous sections, we now turn our focus to algorithms which have been analyzed in prior works on $(L_{0},L_{1})$ -smooth optimization. Some of the first-studied algorithms for $(L_{0},L_{1})$ -smooth optimization take the following forms: for parameters $\eta>0$ and $\gamma\geq 0$ :

These closely-related updates are referred to as Normalized SGD and Clipped SGD respectively. One motivation for considering these specific updates, at least in the noiseless setting where $\bm{g}_{t}=\nabla F(\mathbf{w}_{t})$ , comes through a comparison with the natural SGD step-size for $L_{0}$ -smooth non-convex optimization. Indeed, [GL13] show that a constant step-size of $\eta_{t}=\nicefrac{{1}}{{L_{0}}}$ yields a $\nicefrac{{1}}{{T}}$ rate of convergence to a first-order stationary point. Further, a simple extension of this result (see, e.g., [BCN18] for a proof) is that, under $L_{0}$ -smoothness and Assumption 4 with $\sigma_{0}=0$ and $\sigma_{1}\geq 0$ , the step size $\eta_{t}=\nicefrac{{1}}{{L_{0}(1+\sigma_{1}^{2})}}$ still achieves the $\nicefrac{{1}}{{T}}$ convergence rate. Thus, by analogy, in the $(L_{0},L_{1})$ -smooth setting, $\eta_{t}=\nicefrac{{1}}{{(L_{0}+L_{1}\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert)}}$ (and, in the multiplicative noise regime, $\eta_{t}=\nicefrac{{1}}{{(1+\sigma_{1}^{2})(L_{0}+L_{1}\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert)}}$ ) is a natural candidate step size.

A number of works, including [ZHSJ20, ZJFW20, CLOZZ22], have proved that (variants of) these algorithms converge whenever the noise of the stochastic gradient satisfies (Bounded-supp). It turns out, however, that these algorithms can diverge under the noise model considered in this paper, (Affine-var). To see this, it is useful to consider a specific stochastic gradient oracle which satisfies Assumptions 3 and 4:

We can then take the output of the oracle to be $\bm{g}(\mathbf{w}):=\xi_{\textrm{add}}(\mathbf{w})+\xi_{\textrm{mult}}(\mathbf{w})\nabla F(\mathbf{w})$ . Then, this construction satisfies Assumptions 3 and 4 with the specified $\sigma_{0}$ and $\sigma_{1}$ .

Consider the above oracle with $\sigma_{0}=0$ and $\sigma_{1}\gg 1+\varepsilon$ . This oracle outputs stochastic gradients with the same sign as the true gradient for only roughly a $\nicefrac{{1}}{{\sigma_{1}^{2}}}$ fraction of the times it is queried. The majority of stochastic gradients thus have the opposite sign of the true gradient! This turns out to be quite problematic for algorithms of the form (5). Indeed, consider the behavior of (5) when $\left\lVert\bm{g}_{t}\right\rVert\geq\gamma$ . In this regime, both algorithms discard the magnitude of the stochastic gradients $\bm{g}_{t}$ , and use only their sign to perform updates. Since the stochastic gradients $\bm{g}_{t}$ of Section 6 have the opposite sign of $\nabla F(\mathbf{w}_{t})$ for almost all time steps $t$ , one can prove that algorithms of the form (5) do not converge to a stationary point with constant probability under Assumptions 3 and 4, even when the objective function is a $1$ -dimensional quadratic function (i.e., both smooth and strongly-convex). We give a proof of (a slightly more general version of) this fact in Section D.2.

Acknowledgements

This research is supported in part by NSF Grants 2019844 and 2112471, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program.

References

Overview of Appendix

Appendix A Auxiliary Lemmas

Let $\{a_{i}\}_{i=1}^{\infty}$ be a sequence of non-negative integers such that $a_{1}>0$ . Then, for any $T$ ,

We proceed via induction. The base case of $T=1$ holds trivially, with equality. Assuming the hypothesis holds at some time $T\geq 1$ , we have that

Now, using the fact that $\exp(x)\leq\nicefrac{{1}}{{(1-x)}}$ for any $x<1$ , we have that

Combining these two bounds, we conclude that

so the claim holds also for $T+1$ . Thus, the claim holds for all $T$ by induction. ∎

The (AG-Norm) step-sizes satisfy, for any (possibly random) times $1\leq t_{0}\leq t_{1}$ and $s\geq 0$ ,

We first note that, by definition of $b_{t}$ :

The iterates $\left\{\mathbf{w}_{s}\right\}_{s=1}^{\infty}$ generated by (AG-Norm) satisfy, for every $t\geq 1$ ,

Moreover, for any $k\geq 2$ and $t>t^{\prime}$ ,

which establishes the first inequality. To obtain the second, we apply the first, together with Jensen’s inequality (noting that $\left\lVert\cdot\right\rVert^{k-1}$ is convex), to obtain:

For any function $F$ satisfying Assumption 2, the sequence of iterates $\{\mathbf{w}_{s}\}_{s=1}^{\infty}$ generated by (AG-Norm) with $\eta\leq\nicefrac{{1}}{{L_{1}}}$ satisfy

Thus, by choosing $\eta\leq\nicefrac{{1}}{{L_{1}}}$ , the claim is an immediate consequence of Section A.1. ∎

For any $(L_{0},L_{1})$ -smooth function $F(\cdot)$ , assuming that $\eta\leq\nicefrac{{1}}{{L_{1}}}$ , the gradient $\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert^{2}$ evaluated at the iterate of (AG-Norm) at time $t$ satisfies:

Since $\eta\leq\nicefrac{{1}}{{L_{1}}}$ , $\left\lVert\mathbf{w}_{t+1}-\mathbf{w}_{t}\right\rVert\leq\nicefrac{{1}}{{L_{1}}}$ by Section A.1. Thus, we may apply Assumption 2 to obtain

The proof of the first statement is from [ZJFW20, Corollary A.4]. The proof of the second statement closely follows the analogous proof for $L_{0}$ -smooth functions from [Nes03, Lemma 1.2.2]. We give a proof of this claim for completeness.

Hence, by the limit inequality theorem, we have that, for any $0<\left\lVert\mathbf{s}\right\rVert\leq\nicefrac{{1}}{{L_{1}}}$ ,

In particular, by taking the supremum over all such $\mathbf{s}$ , we conclude that

as claimed, where the first equality follows by observing that $\nabla^{2}F(\mathbf{x})\nabla^{2}F(\mathbf{x})^{\top}$ and $\nabla^{2}F(\mathbf{x})^{\top}\nabla^{2}F(\mathbf{x})$ have the same non-zero eigenvalues (since all entries of $\nabla^{2}F(\mathbf{x})$ are real, and by appealing to the singular value decomposition), which implies that $\nabla^{2}F(x)$ and $\nabla^{2}F(x)^{\top}$ have the same spectral norm. ∎

Suppose that the stochastic gradient oracle satisfies Assumptions 3 and 4 for some $\sigma_{0}\geq 0$ and $\sigma_{1}\geq 1$ . Then, assuming this oracle returns independent stochastic gradients each time $\bm{g}(\mathbf{w})$ is sampled, one can construct, for any $\varepsilon\in(0,1)$ , a new stochastic gradient oracle from this one through mini-batching which satisfies Assumptions 3 and 4 with $\widetilde{\sigma_{0}}\leq\sigma_{0}$ and $\widetilde{\sigma_{1}}=1-\varepsilon$ , and where each call to the new gradient requires only $B=\left\lceil\nicefrac{{\sigma_{1}^{2}}}{{(1-\varepsilon)^{2}}}\right\rceil$ calls to the old one.

where the first inequality follows by Assumption 3 and since $\bm{g}_{j}(\mathbf{w})$ and $\bm{g}_{j^{\prime}}(\mathbf{w})$ are independent. The second inequality follows by Assumption 4 and our choice of $B\geq\nicefrac{{\sigma_{1}^{2}}}{{(1-\varepsilon)^{2}}}$ . Thus, Assumption 4 is satisfied with $\widetilde{\sigma_{0}}^{2}=\nicefrac{{(1-\varepsilon)^{2}\sigma_{0}^{2}}}{{\sigma_{1}^{2}}}\leq\sigma_{0}^{2}$ and $\widetilde{\sigma_{1}}^{2}=(1-\varepsilon)^{2}$ . ∎

An immediate consequence of Section A.2 and [FTCMSW22, Lemma 5] is that, as long as $\eta\leq\nicefrac{{1}}{{L_{1}}}$ ,

We provide a proof of this inequality in Section B.5A careful reader may notice that the inequality in Section B.5 is actually slightly smaller than the one from [FTCMSW22, Lemma 5], since the dependence on constants is strictly better..

Now, let’s focus on bounding the final term above. We start by rewriting it as follows: Let us take $\mathcal{E}_{\sigma_{0}}=\{\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert>\sigma_{0}\}$ . Then, we can decompose the final term (trivially) as

Now, whenever $\mathcal{E}_{\sigma_{0}}$ is false, then this expression is easy to bound, since

Notice that the first term above can be absorbed into the first term in (6), assuming $\eta$ is sufficiently small. For the remaining term, we begin by noticing that

Therefore, collecting these results and choosing $\eta<\nicefrac{{2\varepsilon^{\prime}}}{{L_{1}(4+\sigma_{1}^{2})}}$ , we have that

where $\widetilde{c}_{0}=c_{0}+\eta^{2}L_{1}\sigma_{0}$ . ∎

Fix any $\varepsilon^{\prime\prime}\in(0,1)$ , and let $2\leq\tau\leq T+1$ be any stopping time with respect to $(\mathcal{F}_{s-1})_{s\geq 1}$ . Then, we have that

Let us define, for a parameter $\lambda$ to be determined,

More specifically, for any $t$ , we can decompose

In the other case, for any fixed $t\in[T]$ , we have that

and in the third, we used the fact that $\|\widetilde{\nabla}_{t}\|^{2}=\sigma_{0}^{2}+\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert^{2}$ . Then, noting that

Thus, if we choose $\lambda=\frac{\varepsilon^{\prime\prime}\eta^{2}}{(2+\sigma_{1}^{2})\widetilde{c}_{0}(\tau-1)},$ then we obtain

Now, summing over $t\in[\tau-1]$ , and using the fact that $\left\{t<\tau\right\}\in\mathcal{F}_{t-1}$ by assumption on $\tau$ , we have:

Focusing on the last term in the above inequality, and recalling that (deterministically) $\tau>1$ by assumption, we may apply the above bounds together with Jensen’s inequality to obtain:

Combining these bounds yields the claimed inequality. ∎

In the following, we restate Section 5 with an equivalent characterization that is sometimes more convenient for our analysis.

A time $t\in[T]$ is “good” if, for fixed parameters $\varepsilon,\varepsilon^{\prime},\varepsilon^{\prime\prime},\varepsilon^{\prime\prime\prime}\in(0,1)$ , satisfying $\varepsilon+\varepsilon^{\prime}+\varepsilon^{\prime\prime}+\varepsilon^{\prime\prime\prime}<1$

Now, we may use the first and second inequalities in Section B.1 to bound the sum over “good” and “bad” times, respectively, and, collecting terms, we obtain

B.2 Constructing the “nice” stopping time

Let us recall the definition of $\tau_{T+1}(\delta)$ , the “nice” stopping time: See 5.1

Here, we show that these random variables are well-defined, and enumerate the crucial properties that they satisfy.

For any $\delta\in(0,1]$ and $t\geq 1$ , let $\tau_{t}(\delta)$ , $S_{t}(\delta)$ , and $X_{t}(\delta)$ be recursively-defined random variables from Section 5.1. Then, we have that, for all $t\geq 1$ ,

$\tau_{t}(\delta)$ is $\mathcal{F}_{t-2}$ -measurable, and $S_{t}(\delta),X_{t}(\delta)$ are each $\mathcal{F}_{t-1}$ -measurable (where we take $\mathcal{F}_{0}=\mathcal{F}_{-1}$ to be the trivial $\sigma$ -algebra).

$\tau_{t}(\delta)$ is a stopping time with respect to $(\mathcal{F}_{s-1})_{s\geq 1}$ , i.e., for all $s\geq 0$ $\{s<\tau_{t}(\delta)\}\in\mathcal{F}_{s-1}$ .

For all $t\geq 1$ , $\tau_{t+1}(\delta)\geq\tau_{t}(\delta)$ , $S_{t+1}(\delta)\geq S_{t}(\delta)$ , and $X_{t+1}(\delta)\leq X_{t}(\delta)$ .

$\tau_{\tau_{t}(\delta)-1}(\delta)\overset{\text{a.s.}}{=}\tau_{t}(\delta)-1$

For every $s<\tau_{t}(\delta)$ , the following inequalities hold deterministically:

Before proving this result, let us briefly discuss an alternative construction to Section 5.1 which is (perhaps) more natural and easier to define, but does not satisfy a property we rely on to prove Section B.3:

Notice that (8) is true no matter the realization of $\tau_{t}(\delta)$ . Indeed, for any realization of $\tau_{t}(\delta)$ , the bound on the right-hand side still involves a random index inside the expectation. This is not the case with ( ‣ B.2) (there, the random index is outside of the expectation). This difference is crucial, and this special property of $S_{\tau_{t}(\delta)-1}$ is actually what makes the proof of Section B.3 possible.

For the second claim, it suffices to consider $0\leq s\leq t-2$ (since we just established that $\tau_{t}(\delta)$ is $\mathcal{F}_{t-2}$ -measurable, and $\mathcal{F}_{t-2}\subset\mathcal{F}_{t^{\prime}-2}$ for any $t^{\prime}\geq t$ ). Now, for any such $s$ , since $s<t$ , we have that

For the fourth claim, we have that, by definition of $S_{t}(\delta)$ and the tower rule of expectation,

Now, since $\left\{s<\tau_{t}(\delta)\right\}\in\mathcal{F}_{s-1}$ , and applying (1),

Summing the above expression over $s\in[t-1]$ , we conclude that

For the seventh claim, assuming $s<\tau_{t}(\delta)$ , we have that, by Section A.2,

Further, since $\tau_{\tau_{t}(\delta)-1}(\delta)=\tau_{t}(\delta)-1$ , and by definition of $S_{t}(\delta)$ , we have that

For the final claim, we note that $\tau_{t}(\delta)\leq t$ deterministically, by construction. Thus, we focus on the lower bound. Indeed, notice that, since $\tau_{t}(\delta)\in[t]$ ,

Therefore, by applying the union bound and Markov’s inequality, we conclude that

B.3 The key consequence of the nice stopping time construction

The following result is the most crucial place where the properties of Section 5.1 are utilized. It tells us that, as long as the sum of “bad” gradients is comparable to the sum of “good” ones, and as long as the descent inequality (Section 5) holds, then the sum of gradients scales (roughly) as $\mathcal{O}(\nicefrac{{b(T)^{2}}}{{\delta}}+b(T)\sqrt{\nicefrac{{T}}{{\delta}}})$ . One can compare this result to that of [FTCMSW22, Lemma 13 ], which obtained a similar bound in the simpler $L_{0}$ -smooth setting. Their argument utilized a technique they termed “recursive improvement,” which required recursively invoking gradually improving bounds in order to reach their desired conclusion after infinitely many calls. Moreover, their argument crucially relies on properties of $L_{0}$ -smoothness in order to obtain worst-case upper bounds on the sum of gradients, which are no longer true in our setting. Through our construction of the stopping time $\tau_{T+1}(\delta)$ , we are able to obtain a similar bound as in their setting, but with an (arguably) significantly simpler and more general proof which works even in the $(L_{0},L_{1})$ -smooth setting.

Then, we obtain the inequality given below:

Now, by Item 7 in Section B.2, since $t<\tau_{t}(\delta)$ for any $t\in\widetilde{S}(\tau_{T+1}(\delta))$ ,

then we may solve this inequality to conclude that

and $\widetilde{c}_{0}=\frac{\eta\sigma_{0}}{2\varepsilon}+\eta^{2}\frac{L_{0}+\sigma_{0}L_{1}}{2}$ .

Now, applying Hölder’s inequality to the above, we have:

where we used the following version of Hölder’s:

Recalling that $\left\{s<\tau_{T+1}(\delta)\right\}\in\mathcal{F}_{s-1}$ by Item 2 of Section B.2, we may apply Assumption 4 to obtain

where $\delta\in(0,1)$ is a parameter of our choosing. In particular, choosing (with foresight) $\delta=\nicefrac{{\delta^{\prime}}}{{4T}}$ for any $\delta^{\prime}\in(0,1)$ , the above can be rewritten as:

To obtain a convergence rate, we begin by noting, for any $\delta^{\prime}\in(0,1)$ , we can decompose

The first term is easy to bound via Markov’s inequality, since, choosing $\delta=\frac{\delta^{\prime}}{4T}\leq\frac{\delta^{\prime}}{2(T+1)}$ (since $T\geq 1$ ),

To bound the second term, we note that, whenever $|\widetilde{S}(\tau_{T+1}(\delta))|>\nicefrac{{T}}{{2}}$ , then

Hence, we have by Markov’s inequality and our previous bounds,

B.5 A deferred proof for establishing Section 5

Fix any $\varepsilon\in(0,1)$ . Suppose that $\eta\leq\nicefrac{{1}}{{L_{1}}}$ . Then, for any time $t$ , the iterates of (AG-Norm) satisfy

The proof proceeds using similar arguments as in [FTCMSW22, Lemma 5]. By Section A.2 and the definition of (AG-Norm), we know that

We begin by bounding the inner product term above as:

Combining the above arguments, and applying Hölder’s inequality, we have that

where the last step comes from $\|\widetilde{\nabla}_{t}\|\geq\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert$ . Collecting our bounds so far yields:

Focusing on the term depending on $\sigma_{0}$ , we have that for any $\varepsilon>0$ ,

In this section, we show that Section B.4 can be used to establish a $\widetilde{\mathcal{O}}(\nicefrac{{1}}{{\sqrt{T}}})$ convergence rate without the restriction of $\sigma_{1}<1$ . The key is to restrict our attention to $(L_{0},L_{1})$ -smooth functions which satisfy the following additional property:

The following result provides a characterization of these functions relative to $L_{0}$ -smooth functions and $(L_{0},L_{1})$ -smooth functions. In particular, it tells us that Section 4 is a richer function class than $(L_{0},L_{1})$ -smooth functions. However, not all $(L_{0},L_{1})$ -smooth functions satisfy Section 4.

Every $L_{0}$ -smooth function satisfies Section 4 with $k=2$ , $c_{k}=1$ , and $c_{k}^{\prime}=L_{0}$ .

Every $(L_{0},L_{1})$ -smooth function satisfies Section 4 locally (i.e., when $\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq\nicefrac{{1}}{{L_{1}}}$ ) with $k=2,c_{k}=2$ and $c_{k}^{\prime}=L_{0}$ .

There is a $(0,L_{1})$ -smooth function which does not satisfy Section 4 for any fixed $k,c_{k},c_{k}^{\prime}$ .

For any $k\geq 2$ , $F(\mathbf{w})=\left\lVert\mathbf{w}-\mathbf{w}^{*}\right\rVert^{k}$ satisfies Section 4 with $k=k$ , $c_{k}=2^{k-2}$ , and $c_{k}^{\prime}=k2^{k-2}$ . Additionally, for any $L_{1}>0$ , $F(\mathbf{w})$ is $(\nicefrac{{2k(k-1)}}{{L_{1}^{k-2}}},(e-1)(k-1)L_{1})$ -smooth. However, this $F(\mathbf{w})$ is not $L_{0}$ -smooth when $k>2$ .

The second follows since, for any $(L_{0},L_{1})$ -smooth function, for every $\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq L_{1}$ ,

For the third claim, consider the function $F(w)=\exp(L_{1}w)$ . Since $F^{\prime\prime}(w)=L_{1}^{2}\exp(L_{1}w)=L_{1}F^{\prime}(w)$ . Suppose there were some $k,c_{k},c_{k}^{\prime}$ such that Section 4 is satisfied. Then, it must be the case that, for any $x>0$ :

where the inequality follows from the definition of Section 4, the first equality by L’Hôpital’s rule, and the second by rewriting the previous expression. Repeating this argument $k-1$ times, this implies that

a contradiction. Hence, $\exp(L_{1}x)$ cannot satisfy Section 4.

For the final claim, we see that $F(\mathbf{w})$ satisfies Section 4 with $c_{k}=2^{k-2}$ and $c_{k}^{\prime}=k2^{k-2}$ since, by Jensen’s inequality,

Further, $F(\mathbf{w})$ is also $(2k(k-1),(e-1)(k-1))$ -smooth, since simple calculations yield that

In particular, this implies that $\mathbf{w}-\mathbf{w}^{*}$ is an eigenvector with largest eigenvalue, so, for any $L_{1}>0$ ,

Therefore, by [ZJFW20, Corollary A.4], for any $\left\lVert\mathbf{w}-\mathbf{w}^{\prime}\right\rVert\leq\nicefrac{{1}}{{(k-1)L_{1}}}$ ,

The following result demonstrates the difference in worst-case gradient norm scaling that $(L_{0},L_{1})$ -smooth functions provide, versus the worst-case scaling of functions satisfying Section 4.

For any function satisfying Assumption 2, and any algorithm producing iterates $(\mathbf{w}_{s})_{s\geq 1}$ satisfying $\left\lVert\mathbf{w}_{s+1}-\mathbf{w}_{s}\right\rVert\leq\eta\leq\nicefrac{{1}}{{L_{1}}}$ for every $s\geq 1$ , the following inequality holds for every $t>t^{\prime}$ :

Moreover, this inequality is essentially unimprovable, in the sense that there exists a $(0,\mathcal{O}(L_{1}))$ -smooth function and $\eta\leq\nicefrac{{1}}{{L_{1}}}$ such that $\left\lVert\nabla F(\mathbf{w}_{T+1})\right\rVert=(1+\mathcal{O}(\eta L_{1}))^{T+1}\left\lVert\nabla F(\mathbf{w}_{1})\right\rVert$ for any $T\geq 1$ , and a $(\mathcal{O}(L_{0}),\mathcal{O}(L_{1}))$ -smooth function such that $\left\lVert\nabla F(\mathbf{w}_{1})\right\rVert=0$ and $\left\lVert\nabla F(\mathbf{w}_{T+1})\right\rVert=\mathcal{O}(\nicefrac{{L_{0}}}{{L_{1}}})((1+\mathcal{O}(\eta L_{1}))^{T}-1)$ . By contrast, any function satisfying Section 4 satisfies:

We begin by proving the first claim by by induction on $t-t^{\prime}$ . The base case of $t-t^{\prime}=1$ holds by definition, since

Now, supposing the claim holds for $t-t^{\prime}=1,\ldots,s$ , we have that:

where the first inequality follows by applying the induction hypothesis for $t-t^{\prime}=s$ , the second by applying the induction hypothesis for $t-t^{\prime}=1$ , and the final equality follows by rearranging the prior line. Thus, the inequality holds also at $t-t^{\prime}=s+1$ , and thus our claim holds by induction.

To see that this inequality is essentially unimprovable, let us consider first consider, for any $L_{1}>0$ , the function:

Since $F^{\prime\prime}(x)=L_{1}^{2}\exp(L_{1}x)=L_{1}F^{\prime}(x)$ , it follows from Section 3 that $F(\cdot)$ is $(0,(e-1)L_{1})$ -smooth. Notice that, if $x_{s+1}-x_{s}=\eta=\nicefrac{{1}}{{(e-1)L_{1}}}$ , then, taking $x_{1}=0$ and $t\geq 1$ ,

Further, for any $L_{0},L_{1}>0$ , consider the function:

Noting that $F^{\prime\prime}(x)\leq 2L_{0}+2L_{1}F^{\prime}(x)$ when $\left|x\right|\geq\nicefrac{{\sqrt{2}}}{{L_{1}}}$ , and $F^{\prime\prime}(x)\leq L_{1}(2+\exp(\sqrt{2}))+2L_{1}F^{\prime}(x)$ otherwise, it follows that $F$ is $(2(2+\exp(\sqrt{2}))L_{0},2(e-1)L_{1})$ -smooth (by Section 3). Therefore, whenever $\eta=\nicefrac{{1}}{{2(e-1)L_{1}}}$ ,

where the first equality follows by rearranging the definition, the second since $\eta L_{1}=\Theta(1)$ , the third since $2\frac{c}{1+c}(e-1)\eta L_{1}=\frac{2c(e-1)\eta L_{1}}{1+2c(e-1)\eta L_{1}}\leq\log(1+c2(e-1)\eta L_{1})\leq 2c(e-1)\eta L_{1}$ , and the fourth by rearranging.

The final inequality follows immediately from Sections A.1 and 4, which together imply that

where we use the fact that $c_{k}\geq 1$ and $b_{t-1}^{2}\geq b_{t^{\prime}-1}^{2}$ (since $t\geq t^{\prime}$ ) for the first inequality, the definition of Section 4 for the second, and the definition of $\widetilde{b}_{t}$ from Section 3 for the third. Now, either $\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert\geq 2\max\left\{c_{k}^{\prime}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert^{k-1},L_{0}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert\right\}$ , or not. In the first case, we note that

In the alternate case that $\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert<2\max\left\{c_{k}^{\prime}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert^{k-1},L_{0}\left\lVert\mathbf{w}_{t}-\mathbf{w}_{t^{\prime}}\right\rVert\right\}$ , we obtain:

To bound the second term, we apply Sections 4 and A.1, together with the above construction, to obtain:

Thus, by the Multinomial theorem, we have that

Combining this with the above, we have the following:

We claim that, for any $s\geq 1$ , the inner summation term above is bounded in expectation by:

To see this, first note that, by Section B.1, and since $\left\{t<\tau_{T+1}(\delta)\right\}\in\mathcal{F}_{t-1}$ by Section B.2, for any $t^{\prime}\geq 0$ ,

Now, by Items 4, 5 and 6 of Section B.2, we have that, almost surely,

Therefore, collecting these results, we conclude that, for any $t^{\prime}\geq 0$ ,

Now, the base case of $s=1$ for (C.2) follows immediately from (14) with $t^{\prime}=0$ . Let us now suppose that the claim (C.2) holds for some $s\geq 1$ . Then, to apply the induction hypothesis, we begin by decomposing:

Notice that the above expectation is a product of two terms: indicators depending of times $t_{1},\ldots,t_{s}$ , and those depending on $t_{s+1}>t_{s}$ . Therefore, since, by Sections B.2 and B.1,

we may apply the tower rule of expectations and the inequality from (14):

Therefore, summing the above expression over $1\leq t_{1}<\ldots<t_{s}\leq T$ and applying the induction hypothesis, we conclude that:

Now, finally noting that, for any $x\geq 1$ ,

C.3 Bounding the sum of “bad” gradients by the sum of “good” ones

We recall from Section B.4 that, in order to use this bound, we need to show that the sum of “bad” gradients can be upper-bounded (relatively) by the sum of “good” ones. It turns out, for functions satisfying Section 4, this is possible, as we now show.

Let $\tau\geq 1$ be any (possibly random) time, and consider any (possibly random) set $S(\tau)\subseteq[\tau-1]$ . Denote $S(\tau)^{c}=[\tau-1]\setminus S(\tau)$ . Then, assuming $F(\cdot)$ satisfies Section 4, the following is satisfied deterministically:

The proof of this result follows a similar argument as used in Section 5.2. The main idea here is to, for every $t\in S(\tau)^{c}$ in decreasing order, find the first available time $t^{\prime}\in S(\tau)$ which has not been associated with an earlier time from $S(\tau)^{c}$ . Then, using Section 4, we show that, as long as $t$ and $t^{\prime}$ are not too far apart, then $\left\lVert\nabla F(\mathbf{w}_{t})\right\rVert^{2}$ and $\left\lVert\nabla F(\mathbf{w}_{t^{\prime}})\right\rVert^{2}$ must also be close. For some times $t\in S(\tau)^{c}$ , there may not be such a $t^{\prime}\in S(\tau).$ However, because of the greedy construction, these times must be relatively small (roughly within the first $|S(\tau)^{c}|$ time steps). Thus, as long as $|S(\tau)^{c}|$ is not “too big” (in expectation), then we can still bound these remaining terms. We now make these arguments precise.

To begin, note that for every $t,t^{\prime}\geq 1$ , by Section 4,

We use this bound as follows: let us index the times in $S(\tau)^{c}$ , denoting $\widetilde{\tau}_{[i]}$ to be the $i$ th largest time in $S(\tau)^{c}$ , i.e.,

Indeed, this follows by first decomposing

Next, notice that, for every $i\geq i^{*}$ ,

where the first inequality is by definition of $\widetilde{\tau}_{[i]}$ . To see the second inequality, we follow a similar argument as before. Indeed, observe that

where in the second inequality, we applied Section 5.2. Thus, we obtain the claimed result. ∎

and $\widetilde{c}_{0}=\frac{\eta\sigma_{0}}{2\varepsilon}+\eta^{2}\frac{L_{0}+\sigma_{0}L_{1}}{2}$ (where we use the notation $(x)_{+}:=\max\left\{0,x\right\}$ ).

Thus, the conditions to apply Section B.4 are satisfied, and we obtain the convergence rate. ∎

In this section, we consider the convergence behavior of several natural candidate algorithms which have been studied in the literature on $(L_{0},L_{1})$ -smooth optimization. These algorithms take the form $\mathbf{w}_{t+1}=\mathbf{w}_{t}-\bm{u}_{t}$ , where $\bm{u}_{t}$ takes a number of different forms, including: in Normalized SGD:

and Sign-SGD with Momentum (operations performed element-wise):

[ZHSJ20, ZJFW20, CLOZZ22] prove $\mathcal{O}(\nicefrac{{1}}{{\sqrt{T}}})$ convergence of these algorithms in the setting of (Bounded-supp). In this section, we show that these step-size choices for $(L_{0},L_{1})$ -smooth optimization fail under (Affine-var), despite working in the noiseless and (Bounded-supp) settings. Our negative results rely on the following stochastic gradient oracle construction:

Fix any $\varepsilon,\sigma_{1}\geq 0$ . We begin by establishing that Assumption 3 holds for our construction of $\bm{g}(\mathbf{w})$ . Begin by denoting $\delta=\nicefrac{{(1+\varepsilon)^{2}}}{{((1+\varepsilon)^{2}+\sigma_{1}^{2})}}$ . Under this notation, we have that

which establishes Assumption 4 for any $\sigma_{0},\sigma_{1}\geq 0$ . ∎

We establish all of the following negative results using the stochastic gradient oracle described in Section 6. Before stating our results, let us briefly discuss some intuition behind why one should expect (NormSGD), (ClippedSGD), and (SignSGD-M) to fail under Section 6. Consider the setting where $\sigma_{1}\gg 1+\varepsilon$ . Then, notice that the stochastic gradient $\bm{g}_{t}$ only has the same sign as $\nabla F(\mathbf{w}_{t})$ with roughly $\nicefrac{{1}}{{\sigma_{1}^{2}}}$ probability. Otherwise, $\bm{g}_{t}$ has the opposite sign as $\nabla F(\mathbf{w}_{t})$ . Now, for an algorithm which incorporates the magnitude of the stochastic gradients together with the signs, the oracle in Section 6 may not be so problematic – indeed, even though the updates with correct sign are somewhat “rare”, they are also of significantly larger magnitude compared to the updates with proper sign. However, notice that (NormSGD), (ClippedSGD), and (SignSGD-M) are (effectively) unit step-length algorithms (at least, in the setting where $\left\lVert\bm{g}_{t}\right\rVert\geq\gamma$ ). Thus, in many parameter regimes, all of these algorithms effectively disregard the magnitude of the stochastic gradients and only use their signs. This results in a biased random walk which never finds an iterate better than the initial one with constant probability. We formalize this intuition in the following:

We note that the statement Section D.1 follows from Section D.2 by choosing the parameter $\varepsilon=\nicefrac{{\sigma_{1}}}{{2\sqrt{2}}}$ . The main takeaway here is that, for a reasonably wide range of parameters, (NormSGD), (ClippedSGD), and (SignSGD-M) can diverge in the affine variance setting, even for very simple smooth and strongly convex problems (in fact, even on a $1$ -dimensional quadratic function). In particular, this says that, whenever (NormSGD) is run with $\gamma=0$ (or (SignSGD-M) with $\beta=0$ ), then there is no parameter tuning with respect to $\eta$ such that $\min_{t\in[T]}\left\lVert\nabla F(x_{t})\right\rVert^{2}$ converges!

Fix any $L_{1}>0$ , time horizon $T>1$ , and affine variance parameter

D.2 Full statement and proof of negative results for (SignSGD-M), (NormSGD), and (ClippedSGD)

Here, we give the complete negative result for (SignSGD-M), (NormSGD), and (ClippedSGD), and formalize the intuition given there.

Let us choose, for arbitrary $L_{0}>0$ and $\Delta>0$ , the $(L_{0},0)$ -smooth objective $F(x)=\nicefrac{{L_{0}}}{{2}}\ x^{2}$ , and assume without loss of generality that $x_{1}=-\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}}$ (indeed, if this is not the case, then we can always translate the function $F(x)$ to be $F(x)=\nicefrac{{L_{0}}}{{2}}(x-x_{1}-\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}})^{2}$ , and our arguments remain unchanged). Notice that $F(x_{1})-F^{*}=F(x_{1})=\Delta$ .

Consider, for any $\varepsilon>0$ and $\sigma_{1}^{2}>(1+\varepsilon)^{2}$ , the stochastic gradient oracle from Section 6, i.e.,

Let $\tau^{*}$ be the first time when an iterate becomes larger than the original one, i.e.,

Notice that this implies that, for any $1\leq t<\tau^{*}$ :

This guarantees that, before $\tau^{*}$ , (i) the iterates are always to the left of the minimizer, (ii) that the algorithm (ClippedSGD) never “clips” (i.e., $u_{t}=\nicefrac{{\eta g_{t}}}{{\left|g_{t}\right|}}$ ), and (iii), $\min_{t<\tau^{*}}\left\lVert\nabla F(x_{t})\right\rVert^{2}=\left\lVert\nabla F(x_{1})\right\rVert^{2}$ (i.e., the algorithm never achieves any nontrivial target minimization criterion). Additionally, it must be the case that:

since $x_{\tau^{*}-1}<x_{1}$ and $x_{\tau^{*}}=x_{\tau^{*}-1}-u_{\tau^{*}-1}\geq x_{1}$ .

Now, let us distinguish the updates of (SignSGD-M), (NormSGD), and (ClippedSGD) as $(x_{t}(1),u_{t}(1))$ , $(x_{t}(2),u_{t}(2))$ , and $(x_{t}(3),u_{t}(3))$ , respectively. Now, instead of reasoning about the dynamics of each of these algorithms individually, we instead reason about an algorithm with simpler dynamics, and draw conclusions about each of these processes via a stochastic dominance argument.

To do this, we utilize the coupling of these algorithms defined in Section D.2 – namely, we let $x_{1}(i)=x_{1}=\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}}$ (as discussed above), and $g(x_{t}(i))=\xi_{\textrm{mult},t}\nabla F(x_{t}(i))$ for every $i$ , where $\xi_{\textrm{mult},t}$ is $-\varepsilon$ with probability $1-\delta$ , and $1+\nicefrac{{\sigma_{1}^{2}}}{{(1+\varepsilon)}}$ otherwise. That is, each process starts from the same initial iterate, and receives the same multiplicative noise on the stochastic gradient at time $t$ .

Similarly, let us define the “simpler” comparison process as:

and take $x_{1}(4)=x_{1}$ and $x_{t+1}(4)=x_{t}(4)-u_{t}(4)$ . Now, denote $\tau^{*}(i)$ as the stopping time from (17) corresponding to the process $i\in$ . Then, by Section D.2, we have that, under our coupling of these algorithms, $\tau^{*}(4)\leq\min_{i\in}\tau^{*}(i)$ , which implies that, for each algorithm $i\in$ :

By Section D.2, we have that, for any $t_{0}\geq 0$ :

Thus, using the geometric summation formula, we can bound the above summations as:

Now, let us focus on bounding the two terms in the above expression. To do this, we first observe that, for any $\mu,x>0$ and $i\geq 0$ ,

In particular, since $\exp(-x)<\nicefrac{{1}}{{(1+x)}}$ for any $x>0$ , and thus also $\nicefrac{{\exp(-x)}}{{1-\exp(-x)}}<\nicefrac{{1}}{{x}}$ , since $\left\lfloor x\right\rfloor>x-1$ , we have that the above inequality is satisfied whenever:

and, combining our results, we conclude that, for any algorithm $i\in$ :

Consider the process $\left\{x_{t}\right\}_{t\geq 1}$ from (SignSGD-M) as defined in Section D.2, where $x_{1}<0$ , $F(x):=\nicefrac{{L_{0}}}{{2}}\ x^{2}$ for some $L_{0}>0$ , and $g_{t}$ are the stochastic gradients output by the oracle from Section 6. Suppose that the parameter $\beta$ of (SignSGD-M) satisfies:

Let $\tau^{*}=\min\left\{t>1:x_{1}\leq x_{t}\right\}$ . Then, if $t<\tau^{*}$ and $g_{t}=-\varepsilon\nabla F(x_{t})$ , then $u_{t}=\eta$ .

Recall that, by construction of the stochastic gradient oracle from Section 6, and since $\nabla F(x)=L_{0}x$ :

We wish to show that the process from (SignSGD-M) has the property that, whenever $g_{t}=-\varepsilon L_{0}x_{t}$ and $x_{s}<0$ for every $s\in[t]$ , then $u_{t}=\eta$ . We consider any initialization $x_{1}<0$ , and denote $\tau^{*}$ to be the first time when an iterate becomes non-negative, i.e.,

Notice that, since $u_{t}\in\left\{\pm\eta\right\}$ by definition of (SignSGD-M), and by construction of the stochastic gradient oracle:

Thus, it suffices to prove by induction that, for any $i\geq 0$ , either $\tau_{i}\geq\tau^{*}$ , or $u_{\tau_{i}}=\eta$ , as long as

For the base case of $i=0$ , we may assume without loss of generality that $u_{\tau_{0}}=u_{0}=\eta$ , since $m_{0}=0$ and the dynamics of the update rule do not depend on $u_{0}$ (i.e., the dynamics begin at time $t=1$ and $x_{1}$ is the starting point of the process). Thus, the base case is true by construction.

Now, suppose the claim holds for some $i\geq 1$ . Either $\tau_{i+1}\geq\tau^{*}$ or not. In the former case, the claim follows trivially, so let us assume that $\tau_{i+1}<\tau^{*}$ . Since $\tau_{i}<\tau_{i+1}<\tau^{*}$ by construction, $u_{\tau_{i}}=\eta$ by the induction hypothesis. Further, let us assume that $g_{\tau_{i+1}}=-\varepsilon L_{0}x_{\tau_{i+1}}$ , since otherwise the claim again follows trivially by definition of $\tau_{i+1}$ . Thus, we can write:

where the first equality is the definition of $m_{\tau_{i+1}}$ . The second inequality follows from observation (21). Further, since $u_{\tau_{i}}=\eta$ , then by definition of (SignSGD-M), either $m_{\tau_{i}}>0$ , or $m_{\tau_{i}}=0$ and the algorithm chooses $u_{\tau_{i}}=\eta$ . In either case, $m_{\tau_{i}}\geq 0$ . Therefore, since, for $\beta\in[0,1)$ :

we obtain, using the fact that $x_{\tau_{i+1}}=x_{\tau_{i}+1}+\eta(\tau_{i+1}-(\tau_{i}+1))$ and $m_{\tau_{i}}\geq 0$ :

Thus, since $x_{\tau_{i}+1}\leq x_{\tau_{i+1}}<0$ , and since $\tau_{i+1}<\tau^{*}$ (which implies, since each update of (SignSGD-M) satisfies $u_{t}\in\left\{\pm\eta\right\}$ and by definition of $\tau^{*}$ , $x_{\tau_{i+1}}\leq x_{\tau^{*}-1}=x_{1}-\eta<0$ ), the above inequality implies that $m_{\tau_{i+1}}>0$ as long as:

Since we require $0\leq\beta<1$ , the second inequality is equivalent to:

Thus, since $\sqrt{1-x}<1-\nicefrac{{x}}{{2}}$ for $0<x\leq 1$ , it suffices to choose $\beta$ as:

In this case, $m_{\tau_{i+1}}>0$ , and thus $u_{t}=\eta$ , which establishes the induction step. Thus, for every $i\geq 0$ , either $\tau_{i}\geq\tau^{*}$ or $u_{\tau_{i}}=\eta$ , as claimed. ∎

Let us recall the i.i.d random process $\left\{\xi_{\textrm{mult},t}\right\}_{t\geq 1}$ from Section 6, where each $\xi_{\textrm{mult},t}$ is $-\varepsilon$ with probability $1-\delta$ , and $(1+\nicefrac{{\sigma_{1}^{2}}}{{(1+\varepsilon)}})$ otherwise. Let us distinguish the three processes from Section D.2 (Eqs. SignSGD-M, ClippedSGD and NormSGD) as, respectively, $\left\{x_{t}(i)\right\}_{t\geq 1}$ for $i\in$ . Consider the coupling of these three processes, where $x_{1}(i)=x_{1}:=-\sqrt{\nicefrac{{2\Delta}}{{L_{0}}}}$ for every $i\in$ , and for each $t\geq 1$ and $i\in$ , $g(x_{t}(i))=\xi_{\textrm{mult},t}\nabla F(x_{t}(i))$ . Further, let us denote, for each $t\geq 1$ :

and take $x_{1}(4)=x_{1}$ and $x_{t+1}(4)=x_{t}(4)-u_{t}(4)$ . Further, let, for each $i\in$ ,

Then, under the constraints on parameters of the three algorithms as imposed in Section D.2, we have that:

We claim that, for each $i\in$ , and any $t<\tau^{*}(i)$ , $u_{t}(4)\leq u_{t}(i)$ . Notice that, supposing this claim is true, then $\tau^{*}(4)\leq\tau^{*}(i)$ for each $i\in$ , since, by definition of $\tau^{*}(i)$ :

Thus, since $\tau^{*}(i)$ is the first time $t>1$ for which $x_{t}(i)\geq x_{1}$ , it follows that $\tau^{*}(4)\leq\tau^{*}(i)$ . Having established this implication, it suffices to prove the claim for each of the $u_{t}(i)$ s.

For the case of $i=2$ (i.e., algorithm (NormSGD)), for every $t<\tau^{*}(2)$ , since $\left|g(x_{t}(2))\right|\geq\varepsilon L_{0}|x_{1}|\geq\gamma$ by (18) and since $\nicefrac{{x}}{{(x+y)}}$ is non-decreasing in $x$ on the interval $x\in(0,\infty)$ for any fixed $y\geq 0$ ,

Therefore, the claim is established in all three cases, which also concludes the proof. ∎

Consider the algorithm $4$ as defined in Section D.2. Then, under the assumptions of Section D.2, we have that, for any $T\geq 1$ and any $t_{0}\geq 0$ ,

as the “net movement” of algorithm $4$ to the left of $x_{t_{1}}$ after $t_{2}-t_{1}+1$ time steps. and observe that

Additionally, note that, recalling the definition of $\tau^{*}(4)$ from Eq. 17,

Therefore, we have that, for any $t_{0}\geq 0$ ,

it remains only to upper-bound each probability inside of the above summation. To do this, let us denote, for any $t\in[t_{0}+2,T]$ ,

Collecting the above results, we arrive at the claimed lower bound. ∎

In particular, whenever $\eta\geq\nicefrac{{\alpha}}{{L_{1}}}$ for some $\alpha>0$ , and, for any $\delta\in(0,1)$ ,

and $g(x_{t})=\xi_{\textrm{mult},t}\nabla F(x_{t})$ . As established in Section 6, this oracle satisfies Assumptions 3 and 4 with $\sigma_{0}=0$ and the specified $\sigma_{1}>1$ .

Let us define, for a parameter $t_{0}\geq 1$ to be determined shortly:

Now, since the noise is sampled i.i.d at each time step, we have that, for any $t_{0}\geq 0$ :

Whenever $\mathcal{E}_{\text{nc}}$ is true, notice that:

Now, using the fact that, whenever $\mathcal{E}_{\text{nc}}$ is true, then $\nabla F(x_{t})\leq\nabla F(x_{t+1})$ for each $t\in[t_{0}]$ , and assuming $b_{0}^{2}\leq\varepsilon^{2}\left\lVert\nabla F(x_{1})\right\rVert^{2}$ , we can bound

Now, for a parameter $\alpha>0$ to be determined shortly, let us define:

Notice that, by construction, $\xi_{\textrm{mult},\tau_{i}}=1+\nicefrac{{\sigma_{1}^{2}}}{{(1+\varepsilon)}}$ for every $i\geq 0$ . Further, $x_{\tau_{i}+1}\leq x_{\tau_{i+1}}$ since $\tau_{i+1}$ is the first time after $\tau_{i}$ satisfying $\nabla F(x_{\tau_{i+1}+1})<\nabla F(x_{\tau_{i}+1})$ , or equivalently, $x_{\tau_{i+1}+1}<x_{\tau_{i}+1}$ . This implies that

Now, by construction of the $\tau_{i}$ , we have that, assuming $\mathcal{E}_{\text{nc}}$ is true, then

Thus, to ensure that $\min_{t\in[T]}\left\lVert\nabla F(x_{t})\right\rVert^{2}=\left\lVert\nabla F(x_{1})\right\rVert^{2}$ , it suffices to have that, for every $i<T$ , $\left\lVert\nabla F(x_{\tau_{i}})\right\rVert^{2}\geq\left\lVert\nabla F(x_{1})\right\rVert^{2}$ . Now, notice that

Thus, it suffices to establish conditions under which

Thus, if we choose $\alpha=\nicefrac{{\log(T-1)}}{{\eta L_{1}}}$ , then it suffices to take:

in which case $\min_{t\in[T]}\left\lVert\nabla F(x_{t})\right\rVert^{2}=\left\lVert\nabla F(x_{1})\right\rVert^{2}$ under $\mathcal{E}_{\text{nc}}$ . Hence,

In particular, using the fact that $1-x>\exp(\nicefrac{{-x}}{{(1-x)}})$ for $x<1$ , it follows that:

Hence, as long as, for some $\delta\in(0,1)$ ,