Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD
Matthew Faw, Litu Rout, Constantine Caramanis, Sanjay Shakkottai
Introduction
A fundamental problem in stochastic optimization is to characterize the convergence behavior of the Stochastic Gradient Descent algorithm:
Much of the literature on stochastic optimization, e.g., [NY83, GL13, Bub15, FSSSSW19], focuses on a special case of (Affine-var) where the variance is uniformly upper-bounded ():
Rates of convergence to a first-order stationary point in these settings are now well-understood. Under (Bounded-var) regime, [GL13] prove an rate of convergence with a fixed step-size schedule. Later, [ACDFSW22] show that this rate is optimal up to constant factors. Further, as noted by [BCN18], a minor modification to this step-size gives nearly the same rate in the more general (Affine-var) setting, i.e., . This rate is obtained by making trivial changes to the proof technique of [GL13].
One crucial assumption in these lines of work is -smoothness, i.e., -Lipschitz gradients of the loss landscape. However, recent works [ZJFW20, ZHSJ20] provide empirical evidence that this assumption is often not satisfied in practical machine learning problems. For instance, in large-scale language modeling including BERT [DCLT18] and other variants [Rad+21, Car+21, LYFJHN23], the loss landscape of transformer architectures either does not satisfy the -smoothness assumption, or the value of becomes so large that it produces a significantly weaker rate of convergence [ZJFW20, ZHSJ20].
Aiming to address these issues, there has been a recent surge of interest in relaxing the standard -smoothness assumption and characterizing the rate of convergence. One appealing relaxation proposed by [ZHSJ20] is that of -smoothnessFor convenience, we state this assumption in terms of a bound on the hessian of . The requirement that the hessian exists everywhere can be relaxed to a condition on the gradients [ZJFW20]. This relaxation is the one we use for our main results, see Assumption 2.:
While every -smooth function is also -smooth, this relaxation admits functions that grow significantly faster than a quadratic function, e.g., is -smooth for any , and is -smooth. With regards to convergence, recent works [ZHSJ20, CLOZZ22] establish an rate in the -smooth setting, as long as the noise of the stochastic gradients has uniformly-bounded support, i.e.,
In many real-world scenarios, the (Bounded-supp) assumption does not hold. For instance, when running SGD in standard least-squares regression settings, the stochastic gradients have multiplicative noise, as noted in [DFB17, FB17, JKKNS18, JT19]. Similar noise assumptions have also been considered, e.g., in convergence of stochastic proximal gradient methods [RVV20], Hilbert-valued stochastic subgradient methods [BRS07], and adaptive gradient methods [FTCMSW22]. Moreover, multiplicative noise naturally arises in machine learning problems with (additive or multiplicative) feature noise [LW11, Hwa86, CRSC06]. Thus, we believe that characterizing -smooth functions under (Affine-var) is an important step in extending the theory of non-convex stochastic optimization beyond the standard -smooth setting.
A major challenge in the analysis of adaptive stochastic gradient descent is the correlation between the stochastic gradients and the step-size. Here, we develop a technique to simplify this challenge. Our key innovation is a recursively-defined stopping time which satisfies two crucial properties: (i) before the stopping time is reached, the step sizes behave roughly independently of the gradients, and (ii) on average, the stopping time is at least a constant fraction of the time horizon. As a consequence, instead of analyzing over the entire time horizon, we conduct the analysis over this sub-interval over which we exploit this convenient almost-independent property. This tool allows us to prove the first rate of convergence for -smooth functions beyond the (Bounded-supp) setting. Our main contributions are three-fold:
(a) Convergence for -smoothness when . We show in Section 4 that AdaGrad-Norm converges at a rate when the stochastic gradient oracle satisfies (Affine-var) with and . This is the first convergence rate for any algorithm even under (Bounded-var) (i.e., ) for general -smooth optimization. Note that the scaling of this bound with matches (up to poly-logarithmic factors) the best-known rate for -smooth functions – with a minor caveat that is not needed in the -smooth setting. Also, we show that the rate improves to in the “small variance” regime when even without tuning the step-size.
(b) Convergence for all . We establish a sufficient condition under which AdaGrad-Norm converges at a rate when , see Section 5. This condition allows us to analyze a broad subset of -smooth functions that includes all -smooth functions as well as fixed-degree polynomials without any restrictions on . This simultaneously generalizes the result and simplifies a key proof technique of [FTCMSW22] for -smooth functions.
(c) Negative results for known algorithms. We prove a set of negative results in Section 6 for most algorithms analyzed under -smoothness and (Bounded-supp). We construct an oracle for Clipped and Normalized SGD [ZHSJ20, ZJFW20] and Sign SGD with Momentum [CLOZZ22] that leads to failure with constant probability in a wide parameter regime. We also prove that AdaGrad-Norm can diverge with constant probability if the step-size is not carefully tuned in the “large variance” regime for -smooth functions. By contrast, no parameter tuning is needed in the -smooth setting in this noise regime.
Related Works
Stochastic gradient descent. (SGD) has been well-studied for many decades [RM51]. [PT73] proved almost-sure convergence to a first-order stationary point of (SGD) for non-convex and -smooth functions with with stochastic gradient oracle satisfying (a slightly weaker condition than) (Affine-var). [BT00] extended the result to a setting where does not have a uniform lower-bound. [GL13] proved that (SGD) with step-size achieves a convergence rate to a first-order stationary point of , assuming -smoothness and (Bounded-var). [DS20] proved that this is the optimal rate for (SGD) without further assumptions. Recently, [ACDFSW22] proved that the convergence rate of [GL13] is optimal among all first-order methods, not just SGD.
-smoothness in the (Bounded-supp) regime. Recent work by [ZHSJ20] argued that the -smoothness assumption is not realistic for many practical machine learning tasks, e.g., large-scale natural language processing using transformer architectures. Instead, they demonstrated that -smooth functions (Generalized-smooth) better capture the loss landscape, and proved that the gradient clipping algorithm converges at a rate in the (Bounded-supp) regime. [ZJFW20] later proved convergences for a generalized class of gradient clipping algorithms. They used a slightly weaker definition of -smoothness, which we use in Assumption 2. Very recently, [CLOZZ22] considered a “coordinate-wise” generalization of -smoothness, and proved that a “generalized SignSGD” algorithm converges at a rate . By contrast, they proved that gradient descent with fixed step-sizes must scale linearly in , where is the largest gradient in the sublevel set . Interestingly, this line of work establishes that adaptive step-size schedules can avoid this dependence on .
Problem Setting
We are interested in finding a first-order stationary point of a non-convex function, given access to a stochastic gradient oracle, using (SGD). For compactness, let . Our objective function satisfies the following:
We note that -smoothness was originally defined in [ZHSJ20] as a bound on the Hessian of , as in (Generalized-smooth). Following [ZJFW20, Remark 2.3], we choose to adopt the alternative condition in Assumption 2 for two reasons. First, Assumption 2 is strictly weaker than -smoothness, since -smoothness implies the gradients are -Lipschitz. Second, whenever the objective is twice-differentiable, Assumption 2 implies (Generalized-smooth) (up to constant factors in the definitions of and ):
A function satisfying -smoothness as per (Generalized-smooth) is also -smooth as per Assumption 2. If is twice continuously differentiable and -smooth as per Assumption 2, then it is also -smooth as per (Generalized-smooth).
Let be the sigma-algebra generated by the interaction between the algorithm and stochastic gradient oracle for rounds, i.e., . We impose the following conditions on the stochastic gradients:
Assumptions 3 and 4 imply the following bound on the stochastic gradients in terms of the true gradient:
We are interested in studying algorithms which require as little hyper-parameter tuning as possible and, simultaneously, can handle potentially unbounded smoothness constant. To achieve this, we analyze AdaGrad-Norm [SM10], a step-size sequence for (SGD) which, at each time , depends on the current and past stochastic gradients :
Our main results, Sections 4 and 4, both establish convergence rates for (AG-Norm) in the -smooth regime under (Affine-var). Section 4 holds for any -smooth function under a mild restriction that . It is easy to extend this result for by computing mini-batch gradients with a batch size , refer Section A.3 for a proof. Despite the restriction of Section 4 to , we emphasize that, prior to our work, no proof of convergence even for the (Bounded-var) setting (i.e., ) was known for a general class of -smooth functions. Besides, Section 4 holds for all and a subclass of -smooth functions, i.e., excluding functions like .
Fix any constants such that . Consider (AG-Norm) with any parameters and , running on an objective function satisfying Assumption 2, and given access to a stochastic gradient oracle satisfying Assumptions 3 and 4. Assuming that , then for any and , with probability at least , the iterates of (AG-Norm) satisfy
To extend our convergence proofs beyond , we consider a subclass of -smooth functions which satisfy the following additional assumption:
Notice that, whereas Assumption 2 is a local constraint on the objective, Section 4 enforces a global polynomial growth constraint – thus ruling out such -smooth functions as exponentials, while capturing a significantly broader class of functions than -smoothness. We refer the interested reader to Section C.1 for some properties of this class of functions. Using this definition, we are able to prove the following:
Fix any constants such that . Consider (AG-Norm) with any parameters and , running on an objective function satisfying Assumption 2 and Section 4 for some constants , and given access to a stochastic gradient oracle satisfying Assumptions 3 and 4 for any . Then, for any and , with probability at least , the iterates of (AG-Norm) satisfy
There are several notable takeaways from the above results.
Noise adaptivity. Both Sections 4 and 4 provide “noise-adaptive” convergence rates, in a sense that as , the convergence rates automatically improve from to without any additional hyperparameter tuning.
Less hyperparameter tuning. These rates hold without tuning the parameters or with respect to or , unlike all prior algorithms for -smoothness that we are aware of [ZHSJ20, ZJFW20, CLOZZ22]This feature, however, manifests into a worse dependence on unlike [ZJFW20, CLOZZ22]..
Generalization of prior work. We remark that Section 4 strictly generalizes the result of [FTCMSW22] beyond the uniform -smooth setting. Further, our stopped analysis simplifies their “recursive improvement” technique [FTCMSW22, Lemma 13].
Key technical ideas
As discussed earlier, the main technical tool we use to obtain our convergence rates in Sections 4 and 4 is a recursively-defined stopping time. Before we are ready to define this time and discuss its utility, we first give a brief overview of the main initial steps of our analysis.
This inequality is no longer true for -smooth functions. Indeed, it is clearly not satisfied for all on the -smooth function . However, [ZHSJ20, ZJFW20] note that a similar variant holds “locally” for (see Section A.2). Using this variant, we obtain the following inequality, which is our first tool for studying the convergence of (AG-Norm).
Fix any . Suppose that . Then, for any ,
Intuitively, the “good” times are those times when Section 5 is non-vacuous. Using Section 5, we sum the expression Section 5 until any stopping time to obtain the following more useful form.
We leverage the fact that Section 5 holds for any stopping time as follows. Suppose there were some stopping time such that:
Notice that , , , and deterministically. Further, one can show that , and are -measurable for every . Intuitively, is the first time that the sum of stochastic gradient norms is significantly larger than its expectation, where the expectation is crucially over the random summation range (refer to Section B.2 for a further discussion). The following result shows the utility of this recursive construction:
For any and , let be the stopping time from Section 5.1. Then, we have the following:
is a stopping time with respect to , i.e., , .
Notice that an immediate consequence of Section 5.1 is that:
General -smooth functions clearly violate this inequality. Indeed, can potentially be a multiplicative factor of times larger than (for instance, when the -smooth objective is . Thus, even if we could guarantee deterministically that only the first time-steps are “bad”, the objective function (and also the norm of the gradient) could grow by polynomial factor in during this interval! In fact, this is exactly the intuition behind our negative result for (AG-Norm) in the “large ” regime (see Section D.1).
Since the numerator and denominator both depend on the same summation, we can apply essentially the same arguments from the case to obtain our convergence rate for in Section 4.
Given our positive results for (AG-Norm) from the previous sections, we now turn our focus to algorithms which have been analyzed in prior works on -smooth optimization. Some of the first-studied algorithms for -smooth optimization take the following forms: for parameters and :
These closely-related updates are referred to as Normalized SGD and Clipped SGD respectively. One motivation for considering these specific updates, at least in the noiseless setting where , comes through a comparison with the natural SGD step-size for -smooth non-convex optimization. Indeed, [GL13] show that a constant step-size of yields a rate of convergence to a first-order stationary point. Further, a simple extension of this result (see, e.g., [BCN18] for a proof) is that, under -smoothness and Assumption 4 with and , the step size still achieves the convergence rate. Thus, by analogy, in the -smooth setting, (and, in the multiplicative noise regime, ) is a natural candidate step size.
A number of works, including [ZHSJ20, ZJFW20, CLOZZ22], have proved that (variants of) these algorithms converge whenever the noise of the stochastic gradient satisfies (Bounded-supp). It turns out, however, that these algorithms can diverge under the noise model considered in this paper, (Affine-var). To see this, it is useful to consider a specific stochastic gradient oracle which satisfies Assumptions 3 and 4:
We can then take the output of the oracle to be . Then, this construction satisfies Assumptions 3 and 4 with the specified and .
Consider the above oracle with and . This oracle outputs stochastic gradients with the same sign as the true gradient for only roughly a fraction of the times it is queried. The majority of stochastic gradients thus have the opposite sign of the true gradient! This turns out to be quite problematic for algorithms of the form (5). Indeed, consider the behavior of (5) when . In this regime, both algorithms discard the magnitude of the stochastic gradients , and use only their sign to perform updates. Since the stochastic gradients of Section 6 have the opposite sign of for almost all time steps , one can prove that algorithms of the form (5) do not converge to a stationary point with constant probability under Assumptions 3 and 4, even when the objective function is a -dimensional quadratic function (i.e., both smooth and strongly-convex). We give a proof of (a slightly more general version of) this fact in Section D.2.
Acknowledgements
This research is supported in part by NSF Grants 2019844 and 2112471, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program.
References
Overview of Appendix
Appendix A Auxiliary Lemmas
Let be a sequence of non-negative integers such that . Then, for any ,
We proceed via induction. The base case of holds trivially, with equality. Assuming the hypothesis holds at some time , we have that
Now, using the fact that for any , we have that
Combining these two bounds, we conclude that
so the claim holds also for . Thus, the claim holds for all by induction. ∎
The (AG-Norm) step-sizes satisfy, for any (possibly random) times and ,
We first note that, by definition of :
The iterates generated by (AG-Norm) satisfy, for every ,
Moreover, for any and ,
which establishes the first inequality. To obtain the second, we apply the first, together with Jensen’s inequality (noting that is convex), to obtain:
For any function satisfying Assumption 2, the sequence of iterates generated by (AG-Norm) with satisfy
Thus, by choosing , the claim is an immediate consequence of Section A.1. ∎
For any -smooth function , assuming that , the gradient evaluated at the iterate of (AG-Norm) at time satisfies:
Since , by Section A.1. Thus, we may apply Assumption 2 to obtain
The proof of the first statement is from [ZJFW20, Corollary A.4]. The proof of the second statement closely follows the analogous proof for -smooth functions from [Nes03, Lemma 1.2.2]. We give a proof of this claim for completeness.
Hence, by the limit inequality theorem, we have that, for any ,
In particular, by taking the supremum over all such , we conclude that
as claimed, where the first equality follows by observing that and have the same non-zero eigenvalues (since all entries of are real, and by appealing to the singular value decomposition), which implies that and have the same spectral norm. ∎
Suppose that the stochastic gradient oracle satisfies Assumptions 3 and 4 for some and . Then, assuming this oracle returns independent stochastic gradients each time is sampled, one can construct, for any , a new stochastic gradient oracle from this one through mini-batching which satisfies Assumptions 3 and 4 with and , and where each call to the new gradient requires only calls to the old one.
where the first inequality follows by Assumption 3 and since and are independent. The second inequality follows by Assumption 4 and our choice of . Thus, Assumption 4 is satisfied with and . ∎
An immediate consequence of Section A.2 and [FTCMSW22, Lemma 5] is that, as long as ,
We provide a proof of this inequality in Section B.5A careful reader may notice that the inequality in Section B.5 is actually slightly smaller than the one from [FTCMSW22, Lemma 5], since the dependence on constants is strictly better..
Now, let’s focus on bounding the final term above. We start by rewriting it as follows: Let us take . Then, we can decompose the final term (trivially) as
Now, whenever is false, then this expression is easy to bound, since
Notice that the first term above can be absorbed into the first term in (6), assuming is sufficiently small. For the remaining term, we begin by noticing that
Therefore, collecting these results and choosing , we have that
where . ∎
Fix any , and let be any stopping time with respect to . Then, we have that
Let us define, for a parameter to be determined,
More specifically, for any , we can decompose
In the other case, for any fixed , we have that
and in the third, we used the fact that . Then, noting that
Thus, if we choose then we obtain
Now, summing over , and using the fact that by assumption on , we have:
Focusing on the last term in the above inequality, and recalling that (deterministically) by assumption, we may apply the above bounds together with Jensen’s inequality to obtain:
Combining these bounds yields the claimed inequality. ∎
In the following, we restate Section 5 with an equivalent characterization that is sometimes more convenient for our analysis.
A time is “good” if, for fixed parameters , satisfying
Now, we may use the first and second inequalities in Section B.1 to bound the sum over “good” and “bad” times, respectively, and, collecting terms, we obtain
B.2 Constructing the “nice” stopping time
Let us recall the definition of , the “nice” stopping time: See 5.1
Here, we show that these random variables are well-defined, and enumerate the crucial properties that they satisfy.
For any and , let , , and be recursively-defined random variables from Section 5.1. Then, we have that, for all ,
is -measurable, and are each -measurable (where we take to be the trivial -algebra).
is a stopping time with respect to , i.e., for all .
For all , , , and .
For every , the following inequalities hold deterministically:
Before proving this result, let us briefly discuss an alternative construction to Section 5.1 which is (perhaps) more natural and easier to define, but does not satisfy a property we rely on to prove Section B.3:
Notice that (8) is true no matter the realization of . Indeed, for any realization of , the bound on the right-hand side still involves a random index inside the expectation. This is not the case with ( ‣ B.2) (there, the random index is outside of the expectation). This difference is crucial, and this special property of is actually what makes the proof of Section B.3 possible.
For the second claim, it suffices to consider (since we just established that is -measurable, and for any ). Now, for any such , since , we have that
For the fourth claim, we have that, by definition of and the tower rule of expectation,
Now, since , and applying (1),
Summing the above expression over , we conclude that
For the seventh claim, assuming , we have that, by Section A.2,
Further, since , and by definition of , we have that
For the final claim, we note that deterministically, by construction. Thus, we focus on the lower bound. Indeed, notice that, since ,
Therefore, by applying the union bound and Markov’s inequality, we conclude that
B.3 The key consequence of the nice stopping time construction
The following result is the most crucial place where the properties of Section 5.1 are utilized. It tells us that, as long as the sum of “bad” gradients is comparable to the sum of “good” ones, and as long as the descent inequality (Section 5) holds, then the sum of gradients scales (roughly) as . One can compare this result to that of [FTCMSW22, Lemma 13 ], which obtained a similar bound in the simpler -smooth setting. Their argument utilized a technique they termed “recursive improvement,” which required recursively invoking gradually improving bounds in order to reach their desired conclusion after infinitely many calls. Moreover, their argument crucially relies on properties of -smoothness in order to obtain worst-case upper bounds on the sum of gradients, which are no longer true in our setting. Through our construction of the stopping time , we are able to obtain a similar bound as in their setting, but with an (arguably) significantly simpler and more general proof which works even in the -smooth setting.
Then, we obtain the inequality given below:
Now, by Item 7 in Section B.2, since for any ,
then we may solve this inequality to conclude that
and .
Now, applying Hölder’s inequality to the above, we have:
where we used the following version of Hölder’s:
Recalling that by Item 2 of Section B.2, we may apply Assumption 4 to obtain
where is a parameter of our choosing. In particular, choosing (with foresight) for any , the above can be rewritten as:
To obtain a convergence rate, we begin by noting, for any , we can decompose
The first term is easy to bound via Markov’s inequality, since, choosing (since ),
To bound the second term, we note that, whenever , then
Hence, we have by Markov’s inequality and our previous bounds,
B.5 A deferred proof for establishing Section 5
Fix any . Suppose that . Then, for any time , the iterates of (AG-Norm) satisfy
The proof proceeds using similar arguments as in [FTCMSW22, Lemma 5]. By Section A.2 and the definition of (AG-Norm), we know that
We begin by bounding the inner product term above as:
Combining the above arguments, and applying Hölder’s inequality, we have that
where the last step comes from . Collecting our bounds so far yields:
Focusing on the term depending on , we have that for any ,
In this section, we show that Section B.4 can be used to establish a convergence rate without the restriction of . The key is to restrict our attention to -smooth functions which satisfy the following additional property:
The following result provides a characterization of these functions relative to -smooth functions and -smooth functions. In particular, it tells us that Section 4 is a richer function class than -smooth functions. However, not all -smooth functions satisfy Section 4.
Every -smooth function satisfies Section 4 with , , and .
Every -smooth function satisfies Section 4 locally (i.e., when ) with and .
There is a -smooth function which does not satisfy Section 4 for any fixed .
For any , satisfies Section 4 with , , and . Additionally, for any , is -smooth. However, this is not -smooth when .
The second follows since, for any -smooth function, for every ,
For the third claim, consider the function . Since . Suppose there were some such that Section 4 is satisfied. Then, it must be the case that, for any :
where the inequality follows from the definition of Section 4, the first equality by L’Hôpital’s rule, and the second by rewriting the previous expression. Repeating this argument times, this implies that
a contradiction. Hence, cannot satisfy Section 4.
For the final claim, we see that satisfies Section 4 with and since, by Jensen’s inequality,
Further, is also -smooth, since simple calculations yield that
In particular, this implies that is an eigenvector with largest eigenvalue, so, for any ,
Therefore, by [ZJFW20, Corollary A.4], for any ,
The following result demonstrates the difference in worst-case gradient norm scaling that -smooth functions provide, versus the worst-case scaling of functions satisfying Section 4.
For any function satisfying Assumption 2, and any algorithm producing iterates satisfying for every , the following inequality holds for every :
Moreover, this inequality is essentially unimprovable, in the sense that there exists a -smooth function and such that for any , and a -smooth function such that and . By contrast, any function satisfying Section 4 satisfies:
We begin by proving the first claim by by induction on . The base case of holds by definition, since
Now, supposing the claim holds for , we have that:
where the first inequality follows by applying the induction hypothesis for , the second by applying the induction hypothesis for , and the final equality follows by rearranging the prior line. Thus, the inequality holds also at , and thus our claim holds by induction.
To see that this inequality is essentially unimprovable, let us consider first consider, for any , the function:
Since , it follows from Section 3 that is -smooth. Notice that, if , then, taking and ,
Further, for any , consider the function:
Noting that when , and otherwise, it follows that is -smooth (by Section 3). Therefore, whenever ,
where the first equality follows by rearranging the definition, the second since , the third since , and the fourth by rearranging.
The final inequality follows immediately from Sections A.1 and 4, which together imply that
where we use the fact that and (since ) for the first inequality, the definition of Section 4 for the second, and the definition of from Section 3 for the third. Now, either , or not. In the first case, we note that
In the alternate case that , we obtain:
To bound the second term, we apply Sections 4 and A.1, together with the above construction, to obtain:
Thus, by the Multinomial theorem, we have that
Combining this with the above, we have the following:
We claim that, for any , the inner summation term above is bounded in expectation by:
To see this, first note that, by Section B.1, and since by Section B.2, for any ,
Now, by Items 4, 5 and 6 of Section B.2, we have that, almost surely,
Therefore, collecting these results, we conclude that, for any ,
Now, the base case of for (C.2) follows immediately from (14) with . Let us now suppose that the claim (C.2) holds for some . Then, to apply the induction hypothesis, we begin by decomposing:
Notice that the above expectation is a product of two terms: indicators depending of times , and those depending on . Therefore, since, by Sections B.2 and B.1,
we may apply the tower rule of expectations and the inequality from (14):
Therefore, summing the above expression over and applying the induction hypothesis, we conclude that:
Now, finally noting that, for any ,
C.3 Bounding the sum of “bad” gradients by the sum of “good” ones
We recall from Section B.4 that, in order to use this bound, we need to show that the sum of “bad” gradients can be upper-bounded (relatively) by the sum of “good” ones. It turns out, for functions satisfying Section 4, this is possible, as we now show.
Let be any (possibly random) time, and consider any (possibly random) set . Denote . Then, assuming satisfies Section 4, the following is satisfied deterministically:
The proof of this result follows a similar argument as used in Section 5.2. The main idea here is to, for every in decreasing order, find the first available time which has not been associated with an earlier time from . Then, using Section 4, we show that, as long as and are not too far apart, then and must also be close. For some times , there may not be such a However, because of the greedy construction, these times must be relatively small (roughly within the first time steps). Thus, as long as is not “too big” (in expectation), then we can still bound these remaining terms. We now make these arguments precise.
To begin, note that for every , by Section 4,
We use this bound as follows: let us index the times in , denoting to be the th largest time in , i.e.,
Indeed, this follows by first decomposing
Next, notice that, for every ,
where the first inequality is by definition of . To see the second inequality, we follow a similar argument as before. Indeed, observe that
where in the second inequality, we applied Section 5.2. Thus, we obtain the claimed result. ∎
and (where we use the notation ).
Thus, the conditions to apply Section B.4 are satisfied, and we obtain the convergence rate. ∎
In this section, we consider the convergence behavior of several natural candidate algorithms which have been studied in the literature on -smooth optimization. These algorithms take the form , where takes a number of different forms, including: in Normalized SGD:
and Sign-SGD with Momentum (operations performed element-wise):
[ZHSJ20, ZJFW20, CLOZZ22] prove convergence of these algorithms in the setting of (Bounded-supp). In this section, we show that these step-size choices for -smooth optimization fail under (Affine-var), despite working in the noiseless and (Bounded-supp) settings. Our negative results rely on the following stochastic gradient oracle construction:
Fix any . We begin by establishing that Assumption 3 holds for our construction of . Begin by denoting . Under this notation, we have that
which establishes Assumption 4 for any . ∎
We establish all of the following negative results using the stochastic gradient oracle described in Section 6. Before stating our results, let us briefly discuss some intuition behind why one should expect (NormSGD), (ClippedSGD), and (SignSGD-M) to fail under Section 6. Consider the setting where . Then, notice that the stochastic gradient only has the same sign as with roughly probability. Otherwise, has the opposite sign as . Now, for an algorithm which incorporates the magnitude of the stochastic gradients together with the signs, the oracle in Section 6 may not be so problematic – indeed, even though the updates with correct sign are somewhat “rare”, they are also of significantly larger magnitude compared to the updates with proper sign. However, notice that (NormSGD), (ClippedSGD), and (SignSGD-M) are (effectively) unit step-length algorithms (at least, in the setting where ). Thus, in many parameter regimes, all of these algorithms effectively disregard the magnitude of the stochastic gradients and only use their signs. This results in a biased random walk which never finds an iterate better than the initial one with constant probability. We formalize this intuition in the following:
We note that the statement Section D.1 follows from Section D.2 by choosing the parameter . The main takeaway here is that, for a reasonably wide range of parameters, (NormSGD), (ClippedSGD), and (SignSGD-M) can diverge in the affine variance setting, even for very simple smooth and strongly convex problems (in fact, even on a -dimensional quadratic function). In particular, this says that, whenever (NormSGD) is run with (or (SignSGD-M) with ), then there is no parameter tuning with respect to such that converges!
Fix any , time horizon , and affine variance parameter
D.2 Full statement and proof of negative results for (SignSGD-M), (NormSGD), and (ClippedSGD)
Here, we give the complete negative result for (SignSGD-M), (NormSGD), and (ClippedSGD), and formalize the intuition given there.
Let us choose, for arbitrary and , the -smooth objective , and assume without loss of generality that (indeed, if this is not the case, then we can always translate the function to be , and our arguments remain unchanged). Notice that .
Consider, for any and , the stochastic gradient oracle from Section 6, i.e.,
Let be the first time when an iterate becomes larger than the original one, i.e.,
Notice that this implies that, for any :
This guarantees that, before , (i) the iterates are always to the left of the minimizer, (ii) that the algorithm (ClippedSGD) never “clips” (i.e., ), and (iii), (i.e., the algorithm never achieves any nontrivial target minimization criterion). Additionally, it must be the case that:
since and .
Now, let us distinguish the updates of (SignSGD-M), (NormSGD), and (ClippedSGD) as , , and , respectively. Now, instead of reasoning about the dynamics of each of these algorithms individually, we instead reason about an algorithm with simpler dynamics, and draw conclusions about each of these processes via a stochastic dominance argument.
To do this, we utilize the coupling of these algorithms defined in Section D.2 – namely, we let (as discussed above), and for every , where is with probability , and otherwise. That is, each process starts from the same initial iterate, and receives the same multiplicative noise on the stochastic gradient at time .
Similarly, let us define the “simpler” comparison process as:
and take and . Now, denote as the stopping time from (17) corresponding to the process . Then, by Section D.2, we have that, under our coupling of these algorithms, , which implies that, for each algorithm :
By Section D.2, we have that, for any :
Thus, using the geometric summation formula, we can bound the above summations as:
Now, let us focus on bounding the two terms in the above expression. To do this, we first observe that, for any and ,
In particular, since for any , and thus also , since , we have that the above inequality is satisfied whenever:
and, combining our results, we conclude that, for any algorithm :
Consider the process from (SignSGD-M) as defined in Section D.2, where , for some , and are the stochastic gradients output by the oracle from Section 6. Suppose that the parameter of (SignSGD-M) satisfies:
Let . Then, if and , then .
Recall that, by construction of the stochastic gradient oracle from Section 6, and since :
We wish to show that the process from (SignSGD-M) has the property that, whenever and for every , then . We consider any initialization , and denote to be the first time when an iterate becomes non-negative, i.e.,
Notice that, since by definition of (SignSGD-M), and by construction of the stochastic gradient oracle:
Thus, it suffices to prove by induction that, for any , either , or , as long as
For the base case of , we may assume without loss of generality that , since and the dynamics of the update rule do not depend on (i.e., the dynamics begin at time and is the starting point of the process). Thus, the base case is true by construction.
Now, suppose the claim holds for some . Either or not. In the former case, the claim follows trivially, so let us assume that . Since by construction, by the induction hypothesis. Further, let us assume that , since otherwise the claim again follows trivially by definition of . Thus, we can write:
where the first equality is the definition of . The second inequality follows from observation (21). Further, since , then by definition of (SignSGD-M), either , or and the algorithm chooses . In either case, . Therefore, since, for :
we obtain, using the fact that and :
Thus, since , and since (which implies, since each update of (SignSGD-M) satisfies and by definition of , ), the above inequality implies that as long as:
Since we require , the second inequality is equivalent to:
Thus, since for , it suffices to choose as:
In this case, , and thus , which establishes the induction step. Thus, for every , either or , as claimed. ∎
Let us recall the i.i.d random process from Section 6, where each is with probability , and otherwise. Let us distinguish the three processes from Section D.2 (Eqs. SignSGD-M, ClippedSGD and NormSGD) as, respectively, for . Consider the coupling of these three processes, where for every , and for each and , . Further, let us denote, for each :
and take and . Further, let, for each ,
Then, under the constraints on parameters of the three algorithms as imposed in Section D.2, we have that:
We claim that, for each , and any , . Notice that, supposing this claim is true, then for each , since, by definition of :
Thus, since is the first time for which , it follows that . Having established this implication, it suffices to prove the claim for each of the s.
For the case of (i.e., algorithm (NormSGD)), for every , since by (18) and since is non-decreasing in on the interval for any fixed ,
Therefore, the claim is established in all three cases, which also concludes the proof. ∎
Consider the algorithm as defined in Section D.2. Then, under the assumptions of Section D.2, we have that, for any and any ,
as the “net movement” of algorithm to the left of after time steps. and observe that
Additionally, note that, recalling the definition of from Eq. 17,
Therefore, we have that, for any ,
it remains only to upper-bound each probability inside of the above summation. To do this, let us denote, for any ,
Collecting the above results, we arrive at the claimed lower bound. ∎
In particular, whenever for some , and, for any ,
and . As established in Section 6, this oracle satisfies Assumptions 3 and 4 with and the specified .
Let us define, for a parameter to be determined shortly:
Now, since the noise is sampled i.i.d at each time step, we have that, for any :
Whenever is true, notice that:
Now, using the fact that, whenever is true, then for each , and assuming , we can bound
Now, for a parameter to be determined shortly, let us define:
Notice that, by construction, for every . Further, since is the first time after satisfying , or equivalently, . This implies that
Now, by construction of the , we have that, assuming is true, then
Thus, to ensure that , it suffices to have that, for every , . Now, notice that
Thus, it suffices to establish conditions under which
Thus, if we choose , then it suffices to take:
in which case under . Hence,
In particular, using the fact that for , it follows that:
Hence, as long as, for some ,