Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization

Shuo Xie, Zhiyuan Li

Introduction

Which solution does $\mathtt{AdamW}$ converge to, if it converges?

Our following main result Theorem 1.1 characterizes the implicit bias of $\mathtt{AdamW}$ in the deterministic case, where a full-batch loss is used:

If $L$ is additionally convex, then $\mathtt{AdamW}$ converges to the constrained minimizer, i.e., ${\bm{x}}_{\infty}\in\operatorname*{arg\,min}_{\left\|{\bm{x}}\right\|_{\infty}\leq\frac{1}{\lambda}}L({\bm{x}})$ .

Despite being simplistic, the full-batch setting is still a very interesting and highly non-trivial regime, because the two main hypotheses of why $\mathtt{Adam}$ outperforms $\mathtt{SGD}$ got challenged recently in the deterministic regime (Kunstner et al., 2022). The first hypothesis is that $\mathtt{Adam}$ outperforms $\mathtt{SGD}$ by better handling heavy-tailed noise (Zhang et al., 2020). However, Kunstner et al. (2022) finds that $\mathtt{Adam}$ still outperforms $\mathtt{GD}$ for optimizing language tasks even in the full-batch setting. The second hypothesis is the smoothness of the training loss landscape can linearly increase as the gradient norm increases and thus clipping or normalization is necessary for gradient descent. Intriguingly, Kunstner et al. (2022) finds that normalizing each update of GD cannot close the gap towards $\mathtt{Adam}$ in the full-batch setting, but normalizing each coordinate to its sign (i.e., $\mathtt{SignGD}$ ) closes the gap.

In Section 3.1, we prove normalized steepest descent with weight decay optimizes convex functions under norm constraints (Theorem 3.5). In Section 3.2, we prove it must converge to KKT points of the norm-constrained optimization problem for general loss functions if it converges with a learning rate schedule whose partial sum diverges (Theorem 3.7).

In Section 4, we prove $\mathtt{AdamW}$ must converge to KKT points of the norm-constrained optimization problem for general loss functions if it converges with a non-increasing learning rate schedule whose partial sum diverges (Theorem 1.1).

Towards generalizing the proof of Theorem 3.7 to Theorem 1.1, we prove a novel and tight upper bound on average update size of $\mathtt{Adam}$ (Lemma 4.2), which holds even for non-deterministic settings as well and might be of independent interest to the community. We test various predictions made by our bound in experiments.

Preliminaries and Notations

Steepest Descent:

We say ${\bm{v}}$ is a steepest descent direction for objective function $L$ at current iterate ${\bm{x}}$ w.r.t. norm $\left\|\cdot\right\|$ iff $\left\|{\bm{v}}\right\|=1$ and $\left\langle{\bm{v}},\nabla L({\bm{x}})\right\rangle=\min_{\left\|{\bm{v}}^{\prime}\right\|\leq 1}\left\langle{\bm{v}}^{\prime},\nabla L({\bm{x}})\right\rangle$ . Thus for all steepest descent direction ${\bm{v}}$ , we have that $\left\langle{\bm{v}},\nabla L({\bm{x}})\right\rangle=-\left\|\nabla L({\bm{x}})\right\|_{*}$ .

Given initialization ${\bm{x}}_{0}$ , learning rate schedule $\{\eta_{t}\}_{t=0}^{\infty}$ and weight decay factor $\lambda$ , the $t$ th iterate of normalized steepest descent w.r.t. $\left\|\cdot\right\|$ with decoupled weight decay is defined as

Because the dual norm of the dual norm is always equal to the original norm, by Lemma 2.1, we can also characterize the steepest descent directions as the subgradient of its dual norm.

$\operatorname*{arg\,max}\limits_{\left\|{\bm{\Delta}}\right\|\leq 1}\nabla L({\bm{x}})^{\top}{\bm{\Delta}}=\left.\partial\left\|{\bm{y}}\right\|_{*}\right|_{{\bm{y}}=\nabla L({\bm{x}})}$ .

Warm Up: Implicit Bias of Normalized Steepest Descent w. Weight Decay

Our analysis in this section holds for all norms, including the non-differentiable ones, like $\left\|\cdot\right\|_{\infty}$ .

In this subsection, we give a simple non-asymptotic convergence analysis for normalized Steepest descent w. weight decay ( $\mathtt{NSD}$ - $\mathtt{WD}$ ) w.r.t. to general norms over smooth convex loss functions. If the norm of initialization is no larger than $\frac{1}{\lambda}$ where $\lambda$ is the weight decay factor then surprisingly $\mathtt{NSD}$ - $\mathtt{WD}$ is exactly equivalent to a well-known optimization algorithm in literature, $\mathtt{Frank}$ - $\mathtt{Wolfe}$ (Frank et al., 1956), where the constraint set here is the norm ball with radius $\frac{1}{\lambda}$ . If the norm of initialization is larger than $\frac{1}{\lambda}$ , then the analysis contains an additional phase where the norm of iterates linearly converges to $\frac{1}{\lambda}$ . In this case, the iterate of $\mathtt{NSD}$ - $\mathtt{WD}$ may always be outside the $\frac{1}{\lambda}$ norm ball, but still, the convergence analysis of $\mathtt{Frank}$ - $\mathtt{Wolfe}$ can be adopted (e.g., Jaggi (2013)). First, we show that the norm of the iterates will shrink to $\frac{1}{\lambda}$ as long as the norm of each update is bounded by $1$ , i.e., $\left\|{\bm{\Delta}}_{t}\right\|\leq 1$ . Note this conclusion doesn’t use the convexity of the function $L({\bm{x}})$ nor the update ${\bm{\Delta}}_{t}$ being the steepest descent direction. It can hold under non-deterministic settings.

For any learning rate schedule $\{\eta_{t}\}_{t=1}^{\infty}$ and update $\{{\bm{\Delta}}_{t}\}_{t=1}^{\infty}$ such that $\lambda\eta_{t}<1$ and $\left\|{\bm{\Delta}}_{t}\right\|\leq 1$ , $\left\|{\bm{x}}_{t}\right\|-\frac{1}{\lambda}\leq\max\left(e^{-\lambda\sum_{i=1}^{t}\eta_{i}}\left(\left\|{\bm{x}}_{0}\right\|-\frac{1}{\lambda}\right),0\right)$ .

The proof is deferred to Section A.1. Lemma 3.1 shows that ${\bm{x}}_{t}$ is either always inside the norm ball with radius $\frac{1}{\lambda}$ , or their distance shrinks exponentially as the sum of learning rates increases. Whenever ${\bm{x}}_{t}$ gets into the norm ball with radius $\frac{1}{\lambda}$ , ${\bm{x}}_{t}$ will not leave it and the remaining trajectory of $\mathtt{NSD}$ - $\mathtt{WD}$ is exactly the same as $\mathtt{Frank}$ - $\mathtt{Wolfe}$ , as shown in the following theorem. We note the relationship between $\mathtt{Frank}$ - $\mathtt{Wolfe}$ and steepest descent algorithms is also observed very recently in the continuous case (Chen et al., 2023).

For any norm $\left\|\cdot\right\|$ , weight decay $\lambda$ , and $\left\|{\bm{x}}_{t-1}\right\|\leq\frac{1}{\lambda}$ , $\mathtt{NSD}$ - $\mathtt{WD}$ with learning rate $\eta_{t}<\frac{1}{\lambda}$ and $\mathtt{Frank}$ - $\mathtt{Wolfe}$ (Algorithm 2) with step size $\gamma_{t}=\eta_{t}\lambda$ and convex set $\mathcal{X}\triangleq\{{\bm{y}}\mid\left\|{\bm{y}}\right\|\leq\frac{1}{\lambda}\}$ generate the same next iterate ${\bm{x}}_{t}$ .

Define ${\bm{x}}^{*}=\operatorname*{arg\,min}_{\left\|{\bm{x}}\right\|\leq\frac{1}{\lambda}}L({\bm{x}})$ to be the constrained minimizer of convex function $L({\bm{x}})$ . We first compute how much the gap between $L({\bm{x}}_{t})$ and $L({\bm{x}}^{*})$ can decrease in one normalized steepest descent step when the iterate ${\bm{x}}_{t}$ is bounded.

Suppose loss function $L$ is convex and has $H$ -lipschitz gradient w.r.t. norm $\left\|\cdot\right\|$ . For iterates $\{{\bm{x}}_{t}\}$ in $\mathtt{NSD}$ - $\mathtt{WD}$ (Equation 1), we have that

The proof of Lemma 3.3 is deferred to Section A.2. With Lemma 3.3, we can prove the convergence of $L({\bm{x}}_{t})$ for learning rate schedules with certain conditions. The proof is also deferred to Section A.2.

Assume that $\eta_{t}\geq 0$ , $\lim_{t\rightarrow\infty}\eta_{t}=0$ and $\sum_{t=1}^{\infty}\eta_{t}=\infty$ . For any convex loss $L$ with $H$ -lipschitz gradient, $\lim_{t\rightarrow\infty}L({\bm{x}}_{t})=L({\bm{x}}^{*})$ .

We also provide a specific example of learning rates $\{\eta_{t}\}_{t=1}^{\infty}$ that can achieve $O(\frac{1}{t})$ convergence of $f({\bm{x}}_{t})$ , which is the same as $\mathtt{Frank}$ - $\mathtt{Wolfe}$ over convex objectives (Jaggi, 2013) and the proof is standard. For completeness, we provide a proof of Theorem 3.5 in Section A.2.

Define $B=\max{\{\left\|{\bm{x}}_{0}\right\|,\frac{1}{\lambda}\}}$ . For $\mathtt{NSD}$ - $\mathtt{WD}$ with learning rate schedule $\eta_{t}=\frac{2}{\lambda(t+1)}$ , we have $L({\bm{x}}_{t})-L({\bm{x}}^{*})\leq\frac{2H(1+\lambda B)^{2}}{(t+2)\lambda^{2}}$ for $t\geq 1$ .

2 Non-convex setting: convergence to KKT points

In this subsection, we study the implicit bias of $\mathtt{SignGD}$ (or more generally, $\mathtt{NSD}$ - $\mathtt{WD}$ ) when the loss is non-convex. In such case, last-iterate parameter convergence is, in general, difficult to showIndeed, even for convex case, $\mathtt{Frank}$ - $\mathtt{Wolfe}$ may not converge in parameter.(Bolte et al., 2023), and thus we turn to study what parameters $\mathtt{SignGD}$ and $\mathtt{NSD}$ - $\mathtt{WD}$ can converge to. Our main results Theorem 3.7 show that such parameters must be the KKT points (see Definition 3.6) of the constrained optimization problems. In particular, if the objective is convex, since the norm ball constraint is always convex for all norm, all KKT points are constrained minimizers.

For convex $L$ , all KKT points ${\bm{x}}^{*}$ are optimal and the dual variable $s^{*}\geq 0$ is the certificate for the optimality. To see that, for any other $\left\|{\bm{y}}\right\|\leq\frac{1}{\lambda}$ , it holds that $L({\bm{y}})\geq L({\bm{y}})+s^{*}(\left\|{\bm{y}}\right\|-\frac{1}{\lambda})\geq L({\bm{x}}^{*})+s^{*}(\left\|{\bm{x}}^{*}\right\|-\frac{1}{\lambda})$ , where the second inequality is because $L({\bm{x}})+s^{*}\left\|{\bm{x}}\right\|$ is also convex and is its subgradient at ${\bm{x}}^{*}$ . Thus we conclude $L({\bm{y}})\geq L({\bm{x}}^{*})+s^{*}(\left\|{\bm{x}}^{*}\right\|-\frac{1}{\lambda})=L({\bm{x}}^{*})$ .

Now we state the main result for this subsection.

To prove Theorem 3.7, we use the following alternative characterization for KKT points of $\min_{\left\|{\bm{x}}\right\|\leq\frac{1}{\lambda}}L({\bm{x}})$ below based on Lemma 2.1.

${\bm{x}}$ is a KKT point of $\min_{\left\|{\bm{x}}\right\|\leq\frac{1}{\lambda}}L({\bm{x}})$ iff $\left\|{\bm{x}}\right\|\leq\frac{1}{\lambda}$ and $\left\langle-\lambda{\bm{x}},\nabla L({\bm{x}})\right\rangle=\left\|\nabla L({\bm{x}})\right\|_{*}$ .

The following lemma (Lemma 3.9) circumvents the above issue by considering the weighted average of past steepest descent directions, which provably converges, given the iterates $\{{\bm{x}}_{t}\}_{t=1}^{\infty}$ converge. Theorem 3.7 is a direct combination of Lemma 3.9 and Lemma 3.8 and we omit its proof. The proof of Lemma 3.9 is deferred into Section A.3.

For any learning rate schedule $\{\eta_{t}\}_{t=1}^{\infty}$ satisfying $\sum_{t=1}^{\infty}\eta_{t}=\infty$ , if the iterates of $\mathtt{NSD}$ - $\mathtt{WD}$ $\{{\bm{x}}_{t}\}_{t=0}^{\infty}$ converges to some ${\bm{x}}_{\infty}$ , we have that

${\bm{\Delta}}_{\infty}:=\lim\limits_{T\rightarrow\infty}\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t}}{\sum_{t=1}^{T}\eta_{t}}$ exists and ${\bm{\Delta}}_{\infty}=-\lambda{\bm{x}}_{\infty}$ .

$\left\langle\nabla L({\bm{x}}_{\infty}),{\bm{\Delta}}_{\infty}\right\rangle=\left\|\nabla L({\bm{x}}_{\infty})\right\|_{*}$ .

$\left\|{\bm{\Delta}}_{\infty}\right\|\leq 1$ .

Implicit Bias of AdamW

In this section, we extend the analysis on $\mathtt{NSD}$ - $\mathtt{WD}$ in Section 3 to $\mathtt{AdamW}$ to prove that the converged parameters of $\mathtt{AdamW}$ is the KKT point of the constrained optimization problem. The proof relies on an upper bound of average update size of $\mathtt{AdamW}$ and we find that the bound can also be used to guide hyperparameter tuning in empirical study.

For non-increasing learning rate schedule $\{\eta_{t}\}_{t=0}^{\infty}$ satisfying $\sum_{t=1}^{\infty}\eta_{t}=\infty$ and $\beta_{2}\geq\beta_{1}$ , we get $\{{\bm{x}}_{t}\}_{t=1}^{\infty}$ by running AdamW with weight decay factor $\lambda$ . If $\{{\bm{x}}_{t}\}_{t=0}^{\infty}$ converges to some ${\bm{x}}_{\infty}$ , then it holds that

${\bm{\Delta}}_{\infty}:=\lim\limits_{T\rightarrow\infty}\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t}}{\sum_{t=1}^{T}\eta_{t}}$ exists and ${\bm{\Delta}}_{\infty}=-\lambda{\bm{x}}_{\infty}$ .

$\left\langle\nabla L({\bm{x}}_{\infty}),{\bm{\Delta}}_{\infty}\right\rangle=\left\|\nabla L({\bm{x}}_{\infty})\right\|_{1}$ .

$\left\|{\bm{\Delta}}_{\infty}\right\|_{\infty}\leq 1$ .

The first two properties in Lemma 4.1 follow from a similar argument for Lemma 4.1, and the main technical difficulty here lies in the proof of the third property. This is because for any single $t$ , $\left\|{\bm{\Delta}}_{t}\right\|$ could be larger than $1$ , which is different from the case of $\mathtt{NSD}$ - $\mathtt{WD}$ . To prove the third property, we need a tight upper bound for the average update size of $\mathtt{Adam}$ -like update rule, which is Lemma 4.2. The proof of Lemma 4.1 is deferred to Appendix B.

As mentioned earlier, $\mathtt{Adam}$ updates $\left\|\frac{{\bm{m}}_{t}}{\sqrt{{\bm{v}}_{t}}}\right\|$ can easily go beyond $1$ and thus we prove the following upper bound for the average update size of $\mathtt{Adam}$ (Lemma 4.2). The proof of Lemma 4.2 is deferred to Section B.1.

In particular, when $\beta_{1}=\beta_{2}$ , it even holds that $|\Delta_{t}|\leq 1$ .

Note $\{v_{t}\}_{t=0}^{\infty}$ here only needs to satisfy a more general condition rather than to be the exact moving average of $g_{t}^{2}$ . It can be applied to the practical scenario where a small positive constant $\epsilon$ is added to $\sqrt{{\bm{v}}_{t}}$ in the denominator to improve the numerical stability of $\mathtt{Adam}$ . It is easy to verify that for ${\bm{v}}_{t}$ in Algorithm 1, we have that

Therefore, for $\mathtt{Adam}$ with $\epsilon$ , $v_{t}$ in Lemma 4.2 is always lower bounded, and if we further have an upper bound for gradients, then we can easily control the average update size of $\mathtt{Adam}$ . One nice property is that the upper bound only scales up logarithmically to $1/\epsilon$ , instead of linearly, as the naive upper bound scales.

Another application of Lemma 4.2 is to provide a tight upper bound for the norm of iterates for any setting, e.g., before convergence or even when the gradient is stochastic. In particular, when the learning rate does not change over steps, we have the following upper bound whose proof is in Section B.4.

For any coordinate $j\in[d]$ , for $\mathtt{AdamW}$ with constant learning rate $\eta$ and weight decay factor $\lambda$ , with $C\triangleq\max\limits_{1\leq t\leq T}\left|\ln{\frac{{\bm{v}}_{t,j}}{{\bm{v}}_{1,j}}}\right|$ , it holds that

When $\beta_{1}=\beta_{2}$ , we only need $T=\Omega\left(\frac{\log\left\|{\bm{x}}_{0}\right\|_{\infty}}{\lambda\eta}\right)$ to guarantee that $\left|{\bm{x}}_{T},j\right|$ is no larger than $\frac{1}{\lambda}$ for any $\lambda\eta\leq 1$ . However, when $\beta_{1}<\beta_{2}$ and $\beta_{1}<1-\lambda\eta$ , the dominating term on the right-hand side is $C\cdot\frac{\eta\lambda(\beta_{2}-\beta_{1})}{(1-\beta_{2})(1-\eta\lambda-\beta_{1})}$ . Assuming $C=O(1)$ , it also requires $\lambda\eta\ll 1-\beta_{2}<1-\beta_{1}$ or $\lambda\eta<1-\beta_{2}\approx 1-\beta_{1}$ to ensure the remaining term is small.

Experiments

2 A synthetic problem

Related Work

While Stochastic Gradient Descent (Robbins & Monro, 1951) remains popular for optimizing deep learning models like ResNet (He et al., 2016), only adaptive methods can efficiently train recently-emerged large language models (Zhang et al., 2020). There has been a fruitful amount of research on adaptive gradient method, including AdaGrad (Duchi et al., 2011), RMSProp (Tieleman & Hinton, 2012), AdaDelta (Zeiler, 2012), Adam (Kingma & Ba, 2014), AdaFactor (Shazeer & Stern, 2018), AMSGrad (Reddi et al., 2018), AdaBound (Luo et al., 2018), Lion (Chen et al., 2024), etc. Recently there have been also adaptive methods attempting to accelerate by leveraging the second-order information, e.g., AdaHessian (Yao et al., 2021) and Sophia (Liu et al., 2023). However, most algorithms that are able to train large language models adopt coordinate-wise adaptivity. In contrast, stochastic gradient descent, even equipped with global gradient norm clipping, cannot match the performance of coordinate-wise adaptive algorithms on language tasks (Li et al., 2022a). Previous work has given convergence rate for RMSProp and Adam under different assumptions (Chen et al., 2018; Zou et al., 2019; Shi & Li, 2021; Guo et al., 2021; Défossez et al., 2022; Zhang et al., 2022).

Our work shows that $\mathtt{AdamW}$ and $\mathtt{SignGD}$ with weight decay converge to the same point assuming convergence. Balles & Hennig (2018); Kunstner et al. (2022) point out that the similarity with $\mathtt{SignGD}$ largely accounts for the advantage of $\mathtt{Adam}$ over $\mathtt{SGD}$ . Moreover, when $\mathtt{SignGD}$ is equipped with momentum which is one key component of $\mathtt{Adam}$ , it can achieve comparable empirical results with $\mathtt{Adam}$ for various tasks (Balles & Hennig, 2018; Kunstner et al., 2022; Bernstein et al., 2018; Crawshaw et al., 2022).

Role of Weight Decay:

The usage of weight decay, which refers to shrinking the parameter by a small constant fraction, can be dated back to the 1980s (Rumelhart et al., 1986; Hinton, 1987). It has been recognized as a standard trick to improve the generalization performance of neural networks (Krogh & Hertz, 1991; Bos & Chug, 1996) for a long time. Krizhevsky et al. (2012) first noticed that weight decay can sometimes accelerate optimization in deep learning. For modern architectures equipped with normalization layers, e.g., BatchNorm (Ioffe & Szegedy, 2015) and LayerNorm (Ba et al., 2016), only the direction of the parameters before normalization layers matters, rather than their norms. Turning on weight decay in such settings changes the effective learning rate of the parameters (Hoffer et al., 2018; Arora et al., 2018; Zhang et al., 2018; Li & Arora, 2019; Li et al., 2020).

Implicit Regularization:

The concurrent work by Chen et al. (2023) is arguably the most related work to us, where the recently discovered optimization algorithm by auto-search, Lion (Chen et al., 2024), is elegantly generalized to a family of algorithms, Lion- $\mathcal{K}$ , where $\mathcal{K}$ is some convex function. When $\mathcal{K}$ is chosen to be the dual norm and momentum in Lion- $\mathcal{K}$ is turned off, Lion- $\mathcal{K}$ becomes the normalized steepest descent. Their analysis shows that even with momentum, the steepest normalized descent with weight decay can be viewed as optimization under the original norm constraint. However, in any Lion- $\mathcal{K}$ algorithm, the update at one step $t$ only depends on past iterates through first-order momentum ${\bm{m}}_{t}$ . Their analysis cannot be applied to $\mathtt{AdamW}$ because $\mathtt{AdamW}$ cannot be written in the form of Lion- $\mathcal{K}$ for any convex function $\mathcal{K}$ . To see this, simply note that the update of Lion- $\mathcal{K}$ for a fixed $\mathcal{K}$ is completely determined by ${\bm{g}}_{t},{\bm{m}}_{t}$ and ${\bm{x}}_{t}$ while the update of $\mathtt{AdamW}$ can still be different if the second order momentum ${\bm{v}}_{t}$ is different. In terms of proof technique, Chen et al. (2023) constructs the Lyapunov function while we directly characterize the KKT point and connect the converged point to KKT point through the weighted average update.

Discussion and Future Works

This work focuses on the implicit bias of $\mathtt{AdamW}$ in the deterministic (or full-batch) case. Though our upper bound on the average update size of $\mathtt{Adam}$ holds unconditionally on the input gradients, regardless of stochasticity or not, it is unlikely that the $\frac{1}{\lambda}$ upper bound can be reached when there is large gradient noise, especially when $\beta_{2}$ is very close to $1$ . In that case, the denominator of the update of $\mathtt{AdamW}$ is roughly the square root of the square of the expected gradient plus some additional gradient variance term, which strictly dominates the expected gradient in the numerator. Malladi et al. (2022) uses Stochastic Differential Equation (SDE) approximation to model the trajectories of $\mathtt{Adam}$ in such regime and empirically tests the implication of SDE approximation, namely the square root scaling rule.

Another important future direction is to provide non-asymptotic convergence rates for $\mathtt{AdamW}$ in both convex and non-convex settings.

Conclusions

References

Appendix A Omitted Proofs in Section 3

In this section, we provide the omitted proofs in Section 3, which shows the iterates and the converged solution by normalized steepest descent with decoupled weight decay before diving into the analysis on $\mathtt{AdamW}$ . In Section A.1, we prove that the iterates will enter or stay in the norm ball with radius $\frac{1}{\lambda}$ for any normalized update. In Section A.2, we prove that the iterates of normalized steepest descent with weight decay will converge to the constrained minimizer of $L({\bm{x}})$ in the same ball with proper learning rates.

Lemma 3.1 We prove by induction that $\left\|{\bm{x}}_{t}\right\|\leq\frac{1}{\lambda}+\prod_{i=1}^{t}(1-\lambda\eta_{i})\left(\left\|{\bm{x}}_{0}\right\|-\frac{1}{\lambda}\right)$ .

When $\left\|{\bm{x}}_{0}\right\|>\frac{1}{\lambda}$ , we have that

When $\left\|{\bm{x}}_{0}\right\|\leq\frac{1}{\lambda}$ , $\left\|{\bm{x}}_{t}\right\|-\frac{1}{\lambda}\leq 0$ . This completes the proof. ∎

A.2 Omitted proofs for convergence to constrained minimizer with proper learning rates

For normalized steepest descent update ${\bm{\Delta}}_{t}$ from Equation 1,

where the first inequality we use convexity of $L$ and the second inequality uses $\left\|{\bm{x}}^{*}\right\|\leq 1$ .

Since the gradient of $L$ is $H$ -lipschitz, by Taylor expansion, we have that

Because the update ${\bm{\Delta}}_{t}$ is normalized and thus have unit norm by definition, it holds that

The proof of Theorem 3.4 is a direct application of Lemma A.1 on the one-step descent lemma Lemma 3.3. ∎

Assume that $\eta_{t}\geq 0$ , $\lim_{t\rightarrow\infty}\eta_{t}=0$ and $\sum_{t=1}^{\infty}\eta_{t}=\infty$ . $C$ is any positive number and $a_{0}\geq 0$ . If the sequence $\{a_{t}\}_{t=0}^{\infty}$ satisfies that $a_{t}\leq(1-\eta_{t})a_{t-1}+C\eta_{t}^{2}$ , then $\lim_{t\rightarrow\infty}a_{t}=0$ .

First we show by induction that $a_{t}\leq a_{0}\exp\left(-\sum_{i=1}^{t}\eta_{i}\right)+C\sum_{i=1}^{t}\eta_{i}^{2}\exp\left(-\sum_{j=i+1}^{t}\eta_{j}\right)$ .

Because $\sum_{t=1}^{\infty}\eta_{t}=\infty$ , $\lim_{t\rightarrow\infty}a_{0}\exp\left(-\sum_{i=1}^{t}\eta_{i}\right)=0$ . In order to show $\lim_{t\rightarrow\infty}a_{t}=0$ , it’s sufficient to show $\lim_{t\rightarrow\infty}\sum_{i=1}^{t}\eta_{i}^{2}\exp\left(-\sum_{j=i+1}^{t}\eta_{j}\right)=0$ .

From Lemma 3.1, $\left\|{\bm{x}}_{t}\right\|\leq\max{\{\left\|{\bm{x}}_{0}\right\|,\frac{1}{\lambda}\}}=B$ for $t\geq 0$ . Define $C\triangleq\frac{H(1+\lambda B)^{2}}{2\lambda^{2}}\frac{4}{(t+1)^{2}}$ .

Suppose $L({\bm{x}}_{t-1})-L({\bm{x}}^{*})\leq\frac{4C}{t+1}$ , we have that

A.3 Omitted Proofs for Lemma 3.9

For any $\epsilon>0$ , there exists $t^{\prime}$ such that $\left\|{\bm{x}}_{t}-{\bm{x}}_{\infty}\right\|\leq\frac{\epsilon}{2\lambda}$ for any $t>t^{\prime}$ . Because $\eta_{t}{\bm{\Delta}}_{t}={\bm{x}}_{t-1}-{\bm{x}}_{t}-\lambda\eta_{t}{\bm{x}}_{t-1}$ , we have that

There exists $T^{\prime}\geq t^{\prime}$ such that $\sum_{t=1}^{T}\eta_{t}\geq\frac{2}{\epsilon}\left(\left\|{\bm{x}}_{0}-{\bm{x}}_{\infty}-\lambda\left(\sum_{t=1}^{t^{\prime}}\eta_{t}{\bm{x}}_{t-1}-\sum_{t=1}^{t^{\prime}}\eta_{t}{\bm{x}}_{\infty}\right)\right\|+\frac{\epsilon}{2}\right)$ for $T\geq T^{\prime}$ . Then we have

So ${\bm{\Delta}}_{\infty}:=\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t}}{\sum_{t=1}^{T}\eta_{t}}$ exists and ${\bm{\Delta}}_{\infty}=-\lambda{\bm{x}}_{\infty}$ .

Because $\nabla L({\bm{x}})$ is a continuous function and $\lim_{t\rightarrow\infty}{\bm{x}}_{t}={\bm{x}}_{\infty}$ , $\lim_{t\rightarrow\infty}\nabla L({\bm{x}}_{t})=\nabla L({\bm{x}}_{\infty})$ . For any $\epsilon>0$ , there exists $T_{1}$ such that

for any $t\geq T_{1}$ . It also holds that

because $\left\|{\bm{\Delta}}_{t}\right\|\leq 1$ . Because $\sum_{t=1}^{\infty}\eta_{t}=\infty$ , there exists $T_{2}\geq T_{1}$ such that

Therefore, we prove that $\left\|\nabla L({\bm{x}}_{\infty})\right\|_{*}=\lim_{T\to\infty}\frac{\sum_{t=1}^{T}\eta_{t}\left\langle\nabla L({\bm{x}}_{\infty}),{\bm{\Delta}}_{t}\right\rangle}{\sum_{t=1}^{T}\eta_{t}}$ . On the other hand, we have that

For any $T$ , we know $\left\|\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t}}{\sum_{t=1}^{T}\eta_{t}}\right\|\leq\frac{\sum_{t=1}^{T}\eta_{t}\left\|{\bm{\Delta}}_{t}\right\|}{\sum_{t=1}^{T}\eta_{t}}\leq\frac{\sum_{t=1}^{T}\eta_{t}}{\sum_{t=1}^{T}\eta_{t}}=1$ . By the continuity of $\left\|\cdot\right\|$ , $\left\|\Delta_{\infty}\right\|=\lim_{T\rightarrow\infty}\left\|\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t}}{\sum_{t=1}^{T}\eta_{t}}\right\|\leq 1$ .

Appendix B Omitted Proofs in Section 4

We first represent $m_{t}$ and $v_{t}$ as a weighted sum of $g_{t}$ and $g_{t}^{2}$ .

By Cauchy–Schwarz inequality, we have that

We further analyze the first time in RHS and have that

B.2 Proof for Lemma 4.1

The proof for this part is the same as the proof of Lemma 3.9 in Section A.3.

If $\nabla L({\bm{x}}_{\infty})={\bm{\mathbf{0}}}$ , $\left\langle\nabla L({\bm{x}}_{\infty}),{\bm{\Delta}}_{\infty}\right\rangle=0=\left\|\nabla L({\bm{x}}_{\infty})\right\|_{1}$ .

If $\nabla L({\bm{x}}_{\infty})\neq{\bm{\mathbf{0}}}$ , we consider each coordinate $j$ such that $\nabla L({\bm{x}}_{\infty})_{j}\neq 0$ . Since we have that $\lim_{t\rightarrow\infty}\nabla L({\bm{x}}_{t})_{j}=\nabla L({\bm{x}}_{\infty})_{j}$ , we can get the convergence for ${\bm{m}}_{t,j}$ and ${\bm{v}}_{t,j}$ .

Then we have that $\lim_{t\rightarrow\infty}{\bm{\Delta}}_{t,j}=\lim_{t\rightarrow\infty}\frac{{\bm{m}}_{t,j}}{\sqrt{{\bm{v}}_{t,j}}}=\text{sign}(\nabla L({\bm{x}}_{\infty})_{j})$ . For any $\epsilon>0$ , there exists $t^{\prime}$ such that $\left\|{\bm{\Delta}}_{t,j}-\text{sign}\left(\nabla L({\bm{x}}_{\infty})_{j}\right)\right\|\leq\frac{\epsilon}{2}$ for $t\geq t^{\prime}$ . And there exists $T^{\prime}\geq t^{\prime}$ such that $\sum_{t=1}^{T}\eta_{t}\geq\frac{2}{\epsilon}\sum_{t=1}^{t^{\prime}}\eta_{t}\left({\bm{\Delta}}_{t,j}-\text{sign}\left(\nabla L({\bm{x}}_{\infty})_{j}\right)\right)$ for any $T\geq T^{\prime}$ . Then for any $T\geq T^{\prime}$ , we have that

So ${\bm{\Delta}}_{\infty,j}=\lim_{T\rightarrow\infty}\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t,j}}{\sum_{t=1}^{T}\eta_{t}}=\text{sign}(\nabla L({\bm{x}}_{\infty})_{j})$ for $\nabla L({\bm{x}}_{\infty})_{j}\neq 0$ . Then we have that

For nonzero coordinate $j$ of $\nabla L({\bm{x}}_{\infty})$ , from above we have $|{\bm{\Delta}}_{\infty,j}|=|\text{sign}(\nabla L({\bm{x}}_{\infty})_{j})|=1$ .

For $j$ such that $\nabla L({\bm{x}}_{\infty})_{j}=0$ , we know $\lim_{t\rightarrow\infty}{\bm{g}}_{t,j}=\lim_{t\rightarrow\infty}{\bm{m}}_{t,j}=\lim_{t\rightarrow\infty}{\bm{v}}_{t,j}=0$ . We employ the upper bound for average update in Lemma 4.2 since $\{{\bm{g}}_{t,j}\}_{t=1}^{\infty}$ and $\{{\bm{v}}_{t,j}\}_{t=0}^{\infty}$ in Algorithm 1 satisfy the condition that ${\bm{v}}_{t,j}-\beta_{2}{\bm{v}}_{t-1,j}\geq(1-\beta_{2}){\bm{g}}_{t,j}^{2}$ and ${\bm{m}}_{0,j}=0\leq\sqrt{{\bm{v}}_{0,j}}$ . By Lemma 4.2 we have

The denominator goes to $\infty$ when $T\rightarrow\infty$ . So it suffices to bound the last two terms in the numerator by constants in order to show $\left\|{\bm{\Delta}}_{\infty}\right\|\leq 1$ . Because $\eta_{t}$ is non-increasing in $t$ , it holds that

For the last term, we first analyze the coefficient between each $\ln{{\bm{v}}_{t,j}}$ . Define $\alpha_{t}=\eta_{t}\frac{1-\beta_{1}^{t-1}}{1-\beta_{1}}-\sum_{i=1}^{T-t}\eta_{t+i}\beta_{1}^{i-1}$ . We claim that $\left|\alpha_{t}\right|\leq\max{\{\frac{\beta_{1}^{t-1}}{1-\beta_{1}}\eta_{t+1},\frac{\eta_{t}}{1-\beta_{1}}\}}=\frac{\eta_{t}}{1-\beta_{1}}$ . This is because

and again by monotonicity of learning rates $\eta_{t}$ , we have that

We can also have $\ln{\frac{{\bm{v}}_{t,j}}{{\bm{v}}_{1,j}}}\geq(t-1)\ln{\beta_{2}}$ because

And there exists $t^{\prime}$ such that $\ln{\frac{{\bm{v}}_{t,j}}{{\bm{v}}_{1,j}}}\leq 0$ for any $t\geq t^{\prime}$ because $\lim_{t\rightarrow\infty}{\bm{v}}_{t,j}=0$ . Then

Define $C:=\frac{(\beta_{2}-\beta_{1})\eta_{1}\beta_{1}}{(1-\beta_{2})(1-\beta_{1})}+\frac{\beta_{2}-\beta_{1}}{1-\beta_{2}}\left(\sum_{t=2}^{t^{\prime}}\eta_{t}\left|\ln{{\bm{v}}_{t,j}}\right|-\frac{\eta_{1}\beta_{1}^{2}\ln{\beta_{2}}}{(1-\beta_{1})^{2}}\right)$ , we now have

Then $\left|{\bm{\Delta}}_{\infty,j}\right|=\left|\lim\limits_{T\to\infty}\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t,j}}{\sum_{t=1}^{T}\eta_{t}}\right|\leq\lim\limits_{T\to\infty}\left|\frac{\sum_{t=1}^{T}\eta_{t}{\bm{\Delta}}_{t,j}}{\sum_{t=1}^{T}\eta_{t}}\right|\leq 1$ because $\sum_{t=1}^{\infty}\eta_{T}=\infty$ . This completes the proof.

B.4 Proof for upper bound for norm of iterates in 𝙰𝚍𝚊𝚖𝚆𝙰𝚍𝚊𝚖𝚆\mathtt{AdamW}

For $\mathtt{AdamW}$ with constant learning rate $\eta$ and each coordinate $j$ , ${\bm{x}}_{T,j}$ can be written as weighted average of past update

Define $\eta_{t}=\eta(1-\lambda\eta)^{T-t}$ for $1\leq t\leq T$ . We apply Lemma 4.2 on $\{{\bm{v}}_{t,j}\}_{t=1}^{T}$ and $\{{\bm{g}}_{t,j}\}_{t=1}^{T}$ to bound $\left|\sum_{t=1}^{T}\eta(1-\lambda\eta)^{T-t}\frac{{\bm{m}}_{t,j}}{\sqrt{{\bm{v}}_{t,j}}}\right|$ .

We first compute $\sum_{t=1}^{T}\eta_{t}=\frac{1-(1-\lambda\eta)^{T}}{\lambda}\leq\frac{1}{\lambda}$ . For the second term in Lemma 4.2, we have that

For the last term, we define $\alpha_{t}=\eta_{t}\frac{1-\beta_{1}^{t-1}}{1-\beta_{1}}-\sum_{i=1}^{T-t}\eta_{t+i}\beta_{1}^{i-1}$ and we can compute the exact form of $\alpha_{t}$ as following

Then we can bound the last term by showing that

Appendix C Experimental Details and More Results

The architecture of the two-layer transformer is the same as in Kunstner et al. (2022), which is also used as a tutorial example in PyTorch. It consists of a 200-dimensional embedding layer, $2$ transformer layers and a linear layer. Each transformer layer consists of a $2$ -head self-attention and an MLP with a hidden dimension $200$ . The experiments are run on a single A4000 or A6000.