DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule

Maor Ivgi, Oliver Hinder, Yair Carmon

Introduction

While stochastic optimization methods drive continual improvements in machine learning, choosing the optimization parameters—and particularly the learning rate—remains a difficulty. Standard methodologies include searching over a set of learning rates, or simply picking the learning rate from prior work. The former incurs a substantial computational overhead, while the latter risks training a suboptimal model.

The rich literature on adaptive gradient methods (AdaGrad, Adam, and their many variants) offers optimization algorithms that better exploit problem structure [e.g., 29, 48, 34, 84, 57]. However, these methods still have a learning rate parameter that requires tuning. The theoretically-optimal value of this parameter depends on unknown problem properties. For example, on convex problems the optimal learning rate of AdaGrad is related to the distance between the initial point and the optimal solution, while in non-convex settings it is related to the function’s smoothness and initial optimality gap .

Parameter-free optimization aims to remove the need for such tuning by designing algorithms that achieve a near-optimal rate of convergence with almost no knowledge of the problem properties . Most works in this field [e.g., 58, 69, 21, 61, 10, 42, 110] use advanced online learning techniques to construct algorithms that, for the fundamental setting of stochastic convex optimization (SCO) with bounded stochastic gradients, achieve optimal rates of convergence up to logarithmic factors. While practical parameter-free algorithms exist [e.g. 68, 71, 47, 15], there is little research into practical parameter-free step size selection methods for SGD. Recently, Carmon and Hinder have shown that performing a careful bisection over the SGD step size yields a parameter-free optimization method that is optimal for SCO up to a double-logarithmic factor. While theoretically novel, on a practical level the result leaves much to be desired, as it essentially prescribes the standard recipe of running SGD multiple times with different learning rates.

In this work, we use key insights from Carmon and Hinder to go a step further and develop a parameter-free step size schedule. For SGD iterations of the form xt+1=xtηtgtx_{t+1}=x_{t}-\eta_{t}g_{t}, where xtx_{t} denotes the model parameters at the tt’th iteration and gtg_{t} denotes the stochastic gradient of the loss function, our proposed steps size sequence is (for all t1t\geq 1)

In words, the step size at iteration tt is the maximum distance to between the initial point and observed iterates, divided by the sum of squared stochastic gradient norms, i.e., Distance over Gradients (DoG). At the first step, we set η0\eta_{0} to be rϵ/g0r_{\epsilon}/\|g_{0}\|, i.e., we take a normalized gradient step of size rϵr_{\epsilon}; we show that, as long as rϵr_{\epsilon} is small, its precise setting has only mild effect.

Crucially, DoG has no multiplicative “learning rate” parameter: if one considers step sizes of the form ηt=cmaxitxix0itgi2\eta_{t}=c\cdot\frac{\max_{i\leq t}\|x_{i}-x_{0}\|}{\sqrt{\sum_{i\leq t}\|g_{i}\|^{2}}} then c=1c=1 is a universally good setting (see Section 2 for a heuristic justification and Section 4.3 for empirical evidence for this claim).

Figure 1 highlights key aspects of DoG. The top row shows the DoG step size sequence for different values of rϵr_{\epsilon} in convex (left) and non-convex (right) stochastic optimization problems. The DoG step size increases rapidly (note the logarithmic xx scale) and stabilizes around values close to the optimal SGD step size with little dependence on rϵr_{\epsilon}. The bottom row of the figure compares the test errors of DoG and SGD with various step sizes, showing that (for all choices of rϵr_{\epsilon}) DoG is on par with well-tuned SGD.

1 Summary of results

In Section 3 we analyze DoG for stochastic convex optimization with bounded stochastic gradients and a (potentially unbounded) closed convex domain. To present our results, let B\mathcal{B} denote a ball around the initial point x0x_{0} with radius 3d03d_{0}, where d0d_{0} is the distance between x0x_{0} and an optimum.

First, we show that if the iterates of DoG remain in B\mathcal{B}, then with high probability DoG achieves a convergence rate that is optimal up to a factor of O(log(1+d0rϵ))O(\log(1+\frac{d_{0}}{r_{\epsilon}})). In practice, DoG appears to indeed be stable as long as rϵr_{\epsilon} is sufficiently small. However, DoG is not always stable: on pathological functions its iterates can move far from the optimum.

To address this, we consider a theoretical, tamed variant of DoG, which we call T-DoG, whose step sizes are smaller by a logarithmic factor. We prove that, with high probability, the T-DoG iterates never leave B\mathcal{B}. Thus, we obtain a high probability parameter-free convergence guarantee that is optimal up logarithmic factors.

To our knowledge, this is the first dynamic SGD step size schedule to attain such theoretical guarantee, and only the third high probability parameter-free guarantee in the literature [following 11, 109]. Moreover, it is the first parameter-free result assuming only locally bounded stochastic gradients (i.e., in the set B\mathcal{B}). This is significant since the usually-assumed global stochastic gradient bound does not exist in many problems, including least squares.

Empirical study.

Our experiments in Section 4 focus on fine-tuning neural networks, because this is a practically important setting that still allows for thorough experiments at a reasonable computational budget. We also perform a small-scale experiment with training a neural network from scratch. Our experiments span 23 natural language understanding and image classification tasks and 8 popular model architectures.

Our results indicate that, compared to DoG, SGD with a cosine step size schedule and tuned base learning rarely attains a relative error improvement of more than 5% (e.g., the difference between accuracy 95% and 95.25%). For convex problems (linear probes), the relative difference in errors is below 1%. In our testbed, well-tuned Adam tends to outperform both SGD and DoG, but a layer-wise version of DoG (which we call L-DoG) closes some of this performance gap.

We also test the sensitivity of DoG to the value of rϵr_{\epsilon}. We find that for most model/task combinations, DoG performs consistently well across a wide range of rϵr_{\epsilon} values as our theory predicts. However, in certain cases, choosing rϵr_{\epsilon} to be too low results in poor performance. We provide some preliminary findings showing that this is due in part to batch normalization.

Put together, our theory and experiments suggest DoG has the potential to save significant computation currently spent on learning rate tuning at little or no cost in performance—especially if we reinvest some of the saved computation in training a larger model on more data.

Algorithm Derivation

Before providing rigorous theoretical guarantees for DoG, in this section we explain the origin of the algorithm. Our starting point is the following result by Carmon and Hinder . Suppose we run TT iterations of SGD with fixed step size η\eta, i.e., the recursion xt+1=xtηgtx_{t+1}=x_{t}-\eta g_{t}, where xtx_{t} is the SGD iterate and gtg_{t} is the stochastic gradient at step tt. If, for some c(0,1)c\in(0,1), it happens to hold that

then the averaged iterates satisfies an excess loss bound that is at most a factor 1c(1c2)\frac{1}{c(1-c^{2})} larger than the worst-case optimal bound achieved by perfectly tuned SGD.This results holds in the non-stochastic case [11, Proposition 1], but a qualitatively similar results holds with high probability in the stochastic case as well [11, Proposition 3].

The condition (1) is an implicit equation: it allows us to check whether the choice of step size η\eta is good only after running TT steps of SGD using that η\eta. Solving this implicit equation therefore requires multiple calls to SGD. We derive the DoG step size sequence by making the equation explicit: we choose ηt\eta_{t} so that equation (1) holds at each step. For c=1c=1, this yields the step size formula (DoG). Our reason for choosing c=1c=1 is that it is the threshold under which a solution to the implicit equation yields an optimal rate of convergence. Therefore, in practice we expect 11 to be close to the highest stable value of cc, and thus obtain the best performance; we verify this empirically in Section 4.3.

Theoretical Analysis

The function ff is convex, its domain X\mathcal{X} is closed and convex, and its minimum is attained at some xXx_{\star}\in\mathcal{X}, i.e., finfxXf(x)=f(x)f_{\star}\coloneqq\inf_{x\in\mathcal{X}}f(x)=f(x_{\star}).

In Appendix A we discuss a possible relaxation of convexity under which our results continue to hold.

Algorithm statement.

We study (projected) SGD with dynamic learning rate schedule {ηt}\{\eta_{t}\}, i.e.,

where abmax{a,b}a\vee b\coloneqq\max\{a,b\} and rϵr_{\epsilon} is a small user-specified initial movement size parameter. With this notation, we define a family of DoG-like learning rate schedules.

for a positive nondecreasing sequence GtG_{t}^{\prime} that depends only on x0,g0,,gtx_{0},g_{0},\ldots,g_{t} and satisfies GtGtG_{t}^{\prime}\geq G_{t}.

DoG corresponds to simply setting Gt=GtG_{t}^{\prime}=G_{t}; in Section 3.3 we consider a theoretical (or tamed) DoG-like algorithm for which we guarantee bounded iterates by making GtG_{t}^{\prime} larger than GtG_{t} by polylogarithmic factors. Throughout, we bound the error of the weighted average sequence

Finally, to streamline the analysis we define:

Logarithm conventions.

Throughout the paper log\log is base ee and log+()1+log()\log_{+}\left(\cdot\right)\coloneqq 1+\log(\cdot).

2 Optimality gap bounds assuming bounded iterates

In this section, we bound the optimality gap attained by any DoG-like algorithm. Our bounds depend on the quantities rˉT\bar{r}_{T} and GTG_{T}, and are nearly optimal when rˉT=O(d0)\bar{r}_{T}=O(d_{0}) (i.e., the DoG iterates don’t move too far away from x0x_{0}) and GTG_{T}^{\prime} is not much larger than GTG_{T}. In the next section we describe a specific DoG-like algorithm that is guaranteed to satisfy both requirements.

Convexity and Jensen’s inequality imply that xˉt\bar{x}_{t} satisfies

The sum in the RHS decomposes to two components:

where Δkgkf(xk)\Delta_{k}\coloneqq g_{k}-\nabla f(x_{k}). We give probability 1 bounds for the weighted regret (Lemma 1) and high probability bounds for the noise term (Lemma 2). In each case, the key challenge is replacing a-priori bounds on d0d_{0} (or the domain size) with the empirically observed rˉT\bar{r}_{T}. We present and discuss each lemma in turn.

If X\mathcal{X} is a closed convex set then any DoG-like scheme (Definition 1) satisfies k=0t1rˉk<gk,xkx>rˉt(2dˉt+rˉt)Gt1\sum_{k=0}^{t-1}\bar{r}_{k}\left<g_{k},x_{k}-x_{\star}\right>\leq\bar{r}_{t}(2\bar{d}_{t}+\bar{r}_{t})\sqrt{G_{t-1}^{\prime}}, t1\forall t\geq 1.

Therefore, k=0t1rˉk<gk,xkx>\sum_{k=0}^{t-1}\bar{r}_{k}\left<g_{k},x_{k}-x_{\star}\right> is at most

We bound the terms (A)(A) and (B)(B) in turn, beginning with the former:

Inequality (i)(i) uses dkdˉtd_{k}\leq\bar{d}_{t} and that GkG_{k}^{\prime} is nondecreasing as per Definition 1. Inequality (ii)(ii) holds since, for sarg maxktdks\in\operatorname*{arg\,max}_{k\leq t}d_{k}, we have dˉt2dt2=ds2dt2=(dsdt)(ds+dt)xsxt(ds+dt)(rˉs+rˉt)(ds+dt)4rˉtdˉt\bar{d}_{t}^{2}-d_{t}^{2}=d_{s}^{2}-d_{t}^{2}=(d_{s}-d_{t})(d_{s}+d_{t})\leq\|x_{s}-x_{t}\|(d_{s}+d_{t})\leq(\bar{r}_{s}+\bar{r}_{t})(d_{s}+d_{t})\leq 4\bar{r}_{t}\bar{d}_{t}. Bounding the second term (B)(B), we have:

where the final inequality uses the standard Lemma 4 with ak=Gk=ikgi2a_{k}=G_{k}=\sum_{i\leq k}\|g_{i}\|^{2}. ∎

While the proof of Lemma 1 is similar to the analysis of adaptive SGD where ηt=ρGt\eta_{t}=\frac{\rho}{\sqrt{G_{t}}} , there are a couple of key differences. First, the DoG step sizes can increase, which typically makes adaptive gradient methods difficult to analyze . We bypass this difficulty by considering regret weighted by rˉk\bar{r}_{k}, which factors out the increasing portion of the step size. Second, the standard adaptive SGD analysis yields a bound proportional to dˉt2\bar{d}_{t}^{2} (typically further bounded using the domain diameter) rather than rˉtdˉt\bar{r}_{t}\bar{d}_{t} as in our bound. This is a crucial difference, since—as we soon argue—rˉt\bar{r}_{t} “cancels” when dividing through by k<trˉk\sum_{k<t}\bar{r}_{k}, while dˉt\bar{d}_{t} does not. We obtain the improved result by keeping around the term dt2Gt1-d_{t}^{2}\sqrt{G_{t-1}^{\prime}} in the bound for (A)(A) above; a trick similar to Carmon and Hinder [11, Lemma 1].

Next, we handle the noise term in (4), recalling the notation Δtgtf(xt)\Delta_{t}\coloneqq g_{t}-\nabla f(x_{t}) and θt,δlog60log(6t)δ\theta_{t,\delta}\coloneqq\log\frac{60\log(6t)}{\delta}.

The proof of Lemma 2 appears in Appendix C.1 and is based on a new concentration bound, Lemma 7, which allows us to bound the noise term despite having no deterministic bound on the magnitude of the martingale difference sequence rˉk<Δk,xkx>\bar{r}_{k}\left<\Delta_{k},x_{k}-x_{\star}\right>. The proof of Lemma 7 involves combining time-uniform Bernstein bounds and a general bound on the cumulative sums of sequence products (Lemma 5), which may be of independent interest.

Combining the above results, we obtain the following.

Follows from Equations 3 and 4, Lemma 1, Lemma 2 and the fact that dˉtd0+rˉt\bar{d}_{t}\leq d_{0}+\bar{r}_{t}. ∎

The following algebraic fact shows that there is always an iteration τT\tau\leq T where the denominator i<trˉirˉtΩ(T/logrˉTrϵ)\sum_{i<t}\frac{\bar{r}_{i}}{\bar{r}_{t}}\geq\Omega(T/\log\frac{\bar{r}_{T}}{r_{\epsilon}}); see Section B.1 for proof.

Let s0,s1,,sTs_{0},s_{1},\ldots,s_{T} be a positive nondecreasing sequence. Then

Combining Proposition 1 and Lemma 3 yields the following (see short proof in Section C.2).

Corollary 1 is immediately useful when X\mathcal{X} is bounded but its exact diameter is unknown, for example when X\mathcal{X} is a polytope as is common in two-stage stochastic programming .

Furthermore, in realistic DoG trajectories, even the multiplicative term logrˉTrϵ\log\frac{\bar{r}_{T}}{r_{\epsilon}} is likely too pessimistic. This is because rˉt\bar{r}_{t} typically increases rapidly for t0<1000t_{0}<1000 steps and then plateaus (see Figure 12 in the appendix). Consequently, rˉi/rˉt1/10\bar{r}_{i}/\bar{r}_{t}\geq 1/10 for most of the optimization trajectory, and i<trˉirˉtt/10t0\sum_{i<t}\frac{\bar{r}_{i}}{\bar{r}_{t}}\geq t/10-t_{0}. Substituting back into Proposition 2, we get that xˉT\bar{x}_{T} is O(d0LTt0θT,δ)O\left(\frac{d_{0}L_{\star}}{\sqrt{T-t_{0}}}\theta_{T,\delta}\right) suboptimal.

DoG can run wild.

While DoG is empirically stable, there exist (non-stochastic) examples where rˉt\bar{r}_{t} grows much larger than d0d_{0}: in Appendix C.3 we describe a variant of Nemirovski’s function for which rˉt=rϵt\bar{r}_{t}=r_{\epsilon}\sqrt{t} and therefore rˉt/d0\bar{r}_{t}/d_{0} diverges as tt grows. Next, we show that by slightly decreasing the DoG step sizes we can guarantee that rˉT/d03\bar{r}_{T}/d_{0}\leq 3 with high probability.

3 Iterate stability bound

This section introduces a new DoG-like step size scheme whose iterates are guaranteed to remain bounded with high probability. We call this scheme T-DoG, where the T stands for “theoretical” or “tamed.” The step sizes are given by ηt=rˉt/Gt\eta_{t}={\bar{r}_{t}}/{\sqrt{G_{t}^{\prime}}}, where

We are ready to state T-DoG’s key property: guaranteed iterate stability.

We defer the full proof to Section C.4 and proceed to highlight the key argument by proving the result in the noiseless case.

In the noiseless case we have gk=f(xk)g_{k}=\nabla f(x_{k}) and therefore <gk,xkx>f(xk)f0\left<g_{k},x_{k}-x_{\star}\right>\geq f(x_{k})-f_{\star}\geq 0. Substituting into (5) and rearranging gives dk+12dk2ηk2gk2d_{k+1}^{2}-d_{k}^{2}\leq\eta_{k}^{2}\|g_{k}\|^{2}. Assuming by induction that rˉt3d0\bar{r}_{t}\leq 3d_{0} and telescoping yields dk+12dk2ηk2gk2d_{k+1}^{2}-d_{k}^{2}\leq\eta_{k}^{2}\|g_{k}\|^{2}. Assuming by induction that rˉt3d0\bar{r}_{t}\leq 3d_{0} and telescoping yields

where (i)(i) uses that gk2=GkGk1\|g_{k}\|^{2}=G_{k}-G_{k-1} (with the shorthand G10G_{-1}\coloneqq 0) and

With all the ingredients in hand, we state the main guarantee for T-DoG.

where cδ,rϵ,T=log+(Td0Lf(x0)f)log+(d0rϵ)log(log+(T)δ)c_{\delta,r_{\epsilon},T}=\log_{+}\left(T\frac{d_{0}L_{\star}}{f(x_{0})-f_{\star}}\right)\log_{+}\left(\frac{d_{0}}{r_{\epsilon}}\right)\log\left(\frac{\log_{+}(T)}{\delta}\right).

Theorem 1 yields the optimal convergence bound up to logarithmic factors. To the best of our knowledge this is the first parameter-free stochastic optimization method that does not require the stochastic gradients to be uniformly bounded across the domain X\mathcal{X} and instead produces a bound that depends on the ‘local’ gradient bound LL_{\star}. Crucially, the T-DoG step size formula does not require advance knowledge of LL_{\star}.There is prior work that develop methods with steps that do not require a global Lipschitz bound , but these methods do not guarantee that iterates remain in a ball of radius O(d0)O(d_{0}) around the initial point. Consequently, the rates of convergence of these methods cannot be expressed in terms of a quantity like LL_{\star}.

While the weighted iterate average (2) is convenient to our analysis, bounds similar to Proposition 1, Corollary 1 and Theorem 1 hold also for the standard unweighted iterate average x^T=1Tt=0T1xt\hat{x}_{T}=\frac{1}{T}\sum_{t=0}^{T-1}x_{t}. For x^T\hat{x}_{T} it is also straightforward to show a 1/T1/T error bound for DoG in the smooth noiseless case. See Appendix D for details.

Experiments

To test DoG in practical scenarios, we perform extensive experiments over a diverse set of tasks and model architectures in both the vision and language domains. We construct a testbed that consists of over 20 tasks and 7 model architecture, covering natural language understanding and computer vision (Section 4.1). In this testbed we compare DoG to SGD and Adam (Section 4.2), showing that DoG performs on par with tuned SGD, but not as well as tuned Adam. Nevertheless, a per-layer version of DoG (defined below) closes much of this gap with Adam without requiring tuning. We also use our testbed to analyze the sensitivity of DoG to its fixed parameters (Section 4.3), and demonstrate its effectiveness in convex logistic regression settings (Section 4.4). Finally, we apply DoG and L-DoG to fine-tuning a CLIP model on ImageNet (Section 4.5) and training a CIFAR10 model from scratch (Section 4.6), and provide preliminary comparison to previously-proposed tuning free methods (Section 4.7). A PyTorch implementation of DoG is available at https://github.com/formll/dog.

Neural models in general and transformer-based models in particular often benefit from using a per-parameter or per-layer step sizes . With this in mind, we consider a per-layer version of DoG, which we call L-DoG, where we apply the (DoG) formula separately for every layer. Namely, if we consider xtlx_{t}^{l} to be the weights in layerMore precisely, our implementation treats each element in the PyTorch .parameters() list as a separate layer. ll at step tt, then we set the learning rate for that layer to be ηtl=maxitxilx0litgil2+ϵ\eta_{t}^{l}=\frac{\max_{i\leq t}\|x_{i}^{l}-x_{0}^{l}\|}{\sqrt{\sum_{i\leq t}\|g_{i}^{l}\|^{2}+\epsilon}}, where ϵ=108\epsilon=10^{-8} is added to the denominator for numerical stability. While we do not provide theoretical guarantees for L-DoG, we show below that it performs well in practice.

1 Fine-tuning testbed

Our main experiments focus on fine-tuning pre-trained models, which allows us to experiment with advanced models while also thoroughly tuning the learning rate for the baseline optimizers, using an academic computational budget.

For each baseline algorithm, we use best-practice learning rate schedule (cosine annealing for all experiments, with a warmup stage for language experiments) and sweep over the peak learning rate for each model/task pair. We give each pair a fixed step budget designed to suffice for convergence, performing evaluation throughout the training. In all cases, we use polynomial decay averagingWe apply the weight averaging with a fixed parameter (γ=8\gamma=8, following ); we did not try any other parameter in our experiments. as proposed by Shamir and Zhang , and select the best checkpoint (either averaged or not) based on evaluation performance. We repeat relevant learning setups with 5 different seeds, and report the mean performance across the seeds. For simplicity, we do not use weight decay throughout. The complete set of hyper-parameters appears in Appendix E.

Natural language understanding (NLU).

To test DoG’s efficacy in modern NLU, we use it to fine-tune transformer language models on the well-studied GLUE benchmark which measures models’ performance on diverse text classification tasks (listed in Section E.3).

Additionally, we fine-tune models on SQuAD 1.1, a question answering dataset . We fine-tune a RoBERTa-base checkpoint and T5-base .Throughout the paper we often use the shorthand names RoBERTa-b and T5-b, respectively. For each task, we use the official evaluation metrics defined in Wang et al. and Rajpurkar et al. as well as their original proposed splits, and report the results over the evaluation set.

Computer vision.

We also fine-tune 5 models architectures on 12 different computer vision tasks from the VTAB benchmark (see Section E.3); of the other 7 tasks in VTAB, 5 are trivial (accuracy greater than 99%) and 2 have small validation splits leading to unreliable model selection. We follow the training, validation and test splits defined in VTAB, and report performance on the test split (using the validation split for model selection). We fine-tune 5 models: VGG11 , ResNet50 , Densenet121 , ViT-B/32 , and ConvNeXt-T , where the ViT model is pre-trained on ImageNet 21K and the others are trained on ImageNet 1K .

Normalized performance metric.

A positive RED value indicates that optimizer xx is better than DoG, and a negative value indicates the opposite. When the absolute value of RED is beneath a few percentage points, the compared methods are nearly equivalent. For example, a 5% RED is equivalent to the difference between accuracy 95% and 95.25%.

Our theoretical analysis suggests that the particular choice of rϵr_{\epsilon} does not matter as long as it is sufficiently small relative to the distance between the weight initialization x0x_{0} and the optimum. Consequently, for vision experiments we set rϵ=α(1+x0)r_{\epsilon}=\alpha\cdot(1+\|x_{0}\|) for α=104\alpha=10^{-4}, assuming that the distance to the optimum is more than 0.01% of the initialization norm. For language experiments, this assumption turned out to be wrong (causing DoG to diverge in some cases), and we decreased α\alpha to 10610^{-6} for DoG and to 10810^{-8} for L-DoG, where the additive 10610^{-6} term was too large in some layers. We believe that 10610^{-6} and 10810^{-8} should be good defaults for DoG and L-DoG, respectively, though networks with batch normalization or different initialization schemes could require a larger value; see Section 4.3 for additional discussion.

2 Comparison of fine-tuning performance

Figure 2 depicts the median, IQR (inter-quantile range) and mean RED of each model,When aggregating results over tasks, we always report the RED statistics across tasks, where for each task we average the RED values over seeds. See Section E.5 for details. when trained with SGD and Adam with different peak learning rates. The figure shows that, when comparing across models, there is no good default learning rate for neither SGD nor Adam. Moreover, even for a single model only very specific SGD learning rate performs well, while most are considerably inferior to using DoG. Even when tuned to the best fixed learning-rate value per model (which we refer to as model tuned LR), some tasks may still fail (compared to DoG) as indicated by the large IQR and the gap between the mean (triangles) and the median RED (circles) in models such as ViT-B/32 ad Densenet121. While Adam also requires tuning, it is somewhat less sensitive than SGD to the choice of peak learning rate. For a full breakdown of performance per task, see Figure 7 and Sections F.1 and F.1 in Section F.1.

DoG performs similarly to well-tuned SGD in 79 out of the 80 model/task combinations in our testbed. The one exception is tuning T5-b on CoLA, where DoG behaves erratically while SGD succeeds only with a few learning rates. In contrast, both Adam and L-DoG achieved reasonable performance consistently. DoG’s poor performance on CoLA results in high RED measures for this case, which draw the mean RED (triangles) above the median one in Figure 2 for T5-b. We further analyze this exception in Section F.3 and show that choosing significantly smaller rϵr_{\epsilon} for DoG alleviates the problem.

Figure 3 (top) compares DoG to SGD with model tuned LR as defined above, as well as instance tuned LR, where for each model/task pair we select the best learning rate, at a computational expense 5–7 times larger than running DoG. The performance of DoG remains close to that of SGD with instance-tuned LR, with the largest median RED observed for ResNet50 and ViT-B/32.

Figure 3 (bottom) compares DoG to model-tuned and instance-tuned Adam, as well as to L-DoG. In a few cases (namely ResNet50 and ConvNeXt-T) the gaps between DoG and Adam are significant, and favor Adam. We hypothesize this is due to Adam’s per-parameter step-sizes and momentum mechanisms, which DoG does not exploit. L-DoG, which has per-layer steps, has positive median RED for all models, and narrows the gap between DoG and Adam, particularly for ResNet50.

The instance-tuned baselines consume significantly more compute than DoG and L-DoG. In Section F.2 we equalize the compute budget by reducing the number of steps for SGD and Adam. This makes DoG outperform instance-tune SGD in most cases, and brings L-DoG substantially closer to Adam.

3 Sensitivity of DoG’s fixed parameters

Our theory suggests that all sufficiently small choices of rϵr_{\epsilon} should perform similarly, but choosing rϵr_{\epsilon} too large (compared to the initial distance to the optimum) can hurt the performance of the algorithm. In Figure 4 (left) we plot the test performance as a function of rϵr_{\epsilon} for 8 model/task combinations. For 7 out of the 8, DoG is highly robust to the value of rϵr_{\epsilon} as long as it small enough, as predicted. However, ResNet50 on CIFAR-100 (bottom left) is an exception, where smaller values of rϵr_{\epsilon} result in an accuracy drop. We hypothesize this is due to scale invariance introduced by batch normalization (BN), and provide supporting evidence for that in Section F.4 (Figure 10), where we show that DoG is insensitive to rϵr_{\epsilon} when we turn off BN. In the appendix we also provide a complementary diagnostic for rϵr_{\epsilon} sensitivity by plotting ηt\eta_{t} vs. η0\eta_{0} for different values of tt (see Figure 8).

Base learning rate.

For this experiment only, we consider variants of DoG with different values of base learning, i.e., step sizes of the form ηt=cmaxitxix0itgi2\eta_{t}=c\cdot\frac{\max_{i\leq t}\|x_{i}-x_{0}\|}{\sqrt{\sum_{i\leq t}\|g_{i}\|^{2}}} with different values of cc. We expect optimal performance when cc is close to 1. More specifically, we expect the algorithm to be unstable when c>1c>1 and to be slower to converge (and less likely to generalize well) when c<1c<1. As can be observed in Figure 4 (right), values around c=1c=1 perform well for all models. For smaller values, there is indeed inferior performance in some models (mainly ResNet50 and RoBERTa-b)—indicating T-DoG would not work well in practice—while larger values result in divergence (in 6 out of 8 cases). Hence, the useful range for cc is very narrow (about [0.5, 1.5]) and tuning it is not likely to produce significant improvements. This is in contrast to Adam and SGD which generally require searching over a space spanning a few orders of magnitude to properly train a model.

4 Convex optimization

We also evaluate DoG on convex optimization tasks, matching the assumptions of our theoretical analysis. To do so, we perform multi-class logistic regression on features obtained from the computer vision models in our testbed, i.e., linear probes. We find that model-tuned SGD performs on par or worse than DoG, while instance-tuned SGD barely gains any advantage (Figure 5), with RED values well under 1% (corresponding to the difference between accuracies 90% and 90.1%). Moreover, even in this simple setting, SGD is sensitive to the choice of learning rate, which differ significantly between models (Figure 6).

5 Fine-tuning on ImageNet

To complement our main fine-tuning testbed, we perform a more limited experiment involving ImageNet as a downstream task, which is more expensive to tune due its larger scale. We fine-tune a ViT-B/32 CLIP model and compare DoG and L-DoG to training with SGD or AdamW . We use a training prescription similar to Wortsman et al. ; see Section E.7 for additional details. Table 1 shows the ImageNet top-1 validation accuracies of the final model checkpoints, with and without the polynomial decay averaging used throughout our experiments. DoG performs similarly to SGD, but both algorithms perform significantly worse than AdamW, perhaps due to an insufficient iteration budget. L-DoG performs well in this setting, improving on AdamW by a little over 1 point.

6 Training from scratch

We conduct a preliminary experiment with training a model from scratch, specifically a Wide ResNet 28-10 on CIFAR-10 ; see Section E.8 for details. Table 2 shows the test accuracy of the final checkpoint, with and without the polynomial averaging used throughout our experiments. Here DoG performs on par with the setting’s canonical training prescription of SGD with momentum 0.9 and learning rate 0.1 . In this setting Adam produces poorer results, and L-DoG is 0.5 point worse than tuned Adam with the best learning rate, perhaps due to not reaching convergence.

7 Comparison to other tuning-free methods

We perform preliminary comparisons between DoG and L-DoG and other methods for removing the learning rate parameter: the Stochastic Polyak Step , D-Adaptation and Continuous Coin Betting (COCOB) . In all cases, we find that DoG and L-DoG provide better performance on most tasks and on average (see Sections G.4 and G.4). We provide detailed results in Appendix G, where we also discuss the practical prospects of the bisection procedure of Carmon and Hinder .

Related Work

Previous attempts to design theoretically principled and practical optimization algorithms that do not require learning rate tuning approach the problem from a variety of perspectives, resulting in a large variety of proposed algorithms. Rolinek and Martius , Vaswani et al. , Paquette and Scheinberg lift classical line search technique from non-stochastic optimization to the stochastic setting, while Berrada et al. , Loizou et al. do the same for the classical Polyak step size . Asi and Duchi develop a class of algorithms based on stochastic proximal methods and demonstrate their improved robustness both theoretically and empirically. Schaul et al. use a stochastic quadratic approximation for designing learning rates that maximize the expected one-step objective decrease. Chandra et al. nest hypergradient descent to make a method that is insensitive to initial hyper-parameter choices. However, none of these results are parameter-free in the same sense as DoG: they either do not have converges guarantees, or have suboptimality bounds that blow up polynomially when the method’s parameters do not match a problem-dependent value. In contrast, parameter-free methods have convergence rates that depend at most logarithmically on algorithmic parameters.

While the parameter-free optimization literature has focused mainly on theoretical schemes, a number of works also include empirical studies . In particular, Orabona and Tommasi build on coin-betting schemes to design an algorithm for training neural networks that has AdaGrad-style convergence guarantees for quasi-convex functions, showing promising results on neural network training problems. In recent work Chen et al. obtain improved empirical results with an algorithm that leverages coin betting and truncated linear models. However, this method lacks theoretical guarantees.

In recent independent work Defazio and Mishchenko propose a parameter-free dynamic step size schedule of dual averaging. While our work has the same motivation and shares a number of technical similarities (including the use of weighted regret bounds and an independently obtained Lemma 3), the proposed algorithms are quite different, and dual averaging is rarely used in training neural networks. (See additional discussion in Section G.3). Moreover, Defazio and Mishchenko only prove parameter-free rates of convergence in the non-stochastic setting, while we establish high probability guarantees in the stochastic setting. Concurrently with our work, Defazio and Mishchenko heuristically extended their dual averaging scheme to SGD- and Adam-like algorithms, reporting promising experimental results.

Finally, a number of neural network optimization methods—LARS , LAMB , Adafactor , and Fromage —use the norm of neural network weights to scale the step size. DoG and L-DoG are similar in also using a norm to scale their step size, but they differ from prior work by considering the distance from initialization rather than the norm of the weights. We believe that this difference is crucial in making DoG parameter-free, while the above-mentioned method have a learning-rate parameter to tune (though Bernstein et al. report that a single default value works well across different tasks).

Limitations and Outlook

Our theoretical and empirical results place DoG as a promising step toward a new generation of principled and efficient tuning-free optimization algorithms. However, much additional work is necessary for these algorithms to become ubiquitous. First, it is important to understand how to correctly combine DoG with proven technique such as momentum, per-parameter learning rates, and learning rate annealing—this is a challenge both from a theoretical and a practical perspective. Second, it is important to gain a better understanding of situations where DoG is more sensitive to the choice of rϵr_{\epsilon} than theory would have us expect. Our preliminary investigations suggest a connection to batch normalization, and following that lead could lead to even more robust training methods. Finally, while our experiments aim to cover a broad range of tasks and architectures, future work needs to explore DoG in additional settings, particularly those involving training from scratch.

Acknowledgments

We thank Francesco Orabona, Mitchell Wortsman, Simon Kornblith and our anonymous reviewers for their insightful comments. This work was supported by the NSF-BSF program, under NSF grant #2239527 and BSF grant #2022663. MI acknowledges support from the Israeli council of higher education. OH acknowledges support from Pitt Momentum Funds, and AFOSR grant #FA9550-23-1-0242. YC acknowledges support from the Israeli Science Foundation (ISF) grant no. 2486/21, the Alon Fellowship, the Yandex Initiative for Machine Learning, and the Len Blavatnik and the Blavatnik Family Foundation.

References

Appendix A Relaxing the Convexity Assumption

This section describes relaxations of convexity under which our main theoretical results still hold. In particular, our results naturally extend to star-convex functions which satisfy

Our results also extend (with changed constants) to quasarconvex functions , which require that f(x)fc<f(x),xx>f(x)-f_{\star}\leq c\left<\nabla f(x),x-x_{\star}\right> holds for some c<c<\infty and all xXx\in\mathcal{X}. A further relaxation of star convexity requires it to hold only along the optimization trajectory:

There exists xarg minxf(x)x_{\star}\in\operatorname*{arg\,min}_{x}f(x) and constant c<c<\infty such that the iterates of SGD satisfy

Zhou et al. introduce this notion of a “star-convex path” and provide some empirical evidence that it may hold when training deep neural networks with SGD (see also Kleinberg et al. for a related assumption). Zhou et al. also prove that the assumption suffices to prove that SGD converges to the global minimizer; it suffices for DoG for similar reasons.

When substituting Assumption 1 with Assumption 3, our analysis goes through unchanged, except we can no longer use Jensen’s inequality to argue directly about the suboptimality of the point xˉτ\bar{x}_{\tau}. Instead, Theorem 1 with Assumption 3 says that, with probability at least 1δ1-\delta,

with ωkrˉki=0t1rˉi\omega_{k}\coloneqq\frac{\bar{r}_{k}}{\sum_{i=0}^{t-1}\bar{r}_{i}}, the final iterate tau τ\tau, and the coefficient cδ,rϵ,Tc_{\delta,r_{\epsilon},T} as defined in Theorem 1. (Note that Assumption 3 implies k=0t1ωk(f(xk)f)k=0t1ωk<f(xk),xkx>\sum_{k=0}^{t-1}\omega_{k}(f(x_{k})-f_{\star})\leq\sum_{k=0}^{t-1}\omega_{k}\left<\nabla f(x_{k}),x_{k}-x_{\star}\right> which replaces (3)).

We can turn the above bound into a constant-probability guarantee for a specific T-DoG iterate xKx_{K} by sampling KωK\sim\omega and using Markov’s inequality:

To obtain a high probability guarantee, we can make l=log1δl=\lceil\log\frac{1}{\delta}\rceil independent draws from ω\omega, denoted K1,,KlK_{1},\ldots,K_{l} and use the fact that

We remark that the literature contains a plethora of other convexity relaxations such as quasiconvexity , pseudoconvexity , Polyak-Łojasiewicz conditions and weak convexity . Exploring the convergence of DoG under these additional convexity relaxations is left to future work.

Appendix B Useful Algebraic Facts

Define Klog(sT/s0)K\coloneqq\lceil\log(s_{T}/s_{0})\rceil, and nTKn\coloneqq\lfloor\frac{T}{K}\rfloor. Then, we have

Rearranging and using the definition of KK gives

where the implication follows from monotonicity of the exponential function. Therefore,

where the first inequality uses that ss is positive nondecreasing sequence and the second inequality uses mink<Ksn(k+1)snke\min_{k<K}\frac{s_{n(k+1)}}{s_{nk}}\leq e as shown above. ∎

B.2 Lemma 4

Let a0,,ata_{0},\dots,a_{t} be a nondecreasing sequence of nonnegative numbers. Then

B.3 Lemma 5

Let ai=aiai1a^{\prime}_{i}=a_{i}-a_{i-1} and Bi=jibjB_{i}=\sum_{j\leq i}b_{j}. Then (by discrete integration by parts)

where (i)(i) uses the triangle and Hölder’s inequalities, and (ii)(ii) uses that ata_{t} is nonnegative and nondecreasing and therefore i=1t1ai+1ai=ata1at\sum_{i=1}^{t-1}|a_{i+1}-a_{i}|=a_{t}-a_{1}\leq a_{t}. ∎

B.4 Lemma 6

Recall that log+(z)1+log(t)\log_{+}(z)\coloneqq 1+\log(t).

Let a1,a0,a1,,ata_{-1},a_{0},a_{1},\dots,a_{t} be a nondecreasing sequence of nonnegative numbers, then

Appendix C Proofs for Section 3

We begin by citing the following corollary of a general bound due to Howard et al. . (Recall that θt,δlog60log(6t)δ\theta_{t,\delta}\coloneqq\log\frac{60\log(6t)}{\delta}).

Let c>0c>0 and XtX_{t} be a martingale difference sequence adapted to Ft\mathcal{F}_{t} such that Xtc\left|X_{t}\right|\leq c with probability 1 for all tt. Then, for all δ(0,1)\delta\in\left(0,1\right), and X^tFt1\hat{X}_{t}\in\mathcal{F}_{t-1} such that X^tc\lvert\hat{X}_{t}\rvert\leq c with probability 1,

Building on Corollary 2 we prove the following result, which allows for the sequence XtX_{t} to be almost-surely bounded by a random (rather than deterministic) quantity.

Let CtFt1C_{t}\in\mathcal{F}_{t-1} and let XtX_{t} be a martingale difference sequence adapted to Ft\mathcal{F}_{t} such that XtCt\left|X_{t}\right|\leq C_{t} with probability 1 for all tt. Then, for all δ(0,1)\delta\in\left(0,1\right), c>0c>0, and X^tFt1\hat{X}_{t}\in\mathcal{F}_{t-1} such that X^tCt\lvert\hat{X}_{t}\rvert\leq C_{t} with probability 1,

and note that they satisfy the requirements of Corollary 2: WtW_{t} is a martingale difference sequence adapted to Ft\mathcal{F}_{t} while W^tFt1\hat{W}_{t}\in\mathcal{F}_{t-1} and they both have absolute value bounded by 1 almost surely.

Writing Cˉt=max{c,Ct}\bar{C}_{t}=\max\{c,C_{t}\} for short, we have

where (i)(i) uses the fact that Cˉs=c\bar{C}_{s}=c for all sTs\leq T when ¬HT\neg H_{T} holds, and (ii)(ii) uses Corollary 2. ∎

Next, we connect Corollary 3 with a handy algebraic fact (Lemma 5) to obtain the following result, which underpins Lemma 2.

Let SS be the set of nonnegative and nondecreasing sequences. Let CtFt1C_{t}\in\mathcal{F}_{t-1} and let XtX_{t} be a martingale difference sequence adapted to Ft\mathcal{F}_{t} such that XtCt\left|X_{t}\right|\leq C_{t} with probability 1 for all tt. Then, for all δ(0,1)\delta\in\left(0,1\right), c>0c>0, and X^tFt1\hat{X}_{t}\in\mathcal{F}_{t-1} such that X^tCt\lvert\hat{X}_{t}\rvert\leq C_{t} with probability 1,

Follows from Lemma 5 (with yiy_{i} and XiX_{i} taking the roles of aia_{i} and bib_{i}, respectively), and Corollary 3 that bounds maxititXi\max_{i\leq t}\left\lvert\sum_{i\leq t}X_{i}\right\rvert for all tTt\leq T. ∎

For k[T]k\in[T] define the random variables:

where the last inequality uses Lemma 7. ∎

C.2 Proof of Corollary 1

C.3 DoG can diverge on a pathological instance

where [x]i[x]_{i} denotes the ii’th coordinate of xx and [x0]i=10rϵ/m[x_{0}]_{i}=10r_{\epsilon}/\sqrt{m} for all ii, so that d0=10rϵ>rϵd_{0}=10r_{\epsilon}>r_{\epsilon}. We show that applying DoG on this function gives rˉT/d0=T/10\bar{r}_{T}/d_{0}=\sqrt{T}/10 for all TmT\leq m, meaning that the ratio rˉT/d0\bar{r}_{T}/d_{0} can be made arbitrarily large by increasing TT and mm.

i.e., i(x)i(x) is the smallest coordinate which is candidate for providing a subgradient. Using this notation, a valid subgradient for ff is:

where eje_{j} is a vector with one in the jjth entry and zero elsewhere. With this subgradient choice for kmk\leq m the iterates become:

and therefore rˉk=krϵ=kd0/10\bar{r}_{k}=\sqrt{k}r_{\epsilon}=\sqrt{k}d_{0}/10 as claimed. We confirm (7) by induction. Since [x0]i=10rϵ/m[x_{0}]_{i}=10r_{\epsilon}/\sqrt{m} for all ii, the expression (7) holds for k=0k=0. If (7) holds for all kn<mk\leq n<m then

and therefore Gn=nG_{n}=n so that ηn=rϵnn=rϵ\eta_{n}=\frac{r_{\epsilon}\sqrt{n}}{\sqrt{n}}=r_{\epsilon} and xn+1=xnnnrϵenx_{n+1}=x_{n}-\frac{\sqrt{n}}{\sqrt{n}}r_{\epsilon}e_{n}, meaning that

C.4 Proof of Proposition 2

To show iterate boundedness in the stochastic setting we define the stopping time

Under Assumption 2 the truncated T-DoG step sizes (8) satisfy, for all tTt\leq T,

Equation (9) holds directly from the definition of (T-DoG) and (8).

where (i)(i) uses that gk2=GkGk1\|g_{k}\|^{2}=G_{k}-G_{k-1} (with the shorthand G10G_{-1}\coloneqq 0) and

The final bound (12) follows immediately from (11) by noting that

The above properties allow us to establish the following concentration bound.

Finally, we show that the event defined in Lemma 9 implies the desired distance bound.

for all kk. Summing this inequality from k=0k=0 to k=t1k=t-1, we get

Proposition 2 follows immediately from Lemmas 9 and 10.

C.5 Illustrating DoG’s guarantees for least squares problems

For simplicity, we set x0=0x_{0}=0, and let xx_{\star} be the minimum norm minimizer of ff.

If we assume that aA\|a\|\leq A and bB|b|\leq B with probability 1, then G()\mathcal{G}(\cdot) satisfies Assumption 2 with

The bounds aA\|a\|\leq A and bB|b|\leq B are often easy to verify (e.g., via data normalization).

where O~\widetilde{O} hides polylogarithmic terms in T,δT,\delta, and rϵr_{\epsilon}. We emphasize that the bound above depends on the value of x\|x_{\star}\| (the smallest minimizer norm) even though T-DoG assumes no knowledge of this value.

where GTG_{T} denotes the sum of square stochastic gradient norms observed by the algorithm. The coarseness of the upper bound DxD\geq\|x_{\star}\| affects the leading order term in the above display: since the iterates can be anywhere in X\mathcal{X}^{\prime}, the best we can guarantee about GTG_{T} is GT=O(L2T)=O((AD+B)2T)G_{T}=O(L^{2}T)=O((AD+B)^{2}T). Therefore, the best deterministic upper bound on the optimality gap of previous methods is

Comparing the bounds (C.5) and (C.5), we see that, for least squares problems, T-DoG can offer substantially stronger guarantees than previous parameter-free methods, even when we bound the domain to ensure that the latter methods apply.

Appendix D Guarantees for the unweighted DoG iterate average

In this section we derive guarantees similar to those presented in Section 3 for the unweighted iterate average

The following proposition shows that the bound resulting from combining Proposition 1 with Lemma 3 holds also for uniform iterate averaging.

Moreover, in the noiseless case we have (with probability 1)

with τ00\tau_{0}\coloneqq 0. Moreover, let KK be the first index such that τK=T\tau_{K}=T and note that K1+log2rˉTrϵK\leq 1+\log_{2}\frac{\bar{r}_{T}}{r_{\epsilon}} by construction.

for all kKk\leq K. Moreover, by Lemma 2 we have that

where the last transition holds since, for any DoG-like method and all tt,

Summing Equation 18 over kk from 11 to KK, we obtain

Finally, recall that K=O(log+rˉTrϵ)K=O\left(\log_{+}\frac{\bar{r}_{T}}{r_{\epsilon}}\right) and note that k=1Krˉτk=O(rˉT)\sum_{k=1}^{K}\bar{r}_{\tau_{k}}=O(\bar{r}_{T}) since rˉτirˉτK12K1+i\frac{\bar{r}_{\tau_{i}}}{\bar{r}_{\tau_{K-1}}}\leq 2^{-K-1+i} for all iK1i\leq K-1. The proof of Equation 15 is complete upon noting that f(x^T)1Tt=0T1f(xt)f(\hat{x}_{T})\leq\frac{1}{T}\sum_{t=0}^{T-1}f(x_{t}) by Jensen’s inequality. To show Equation 17 we simply set Δt=0\Delta_{t}=0 in the bound above. ∎

Using Proposition 3 we may replace the early-stopped weighted average xˉτ\bar{x}_{\tau} with the final unweighted average x^T\hat{x}_{T} in Corollary 1 and Theorem 1.

Proposition 3 also shows that (as long as iterates remain bounded) DoG attains a 1/T1/T rate of convergence in the smooth noiseless cases. In particular, if we assume that f\nabla f is SS-Lipschitz and noiseless gradients (i.e., gi=f(xi))g_{i}=\nabla f(x_{i})), then we have

Substituting this bound back into Equation 17 (with GT1=GT1G_{T-1}^{\prime}=G_{T-1} for DoG), we obtain

Dividing through by 1Tt=0T1f(xt)f(x)\sqrt{\frac{1}{T}\sum_{t=0}^{T-1}f(x_{t})-f(x_{\star})} and squaring yields

Appendix E Experiment Details

All experiments were based on PyTorch (version 1.12.0).

Language experiments were done with the transformers library (version 4.21.0) and tracked using the Comet.ML . All datasets were provided by the Datasets library (version 2.4.0) and were left as is, including train-eval-test splits.

Vision experiments were based on the pytorch-image-models (timm, version0.7.0dev0) repository , with TensorFlow datasets (version 4.6.0) as a dataset backend .

To support the training and analysis of the results, we used numpy , scipy , pandas and scikit-learn .

E.2 Implementation details

Whenever possible, we used existing scripts and recipes provided by timm and transformers to fine-tune the models. We implemented DoG, L-DoG and the polynomial model averaging as a subclass of PyTorch Optimizer interface. We provide implementation of both in https://github.com/formll/dog.

E.3 Datasets

The datasets used in the language experiments are: CoLA , SST-2 , MRPC , QQP , STS-B , MNLI , QNLI , and RTE . Following Liu et al. , we discard WNLI as it was found to be ill-defined and was reformulated differently in SuperGLUE .

The datasets used in the vision experiments are: Caltech101 , CIFAR-100 , CLEVR-Dist , DMLab , dSprites-Ori , DTD , Flowers102 , Pets , Resisc45 , Retinopathy , Sun397 , and SVHN .

E.4 Models

When fine-tuning RoBERTa (from the ‘roberta-base’ checkpoint) on classification tasks, we follow the common technique of prepending a CLS token to the input, and feeding its final representation to a one hidden-layer, randomly initialized MLP that is used as a classification head. For SQuAD, the classification head is tasked with multi-label classification, predicting the probability that each word (token) in the input is the beginning/end of the answer span, and we then used the span that has the maximum likelihood as the model’s output. When fine-tuning T5 (from the ‘t5-base’ checkpoint), we treated all tasks as sequence-to-sequence tasks, translating classification labels to appropriate words (e.g. 0/1 to positive/negative) and then evaluated accuracy with exact match. The computer vision pre-trained models were accessed via timm, and had randomly initialized classification heads. The strings used to load the models were: ‘convnext_tiny’, ‘resnet50’, ‘densenet121’, ‘vit_base_patch32_224_in21k’ and ‘vgg11’.

E.5 Hyper-parameters

We trained each model/task combination a fixed number of steps (see Table 3), performing evaluation every 500 update steps (except for the smaller datasets Caltech101, DTD, Flowers102 and Pets where we evaluated every 200) with both the latest checkpoint, and the polynomial averaged one (see below). We did not use any weight decay. For language models, we left dropout at its default value in the transformers library. We used batch sizes as is common practice for each task, as detailed in Table 3.

The VTAB suite divides its datasets into three categories: natural, specialized and structured, and we used a suitable data augmentation strategy for each of the categories. In particular, for structured datasets we simply resized the images to a (224, 224) resolution, while for the natural and specialized datasets we used the standard “inception crop” at training time and a 0.875 center crop at test time. For natural datasets we additionally applied a color jitter operation with parameter 0.4 (as implemented in timm). Finally, we applied a random horizontal flip for all datasets except SVHN and dSprites-Ori, where such augmentation clearly interferes with the task.

Model selection in vision experiments.

For computer vision experiments, we used the VTAB evaluation splits to select the best checkpoint, and then reported performance on the test split. Unlike the experiments accompanying the VTAB suite , we did not retrain selected models on the combination of training and validation data.

Repeated runs.

To account for randomness, we repeated our fine-tuning experiments using multiple seeds. In most cases (with exceptions listed below) we repeated each DoG and L-DoG training 5 times. For SGD and Adam repeating the training with all learning rates was computationally prohibitive, so instead for each task / model pair we repeated 5 times only the best-performing LR (i.e., instance-tuned LR) and the best-performing LR across all tasks for that model (i.e., model-tuned LR) according the validation split. A few experiments were too computationally expensive to repeat: for QQP and MNLI (which require a large step budget) we have only 1–3 repetitions per training configuration, and for ConvNeXt-T (which takes a long time per step) we did repeat the training runs.

Each relative error difference (RED) score combines the error of two optimization methods (one being DoG) on a particular model task combination. Given multiple seeds for each optimization method, we computed the RED scores for each possible seed combination. In Figures 2, 3, 6 and 5 (which aggregate multiple tasks) we average over those multiple RED values and compute the statistics of the average RED. In per-task breakdowns such as Figures 7, F.1 and F.1 we report the statistics over the multiple RED values.

Baseline optimizers.

For both SGD and Adam, we used cosine learning rate decay, and searched over appropriate values for peak learning rate. The base learning rate search space used when performing fine-tuning for each model/task combination can be found in Sections F.1 and F.1. We did not use momentum for SGD. For Adam we used β1=0.9\beta_{1}=0.9 for all experiments, and β2=0.999\beta_{2}=0.999 for language experiments and β2=0.99\beta_{2}=0.99 for vision experiments. For language models only, we used warmup of 10% of the maximum steps count, and gradient clipping at global norm 1. We did not perform learning rate warmup or gradient clipping for the vision experiment since we did not encounter any training stability issues there.

As explained in Section 4.1, setting rϵα(1+x0)r_{\epsilon}\coloneqq\alpha(1+\|x_{0}\|) generally works well for α=104\alpha=10^{-4}. However, in some cases such as with T5, x0\lVert x_{0}\rVert can be very large, causing destructively large first updates, with ηt\eta_{t} increasing exponentially and the model diverging. This is easily detectable early during training, as usually ηt\eta_{t} exceeds 10001000 within the first 100100 steps. Since the theory requires rϵr_{\epsilon} to be small, we simply decreased α\alpha by a factor of 100. While preliminary experiments with RoBERTa indicated that DoG also performed well with α=104\alpha=10^{-4}, for the sake of consistency we use the same values in all models of the same domain. Thus, models fine-tuned on vision tasks used α=104\alpha=10^{-4}, while language models used 10610^{-6} for DoG and 10810^{-8} for L-DoG.

Model averaging.

As mentioned in Section 4.1, we used the polynomial decay averaging as proposed by Shamir and Zhang . Namely, we kept an additional copy of the model weights, and in every update step we updated our running average of the model parameters as follows:

The vector xˉtγ\bar{x}_{t}^{\gamma} roughly corresponds to an average of the last t/γt/\gamma iterates preceding iteration tt. For all models, we set γ=8\gamma=8. We did not perform any tuning of the parameter γ\gamma; we chose the value 88 because 1/81/8 seemed like a good fraction of iterates to average, and because it worked well in the experiments of Levy et al. .

To ensure that iterate averaging is never harmful, for each optimization method we selected the best-performing checkpoint across both xtx_{t} and xˉtγ\bar{x}_{t}^{\gamma} (i.e., with or without averaging).

E.6 Figure 1 details

We generated Figure 1 as part of our fine-tuning testbed. In particular, SGD used a cosine learning rate annealing (without warmup), both algorithms use polynomial decay averaging, and we report test performance on the best checkpoint selected on a validation set.

E.7 Fine-tuning ImageNet

Our training setup mostly followed the default configuration in Wortsman et al. . In particular, we used batch size 512 and the default timm augmentation (as in our main computer vision experiments) which Wortsman et al. refer to as ‘medium aug.’ We trained for 25K steps, corresponding to roughly 10 passes over the data. However (keeping with our computer vision testbed setting) we did not perform learning rate warmup or gradient clipping, and we initialized the classification head to be random.

For AdamW we used weight decay 0.1 and cosine learning rate annealing as in Wortsman et al. . We obtained accuracies within 0.5% of the numbers reported in Appendix L of Wortsman et al. .

For DoG and L-DoG we used weight decay 0 since the value 0.1 is meant for decoupled weight decay and we did not wish to re-tune a weight decay parameter. We set rϵr_{\epsilon} to be 106(1+x0)10^{-6}\cdot(1+\|x_{0}\|) without trying different values of this parameter.

For SGD we used cosine learning rate annealing and set weight decay to 0 for a more direct comparison to DoG.

E.8 Training from scratch

Our training setup followed the basic training configuration of Cubuk et al. , which is typical for training ResNets on CIFAR-10: data augmentations comprising a random crop after 4 pixel padding and random horizontal flip, batch size of 128, weight decay of 0.0005 and 200 epochs of training. SGD used cosine learning weight annealing and (when applicable) Nesterov momentum. We did not use dropout or any other additional form of regularization. For DoG and L-DoG, we set rϵ=104(1+x0)r_{\epsilon}=10^{-4}\cdot(1+\|x_{0}\|) without trying different values of this parameter. The accuracies we obtained using SGD and DoG are consistent (and slightly better) than the baseline number reported in Table 2 of Cubuk et al. and within 0.1% of the one reported in Table 1 of Carmon et al. .

Appendix F Additional experiment results

Figure 7 as well as Sections F.1 and F.1 provide the full breakdown of our main fine-tuning results, comparing DoG and L-DoG to SGD and Adam with different learning rates for each model/task combination.

Throughout the paper, our experiments focus on comparing different methods by the final test performance they are able to reach given sufficient compute budget to essentially fully converge. As a consequence, SGD and Adam—which require learning rate tuning—use up significantly more computation than DoG and L-DoG. More specifically, for each model/task combination we tune the SGD and Adam learning rates over a grid of at least 5 values (and often 6 or more), resulting in computational cost increased by the same factor.

In this subsection only, we compare different optimizers using roughly the same computational budget, by measuring the performance of Adam and SGD after roughly 20% of their overall step budget.Since these results are just a re-analysis of our original experiments, for language experiments we take all the warmup iterates plus the first 20% of the remaining iterates, overall using 28% of the budget. Figure 9 shows the result of this comparison, contrasting it to our main experiment. The figure shows that DoG often exceeds the performance of instance-tuned SGD with equalized step budget.

We note a number of caveats regarding the equalized compute budget comparison:

Since our experiments are focused on getting the best possible generalization, we substantially over-provisioned the iteration budget, and hence the performance of instance-tuned SGD and Adam declines only mildly when we cut the budget by roughly 5. Our tightest budget was for RoBERTa (150% the iterations in Liu et al. ), and there we can see that performance degraded more substantially. If we were to instead take the number of iterations DoG actually needs to reach its peak performance, its advantage over equalized-compute SGD would likely be far larger.

Since the results reported here are obtained by re-analysis of our original experiments, the cosine learning rate schedule for SGD and Adam is not properly set for using 20% of the iteration budget; in particular, the learning rate does not decay to zero at the end of the training. Running these methods with a properly set cosine schedule would likely improve their performance. However, we note that the addition of iterate averaging appears to partially compensate for the lack of sharp learning rate decay.

Given sufficient resources, it is possible to run all the different learning rates of SGD and Adam in parallel. Therefore, the comparison equalizes overall computational cost, but not necessarily wall time.

The comparison also does not take into account more sophisticated learning rate tuning schemes that early-stop unpromising learning rate candidates. However, such schemes run the risk of choosing a suboptimal learning rate.

As discussed in Section 4.2, DoG with rϵ=106(1+x0)r_{\epsilon}=10^{-6}(1+\|x_{0}\|) failed in fine-tuning T5-b on CoLA. To investigate this issue, we ran DoG and L-DoG with different choices of rϵr_{\epsilon}. Figure 11 depicts the results of this experiment as well as the performance of SGD and Adam with different learning rates. The figure shows that using lower values of rϵr_{\epsilon} allows DoG to reach reasonable results, but with some seeds still failing. In contrast, L-DoG shows consistent and superior performance across a large range of rϵr_{\epsilon} values. We leave further investigations on the cause of failure in CoLA to future work.

In Section 4.3, we discuss DoG’s insensitivity to the choice of rϵr_{\epsilon} as long as it is small enough. Here, we expand on this analysis by testing how the DoG step size at iteration tt, denoted ηt\eta_{t}, depends on its initial step size η0=rϵ/g0\eta_{0}=r_{\epsilon}/\|g_{0}\|. For each task in our testbed and for 4 models, we perform short training runs with a large number of η0\eta_{0} values. In Figure 8 we plot ηt\eta_{t} vs. η0\eta_{0} for t{2,10,100,1000}t\in\{2,10,100,1000\}. We also show a horizontal line for the learning rate of of SGD reaching the best validation accuracy, and the y=xy=x diagonal. The figure shows that for most model/task combinations, ηt\eta_{t} converges quickly (within the first 10001000 steps) to a value near the optimal one for SGD, and mostly independent of η0\eta_{0} as long as it is small enough.

However, we also observe some failure cases where ηt\eta_{t} strongly depends on η0\eta_{0}, such as fine-tuning ResNet50 on CIFAR-100. This provides a complementary perspective on the fact DoG is sensitive to rϵr_{\epsilon} in this setting, as already shown in Figure 3: when η0\eta_{0} is to low, DoG fails to reach a suitable value of ηt\eta_{t} in a reasonable time. We hypothesize that this is due to the batch normalization (BN) layers in the model causing many different step size to “look” like solutions to the implicit equation motivating DoG. To test this hypothesis, we repeat the CIFAR-100 training experiment but without BN (we disable BN by fine-tuning the model in evaluation mode). Figure 10(a) shows that removing BN allows DoG to recover its stabilizing behavior. Moreover, Figure 10(b) further shows that without batch normalization, the performance of DoG again becomes insensitive to the choice of rϵr_{\epsilon} provided it is sufficiently small. Unsurprisingly, we also observe that removing BN slightly hurts generalization performance in this task. As mentioned in Section 6, improving DoG to be more robust in the presence of normalization layers in general and batch normalization in particular is an important direction for future research.

Figure 12 plots rˉt\bar{r}_{t} for DoG as a function of the iteration index tt. As the figure shows, rˉt\bar{r}_{t} grows very rapidly and then approximately plateaus. Therefore, the quantity itrˉirˉt\sum_{i\leq t}\frac{\bar{r}_{i}}{\bar{r}_{t}} grows roughly linearly in tt, implying a near-optimal rate of convergence for DoG, as discussed in Section 3.2.

Section 4.7 discusses a comparison between DoG and other parameter-free optimizers. In this section, we provide further details on the experiments.

Carmon and Hinder propose a bisection procedure for tuning the SGD learning rate. A direct implementation of this method would need to perform at least 4 or 5 bisection steps and therefore, in the best case, perform similarly to our instance-tuned SGD baseline. Since our learning rate tuning employs a tight grid of values selected using some prior knowledge of the problem, and since we select learning rates based on validation set performance and not a step size certificate, instance-tuned SGD is likely a conservative upper bound on the performance of bisection approach.

Similar to instance tuned SGD, the bisection procedure has increased computational cost relative to DoG that is due to the need for multiple SGD runs. That is, performing 5 steps of bisection where each SGD call has the same step budget as DoG consumes 5 times more compute than DoG. We may also consider a situation where each bisection step uses only 20% of the DoG compute budget, leading to equal overall cost. In this setting, the “equalized compute budget” comparison we perform in Section F.2 and Figure 9 provides a conservative upper bound on the bisection performance, indicating it is likely to under-perform DoG.

G.2 Stochastic Polyak step-size

We apply the Stochastic Polyak Step (SPS) proposed by Loizou et al. using their open-source implementationhttps://github.com/IssamLaradji/sps to a subset of our fine-tuning testbed, and present the results in Sections G.4 and G.4. For the vision experiments, the SPS with the hyper-parameters proposed in the paper (c=0.2c=0.2, τ=2\tau=2) and initial step size of 1.0 (the default in the code) worked reasonably well, but not as well as DoG. For the language experiments the same algorithm diverges; we find initial learning rate of 0.01 worked reasonably well, but again not as well as DoG (we also attempted an initial learning rate of 0.0001, which produced worse results). For vision tasks, similarly tuning the initial step size did not significantly improve performance. We run 5 random seeds per experiment, and average the results across seeds. DoG outperforms SPS in 22 out of 34 task/model combinations, and by 2.3 percentage points on average. L-DoG further increases this gap by outperforming SPS in 24 out of 34 pairs, with an average of 5.3 percentage points.

G.3 D-adaptation

We perform a preliminary empirical evaluation of the practical algorithms proposed in Defazio and Mishchenko using the code they releasehttps://github.com/facebookresearch/dadaptation and a subset of our fine-tuning testbed. As Sections G.4 and G.4 show, D-adapt SGD and and D-adapt Adam perform reasonably well but slightly worse than DoG, and noticeably worse than L-DoG and Adam. DoG outperforms D-adapt SGD in 24 out of 34 task/model combinations, and by 1.9 percentage points on average. L-DoG further increases this gap by outperforming D-adapt SGD in 30 out of 34 pairs, with an average of 4.9 percentage points. D-adapt Adam is less stable on many of the tasks in our testbed, being outperformed by DoG in 26 out of 34 task/model combinations, and by L-DoG in 27, with an average of 10.8 and 13.8 percentage points respectively.

Theoretical comparison to Algorithm 1 of Defazio and Mishchenko [25].

Defazio and Mishchenko carry out their main theoretical analysis on a “Parameter Free Dual Averaging” (PFDA) method. We now provide some additional remarks comparing PFDA and DoG. The iterate xtx_{t} in PFDA is

where Gt=itgi2G_{t}=\sum_{i\leq t}\|g_{i}\|^{2} and qiq_{i} is a lower bound on the distance to optimality (denoted did_{i} in ). In contrast, the DoG iterates are

where rˉt=maxitxix0\bar{r}_{t}=\max_{i\leq t}\|x_{i}-x_{0}\|. While both dtd_{t} in PFDA and rˉt\bar{r}_{t} are lower bounds for (a constant factor times) the distance to optimality, only DoG aims to approximate η=x0xGt\eta_{\star}=\frac{\|x_{0}-x_{\star}\|}{\sqrt{G_{t}}}; PFDA instead approximates the optimal step size for dual averaging.

The dual averaging prescription of putting the factor 1/Gt1/\sqrt{G_{t}} outside the summation defining xtx_{t} likely hurts performance in practice. The practical D-adapt SGD and D-adapt Adam methods that Defazio and Mishchenko propose do not follow this prescription. Consequently, these algorithms are very different from PFDA and have no theoretical guarantees.

G.4 Continuous Coin Betting

Orabona and Tommasi propose a parameter-free algorithm based on coin-betting schemes and demonstrate its effectiveness in training neural networks. We use an open source implementationhttps://github.com/bremen79/parameterfree to apply COCOB on the same subset of our testbed as the other experiments in this section, and present the results in Sections G.4 and G.4. While the default parameters work well for ResNet50 (slightly better than DoG and similar to L-DoG), they produced poor results on the other models. We found that increasing the α\alpha parameter (which roughly corresponds to the number of “warmup” steps) from a constant 100 to 10% of the step budget improves the results dramatically for all transformer-based models, though in most cases DoG continues to significantly outperform it. Additionally, we find that in almost all cases, our averaging scheme benefits COCOB.