Prodigy: An Expeditiously Adaptive Parameter-Free Learner
Konstantin Mishchenko, Aaron Defazio
Introduction
Optimization is an essential tool in modern machine learning, enabling efficient solutions to large-scale problems that arise in various domains, such as computer vision, natural language processing, and reinforcement learning. One of the key challenges is the selection of appropriate learning rates, which can significantly impact the convergence speed and the quality of the final solution. Learning-rate tuning has been particularly challenging in applications where there are multiple agents that use their own optimizer. For instance, when training Generative Adversarial Networks (GANs) (Goodfellow et al., 2020), there are two neural networks with different architectures. In federated learning, tuning is even more challenging (Khodak et al., 2021), since there might be billions of devices (Kairouz et al., 2021), each optimizing their objective locally. Another example is Neural Architecture Search (NAS) (Zoph and Le, 2017), where the goal is to find the best neural network architecture automatically by training a lot of networks and evaluating them on a validation set. In cases like that, it becomes very expensive to manually tune the learning rate.
In recent years, "parameter-free" adaptive learning rate methods (Orabona and Tommasi, 2017; Cutkosky and Orabona, 2018; Zhang et al., 2022; Carmon and Hinder, 2022; Ivgi et al., 2023) have gained considerable attention due to their ability to automatically adjust learning rates based on the problem structure and data characteristics. Among these, the D-Adaptation method, introduced by (Defazio and Mishchenko, 2023), has emerged as a promising practical approach for learning-rate-free optimization.
D-Adaptation works by maintaining a lower bound on the initial distance to solution , for any in the solution set of the following problem:
In practice, the lower bound estimated by D-Adaptation increases rapidly during the course of optimization, plateauing to a value close to the true . This quantity is the key unknown constant needed to set the learning rate for non-smooth optimization methods, forming the numerator of the step size:
and the denominator is based on the Adagrad step size Duchi et al. (2011); Streeter and McMahan (2010); Ward et al. (2019). The Gradient Descent form of D-Adaptation simply plugs in the current lower bound at each step in place of . This simple approach can be applied to estimate the step size in Adam (Kingma and Ba, 2015), which yields state-of-the-art performance across a wide-range of deep learning problems. Defazio and Mishchenko (2023) also show that asymptotically, D-Adaptation is as fast as specifying the step size using the true (up to small constant factors).
In this paper, we present two novel modifications to the D-Adaptation method that enhance its worst-case non-asymptotic convergence rate. By refining the algorithm’s adaptive learning rate scheme, we achieve improved performance in terms of convergence speed and solution quality. We then conduct extensive experiments that consistently demonstrate that the improved D-Adaptation methods adapt the learning rate much faster than the standard D-Adaptation, leading to enhanced convergence rates and better optimization outcomes.
Prodigy Approach
To understand how we can improve upon D-Adaptation, let us take a closer look at some nuggets in its analysis. In D-adapted Dual Averaging, the gradient at iteration is scaled with weight . This leads to the error term
The theory then proceeds to upper bound this sum using the largest of all ’s by using the upper bound . This, however, is quite pessimistic since then is set to be , so can be as large as and can be as small as . Therefore, replacing with can introduce a multiplicative error of in this term.
We take a different approach and instead handle the error term using modified Adagrad-like step sizes. In the Adagrad theory, the error term does not have any factors, which is exactly why Adagrad places in the step-size denominator. The required modification is then obvious: since the error terms are now instead of , the new adaptive step size should be
for the Gradient Descent algorithm. This way, we can still control the error term of D-Adaptation but the obtained step size is provably larger since is non-decreasing. For instance, for Gradient Descent, we have
Having larger step sizes while preserving the main error term is the key reason why the new algorithms converge, as we show below, with a faster rate.
Notice, however, that the methods might still be slow because the denominator in the step size might grow too large over time. To remedy this, we introduce a modification for the Gradient Descent step size by placing an extra weight next to the gradients:
In fact, the modified step size might even increase between iterations, whereas the Adagrad step size always decreases. We will show that as long as does not grow too quickly, the worst-case convergence rate is almost the same.
To have non-asymptotic theory, we also introduce in our algorithms an extra term in the denominator which upper bound the gradient norm. We define it formally in the assumption below.
Algorithm 1 and Algorithm 2 give Gradient Descent and the Dual Averaging variants of our new method. In contrast to Adagrad, they estimate the product of and in the denominator, so we call the proposed technique Prodigy. We give the following convergence result for Algorithm 1:
Assume is convex and -Lipschitz. Given any weights , the functional gap of the average iterate of Algorithm 1 converges as
where is the weighted average iterate.
Notice that we have the freedom to choose any non-decreasing sequence as long as the right-hand side is decreasing. This allows us to put much more weight on the recent gradients and get more reasonable step sizes. For instance, we can choose , where and since , it would result in an extra factor of in the numerator due to the log term. The denominator, on the other hand, would increase as well, giving us a trade-off that depends on the values of ’s. We note that weights have appeared previously in Accelegrad (Levy et al., 2018) and UniXGrad (Kavis et al., 2019), which combine Adagrad step sizes with momentum.
To understand why this improves the convergence rate, consider the following lemma, which we prove in the appendix. The lemma presents an upper bound on the terms related to the sequence in the right-hand side of equation 1.
Let be positive numbers and assume , where . Then,
In contrast to the bound in Defazio and Mishchenko (2023), we bound instead of . This is the reason why the overall guarantee improves by a factor of . For instance, if we set for all and substitute the bound from Lemma 1, we get the convergence rate
where is chosen as the argmin from Lemma 1. Furthermore, for arbitrary increasing positive weights, we get the following guarantee by applying Lemma 1 directly to the bound in Theorem 1:
Even though our theory does not guarantee that it is beneficial to use increasing weights , this result is, to the best of our knowledge, new for Adagrad-like methods. It allows for a wide range of choices in . For example, if we set with and , then the method is still guaranteed to converge at the rate of . This is of particular interest when we study Adam-like methods, see Section 6 for a discussion.
The logarithmic term is, however, not necessary and only arises due to the use of Gradient Descent update. The Dual Averaging update, provides a tighter guarantee as given in the next theorem.
Let be a convex and -Lipschitz function. For Algorithm 2, it holds that:
where and .
Comparing this with the previous rate, the only difference is the removal of a multiplicative factor. This improvement, however, is mostly theoretical as Gradient Descent typically performs better in practice than Dual Averaging. We also note that we do not have a convergence result for Algorithm 2 with weights other than . This is due to the fact that the DA analysis requires the step size to be monotonically decreasing, so we cannot place an extra weighting factor in the numerator of .
Resetting Approach
Algorithm 3 describes a variant of D-Adaptation where the Dual Averaging process is reset whenever the current estimate increases by more than a factor of 2. We call the interval between resetting events an epoch. This resetting process has a number of other effects:
The step-size sequence is also reset, resulting in larger steps right after the reset.
The convergence of the method is proven with respect to an unweighted average of the iterates, rather than a weighted average.
Since the quantities tracked to compute are also reset, the value of often will increase more rapidly than it can when using the standard D-Adaptation estimate.
This resetting variant has the advantage of being significantly simpler to analyze in the non-asymptotic case than standard D-Adaptation or Prodigy. This makes it well suited to be used as a basis for extensions and modifications of D-Adaptation.
Under the assumption of convex and -Lipschitz , we have for Algorithm 3:
This is the same rate as we established for the Dual Averaging variant of Prodigy, but we return the average of all iterates, rather than an average computed up to some point , a significant simplification. However, due to the resetting operation, this method is not expected to work as well as Prodigy in practice.
Lower Complexity Bounds for Exponentially Bounded Algorithms
We can obtain an interesting class of algorithms, which contains our two D-Adaptation variants, by restricting the rate of growth.
An optimization algorithm is exponentially bounded if there exists a constant , so that for any sequence of G-bounded gradients it returns a sequence of iterates such that for all :
D-Adaptation, DoG, Prodigy and D-Adaptation with resetting are exponentially bounded.
A simple lower complexity bound can be established via a simple 1-dimensional resisting oracle. The bound depends on the "scale" of the initial step of the algorithm, which is the size of the initial step from to . This initial step is for D-Adaptation, and can be though of as an algorithm-agnostic measure of .
Our lower bound allows the resisting oracle to choose a constant after seeing , which is a much stronger oracle than required for establishing a lower bound. Ideally, a lower bound could be established where the constant is fixed but unknown to the algorithm, and the actual distance to solution given by the oracle is allowed to depend on the iterate sequence.
The primary consequence of this difference is that our construction only tells us that hard problems exist for small relative to , of the scale . It remains an open problem to show a more general lower bound for larger . This is in a sense a trivial consequence of the exponentially bounded property, but is actually representative of the real behavior of the methods during the early steps of the algorithm, where both and actually are small. Any more general lower bound must cover this case.
Our new D-Adaptation variants are optimal among exponentially bounded algorithms for this complexity class:
Consider any exponentially bounded algorithm for minimizing a convex -Lipschitz function starting from , which has no knowledge of problem constants G and D. There exists a fixed gradient oracle such that any sequence of , there exists a convex Lipschitz problem with and for all minimizing points , consistent with the gradient oracle such that:
Using the simple construction from Theorem 5, we show in Appendix C that the class of exponentially bounded methods (potentially with an exponent other than 2) covers all Gradient Descent approaches that use an estimate of for some constant c, and use a step size without line-search or other additional queries. So the only way to achieve a dependence on is by using a method that performs some queries that overshoot the standard step size by more than a fixed constant factor. Although using larger step sizes is not problematic for Lipschitz functions, it comes with the risk of causing training divergence when applied to functions whose gradients are only locally bounded by , which is common in machine learning settings.
Lower complexity bounds for the average regret in the more general online learning setting also apply here. They are of the form (Zhang et al., 2022):
where is a “user-specificed constant” playing a similar role to . Bounds on the average regret directly bound function value sub-optimality as
where .
Related Work
In this section, we review the major classes of techniques for optimizing convex Lipschitz functions with some level of problem parameter independence.
The Polyak step size Polyak (1987) trades the knowledge of for , achieving optimal convergence rate without additional log factors. Stable convergence requires accurate estimates. A restarting scheme converges within a multiplicative log factor of the optimal rate Hazan and Kakade (2019). There has been substantial recent research on modifications of the Polyak step size to make it better suited to machine learning tasks (Loizou et al., 2021; Gower et al., 2021; Orvieto et al., 2022) but as of yet they have not seen widespread adoption.
Coin-betting Orabona and Tommasi (2017); McMahan and Orabona (2014); Cutkosky and Orabona (2018); Zhang et al. (2022); Orabona and Pál (2021) is a family of approaches from the online learning setting which are also applicable for convex non-smooth optimization. They work by establishing a relationship by duality between regret minimization and wealth-maximization. Existing approaches for wealth-maximization can then be mapped to algorithms for regret minimization. Coin-betting approaches give convergence rates for an equal-weighted average of the iterates of the form:
Standard D-Adaptation obtains asymptotic rates without the log factor, but was otherwise (theoretically) slower in finite time, as it had a rather than a dependence on :
Our two new variants close this gap, giving the same sqrt-log dependence as coin betting.
The DoG method (Ivgi et al., 2023), based on the bisection approach of Carmon and Hinder (2022), is the only other approach that we are aware of that estimates in an online fashion. DoG estimates by :
Ivgi et al. (2023) use this quantity as a plug-in estimate for the numerator of the step size, similar to D-Adaptation’s approach. This approach can divergence in theory, but with additional modifications to the step size, the "tamed" T-DoG method is shown to converge. It has a dependence on the initial sub-optimally of the D estimate, so our approach improves on this dependence by a factor.
Malitsky and Mishchenko (2020) proposed AdGD, a method for convex optimization that does not require any hyperparameters and has a rate that is at least as good as that of the optimally tuned Gradient Descent. However, AdGD requires the objective to be locally smooth, which hinders its use in many practical problems. Latafat et al. (2023) partially addressed this gap by proposing a proximal extension, but the case of non-smooth differentiable functions has remained unstudied.
Deriving Adam-like Step Sizes
To derive an Adam-like method, which should use exponential moving average for the step size, we want to approximate Adam’s update of the exponential moving average of squared gradients:
where is the coordinate-wise square of the gradient . We can achieve this using exponential weights, , which after substituting the definition of give us the following identity:
The update uses exponential moving average as well, although it is more conservative as it uses instead of . Note that there is an extra of in the update for , which can be optionally compensated for by using the bias correction discussed by Kingma and Ba (2015). These update rules are summarized in Algorithm 4. This is the main algorithm that we study numerically in the next section.
Experiments
We test our methods on convex logistic regression as well as deep learning problems. The Prodigy method is used as presented in Algorithm 4 in all deep learning experiments.
For the convex setting, we ran a set of classification experiments. For each dataset, we used the multi-margin loss (Weston and Watkins, 1999), a multi-class generalization of the hinge loss. This non-smooth loss results in bounded gradients, which is required by our theory. We perform full-batch rather that stochastic optimization, for two reasons. Firstly, it matches the assumptions of our theory. Secondly, fast learning rate adaptation is more crucial for full-batch optimization than stochastic optimization as fewer total steps will be performed.
We performed 1,000 steps for each dataset, using a randomized and plot the results of 10 seeds. We ran both DA and SGD variants of each method. Each plot shows the accuracy of the average iterate for each method. Figure 1 shows that both our proposed algorithms greatly out-perform regular D-Adaptation. Our weighted SGD variant of D-Adaptation is faster consistently across each dataset. Additionally, it adapts faster than the DoG method (Ivgi et al., 2023) on 10 of the 12 problems.
CIFAR10.
For neural network experimentsThe PyTorch code of our optimizer is available at https://github.com/konstmish/prodigy, we consider training on CIFAR10 (Krizhevsky, 2009) with batch size 256, where D-Adapted Adam has a gap of a few percent compared to the standard Adam. We use cosine annealing with initial step size 1 for all Adam-based methods and initial step size for Adam itself. The considered networks are VGG11 (Simonyan and Zisserman, 2014) and ResNet-50 (He et al., 2016)VGG11 and ResNet-50 implementation along with the data loaders were taken from https://github.com/kuangliu/pytorch-cifar. To simplify the experiment, we do not use weight decay, so both networks slightly overfit and do not reach high test accuracy values. All methods were run using same 8 random seeds.
We show the results in Figure 2. As we can see, this gap is closed by Prodigy, which is achieved by the larger estimates of the step size.
For DoG and L-DoG, we compute the polynomial-averaging iterate and then report the best of the average and the last iterate. We average with , see (Ivgi et al., 2023) for the details. While DoG produces larger step size estimate than Prodigy (see the right column in Figure 2, this is counterbalanced by the larger denominator in DoG. We also tried to modify DoG to use Adam-like step sizes but our heuristic modification diverged on this problem. We also observed that among DoG and its layer-wise version, L-DoG, there is no clear winner as the former performed better on VGG11 and the latter was better when training ResNet-50.
nanoGPT transformer.
We also train a 6-layer transformer network from nanoGPThttps://github.com/karpathy/nanoGPT on the Shakespeare dataset. For all methods, we use batch size 256, clip the gradients to have norm not exceeding 1 and use float16 numbers. We use AdamW with hyperparameters given in the repository, i.e., , weight decay , stepsize , cosine annealing with warmup over 100 steps. The same weight decay value and cosine annealing is used for Prodigy and D-Adapted Adam, except that the latter two methods use stepsize 1. We accumulate minibatches of size 12 into a batch of size 480. We tuned the weight decay for DoG and L-DoG and found the value to work well for this problem. We ran each method with 8 random seeds and report the average as well as one-standard-deviation confidence intervals.
See Figure 3 for the results. In terms of the test loss, all methods are roughly equivalent except that DoG and L-DoG were slower to reach the best value of roughly 1.5. For the train loss, Prodigy was on par with tuned AdamW and slightly better than D-Adapted Adam. Surprisingly, the estimated step size in Prodigy was very consistent across the 8 random seeds.
1 Large-scale Adam experiments
To validate the performance on large-scale practical applications directly against D-Adaptation, we ran the subset of the experiments from Defazio and Mishchenko (2023) that use the Adam optimizer. Methods without coordinate adaptivity are known to underperform on these problems and so we exclude SGD and DoG from these comparisons.
On the smallest problem of LSTM training, Prodigy appears to converge significantly faster in training loss and slightly overfits in test loss compared to the baselines. For RoBERTa (Liu et al., 2019) and GPT (Radford et al., 2019) training on BookWiki, Prodigy matches the performance of the baseline with only negligible differences. For the application problems, DLRM (Naumov et al., 2019) on the Criteo Kaggle Display Advertising dataset, and fastMRI VarNet (Zbontar et al., 2018), Prodigy again closely matches the baselines.
ViT training.
Defazio and Mishchenko (2023) present a negative result for training vision transformer (Dosovitskiy et al., 2021), where D-Adaptation significantly underperforms tuned Adam. We investigated this effect, and we were able to reproduce this gap across a wide range of weight-decay values, although this problem has high run-to-run variance of 1-2% of test accuracy, which makes comparison difficult. Using weight decay 0.05 instead of 0.1 significantly improved performance of each variant, and so we present results for both the baselines and Prodigy at that value. We can see that Prodigy almost closes the gap between tuned Adam and D-Adaptation, giving a test accuracy of 74.63% compared to 75.4% for Adam, and more than 2% higher than D-Adaptation. See Figure 5 for the results.
Conclusion
We have presented two new methods for learning rate adaptation that improve upon the adaptation rate of the state-of-the-art D-Adaptation method. Prodigy, a form of weighted D-Adaptation, was shown to adapt faster than other known methods across a range of experiments. The second method, D-Adaptation with resetting, is shown to achieve the same theoretical rate as Prodigy with a much simpler theory than Prodigy or even D-Adaptation.
References
Appendix A Analysis of Prodigy
As a reminder, we use the notation to denote the logarithm that is lower bounded by for any .
For any sequence of nonnegative real numbers
For completeness, we prove both statements here. Notice that for any , it holds . Substituting gives
If we multiply all sides by , the inequality above becomes
Summing over , we get the stated bound. ∎
For any sequence of nonnegative numbers and , it holds
If for some , we can simply ignore the corresponding summands, so let us assume that for all . For any it holds . Substituting , where for and , we get
Summing this over from to , we get
This is exactly what we wanted to prove. ∎
A.2 Proof of Lemma 1
Following the proof in Ivgi et al. , we define K=\bigl{\lceil}\log_{2}\bigl{(}\frac{d_{N}}{d_{0}}\bigr{)}\bigr{\rceil} and n=\bigl{\lfloor}\frac{N}{K}\bigr{\rfloor}. Consider a partitioning of the sequence into half-open intervals for to . We want to show that there is at least one interval such that changes by at most a factor of 2 on that interval. We will use proof by contradiction.
Suppose that for all intervals, . Then at least doubles in every interval, and so:
which implies that and so which contradictions our definition K=\bigl{\lceil}\log_{2}\bigl{(}\frac{d_{N}}{d_{0}}\bigr{)}\bigr{\rceil}. Therefore, there exists some such that . We can now proceed with proving the Lemma by considering the summation over interval only:
A.3 GD Analysis
Assume that . Then, the estimate in Algorithm 1 satisfies for all .
By optimality of , we have , so
Collecting the gradients in the first sum together and using Cauchy-Schwarz inequality, we obtain
Using the definition of , this is equivalent to , which implies . Therefore, since , we can show by induction as well. ∎
The following inequality holds for the iterates of Algorithm 1:
Let us rewrite in a slightly different manner:
Combining this with the property , we derive
Applying inequality with and and plugging-in the bound above, we establish
It remains to divide this inequality by to get the desired claim. ∎
Assuming the weights are positive, it holds for the iterates of Algorithm 1:
The lemma follows straightforwardly from Proposition 2 by substituting for from 0 to :
where in the last step we used . ∎
Given any weights , the functional gap of the average iterate of Algorithm 1 converges as
The first steps in the proof follow the same lines as the theory in Defazio and Mishchenko , but we still provide them for completeness.
Firstly, let us continue developing the bound proved in the proof of Lemma 2:
We upper bound the first term with the help of Lemma 3:
Since by Lemma 2, , we can simplify it to
Using the convexity of , we can apply Jensen’s inequality on the iterate to get
Sum over from to and using gives
Consider Algorithm 1 with and define . If we choose weights , then it holds
Substituting in the bound of Theorem 1, we get for any
Using the definition of , the result of Lemma 1 and the property , we obtain
Choose any ans set the weights to be . Then,
Since the sequence is non-decreasing and upper bounded by , there exists an index such that for any . Moreover, we have for
Let us plug this into the bound of Theorem 1 for :
Notice that the bound in Corollary 2 does not depend on . This is only possible asymptotically for a large enough and a similar bound without weights was presented by Defazio and Mishchenko .
A.4 DA Analysis
When studying Dual Averaging, we need to introduce an extra sequence that lower bounds :
Let us show that by comparing their numerators:
Using the definition of , and the property , we derive
Using inequality with and and then the bound above, we establish
It remains to divide both sides by . ∎
The Dual Averaging algorithm (Algorithm 2) satisfies
Summing inequality with weights , we get
Using Cauchy-Schwarz on the first product in the right-hand side and then telescoping the second sum, we obtain
where .
Let us sum inequality and then apply Lemma 6:
Clearly, this implies that , and by induction it follows that as well. Now let us upper bound the functional values:
We can drop the last two terms since :
The first term in the right-hand side is readily bounded by Lemma 5:
Let us now plug-in and bound each gradient norm using :
Thus, we get the following convergence rate:
A.5 Coordinate-wise Prodigy
The gradients are upper bounded coordinate-wise: .
It holds for the iterates of Algorithm 5:
As in the proof of Lemma 5, let us introduce an extra sequence :
The next step is to show that by comparing the numerators:
Using the definition of , and the property , we derive
Using inequality with and for and then the bound above, we establish
It remains to divide both sides by . ∎
The coordinate-wise version of Prodigy (Algorithm 5) satisfies
where .
Summing inequality with weights , we get
Using Hölder’s inequality on the first product in the right-hand side and then telescoping the second sum, we obtain
where .
so we can prove by induction that . Using the same bounds as before, we get for the average iterate
We now apply Proposition 1, substitute , and use :
Using Lemma 1, we get the rate for :
Appendix B Analysis of D-Adaptation with Resetting
In Algorithm 3 the counter tracks the epoch. Let represent the number of steps performed in epoch . Let denote the total number of epochs performed before the algorithm returns, where .
Consider the steps within a single epoch, dropping the index, we have that the norm of is bounded by:
We start with Lemma 5 from Defazio and Mishchenko , which applies within an epoch in our case since the gamma decreases within epochs:
Note that we have used a slightly tightened version, where rather than , appears on the right, which easily follows by not using on the last step of their telescoping.
Using the definition of and the property , we have:
Using inequality with and and then the bound above, we establish
It remains to divide this inequality by to get:
Now plugging in , and using the AdaGradNorm error bound:
Starting from the bound in Defazio and Mishchenko :
Using the AdaGradNorm bound . and further dropping the term gives the result. ∎
Starting from Lemma 10, for epoch it holds that:
We sum over epochs up to the final epoch :
Jensen’s inequality tells us that for concave :
Applying to our case, we use , , and :
Applying Jensen’s inequality to obtain a bound on the average iterate completes the proof. ∎
Appendix C Lower Complexity Theory
Consider any exponentially bounded algorithm for minimizing a convex -Lipschitz function starting from , which has no knowledge of problem constants G and D. There exists a fixed gradient oracle such that any sequence of , there exists a convex Lipschitz problem with and for all minimizing points , consistent with the gradient oracle such that:
We consider the construction of a 1D oracle for this problem. Our oracle returns and for all queries. Without loss of generality we assume that for all , and .
and thus . and our construction uses the following function value and gradient sequence
Note that for all query points , the gradient is negative, and only the left arm of the absolute value function is seen by the algorithm, so the function appears linear for all test points. Using this construction, we have:
D-Adaptation, DoG, Prodigy and D-Adaptation with resetting are exponentially bounded.
Consider the lower bound from D-Adaptation:
Note also that . So:
So the sequence is upper bounded by the sequence:
This sequence has the following closed form:
We can prove this by induction. The base case is by definition . Then
Note that for both the Dual Averaging form and the GD form we have, we have:
It follows that D-Adaptation is exponentially bounded. The same argument applies to the resetting variant since the resetting operation does not increase the rate of accumulation of .
The rest of the argument follows the D-Adaptation case, with:
For DoG, recall the basic DoG step is gradient descent with step sizes:
Suppose that and . then:
Without loss of generality assume that . Firstly, note that using the absolute value function as constructed in Theorem 5, it’s clear that there is always exists a function with at step consistent with the sequence of gradients seen so far. Therefore, it must hold that
We prove the result by induction. For the base case, trivially: