Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution

Antonio Orvieto, Simon Lacoste-Julien, Nicolas Loizou

Introduction

We consider the stochastic optimization problem:

In this setting, the algorithm of choice is often Stochastic Gradient Descent (SGD), i.e. xk+1=xkγkfSk(xk)x^{k+1}=x^{k}-\gamma_{k}\nabla f_{\mathcal{S}_{k}}(x^{k}), where γk>0\gamma_{k}>0 is the stepsize at iteration kk, Sk[n]\mathcal{S}_{k}\subseteq[n] a random subset of datapoints (minibatch) with cardinality BB sampled independently at each iteration kk, and fSk(x):=1BiSkfi(x)\nabla f_{\mathcal{S}_{k}}(x):=\frac{1}{B}\sum_{i\in\mathcal{S}_{k}}\nabla f_{i}(x) is the minibatch gradient.

A careful choice of γk\gamma_{k} is crucial for most applications . The simplest option is to pick γk\gamma_{k} to be constant over training, with its value inversely proportional to the Lipschitz constant of the gradient. While this choice yields fast convergence to the neighborhood of a minimizer, two main problems arise: (a) the optimal γ\gamma depends on (often unknown) problem parameters — hence often requires heavy tuning ; and (b) it cannot be guaranteed that X\mathcal{X}^{*} is reached in the limit . A simple fix for the last problem is to allow polynomially decreasing stepsizes (second option) : this choice for γk\gamma_{k} often leads to convergence to X\mathcal{X}^{*}, but hurts the overall algorithm speed. The third option, which became very popular with the rise of deep learning, is to implement an adaptive stepsize. These methods do not commit to a fixed schedule, but instead use the optimization statistics (e.g. gradient history, cost history) to tune the value of γk\gamma_{k} at each iteration. These stepsizes are known to work very well in deep learning , and include Adam , Adagrad , and RMSprop .

A promising new direction in the adaptive stepsizes literature is based on the idea of Polyak stepsizes, introduced by in the context of deterministic convex optimization. Recently successfully adapted Polyak stepsizes to the stochastic setting, and provided convergence rates matching fine-tuned SGD — while the algorithm does not require knowledge of the unknown quantities such as the gradient Lipschitz constant. The results especially shines in the overparameterized strongly convex setting, where linear convergence to xx^{*} is shown. This result is especially important since, under the same assumption, no such rate exists for AdaGrad (see e.g. for the latest results) or other adaptive stepsizes. Moreover, the method was shown to work surprisingly well on deep learning problems, without requiring heavy tuning .

Even if the stochastic Polyak stepsize (SPS) comes with strong convergence guarantees, it has two main drawbacks when it is used in non-over-parameterized regimes: (i) It requires a priori knowledge of the optimal mini-batch losses, which are not often available for big batch sizes or regularized objectives (see discussion in §1.1) and (ii) it guarantees convergence only to a neighborhood of the solution. In this work, we study the dynamics and the convergence properties of SGD equipped with new variants of SPS for solving general convex optimization problems. Our new proposed variants provide solutions to both drawbacks of the original SPS.

Crucially the algorithm requires knowledge of fSkf_{\mathcal{S}_{k}}^{*} for every realization of the mini-batch Sk\mathcal{S}_{k}. In the non-regularized overparametrized setting (e.g. neural networks), fSkf_{\mathcal{S}_{k}} is often zero for every subset S\mathcal{S} . However, this is not the only setting where fSf_{\mathcal{S}}^{*} is computable: e.g., in the regularized logistic loss with batch size 1, it is possible to recover a cheap closed form expression for each fif_{i}^{*} . Unfortunately, if the batch-size is bigger than 11 or the loss becomes more demanding (e.g. cross-entropy), then no such closed-form computation is possible.

Rates and comparison with AdaGrad.

2 Main Contributions

As we already mentioned, in the non-interpolated setting SPSmax has the following issues:

For B>1B>1 (minibatch setting), SPSmax requires the exact knowledge of fSf_{\mathcal{S}}^{*}. This is not practical.

SPSmax guarantees convergence to a neighborhood of the solution. It is not clear how to modify it to yield convergence to the exact minimizer.

Having the above two issues in mind, the main contributions of our work (see also Table 1 for a summary of the main complexity results obtained in this paper) are summarized as follows:

In §3, we provide a direct solution for Issue (1). We explain how only a lower bound on fSf_{\mathcal{S}}^{*} (trivial if all fif_{i}s are non-negative) is required for convergence to a neighborhood of the solution. While this neighborhood is bigger that the one for SPSmax, our modified version provides a practical baseline for the solution to the second issue.

We explain why Issue (2) is highly non-trivial and requires an in-depth study of the bias induced by the interaction between gradients and Polyak stepsizes. Namely, we show that simply multiplying the stepsize of SPSmax by 1/k1/\sqrt{k} — which would work for vanilla SGD — yields a bias in the solution found by SPS (§4), regardless of the estimation of fSf_{\mathcal{S}}^{*}.

In §5, we provide a solution to the problem (Issue (2)) by introducing additional structure — as well as the fix to Issue (1) — into the stepsize. We call the new algorithm Decreasing SPS (DecSPS), and provide a convergence guarantee under the bounded domain assumption — matching the standard AdaGrad results.

In §5.2 we go one step further and show that, if strong convexity is assumed, iterates are bounded with probability 1 and hence we can remove the bounded iterates assumption. To the best of our knowledge, DecSPS, is the first stochastic adaptive optimization method that converges to the exact solution without assuming strong assumptions like bounded iterates/gradients.

In §5.3 we provide extensions of our approach to the non-smooth setting.

In §6, we corroborate our theoretical results with experimental testing.

Background on Stochastic Polyak Stepsize

In this section, we provide a concise overview of the results in , and highlight the main assumptions and open questions.

It is easy to realize that as soon as problem (1) is interpolated, then σB2=0\sigma^{2}_{B}=0 for each BnB\leq n. In addition, note that σB2\sigma^{2}_{B} is non-increasing as a function of BB.

In the overparametrized setting, the result guarantees convergence to the exact minimizer, without knowledge of the gradient Lipschitz constant (as vanilla SGD would instead require) and without assuming bounded iterates (in contrast to ).

As soon as (1) a regularizer is applied to the loss (e.g. L2L2 penalty), or (2) the number of datapoints gets comparable to the dimension, then the problem is not interpolated and SPSmax only converges to a neighborhood and it gets impractical to compute fSf_{\mathcal{S}}^{*} — this is the setting we study in this paper.

In the rare case that fSk(xk)2=0\|\nabla f_{S_{k}}(x^{k})\|^{2}=0, there is no need to evaluate the stepsize. In this scenario, the update direction fSk(xk)=0\nabla f_{S_{k}}(x^{k})=0 and thus the iterate is not updated irrespective of the choice of step-size. If this happens, the user should simply sample a different minibatch. We note that in our experiments (see §6), such event never occurred.

The classical Polyak stepsize has been successfully used in the analysis of deterministic subgradient methods in different settings . First attempts on providing an efficient variant of the stepsize that works well in the stochastic setting were made in . However, as explained in , none of these approaches provide a natural stochastic extension with strong theoretical convergence guarantees, and thus Loizou et al. proposed the stochastic Polyak stepsize SPSmax as a better alternative.A variant of SGD with SPSmax was also proposed by Asi and Duchi as a special case of a model-based method called the lower-truncated model. Asi and Duchi also proposed a decreasing step-size variant of SPSmax which is closely related but different than the DecSPS that we propose in §5. Among some differences, they assume interpolation for their convergence results whereas we do not in §5. We describe the differences between our work and Asi and Duchi in more detail in Appendix A. Despite its recent appearance, SPSmax has already been used and analyzed as a stepsize for SGD for solving structured non-convex problems , in combination with other adaptive methods , with a moving target and in the update rule of stochastic mirror descent . These extensions are orthogonal to our approach, and we speculate that our proposed variants can also be used in the above settings. We leave such extensions for future work.

The following results can be seen as an easy extension of the main result of , under a newly defined suboptimality measure:

And we also have an easy practical corollary.

Bias in the SPS dynamics.

In this section, we study convergence of the standard SPSmax in the non-interpolated regime, under an additional (decreasing) multiplicative factor, in the most ideal setting: batch size 1, and we have knowledge of each fif_{i}^{*}. That is, we consider γk=min{fik(xk)fikckfik(xk)2,γb}\gamma_{k}=\min\{\frac{f_{i_{k}}(x^{k})-f_{i_{k}}^{*}}{c_{k}\|\nabla f_{i_{k}}(x^{k})\|^{2}},\gamma_{b}\} with ckc_{k}\to\infty, e.g. ck=O(k)c_{k}=\mathcal{O}(\sqrt{k}) or ck=O(k)c_{k}=\mathcal{O}(k). We note that, in the SGD case, simply picking e.g. γk=γ0/k+1\gamma_{k}=\gamma_{0}/\sqrt{k+1} would guarantee convergence of f(xk)f(x^{k}) to f(x)f(x^{*}), in expectation and with high probability . Therefore, it is natural to expect a similar behavior for SPS, if 1/ck1/c_{k} safisfies the usual Robbins-Monro conditions : k=01/ck=,k=01/ck2<\sum_{k=0}^{\infty}1/c_{k}=\infty,\sum_{k=0}^{\infty}1/c_{k}^{2}<\infty.

We show that this is not the case: quite interestingly, f(xk)f(x^{k}) converges to a biased solution due to the correlation between fik\nabla f_{i_{k}} and γk\gamma_{k}. we show this formally, in the case of non-interpolation (otherwise both SGD and SPS do not require a decreasing learning rate).

Consider the following finite-sum setting: f(x)=12f1(x)+12f2(x)f(x)=\frac{1}{2}f_{1}(x)+\frac{1}{2}f_{2}(x) with f1(x)=a12(x1)2,f2(x)=a22(x+1)2f_{1}(x)=\frac{a_{1}}{2}(x-1)^{2},f_{2}(x)=\frac{a_{2}}{2}(x+1)^{2}. To make the problem interesting, we choose a1=2a_{1}=2 and a2=1a_{2}=1: this introduces asymmetry in the average landscape with respect to the origin. During optimization, we sample f1f_{1} and f2f_{2} independently and seek convergence to the unique minimizer x=a1a2a1+a2=1/3x^{*}=\frac{a_{1}-a_{2}}{a_{1}+a_{2}}=1/3. The first thing we notice is that xx^{*} is not a stationary point for the dynamics under SPS. Indeed note that since fi=0f_{i}^{*}=0 for i=1,2i=1,2 we have (assuming γb\gamma_{b} large enough): γkfik(x)=x12ck\gamma_{k}\nabla f_{i_{k}}(x)=\frac{x-1}{2c_{k}}, if ik=1i_{k}=1, and γkfik(x)=x+12ck\gamma_{k}\nabla f_{i_{k}}(x)=\frac{x+1}{2c_{k}} if ik=2i_{k}=2.

In the same picture, we show how our modified variant of the vanilla stepsize — we call this new algorithm DecSPS, see §5 — instead converges to the correct solution.

In the appendix, we go one step further and provide an analysis of the bias of SPS in the one-dimensional quadratic case (Prop. 4). Yet, we expect the precise characterization of the bias phenomenon in the non-quadratic setting to be particularly challenging. We provide additional insights in §D.2. Instead, in the next section, we show how to effectively modify γk\gamma_{k} to yield convergence to xx^{*} without further assumptions.

DecSPS: Convergence to the exact solution

We propose the following modification of the vanilla SPS proposed in , designed to yield convergence to the exact minimizer while keeping the main adaptiveness propertiesSimilar choices are possible. We found that this leads to the biggest stepsize magnitude, allowing for faster convergence in practice.. We call it Decreasing SPS (DecSPS), since it combines a steady stepsize decrease with the adaptiveness of SPS.

Let each fif_{i} be LiL_{i} smooth and let (ck)k=0(c_{k})_{k=0}^{\infty} be any non-decreasing positive sequence of real numbers. Under DecSPS, we have min{12ckLmax,c0γbck}γkc0γbck\min\left\{\frac{1}{2c_{k}L_{\max}},\frac{c_{0}\gamma_{b}}{c_{k}}\right\}\leq\gamma_{k}\leq\frac{c_{0}\gamma_{b}}{c_{k}}, and γk1γk\gamma_{k-1}\leq\gamma_{k}

As stated in the last lemma, under the assumption of ckc_{k} non-decreasing, γk\gamma_{k} is trivially non-increasing since γkck1γk1/ck\gamma_{k}\leq c_{k-1}\gamma_{k-1}/c_{k}.

The proof can be found in the appendix, and is based on a simple induction argument.

The following result provides a proof of convergence of SGD for the γk\gamma_{k} sequence defined above.

Under the setting of Thm. 3, for ck=k+1c_{k}=\sqrt{k+1} (c1=c0c_{-1}=c_{0}) we have

The result above crucially relies on the bounded iterates assumption: D2<D^{2}<\infty. To the best of our knowledge, if no further regularity is assumed, modern convergence results for adaptive methods (e.g. variants of AdaGrad) in convex stochastic programming requirePerhaps the only exception is the result of , where the authors work on a different setting: i.e. they introduce the RUIG inequality. this assumption, or else require gradients to be globally bounded. To mention a few: . A simple algorithmic fix to this problem is adding a cheap projection step onto a large bounded domain . We can of course include this projection step in DecSPS, and the theorem above will hold with no further modification. Yet we found this to be not necessary: the strong guarantees of SPS in the strongly convex setting let us go one step beyond: in §5.2 we show that, if each fif_{i} is strongly convex (e.g. regularizer is added), then one can bound the iterates globally with probability one, without knowledge of the gradient Lipschitz constant. To the best of our knowledge, no such result exist for AdaGrad — except , for the deterministic case.

In standard results for AdaGrad, a dependency on the problem dimension often appears (e.g. Thm. 1 in ). This dependency follows from a bound on the AdaGrad preconditioner that can be found e.g. in Thm. 4 in . In the SPS case no such dependency appears — specifically because the stepsize is lower bounded by 1/(2ckLmax)1/(2c_{k}L_{\max}).

2 Removing the bounded iterates assumption

We prove that under DecSPS the iterates live in a set of diameter DmaxD_{\max} almost surely. This can be done by assuming strong convexity of each fif_{i}.

Note that trivially σ^B,max2<\hat{\sigma}^{2}_{B,\max}<\infty under the assumption that all fif_{i} are lower bounded and n<n<\infty.

The proof relies on the variations of constants formula and an induction argument — it is provided in the appendix. We are now ready to state the main theorem for the unconstrained setting, which follows from Prop. 1 and Thm. 3.

The careful reader might notice that, while we assumed strong convexity, our rate is slower than the optimal O(1/K)\mathcal{O}(1/K). This is due to the adaptive nature of DecSPS. It is indeed notoriously hard to achieve a convergence rate of O(1/K)\mathcal{O}(1/K) for adaptive methods in the strongly convex regime. While further investigations will shed light on this interesting problem, we note that the result we provide is somewhat unique in the literature: we are not aware of any adaptive method that enjoys a similar convergence rate without either (a) assuming bounded iterates/gradients or (b) assuming knowledge of the gradient Lipschitz constant or the strong convexity constant.

On a convex problem, the non-asymptotic performance of SGD with a decreasing stepsize γk=η/k\gamma_{k}=\eta/\sqrt{k} strongly depends on the choice of η\eta. The optimizer might diverge if η\eta is too big for the problem at hand. Indeed, most bounds for SGD, under no access to the gradient Lipschitz constant, display a dependency on the size of the domain and rely on projections after each step. If one applies the method in the unconstrained setting, such convergence rates technically do not hold, and tuning is sometimes necessary to retrieve stability and good performance. Instead, for DecSPS, simply by adding a small regularizer, the method is guaranteed to converge at the non-asymptotic rate we derived even in the unconstrained setting.

3 Extension to the non-smooth setting

For any S[n]\mathcal{S}\subseteq[n], we denote in this section by gS(x)g_{\mathcal{S}}(x) the subgradient of fSf_{\mathcal{S}} evaluated at xx. We discuss the extension of DecSPS to the non-smooth setting.

A straightforward application of DecSPS leads to a stepsize γk\gamma_{k} which is no longer lower bounded (see Lemma 1) by the positive quantity min{12ckLmax,c0γbck}\min\left\{\frac{1}{2c_{k}L_{\max}},\frac{c_{0}\gamma_{b}}{c_{k}}\right\}. Indeed, the gradient Lipschitz constant in the non-smooth case is formally Lmax=L_{\max}=\infty. Hence, γk\gamma_{k} prescribed by DecSPS can get arbitrarily smallTake for instance the deterministic setting one-dimensional setting f(x)=xf(x)=|x|. As x0x\to 0, the stepsize prescribed by DecSPS converges to zero. This is not the case e.g. in the quadratic setting. for finite kk. One easy solution to the problem is to enforce a lower bound, and adopt a new proof technique. Specifically we propose the following:

For any non-decreasing positive sequence (ck)k=0(c_{k})_{k=0}^{\infty}, consider SGD with DecSPS-NS. Assume that each fif_{i} is convex and lower bounded. We have

where D2:=maxk[K1]xkx2D^{2}:=\max_{k\in[K-1]}\|x^{k}-x^{*}\|^{2} and G2:=maxk[K1]gSk(xk)2G^{2}:=\max_{k\in[K-1]}\|g_{\mathcal{S}_{k}}(x^{k})\|^{2}.

One can then easily derive an O(1/k)\mathcal{O}(1/\sqrt{k}) convergence rate. This is presented in §F (appendix).

Numerical Evaluation

Comparison with vanilla SGD with decreasing stepsize.

We compare the performance of DecSPS against the classical decreasing SGD stepsize η/k+1\eta/\sqrt{k+1}, which guarantees convergence to the exact solution at the same asymptotic rate as DecSPS. We show that, while the asymptotics are the same, DecSPS with hyperparameters c0=1,γb=10c_{0}=1,\gamma_{b}=10 performs competitively to a fine-tuned η\eta — where crucially the optimal value of η\eta depends on the problem. This behavior is shown on all the considered datasets, and is reported in Figures 4 (Breast and Synthetic reported in the appendix for space constraints). If lower regularization (1e4,1e61e-4,1e-6) is considered, then DecSPS can still match the performance of tuned SGD — but further tuning is needed (see Figure 14. Specifically, since the non-regularized problems do not have strong curvature, we found that DecSPS works best with a much higher γb\gamma_{b} parameter and c0=0.05c_{0}=0.05.

DecSPS yields a truly adaptive stepsize.

We inspect the value of γk\gamma_{k} returned by DecSPS, shown in Figures 4 & 8 (in the appendix). Compared to the vanilla SGD stepsize η/k+1\eta/\sqrt{k+1}, a crucial difference appears: γk\gamma_{k} decreases faster than O(1/k)\mathcal{O}(1/\sqrt{k}). This showcases that, while the factor k+1\sqrt{k+1} can be found in the formula of DecSPSWe pick ck=c0k+1c_{k}=c_{0}\sqrt{k+1}, as suggested by Cor. 2 & 3 , the algorithm structure provides additional adaptation to curvature. Indeed, in (regularized) logistic regression, the local gradient Lipschitz constant increases as we approach the solution. Since the optimal stepsize for steadily-decreasing SGD is 1/(Lk+1)1/(L\sqrt{k+1}), where LL is the global Lipschitz constant , it is pretty clear that η\eta should be decreased over training for optimal converge (as LL effectively increases). This is precisely what DecSPS is doing.

Comparison with AdaGrad stepsizes.

Last, we compare DecSPS with another adaptive coordinate-independent stepsize with strong theoretical guarantees: the norm version of AdaGrad (a.k.a. AdaGrad-Norm, AdaNorm), which guarantees the exact solution at the same asymptotic rate as DecSPS . AdaGrad-norm at each iteration updates the scalar bk+12=bk2+fSk(xk)2b_{k+1}^{2}=b_{k}^{2}+\|\nabla f_{\mathcal{S}_{k}}(x_{k})\|^{2} and then selects the next step as xk+1=xkηbk+1fi(xk)x_{k+1}=x_{k}-\frac{\eta}{b_{k+1}}\nabla f_{i}(x_{k}). Hence, it has tuning parameters b0b_{0} and η\eta. In Fig. 4 we show that, on the Breast Cancer dataset, after fixing b0=0.1b_{0}=0.1 as recommended in (see their Figure 3), tuning η\eta cannot quite match the performance of DecSPS. This behavior is also observed on the other two datasets we consider (see Fig. 9 in the Appendix). Last, in Fig. 10& 11 in the Appendix, we show that further tuning of b0b_{0} likely does not yield a substantial improvement.

Comparison with Adam and AMSgrad without momentum.

In Figures 5&12&13 we compare DecSPS with Adam and AMSgrad on the A1A and Breast Cancer datasets. Results show that DecSPS with the usual hyperparameters is comparable to the fine-tuned version of both these algorithms — which however do not enjoy convergence guarantees in the unbounded domain setting.

Conclusions and Future Work

We provided a practical variant of SPS , which converges to the true problem solution without the interpolation assumption in convex stochastic problems — matching the rate of AdaGrad. If in addition, strong convexity is assumed, then we show how, in contrast to current results for AdaGrad, the bounded iterates assumption can be dropped. The main open direction is a proof of a faster rate O(1/K)\mathcal{O}(1/K) under strong convexity. Other possible extensions of our work include using the proposed new variants of SPS with accelerated methods, studying further the effect of mini-batching and non-uniform sampling of DecSPS, and extensions to the distributed and decentralized settings.

Acknowledgements

This work was partially supported by the Canada CIFAR AI Chair Program. Simon Lacoste-Julien is a CIFAR Associate Fellow in the Learning in Machines & Brains program.

References

Appendix A Comparison with closely related work

In this section we present a more detailed comparison to closely related works on stochastic variants of the Polyak stepsize. We start with the work of Asi and Duchi , and then continue with a brief presentation of other papers (already presented in Loizou et al. ).

In Loizou et al. , the proposed SPSmax stepsize is

This stepsize is similar to the one in Asi and Duchi : in both, a Polyak-like stochastic stepsize is bounded from above in order to guarantee convergence. However there are crucial differences.

SPSmax can be applied to non-interpolated problems and leads to fast convergence to a ball around the solution in the non-interpolated setting (see Theorem 1). Instead, Asi and Duchi only formulated and studied Eq. (8) in the interpolated setting.

As we will see in the next paragraph, one can formulate few conditions under which it is possible to derive linear convergence rates for Eq. (8) in the interpolated setting. As can be easily seen from Theorem 1, SPSmax has similar convergence guarantees but works under a more standard/restrictive set of assumptions. In particular, in the interpolated setting, while Asi and Duchi require some specific assumptions on the noise statistics (see next paragraph), the rates in Loizou et al. can be applied without the need for, e.g., probabilistic bound on the gradient magnitude.

In this paper, starting from the SPSmax algorithm we propose the following stepsize for convergence to the exact solution in the non-interpolated setting:

We now compare DecSPS with Eq. (8) and our results with the rates in .

Convergence rates: The form of Eq. (8) and the convergence guarantees of are restricted to the interpolated setting. Instead, in this paper we focus on the non-interpolated setting: using DecSPS we provided the first stochastic adaptive optimization method that converges in the non-interpolated setting to the exact solution without restrictive assumptions like bounded iterates/gradients.

Inspection of the stepsize: DecSPS provides a version of SPS where γk\gamma_{k} is steadily decreasing and is upper bounded by the decreasing quantity c0γb/ckc_{0}\gamma_{b}/c_{k}, where ck=k+1c_{k}=\sqrt{k+1} yields the optimal asymptotic rate (Theorem 4) . Hence, DecSPS can be compared to Eq. (8) for αk=α0/k+1\alpha_{k}=\alpha_{0}/\sqrt{k+1}. However, note that there are two fundamental differences: First, in DecSPS we have that γkγk1\gamma_{k}\geq\gamma_{k-1} (see Lemma 1), a feature which Eq. (8) does not have. Secondly, compared to our DecSPS, the stepsize in Eq. (8) with αk\alpha_{k} decreasing polynomially is asymptotically non-adaptive. Indeed, assuming that each fif_{i} has LiL_{i}-Lipschitz gradients and that each fSf_{\mathcal{S}}^{*} is non-negative, we have (see ) that

therefore, after logβ(2Lmaxα0)\log_{\beta}(2L_{\max}\alpha_{0}) iterationsSimply plugging in αk=α0kβ\alpha_{k}=\alpha_{0}k^{-\beta} and solving for kk. the algorithm coincides with SGD with stepsize αk\alpha_{k}.

For completeness, we provide in the next paragraph an overview of the results in Asi and Duchi .

Precise theoretical guarantees in Asi and Duchi [2].

The stepsize in Equation (8) yields linear convergence guarantees under a specific set of assumptions. We summarize the two main results of Asi and Duchi below in the case of differentiable lossesThe results of Asi and Duchi also work in the subdifferentiable setting.:

Then, for αk=α0kβ\alpha_{k}=\alpha_{0}k^{-\beta} with β(,1)\beta\in(-\infty,1) the stepsize of Equation (8) yields a linear convergence rate dependent on λ1\lambda_{1} and the choice of β\beta.

Then, for αk=α0kβ\alpha_{k}=\alpha_{0}k^{-\beta} with β(,)\beta\in(-\infty,\infty) the stepsize of Equation (8) yields a linear convergence rate dependent on λ0,λ1\lambda_{0},\lambda_{1} and the choice of β\beta.

A.2 Comparisons with other versions of the Polyak stepsize for stochastic problems

To the best of our knowledge, no prior work has provided a computationally feasible modification of the Polyak stepsize for convergence to the exact solution in stochastic non-interpolated problems. In the next lines, we outline the details for a few related works on Polyak stepsize for stochastic problems.

SPSmax: As discussed in the main paper, our starting point is the SPSmax algorithm in , which provides linear (for strongly convex) or sublinear (for convex) convergence to a ball around the minimizer, with size dependent of the problems’ degree of interpolation. Instead, in this work, we provide convergence guarantees to the exact solution in the non-interpolated setting for a modified version of this algorithm. In addition, when compared to SPSmax, our method does not require knowledge of the single fif_{i}^{*}s, but just of lower bounds on these quantities (see §3).

L4: A stepsize very similar to SPSmax (the L4 algorithm) was proposed back in 2018 by . While this stepsize results in promising performance in deep learning, (1) it has no theoretical convergence guarantees, and (2) each update requires an online estimation of the fif_{i}^{*}, which in turn requires tuning up to three hyper-parameters.

ALI-G: Last, the ALI-G stepsize proposed by Berrada et al. is γk=min{fi(xk)fi(xk)2+δ,η}\gamma_{k}=\min\left\{\frac{f_{i}(x^{k})}{\|\nabla f_{i}(x^{k})\|^{2}+\delta},\eta\right\}, where δ>0\delta>0 is a tuning parameter. Unlike the SPSmax setting, their theoretical analysis relies on an ϵ\epsilon-interpolation condition. Moreover, the values of the parameter δ\delta and η\eta that guarantee convergence heavily depend on the smoothness parameter of the objective ff, limiting the method’s practical applicability. In addition, in the interpolated setting, while ALI-G converges to a neighborhood of the solution, the SPSmax method is able to provide linear convergence to the solution.

Appendix B Technical Preliminaries

Let us present some basic definitions used throughout the paper.

If a function gg is μ\mu-strongly convex and LL-smooth the following bounds holds:

The following lemma is the fundamental starting point in .

Let f(x)=1ni=1nfi(x)f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x) where the functions fif_{i} are μi\mu_{i}-strongly convex and LiL_{i}-smooth, then

where fi:=infxfi(x)f_{i}^{*}:=\inf_{x}f_{i}(x), Lmax=max{Li}i=1nL_{\max}=\max\{L_{i}\}_{i=1}^{n} and μmin=min{μi}i=1n\mu_{\min}=\min\{\mu_{i}\}_{i=1}^{n}.

The proofs in this subsection is an easy adaptation of the proofs appeared in . To avoid redundancy in the literature, we provide skecth of the proofs showing the fundamental differences and invite the interested reader to read the details in .

We highlight in blue text the differences between this proof and the one in .

As in we use a standard expansion as well as the stepsize definition.

Next, adding and subtracting fSkf_{\mathcal{S}_{k}}^{*} gives

Since c>12c>\frac{1}{2} it holds that (21c)>0\left(2-\frac{1}{c}\right)>0. We obtain:

where in the last inequality we use that α(21c)[fSk(x)fSk]>0\alpha\left(2-\frac{1}{c}\right)\left[f_{\mathcal{S}_{k}}(x^{*})-f_{\mathcal{S}_{k}}^{*}\right]>0. Note that this factor fSk(x)fSkf_{\mathcal{S}_{k}}(x^{*})-f_{\mathcal{S}_{k}}^{*} pops up in the proof, not in the stepsize! By rearranging:

where xˉK=1Kk=0K1xk\bar{x}^{K}=\frac{1}{K}\sum_{k=0}^{K-1}x^{k}, α:=min{12cLmax,γb}\alpha:=\min\{\frac{1}{2cL_{\max}},\gamma_{b}\} and Lmax=max{Li}i=1nL_{\max}=\max\{L_{i}\}_{i=1}^{n} is the maximum smoothness constant.

Strongly Convex setting.

From convexity of functions fSkf_{\mathcal{S}_{k}} it holds that xkx,fSk(xk)+fSk(xk)fSk(x)0-\langle x^{k}-x^{*},\nabla f_{\mathcal{S}_{k}}(x^{k})\rangle+f_{\mathcal{S}_{k}}(x^{k})-f_{\mathcal{S}_{k}}(x^{*})\leq 0, Sk[n]\forall\mathcal{S}_{k}\subseteq[n]. Thus,

where α:=min{12cLmax,γb}\alpha:=\min\{\frac{1}{2cL_{\max}},\gamma_{b}\} and Lmax=max{Li}i=1nL_{\max}=\max\{L_{i}\}_{i=1}^{n} is the maximum smoothness constant. ∎

Appendix D Lack of convergence of SGD with SPSmax in the non-interpolated setting

We recall the variation of constants formula

For k=1k=1 we get z1=A0z0+ε0z_{1}=A_{0}z_{0}+\varepsilon_{0}. The induction step yields

This completes the proof of the variations of constants formula. ∎

If cj=(j+1)/2c_{j}=(j+1)/2 then (112cj)=jj+1\left(1-\frac{1}{2c_{j}}\right)=\frac{j}{j+1} and therefore

Finally, to get a rate on the distance-shrinking, we take the expectation w.r.t. iki_{k} conditioned on xkx_{k}: the cross-term disappears and we get

Therefore, using the tower property and the variation of constants formula,

D.2 Asymptotic vanishing of the SPS bias in 1d quadratics

As nn\to\infty, it is possible to see that, under some additional assumptions (e.g. aia_{i} Beta-distributed), the variance of i=1naixii=1nai\frac{\sum_{i=1}^{n}a_{i}x_{i}^{*}}{\sum_{i=1}^{n}a_{i}} collapses to zero, hence one has i=1naixii=1naix\frac{\sum_{i=1}^{n}a_{i}x_{i}^{*}}{\sum_{i=1}^{n}a_{i}}\to x^{*} in probability, as nn\to\infty, with rate O(1/n)\mathcal{O}(1/n).

where MX(t)\mathcal{M}_{X}(t) denotes the moment generating function of the Γ(k,λ)\Gamma(k,\lambda) distribution. Next, we solve the integral using the closed-form expression MX(t)=(1tλ)k\mathcal{M}_{X}(t)=\left(1-\frac{t}{\lambda}\right)^{-k} for tλt\leq\lambda (else does not exist). Note that we integrate only for t0t\leq 0 so the MGF is always defined:

where in the third-last inequality we changed variables tλut\to\lambda u and in the second last we changed variables t1sst\to\frac{1-s}{s}.

All in all, we have that Var[i=1naixii=1nai]=O(1/n)\text{Var}\left[\frac{\sum_{i=1}^{n}a_{i}x_{i}^{*}}{\sum_{i=1}^{n}a_{i}}\right]=\mathcal{O}(1/n). This implies that i=1naixii=1nai\frac{\sum_{i=1}^{n}a_{i}x_{i}^{*}}{\sum_{i=1}^{n}a_{i}} converges to xx^{*} in quadratic mean — hence also in probability.

Appendix E Convergence of SGD with DecSPS in the smooth setting

Here we study Decreasing SPS (DecSPS), which combines stepsize decrease with the adaptiveness of SPS.

for k0k\geq 0, where we set c1=c0c_{-1}=c_{0} and γ1=γb\gamma_{-1}=\gamma_{b} (stepsize bound, similar to ), to get

First, note that γk\gamma_{k} is trivially non-increasing since γkck1γk1/ck\gamma_{k}\leq c_{k-1}\gamma_{k-1}/c_{k}. Next, we prove the bounds on γk\gamma_{k}. For k=0k=0, we can directly use Lemma 2:

Next, we proceed by induction: we assume the proposition holds true for γk\gamma_{k}:

Then, we have : γk+1=1ck+1min{fSk+1(xk+1)fSk+1fSk+1(xk+1)2,ι}\gamma_{k+1}=\frac{1}{c_{k+1}}\min\left\{\frac{f_{\mathcal{S}_{k+1}}(x^{k+1})-f^{*}_{\mathcal{S}_{k+1}}}{\|\nabla f_{\mathcal{S}_{k+1}}(x^{k+1})\|^{2}},\iota\right\}, where

by the induction hypothesis. This bound directly implies that the proposition holds true for γk+1\gamma_{k+1}, since again by Lemma 2 we have fSk+1(xk+1)fSk+1fSk+1(xk+1)212Lmax\frac{f_{\mathcal{S}_{k+1}}(x^{k+1})-f^{*}_{\mathcal{S}_{k+1}}}{\|\nabla f_{\mathcal{S}_{k+1}}(x^{k+1})\|^{2}}\geq\frac{1}{2L_{\max}}. This concludes the induction step. ∎

E.2 Proof of Thm. 3

The fundamental problem towards a proof for DecSPS is that the error to control due to gradient stochasticity does not come from the term γk2f(xk)2\gamma_{k}^{2}\|\nabla f(x^{k})\|^{2} in the expansion of xkx2\|x^{k}-x^{*}\|^{2}, as instead is usual for SGD with decreasing stepsizes. Instead, the error comes from the inner product term γkf(xk),xkx\gamma_{k}\langle\nabla f(x^{k}),x^{k}-x^{*}\rangle. Hence, the error is proportional to γk\gamma_{k}, and not γ2\gamma^{2}. As a result, the usual Robbins-Monro conditions do not yield convergence. A similar problem is discussed for AdaGrad in .

In the first version of this paper, the proof contained a small errorThis was in Eq. (87), where we incorrectly bounded (21ck)\left(2-\frac{1}{c_{k}}\right) with the constant 11. This, of course, cannot be done since the term fSk(xk)fSk(x)f_{\mathcal{S}_{k}}(x^{k})-f_{\mathcal{S}_{k}}(x^{*}) is not positive. However, it’s expectation is positive. Since (21ck)\left(2-\frac{1}{c_{k}}\right) is a constant, we avoid bounding it in the early stages and instead bound it after taking the expectation at the end of the proof. The result is unchanged.. This is now fixed, the result is unchanged.

Multiplying by γk\gamma_{k} and rearranging terms we get the fundamental inequality

Using the definition of DecSPS and convexity we get

Let us divide everything by γk>0\gamma_{k}>0.

In step (94), we are able to collect D2D^{2} because (1γk+11γk)0\left(\frac{1}{\gamma_{k+1}}-\frac{1}{\gamma_{k}}\right)\geq 0. This is guaranteed by the new SPS definition (DecSPS), along with the fact that ckc_{k} is increasing. Note that one could not perform this step under the original SPS update rule of .

We conclude by using Jensen’s inequality as follows:

where σ^B2\hat{\sigma}^{2}_{B} is as defined in (3). ∎

Note that, in the convergence rate, the second term does not depend on γb\gamma_{b} while the first does. This is different from the original SPS result , and due to the different proof technique: specifically, we divide by γk\gamma_{k} early in the proof — and not at the end. To point to the exact source of this difference, we invite the reader to inspect Equation 24 in the appendixhttp://proceedings.mlr.press/v130/loizou21a/loizou21a-supp.pdf of : the last term there is proportional to γb/α\gamma_{b}/\alpha, where α\alpha is a lower bound on the SPS and γb\gamma_{b} is an upper bound. In our proof approach, these terms — which bound the same quantity — effectively cancel out (because we divide by γk\gamma_{k} earlier in the proof), at the price of having D2D^{2} in the first term.

E.3 Proof of Prop. 1

We need the following lemma. An illustration of the result can be found in Fig. 6.

Let zk+1=Akzk+εkz^{k+1}=A_{k}z^{k}+\varepsilon_{k} with Ak=(1a/k+1)A_{k}=(1-a/\sqrt{k+1}) and εk=b/k+1\varepsilon_{k}=b/\sqrt{k+1}. If z0>0z^{0}>0, 0<a10<a\leq 1, b>0b>0, then zkmax{z0,b/a}z^{k}\leq\max\{z^{0},b/a\} for all k0k\geq 0.

Simple to prove by induction. The base case is trivial, since z0max{z0,b/a}z^{0}\leq\max\{z^{0},b/a\}. Let us now assume the proposition holds true for zkz^{k} (that is, zkmax{z0,b/a}z^{k}\leq\max\{z^{0},b/a\}) , we want to show it holds true for k+1k+1. We have

If b/a=max{z0,b/a}b/a=\max\{z^{0},b/a\}, then we get, by induction

Else, if z0=max{z0,b/a}z^{0}=\max\{z^{0},b/a\}, then by induction

where the last inequality holds because az0b>0az^{0}-b>0 and aa is positive. This completes the proof. ∎

For y=xy=x^{*} and x=xkx=x^{k}, this implies

Since γk>0\gamma_{k}>0, we can substitute this inequality in Equation (106) and get

where we used the inequality min{12ckLmax,c0γbck}γkc0γbck\min\left\{\frac{1}{2c_{k}L_{\max}},\frac{c_{0}\gamma_{b}}{c_{k}}\right\}\leq\gamma_{k}\leq\frac{c_{0}\gamma_{b}}{c_{k}} (Lemma 1).

Now we seek an upper bound for the contraction factor. Under ck=k+1c_{k}=\sqrt{k+1}, using again Lemma 1 we have, since c0=1c_{0}=1,

Now have all ingredients to bound the iterates: the result follows from Lemma 5 using a=min{μmin2Lmax,μminγb}a=\min\left\{\frac{\mu_{\min}}{2L_{\max}},\mu_{\min}\gamma_{b}\right\} and b=2c0γbσ^B,max2b=2c_{0}\gamma_{b}\hat{\sigma}^{2}_{B,\max}. So, we get

Appendix F Convergence of stochastic subgradient method with DecSPS-NS in the non-smooth setting

In this subsection we consider the DecSPS-NS stepsize in the non-smooth setting:

where gSk(xk)g_{\mathcal{S}_{k}}(x^{k}) is the stochastic subgradient using batch size Sk\mathcal{S}_{k} at iteration kk, and we set c1=c0c_{-1}=c_{0} and γ1=γb\gamma_{-1}=\gamma_{b} to get

First, note that γk\gamma_{k} is trivially non-increasing since γkck1γk1/ck\gamma_{k}\leq c_{k-1}\gamma_{k-1}/c_{k}. Next, we prove the bounds on γk\gamma_{k}.

Without loss of generality, we can work with the simplified stepsize

Similarly to the base case, we can write:

F.2 Proof of Thm. 5

Let us consider the DecSPS stepsize in the non-smooth setting. Using convexity and the gradient bound we get

where the last line follows from definition of subgradient and Lemma 6.

Using the same exact steps as Thm 3, and using the fact that γk\gamma_{k} is decreasing, we arrive at the equation

We conclude by taking the expectation and using Jensen’s inequality. ∎

In the setting of Thm. 5, if ck=c0k+1c_{k}=c_{0}\sqrt{k+1} we have

The bound in Cor. 3 does not depend on σB2\sigma^{2}_{B}, while the one in Cor. 2 does. This is because the proof is different, and does not rely on bounding squared gradients with function suboptimalities (one cannot, if smoothness does not hold). Similarly, usual bounds for non-smooth optimization do not depend on subgradient variance but instead on GG .

Appendix G Further experimental results

A1A dataset (standard normalization) from , consisting in 16051605 datapoints in 123 dimensions. We consider again B=20B=20 but a substantial regularization with λ=0.01\lambda=0.01.

Breast Cancer dataset (standard normalization) , consisting in 569569 datapoints in 39 dimensions. We consider a small batch size B=5B=5 with strong regularization λ=0.1\lambda=0.1.

All experiments reported below are repeated 5 times. Shown is mean and 2 standard deviations.

Comparison with SGD.

In addition to Figure 4 (A1A dataset), in Figure 8 we provide comparison of DecSPS with SGD with stepsize γ0/k+1\gamma_{0}/\sqrt{k+1} for the Synthetic and Breast Cancer datasets. From the results, it is clear that DecSPS with standard parameters c0=1,γb=10c_{0}=1,\gamma_{b}=10 (see discussion in main paper and paragraph above) is comparable if not faster than vanilla SGD with decreasing stepsize.

Comparison with Adagrad-Norm.

In addition to Figure 4 (Breast Cancer dataset), in Figures 10&11&9 we provide comparison of DecSPS with AdaGrad-Norm for the Synthetic and A1A datasets. AdaGrad-norm at each iteration updates the scalar bk+12=bk2+fSk(xk)2b_{k+1}^{2}=b_{k}^{2}+\|\nabla f_{\mathcal{S}_{k}}(x_{k})\|^{2} and then selects the next step as xk+1=xkηbk+1fi(xk)x_{k+1}=x_{k}-\frac{\eta}{b_{k+1}}\nabla f_{i}(x_{k}). Hence, it has tuning parameters b0b_{0} and η\eta; b0=0.1b_{0}=0.1 is recommended in (see their Figure 3). Using this value for b0b_{0} we show in Figure 9 that the performance of DecSPS is competitive against a well-tuned value of the AdaGrad-norm stepsize η\eta. In Figure 10&11 we show the effect of tuning b0b_{0} on the synthetic dataset: no major improvement is observed.

Comparison with Adam and AMSgrad.

We provide a comparison with Adam and AMSgrad . For both methods, we set the momentum parameter to zero (a.k.a RMSprop) for a fair comparison with DecSPS. For β:=β2\beta:=\beta_{2}, the parameter that controls the moving average of the second moments, we select the value 0.990.99 since we found that the standard 0.9990.999 leads to problematic (exploding) stepsizes. Findings are pretty similar for both the A1A (Figure 12) and Breast Cancer (Figure 13) datasets: when compared to DecSPS with the usual parameters, fine-tuned Adam with fixed stepsize can reach the same performance after a few tens of thousand iterations — however, it is much slower at the beginning of training. While deriving convergence guarantees for Adam is problematic , AMSgrad with stepsize η/k+1\eta/\sqrt{k+1} enjoys a convergence guarantee similar to Adagrad and Adagrad-Norm. This is reflected in the empirical convergence: fine-tuned AMSgrad is able to match the convergence of DecSPS with the usual parameters motivated at the beginning of this section. Yet, we recall that the convergence guarantees of AMSgrad require the iterates to live in a bounded domain, an assumption which is not needed in our DecSPS (see § 5.2).

Performance under light regularization.

If the problem at hand does not have strong curvature information, e.g. there is very light regularization, then additional tuning of the DecSPS parameters is required. Figure 14 shows that it is possible to retrieve the performance of SGD also with light regularization parameters (1e4,1e61e-4,1e-6) under additional tuning of c0c_{0} and γb\gamma_{b}.