The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli

Introduction

Large scale machine learning relies almost exclusively on stochastic optimization methods (Bottou and Bousquet, 2007), which include stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants Duchi et al. (2011); Johnson and Zhang (2013). In this work, we restrict our attention to the SGD algorithm where we are concerned with the behavior of the final iterate (i.e. the last point when we terminate the algorithm). A majority of (minimax optimal) theoretical results for SGD focus on polynomially decaying stepsizes (Dekel et al., 2012; Rakhlin et al., 2012; Lacoste-Julien et al., 2012; Bubeck, 2014) (or constant stepsizes (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016) for the case of least squares regression) coupled with iterate averaging (Ruppert, 1988; Polyak and Juditsky, 1992) to achieve minimax optimal rates of convergence. However, practical SGD implementations typically return the final iterate of a stochastic gradient procedure. This line of work in theory (based on iterate averaging) and its discrepancy with regards to practice leads to the question with regards to the behavior of SGD’s final iterate. Indeed, this question has motivated several efforts in stochastic convex optimization literature as elaborated below.

Non-Smooth Stochastic Optimization: The work of Shamir (2012) raised the question with regards to the behavior of SGD’s final iterate for non-smooth stochastic optimization (with/without strong convexity). The work of Shamir and Zhang (2012) answered this question, indicating that SGD’s final iterate with polynomially decaying stepsizes achieves near minimax rates (up to log\log factors) in an anytime (i.e. in a limiting) sense (when number of iterations SGD is run for is not known in advance). Under specific choices of step size sequences, Shamir and Zhang (2012)’s result on SGD’s final iterate is tight owing to the recent work of Harvey et al. (2018). More recently Jain et al. (2019) presented an approach indicating that a more nuanced stepsize sequence serves to achieve minimax rates (up to constant factors) for the non-smooth stochastic optimization setting when the end time TT is known in advance.

Least Squares Regression (LSR): In contrast to the non-smooth setting, the state of our understanding of SGD’s final iterate for smooth stochastic convex optimization, or, say, the streaming least squares regression setting is far less mature - this gap motivates our paper’s contributions. In particular, this paper studies SGD’s final iterate behavior under various stepsize choices for least squares regression (with and without strong convexity). The use of SGD’s final iterate for the least mean squares objective has featured in several efforts (Widrow and Hoff, 1960; Proakis, 1974; Widrow and Stearns, 1985; Roy and Shynk, 1990), but these results do not achieve minimax rates of convergence, which leads to the following question:

“ Can polynomially decaying stepsizes (known to achieve minimax rates when coupled with iterate averaging (Ruppert, 1988; Polyak and Juditsky, 1992)) offer minimax optimal rates on SGD’s final iterate when optimizing the streaming least squares regression objective? If not, is there any other family of stepsizes that can guarantee minimax rates on the final iterate of stochastic gradient descent? ”

This paper presents progress on answering the above question - refer to contributions below for more details. Note that the oracle model employed by this work (to quantify SGD’s final iterate behavior) has featured in a string of recent results that present a non-asymptotic understanding of SGD for least squares regression, with the caveat being that these results crucially rely on iterate averaging with constant stepsize sequences (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016, 2017b, 2017a; Neu and Rosasco, 2018).

This work establishes upper and lower bounds on the behavior of SGD’s final iterate, as run with polynomially decaying stepsizes as well as step decay schedules which tends to cut the stepsize by a constant factor after every constant number of epochs (see algorithm 1), by considering the streaming least squares regression problem (with and without strong convexity). Our main result indicates that step decay schedules offer significant improvements in achieving near minimax rates over polynomially decaying stepsizes in the known horizon case (when the end time TT is known in advance). Figure 1 illustrates that this difference is evident (empirically) even when optimizing a two-dimensional synthetic least squares objective. Table 1 provides a summary. Finally, we present results that indicate the subtle (yet significant) differences between the known time horizon case and the anytime (i.e. the limiting) behavior of SGD’s final iterate (see below). Note that proofs of our main claims can be found in the appendix.

Sub-optimality of polynomially decaying stepsizes: For the strongly convex least squares case, this work shows that the final iterate of a polynomially decaying stepsize scheme (i.e. with ηt1/tα\eta_{t}\propto 1/t^{\alpha}, with α[0.5,1]\alpha\in[0.5,1]) is off the minimax rate dσ2/Td\sigma^{2}/T by a factor of the condition number of the problem. For the non-strongly convex case of least squares, we show that any polynomially decaying stepsize can achieve a rate no better than dσ2/Td\sigma^{2}/\sqrt{T} (up to log\log factors), while the minimax rate is dσ2/Td\sigma^{2}/T.

Near-optimality of the step-decay scheme: Given a fixed end time TT, the step-decay scheme (algorithm 1) presents a final iterate that is off the statistical minimax rate by just a log(T)\log(T) factor when optimizing the strongly convex and non-strongly convex least squares regression This dependence can be improved to log\log of the condition number of the problem (for the strongly convex case) using a more refined stepsize decay scheme., thus indicating vast improvements over polynomially decaying stepsize schedules. We note here that our Theorem 2 for the non-strongly case offers a rate on the initial error (i.e., the bias term) that is off the best known rate (Bach and Moulines, 2013) (that employs iterate averaging) by a dimension factor. That said, Algorithm 1 is rather straightforward and employs the knowledge of just an initial learning rate and number of iterations for its implementation.

SGD has to query bad iterates infinitely often: For the case of optimizing strongly convex least squares regression, this work shows that any stochastic gradient procedure (in a lim sup\limsup sense) must query sub-optimal iterates (off by nearly a condition number) infinitely often.

Complementary to our theoretical results for the stochastic linear regression, we evaluate the empirical performance of different learning rate schemes when training a residual network on the cifar-10 dataset and observe that the continuous variant of step decay schemes (i.e. an exponential decay) indeed compares favorably to polynomially decaying stepsizes.

While the upper bounds established in this paper (section 3.2) merit extensions towards broader smooth convex functions (with/without strong convexity), the lower bounds established in sections 3.1, 3.3 present implications towards classes of smooth stochastic convex optimization. Even in terms of upper bounds, note that there are fewer results on non-asymptotic behavior of SGD (beyond least squares) when working in the oracle model considered in this work (see below). Bach and Moulines (2011, 2013); Bach (2014); Needell et al. (2016) are exceptions, yet they do not achieve minimax rates on appropriate problem classes; Frostig et al. (2015) does not work in standard stochastic first order oracle model (Nemirovsky and Yudin, 1983; Agarwal et al., 2012), so their work is not directly comparable to examine extensions towards broader function classes.

As a final note, this paper’s result on the sub-optimality of standard polynomially decaying stepsizes for classes of smooth and strongly convex optimization doesn’t contradict the (minimax) optimality results in stochastic approximation (Polyak and Juditsky, 1992). Iterate averaging coupled with polynomially decaying learning rates (or constant learning rates for least squares (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016)) does achieve minimax rates (Ruppert, 1988; Polyak and Juditsky, 1992). However, as mentioned previously, this work deals with SGD’s final iterate behavior (i.e. without iterate averaging), since this bears more relevance towards practice.

Related work: Robbins and Monro (1951) introduced the stochastic approximation problem and Stochastic Gradient Descent (SGD). They present conditions on stepsize schemes satisfied by asymptotically convergent algorithms: these schemes are referred to as “convergent” stepsize sequences. Ruppert (1988); Polyak and Juditsky (1992) proved the asymptotic optimality of iterate averaged SGD with larger stepsize sequences. In terms of oracle models and notions of optimality, there exists two lines of thought (see also Jain et al. (2017b)):

Towards statistically optimal estimation procedures: The goal of this line of thought is to match the excess risk of the statistically optimal estimator (Anbar, 1971; Kushner and Clark, 1978; Polyak and Juditsky, 1992; Lehmann and Casella, 1998) on every problem instance. Several efforts consider SGD in this oracle (Bach and Moulines, 2011; Bach, 2014; Dieuleveut and Bach, 2015; Frostig et al., 2015; Needell et al., 2016) presenting non-asymptotic results, often with iterate averaging. With regards to least squares, Bach and Moulines (2013); Défossez and Bach (2015); Frostig et al. (2015); Jain et al. (2016, 2017b); Neu and Rosasco (2018) use constant step-size SGD with iterate averaging to achieve minimax rates (on a per-problem basis) in this oracle model. SGD’s final iterate behavior for least squares has featured in several efforts in the signal processing/controls literature (Widrow and Hoff, 1960; Nagumo and Noda, 1967; Proakis, 1974; Widrow and Stearns, 1985; Roy and Shynk, 1990; Sharma et al., 1998), without achieving minimax rates. This paper works in this oracle model and analyzes SGD’s final iterate behavior with various stepsize choices.

Towards optimality under bounded noise assumptions: The other line of thought presents algorithms with access to stochastic gradients satisfying bounded noise assumptions, aiming to match lower bounds provided in Nemirovsky and Yudin (1983); Raginsky and Rakhlin (2011); Agarwal et al. (2012). Asymptotic properties of “convergent” stepsize schemes have been studied in great detail (Kushner and Clark, 1978; Benveniste et al., 1990; Ljung et al., 1992; Bharath and Borkar, 1999; Kushner and Yin, 2003; Lai, 2003; Borkar, 2008). Dekel et al. (2012); Lacoste-Julien et al. (2012); Rakhlin et al. (2012); Ghadimi and Lan (2012, 2013b); Hazan and Kale (2014); Bubeck (2014); Dieuleveut et al. (2016) use iterate averaged SGD to achieve minimax rates for various problem classes non-asymptotically. Allen-Zhu (2018) present an alternative approach towards minimizing the gradient norm with access to stochastic gradients. As noted, Shamir and Zhang (2012) achieves anytime optimal rates (upto a logT\log{T} factor) with the final iterate of an SGD procedure, and this is shown to be tight with the recent work of Harvey et al. (2018). Jain et al. (2019) achieve minimax rates on the final iterate using a nuanced stepsize scheme when the number of iterations is fixed in advance.

Geometrically Decaying Stepsize Schedules date to Goffin (1977). Davis and Drusvyatskiy (2019) employ the stepdecay schedule to prove high-probability guarantees for SGD with strongly convex objectives. In stochastic optimization, several other works, including Ghadimi and Lan (2013a); Hazan and Kale (2014); Aybat et al. (2019); Kulunchakov and Mairal (2019) consider doubling argument based approaches, where the epoch length is doubled everytime the stepsizes are halved. The step decay schedule is employed to yield faster rates of convergence under certain growth (and related) conditions both in convex (Xu et al., 2016) and non-convex settings (Yang et al., 2018; Davis et al., 2019).

Paper organization: Section 2 describes notation and problem setup. Section 3 presents our results on the sub-optimality of polynomial decay schemes and the near optimality of the step decay scheme. Section 3.3 presents results on the anytime behavior of SGD (i.e. the asymptotic/infinite horizon case). Section 4 presents experimental results and Section 5 presents conclusions.

Problem Setup

Notation: We present the setup and associated notation in this section. We represent scalars with normal font a,b,La,b,L etc., vectors with boldface lowercase characters a,b\mathbf{a},\mathbf{b} etc. and matrices with boldface uppercase characters A,B\mathbf{A},\mathbf{B} etc. We represent positive semidefinite (PSD) ordering between two matrices using \succeq. The symbol \gtrsim represents that the inequality holds for some universal constant.

We consider here the minimization of the following expected square loss objective:

where, ϵ\epsilon is the noise on the example pair (x,y)D(\mathbf{x},y)\sim\mathcal{D} and w\mathbf{w}^{*} is a minimizer of the objective f(w)f(\mathbf{w}). Given an initial iterate w0\mathbf{w}_{0} and stepsize sequence {ηt}\{\eta_{t}\}, our stochastic gradient update is:

We assume that the noise ϵ=yw,x    (x,y)D\epsilon=y-\langle\mathbf{w}^{*},\mathbf{x}\rangle\ \ \forall\ \ (\mathbf{x},y)\sim\mathcal{D} satisfies the following condition:

Next, assume that covariates x\mathbf{x} satisfy the following fourth moment inequality:

This assumption is satisfied, say, when the norm of the covariates supx2<R2\sup\left\|\mathbf{x}\right\|^{2}<R^{2}, but is true more generally. Finally, note that both 3 and 4 are general and are used in recent works (Bach and Moulines, 2013; Jain et al., 2016) that present a sharp analysis of SGD for streaming least squares problem. Next, we denote by

The rate of (1+o(1))σ2d/t\left(1+o(1)\right)\cdot{\sigma^{2}d}/{t} is achieved using iterate averaged SGD (Ruppert, 1988; Polyak and Juditsky, 1992) with constant stepsizes (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016). This rate of σ2d/t\sigma^{2}d/t is called the statistical minimax rate.

Main results

Sections 3.1, 3.2 consider the fixed time horizon setting; the former presents the significant sub-optimality of polynomially decaying stepsizes on SGD’s final iterate behavior, the latter section presenting the near-optimality of SGD’s final iterate. Section 3.3 presents negative results on SGD’s final iterate behavior (irrespective of stepsizes employed), in the anytime (i.e. limiting) sense.

This section begins by showing that there exist problem instances where polynomially decaying stepsizes considered stochastic approximation theory (Robbins and Monro, 1951; Polyak and Juditsky, 1992) i.e., those of the form ab+tα\frac{a}{b+t^{\alpha}}, for any choice of a,b>0a,b>0 and α[0.5,1]\alpha\in[0.5,1] are significantly suboptimal (by a factor of the condition number of the problem, or by T\sqrt{T} in the smooth case) compared to the statistical minimax rate (Kushner and Clark, 1978).

Under assumptions 3, 4, there exists a class of problem instances where the following lower bounds on excess risk hold on SGD’s final iterate with polynomially decaying stepsizes when given access to the oracle as written in equation 2.

Strongly convex case: Suppose μ>0\mu>0. For any condition number κ\kappa, there exists a least squares problem instance with initial suboptimality f(w0)f(w)σ2df(\mathbf{w}_{0})-f(\mathbf{w}^{*})\leq\sigma^{2}d such that, for any Tκ43T\geq\kappa^{\frac{4}{3}}, and for all a,b0a,b\geq 0 and 0.5α10.5\leq\alpha\leq 1, and for the learning rate scheme ηt=ab+tα\eta_{t}=\frac{a}{b+t^{\alpha}}, we have

Smooth case: For any fixed T>1T>1, there exists a least squares problem instance such that, for all a,b0a,b\geq 0 and 0.5α10.5\leq\alpha\leq 1, and for the learning rate scheme ηt=ab+tα\eta_{t}=\frac{a}{b+t^{\alpha}}, we have

For both cases (with/without strong convexity), the minimax rate is σ2d/T\sigma^{2}d/T. In the strongly convex case, SGD’s final iterate with polynomially decaying stepsizes pays a suboptimality factor of Ω(κ)\Omega(\kappa), whereas, in the smooth case, SGD’s final iterate pays a suboptimality factor of Ω(TlogT)\Omega\left(\frac{\sqrt{T}}{\log T}\right).

2 Near optimality of Step Decay schemes

Given the knowledge of an end time TT when the algorithm is terminated, this section presents the step decay schedule (Algorithm 1), which offers significant improvements over standard polynomially decaying stepsize schemes, and obtains near minimax rates (off by only a log(T)\log(T) factor).

Suppose we are given access to the stochastic gradient oracle 2 satisfying Assumptions 3 and 4. Running Algorithm 1 with an initial stepsize of η1=1/(2R2)\eta_{1}=1/(2R^{2}) allows the algorithm to achieve the following excess risk guarantees.

Strongly convex case: Suppose μ>0\mu>0. We have:

While theorem 2 presents significant improvements over polynomial decay schemes, as mentioned in the contributions, the above result presents a worse rate on the initial error (by a dimension factor) in the smooth case (i.e. non-strongly convex case), compared to the best known result (Bach and Moulines, 2013), which relies heavily on iterate averaging to remove this factor. It is an open question with regards to whether this factor can actually be improved or not. Furthermore, comparing the initial error dependence between the lower bound for the smooth case (Theorem 1) with the upper bound for the step decay scheme, we believe that the dependence on the smoothness LL should be improved to one on the R2R^{2}.

In terms of the variance, however, note that the polynomial decay schemes, are plagued by a polynomial dependence on the condition number κ\kappa (for the strongly convex case), and are off the minimax rate by a T\sqrt{T} factor (for the smooth case). The step decay schedule, on the other hand, is off the minimax rate (Ruppert, 1988; Polyak and Juditsky, 1992; Van der Vaart, 2000) by only a log(T)\log(T) factor. It is worth noting that Algorithm 1 admits an efficient implementation in that it requires the knowledge only of R2R^{2} (similar to iterate averaging results (Bach and Moulines, 2013; Jain et al., 2016)) and the end time TT. Finally, note that this logT\log T factor can be improved to a logκ\log\kappa factor for the strongly convex case by using an additional polynomial decay scheme before switching to the Step Decay scheme.

Suppose we have access to the stochastic gradient oracle 2 satisfying Assumptions 3 and 4. Let μ>0\mu>0 and κ2\kappa\geq 2. For any problem and fixed time horizon T/logT>5κT/\log T>5\kappa, there exists a learning rate scheme that achieves

In order to have improved the dependence on the variance from log(T)\log(T) (in theorem 2) to log(κ)\log(\kappa) (in proposition 3), we require access to the strong convexity parameter μ=λmin(H)\mu=\lambda_{\textrm{min}}(\mathbf{H}) in addition to R2R^{2} and knowledge of the end time TT. This parallels results known for general strongly convex setting (Rakhlin et al., 2012; Lacoste-Julien et al., 2012; Shamir and Zhang, 2012; Bubeck, 2014; Jain et al., 2019).

As a final remark, note that this section’s results (on step decay schemes) assumed the knowledge of a fixed time horizon TT. In contrast, most results SGD’s averaged iterate obtain anytime (i.e., limiting/infinite horizon) guarantees. Can we hope to achieve such guarantees with the final iterate?

3 SGD queries bad points infinitely often

This section shows that obtaining near minimax rates with the final iterate is not possible without knowledge of the time horizon TT. Concretely, we show that irrespective of the learning rates employed (be it polynomially decaying or step-decay), SGD requires to query a point with sub-optimality Ω(κ/logκ)σ2d/T\Omega(\kappa/\log\kappa)\cdot\sigma^{2}d/T for infinitely many time steps TT.

Suppose we have access to a stochastic gradient oracle 2 satisfying Assumption 3, 4. There exists a universal constant C>0C>0, and a problem instance such that SGD algorithm with any ηt1/2R2\eta_{t}\leq 1/2R^{2} for all ttLearning rate more than 2/R22/R^{2} will make the algorithm diverge., we have

The bad points guaranteed to exist by Theorem 4 are not rare. We show that such points occur at least once in O(κlogκ)\mathcal{O}\left(\frac{\kappa}{\log\kappa}\right) iterations. Refer to Theorem 16 in appendix D in the appendix.

Experimental Results

We present experimental validation on the suitability of the Step-decay schedule (or more precisely, its continuous counterpart, which is the exponentially decaying schedule), and compare its with the polynomially decaying stepsize schedules. In particular, we consider the use of: ηt=η01+bt\eta_{t}=\frac{\eta_{0}}{1+b\cdot t} (5) ηt=η01+bt\eta_{t}=\frac{\eta_{0}}{1+b\sqrt{t}} (6) \vspace1.3emηt=η0exp(bt).\vspace*{1.3em}\eta_{t}=\eta_{0}\cdot\exp{(-b\cdot t)}. (7) Where, we perform a systematic grid search on the parameters η0\eta_{0} and bb. In the section below, we consider a real world non-convex optimization problem of training a residual network on the cifar-10 dataset, with an aim to illustrate the practical implications of the results described in the paper. Refer to Appendix E for more details.

Our experiments are based on grid searching for the best learning rate decay scheme on the parametric family of learning rate schemes described above (5),(6),(7); all grid searches are performed on a separate validation set (obtained by setting aside one-tenth of the training dataset) and with models trained on the remaining 4500045000 samples. For presenting the final numbers in the plots/tables, we employ the best hyperparameters from the validation stage and train it on the entire 50,00050,000 samples and average results run with 1010 different random seeds. The parameters for grid searches and other details are presented in Appendix E. Furthermore, we always extend the grid so that the best performing grid search parameter lies in the interior of our grid search.

How does the step decay scheme compare with the polynomially decaying stepsizes? Figure 2 and Table 2 present a comparison of the performance of the three schemes (5)-(7). These results demonstrate that the exponential scheme convicingly outperforms the polynomial step-size schemes.

Does suffix iterate averaging improve over final iterate’s behavior for polynomially decaying stepsizes? Towards answering this question, firstly, we consider the best performing values of equation 5 and 6, and then, average iterates of the algorithm starting from 5,10,20,40,80,85,90,95,995,10,20,40,80,85,90,95,99 epochs when training the model for a total of 100100 epochs. While such iterate averaging (and their suffix) variants have strong theoretical support for (stochastic) convex optimization (Ruppert, 1988; Polyak and Juditsky, 1992; Rakhlin et al., 2012; Bubeck, 2014; Jain et al., 2016), their impact on non-convex optimization is largely debatable. Nevertheless, this experiments’s results (figure 3) indicates that suffix averaging tends to hurt the algorithm’s generalization behavior (which is unsurprising given the non-convex nature of the objective). Note that, figure 3 serves to indicate that averaging the final few (5\leq 5) epochs tends to offer nearly the same result as the final iterate’s behavior, indicating that the gains of using suffix iterate averaging are relatively limited for several such settings.

Does our result on “knowing” the time horizon (for step-decay schedule) present implications towards hyper-parameter search methods that work based on results from truncated runs? Towards answering this question, consider the figure 4 and Tables 3 and 4, which present a comparison of the performance of three exponential decay schemes each of which is tuned to achieve the best performance at 3333, 6666 and 100100 epochs respectively. The key point to note is that best performing hyperparameters at 3333 and 6666 epochs are not the best performing at 100100 epochs (which is made stark from the perspective of the validation error - refer to table 4). This demonstrates that hyper parameter selection methods that tend to discard hyper-parameters which don’t perform well at earlier stages of the optimization (i.e. based on comparing results on truncated runs), which, for e.g., is indeed the case with hyperband (Li et al., 2017), will benefit from a round of rethinking.

Conclusions and Discussion

The main contribution of this work shows that the behavior of SGD’s final iterate for least squares regression is much more nuanced than what has been indicated by prior efforts that have primarily considered non-smooth stochastic convex optimization. The results of this paper point out the striking limitations of polynomially decaying stepsizes on SGD’s final iterate, as well as sheds light on the effectiveness of starkly different schemes based on a Step Decay schedule. Somewhat coincidentally, practical implementations for certain classes of stochastic optimization do return the final iterate of SGD with step decay schedule - this connection does merit an understanding through future work.

Rong Ge acknowledges funding from NSF CCF-17046561704656, NSF CCF-18451711845171 (CAREER), Sloan Fellowship and Google Faculty Research Award. Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, NSF Award 1740551, and ONR award N0001400014-1818-11-22472247. Rahul Kidambi acknowledges funding from NSF Award 17408221740822.

References

Appendix A Preliminaries

Before presenting the lemmas establishing the behavior of SGD under various learning rate schemes, we introduce some notation. We recount that the SGD update rule denoted through:

We then write out the expression for the stochastic gradient f^(wt1)\widehat{\nabla f}(\mathbf{w}_{t-1}).

where, given the stochastic gradient corresponding to an example (xt,yt)D(\mathbf{x}_{t},y_{t})\sim\mathcal{D}, with yt=w,xt+ϵty_{t}=\left\langle\mathbf{w}^{*},\mathbf{x}_{t}\right\rangle+\epsilon_{t}, the above stochastic gradient expression naturally follows. Now, in order to analyze the contraction properties of the SGD update rule, we require the following notation:

[For e.g. Appendix A.2.2 from Jain et al. (2016)] Bias-Variance tradeoff: Running SGD for TT-steps starting from w0\mathbf{w}_{0} and a stepsize sequence {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} presents a final iterate wT\mathbf{w}_{T} whose excess risk is upper-bounded as:

One can view the contribution of the above two terms as ones stemming from SGD’s updates, which can be written as:

For more details, refer to Défossez and Bach (2015).

Now, in order to bound the total error, note that the original stochastic process associated with SGD’s updates can be decoupled into two (simpler) processes, one being the noiseless process (which corresponds to reducing the dependence on the initial error, and is termed “bias”), i.e., where, the recurrence evolves as:

The second recursion corresponds to the dependence on the noise (termed as variance), wherein, the process is initiated at the solution, i.e. w0var=w\mathbf{w}_{0}^{\text{var}}=\mathbf{w}^{*} and is driven by the noise ntn_{t}. The update for this process corresponds to:

We represent by BtB_{t} the covariance of the ttht^{\text{th}} iterate of the bias process, i.e.,

Firstly, note that this naturally implies that the sequence of covariances VτV_{\tau}, τ=1,,T\tau=1,\cdots,T initialized at (say), the solution, i.e., V0=0\mathbf{V}_{0}=0 naturally grows to its steady state covariance, i.e.,

See lemma 3 of Jain et al. (2017a) for more details. Furthermore, what naturally follows in relating VtV_{t} to Vt1V_{t-1} is:

Running SGD with a (constant) stepsize sequence η<1/R2\eta<1/R^{2} achieves the following steady-state covariance:

Suppose η=1/2R2\eta=1/{2R^{2}}, and V0=ησ21ηR2I=2ησ2IV_{0}=\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}=2\eta\sigma^{2}\mathbf{I}. For any sequence of learning rates ηtη=1/2R2  t{1,,t}\eta_{t}\leq\eta=1/{2R^{2}}\ \forall\ t\in\{1,\cdots,t\}, then,

We will prove the lemma using an inductive argument. The base case, i.e. t=0t=0 follows from the problem statement. Note also that for SGD, V0=0V_{0}=0 implying the statement naturally follows. If, say, VtV_{t} satisfies the equation above, from equation 10, we have the following covariance for Vt+1V_{t+1}:

(Reduction from Multiplicative noise oracle) Let VtV_{t} be the (expected) covariance of the variance error. Then, the recursion that connects Vt+1V_{t+1} to VtV_{t} can be expressed as:

From equation 10, we already know that the evolution of the co-variance of the variance error follows:

Where the steps follow from lemma 7, and owing from the fact that ηtη=1/2R2    t\eta_{t}\leq\eta=1/2R^{2}\ \ \forall\ \ t. ∎

Note: Basically, one could analyze an auxiliary process driven by noise with variance off by a factor of two and convert the analysis into one involving exact (deterministic) gradients.

[Bias decay - strongly convex case] Let the minimal eigenvalue of the Hessian μ=λmin(H)>0\mu=\lambda_{\textrm{min}}(\mathbf{H})>0. Consider the bias recursion as in equation 8 with the stepsize set as η=1/(2R2)\eta=1/(2R^{2}). Then,

The proof follows through straight forward computations:

[Reduction of the bias recursion with multiplicative noise to one resembling the variance recursion] Consider the bias recursion that evolves as

Then, the following recursion holds γt1/R2\forall\gamma_{t}\leq 1/R^{2}:

The result follows owing to the following computations:

with the last inequality holding true if the squared distance to the optimum doesn’t grow as a part of the recursion. We prove that this indeed is the case below:

Recursively applying the above argument yields the desired result. ∎

Note: This result implies that the bias error (in the smooth non-strongly convex case of the least squares regression with multiplicative noise) can be bounded by employing a similar lemma as that of the variance, where one can look at the quantity R2w0w22R^{2}\cdot\|\mathbf{w}_{0}-\mathbf{w}^{*}\|^{2}_{2} as the analog of the variance σ2\sigma^{2} that drives the process.

[Lower bounds on the additive noise oracle imply ones for the multiplicative noise oracle] Under the assumption that the covariance of noise Σ=σ2H\Sigma=\sigma^{2}\mathbf{H}, the following statement holds. Let VtV_{t} be the (expected) covariance of the variance error. Then, the recursion that connects Vt+1V_{t+1} to VtV_{t} can be expressed as:

Let us consider firstly, the setting of (bounded) additive noise. Here, we have:

Then, updates leading upto time t+1t+1 can be written as:

This implies the covariance of the variance error is:

Now, let us consider the statement of the lemma:

Appendix B Proofs of results in Section 3.1

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

The following lower bounds hold on the final iterate of a Stochastic Gradient procedure with access to the above stochastic gradients when using polynomially decaying stepsizes.

Strongly convex case: Suppose μ>0\mu>0. For any condition number κ\kappa, there exists a problem instance with initial suboptimality f(w0)f(w)σ2df(\mathbf{w}_{0})-f(\mathbf{w}^{*})\leq\sigma^{2}d such that, for any Tκ43T\geq\kappa^{\frac{4}{3}}, and for all a,b0a,b\geq 0 and 0.5α10.5\leq\alpha\leq 1, and for the learning rate scheme ηt=ab+tα\eta_{t}=\frac{a}{b+t^{\alpha}}, we have

Smooth case: For any fixed T>1T>1, there exists a problem instance such that, for all a,b0a,b\geq 0 and 0.5α10.5\leq\alpha\leq 1, and for the learning rate scheme ηt=ab+tα\eta_{t}=\frac{a}{b+t^{\alpha}}, we have

We consider a recursion for vt(i)v^{\left(i\right)}_{t} with eigenvalue λi\lambda_{i} (κ\kappa or 1{1}{}). By the design of the algorithm, we know

Let s(η,λ)=λσ2η21(1ηλ)2s(\eta,\lambda)=\frac{\lambda\sigma^{2}\eta^{2}}{1-(1-\eta\lambda)^{2}} be the solution to the stationary point equation x=(1ηλ)2+λσ2η2x=(1-\eta\lambda)^{2}+\lambda\sigma^{2}\eta^{2}. Intuitively if we keep using the same learning rate η\eta, then vt(i)v^{\left(i\right)}_{t} is going to converge to s(η,λi)s(\eta,\lambda_{i}). Also note that s(η,λ)σ2η/2s(\eta,\lambda)\approx\sigma^{2}\eta/2 when ηλ1\eta\lambda\ll 1.

We first prove the following claim showing that eventually the variance in direction ii is going to be at least s(ηT,λi)s(\eta_{T},\lambda_{i}).

Suppose s(ηt,λi)v0(i)s(\eta_{t},\lambda_{i})\leq v^{\left(i\right)}_{0}, then vt(i)s(ηt,λi)v^{\left(i\right)}_{t}\geq s(\eta_{t},\lambda_{i}).

In this form, it is easy to see that the iteration is a contraction towards s(ηt,λi)s(\eta_{t},\lambda_{i}). Further, vt+1(i)s(ηt,λi)v^{\left(i\right)}_{t+1}-s(\eta_{t},\lambda_{i}) and vt(i)s(ηt,λi)v^{\left(i\right)}_{t}-s(\eta_{t},\lambda_{i}) have the same sign. In particular, let t0t_{0} be the first time such that s(ηt,λi)v0(i)s(\eta_{t},\lambda_{i})\leq v^{\left(i\right)}_{0} (note that ηt\eta_{t} is monotone and so is s(ηt,λi)s(\eta_{t},\lambda_{i})), it is easy to see that vt(i)v0(i)v^{\left(i\right)}_{t}\geq v^{\left(i\right)}_{0} when tt0t\leq t_{0}. Therefore we know vt0(i)s(ηt0,λi)v^{\left(i\right)}_{t_{0}}\geq s(\eta_{t_{0}},\lambda_{i}), by the recursion this implies vt0+1(i)s(ηt0,λi)s(ηt0+1,λi)v^{\left(i\right)}_{t_{0}+1}\geq s(\eta_{t_{0}},\lambda_{i})\geq s(\eta_{t_{0}+1},\lambda_{i}). The claim then follows from a simple induction. ∎

If s(ηT,λi)v0(i)s(\eta_{T},\lambda_{i})\geq v^{\left(i\right)}_{0} for i=1i=1 or i=di=d then the error is at least σ2d/2κσ2d/T\sigma^{2}d/2\geq\kappa\sigma^{2}d/T and we are done. Therefore we must have s(ηT,κ)v0(1)=3σ2/κs(\eta_{T},\kappa)\leq v^{\left(1\right)}_{0}=3\sigma^{2}/\kappa, and by Claim 1 we know vT(1)s(ηT,κ)σ2ηT/2v^{\left(1\right)}_{T}\geq s(\eta_{T},\kappa)\geq\sigma^{2}\eta_{T}/2. The function value is at least

In the second case, b<Tαb<T^{\alpha}. Since 116TηT=ab+Tαa2Tα\frac{1}{16T}\geq\eta_{T}=\frac{a}{b+T^{\alpha}}\geq\frac{a}{2T^{\alpha}}, we have a18Tα1a\leq\frac{1}{8}T^{\alpha-1}. The sum of learning rates satisfy

We consider a recursion for vt(i)v^{\left(i\right)}_{t} with eigenvalue λi\lambda_{i} (1 or 1κ\frac{1}{\kappa}). By the design of the algorithm, we know

Let s(η,λ)=λσ2η21(1ηλ)2s(\eta,\lambda)=\frac{\lambda\sigma^{2}\eta^{2}}{1-(1-\eta\lambda)^{2}} be the solution to the stationary point equation x=(1ηλ)2+λσ2η2x=(1-\eta\lambda)^{2}+\lambda\sigma^{2}\eta^{2}. Intuitively if we keep using the same learning rate η\eta, then vt(i)v^{\left(i\right)}_{t} is going to converge to s(η,λi)s(\eta,\lambda_{i}). Also note that s(η,λ)σ2η/2s(\eta,\lambda)\approx\sigma^{2}\eta/2 when ηλ1\eta\lambda\ll 1.

If s(ηT,λi)v0(i)s(\eta_{T},\lambda_{i})\geq v^{\left(i\right)}_{0} for i=1i=1 or i=di=d then the error is at least σ2d/2κκσ2d/T\sigma^{2}d/2\kappa\geq\kappa\sigma^{2}d/T and we are done. Therefore we must have s(ηT,κ)v0(1)=3σ2/κs(\eta_{T},\kappa)\leq v^{\left(1\right)}_{0}=3\sigma^{2}/\kappa, and by Claim 1 we know vT(1)s(ηT,κ)σ2ηT/2v^{\left(1\right)}_{T}\geq s(\eta_{T},\kappa)\geq\sigma^{2}\eta_{T}/2. The function value is at least

In the second case, b<Tαb<T^{\alpha}. Since κ16TlogTηT=ab+Tαa2Tα\frac{\kappa}{16T\log T}\geq\eta_{T}=\frac{a}{b+T^{\alpha}}\geq\frac{a}{2T^{\alpha}}, we have a18logTκTα1a\leq\frac{1}{8\log T}\kappa T^{\alpha-1}. The sum of learning rates satisfy

Here the second inequality uses the fact that Tα1iαi1T^{\alpha-1}i^{-\alpha}\leq i^{-1}. Similarly, we also know

The proof of theorem 1 follows straightforwardly when combining the result of lemma 11 and theorem 12. ∎

Appendix C Proofs of results in Section 3.2

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

Running Algorithm 1 with an initial stepsize of η1=1/R2\eta_{1}=1/R^{2}, starting from the solution, i.e. w0=w\mathbf{w}_{0}=\mathbf{w}^{*} allows the algorithm to obtain the following dependence on the variance error:

Plugging (12) and (13) into (11), we obtain

The function suboptimality can now be bounded as

Smooth case: The result follows by instantiating σ^2\hat{\sigma}^{2} in theorem 13 with 2σ22\sigma^{2} (lemma 8) and R2w0w22R^{2}\left\|\mathbf{w}_{0}-\mathbf{w}^{*}\right\|_{2}^{2} (lemma 10) and using the lemma 5 to obtain the result.

Strongly convex case: As with the smooth case, the result relies on instantiating theorem 13 with 2σ22\sigma^{2} (lemma 8) and using lemma 9 and then appealing to lemma 5. ∎

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

There exists a stepsize scheme with which, by starting at the solution (i.e. w0=w\mathbf{w}_{0}=\mathbf{w}^{*}) the algorithm obtains the following dependence on the variance error, under the assumption that μ>0\mu>0 and κ2\kappa\geq 2.

Recall the variance in the kthk^{\textrm{th}} coordinate can be upper bounded by

We will consider the first T/3T/3 steps. The guarantee that we will prove for these iterations is: for any tAt\leq A, vt(k)(1λ(k)/R2)2tv0(k)+σ2R2v^{\left(k\right)}_{t}\leq(1-\lambda^{(k)}/R^{2})^{2t}v^{\left(k\right)}_{0}+\frac{\sigma^{2}}{R^{2}}.

This can be proved easily by induction. Clearly this is true when t=0t=0. Suppose it is true for t1t-1, let’s consider step tt. By recursion of vt(k)v^{\left(k\right)}_{t} we know

Here the second step uses induction hypothesis and the third step uses the fact that (1x)2+x1(1-x)^{2}+x\leq 1 when xx\in. In particular, since (1λ(k)/R2)2T/3(11/κ)2T/3(11/κ)3κlogT=1/T3(1-\lambda^{(k)}/R^{2})^{2T/3}\leq(1-1/\kappa)^{2T/3}\leq(1-1/\kappa)^{3\kappa\log T}=1/T^{3}, we know at the end of the first phase, vA(k)v0(k)/T3+σ2R2v^{\left(k\right)}_{A}\leq v^{\left(k\right)}_{0}/T^{3}+\frac{\sigma^{2}}{R^{2}}.

In the second T/3T/3 steps, the guarantee would be: for any tT/3t\leq T/3, vA+t(k)v0(k)/T3+2ηA+tσ2v^{\left(k\right)}_{A+t}\leq v^{\left(k\right)}_{0}/T^{3}+2\eta_{A+t}\sigma^{2}.

We will again prove this by induction. The base case (t=0t=0) follows immediately from the guarantee for the first part. Suppose this is true for A+t1A+t-1, let us consider A+tA+t, again by recursion we know

Here the last line uses the fact that 2ηA+t1(112μηA+t1)2ηA+tσ22\eta_{A+t-1}(1-\frac{1}{2}\mu\eta_{A+t-1})\leq 2\eta_{A+t}\sigma^{2}, which is easy to verify by our choice of η\eta. Therefore, at the end of the second part, we have vB(k)v0(k)/T3+2σ2μ(κ+T/6)v^{\left(k\right)}_{B}\leq v^{\left(k\right)}_{0}/T^{3}+\frac{2\sigma^{2}}{\mu(\kappa+T/6)}.

The proof of the proposition works similar to the proof of the strongly convex case of theorem 2, wherein, we combine the result of proposition 14 with lemma 9 and lemma 5 to obtain the result. ∎

Appendix D Proofs of results in Section 3.3

All of our counter-examples in this section are going to be the same simple function. Let the inputs xx be such that only a single co-ordinate be active on each example. We refer to this case as the “discrete” case. Furthermore, let each co-ordinate be a Gaussian with mean and variance for the first d/2d/2 directions being dκ/3d\kappa/3 and the final d/2d/2 directions being 11. Furthermore, consider the noise to be additive (and independent of x\mathbf{x}) with mean zero. This indicates that R2=κR^{2}=\kappa for this problem.

Intuitively, we will show that in order to have a small error in the first eigendirection (with eigenvalue κ\kappa), one need to set a small learning rate ηt\eta_{t} which would be too small to achieve a small error in the second eigendirection (with eigenvalue 11). As a useful tool, we will decompose the variance in the two directions corresponding to κ\kappa eigenvalue and 11 eigenvalue respectively as follows:

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

There exists a universal constant C>0C>0, and a problem instance, such that for SGD algorithm with any ηt1/2κ\eta_{t}\leq 1/2\kappa for all ttLearning rate more than 2/κ2/\kappa will make the algorithm diverge., we have

Fix τ=κ/Clog(κ+1)\tau=\kappa/C\log(\kappa+1) where CC is a universal constant that we choose later. We need to exhibit that the lim sup\limsup is larger than τ\tau. For simplicity we will also round κ\kappa up to the nearest integer.

Note that such a number always exists because all the step sizes are at most 2/κ2/\kappa. We will also let Δi\Delta_{i} be TiTi1T_{i}-T_{i-1}. Firstly, from (15) and (16), we see that tηt=\sum_{t}\eta_{t}=\infty. Otherwise, the bias will never decay to zero. If f(wTi1+Δi)f(w)>τσ2dTi1+Δif(\mathbf{w}_{T_{i-1}+\Delta_{i}})-f(\mathbf{w}^{*})>\frac{\tau\sigma^{2}d}{T_{i-1}+\Delta_{i}} for some i=1,,κi=1,\cdots,\kappa, we are done. If not, we obtain the following relations:

Here the second inequality is based on (15). We will use C1C_{1} to denote exp(3)\exp(3). Similarly, we have

Repeating this argument, we can show that

We will use i=1i=1 in particular, which specializes to

Using the above inequality, we can lower bound the sum of Δj\Delta_{j} as

where we used (17) in the last step. Rearranging, we obtain

If we choose a large enough CC (e.g., 3C13C_{1}), the right hand side is at least exp((C/C1)log(κ+1)3)κκ\frac{\exp\left((C/C_{1})\log(\kappa+1)-3\right)}{\kappa}\geq\kappa.

Theorem 4 follows as a straightforward consequence of Theorem 15 and lemma 11. ∎

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

To prove Theorem 17, we rely on the following key lemma, which says if a query point wT\mathbf{w}_{T} is bad (in the sense that it has expected value more than 10τσ2d/T10\tau\sigma^{2}d/T), then it takes at least Ω(T/τ)\Omega(T/\tau) steps to bring the error back down.

Since f(wT)f(w)C1τσ2d/Tf(\mathbf{w}_{T})-f(\mathbf{w}^{*})\geq C_{1}\tau\sigma^{2}d/T and f(wT)=d2(κ(wT(1)(w)(1))2+(wT(2)(w)(2))2)f(\mathbf{w}_{T})=\frac{d}{2}\left(\kappa\left(\mathbf{w}_{T}^{(1)}-\left(\mathbf{w}^{*}\right)^{(1)}\right)^{2}+\left(\mathbf{w}_{T}^{(2)}-\left(\mathbf{w}^{*}\right)^{(2)}\right)^{2}\right), we know either (wT(1)(w)(1))2C1τσ2/2κT\left(\mathbf{w}_{T}^{(1)}-\left(\mathbf{w}^{*}\right)^{(1)}\right)^{2}\geq C_{1}\tau\sigma^{2}/2\kappa T or (wT(2)(w)(2))2C1τσ2/2T\left(\mathbf{w}_{T}^{(2)}-\left(\mathbf{w}^{*}\right)^{(2)}\right)^{2}\geq C_{1}\tau\sigma^{2}/2T. Either way, we have a coordinate ii with eigenvalue λi\lambda_{i} (κ\kappa or 11) such that (wT(i)(w)(i))2C1τσ2/(2Tλi\left(\mathbf{w}_{T}^{(i)}-\left(\mathbf{w}^{*}\right)^{(i)}\right)^{2}\geq C_{1}\tau\sigma^{2}/(2T\lambda_{i}).

Similar as before, choose Δ\Delta to be the first point such that

If S2C2τ/(λi2T)S^{2}\leq C_{2}\tau/(\lambda_{i}^{2}T) (where C2C_{2} is a large enough universal constant chosen later), then by Cauchy-Schwartz we know

Therefore ΔT/C2τ\Delta\geq T/C_{2}\tau, and we are done.

If S2>C2τ/(λi2T)S^{2}>C_{2}\tau/(\lambda_{i}^{2}T), by Equation (15) and (16) we know

Theorem 17 is an immediate corollary of Theorem 15 and Lemma 18. ∎

Theorem 16 is an immediate corollary of Theorem 17 and lemma 11∎

Appendix E Details of experimental setup

As mentioned in the main paper, we consider four condition numbers namely κ{50,100,200,400}\kappa\in\{50,100,200,400\}. We run all experiments for a total of κmax2=4002=160000\kappa_{\max}^{2}=400^{2}=160000 iterations. The two eigenvalues of the Hessian are 11 and 1/κ1/\kappa respectively and the noise level σ2=1\sigma^{2}=1 and we average our results with five random seeds. All our grid search results for the polynomially decaying learning rates are conducted on a 8×88\times 8 grid of learning rates ×\times decay factor and whenever a best run lands at the edge of the grid, the grid is extended so that we have the best run in the interior of the grid search. For the step decay schedules, note that we fix the learning rate (details below), and vary only the decay factor.

For the O(1/t)O(1/t) learning rate, we search for decay parameter over 88-points log-spaced between {1/(200κ),5000/κ}\{1/(200\kappa),5000/\kappa\}. The starting learning rate is searched over 88 points logarithmically spaced between {1/κ,5}\{1/\kappa,5\}.

For the O(1/t)O(1/\sqrt{t}) learning rate, the decay parameter is searched over 88 logarithmically spaced points between {1/(2500κ),100/κ}\{1/(2500\kappa),100/\kappa\}. The starting learning rate is searched between {1/(10κ),5}\{1/(10\kappa),5\} with 88 logarithmically spaced points.

For the step decay schedule experiments, we kept the initial learning rate to be 0.10.1 and swept over when to decay in multiples of T/logTT/\log{T}, i.e., vary some parameter c{0.25,0.5,0.75,1.0,1.25,1.5,2,4}c\in\{0.25,0.5,0.75,1.0,1.25,1.5,2,4\} where the learning rate decays by a factor of 22 every cT/logTc\cdot T/log{T} steps. We found that the values chosen in most experiments were very close to 11, i.e., they were either 11 or 1.251.25 or some very rare cases, 1.51.5.

With regards to the suffix iterate averaging, we used a constant stepsize of 0.10.1 and averaged iterates over the final half of the iterations.

E.2 Non-Convex experiments on cifar-10 dataset with a 44-layer residual net

With regards to learning rates, we consider 1010-values geometrically spaced as {1,0.6,,0.01}\{1,0.6,\cdots,0.01\}. To set the decay factor for any of the schemes such as 5,6, and 7, we use the following rule. Suppose we have a desired learning rate that we wish to use towards the end of the optimization (say, something that is 100100 times lower than the starting learning rate, which is a reasonable estimate of what is typically employed in practice), this can be used to obtain a decay factor for the corresponding decay scheme. In our case, we found it advantageous to use an additively spaced grid for the learning rate γt\gamma_{t}, i.e., one which is searched over a range {0.0001,0.0002,,0.0009,0.001,,0.009}\{0.0001,0.0002,\cdots,0.0009,0.001,\cdots,0.009\} at the 80th80^{th} epoch, and cap off the minimum possible learning rate to be used to be 0.00010.0001 to ensure that there is progress made by the optimization routine. For any of the experiments that yield the best performing gridsearch parameter that falls at the edge of the grid, we extend the grid to ensure that the finally chosen hyperparameter lies in the interior of the grid. All our gridsearches are run such that we separate a tenth of the training dataset as a validation set and train on the remaining 9/10th9/10^{th} dataset. Once the best grid search parameter is chosen, we train on the entire training dataset and evaluate on the test dataset and present the result of the final model (instead of choosing the best possible model found during the course of optimization).