The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli

Introduction

Large scale machine learning relies almost exclusively on stochastic optimization methods (Bottou and Bousquet, 2007), which include stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants Duchi et al. (2011); Johnson and Zhang (2013). In this work, we restrict our attention to the SGD algorithm where we are concerned with the behavior of the final iterate (i.e. the last point when we terminate the algorithm). A majority of (minimax optimal) theoretical results for SGD focus on polynomially decaying stepsizes (Dekel et al., 2012; Rakhlin et al., 2012; Lacoste-Julien et al., 2012; Bubeck, 2014) (or constant stepsizes (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016) for the case of least squares regression) coupled with iterate averaging (Ruppert, 1988; Polyak and Juditsky, 1992) to achieve minimax optimal rates of convergence. However, practical SGD implementations typically return the final iterate of a stochastic gradient procedure. This line of work in theory (based on iterate averaging) and its discrepancy with regards to practice leads to the question with regards to the behavior of SGD’s final iterate. Indeed, this question has motivated several efforts in stochastic convex optimization literature as elaborated below.

Non-Smooth Stochastic Optimization: The work of Shamir (2012) raised the question with regards to the behavior of SGD’s final iterate for non-smooth stochastic optimization (with/without strong convexity). The work of Shamir and Zhang (2012) answered this question, indicating that SGD’s final iterate with polynomially decaying stepsizes achieves near minimax rates (up to $\log$ factors) in an anytime (i.e. in a limiting) sense (when number of iterations SGD is run for is not known in advance). Under specific choices of step size sequences, Shamir and Zhang (2012)’s result on SGD’s final iterate is tight owing to the recent work of Harvey et al. (2018). More recently Jain et al. (2019) presented an approach indicating that a more nuanced stepsize sequence serves to achieve minimax rates (up to constant factors) for the non-smooth stochastic optimization setting when the end time $T$ is known in advance.

Least Squares Regression (LSR): In contrast to the non-smooth setting, the state of our understanding of SGD’s final iterate for smooth stochastic convex optimization, or, say, the streaming least squares regression setting is far less mature $-$ this gap motivates our paper’s contributions. In particular, this paper studies SGD’s final iterate behavior under various stepsize choices for least squares regression (with and without strong convexity). The use of SGD’s final iterate for the least mean squares objective has featured in several efforts (Widrow and Hoff, 1960; Proakis, 1974; Widrow and Stearns, 1985; Roy and Shynk, 1990), but these results do not achieve minimax rates of convergence, which leads to the following question:

“ Can polynomially decaying stepsizes (known to achieve minimax rates when coupled with iterate averaging (Ruppert, 1988; Polyak and Juditsky, 1992)) offer minimax optimal rates on SGD’s final iterate when optimizing the streaming least squares regression objective? If not, is there any other family of stepsizes that can guarantee minimax rates on the final iterate of stochastic gradient descent? ”

This paper presents progress on answering the above question $-$ refer to contributions below for more details. Note that the oracle model employed by this work (to quantify SGD’s final iterate behavior) has featured in a string of recent results that present a non-asymptotic understanding of SGD for least squares regression, with the caveat being that these results crucially rely on iterate averaging with constant stepsize sequences (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016, 2017b, 2017a; Neu and Rosasco, 2018).

This work establishes upper and lower bounds on the behavior of SGD’s final iterate, as run with polynomially decaying stepsizes as well as step decay schedules which tends to cut the stepsize by a constant factor after every constant number of epochs (see algorithm 1), by considering the streaming least squares regression problem (with and without strong convexity). Our main result indicates that step decay schedules offer significant improvements in achieving near minimax rates over polynomially decaying stepsizes in the known horizon case (when the end time $T$ is known in advance). Figure 1 illustrates that this difference is evident (empirically) even when optimizing a two-dimensional synthetic least squares objective. Table 1 provides a summary. Finally, we present results that indicate the subtle (yet significant) differences between the known time horizon case and the anytime (i.e. the limiting) behavior of SGD’s final iterate (see below). Note that proofs of our main claims can be found in the appendix.

Sub-optimality of polynomially decaying stepsizes: For the strongly convex least squares case, this work shows that the final iterate of a polynomially decaying stepsize scheme (i.e. with $\eta_{t}\propto 1/t^{\alpha}$ , with $\alpha\in[0.5,1]$ ) is off the minimax rate $d\sigma^{2}/T$ by a factor of the condition number of the problem. For the non-strongly convex case of least squares, we show that any polynomially decaying stepsize can achieve a rate no better than $d\sigma^{2}/\sqrt{T}$ (up to $\log$ factors), while the minimax rate is $d\sigma^{2}/T$ .

Near-optimality of the step-decay scheme: Given a fixed end time $T$ , the step-decay scheme (algorithm 1) presents a final iterate that is off the statistical minimax rate by just a $\log(T)$ factor when optimizing the strongly convex and non-strongly convex least squares regression This dependence can be improved to $\log$ of the condition number of the problem (for the strongly convex case) using a more refined stepsize decay scheme., thus indicating vast improvements over polynomially decaying stepsize schedules. We note here that our Theorem 2 for the non-strongly case offers a rate on the initial error (i.e., the bias term) that is off the best known rate (Bach and Moulines, 2013) (that employs iterate averaging) by a dimension factor. That said, Algorithm 1 is rather straightforward and employs the knowledge of just an initial learning rate and number of iterations for its implementation.

SGD has to query bad iterates infinitely often: For the case of optimizing strongly convex least squares regression, this work shows that any stochastic gradient procedure (in a $\limsup$ sense) must query sub-optimal iterates (off by nearly a condition number) infinitely often.

Complementary to our theoretical results for the stochastic linear regression, we evaluate the empirical performance of different learning rate schemes when training a residual network on the cifar-10 dataset and observe that the continuous variant of step decay schemes (i.e. an exponential decay) indeed compares favorably to polynomially decaying stepsizes.

While the upper bounds established in this paper (section 3.2) merit extensions towards broader smooth convex functions (with/without strong convexity), the lower bounds established in sections 3.1, 3.3 present implications towards classes of smooth stochastic convex optimization. Even in terms of upper bounds, note that there are fewer results on non-asymptotic behavior of SGD (beyond least squares) when working in the oracle model considered in this work (see below). Bach and Moulines (2011, 2013); Bach (2014); Needell et al. (2016) are exceptions, yet they do not achieve minimax rates on appropriate problem classes; Frostig et al. (2015) does not work in standard stochastic first order oracle model (Nemirovsky and Yudin, 1983; Agarwal et al., 2012), so their work is not directly comparable to examine extensions towards broader function classes.

As a final note, this paper’s result on the sub-optimality of standard polynomially decaying stepsizes for classes of smooth and strongly convex optimization doesn’t contradict the (minimax) optimality results in stochastic approximation (Polyak and Juditsky, 1992). Iterate averaging coupled with polynomially decaying learning rates (or constant learning rates for least squares (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016)) does achieve minimax rates (Ruppert, 1988; Polyak and Juditsky, 1992). However, as mentioned previously, this work deals with SGD’s final iterate behavior (i.e. without iterate averaging), since this bears more relevance towards practice.

Related work: Robbins and Monro (1951) introduced the stochastic approximation problem and Stochastic Gradient Descent (SGD). They present conditions on stepsize schemes satisfied by asymptotically convergent algorithms: these schemes are referred to as “convergent” stepsize sequences. Ruppert (1988); Polyak and Juditsky (1992) proved the asymptotic optimality of iterate averaged SGD with larger stepsize sequences. In terms of oracle models and notions of optimality, there exists two lines of thought (see also Jain et al. (2017b)):

Towards statistically optimal estimation procedures: The goal of this line of thought is to match the excess risk of the statistically optimal estimator (Anbar, 1971; Kushner and Clark, 1978; Polyak and Juditsky, 1992; Lehmann and Casella, 1998) on every problem instance. Several efforts consider SGD in this oracle (Bach and Moulines, 2011; Bach, 2014; Dieuleveut and Bach, 2015; Frostig et al., 2015; Needell et al., 2016) presenting non-asymptotic results, often with iterate averaging. With regards to least squares, Bach and Moulines (2013); Défossez and Bach (2015); Frostig et al. (2015); Jain et al. (2016, 2017b); Neu and Rosasco (2018) use constant step-size SGD with iterate averaging to achieve minimax rates (on a per-problem basis) in this oracle model. SGD’s final iterate behavior for least squares has featured in several efforts in the signal processing/controls literature (Widrow and Hoff, 1960; Nagumo and Noda, 1967; Proakis, 1974; Widrow and Stearns, 1985; Roy and Shynk, 1990; Sharma et al., 1998), without achieving minimax rates. This paper works in this oracle model and analyzes SGD’s final iterate behavior with various stepsize choices.

Towards optimality under bounded noise assumptions: The other line of thought presents algorithms with access to stochastic gradients satisfying bounded noise assumptions, aiming to match lower bounds provided in Nemirovsky and Yudin (1983); Raginsky and Rakhlin (2011); Agarwal et al. (2012). Asymptotic properties of “convergent” stepsize schemes have been studied in great detail (Kushner and Clark, 1978; Benveniste et al., 1990; Ljung et al., 1992; Bharath and Borkar, 1999; Kushner and Yin, 2003; Lai, 2003; Borkar, 2008). Dekel et al. (2012); Lacoste-Julien et al. (2012); Rakhlin et al. (2012); Ghadimi and Lan (2012, 2013b); Hazan and Kale (2014); Bubeck (2014); Dieuleveut et al. (2016) use iterate averaged SGD to achieve minimax rates for various problem classes non-asymptotically. Allen-Zhu (2018) present an alternative approach towards minimizing the gradient norm with access to stochastic gradients. As noted, Shamir and Zhang (2012) achieves anytime optimal rates (upto a $\log{T}$ factor) with the final iterate of an SGD procedure, and this is shown to be tight with the recent work of Harvey et al. (2018). Jain et al. (2019) achieve minimax rates on the final iterate using a nuanced stepsize scheme when the number of iterations is fixed in advance.

Geometrically Decaying Stepsize Schedules date to Goffin (1977). Davis and Drusvyatskiy (2019) employ the stepdecay schedule to prove high-probability guarantees for SGD with strongly convex objectives. In stochastic optimization, several other works, including Ghadimi and Lan (2013a); Hazan and Kale (2014); Aybat et al. (2019); Kulunchakov and Mairal (2019) consider doubling argument based approaches, where the epoch length is doubled everytime the stepsizes are halved. The step decay schedule is employed to yield faster rates of convergence under certain growth (and related) conditions both in convex (Xu et al., 2016) and non-convex settings (Yang et al., 2018; Davis et al., 2019).

Paper organization: Section 2 describes notation and problem setup. Section 3 presents our results on the sub-optimality of polynomial decay schemes and the near optimality of the step decay scheme. Section 3.3 presents results on the anytime behavior of SGD (i.e. the asymptotic/infinite horizon case). Section 4 presents experimental results and Section 5 presents conclusions.

Problem Setup

Notation: We present the setup and associated notation in this section. We represent scalars with normal font $a,b,L$ etc., vectors with boldface lowercase characters $\mathbf{a},\mathbf{b}$ etc. and matrices with boldface uppercase characters $\mathbf{A},\mathbf{B}$ etc. We represent positive semidefinite (PSD) ordering between two matrices using $\succeq$ . The symbol $\gtrsim$ represents that the inequality holds for some universal constant.

We consider here the minimization of the following expected square loss objective:

where, $\epsilon$ is the noise on the example pair $(\mathbf{x},y)\sim\mathcal{D}$ and $\mathbf{w}^{*}$ is a minimizer of the objective $f(\mathbf{w})$ . Given an initial iterate $\mathbf{w}_{0}$ and stepsize sequence $\{\eta_{t}\}$ , our stochastic gradient update is:

We assume that the noise $\epsilon=y-\langle\mathbf{w}^{*},\mathbf{x}\rangle\ \ \forall\ \ (\mathbf{x},y)\sim\mathcal{D}$ satisfies the following condition:

Next, assume that covariates $\mathbf{x}$ satisfy the following fourth moment inequality:

This assumption is satisfied, say, when the norm of the covariates $\sup\left\|\mathbf{x}\right\|^{2}<R^{2}$ , but is true more generally. Finally, note that both 3 and 4 are general and are used in recent works (Bach and Moulines, 2013; Jain et al., 2016) that present a sharp analysis of SGD for streaming least squares problem. Next, we denote by

The rate of $\left(1+o(1)\right)\cdot{\sigma^{2}d}/{t}$ is achieved using iterate averaged SGD (Ruppert, 1988; Polyak and Juditsky, 1992) with constant stepsizes (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2016). This rate of $\sigma^{2}d/t$ is called the statistical minimax rate.

Main results

Sections 3.1, 3.2 consider the fixed time horizon setting; the former presents the significant sub-optimality of polynomially decaying stepsizes on SGD’s final iterate behavior, the latter section presenting the near-optimality of SGD’s final iterate. Section 3.3 presents negative results on SGD’s final iterate behavior (irrespective of stepsizes employed), in the anytime (i.e. limiting) sense.

This section begins by showing that there exist problem instances where polynomially decaying stepsizes considered stochastic approximation theory (Robbins and Monro, 1951; Polyak and Juditsky, 1992) i.e., those of the form $\frac{a}{b+t^{\alpha}}$ , for any choice of $a,b>0$ and $\alpha\in[0.5,1]$ are significantly suboptimal (by a factor of the condition number of the problem, or by $\sqrt{T}$ in the smooth case) compared to the statistical minimax rate (Kushner and Clark, 1978).

Under assumptions 3, 4, there exists a class of problem instances where the following lower bounds on excess risk hold on SGD’s final iterate with polynomially decaying stepsizes when given access to the oracle as written in equation 2.

Strongly convex case: Suppose $\mu>0$ . For any condition number $\kappa$ , there exists a least squares problem instance with initial suboptimality $f(\mathbf{w}_{0})-f(\mathbf{w}^{*})\leq\sigma^{2}d$ such that, for any $T\geq\kappa^{\frac{4}{3}}$ , and for all $a,b\geq 0$ and $0.5\leq\alpha\leq 1$ , and for the learning rate scheme $\eta_{t}=\frac{a}{b+t^{\alpha}}$ , we have

Smooth case: For any fixed $T>1$ , there exists a least squares problem instance such that, for all $a,b\geq 0$ and $0.5\leq\alpha\leq 1$ , and for the learning rate scheme $\eta_{t}=\frac{a}{b+t^{\alpha}}$ , we have

For both cases (with/without strong convexity), the minimax rate is $\sigma^{2}d/T$ . In the strongly convex case, SGD’s final iterate with polynomially decaying stepsizes pays a suboptimality factor of $\Omega(\kappa)$ , whereas, in the smooth case, SGD’s final iterate pays a suboptimality factor of $\Omega\left(\frac{\sqrt{T}}{\log T}\right)$ .

2 Near optimality of Step Decay schemes

Given the knowledge of an end time $T$ when the algorithm is terminated, this section presents the step decay schedule (Algorithm 1), which offers significant improvements over standard polynomially decaying stepsize schemes, and obtains near minimax rates (off by only a $\log(T)$ factor).

Suppose we are given access to the stochastic gradient oracle 2 satisfying Assumptions 3 and 4. Running Algorithm 1 with an initial stepsize of $\eta_{1}=1/(2R^{2})$ allows the algorithm to achieve the following excess risk guarantees.

Strongly convex case: Suppose $\mu>0$ . We have:

While theorem 2 presents significant improvements over polynomial decay schemes, as mentioned in the contributions, the above result presents a worse rate on the initial error (by a dimension factor) in the smooth case (i.e. non-strongly convex case), compared to the best known result (Bach and Moulines, 2013), which relies heavily on iterate averaging to remove this factor. It is an open question with regards to whether this factor can actually be improved or not. Furthermore, comparing the initial error dependence between the lower bound for the smooth case (Theorem 1) with the upper bound for the step decay scheme, we believe that the dependence on the smoothness $L$ should be improved to one on the $R^{2}$ .

In terms of the variance, however, note that the polynomial decay schemes, are plagued by a polynomial dependence on the condition number $\kappa$ (for the strongly convex case), and are off the minimax rate by a $\sqrt{T}$ factor (for the smooth case). The step decay schedule, on the other hand, is off the minimax rate (Ruppert, 1988; Polyak and Juditsky, 1992; Van der Vaart, 2000) by only a $\log(T)$ factor. It is worth noting that Algorithm 1 admits an efficient implementation in that it requires the knowledge only of $R^{2}$ (similar to iterate averaging results (Bach and Moulines, 2013; Jain et al., 2016)) and the end time $T$ . Finally, note that this $\log T$ factor can be improved to a $\log\kappa$ factor for the strongly convex case by using an additional polynomial decay scheme before switching to the Step Decay scheme.

Suppose we have access to the stochastic gradient oracle 2 satisfying Assumptions 3 and 4. Let $\mu>0$ and $\kappa\geq 2$ . For any problem and fixed time horizon $T/\log T>5\kappa$ , there exists a learning rate scheme that achieves

In order to have improved the dependence on the variance from $\log(T)$ (in theorem 2) to $\log(\kappa)$ (in proposition 3), we require access to the strong convexity parameter $\mu=\lambda_{\textrm{min}}(\mathbf{H})$ in addition to $R^{2}$ and knowledge of the end time $T$ . This parallels results known for general strongly convex setting (Rakhlin et al., 2012; Lacoste-Julien et al., 2012; Shamir and Zhang, 2012; Bubeck, 2014; Jain et al., 2019).

As a final remark, note that this section’s results (on step decay schemes) assumed the knowledge of a fixed time horizon $T$ . In contrast, most results SGD’s averaged iterate obtain anytime (i.e., limiting/infinite horizon) guarantees. Can we hope to achieve such guarantees with the final iterate?

3 SGD queries bad points infinitely often

This section shows that obtaining near minimax rates with the final iterate is not possible without knowledge of the time horizon $T$ . Concretely, we show that irrespective of the learning rates employed (be it polynomially decaying or step-decay), SGD requires to query a point with sub-optimality $\Omega(\kappa/\log\kappa)\cdot\sigma^{2}d/T$ for infinitely many time steps $T$ .

Suppose we have access to a stochastic gradient oracle 2 satisfying Assumption 3, 4. There exists a universal constant $C>0$ , and a problem instance such that SGD algorithm with any $\eta_{t}\leq 1/2R^{2}$ for all $t$ Learning rate more than $2/R^{2}$ will make the algorithm diverge., we have

The bad points guaranteed to exist by Theorem 4 are not rare. We show that such points occur at least once in $\mathcal{O}\left(\frac{\kappa}{\log\kappa}\right)$ iterations. Refer to Theorem 16 in appendix D in the appendix.

Experimental Results

We present experimental validation on the suitability of the Step-decay schedule (or more precisely, its continuous counterpart, which is the exponentially decaying schedule), and compare its with the polynomially decaying stepsize schedules. In particular, we consider the use of: $\eta_{t}=\frac{\eta_{0}}{1+b\cdot t}$ (5) $\eta_{t}=\frac{\eta_{0}}{1+b\sqrt{t}}$ (6) $\vspace*{1.3em}\eta_{t}=\eta_{0}\cdot\exp{(-b\cdot t)}.$ (7) Where, we perform a systematic grid search on the parameters $\eta_{0}$ and $b$ . In the section below, we consider a real world non-convex optimization problem of training a residual network on the cifar-10 dataset, with an aim to illustrate the practical implications of the results described in the paper. Refer to Appendix E for more details.

Our experiments are based on grid searching for the best learning rate decay scheme on the parametric family of learning rate schemes described above (5),(6),(7); all grid searches are performed on a separate validation set (obtained by setting aside one-tenth of the training dataset) and with models trained on the remaining $45000$ samples. For presenting the final numbers in the plots/tables, we employ the best hyperparameters from the validation stage and train it on the entire $50,000$ samples and average results run with $10$ different random seeds. The parameters for grid searches and other details are presented in Appendix E. Furthermore, we always extend the grid so that the best performing grid search parameter lies in the interior of our grid search.

How does the step decay scheme compare with the polynomially decaying stepsizes? Figure 2 and Table 2 present a comparison of the performance of the three schemes (5)-(7). These results demonstrate that the exponential scheme convicingly outperforms the polynomial step-size schemes.

Does suffix iterate averaging improve over final iterate’s behavior for polynomially decaying stepsizes? Towards answering this question, firstly, we consider the best performing values of equation 5 and 6, and then, average iterates of the algorithm starting from $5,10,20,40,80,85,90,95,99$ epochs when training the model for a total of $100$ epochs. While such iterate averaging (and their suffix) variants have strong theoretical support for (stochastic) convex optimization (Ruppert, 1988; Polyak and Juditsky, 1992; Rakhlin et al., 2012; Bubeck, 2014; Jain et al., 2016), their impact on non-convex optimization is largely debatable. Nevertheless, this experiments’s results (figure 3) indicates that suffix averaging tends to hurt the algorithm’s generalization behavior (which is unsurprising given the non-convex nature of the objective). Note that, figure 3 serves to indicate that averaging the final few ( $\leq 5$ ) epochs tends to offer nearly the same result as the final iterate’s behavior, indicating that the gains of using suffix iterate averaging are relatively limited for several such settings.

Does our result on “knowing” the time horizon (for step-decay schedule) present implications towards hyper-parameter search methods that work based on results from truncated runs? Towards answering this question, consider the figure 4 and Tables 3 and 4, which present a comparison of the performance of three exponential decay schemes each of which is tuned to achieve the best performance at $33$ , $66$ and $100$ epochs respectively. The key point to note is that best performing hyperparameters at $33$ and $66$ epochs are not the best performing at $100$ epochs (which is made stark from the perspective of the validation error - refer to table 4). This demonstrates that hyper parameter selection methods that tend to discard hyper-parameters which don’t perform well at earlier stages of the optimization (i.e. based on comparing results on truncated runs), which, for e.g., is indeed the case with hyperband (Li et al., 2017), will benefit from a round of rethinking.

Conclusions and Discussion

The main contribution of this work shows that the behavior of SGD’s final iterate for least squares regression is much more nuanced than what has been indicated by prior efforts that have primarily considered non-smooth stochastic convex optimization. The results of this paper point out the striking limitations of polynomially decaying stepsizes on SGD’s final iterate, as well as sheds light on the effectiveness of starkly different schemes based on a Step Decay schedule. Somewhat coincidentally, practical implementations for certain classes of stochastic optimization do return the final iterate of SGD with step decay schedule $-$ this connection does merit an understanding through future work.

Rong Ge acknowledges funding from NSF CCF- $1704656$ , NSF CCF- $1845171$ (CAREER), Sloan Fellowship and Google Faculty Research Award. Sham Kakade acknowledges funding from the Washington Research Foundation for Innovation in Data-intensive Discovery, NSF Award 1740551, and ONR award N $00014$ - $18$ - $1$ - $2247$ . Rahul Kidambi acknowledges funding from NSF Award $1740822$ .

References

Appendix A Preliminaries

Before presenting the lemmas establishing the behavior of SGD under various learning rate schemes, we introduce some notation. We recount that the SGD update rule denoted through:

We then write out the expression for the stochastic gradient $\widehat{\nabla f}(\mathbf{w}_{t-1})$ .

where, given the stochastic gradient corresponding to an example $(\mathbf{x}_{t},y_{t})\sim\mathcal{D}$ , with $y_{t}=\left\langle\mathbf{w}^{*},\mathbf{x}_{t}\right\rangle+\epsilon_{t}$ , the above stochastic gradient expression naturally follows. Now, in order to analyze the contraction properties of the SGD update rule, we require the following notation:

[For e.g. Appendix A.2.2 from Jain et al. (2016)] Bias-Variance tradeoff: Running SGD for $T-$ steps starting from $\mathbf{w}_{0}$ and a stepsize sequence $\{\eta_{t}\}_{t=1}^{T}$ presents a final iterate $\mathbf{w}_{T}$ whose excess risk is upper-bounded as:

One can view the contribution of the above two terms as ones stemming from SGD’s updates, which can be written as:

For more details, refer to Défossez and Bach (2015).

Now, in order to bound the total error, note that the original stochastic process associated with SGD’s updates can be decoupled into two (simpler) processes, one being the noiseless process (which corresponds to reducing the dependence on the initial error, and is termed “bias”), i.e., where, the recurrence evolves as:

The second recursion corresponds to the dependence on the noise (termed as variance), wherein, the process is initiated at the solution, i.e. $\mathbf{w}_{0}^{\text{var}}=\mathbf{w}^{*}$ and is driven by the noise $n_{t}$ . The update for this process corresponds to:

We represent by $B_{t}$ the covariance of the $t^{\text{th}}$ iterate of the bias process, i.e.,

Firstly, note that this naturally implies that the sequence of covariances $V_{\tau}$ , $\tau=1,\cdots,T$ initialized at (say), the solution, i.e., $\mathbf{V}_{0}=0$ naturally grows to its steady state covariance, i.e.,

See lemma 3 of Jain et al. (2017a) for more details. Furthermore, what naturally follows in relating $V_{t}$ to $V_{t-1}$ is:

Running SGD with a (constant) stepsize sequence $\eta<1/R^{2}$ achieves the following steady-state covariance:

Suppose $\eta=1/{2R^{2}}$ , and $V_{0}=\frac{\eta\sigma^{2}}{1-\eta R^{2}}\mathbf{I}=2\eta\sigma^{2}\mathbf{I}$ . For any sequence of learning rates $\eta_{t}\leq\eta=1/{2R^{2}}\ \forall\ t\in\{1,\cdots,t\}$ , then,

We will prove the lemma using an inductive argument. The base case, i.e. $t=0$ follows from the problem statement. Note also that for SGD, $V_{0}=0$ implying the statement naturally follows. If, say, $V_{t}$ satisfies the equation above, from equation 10, we have the following covariance for $V_{t+1}$ :

(Reduction from Multiplicative noise oracle) Let $V_{t}$ be the (expected) covariance of the variance error. Then, the recursion that connects $V_{t+1}$ to $V_{t}$ can be expressed as:

From equation 10, we already know that the evolution of the co-variance of the variance error follows:

Where the steps follow from lemma 7, and owing from the fact that $\eta_{t}\leq\eta=1/2R^{2}\ \ \forall\ \ t$ . ∎

Note: Basically, one could analyze an auxiliary process driven by noise with variance off by a factor of two and convert the analysis into one involving exact (deterministic) gradients.

[Bias decay - strongly convex case] Let the minimal eigenvalue of the Hessian $\mu=\lambda_{\textrm{min}}(\mathbf{H})>0$ . Consider the bias recursion as in equation 8 with the stepsize set as $\eta=1/(2R^{2})$ . Then,

The proof follows through straight forward computations:

[Reduction of the bias recursion with multiplicative noise to one resembling the variance recursion] Consider the bias recursion that evolves as

Then, the following recursion holds $\forall\gamma_{t}\leq 1/R^{2}$ :

The result follows owing to the following computations:

with the last inequality holding true if the squared distance to the optimum doesn’t grow as a part of the recursion. We prove that this indeed is the case below:

Recursively applying the above argument yields the desired result. ∎

Note: This result implies that the bias error (in the smooth non-strongly convex case of the least squares regression with multiplicative noise) can be bounded by employing a similar lemma as that of the variance, where one can look at the quantity $R^{2}\cdot\|\mathbf{w}_{0}-\mathbf{w}^{*}\|^{2}_{2}$ as the analog of the variance $\sigma^{2}$ that drives the process.

[Lower bounds on the additive noise oracle imply ones for the multiplicative noise oracle] Under the assumption that the covariance of noise $\Sigma=\sigma^{2}\mathbf{H}$ , the following statement holds. Let $V_{t}$ be the (expected) covariance of the variance error. Then, the recursion that connects $V_{t+1}$ to $V_{t}$ can be expressed as:

Let us consider firstly, the setting of (bounded) additive noise. Here, we have:

Then, updates leading upto time $t+1$ can be written as:

This implies the covariance of the variance error is:

Now, let us consider the statement of the lemma:

Appendix B Proofs of results in Section 3.1

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

The following lower bounds hold on the final iterate of a Stochastic Gradient procedure with access to the above stochastic gradients when using polynomially decaying stepsizes.

Strongly convex case: Suppose $\mu>0$ . For any condition number $\kappa$ , there exists a problem instance with initial suboptimality $f(\mathbf{w}_{0})-f(\mathbf{w}^{*})\leq\sigma^{2}d$ such that, for any $T\geq\kappa^{\frac{4}{3}}$ , and for all $a,b\geq 0$ and $0.5\leq\alpha\leq 1$ , and for the learning rate scheme $\eta_{t}=\frac{a}{b+t^{\alpha}}$ , we have

Smooth case: For any fixed $T>1$ , there exists a problem instance such that, for all $a,b\geq 0$ and $0.5\leq\alpha\leq 1$ , and for the learning rate scheme $\eta_{t}=\frac{a}{b+t^{\alpha}}$ , we have

We consider a recursion for $v^{\left(i\right)}_{t}$ with eigenvalue $\lambda_{i}$ ( $\kappa$ or ${1}{}$ ). By the design of the algorithm, we know

Let $s(\eta,\lambda)=\frac{\lambda\sigma^{2}\eta^{2}}{1-(1-\eta\lambda)^{2}}$ be the solution to the stationary point equation $x=(1-\eta\lambda)^{2}+\lambda\sigma^{2}\eta^{2}$ . Intuitively if we keep using the same learning rate $\eta$ , then $v^{\left(i\right)}_{t}$ is going to converge to $s(\eta,\lambda_{i})$ . Also note that $s(\eta,\lambda)\approx\sigma^{2}\eta/2$ when $\eta\lambda\ll 1$ .

We first prove the following claim showing that eventually the variance in direction $i$ is going to be at least $s(\eta_{T},\lambda_{i})$ .

Suppose $s(\eta_{t},\lambda_{i})\leq v^{\left(i\right)}_{0}$ , then $v^{\left(i\right)}_{t}\geq s(\eta_{t},\lambda_{i})$ .

In this form, it is easy to see that the iteration is a contraction towards $s(\eta_{t},\lambda_{i})$ . Further, $v^{\left(i\right)}_{t+1}-s(\eta_{t},\lambda_{i})$ and $v^{\left(i\right)}_{t}-s(\eta_{t},\lambda_{i})$ have the same sign. In particular, let $t_{0}$ be the first time such that $s(\eta_{t},\lambda_{i})\leq v^{\left(i\right)}_{0}$ (note that $\eta_{t}$ is monotone and so is $s(\eta_{t},\lambda_{i})$ ), it is easy to see that $v^{\left(i\right)}_{t}\geq v^{\left(i\right)}_{0}$ when $t\leq t_{0}$ . Therefore we know $v^{\left(i\right)}_{t_{0}}\geq s(\eta_{t_{0}},\lambda_{i})$ , by the recursion this implies $v^{\left(i\right)}_{t_{0}+1}\geq s(\eta_{t_{0}},\lambda_{i})\geq s(\eta_{t_{0}+1},\lambda_{i})$ . The claim then follows from a simple induction. ∎

If $s(\eta_{T},\lambda_{i})\geq v^{\left(i\right)}_{0}$ for $i=1$ or $i=d$ then the error is at least $\sigma^{2}d/2\geq\kappa\sigma^{2}d/T$ and we are done. Therefore we must have $s(\eta_{T},\kappa)\leq v^{\left(1\right)}_{0}=3\sigma^{2}/\kappa$ , and by Claim 1 we know $v^{\left(1\right)}_{T}\geq s(\eta_{T},\kappa)\geq\sigma^{2}\eta_{T}/2$ . The function value is at least

In the second case, $b<T^{\alpha}$ . Since $\frac{1}{16T}\geq\eta_{T}=\frac{a}{b+T^{\alpha}}\geq\frac{a}{2T^{\alpha}}$ , we have $a\leq\frac{1}{8}T^{\alpha-1}$ . The sum of learning rates satisfy

We consider a recursion for $v^{\left(i\right)}_{t}$ with eigenvalue $\lambda_{i}$ (1 or $\frac{1}{\kappa}$ ). By the design of the algorithm, we know

If $s(\eta_{T},\lambda_{i})\geq v^{\left(i\right)}_{0}$ for $i=1$ or $i=d$ then the error is at least $\sigma^{2}d/2\kappa\geq\kappa\sigma^{2}d/T$ and we are done. Therefore we must have $s(\eta_{T},\kappa)\leq v^{\left(1\right)}_{0}=3\sigma^{2}/\kappa$ , and by Claim 1 we know $v^{\left(1\right)}_{T}\geq s(\eta_{T},\kappa)\geq\sigma^{2}\eta_{T}/2$ . The function value is at least

In the second case, $b<T^{\alpha}$ . Since $\frac{\kappa}{16T\log T}\geq\eta_{T}=\frac{a}{b+T^{\alpha}}\geq\frac{a}{2T^{\alpha}}$ , we have $a\leq\frac{1}{8\log T}\kappa T^{\alpha-1}$ . The sum of learning rates satisfy

Here the second inequality uses the fact that $T^{\alpha-1}i^{-\alpha}\leq i^{-1}$ . Similarly, we also know

The proof of theorem 1 follows straightforwardly when combining the result of lemma 11 and theorem 12. ∎

Appendix C Proofs of results in Section 3.2

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

Running Algorithm 1 with an initial stepsize of $\eta_{1}=1/R^{2}$ , starting from the solution, i.e. $\mathbf{w}_{0}=\mathbf{w}^{*}$ allows the algorithm to obtain the following dependence on the variance error:

Plugging (12) and (13) into (11), we obtain

The function suboptimality can now be bounded as

Smooth case: The result follows by instantiating $\hat{\sigma}^{2}$ in theorem 13 with $2\sigma^{2}$ (lemma 8) and $R^{2}\left\|\mathbf{w}_{0}-\mathbf{w}^{*}\right\|_{2}^{2}$ (lemma 10) and using the lemma 5 to obtain the result.

Strongly convex case: As with the smooth case, the result relies on instantiating theorem 13 with $2\sigma^{2}$ (lemma 8) and using lemma 9 and then appealing to lemma 5. ∎

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

There exists a stepsize scheme with which, by starting at the solution (i.e. $\mathbf{w}_{0}=\mathbf{w}^{*}$ ) the algorithm obtains the following dependence on the variance error, under the assumption that $\mu>0$ and $\kappa\geq 2$ .

Recall the variance in the $k^{\textrm{th}}$ coordinate can be upper bounded by

We will consider the first $T/3$ steps. The guarantee that we will prove for these iterations is: for any $t\leq A$ , $v^{\left(k\right)}_{t}\leq(1-\lambda^{(k)}/R^{2})^{2t}v^{\left(k\right)}_{0}+\frac{\sigma^{2}}{R^{2}}$ .

This can be proved easily by induction. Clearly this is true when $t=0$ . Suppose it is true for $t-1$ , let’s consider step $t$ . By recursion of $v^{\left(k\right)}_{t}$ we know

Here the second step uses induction hypothesis and the third step uses the fact that $(1-x)^{2}+x\leq 1$ when $x\in$ . In particular, since $(1-\lambda^{(k)}/R^{2})^{2T/3}\leq(1-1/\kappa)^{2T/3}\leq(1-1/\kappa)^{3\kappa\log T}=1/T^{3}$ , we know at the end of the first phase, $v^{\left(k\right)}_{A}\leq v^{\left(k\right)}_{0}/T^{3}+\frac{\sigma^{2}}{R^{2}}$ .

In the second $T/3$ steps, the guarantee would be: for any $t\leq T/3$ , $v^{\left(k\right)}_{A+t}\leq v^{\left(k\right)}_{0}/T^{3}+2\eta_{A+t}\sigma^{2}$ .

We will again prove this by induction. The base case ( $t=0$ ) follows immediately from the guarantee for the first part. Suppose this is true for $A+t-1$ , let us consider $A+t$ , again by recursion we know

Here the last line uses the fact that $2\eta_{A+t-1}(1-\frac{1}{2}\mu\eta_{A+t-1})\leq 2\eta_{A+t}\sigma^{2}$ , which is easy to verify by our choice of $\eta$ . Therefore, at the end of the second part, we have $v^{\left(k\right)}_{B}\leq v^{\left(k\right)}_{0}/T^{3}+\frac{2\sigma^{2}}{\mu(\kappa+T/6)}$ .

The proof of the proposition works similar to the proof of the strongly convex case of theorem 2, wherein, we combine the result of proposition 14 with lemma 9 and lemma 5 to obtain the result. ∎

Appendix D Proofs of results in Section 3.3

All of our counter-examples in this section are going to be the same simple function. Let the inputs $x$ be such that only a single co-ordinate be active on each example. We refer to this case as the “discrete” case. Furthermore, let each co-ordinate be a Gaussian with mean and variance for the first $d/2$ directions being $d\kappa/3$ and the final $d/2$ directions being $1$ . Furthermore, consider the noise to be additive (and independent of $\mathbf{x}$ ) with mean zero. This indicates that $R^{2}=\kappa$ for this problem.

Intuitively, we will show that in order to have a small error in the first eigendirection (with eigenvalue $\kappa$ ), one need to set a small learning rate $\eta_{t}$ which would be too small to achieve a small error in the second eigendirection (with eigenvalue $1$ ). As a useful tool, we will decompose the variance in the two directions corresponding to $\kappa$ eigenvalue and $1$ eigenvalue respectively as follows:

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

There exists a universal constant $C>0$ , and a problem instance, such that for SGD algorithm with any $\eta_{t}\leq 1/2\kappa$ for all $t$ Learning rate more than $2/\kappa$ will make the algorithm diverge., we have

Fix $\tau=\kappa/C\log(\kappa+1)$ where $C$ is a universal constant that we choose later. We need to exhibit that the $\limsup$ is larger than $\tau$ . For simplicity we will also round $\kappa$ up to the nearest integer.

Note that such a number always exists because all the step sizes are at most $2/\kappa$ . We will also let $\Delta_{i}$ be $T_{i}-T_{i-1}$ . Firstly, from (15) and (16), we see that $\sum_{t}\eta_{t}=\infty$ . Otherwise, the bias will never decay to zero. If $f(\mathbf{w}_{T_{i-1}+\Delta_{i}})-f(\mathbf{w}^{*})>\frac{\tau\sigma^{2}d}{T_{i-1}+\Delta_{i}}$ for some $i=1,\cdots,\kappa$ , we are done. If not, we obtain the following relations:

Here the second inequality is based on (15). We will use $C_{1}$ to denote $\exp(3)$ . Similarly, we have

Repeating this argument, we can show that

We will use $i=1$ in particular, which specializes to

Using the above inequality, we can lower bound the sum of $\Delta_{j}$ as

where we used (17) in the last step. Rearranging, we obtain

If we choose a large enough $C$ (e.g., $3C_{1}$ ), the right hand side is at least $\frac{\exp\left((C/C_{1})\log(\kappa+1)-3\right)}{\kappa}\geq\kappa$ .

Theorem 4 follows as a straightforward consequence of Theorem 15 and lemma 11. ∎

Consider the additive noise oracle setting, where, we have access to stochastic gradients satisfying:

To prove Theorem 17, we rely on the following key lemma, which says if a query point $\mathbf{w}_{T}$ is bad (in the sense that it has expected value more than $10\tau\sigma^{2}d/T$ ), then it takes at least $\Omega(T/\tau)$ steps to bring the error back down.

Since $f(\mathbf{w}_{T})-f(\mathbf{w}^{*})\geq C_{1}\tau\sigma^{2}d/T$ and $f(\mathbf{w}_{T})=\frac{d}{2}\left(\kappa\left(\mathbf{w}_{T}^{(1)}-\left(\mathbf{w}^{*}\right)^{(1)}\right)^{2}+\left(\mathbf{w}_{T}^{(2)}-\left(\mathbf{w}^{*}\right)^{(2)}\right)^{2}\right)$ , we know either $\left(\mathbf{w}_{T}^{(1)}-\left(\mathbf{w}^{*}\right)^{(1)}\right)^{2}\geq C_{1}\tau\sigma^{2}/2\kappa T$ or $\left(\mathbf{w}_{T}^{(2)}-\left(\mathbf{w}^{*}\right)^{(2)}\right)^{2}\geq C_{1}\tau\sigma^{2}/2T$ . Either way, we have a coordinate $i$ with eigenvalue $\lambda_{i}$ ( $\kappa$ or $1$ ) such that $\left(\mathbf{w}_{T}^{(i)}-\left(\mathbf{w}^{*}\right)^{(i)}\right)^{2}\geq C_{1}\tau\sigma^{2}/(2T\lambda_{i}$ ).

Similar as before, choose $\Delta$ to be the first point such that

If $S^{2}\leq C_{2}\tau/(\lambda_{i}^{2}T)$ (where $C_{2}$ is a large enough universal constant chosen later), then by Cauchy-Schwartz we know

Therefore $\Delta\geq T/C_{2}\tau$ , and we are done.

If $S^{2}>C_{2}\tau/(\lambda_{i}^{2}T)$ , by Equation (15) and (16) we know

Theorem 17 is an immediate corollary of Theorem 15 and Lemma 18. ∎

Theorem 16 is an immediate corollary of Theorem 17 and lemma 11∎

Appendix E Details of experimental setup

As mentioned in the main paper, we consider four condition numbers namely $\kappa\in\{50,100,200,400\}$ . We run all experiments for a total of $\kappa_{\max}^{2}=400^{2}=160000$ iterations. The two eigenvalues of the Hessian are $1$ and $1/\kappa$ respectively and the noise level $\sigma^{2}=1$ and we average our results with five random seeds. All our grid search results for the polynomially decaying learning rates are conducted on a $8\times 8$ grid of learning rates $\times$ decay factor and whenever a best run lands at the edge of the grid, the grid is extended so that we have the best run in the interior of the grid search. For the step decay schedules, note that we fix the learning rate (details below), and vary only the decay factor.

For the $O(1/t)$ learning rate, we search for decay parameter over $8-$ points log-spaced between $\{1/(200\kappa),5000/\kappa\}$ . The starting learning rate is searched over $8$ points logarithmically spaced between $\{1/\kappa,5\}$ .

For the $O(1/\sqrt{t})$ learning rate, the decay parameter is searched over $8$ logarithmically spaced points between $\{1/(2500\kappa),100/\kappa\}$ . The starting learning rate is searched between $\{1/(10\kappa),5\}$ with $8$ logarithmically spaced points.

For the step decay schedule experiments, we kept the initial learning rate to be $0.1$ and swept over when to decay in multiples of $T/\log{T}$ , i.e., vary some parameter $c\in\{0.25,0.5,0.75,1.0,1.25,1.5,2,4\}$ where the learning rate decays by a factor of $2$ every $c\cdot T/log{T}$ steps. We found that the values chosen in most experiments were very close to $1$ , i.e., they were either $1$ or $1.25$ or some very rare cases, $1.5$ .

With regards to the suffix iterate averaging, we used a constant stepsize of $0.1$ and averaged iterates over the final half of the iterations.

E.2 Non-Convex experiments on cifar-10 dataset with a 44-layer residual net

With regards to learning rates, we consider $10-$ values geometrically spaced as $\{1,0.6,\cdots,0.01\}$ . To set the decay factor for any of the schemes such as 5,6, and 7, we use the following rule. Suppose we have a desired learning rate that we wish to use towards the end of the optimization (say, something that is $100$ times lower than the starting learning rate, which is a reasonable estimate of what is typically employed in practice), this can be used to obtain a decay factor for the corresponding decay scheme. In our case, we found it advantageous to use an additively spaced grid for the learning rate $\gamma_{t}$ , i.e., one which is searched over a range $\{0.0001,0.0002,\cdots,0.0009,0.001,\cdots,0.009\}$ at the $80^{th}$ epoch, and cap off the minimum possible learning rate to be used to be $0.0001$ to ensure that there is progress made by the optimization routine. For any of the experiments that yield the best performing gridsearch parameter that falls at the edge of the grid, we extend the grid to ensure that the finally chosen hyperparameter lies in the interior of the grid. All our gridsearches are run such that we separate a tenth of the training dataset as a validation set and train on the remaining $9/10^{th}$ dataset. Once the best grid search parameter is chosen, we train on the entire training dataset and evaluate on the test dataset and present the result of the final model (instead of choosing the best possible model found during the course of optimization).