Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

Eduard Gorbunov, Marina Danilova, Alexander Gasnikov

Introduction

In this paper we focus on the following problem

However, there are a lot of important applications where the noise distribution in the stochastic gradient is significantly heavy-tailed . For such problems SGD is often less robust and shows poor performance in practice. Furthermore, existing results for the convergence with high-probability for SGD are also much worse in the presence of heavy-tailed noise than its “light-tailed counterparts”. In this case, rates of the convergence in expectation can be insufficient to describe the behavior of the method.

To illustrate this phenomenon we consider a simple example of stochastic optimization problem and apply SGD with constant stepsize to solve it. After that, we present a natural and simple way to resolve the issue of SGD based on the clipping of stochastic gradients. However, we need to introduce some important notations and definitions before we start to discuss this example.

i.e. we have an access to the unbiased estimator of f(x)\nabla f(x) with uniformly bounded by σ2\sigma^{2} variance where σ\sigma is some non-negative number. These assumptions on the stochastic gradient are standard in the stochastic optimization literature . Below we introduce one of the most important definitions in this paper.

Such distributions are often called sub-Gaussian ones (see and references therein). One can show (see Lemma 2 from ) that this definition is equivalent to

2 Simple Motivational Example: Convergence in Expectation and Clipping

In this section we consider SGD xk+1=xkγf(xk,ξk)x^{k+1}=x^{k}-\gamma\nabla f(x^{k},\xi^{k}) applied to solve the problem (1) with f(x,ξ)=\nicefracx222+ξ,xf(x,\xi)=\nicefrac{{\|x\|_{2}^{2}}}{{2}}+\langle\xi,x\rangle, where ξ\xi is a random vector with zero mean and the variance by σ2\sigma^{2} (see the details in Section H.1). The state-of-the-art theory (e.g. ) says that convergence properties in expectation of SGD in this case depend only on the stepsize γ\gamma, condition number of ff, initial suboptimality f(x0)f(x)f(x^{0})-f(x^{*}) and the variance σ\sigma, but does not depend on distribution of ξ\xi. However, the trajectory of SGD significantly depends on the distribution of ξ\xi. To illustrate this we consider 33 different distributions of ξ\xi with the same σ\sigma, i.e., Gaussian distribution, Weibull distribution and Burr Type XII distribution with proper shifts and scales to get needed mean and variance for ξ\xi (see the details in Section H.1). For each distribution, we run SGD several times from the same starting point, the same stepsize γ\gamma, and the same batchsize, see typical runs in Figure 1.

This simple example shows that SGD in all 33 cases rapidly reaches a neighborhood of the solution and then starts to oscillate there. However, these oscillations are significantly larger for the second and the third cases where stochastic gradients are heavy-tailed. Unfortunately, guarantees for the convergence in expectation cannot express this phenomenon, since in expectation the convergence guarantees for all 33 cases are identical.

Moreover, in practice, e.g., in training big machine learning models, it is often used only a couple runs of SGD or another stochastic method. The training process can take hours or even days, so, it is extremely important to obtain good accuracy of the solution with high probability. However, as our simple example shows, SGD fails to converge robustly if the noise in stochastic gradients is heavy-tailed which was also noticed for several real-world problems like training AlexNet on CIFAR10 (see ) and training an attention model via BERT (see ).

Clearly, since the distributions of stochastic gradients in the second and the third cases are heavy tailed the probability of sampling too large ξ\xi (in terms of the norm) and, as a consequence, too large f(x,ξ)\nabla f(x,\xi) is high even if we are close to the solution. Once the current point xkx^{k} is not too far from the solution and SGD gets a stochastic gradient with too large norm the method jumps far from the solution. Therefore, we see large oscillations. Since the reason of such oscillations is large norm of stochastic gradient it is natural to clip it, i.e., update xk+1x^{k+1} according to xk+1=xkγmin{1,\nicefracλf(xk,ξk)2}f(xk,ξk).x^{k+1}=x^{k}-\gamma\min\{1,\nicefrac{{\lambda}}{{\|\nabla f(x^{k},\xi^{k})\|_{2}}}\}\nabla f(x^{k},\xi^{k}). The obtained method is known in literature as clipped-SGD (see and references therein). Among the good properties of clipped-SGD we emphasize its robustness to the heavy-tailed noise in stochastic gradients (see also ). In our tests, trajectories of clipped-SGD oscillate not significantly even for heavy-tailed distributions, and clipping does not spoil the rate of convergence. These two factors make clipped-SGD preferable than SGD when we deal with heavy-tailed distributed stochastic gradients (see further discussion in Section B.2).

3 Related Work

In the light-tailed case high-probability complexity bounds and complexity bounds in expectation for SGD and AC-SA differ only in logarithmical factors of \nicefrac1β\nicefrac{{1}}{{\beta}}, see the details in Table 1. Such bounds were obtained in for SGD in the convex case and then were extended to the μ\mu-strongly convex case in for modification of SGD called Stochastic Intermediate Gradient Method (SIGM). Finally, optimal complexities were derived in for the method called AC-SA in the convex case and for Multi-Staged AC-SA (MS-AC-SA) in the strongly convex case.

Without light tails assumption the most straightforward results lead to O(\nicefrac1β2)O(\nicefrac{{1}}{{\beta^{2}}}) and O(\nicefrac1β)O(\nicefrac{{1}}{{\beta}}) dependency on β\beta in the complexity bounds. Such bounds can be obtained from the complexity bounds for the convergence in expectation via Markov’s inequality. However, for small β\beta these bounds become unacceptably poor. Classical results reduce these dependence to O(ln(β1))O(\ln(\beta^{-1})) but they have worse dependence on ε\varepsilon than corresponding results relying on light tails assumption.

For a long time the following question was open: is it possible to design stochastic methods having the same or comparable complexity bounds as in the light-tailed case but without light tails assumption on stochastic gradients? In and the authors give a positive answer to this question but only partially. Let us discuss the results from these papers in detail.

In Nazin et al. develop a new algorithm called Robust Stochastic Mirror Descent (RSMD) which is based on a special truncation of stochastic gradients and derive complexity guarantees similar to SGD in the convex case but without light assumption, see Table 1. This technique is very similar to gradient clipping. Moreover, in authors consider also composite problems with non-smooth composite term. However, in the optimization problem is defined on some compact convex set XX with diameter Θ=max{xy2x,yX}<\Theta=\max\{\|x-y\|_{2}\mid x,y\in X\}<\infty and the analysis depends substantially on the boundedness of XX. Using special restarts technique together with iterative squeezing of the set XX Nazin et al. extend their method to the μ\mu-strongly convex case, see Table 2. Finally, in the discussion section of authors formulate the following question: is it possible to develop such accelerated stochastic methods that have the same or comparable complexity bounds as in the light-tailed case but do not require stochastic gradients to be light-tailed?

In the strongly convex case the positive answer to this question was given by Davis et al. where authors propose a new method called proxBoost that is based on robust distance estimation and proximal point method , see Table 2. However, this approach requires solving an auxiliary optimization problem at each iteration that can lead to poor performance in practice.

In our paper we close the gap in theory, i.e., we provide a positive answer to the following question: Is it possible to develop such an accelerated stochastic method that have the same or comparable complexity bound as for AC-SA in the convex case but do not require stochastic gradients to be light-tailed?

4 Our Contributions

One of the main contributions of our paper is a new method called Clipped Stochastic Similar Triangles Method (clipped-SSTM). For the case when the objective function ff is convex and LL-smooth we derive the following complexity bound without light tails assumption on the stochastic gradients: O(max{\nicefracLR02ε,\nicefracσ2R02ε2}ln(\nicefracLR02εβ)).O(\max\{\sqrt{\nicefrac{{LR_{0}^{2}}}{{\varepsilon}}},\nicefrac{{\sigma^{2}R_{0}^{2}}}{{\varepsilon^{2}}}\}\ln(\nicefrac{{LR_{0}^{2}}}{{\varepsilon\beta)}}). This bound outperforms all known bounds for this setting (see Table 1) and up to the difference in logarithmical factors recovers the complexity bound of AC-SA derived under light tails assumption. That is, in this paper we close the gap in theory theory of smooth convex stochastic optimization with heavy-tailed noise. Moreover, unlike in , we do not assume boundedness of the set where the optimization problem is defined, which makes our analysis more complicated. We also study different batchsize policies for clipped-SSTM.

Using restarts technique we extend clipped-SSTM to the μ\mu-strongly convex objectives and obtain a new method called Restarted clipped-SSTM (R-clipped-SSTM). For this method we prove the following complexity bound (again, without light tails assumption on the stochastic gradients): O(max{\nicefracLμln(\nicefracμR2ε),\nicefracσ2με}ln(\nicefracLμβln(\nicefracμR2ε))).O(\max\{\sqrt{\nicefrac{{L}}{{\mu}}}\ln(\nicefrac{{\mu R^{2}}}{{\varepsilon}}),\nicefrac{{\sigma^{2}}}{{\mu\varepsilon}}\}\ln(\nicefrac{{L}}{{\mu\beta}}\ln(\nicefrac{{\mu R^{2}}}{{\varepsilon}}))). Our bound outperforms the state-of-the-art result from in terms of the dependence on lnLμ\ln\frac{L}{\mu}, see Table 2 for the details.

We prove the first high-probability complexity guarantees for clipped-SGD in convex and strongly convex cases without light tails assumption on the stochastic gradients, see Tables 1 and 2. The complexity we prove for clipped-SGD in the convex case is comparable with corresponding bound for SGD derived under light tails assumption. In the μ\mu-strongly convex case we derive a new complexity bound for the restarted version of clipped-SGD (R-clipped-SGD) which is comparable with its “light-tailed counterpart”.

We conduct several numerical experiments with the proposed methods in order to justify the theory we develop. In particular, we show that clipped-SSTM can outperform SGD and clipped-SGD in practice even without using large batchsizes. Moreover, in our experiments we illustrate how clipping makes the convergence of SGD and SSTM more robust and reduces their oscillations.

5 Paper Organization

The remaining part of the paper is organized as follows. In Section 2 we present clipped-SSTM together with the main complexity result in the convex case that we prove for this method. Then, we present the first high-probability complexity bounds for clipped-SGD for for the convex problems. In Section 4 we provide our numerical experiments justifying our theoretical results. Finally, in Section 5 we provide some concluding remarks and discuss the limitations and possible extensions of the results developed in the paper. Due to the space limitations, we put the exact formulations of all theorems, results for the strongly convex problems and the full proofs in the Appendix (see Sections F and G), together with auxiliary and technical results and additional experiments (see Section H). Moreover, in Section F.1.2 we present a sketch of the proof of the main convergence result for clipped-SSTM and explain the intuition behind it.

Accelerated SGD with Clipping

In our method we use a clipped stochastic gradient that is defined in the following way:

where f(x,ξ)=1mi=1mf(x,ξi)\nabla f(x,\boldsymbol{\xi})=\frac{1}{m}\sum_{i=1}^{m}\nabla f(x,\xi_{i}) is a mini-batched version of f(x)\nabla f(x). That is, in order to compute clip(f(x,ξ),λ)\text{clip}(\nabla f(x,\boldsymbol{\xi}),\lambda) one needs to get mm i.i.d. samples f(x,ξ1),,f(x,ξm)\nabla f(x,\xi_{1}),\ldots,\nabla f(x,\xi_{m}), compute its average and then project the result f(x,ξ)\nabla f(x,\boldsymbol{\xi}) on the Euclidean ball with radius λ\lambda and center at the origin. Next theorem summarizes the main convergence result for clipped-SSTM.

Assume that function ff is convex and LL-smooth. Then for all β(0,1)\beta\in(0,1) and N1N\geq 1 such that ln(\nicefrac4Nβ)2\ln(\nicefrac{{4N}}{{\beta}})\geq 2 we have that after NN iterations of clipped-SSTM with mk=Θ(max{1,\nicefracσ2αk+12Nln(\nicefracNβ)R02})m_{k}=\Theta\left(\max\left\{1,\nicefrac{{\sigma^{2}\alpha_{k+1}^{2}N\ln(\nicefrac{{N}}{{\beta}})}}{{R_{0}^{2}}}\right\}\right), B=Θ(\nicefracR0ln(\nicefracNβ))B=\Theta(\nicefrac{{R_{0}}}{{\ln(\nicefrac{{N}}{{\beta}})}}) and a=Θ(ln2(\nicefracNβ))a=\Theta(\ln^{2}(\nicefrac{{N}}{{\beta}})) that f(yN)f(x)=O(\nicefracaLR02N2)f(y^{N})-f(x^{*})=O(\nicefrac{{aLR_{0}^{2}}}{{N^{2}}}) holds with probability at least 1β1-\beta where R0=x0x2R_{0}=\|x^{0}-x^{*}\|_{2}. In other words, if we choose aa to be equal to the maximum from (27), then the method achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(\nicefracLR02εln(\nicefracLR02εβ))O(\sqrt{\nicefrac{{LR_{0}^{2}}}{{\varepsilon}}}\ln(\nicefrac{{LR_{0}^{2}}}{{\varepsilon\beta}})) iterations and requires O(max{\nicefracLR02ε,\nicefracσ2R02ε2}ln(\nicefracLR02εβ))O(\max\{\sqrt{\nicefrac{{LR_{0}^{2}}}{{\varepsilon}}},\nicefrac{{\sigma^{2}R_{0}^{2}}}{{\varepsilon^{2}}}\}\ln(\nicefrac{{LR_{0}^{2}}}{{\varepsilon\beta}})) oracle calls.

The theorem says that for any β(0,1)\beta\in(0,1) clipped-SSTM converges to ε\varepsilon-solution with probability at least 1β1-\beta and requires exactly the same number of stochastic first-order oracle calls (up to the difference in constant and logarithmical factors) as optimal stochastic methods like AC-SA or Stochastic Similar Triangles Method . However, our method achieves this rate under less restrictive assumption. Indeed, Theorem 2.1 holds even in the case when the stochastic gradient f(x,ξ)\nabla f(x,\xi) satisfies only (2) and can have heavy-tailed distribution. In contrast, all existing results that establish (30) and that are known in the literature hold only in the light-tails case, see Section 1.3.1.

Finally, when σ2\sigma^{2} is big then Theorem 2.1 says that at iteration kk clipped-SGD requires large batchsizes mkk2Nm_{k}\sim k^{2}N (see (26)) which is proportional to ε\nicefrac32\varepsilon^{-\nicefrac{{3}}{{2}}} for last iterates. It can make the cost of one iteration extremely high, therefore, we also consider different stepsize policies that remove this drawback in Section F.1.1. In particular, the following result shows that clipped-SSTM achieves the same oracle complexity even with constant batchsizes mkm_{k} when stepsize parameter aa is chosen properly.

Let the assumptions of Theorem F.1 hold and a=Θ(max{1,ln2(\nicefracNβ),\nicefracln\nicefracNβσN\nicefrac32LR0})a=\Theta\left(\max\{1,\ln^{2}(\nicefrac{{N}}{{\beta}}),\nicefrac{{\sqrt{\ln\nicefrac{{N}}{{\beta}}}\sigma N^{\nicefrac{{3}}{{2}}}}}{{LR_{0}}}\}\right). Then mk=O(1)m_{k}=O(1) and clipped-SSTM achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(max{\nicefracLR02ε,\nicefracσ2R02ε2}ln(\nicefrac(LR02+σR0)εβ))O(\max\{\sqrt{\nicefrac{{LR_{0}^{2}}}{{\varepsilon}}},\nicefrac{{\sigma^{2}R_{0}^{2}}}{{\varepsilon^{2}}}\}\ln(\nicefrac{{(LR_{0}^{2}+\sigma R_{0})}}{{\varepsilon\beta}})) iterations/oracle calls.

SGD with Clipping

In this section we present our complexity results for clipped-SGD (see Algorithm 2) in the convex case.

Next theorem summarizes the main convergence result for clipped-SGD in this case.

Assume that function ff is convex and LL-smooth. Then for all β(0,1)\beta\in(0,1) and N1N\geq 1 such that ln(\nicefrac4Nβ)2\ln(\nicefrac{{4N}}{{\beta}})\geq 2 we have that after NN iterations of clipped-SGD with λ=Θ(LR0)\lambda=\Theta(LR_{0}) and mk=m=Θ(max{1,\nicefracNσ2R02L2ln(\nicefracNβ)})m_{k}=m=\Theta(\max\{1,\nicefrac{{N\sigma^{2}}}{{R_{0}^{2}L^{2}\ln(\nicefrac{{N}}{{\beta}})}}\}) where R0=x0x2R_{0}=\|x^{0}-x^{*}\|_{2} and stepsize γ=\nicefrac180Lln(\nicefrac4Nβ)\gamma=\nicefrac{{1}}{{80L\ln(\nicefrac{{4N}}{{\beta}})}} that f(xˉN)f(x)=O(\nicefracLR02ln(\nicefrac4Nβ)N)f(\bar{x}^{N})-f(x^{*})=O(\nicefrac{{LR_{0}^{2}\ln(\nicefrac{{4N}}{{\beta}})}}{{N}}) with probability at least 1β1-\beta where xˉN=1Nk=0N1xk\bar{x}^{N}=\frac{1}{N}\sum_{k=0}^{N-1}x^{k}. In other words, the method achieves f(xˉN)f(x)εf(\bar{x}^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(\nicefracLR02εln(\nicefracLR02εβ))O\left(\nicefrac{{LR_{0}^{2}}}{{\varepsilon}}\ln(\nicefrac{{LR_{0}^{2}}}{{\varepsilon\beta}})\right) iterations and requires O(max{\nicefracLR02ε,\nicefracσ2R02ε2}ln(\nicefracLR02εβ))O(\max\{\nicefrac{{LR_{0}^{2}}}{{\varepsilon}},\nicefrac{{\sigma^{2}R_{0}^{2}}}{{\varepsilon^{2}}}\}\ln(\nicefrac{{LR_{0}^{2}}}{{\varepsilon\beta}})) oracle calls.

To the best of our knowledge, it is the first result for clipped-SGD establishing non-trivial complexity guarantees for the convergence with high probability. Up to the difference in logarithmical factors our bound recovers the complexity bound for SGD which was obtained under light tails assumption and the complexity bound for RSMD. However, unlike in , we do not assume that the optimization problem is defined on the bounded set. The proof technique is similar to one we use to prove Theorem F.1. One can find the full proof in Section G.3.1.

Numerical Experiments

We have testedOne can find the code here: https://github.com/eduardgorbunov/accelerated_clipping. clipped-SSTM and clipped-SGD on the logistic regression problem, the datasets were taken from LIBSVM library . To implement methods we use Python 3.7 and standard libraries. One can find additional experiments and details in Section H.2.

First of all, using standard solvers from scipy library we find good enough approximation of the solution of the problem for each dataset. For simplicity, we denote this approximation by xx^{*}. Then, we numerically study the distribution of fi(x)2\|\nabla f_{i}(x^{*})\|_{2} and plot corresponding histograms for each dataset, see Figure 2.

These histograms hint that near the solution for heart dataset tails of stochastic gradients are not heavy and the norm of the noise can be well-approximated by Gaussian distribution, whereas for diabetes and australian we see the presense of outliers that makes the distribution heavy-tailed.

Next, let us consider numerical results for SGD and SSTM with and without clipping applied to solve logistic regression problem on these 33 datasets, see Figures 3- 5.

For all methods we used constant batchsizes mm, stepsizes and clipping levels were tuned, see Section H.2 for the details. In our experiments we also consider clipped-SGD with periodically decreasing clipping level λ\lambda (d-clipped-SGD in Figures), i.e. the method starts with some initial clipping level λ0\lambda_{0} and after every ll epochs or, equivalently, after every \nicefracrlm\lceil\nicefrac{{rl}}{{m}}\rceil iterations the clipping level is multiplied by some constant α(0,1)\alpha\in(0,1).

Let us discuss the obtained numerical results. First of all, d-clipped-SGD stabilizes the oscillations of SGD even if the initial clipping level was high. In contrast, clipped-SGD with too large clipping level λ\lambda behaves similarly to SGD. Secondly, we emphasize that due to the fact that we used small bathcsizes SSTM has very large oscillations in comparison to SGD. Actually, fast error/noise accumulation is a typical drawback of accelerated SGD with small batchsizes . Moreover, deterministic accelerated and momentum-based methods often have non-monotone behavior (see and references therein). However, to some extent clipped-SSTM suffers from the first drawback less than SSTM and has comparable convergence rate with SSTM. Finally, in our experiments on heart and australian datasets clipped-SSTM converges faster than SGD and clipped-SGD and oscillates little, while on diabetes dataset it also converges faster than SGD, but oscillates more if parameter BB is not fine-tuned.

We also want to mention that the behavior of SGD on heart and diabetes datasets correlates with the insights from Section 1.2 and our numerical study of the distribution of fi(x)2\|\nabla f_{i}(x^{*})\|_{2}. Indeed, for heart dataset SGD has little oscillations since the distribution of fi(xk)f(xk)2\|\nabla f_{i}(x^{k})-\nabla f(x^{k})\|_{2}, where xkx^{k} is the last iterate, is well concentrated near its mean and can be approximated by Gaussian distribution (see the details in Section H.2). In contrast, Figure 4 shows that SGD oscillates more than in the previous example. One can explain such behavior using Figure 2 showing that the distribution of f(x)2\|\nabla f(x^{*})\|_{2} has heavier tails than for heart dataset.

However, we do not see any oscillations of SGD for australian dataset despite the fact that according to Figure 2 the distribution of fi(x)2\|\nabla f_{i}(x^{*})\|_{2} in this case has heavier tails than in previous examples. Actually, there is no contradiction and in this case it simply means that SGD does not get close to the solution in terms of functional value, despite the fact that we used γ=\nicefrac1L\gamma=\nicefrac{{1}}{{L}}. In Section H.2 we present the results of different tests where we tried to use bigger stepsize γ\gamma in order to reach oscillation region faster and show that in fact in that region SGD oscillates significantly more, but clipping fixes this issue without spoiling the convergence rate.

Discussion

In this paper we close the gap in the theory of high-probability complexity bounds for stochastic optimization with heavy-tailed noise. In particular, we propose a new accelerated stochastic method — clipped-SSTM — and prove the first accelerated high-probability complexity bounds for smooth convex stochastic optimization without light-tails assumption. Moreover, we extend our results to the strongly convex case and prove new complexity bounds outperforming the state-of-the-art results. Finally, we derive first high-probability complexity bounds for the popular method called clipped-SGD in convex and strongly convex cases and conduct a numerical study of the considered methods.

Broader Impact

Our contribution is primarily theoretical. Therefore, a broader impact discussion is not applicable.

Acknowledgments and Disclosure of Funding

The research of E. Gorbunov and A. Gasnikov was partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03. The research of Marina Danilova was funded by RFBR, project number 20-31-90073.

Appendix Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

Next, we introduce some standard definitions.

It is well-known that LL-smoothness implies (see )

Since in this paper we focus only on smooth optimization problems we introduce strong convexity in the following way.

Throughout the paper, we use xx^{*} to denote any solution of problem (1) assuming its existence. By the complexity of stochastic first-order method we always mean the total number of stochastic first-order oracle calls that the method needs in order to produce such a point x^\hat{x} that f(x^)f(x)εf(\hat{x})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta for some ε>0\varepsilon>0 and β(0,1)\beta\in(0,1). Finally, in the complexity bounds we often use R0R_{0} to denote x0x2\|x^{0}-x^{*}\|_{2} where x0x^{0} is the starting point of the method.

Appendix B Related Work: Additional Details

The first terms in maximums above correspond to the Central Limit Theorem regime, while the second terms correspond to the heavy-tailed regime, see Section D.2. These bounds show that heavy tailed distributions of the stochastic gradients significantly spoil complexity bounds of SGD and restarted-SGD when the confidence level β\beta is small enough.

B.2 Related Work on Gradient Clipping

Appendix C Basic Facts

In this section we enumerate for convenience basic facts that we use many times in our proofs.

Appendix D Auxiliary Results

D.2 About the Sum of i.i.d. Random Variables with Heavy Tails

where N1N\gg 1 and Φ(x)=12πxexp(y2/2)dy\Phi(x)=\frac{1}{2\pi}\int_{-\infty}^{x}\exp\left(-y^{2}/2\right)dy. Since

This simple observation can play a significant role in deriving complexity results for non-smooth convex optimization under the assumption that stochastic gradients are heavy-tailed, see for the details.

Appendix E Technical Results

Consider two sequences of non-negative numbers {αk}k0\{\alpha_{k}\}_{k\geq 0} and {Ak}k0\{A_{k}\}_{k\geq 0} such that

Using (k+1)(k+4)(k+2)2(k+1)(k+4)\geq(k+2)^{2} together with the inequality above we derive (21). ∎

Appendix F Accelerated SGD with Clipping: Exact Formulations and Missing Proofs

In this section we provide exact formulations of all the results that we have for clipped-SSTM and R-clipped-SSTM together with the full proofs.

Recall that in order to compute clip(f(x,ξ),λ)\text{clip}(\nabla f(x,\boldsymbol{\xi}),\lambda) one needs to get mm i.i.d. samples f(x,ξ1),,f(x,ξm)\nabla f(x,\xi_{1}),\ldots,\nabla f(x,\xi_{m}), compute its average

and then project the result f(x,ξ)\nabla f(x,\boldsymbol{\xi}) on the Euclidean ball with radius λ\lambda and center at the origin. We also notice that

Next theorem summarizes the main convergence result for clipped-SSTM.

Assume that function ff is convex and LL-smooth. Then for all β(0,1)\beta\in(0,1) and N1N\geq 1 such that

we have that after NN iterations of clipped-SSTM with

In other words, if we choose aa to be equal to the maximum from (27), then the method achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(LR02εlnLR02εβ)O\left(\sqrt{\frac{LR_{0}^{2}}{\varepsilon}}\ln\frac{LR_{0}^{2}}{\varepsilon\beta}\right) iterations and requires

One can easily notice that multiplicative constant factors in formulas for mkm_{k} and aa are too big and seem to be impractical, but in practice one can tune these constants to get good enough performance. That is, big constants in (26) and (27) are needed only in our analysis in order to get bound (30).

Finally, when σ2\sigma^{2} is big then Theorem F.1 says that at iteration kk clipped-SGD requires large batchsizes mkk2Nm_{k}\sim k^{2}N (see (26)) which is proportional to ε\nicefrac32\varepsilon^{-\nicefrac{{3}}{{2}}} for last iterates. It can make the cost of one iteration extremely high, therefore, we consider different stepsize policies that remove this drawback.

(Medium batchsize). If NN and β\beta are such that Nln4NβN\ln\frac{4N}{\beta} is bigger than the maximum from (27), then for a=Nln4Nβa=N\ln\frac{4N}{\beta} we have

and the method achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(LR02εlnLR02εβ)O\left(\frac{LR_{0}^{2}}{\varepsilon}\ln\frac{LR_{0}^{2}}{\varepsilon\beta}\right) iterations and requires

(Constant batchsize). If NN and β\beta are such that a0N\nicefrac32ln4Nβa_{0}N^{\nicefrac{{3}}{{2}}}\sqrt{\ln\frac{4N}{\beta}} is bigger than the maximum from (27) for some positive constant a0a_{0}, then for a=a0N\nicefrac32ln4Nβa=a_{0}N^{\nicefrac{{3}}{{2}}}\sqrt{\ln\frac{4N}{\beta}} we have

and the method achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(a02L2R04ε2lna0LR02εβ)O\left(\frac{a_{0}^{2}L^{2}R_{0}^{4}}{\varepsilon^{2}}\ln\frac{a_{0}LR_{0}^{2}}{\varepsilon\beta}\right) iterations and requires

Finally, if a0=σLR0a_{0}=\frac{\sigma}{LR_{0}}, then mk=O(1)m_{k}=O(1) for k=0,1,,Nk=0,1,\ldots,N and clipped-SSTM finds ε\varepsilon-solution with probability at least 1β1-\beta after O(σ2R02ε2lnσR0εβ)O\left(\frac{\sigma^{2}R_{0}^{2}}{\varepsilon^{2}}\ln\frac{\sigma R_{0}}{\varepsilon\beta}\right) iterations and requires O(1)O(1) oracle calls per iteration.

In the second case the corollary establishes ε2ln(ε1β1)\varepsilon^{-2}\ln(\varepsilon^{-1}\beta^{-1}) rate for clipped-SSTM with constant batchsizes, i.e. mk=O(1)m_{k}=O(1) for all kk. The ability of clipped-SSTM to converge with constant batchsizes makes it more practical and applicable for wider class of problems where it can be very expensive to compute large batchsizes, e.g. training deep neural networks. Moreover, when σ\sigma is not too small, i.e. σ2Lε\sigma^{2}\geq L\varepsilon, this rate is optimal (up to logarithmical factors) and also recovers the rate of RSMD.

and mkm_{k} as in (26), we get mk=O(1)m_{k}=O(1) for k=0,1,,Nk=0,1,\ldots,N and derive the following result.

Let the assumptions of Theorem F.1 hold, aa is chosen as in (35) and mkm_{k} is computed via (26). Then clipped-SSTM achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

We start with the following lemma that is pretty standard in the analysis of Stochastic Similar Triangles Method, e.g. see the proof of Theorem 1 from .

That is, if z=xz=x^{*}, then the result above gives a preliminary upper bound for AN(f(yN)f(x))A_{N}(f(y^{N})-f(x^{*})). The first and the second terms in the r.h.s. of (36) come from the analysis of Similar Triangles Method and three last terms have a stochastic nature. In particular, they explicitly depend on differences θk+1=~f(xk+1,ξk)f(xk+1)\theta_{k+1}=\widetilde{\nabla}f(x^{k+1},\boldsymbol{\xi}^{k})-\nabla f(x^{k+1}) between clipped mini-batched stochastic gradients and full gradients at xk+1x^{k+1}, so, if ~f(xk+1,ξk)=f(xk+1)\widetilde{\nabla}f(x^{k+1},\boldsymbol{\xi}^{k})=\nabla f(x^{k+1}) with probability 11, then we easily get needed convergence rate. However, we are interested in the more general case and, as a consequence, to continue the proof, we need to find a good enough upper bound for the last three terms from (36). In other words, we need to show that choosing parameters aa, mkm_{k} and λk+1\lambda_{k+1} properly we can upper bound these terms by something that coincides with z0x22\|z^{0}-x^{*}\|_{2}^{2} up to numerical multiplicative constant. The proof of convergence result for RSMD from where authors provide upper bound for similar sums hints that Bernstein’s inequality (see Lemma D.1) applied to estimate these terms can help us to reach our goal. In order to apply Bernstein’s inequality one should derive tight bounds for such characteristics of ~f(xk+1,ξk)\widetilde{\nabla}f(x^{k+1},\boldsymbol{\xi}^{k}) as upper bounds for the magnitude, bias, variance and distortion and the next lemma provides us with this.

For all k0k\geq 0 the following inequality holds:

Moreover, if f(xk+1)2λk+12\|\nabla f(x^{k+1})\|_{2}\leq\frac{\lambda_{k+1}}{2} for some k0k\geq 0, then for this kk we have:

Clearly, clipping introduces a bias in ~f(xk+1,ξk)\widetilde{\nabla}f(x^{k+1},\boldsymbol{\xi}^{k}) which influences the convergence of the method. Hence, the clipping level λk+1\lambda_{k+1} should be chosen in a very accurate way. Below we informally describe what does it mean and present the sketch of the remaining part of the proof.

Imagine the ideal situation: f(xk+1,ξk)=f(xk+1)\nabla f(x^{k+1},\boldsymbol{\xi}^{k})=\nabla f(x^{k+1}) with probability 11 for all kk, i.e. we have an access to the full gradients at points xk+1x^{k+1}. Then it is natural to choose λk+1\lambda_{k+1} in such a way that clip(f(xk+1),λk+1)=f(xk+1)\text{clip}(\nabla f(x^{k+1}),\lambda^{k+1})=\nabla f(x^{k+1}) in order to recover Similar Triangles Method (STM) that converges with optimal rate in the deterministic case. In other words, one can pick λk+1\lambda_{k+1} such that f(xk+1)2λk+1\|\nabla f(x^{k+1})\|_{2}\leq\lambda_{k+1} and get an optimal method. Since we know that in this case the method should converge with O(\nicefrac1k2)O(\nicefrac{{1}}{{k^{2}}}) rate in terms of f(xk)f(x)f(x^{k})-f(x^{*}) one can expect that the gradient’s norm decays with O(\nicefrac1k)O(\nicefrac{{1}}{{k}}) rate, so, one can choose λk+1\lambda_{k+1} to be proportional to \nicefrac1k\nicefrac{{1}}{{k}}. It is exactly what we do when we define λk+1\lambda_{k+1} as \nicefracBαk+1\nicefrac{{B}}{{\alpha_{k+1}}}.

The ideal case described above gives a good insight on how to choose λk+1\lambda_{k+1} in the general case and can be described as follows: if we want to prevent our gradient estimator ~f(xk+1,ξk)\widetilde{\nabla}f(x^{k+1},\boldsymbol{\xi}^{k}) from large deviations from f(xk+1)\nabla f(x^{k+1}) with high probability, then it is needed to choose λk+1\lambda_{k+1} such that f(xk)2cλk+1\|\nabla f(x^{k})\|_{2}\leq c\lambda_{k+1} with high probability where c<1c<1 is some positive number. This choice guarantees that with high probability clipped mini-batched gradient ~f(xk+1,ξk)\widetilde{\nabla}f(x^{k+1},\boldsymbol{\xi}^{k}) cannot deviates from f(xk+1)\nabla f(x^{k+1}) significantly and, as a consequence, the convergence rate of clipped-SSTM in terms of the number of iterations needed to achieve the desired accuracy of the solution with high probability becomes similar to the convergence rate of STM up to some logarithmical factors depending on the confidence level.

In particular, we choose λk+1\lambda_{k+1} such that f(xk+1)2\nicefracλk+12\|\nabla f(x^{k+1})\|_{2}\leq\nicefrac{{\lambda_{k+1}}}{{2}} with high probability. Moreover, we derive this relation by induction via refined estimation of the three last terms from the r.h.s. of (36) that is based on the new variant of advanced recurrences technique from . The main trick there is in showing by induction that sequence zkx2\|z^{k}-x^{*}\|_{2} is bounded by some constant multiplied by x0x2\|x^{0}-x^{*}\|_{2} and in deriving f(xk+1)2\nicefracλk+12\|\nabla f(x^{k+1})\|_{2}\leq\nicefrac{{\lambda_{k+1}}}{{2}} simultaneously for all k=0,1,,Nk=0,1,\ldots,N. With such bounds and Lemma F.5 in hand, it is possible to apply Bernstein’s inequality to three sums from the r.h.s. of (36) since all summands are bounded with high probability. After applying Bernstein’s inequality we adjust parameters αk+1\alpha_{k+1} and mkm_{k} in such a way that after rearranging the terms in the obtained upper bounds we get that r.h.s. in (36) (with z=xz=x^{*}) is smaller than x0x22\|x^{0}-x^{*}\|_{2}^{2} up to some multiplicative numerical constant. This finishes the proof.

To conclude, the key tools in our analysis are Bernstein’s inequality (see Lemma D.1) and advanced recurrences technique that helps us to show boundedness of zNx2\|z^{N}-x^{*}\|_{2} and f(xk+1)2\nicefracλk+12\|\nabla f(x^{k+1})\|_{2}\leq\nicefrac{{\lambda_{k+1}}}{{2}} with high probability. We provide detailed proofs of presented result in the Appendix (see Section F.3).

F.2 Strongly Convex Case

In this section we assume additionally that f(x)f(x) is μ\mu-strongly convex. For this case we modify Algorithm 1 and propose a new method called Restarted Clipped Similar Triangles Method (R-clipped-SSTM), see Algorithm 3.

At each iteration R-clipped-SSTM runs clipped-SSTM for N0N_{0} iterations from the current point x^k\hat{x}^{k} and use its output as next iterate x^k+1\hat{x}^{k+1}. In literature this approach is known as the restarts technique . Choosing N0N_{0} and parameters mkm_{k}, aa and BB in a proper way one can get an accelerated method for strongly convex objectives. Theorem below states the main convergence result for R-clipped-SSTM.

Assume that ff is μ\mu-strongly convex and LL-smooth. If we choose β(0,1)\beta\in(0,1), τ\tau and N01N_{0}\geq 1 such that

where R=2(f(x0)f(x))μR=\sqrt{\frac{2(f(x^{0})-f(x^{*}))}{\mu}} and C=5C=\sqrt{5}, then we have that after τ\tau runs of clipped-SSTM in R-clipped-SSTM the inequality

holds with probability at least 1β1-\beta. That is, if we choose aa to be equal to the maximum from (45) and N0C18aLμN_{0}\leq C_{1}\sqrt{\frac{8aL}{\mu}} with some numerical constant C1CC_{1}\geq C, then the method achieves f(x^τ)f(x)εf(\hat{x}^{\tau})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

In other words, R-clipped-SSTM has the same convergence rate as optimal stochastic methods for strongly convex problems like Multi-Staged AC-SA (MS-AC-SA) or Stochastic Similar Triangles Method for strongly convex problems (SSTM_sc) . Moreover, in Theorem F.6 we do not assume that stochastic gradients are sampled from sub-Gaussian distribution while corresponding results for MS-AC-SA and SSTM_sc are substantially based on the light tails assumption. Our bound outperforms the state-of-the-art result from in terms of the dependence on lnLμ\ln\frac{L}{\mu}. It is worth to mention here that using special restarts technique Nazin et al. generalize their method (RSMD) for the strongly convex case, but since RSMD is not accelerated their approach gives only non-accelerated convergence rate.

We also emphasize that big numerical factors in formulas for mktm_{k}^{t} and aa are needed only in our analysis and in practice they can be tuned. However, when σ2\sigma^{2} is big bathsizes mktm_{k}^{t} become of the order k2ε1k^{2}\varepsilon^{-1}. It can make the cost of one iteration extremely high, therefore, as for clipped-SSTM we consider a different stepsize policy removing this drawback.

Let the assumptions of Theorem F.6 hold. Assume that conditions (42), (43), (44) and (45) are satisfied for

Then after τ=ln(\nicefracμR22ε)\tau=\lceil\ln(\nicefrac{{\mu R^{2}}}{{2\varepsilon}})\rceil runs of clipped-SSTM in R-clipped-SSTM the method achieves f(x^τ)f(x)εf(\hat{x}^{\tau})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta. Moreover, the total number of iterations of clipped-SSTM equals

with mkt=O(1)m_{k}^{t}=O(1) for all k=0,1,,N01k=0,1,\ldots,N_{0}-1, t=0,1,,τ1t=0,1,\ldots,\tau-1.

When σ2\sigma^{2} is big the obtained bound is comparable with bounds for restarted-RSMD and proxBoost, see Table 2.

F.3 Proofs

Since Ak+1aLαk+12A_{k+1}\geq aL\alpha_{k+1}^{2} (see Lemma E.1) and a1a\geq 1 we can continue our derivations:

By definition of xk+1x^{k+1} we have xk+1=Akyk+αk+1zkAk+1x^{k+1}=\frac{A_{k}y^{k}+\alpha_{k+1}z^{k}}{A_{k+1}} which implies

since Ak+1=Ak+αk+1A_{k+1}=A_{k}+\alpha_{k+1}. Putting all together we derive that

where in the last inequality we use the convexity of ff. Taking into account A0=α0=0A_{0}=\alpha_{0}=0 and AN=k=0N1αk+1A_{N}=\sum_{k=0}^{N-1}\alpha_{k+1} we sum up these inequalities for k=0,,N1k=0,\ldots,N-1 and get

Proof of (39). In order to prove this bound we introduce following indicator random variables:

From the assumptions of the lemma, we have that f(xk+1)2λk+12\|\nabla f(x^{k+1})\|_{2}\leq\frac{\lambda_{k+1}}{2} which implies

The introduced notation helps us to rewrite ~f(xk+1,ξk)\widetilde{\nabla}f(x^{k+1},\boldsymbol{\xi}^{k}) in the following way:

We use this representation to obtain the following inequality:

Next, we derive an upper bound for the expectation of ηk\eta_{k} using Markov’s inequality:

Proof of (41). To derive (41) we use (40):

holds for all N0N\geq 0. Taking into account that f(yN)f(x)0f(y^{N})-f(x^{*})\geq 0 for all yNy^{N} and using new notation Rk=defzkx2R_{k}\stackrel{{\scriptstyle\text{def}}}{{=}}\|z^{k}-x^{*}\|_{2}, R~0=R0\widetilde{R}_{0}=R_{0}, R~k+1=max{R~k,Rk+1}\widetilde{R}_{k+1}=\max\{\widetilde{R}_{k},R_{k+1}\} we derive that for all k0k\geq 0

First of all, we notice that for each k0k\geq 0 iterates xk+1,zk,ykx^{k+1},z^{k},y^{k} lie in the ball BR~k(x)B_{\widetilde{R}_{k}}(x^{*}). We prove it using induction. Since y0=z0=x0y^{0}=z^{0}=x^{0}, R~0=R0=z0x2\widetilde{R}_{0}=R_{0}=\|z^{0}-x^{*}\|_{2} and x1=A0y0+α1z0A1=z0x^{1}=\frac{A_{0}y^{0}+\alpha_{1}z^{0}}{A_{1}}=z^{0} we have that x1,z0,y0BR~0(x)x^{1},z^{0},y^{0}\in B_{\widetilde{R}_{0}}(x^{*}). Next, assume that xl,zl1,yl1BR~l1(x)x^{l},z^{l-1},y^{l-1}\in B_{\widetilde{R}_{l-1}}(x^{*}) for some l1l\geq 1. By definitions of RlR_{l} and R~l\widetilde{R}_{l} we have that zlBRl(x)BR~l(x)z^{l}\in B_{R_{l}}(x^{*})\subseteq B_{\widetilde{R}_{l}}(x^{*}). Since yly^{l} is a convex combination of yl1BR~l1(x)BR~l(x)y^{l-1}\in B_{\widetilde{R}_{l-1}}(x^{*})\subseteq B_{\widetilde{R}_{l}}(x^{*}), zlBR~l(x)z^{l}\in B_{\widetilde{R}_{l}}(x^{*}) and BR~l(x)B_{\widetilde{R}_{l}}(x^{*}) is a convex set we conclude that ylBR~l(x)y^{l}\in B_{\widetilde{R}_{l}}(x^{*}). Finally, since xl+1x^{l+1} is a convex combination of yly^{l} and zlz^{l} we have that xl+1x^{l+1} lies in BR~l(x)B_{\widetilde{R}_{l}}(x^{*}) as well.

The rest of the proof is based on the refined analysis of inequality (64). In particular, via induction we prove that for all k=0,1,,Nk=0,1,\ldots,N with probability at least 1kβN1-\frac{k\beta}{N} the following statement holds: inequalities

hold for t=0,1,,T1t=0,1,\ldots,T-1. Then, inequalities

hold for t=1,,T1t=1,\ldots,T-1 where the last inequality follows from (t+2)2t(t+3)(1+2)21(1+3)=94\frac{(t+2)^{2}}{t(t+3)}\leq\frac{(1+2)^{2}}{1(1+3)}=\frac{9}{4}. Taking aa such that

we obtain that probability event ET1E_{T-1} implies

for t=0,,T1t=0,\ldots,T-1. Since B=CR08ln4NβB=\frac{CR_{0}}{8\ln\frac{4N}{\beta}} we have to choose such aa that

w.r.t. a\sqrt{a} we get that aa should satisfy

Having inequalities (67) in hand we show in the rest of the proof that (65) holds for t=Tt=T with big enough probability. First of all, we introduce new random variables:

for l=0,1,T1l=0,1,\ldots T-1. Note that these random variables are bounded with probability 11, i.e. with probability 11 we have

Secondly, we use the introduced notation and get that ET1E_{T-1} implies

Finally, we do some preliminaries in order to apply Bernstein’s inequality (see Lemma D.1) and obtain that ET1E_{T-1} implies

It remains to provide tight upper bounds for ①, ②, ③, ④ and ⑤, i.e. in the remaining part of the proof we show that ¬+­+®+¯+°δC2R02\text{\char 172}+\text{\char 173}+\text{\char 174}+\text{\char 175}+\text{\char 176}\leq\delta C^{2}R_{0}^{2} for some δ<1\delta<1.

Secondly, these summands are bounded with probability 11:

In other words, sequence {αl+1θl+1u,2ηl+2αl+1ζl}l0\left\{\alpha_{l+1}\left\langle\theta_{l+1}^{u},2\eta_{l}+2\alpha_{l+1}\zeta_{l}\right\rangle\right\}_{l\geq 0} is bounded martingale difference sequence with bounded conditional variances {σl2}l0\{\sigma_{l}^{2}\}_{l\geq 0}. Therefore, we can apply Bernstein’s inequality, i.e. we apply Lemma D.1 with Xl=αl+1θl+1u,2ηl+2αl+1ζlX_{l}=\alpha_{l+1}\left\langle\theta_{l+1}^{u},2\eta_{l}+2\alpha_{l+1}\zeta_{l}\right\rangle, c=33C2R0264ln4Nβc=\frac{33C^{2}R_{0}^{2}}{64\ln\frac{4N}{\beta}} and F=c2ln4Nβ18F=\frac{c^{2}\ln\frac{4N}{\beta}}{18} and get that for all b>0b>0

or, equivalently, with probability at least 12exp(b22F+\nicefrac2cb3)1-2\exp\left(-\frac{b^{2}}{2F+\nicefrac{{2cb}}{{3}}}\right)

The choice of FF will be clarified further, let us now choose bb in such a way that 2exp(b22F+\nicefrac2cb3)=β2N2\exp\left(-\frac{b^{2}}{2F+\nicefrac{{2cb}}{{3}}}\right)=\frac{\beta}{2N}. This implies that bb is the positive root of the quadratic equation

That is, with probability at least 1β2N1-\frac{\beta}{2N}

Next, we notice that probability event ET1E_{T-1} implies that

where the last inequality follows from c=33C2R0264ln4Nβc=\frac{33C^{2}R_{0}^{2}}{64\ln\frac{4N}{\beta}} and simple arithmetic.

Upper bound for ②. First of all, we notice that probability event ET1E_{T-1} implies

Upper bound for ③. We derive the upper bound for ③ using the same technique as for ①. First of all, we notice that the summands in ③ are conditionally independent:

Secondly, the summands are bounded with probability 11:

or, equivalently, with probability at least 12exp(b22F1+\nicefrac2c1b3)1-2\exp\left(-\frac{b^{2}}{2F_{1}+\nicefrac{{2c_{1}b}}{{3}}}\right)

As in our derivations of the upper bound for ① we choose such bb that 2exp(b22F1+\nicefrac2c1b3)=β2N2\exp\left(-\frac{b^{2}}{2F_{1}+\nicefrac{{2c_{1}b}}{{3}}}\right)=\frac{\beta}{2N}, i.e.

That is, with probability at least 1β2N1-\frac{\beta}{2N}

Next, we notice that probability event ET1E_{T-1} implies that

Upper bound for ④. The probability event ET1E_{T-1} implies

Upper bound for ⑤. Again, we use corollaries of probability event ET1E_{T-1}:

Now we summarize all bound that we have: probability event ET1E_{T-1} implies

Taking into account these inequalities we get that probability event ET1E¬E®E_{T-1}\cap E_{\text{\char 172}}\cap E_{\text{\char 174}} implies

That is, by definition of ETE_{T} and ET1E_{T-1} we have proved that

Since AN=N(N+3)4aLA_{N}=\frac{N(N+3)}{4aL} (see Lemma E.1) we get that with probability at least 1β1-\beta

In other words, clipped-SSTM with a=max{1,16ln4NβC,36(2ln4Nβ+4ln24Nβ+2ln4Nβ)2}=36(2ln4Nβ+4ln24Nβ+2ln4Nβ)2a=\max\left\{1,\frac{16\ln\frac{4N}{\beta}}{C},36\left(2\ln\frac{4N}{\beta}+\sqrt{4\ln^{2}\frac{4N}{\beta}+2\ln\frac{4N}{\beta}}\right)^{2}\right\}=36\left(2\ln\frac{4N}{\beta}+\sqrt{4\ln^{2}\frac{4N}{\beta}+2\ln\frac{4N}{\beta}}\right)^{2} achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(LR02εlnLR02εβ)O\left(\sqrt{\frac{LR_{0}^{2}}{\varepsilon}}\ln\frac{LR_{0}^{2}}{\varepsilon\beta}\right) iterations and requires

Theorem F.1 implies that with probability at least 1β1-\beta

αk+1=k+22aL\alpha_{k+1}=\frac{k+2}{2aL} and batchsizes mkm_{k} are chosen according to (26):

We consider two different options for aa.

If Nln4NβN\ln\frac{4N}{\beta} is bigger than a^\hat{a}, then we take a=Nln4Nβa=N\ln\frac{4N}{\beta} which implies that

That is, if ε\varepsilon is small enough to satisfy LR02εlnLR02εβC1ln2LR02εβ\frac{LR_{0}^{2}}{\varepsilon}\ln\frac{LR_{0}^{2}}{\varepsilon\beta}\geq C_{1}\ln^{2}\frac{LR_{0}^{2}}{\varepsilon\beta} for some constant C1C_{1}, then due to (80) we have that after

of clipped-SSTM we obtain such point yNy^{N} that with probability at least 1β1-\beta inequality f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon holds and the method requires

If a0N\nicefrac32ln4Nβa_{0}N^{\nicefrac{{3}}{{2}}}\sqrt{\ln\frac{4N}{\beta}} is bigger than a^\hat{a} for some a0>0a_{0}>0, then we take a=a0N\nicefrac32ln4Nβa=a_{0}N^{\nicefrac{{3}}{{2}}}\sqrt{\ln\frac{4N}{\beta}} which implies that

That is, if ε\varepsilon is small enough to satisfy a03L3R06ε3(lnLR02εβ)\nicefrac32C2ln2LR02εβ\frac{a_{0}^{3}L^{3}R_{0}^{6}}{\varepsilon^{3}}\left(\ln\frac{LR_{0}^{2}}{\varepsilon\beta}\right)^{\nicefrac{{3}}{{2}}}\geq C_{2}\ln^{2}\frac{LR_{0}^{2}}{\varepsilon\beta} for some constant C2C_{2}, then due to (81) we have that after

of clipped-SSTM we obtain such point yNy^{N} that with probability at least 1β1-\beta inequality f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon holds and the method requires

stochastic first-order oracle calls. Finally, if all assumptions on NN, β\beta and ε\varepsilon hold for a0=σLR0a_{0}=\frac{\sigma}{LR_{0}}, then for all k=0,1,,N1k=0,1,\ldots,N-1

i.e. one iteration of clipped-SSTM requires O(1)O(1) oracle calls, and f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

Since aσN\nicefrac32LR0a\geq\frac{\sigma N^{\nicefrac{{3}}{{2}}}}{LR_{0}} we have that mk=O(1)m_{k}=O(1). Next, there are two possible situations.

If a=aa=a^{\prime}, then we are in the settings of Theorem F.1. This means that clipped-SSTM achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

If a=σN\nicefrac32LR0ln4Nβa=\frac{\sigma N^{\nicefrac{{3}}{{2}}}}{LR_{0}}\sqrt{\ln\frac{4N}{\beta}}, then we are in the settings of Corollary F.2 which implies that clipped-SSTM achieves f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

Finally, we combine these two cases and obtain that with a=max{a,σN\nicefrac32LR0ln4Nβ}a=\max\left\{a^{\prime},\frac{\sigma N^{\nicefrac{{3}}{{2}}}}{LR_{0}}\sqrt{\ln\frac{4N}{\beta}}\right\} clipped-SSTM guarantees f(yN)f(x)εf(y^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

First of all, consider behavior of clipped-SSTM during the first run in R-clipped-SSTM. We notice that the proof of Theorem F.1 will be valid if we substitute R0R_{0} everywhere by its upper bound RR. From μ\mu-strong convexity of ff we have

therefore, one can choose R=2μ(f(x0)f(x))R=\sqrt{\frac{2}{\mu}\left(f(x^{0})-f(x^{*})\right)}. It implies that after N0N_{0} iterations of clipped-SSTM we have

with probability at least 1βτ1-\frac{\beta}{\tau}, hence with the same probability f(yN0)f(x)12(f(x0)f(x))f(y^{N_{0}})-f(x^{*})\leq\frac{1}{2}(f(x^{0})-f(x^{*})) since N0C8aLμN_{0}\geq C\sqrt{\frac{8aL}{\mu}}. In other words, with probability at least 1βτ1-\frac{\beta}{\tau}

Then, by induction one can show that for arbitrary k{0,1,,τ1}k\in\{0,1,\ldots,\tau-1\} the inequality

holds with probability at least 1βτ1-\frac{\beta}{\tau}. Therefore, these inequalities hold simultaneously with probability at least 1β1-\beta. Using this we derive that inequality

holds with probability 1β\geq 1-\beta. That is, after τ=log2μR22ε\tau=\left\lceil\log_{2}\frac{\mu R^{2}}{2\varepsilon}\right\rceil restarts R-clipped-SSTM generates such a point x^τ\hat{x}^{\tau} that f(x^τ)f(x)εf(\hat{x}^{\tau})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta. Moreover, if aa equals the maximum from (45) and N0C18aLμN_{0}\leq C_{1}\sqrt{\frac{8aL}{\mu}} with some numerical constant C1CC_{1}\geq C, then a(lnN0τβ)2a\sim\left(\ln\frac{N_{0}\tau}{\beta}\right)^{2}, the total number of iterations of clipped-SSTM equals

and the overall number of stochastic first-order oracle calls is

Similarly to the proof of Theorem F.6 (see the previous subsection) we derive that under assumptions of the corollary after τ=log2μR22ε\tau=\left\lceil\log_{2}\frac{\mu R^{2}}{2\varepsilon}\right\rceil restarts R-clipped-SSTM generates such a point x^τ\hat{x}^{\tau} that f(x^τ)f(x)εf(\hat{x}^{\tau})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta. Moreover, aa and N0N_{0} satisfy the following system of inequalities

Then, for all k=0,1,,N01k=0,1,\ldots,N_{0}-1 and t=0,1,,τ1t=0,1,\ldots,\tau-1 batchsizes satisfy

i.e. the algorithm requires O(1)O(1) oracle calls per iteration. Finally, the total number of iterations is

Appendix G SGD with Clipping: Exact Formulations and Missing Proofs

In this section we provide exact formulations of all the results that we have for clipped-SGD and R-clipped-SGD together with the full proofs.

Assume that function ff is convex and LL-smooth. Then for all β(0,1)\beta\in(0,1) and N1N\geq 1 such that

we have that after NN iterations of clipped-SGD with

where R0=x0x2R_{0}=\|x^{0}-x^{*}\|_{2} and stepsize

where xˉN=1Nk=0N1xk\bar{x}^{N}=\frac{1}{N}\sum_{k=0}^{N-1}x^{k} and

In other words, the method achieves f(xˉN)f(x)εf(\bar{x}^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(LR02εlnLR02εβ)O\left(\frac{LR_{0}^{2}}{\varepsilon}\ln\frac{LR_{0}^{2}}{\varepsilon\beta}\right) iterations and requires

To the best of our knowledge, it is the first result for clipped-SGD establishing non-trivial complexity guarantees for the convergence with high probability. One can find the full proof in Section G.3.1.

G.2 Strongly Convex Case

Next, we consider the situation when ff is additionally μ\mu-strongly convex and propose a restarted version of clipped-SGD (R-clipped-SGD), see Algorithm 4.

For this method we prove the following result.

Assume that ff is μ\mu-strongly convex and LL-smooth. If we choose β(0,1)\beta\in(0,1), τ\tau and N01N_{0}\geq 1 such that

where R=2(f(x0)f(x))μR=\sqrt{\frac{2(f(x^{0})-f(x^{*}))}{\mu}} and C=2C=\sqrt{2}, then we have that after τ\tau runs of clipped-SGD in R-clipped-SGD the inequality

holds with probability at least 1β1-\beta. That is, if we choose N0ln4N0τβC1Lμ\frac{N_{0}}{\ln\frac{4N_{0}\tau}{\beta}}\leq\frac{C_{1}L}{\mu} with some numerical constant C1320C2C_{1}\geq 320C^{2}, then the method achieves f(x^τ)f(x)εf(\hat{x}^{\tau})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

This theorem implies that R-clipped-SGD has the same complexity as the restarted version of RSMD from up to the difference in logarithmical factors. We notice that the main difference between our result and one from is that we do not need to assume that the optimization problem is considered on the bounded set.

However, in order to get (94) R-clipped-SGD requires to know strong convexity parameter μ\mu. In order to remove this drawback we analyse clipped-SGD for the strongly convex case and get the following result.

Assume that function ff is μ\mu-strongly convex and LL-smooth. Then for all β(0,1)\beta\in(0,1) and N1N\geq 1 such that

we have that after NN iterations of clipped-SGD with

where r0=f(x0)f(x)r_{0}=f(x^{0})-f(x^{*}) and stepsize

In other words, the method achieves f(xN)f(x)εf(x^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(Lμln(r0ε)ln(Lμβlnr0ε))O\left(\frac{L}{\mu}\ln\left(\frac{r_{0}}{\varepsilon}\right)\ln\left(\frac{L}{\mu\beta}\ln\frac{r_{0}}{\varepsilon}\right)\right) iterations and requires

Unfortunately, our approach leads to worse complexity bound than we have for R-clipped-SGD: in the second term of the maximum in (99) we get an extra factor \nicefracLμ\nicefrac{{L}}{{\mu}} that can be large. Nevertheless, to the best of our knowledge it is the first non-trivial complexity result for clipped-SGD that guarantees convergence with high probability. One can find the full proof of Theorem G.3 in Section G.3.3.

G.3 Proofs

Since f(x)f(x) is convex and LL-smooth, we get the following inequality:

where θk=~f(xk,ξk)f(xk)\theta_{k}=\widetilde{\nabla}f(x^{k},\boldsymbol{\xi}^{k})-\nabla f(x^{k}) and the last inequality follows from the convexity of ff. Using notation Rk=defxkx2R_{k}\stackrel{{\scriptstyle\text{def}}}{{=}}\|x^{k}-x^{*}\|_{2} we derive that for all k0k\geq 0

Let us define A=(2γ4γ2L)A=\left(2\gamma-4\gamma^{2}L\right), then

Summing up these inequalities for k=0,,N1k=0,\dots,N-1 we obtain

Noticing that for xˉN=1Nk=0N1xk\bar{x}^{N}=\frac{1}{N}\sum\limits_{k=0}^{N-1}x^{k} Jensen’s inequality gives f(xˉN)=f(1Nk=0N1xk)1Nk=0N1f(xk)f(\bar{x}^{N})=f\left(\frac{1}{N}\sum\limits_{k=0}^{N-1}x^{k}\right)\leq\frac{1}{N}\sum\limits_{k=0}^{N-1}f(x^{k}) we have

Taking into account that f(xˉN)f(x)0f(\bar{x}^{N})-f(x^{*})\geq 0 and changing the indices we get that for all k0k\geq 0

The remaining part of the proof is based on the analysis of inequality (101). In particular, via induction we prove that for all k=0,1,,Nk=0,1,\ldots,N with probability at least 1kβN1-\frac{k\beta}{N} the following statement holds: inequalities

hold for t=0,1,,T1t=0,1,\ldots,T-1. Since ff is LL-smooth, we have that probability event ET1E_{T-1} implies

for t=0,,T1t=0,\ldots,T-1, where the clipping level is defined as

Having inequalities (103) in hand we show in the rest of the proof that (102) holds for t=Tt=T with big enough probability. First of all, we introduce new random variables:

for l=0,1,T1l=0,1,\ldots T-1. Note that these random variables are bounded with probability 11, i.e. with probability 11 we have

Secondly, we use the introduced notation and get that ET1E_{T-1} implies

Finally, we do some preliminaries in order to apply Bernstein’s inequality (see Lemma D.1) and obtain that ET1E_{T-1} implies

It remains to provide tight upper bounds for ①, ②, ③, ④ and ⑤, i.e. in the remaining part of the proof we show that ¬+­+®+¯+°δC2R02\text{\char 172}+\text{\char 173}+\text{\char 174}+\text{\char 175}+\text{\char 176}\leq\delta C^{2}R_{0}^{2} for some δ<1\delta<1.

Secondly, these summands are bounded with probability 11:

In other words, sequence {2γθlu,ηl}l0\left\{2\gamma\left\langle\theta_{l}^{u},\eta_{l}\right\rangle\right\}_{l\geq 0} is a bounded martingale difference sequence with bounded conditional variances {σl2}l0\{\sigma_{l}^{2}\}_{l\geq 0}. Therefore, we can apply Bernstein’s inequality, i.e. we apply Lemma D.1 with Xl=2γθlu,ηlX_{l}=2\gamma\left\langle\theta_{l}^{u},\eta_{l}\right\rangle, c=8γ(CR0)2Lc=8\gamma(CR_{0})^{2}L and F=c2ln4Nβ6F=\frac{c^{2}\ln\frac{4N}{\beta}}{6} and get that for all b>0b>0

or, equivalently, with probability at least 12exp(b22F+\nicefrac2cb3)1-2\exp\left(-\frac{b^{2}}{2F+\nicefrac{{2cb}}{{3}}}\right)

The choice of FF will be clarified further, let us now choose bb in such a way that 2exp(b22F+\nicefrac2cb3)=β2N2\exp\left(-\frac{b^{2}}{2F+\nicefrac{{2cb}}{{3}}}\right)=\frac{\beta}{2N}. This implies that bb is the positive root of the quadratic equation

That is, with probability at least 1β2N1-\frac{\beta}{2N}

Next, we notice that probability event ET1E_{T-1} implies that

where the last inequality follows from c=8γ(CR0)2Lc=8\gamma(CR_{0})^{2}L and simple arithmetic.

Upper bound for ②. First of all, we notice that probability event ET1E_{T-1} implies

Upper bound for ③. We derive the upper bound for ③ using the same technique as for ①. First of all, we notice that the summands in ③ are conditionally independent:

Secondly, the summands are bounded with probability 11:

or, equivalently, with probability at least 12exp(b22F1+\nicefrac2c1b3)1-2\exp\left(-\frac{b^{2}}{2F_{1}+\nicefrac{{2c_{1}b}}{{3}}}\right)

As in our derivations of the upper bound for ① we choose such bb that 2exp(b22F1+\nicefrac2c1b3)=β2N2\exp\left(-\frac{b^{2}}{2F_{1}+\nicefrac{{2c_{1}b}}{{3}}}\right)=\frac{\beta}{2N}, i.e.

That is, with probability at least 1β2N1-\frac{\beta}{2N}

Next, we notice that probability event ET1E_{T-1} implies that

Upper bound for ④. The probability event ET1E_{T-1} implies

Upper bound for ⑤. Again, we use corollaries of probability event ET1E_{T-1}:

Now we summarize all bound that we have: probability event ET1E_{T-1} implies

Taking into account these inequalities and our assumptions on mm and γ\gamma (see (85) and (86)) we get that probability event ET1E¬E®E_{T-1}\cap E_{\text{\char 172}}\cap E_{\text{\char 174}} implies

That is, by definition of ETE_{T} and ET1E_{T-1} we have proved that

Since A=2γ(12γL)A=2\gamma\left(1-2\gamma L\right) and 1γL121-\gamma L\geq\frac{1}{2} we get that with probability at least 1β1-\beta

In other words, clipped-SGD achieves f(xˉN)f(x)εf(\bar{x}^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after O(LR02εlnLR02εβ)O\left(\frac{LR_{0}^{2}}{\varepsilon}\ln\frac{LR_{0}^{2}}{\varepsilon\beta}\right) iterations and requires

First of all, consider behavior of clipped-SGD during the first run in R-clipped-SGD. We notice that the proof of Theorem G.1 will be valid if we substitute R0R_{0} everywhere by its upper bound RR. From μ\mu-strong convexity of ff we have

therefore, one can choose R=2μ(f(x0)f(x))R=\sqrt{\frac{2}{\mu}\left(f(x^{0})-f(x^{*})\right)}. It implies that after N0N_{0} iterations of clipped-SGD we have

with probability at least 1βτ1-\frac{\beta}{\tau}, hence with the same probability f(xˉN0)f(x)12(f(x0)f(x))f(\bar{x}^{N_{0}})-f(x^{*})\leq\frac{1}{2}(f(x^{0})-f(x^{*})) since N0ln4N0τβ320C2Lμ\frac{N_{0}}{\ln\frac{4N_{0}\tau}{\beta}}\geq\frac{320C^{2}L}{\mu}. In other words, with probability at least 1βτ1-\frac{\beta}{\tau}

Then, by induction one can show that for arbitrary k{0,1,,τ1}k\in\{0,1,\ldots,\tau-1\} the inequality

holds with probability at least 1βτ1-\frac{\beta}{\tau}. Therefore, these inequalities hold simultaneously with probability at least 1β1-\beta. Using this we derive that inequality

holds with probability 1β\geq 1-\beta. That is, after τ=log2μR22ε\tau=\left\lceil\log_{2}\frac{\mu R^{2}}{2\varepsilon}\right\rceil restarts R-clipped-SGD generates such point x^τ\hat{x}^{\tau} that f(x^τ)f(x)εf(\hat{x}^{\tau})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta. Moreover, if N0ln4N0τβC1Lμ\frac{N_{0}}{\ln\frac{4N_{0}\tau}{\beta}}\leq\frac{C_{1}L}{\mu} with some numerical constant C1320C2C_{1}\geq 320C^{2}, then the total number of iterations of clipped-SGD equals

and the overall number of stochastic first-order oracle calls is

where in the last inequality we use 1γL121-\gamma L\geq\frac{1}{2}. Next, μ\mu-strong convexity of ff implies f(xk)222μ(f(xk)f(x))\|\nabla f(x^{k})\|_{2}^{2}\geq 2\mu(f(x^{k})-f(x^{*})) and

for all N0N\geq 0. Using notation rk=deff(xk)f(x)r_{k}\stackrel{{\scriptstyle\text{def}}}{{=}}f(x^{k})-f(x^{*}) we rewrite this inequality in the following form:

The rest of the proof is based on the refined analysis of inequality (115). In particular, via induction we prove that for all k=0,1,,Nk=0,1,\ldots,N with probability at least 1kβN1-\frac{k\beta}{N} the following statement holds: inequalities

hold for t=0,1,,T1t=0,1,\ldots,T-1. Since ff is LL-smooth, we have that probability event ET1E_{T-1} implies

Having inequalities (118) in hand we show in the rest of the proof that (116) holds for t=Tt=T with big enough probability. First of all, we introduce new random variables:

for l=0,1,T1l=0,1,\ldots T-1. Note that these random variables are bounded with probability 11, i.e. with probability 11 we have

Secondly, we use the introduced notation and get that ET1E_{T-1} implies

Finally, we do some preliminaries in order to apply Bernstein’s inequality (see Lemma D.1) and obtain that ET1E_{T-1} implies

It remains to provide tight upper bounds for ①, ②, ③, ④ and ⑤, i.e. in the remaining part of the proof we show that ¬+­+®+¯+°(1γμ)Tr0\text{\char 172}+\text{\char 173}+\text{\char 174}+\text{\char 175}+\text{\char 176}\leq(1-\gamma\mu)^{T}r_{0}.

Secondly, these summands are bounded with probability 11:

In other words, sequence {γ(1γμ)T1lθlu,ζl}l0\left\{\gamma(1-\gamma\mu)^{T-1-l}\left\langle\theta_{l}^{u},\zeta_{l}\right\rangle\right\}_{l\geq 0} is a bounded martingale difference sequence with bounded conditional variances {σl2}l0\{\sigma_{l}^{2}\}_{l\geq 0}. Therefore, we can apply Bernstein’s inequality, i.e. we apply Lemma D.1 with Xl=γ(1γμ)T1lθlu,ζlX_{l}=\gamma(1-\gamma\mu)^{T-1-l}\left\langle\theta_{l}^{u},\zeta_{l}\right\rangle, c=16γLr0(1γμ)T1c=16\gamma Lr_{0}(1-\gamma\mu)^{T-1} and F=c2ln4Nβ6F=\frac{c^{2}\ln\frac{4N}{\beta}}{6} and get that for all b>0b>0

or, equivalently, with probability at least 12exp(b22F+\nicefrac2cb3)1-2\exp\left(-\frac{b^{2}}{2F+\nicefrac{{2cb}}{{3}}}\right)

The choice of FF will be clarified further, let us now choose bb in such a way that 2exp(b22F+\nicefrac2cb3)=β2N2\exp\left(-\frac{b^{2}}{2F+\nicefrac{{2cb}}{{3}}}\right)=\frac{\beta}{2N}. This implies that bb is the positive root of the quadratic equation

That is, with probability at least 1β2N1-\frac{\beta}{2N}

Next, we notice that probability event ET1E_{T-1} implies that

where the last inequality follows from c=16γLr0(1γμ)T1c=16\gamma Lr_{0}(1-\gamma\mu)^{T-1} and simple arithmetic.

Upper bound for ②. First of all, we notice that probability event ET1E_{T-1} implies

Upper bound for ③. We derive the upper bound for ③ using the same technique as for ①. First of all, we notice that the summands in ③ are conditionally independent:

Secondly, the summands are bounded with probability 11:

or, equivalently, with probability at least 12exp(b22F1+\nicefrac2c1b3)1-2\exp\left(-\frac{b^{2}}{2F_{1}+\nicefrac{{2c_{1}b}}{{3}}}\right)

As in our derivations of the upper bound for ① we choose such bb that 2exp(b22F1+\nicefrac2c1b3)=β2N2\exp\left(-\frac{b^{2}}{2F_{1}+\nicefrac{{2c_{1}b}}{{3}}}\right)=\frac{\beta}{2N}, i.e.

That is, with probability at least 1β2N1-\frac{\beta}{2N}

Next, we notice that probability event ET1E_{T-1} implies that

Upper bound for ④. The probability event ET1E_{T-1} implies

Upper bound for ⑤. Again, we use corollaries of probability event ET1E_{T-1}:

Now we summarize all bounds that we have: probability event ET1E_{T-1} implies

Taking into account these inequalities and our assumptions on mkm_{k} and γ\gamma (see (96) and (97)) we get that probability event ET1E¬E®E_{T-1}\cap E_{\text{\char 172}}\cap E_{\text{\char 174}} implies

That is, by definition of ETE_{T} and ET1E_{T-1} we have proved that

As a result, we get that with probability at least 1β1-\beta

In other words, clipped-SGD achieves f(xN)f(x)εf(x^{N})-f(x^{*})\leq\varepsilon with probability at least 1β1-\beta after

iterations, where r0=f(x0)f(x)r_{0}=f(x^{0})-f(x^{*}) and requires

Appendix H Extra Experiments

In this section we provide a detailed description of experiments from Section 1.2 together with additional experiments. In these experiments we consider the following problem:

That is, for given kk the r.h.s. of the formula above depends only on the stepsize γ\gamma, initial suboptimality f(x0)f(x)f(x^{0})-f(x^{*}) and the variance σ\sigma.

We emphasize that the obtained bound and the convergence in expectation itself does not imply non-trivial upper bound for f(xk)f(x)f(x^{k})-f(x^{*}) with high-probability without additional assumptions on the distribution of random vector ξ\xi. In fact, the trajectory of SGD significantly depends on the distribution of ξ\xi. To illustrate this we consider 33 different distributions of ξ\xi with the same σ\sigma.

In the first case we consider ξ\xi from standard normal distribution, i.e. ξ\xi is a Gaussian random vector with zero mean and covariance matrix II. Clearly, in this situation σ2=n\sigma^{2}=n.

Next, we consider a random vector ξ\xi with i.i.d. components having Weibull distribution . The cumulative distribution function (CDF) for Weibull distribution with parameters c>0c>0 and α>0\alpha>0 is

There are explicit formulas for mean and variance for Weibull distribution:

where Γ\Gamma denotes the gamma function. Having these formulas one can easily shift and scale the distribution in order to get a random variable with zero mean and the variance equal 11.

Finally, we consider a random vector ξ\xi with i.i.d. components having Burr Type XII distribution having the following cumulative distribution function

where c>0c>0 and d>0d>0 are the positive parameters. There are explicit formulas for mean and variance for Burr distribution:

where the rr-th moment (if exists) is defined as follows :

For all experiments we considered the dimension n=100n=100, the stepsize γ=0.001\gamma=0.001 and for clipped-SGD we set λ=100\lambda=100. The result of 1010 independent runs of SGD and clipped-SGD are presented in Figures 6-10. These numerical tests show that for Weibull and Burr Type XII distributions SGD have significantly larger oscillations than for Gaussian distribution in all 1010 tests. In contrast, clipped-SGD behaves much more robust in all 33 cases during all 1010 runs without significant oscillations.

H.2 Additional Details and Experiments with Logistic Regression

In this section, we provide additional details of the experiments presented in Section 4 together with extra numerical results. In particular, we consider the logistic regression problem:

We notice that in all experiments that we did with logistic regression the initial suboptimality f(x0)f(x)f(x^{0})-f(x^{*}) was of order 1010. Moreover, as it was mentioned in the main part of the paper the parameters for the methods were tuned. One can find parameters that we used in the experiments from Section 4 in Table 4.

Next, we provide our numerical study of the distribution of fi(xk)f(xk)2\|\nabla f_{i}(x^{k})-\nabla f(x^{k})\|_{2}, where xkx^{k} is the last iterate produced by SGD in experiments presented in Section 4, see Figure 11.

As we mentioned in the main part of the paper these histograms are very similar to ones presented in Figure 2, so, the insights that we got from Figure 2 are right. However, in our experiments with australian dataset SGD with the stepsize γ=\nicefrac1L\gamma=\nicefrac{{1}}{{L}} did not reach needed suboptimality in order to oscillate.

Therefore, we run SGD along with its clipped variants with the same batchsize m=50m=50 for bigger number of epochs and also tuned their parameters. One can find the results of these runs in Figure 12.

We see that SGD with this stepsize achieves better suboptimality but it also oscillates significantly more. In contrast, clipped-SGD and d-clipped-SGD do not have significant oscillations and converge with the same rate as SGD. Moreover, clipped-SSTM shows slightly better performance in this case. Finally, we numerically studied the distribution of fi(xk)f(xk)2\|\nabla f_{i}(x^{k})-\nabla f(x^{k})\|_{2}, where xkx^{k} is the last iterate produced by SGD, see Figure 13.

These histograms imply that the noise in stochastic gradients is heavy-tailed and explain an unstable behavior of SGD in this case.

Finally, we conducted experiments on larger datasets: a9a and w8a. The results of our numerical test are reported on Figures 14 and 15. We notice that SSTM with given stepsize and batchsize suffers from noise accumulation, while clipped-SSTM does not have this drawback and shows comparable performance with SGD on a9a and much better performance on w8a.

Figure 15 shows the gradient’s noise distributions for both datasets. While the distribution of stochastic gradients at the optimum for a9a have sub-Gaussian-like distribution, for w8a they have heavy-tailed distribution.