On the convergence properties of a $K$-step averaging stochastic gradient descent algorithm for nonconvex optimization

Fan Zhou, Guojing Cong

Introduction

Parallel and distributed processing have been adopted for stochastic optimization to solve large-scale machine learning problems. Efficient parallelization is critical to accelerating long running deep-learning applications. Derived from stochastic gradient descent (SGD), parallel solvers such as synchronous SGD (e.g., Zinkevich et al. (2010) Dekel et al. (2012)) and Asynchronous SGD (ASGD) (e.g., see Dean et al. (2012); Recht et al. (2011)), have been proposed.

Beginning with the seminal paper Robbins and Monro (1951), the convergence properties of SGD and its variants have been extensively studied for the past 50 years (e.g. see Robbins and Siegmund (1971); Bottou (1998); Nemirovski et al. (2009); Shamir and Zhang (2013); Ghadimi and Lan (2013)). The asymptotic optimal convergence rate of SGD was proved by Chung (1954) and Sacks (1958) to be O(1/N)O(1/N) with twice continuously differentiable and strongly convex objectives. NN is the number of samples processed. The iteration complexity is O(1/N)O(1/\sqrt{N}) for general convex (see Nemirovski et al. (2009)) and nonconvex (see Ghadimi and Lan (2013)) problems. Regarding parallel variants of SGD, Dekel et al. (2012) extend these results to the setting of synchronous SGD with PP learners and show that it has convergence rate of O(1/NP)O(1/\sqrt{NP}) for non-convex objectives, with NN being the number of samples processed by each learner. Hogwild! is a lockfree implementation of ASGD, and Recht et al. (2011) prove its convergence for strongly convex problems with theoretical linear speedup over SGD. Downpour is another ASGD implementation with resilience against machine failures Dean et al. (2012). Lian et al. (2015) show that as long as the gradient staleness is bounded by the number of learners, ASGD converges for nonconvex problems. Due to its asynchronous nature that reduces communication cost, ASGD receives much attention in many recent studies.

Although ASGD has the same asymptotic convergence rate as SGD when the staleness of gradient update is bounded, the learning rate assumed for proving ASGD convergence are usually too small for practical purposes. It is also difficult for an ASGD implementation to control the staleness in gradient updates as it is influenced by the relative processing speed of learners and their positions in the communication network. Furthermore, the parameter server presents performance challenges on platforms with many GPUs. On such platforms, a single parameter server oftentimes does not serve the aggregation requests fast enough. A sharded server alleviates the aggregation bottleneck but introduces inconsistencies for parameters distributed on multiple shards. Communication between the parameter server (typically on CPUs) and the learners (on GPUs) is likely to remain a bottleneck in future systems.

We adopt a distributed, bulk-synchronous SGD algorithm that allows delayed gradient aggregation to effectively minimize the communication overhead. We call this algorithm K-step average SGD (K-AVG). Instead of using a parameter server, the learners in K-AVG compute the average of their copies of parameters at regular intervals through global reduction. Rather than relying on asynchrony that reduces communication overhead but has adverse impact on practical convergence, the communication interval KK is a parameter in K-AVG. The communication time is amortized among the data samples processed within each interval. On current and emerging computer platforms that support high bandwidth direct communication among GPUs (e.g., GPU-direct), global reduction does not involve CPUs and avoids multiple costly copies through the software layers. Similar averaging approaches have been proposed in the literature, see Hazan and Kale (2014); Johnson and Zhang (2013); Smith et al. (2016); Zhang et al. (2016); Loshchilov and Hutter (2016); Chen et al. (2016); Wang et al. (2017). However, their convergence behavior is not well understood analytically for nonconvex objectives, and it is unclear how they compare with ASGD approaches. This is the part where our major contribution goes to.

We study the convergence behavior of K-AVG and the impact of the number of processors PP on convergence. As the abundance of data is critical to the success of most machine learning tasks, training employs increasingly more learners. We show that K-AVG scales better than ASGD with PP and K-AVG allows larger stepsizes than ASGD for the same PP. We also analyze the impact of KK on convergence. Since with K=1K=1 K-AVG becomes hard-sync SGD, and with larger KK K-AVG can stimulate averaging after an epoch or many epochs, our analysis can be applied to many existing variants of synchronous SGD. Finding the optimal length of delay KoptK_{opt} for convergence is of high importance to practitioners. Contrary to popular belief, KoptK_{opt} is oftentimes not 11 and can be very large for many applications. Thus K-AVG is a good fit for large-scale distributed training as communication may not need to be very frequent for optimal convergence. Our analysis of convergence of K-AVG provides guidelines practitioners to explicitly balance the decrease of communication time and the increase of iterations through an appropriately chosen KK.

Using an image recognition benchmark, we demonstrate the nice convergence properties of K-AVG in comparison to two popular ASGD implementations: Downpour Dean et al. (2012) and EAMSGD Zhang et al. (2015). In EAMSGD, global gradient aggregation among learners simulates an elastic force that links the parameters they compute with a center variable stored by the parameter server. In both Downpour and EAMSGD, updates to the central parameter server can also have a K-step delay. On our target platform, when KK is small, K-AVG significantly reduces the communication time in comparison to Downpour and EAMSGD while achieving similar training and test accuracies. The training time reduction is up to 50%. When KK is large, K-AVG achieves much better training and test accuracies than Downpour and EAMSGD after the same amount of data samples are processed. For example, with 128 GPUs, K-AVG is up to about 77 and 22-66 times faster than Downpour and EAMSGD respectively, and achieves significantly better accuracy.

This rest of the paper is organized as follows: In section 2, we introduce the standard assumptions in optimization theory needed to analyze SGD methods and frequently used notations throughout the paper; In section 3, we formally introduce the K-AVG algorithm, and prove its standard convergence results with fixed and diminishing stepsize. Based on the convergence result, we analyze the scalability of K-AVG and investigate the optimal choice of KK; In section 4, we present our experimental results to validate our analysis.

Preliminaries and notations

BnB_{n}, Bˉ\bar{B}, or BB denotes the size of mini-batch for nn-th update;

γn\gamma_{n}, γˉ\bar{\gamma}, or γ\gamma denotes the step size for nn-th update;

ξk,sj\xi^{j}_{k,s} denotes the i.i.d. realizations of a random variable ξ\xi generated by the algorithm on different processors and in different iterations, especially, j=1,...,Nj=1,...,N, k=1,...,Kk=1,...,K, and s=1,...,Bs=1,...,B.

We study the following optimization problem:

This assumption is essential to convergence analysis of our algorithm as well as most gradient based ones. Under such an assumption, the gradient of FF serves as a good indicator for how far to move to decrease FF.

The sequence of iterates {wj}\{\mathbf{w}_{j}\} is contained in an open set over which FF is bounded below by a scalar FF^{*}.

Assumption 2 requires that objective function to be bounded from below, which guarantees the problem we study is well defined.

For any fixed parameter w\mathbf{w}, the stochastic gradient F(w;ξ)\nabla F(\mathbf{w};\xi) is an unbiased estimator of the true gradient corresponding to the parameter w\mathbf{w}, namely,

One should notice that the unbiasedness assumption here can be replaced by a weaker version which is called the First Limit Assumption see Bottou et al. (2016) that can still be applied to our analysis. For simplicity, we just assume that the stochastic gradient is an unbiased estimator of the true one.

Assumption 4 characterizes the variance (second order moments) of the stochastic gradients.

Since all results in this paper are based on the four assumptions above, we present them with no further mention in the following literature.

Main results

We present the distributed K-AVG algorithm as follows:

Note that traditional parallel SGD algorithm is equivalent to K-AVG with K=1K=1, as synchronization is required after each local update. However, K-AVG relaxes this requirement and allows for KK individual updates before synchronization. Thus, K-AVG is a more general synchronous algorithm that contains parallel SGD. Surprisingly, as we show both analytically (section 3.4) and experimentally (section 4.2), more frequent synchronization does not always result in faster convergence for nonconvex objectives.

In the following theorem, we prove an upper bound on the expected average squared gradient norms, which serve as a metric to measure the convergence rate for nonconvex objectives.

(Nonconvex objective, fixed stepsize, and fixed batch size) Suppose that Algorithm 1 is run with a fixed stepsize γn=γˉ\gamma_{n}=\bar{\gamma}, a fixed batch size Bn=BˉB_{n}=\bar{B} satisfying

The proof of Theorem 3.1 can be found in section 6.1. An immediate observation from (3.2) is that the expected average squared gradient norms of FF converges to some nonzero constant as NN\rightarrow\infty. The first term 2(F(w~1)F)N(K1+δ)γˉ\frac{2(F(\widetilde{\mathbf{w}}_{1})-F^{*})}{N(K-1+\delta)\bar{\gamma}} reflects the distance from initial weight to the solution. It eventually goes to zero as the number of iterations goes to infinity. The second term \frac{LK\bar{\gamma}M}{\bar{B}(K-1+\delta)}\Big{(}\frac{K}{P}+\frac{L(2K-1)(K-1)\bar{\gamma}}{6}\Big{)} is not affected by the iteration number. Compared with sequential SGD (see section 4.3 in Bottou et al. (2016)), this term is scaled by the batch size 1/Bˉ1/\bar{B}, and 1/P1/P or L(2K1)(K1)γˉ6\frac{L(2K-1)(K-1)\bar{\gamma}}{6}, which means larger batch size and smaller stepsize, more learners or more frequent averaging tend to reduce this term. Mini-batch method as a variance reduction technique explains the appearance of Bˉ\bar{B}. Parallelization of this algorithm contributes to the scaling factor 1/P1/P. However, when PP is large enough to make L(2K1)(K1)γˉ6\frac{L(2K-1)(K-1)\bar{\gamma}}{6} dominates K/PK/P, the effect of parallelization is not ideal as one may expect.

The rates of convergence (also referred as iteration complexity) after NN step updates are established in the following corollary which originates from Theorem 3.1.

The proof of Corollary 3.1 can be found in section 6.3. Condition (3.3) and bound (3.4) imply when the number of updates NN is large enough, K-AVG eventually achieves a similar rate of convergence as classical SGD method for nonconvex objectives. Indeed, the rate of convergence of classical SGD methods is N1/2N^{-1/2} after NN samples processed. Note that NN updates in K-AVG means that NKBPN*K*B*P samples have been processed. Taking a closer look at bound in (3.4), the right hand side is of the order O((NBP)1/2)O((N*B*P)^{-1/2}). K-AVG loses a factor of 1/K1/\sqrt{K} as a result of communication saving. However, this doesn’t mean that the smaller KK the better. Since there is an extra multiplicative factor 4KK1+δ\frac{4K}{K-1+\delta} which is monotone decreasing with respect to KK. We will have a more detailed discussion on the choice of KK in section 3.4.

Deploying diminishing stepsizes and/or dynamic batch sizes makes the expected average squared gradient norms converge to zero for non-convex optimization. In the following theorem, we establish the convergence result under such conditions.

(Nonconvex objective, diminishing step size, and growing batch size) Suppose that Algorithm 1 is run with diminishing step size γj\gamma_{j}, and growing batch size BjB_{j} satisfying

with some constant 0<δ<10<\delta<1. Then the weighted average squared gradient norms satisfies

The proof of Theorem 3.2 can be found in section 6.2. As we can see, by adopting a diminishing sequence of stepsizes instead of a fixed one, the expected average squared gradient norms of K-AVG converges to instead of a nonzero constant.

2 K-AVG allows for larger stepsize than ASGD

Compared with the classical stepsize schedule for both sequential SGD (proposed by Robbins and Monro (1951)) and ASGD:

the stepsize schedule proposed in (3.7) turns out to allow larger choices of γj\gamma_{j}. On one hand, j=1γj3<\sum_{j=1}^{\infty}\gamma^{3}_{j}<\infty itself is a much more relaxed constrain compared with j=1γj2<\sum_{j=1}^{\infty}\gamma^{2}_{j}<\infty. On the other hand, as a byproduct of parallelization, when PP is large, j=1γj2/BjP\sum_{j=1}^{\infty}\gamma^{2}_{j}/B_{j}P also allows larger choice of γj\gamma_{j}. Intuitively, averaging acts as a variance reduction and leads to relaxation of larger stepsize constrain. In our experiments (section 4.1), larger stepsizes work well in K-AVG but can result in divergence in popular ASGD implementations.

3 Scalability comparison of K-AVG against ASGD

We analyze the bound on expected average squared gradient norms in (3.2) to show K-AVG algorithm scales better with PP than ASGD. We first establish the following theorem on the scalability of K-AVG.

Under the condition of Theorem 3.1, K-AVG scales better than ASGD.

In ASGD, one key parameter is the maximum staleness, generally assumed to be upper bounded by the number of processors, i.e. PP in K-AVG. With fixed stepsize, the expected average gradient norms is (see also Lian et al. (2015), Theorem 3) is

where C0C_{0} and C1C_{1} are constants independent of PP.

PP serves as a scaling factor in the denominator, which implies K-AVG scales better than asynchronization. ∎

The comparison of practical scalability between K-AVG and ASGD implementations is shown in section 4.1.

4 Optimal K𝐾K for convergence is not always 111

Unlike convex optimization problems where all learners converge to the same optimum, different learners may converge to different local optimums in nonconvex case. As a consequence, the frequency of averaging for nonconvex problems may be different from that of convex cases intuitively. Zhang et al. (2016) expressed the same concern, their experimental results showed that periodic averaging tends to improve the solution quality. Contrary to popular belief that more frequent averaging i.e. smaller KK speeds up convergence, we show that the optimal frequency KK for convergence is not always 11. We consider the case that the amount of samples processed NKN*K is constant, which means that the computational time remains as a constant given a fixed number of processors. If every other parameter stays the same, larger KK means longer delay and fewer updates of global parameter w~n\widetilde{\mathbf{w}}_{n}. The following theorem discusses the impact and optimal choice of KK in K-AVG under such an assumption.

Let S=NKS=N*K be a constant. Suppose that Algorithm 1 is run with a fixed stepsize γn=γˉ\gamma_{n}=\bar{\gamma}, a fixed batch size Bn=BˉB_{n}=\bar{B} satisfying

Then the optimal choice of KK is greater than 11.

Under the assumption S=NKS=N*K, we can rewrite the bound (3.2) as

To minimize the right hand side of (3.10), it is equivalent to solve the following integer program

which can be very hard. Meanwhile, one should notice that KK^{*} depends on some unknown quantities such as LL, MM and (F(w~1)F)(F(\widetilde{\mathbf{w}}_{1})-F^{*}). Instead, we investigate the monotonicity of B(K)B(K). It is easy to check that \Big{(}\alpha+\beta K+\eta(2K-1)(K-1)\Big{)} is monotone increasing for all K1K\geq 1, and \Big{(}\frac{K}{K-1+\delta}\Big{)} is monotone decreasing for K1K\geq 1. Thus, there exists a unique KK^{*}. Then a sufficient condition for K>1K^{*}>1 is that B(2)<B(1)B(2)<B(1), which implies

The meaning of Theorem 3.4 is that it indicates when it comes to nonconvex optimization, more frequent averaging is not necessary. The sufficient condition (3.9) implies that larger value of \big{(}F(\widetilde{\mathbf{w}}_{1})-F^{*}\big{)} requires larger KK thus longer delay to decrease the bound in (3.10). The intuition is that if the initial weight is too far away from FF^{*}, then less frequent synchronizations can lead to faster convergence rate. Less frequent averaging implies higher variance in general. It is quite reasonable to think that if it is still far away from the solution, a stochastic gradient direction with larger variance may be preferred.

As we have already mentioned in the proof, optimal value KK^{*} depends on quantities such as LL, MM, and (F(w~1)F)(F(\widetilde{\mathbf{w}}_{1})-F^{*}) which are unknown to us in practice. Therefore, to obtain a concrete KK^{*} in practice is not so realistic. Note that when K>1K^{*}>1 doesn’t necessarily mean that KK^{*} is very close to 11. In our experiments (see section 4), KK^{*} can be as large as 1616 or 3232 in some situation.

Experiments

We conduct experiments to validate our analysis on the scalability of K-AVG vs. ASGD implementations, the optimal delay in averaging ( optimal value of KK), the convergence comparison with the sequential algorithm, i.e., SGD, and the comparison of the learning rates allowed with ASGD.

In our application gradient descent is implemented with Torch, and the communication is implemented using CUDA-aware openMPI 2.0 through the mpiT library. All implementations use the cuDNN library for forward propagation and backward propagation. Our experiments are done on a cluster of 32 Minsky nodes interconnected with Infiniband. Each node is an IBM S822LC system containing 2 Power8 CPUs with 10 cores each, and 4 NVIDIA Tesla P100 GPUs.

We experiment with the CIFAR-10 Krizhevsky and Hinton (2009) data set using the vgg and nin models. CIFAR-10 contains 50,00050,000 training images and 10,00010,000 test images, each associated with 11 out of 1010 possible labels.

Theorem 3.3 shows that the convergence bound of K-AVG is not affected much by the scaling of PP but rather by KK. On the contrary, the convergence bound of ASGD increases linearly with PP. Thus we expect poorer convergence behavior of ASGD implementations at large PP in comparison with K-AVG.

Figures 2 and 2 compare the performance of K-AVG with two ASGD implementations, Downpour and EAMSGD, for two neural networks, vgg Simonyan and Zisserman (2014) and nin Lin et al. (2013), respectively. We use P=8,16,32,64P=8,16,32,64 and 128128 learners, and show the test accuracies. All implementations use the same initial learning rate (γ0=1\gamma_{0}=1) and learning rate adaptation schedule (reduce γ\gamma by half after 50 epochs). The batch size is fixed at Bˉ=16\bar{B}=16. We run for 600 epochs with K=16K=16 for K-AVG.

In both figures, K-AVG always achieves better test accuracy than Downpour and EAMSGD. The test accuracies for Downpour and EAMSGD decreases as PP increases (the effect is more pronounced for vgg). When PP reaches 128, the accuracies of Downpour and EAMSGD both degrade to around 10%, i.e, random guess. The ASGD implementations do not converge with γ0=1\gamma_{0}=1 at P=128P=128.

When we set the initial learning rate γ0=0.1\gamma_{0}=0.1, Downpour achieves around 80% and 87% test accuracies for vgg and nin, respectively; EAMSGD still does not converge.

K-AVG achieves better test accuracy than ASGD implementations, and in our experiment it is also faster. We measure wall-clock times for all implementations after 600 epochs. Naturally for K-AVG, KK impacts the ratio of communication vs. computation. We still use K=16K=16.

Figures 4 and 4 show the speedups of K-AVG over Downpour and EAMSGD, with vgg and nin, respectively. In Figure 4, when P=8P=8, the ASGD implementations are slightly faster than K-AVG. As PP increases, the speedup increases. When P=128P=128, the speedups are around 2.5 and 2.6 over Downpour and EAMSGD, respectively. Similar behavior is observed in Figure 4 for nin.

2 The optimal delay in averaging for K-AVG

For K-AVG, KK regulates its behavior. From the execution time perspective, larger KK results in fewer communications to process a given number of data samples. From the convergence perspective, smaller KK reduces the variances among learners. In the extreme case where K=1K=1, K-AVG is equivalent to synchronous parallelization of SGD. People tend to think that smaller KK results in faster convergence in terms of number of data samples processed. As discussed in Section 3.4, there are scenarios where KoptK_{opt} is not 1.

We evaluate the convergence behavior of K-AVG with different KK values for vgg and nin. We experiment with K=1,2,4,16,32K=1,2,4,16,32, and 6464. Figures 6 and 6 show the test accuracies achieved after 600 epochs for P=8,16,32,64P=8,16,32,64, and 128128, for vgg and nin, respectively. Again we use the initial γ0\gamma_{0}=1, and after every 50 epochs, γ\gamma is reduced by half. The batch size is fixed as Bˉ=32\bar{B}=32.

In Figure 6, strikingly, none of the experiments show the optimal value of KK for K-AVG is 1. KoptK_{opt} ranges from 32 (when P=8P=8) farthest away from 1, to 2 (when P=64P=64), closest to 1. In this set of experiments, as PP increases, KoptK_{opt} tends to decrease. Also with smaller PP, K-AVG is more forgiving in terms of the choices of KK. For example, when P=8P=8, test accuracies for different KK are similar. With larger PP, however, choosing a KK that is too large has severe punishing consequences. For example, when P=128P=128, Kopt=4K_{opt}=4, and the test accuracy degrades rapidly with the increase of KK beyond 4.

In Figure 6, almost all experiments show KoptK_{opt}=1. The exception is with p=8p=8, and the accuracy is slightly higher (by 0.27%) at K=8K=8 than K=1K=1. Again we see for small pp the choices of KK is not critical, while for large pp the degradation in accuracy is rapid with the increase of KK beyond KoptK_{opt}.

Since in our experiments the same hyper-parameters are used for vgg and nin, it is likely that the Lipschitz constant LL largely determines the differences in KoptK_{opt} between the two cases.

3 Convergence comparison with SGD

We compare the performance of K-AVG against the sequential implementation, that is, SGD. We evaluate the test accuracies achieved and the wall clock time used.

Figure 8 shows the accuracy gap between K-AVG and SGD. With 88 and 1616 learners, K-AVG is slightly worse than sequential SGD for vgg but better than nin. K-AVG and SGD achieve comparable accuracies with 32 learners. As the number of learners reaches 64 and 128, significant accuracy degradation, up to 8.8% is observed for vgg. Interestingly, the accuracy degradation for nin is still within 1.3% with 128 learners.

Figure 8 shows the epoch time speedup of K-AVG against SGD with 8, 16, 32, 64, and 128 learners. With 8 learners, the speedups for vgg and nin are 4.9, and 4.8, respectively. With 128 learners, the speedups for vgg and nin are 56.3 and 58.3, respectively. Since twice as many epochs are run for K-AVG in comparison to SGD, the wallclock time speedups are half of these numbers. This means linear speedup is achieved.

Conclusion

In this paper, we adopt and analyze K-AVG for solving large scale machine learning problems with nonconvex objectives. We establish the convergence results of K-AVG under both fixed and diminishing stepsize, and show that with a properly chosen sequence of stepsizes, K-AVG achieves similar convergence rate consistent with its sequential counterpart. We show that K-AVG scales better than ASGD with properly chosen KK when PP is large. K-AVG allows larger stepsizes that still guarantees convergence while ASGD may fail to converge. Contrary to popular belief, we show that the length of delay to average learning parameters among parallel learners is not necessarily to be 11. Although, a proper choice of KoptK_{opt} remains unknown, we analytically explain the dependence of KoptK_{opt} on other parameters which we hope can serve as a users’ guide in practical implementations.

Proofs

We denote w~α\widetilde{\mathbf{w}}_{\alpha} as the α\alpha-th global update in K-AVG, and denote wα+tj\mathbf{w}^{j}_{\alpha+t} as tt-th local update on processor jj. The (α+1)(\alpha+1)-th global average can be written as

according to our algorithm, the random variables ξt,sj\xi_{t,s}^{j} are i.i.d. for all t=0,...,K1t=0,...,K-1, s=1,...,Bs=1,...,B, and j=1,...,Pj=1,...,P.

Under the assumption that a constant stepsize is implemented within each inner parallel step, i.e. γt=γ\gamma_{t}=\gamma, for t=0,...,K1t=0,...,K-1, one can immediately simplify the above inequality as

The goal here is to investigate the expectation of F(w~α+1)F(w~α)F(\widetilde{\mathbf{w}}_{\alpha+1})-F(\widetilde{\mathbf{w}}_{\alpha}) over all random variables ξt,sj\xi_{t,s}^{j}. Under the unbiased estimation Assumption 3, by taking the overall expectation we can immediately get

We can therefore get rid of the summation over jj as well in (6.4). By taking the overall expectation on both sides of (6.4) and (6.5), we have

We are going to bound (6.6) and (6.7) respectively. Obviously (we treat w~α\widetilde{\mathbf{w}}_{\alpha} as a constant vector at this moment),

where we used the fact that w~αj=w~α\widetilde{\mathbf{w}}_{\alpha}^{j}=\widetilde{\mathbf{w}}_{\alpha}, for j=1,...,Pj=1,...,P for the last term and the assumption that gradient F\nabla F is Lipschitz continuous. Note that

where the first inequality is due to Cauchy-Schwartz inequality, the last equity is due to Assumption 3 and the last inequality is due to Assumption 4. We plug the above inequality back into (6.8) and get

On the other hand, to bound (6.7), we can apply the similar analysis,

Combine the results in (6.9) and (6.10), we have

Under the condition 1L2γ2(K+1)(K2)2+LγK1\geq\frac{L^{2}\gamma^{2}(K+1)(K-2)}{2}+L\gamma K, the second term on the right hand side can be discarded. This condition also implies that

Then, together with the condition 1δL2γ21-\delta\geq L^{2}\gamma^{2} for some 0<δ<10<\delta<1, we have

If we allow both batch size and step size to change after each averaging step, by taking the summation we have

Combining both, we can immediately get the following bound

If we employ a constant step size and batch size, we get the bound on the expected average squared gradient norms of FF as following:

2 Proof of Theorem 3.2

If we use a diminishing step size γj\gamma_{j} and growing batch size BjB_{j}, Thus, from (6.12), we divide both sides with j=1Nγj\sum_{j=1}^{N}\gamma_{j}, we have

If the following restriction of step size is satisfied,

3 Proof of Corollary 3.1

At first, we assume that K/P>L(2K1)(K1)γˉ/6K/P>L(2K-1)(K-1)\bar{\gamma}/6. Then we can rewrite the bound (3.2) as

By taking f=0f^{\prime}=0, one immediately get

By plugging in the value of γˉ1\bar{\gamma}_{1}, K/P>L(2K1)(K1)γˉ/6K/P>L(2K-1)(K-1)\bar{\gamma}/6 implies that

At last, we need to check condition (3.1) is hold. It is sufficient to set

This completes the proof of Corollary 3.1. ∎

References