On the convergence properties of a $K$-step averaging stochastic gradient descent algorithm for nonconvex optimization
Fan Zhou, Guojing Cong
Introduction
Parallel and distributed processing have been adopted for stochastic optimization to solve large-scale machine learning problems. Efficient parallelization is critical to accelerating long running deep-learning applications. Derived from stochastic gradient descent (SGD), parallel solvers such as synchronous SGD (e.g., Zinkevich et al. (2010) Dekel et al. (2012)) and Asynchronous SGD (ASGD) (e.g., see Dean et al. (2012); Recht et al. (2011)), have been proposed.
Beginning with the seminal paper Robbins and Monro (1951), the convergence properties of SGD and its variants have been extensively studied for the past 50 years (e.g. see Robbins and Siegmund (1971); Bottou (1998); Nemirovski et al. (2009); Shamir and Zhang (2013); Ghadimi and Lan (2013)). The asymptotic optimal convergence rate of SGD was proved by Chung (1954) and Sacks (1958) to be with twice continuously differentiable and strongly convex objectives. is the number of samples processed. The iteration complexity is for general convex (see Nemirovski et al. (2009)) and nonconvex (see Ghadimi and Lan (2013)) problems. Regarding parallel variants of SGD, Dekel et al. (2012) extend these results to the setting of synchronous SGD with learners and show that it has convergence rate of for non-convex objectives, with being the number of samples processed by each learner. Hogwild! is a lockfree implementation of ASGD, and Recht et al. (2011) prove its convergence for strongly convex problems with theoretical linear speedup over SGD. Downpour is another ASGD implementation with resilience against machine failures Dean et al. (2012). Lian et al. (2015) show that as long as the gradient staleness is bounded by the number of learners, ASGD converges for nonconvex problems. Due to its asynchronous nature that reduces communication cost, ASGD receives much attention in many recent studies.
Although ASGD has the same asymptotic convergence rate as SGD when the staleness of gradient update is bounded, the learning rate assumed for proving ASGD convergence are usually too small for practical purposes. It is also difficult for an ASGD implementation to control the staleness in gradient updates as it is influenced by the relative processing speed of learners and their positions in the communication network. Furthermore, the parameter server presents performance challenges on platforms with many GPUs. On such platforms, a single parameter server oftentimes does not serve the aggregation requests fast enough. A sharded server alleviates the aggregation bottleneck but introduces inconsistencies for parameters distributed on multiple shards. Communication between the parameter server (typically on CPUs) and the learners (on GPUs) is likely to remain a bottleneck in future systems.
We adopt a distributed, bulk-synchronous SGD algorithm that allows delayed gradient aggregation to effectively minimize the communication overhead. We call this algorithm K-step average SGD (K-AVG). Instead of using a parameter server, the learners in K-AVG compute the average of their copies of parameters at regular intervals through global reduction. Rather than relying on asynchrony that reduces communication overhead but has adverse impact on practical convergence, the communication interval is a parameter in K-AVG. The communication time is amortized among the data samples processed within each interval. On current and emerging computer platforms that support high bandwidth direct communication among GPUs (e.g., GPU-direct), global reduction does not involve CPUs and avoids multiple costly copies through the software layers. Similar averaging approaches have been proposed in the literature, see Hazan and Kale (2014); Johnson and Zhang (2013); Smith et al. (2016); Zhang et al. (2016); Loshchilov and Hutter (2016); Chen et al. (2016); Wang et al. (2017). However, their convergence behavior is not well understood analytically for nonconvex objectives, and it is unclear how they compare with ASGD approaches. This is the part where our major contribution goes to.
We study the convergence behavior of K-AVG and the impact of the number of processors on convergence. As the abundance of data is critical to the success of most machine learning tasks, training employs increasingly more learners. We show that K-AVG scales better than ASGD with and K-AVG allows larger stepsizes than ASGD for the same . We also analyze the impact of on convergence. Since with K-AVG becomes hard-sync SGD, and with larger K-AVG can stimulate averaging after an epoch or many epochs, our analysis can be applied to many existing variants of synchronous SGD. Finding the optimal length of delay for convergence is of high importance to practitioners. Contrary to popular belief, is oftentimes not and can be very large for many applications. Thus K-AVG is a good fit for large-scale distributed training as communication may not need to be very frequent for optimal convergence. Our analysis of convergence of K-AVG provides guidelines practitioners to explicitly balance the decrease of communication time and the increase of iterations through an appropriately chosen .
Using an image recognition benchmark, we demonstrate the nice convergence properties of K-AVG in comparison to two popular ASGD implementations: Downpour Dean et al. (2012) and EAMSGD Zhang et al. (2015). In EAMSGD, global gradient aggregation among learners simulates an elastic force that links the parameters they compute with a center variable stored by the parameter server. In both Downpour and EAMSGD, updates to the central parameter server can also have a K-step delay. On our target platform, when is small, K-AVG significantly reduces the communication time in comparison to Downpour and EAMSGD while achieving similar training and test accuracies. The training time reduction is up to 50%. When is large, K-AVG achieves much better training and test accuracies than Downpour and EAMSGD after the same amount of data samples are processed. For example, with 128 GPUs, K-AVG is up to about and - times faster than Downpour and EAMSGD respectively, and achieves significantly better accuracy.
This rest of the paper is organized as follows: In section 2, we introduce the standard assumptions in optimization theory needed to analyze SGD methods and frequently used notations throughout the paper; In section 3, we formally introduce the K-AVG algorithm, and prove its standard convergence results with fixed and diminishing stepsize. Based on the convergence result, we analyze the scalability of K-AVG and investigate the optimal choice of ; In section 4, we present our experimental results to validate our analysis.
Preliminaries and notations
, , or denotes the size of mini-batch for -th update;
, , or denotes the step size for -th update;
denotes the i.i.d. realizations of a random variable generated by the algorithm on different processors and in different iterations, especially, , , and .
We study the following optimization problem:
This assumption is essential to convergence analysis of our algorithm as well as most gradient based ones. Under such an assumption, the gradient of serves as a good indicator for how far to move to decrease .
The sequence of iterates is contained in an open set over which is bounded below by a scalar .
Assumption 2 requires that objective function to be bounded from below, which guarantees the problem we study is well defined.
For any fixed parameter , the stochastic gradient is an unbiased estimator of the true gradient corresponding to the parameter , namely,
One should notice that the unbiasedness assumption here can be replaced by a weaker version which is called the First Limit Assumption see Bottou et al. (2016) that can still be applied to our analysis. For simplicity, we just assume that the stochastic gradient is an unbiased estimator of the true one.
Assumption 4 characterizes the variance (second order moments) of the stochastic gradients.
Since all results in this paper are based on the four assumptions above, we present them with no further mention in the following literature.
Main results
We present the distributed K-AVG algorithm as follows:
Note that traditional parallel SGD algorithm is equivalent to K-AVG with , as synchronization is required after each local update. However, K-AVG relaxes this requirement and allows for individual updates before synchronization. Thus, K-AVG is a more general synchronous algorithm that contains parallel SGD. Surprisingly, as we show both analytically (section 3.4) and experimentally (section 4.2), more frequent synchronization does not always result in faster convergence for nonconvex objectives.
In the following theorem, we prove an upper bound on the expected average squared gradient norms, which serve as a metric to measure the convergence rate for nonconvex objectives.
(Nonconvex objective, fixed stepsize, and fixed batch size) Suppose that Algorithm 1 is run with a fixed stepsize , a fixed batch size satisfying
The proof of Theorem 3.1 can be found in section 6.1. An immediate observation from (3.2) is that the expected average squared gradient norms of converges to some nonzero constant as . The first term reflects the distance from initial weight to the solution. It eventually goes to zero as the number of iterations goes to infinity. The second term \frac{LK\bar{\gamma}M}{\bar{B}(K-1+\delta)}\Big{(}\frac{K}{P}+\frac{L(2K-1)(K-1)\bar{\gamma}}{6}\Big{)} is not affected by the iteration number. Compared with sequential SGD (see section 4.3 in Bottou et al. (2016)), this term is scaled by the batch size , and or , which means larger batch size and smaller stepsize, more learners or more frequent averaging tend to reduce this term. Mini-batch method as a variance reduction technique explains the appearance of . Parallelization of this algorithm contributes to the scaling factor . However, when is large enough to make dominates , the effect of parallelization is not ideal as one may expect.
The rates of convergence (also referred as iteration complexity) after step updates are established in the following corollary which originates from Theorem 3.1.
The proof of Corollary 3.1 can be found in section 6.3. Condition (3.3) and bound (3.4) imply when the number of updates is large enough, K-AVG eventually achieves a similar rate of convergence as classical SGD method for nonconvex objectives. Indeed, the rate of convergence of classical SGD methods is after samples processed. Note that updates in K-AVG means that samples have been processed. Taking a closer look at bound in (3.4), the right hand side is of the order . K-AVG loses a factor of as a result of communication saving. However, this doesn’t mean that the smaller the better. Since there is an extra multiplicative factor which is monotone decreasing with respect to . We will have a more detailed discussion on the choice of in section 3.4.
Deploying diminishing stepsizes and/or dynamic batch sizes makes the expected average squared gradient norms converge to zero for non-convex optimization. In the following theorem, we establish the convergence result under such conditions.
(Nonconvex objective, diminishing step size, and growing batch size) Suppose that Algorithm 1 is run with diminishing step size , and growing batch size satisfying
with some constant . Then the weighted average squared gradient norms satisfies
The proof of Theorem 3.2 can be found in section 6.2. As we can see, by adopting a diminishing sequence of stepsizes instead of a fixed one, the expected average squared gradient norms of K-AVG converges to instead of a nonzero constant.
2 K-AVG allows for larger stepsize than ASGD
Compared with the classical stepsize schedule for both sequential SGD (proposed by Robbins and Monro (1951)) and ASGD:
the stepsize schedule proposed in (3.7) turns out to allow larger choices of . On one hand, itself is a much more relaxed constrain compared with . On the other hand, as a byproduct of parallelization, when is large, also allows larger choice of . Intuitively, averaging acts as a variance reduction and leads to relaxation of larger stepsize constrain. In our experiments (section 4.1), larger stepsizes work well in K-AVG but can result in divergence in popular ASGD implementations.
3 Scalability comparison of K-AVG against ASGD
We analyze the bound on expected average squared gradient norms in (3.2) to show K-AVG algorithm scales better with than ASGD. We first establish the following theorem on the scalability of K-AVG.
Under the condition of Theorem 3.1, K-AVG scales better than ASGD.
In ASGD, one key parameter is the maximum staleness, generally assumed to be upper bounded by the number of processors, i.e. in K-AVG. With fixed stepsize, the expected average gradient norms is (see also Lian et al. (2015), Theorem 3) is
where and are constants independent of .
serves as a scaling factor in the denominator, which implies K-AVG scales better than asynchronization. ∎
The comparison of practical scalability between K-AVG and ASGD implementations is shown in section 4.1.
4 Optimal K𝐾K for convergence is not always 111
Unlike convex optimization problems where all learners converge to the same optimum, different learners may converge to different local optimums in nonconvex case. As a consequence, the frequency of averaging for nonconvex problems may be different from that of convex cases intuitively. Zhang et al. (2016) expressed the same concern, their experimental results showed that periodic averaging tends to improve the solution quality. Contrary to popular belief that more frequent averaging i.e. smaller speeds up convergence, we show that the optimal frequency for convergence is not always . We consider the case that the amount of samples processed is constant, which means that the computational time remains as a constant given a fixed number of processors. If every other parameter stays the same, larger means longer delay and fewer updates of global parameter . The following theorem discusses the impact and optimal choice of in K-AVG under such an assumption.
Let be a constant. Suppose that Algorithm 1 is run with a fixed stepsize , a fixed batch size satisfying
Then the optimal choice of is greater than .
Under the assumption , we can rewrite the bound (3.2) as
To minimize the right hand side of (3.10), it is equivalent to solve the following integer program
which can be very hard. Meanwhile, one should notice that depends on some unknown quantities such as , and . Instead, we investigate the monotonicity of . It is easy to check that \Big{(}\alpha+\beta K+\eta(2K-1)(K-1)\Big{)} is monotone increasing for all , and \Big{(}\frac{K}{K-1+\delta}\Big{)} is monotone decreasing for . Thus, there exists a unique . Then a sufficient condition for is that , which implies
The meaning of Theorem 3.4 is that it indicates when it comes to nonconvex optimization, more frequent averaging is not necessary. The sufficient condition (3.9) implies that larger value of \big{(}F(\widetilde{\mathbf{w}}_{1})-F^{*}\big{)} requires larger thus longer delay to decrease the bound in (3.10). The intuition is that if the initial weight is too far away from , then less frequent synchronizations can lead to faster convergence rate. Less frequent averaging implies higher variance in general. It is quite reasonable to think that if it is still far away from the solution, a stochastic gradient direction with larger variance may be preferred.
As we have already mentioned in the proof, optimal value depends on quantities such as , , and which are unknown to us in practice. Therefore, to obtain a concrete in practice is not so realistic. Note that when doesn’t necessarily mean that is very close to . In our experiments (see section 4), can be as large as or in some situation.
Experiments
We conduct experiments to validate our analysis on the scalability of K-AVG vs. ASGD implementations, the optimal delay in averaging ( optimal value of ), the convergence comparison with the sequential algorithm, i.e., SGD, and the comparison of the learning rates allowed with ASGD.
In our application gradient descent is implemented with Torch, and the communication is implemented using CUDA-aware openMPI 2.0 through the mpiT library. All implementations use the cuDNN library for forward propagation and backward propagation. Our experiments are done on a cluster of 32 Minsky nodes interconnected with Infiniband. Each node is an IBM S822LC system containing 2 Power8 CPUs with 10 cores each, and 4 NVIDIA Tesla P100 GPUs.
We experiment with the CIFAR-10 Krizhevsky and Hinton (2009) data set using the vgg and nin models. CIFAR-10 contains training images and test images, each associated with out of possible labels.
Theorem 3.3 shows that the convergence bound of K-AVG is not affected much by the scaling of but rather by . On the contrary, the convergence bound of ASGD increases linearly with . Thus we expect poorer convergence behavior of ASGD implementations at large in comparison with K-AVG.
Figures 2 and 2 compare the performance of K-AVG with two ASGD implementations, Downpour and EAMSGD, for two neural networks, vgg Simonyan and Zisserman (2014) and nin Lin et al. (2013), respectively. We use and learners, and show the test accuracies. All implementations use the same initial learning rate () and learning rate adaptation schedule (reduce by half after 50 epochs). The batch size is fixed at . We run for 600 epochs with for K-AVG.
In both figures, K-AVG always achieves better test accuracy than Downpour and EAMSGD. The test accuracies for Downpour and EAMSGD decreases as increases (the effect is more pronounced for vgg). When reaches 128, the accuracies of Downpour and EAMSGD both degrade to around 10%, i.e, random guess. The ASGD implementations do not converge with at .
When we set the initial learning rate , Downpour achieves around 80% and 87% test accuracies for vgg and nin, respectively; EAMSGD still does not converge.
K-AVG achieves better test accuracy than ASGD implementations, and in our experiment it is also faster. We measure wall-clock times for all implementations after 600 epochs. Naturally for K-AVG, impacts the ratio of communication vs. computation. We still use .
Figures 4 and 4 show the speedups of K-AVG over Downpour and EAMSGD, with vgg and nin, respectively. In Figure 4, when , the ASGD implementations are slightly faster than K-AVG. As increases, the speedup increases. When , the speedups are around 2.5 and 2.6 over Downpour and EAMSGD, respectively. Similar behavior is observed in Figure 4 for nin.
2 The optimal delay in averaging for K-AVG
For K-AVG, regulates its behavior. From the execution time perspective, larger results in fewer communications to process a given number of data samples. From the convergence perspective, smaller reduces the variances among learners. In the extreme case where , K-AVG is equivalent to synchronous parallelization of SGD. People tend to think that smaller results in faster convergence in terms of number of data samples processed. As discussed in Section 3.4, there are scenarios where is not 1.
We evaluate the convergence behavior of K-AVG with different values for vgg and nin. We experiment with , and . Figures 6 and 6 show the test accuracies achieved after 600 epochs for , and , for vgg and nin, respectively. Again we use the initial =1, and after every 50 epochs, is reduced by half. The batch size is fixed as .
In Figure 6, strikingly, none of the experiments show the optimal value of for K-AVG is 1. ranges from 32 (when ) farthest away from 1, to 2 (when ), closest to 1. In this set of experiments, as increases, tends to decrease. Also with smaller , K-AVG is more forgiving in terms of the choices of . For example, when , test accuracies for different are similar. With larger , however, choosing a that is too large has severe punishing consequences. For example, when , , and the test accuracy degrades rapidly with the increase of beyond 4.
In Figure 6, almost all experiments show =1. The exception is with , and the accuracy is slightly higher (by 0.27%) at than . Again we see for small the choices of is not critical, while for large the degradation in accuracy is rapid with the increase of beyond .
Since in our experiments the same hyper-parameters are used for vgg and nin, it is likely that the Lipschitz constant largely determines the differences in between the two cases.
3 Convergence comparison with SGD
We compare the performance of K-AVG against the sequential implementation, that is, SGD. We evaluate the test accuracies achieved and the wall clock time used.
Figure 8 shows the accuracy gap between K-AVG and SGD. With and learners, K-AVG is slightly worse than sequential SGD for vgg but better than nin. K-AVG and SGD achieve comparable accuracies with 32 learners. As the number of learners reaches 64 and 128, significant accuracy degradation, up to 8.8% is observed for vgg. Interestingly, the accuracy degradation for nin is still within 1.3% with 128 learners.
Figure 8 shows the epoch time speedup of K-AVG against SGD with 8, 16, 32, 64, and 128 learners. With 8 learners, the speedups for vgg and nin are 4.9, and 4.8, respectively. With 128 learners, the speedups for vgg and nin are 56.3 and 58.3, respectively. Since twice as many epochs are run for K-AVG in comparison to SGD, the wallclock time speedups are half of these numbers. This means linear speedup is achieved.
Conclusion
In this paper, we adopt and analyze K-AVG for solving large scale machine learning problems with nonconvex objectives. We establish the convergence results of K-AVG under both fixed and diminishing stepsize, and show that with a properly chosen sequence of stepsizes, K-AVG achieves similar convergence rate consistent with its sequential counterpart. We show that K-AVG scales better than ASGD with properly chosen when is large. K-AVG allows larger stepsizes that still guarantees convergence while ASGD may fail to converge. Contrary to popular belief, we show that the length of delay to average learning parameters among parallel learners is not necessarily to be . Although, a proper choice of remains unknown, we analytically explain the dependence of on other parameters which we hope can serve as a users’ guide in practical implementations.
Proofs
We denote as the -th global update in K-AVG, and denote as -th local update on processor . The -th global average can be written as
according to our algorithm, the random variables are i.i.d. for all , , and .
Under the assumption that a constant stepsize is implemented within each inner parallel step, i.e. , for , one can immediately simplify the above inequality as
The goal here is to investigate the expectation of over all random variables . Under the unbiased estimation Assumption 3, by taking the overall expectation we can immediately get
We can therefore get rid of the summation over as well in (6.4). By taking the overall expectation on both sides of (6.4) and (6.5), we have
We are going to bound (6.6) and (6.7) respectively. Obviously (we treat as a constant vector at this moment),
where we used the fact that , for for the last term and the assumption that gradient is Lipschitz continuous. Note that
where the first inequality is due to Cauchy-Schwartz inequality, the last equity is due to Assumption 3 and the last inequality is due to Assumption 4. We plug the above inequality back into (6.8) and get
On the other hand, to bound (6.7), we can apply the similar analysis,
Combine the results in (6.9) and (6.10), we have
Under the condition , the second term on the right hand side can be discarded. This condition also implies that
Then, together with the condition for some , we have
If we allow both batch size and step size to change after each averaging step, by taking the summation we have
Combining both, we can immediately get the following bound
If we employ a constant step size and batch size, we get the bound on the expected average squared gradient norms of as following:
2 Proof of Theorem 3.2
If we use a diminishing step size and growing batch size , Thus, from (6.12), we divide both sides with , we have
If the following restriction of step size is satisfied,
3 Proof of Corollary 3.1
At first, we assume that . Then we can rewrite the bound (3.2) as
By taking , one immediately get
By plugging in the value of , implies that
At last, we need to check condition (3.1) is hold. It is sufficient to set
This completes the proof of Corollary 3.1. ∎