Local SGD Converges Fast and Communicates Little
Sebastian U. Stich
Introduction
Stochastic Gradient Descent (SGD) consists of iterations of the form
In this work we follow an orthogonal approach, still with the goal to increase the compute to communication ratio: Instead of increasing the mini-batch size, we reduce the communication frequency. Rather than keeping the sequences on different machines in sync, we allow them to evolve locally on each machine, independent from each other, and only average the sequences once in a while (local SGD). Such strategies have been explored widely in the literature, under various names.
An extreme instance of this concept is one-shot SGD where the local sequences are only exchanged once, after the local runs have converged. Zhang et al. show statistical convergence (see also ), but the analysis restricts the algorithm to at most one pass over the data, which is in general not enough for the training error to converge. More practical are schemes that perform more frequent averaging of the parallel sequences, as e.g. for perceptron training (iterative parameter mixing), see also , for the training of deep neural networks (model averaging) or in federated learning .
The question of how often communication rounds need to be initiated has eluded a concise theoretical answer so far. Whilst there is practical evidence, the theory does not even resolve the question whether averaging helps when optimizing convex functions. Concretely, whether running local SGD on workers is times faster than running just a single instance of SGD on one worker.On convex functions, the average of the local solutions can of course only decrease the objective value, but convexity does not imply that the averaged point is times better.
We fill this gap in the literature and provide a concise convergence analysis of local SGD. We show that averaging helps. Frequent synchronization of local sequences increases the convergence rate by a factor of , i.e. a linear speedup can be attained. Thus, local SGD is as efficient as parallel mini-batch SGD in terms of computation, but the communication cost can be drastically reduced.
Our proof is simple and straightforward, and we imagine that—with slight modifications of the proof—the technique can also be used to analyze other variants of SGD that evolve sequences on different worker that are not perfectly synchronized. Although we do not yet provide convergence guarantees for the non-convex setting, we feel that the positive results presented here will spark further investigation of local SGD for this important application (see e.g. ).
2 Related Work
A parallel line of work reduces the communication cost by compressing the stochastic gradients before communication. For instance, by limiting the number of bits in the floating point representation , or random quantization . The ZipML framework applies this technique also to the data . Sparsification methods reduce the number of non-zero entries in the stochastic gradient . A very aggressive—and promising—sparsification method is to keep only very few coordinates of the stochastic gradient by considering only the coordinates with the largest magnitudes .
Allowing asynchronous updates provides an alternative solution to disguise the communication overhead to a certain amount , though alternative strategies might be better when high accuracy is desired . The analysis of Agarwal & Duchi shows that asynchronous SGD on convex functions can tolerated delays up to , which is identical to the maximal length of the local sequences in local SGD. Asynchronous SGD converges also for larger delays (see also ) but without linear speedup, a similar statement holds for local SGD (see discussion in Section 3). The current frameworks for the analysis of asynchronous SGD do not cover local SGD. A fundamental difference is that asynchronous SGD maintains a (almost) synchronized sequence and gradients are computed with respect this unique sequence (but just applied with delays), whereas each worker in local SGD evolves a different sequence and computes gradient with respect those iterates.
For the training of deep neural networks, Bijral et al. discuss a stochastic averaging schedule whereas Zhang et al. study local SGD with more frequent communication at the beginning of the optimization process. The elastic averaging technique is different to local SGD, as it uses the average of the iterates only to guide the local sequences but does not perform a hard reset after averaging. Among the first theoretical studies of local SGD in the non-convex setting are that did not establish a speedup, in contrast to two more recent analyses . Yu et al. show linear speedup of local SGD on non-convex functions for , which is more restrictive than the constraint on in the convex setting. Lin et al. study empirically hierarchical variants of local SGD.
Local SGD with averaging in every step, i.e. , is identical to mini-batch SGD. Dekel et al. show that batch sizes , for are asymptotically optimal for mini-batch SGD, however they also note that this asymptotic bound might be crude for practical purposes. Similar considerations might also apply to the asymptotic upper bounds on the communication frequency derived here. Local SGD with averaging only at the end, i.e. , is identical to one-shot SGD. Jain et al. give concise speedup results in terms of bias and variance for one-shot SGD with constant stepsizes for the optimization of quadratic least squares problems. In contrast, our upper bounds become loose when and our results do not cover one-shot SGD.
Recently, Woodworth et al. provided a lower bound for parallel stochastic optimization (in the convex setting, and not for strongly convex functions as considered here). The bound is not known to be tight for local SGD.
3 Outline
We formally introduce local SGD in Section 2 and sketch the convergence proof in Section 3. In Section 4 show numerical results to illustrate the result. We analyze asynchronous local SGD in Section 5. The proof of the technical results, further discussion about the experimental setup and implementation guidelines are deferred to the appendix.
Local SGD
The algorithm local SGD (depicted in Algorithm 1) generates in parallel sequences of iterates, . Here denotes the level of parallelization, i.e. the number of distinct parallel sequences and the number of steps (i.e. the total number of stochastic gradient evaluations is ). Let with denote a set of synchronization indices. Then local SGD evolves the sequences in the following way:
where indices and denotes a sequence of stepsizes. If then the synchronization of the sequences is performed every iteration. In this case, (4) amounts to parallel or mini-batch SGD with mini-batch size .For the ease of presentation, we assume here that each worker in local SGD only processes a mini-batch of size . This can be done without loss of generality, as we discuss later in Remark 2.4. On the other extreme, if , the synchronization only happens at the end, which is known as one-shot averaging.
In order to measure the longest interval between subsequent synchronization steps, we introduce the gap of a set of integers.
The gap of a set of integers, for , is defined as .
Before jumping to the convergence result, we first discuss an important observation.
Towards local SGD.
For local SGD such a simple argument is elusive. For instance, just capitalizing the convexity of the objective function is not enough: this will show that the averaged iterate of independent SGD sequences converges at rate \mathcal{O}\bigl{(}\frac{\sigma^{2}}{T}\bigr{)}, i.e. no speedup can be shown in this way.
This indicates that one has to show that local SGD decreases the variance instead, similar as in parallel SGD. Suppose the different sequences evolve close to each other. Then it is reasonable to assume that averaging the stochastic gradients for all can still yield a reduction in the variance by a factor of —similar as in parallel SGD. Indeed, we will make this statement precise in the proof below.
2 Convergence Result and Discussion
where , for and .
We were not especially careful to optimize the constants (and the lower order terms) in (5), so we now state the asymptotic result.
Let be as defined as in Theorem 2.2, for parameter . Then
So far, we assumed that each worker only computes a single stochastic gradient. In mini-batch local SGD, each worker computes a mini-batch of size in each iteration. This reduces the variance by a factor of , and thus Theorem (2.2) gives the convergence rate of mini-batch local SGD when is replaced by .
We now state some consequences of equation (6). For the ease of the exposition we omit the dependency on , , and below, but depict the dependency on the local mini-batch size .
For large enough and assuming , the very first term is dominating in (6) and local SGD converges at rate . That is, local SGD achieves a linear speedup in both, the number of workers and the mini-batch size .
It needs to hold to get the linear speedup. This yields a reduction of the number of communication rounds by a factor compared to parallel mini-batch SGD without hurting the convergence rate.
We have not optimized the result for extreme settings of , , or . For instance, we do not recover convergence for the one-shot averaging, i.e. the setting (though convergence for , but at a lower rate).
Zhang et al. empirically observe that more frequent communication at the beginning of the optimization can help to get faster time-to-accuracy (see also ). Indeed, when the number of total iterations is not known beforehand (as it e.g. depends on the target accuracy, cf. (6) and also Section 4 below), then increasing the communication frequency seems to be a good strategy to keep the communication low, why still respecting the constraint for all .
Proof Outline
We now give the outline of the proof. The proofs of the lemmas are given in Appendix A.
Inspired by the perturbed iterate framework of we first define a virtual sequence in the following way:
where the sequences for are the same as in (4). Notice that this sequence never has to be computed explicitly, it is just a tool that we use in the analysis. Further notice that for whenever . Especially, when , then for every . It will be useful to define
Now the proof proceeds as follows: we show (i) that the virtual sequence almost behaves like mini-batch SGD with batch size (Lemma 3.1 and 3.2), and (ii) the true iterates do not deviate much from the virtual sequence (Lemma 3.3). These are the main ingredients in the proof. To obtain the rate we exploit a technical lemma from .
Let and for be defined as in (4) and (7) and let be -smooth and -strongly convex and . Then
Bounding the variance.
Bounding the deviation.
If and sequence of decreasing positive stepsizes satisfying for all , then
Optimal Averaging.
Similar as in we define a suitable averaging scheme for the iterates to get the optimal convergence rate. In contrast to that use linearly increasing weights, we use quadratically increasing weights, as for instance .
Let , , , be sequences satisfying
for and constants , , , . Then
for and .
This is a reformulation of Lemma 3.3 in . ∎
Proof of Theorem 2.2.
Numerical Illustration
In this section we show some numerical experiments to illustrate the results of Theorem 2.2.
When Algorithm 1 is implemented in a distributed setting, there are two components that determine the wall-clock time: (i) the total number of gradient computations, , and (ii) the total time spend for communication. In each communication round vectors need to be exchanged, and there will be communication rounds. Typically, the communication is more expensive than a single gradient computation. We will denote this ratio by a factor (in practice, can be 10–100, or even larger on slow networks). The parameter depends on the desired accuracy , and according to (6) we roughly have . Thus, the theoretical speedup of local SGD on machines compared to SGD on one machine (, ) is
Theoretical.
Examining (13), we see that (i) increasing can reduce negative scaling effects due to parallelization (second bracket in the denominator of (13)), and (ii) local SGD only shows linear scaling for (i.e. large enough, in agreement with the theory). In Figure 3 we depict , once for in Figure 2(b), and for positive in Figure 2(a) under the assumption . We see that for the largest values of give the best speedup, however, when only a few epochs need to be performed, then the optimal values of change with the number of workers . We also see that for a small number of workers is never optimal. If is unknown, then these observations seem to indicate that the technique from , i.e. adaptively increasing over time seems to be a good strategy to get the best choice of when the time horizon is unknown.
Experimental.
Conclusion.
The restriction on imposed by theory is not severe for . Thus, for training that either requires many passes over the data or that is performed only on a small cluster, large values of are advisable. However, for smaller (few passes over the data), the dependency shows significantly in the experiment. This has to be taken into account when deploying the algorithm on a massively parallel system, for instance through the technique mentioned in .
Asynchronous Local SGD
In this section we present asynchronous local SGD that does not require that the local sequences are synchronized. This does not only reduce communication bottlenecks, but by using load-balancing techniques the algorithm can optimally be tuned to heterogeneous settings (slower workers do less computation between synchronization, and faster workers do more). We will discuss this in more detail in Section C.
Asynchronous local SGD generates in parallel sequences of iterates, . Similar as in Section 2 we introduce sets of synchronization indices, with for . Note that the sets do not have to be equal for different workers. Each worker evolves locally a sequence in the following way:
where denotes the state of the aggregated variable at the time when worker reads the aggregated variable. To be precise, we use the notation
where denotes all updates that have been written at the time the read takes place. The sets are indexed by iteration , worker that initiates the read and . Thus denotes all updates of the local sequence , that have been reported back to the server at the time worker reads (in iteration ). This notation is necessary, as we don’t necessarily have for . We have for , as updates are not overwritten. When we cast synchronized local SGD in this notation, then it holds for all , as all the writes and reads are synchronized.
Let , , and be as in Theorem 5.1 and sequences for generated according to (14) with for and for stepsizes with shift parameter for delay . If for all , , then
where , for and .
Hence, for large enough and , asynchronous local SGD converges with rate O\bigl{(}\frac{G^{2}}{KT}\bigr{)}, the same rate as synchronous local SGD.
Conclusion
We prove convergence of synchronous and asynchronous local SGD and are the first to show that local SGD (for nontrivial values of ) attains theoretically linear speedup on strongly convex functions when parallelized among workers. We show that local SGD saves up to a factor of in global communication rounds compared to mini-batch SGD, while still converging at the same rate in terms of total stochastic gradient computations.
Deriving more concise convergence rates for local SGD could be an interesting future direction that could deepen our understanding of the scheme. For instance one could aim for a more fine grained analysis in terms of bias and variance terms (similar as e.g. in ), relaxing the assumptions (here we relied on the bounded gradient assumption), or investigating the data dependence (e.g. by considering data-depentent measures like e.g. gradient diversity ). There are also no apparent reasons that would limit the extension of the theory to non-convex objective functions; Lemma 3.3 does neither use the smoothness nor the strong convexity assumption, so this can be applied in the non-convex setting as well. We feel that the positive results shown here can motivate and spark further research on non-convex problems. Indeed, very recent work analyzes local SGD for non-convex optimization problems and shows convergence of SGD to a stationary point, though the restrictions on are stronger than here.
Acknowledgments
The author thanks Jean-Baptiste Cordonnier, Tao Lin and Kumar Kshitij Patel for spotting various typos in the first versions of this manuscript, as well as Martin Jaggi for his support.
References
Appendix A Missing Proofs for Synchronized Local SGD
In this section we provide the proofs for the three lemmas that were introduced in Section 3.
where we used the inequality in (21). By -smoothness,
To estimate the last term in (22) we use , for . This gives
where we have again used (23) in the last inequality. By applying these three estimates to (22) we get
For it holds \bigl{(}\eta_{t}L-\frac{1}{2}\bigr{)}\leq-\frac{1}{4}. By convexity of for :
Finally, we can plug (30) back into (18). By taking expectation we get
By definition of and we have
where we used for independent random variables. ∎
Appendix B Missing Proof for Asynchronous Local SGD
In this Section we prove Theorem 5.1. The proof follows closely the proof presented in Section 3. We again introduce the virtual sequence
as before. By the property for we know that all workers will have written their updates when the algorithm terminates. This assumption is not very critical and could be relaxed, but it facilitates the (already quite heavy) notation in the proof.
Observe, that Lemmas 3.1 and 3.2 hold for the virtual sequence . Hence, all we need is a refined version of Lemma 3.3 that bounds how far the local sequences can deviate from the virtual average.
If and , s.t. for all , , and sequence of decreasing positive stepsizes satisfying for all , then
Here we use the notation for , such that is also defined for .
As there exists for every a , , such that . Let and observe . Let . As for all , , it holds
for each . In other words, all updates up to iteration have been written to the aggregated sequence.
Now, using , and , we conclude (as in (36))
and similarly, as ,
Finally, as , we can conclude
Now the proof of Theorem 5.1 follows immediately.
As in the proof of Theorem 2.2 we rely on Lemma 3.4 to derive the convergence rate. Again, we have , , and (Lemma B.1). It is easy to see that the stepsizes satisfy the condition of Lemma B.1, as clearly , as . ∎
Appendix C Comments on Implementation Issues
In Theorem 5 we do not prove convergence of the sequences of the iterates, but only convergence of a weighted average of all iterates. In practice, the last iterate might often be sufficient, but we like to remark that the weighted average of the iterates can easily be tracked on the fly with an auxiliary sequence , , without storing all intermediate iterates, see Table 1 for some examples.
C.2 Asynchronous Local SGD
As for synchronous local SGD, the weighted averages of the iterates (if needed), can be tracked on each worker locally by a recursive formula as explained above.
A more important aspect that we do not have discussed yet, is that Algorithm 2 allows for an easy procedure to balance the load in heterogeneous settings. In our notation, we have always associated the local sequences with a specific worker . However, the computation of the sequences does not need to be tied to a specific worker. Thus, a fast worker that has advanced his local sequence too much already, can start computing updates for another sequence , if worker is lagged behind. This was not possible in the synchronous model, as there all communications had to happen in sync. We demonstrate this principle in Table 2 below for two workers. Note that also the running averages can still be maintained.
Appendix D Details on Experiments
For each run, we initialize and measure the number of iterationsNote, that besides the randomness involved the stochastic gradient computations, the averaging steps of synchronous local SGD are deterministic. Hence, these results (convergence in terms if numbers of iterations) can be reproduced by just simulating local SGD by using virtual workers (which we did for large number of ). For completeness, we report that all experiments were run on an an Ubuntu 16.04 machine with a 24 cores processor Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz. (and number of stochastic gradient evaluations) to reach the target accuracy . As we prove convergence only for a special weighted sum of the iterates in Theorem 2.2 and not for standard criteria (last iterate or uniform average), we evaluate the function value for different weighted averages , and consider the accuracy reached when one of the averages satisfies , with (numerically determined). The precise formulas for the averages that we used are given in Table 1.
In Figures 4 and 5 we give additional results for mini-batch sizes .