On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization
Hao Yu, Rong Jin, Sen Yang
Introduction
Consider distributed non-convex optimization scenarios where workers jointly solve the following consensus optimization problem:
Stochastic Gradient Descent (SGD) (and its momentum variants) have been the dominating methodology for solving machine learning problem. For large-scale distributed machine learning problems, such as training deep neural networks, a parallel version of SGD, known as parallel mini-batch SGD, is widely adopted (Dean et al., 2012; Dekel et al., 2012; Li et al., 2014). With parallel workers, parallel mini-batch SGD can solve problem (1) with convergence, which is times fasterIf an algorithm has convergence, then it takes iterations to attain an accurate solution. Similarly, if another algorithm has convergence, then it takes iterations, which is times smaller than , to attain an accurate solution. In this sense, the second algorithm is times faster than the first one. than the convergence attained by SGD over a single node (Ghadimi & Lan, 2013; Lian et al., 2015). Obviously, such linear speedup with respect to (w.r.t.) number of workers is desired in distributed training as it implies perfect computation scalability by increasing the number of used workers. However, such linear speedup is difficult to harvest in practice because the classical parallel mini-batch SGD requires all workers to synchronize local gradients or models at every iteration such that inter-node communication cost becomes the performance bottleneck. To eliminate potential communication bottlenecks, such as high latency or low bandwidths, in computing infrastructures, many distributed SGD variants have been recently proposed. For example, (Lian et al., 2017; Jiang et al., 2017; Assran et al., 2018) consider decentralized parallel SGD where global gradient aggregations used in the classical parallel mini-batch SGD are replaced with local aggregations between neighbors. To reduce the communication cost at each iteration, compression or sparsification based parallel SGD are considered in (Seide et al., 2014; Strom, 2015; Aji & Heafield, 2017; Wen et al., 2017; Alistarh et al., 2017). Recently, (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018) prove that certain parallel SGD variants that strategically skip communication rounds can achieve the fast convergence with significantly less communication rounds. See Table 1 for a summary on recent works studying distributed SGD with convergence and reduced communication complexity.
It is worth noting that existing convergence analyses on distributed methods for solving problem (1) focus on parallel SGD without momentum. In practice, momentum SGD is more commonly used to train deep neural networks since it often can converge faster and generalize better (Krizhevsky et al., 2012; Sutskever et al., 2013; Yan et al., 2018). For example, momentum SGD is suggested for training ResNet for image classification tasks to obtain the best test accuracy (He et al., 2016). See Figure 1 for the comparison of test accuracy between training with momentum and training without momentum.As observed in Figure 1, the final test accuracy of momentum SGD is roughly better than that of vanilla SGD (without momentum) over CIFAR10. In practice, people usually use momentum SGD to train ResNet in both single GPU and multiple GPU scenarios. Some practitioners conjecture that it is possible to avoid the performance degradation of vanilla SGD if its hyper-parameters are well tuned. However, even if this conjecture is true, the hyper-parameter tuning can be extremely time-consuming. In fact, while previous works (Lian et al., 2017; Stich, 2018; Yu et al., 2018; Jiang & Agrawal, 2018) prove that distributed vanilla SGD (without momentum) can train deep neural networks with convergence using significantly fewer communications rounds, most of their experiments use momentum SGD rather than vanilla SGD (to achieve the targeted test accuracy). In addition, previous empirical works (Chen & Huo, 2016; Lin et al., 2018) on distributed training for deep neural networks explicitly observe that momentum is necessary to improve the convergence and test accuracy. In this perspective, there is a huge gap between the current practices, i.e., using momentum SGD rather than vanilla SGD in distributed training for deep neural networks, and existing theoretical analyses such as (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018) studying the convergence rate and communication complexity of SGD without momentum. Momentum methods are originally proposed by Polyak in (Polyak, 1964) to minimize deterministic strongly convex quadratic functions. Its convergence rate for deterministic convex optimization, which is not necessarily strongly convex, is established in (Ghadimi et al., 2014). For non-convex optimization, the convergence for deterministic non-convex optimization is proven in (Zavriev & Kostyuk, 1993) and the convergence rate for stochastic non-convex optimization is recently shown in (Yan et al., 2018). However, to our best knowledge, it remains as an open question: “Whether any distributed momentum SGD can achieve the same convergence, i.e., linear speedup w.r.t. number of workers, with reduced communication complexity as SGD (without momentum) methods in (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018) ?”
Our Contributions: In this paper, we considers parallel restarted SGD with momentum (described in Algorithm 1), which can be viewed as the momentum extension of parallel restarted SGD, also referred as local SGD, considered in (Stich, 2018; Yu et al., 2018). Such a method is also faithful to the common practice “model averaging with momentum” used by practitioners for training deep neural networks in (Chen & Huo, 2016; McMahan et al., 2017; Su et al., 2018; Lin et al., 2018). We show that parallel restarted SGD with momentum can solve problem (1) with convergence, i.e., achieving linear speedup w.r.t. number of workers. Moreover, to achieve the fast convergence, iterations of our algorithm only requires communication rounds when all workers can access identical data sets; or communication rounds when workers access non-identical data sets. To our knowledge, this is the first time that a distributed momentum SGD method for non-convex stochastic optimization is proven to possess the same linear speedup property (with communication reduction) as distributed SGD (without momentum) in (Lian et al., 2017; Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018).
Recall that momentum SGD degrades to vanilla SGD when its momentum coefficient is set . Algorithm 1 with degrades to the parallel SGD methods (without momentum) considered in (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018). Our communication complexity results for parallel SGD without momentum cases also improve the state-of-the-art. As shown in Table 1, the number of required communication rounds shown in this paper is the fewest in both identical training data set case and non-identical data set case. In particular, this paper relaxes the bounded gradient moment assumption used in (Yu et al., 2018) to a milder bounded variance assumption and reduces the communication complexity for the case with identical training data. Our analysis applies to the distributed training scenarios where workers access non-identical training sets, e.g., training with sharded data or federated learning, that can not be handled in (Wang & Joshi, 2018a).
This paper further considers momentum SGD with decentralized communication and proves that it possess the same linear speedup property as its vanilla SGD (without momentum) counterpart considered in (Lian et al., 2017).
Parallel Restarted SGD with Momentum
Following the convention in (distributed) stochastic optimization, we assume each worker can independently evaluate unbiased stochastic gradients using its own local variable . Throughout this paper, we have the following standard assumption:
Smoothness: Each function in problem (1) is smooth with modulus .
Bounded variances: There exist and such that
Note that in Assumption 1 quantifies the variance of stochastic gradients at local worker, and quantifies the deviations between each worker’s local objective function . For distributed training of neural networks, if all workers can access the same data set, then . Assumption 1 was previously used in (Lian et al., 2017; Wen et al., 2017; Alistarh et al., 2017; Jiang & Agrawal, 2018). The bounded variance assumption in Assumption 1 is milder than the bounded second moment assumption used in (Stich, 2018; Yu et al., 2018).
Recall that if function is smooth with modulus , then we have the key property that where represents the inner product of two vectors. Another two useful properties implied by Assumption 1 are summarized in the next two lemmas.
Consider problem (1) under Assumption 1. For any points , if we define ,then we have
2 Parallel Restarted SGD with Momentum
Consider applying the distributed algorithm described in Algorithm 1 to solve problem (1) with workers. In the literature, there are two momentum methods for SGD, i.e., Polyak’s momentum method (also known as the heavy ball method) and Nesterov’s momentum method. Algorithm 1 can use either of them for local worker updates by providing two options: “Option I” is Polyak’s momentum; “Option II” is Nesterov’s momentum.The current description in (4) and (5) is identical to the default implementations of momentum methods in PyTorch’s SGD optimizer. Polyak’s and Nesterov’s momentum have many other equivalent representations. These equivalent representations will be further discussed later. We also note that the update steps described in (4) or (5) can be locally performed at each worker in parallel.
The synchronization step (6) can be interpreted as restarting momentum SGD with new initial point and momentum buffer , i.e., resetting and . In this perspective, we call Algorithm 1 “parallel restarted SGD with momentum”. Note that if we choose in Algorithm 1, then it degrades to the “parallel restarted SGD”, also known as “local SGD” or “SGD with periodic averaging”, in (Stich, 2018; Yu et al., 2018; Lin et al., 2018; Wang & Joshi, 2018a; Zhou & Cong, 2018; Jiang & Agrawal, 2018).
Since inter-node communication is only needed by Algorithm 1 to calculate global averages and in (6) and happens only once every iterations, the total number of inter-communication rounds involved in iterations of Algorithm 1 is given by . If we use all-reduce operations to compute the global averages, the per-round communication is relatively cheap and does not involve a centralized parameter server (Goyal et al., 2017).
Algorithm 1 uses to denote local solution sequences generated at each worker . For each iteration , we define
as the averages of local solutions across all nodes. Our performance analysis will be performed regarding the aggregated version defined in (7) so that “consensus” is no longer our concern for problem (1). Performing convergence analysis for the node-average version has been an important technique used in previous works on distributed consensus optimization (Lian et al., 2017; Mania et al., 2017; Stich, 2018; Yu et al., 2018; Jiang & Agrawal, 2018).
3 Performance Analysis
This subsection analyzes the convergence rates of Algorithm 1 with Polyak’s and Nesterov’s momentum, respectively.
We first consider Algorithm 1 with Option I given by (4), i.e., Polyak’s momentum. Polyak’s momentum SGD, also known as the heavy ball method, is the classical momentum SGD used for training deep neural networks and often provides fast convergence and good generalization (Krizhevsky et al., 2012; Sutskever et al., 2013; Yan et al., 2018). If we let , then (4) in Algorithm 1 can be equivalently written as the following single variable version
where the last term is often called Polyak’s momentum term.
It’s interesting to note that if we define an auxiliary sequence
which is the node average sequence of local buffer variables from Algorithm 1 with Option I, then we have
where is defined in (7).
An important observation is that if in (10) is replaced with , then (10) is exactly a standard single-node SGD momentum method with momentum buffer and solution . That is, under Algorithm 1, workers essentially jointly update and with momentum SGD using an inaccurate stochastic gradient . However, since (6) periodically (every iterations) synchronizes all local variables and , our intuition is by synchronizing frequently enough the inaccuracy of the used gradient counterpart in (10) shall not damage the convergence too much. The next theorem summarizes the convergence rate of Algorithm 1 and characterizes the effect of synchronization interval in its convergence.
Our Algorithm 1 is different from a common heuristic model averaging strategy for momentum SGD suggested in (Wang & Joshi, 2018b) and in Microsoft’s CNTK framework (Seide & Agarwal, 2016). The strategy in (Seide & Agarwal, 2016; Wang & Joshi, 2018b) let each worker run local momentum SGD in parallel, and periodically reset momentum buffer variables to zero and average local models. However, there is no convergence analysis for this strategy. In contrast, this paper shall provide rigorous convergence analysis for our Algorithm 1. Our experiment in Supplement 6.7.2 furthers shows that Algorithm 1 has faster convergence than the strategy in (Seide & Agarwal, 2016; Wang & Joshi, 2018b) and yields better test accuracy when used to train ResNet for CIFAR10.
Consider problem (1) under Assumption 1. If we choose and in Algorithm 1 with Option I, then for all , we have
where is the sequence defined in (7); and and are constants defined in Assumption 1; and is the minimum value of problem (1).
The next corollary summarizes that Algorithm 1 with Option I using workers can solve problem (1) with the fast convergence, i.e., achieving the linear speedup (w.r.t. number of workers). Note that dominates in Corollary 1 when is sufficiently large.
Consider problem (1) under Assumption 1. If we choose and in Algorithm 1 with Option I, then for any , we have
This corollary follows because the selection of and satisfies the conditions in Theorem 1. ∎
The next corollary summarizes that the desired linear speedup can be attained with communication skipping, i.e., using in Algorithm 1. In particular, to achieve the linear speedup, iterations in Algorithm 1 with Option I only require communication rounds when , i.e., workers access the common training data set; or only require communication rounds when , i.e., workers access non-identical training data sets.
Case : If we choose and in Algorithm 1 with Option I, then for any , we have
By using , the total number of inter-node communication rounds involved in Algorithm 1 is .
Case : If we choose and in Algorithm 1 with Option I, then for any , we have
By using , the total number of inter-node communication rounds involved in Algorithm 1 is .
This corollary follows simply because the selection of and satisfies the conditions in Theorem 1. ∎
Note that Theorem 1 and Corollary 2 hold for any . Recall that Algorithm 1 with degrades to parallel restarted SGD (without momentum). For parallel SGD (without momentum), the communication complexity implied by Corollary 2 improves the state-of-the-art as summarized in Table 1.
Now consider using Nesterov’s momentum described in Option II in Algorithm 1. Option II given by (5) introduces extra auxiliary variables and uses them in the update of local solutions . It is not difficultThe derivation on the equivalence is in Supplement 6.4. to show that (5) yields the same solution sequences as
with . The only difference between (5) and (14) is the adoption of different cache variables.
Equation (14) is more widely used to describe SGD with Nesterov’s momentum in the literature (Nesterov, 2004). However, by writing Nesterov’s momentum SGD as (5), an important observation is that momentum buffer variables in Polyak’s and Nesterov’s momentum methods evolve according to the same dynamic (with stochastic gradients sampled at different points). By adapting the convergence rate analysis for Polyak’s momentum, Theorem 2 (formally proven in Supplement 6.5) summarizes the convergence for Nesterov’s momentum.
Consider problem (1) under Assumption 1. If we choose and in Algorithm 1 with Option II, then for all , we have
where is the sequence defined in (7); and and are constants defined in Assumption 1; and is the minimum value of problem (1).
By comparing Theorem 1 and Theorem 2, we conclude that the order of convergence rate for both options in Algorithm 1 (with slightly different choices of learning rate ) have the same dependence on and . It is then immediate that Algorithm 1 with Option II can achieve the linear speedup with the same (reduced) communication complexity as summarized in Corollary 2 for Option I.
Extension: Momentum SGD with Decentralized Communication
where is the -th largest eigenvalue of .
Compared with PR-SGD-Momentum described in Algorithm 1, Algorithm 2 uses a local aggregation step (18) at every iteration. While Algorithm 2 can not skip the aggregation steps as Algorithm 1 does and hence involves times more communication rounds, the per-step communication is more flexible since each node only computes neighborhood averages rather than global averages. As a consequence, Algorithm 2 is more suitable to distributed machine learning with mobiles where the network topology is heterogeneous and diverse. By extending our analysis for Algorithm 1, the next theorem proves the linear speedup of Algorithm 2.
Consider problem (1) under Assumptions 1 and 2. If we choose in Algorithm 2, then for all , we have
where defined in (7) are the averages of local solutions across workers; and and are constants defined in Assumptions 1-2.
This next corollary further summarizes that using in Algorithm 2 can achieve convergence with the linear speedup.
Consider problem (1) under Assumptions 1 and 2. If we choose in Algorithm 2, then for all , we have
where defined in (7) are the averages of local solutions across workers.
Note that Algorithm 2 with degrades to the decentralized SGD without momentum considered in (Lian et al., 2017), where the linear speedup of decentralized SGD is first shown. Recently, many variants of decentralized SGD without momentum with linear speedup have been developed for distributed deep learning (Tang et al., 2018; Wang & Joshi, 2018a; Assran et al., 2018). To our knowledge, this is the first time that decentralized momentum SGD is shown to possess the same linear speedup.
Experiments
In this section, we validate our theory with experiments on training ResNet (He et al., 2016) for the image classification tasks over CIFAR-10 (Krizhevsky & Hinton, 2009) and ImageNet (Deng et al., 2009). Our experiments are conducted on a machine with NVIDIA P100 GPUs. The local batch size at each GPU is . The learning rate is initialized to and is divided by when all GPUs jointly access and epochs.Such decaying learning rates are used to achieve good test accuracy by practitioners (He et al., 2016). This deviates from the constant learning rates used in the theory to establish the linear speedup. On one hand, it is possible follow the analysis techniques in (Yu et al., 2018) to extend this paper’s theory to establish a similar linear speedup. On the other hand, our supplement 6.7.1 reports extra experiments to verify the linear speedup of Algorithm 1 using constant learning rates faithful to the theory. The momentum coefficient , i.e., in Algorithm 1, is set to for both Polyak’s and Nesterov’s momentum. All algorithms are implemented using PyTorch . To study how communication skipping affects the convergence of Algorithm 1, we run Algorithm 1 with and compare the test accuracy convergence between Algorithm 1 and the classical parallel mini-batch SGD with momentum, which uses an inter-node communication step at every iteration and can be interpreted as Algorithm 1 with . Figure 2 plots the convergence of training loss and test accuracy in terms of the number of epochs that are jointly accessed by all used GPUs. That is, if the -axis value is , then each GPU access epoch of training data. The same convention is followed by other figures for multiple GPU training in this paper. By plotting the convergence in terms of the number of epochs that are jointly accessed by all GPUs, we can verify the convergence for Algorithm 1 proven in this paper. To verify the benefit of skipping communication in our Algorithm 1, Figure 3 plots the convergence of training loss and test accuracy in terms of the wall clock time. Since Algorithm 1 with skip communication steps, it is much faster than the classical parallel momentum SGD when measured by the wall clock time. More numerical experiments on training ResNet over ImageNet, comparisons with a model averaging strategy suggested in (Seide & Agarwal, 2016; Wang & Joshi, 2018b), and Algorithm 1 with Nesterov’s momentum are available in Supplement 6.7.
Conclusion
This paper considers parallel restarted SGD with momentum and prove that it can achieve convergence with or communication rounds depending whether each node accesses identical objective functions or not. We further show that distributed momentum SGD with decentralized communication can achieve convergence.
References
Supplement
The lemma follows simply because unbiased stochastic gradients are independently sampled. The formal proof is as follows:
2 Proof of Lemma 2
This lemma follows from simple algebraic manipulations as follows:
where (a) follows from the basic inequality for any vectors and ; (b) follows because by the smoothness of in Assumption 1 and , where the first inequality follows from the convexity of and Jensen’s inequality and the second inequality follows from the smoothness of each .
3 Proof of Theorem 1
To prove this theorem, we first introduce an auxiliary sequence given by
where defined in (7) are the node averages of local solutions from Algorithm 1 with Option I. A similar auxiliary sequence has been used for the convergence analysis of standard momentum methods (single node without restarting) in (Ghadimi et al., 2014) and (Yan et al., 2018).
Before the main proof of Theorem 1, we further introduce several useful lemmas.
Consider the sequence defined in (21). Algorithm 1 with Option I ensures that for all , we have
To prove , we consider and separately.
where (a) follows from (21); (b) follows from (10) by noting that .
where (a) follows from (21); and (b) and (c) follow from (10).
Let defined in (7) be the node averages of local solutions from Algorithm 1 with Option I. Let be defined in (21). For all , Algorithm 1 with Option I ensures that
Recall that . Recursively applying the first equation in (10) for times yields:
where (a) follows from (21) and (b) follows from the second equation in (10).
Define . For all , we have
where (a) follows from the convexity of and Jensen’s inequality.
Fix . Note that . Summing (27) over yields
where (a) follows by noting that . ∎
Consider problem (1) under Assumption 1. Let defined in (7) be the node averages of local solutions from Algorithm 1 with Option I. If and are chosen to satisfy , then for all , we have
For any , by the second equation in (4), we have
Summing over and noting that yields
Using an argument similar to the above, we can further show
where (a) follows from the basic inequality .
Now we develop the respective upper bounds of the two terms on the right side of (33). We note that
Recall that for each , the index in the above equation is the largest integer such that , and . Summing (36) over that are not a multiple of and noting that if is a multiple of yields
Collecting common terms and dividing both sides by yields
3.2 Main Proof of Theorem 1
With the above lemmas, we are ready to present the main proof of Theorem 1.
Fix . By the smoothness of function (in Assumption 1), we have
where (a) follows by applying the basic inequality with and ; and (b) follows from the smoothness of function .
Applying the basic identity with and yields
where (a) follows because , where the first inequality follows from the convexity of and Jensen’s inequality and the second inequality follows from the smoothness of each .
Dividing both sides by and rearranging terms yields
where (a) follows because is chosen to ensure .
4 On the equivalence between (5) and (14)
In this subsection, we show both and yield the same solution sequences assume they are initialized at the same . It is easy to verify that (5) and (14) yield the same by noting that and . Next, we show they yield the same .
Substituting the second equation of (5) into the third equation of (5) yields
Fix . Substituting the first equation of (5) into the above equation yields
where (a) follows by noting that from (48).
Now if we define , which is the first equation in (14), then the above equation can be written as
which is exactly the second equation in (14).
Thus, we can conclude that (5) and (14) yield the same solution sequences (with different auxiliary/buffer sequences.)
5 Proof of Theorem 2
To establish the convergence rate for Algorithm 1 with Nesterov’s momentum, we introduce a different auxiliary sequence given by
where defined in (7) are the node averages of local solutions from Algorithm 1 with Option II.
Using the notation in (9) and with and from Algorithm 1 with Option II, we shall have
The main proof shall follow similar steps as in our main proof for Polyak’s momentum in Section 6.3.2 as long as we prove the counterpart of Lemmas 3-5 for Algorithm 1 with Option II.
While the sequence introduced in (53) is different from defined in (21) for Polyak’s momentum, the next lemma shows that for Algorithm 1 with Option II is equal to for Algorithm 1 with Option I.
Consider the sequence defined in (53). Algorithm 1 with Option II ensures that for all , we have
The case can be shown directly by combining (53) and (54) with . To prove the case , we note that
where (a) follows from (53); (b) follows from the third equation in (54); and (c) follows by substituting and , which is implied by combining the first and second equations in (54). ∎
Let defined in (7) be the node averages of local solutions from Algorithm 1 with Option II. Let be defined in (53). For all , Algorithm 1 with Option II ensures that
Recall that . Recursively applying the first equation in (54) for times yields:
where (a) follows from (53); (b) follows from the third equation in (54); and (c) follows from the second equation in (54).
Define . For all , we have
where (a) follows from the convexity of and Jensen’s inequality.
Fix . Recall that . Summing (60) over yields
where (a) follows by noting that . ∎
Consider problem (1) under Assumption 1. Let defined in (7) be the node averages of local solutions from Algorithm 1 with Option II. If and are chosen to satisfy , then for all , we have
By the second equation in (5), this further implies
For any , by the third equation in (5), we have
Summing over and noting that yields
Using a argument similar to the above, we can further show
where (a) follows from the basic inequality .
Now we develop the respective upper bounds of the two terms on the right side of (67). We note that
Recall that for each , the index in the above equation is the largest integer such that , and . Summing (70) over that are not a multiple of and noting that if is a multiple of yields
Collecting common terms and dividing both sides by yields
5.2 Main Proof of Theorem 2
It is easy to realize that Lemmas 6, 7 and 8 developed above are the respective counterparts for Lemmas 3, 4 and 5. The main proof of Theorem 2 follows similar steps as the proof of Theorem 1 in Section 6.3 with the minor changes that Lemmas 3, 4 and 5 should be replaced by Lemmas 6, 7 and 8, respectively.
Fix . By the smoothness of function (in Assumption 1), we have
where (a) follows by applying the basic inequality with and ; and (b) follows from the smoothness of function .
Applying the basic identity with and yields
where (a) follows because , where the first inequality follows from the convexity of and Jensen’s inequality and the second inequality follows from the smoothness of each .
Dividing both sides by and rearranging terms yields
where (a) follows because is chosen to ensure .
6 Proof of Theorem 3
This section provides the complete proof for Algorithm 2 with Option I (Polyak’s momentum). An extension to Algorithm 2 with Option II can be similarly done as our extension in Section 6.5 for Algorithm 1 from Option I to Option II.
Let and (using the forms in (9) and (7), respectively) be the node averages of local variables and from Algorithm 2 with Option I. It is not difficult to show that and satisfiy the following fact.
Let and be node averages of local variables from Algorithm 2 with Option I. For all , we have
By the definition of , we have
where (a) follows by substituting the first equation in (18); (b) follows by recalling that for doubly stochastic matrix ; (c) follows by substituting the first equation in (16); and (d) follows from the definition of .
where (a) follows from the definition of ; (b) follows by substituting the second equation in (18); (c) follows byrecalling that for doubly stochastic matrix ; and (d) follows by using the definition of and equation (83).
It is remarkable that Fact 1 implies that even if we decentralized local averaging in Algorithm 2, the yielded global averages and follow the same dynamics as the global averages in Algorithm 1.
Let be node averages of local variables from Algorithm 2 with Option I. We again define the auxiliary sequence via
Since (82) and (86) are respectively identical to (10) and (21) for Algorithm 1 with Option I, the following two lemmas can be proven using exactly the same steps in the proofs for Lemmas 3 and 4.
Consider the sequence defined in (86). Algorithm 2 with Option I ensures that for all , we have
Let defined in (7) be the global node averages of local solutions from Algorithm 2 with Option I. Let be defined in (86). For all , Algorithm 2 with Option I ensures that
is an matrix where all entries are and is the identity matrix for which the dimensions are obvious in its context.
Due to the asymmetry in computing local averages for different nodes in (18), it is easier to provide the desired counterpart of Lemma 5 by studying the equivalent matrix form . The technique of introducing matrix forms to analyze the convergence rate has been previously used in (Lian et al., 2017; Tang et al., 2018; Wang & Joshi, 2018a) for the analysis of decentralized SGD without momentum.
The following useful facts are simple in matrix analysis (Horn & Johnson, 1985). Similar facts are used in (Lian et al., 2017; Wang & Joshi, 2018a).
Let be arbitrary real square matrices. It follows that .
Recall by the definition of Frobenius norm, we have where denote the trace operator of a square matrix. It follows that
where (a) follows from the simple fact that for any two square matrices. ∎
Let be defined in (90). For any symmetric doubly stochastic matrix under Assumption 2, we have
For any integer , we have where is the constant in Assumption 2 and denotes the spectrum norm for a matrix.
Recall that is a symmetric matrix. Let be the eigenvalue decomposition for where and is a unitary matrix where each column is an eigenvector of . Since is the eigenvector corresponding to eigenvalue for matrix , thus the first column of is . By (90), we further have an eigenvalue decomposition of given by where . Thus, we have
where (a) follows from part (1) of this fact.
Note that where . By Assumption 2, we have .
Consider problem (1) under Assumptions 1 and 2. Let defined in (7) be the node averages of local solutions from Algorithm 2 with Option I. If is chosen to satisfy , then for all , we have
Recall the definitions of and in (89), for all , we have
where (a) follows from the first equation in (18) and (b) follows from the first equation in (16).
Recursively applying the above equation for times yields
where (a) follows from .
Recall the definitions of and in (89), for all , we have
where (a) follows from the second equation in (16); (b) follows from the first equation in (16); and (c) follows from (91).
Multiplying both sides by with defined in (90) yields
where (a) follows because by Fact 3.
For all , recursively applying the above equation for times (using when needed) yields
where (a) follows because due to all columns of are identical; (b) follows by substituting (92); (c) follows because by Fact 3.
where (a) follows from the basic inequality .
Now we develop the respective upper bounds of the two terms on the right side of (96). We note that
where (a) follows from Fact 2 ; (b) follows because , for any two matrices and by Fact 3; (c) follows from the basic identity for any two real number ; (d) follows by noting the symmetry between and in the expression from last line; (e) follows because .
where (a) follows by applying the basic inequality for matrices of the same dimension with ; (b) follows because where the inequality step follows from the smoothness of each by Assumption 1, where the second step follows from Assumption 1, .
Substituting (97) and (101) into (96) yields
Summing over and noting that yields
where (a) follows after simplifying the double summations (by swapping the order of two summations and collecting coefficients for common terms)
Rearranging terms and dividing both sides by yields
Finally, this lemma follows by recalling that . ∎
To simplify the lengthy coefficient in Lemma 11, we have the following trivial corollary.
Consider problem (1) under Assumptions 1 and 2. Let defined in (7) be the node averages of local solutions from Algorithm 2 with Option I. If , then for all , we have
It follows because under our choice of . ∎
Now, we are ready to present the main proof of Theorem 3. Since Lemmas 9 and 10 under Algorithm 2 are respectively identical to Lemmas 3 and 4, repeating the steps from (38) to (43) in Section 6.3.2 yields:
Summing over yields
7 More Experiments
Recall our theory, e.g., Corollary 2, establishes the linear speedup property of Algorithm 1 (with ) using constant learning rates. We now verify our theory by training ResNet56 over CIFAR10 using learning rates faithful to our theory. We run SGD with Polyak’s momentum using momentum coefficient , batch size and constant learning rate . We also run Algorithm 1 with Option I and over GPUs. The local batch size at each GPU is and the learning rate is set to . Figure 4(a) plots the convergence of training loss in terms of the number of epochs that are jointly accessed by all used GPUs. Figure 4(a) verifies the linear speedup property of our Algorithm 1 since each GPU in Algorithm 1 only uses of the number of epochs, which are used by the single worker momentum SGD, to converge to the same loss value. A more straightforward verification is given in Figure 4(b) where the x-axis is the wall clock time.
7.2 Comparisons between Algorithm 1 and Existing Model Averaging with Momentum SGD
Our Algorithm 1 is quite similar to a common practice, known as “model averaging”, used for training deep neural networks with multiple workers. When the deep neural works are trained with momentum SGD, Microsoft’s CNTK framework (Seide & Agarwal, 2016) and recent work (Wang & Joshi, 2018b) suggest that each worker should additionally clear its local momentum buffer by reseting it to when they average their local models. In contrast, our Algorithm 1 averages both local momentum buffers and local models. There is no convergence guarantees shown in the literature for the model averaging strategy suggested in (Seide & Agarwal, 2016; Wang & Joshi, 2018b). We run both Algorithm 1 and the model averaging with Polyak’s momentum SGD suggested in (Seide & Agarwal, 2016; Wang & Joshi, 2018b), referred as “model averaging with cleared momentum” to train ResNet56 over CIFAR10. In the experiment, both Algorithm 1 and “model averaging with cleared momentum” perform one synchronization step, e.g., (6) in Algorithm 1, every iterations. Figure 5 shows that Algorithm 1 converges slightly faster. More impressively, the test accuracy attained by Algorithm 1 is roughly better.
7.3 Experiments with ImageNet
We further verify the performance of Algorithm 1 when used to train ResNet50 over ImageNet (Deng et al., 2009), which is an image classification task significantly harder than CIFAR10. We run Algorithm 1 with and the classical parallel SGD with momentum on a machine with NVIDIA P100 GPUs. The local batch size at each GPU is . The learning rate is initialized to and is divided by when all GPUs jointly access and epochs. The momentum coefficient is set to . Figure 6 plots the convergence of training loss and test accuracy in terms of the number of iteration steps. This figure again verifies that Algorithm 1 has the same convergence rate as the classical parallel momentum SGD. To verify the benefit of skipping communication, Figure 7 plots the convergence of training loss and test accuracy in terms of the wall clock time.
7.4 Experiments on Algorithm 1 with Option II
Figures 8 and 9 verify the convergence of Algorithm 1 with Option II. Their experiment configuration is identical to Figures 2 and 3 except that Polyak’s momentum in all algorithms is replaced with Nesterov’s momentum. While Figures 8 and 9 verify the convergence rate analysis proven in Theorem 2, our preliminary experiments seem to suggest Nesterov’s momentum is less robust to communication skipping since even a small leads to performance degradation of test accuracy.