On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Hao Yu, Rong Jin, Sen Yang

Introduction

Consider distributed non-convex optimization scenarios where $N$ workers jointly solve the following consensus optimization problem:

Stochastic Gradient Descent (SGD) (and its momentum variants) have been the dominating methodology for solving machine learning problem. For large-scale distributed machine learning problems, such as training deep neural networks, a parallel version of SGD, known as parallel mini-batch SGD, is widely adopted (Dean et al., 2012; Dekel et al., 2012; Li et al., 2014). With $N$ parallel workers, parallel mini-batch SGD can solve problem (1) with $O(1/\sqrt{NT})$ convergence, which is $N$ times fasterIf an algorithm has $O(1/\sqrt{T})$ convergence, then it takes $1/\epsilon^{2}$ iterations to attain an $O(\epsilon)$ accurate solution. Similarly, if another algorithm has $O(1/\sqrt{NT})$ convergence, then it takes $1/(N\epsilon^{2})$ iterations, which is $N$ times smaller than $1/\epsilon^{2}$ , to attain an $O(\epsilon)$ accurate solution. In this sense, the second algorithm is $N$ times faster than the first one. than the $O(1/\sqrt{T})$ convergence attained by SGD over a single node (Ghadimi & Lan, 2013; Lian et al., 2015). Obviously, such linear speedup with respect to (w.r.t.) number of workers is desired in distributed training as it implies perfect computation scalability by increasing the number of used workers. However, such linear speedup is difficult to harvest in practice because the classical parallel mini-batch SGD requires all workers to synchronize local gradients or models at every iteration such that inter-node communication cost becomes the performance bottleneck. To eliminate potential communication bottlenecks, such as high latency or low bandwidths, in computing infrastructures, many distributed SGD variants have been recently proposed. For example, (Lian et al., 2017; Jiang et al., 2017; Assran et al., 2018) consider decentralized parallel SGD where global gradient aggregations used in the classical parallel mini-batch SGD are replaced with local aggregations between neighbors. To reduce the communication cost at each iteration, compression or sparsification based parallel SGD are considered in (Seide et al., 2014; Strom, 2015; Aji & Heafield, 2017; Wen et al., 2017; Alistarh et al., 2017). Recently, (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018) prove that certain parallel SGD variants that strategically skip communication rounds can achieve the fast $O(1/\sqrt{NT})$ convergence with significantly less communication rounds. See Table 1 for a summary on recent works studying distributed SGD with $O(1/\sqrt{NT})$ convergence and reduced communication complexity.

It is worth noting that existing convergence analyses on distributed methods for solving problem (1) focus on parallel SGD without momentum. In practice, momentum SGD is more commonly used to train deep neural networks since it often can converge faster and generalize better (Krizhevsky et al., 2012; Sutskever et al., 2013; Yan et al., 2018). For example, momentum SGD is suggested for training ResNet for image classification tasks to obtain the best test accuracy (He et al., 2016). See Figure 1 for the comparison of test accuracy between training with momentum and training without momentum.As observed in Figure 1, the final test accuracy of momentum SGD is roughly $2.5\%$ better than that of vanilla SGD (without momentum) over CIFAR10. In practice, people usually use momentum SGD to train ResNet in both single GPU and multiple GPU scenarios. Some practitioners conjecture that it is possible to avoid the performance degradation of vanilla SGD if its hyper-parameters are well tuned. However, even if this conjecture is true, the hyper-parameter tuning can be extremely time-consuming. In fact, while previous works (Lian et al., 2017; Stich, 2018; Yu et al., 2018; Jiang & Agrawal, 2018) prove that distributed vanilla SGD (without momentum) can train deep neural networks with $O(1/\sqrt{NT})$ convergence using significantly fewer communications rounds, most of their experiments use momentum SGD rather than vanilla SGD (to achieve the targeted test accuracy). In addition, previous empirical works (Chen & Huo, 2016; Lin et al., 2018) on distributed training for deep neural networks explicitly observe that momentum is necessary to improve the convergence and test accuracy. In this perspective, there is a huge gap between the current practices, i.e., using momentum SGD rather than vanilla SGD in distributed training for deep neural networks, and existing theoretical analyses such as (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018) studying the convergence rate and communication complexity of SGD without momentum. Momentum methods are originally proposed by Polyak in (Polyak, 1964) to minimize deterministic strongly convex quadratic functions. Its convergence rate for deterministic convex optimization, which is not necessarily strongly convex, is established in (Ghadimi et al., 2014). For non-convex optimization, the convergence for deterministic non-convex optimization is proven in (Zavriev & Kostyuk, 1993) and the $O(1/\sqrt{T})$ convergence rate for stochastic non-convex optimization is recently shown in (Yan et al., 2018). However, to our best knowledge, it remains as an open question: “Whether any distributed momentum SGD can achieve the same $O(1/\sqrt{NT})$ convergence, i.e., linear speedup w.r.t. number of workers, with reduced communication complexity as SGD (without momentum) methods in (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018) ?”

Our Contributions: In this paper, we considers parallel restarted SGD with momentum (described in Algorithm 1), which can be viewed as the momentum extension of parallel restarted SGD, also referred as local SGD, considered in (Stich, 2018; Yu et al., 2018). Such a method is also faithful to the common practice “model averaging with momentum” used by practitioners for training deep neural networks in (Chen & Huo, 2016; McMahan et al., 2017; Su et al., 2018; Lin et al., 2018). We show that parallel restarted SGD with momentum can solve problem (1) with $O(1/\sqrt{NT})$ convergence, i.e., achieving linear speedup w.r.t. number of workers. Moreover, to achieve the fast $O(1/\sqrt{NT})$ convergence, $T$ iterations of our algorithm only requires $O(N^{3/2}T^{1/2})$ communication rounds when all workers can access identical data sets; or $O(N^{3/4}T^{3/4})$ communication rounds when workers access non-identical data sets. To our knowledge, this is the first time that a distributed momentum SGD method for non-convex stochastic optimization is proven to possess the same linear speedup property (with communication reduction) as distributed SGD (without momentum) in (Lian et al., 2017; Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018).

Recall that momentum SGD degrades to vanilla SGD when its momentum coefficient is set . Algorithm 1 with $\beta=0$ degrades to the parallel SGD methods (without momentum) considered in (Yu et al., 2018; Wang & Joshi, 2018a; Jiang & Agrawal, 2018). Our communication complexity results for parallel SGD without momentum cases also improve the state-of-the-art. As shown in Table 1, the number of required communication rounds shown in this paper is the fewest in both identical training data set case and non-identical data set case. In particular, this paper relaxes the bounded gradient moment assumption used in (Yu et al., 2018) to a milder bounded variance assumption and reduces the communication complexity for the case with identical training data. Our analysis applies to the distributed training scenarios where workers access non-identical training sets, e.g., training with sharded data or federated learning, that can not be handled in (Wang & Joshi, 2018a).

This paper further considers momentum SGD with decentralized communication and proves that it possess the same linear speedup property as its vanilla SGD (without momentum) counterpart considered in (Lian et al., 2017).

Parallel Restarted SGD with Momentum

Following the convention in (distributed) stochastic optimization, we assume each worker can independently evaluate unbiased stochastic gradients $\nabla F_{i}({\mathbf{x}}_{i};\xi_{i}),\xi_{i}\sim\mathcal{D}_{i}$ using its own local variable ${\mathbf{x}}_{i}$ . Throughout this paper, we have the following standard assumption:

Smoothness: Each function $f_{i}({\mathbf{x}})$ in problem (1) is smooth with modulus $L$ .

Bounded variances: There exist $\sigma>0$ and $\kappa>0$ such that

Note that $\sigma^{2}$ in Assumption 1 quantifies the variance of stochastic gradients at local worker, and $\kappa^{2}$ quantifies the deviations between each worker’s local objective function $f_{i}({\mathbf{x}})$ . For distributed training of neural networks, if all workers can access the same data set, then $\kappa=0$ . Assumption 1 was previously used in (Lian et al., 2017; Wen et al., 2017; Alistarh et al., 2017; Jiang & Agrawal, 2018). The bounded variance assumption in Assumption 1 is milder than the bounded second moment assumption used in (Stich, 2018; Yu et al., 2018).

Recall that if function $f(\cdot)$ is smooth with modulus $L$ , then we have the key property that $f({\mathbf{x}})\leq f({\mathbf{y}})+\langle\nabla f({\mathbf{y}}),{\mathbf{x}}-{\mathbf{y}}\rangle+\frac{L}{2}\|{{\mathbf{y}}-{\mathbf{x}}}\|^{2},\forall{\mathbf{x}},{\mathbf{y}}$ where $\langle\cdot,\cdot\rangle$ represents the inner product of two vectors. Another two useful properties implied by Assumption 1 are summarized in the next two lemmas.

Consider problem (1) under Assumption 1. For any $N$ points ${\mathbf{x}}_{i},i\in\{1,2,\ldots,N\}$ , if we define $\bar{{\mathbf{x}}}=\frac{1}{N}\sum_{i=1}^{N}{\mathbf{x}}_{i}$ ,then we have

2 Parallel Restarted SGD with Momentum

Consider applying the distributed algorithm described in Algorithm 1 to solve problem (1) with $N$ workers. In the literature, there are two momentum methods for SGD, i.e., Polyak’s momentum method (also known as the heavy ball method) and Nesterov’s momentum method. Algorithm 1 can use either of them for local worker updates by providing two options: “Option I” is Polyak’s momentum; “Option II” is Nesterov’s momentum.The current description in (4) and (5) is identical to the default implementations of momentum methods in PyTorch’s SGD optimizer. Polyak’s and Nesterov’s momentum have many other equivalent representations. These equivalent representations will be further discussed later. We also note that the update steps described in (4) or (5) can be locally performed at each worker in parallel.

The synchronization step (6) can be interpreted as restarting momentum SGD with new initial point $\hat{{\mathbf{x}}}$ and momentum buffer $\hat{{\mathbf{u}}}$ , i.e., resetting ${\mathbf{u}}_{i}^{(t)}=\hat{{\mathbf{u}}}$ and ${\mathbf{x}}_{i}^{(t)}=\hat{{\mathbf{x}}}$ . In this perspective, we call Algorithm 1 “parallel restarted SGD with momentum”. Note that if we choose $\beta=0$ in Algorithm 1, then it degrades to the “parallel restarted SGD”, also known as “local SGD” or “SGD with periodic averaging”, in (Stich, 2018; Yu et al., 2018; Lin et al., 2018; Wang & Joshi, 2018a; Zhou & Cong, 2018; Jiang & Agrawal, 2018).

Since inter-node communication is only needed by Algorithm 1 to calculate global averages $\hat{{\mathbf{x}}}$ and $\hat{{\mathbf{u}}}$ in (6) and happens only once every $I$ iterations, the total number of inter-communication rounds involved in $T$ iterations of Algorithm 1 is given by $T/I$ . If we use all-reduce operations to compute the global averages, the per-round communication is relatively cheap and does not involve a centralized parameter server (Goyal et al., 2017).

Algorithm 1 uses $\{{\mathbf{x}}_{i}^{(t)}\}_{t\geq 0}$ to denote local solution sequences generated at each worker $i$ . For each iteration $t$ , we define

as the averages of local solutions ${\mathbf{x}}_{i}^{(t)}$ across all $N$ nodes. Our performance analysis will be performed regarding the aggregated version $\bar{{\mathbf{x}}}^{(t)}$ defined in (7) so that “consensus” is no longer our concern for problem (1). Performing convergence analysis for the node-average version $\bar{{\mathbf{x}}}^{(t)}$ has been an important technique used in previous works on distributed consensus optimization (Lian et al., 2017; Mania et al., 2017; Stich, 2018; Yu et al., 2018; Jiang & Agrawal, 2018).

3 Performance Analysis

This subsection analyzes the convergence rates of Algorithm 1 with Polyak’s and Nesterov’s momentum, respectively.

We first consider Algorithm 1 with Option I given by (4), i.e., Polyak’s momentum. Polyak’s momentum SGD, also known as the heavy ball method, is the classical momentum SGD used for training deep neural networks and often provides fast convergence and good generalization (Krizhevsky et al., 2012; Sutskever et al., 2013; Yan et al., 2018). If we let ${\mathbf{x}}_{i}^{(-1)}={\mathbf{x}}_{i}^{(0)}$ , then (4) in Algorithm 1 can be equivalently written as the following single variable version

where the last term $\beta[{\mathbf{x}}_{i}^{(t-1)}-{\mathbf{x}}_{i}^{(t-2)}]$ is often called Polyak’s momentum term.

It’s interesting to note that if we define an auxiliary sequence

which is the node average sequence of local buffer variables ${\mathbf{u}}_{i}^{(t)}$ from Algorithm 1 with Option I, then we have

where $\bar{{\mathbf{x}}}^{(t)}$ is defined in (7).

An important observation is that if $\frac{1}{N}\sum_{i=1}^{N}{\mathbf{g}}_{i}^{(t-1)}$ in (10) is replaced with $\nabla f(\bar{{\mathbf{x}}}^{(t-1)})$ , then (10) is exactly a standard single-node SGD momentum method with momentum buffer $\bar{{\mathbf{u}}}^{(t)}$ and solution $\bar{{\mathbf{x}}}^{(t)}$ . That is, under Algorithm 1, $N$ workers essentially jointly update $\bar{{\mathbf{u}}}^{(t)}$ and $\bar{{\mathbf{x}}}^{(t)}$ with momentum SGD using an inaccurate stochastic gradient $\frac{1}{N}\sum_{i=1}^{N}{\mathbf{g}}_{i}^{(t-1)}$ . However, since (6) periodically (every $I$ iterations) synchronizes all local variables ${\mathbf{u}}_{i}^{(t)}$ and ${\mathbf{x}}_{i}^{(t)}$ , our intuition is by synchronizing frequently enough the inaccuracy of the used gradient counterpart in (10) shall not damage the convergence too much. The next theorem summarizes the convergence rate of Algorithm 1 and characterizes the effect of synchronization interval $I$ in its convergence.

Our Algorithm 1 is different from a common heuristic model averaging strategy for momentum SGD suggested in (Wang & Joshi, 2018b) and in Microsoft’s CNTK framework (Seide & Agarwal, 2016). The strategy in (Seide & Agarwal, 2016; Wang & Joshi, 2018b) let each worker run local momentum SGD in parallel, and periodically reset momentum buffer variables to zero and average local models. However, there is no convergence analysis for this strategy. In contrast, this paper shall provide rigorous convergence analysis for our Algorithm 1. Our experiment in Supplement 6.7.2 furthers shows that Algorithm 1 has faster convergence than the strategy in (Seide & Agarwal, 2016; Wang & Joshi, 2018b) and yields better test accuracy when used to train ResNet for CIFAR10.

Consider problem (1) under Assumption 1. If we choose $\gamma\leq\frac{(1-\beta)^{2}}{(1+\beta)L}$ and $I\leq\frac{1-\beta}{6L\gamma}$ in Algorithm 1 with Option I, then for all $T\geq 1$ , we have

where $\{\bar{\mathbf{x}}^{(t)}\}_{t\geq 0}$ is the sequence defined in (7); and $\sigma$ and $\kappa$ are constants defined in Assumption 1; and $f^{\ast}$ is the minimum value of problem (1).

The next corollary summarizes that Algorithm 1 with Option I using $N$ workers can solve problem (1) with the fast $O(\frac{1}{\sqrt{NT}})$ convergence, i.e., achieving the linear speedup (w.r.t. number of workers). Note that $O(\frac{1}{\sqrt{NT}})$ dominates $O(\frac{N}{T})$ in Corollary 1 when $T$ is sufficiently large.

Consider problem (1) under Assumption 1. If we choose $\gamma=\frac{\sqrt{N}}{\sqrt{T}}$ and $I=1$ in Algorithm 1 with Option I, then for any $T\geq\frac{36L^{2}N}{(1-\beta)^{2}}$ , we have

This corollary follows because the selection of $\gamma$ and $I$ satisfies the conditions in Theorem 1. ∎

The next corollary summarizes that the desired linear speedup can be attained with communication skipping, i.e., using $I>1$ in Algorithm 1. In particular, to achieve the linear speedup, $T$ iterations in Algorithm 1 with Option I only require $O(N^{3/2}T^{1/2})$ communication rounds when $\kappa=0$ , i.e., workers access the common training data set; or only require $O(N^{3/4}T^{3/4})$ communication rounds when $\kappa\neq 0$ , i.e., workers access non-identical training data sets.

Case $\kappa=0$ : If we choose $\gamma=\frac{\sqrt{N}}{\sqrt{T}}$ and $I\leq\frac{1-\beta}{6L}\frac{\sqrt{T}}{N^{3/2}}=O(\frac{T^{1/2}}{N^{3/2}})$ in Algorithm 1 with Option I, then for any $T\geq\frac{(1+\beta)^{2}L^{2}}{(1-\beta)^{4}}N$ , we have

By using $I=O(\frac{T^{1/2}}{N^{3/2}})$ , the total number of inter-node communication rounds involved in Algorithm 1 is $O(N^{3/2}T^{1/2})$ .

Case $\kappa\neq 0$ : If we choose $\gamma=\frac{\sqrt{N}}{\sqrt{T}}$ and $I\leq\frac{1-\beta}{6L}\frac{T^{1/4}}{N^{3/4}}=O(\frac{T^{1/4}}{N^{3/4}})$ in Algorithm 1 with Option I, then for any $T\geq\frac{(1+\beta)^{2}L^{2}}{(1-\beta)^{4}}N$ , we have

By using $I=O(\frac{T^{1/4}}{N^{3/4}})$ , the total number of inter-node communication rounds involved in Algorithm 1 is $O(N^{3/4}T^{3/4})$ .

This corollary follows simply because the selection of $\gamma$ and $I$ satisfies the conditions in Theorem 1. ∎

Note that Theorem 1 and Corollary 2 hold for any $0\leq\beta<1$ . Recall that Algorithm 1 with $\beta=0$ degrades to parallel restarted SGD (without momentum). For parallel SGD (without momentum), the communication complexity implied by Corollary 2 improves the state-of-the-art as summarized in Table 1.

Now consider using Nesterov’s momentum described in Option II in Algorithm 1. Option II given by (5) introduces extra auxiliary variables $\bar{{\mathbf{v}}}_{i}^{(t)}$ and uses them in the update of local solutions ${\mathbf{x}}_{i}^{(t)}$ . It is not difficultThe derivation on the equivalence is in Supplement 6.4. to show that (5) yields the same solution sequences $\{{\mathbf{x}}_{i}^{(t)}\}_{t\geq 0}$ as

with $\mathbf{y}_{i}^{(0)}\overset{\Delta}{=}{\mathbf{0}},\forall i$ . The only difference between (5) and (14) is the adoption of different cache variables.

Equation (14) is more widely used to describe SGD with Nesterov’s momentum in the literature (Nesterov, 2004). However, by writing Nesterov’s momentum SGD as (5), an important observation is that momentum buffer variables ${\mathbf{u}}_{i}^{(t)}$ in Polyak’s and Nesterov’s momentum methods evolve according to the same dynamic (with stochastic gradients sampled at different points). By adapting the convergence rate analysis for Polyak’s momentum, Theorem 2 (formally proven in Supplement 6.5) summarizes the convergence for Nesterov’s momentum.

Consider problem (1) under Assumption 1. If we choose $\gamma\leq\frac{(1-\beta)^{2}}{L(1+\beta^{3})}$ and $I\leq\frac{1-\beta}{6L\gamma}$ in Algorithm 1 with Option II, then for all $T>0$ , we have

where $\{\bar{\mathbf{x}}^{(t)}\}_{t\geq 0}$ is the sequence defined in (7); and $\sigma$ and $\kappa$ are constants defined in Assumption 1; and $f^{\ast}$ is the minimum value of problem (1).

By comparing Theorem 1 and Theorem 2, we conclude that the order of convergence rate for both options in Algorithm 1 (with slightly different choices of learning rate $\gamma$ ) have the same dependence on $\gamma,N$ and $I$ . It is then immediate that Algorithm 1 with Option II can achieve the linear speedup with the same (reduced) communication complexity as summarized in Corollary 2 for Option I.

Extension: Momentum SGD with Decentralized Communication

where $\lambda_{i}({\mathbf{W}})$ is the $i$ -th largest eigenvalue of ${\mathbf{W}}$ .

Compared with PR-SGD-Momentum described in Algorithm 1, Algorithm 2 uses a local aggregation step (18) at every iteration. While Algorithm 2 can not skip the aggregation steps as Algorithm 1 does and hence involves $I$ times more communication rounds, the per-step communication is more flexible since each node only computes neighborhood averages rather than global averages. As a consequence, Algorithm 2 is more suitable to distributed machine learning with mobiles where the network topology is heterogeneous and diverse. By extending our analysis for Algorithm 1, the next theorem proves the linear speedup of Algorithm 2.

Consider problem (1) under Assumptions 1 and 2. If we choose $\gamma\leq\min\{\frac{(1-\beta)^{2}(1-\sqrt{\rho})^{2}}{6L},\frac{(1-\beta)(1-\sqrt{\rho})}{4L}\}$ in Algorithm 2, then for all $T\geq 1$ , we have

where $\{\bar{\mathbf{x}}^{(t)}\}_{t\geq 0}$ defined in (7) are the averages of local solutions across $N$ workers; and $\sigma,\kappa$ and $\rho$ are constants defined in Assumptions 1-2.

This next corollary further summarizes that using $\gamma=O(\frac{\sqrt{N}}{\sqrt{T}})$ in Algorithm 2 can achieve $O(\frac{1}{\sqrt{NT}})$ convergence with the linear speedup.

Consider problem (1) under Assumptions 1 and 2. If we choose $\gamma=\frac{\sqrt{N}}{\sqrt{T}}$ in Algorithm 2, then for all $T\geq\max\{\frac{36NL^{2}}{(1-\beta)^{4}(1-\sqrt{\rho})^{4}},\frac{32NL^{2}}{(1-\beta)^{2}(1-\sqrt{\rho})^{2}}\}$ , we have

where $\{\bar{\mathbf{x}}^{(t)}\}_{t\geq 0}$ defined in (7) are the averages of local solutions across $N$ workers.

Note that Algorithm 2 with $\beta=0$ degrades to the decentralized SGD without momentum considered in (Lian et al., 2017), where the linear speedup of decentralized SGD is first shown. Recently, many variants of decentralized SGD without momentum with linear speedup have been developed for distributed deep learning (Tang et al., 2018; Wang & Joshi, 2018a; Assran et al., 2018). To our knowledge, this is the first time that decentralized momentum SGD is shown to possess the same linear speedup.

Experiments

In this section, we validate our theory with experiments on training ResNet (He et al., 2016) for the image classification tasks over CIFAR-10 (Krizhevsky & Hinton, 2009) and ImageNet (Deng et al., 2009). Our experiments are conducted on a machine with $8$ NVIDIA P100 GPUs. The local batch size at each GPU is $64$ . The learning rate is initialized to $0.3$ and is divided by $10$ when all GPUs jointly access $80$ and $120$ epochs.Such decaying learning rates are used to achieve good test accuracy by practitioners (He et al., 2016). This deviates from the constant learning rates used in the theory to establish the linear speedup. On one hand, it is possible follow the analysis techniques in (Yu et al., 2018) to extend this paper’s theory to establish a similar linear speedup. On the other hand, our supplement 6.7.1 reports extra experiments to verify the linear speedup of Algorithm 1 using constant learning rates faithful to the theory. The momentum coefficient , i.e., $\beta$ in Algorithm 1, is set to $0.9$ for both Polyak’s and Nesterov’s momentum. All algorithms are implemented using PyTorch $0.4$ . To study how communication skipping affects the convergence of Algorithm 1, we run Algorithm 1 with $I\in\{4,8,16,32\}$ and compare the test accuracy convergence between Algorithm 1 and the classical parallel mini-batch SGD with momentum, which uses an inter-node communication step at every iteration and can be interpreted as Algorithm 1 with $I=1$ . Figure 2 plots the convergence of training loss and test accuracy in terms of the number of epochs that are jointly accessed by all used GPUs. That is, if the $x$ -axis value is $8$ , then each GPU access $1$ epoch of training data. The same convention is followed by other figures for multiple GPU training in this paper. By plotting the convergence in terms of the number of epochs that are jointly accessed by all GPUs, we can verify the $O(\frac{1}{\sqrt{NT}})$ convergence for Algorithm 1 proven in this paper. To verify the benefit of skipping communication in our Algorithm 1, Figure 3 plots the convergence of training loss and test accuracy in terms of the wall clock time. Since Algorithm 1 with $I>0$ skip communication steps, it is much faster than the classical parallel momentum SGD when measured by the wall clock time. More numerical experiments on training ResNet over ImageNet, comparisons with a model averaging strategy suggested in (Seide & Agarwal, 2016; Wang & Joshi, 2018b), and Algorithm 1 with Nesterov’s momentum are available in Supplement 6.7.

Conclusion

This paper considers parallel restarted SGD with momentum and prove that it can achieve $O(1/\sqrt{NT})$ convergence with $O(N^{3/2}T^{1/2})$ or $O(N^{3/4}T^{3/4})$ communication rounds depending whether each node accesses identical objective functions $f_{i}({\mathbf{x}})$ or not. We further show that distributed momentum SGD with decentralized communication can achieve $O(1/\sqrt{NT})$ convergence.

References

Supplement

The lemma follows simply because unbiased stochastic gradients ${\mathbf{g}}_{i}$ are independently sampled. The formal proof is as follows:

2 Proof of Lemma 2

This lemma follows from simple algebraic manipulations as follows:

where (a) follows from the basic inequality $\|{{\mathbf{a}}_{1}+{\mathbf{a}}_{2}+{\mathbf{a}}_{3}}\|^{2}\leq 3\|{{\mathbf{a}}_{1}}\|^{2}+3\|{{\mathbf{a}}_{2}}\|^{2}+3\|{{\mathbf{a}}_{3}}\|^{2}$ for any vectors ${\mathbf{a}}_{1},{\mathbf{a}}_{2}$ and ${\mathbf{a}}_{3}$ ; (b) follows because $\|\nabla f_{i}({\mathbf{x}}_{i})-\nabla f_{i}(\bar{{\mathbf{x}}})\|\leq L\|{\mathbf{x}}_{i}-\bar{{\mathbf{x}}}\|$ by the smoothness of $f_{i}(\cdot)$ in Assumption 1 and $\|\nabla f(\bar{{\mathbf{x}}})-\frac{1}{N}\sum_{j=1}^{N}\nabla f_{j}({\mathbf{x}}_{j})\|^{2}=\|\frac{1}{N}\sum_{j=1}^{N}\nabla f_{j}(\bar{{\mathbf{x}}})-\frac{1}{N}\sum_{j=1}^{N}\nabla f_{j}({\mathbf{x}}_{j})\|^{2}\leq\frac{1}{N}\sum_{j=1}^{N}\|\nabla f_{j}(\bar{{\mathbf{x}}})-\nabla f_{j}({\mathbf{x}}_{j})\|^{2}\leq\frac{1}{N}\sum_{j=1}^{N}L^{2}\|\bar{{\mathbf{x}}}-{\mathbf{x}}_{j}\|^{2}$ , where the first inequality follows from the convexity of $\|\cdot\|^{2}$ and Jensen’s inequality and the second inequality follows from the smoothness of each $f_{j}(\cdot)$ .

3 Proof of Theorem 1

To prove this theorem, we first introduce an auxiliary sequence $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ given by

where $\bar{{\mathbf{x}}}^{(t)}$ defined in (7) are the node averages of local solutions from Algorithm 1 with Option I. A similar auxiliary sequence $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ has been used for the convergence analysis of standard momentum methods (single node without restarting) in (Ghadimi et al., 2014) and (Yan et al., 2018).

Before the main proof of Theorem 1, we further introduce several useful lemmas.

Consider the sequence $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ defined in (21). Algorithm 1 with Option I ensures that for all $t\geq 0$ , we have

To prove $\bar{{\mathbf{z}}}^{(t+1)}-\bar{{\mathbf{z}}}^{(t)}=-\frac{\gamma}{1-\beta}\frac{1}{N}\sum_{i=1}^{N}{\mathbf{g}}_{i}^{(t)}$ , we consider $t=0$ and $t\geq 1$ separately.

where (a) follows from (21); (b) follows from (10) by noting that $\bar{{\mathbf{u}}}^{(0)}={\mathbf{0}}$ .

where (a) follows from (21); and (b) and (c) follow from (10).

Let $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ defined in (7) be the node averages of local solutions from Algorithm 1 with Option I. Let $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ be defined in (21). For all $T\geq 1$ , Algorithm 1 with Option I ensures that

Recall that $\bar{{\mathbf{u}}}^{(0)}={\mathbf{0}}$ . Recursively applying the first equation in (10) for $t$ times yields:

where (a) follows from (21) and (b) follows from the second equation in (10).

Define $s_{t}\overset{\Delta}{=}\sum_{\tau=0}^{t-1}\beta^{t-1-\tau}=\frac{1-\beta^{t}}{1-\beta}$ . For all $t\geq 1$ , we have

where (a) follows from the convexity of $\|\cdot\|^{2}$ and Jensen’s inequality.

Fix $T\geq 1$ . Note that $\bar{{\mathbf{z}}}^{(0)}-\bar{{\mathbf{x}}}^{(0)}={\mathbf{0}}$ . Summing (27) over $t\in\{1,2,\ldots,T-1\}$ yields

where (a) follows by noting that $\sum_{l=\tau+1}^{T-1}\beta^{l-1-\tau}=\frac{1-\beta^{T-\tau-1}}{1-\beta}\leq\frac{1}{1-\beta}$ . ∎

Consider problem (1) under Assumption 1. Let $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ defined in (7) be the node averages of local solutions from Algorithm 1 with Option I. If $\gamma$ and $I$ are chosen to satisfy $\frac{12L^{2}\gamma^{2}I^{2}}{(1-\beta)^{2}}<1$ , then for all $T\geq 1$ , we have

For any $\tau\in\{t_{0}+1,\ldots,t\}$ , by the second equation in (4), we have

Summing over $\tau\in\{t_{0}+1,\ldots,t\}$ and noting that ${\mathbf{x}}_{i}^{(t_{0})}=\hat{{\mathbf{x}}}$ yields

Using an argument similar to the above, we can further show

where (a) follows from the basic inequality $\|{{\mathbf{a}}_{1}+{\mathbf{a}}_{2}}\|^{2}\leq 2\|{{\mathbf{a}}_{1}}\|^{2}+2\|{{\mathbf{a}}_{2}}\|^{2}$ .

Now we develop the respective upper bounds of the two terms on the right side of (33). We note that

Recall that for each $t$ , the index $t_{0}$ in the above equation is the largest integer such that $t_{0}<t$ , $t_{0}~{}\text{mod}~{}I=0$ and $t-t_{0}<I$ . Summing (36) over $t\in\{0,1,\ldots,T\}$ that are not a multiple of $I$ and noting that $\bar{{\mathbf{x}}}^{(t)}-{\mathbf{x}}_{i}^{(t)}={\mathbf{0}}$ if $t$ is a multiple of $I$ yields

Collecting common terms and dividing both sides by $1-\frac{12L^{2}\gamma^{2}I^{2}}{(1-\beta)^{2}}$ yields

3.2 Main Proof of Theorem 1

With the above lemmas, we are ready to present the main proof of Theorem 1.

Fix $t\geq 0$ . By the smoothness of function $f(\cdot)$ (in Assumption 1), we have

where (a) follows by applying the basic inequality $\langle{\mathbf{a}},{\mathbf{b}}\rangle\leq\frac{1}{2}\|{{\mathbf{a}}}\|^{2}+\frac{1}{2}\|{{\mathbf{b}}}\|^{2}$ with ${\mathbf{a}}=-\frac{\sqrt{1-\beta}}{\sqrt{\beta L}}[\nabla f(\bar{{\mathbf{z}}}^{(t)})-\nabla f(\bar{{\mathbf{x}}}^{(t)})]$ and ${\mathbf{b}}=\frac{\gamma\sqrt{\beta L}}{(1-\beta)^{3/2}}\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}({\mathbf{x}}_{i}^{(t)})$ ; and (b) follows from the smoothness of function $f(\cdot)$ .

Applying the basic identity $\langle{\mathbf{a}},{\mathbf{b}}\rangle=\frac{1}{2}[\|{{\mathbf{a}}}\|^{2}+\|{{\mathbf{b}}}\|^{2}-\|{{\mathbf{a}}-{\mathbf{b}}}\|^{2}]$ with ${\mathbf{a}}=\nabla f(\bar{{\mathbf{x}}}^{(t)})$ and ${\mathbf{b}}=\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}({\mathbf{x}}_{i}^{(t)})$ yields

where (a) follows because $\|\nabla f(\bar{{\mathbf{x}}}^{(t)})-\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}({\mathbf{x}}_{i}^{(t)})\|^{2}=\|\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(\bar{{\mathbf{x}}}^{(t)})-\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}({\mathbf{x}}_{i}^{(t)})\|^{2}\leq\frac{1}{N}\sum_{i=1}^{N}\|\nabla f_{i}(\bar{{\mathbf{x}}}^{(t)})-\nabla f_{i}({\mathbf{x}}_{i}^{(t)})\|^{2}\leq\frac{1}{N}\sum_{i=1}^{N}L^{2}\|\bar{{\mathbf{x}}}^{(t)}-{\mathbf{x}}_{i}^{(t)}\|^{2}$ , where the first inequality follows from the convexity of $\|\cdot\|^{2}$ and Jensen’s inequality and the second inequality follows from the smoothness of each $f_{i}(\cdot)$ .

Dividing both sides by $\frac{\gamma}{2(1-\beta)}$ and rearranging terms yields

where (a) follows because $I\leq\frac{1-\beta}{6L\gamma}$ is chosen to ensure $\frac{1}{1-\frac{12L^{2}\gamma^{2}I^{2}}{(1-\beta)^{2}}}\leq\frac{3}{2}$ .

4 On the equivalence between (5) and (14)

In this subsection, we show both and yield the same solution sequences $\{{\mathbf{x}}_{i}^{(t)}\}_{t\geq 0}$ assume they are initialized at the same ${\mathbf{x}}_{i}^{(0)}$ . It is easy to verify that (5) and (14) yield the same ${\mathbf{x}}_{i}^{(1)}$ by noting that ${\mathbf{y}}_{i}^{(0)}=0$ and ${\mathbf{u}}_{i}^{(0)}={\mathbf{0}}$ . Next, we show they yield the same $\{{\mathbf{x}}_{i}^{(0)}\}_{t\geq 2}$ .

Substituting the second equation of (5) into the third equation of (5) yields

Fix $t\geq 2$ . Substituting the first equation of (5) into the above equation yields

where (a) follows by noting that $-\gamma\beta^{2}{\mathbf{u}}_{i}^{(t-1)}=\gamma\beta{\mathbf{g}}_{i}^{(t-2)}+\beta[{\mathbf{x}}_{i}^{(t-1)}-{\mathbf{x}}_{i}^{(t-2)}],\forall t\geq 2$ from (48).

Now if we define ${\mathbf{y}}_{i}^{(t)}\overset{\Delta}{=}{\mathbf{x}}_{i}^{(t-1)}-\gamma{\mathbf{g}}_{i}^{(t-1)}$ , which is the first equation in (14), then the above equation can be written as

which is exactly the second equation in (14).

Thus, we can conclude that (5) and (14) yield the same solution sequences $\{{\mathbf{x}}_{i}^{(0)}\}_{t\geq 0}$ (with different auxiliary/buffer sequences.)

5 Proof of Theorem 2

To establish the convergence rate for Algorithm 1 with Nesterov’s momentum, we introduce a different auxiliary sequence $\{\bar{{\mathbf{y}}}^{(t)}\}_{t\geq 0}$ given by

where $\bar{{\mathbf{x}}}^{(t)}$ defined in (7) are the node averages of local solutions from Algorithm 1 with Option II.

Using the notation $\bar{{\mathbf{u}}}^{(t)}\overset{\Delta}{=}\frac{1}{N}\sum_{i=1}^{N}{\mathbf{u}}_{i}^{(t)}$ in (9) and $\bar{{\mathbf{v}}}^{(t)}\overset{\Delta}{=}\frac{1}{N}\sum_{i=1}^{N}{\mathbf{v}}_{i}^{(t)}$ with ${\mathbf{u}}_{i}^{(t)}$ and ${\mathbf{v}}_{i}^{(t)}$ from Algorithm 1 with Option II, we shall have

The main proof shall follow similar steps as in our main proof for Polyak’s momentum in Section 6.3.2 as long as we prove the counterpart of Lemmas 3-5 for Algorithm 1 with Option II.

While the sequence $\{\bar{{\mathbf{y}}}^{(t)}\}_{t\geq 0}$ introduced in (53) is different from $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ defined in (21) for Polyak’s momentum, the next lemma shows that $\bar{{\mathbf{y}}}^{(t+1)}-\bar{{\mathbf{y}}}^{(t)}$ for Algorithm 1 with Option II is equal to $\bar{{\mathbf{z}}}^{(t+1)}-\bar{{\mathbf{z}}}^{(t)}$ for Algorithm 1 with Option I.

Consider the sequence $\{\bar{{\mathbf{y}}}^{(t)}\}_{t\geq 0}$ defined in (53). Algorithm 1 with Option II ensures that for all $t\geq 0$ , we have

The case $\bar{{\mathbf{y}}}^{(1)}-\bar{{\mathbf{y}}}^{(0)}=-\frac{\gamma}{1-\beta}\frac{1}{N}\sum_{i=1}^{N}{\mathbf{g}}_{i}^{(0)}$ can be shown directly by combining (53) and (54) with $t=0$ . To prove the case $t\geq 1$ , we note that

where (a) follows from (53); (b) follows from the third equation in (54); and (c) follows by substituting $\bar{{\mathbf{v}}}^{(t)}=\beta\bar{{\mathbf{u}}}^{(t)}+\frac{1}{N}\sum_{i=1}^{N}{\mathbf{g}}_{i}^{(t-1)}$ and $\bar{{\mathbf{v}}}^{(t+1)}=\beta^{2}\bar{{\mathbf{u}}}^{(t)}+(1+\beta)\frac{1}{N}\sum_{i=1}^{N}{\mathbf{g}}_{i}^{(t)}$ , which is implied by combining the first and second equations in (54). ∎

Let $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ defined in (7) be the node averages of local solutions from Algorithm 1 with Option II. Let $\{\bar{{\mathbf{y}}}^{(t)}\}_{t\geq 0}$ be defined in (53). For all $T\geq 1$ , Algorithm 1 with Option II ensures that

Recall that $\bar{{\mathbf{u}}}^{(0)}={\mathbf{0}}$ . Recursively applying the first equation in (54) for $t$ times yields:

where (a) follows from (53); (b) follows from the third equation in (54); and (c) follows from the second equation in (54).

Define $s_{t}\overset{\Delta}{=}\sum_{\tau=0}^{t-1}\beta^{t-1-\tau}=\frac{1-\beta^{t}}{1-\beta}$ . For all $t\geq 1$ , we have

where (a) follows from the convexity of $\|\cdot\|^{2}$ and Jensen’s inequality.

Fix $T\geq 1$ . Recall that $\bar{{\mathbf{y}}}^{(0)}-\bar{{\mathbf{x}}}^{(0)}={\mathbf{0}}$ . Summing (60) over $t\in\{1,2,\ldots,T-1\}$ yields

where (a) follows by noting that $\sum_{l=\tau+1}^{T-1}\beta^{l-1-\tau}=\frac{1-\beta^{T-\tau-1}}{1-\beta}\leq\frac{1}{1-\beta}$ . ∎

Consider problem (1) under Assumption 1. Let $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ defined in (7) be the node averages of local solutions from Algorithm 1 with Option II. If $\gamma$ and $I$ are chosen to satisfy $\frac{12L^{2}\gamma^{2}I^{2}}{(1-\beta)^{2}}<1$ , then for all $T\geq 1$ , we have

By the second equation in (5), this further implies

For any $\tau\in\{t_{0}+1,\ldots,t\}$ , by the third equation in (5), we have

Summing over $\tau\in\{t_{0}+1,\ldots,t\}$ and noting that ${\mathbf{x}}_{i}^{(t_{0})}=\hat{{\mathbf{x}}}$ yields

Using a argument similar to the above, we can further show

where (a) follows from the basic inequality $\|{{\mathbf{a}}_{1}+{\mathbf{a}}_{2}}\|^{2}\leq 2\|{{\mathbf{a}}_{1}}\|^{2}+2\|{{\mathbf{a}}_{2}}\|^{2}$ .

Now we develop the respective upper bounds of the two terms on the right side of (67). We note that

Recall that for each $t$ , the index $t_{0}$ in the above equation is the largest integer such that $t_{0}<t$ , $t_{0}~{}\text{mod}~{}I=0$ and $t-t_{0}<I$ . Summing (70) over $t\in\{0,1,\ldots,T\}$ that are not a multiple of $I$ and noting that $\bar{{\mathbf{x}}}^{(t)}-{\mathbf{x}}_{i}^{(t)}={\mathbf{0}}$ if $t$ is a multiple of $I$ yields

Collecting common terms and dividing both sides by $1-\frac{12L^{2}\gamma^{2}I^{2}}{(1-\beta)^{2}}$ yields

5.2 Main Proof of Theorem 2

It is easy to realize that Lemmas 6, 7 and 8 developed above are the respective counterparts for Lemmas 3, 4 and 5. The main proof of Theorem 2 follows similar steps as the proof of Theorem 1 in Section 6.3 with the minor changes that Lemmas 3, 4 and 5 should be replaced by Lemmas 6, 7 and 8, respectively.

Fix $t\geq 0$ . By the smoothness of function $f(\cdot)$ (in Assumption 1), we have

where (a) follows by applying the basic inequality $\langle{\mathbf{a}},{\mathbf{b}}\rangle\leq\frac{1}{2}\|{{\mathbf{a}}}\|^{2}+\frac{1}{2}\|{{\mathbf{b}}}\|^{2}$ with ${\mathbf{a}}=-\frac{\sqrt{1-\beta}}{\sqrt{L}\beta^{3/2}}[\nabla f(\bar{{\mathbf{y}}}^{(t)})-\nabla f(\bar{{\mathbf{x}}}^{(t)})]$ and ${\mathbf{b}}=\frac{\gamma\sqrt{L}\beta^{3/2}}{(1-\beta)^{3/2}}\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}({\mathbf{x}}_{i}^{(t)})$ ; and (b) follows from the smoothness of function $f(\cdot)$ .

where (a) follows because $\|\nabla f(\bar{{\mathbf{x}}}^{(t)})-\frac{1}{N}\sum_{j=1}^{N}\nabla f_{j}({\mathbf{x}}_{j}^{(t)})\|^{2}=\|\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(\bar{{\mathbf{x}}}^{(t)})-\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}({\mathbf{x}}_{i}^{(t)})\|^{2}\leq\frac{1}{N}\sum_{i=1}^{N}\|\nabla f_{i}(\bar{{\mathbf{x}}}^{(t)})-\nabla f_{i}({\mathbf{x}}_{i}^{(t)})\|^{2}\leq\frac{1}{N}\sum_{i=1}^{N}L^{2}\|\bar{{\mathbf{x}}}^{(t)}-{\mathbf{x}}_{i}^{(t)}\|^{2}$ , where the first inequality follows from the convexity of $\|\cdot\|^{2}$ and Jensen’s inequality and the second inequality follows from the smoothness of each $f_{i}(\cdot)$ .

Dividing both sides by $\frac{\gamma}{2(1-\beta)}$ and rearranging terms yields

where (a) follows because $I\leq\frac{1-\beta}{6L\gamma}$ is chosen to ensure $\frac{1}{1-\frac{12L^{2}\gamma^{2}I^{2}}{(1-\beta)^{2}}}\leq\frac{3}{2}$ .

6 Proof of Theorem 3

This section provides the complete proof for Algorithm 2 with Option I (Polyak’s momentum). An extension to Algorithm 2 with Option II can be similarly done as our extension in Section 6.5 for Algorithm 1 from Option I to Option II.

Let $\bar{{\mathbf{u}}}^{(t)}$ and $\bar{{\mathbf{x}}}^{(t)}$ (using the forms in (9) and (7), respectively) be the $N$ node averages of local variables ${\mathbf{u}}_{i}^{(t)}$ and ${\mathbf{x}}_{i}^{(t)}$ from Algorithm 2 with Option I. It is not difficult to show that $\bar{{\mathbf{u}}}^{(t)}$ and $\bar{{\mathbf{x}}}^{(t)}$ satisfiy the following fact.

Let $\bar{{\mathbf{u}}}^{(t)}\overset{\Delta}{=}\frac{1}{N}\sum_{i=1}^{N}{\mathbf{u}}_{i}^{(t)}$ and $\bar{{\mathbf{x}}}^{(t)}\overset{\Delta}{=}\frac{1}{N}\sum_{i=1}^{N}{\mathbf{x}}_{i}^{(t)}$ be node averages of local variables from Algorithm 2 with Option I. For all $t\geq 1$ , we have

By the definition of $\bar{{\mathbf{u}}}^{(t)}$ , we have

where (a) follows by substituting the first equation in (18); (b) follows by recalling that $\sum_{j=1}^{N}W_{ji}=1$ for doubly stochastic matrix ${\mathbf{W}}$ ; (c) follows by substituting the first equation in (16); and (d) follows from the definition of $\bar{{\mathbf{u}}}^{(t)}$ .

where (a) follows from the definition of $\bar{{\mathbf{x}}}^{(t)}$ ; (b) follows by substituting the second equation in (18); (c) follows byrecalling that $\sum_{j=1}^{N}W_{ji}=1$ for doubly stochastic matrix ${\mathbf{W}}$ ; and (d) follows by using the definition of $\bar{{\mathbf{x}}}^{(t)}$ and equation (83).

It is remarkable that Fact 1 implies that even if we decentralized local averaging in Algorithm 2, the yielded global averages $\{\bar{{\mathbf{u}}}^{(t)}\}_{t\geq 0}$ and $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ follow the same dynamics as the global averages in Algorithm 1.

Let $\bar{{\mathbf{x}}}^{(t)}=\frac{1}{N}\sum_{i=1}^{N}{\mathbf{x}}_{i}^{(t)}$ be node averages of local variables ${\mathbf{x}}_{i}^{(t)}$ from Algorithm 2 with Option I. We again define the auxiliary sequence $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ via

Since (82) and (86) are respectively identical to (10) and (21) for Algorithm 1 with Option I, the following two lemmas can be proven using exactly the same steps in the proofs for Lemmas 3 and 4.

Consider the sequence $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ defined in (86). Algorithm 2 with Option I ensures that for all $t\geq 0$ , we have

Let $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ defined in (7) be the global node averages of local solutions from Algorithm 2 with Option I. Let $\{\bar{{\mathbf{z}}}^{(t)}\}_{t\geq 0}$ be defined in (86). For all $T\geq 1$ , Algorithm 2 with Option I ensures that

is an $N\times N$ matrix where all entries are $1/N$ and ${\mathbf{I}}$ is the identity matrix for which the dimensions are obvious in its context.

Due to the asymmetry in computing local averages for different nodes in (18), it is easier to provide the desired counterpart of Lemma 5 by studying the equivalent matrix form $\|{{\mathbf{X}}^{(t)}({\mathbf{I}}-{\mathbf{Q}})}\|_{F}^{2}$ . The technique of introducing matrix forms to analyze the convergence rate has been previously used in (Lian et al., 2017; Tang et al., 2018; Wang & Joshi, 2018a) for the analysis of decentralized SGD without momentum.

The following useful facts are simple in matrix analysis (Horn & Johnson, 1985). Similar facts are used in (Lian et al., 2017; Wang & Joshi, 2018a).

Let ${\mathbf{A}}_{i},i\in\{1,2,\ldots,n\}$ be $n$ arbitrary real square matrices. It follows that $\|{\sum_{i=1}^{n}{\mathbf{A}}_{i}}\|_{F}^{2}\leq\sum_{i=1}^{n}\sum_{j=1}^{n}\|{{\mathbf{A}}_{i}}\|_{F}\|{{\mathbf{A}}_{i}}\|_{F}$ .

Recall by the definition of Frobenius norm, we have $\|{{\mathbf{A}}}\|_{F}^{2}=\text{tr}({\mathbf{A}}^{\mkern-1.5mu\mathsf{T}}{\mathbf{A}})$ where $\text{tr}(\cdot)$ denote the trace operator of a square matrix. It follows that

where (a) follows from the simple fact that $|{\text{tr}({\mathbf{A}}^{\mkern-1.5mu\mathsf{T}}{\mathbf{B}})}|\leq\|{{\mathbf{A}}}\|_{F}\|{{\mathbf{B}}}\|_{F}$ for any two square matrices. ∎

Let ${\mathbf{Q}}$ be defined in (90). For any symmetric doubly stochastic matrix ${\mathbf{W}}$ under Assumption 2, we have

${\mathbf{Q}}{\mathbf{W}}={\mathbf{W}}{\mathbf{Q}}$

$({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}={\mathbf{W}}({\mathbf{I}}-{\mathbf{Q}})$

For any integer $k\geq 1$ , we have $\|{({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}^{k}}\|\leq\rho^{k/2}$ where $\rho<1$ is the constant in Assumption 2 and $\|{\cdot}\|$ denotes the spectrum norm for a matrix.

Recall that ${\mathbf{W}}$ is a symmetric matrix. Let ${\mathbf{W}}={\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Lambda}}{\mathbf{P}}$ be the eigenvalue decomposition for ${\mathbf{W}}$ where ${\boldsymbol{\Lambda}}=\text{Diag}\{\lambda_{1}({\mathbf{W}}),\ldots,\lambda_{N}({\mathbf{W}})\}$ and ${\mathbf{P}}$ is a unitary matrix where each column is an eigenvector of ${\mathbf{W}}$ . Since $\frac{1}{\sqrt{N}}{\mathbf{1}}_{N}$ is the eigenvector corresponding to eigenvalue $\lambda_{1}({\mathbf{W}})=1$ for matrix ${\mathbf{W}}$ , thus the first column of ${\mathbf{P}}$ is $\frac{1}{\sqrt{N}}{\mathbf{1}}_{N}$ . By (90), we further have an eigenvalue decomposition of ${\mathbf{Q}}$ given by ${\mathbf{Q}}={\mathbf{P}}{\boldsymbol{\Gamma}}{\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}$ where ${\boldsymbol{\Gamma}}=\text{Diag}\{1,0,\ldots,0\}$ . Thus, we have

${\mathbf{Q}}{\mathbf{W}}={\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Gamma}}{\mathbf{P}}{\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Lambda}}{\mathbf{P}}={\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Gamma}}{\boldsymbol{\Lambda}}{\mathbf{P}}={\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Lambda}}{\boldsymbol{\Gamma}}{\mathbf{P}}={\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Lambda}}{\mathbf{P}}{\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Gamma}}{\mathbf{P}}={\mathbf{W}}{\mathbf{Q}}$

$({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}={\mathbf{W}}-{\mathbf{Q}}{\mathbf{W}}\overset{(a)}{=}{\mathbf{W}}-{\mathbf{W}}{\mathbf{Q}}={\mathbf{W}}({\mathbf{I}}-{\mathbf{Q}})$ where (a) follows from part (1) of this fact.

Note that $({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}^{k}={\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}({\mathbf{I}}-{\boldsymbol{\Gamma}}){\mathbf{P}}({\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}{\boldsymbol{\Lambda}}{\mathbf{P}})^{k}={\mathbf{P}}^{\mkern-1.5mu\mathsf{T}}({\mathbf{I}}-{\boldsymbol{\Gamma}}){\boldsymbol{\Lambda}}^{k}{\mathbf{P}}$ where $({\mathbf{I}}-{\boldsymbol{\Gamma}}){\boldsymbol{\Lambda}}^{k}=\text{Diag}\{0,(\lambda_{2}({\mathbf{W}}))^{k},\ldots,(\lambda_{N}({\mathbf{W}}))^{k}\}$ . By Assumption 2, we have $\max\{|{(\lambda_{2}({\mathbf{W}}))^{k}}|,|{(\lambda_{N}({\mathbf{W}}))^{k}}|\}\leq(\sqrt{\rho})^{k}$ .

Consider problem (1) under Assumptions 1 and 2. Let $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ defined in (7) be the node averages of local solutions from Algorithm 2 with Option I. If $\gamma$ is chosen to satisfy $\frac{16L^{2}\gamma^{2}}{(1-\beta)^{2}(1-\sqrt{\rho})^{2}}<1$ , then for all $T\geq 1$ , we have

Recall the definitions of ${\mathbf{U}}^{(t)}$ and $\widetilde{{\mathbf{U}}}^{(t)}$ in (89), for all $t\geq 1$ , we have

where (a) follows from the first equation in (18) and (b) follows from the first equation in (16).

Recursively applying the above equation for $t$ times yields

where (a) follows from ${\mathbf{U}}^{(0)}={\mathbf{0}}$ .

Recall the definitions of ${\mathbf{X}}^{(t)}$ and $\widetilde{{\mathbf{U}}}^{(t)}$ in (89), for all $t\geq 1$ , we have

where (a) follows from the second equation in (16); (b) follows from the first equation in (16); and (c) follows from (91).

Multiplying both sides by ${\mathbf{I}}-{\mathbf{Q}}$ with ${\mathbf{Q}}$ defined in (90) yields

where (a) follows because ${\mathbf{W}}({\mathbf{I}}-{\mathbf{Q}})=({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}$ by Fact 3.

For all $t\geq 1$ , recursively applying the above equation for $t$ times (using ${\mathbf{W}}({\mathbf{I}}-{\mathbf{Q}})=({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}$ when needed) yields

where (a) follows because ${\mathbf{X}}^{(0)}({\mathbf{I}}-{\mathbf{Q}})={\mathbf{0}}$ due to all columns of ${\mathbf{X}}^{(0)}$ are identical; (b) follows by substituting (92); (c) follows because ${\mathbf{W}}({\mathbf{I}}-{\mathbf{Q}})=({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}$ by Fact 3.

where (a) follows from the basic inequality $\|{{\mathbf{A}}_{1}+{\mathbf{A}}_{2}}\|_{F}^{2}\leq 2\|{{\mathbf{A}}_{1}}\|_{F}^{2}+2\|{{\mathbf{A}}_{2}}\|_{F}^{2}$ .

Now we develop the respective upper bounds of the two terms on the right side of (96). We note that

where (a) follows from Fact 2 ; (b) follows because $|{\frac{1-\beta^{t-\tau}}{1-\beta}}|\leq\frac{1}{1-\beta}$ , $\|{{\mathbf{A}}{\mathbf{B}}}\|_{F}\leq\|{{\mathbf{A}}}\|_{F}\|{{\mathbf{B}}}\|$ for any two matrices ${\mathbf{A}},{\mathbf{B}}$ and $\|{({\mathbf{I}}-{\mathbf{Q}}){\mathbf{W}}^{t-1-\tau}}\|\leq\rho^{(t-1-\tau)/2}$ by Fact 3; (c) follows from the basic identity $xy\leq\frac{1}{2}x^{2}+\frac{1}{2}y^{2}$ for any two real number $x,y$ ; (d) follows by noting the symmetry between $\tau$ and $\tau^{\prime}$ in the expression from last line; (e) follows because $\sum_{\tau^{\prime}=0}^{t-1}\rho^{t-1-(\tau+\tau^{\prime})/2}\leq\frac{\rho^{(t-1-\tau)/2}}{1-\sqrt{\rho}}$ .

where (a) follows by applying the basic inequality $\|{\sum_{i=1}^{n}{\mathbf{A}}_{i}}\|_{F}^{2}\leq n\sum_{i=1}^{n}\|{{\mathbf{A}}_{i}}\|_{F}^{2}$ for matrices of the same dimension with $n=4$ ; (b) follows because $\|{{\mathbf{H}}^{(\tau)}-{\mathbf{J}}^{(\tau)}}\|_{F}^{2}=\sum_{i=1}^{N}\|{\nabla f_{i}({\mathbf{x}}_{i}^{(\tau)})-\nabla f_{i}(\bar{{\mathbf{x}}}^{(\tau)})}\|^{2}\leq L^{2}\sum_{i=1}^{N}\|{{\mathbf{x}}_{i}^{(\tau)}-\bar{{\mathbf{x}}}^{(\tau)}}\|^{2}=L^{2}\|{{\mathbf{X}}^{(\tau)}({\mathbf{I}}-{\mathbf{Q}})}\|_{F}^{2}$ where the inequality step follows from the smoothness of each $f_{i}(\cdot)$ by Assumption 1, $\|{{\mathbf{J}}^{(\tau)}-{\mathbf{J}}^{(\tau)}{\mathbf{Q}}}\|_{F}^{2}=\sum_{i=1}^{N}\|{\nabla f_{i}(\bar{{\mathbf{x}}}^{(\tau)})-\nabla f(\bar{{\mathbf{x}}}^{(\tau)})}\|^{2}\leq N\kappa^{2}$ where the second step follows from Assumption 1, $\|{({\mathbf{J}}^{(\tau)}-{\mathbf{H}}^{(\tau)}){\mathbf{Q}}}\|_{F}^{2}\leq\|{{\mathbf{H}}^{(\tau)}-{\mathbf{J}}^{(\tau)}}\|_{F}^{2}$ .

Substituting (97) and (101) into (96) yields

Summing over $t\in\{1,2,\ldots,T-1\}$ and noting that $\|{{\mathbf{X}}^{(0)}({\mathbf{I}}-{\mathbf{Q}})}\|_{F}^{2}=0$ yields

where (a) follows after simplifying the double summations (by swapping the order of two summations and collecting coefficients for common terms)

Rearranging terms and dividing both sides by $N(1-\frac{16L^{2}\gamma^{2}}{(1-\beta)^{2}(1-\sqrt{\rho})^{2}})$ yields

Finally, this lemma follows by recalling that $\|{{\mathbf{X}}^{(t)}({\mathbf{I}}-{\mathbf{Q}})}\|_{F}^{2}=\sum_{i=1}^{N}\|{\bar{{\mathbf{x}}}^{(t)}-{\mathbf{x}}_{i}^{(t)}}\|^{2}$ . ∎

To simplify the lengthy coefficient $\frac{1}{1-\frac{16L^{2}\gamma^{2}}{(1-\beta)^{2}(1-\sqrt{\rho})^{2}}}$ in Lemma 11, we have the following trivial corollary.

Consider problem (1) under Assumptions 1 and 2. Let $\{\bar{{\mathbf{x}}}^{(t)}\}_{t\geq 0}$ defined in (7) be the node averages of local solutions from Algorithm 2 with Option I. If $\gamma\leq\frac{(1-\beta)(1-\sqrt{\rho})}{4\sqrt{2}L}$ , then for all $T\geq 1$ , we have

It follows because $\frac{1}{1-\frac{16L^{2}\gamma^{2}}{(1-\beta)^{2}(1-\sqrt{\rho})^{2}}}<2$ under our choice of $\gamma$ . ∎

Now, we are ready to present the main proof of Theorem 3. Since Lemmas 9 and 10 under Algorithm 2 are respectively identical to Lemmas 3 and 4, repeating the steps from (38) to (43) in Section 6.3.2 yields:

Summing over $t\in\{0,1,\ldots,T-1\}$ yields

7 More Experiments

Recall our theory, e.g., Corollary 2, establishes the linear speedup property of Algorithm 1 (with $I>0$ ) using constant learning rates. We now verify our theory by training ResNet56 over CIFAR10 using learning rates faithful to our theory. We run SGD with Polyak’s momentum using momentum coefficient $0.9$ , batch size $64$ and constant learning rate $0.01$ . We also run Algorithm 1 with Option I and $I=4$ over $N\in\{2,4,8\}$ GPUs. The local batch size at each GPU is $64$ and the learning rate is set to $0.01\sqrt{N}$ . Figure 4(a) plots the convergence of training loss in terms of the number of epochs that are jointly accessed by all used GPUs. Figure 4(a) verifies the linear speedup property of our Algorithm 1 since each GPU in Algorithm 1 only uses $1/N$ of the number of epochs, which are used by the single worker momentum SGD, to converge to the same loss value. A more straightforward verification is given in Figure 4(b) where the x-axis is the wall clock time.

7.2 Comparisons between Algorithm 1 and Existing Model Averaging with Momentum SGD

Our Algorithm 1 is quite similar to a common practice, known as “model averaging”, used for training deep neural networks with multiple workers. When the deep neural works are trained with momentum SGD, Microsoft’s CNTK framework (Seide & Agarwal, 2016) and recent work (Wang & Joshi, 2018b) suggest that each worker should additionally clear its local momentum buffer by reseting it to when they average their local models. In contrast, our Algorithm 1 averages both local momentum buffers and local models. There is no convergence guarantees shown in the literature for the model averaging strategy suggested in (Seide & Agarwal, 2016; Wang & Joshi, 2018b). We run both Algorithm 1 and the model averaging with Polyak’s momentum SGD suggested in (Seide & Agarwal, 2016; Wang & Joshi, 2018b), referred as “model averaging with cleared momentum” to train ResNet56 over CIFAR10. In the experiment, both Algorithm 1 and “model averaging with cleared momentum” perform one synchronization step, e.g., (6) in Algorithm 1, every $I=16$ iterations. Figure 5 shows that Algorithm 1 converges slightly faster. More impressively, the test accuracy attained by Algorithm 1 is roughly $1\%$ better.

7.3 Experiments with ImageNet

We further verify the performance of Algorithm 1 when used to train ResNet50 over ImageNet (Deng et al., 2009), which is an image classification task significantly harder than CIFAR10. We run Algorithm 1 with $I\in\{4,8,16\}$ and the classical parallel SGD with momentum on a machine with $8$ NVIDIA P100 GPUs. The local batch size at each GPU is $128$ . The learning rate is initialized to $0.2$ and is divided by $10$ when all GPUs jointly access $30$ and $60$ epochs. The momentum coefficient is set to $0.9$ . Figure 6 plots the convergence of training loss and test accuracy in terms of the number of iteration steps. This figure again verifies that Algorithm 1 has the same convergence rate as the classical parallel momentum SGD. To verify the benefit of skipping communication, Figure 7 plots the convergence of training loss and test accuracy in terms of the wall clock time.

7.4 Experiments on Algorithm 1 with Option II

Figures 8 and 9 verify the convergence of Algorithm 1 with Option II. Their experiment configuration is identical to Figures 2 and 3 except that Polyak’s momentum in all algorithms is replaced with Nesterov’s momentum. While Figures 8 and 9 verify the convergence rate analysis proven in Theorem 2, our preliminary experiments seem to suggest Nesterov’s momentum is less robust to communication skipping since even a small $I$ leads to performance degradation of test accuracy.