D$^2$: Decentralized Training over Decentralized Data

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, Ji Liu

Introduction

Training machine learning models in a decentralized way has attracted intensive interests recently Lian et al. (2017a); Yuan et al. (2016); Colin et al. (2016). In the decentralized setting, there is a set of workers, each of which collects data from different data sources. Instead of sending all of their data to a centralized place, these workers only communicate with their neighbors. The goal is to get a model that is the same as if all data are collected in a centralized place. Decentralized learning algorithm is important in scenarios in which centralized communication is expensive or not possible, or the underlying communication network has high latency.

For decentralized learning to provide benefit, each user should provides data that is somehow unique, i.e., the variance of data collected from different workers are large. However, many recent theoretical results Lian et al. (2017a, b); Nedic and Ozdaglar (2009); Yuan et al. (2016) all assume a bounded data variance across workers — when data hosted on different workers are very different, these approach could converge slowly, both empirically and theoretically. In this paper, we aim at bringing this discrepancy between the current theoretical understanding and the requirements from some practical scenarios.

In this paper, we present D2, a novel decentralized learning algorithm designed to be robust under high data variance. The structure and technique of D2 is built upon standard decentralized parallel stochastic gradient descent (D-PSGD), but benefits from an additional variance reduction component. In the D2 algorithm, each worker stores the stochastic gradient and its local model in last iterate and linearly combines them with the current stochastic gradient and local model. It results in an improved convergence rate over D-PSGD by eliminating the data variation among workers. In particular, the convergence rate is improved from $O\left({\sigma\over\sqrt{nT}}+{(n\zeta^{2})^{\frac{1}{3}}\over T^{2/3}}\right)$ to $O\left({\sigma\over\sqrt{nT}}\right)$ where $\zeta^{2}$ is the data variation among all workers, $\sigma^{2}$ is the data variance within each worker, $n$ is the number of workers, and $T$ is the number of iterations. We empirically show $D^{2}$ can significantly outperform D-PSGD by training an image classification model where each worker has access to only the data of a limited set of labels.

Throughout this paper, we consider the following decentralized optimization:

where $n$ is the number of workers and $\mathcal{D}_{i}$ is the local data distribution for worker $i$ . All workers are connected to form a connected graph. Each worker can only exchange information with its neighbors.

Throughout this paper, we use following notations and definitions:

$\|\cdot\|_{F}$ denotes the Frobenius norm of matrices.

$\nabla f(\cdot)$ denotes the gradient of a function $f$ .

$f^{*}$ denotes the optimal solution of (1).

$\lambda_{i}(\cdot)$ denotes the $i$ th largest eigenvalue of a matrix.

$\bm{x}^{(i)}$ denotes the local model of worker $i$ .

$\nabla F_{i}(\bm{x}^{(i)};\xi^{(i)})$ denotes a local stochastic gradient of worker $i$ .

In order to organize the algorithm more clearly, here we define the concatenation of all local variables, stochastic gradients, and their average respectively:

where $\xi$ is the collection of randomly sampled data from all workers

Organization

This paper is organized as follows: Section 2 reviews related work about the proposed approach; Section 3 introduces the state-of-the-art decentralized stochastic gradient descent method and its convergence rate; Section 4 introduces the proposed algorithm and its intuition why it can improves the state-of-the-art approach; and Section 5; Section 6 validates the proposed approaches via empirical study; and Section 7 concludes this paper.

Related work

In this section, we review the stochastic gradient descent algorithm and its decentralized variants, decentralized algorithms, and previous variance reduction technologies in this section.

The SGD approahces (Ghadimi and Lan, 2013; Moulines and Bach, 2011; Nemirovski et al., 2009) is quite powerful for solving large-scale machine learning problems. It achieves a convergence rate of $O\left({1}/\sqrt{T}\right)$ . As an implementation of SGD, the Centralized Parallel Stochastic Gradient Descent (C-PSGD), has been widely used in parallel computation. In C-PSGD, a central worker, whose job is to perform the variable updates, is connected to many leaf workers that are used to compute stochastic gradients in parallel. C-PSGD has been applied to many deep learning frameworks, such as such as CNTK (Seide and Agarwal, 2016), MXNet (Chen et al., 2015), and TensorFlow (Abadi et al., 2016). The convergence rate of C-PSGD is $O\left(\frac{1}{\sqrt{nT}}\right)$ , which shows it can achieve linear speedup with regards to the number of leaf workers.

Decentralized algorithms

Centralized algorithms requires a central server to communicate with all other workers (Suresh et al., 2017). In contrast, decentralized algorithms can work on any connected network and only rely on the information exchange between neighbor workers (Kashyap et al., 2007; Lavaei and Murray, 2012; Nedic et al., 2009).

Decentralized algorithms are especially useful under a network with limited bandwidth or high latency. It is more favorable when data privacy is sensitive. These advantages have led to successful applications. The decentralized approach for multi-task reinforcement learning was studied in Omidshafiei et al. (2017); Mhamdi et al. (2017). In Colin et al. (2016), a dual based decentralized algorithm was proposed to solve the pairwise function optimization. Shi et al. (2014) and Mokhtari and Ribeiro (2015) analyzed the decentralized version of the ADMM optimization algorithm. An information theoretic approach was used to analyze decentralization in Dobbe et al. (2017). The decentralized version of (sub-)gradient descent was studied in Nedic and Ozdaglar (2009); Yuan et al. (2016). Its $O(1/\sqrt{T})$ convergence requires a diminishing stepsize or a constant stepsize that depends on the total number of iterations. This phenomenon happens because of the variance between the data in different workers, which we call “outer variance” to differentiate it from the variance in SGD. Recently, there are several deterministic decentralized optimization algorithms that allows a constant stepsize. For example, EXTRA Shi et al. (2015a) is the first modification of decentralized gradient descent that converges under a constant stepsize. Later this algorithm is extended for problems with the sum of smooth and nonsmooth functions at each node Shi et al. (2015b). However, the stepsize depends on both the Lipschitz constant of the differentiable function and the network structure. NIDS is the first algorithm that has a constant network independent stepsize Li et al. (2017). This algorithm was simultaneously proposed by Yuan et al. (2017) for the smooth case only using a different approach. For directed networks, the algorithm DIGing is proposed in Nedić et al. (2017), where two exchanges are needed in each iteration. To Prof. Yan: could you write couple of sentences to summarize these papers.

Decentralized parallel stochastic gradient descent (D-PSGD)

The D-PSGD algorithm (Nedic and Ozdaglar, 2009; Ram et al., 2010a, b) requires each worker to compute a stochastic gradient and exchange its local model with neighbors. In Duchi et al. (2012), a dual averaging based method is proposed for solving the constrained decentralized SGD optimization. In Yuan et al. (2016), the convergence rate for D-PSGD was analyzed when the gradient is assumed to be bounded. In Lan et al. (2017), a decentralized primal-dual type method was proposed with a computational complexity of $O\left({n}/{\epsilon^{2}}\right)$ for general convex objectives. Lian et al. (2017a) proved that D-PSGD can admits linear speedup w.r.t. number of workers with a similar convergence rate like C-PSGD.

Variance reduction technology

There have been many methods developed for reducing the variance in SGD, including SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014), SAG (Schmidt et al., 2017), MISO (Mairal, 2015), and mS2GD (Konečnỳ et al., 2016). However, most of these technologies are just designed for the centralized approaches. The DSA algorithm (Mokhtari and Ribeiro, 2016) applies the variance reduction similar to SAGA on strongly convex decentralized optimization problems and proved a linear convergence rate. However, the speedup property is unclear and a table of all stochastic gradients need to be stoblack.

Preliminary: decentralized stochastic gradient descent

The decentralized stochastic gradient descent (Lian et al., 2017a; Zhang et al., 2017; Shahrampour and Jadbabaie, 2017) allows each worker (say worker $i$ ) maintaining its own local variable $\bm{x}^{(i)}$ . During each iteration (say, iteration $t$ ), each worker performs the following steps:

Take weighted average with its local variable and neighbors’ local variables:

where $W_{ij}$ is the $(i,j)$ element of the matrix $W$ , $W_{ij}=0$ means worker $i$ and worker $j$ are not connected.

Perform one stochastic gradient descent step

where $\xi^{(i)}_{t}$ represents the data sampled in worker $i$ at the iteration $t$ following the distribution $\mathcal{D}_{i}$ .

From a global point of view, the update rule D-PSGD algorithm can be viewed as

It admits the following rate shown in Theorem 1.

Under certain assumptions, the output of D-PSGD admits the following inequality

where $\rho$ reflects the property of the network, $D_{1}$ and $D_{2}$ are defined to be

and $\sigma$ and $\varsigma$ measure the variation within each worker and among all workers respectively

Choosing the optimal steplength $\gamma=\frac{1}{L+\sigma\sqrt{\frac{K}{n}}+{\color[rgb]{0,0,0}n^{\frac{1}{3}}}\zeta^{\frac{2}{3}}T^{\frac{1}{3}}}$ we have the following convergence rate:

The proposed D2 algorithm can improve the convergence rate by removing the dependence to the global bound of outer variance $\zeta$ .

In D2 algorithm, each worker repeats the following updating rule (say, at iteration $t$ ) for worker $i$

Compute a local stochastic gradient $\nabla F(\bm{x}_{t}^{(i)};\xi_{t}^{(i)})$ by sampling $\xi_{t}^{(i)}$ from distribution $\mathcal{D}^{(i)}$ ;

Update the local model $\bm{x}_{t+\frac{1}{2}}^{(i)}\leftarrow 2\bm{x}_{t}^{(i)}-\bm{x}_{t-1}^{(i)}-\gamma\nabla F_{i}\left(\bm{x}^{(i)}_{t};\xi^{(i)}_{t}\right)+\gamma\nabla F_{i}\left(\bm{x}^{(i)}_{t-1};\xi^{(i)}_{t-1}\right)$ using the local models and stochastic gradients in both the $t$ th iteration and the $(t-1)$ th iteration.

When the synchronization barrier is met, exchange $\bm{x}_{t+\frac{1}{2}}^{(i)}$ with neighbors:

From a global point of view, the update rule of D2 can be viewed as:

The complete algorithm is summarized in Algorithm 1.

To understand the intuition of D2, let us consider the mean value $\overline{X}_{t}$ , which gets updated just like the standard stochastic gradient descent:

Why D2 improves the D-PSGD?

Acute reviewers may notice that the D-PSGD algorithm also essentially updates in the form of stochastic gradient descent in (4). Then why D2 can improve D-PSGD?

Assume that $X_{t}$ has achieved the optimum $X^{*}:=x^{*}{\bf 1}^{\top}$ with all local models equal to the optimum $x^{*}$ to (1). Then for D-PSGD, the next update will be

Next we apply a similar analysis for D2 by assuming that both $X_{t-1}$ and $X_{t}$ have reached the optimal solution $X^{*}$ . The next update for $D^{2}$ will be:

Theoretical guarantee

This section provides the theoretical guarantee for the proposed D2 algorithm. We first give the assumptions requiblack below.

Throughout this paper, we make the following commonly used assumptions:

Lipschitzian gradient: All function $f_{i}(\cdot)$ ’s are with $L$ -Lipschitzian gradients.

Bounded variance: Assume bounded variance of stochastic gradient within each worker

Symmetric confusion matrix: The confusion matrix $W$ is symmetric and satisfies $W\bm{1}=\bm{1}$ .

We assume $\lambda<1$ and $\lambda_{n}>-\frac{1}{3}$ .

Initialization: W.l.o.g., assume all local variables are initialized by zero, that is, $X_{0}=0$ .

Given Assumption 1, we have following convergence guarantee for $D^{2}$ :

Choose the steplength $\gamma$ in Algorithm 1 to be a constant satisfying $1-24C_{2}\gamma^{2}L^{2}>0$ . Under Assumption 1, we have the following convergence rate for Algorithm 1:

By appropriately specifying the step length $\gamma$ we reach the following corollary:

Choose the step length $\gamma$ in Algorithm 1 to be $\gamma=\frac{1}{8\sqrt{C_{2}}L+6\sqrt{C_{1}}L+\sigma\sqrt{\frac{T}{n}}}$ , where $C_{1}$ and $C_{2}$ are defined in Theorem 2. Under Assumption 1, the following convergence rate holds

where $\zeta_{0}$ is defined in Theorem 2 and we treat $f(0)-f^{*}$ , $L$ , $\lambda_{n}$ , and $\lambda$ as constants.

Note that we can obtain even better constants by choosing different parameters and applying tighter inequalities, however, the main result of this corollary is to show the order of the convergence. We highlight a few key observations from our theoretical results in the following.

Setting $\sigma=0$ and $\zeta_{0}=0$ , which blackuces the VR-SGD to a normal GD algorithm, we shall see that the convergence rate becomes $O\left(\frac{1}{T}\right)$ , which is exactly the rate of GD.

Since the leading term of the convergence rate is $O\left(\frac{1}{\sqrt{nT}}\right)$ , which is consistent with the convergence rate of C-PSGD, this indicates that we would achieve a linear speed up with respect to the number of nodes.

In NIDS (Li and Yan, 2017), the term depends on $\zeta_{0}$ in the convergence rate is $O\left(\frac{\zeta^{2}_{0}}{T}\right)$ . While the corresponding term in $D^{2}$ is $O\left(\frac{\zeta_{0}^{2}}{T+\sigma^{2}T^{2}}\right)$ , which indicates when our algorithm is consistent with NIDS because in NIDS $\sigma$ is consideblack to be 0.

When compablack to D-PSGD, the convergence rate of $D^{2}$ only depends on $\zeta_{0}$ , and the corresponding decaying rate is $\frac{\zeta_{0}}{T^{2}}$ . Whereas in D-PSGD (Lian et al., 2017a), we need to assume an upper bound for the global variance between different nodes’ dataset, and its influence can be compablack to $\sigma^{2}$ , the inner variance of each node itself. This means we can always achieve a much better convergence rate than D-PSGD.

Experiments

We evaluate the effectiveness of D2 by comparing it with both centralized and decentralized SGD algorithms.

TransferLearning: We test the case that each worker has access to a local pre-trained neural network as feature extractor, and we want to train a logistic regression model among all these workers. In our experiment, we select the first 16 classes of ImageNet and use InceptionV4 as the feature extractor to extract 2048 features for each image. We conduct data augmentation and generate a blurblack version for each image. In total this datasaet contains 16 $\times$ 1300 $\times$ 2 images.

LeNet: We test the case that all workers collaboratively train a neural network model. We train a LeNet on the CIFAR10 dataset. In total this dataset contains 50,000 images of size 32 $\times$ 32.

One caveat of training more recent neural networks is that modern architectures often have a batch normalization layer, which inherently assumes that the data distribution is uniform across different batches, which is not the case that we are interested in. In principle, we could also flow the batch information through the network in a decentralized way; however, we leave this as future work.

By default, each worker only has exclusive access to a subset of classes. For TransferLearning, we use 16 workers and each worker has access to one class; for LeNet, we use 5 workers and each worker has access to two classes. For comparison, we also consider a case when the datasets is first shuffled and then uniformly partitioned among all the workers, we call this the shuffled case, and the default one the unshuffled case. We use a ring topology for both experiments.

Parameter Tuning. For TransferLearning, we use constant learning rates and tune it from {0.01, 0.025, 0.05, 0.075, 0.1}. For LeNet, we use constant learning rate 0.05 which is tuned from {0.5, 0.1, 0.05, 0.01} for centralized algorithms and batch size 128 on each worker.

Metrics. In this paper, we mainly focus on the convergence rate of different algorithms instead of the wall clock speed. This is because the implementation of D2 is a minor change over the standard D-PSGD algorithm, and thus they has almost the same speed to finish one epoch of training, and both are no slower than the centralized algorithm. When the network has high latency, if a decentralized algorithm ( $D^{2}$ or D-PSGD) converges with a similar speed as the centralized algorithm, it can be up to one order of magnitude faster Lian et al. (2017a). However, the convergence rate depending on the “outer variance” is different for both algorithms.

2 Unshuffled Case

We are mostly interested in the unshuffled case, in which the data variation across workers is maximized. Figure 1 shows the result. In the unshuffled case, we see that the D-PSGD algorithm convergences slower than the centralized case. This is consistent with the original D-PSGD paper (Lian et al., 2017a). On the other hand, D2 converges much faster than D-PSGD, and achieves almost the same loss as the centralized algorithm. For the LeNet case, each worker only has access to data of assigned two labels, which means the data variation is very large. The D-PSGD does not converge with the the given learning rate 0.05.We can tune the learning rate 50x smaller for D-PSGD to converge in this case, but doing so will make D-PSGD stuck at the starting point for quite a long time.

3 Shuffled Case

As a sanity check, Figure 2 shows the result of three different algorithms on the shuffled data. In this case, the data variation of among workers is small (in expectation, they are drawn from the same distribution). We see that, all strategies have similar convergence rate. This validate that the D2 algorithm is more effective for larger data variation between different workers.

Conclusion

In this paper, we propose a decentralized algorithm, namely, D2 algorithm. D2 algorithm integrates the D-PSGD algorithm with the variance reduction technology, by which we improves the convergence rate of D-PSGD. The variance reduction technology used in this paper is different from the commonly used ones such as SVRG and SAGA, that are designed for centralized approaches. Experiments validate the advantage of D2 over D-PSGD — D2 converges with a rate that is similar to centralized SGD while D-PSGD does not converge to the a solution with a similar quality when the data variance is large. While being robust to large data variance among workers, the same performance benefit of D-PSGD over the centralized strategy still holds for D2.