QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic

Introduction

In this paper, we focus on parallel SGD methods, which have received considerable attention recently due to their high scalability . Specifically, we consider a setting where a large dataset is partitioned among $K$ processors, which collectively minimize a function $f$ . Each processor maintains a local copy of the parameter vector $\bm{x}_{t}$ ; in each iteration, it obtains a new stochastic gradient update (corresponding to its local data). Processors then broadcast their gradient updates to their peers, and aggregate the gradients to compute the new iterate $\bm{x}_{t+1}$ .

In most current implementations of parallel SGD, in each iteration, each processor must communicate its entire gradient update to all other processors. If the gradient vector is dense, each processor will need to send and receive $n$ floating-point numbers per iteration to/from each peer to communicate the gradients and maintain the parameter vector $\bm{x}$ . In practical applications, communicating the gradients in each iteration has been observed to be a significant performance bottleneck .

One popular way to reduce this cost has been to perform lossy compression of the gradients . A simple implementation is to simply reduce precision of the representation, which has been shown to converge under convexity and sparsity assumptions . A more drastic quantization technique is 1BitSGD , which reduces each component of the gradient to just its sign (one bit), scaled by the average over the coordinates of $\widetilde{g}$ , accumulating errors locally. 1BitSGD was experimentally observed to preserve convergence , under certain conditions; thanks to the reduction in communication, it enabled state-of-the-art scaling of deep neural networks (DNNs) for acoustic modelling . However, it is currently not known if 1BitSGD provides any guarantees, even under strong assumptions, and it is not clear if higher compression is achievable.

Our focus is understanding the trade-offs between the communication cost of data-parallel SGD, and its convergence guarantees. We propose a family of algorithms allowing for lossy compression of gradients called Quantized SGD (QSGD), by which processors can trade-off the number of bits communicated per iteration with the variance added to the process.

QSGD is built on two algorithmic ideas. The first is an intuitive stochastic quantization scheme: given the gradient vector at a processor, we quantize each component by randomized rounding to a discrete set of values, in a principled way which preserves the statistical properties of the original. The second step is an efficient lossless code for quantized gradients, which exploits their statistical properties to generate efficient encodings. Our analysis gives tight bounds on the precision-variance trade-off induced by QSGD.

At one extreme of this trade-off, we can guarantee that each processor transmits at most $\sqrt{n}(\log n+O(1))$ expected bits per iteration, while increasing variance by at most a $\sqrt{n}$ multiplicative factor. At the other extreme, we show that each processor can transmit $\leq 2.8n+32$ bits per iteration in expectation, while increasing variance by a only a factor of $2$ . In particular, in the latter regime, compared to full precision SGD, we use $\approx 2.8n$ bits of communication per iteration as opposed to $32n$ bits, and guarantee at most $2\times$ more iterations, leading to bandwidth savings of $\approx 5.7\times$ .

QSGD is fairly general: it can also be shown to converge, under assumptions, to local minima for non-convex objectives, as well as under asynchronous iterations. One non-trivial extension we develop is a stochastic variance-reduced variant of QSGD, called QSVRG, which has exponential convergence rate.

One key question is whether QSGD’s compression-variance trade-off is inherent: for instance, does any algorithm guaranteeing at most constant variance blowup need to transmit $\Omega(n)$ bits per iteration? The answer is positive: improving asymptotically upon this trade-off would break the communication complexity lower bound of distributed mean estimation (see [44, Proposition 2] and ).

Experiments.

The crucial question is whether, in practice, QSGD can reduce communication cost by enough to offset the overhead of any additional iterations to convergence. The answer is yes. We explore the practicality of QSGD on a variety of state-of-the-art datasets and machine learning models: we examine its performance in training networks for image classification tasks (AlexNet, Inception, ResNet, and VGG) on the ImageNet and CIFAR-10 datasets, as well as on LSTMs for speech recognition. We implement QSGD in Microsoft CNTK .

Experiments show that all these models can significantly benefit from reduced communication when doing multi-GPU training, with virtually no accuracy loss, and under standard parameters. For example, when training AlexNet on 16 GPUs with standard parameters, the reduction in communication time is $4\times$ , and the reduction in training to the network’s top accuracy is $2.5\times$ . When training an LSTM on two GPUs, the reduction in communication time is $6.8\times$ , while the reduction in training time to the same target accuracy is $2.7\times$ . Further, even computationally-heavy architectures such as Inception and ResNet can benefit from the reduction in communication: on 16GPUs, QSGD reduces the end-to-end convergence time of ResNet152 by approximately $2\times$ . Networks trained with QSGD can converge to virtually the same accuracy as full-precision variants, and that gradient quantization may even slightly improve accuracy in some settings.

Related Work.

One line of related research studies the communication complexity of convex optimization. In particular, studied two-processor convex minimization in the same model, provided a lower bound of $\Omega(n(\log n+\log(1/\epsilon)))$ bits on the communication cost of $n$ -dimensional convex problems, and proposed a non-stochastic algorithm for strongly convex problems, whose communication cost is within a log factor of the lower bound. By contrast, our focus is on stochastic gradient methods. Recent work focused on round complexity lower bounds on the number of communication rounds necessary for convex learning.

Buckwild! was the first to consider the convergence guarantees of low-precision SGD. It gave upper bounds on the error probability of SGD, assuming unbiased stochastic quantization, convexity, and gradient sparsity, and showed significant speedup when solving convex problems on CPUs. QSGD refines these results by focusing on the trade-off between communication and convergence. We view quantization as an independent source of variance for SGD, which allows us to employ standard convergence results . The main differences from Buckwild! are that 1) we focus on the variance-precision trade-off; 2) our results apply to the quantized non-convex case; 3) we validate the practicality of our scheme on neural network training on GPUs. Concurrent work proposes TernGrad , which starts from a similar stochastic quantization, but focuses on the case where individual gradient components can have only three possible values. They show that significant speedups can be achieved on TensorFlow , while maintaining accuracy within a few percentage points relative to full precision. The main differences to our work are: 1) our implementation guarantees convergence under standard assumptions; 2) we strive to provide a black-box compression technique, with no additional hyperparameters to tune; 3) experimentally, QSGD maintains the same accuracy within the same target number of epochs; for this, we allow gradients to have larger bit width; 4) our experiments focus on the single-machine multi-GPU case.

We note that QSGD can be applied to solve the distributed mean estimation problem with an optimal error-communication trade-off in some regimes. In contrast to the elegant random rotation solution presented in , QSGD employs quantization and Elias coding. Our use case is different from the federated learning application of , and has the advantage of being more efficient to compute on a GPU.

There is an extremely rich area studying algorithms and systems for efficient distributed large-scale learning, e.g. . Significant interest has recently been dedicated to quantized frameworks, both for inference, e.g., and training . In this context, proposed 1BitSGD, a heuristic for compressing gradients in SGD, inspired by delta-sigma modulation . It is implemented in Microsoft CNTK, and has a cost of $n$ bits and two floats per iteration. Variants of it were shown to perform well on large-scale Amazon datasets by . Compared to 1BitSGD, QSGD can achieve asymptotically higher compression, provably converges under standard assumptions, and shows superior practical performance in some cases.

Preliminaries

SGD has many variants, with different preconditions and guarantees. Our techniques are rather portable, and can usually be applied in a black-box fashion on top of SGD. For conciseness, we will focus on a basic SGD setup. The following assumptions are standard; see e.g. .

Given access to stochastic gradients, and a starting point $\bm{x}_{0}$ , SGD builds iterates $\bm{x}_{t}$ given by Equation (1), projected onto $\mathcal{X}$ , where $(\eta_{t})_{t\geq 0}$ is a sequence of step sizes. In this setting, one can show:

A modification to the SGD scheme presented above often observed in practice is a technique known as minibatching. In minibatched SGD, updates are of the form $\bm{x}_{t+1}=\Pi_{\mathcal{X}}(\bm{x}_{t}-\eta_{t}\widetilde{G}_{t}(\bm{x}_{t}))$ , where $\widetilde{G}_{t}(\bm{x}_{t})=\frac{1}{m}\sum_{i=1}^{m}\widetilde{g}_{t,i}$ , and where each $\widetilde{g}_{t,i}$ is an independent stochastic gradient for $f$ at $\bm{x}_{t}$ . It is not hard to see that if $\widetilde{g}_{t,i}$ are stochastic gradients with variance bound $\sigma^{2}$ , then the $\widetilde{G}_{t}$ is a stochastic gradient with variance bound $\sigma^{2}/m$ . By inspection of Theorem 2.1, as long as the first term in (2) dominates, minibatched SGD requires $1/m$ fewer iterations to converge.

Data-Parallel SGD.

We consider synchronous data-parallel SGD, modelling real-world multi-GPU systems, and focus on the communication cost of SGD in this setting. We have a set of $K$ processors $p_{1},p_{2},\ldots,p_{K}$ who proceed in synchronous steps, and communicate using point-to-point messages. Each processor maintains a local copy of a vector $\bm{x}$ of dimension $n$ , representing the current estimate of the minimizer, and has access to private, independent stochastic gradients for $f$ .

Let $\mathcal{X},f,L,\bm{x}_{0}$ , and $R$ be as in Theorem 2.1. Fix $\epsilon>0$ . Suppose we run parallel SGD on $K$ processors, each with access to independent stochastic gradients with second moment bound $B$ , with step size $\eta_{t}=1/(L+\sqrt{K}/\gamma)$ , where $\gamma$ is as in Theorem 2.1. Then if

In most reasonable regimes, the first term of the max in (3) will dominate the number of iterations necessary. Specifically, the number of iterations will depend linearly on the second moment bound $B$ .

Quantized Stochastic Gradient Descent (QSGD)

We now consider a general, parametrizable lossy-compression scheme for stochastic gradient vectors. The quantization function is denoted with $Q_{s}(\bm{v})$ , where $s\geq 1$ is a tuning parameter, corresponding to the number of quantization levels we implement. Intuitively, we define $s$ uniformly distributed levels between and $1$ , to which each value is quantized in a way which preserves the value in expectation, and introduces minimal variance. Please see Figure 1.

Efficient Coding of Gradients.

Observe that for any vector $\bm{v}$ , the output of $Q_{s}(\bm{v})$ is naturally expressible by a tuple $(\|\bm{v}\|_{2},\bm{\sigma},\bm{\zeta})$ , where $\bm{\sigma}$ is the vector of signs of the $\bm{v}_{i}$ ’s and $\bm{\zeta}$ is the vector of integer values $s\cdot\xi_{i}(\bm{v},s)$ . The key idea behind the coding scheme is that not all integer values $s\cdot\xi_{i}(\bm{v},s)$ can be equally likely: in particular, larger integers are less frequent. We will exploit this via a specialized Elias integer encoding .

Sparse Regime.

For the case $s=1$ , i.e., quantization levels , $1$ , and $-1$ , the gradient density is $O(\sqrt{n})$ , while the second-moment blowup is $\leq\sqrt{n}$ . Intuitively, this means that we will employ $O(\sqrt{n}\log n)$ bits per iteration, while the convergence time is increased by $O(\sqrt{n})$ .

Dense Regime.

The variance blowup is minimized to at most 2 for $s=\sqrt{n}$ quantization levels; in this case, we devise a more efficient encoding which yields an order of magnitude shorter codes compared to the full-precision variant. The proof of this statement is not entirely obvious, as it exploits both the statistical properties of the quantization and the guarantees of the Elias coding.

Let $f,\bm{x}$ , and $\widetilde{g}(\bm{x})$ be as in Theorem 3.2. There is an encoding scheme for $Q_{\sqrt{n}}(\widetilde{g}(\bm{x}))$ which in expectation has length at most $2.8n+32.$

2 QSGD Guarantees

Putting the bounds on the communication and variance given above with the guarantees for SGD algorithms on smooth, convex functions yield the following results:

QSGD is quite portable, and can be applied to almost any stochastic gradient method. For illustration, we can use quantization along with to get communication-efficient non-convex SGD.

3 Quantized Variance-Reduced SGD

Assume we are given $K$ processors, and a parameter $m>0$ , where each processor $i$ has access to functions $\{f_{im/K},\ldots,f_{(i+1)m/K-1}\}$ . The goal is to approximately minimize $f=\frac{1}{m}\sum_{i=1}^{m}f_{i}$ . For processor $i$ , let $h_{i}=\frac{1}{m}\sum_{j=im/K}^{(i+1)m/K-1}f_{i}$ be the portion of $f$ that it knows, so that $f=\sum_{i=1}^{K}h_{i}$ .

A natural question is whether we can apply stochastic quantization to reduce communication for parallel SVRG. Upon inspection, we notice that the resulting update will break standard SVRG. We resolve this technical issue, proving one can quantize SVRG updates using our techniques and still obtain the same convergence bounds.

Let $\widetilde{Q}(\bm{v})=Q(\bm{v},\sqrt{n})$ , where $Q(\bm{v},s)$ is defined as in Section 3.1. Given arbitrary starting point $\bm{x}_{0}$ , we let $\bm{y}^{(1)}=\bm{x}_{0}$ . At the beginning of epoch $p$ , each processor broadcasts $\nabla h_{i}(\bm{y}^{(p)})$ , that is, the unquantized full gradient, from which the processors each aggregate $\nabla f(\bm{y}^{(p)})=\sum_{i=1}^{m}\nabla h_{i}(\bm{y}^{(p)})$ . Within each epoch, for each iteration $t=1,\ldots,T$ , and for each processor $i=1,\ldots,K$ , we let $j^{(p)}_{i,t}$ be a uniformly random integer from $[m]$ completely independent from everything else. Then, in iteration $t$ in epoch $p$ , processor $i$ broadcasts the update vector $\bm{u}^{(p)}_{t,i}=\widetilde{Q}\left(\nabla f_{j^{(p)}_{i,t}}(\bm{x}^{(p)}_{t})-\nabla f_{j^{(p)}_{i,t}}(\bm{y}^{(p)})+\nabla f(\bm{y}^{(p)})\right).$

Each processor then computes the total update $\bm{u}^{(p)}_{t}=\frac{1}{K}\sum_{i=1}^{K}\bm{u}_{t,i}$ , and sets $\bm{x}^{(p)}_{t+1}=\bm{x}^{(p)}_{t}-\eta\bm{u}^{(p)}_{t}$ . At the end of epoch $p$ , each processor sets $\bm{y}^{(p+1)}=\frac{1}{T}\sum_{t=1}^{T}\bm{x}^{(p)}_{t}$ . We can prove the following.

Discussion.

Gradient Descent.

The full version of the paper contains an application of QSGD to gradient descent. Roughly, in this case, QSGD can simply truncate the gradient to its top components, sorted by magnitude.

QSGD Variants

Our experiments will stretch the theory, as we use deep networks, with non-convex objectives. (We have also tested QSGD for convex objectives. Results closely follow the theory, and are therefore omitted.) Our implementations will depart from the previous algorithm description as follows.

First, we notice that the we can control the variance the quantization by quantizing into buckets of a fixed size $d$ . If we view each gradient as a one-dimensional vector $\bm{v}$ , reshaping tensors if necessary, a bucket will be defined as a set of $d$ consecutive vector values. (E.g. the $i$ th bucket is the sub-vector $v[(i-1)d+1:i\cdot d]$ .) We will quantize each bucket independently, using QSGD. Setting $d=1$ corresponds to no quantization (vanilla SGD), and $d=n$ corresponds to full quantization, as described in the previous section. It is easy to see that, using bucketing, the guarantees from Lemma 3.1 will be expressed in terms of $d$ , as opposed to the full dimension $n$ . This provides a knob by which we can control variance, at the cost of storing an extra scaling factor on every $d$ bucket values. As an example, if we use a bucket size of $512$ , and $4$ bits, the variance increase due to quantization will be upper bounded by only $\sqrt{512}/2^{4}\simeq 1.41$ . This provides a theoretical justification for the similar convergence rates we observe in practice.

The second difference from the theory is that we will scale by the maximum value of the vector (as opposed to the $2$ -norm). Intuitively, normalizing by the max preserves more values, and has slightly higher accuracy for the same number of iterations. Both methods have the same baseline bandwidth reduction because of lower bit width (e.g. 32 bits to 2 bits per dimension), but normalizing by the max no longer provides any sparsity guarantees. We note that this does not affect our bounds in the regime where we use $\Theta(\sqrt{n})$ quantization levels per component, as we employ no sparsity in that case. (However, we note that in practice max normalization also generates non-trivial sparsity.)

Experiments

We performed experiments on Amazon EC2 p2.16xlarge instances, with 16 NVIDIA K80 GPUs. Instances have GPUDirect peer-to-peer communication, but do not currently support NVIDIA NCCL extensions. We have implemented QSGD on GPUs using the Microsoft Cognitive Toolkit (CNTK) . This package provides efficient (MPI-based) GPU-to-GPU communication, and implements an optimized version of 1bit-SGD . Our code is released as open-source .

We execute two types of tasks: image classification on ILSVRC 2015 (ImageNet) , CIFAR-10 , and MNIST , and speech recognition on the CMU AN4 dataset . For vision, we experimented with AlexNet , VGG , ResNet , and Inception with Batch Normalization deep networks. For speech, we trained an LSTM network . See Table 1 for details.

Protocol.

Our methodology emphasizes zero error tolerance, in the sense that we always aim to preserve the accuracy of the networks trained. We used standard sizes for the networks, with hyper-parameters optimized for the 32bit precision variant. (Unless otherwise stated, we use the default networks and hyper-parameters optimized for full-precision CNTK 2.0.) We increased batch size when necessary to balance communication and computation for larger GPU counts, but never past the point where we lose accuracy. We employed double buffering to perform communication and quantization concurrently with the computation. Quantization usually benefits from lowering learning rates; yet, we always run the 32bit learning rate, and decrease bucket size to reduce variance. We will not quantize small gradient matrices ( $<10$ K elements), since the computational cost of quantizing them significantly exceeds the reduction in communication. However, in all experiments, more than $99\%$ of all parameters are transmitted in quantized form. We reshape matrices to fit bucket sizes, so that no receptive field is split across two buckets.

Communication vs. Computation.

In the first set of experiments, we examine the ratio between computation and communication costs during training, for increased parallelism. The image classification networks are trained on ImageNet, while LSTM is trained on AN4. We examine the cost breakdown for these networks over a pass over the dataset (epoch). Figure 2 gives the results for various networks for image classification. The variance of epoch times is practically negligible (<1%), hence we omit confidence intervals.

Figure 2 leads to some interesting observations. First, based on the ratio of communication to computation, we can roughly split networks into communication-intensive (AlexNet, VGG, LSTM), and computation-intensive (Inception, ResNet). For both network types, the relative impact of communication increases significantly as we increase the number of GPUs. Examining the breakdown for the 32-bit version, all networks could significantly benefit from reduced communication. For example, for AlexNet on 16 GPUs with batch size 1024, more than $80\%$ of training time is spent on communication, whereas for LSTM on 2 GPUs with batch size 256, the ratio is $71\%$ . (These ratios can be slightly changed by increasing batch size, but this can decrease accuracy, see e.g. .)

Next, we examine the impact of QSGD on communication and overall training time. (Communication time includes time spent compressing and uncompressing gradients.) We measured QSGD with 2-bit quantization and 128 bucket size, and 4-bit and 8-bit quantization with 512 bucket size. The results for these two variants are similar, since the different bucket sizes mean that the 4bit version only sends $77\%$ more data than the 2-bit version (but $\sim 8\times$ less than 32-bit). These bucket sizes are chosen to ensure good convergence, but are not carefully tuned.

On 16GPU AlexNet with batch size 1024, 4-bit QSGD reduces communication time by $4\times$ , and overall epoch time by $2.5\times$ . On LSTM, it reduces communication time by $6.8\times$ , and overall epoch time by $2.7\times$ . Runtime improvements are non-trivial for all architectures we considered.

Accuracy.

We now examine how QSGD influences accuracy and convergence rate. We ran AlexNet and ResNet to full convergence on ImageNet, LSTM on AN4, ResNet110 on CIFAR-10, as well as a two-layer perceptron on MNIST. Results are given in Figure 5, and exact numbers are given in Table 1. QSGD tests are performed on an 8GPU setup, and are compared against the best known full-precision accuracy of the networks. In general, we notice that 4bit or 8bit gradient quantization is sufficient to recover or even slightly improve full accuracy, while ensuring non-trivial speedup. Across all our experiments, 8-bit gradients with $512$ bucket size have been sufficient to recover or improve upon the full-precision accuracy. Our results are consistent with recent work noting benefits of adding noise to gradients when training deep networks. Thus, quantization can be seen as a source of zero-mean noise, which happens to render communication more efficient. At the same time, we note that more aggressive quantization can hurt accuracy. In particular, 4-bit QSGD with $8192$ bucket size (not shown) loses $0.57\%$ for top-5 accuracy, and $0.68\%$ for top-1, versus full precision on AlexNet when trained for the same number of epochs. Also, QSGD with 2-bit and 64 bucket size has gap $1.73\%$ for top-1, and $1.18\%$ for top-1.

One issue we examined in more detail is which layers are more sensitive to quantization. It appears that quantizing convolutional layers too aggressively (e.g., 2-bit precision) can lead to accuracy loss if trained for the same period of time as the full precision variant. However, increasing precision to 4-bit or 8-bit recovers accuracy. This finding suggests that modern architectures for vision tasks, such as ResNet or Inception, which are almost entirely convolutional, may benefit less from quantization than recurrent deep networks such as LSTMs.

Additional Experiments.

The full version of the paper contains additional experiments, including a full comparison with 1BitSGD. In brief, QSGD outperforms or matches the performance and final accuracy of 1BitSGD for the networks and parameter values we consider.

Conclusions and Future Work

We have presented QSGD, a family of SGD algorithms which allow a smooth trade off between the amount of communication per iteration and the running time. Experiments suggest that QSGD is highly competitive with the full-precision variant on a variety of tasks. There are a number of optimizations we did not explore. The most significant is leveraging the sparsity created by QSGD. Current implementations of MPI do not provide support for sparse types, but we plan to explore such support in future work. Further, we plan to examine the potential of QSGD in larger-scale applications, such as super-computing. On the theoretical side, it is interesting to consider applications of quantization beyond SGD.

The full version of this paper contains complete proofs, as well as additional applications.

Acknowledgments

The authors would like to thank Martin Jaggi, Ce Zhang, Frank Seide and the CNTK team for their support during the development of this project, as well as the anonymous NIPS reviewers for their careful consideration and excellent suggestions. Dan Alistarh was supported by a Swiss National Fund Ambizione Fellowship. Jerry Li was supported by the NSF CAREER Award CCF-1453261, CCF-1565235, a Google Faculty Research Award, and an NSF Graduate Research Fellowship. This work was developed in part while Dan Alistarh, Jerri Li and Milan Vojnovic were with Microsoft Research Cambridge, UK.

References

Roadmap of the Appendix

Appendix A Proof of Lemmas and Theorems

The first claim obviously holds. We thus turn our attention to the second claim of the lemma. We first note the following bound:

In this section, we describe a scheme for coding $Q_{s}$ and provide an upper bound for the expected number of information bits that it uses, which gives the bound in Theorem 3.2.

To encode a single coordinate, we utilize a lossless encoding scheme for positive integers known as recursive Elias coding or Elias omega coding.

The following are well-known properties of the recursive Elias code which are not too hard to prove.

Moreover, the decoding can be done without previously knowing a bound on the size of $k$ .

This lemma together with Lemma 3.1 suffices to prove Theorem 3.2.

where (a) follows from Jensen’s inequality. ∎

We can bound the number of information bits needed for our coding scheme in terms of the number of non-zeroes of our vector.

First, the float $A$ takes $F$ bits to communicate. Let us now consider the rest of the string. We break up the string into a couple of parts. First, there is the subsequence $S_{1}$ dedicated to pointing to the next nonzero coordinate of $\bm{z}$ . Second, there is the subsequence $S_{2}$ dedicated to communicating the sign and $c(\bm{z}_{i})$ for each nonzero coordinate $i$ . While these two sets of bits are not consecutive within the string, it is clear that they partition the remaining bits in the string. We bound the length of these two substrings separately.

We now bound the length of $S_{2}$ . Per non-zero coordinate of $\bm{z}$ , we need to communicate a sign (which takes one bit), and $c(s\bm{z}_{i})$ . Thus by Lemma A.3, we have that

Putting together (6) and (7) yields the desired conclusion. ∎

We first need the following technical lemma about the number of nonzeros of $Q(\bm{v},s)$ that we have in expectation.

Let $\bm{u}=\bm{v}/\|\bm{v}\|_{2}$ . Let $I(\bm{u})$ denote the set of coordinates $i$ of $\bm{u}$ so that $u_{i}\leq 1/s$ . Since

we must have that $s^{2}\geq n-|I(\bm{u})|$ . Moreover, for each $i\in I(\bm{u})$ , we have that $Q_{i}(\bm{v},s)$ is nonzero with probability $\bm{u}_{i}$ , and zero otherwise. Hence

Let $Q(\bm{v},s)=(\|\bm{v}\|_{2},\bm{\sigma},\bm{\zeta})$ , and let $\bm{u}=\bm{v}/\|\bm{v}\|_{2}$ . Observe that we always have that

It is a straightforward verification that the function $f(x)=x\log\left(\frac{C}{x}\right)$ is concave for all $C>0$ . Moreover, it is increasing up until $x=C/2$ , and decreasing afterwards. Hence, by Jensen’s inequality, Lemma A.5, and the assumption that $s^{2}+\sqrt{n}\leq n/2$ , we have that

Simplifying yields the expression in the Lemma.

For the case of the quantized SGD scheme that requires $\Theta(n)$ bits per iteration, we can improve the constant factor in the bit length bound in Theorem A.2 by using a different encoding of $Q(\bm{v},s)$ . This corresponds to the regime where $s=\sqrt{n}$ , i.e., where the quantized update is not expected to be sparse. In this case, there is no advantage gained by transmitting the location of the next nonzero, since generally that will simply be the next coordinate of the vector. Therefore, we may as well simply transmit the value of each coordinate in sequence.

It is not hard to see that this is equivalent to the bound stated in Theorem 3.3.

where (a) follows from basic properties of logarithms and (b) follows from the concavity of the function $x\mapsto\log(1+x)$ and Jensen’s inequality. Simplifying yields the desired statement. ∎

As in the proof of Lemma A.2, let $Q(\bm{v},s)=(\|\bm{v}\|_{2},\bm{\sigma},\bm{\zeta})$ , and let $\bm{u}=\bm{v}/\|\bm{v}\|_{2}$ . By Lemma A.7, we have

where (a) follows from Jensen’s inequality, and (b) follows from the proof of Lemma 3.1.

Appendix B Quantized SVRG

One common setting in which SGD sees application in machine learning is when $f$ can be naturally expressed as a sum of smooth functions. Formally, we assume that $f(\bm{x})=\frac{1}{m}\sum_{i=1}^{m}f_{i}(\bm{x})$ . When $f$ can be expressed as a sum of smooth functions, this lends itself naturally to SGD. This is because a natural stochastic gradient for $f$ in this setting is, on input $\bm{x}$ , to sample a uniformly random index $i$ , and output $\nabla f_{i}(\bm{x})$ . We will also impose somewhat stronger assumptions on $f$ and $f_{1},f_{2},\ldots,f_{m}$ , namely, that $f$ is strongly convex, and that each $f_{i}$ is convex and smooth.

Note that it is well-known that even if we impose these stronger assumptions on $f$ and $f_{1},f_{2},\ldots,f_{m}$ , then by only applying SGD one still cannot achieve exponential convergence rates, i.e. error rates which improve as $\exp(-T)$ at iteration $T$ . (Such a rate is known in the optimization literature as linear convergence.) However, an epoch-based modification of SGD, known as stochastic variance reduced gradient descent (SVRG) , is able to give such rates in this specific setting. We describe the method below, following the presentation of Bubeck .

Background on SVRG.

With this iterative scheme, we have the following guarantee:

Quantized SVRG.

In parallel SVRG, we are given $K$ processors, each processor $i$ having access to $f_{im/K},\ldots,f_{(i+1)m/K-1}$ . The goal is the same as before: to approximately minimize $f=\frac{1}{m}\sum_{i=1}^{m}f_{i}$ . For processor $i$ , let $h_{i}=\frac{1}{m}\sum_{j=im/K}^{(i+1)m/K-1}f_{i}$ be the portion of $f$ that it knows, so that $f=\sum_{i=1}^{K}h_{i}$ .

A natural question is whether we can apply randomized quantization to reduce communication for parallel SVRG. Whenever one applies our quantization functions to the gradient updates in SVRG, the resulting update is no longer an update of the form used in SVRG, and hence the analysis for SVRG does not immediately give any results in black-box fashion. Instead, we prove that despite this technical issue, one can quantize SVRG updates using our techniques and still obtain the same convergence bounds.

Let $\widetilde{Q}(\bm{v})=Q(\bm{v},\sqrt{n})$ , where $Q(\bm{v},s)$ is defined as in Section 3.1. Our quantized SVRG updates are as follows. Given arbitrary starting point $\bm{x}_{0}$ , we let $\bm{y}^{(1)}=\bm{x}_{0}$ . At the beginning of epoch $p$ , each processor broadcasts

from which the processors collectively form $H_{p}=\sum_{i=1}^{m}H_{p,i}$ without additional communication. Within each epoch, for each iteration $t=1,\ldots,T$ , and for each processor $i=1,\ldots,K$ , we let $j^{(p)}_{i,t}$ be a uniformly random integer from $[m]$ completely independent from everything else. Then, in iteration $t$ in epoch $p$ , processor $i$ broadcasts the update vector

Each processor then computes the total update for that iteration $\bm{u}^{(p)}_{t}=\frac{1}{K}\sum_{i=1}^{K}\bm{u}_{t,i}$ , and sets $\bm{x}^{(p)}_{t+1}=\bm{x}^{(p)}_{t}-\eta\bm{u}^{(p)}_{t}$ . At the end of epoch $p$ , each processor sets $\bm{y}^{(p+1)}=\frac{1}{T}\sum_{t=1}^{T}\bm{x}^{(p)}_{t}$ .

Our main theorem is that this algorithm still converges, and is communication efficient:

Moreover, QSVRG with $P$ epochs and $T$ iterations per epoch requires $\leq P(F+2.8n)(T+1)$ bits of communication per processor.

By Theorem A.6, each processor transmits at most $F+2.8n$ bits per iteration, and then an additional $F+2.8n$ bits per epoch to communicate the $H_{p,i}$ . Thus the claimed communication bound follows trivially.

We now turn our attention to correctness. As with the case of quantized SGD, it is not hard to see that the parallel updates are equivalent to minibatched updates, and serve only to decrease the variance of the random gradient estimate. Hence, as before, for simplicity of presentation, we will consider the effect of quantization on convergence rates on a single processor. In this case, the updates can be written down somewhat more simply. Namely, in iteration $t$ of epoch $p$ , we have that

where $j^{(p)}_{t}$ is a random index of $[m]$ , and $\widetilde{Q}^{(p)}_{t}$ and $\widetilde{Q}^{(p)}$ are all different, independent instances of $\widetilde{Q}$ .

This clearly suffices to show the theorem. Because we only deal with a fixed epoch, for simplicity of notation, we shall proceed to drop the dependence on $p$ in the notation. For $t=1,\ldots,T$ , let $\bm{v}_{t}=\widetilde{Q}_{t}\left(\nabla f_{j_{t}}(\bm{x}_{t})-\nabla f_{j_{t}}(\bm{y})-\widetilde{Q}^{(p)}(\nabla f(\bm{y}))\right)$ be the update in iteration $t$ . It suffices to show the following two equations:

where $C$ is some universal constant. That the first equation is true follows from the unbiasedness of $\widetilde{Q}$ . We now show the second. We have:

Plugging these bounds into proof structure in yields the proof of G.1, as claimed. ∎

Why does naive quantization not achieve this rate?

Our analysis shows that quantized SVRG achieves the communication efficient rate, using roughly 2.8 times as many bits per iteration, and roughly $C/2=8$ times as many iterations. This may beg the question why naive quantization schemes (say, quantizing down to 16 or 32 bits) fails. At a high level, this is because any such quantization can inherently only achieve up to constant error, since the stochastic gradients are always biased by a (small) constant. To circumvent this, one may quantize down to $O(\log 1/\epsilon)$ bits, however, this only matches the upper bound given by , and is off from the optimal rate (which we achieve) by a logarithmic factor.

Appendix C Quantization for Non-convex SGD

As stated previously, our techniques are portable, and apply easily to a variety of settings where SGD is applied. As a demonstration of this, we show here how we may use quantization on top of recent results which show that SGD converges to local minima when applied on smooth, non-convex functions.

Throughout this paper, our theory only considers the case when $f$ is a convex function. In many interesting applications such as neural network training, however, the objective is non-convex, where much less is known. However, there has been an interesting line of recent work which shows that SGD at least always provably converges to a local minima, when $f$ is smooth. For instance, by applying Theorem 2.1 in , we immediately obtain the following convergence result for quantized SGD. Let $Q_{s}$ be the quantization function defined in Section 3.1. Here we will only state the convergence bound; the communication complexity per iteration is the same as in 3.1.

Observe that the only difference in the assumptions in from what we generally assume is that they assume a variance bound on the stochastic gradients, whereas we prefer a second moment bound. Hence our result applies immediately to their setting.

Another recent result demonstrates local convergence for SGD for smooth non-convex functions in asynchronous settings. The formulas there are more complicated, so for simplicity we will not reproduce them here. However, it is not hard to see that quantization affects the convergence bounds there in a manner which is parallel to Theorem 3.2.

Appendix D Asynchronous QSGD

We consider an asynchronous parameter-server model , modelled identically as in [29, Section 3]. In brief, the system consists of a star-shaped network, with a central parameter server, communicating with worker nodes, which exchange information with the master independently and simultaneously. Asynchrony consists of the fact that competing updates might be applied by the master to the shared parameter (but workers always get a consistent version of the parameter).

In this context, the following follows from [29, Theorem 1]:

We then have the following ergodic convergence rate for the iteration of QSGD with quantization function $Q_{s}$ . Let $\gamma=\sum_{k=1}^{K}\gamma_{k}$ , and $\sigma_{s}=(1+\min(n/s^{2},\sqrt{n}/s))\sigma$ . Then:

Appendix E Experiments

We now empirically validate our approach on data-parallel GPU training of deep neural networks.

We performed experiments on Amazon EC2 p2.16xlarge instances, using up to 16 NVIDIA K80 GPUs. Instances have GPUDirect peer-to-peer communication, but do not currently support NVIDIA NCCL extensions. We have implemented QSGD on GPUs using the Microsoft Cognitive Toolkit (CNTK) . This package provides efficient (MPI-based) GPU-to-GPU communication, and implements an optimized version of 1bit-SGD . Our code is released both as open-source and as a docker instance.

We do not quantize small gradient matrices in QSGD, since the computational cost of quantizing small matrices significantly exceeds the reduction in communication from quantization. However, in all experiments, at least $99\%$ of all parameters are transmitted in quantized form. If required, we reshape matrices to fit bucket sizes.

We execute two types of tasks: image classification on the ILSVRC (ImageNet) , CIFAR-10 , and MNIST datasets, and speech recognition on the CMU AN4 dataset . For vision, we experimented with AlexNet , VGG , ResNet , and Inception with Batch Normalization deep networks. For speech, we trained an LSTM network . See Table 1.

We used standard sizes for the networks, with hyper-parameters optimized for the 32bit precision variant.Unless otherwise stated, we use the default networks and hyper-parameters available in the open-source CNTK 2.0. Full details for networks and experiments are given in the additional material. We increased batch size when necessary to balance communication and computation for larger GPU counts, and we employed double buffering to perform communication and quantization concurrently with the computation. Quantization usually benefits from lowering learning rates; yet, we always run the 32bit learning rate, and decrease bucket size to reduce variance if needed.

Communication vs. Computation.

In the first set of experiments, we examine the ratio between computation and communication costs during training, for increased parallelism. The image classification networks are trained on ImageNet, while LSTM is trained on AN4. We examine the cost breakdown for these networks over a pass over the dataset (epoch). Figure 2 gives image classification results. The variance of epoch times is practically negligible.

The data leads to some interesting observations. First, based on the ratio of communication to computation, we can roughly split networks into communication-intensive (AlexNet, VGG, LSTM), and computation-intensive (Inception, ResNet). For both network types, the relative impact of communication increases significantly as we increase the number of GPUs. Examining the breakdown for the 32-bit version, all networks could significantly benefit from reduced communication. For example, for AlexNet on 16 GPUs with batch size 1024, more than $80\%$ of training time is spent on communication, whereas for LSTM on 2 GPUs with batch size 256, the proportion is $71\%$ communication.These ratios can be improved by increasing batch size. However, increasing batch size further hurts convergence and decreases accuracy, see also e.g. .

Next, we examine the impact of QSGD on communication and overall training time. (For QSGD, communication time includes time spent compressing and uncompressing gradients.) We measured QSGD with 2-bit quantization and 64 bucket size, and 4-bit quantization and 8192 bucket size. The results for these two variants are similar, since the different bucket sizes mean that the 4bit version only sends $77\%$ more data than the 2-bit version (but $\sim 8\times$ less than 32-bit). These bucket sizes are chosen to ensure good convergence, but are not carefully tuned.

Accuracy.

We now examine how QSGD influences accuracy and convergence rate. We ran AlexNet to full convergence on ImageNet, LSTM on AN4, ResNet110 on CIFAR-10, as well as a two-layer perceptron on MNIST. Results are given in Figure 5.

On ImageNet using AlexNet, 4-bit QSGD with $8192$ bucket size converges to $59.22\%$ top-1 error, and $81.63\%$ top-5 error. The gap from the 32bit version is $0.57\%$ for top-5, and $0.68\%$ for top-1 . QSGD with 2-bit and 64 bucket size has gap $1.73\%$ for top-1, and $1.18\%$ for top-1. We note that we did not tune bucket size, number of bits used, number of epochs or learning rate for this experiment.

On AN4 using LSTMs, 2-bit QSGD has similar convergence rate and the same accuracy as 32bit. It is able to converge $3\times$ faster to the target accuracy with respect to full precision, thanks to reduced communication overheads. The 4-bit variant has the same convergence and accuracy, but is slightly slower than 2-bit (by less than $10\%$ ).

On CIFAR-10, 2-bit QSGD applied to ResNet-110 drops about $1.22\%$ top-1 accuracy points. However, 4-bit QSGD converges to the same accuracy as the original, whereas 8-bit QSGD improves accuracy by $0.33\%$ . We observe a similar result on MNIST, where 2-bit QSGD with buckets equal to the size of hidden layers improves accuracy by $0.5\%$ . These results are consistent with recent work noting benefits of added noise in training deep networks. Linear models on e.g. MNIST do not show such improvements.

One issue we examined in more detail is which layers are more sensitive to quantization. It appears that quantizing convolutional layers too aggressively (e.g., 2-bit precision) can lead to accuracy loss if not trained further. However, increasing precision to 4-bit or 8-bit recovers accuracy. This finding suggests that modern architectures for vision tasks, such as ResNet or Inception, which are almost entirely convolutional, may benefit less from quantization than recurrent deep networks such as LSTMs.

Comparison with 1BitSGD.

We have also compared against the 1BitSGD algorithm of . Before discussing results, it is important to note some design choices made in the CNTK implementation of 1BitSGD. For objects without dynamic dimensions, the first tensor dimension is the “row" while the rest are flattened onto “columns." At the same time, 1BitSGD always quantizes per column. In practice, this implies that quantization is often applied to a column of very small dimension ( $1$ – $3$ ), especially in the case of networks with many convolutions. This has the advantage of having extremely low variance, but does not yield any communication benefits. In fact, it can hurt performance due to the cost of quantization. (By contrast, we reshape to quantize on large dimensions.)

Given this artefact, 1BitSGD is slower than even the 32bit version on heavily convolutional networks such as ResNet and Inception. However, 1BitSGD matches the performance of 2-bit and 4-bit QSGD on AlexNet, VGG, and LSTMs within $10\%$ . In general, 1BitSGD attains very good accuracy (on par with 32bit), probably since the more delicate convolutional layers are not quantized. QSGD has the advantage of being able to perform quantization on the fly, without error accumulation: this saves memory, since we do not need to allocate an additional model copy.

Appendix F Quantized Gradient Descent: Description and Analysis

In this section, we consider the effect of lossy compression on standard (non-stochastic) gradient descent. Since this procedure is not data-parallel, we will first have to modify the blueprint for the iterative procedure, as described in Algorithm 2. In particular, we assume that, instead of directly applying the gradient to the iterate $\bm{x}_{t+1}$ , the procedure first quantizes the gradient, before applying it. This setting models a scenario where the model and the computation are performed by different machines, and we wish to reduce the communication cost of the gradient updates.

We now give a quantization function tailored for gradient descent, prove convergence of gradient descent with quantization, and then finally bound the length of the encoding.

Further, define $Q(\bm{v})$ to be the vector

Practically, we preserve the sign for each index in $I(\bm{v})$ , the 2-norm of $\bm{v}$ , and cancel out all remaining components of $\bm{v}$ .

Convergence Bound.

We begin by proving some properties of our quantization function. We have the following:

$\|Q(\bm{v})\|^{2}\leq\sqrt{n}\|\bm{v}\|^{2}$ .

For the first claim, observe that $\bm{v}^{T}Q(\bm{v})=\|\bm{v}\|\sum_{i\in I(\bm{v})}|\bm{v}_{i}|\geq\|\bm{v}\|^{2}$ .

We now prove the second claim. Let $\bm{v}=(v_{1},\ldots,v_{n})$ , and without loss of generality, assume that $|v_{i}|\geq|v_{i+1}|$ for all $i=1,\ldots,n-1$ , so that the coordinates are in decreasing order. Then $I(\bm{v})=\{1,\ldots,D\}$ for some $D$ .

We show that if $D\geq\sqrt{n}$ then $\sum_{i=1}^{D}|v_{i}|\geq\|\bm{v}\|$ , which shows that $|I(\bm{v})|\leq\sqrt{n}$ . Indeed, we have that

and so we see that if $D=\sqrt{n}$ , we must have $\left(\sum_{i=1}^{D}|v_{i}|\right)^{2}\geq\|\bm{v}\|^{2}$ , as claimed.

For the third claim, observe that $\|Q(\bm{v})\|^{2}=\|\bm{v}\|^{2}\cdot|I(\bm{v})|$ ; thus the claim follows from the previous upper bound on the cardinality of $I(\bm{v})$ . ∎

To establish convergence of the quantized method, we prove the following theorem.

We first establish the following two properties:

The first property follows directly from the definitions of strong convexity and smoothness. We now show the second property. If $\bm{x}=\bm{x}^{\ast}$ the property trivially holds so assume that this does not happen. By Lemma F.1, we have

With all this in place, we can now complete the proof of the theorem. Fix $t\geq 0$ . By applying the lemma, we have:

Moreover, observe that, from standard properties of smooth functions , we have

Thus, we obtain the following chain of inequalities:

where (a) follows from the convexity of $f$ , (b) follows from Equation 12 and the Cauchy-Schwarz inequality, (c) follows from the $L$ -smoothness of $f$ , (d) follows from Lemma F.1, and (e) follows from Equation 13. By our choice of $\eta$ , we know that the RHS of Equation 14 is negative. Hence, by Lemma F.3 and the definition of $\eta$ , we have

Letting $\delta_{t}=f(\bm{x}_{t})-f(\bm{x}^{\ast})$ , and observing that $f(\bm{x}_{t+1})-f(\bm{x}_{t})=\delta_{t+1}-\delta_{t}$ , we see that Equation 14 is equivalent to the statement that

Encoding Length.

Appendix G Quantized SVRG