Sparsified SGD with Memory

Sebastian U. Stich, Jean-Baptiste Cordonnier, Martin Jaggi

Introduction

Stochastic Gradient Descent (SGD) and variants thereof (e.g. ) are among the most popular optimization algorithms in machine- and deep-learning . SGD consists of iterations of the form

In this paper we give a concise convergence rate analysis for SGD with memory and $k$ -compression operatorsSee Definition 2.1., such as (but not limited to) top- $k$ sparsification. Our analysis also supports ultra-sparsification operators for which $k<1$ , i.e. where less than one coordinate of the stochastic gradient is applied on average in (1). We not only provide the first convergence result of this method, but the result also shows that the method converges at the same rate as vanilla SGD.

There are several ways to reduce the communication in SGD. For instance by simply increasing the amount of computation before communication, i.e. by using large mini-batches (see e.g. ), or by designing communication-efficient schemes . These approaches are a bit orthogonal to the methods we consider in this paper, which focus on quantization or sparsification of the gradient.

Several papers consider approaches that limit the number of bits to represent floating point numbers . Recent work proposes adaptive tuning of the compression ratio . Unbiased quantization operators not only limit the number of bits, but quantize the stochastic gradients in such a way that they are still unbiased estimators of the gradient . The ZipML framework also applies this technique to the data . Sparsification methods reduce the number of non-zero entries in the stochastic gradient .

A very aggressive sparsification method is to keep only very few coordinates of the stochastic gradient by considering only the coordinates with the largest magnitudes . In contrast to the unbiased schemes it is clear that such methods can only work by using some kind of error accumulation or feedback procedure, similar to the one the we have already discussed , as otherwise certain coordinates could simply never be updated. However, in certain applications no feedback mechanism is needed . Also more elaborate sparsification schemes have been introduced .

Asynchronous updates provide an alternative solution to disguise the communication overhead to a certain amount . However, those methods usually rely on a sparsity assumption on the updates , which is not realistic e.g. in deep learning. We like to advocate that combining gradient sparsification with those asynchronous schemes seems to be a promising approach, as it combines the best of both worlds. Other scenarios that could profit from sparsification are heterogeneous systems or specialized hardware, e.g. accelerators .

Convergence proofs for SGD typically rely on averaging the iterates , though convergence of the last iterate can also be proven . For our convergence proof we rely on averaging techniques that give more weight to more recent iterates , as well as the perturbed iterate framework from Mania et al. and techniques from .

Simultaneous to our work, at NIPS 2018 propose related schemes. Whilst Tang et al. only consider unbiased stochastic compression schemes, Alistarh et al. study biased top- $k$ sparsification. Their scheme also uses a memory vector to compensate for the errors, but their analysis suffers from a slowdown proportional to $k$ , which we can avoid here. Another simultaneous analysis of Wu et al. at ICML 2018 is restricted to unbiased gradient compression. This scheme also critically relies on an error compensation technique, but in contrast to our work the analysis is restricted to quadratic functions and the scheme introduces two additional hyperparameters that control the feedback mechanism.

2 Contributions

We introduce the method formally in Section 2 and show a sketch of the convergence proof in Section 3. In Section 4 we include a few numerical experiments for illustrative purposes. The experiments highlight that top- $k$ sparsification yields a very effective compression method and does not hurt convergence. We also report results for a parallel multi-core implementation of SGD with memory that show that the algorithm scales as well as asynchronous SGD and drastically decreases the communication cost without sacrificing the rate of convergence. We like to stress that the effectiveness of SGD variants with sparsification techniques has already been demonstrated in practice .

Although we do not yet provide convergence guarantees for parallel and asynchronous variants of the scheme, this is the main application of this method. For instance, we like to highlight that asynchronous SGD schemes could profit from the gradient sparsification. To demonstrate this use-case, we include in Section 4 a set of experiments for a multi-core implementation.

SGD with Memory

In this section we present the sparsified SGD algorithm with memory. First we introduce sparsification and quantization operators which allow us to drastically reduce the communication cost in comparison with vanilla SGD.

We consider compression operators that satisfy the following contraction property:

The contraction property is sufficient to obtain all mathematical results that are derived in this paper. However, note that (4) does not imply that $\operatorname{comp}(\mathbf{x})$ is a necessarily sparse vector. Also dense vectors can satisfy (4). One of the main goals of this work is to derive communication efficient schemes, thus we are particularly interested in operators that also ensure that $\operatorname{comp}(\mathbf{x})$ can be encoded much more efficiently than the original $\mathbf{x}$ .

The following two operators are examples of $k$ -contraction operators with the additional property of being $k$ -sparse vectors:

where $\pi$ is a permutation of $[d]$ such that $(\left\lvert\mathbf{x}\right\rvert)_{\pi(i)}\geq(\left\lvert\mathbf{x}\right\rvert)_{\pi(i+1)}$ for $i=1,\dots,d-1$ . We abbreviate $\operatorname{rand}_{k}(\mathbf{x})$ whenever the second argument is chosen uniformly at random, $\omega\sim_{\rm u.a.r.}\Omega_{k}$ .

It is easy to see that both operators satisfy Definition 2.1 of being a $k$ -contraction. For completeness the proof is included in Appendix A.1.

We like to highlight that many other operators do satisfy Definition 2.1, not only the two examples given in Definition 2.2. As a notable variant is to pick a random coordinate of a vector with probability $\frac{k}{d}$ , for $0<k\leq 1$ , property (4) holds even if $k<1$ . I.e. it suffices to transmit on average less than one coordinate per iteration (this would then correspond to a mini-batch update).

2 Variance Blow-up for Unbiased Updates

Before introducing SGD with memory we first discuss a motivating example. Consider the following variant of SGD, where $(d-k)$ random coordinates of the stochastic gradient are dropped:

It is well known that by using mini-batches the variance of the gradient estimator can be reduced. If we consider in (6) the estimator $\mathbf{g}_{t}:=\frac{d}{k}\cdot\operatorname{rand}_{k}\bigl{(}\frac{1}{\tau}\sum_{i\in\mathcal{I}_{\tau}}\nabla f_{i}(\mathbf{x}_{t})\bigr{)}$ for $\tau=\lceil\frac{k}{d}\rceil$ , and $\mathcal{I}_{\tau}\sim_{\rm u.a.r.}\binom{[n]}{k}$ instead, we have

This shows that, when using mini-batches of appropriate size, the sparsification of the gradient does not hurt convergence. However, by increasing the mini-batch size, we increase the computation by a factor of $\frac{d}{k}$ .

3 SGD with Memory: Algorithm and Convergence Results

where $i_{t}\sim_{\rm u.a.r.}[n]$ , $\mathbf{m}_{0}:=0$ and $\{\eta_{t}\}_{t\geq 0}$ denotes a sequence of stepsizes. The pseudocode is given in Algorithm 1. Note that the gradients get multiplied with the stepsize $\eta_{t}$ at the timestep $t$ when they put into memory, and not when they are (partially) retrieved from the memory.

We state the precise convergence result for Algorithm 1 in Theorem 2.4 below. In Remark 2.6 we give a simplified statement in big- $\mathcal{O}$ notation for a specific choice of the stepsizes $\eta_{t}$ .

where $\bar{\mathbf{x}}_{T}=\frac{1}{S_{T}}\sum_{t=0}^{T-1}w_{t}\mathbf{x}_{t}$ , for $w_{t}=(a+t)^{2}$ , and $S_{T}=\sum_{t=0}^{T-1}w_{t}\geq\frac{1}{3}T^{3}$ .

Theorem 2.4 says that for any shift $a>1$ there is a parameter $\alpha(a)>4$ such that (9) holds. However, for the choice $a=\mathcal{O}(1)$ one has to set $\alpha$ such that $\frac{\alpha}{\alpha-4}=\Omega(\frac{d}{k})$ and the last term in (9) will be of order $\mathcal{O}(\frac{d^{3}}{k^{3}T^{2}})$ , thus requiring $T=\Omega(\frac{d^{1.5}}{k^{1.5}})$ steps to yield convergence. For $\alpha\geq 5$ we have $\frac{\alpha}{\alpha-4}=\mathcal{O}(1)$ and the last term is only of order $\mathcal{O}(\frac{d^{2}}{k^{2}T^{2}})$ instead. However, this requires typically a large shift. Observe $\frac{(\alpha+1)\frac{d}{k}+\rho}{\rho+1}\leq 1+(\alpha+1)\frac{d}{k}\leq(\alpha+2)\frac{d}{k}$ , i.e. setting $a=(\alpha+2)\frac{d}{k}$ is enough. We like to stress that in general it is not advisable to set $a\gg(\alpha+2)\frac{d}{k}$ as the first two terms in (9) depend on $a$ . In practice, it often suffices to set $a=\frac{d}{k}$ , as we will discuss in Section 4.

As discussed in Remark 2.5 above, setting $\alpha=5$ and $a=(\alpha+2)\frac{d}{k}$ is feasible. With this choice, equation (9) simplifies to

Proof Outline

We now give an outline of the proof. The proofs of the lemmas are given in Appendix A.2.

where the sequences $\{\mathbf{x}_{t}\}_{t\geq 0},\{\eta_{t}\}_{t\geq 0}$ and $\{i_{t}\}_{t\geq 0}$ are the same as in (8). Notice that

Bounding the memory.

Optimal averaging.

Similar as discussed in we have to define a suitable averaging scheme for the iterates $\{\mathbf{x}_{t}\}_{t\geq 0}$ to get the optimal convergence rate. In contrast to that use linearly increasing weights, we use quadratically increasing weights, as for instance .

Let $\{a_{t}\}_{t\geq 0}$ , $a_{t}\geq 0$ , $\{e_{t}\}_{t\geq 0}$ , $e_{t}\geq 0$ , be sequences satisfying

for $\eta_{t}=\frac{8}{\mu(a+t)}$ and constants $A,B\geq 0$ , $\mu>0$ , $a>1$ . Then

for $w_{t}=(a+t)^{2}$ and $S_{T}:=\sum_{t=0}^{T-1}w_{t}=\frac{T}{6}\left(2T^{2}+6aT-3T+6a^{2}-6a+1\right)\geq\frac{1}{3}T^{3}$ .

Proof of Theorem 2.4.

Experiments

We present numerical experiments to illustrate the excellent convergence properties and communication efficiency of Mem-SGD. As the usefulness of SGD with sparsification techniques has already been shown in practical applications we focus here on a few particular aspects. First, we verify the impact of the initial learning rate that did come up in the statement of Theorem 2.4. We then compare our method with QSGD which decreases the communication cost in SGD by using random quantization operators, but without memory. Finally, we show the performance of the parallel SGD depicted in Algorithm 2 in a multi-core setting with shared memory and compare the speed-up to asynchronous SGD.

Datasets.

We consider a dense dataset, epsilon , as well as a sparse dataset, RCV1 where we train on the larger test set. Statistics on the datasets are listed in Table 1 below:

Implementation.

We use Python3 and the numpy library . Our code is open-source and publicly available at github.com/epfml/sparsifiedSGD. We emphasize that our high level implementation is not optimized for speed per iteration but for readability and simplicity. We only report convergence per iteration and relative speedups, but not wall-clock time because unequal efforts have been made to speed up the different implementations. Plots additionally show the baseline computed with the standard optimizer LogisticSGD of scikit-learn . Experiments were run on an Ubuntu 18.04 machine with a 24 cores processor Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz.

2 Verifying the Theory

We study the convergence of the method using the stepsizes $\eta_{t}=\gamma/(\lambda(t+a))$ and hyperparameters $\gamma$ and $a$ set as in Table 2. We compute the final estimate $\bar{\mathbf{x}}$ as a weighted average of all iterates $\mathbf{x}_{t}$ with weights $w_{t}=(t+a)^{2}$ as indicated by Theorem 2.4. The results are depicted in Figure 2. We use $k\in\{1,2,3\}$ for epsilon and $k\in\{10,20,30\}$ for RCV1 to increase the difference with large number of features. The $\operatorname{top}_{k}$ variant consistently outperforms $\operatorname{rand}_{k}$ and sometimes outperforms vanilla SGD, which is surprising and might come from feature characteristics of the datasets. We also evaluate the impact of the delay $a$ in the learning rate: setting it to $1$ instead of order $\mathcal{O}(d/k)$ dramatically hurts the memory and requires time to recover from the high initial learning rate (labeled “without delay” in Figure 2).

We experimentally verified the convergence properties of Mem-SGD for different sparsification operators and stepsizes but we want to further evaluate its fundamental benefits in terms of sparsity enforcement and reduction of the communication bottleneck. The gain in communication cost of SGD with memory is very high for dense datasets—using the $\operatorname{top}_{1}$ strategy on epsilon dataset improves the amount of communication by $10^{3}$ compared to SGD. For the sparse dataset, SGD can readily use the given sparsity of the gradients. Nevertheless, the improvement for $\operatorname{top}_{10}$ on RCV1 is of approximately an order of magnitude.

3 Comparison with QSGD

Now we compare Mem-SGD with the QSGD compression scheme which reduces communication cost by random quantization. The accuracy (and the compression ratio) in QSGD is controlled by a parameter $s$ , corresponding to the number of quantization levels. Ideally, we would like to set the quantization precision in QSGD such that the number of bits transmitted by QSGD and Mem-SGD are identical and compare their convergence properties. However, even for the lowest precision, QSGD needs to send the sign and index of $\mathcal{O}(\sqrt{d})$ coordinates. It is therefore not possible to reach the compression level of sparsification operators such as top- $k$ or random- $k$ , that only transmit a constant number of bits per iteration (up to logarithmic factors).Encoding the indices of the top- $k$ or random- $k$ elements can be done with additional $\mathcal{O}(k\log d)$ bits. Note that $\log d\leq 32\leq\sqrt{d}$ for both our examples. Hence, we did not enforce this condition and resorted to pick reasonable levels of quantization in QSGD ( $s=2^{b}$ with $b\in\{2,4,8\}$ ). Note that $b$ -bits stands for the number of bits used to encode $s=2^{b}$ levels but the number of bits transmitted in QSGD can be reduced using Elias coding. As a fair comparison in practice, we chose a standard learning rate $\gamma_{0}/(1+\gamma_{0}\lambda t)^{-1}$ , tuned the hyperparameter $\gamma_{0}$ on a subset of each dataset (see Appendix B). Figure 3 shows that Mem-SGD with $\operatorname{top}_{1}$ on epsilon and RCV1 converges as fast as QSGD in term of iterations for 8 and 4-bits. As shown in the bottom of Figure 3, we are transmitting two orders of magnitude fewer bits with the $\operatorname{top}_{1}$ sparsifier concluding that sparsification offers a much more aggressive and performant strategy than quantization.

4 Multicore experiment

We implement a parallelized version of Mem-SGD, as depicted in Algorithm 2. The enforced sparsity allows us to do the update in shared memory using a lock-free mechanism as in . For this experiment we evaluate the final iterate $\mathbf{x}_{T}$ instead of the weighted average $\bar{\mathbf{x}}_{T}$ above, and use the learning rate $\eta_{t}\equiv(1+t)^{-1}$ .

Figure 4 shows the speed-up obtained when increasing the number of cores. We see that both sparsified SGD and vanilla SGD have a linear speed-up, the slopes are dependent of the implementation details. But we observe that Parallel-Mem-SGD with a reasonable sparsification parameter $k$ does not suffer of having multiple independent memories. The experiment is run on a single machine with a 24 core processor, hence no inter-node communication is used. The main advantage of our method—overcoming the communication bottleneck— would be even more visible in a multi-node setup. In this asynchronous setup, SGD with memory computes gradients on stale iterates that differ only by a few coordinates. It encounters fewer inconsistent read/write operations than lock free asynchronous SGD and exhibits better scaling properties on the RCV1 dataset. The $\operatorname{top}_{k}$ operator performs better than $\operatorname{rand}_{k}$ in the sequential setup, but this is not the case in the parallel setup.

Conclusion

We provide the first concise convergence analysis of sparsified SGD . This extremely communication-efficient variant of SGD enforces sparsity of the applied updates by only updating a constant number of coordinates in every iteration. This way, the method overcomes the communication bottleneck of SGD, while still enjoying the same convergence rate in terms of stochastic gradient computations.

Our experiments verify the drastic reduction in communication cost by demonstrating that Mem-SGD requires one to two orders of magnitude less bits to be communicated than QSGD while converging to the same accuracy. The experiments show an advantage for the top- $k$ sparsification over random sparsification in the serial setting, but not in the multi-core shared memory implementation. There, both schemes are on par, and show better scaling than a simple shared memory implementation that just writes the unquantized updates in a lock-free asynchronous fashion (like Hogwild! ).

The theoretical insights to Mem-SGD that were developed here should facilitate the analysis of the same scheme in the parallel (as developped in ) and the distributed setting. It has already been shown in practice that gradient sparsification can be efficiently applied to bandwidth memory limited systems such as multi-GPU training for neural networks . By delivering sparsity no matter if the original gradients were sparse or not, our scheme is not only communication efficient, but becomes more eligible for asynchronous implementations as well. While those were so far limited by strict sparsity assumptions (as e.g. in ), our approach might make such methods much more widely applicable.

Acknowledgments

We would like to thank Dan Alistarh for insightful discussions in the early stages of this project and Frederik Künstner for his useful comments on the various drafts of this manuscript. We acknowledge funding from SNSF grant 200021_175796, Microsoft Research JRC project ‘Coltrain’, as well as a Google Focused Research Award.

References

Appendix A Proofs

Let $\eta_{t}=\frac{1}{c+t}$ , for $c\geq 1$ . Then $\eta_{t}^{2}\left(1-\frac{2}{c}\right)\leq\eta_{t+1}^{2}$ .

A.2 Proof of the Main Theorem

and with $\left\lVert\mathbf{a}+\mathbf{b}\right\rVert^{2}\leq 2\left\lVert\mathbf{a}\right\rVert^{2}+2\left\lVert\mathbf{b}\right\rVert^{2}$ we further have

Putting these two estimates together, we can bound (23) as follows:

First, observe that by Lemma A.1 and $\left\lVert\mathbf{a}+\mathbf{b}\right\rVert^{2}\leq(1+\gamma)\left\lVert\mathbf{a}\right\rVert^{2}+(1+\gamma^{-1})\left\lVert\mathbf{b}\right\rVert^{2}$ for $\gamma>0$ we have

On the other hand, from $\left\lVert\sum_{i=1}^{s}\mathbf{a}_{i}\right\rVert^{2}\leq s\sum_{i=1}^{s}\left\lVert\mathbf{a}_{i}\right\rVert^{2}$ we also have

Now the claim follows from Lemma A.3 just below with $A=\frac{8G^{2}}{\mu}$ . ∎

Let $A\geq 0$ , $d\geq k\geq 1$ , $\{h_{t}\}_{t\geq 0}$ , $h_{t}\geq 0$ be a sequence satisfying

for a sequence $\eta_{t}=\frac{1}{a+t}$ with $a\geq\frac{(\alpha+1)\frac{d}{k}+\rho+1}{\rho+1}>1$ , for $\alpha>4$ , $\rho:=\frac{4\alpha}{(\alpha-4)(\alpha+1)^{2}}$ . Then

Let $t_{0}=\max\{\lceil\alpha\frac{d}{k}-a\rceil,0\}$ , i.e. $\eta_{t_{0}}\leq\frac{k}{\alpha d}$ . (Note that for any $a\geq\alpha\frac{k}{d}$ it holds $t_{0}=0$ .) Suppose the claim holds for $t\leq t_{0}$ . Observe,

for $t\geq t_{0}$ . This follows from Lemma A.2 with $c=\frac{\alpha d}{k}$ . By induction,

where we used $t\geq t_{0}$ (and the observation just above) for the last inequality.

Small t𝑡t.

Assume $t_{0}\geq 1$ , otherwise the claim follows from the part above. We have

using $\frac{d}{k}\geq 1$ . Observe $t_{0}\leq\alpha\frac{d}{k}-a+1\leq(\alpha+1)\frac{d}{k}-a$ . For $t\leq(\alpha+1)\frac{d}{k}-a$ we have

by the condition on $a$ . Hence, by combining these observations,

We now multiply equation (15) with $\frac{w_{t}}{\eta_{t}}$ , which yields

and by recursively substituting $a_{t}\frac{w_{t-1}}{\eta_{t-1}}$ we get

We will now derive upper bounds for the terms on the right hand side. We have

Let $S_{T}:=\sum_{t=0}^{T-1}w_{t}=\frac{T}{6}\left(2T^{2}+6aT-3T+6a^{2}-6a+1\right)$ . Observe

Appendix B Experiments

To produce a fair comparison between Mem-SGD and QSGD , we fix the learning rate to $\gamma_{0}/(1+\gamma_{0}\lambda t)^{-1}$ and run a grid search on the $\gamma_{0}$ hyperparameter (individually for each method). The results are displayed in Figure 5.

QSGD communicated bits.

The number of bits needed by QSGD with $s$ quantization levels to communicate the gradient at each iteration is $\min\bigl{\{}(\lceil\log_{2}(s)\rceil+1)\cdot d,\,3s(s+\sqrt{d})+32\bigr{\}}$ where $d$ is the size of the gradient. The first expression corresponds to the naïve encoding (i.e. index/value pairs), the second expression corresponds to the estimates of the more evolved Elias encoding (see e.g. [3, Theorem 3.2]). For the sparse dataset RCV1-test, we additionally assume that QSGD is aware of the sparsity of the gradients ( $d\approx 71\ll 47^{\prime}236$ ) and send only the quantized non zero coordinates with their indexes. In a nutshell, we chose the best communication pattern for QSGD to conduct a fair comparison with Mem-SGD.