signSGD: Compressed Optimisation for Non-Convex Problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, Anima Anandkumar

Introduction

Deep neural networks have learnt to solve numerous natural human tasks (LeCun et al., 2015; Schmidhuber, 2015). Training these large-scale models can take days or even weeks. The learning process can be accelerated by distributing training over multiple processors—either GPUs linked within a single machine, or even multiple machines linked together. Communication between workers is typically handled using a parameter-server framework (Li et al., 2014), which involves repeatedly communicating the gradients of every parameter in the model. This can still be time-intensive for large-scale neural networks. The communication cost can be reduced if gradients are compressed before being transmitted. In this paper, we analyse the theory of robust schemes for gradient compression.

An elegant form of gradient compression is just to take the sign of each coordinate of the stochastic gradient vector, which we call signSGD. The algorithm is as simple as throwing away the exponent and mantissa of a 32-bit floating point number. Sign-based methods have been studied at least since the days of Rprop (Riedmiller & Braun, 1993). This algorithm inspired many popular optimisers—like RMSprop (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). But researchers were interested in Rprop and variants because of their robust and fast convergence, and not their potential for gradient compression.

We then analyse signSGD in the distributed setting where the parameter server aggregates gradient signs of the workers by a majority vote. Thus we allow worker-server communication to be 1-bit compressed in both directions. We prove that the theoretical speedup matches that of distributed SGD, under natural assumptions that are validated by experiments.

We also extend our theoretical framework to the Signum optimiser—which takes the sign of the momentum. Our theory suggests that momentum may be useful for controlling a tradeoff between bias and variance in the estimate of the stochastic gradient. On the practical side, we show that Signum easily scales to large Imagenet models, and provided the learning rate and weight decay are tuned, all other hyperparameter settings—such as momentum, weight initisialiser, learning rate schedules and data augmentation—may be lifted from an SGD implementation.

Related Work

Distributed machine learning: From the information theoretic angle, Suresh et al. (2017) study the communication limits of estimating the mean of a general quantity known about only through samples collected from MM workers. In contrast, we focus exclusively on communication of gradients for optimisation, which allows us to exploit the fact that we do not care about incorrectly communicating small gradients in our theory. Still our work has connections with information theory. When the parameter server aggregates gradients by majority vote, it is effectively performing maximum likelihood decoding of a repetition encoding of the true gradient sign that is supplied by the M workers.

As for existing gradient compression schemes, Seide et al. (2014) and Strom (2015) demonstrated empirically that 1-bit quantisation can still give good performance whilst dramatically reducing gradient communication costs in distributed systems. Alistarh et al. (2017) and Wen et al. (2017) provide schemes with theoretical guarantees by using random number generators to ensure that the compressed gradient is still an unbiased approximation to the true gradient. Whilst unbiasedness allows these schemes to bootstrap SGD theory, it unfortunately comes at the cost of hugely inflated variance, and this variance explosionFor the version of QSGD with 1-bit compression, the variance explosion is by a factor of d\sqrt{d}, where dd is the number of weights. It is common to have d>108d>10^{8} in modern deep networks. basically renders the SGD-style bounds vacuous in the face of the empirical success of these algorithms. The situation only gets worse when the parameter server must aggregate and send back the received gradients, since merely summing up quantised updates reduces the quantisation efficiency. We compare the schemes in Table 1—notice how the existing schemes pick up log factors in the transmission from parameter-server back to workers. Our proposed approach is different, in that we directly employ the sign gradient which is biased. This avoids the randomisation needed for constructing an unbiased quantised estimate, avoids the problem of variance exploding in the theoretical bounds, and even enables 1-bit compression in both directions between parameter-server and workers, at no theoretical loss compared to distributed SGD.

To date there has been no convincing theory of the {Rprop, RMSprop, Adam} family of algorithms, known as ‘adaptive gradient methods’. Indeed Reddi et al. (2018) point out problems in the original convergence proof of Adam, even in the convex setting. Since signSGD belongs to this same family of algorithms, we expect that our theoretical analysis should be relevant for all algorithms in the family. In a parallel work, Balles & Hennig (2017) explore the connection between signSGD and Adam in greater detail, though their theory is more restricted and lives in the convex world, and they do not analyse Signum as we do but employ it on heuristic grounds.

Optimisation: much of classic optimisation theory focuses on convex problems, where local information in the gradient tells you global information about the direction to the minimum. Whilst elegant, this theory is less relevant for modern problems in deep learning which are non-convex. In non-convex optimisation, finding the global minimum is generally intractable. Theorists usually settle for measuring some restricted notion of success, such as rate of convergence to stationary points (Ghadimi & Lan, 2013; Allen-Zhu, 2017a) or local minima (Nesterov & Polyak, 2006). Though Dauphin et al. (2014) suggest saddle points should abound in neural network error landscapes, practitioners report not finding this a problem in practice (Goodfellow et al., 2015) and therefore a theory of convergence to stationary points is useful and informative.

Experimental benchmarks: throughout the paper we will make use of the CIFAR-10 (Krizhevsky, 2009) and Imagenet (Russakovsky et al., 2015) datasets. As for neural network architectures, we train Resnet-20 (He et al., 2016a) on CIFAR-10, and Resnet-50 v2 (He et al., 2016b) on Imagenet.

Convergence Analysis of signSGD

We begin our analysis of sign stochastic gradient descent in the non-convex setting. The standard assumptions of the stochastic optimisation literature are nicely summarised by Allen-Zhu (2017b). We will use more fine-grained assumptions. signSGD can exploit this additional structure, much as Adagrad (Duchi et al., 2011) exploits sparsity. We emphasise that these fine-grained assumptions do not lose anything over typical SGD assumptions, since our assumptions can be obtained from SGD assumptions and vice versa.

For all xx and some constant ff^{*}, we have objective value f(x)f.f(x)\geq f^{*}.

This assumption is standard and necessary for guaranteed convergence to a stationary point.

The next two assumptions will naturally encode notions of heterogeneous curvature and gradient noise.

Let g(x)g(x) denote the gradient of the objective f(.)f(.) evaluated at point xx. Then x,y\forall x,y we require that for some non-negative constant L:=[L1,...,Ld]\vec{L}:=[L_{1},...,L_{d}]

For twice differentiable ff, this implies that diag(L)Hdiag(L)-\text{diag}(\vec{L})\prec H\prec\text{diag}(\vec{L}). This is related to the slightly weaker coordinate-wise Lipschitz condition used in the block coordinate descent literature (Richtárik & Takáč, 2014).

Lastly, we assume that we have access to the following stochastic gradient oracle:

for a vector of non-negative constants σ:=[σ1,..,σd]\vec{\sigma}:=[\sigma_{1},..,\sigma_{d}].

Assumptions 2 and 3 are different from the assumptions typically used for analysing the convergence properties of SGD (Nesterov, 2013; Ghadimi & Lan, 2013), but they are natural to the geometry induced by algorithms with signed updates such as signSGD and Signum.

which is the standard assumption of Lipschitz smoothness.

Assumption 3 is more fined-grained than the standard stochastic gradient oracle assumption used for SGD analysis. But again, the standard variance bound is recovered by defining σ2:=σ22\sigma^{2}:=||\vec{\sigma}||_{2}^{2}. Then Assumption 3 implies that

which is the standard assumption of bounded total variance.

Under these assumptions, we have the following result:

To pass the stochasticity through the non-linear sign operation in a controlled fashion, we need to prove the key statement that at the kthk^{th} step for the ithi^{th} gradient component

This formalises the intuition that the probability of the sign of a component of the stochastic gradient being incorrect should be controlled by the signal-to-noise ratio of that component. When a component’s gradient is large, the probability of making a mistake is low, and one expects to make good progress. When the gradient is small compared to the noise, the probability of making mistakes can be high, but due to the large batch size this only happens when we are already close to a stationary point.

The large batch size in the theorem may seem unusual, but large batch training actually presents a systems advantage (Goyal et al., 2017) since it can be parallelised. The number of gradient calls NN is the important quantity to measure convergence, but large batch training achieves NN gradient calls in only O(N)(\sqrt{N}) iterations whereas small batch training needs O(N)O(N) iterations. Fewer iterations also means fewer rounds of communication in the distributed setting. Convergence guarantees can be extended to the small batch case under the additional assumption of unimodal symmetric gradient noise using Lemma D.1 in the supplementary, but we leave this for future work. Experiments in this paper were indeed conducted in the small batch regime.

Remember that under our assumptions, SGD-style assumptions hold with Lipschitz constant L:=LL:=\|\vec{L}\|_{\infty} and total variance bound σ2:=σ22\sigma^{2}:=\|\vec{\sigma}\|_{2}^{2}. Using our notion of density we can translate our constants into the language of SGD:

where we have assumed ϕ(g)\phi(g) to be a lower bound on the gradient density over the entire space. Using that (x+y)22(x2+y2)(x+y)^{2}\leq 2(x^{2}+y^{2}) and changing variables in the bound, we reach the following result for signSGD

whereas, for comparison, a typical SGD bound (proved in Supplementary C) is

The bounds are very similar, except for most notably the appearance of ratios of densities R1R_{1} and R2R_{2}, defined as

Naïvely comparing the bounds suggests breaking into cases:

R11R_{1}\gg 1 and R21R_{2}\gg 1. This means that both the curvature and the stochasticity are much denser than the typical gradient and the comparison suggests SGD is better suited than signSGD.

NOT[R11][R_{1}\gg 1] and NOT[R21][R_{2}\gg 1]. This means that neither curvature nor stochasticity are much denser than the gradient, and the comparison suggests that signSGD may converge as fast or faster than SGD, and also get the benefits of gradient compression.

neither of the above holds, for example R11R_{1}\ll 1 and R21R_{2}\gg 1. Then the comparison is indeterminate about whether signSGD or SGD is more suitable.

Let’s briefly provide some intuition to understand how it’s possible that signSGD could outperform SGD. Imagine a scenario where the gradients are dense but there is a sparse set of extremely noisy components. Then the dynamics of SGD will be dominated by this noise, and (unless the learning rate is reduced a lot) SGD will effectively perform a random walk along these noisy components, paying less attention to the gradient signal. signSGD however will treat all components equally, so it will scale down the sparse noise and scale up the dense gradients comparatively, and thus make good progress. See Figure A.2 in the supplementary for a simple example of this.

Still we must be careful when comparing upper bounds, and interpreting the dependence on curvature density is more subtle than noise density. This is because the SGD bound proved in Supplementary C is slacker under situations of sparse curvature than dense curvature. That is to say that SGD, like signSGD, benefits under situations of sparse curvature but this is not reflected in the SGD bound. The potentially slack step in SGD’s analysis is in switching from LiL_{i} to L\|\vec{L}\|_{\infty}. Because of this it is safer to interpret the curvature comparison as telling us a regime where signSGD is expected to lose out to SGD (rather than vice versa). This happens when R11R_{1}\gg 1 and gradients are sparser than curvature. Intuitively, in this case signSGD will push many components in highly curved directions even though these components had small gradient, and this can be undesirable.

To summarise, our theory suggests that when gradients are dense, signSGD should be more robust to large stochasticity on a sparse set of coordinates. When gradients are sparse, SGD should be more robust to dense curvature and noise. In practice for deep networks, we find that signSGD converges about as fast as SGD. That would suggest that we are either in regime (II) or (III) above. But what is the real situation for the error landscape of deep neural networks?

To measure gradient and noise densities in practice, we use Welford’s algorithm (Welford, 1962; Knuth, 1997) to compute the true gradient gg and its stochasticity vector σ\vec{\sigma} at every epoch of training for a Resnet-20 model on CIFAR-10. Welford’s algorithm is numerically stable and only takes a single pass through the data to compute the vectorial mean and variance. Therefore if we train a network for 160 epochs, we make an additional 160 passes through the data to evaluate these gradient statistics. Results are plotted in Figure 3.1. Notice that the gradient density and noise density are of the same order throughout training, and this indeed puts us in regime (II) or (III) as predicted by our theory.

In Figure A.2 of the supplementary, we present preliminary evidence that this finding generalises, by showing that gradients are dense across a range of datasets and network architectures. We have not devised an efficient means to measure curvature densities, which we leave for future work.

Majority Rule: the Power of Democracy in the Multi-Worker Setting

In the most common form of distributed training, workers (such as GPUs) each evaluate gradients on their own split of the data, and send the results up to a parameter-server. The parameter server aggregates the results and transmits them back to each worker (Li et al., 2014).

Up until this point in the paper, we have only analysed signSGD where the update is of the form

To get the benefits of compression we want the mthm^{th} worker to send the sign of the gradient evaluated only on its portion of the data. This suggests an update of the form

This scheme is good since what gets sent to the parameter will be 1-bit compressed. But what gets sent back almost certainly will not. Could we hope for a scheme where all communication is 1-bit compressed?

This is called majority vote, since each worker is essentially voting with its belief about the sign of the true gradient. The parameter server counts the votes, and sends its 1-bit decision back to every worker.

The machinery of Theorem 3.1 is enough to establish convergence for the (good) scheme, but majority vote is more elegant and more communication efficient, therefore we focus on this scheme from here on.

In Theorem 4.1 we first establish the general convergence rate of majority vote, followed by a regime where majority vote enjoys a variance reduction from σ1\|\vec{\sigma}\|_{1} to σ1/M\|\vec{\sigma}\|_{1}/\sqrt{M}.

Remark: Part (a) of the theorem does not describe a speedup over just using a single machine, and that might hint that all those extra M1M-1 workers are a waste in this setting. This is not the case. From the proof sketch above, it should be clear that part (a) is an extremely conservative statement. In particular, we expect all regions of training where the signal-to-noise ratio of the stochastic gradient satisfies S>1S>1 to enjoy a significant speedup due to variance reduction. It’s just that since we don’t get the speedup when S<1S<1, it’s hard to express this in a compact bound.

To sketch a proof for part (b), note that a sign bit from each worker is a Bernoulli trial—call its failure probability qq. We can get a tight control of qq by a convenient tail bound owing to Gauss (1823) that holds under conditions of unimodal symmetry. Then the sum of bits received by the parameter server is a binomial random variable, and we can use Cantelli’s inequality to bound its tail. This turns out to be enough to get tight enough control on the error probability of the majority vote decoder to prove the theorem.

Remark 1: assuming that the stochastic gradient of each worker is approximately symmetric and unimodal is very reasonable. In particular for increasing mini-batch size it will be an ever-better approximation by the central limit theorem. Figure 4.1 plots histograms of real stochastic gradients for neural networks. Even at batch-size 256 the stochastic gradient for an Imagenet model already looks Gaussian.

Remark 2: if you delve into the proof of Theorem 4.1 and graph all of the inequalities, you will notice that some of them are uniformly slack. This suggests that the assumptions of symmetry and unimodality can actually be relaxed to only hold approximately. This raises the possibility of proving a relaxed form of Gauss’ inequality and using a third moment bound in the Berry-Esseen theorem to derive a minimal batch size for which the majority vote scheme is guaranteed to work by the central limit theorem. We leave this for future work.

Distributions like these are a problem because it means that adding more workers will actually drive up the error probability rather than driving it down. The beauty of the central limit theorem is that even for such a skewed and bimodal distribution, the mean of just a few tens of samples will already start to look Gaussian.

Extending the Theory to Signum

Momentum is a popular trick used by neural network practitioners that can, in our experience, speed up the training of deep neural networks and improve the robustness of algorithms to other hyperparameter settings. Instead of taking steps according to the gradient, momentum algorithms take steps according to a running average of recent gradients.

Existing theoretical analyses of momentum often rely on the absence of gradient stochasticity (e.g. Jin et al. (2017)) or convexity (e.g. Goh (2017)) to show that momentum’s asymptotic convergence rate can beat gradient descent.

It is easy to incorporate momentum into signSGD, merely by taking the sign of the momentum We call the resulting algorithm Signum and present the algorithmic step formally in Algorithm 1.2. Signum fits into our theoretical framework, and we prove its convergence rate in Theorem 5.1.

Remark 1: switching optimisers after a warmup period is in fact commonly done by practitioners (Akiba et al., 2017).

Remark 2: the theory suggests that momentum can be used to control a bias-variance tradeoff in the quality of stochastic gradient estimates. Sending β1\beta\rightarrow 1 kills the variance term in σ1\|\vec{\sigma}\|_{1} due to averaging gradients over a longer time horizon. But averaging in stale gradients induces bias due to curvature of f(x)f(x), and this blows up the δL1\delta\|\vec{L}\|_{1} term.

Remark 3: for generality, we state this theorem with a tunable learning rate δ\delta. For variety, we give this theorem in any-time form with a growing batch size and decaying learning rate. This comes at the cost of log\log factors appearing.

We benchmark Signum on Imagenet (Figure 5.1) and CIFAR-10 (Figure A.3 of supplementary). The full results of a giant hyperparameter grid search for the CIFAR-10 experiments are also given in the supplementary. Signum’s performance rivals Adam’s in all experiments.

Discussion

Gradient compression schemes like TernGrad (Wen et al., 2017) quantise gradients into three levels {0,±1}\{0,\pm 1\}. This is desirable when the ternary quantisation is sparse, since it can allow further compression. Our scheme of majority vote should easily be compatible with a ternary quantisation—in both directions of communication. This can be cast as “majority vote with abstention”. The scheme is as follows: workers send their vote to the parameter server, unless they are very unsure about the sign of the true gradient in which case they send zero. The parameter-server counts the votes, and if quorum is not reached (i.e. too many workers disagreed or abstained) the parameter-server sends back zero. This extended algorithm should readily fit into our theory.

In Section 2 we pointed out that signSGD and Signum are closely related to Adam. In all our experiments we find that Signum and Adam have very similar performance, although both lose out to SGD by about 2% test accuracy on Imagenet. Wilson et al. (2017) observed that Adam tends to generalise slightly worse than SGD. Though it is still unclear why this is the case, perhaps it could be because we don’t know how to properly regularise such methods. Whilst we found that neither standard weight decay nor the suggestion of Loshchilov & Hutter (2017) completely closed our Imagenet test set gap with SGD, it is possible that some other regularisation scheme might. One idea, suggested by our theory, is that signSGD could be squashing down noise levels. There is some evidence (Smith & Le, 2018) that a certain level of noise can be good for generalisation, biasing the optimiser towards wider valleys in the objective function. Perhaps, then, adding Gaussian noise to the Signum update might help it generalise better. This can be achieved in a communication efficient manner in the distributed setting by sharing a random seed with each worker, and then generating the same noise on each worker.

Finally, in Section 3 we discuss some geometric implications of our theory, and provide an efficient and robust experimental means of measuring one aspect—the ratio between noise and gradient density—through the Welford algorithm. We believe that since this density ratio is easy to measure, it may be useful to help guide those doing architecture search, to find network architectures which are amenable to fast training through gradient compression schemes.

Conclusion

Our work touches upon interesting aspects of the geometry of high-dimensional error surfaces, which we wish to explore in future work. But the next step for us will be to reach out to members of the distributed systems community to help benchmark the majority vote algorithm which shows such great theoretical promise for 1-bit compression in both directions between parameter-server and workers.

Acknowledgments

The authors are grateful to the anonymous reviewers for their helpful comments, as well as Jiawei Zhao, Michael Tschannen, Julian Salazar, Tan Nguyen, Fanny Yang, Mu Li, Aston Zhang and Zack Lipton for useful discussions. Thanks to Ryan Tibshirani for pointing out the connection to steepest descent.

KA is supported in part by NSF Career Award CCF-1254106 and Air Force FA9550-15-1-0221. AA is supported in part by Microsoft Faculty Fellowship, Google Faculty Research Award, Adobe Grant, NSF Career Award CCF-1254106, and AFOSR YIP FA9550-15-1-0221.

References

Appendix A Further experimental results

Appendix B Proving the convergence rate of signSGD

First take Assumption 2, plug in the step from Algorithm 1.1, and decompose the improvement to expose the stochasticity-induced error:

Next we find the expected improvement at time k+1k+1 conditioned on the previous iterate.

So the expected improvement crucially depends on the probability that each component of the sign vector is correct, which is intuitively controlled by the relative scale of the gradient to the noise. To make this rigorous, first relax the probability, then use Markov’s inequality followed by Jensen’s inequality:

σk,i\sigma_{k,i} refers to the variance of the kthk^{th} stochastic gradient estimate, computed over a mini-batch of size nkn_{k}. Therefore, by Assumption 3, we have that σk,iσi/nk\sigma_{k,i}\leq\sigma_{i}/\sqrt{n_{k}}.

We now substitute these results and our learning rate and mini-batch settings into the expected improvement:

Now extend the expectation over randomness in the trajectory, and perform a telescoping sum over the iterations:

We can rearrange this inequality to yield the rate:

Since we are growing our mini-batch size, it will take N=O(K2)N=O(K^{2}) gradient calls to reach step KK. Substitute this in, square the result, and we are done. ∎

Appendix C Large and small batch SGD

For comparison with signSGD theory, here we present non-convex convergence rates for SGD. These are classical results and we are not sure of the earliest reference.

We noticeably get exactly the same rate for large and small batch SGD when measuring convergence in terms of number of stochastic gradient calls. Although the rates are the same for a given number of gradient calls NN, the large batch setting is preferred (in theory) since it achieves these NN gradient calls in only N\sqrt{N} iterations, whereas the small batch setting requires NN iterations. Fewer iterations in the large batch case implies a smaller wall-clock time to reach a given accuracy (assuming the large batch can be parallelised) as well as fewer rounds of communication in the distributed setting. These systems benefits of large batch learning have been observed by practitioners (Goyal et al., 2017).

Take Assumption 2 and plug in the algorithmic step.

Next we find the expected improvement at time k+1k+1 conditioned on the previous iterate.

σk2\sigma_{k}^{2} refers to the variance of the kthk^{th} stochastic gradient estimate, computed over a mini-batch of size nkn_{k}. Therefore, by Assumption 3, we have that σk2σ2/nk\sigma_{k}^{2}\leq\sigma^{2}/n_{k}.

First let’s substitute in the (large batch) hyperparameters.

Now extend the expectation over randomness in the trajectory, and perform a telescoping sum over the iterations:

We can rearrange this inequality to yield the rate:

Since we are growing our mini-batch size, it will take N=O(K2)N=O(K^{2}) gradient calls to reach step KK. Substitute this in and we are done for the (large batch) case.

Now we need to show that the same result holds for the (small batch) case. Following the initial steps of the large batch proof, we get

This time σk2=σ2\sigma_{k}^{2}=\sigma^{2}. Substituting this and our learning rate and mini-batch settings into the expected improvement:

Now extend the expectation over randomness in the trajectory, and perform a telescoping sum over the iterations:

We can rearrange this inequality to yield the rate:

It will take N=O(K)N=O(K) gradient calls to reach step KK. Substitute this in and we are done. ∎

Appendix D Proving the convergence rate of distributed signSGD with majority vote

See 4.1 Before we introduce the unimodal symmetric assumption, let’s first address the claim that M-worker majority vote is at least as good as single-worker signSGD as in Theorem 3.1 only using Assumptions 1 to 3.

Recall that a crucial step in Theorem 3.1 is showing that

for component ii of the stochastic gradient with variance bound σi\sigma_{i}.

then we are done, since the machinery of Theorem 1 can then be directly applied.

Define the signal-to-noise ratio of a component of the stochastic gradient as S:=gi/σiS:=|g_{i}|/\sigma_{i}. Note that when S1S\leq 1 then ()(\star) is trivially satisfied, so we need only consider the case that S>1S>1. SS should really be labeled SiS_{i} but we abuse notation.

Without loss of generality, assume that gig_{i} is negative, and thus using Assumption 3 and Cantelli’s inequality (Cantelli, 1928) we get that for the failure probability qq of a single worker

For S>1S>1 then we have failure probability q<12q<\frac{1}{2}. If the failure probability of a single worker is smaller than 12\frac{1}{2} then the server is essentially receiving a repetition code RMR_{M} of the true gradient sign. Majority vote is the maximum likelihood decoder of the repetition code, and of course decreases the probability of error—see e.g. (MacKay, 2002). Therefore in all regimes of SS we have that

That’s all well and good, but what we’d really like to show is that using MM workers provides a speedup by reducing the variance. Is

Well in the regime where S1S\gg 1 such a speedup is very reasonable since q12q\ll\frac{1}{2} by Cantelli, and the repetition code actually supplies exponential reduction in failure rate. But we need to exclude very skewed or bimodal distributions where q>12q>\frac{1}{2} and adding more voting workers will not help. That brings us naturally to the following lemma:

which is in all cases less than 12\frac{1}{2}.

Recall Gauss’ inequality for unimodal random variable X with mode ν\nu and expected squared deviation from the mode τ2\tau^{2} (Gauss, 1823; Pukelsheim, 1994):

By the symmetry assumption, the mode is equal to the mean, so we replace mean μ=ν\mu=\nu and variance σ2=τ2\sigma^{2}=\tau^{2}.

Without loss of generality assume that gig_{i} is negative. Then applying symmetry followed by Gauss, the failure probability for the sign bit satisfies:

We now have everything we need to prove part (b) of Theorem 4.1.

If we can show ()(\dagger) we’ll be done, since the machinery of Theorem 3.1 follows through with σ\sigma replaced everywhere by σM\frac{\sigma}{\sqrt{M}}. Note that the important quantity appearing in ()(\star) is

Let ZZ count the number of workers with correct sign bit. To ensure that

ZZ must be larger than M2\frac{M}{2}. But ZZ is the sum of MM independent Bernoulli trials, and is therefore binomial with success probability pp and failure probability qq to be determined. Therefore we have reduced proving ()(\dagger) to showing that

where ZZ is the number of successes of a binomial random variable b(M,p)b(M,p) and SS is our signal-to-noise ratio S:=giσiS:=\frac{|g_{i}|}{\sigma_{i}}.

Let’s start by getting a bound on the success probability pp (or equivalently failure probability qq) of a single Bernoulli trial.

By Lemma D.1, which critically relies on unimodal symmetric gradient noise, the failure probability for the sign bit of a single worker satisfies:

Now we have an analytical handle on random variable ZZ, we may proceed to show (\mathsection)(\mathsection). There are a number of different inequalities that we can use to bound the tail of a binomial random variable, but Cantelli’s inequality will be good enough for our purposes.

Let Zˉ:=MZ\bar{Z}:=M-Z denote the number of failures. Zˉ\bar{Z} is binomial with mean μZˉ=Mq\mu_{\bar{Z}}=Mq and variance σZˉ2=Mpq\sigma^{2}_{\bar{Z}}=Mpq. Then using Cantelli we get

Now using the fact that 11+x212x\frac{1}{1+x^{2}}\leq\frac{1}{2x} we get

To finish, we need only show that 14ϵ21\sqrt{\frac{1}{4\epsilon^{2}}-1} is smaller than 2S\frac{2}{S}, or equivalently that its square is smaller than 4S2\frac{4}{S^{2}}. Well plugging in our bound on ϵ\epsilon we get that

So we have shown both cases, which proves (\mathsection)(\mathsection) from which we get ()(\dagger) and we are done. ∎

Appendix E General recipes for the convergence of approximate sign gradient methods

Now we generalize the arguments in the proof of signSGD and prove a master lemma that provides a general recipe for analyzing the approximate sign gradient method. This allows us to handle momentum and the majority voting schemes, hence proving Theorem 5.1 and Theorem 4.1.

where the expectation is taken over the all random variables, and the rate ξ(k)\xi(k) obeys that ξ(k)0\xi(k)\rightarrow 0 as kk\rightarrow\infty and then we have

In particular, if δk=δ/k\delta_{k}=\delta/\sqrt{k} and ξ(k)=κ/k\xi(k)=\kappa/\sqrt{k}, for some problem dependent constant κ\kappa, then we have

Note that gkg_{k} becomes fixed when we condition on xkx_{k}. Further take expectation over xkx_{k}, and apply (2). We get:

Rearrange the terms and sum over (3) for k=C,C+1,...,K1k=C,C+1,...,K-1.

and the proof is complete by noting that the minimum is smaller than the average in the LHS. ∎

To use the above Lemma for analyzing Signum and the Majority Voting scheme, it suffices to check condition (2) for each algorithm.

One possible way to establish (2) is show that vkv_{k} is a good approximation of the gradient gkg_{k} in expected absolute value.

Condition on xkx_{k} and apply the above inequality to every i=1,...,di=1,...,d to what is inside the expectation of (2), we have

Note that the final \leq uses Markov’s inequality and constant gk[i]|g_{k}[i]| cancels out.

The proof is complete by taking expectation on both sides and apply (4). ∎

Note that the proof of this lemma uses Markov’s inequality in the same way information-theoretical lower bounds are often proved in statistics — reducing estimation to testing.

Another handy feature of the result is that we do not require the approximation to hold for every possible xkx_{k}. It is okay that for some xkx_{k}, the approximation is much worse as long as those xkx_{k} appears with small probability according to the algorithm. This feature enables us to analyze momentum and hence proving the convergence for Signum.

Appendix F Analysis for Signum

Recall our definition of the key random variables used in Signum.

For any k<k<\infty and fixed weight <α1,...,αk<-\infty<\alpha_{1},...,\alpha_{k}<\infty, l=1kαlZl\sum_{l=1}^{k}\alpha_{l}Z_{l} is a Martingale. In particular,

We simply check the definition of a Martingale. Denote Yk:=l=1kαlZlY_{k}:=\sum_{l=1}^{k}\alpha_{l}Z_{l}. First, we have that

Second, again using the law of total probability,

This completes the proof that it is indeed a Martingale. We now make use of the properties of Martingale difference sequences to establish a variance bound on the Martingale.

The consequence of this lemma is that we are able to treat Z1,...,ZkZ_{1},...,Z_{k} as if they are independent, even though they are not—clearly ZlZ_{l} is dependent on Z1,...,Zl1Z_{1},...,Z_{l-1} through xlx_{l}.

For each i[d]i\in[d] we will use the following non-standard “bias-variance” decomposition.

We will first bound ()(**) and then deal with ()(*).

Combining, and again using our condition that kCk\geq C, we get

We now turn to bounding ()(*) — the “bias” term.

Let v:=sign(g(x+ϵs)g(x))v:=\text{sign}(g(x+\epsilon s)-g(x)), H:=[t=01H(x+tϵs)dt]H:=\left[\int_{t=0}^{1}H(x+t\epsilon s)dt\right] and moreover, use H+H_{+} to denote the psd part of HH and HH_{-} to denote the nsd part of HH. Namely, H=H+HH=H_{+}-H_{-}.

Note that assumption 2 implies the semidefinite ordering

and thus max{sTH+s,sTHs}i=1dLi=L1\max\{s^{T}H_{+}s,s^{T}H_{-}s\}\leq\sum_{i=1}^{d}L_{i}=\|\vec{L}\|_{1} for all s{1,1}ds\in\{-1,1\}^{d}.

The proof is complete by observing that both vv and ss are sign vectors in (8). ∎

Using the above lemma and the fact that our update rules are always following some sign vectors with learning rate smaller than δ\delta, we have

Substitute (6) into (5), sum both sides over ii and then further plug in (10) we get the statement in the lemma. ∎

The proof of Theorem 5.1 now follows in a straightforward manner. Note that Lemma F.2 only kicks in after a warmup period of CC iterations, with CC as specified in Theorem 5.1. In theory it does not matter what you do during this warmup period, provided you accumulate the momentum as normal and take steps according to the prescribed learning rate and mini-batch schedules. One option is to just stay put and not update the parameter vector for the first CC iterations. This is wasteful since no progress will be made on the objective. A better option in practice is to take steps using the sign of the stochastic gradient (i.e. do signSGD) instead of Signum during the warmup period.