cpSGD: Communication-efficient and differentially-private distributed SGD
Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, H. Brendan Mcmahan
Introduction
Distributed stochastic gradient descent (SGD) is a basic building block of modern machine learning . In the typical scenario of synchronous distributed learning, in every round, each client obtains a copy of a global model which it updates based on its local data. The updates (usually in the form of gradients) are sent to a parameter server, where they are averaged and used to update the global model. Alternatively, without a central server, each client saves a global model, broadcasts the gradient to all other clients, and updates its model with the aggregated gradient.
Often, the communication cost of sending the gradient becomes the bottleneck . To address this issue, several recent works have focused on reducing the communication cost of distributed learning algorithms via gradient quantization and sparsification . These algorithms have been shown to improve communication cost and hence communication time in distributed learning. This is especially effective in the federated learning setting where clients are mobile devices with expensive up-link communication cost .
While communication is a key concern in client based distributed machine learning, an equally important consideration is that of protecting the privacy of participating clients and their sensitive information. Providing rigorous privacy guarantees for machine learning applications has been an area of active recent interest . Differentially private gradient descent algorithms in particular were studied in the work of . A direct application of these mechanisms in distributed settings leads to algorithms with high communication costs. The key focus of our paper is to analyze mechanisms that achieve rigorous privacy guarantees as well as have communication efficiency.
2 Communication efficiency
where each resides at the client. For example, ’s are weights of a neural network and is the loss of the network on data located on client . Let be the initial value. At round , the server transmits to all the clients and asks a random set of (batch size / lot size) clients to transmit their local gradient estimates . Let be the subset of clients.
for some suitable choice of . Other optimization algorithms such as momentum, Adagrad, or Adam can also be used instead of the SGD step above.
Naively for the above protocol, each of the clients needs to transmit reals, typically using bits is the per-coordinate quantization accuracy. To represent a dimensional vector to an constant accuracy in Euclidean distance, each coordinate is usually quantized to an accuracy of ..
This communication cost can be prohibitive, e.g., for a medium size PennTreeBank language model , the number of parameters and hence total cost is MB (assuming 32 bit float), which is too large to be sent from a mobile phone to the server at every round.
Motivated by the need for communication efficient protocols, various quantization algorithms have been proposed to reduce the communication cost . In these protocols, the clients quantize the gradient by a function and send an efficient representation of instead of its actual local gradient . The server computes the gradient as
and updates as before. Specifically, proposes a quantization algorithm which reduces the requirement of full (or floating point) arithmetic precision to a bit or few bits per value on average. There are many subsequent works e.g., see and in particular showed that stochastic quantization and Elias coding can be used to obtain communication-optimal SGD for convex functions. If the expected communication cost at every round is bounded by , then the total communication cost of the modified gradient descent is at most
All the previous papers relate the error in gradient compression to SGD convergence. We first state one such result for completeness for non-convex functions and prove it in Appendix A. Similar (and stronger) results can be obtained for (strongly) convex functions using results in and .
Let be -smooth and . Let satisfy . Let be a quantization scheme, and then after rounds
The above result relates the convergence of distributed SGD for non-convex functions to the worst-case mean square error (MSE) and bias in gradient mean estimates in Equation (2). Thus smaller the mean square error in gradient estimation, better convergence. Hence, we focus on the problem of distributed mean estimation (DME), where the goal is to estimate the mean of a set of vectors.
3 Differential privacy
While the above schemes reduce the communication cost, it is unclear what (if any) privacy guarantees they offer. We study privacy from the lens of differential privacy (DP). The notion of differential privacy provides a strong notion of individual privacy while permitting useful data analysis in machine learning tasks. We refer the reader to for a survey. Informally, for the output to be differentially private, the estimated model should be indistinguishable whether a particular client’s data was taken into consideration or not. We define this formally in Section 2.
In the context of client based distributed learning, we are interested in the privacy of the gradients aggregated from clients; differential privacy for the average gradients implies privacy for the resulting model since DP is preserved by post-processing.
The standard approach is to let the server add the noise to the averaged gradients (e.g., see and references within). However, the above only works under a restrictive assumption that the clients can trust the server. Our goal is to also minimize the need for clients to trust the central aggregator, and hence we propose the following model:
Clients add their share of the noise to their gradients before transmission. Aggregation of gradients at the server results in an estimate with noise equal to the sum of the noise added at each client.
This approach improves over server-controlled noise addition in several scenarios:
Clients do not trust the server: Even in the scenario when the server is not trustworthy, the above scheme can be implemented via cryptographically secure aggregation schemes , which ensures that the only information about the individual users the server learns is what can be inferred from the sum. Hence, differential privacy of the aggregate now ensures that the parameter server does not learn any individual user information. This will encourage clients to participate in the protocol even if they do not fully trust the server. We note that while secure aggregation schemes add to the communication cost (e.g., adds for levels of quantization), our proposed communication benefits still hold. For example, if , a 4-bit quantization protocol would reduce communication cost by 67% compared to the 32 bit representation.
Server is negligent, but not malicious: the server may "forget" to add noise, but is not malicious and not interested in learning characteristics of individual users. However, if the server releases the learned model to public, it needs to be differentially-private.
A natural way to extend the results of is to let individual users add Gaussian noise to their gradients before transmission. Since the sum of Gaussians is Gaussian itself, differential privacy results follow. However, the transmitted values now are real numbers and the benefits of gradient compression are lost. Further, secure aggregation protocols require discrete inputs. To resolve these issues, we propose that the clients add noise drawn from an appropriately parameterized Binomial distribution. We refer to this as the Binomial mechanism. Since Binomial random variables are discrete, they can be transmitted efficiently. Furthermore, the choice of the Binomial is convenient in the distributed setting because sum of Binomials is also binomially distributed i.e., if
Hence the total noise post aggregation can be analyzed easily, which is convenient for the distributed settingAnother choice is the Poisson distribution. Different from Poisson, the Binomial distribution has bounded support and has an easily analyzable communication complexity which is always bounded.. Binomial mechanism can be of independent interest in other applications with discrete output as well. Furthermore, unlike Gaussian it avoids floating point representation issues.
4 Summary of our results
Binomial mechanism: We first study Binomial mechanism as a generic mechanism to release discrete valued data. Previous analysis of the Binomial mechanism (where you add noise ) was due to , who analyzed the -dimensional case for and showed that to achieve differential privacy, needs to be . We improve the analysis in the following ways:
-dimensions. We extend the analysis of -dimensional Binomial mechanism to dimensions. Unlike the Gaussian distribution, Binomial is not rotation invariant making the analysis more involved. The key fact utilized in this analysis is that Binomial distribution is locally rotation-invariant around the mean.
Improvement. We improve the previous result and show that suffices for small , implying that the Binomial and Gaussian mechanism perform identically as . We note that while this is a constant improvement , it is crucial in making differential privacy practical.
Differentially-private distributed mean estimation (DME): A direct application of Gaussian mechanism requires reals and hence bits of communication. This can be prohibitive in practice. However, a direct application of quantization and Binomial mechanism has high communication cost. We show that random rotation together with the notion of high probability sensitivity can significantly improve communication.
In particular, for , we provide an algorithm achieving equal privacy and error as that of the Gaussian mechanism with communication
per round of distributed SGD. Hence when , the number of bits is .
The rest of the paper is organized as follows. In Section 2, we review the notion of differential privacy and state our results for the Binomial mechanism. Motivated by the fact that the convergence of SGD can be reduced to the error in gradient estimate computation per-round, we formally describe the problem of DME in Section 4 and state our results in Section 5.
In Section 5.2, we provide and analyze the implementation of the binomial mechanism in conjunction with quantization in the context of DME. The main idea is for each client to add noise drawn from an appropriately parameterized Binomial distribution to each quantized value before sending to the server. The server further subtracts the bias introduced by the noise to achieve an unbiased mean estimator. We further show in Section 5.3 that the rotation procedure proposed in which reduces the MSE is helpful in reducing the additional error due to differential privacy.
Differential privacy
We start by defining the notion of differential privacy. Formally, given a set of data sets provided with a notion of neighboring data sets and a query function , a mechanism to release the answer of the query, is defined to be differentially private if for any measurable subset and two neighboring data sets ,
The canonical mechanism to achieve differential privacy is the Gaussian mechanism :
is differentially private All logs are to base unless otherwise stated. and the error is bounded by
2 High probability sensitivity
Given a set of integers and values , we call a randomized function , sensitive, if for any two neighboring data sets , there exist coupled random variables such that the marginal distributions of are identical to that of and and
We show the following result for high-probability sensitivity and the proof is provided in Appendix C.
Let be an differentially private mechanism for sensitivity and let be a sensitive function. Then the composed mechanism is differentially private.
Binomial Mechanism
where for each coordinate , and independent. One dimensional binomial mechanism was introduced by for the case when . We analyze the mechanism for the general -dimensional case and for any . This analysis is involved as the Binomial mechanism is not rotation invariant. By carefully exploiting the local rotation invariant structure near the mean, we show that:
For any , parameters and sensitivity bounds such that
the Binomial mechanism is differentially private for
where , and are defined in (17), (12), and (16) respectively, and for , , , and . The error of the mechanism is
The proof is given in Appendix B. We make some remarks regarding the design and the guarantee for the Binomial Mechanism. Note that the privacy guarantee for the Binomial mechanism depends on all three sensitivity parameters as opposed to the Gaussian mechanism which only depends on . The and terms can be seen as the added complexity due to discretization.
Secondly setting (i.e. providing no scale to the noise) in the expression (7), it can be readily seen that the terms involving and scale differently with respect to the variance of the noise. This motivates the use of the accompanying quantization scale in the mechanism. Indeed it is possible that the resolution of the integer that is provided by the Binomial noise could potentially be too large for the problem leading to worse guarantees. In this setting, the quantization parameter helps normalize the noise correctly. Further, it can be seen as long as the variance of the random variable is fixed, increasing and decreasing makes the Binomial mechanism closer to the Gaussian mechanism. Formally, if we let and , then using the Cauchy-Schwartz inequality, the guarantee (7) can be rewritten as
The variance of the Binomial distribution is and the leading term in matches exactly the term in Gaussian mechanism. Furthermore, if is , then this mechanism is very similar to the Gaussian mechanism. This result agrees with the Berry-Esseen type Central limit theorems for the convergence of one dimensional Binomial distribution to the Gaussian distribution.
In Figure 2, we plot the error vs for Gaussian and Binomial mechanism. Observe that as scale is reduced, error vs privacy trade-off for Binomial mechanism approaches that of Gaussian mechanism.
Distributed mean estimation (DME)
We have related the SGD convergence rate to the MSE in approximating the gradient at each step in Corollary 1. Eq. (1) relates the communication cost of SGD to the communication cost of estimating gradient means. Advanced composition theorem (Thm. 3.5 ) or moments accounting can be used to relate the privacy guarantee of SGD to that of gradient mean estimate at each instance . We also note that in SGD, we often sample the clients, standard privacy amplification results via sampling , can be used to get tighter bounds in this case.
Therefore, akin to , in the rest of the paper we just focus on the MSE and privacy guarantees of DME. The results for synchronous distributed GD follow from Corollary 1 (convergence), advanced composition theorem (privacy), and Eq. (1) (communication).
at a central server. For gradient descent at each round , is set to . DME is a fundamental building block for many distributed learning algorithms including distributed PCA/clustering .
Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server at the same time. We allow the use of both private and public randomness. Private randomness refers to random values generated by each client separately, and public randomness refers to a sequence of random values that are shared among all partiesPublic randomness can be emulated by the server communicating a random seed.
The total number of bits transmitted by all clients with the protocol is
Let the estimated mean be . For a protocol , the MSE of the estimate is
We note that bounds on , translates to bounds on gradients estimates in Eq. (2) and result in convergence guarantees via Corollary 1.
2 Differential privacy
To state the privacy results for DME, we define the notion of data sets and neighbors as follows. A dataset is a collection of vectors . The notion of neighboring data sets typically corresponds to those differing only on the information of one user, i.e. are neighbors if they differ in one vector.
Note that this notion of neighbors for DME in the context of distributed gradient descent translates to two data sets
being neighbors if they differ in one function and corresponds to guaranteeing privacy for individual client’s data. The bound translates to assuming , ensured via gradient clipping.
Results for distributed mean estimation (DME)
In this section we describe our algorithms, the associated MSE, and the privacy guarantees in the context of DME. First, we first establish a baseline by stating the results for implementing the Gaussian mechanism by adding Gaussian noise on each client vector.
In the Gaussian mechanism, each client sends vector
Under the Gaussian mechanism, the mean estimate is unbiased and communication cost is reals. Moreover, for any and , it is differentially private for
We remark that real numbers can be quantized to bits with insignificant effect to privacyFollows by observing that quantizing all values to accuracy ensures minimum loss in privacy. In practice this is often implemented using 32 bits of quantization via float representation.. However this is asymptotic and can be prohibitive in practice , where we have a small fixed communication budget and is of the order of millions. A natural way to reduce communication cost is via quantization, where each client quantizes s before transmitting. However how privacy guarantees degrade due to quantization of the Gaussian mechanism is hard to analyze particularly under aggregation. Instead we propose to use the Binomial mechanism which we describe next.
2 Stochastic k𝑘k-level quantization + Binomial mechanism
We now define the mechanism based on -bit stochastic quantization proposed in composed with the Binomial mechanism. It will be parameterized by quantities .
First, the server sends to all the clients, with the hope that for all , . The clients then clip each coordinate of their vectors to the range . For every integer in the range , let represent a bin (one for each ), i.e.
The algorithm quantizes each coordinate into one of the bins stochastically and adds scaled Binomial noise. Formally client computes the following quantities for every
where is such that and . The client sends to the server. The server now estimates by
If , , then
and will be an unbiased estimate of the mean.
With no prior information on , the natural choice is to set . With this value of we characterize the MSE, sensitivity, and communication complexity of the Binomial mechanism below. To characterize the sensitivity of , we need few definitions. For scalars , let
where . We note that quantities are omitted from the LHS of the equations for the ease of notation. Combined with Theorem 1, this yields the privacy guarantees for the binomial mechanism.
If , then the mean estimate is unbiased and
Furthermore if , then for any , is differentially private where is given by Theorem 1 with sensitivity parameters (Eq. (11)). Furthermore,
We provide the proof in Appendix D. For , we bound the communication cost as follows.
There exists an implementation of , which achieves the same privacy and error as the full precision Gaussian mechanism with a total communication complexity of
Therefore our results provide precise non-asymptotic and asymptotic guarantees on the total communication with respect to k. The communication cost of the above algorithm is bits per coordinate per client, which can be prohibitive. In the next section we show that these bounds can be improved via rotation.
3 Error reduction via randomized rotation
As seen in Corollary 2, if has error and privacy same as that of the Gaussian mechanism, it has high communication cost. The proof reveals that this is due to the error being proportional to . Therefore MSE reduces when is small, e.g., when is uniform on the unit sphere, is (whp) . showed that the same effect can be observed by randomly rotating the vectors before quantization. Here we show that random rotation reduces the leading term as well as improves the privacy guarantee.
and runs the protocol on . The server then obtains the mean estimate in the rotated space using the protocol and then multiplies by to obtain the coordinates in the original basis, i.e.,
Due to the fact that can be huge in practice, we need orthogonal matrices that permits fast matrix-vector products. Naive matrices that support fast multiplication such as block-diagonal matrices often result in high values of . Similar to , we propose to use a special type of orthogonal matrix , where is a random diagonal matrix with i.i.d. Rademacher entries ( with probability ) and is a Walsh-Hadamard matrix . The Walsh-Hadamard matrix of dimension for is given by the recursive formula,
Applying both rotation and its inverse takes time and space (with an in-place algorithm).
The next theorem provides the MSE guarantees for .
For any , let , then
and the bias is . Further if then is differentially private where is given by Theorem 1 with sensitivity parameters (Eq. (11)). Furthermore,
The following corollary bounds the communication cost for when .
There exists an implementation of , that achieves the same error and privacy of the full precision Gaussian mechanism with a total communication complexity:
Hence if , then has the same privacy and utilities as the Gaussian mechanism, but with just communication cost.
Discussion
We trained a three-layer model (60 hidden nodes each with ReLU activation) on the infinite MNIST dataset with 25M data points and 25M clients. At each step 10,000 clients send their data to the server. This setting is close to real-world settings of federated learning where there are hundreds of millions of users. The results are in Figure 2. Note that the models achieve different levels of accuracy depending on communication cost and privacy parameter . We note that we trained the model with exactly one epoch, so each sample was used at most once in training. In this setting, the per batch and the overall are the same.
There are several interesting future directions. On the theoretical side, it is not clear if our analysis of Binomial mechanism is tight. Furthermore, it is interesting to have better privacy accounting for Binomial mechanism via a moments accountant. On the practical side, we plan to explore the effects of neural network topology, over-parametrization, and optimization algorithms on the accuracy of the privately learned models.
Acknowledgements
The authors would like to thank Keith Bonawitz, Vitaly Feldman, Jakub Konečný, Ben Kreuter, Ilya Mironov, and Kunal Talwar for their valuable suggestions and inputs.
References
Appendix A Proof of biased SGD
where the last inequality uses the fact that . Rearranging the above inequality and summing over all we get that
Appendix B Binomial Mechanism - Proof of Theorem 1
To prove Theorem 1, we need few auxiliary lemmas. We first state two inequalities which we use through-out the proof.
Let be a symmetric function of independent random variables . Let be an i.i.d. copy of , then
We use the above two results in the next two lemmas.
where the inequality follows from considering the two cases when can be positive or negative. ∎
Let be real numbers. Let independently such that . Let be the event that for some , such that . Then for any , with probability conditioned on ,
Since and for any , ,
where the first inequality follows from the fact that and for any , . Hence we can set . We now bound the variance:
Let , then
where the inequality uses the fact that for any positive , . Observe that and . We use the following three inequalities, to bound the expectation of the term above. Similar results apply for as . Since and are negatively correlated,
Combining the above results and simplifying the terms, we get that the expectation of the required quantity is bounded by
and the lemma follows by Bernstein’s inequality. ∎
Firstly note that it is sufficient to consider the differential privacy of the quantity where is a Binomial random variable. Note that since is defined to be for some integer the output remains integral. Further note that in this setting the norm sensitivity scales . The above reduction shows that the scale can be considered to be 1 in the rest of the proof.
Consider any two neighboring data sets and let . Note that showing the differential privacy of the Binomial mechanism is equivalent to showing the following. Let be a vector such that then for any vector we have that
To show the above we will first define a set such that
Define as follows: if and only if,
We will first show that the probability of this event is large.
The first condition follows from Bernstein’s inequality with probability . For the second condition, observe that is a function of independent random variables. A direct application of Bernstein’s inequality yields that Equation (14) holds with probability . The third condition follows from the first condition as and . Applying Lemma 6 with being event that and , yields that the fourth equation holds with probability at least . Hence, by the union bound,
We now prove the ratio of probabilities. For any ,
where the inequality follows from Lemma 5. Since , applying Equations (13), (14), (15), together with the fact that (by the assumptions in the theorem) yields the following bound on the exponent.
where is defined in Equation (12) and
Appendix C High probability sensitivity Proof
To show differential privacy we need to show that for any two neighboring data sets and ,
Given any two neighboring data sets let represent the joint distribution of the coupled random variables guaranteed by Definition 1. Now for any we have that
In the above follow from the fact that is a coupling, follows from the condition (5) guaranteed by the coupling and follows from the differential privacy guarantee of the mechanism . ∎
Appendix D Application of Binomial Mechanism to Distributed Mean Estimation - Proof of Theorem 3
We refer the readers to the definition of the protocol (Section 5.2) and in particular the definitions of the random variables and the estimator given in equations (9) and (10) respectively.
The communication complexity follows immediately by noting that the protocol only transmits integers in the range and therefore only needs bits. We now prove the bound on the Mean Square Error of the protocol and then prove the sensitivity guarantee. Mean Square Error
For every , given two neighboring data sets and (where for ) we have that the protocol is -sensitive (c.f. Definition 1) where satisfy the following equations.
Further we note that the protocol is a composition of the binomial mechanism and the protocol . A direct application of Theorem 1 and Lemma 2 gives us that the mechanism is differentially private for any and satisfying the below conditions. we choose as in the application of Lemma 2 Note that the conditions required by Theorem 1 can be verified from the given conditions in Theorem 3. ∎
To this end we recall the definition of the random variables . Given and we associate to every integer in a bin defined as
Further given a number , let be the integer such that . We can now define the random variable
Now define the random variables and similarly . To provide high probability sensitivity bounds in accordance with Lemma 2, we need to define a coupling between the random variables and . To do the above we will define a coupling between the random variables and . The coupled random variables will be sampled as follows.
The defined coupling will have two cases. Define the set . We first consider the case when . In this case we sample a random variable uniformly at random and define the random variables
Additionally wlog consider (the roles of and can be reversed in the following definitions otherwise) and define the auxiliary variables
Otherwise if or equivalently , we sample the bins independently and the random variables are defined as
Additionally wlog consider (the roles of and can be reversed in the following definitions otherwise) and define the auxiliary variables
In this case define and note that
With these definitions, it can be seen that the marginal distributions of are equal to the marginal distributions of , respectively. Further note that since for all we have that w.p. 1 for all . Therefore
where the inequality follows from (21) and (22). We wish to bound the RHS above. To that end consider the following claim which follows from the definitions.
A direct application of Bernstein’s Inequality gives us that with probability at least
where the RHS uses Claim 1 for bounding expectation and variance.
Therefore combining (25) and (26), we get that
The proof is finished using a union bound.
Appendix E Quantization with Rotation
Differential Privacy Given any two neighboring data sets we define a set of good rotations as follows
where is the set of orthonormal matrices. The following lemma follows from . We note that similar analysis holds for uniformly sampled over real domain and we refer the reader to for details.
Let represent the random output of the protocol on respectively and let be any subset of the output range of . Given let be given by Theorem 1 with sensitivity parameters . Given a set of vectors and a rotation matrix define .
follows from differential privacy guarantee for from Theorem 3 and noting that in the integral. Hence offers differential-privacy.
follows from the argument above and follows from the MSE guarantee in Theorem 3 and by noting that the rotation is in .