cpSGD: Communication-efficient and differentially-private distributed SGD

Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, H. Brendan Mcmahan

Introduction

Distributed stochastic gradient descent (SGD) is a basic building block of modern machine learning . In the typical scenario of synchronous distributed learning, in every round, each client obtains a copy of a global model which it updates based on its local data. The updates (usually in the form of gradients) are sent to a parameter server, where they are averaged and used to update the global model. Alternatively, without a central server, each client saves a global model, broadcasts the gradient to all other clients, and updates its model with the aggregated gradient.

Often, the communication cost of sending the gradient becomes the bottleneck . To address this issue, several recent works have focused on reducing the communication cost of distributed learning algorithms via gradient quantization and sparsification . These algorithms have been shown to improve communication cost and hence communication time in distributed learning. This is especially effective in the federated learning setting where clients are mobile devices with expensive up-link communication cost .

While communication is a key concern in client based distributed machine learning, an equally important consideration is that of protecting the privacy of participating clients and their sensitive information. Providing rigorous privacy guarantees for machine learning applications has been an area of active recent interest . Differentially private gradient descent algorithms in particular were studied in the work of . A direct application of these mechanisms in distributed settings leads to algorithms with high communication costs. The key focus of our paper is to analyze mechanisms that achieve rigorous privacy guarantees as well as have communication efficiency.

2 Communication efficiency

where each fif_{i} resides at the ithi^{th} client. For example, ww’s are weights of a neural network and fi(w)f_{i}(w) is the loss of the network on data located on client ii. Let w0w^{0} be the initial value. At round tt, the server transmits wtw^{t} to all the clients and asks a random set of nn (batch size / lot size) clients to transmit their local gradient estimates git(wt){g}^{t}_{i}(w^{t}). Let SS be the subset of clients.

for some suitable choice of γ\gamma. Other optimization algorithms such as momentum, Adagrad, or Adam can also be used instead of the SGD step above.

Naively for the above protocol, each of the nn clients needs to transmit dd reals, typically using O(dlog1/η)\mathcal{O}(d\cdot\log 1/\eta) bitsη\eta is the per-coordinate quantization accuracy. To represent a dd dimensional vector XX to an constant accuracy in Euclidean distance, each coordinate is usually quantized to an accuracy of η=1/d\eta=1/\sqrt{d}..

This communication cost can be prohibitive, e.g., for a medium size PennTreeBank language model , the number of parameters d>10 milliond>10\text{ million} and hence total cost is 38\sim 38MB (assuming 32 bit float), which is too large to be sent from a mobile phone to the server at every round.

Motivated by the need for communication efficient protocols, various quantization algorithms have been proposed to reduce the communication cost . In these protocols, the clients quantize the gradient by a function qq and send an efficient representation of q(git(wt))q({g}^{t}_{i}(w^{t})) instead of its actual local gradient git(wt){g}^{t}_{i}(w^{t}). The server computes the gradient as

and updates wtw^{t} as before. Specifically, proposes a quantization algorithm which reduces the requirement of full (or floating point) arithmetic precision to a bit or few bits per value on average. There are many subsequent works e.g., see and in particular showed that stochastic quantization and Elias coding can be used to obtain communication-optimal SGD for convex functions. If the expected communication cost at every round tt is bounded by cc, then the total communication cost of the modified gradient descent is at most

All the previous papers relate the error in gradient compression to SGD convergence. We first state one such result for completeness for non-convex functions and prove it in Appendix A. Similar (and stronger) results can be obtained for (strongly) convex functions using results in and .

Let FF be LL-smooth and x  F(x)2D\forall x\;\|\nabla F(x)\|_{2}\leq D. Let w0w^{0} satisfy F(w0)F(w)DFF(w^{0})-F(w^{*})\leq D_{F}. Let qq be a quantization scheme, and γmin{L1,2DF(σLT)1},\gamma\triangleq\min\left\{L^{-1},\sqrt{2D_{F}}(\sigma\sqrt{LT})^{-1}\right\}, then after TT rounds

The above result relates the convergence of distributed SGD for non-convex functions to the worst-case mean square error (MSE) and bias in gradient mean estimates in Equation (2). Thus smaller the mean square error in gradient estimation, better convergence. Hence, we focus on the problem of distributed mean estimation (DME), where the goal is to estimate the mean of a set of vectors.

3 Differential privacy

While the above schemes reduce the communication cost, it is unclear what (if any) privacy guarantees they offer. We study privacy from the lens of differential privacy (DP). The notion of differential privacy provides a strong notion of individual privacy while permitting useful data analysis in machine learning tasks. We refer the reader to for a survey. Informally, for the output to be differentially private, the estimated model should be indistinguishable whether a particular client’s data was taken into consideration or not. We define this formally in Section 2.

In the context of client based distributed learning, we are interested in the privacy of the gradients aggregated from clients; differential privacy for the average gradients implies privacy for the resulting model since DP is preserved by post-processing.

The standard approach is to let the server add the noise to the averaged gradients (e.g., see and references within). However, the above only works under a restrictive assumption that the clients can trust the server. Our goal is to also minimize the need for clients to trust the central aggregator, and hence we propose the following model:

Clients add their share of the noise to their gradients gitg^{t}_{i} before transmission. Aggregation of gradients at the server results in an estimate with noise equal to the sum of the noise added at each client.

This approach improves over server-controlled noise addition in several scenarios:

Clients do not trust the server: Even in the scenario when the server is not trustworthy, the above scheme can be implemented via cryptographically secure aggregation schemes , which ensures that the only information about the individual users the server learns is what can be inferred from the sum. Hence, differential privacy of the aggregate now ensures that the parameter server does not learn any individual user information. This will encourage clients to participate in the protocol even if they do not fully trust the server. We note that while secure aggregation schemes add to the communication cost (e.g., adds log2(kn)\log_{2}(k\cdot n) for kk levels of quantization), our proposed communication benefits still hold. For example, if n=1024n=1024, a 4-bit quantization protocol would reduce communication cost by 67% compared to the 32 bit representation.

Server is negligent, but not malicious: the server may "forget" to add noise, but is not malicious and not interested in learning characteristics of individual users. However, if the server releases the learned model to public, it needs to be differentially-private.

A natural way to extend the results of is to let individual users add Gaussian noise to their gradients before transmission. Since the sum of Gaussians is Gaussian itself, differential privacy results follow. However, the transmitted values now are real numbers and the benefits of gradient compression are lost. Further, secure aggregation protocols require discrete inputs. To resolve these issues, we propose that the clients add noise drawn from an appropriately parameterized Binomial distribution. We refer to this as the Binomial mechanism. Since Binomial random variables are discrete, they can be transmitted efficiently. Furthermore, the choice of the Binomial is convenient in the distributed setting because sum of Binomials is also binomially distributed i.e., if

Hence the total noise post aggregation can be analyzed easily, which is convenient for the distributed settingAnother choice is the Poisson distribution. Different from Poisson, the Binomial distribution has bounded support and has an easily analyzable communication complexity which is always bounded.. Binomial mechanism can be of independent interest in other applications with discrete output as well. Furthermore, unlike Gaussian it avoids floating point representation issues.

4 Summary of our results

Binomial mechanism: We first study Binomial mechanism as a generic mechanism to release discrete valued data. Previous analysis of the Binomial mechanism (where you add noise Bin(N,p)\text{Bin}(N,p)) was due to , who analyzed the 11-dimensional case for p=1/2p=1/2 and showed that to achieve (ε,δ)(\varepsilon,\delta) differential privacy, NN needs to be 64log(2/δ)/ε2\geq 64\log(2/\delta)/\varepsilon^{2}. We improve the analysis in the following ways:

dd-dimensions. We extend the analysis of 11-dimensional Binomial mechanism to dd dimensions. Unlike the Gaussian distribution, Binomial is not rotation invariant making the analysis more involved. The key fact utilized in this analysis is that Binomial distribution is locally rotation-invariant around the mean.

Improvement. We improve the previous result and show that N8log(2/δ)/ε2N\geq 8\log(2/\delta)/\varepsilon^{2} suffices for small ε\varepsilon, implying that the Binomial and Gaussian mechanism perform identically as ε0\varepsilon\to 0. We note that while this is a constant improvement , it is crucial in making differential privacy practical.

Differentially-private distributed mean estimation (DME): A direct application of Gaussian mechanism requires ndn\cdot d reals and hence ndlog(nd)n\cdot d\cdot\log(nd) bits of communication. This can be prohibitive in practice. However, a direct application of quantization and Binomial mechanism has high communication cost. We show that random rotation together with the notion of high probability sensitivity can significantly improve communication.

In particular, for ε=O(1)\varepsilon=O(1), we provide an algorithm achieving equal privacy and error as that of the Gaussian mechanism with communication

per round of distributed SGD. Hence when dnd\approx n, the number of bits is ndlog(log(nd)/δ)n\cdot d\cdot\log(\log(nd)/\delta).

The rest of the paper is organized as follows. In Section 2, we review the notion of differential privacy and state our results for the Binomial mechanism. Motivated by the fact that the convergence of SGD can be reduced to the error in gradient estimate computation per-round, we formally describe the problem of DME in Section 4 and state our results in Section 5.

In Section 5.2, we provide and analyze the implementation of the binomial mechanism in conjunction with quantization in the context of DME. The main idea is for each client to add noise drawn from an appropriately parameterized Binomial distribution to each quantized value before sending to the server. The server further subtracts the bias introduced by the noise to achieve an unbiased mean estimator. We further show in Section 5.3 that the rotation procedure proposed in which reduces the MSE is helpful in reducing the additional error due to differential privacy.

Differential privacy

We start by defining the notion of differential privacy. Formally, given a set of data sets D{\mathcal{D}} provided with a notion of neighboring data sets NDD×D\mathcal{N}_{{\mathcal{D}}}\subset{\mathcal{D}}\times{\mathcal{D}} and a query function f:DXf:{\mathcal{D}}\rightarrow\mathcal{X}, a mechanism M:XO{\mathcal{M}}:\mathcal{X}\rightarrow{\mathcal{O}} to release the answer of the query, is defined to be (ε,δ)(\varepsilon,\delta) differentially private if for any measurable subset SOS\subseteq{\mathcal{O}} and two neighboring data sets (D1,D2)ND(D_{1},D_{2})\in\mathcal{N}_{{\mathcal{D}}},

The canonical mechanism to achieve (ε,δ)(\varepsilon,\delta) differential privacy is the Gaussian mechanism Mgσ{\mathcal{M}}_{g}^{\sigma} :

Mgσ{\mathcal{M}}_{g}^{\sigma} is (Δ2σ2log1.25δ,δ)(\frac{\Delta_{2}}{\sigma}\sqrt{2\log\frac{1.25}{\delta}},\delta) differentially private All logs are to base ee unless otherwise stated. and the error is bounded by dσ2.d\cdot\sigma^{2}.

2 High probability sensitivity

Given a set of integers QQ and values ΔQ,δ\Delta_{Q},\delta, we call a randomized function f:DXf:\mathcal{D}\rightarrow\mathcal{X}, (ΔQ,δ)(\Delta_{Q},\delta) sensitive, if for any two neighboring data sets D1,D2NDD_{1},D_{2}\in\mathcal{N}_{\mathcal{D}}, there exist coupled random variables X1,X2XX_{1},X_{2}\in\mathcal{X} such that the marginal distributions of X1,X2X_{1},X_{2} are identical to that of f(D1){f}(D_{1}) and f(D2){f}(D_{2}) and

We show the following result for high-probability sensitivity and the proof is provided in Appendix C.

Let M:XO\mathcal{M}:\mathcal{X}\rightarrow\mathcal{O} be an (ε,δ)(\varepsilon,\delta) differentially private mechanism for sensitivity ΔQ\Delta_{Q} and let f:DX{f}:\mathcal{D}\rightarrow\mathcal{X} be a (ΔQ,δ)(\Delta_{Q},\delta^{\prime}) sensitive function. Then the composed mechanism M(f(D)){\mathcal{M}}(f(D)) is (ε,δ+δ)(\varepsilon,\delta+\delta^{\prime}) differentially private.

Binomial Mechanism

where for each coordinate ii, ZiBin(N,p)Z_{i}\sim\text{Bin}(N,p) and independent. One dimensional binomial mechanism was introduced by for the case when p=1/2p=1/2. We analyze the mechanism for the general dd-dimensional case and for any pp. This analysis is involved as the Binomial mechanism is not rotation invariant. By carefully exploiting the local rotation invariant structure near the mean, we show that:

For any δ\delta, parameters N,p,sN,p,s and sensitivity bounds Δ1,Δ2,Δ\Delta_{1},\Delta_{2},\Delta_{\infty} such that

the Binomial mechanism is (ε,δ)(\varepsilon,\delta) differentially private for

where bpb_{p}, cp,c_{p}, and dpd_{p} are defined in (17), (12), and (16) respectively, and for p=1/2p=1/2, bp=1/3b_{p}=1/3, cp=5/2c_{p}=5/2, and dp=2/3d_{p}=2/3. The error of the mechanism is

The proof is given in Appendix B. We make some remarks regarding the design and the guarantee for the Binomial Mechanism. Note that the privacy guarantee for the Binomial mechanism depends on all three sensitivity parameters Δ2,Δ,Δ1\Delta_{2},\Delta_{\infty},\Delta_{1} as opposed to the Gaussian mechanism which only depends on Δ2\Delta_{2}. The Δ1\Delta_{1} and Δ\Delta_{\infty} terms can be seen as the added complexity due to discretization.

Secondly setting s=1s=1 (i.e. providing no scale to the noise) in the expression (7), it can be readily seen that the terms involving Δ1\Delta_{1} and Δ2\Delta_{2} scale differently with respect to the variance of the noise. This motivates the use of the accompanying quantization scale ss in the mechanism. Indeed it is possible that the resolution of the integer that is provided by the Binomial noise could potentially be too large for the problem leading to worse guarantees. In this setting, the quantization parameter ss helps normalize the noise correctly. Further, it can be seen as long as the variance of the random variable sZs\cdot Z is fixed, increasing Np(1p)Np(1-p) and decreasing ss makes the Binomial mechanism closer to the Gaussian mechanism. Formally, if we let σ=sNp(1p)\sigma=s\sqrt{Np(1-p)} and sσ/(cd)s\leq\sigma/(c\sqrt{d}), then using the Cauchy-Schwartz inequality, the ε\varepsilon guarantee (7) can be rewritten as

The variance of the Binomial distribution is Np(1p)Np(1-p) and the leading term in ε\varepsilon matches exactly the ε\varepsilon term in Gaussian mechanism. Furthermore, if ss is o(1/d)o(1/\sqrt{d}), then this mechanism is very similar to the Gaussian mechanism. This result agrees with the Berry-Esseen type Central limit theorems for the convergence of one dimensional Binomial distribution to the Gaussian distribution.

In Figure 2, we plot the error vs ε\varepsilon for Gaussian and Binomial mechanism. Observe that as scale is reduced, error vs privacy trade-off for Binomial mechanism approaches that of Gaussian mechanism.

Distributed mean estimation (DME)

We have related the SGD convergence rate to the MSE in approximating the gradient at each step in Corollary 1. Eq. (1) relates the communication cost of SGD to the communication cost of estimating gradient means. Advanced composition theorem (Thm. 3.5 ) or moments accounting can be used to relate the privacy guarantee of SGD to that of gradient mean estimate at each instance tt. We also note that in SGD, we often sample the clients, standard privacy amplification results via sampling , can be used to get tighter bounds in this case.

Therefore, akin to , in the rest of the paper we just focus on the MSE and privacy guarantees of DME. The results for synchronous distributed GD follow from Corollary 1 (convergence), advanced composition theorem (privacy), and Eq. (1) (communication).

at a central server. For gradient descent at each round tt, XiX_{i} is set to gitg^{t}_{i}. DME is a fundamental building block for many distributed learning algorithms including distributed PCA/clustering .

Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server at the same time. We allow the use of both private and public randomness. Private randomness refers to random values generated by each client separately, and public randomness refers to a sequence of random values that are shared among all partiesPublic randomness can be emulated by the server communicating a random seed.

The total number of bits transmitted by all clients with the protocol π\pi is

Let the estimated mean be Xˉ^\hat{\bar{X}}. For a protocol π\pi, the MSE of the estimate is

We note that bounds on E((π,X1n)\mathcal{E}((\pi,X^{n}_{1}), translates to bounds on gradients estimates in Eq. (2) and result in convergence guarantees via Corollary 1.

2 Differential privacy

To state the privacy results for DME, we define the notion of data sets and neighbors as follows. A dataset is a collection of vectors X={X1,Xn}X=\{X_{1},\ldots X_{n}\}. The notion of neighboring data sets typically corresponds to those differing only on the information of one user, i.e. X,XiX,X_{\otimes i} are neighbors if they differ in one vector.

Note that this notion of neighbors for DME in the context of distributed gradient descent translates to two data sets

being neighbors if they differ in one function fif_{i} and corresponds to guaranteeing privacy for individual client’s data. The bound Xi2D\mathopen{\|}X_{i}\mathclose{\|}_{2}\leq D translates to assuming gitD\|g^{t}_{i}\|\leq D, ensured via gradient clipping.

Results for distributed mean estimation (DME)

In this section we describe our algorithms, the associated MSE, and the privacy guarantees in the context of DME. First, we first establish a baseline by stating the results for implementing the Gaussian mechanism by adding Gaussian noise on each client vector.

In the Gaussian mechanism, each client sends vector

Under the Gaussian mechanism, the mean estimate is unbiased and communication cost is ndn\cdot d reals. Moreover, for any δ\delta and σ2Dn2log1.25/δ\sigma\geq\frac{2D}{\sqrt{n}}\cdot\sqrt{2\log{1.25}/{\delta}}, it is (ε,δ)(\varepsilon,\delta) differentially private for

We remark that real numbers can be quantized to O(logdn/εδ)\mathcal{O}(\log dn/\varepsilon\delta) bits with insignificant effect to privacyFollows by observing that quantizing all values to 1/poly(n,d,1/ε,log1/δ)1/\text{poly}(n,d,1/\varepsilon,\log 1/\delta) accuracy ensures minimum loss in privacy. In practice this is often implemented using 32 bits of quantization via float representation.. However this is asymptotic and can be prohibitive in practice , where we have a small fixed communication budget and dd is of the order of millions. A natural way to reduce communication cost is via quantization, where each client quantizes YiY_{i}s before transmitting. However how privacy guarantees degrade due to quantization of the Gaussian mechanism is hard to analyze particularly under aggregation. Instead we propose to use the Binomial mechanism which we describe next.

2 Stochastic k𝑘k-level quantization + Binomial mechanism

We now define the mechanism πsk(Bin(m,p))\pi_{sk}(\text{Bin}(m,p)) based on kk-bit stochastic quantization πsk\pi_{sk} proposed in composed with the Binomial mechanism. It will be parameterized by 33 quantities k,m,pk,m,p.

First, the server sends XmaxX^{\max} to all the clients, with the hope that for all i,ji,j, XmaxXi(j)Xmax-X^{\max}\leq X_{i}(j)\leq X^{\max}. The clients then clip each coordinate of their vectors to the range [Xmax,Xmax][-X^{\max},X^{\max}]. For every integer rr in the range [0,k)[0,k), let B(r)B(r)represent a bin (one for each rr), i.e.

The algorithm quantizes each coordinate into one of the bins stochastically and adds scaled Binomial noise. Formally client ii computes the following quantities for every jj

where rr is such that Xi(j)[B(r),B(r+1)]X_{i}(j)\in[B(r),B(r+1)] and Ti(j)Bin(m,p)T_{i}(j)\sim\text{Bin}(m,p). The client sends YiY_{i} to the server. The server now estimates Xˉ\bar{X} by

If j\forall j, Xi(j)[Xmax,Xmax]X_{i}(j)\in[-X^{\max},X^{\max}], then

and Xˉ^πsk(Bin(m,p))\hat{\bar{X}}_{\pi_{sk}(\text{Bin}(m,p))} will be an unbiased estimate of the mean.

With no prior information on XmaxX^{\max}, the natural choice is to set Xmax=DX^{\max}=D. With this value of XmaxX^{\max} we characterize the MSE, sensitivity, and communication complexity of the Binomial mechanism below. To characterize the sensitivity of Xˉ^πsk(Bin(m,p))\hat{\bar{X}}_{\pi_{sk}(\text{Bin}(m,p))}, we need few definitions. For scalars D,XmaxD,X^{\max}, let

where q=Xmax/(k1)q=X^{\max}/(k-1). We note that quantities k,δ,ε,dk,\delta,\varepsilon,d are omitted from the LHS of the equations for the ease of notation. Combined with Theorem 1, this yields the privacy guarantees for the binomial mechanism.

If Xmax=DX^{\max}=D, then the mean estimate is unbiased and

Furthermore if mnp(1p)max(23log(10d/δ),2Δ(D,Xmax))mnp(1-p)\geq\max\left(23\log(10d/\delta),2\Delta_{\infty}(D,X^{max})\right), then for any δ\delta, Xˉ^πsk(Bin(m,p))\hat{\bar{X}}_{\pi_{sk}(\text{Bin}(m,p))} is (ε,2δ)(\varepsilon,2\delta) differentially private where ε\varepsilon is given by Theorem 1 with sensitivity parameters {Δ1(Xmax,D),Δ2(Xmax,D),Δ(Xmax,D)}\{\Delta_{1}(X^{max},D),\Delta_{2}(X^{max},D),\Delta_{\infty}(X^{max},D)\} (Eq. (11)). Furthermore,

We provide the proof in Appendix D. For ε1\varepsilon\leq 1, we bound the communication cost as follows.

There exists an implementation of πsk(Bin(m,p))\pi_{sk}(\text{Bin}(m,p)), which achieves the same privacy and error as the full precision Gaussian mechanism with a total communication complexity of

Therefore our results provide precise non-asymptotic and asymptotic guarantees on the total communication with respect to k. The communication cost of the above algorithm is Ω(logd)\Omega(\log d) bits per coordinate per client, which can be prohibitive. In the next section we show that these bounds can be improved via rotation.

3 Error reduction via randomized rotation

As seen in Corollary 2, if πsk(Bin(m,p))\pi_{sk}(\text{Bin}(m,p)) has error and privacy same as that of the Gaussian mechanism, it has high communication cost. The proof reveals that this is due to the error being proportional to O(d(Xmax)2/n)O({d}(X^{\max})^{2}/{n}). Therefore MSE reduces when XmaxX^{\max} is small, e.g., when XiX_{i} is uniform on the unit sphere, XmaxX^{\max} is O((logd)/d)\mathcal{O}\left(\sqrt{(\log d)/d}\right) (whp) . showed that the same effect can be observed by randomly rotating the vectors before quantization. Here we show that random rotation reduces the leading term as well as improves the privacy guarantee.

and runs the protocol on X1,X2,XnX^{\prime}_{1},X^{\prime}_{2},\ldots X^{\prime}_{n}. The server then obtains the mean estimate Xˉ^\hat{\bar{X^{\prime}}} in the rotated space using the protocol π\pi and then multiplies by R1R^{-1} to obtain the coordinates in the original basis, i.e.,

Due to the fact that dd can be huge in practice, we need orthogonal matrices that permits fast matrix-vector products. Naive matrices that support fast multiplication such as block-diagonal matrices often result in high values of Xi2\|X^{\prime}_{i}\|_{\infty}^{2}. Similar to , we propose to use a special type of orthogonal matrix R=1dHAR=\frac{1}{\sqrt{d}}HA, where AA is a random diagonal matrix with i.i.d. Rademacher entries (±1\pm 1 with probability 0.50.5) and HH is a Walsh-Hadamard matrix . The Walsh-Hadamard matrix of dimension 2m2^{m} for mNm\in\mathcal{N} is given by the recursive formula,

Applying both rotation and its inverse takes O(dlogd)\mathcal{O}(d\log d) time and O(1)\mathcal{O}(1) space (with an in-place algorithm).

The next theorem provides the MSE guarantees for Rot(πsk(Bin(m,p)),HA)\text{Rot}(\pi_{sk}(\text{Bin}(m,p)),HA).

For any δ\delta, let Xmax=2Dlog(2nd/δ)dX^{\max}=2D\sqrt{\frac{\log(2nd/\delta)}{d}}, then

and the bias is 2Dδ\leq 2D\delta. Further if mnp(1p)max(23log(10d/δ),2Δ(D,Xmax)),mnp(1-p)\geq\max\left(23\log(10d/\delta),2\Delta_{\infty}(D,X^{max})\right), then Xˉ^(Rot(πsk(Bin(m,p))))\hat{\bar{X}}(\text{Rot}(\pi_{sk}(\text{Bin}(m,p)))) is (ε,3δ)(\varepsilon,3\delta) differentially private where ε\varepsilon is given by Theorem 1 with sensitivity parameters {Δ1(Xmax,D),Δ2(Xmax,D),Δ(Xmax,D)}\{\Delta_{1}(X^{max},D),\Delta_{2}(X^{max},D),\Delta_{\infty}(X^{max},D)\} (Eq. (11)). Furthermore,

The following corollary bounds the communication cost for Rot(πsk(Bin(m,p)),HA)\text{Rot}(\pi_{sk}(\text{Bin}(m,p)),HA) when ε1\varepsilon\leq 1.

There exists an implementation of Rot(πsk(Bin(m,p)),HA)\text{Rot}(\pi_{sk}(\text{Bin}(m,p)),HA), that achieves the same error and privacy of the full precision Gaussian mechanism with a total communication complexity:

Hence if d=O(nε2)d=\mathcal{O}(n\varepsilon^{2}), then Rot(πsk(Bin(m,p)),HA)\text{Rot}(\pi_{sk}(\text{Bin}(m,p)),HA) has the same privacy and utilities as the Gaussian mechanism, but with just O(ndloglog(nd/δε))\mathcal{O}(nd\log\log(nd/\delta\varepsilon)) communication cost.

Discussion

We trained a three-layer model (60 hidden nodes each with ReLU activation) on the infinite MNIST dataset with 25M data points and 25M clients. At each step 10,000 clients send their data to the server. This setting is close to real-world settings of federated learning where there are hundreds of millions of users. The results are in Figure 2. Note that the models achieve different levels of accuracy depending on communication cost and privacy parameter ε\varepsilon. We note that we trained the model with exactly one epoch, so each sample was used at most once in training. In this setting, the per batch ε\varepsilon and the overall ε\varepsilon are the same.

There are several interesting future directions. On the theoretical side, it is not clear if our analysis of Binomial mechanism is tight. Furthermore, it is interesting to have better privacy accounting for Binomial mechanism via a moments accountant. On the practical side, we plan to explore the effects of neural network topology, over-parametrization, and optimization algorithms on the accuracy of the privately learned models.

Acknowledgements

The authors would like to thank Keith Bonawitz, Vitaly Feldman, Jakub Konečný, Ben Kreuter, Ilya Mironov, and Kunal Talwar for their valuable suggestions and inputs.

References

Appendix A Proof of biased SGD

where the last inequality uses the fact that γL1\gamma L\leq 1. Rearranging the above inequality and summing over all tt we get that

Appendix B Binomial Mechanism - Proof of Theorem 1

To prove Theorem 1, we need few auxiliary lemmas. We first state two inequalities which we use through-out the proof.

Let ff be a symmetric function of nn independent random variables X1,X2,XnX_{1},X_{2},\ldots X_{n}. Let X1X^{\prime}_{1} be an i.i.d. copy of X1X_{1}, then

We use the above two results in the next two lemmas.

where the inequality follows from considering the two cases when tt can be positive or negative. ∎

Let t1,t2,tdt_{1},t_{2},\ldots t_{d} be dd real numbers. Let viBin(N,p)v_{i}\sim\text{Bin}(N,p) independently such that Np(1p)39Np(1-p)\geq 39. Let AA be the event that viNpβ\mathopen{\|}v_{i}-Np\mathclose{\|}_{\infty}\leq\beta for some β\beta, such that βNmin(p,1p)/3\beta\leq N\min(p,1-p)/3. Then for any δ\delta, with probability 1δ\geq 1-\delta conditioned on AA,

Since βNmin(p,1p)/3\beta\leq N\min(p,1-p)/3 and for any z1/3z\geq-1/3, log(1+z)z1.95z2/3|\log(1+z)-z|\leq 1.95z^{2}/3,

where the first inequality follows from the fact that βNmin(p,1p)/3\beta\leq N\min(p,1-p)/3 and for any z1/3z\geq-1/3, log(1+z)z2z2/3|\log(1+z)-z|\leq 2z^{2}/3. Hence we can set M=23(β+1)2(p2+(1p))2N2p2(1p)2M=\frac{2}{3}\frac{(\beta+1)^{2}(p^{2}+(1-p))^{2}}{N^{2}p^{2}(1-p)^{2}}. We now bound the variance:

Let w=jjnXi(j)w=\sum^{n}_{j^{\prime}\neq j}X_{i}(j^{\prime}), then

where the inequality uses the fact that for any positive xx, xx2/2logxxx-x^{2}/2\leq\log x\leq x. Observe that wBin(n1,p)w\sim\text{Bin}(n-1,p) and N1wBin(n1,1p)N-1-w\sim\text{Bin}(n-1,1-p). We use the following three inequalities, to bound the expectation of the term above. Similar results apply for NwN-w as N1wBin(n1,1p)N-1-w\sim\text{Bin}(n-1,1-p). Since 1/w1/w and 1/(Nw)1/(N-w) are negatively correlated,

Combining the above results and simplifying the terms, we get that the expectation of the required quantity is bounded by

and the lemma follows by Bernstein’s inequality. ∎

Firstly note that it is sufficient to consider the differential privacy of the quantity f(D)s+Z\frac{f(D)}{s}+Z where ZZ is a Binomial random variable. Note that since ss is defined to be 1/j1/j for some integer jj the output f(D)/sf(D)/s remains integral. Further note that in this setting the lql_{q} norm sensitivity scales Δq/s\Delta_{q}/s. The above reduction shows that the scale ss can be considered to be 1 in the rest of the proof.

Consider any two neighboring data sets D1,D2D_{1},D_{2} and let Δf(D2)f(D1)\Delta\triangleq f(D_{2})-f(D_{1}). Note that showing the (ε,δ)(\varepsilon,\delta) differential privacy of the Binomial mechanism is equivalent to showing the following. Let TT be a vector such that T(j)Bin(N,p)T(j)\sim\text{Bin}(N,p) then for any vector v[N]dv\in[N]^{d} we have that

To show the above we will first define a set VV such that

Define VV as follows: vVv\in V if and only if,

We will first show that the probability of this event is large.

The first condition follows from Bernstein’s inequality with probability 1δ/10\geq 1-\delta/10. For the second condition, observe that Δ(sNp)\Delta\cdot(s-Np) is a function of NdNd independent random variables. A direct application of Bernstein’s inequality yields that Equation (14) holds with probability 1δ/1.25\geq 1-\delta/1.25. The third condition follows from the first condition as ΔNpβ\mathopen{\|}\Delta\mathclose{\|}_{\infty}\leq Np-\beta and Np(1p)/3βNp(1-p)/3\geq\beta. Applying Lemma 6 with AA being event that vNpβ\mathopen{\|}v-Np\mathclose{\|}_{\infty}\leq\beta and δ=δ/10\delta=\delta/10, yields that the fourth equation holds with probability at least 1δ/101-\delta/10. Hence, by the union bound,

We now prove the ratio of probabilities. For any vv,

where the inequality follows from Lemma 5. Since vVv\in V, applying Equations (13), (14), (15), together with the fact that β2.5Np(1p)log(20d/δ)\beta\leq\sqrt{2.5Np(1-p)\log(20d/\delta)} (by the assumptions in the theorem) yields the following bound on the exponent.

where cpc_{p} is defined in Equation (12) and

Appendix C High probability sensitivity Proof

To show (ε,δ+δ)(\varepsilon,\delta+\delta^{\prime}) differential privacy we need to show that for any two neighboring data sets D1,D2D_{1},D_{2} and OOO\subseteq\mathcal{O},

Given any two neighboring data sets D1,D2D_{1},D_{2} let PrΔQ,δ(X1,X2)\Pr_{\Delta_{Q},\delta}(X_{1},X_{2}) represent the joint distribution of the coupled random variables X1,X2X_{1},X_{2} guaranteed by Definition 1. Now for any OOO\in\mathcal{O} we have that

In the above (a),(d)(a),(d) follow from the fact that PrΔq,δ\Pr_{\Delta_{q},\delta} is a coupling, (b)(b) follows from the condition (5) guaranteed by the coupling and (c)(c) follows from the (ε,δ)(\varepsilon,\delta) differential privacy guarantee of the mechanism M\mathcal{M}. ∎

Appendix D Application of Binomial Mechanism to Distributed Mean Estimation - Proof of Theorem 3

We refer the readers to the definition of the protocol (Section 5.2) and in particular the definitions of the random variables Ui,Ti,U_{i},T_{i}, and the estimator Xˉ^πsk(Bin(m,p))\hat{\bar{X}}_{\pi_{sk}(\text{Bin}(m,p))} given in equations (9) and (10) respectively.

The communication complexity follows immediately by noting that the protocol only transmits integers in the range [0,k+m)[0,k+m) and therefore only needs log(k+m)\log(k+m) bits. We now prove the bound on the Mean Square Error of the protocol and then prove the sensitivity guarantee. Mean Square Error

For every δ\delta, given two neighboring data sets X{X1Xn}X\triangleq\{X_{1}\ldots X_{n}\} and Xn{X1Xn}X_{\otimes n}\triangleq\{X^{\prime}_{1}\ldots X^{\prime}_{n}\} (where Xi=XiX^{\prime}_{i}=X_{i} for i[1,n1]i\in[1,n-1]) we have that the protocol πsk\pi_{sk} is ({Δ1,Δ2,Δ},δ)(\{\Delta_{1},\Delta_{2},\Delta_{\infty}\},\delta)-sensitive (c.f. Definition 1) where Δ1,Δ2,Δ\Delta_{1},\Delta_{2},\Delta_{\infty} satisfy the following equations.

Further we note that the protocol πsk(Bin(m,p))\pi_{sk}(\text{Bin}(m,p)) is a composition of the binomial mechanism and the protocol πsk\pi_{sk}. A direct application of Theorem 1 and Lemma 2 gives us that the mechanism πsk(Bin(m,p))\pi_{sk}(\text{Bin}(m,p)) is (ε,2δ)(\varepsilon,2\delta) differentially private for any δ(0,1)\delta\in(0,1) and ε\varepsilon satisfying the below conditions. we choose δ,δ\delta,\delta^{\prime} as δ\delta in the application of Lemma 2 Note that the conditions required by Theorem 1 can be verified from the given conditions in Theorem 3. ∎

To this end we recall the definition of the random variables Ui(j)U_{i}(j). Given XmaxX^{\max} and XminX^{\min} we associate to every integer rr in [0,k)[0,k) a bin B(r)B(r) defined as

Further given a number X[Xmax,Xmax]X\in[-X^{\max},X^{\max}], let r(X)r(X) be the integer such that X[B(r(X)),B(r(X)+1)]X\in[B(r(X)),B(r(X)+1)]. We can now define the random variable

Now define the random variables UiX(j)U(Xi(j))U^{X}_{i}(j)\triangleq U(X_{i}(j)) and similarly UiXn(j)U(Xi(j))U^{X_{\otimes n}}_{i}(j)\triangleq U(X^{\prime}_{i}(j)). To provide high probability sensitivity bounds in accordance with Lemma 2, we need to define a coupling between the random variables iUiX\sum_{i}U^{X}_{i} and iUiXn\sum_{i}U^{X_{\otimes n}}_{i}. To do the above we will define a coupling between the random variables UiX(j)U^{X}_{i}(j) and UiXn(j)U^{X_{\otimes n}}_{i}(j). The coupled random variables will be sampled as follows.

The defined coupling will have two cases. Define the set S={(i,j)r(Xi(j))=r(Xi(j))}S=\{(i,j)|r(X_{i}(j))=r(X^{\prime}_{i}(j))\}. We first consider the case when (i,j)S(i,j)\in S. In this case we sample a random variable αij\alpha_{ij}\in uniformly at random and define the random variables

Additionally wlog consider Xi>XiX_{i}>X^{\prime}_{i} (the roles of ii and ii^{\prime} can be reversed in the following definitions otherwise) and define the auxiliary variables

Otherwise if (i,j)S(i,j)\notin S or equivalently r(Xi(j))r(Xi(j))r(X_{i}(j))\neq r(X^{\prime}_{i}(j)), we sample the bins independently and the random variables are defined as

Additionally wlog consider Xi>XiX_{i}>X^{\prime}_{i} (the roles of ii and ii^{\prime} can be reversed in the following definitions otherwise) and define the auxiliary variables

In this case define Li,jr(Xi(j))r(Xi(j))+1+Zi(j)L_{i,j}\triangleq r(X_{i}(j))-r(X^{\prime}_{i}(j))+1+Z_{i}(j) and note that

With these definitions, it can be seen that the marginal distributions of Yi(j),Yin(j)Y_{i}(j),Y^{\otimes n}_{i}(j) are equal to the marginal distributions of UiX(j)U^{X}_{i}(j), UiXn(j)U^{X_{\otimes n}}_{i}(j) respectively. Further note that since Xi=XiX^{\prime}_{i}=X_{i} for all i[1,n1]i\in[1,n-1] we have that Yi=YinY_{i}=Y^{\otimes n}_{i} w.p. 1 for all i[1,n1]i\in[1,n-1]. Therefore

where the inequality follows from (21) and (22). We wish to bound the RHS above. To that end consider the following claim which follows from the definitions.

A direct application of Bernstein’s Inequality gives us that with probability at least 1δ/21-\delta/2

where the RHS uses Claim 1 for bounding expectation and variance.

Therefore combining (25) and (26), we get that

The proof is finished using a union bound.

Appendix E Quantization with Rotation

Differential Privacy Given any two neighboring data sets X={X1,Xn},Xn={X1,Xn}X=\{X_{1},\ldots X_{n}\},X_{\otimes n}=\{X_{1},\ldots X^{\prime}_{n}\} we define a set of good rotations UgoodU_{\text{good}} as follows

where UU is the set of d×dd\times d orthonormal matrices. The following lemma follows from . We note that similar analysis holds for uniformly sampled RR over real domain and we refer the reader to for details.

Let Rot(π,HA)(X),Rot(π,HA)(Xn)\text{Rot}(\pi,HA)(X),\text{Rot}(\pi,HA)(X_{\otimes n}) represent the random output of the protocol Rot(π,HA)\text{Rot}(\pi,HA) on X,XnX,X_{\otimes n} respectively and let SS be any subset of the output range of Rot(π,HA)\text{Rot}(\pi,HA). Given δ\delta let ε\varepsilon be given by Theorem 1 with sensitivity parameters {Δ1(Xmax,D),Δ2(Xmax,D),Δ(Xmax,D)}\{\Delta_{1}(X^{max},D),\Delta_{2}(X^{max},D),\Delta_{\infty}(X^{max},D)\}. Given a set of vectors VV and a rotation matrix RR define RV={RvvV}R\cdot V=\{Rv|v\in V\}.

aa follows from (ε,2δ)(\varepsilon,2\delta) differential privacy guarantee for πsk(Bin(m,p))\pi_{sk}(\text{Bin}(m,p)) from Theorem 3 and noting that RUgoodR\in U_{good} in the integral. Hence Rot(πsk(Bin(m,p)))\text{Rot}(\pi_{sk}(\text{Bin}(m,p))) offers (ε,3δ)(\varepsilon,3\delta) differential-privacy.

aa follows from the argument above and bb follows from the MSE guarantee in Theorem 3 and by noting that the rotation is in UgoodU_{good}.