Distributed Mean Estimation with Limited Communication
Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, H. Brendan McMahan
Introduction
This basic estimation problem is used as a subroutine in several learning and optimization tasks where data is distributed across several clients. For example, in Lloyd’s algorithm for k-means clustering, if data is distributed across several clients, the server needs to compute the means of all clusters in each update step. Similarly, for PCA, if data samples are distributed across several clients, then for the power-iteration method, the server needs to average the output of all clients in each step.
Recently, algorithms involving distributed mean estimation have been used extensively in training large-scale neural networks and other statistical models . In a typical scenario of synchronized distributed learning, each client obtains a copy of a global model. The clients then update the model independently based on their local data. The updates (usually in the form of gradients) are then sent to a server, where they are averaged and used to update the global model. A critical step in all of the above algorithms is to estimate the mean of a set of vectors as in Eq. (1).
One of the main bottlenecks in distributed algorithms is the communication cost. This has spurred a line of work focusing on communication cost in learning . The communication cost can be prohibitive for modern applications, where each client can be a low-power and low-bandwidth device such as a mobile phone . Given such a wide set of applications, we study the basic problem of achieving the optimal minimax rate in distributed mean estimation with limited communication.
We note that our model and results differ from previous works on mean estimation in two ways: previous works assume that the data is generated i.i.d. according to some distribution; we do not make any distribution assumptions on data. Secondly, the objective in prior works is to estimate the mean of the underlying statistical model; our goal is to estimate the empirical mean of the data.
2 Model
Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server and they can transmit at the same time. In any independent communication protocol, each client transmits a function of (say ), and a central server estimates the mean by some function of . Let be any such protocol and let be the expected number of transmitted bits by the -th client during protocol , where throughout the paper, expectation is over the randomness in protocol .
The total number of bits transmitted by all clients with the protocol is
Let the estimated mean be . For a protocol , the MSE of the estimate is
We allow the use of both private and public randomness. Private randomness refers to random values that are generated by each machine separately, and public randomness refers to a sequence of random values that are shared among all partiesIn the absence of public randomness, the server can communicate a random seed that can be used by clients to emulate public randomness..
Let denote the set of all protocols with communication cost at most . The minimax MSE is
3 Results and discussion
We first analyze the MSE for three algorithms, when , i.e., each client sends a constant number of bits per dimension.
Stochastic uniform quantization. In Section 2.1, as a warm-up we first show that a naive stochastic binary quantization algorithm (denoted by ) achieves an MSE of
A natural way to decrease the error is to increase the number of levels of quantization. If we use levels of quantization, in Theorem 2, we show that the error decreases as
In order to reduce the communication cost, we propose two approaches.
Stochastic rotated quantization: We show that preprocessing the data by a random rotation reduces the mean squared error. Specifically, in Theorem 3, we show that this new scheme (denoted by ) achieves an MSE of
Variable length coding: Our second approach uses the same quantization as but encodes levels via variable length coding. Instead of using bits per dimension, we show that using variable length encoding such as arithmetic coding to compress the data reduces the communication cost significantly. In particular, in Theorem 4 we show that there is a scheme (denoted by ) such that
and . Hence, setting in Eqs. 2 and 3 yields
and with bits of communication i.e., constant number of bits per dimension per client. Of the three protocols, has the best MSE for a given communication cost. Note that uses quantization levels but still uses bits per dimension per client for all .
Theoretically, while variable length coding has better guarantees, stochastic rotated quantization has several practical advantages: it uses fixed length coding and hence can be combined with encryption schemes for privacy preserving secure aggregation . It can also provide lower quantization error in some scenarios due to better constants (see Section 7 for details).
Concurrent to this work, showed that stochastic quantization and Elias coding can be used to obtain communication-optimal SGD. Recently, showed that can be improved further by optimizing the choice of stochastic quantization boundaries. However, their results depend on the number of bits necessary to represent a float, whereas ours do not.
3.2 Minimax MSE
In the above protocols, all of the clients transmit the data. We augment these protocols with a sampling procedure, where only a random fraction of clients transmit data. We show that a combination of -level quantization, variable length coding, and sampling can be used to achieve information theoretically optimal MSE for a given communication cost. In particular, combining Corollary 1 and Theorem 5 yields our minimax result:
There exists a universal constant such that for communication cost and ,
This result shows that the product of communication cost and MSE scales linearly in the number of dimensions.
The rest of the paper is organized as follows. We first analyze the stochastic uniform quantization technique in Section 2. In Section 3, we propose the stochastic rotated quantization technique, and in Section 4 we analyze arithmetic coding. In Section 5, we combine the above algorithm with a sampling technique and state the upper bound on the minimax risk, and in Section 6 we state the matching minimax lower bounds. Finally, in Section 7 we discuss some practical considerations and apply these algorithms on distributed power iteration and Lloyd’s algorithm.
Stochastic uniform quantization
For a vector , let and similarly let . In the stochastic binary quantization protocol , for each client , the quantized value for each coordinate is generated independently with private randomness as
We first bound the communication cost of this protocol.
Instead of sending vectors , clients transmit two real values and (to a desired precision) and a bit vector such that if and otherwise. Hence each client transmits bits, where is the number of bits to transmit the real value to a desired precision.
Let be the maximum norm of the underlying vectors. To bound , observe that using bits, one can represent a number between and to an error of . Thus using bits one can represent the minimum and maximum to an additive error of . This error in transmitting minimum and maximum of the vector does not affect our calculations and we ignore it for simplicity. We note that in practice, each dimension of is often stored as a 32 bit or 64 bit float, and should be set as either 32 or 64. In this case, using an even larger does not further reduce the error. ∎
We now compute the estimation error of this protocol.
where the last equality follows by observing that , , are independent zero mean random variables. The proof follows by observing that for every ,
Lemma 2 implies the following upper bound.
The proof follows by Lemma 2 observing that
We also show that the above bound is tight:
There exists a set of vectors such that
For every , let be defined as follows. , , and for all , . For every , and . Substituting these bounds in the conclusion of Lemma 2 (which is an equality) yields the theorem. ∎
Therefore, the simple algorithm proposed in this section gives MSE times the average norm. Such an error is too large for real-world use. For example, in the application of neural networks , can be on the order of millions, yet can be much smaller than that. In such cases, the MSE is even larger than the norm of the vector.
2 Stochastic k𝑘k-level quantization
A natural generalization of binary quantization is -level quantization. Let be a positive integer larger than . We propose a -level stochastic quantization scheme to quantize each coordinate. Recall that for a vector , and . For every integer in the range , let
where satisfies . A natural choice for would be .We will show in Section 4, however, a higher value of and variable length coding has better guarantees. The algorithm quantizes each coordinate into one of s stochastically. In , for the -th datapoint and -th coordinate, if ,
As before, the communication complexity of this protocol is bounded. The proof is similar to that of Lemma 1 and hence omitted.
The mean squared loss can be bounded as follows.
If , then for any , the protocol satisfies,
We conclude this section by noting that satisfies the conditions for the above theorem by Eq. (4).
Stochastic rotated quantization
We show that the algorithm of the previous section can be significantly improved by a new protocol. The motivation comes from the fact that the MSE of stochastic binary quantization and stochastic -level quantization is (the proof of Lemma 3 and Theorem 2 with ). Therefore the MSE is smaller when and are close. For example, when is generated uniformly on the unit sphere, with high probability, is . In such case, is instead of .
In this section, we show that even without any assumptions on the distribution of the data, we can “reduce” with a structured random rotation, yielding an error. We call the method stochastic rotated quantization and denote it by .
The communication cost is same as and is given by Lemma 5. We now bound the MSE.
For any , is at most
where and for every , let .
where the last inequality follows Eq. (5) and the value of . follows from the fact that the rotation does not change the norm of the vector, and follows from the tower law of expectation. The lemma follows from observing that
To obtain strong bounds, we need to find an orthogonal matrix that achieves low and . In addition, due to the fact that can be huge in practice, we need a type of orthogonal matrix that permits fast matrix-vector products. Naive orthogonal matrices that support fast multiplication such as block-diagonal matrices often result in high values of and . Motivated by recent works of structured matrices , we propose to use a special type of orthogonal matrix , where is a random diagonal matrix with i.i.d. Rademacher entries ( with probability ). is a Walsh-Hadamard matrix . The Walsh-Hadamard matrix of dimension for is given by the recursive formula,
Let , where is a diagonal matrix with independent Radamacher random variables. For every and every sequence ,
Combining the above two lemmas yields the main result.
For any , protocol satisfies,
Variable length coding
Instead of preprocessing the data via a rotation matrix as in , in this section we propose to use a variable length coding strategy to minimize the number of bits.
Consider the stochastic -level quantization technique. A natural way of transmitting is sending the bin number for each coordinate, thus the total number of bits the algorithm sends per transmitted coordinate would be . This naive implementation is sub-optimal. Instead, we propose to further encode the transmitted values using universal compression schemes . We first encode , the number of times each quantized value has appeared, and then use arithmetic or Huffman coding corresponding to the distribution We denote this scheme by . Since we quantize vectors the same way in and , the MSE of is also given by Theorem 2. We now bound the communication cost.
Let . There exists an implementation of such that is at most
Once we have compressed the ’s, we use arithmetic coding corresponding to the distribution to compress and transmit bin values for each coordinate. The total number of bits arithmetic coding uses is
Let and . Note that if belongs to bin , and is the number of coordinates quantized into bin . Hence is the scaled norm-square of , i.e.,
where the . Taking expectations on both sides and using the fact that the are independent zero mean random variables over a range of , we get
We now bound in terms of . Let and . Note that
where the first inequality follows from the positivity of KL-divergence. Choosing , yields and hence,
Combining the above set of equations, we get that the expected communication is bounded by
where follows from Equation (7) and follows from Jensen’s inequality. The last inequality follows from Equation (6). ∎
Thus if , the communication complexity is and the MSE is .
Communication MSE trade-off
In the above protocols, all the clients transmit and hence the communication cost scales linearly with . Instead, we show that any of the above protocols can be combined by client sampling to obtain trade-offs between the MSE and the communication cost. Note that similar analysis also holds for sampling the coordinates.
Let be a protocol where the mean estimate is of the form:
All three protocols we have discussed are of this form. Let be the protocol where each client participates independently with probability . The server estimates by
where s are defined in the previous section and is the set of clients that transmitted.
For any set of vectors and protocol of the form Equation (8), its sampled version satisfies
The proof of communication cost follows from Lemma 5 and the fact that in expectation, clients transmit. We now bound the MSE. Let be the set of clients that transmit. The error is
Furthermore, the second term can be bounded as
where the last equality follows from the assumption that ’s mean estimate is of the form (8). follows from the fact that are independent zero mean random variables. ∎
Combining the above lemma with Theorems 2 and 4, and choosing results in the following.
For every , there exists a protocol such that and
Lower bounds
The lower bound relies on the lower bounds on distributed statistical estimation due to .
There exists a set of distributions supported on such that if any centralized server wishes to estimate the mean of the underlying unknown distribution, then for any independent protocol
where is the communication cost of the protocol, is the mean of , and is a positive constant.
Let be the constant in Lemma 9. For every and ,
Given samples from the underlying distribution where each sample belongs to , it is easy to see that
where is the empirical mean of the observed samples. Let be the set of distributions in Lemma 9. Hence for any protocol there exists a distribution such that
follows from the fact that . follows from Lemma 9 and follows from the fact that and . ∎
Corollary 1 and Theorem 5 yield Theorem 1. We note that the above lower bound holds only for communication cost . Extending the results for larger values of remains an open problem.
At a first glance it may appear that combining structured random matrix and variable length encoding may improve the result asymptotically, and therefore violates the lower bound. However, this is not true. Observe that variable length coding and stochastic rotated quantization use different aspects of the data: the variable length coding uses the fact that bins with large values of index are less frequent. Hence, we can use fewer bits to encode frequent bins and thus improve communication. In this scheme bin-width () is . Rotated quantization uses the fact that rotation makes the min and max closer to each other and hence we can make bins with smaller width. In such a case, all the bins become “more or less equally likely” and hence variable length coding does not help. In this scheme bin-width () is , which is much smaller than bin-width for variable length coding. Hence variable length coding and random rotation cannot be used simultaneously.
Practical considerations and applications
Based on the theoretical analysis, the variable-length coding method provides the lowest quantization error asymptotically when using a constant number of bits. Stochastic rotated quantization in practice may be preferred due to the (hidden) constant factors of the variable length code methods. For example, considering quantizing a single vector $2-2N(0,1)N(100,1)$. As shown in Figure 1, the rotated stochastic quantization has the best performance. The improvement is especially significant for low bit rate cases.
We demonstrate two applications in the rest of this section. The experiments are performed on the MNIST () and CIFAR () datasets.
Distributed Lloyd’s algorithm. In the distributed Lloyd’s (k-means) algorithm, each client has access to a subset of data points. In each iteration, the server broadcasts the cluster centers to all the clients. Each client updates the centers based on its local data, and sends the centers back to the server. The server then updates the centers by computing the weighted average of the centers sent from all clients. In the quantized setting, the client compresses the new centers before sending to the server. This saves the uplink communication cost, which is often the bottleneck of distributed learningIn this setting, the downlink is a broadcast, and therefore its cost can be reduced by a factor of without quantization, where is the number of clients.. We set both the number of centers and number of clients to 10. Figure 2 shows the result.
Distributed power iteration. Power iteration is a widely used method to compute the top eigenvector of a matrix. In the distributed setting, each client has access to a subset of data. In each iteration, the server broadcasts the current estimate of the eigenvector to all clients. Each client then updates the eigenvector based on one power iteration on its local data, and sends the updated eigenvector back to the server. The server updates the eigenvector by computing the average of the eigenvectors sent by all clients. Similar to the above distributed Lloyd’s algorithm, in the quantized setting, the client compresses the estimated eigenvector before sending to the server. Figure 3 shows the result. The dataset is distributed over clients.
For both of these applications, variable-length coding achieves the lowest quantization error in most of the settings. Furthermore, for low-bit rate, stochastic rotated quantization is competitive with variable-length coding.
Acknowledgments
We thank Jayadev Acharya, Keith Bonawitz, Dan Holtmann-Rice, Jakub Konecny, Tengyu Ma, and Xiang Wu for helpful comments and discussions.
References
Appendix A Proof of Lemma 7
The equality follows from the symmetry in . To prove the upper bound, observe that
Let be the diagonal entry of . To bound the first term observe that is a function of independent random variables . Changing changes the by at most . Hence, applying Efron-Stein variance bound yields
To bound the second term, observe that for every ,
Note that . Since the ’s are Radamacher random variables and for all , the distributions of is same for all . Hence by Jensen’s inequality,
Since ,
where follows from the fact that the ’s are independent and follows from the fact that for any . Hence,