Distributed Mean Estimation with Limited Communication

Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, H. Brendan McMahan

Introduction

This basic estimation problem is used as a subroutine in several learning and optimization tasks where data is distributed across several clients. For example, in Lloyd’s algorithm for k-means clustering, if data is distributed across several clients, the server needs to compute the means of all clusters in each update step. Similarly, for PCA, if data samples are distributed across several clients, then for the power-iteration method, the server needs to average the output of all clients in each step.

Recently, algorithms involving distributed mean estimation have been used extensively in training large-scale neural networks and other statistical models . In a typical scenario of synchronized distributed learning, each client obtains a copy of a global model. The clients then update the model independently based on their local data. The updates (usually in the form of gradients) are then sent to a server, where they are averaged and used to update the global model. A critical step in all of the above algorithms is to estimate the mean of a set of vectors as in Eq. (1).

One of the main bottlenecks in distributed algorithms is the communication cost. This has spurred a line of work focusing on communication cost in learning . The communication cost can be prohibitive for modern applications, where each client can be a low-power and low-bandwidth device such as a mobile phone . Given such a wide set of applications, we study the basic problem of achieving the optimal minimax rate in distributed mean estimation with limited communication.

We note that our model and results differ from previous works on mean estimation in two ways: previous works assume that the data is generated i.i.d. according to some distribution; we do not make any distribution assumptions on data. Secondly, the objective in prior works is to estimate the mean of the underlying statistical model; our goal is to estimate the empirical mean of the data.

2 Model

Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server and they can transmit at the same time. In any independent communication protocol, each client transmits a function of $X_{i}$ (say $f(X_{i})$ ), and a central server estimates the mean by some function of $f(X_{1}),f(X_{2}),\ldots,f(X_{n})$ . Let $\pi$ be any such protocol and let $\mathcal{C}_{i}(\pi,X_{i})$ be the expected number of transmitted bits by the $i$ -th client during protocol $\pi$ , where throughout the paper, expectation is over the randomness in protocol $\pi$ .

The total number of bits transmitted by all clients with the protocol $\pi$ is

Let the estimated mean be $\hat{\bar{X}}$ . For a protocol $\pi$ , the MSE of the estimate is

We allow the use of both private and public randomness. Private randomness refers to random values that are generated by each machine separately, and public randomness refers to a sequence of random values that are shared among all partiesIn the absence of public randomness, the server can communicate a random seed that can be used by clients to emulate public randomness..

Let $\Pi(c)$ denote the set of all protocols with communication cost at most $c$ . The minimax MSE is

3 Results and discussion

We first analyze the MSE $\mathcal{E}(\pi,X^{n})$ for three algorithms, when $\mathcal{C}(\pi,X^{n})=\Theta(nd)$ , i.e., each client sends a constant number of bits per dimension.

Stochastic uniform quantization. In Section 2.1, as a warm-up we first show that a naive stochastic binary quantization algorithm (denoted by $\pi_{sb}$ ) achieves an MSE of

A natural way to decrease the error is to increase the number of levels of quantization. If we use $k$ levels of quantization, in Theorem 2, we show that the error decreases as

In order to reduce the communication cost, we propose two approaches.

Stochastic rotated quantization: We show that preprocessing the data by a random rotation reduces the mean squared error. Specifically, in Theorem 3, we show that this new scheme (denoted by $\pi_{srk}$ ) achieves an MSE of

Variable length coding: Our second approach uses the same quantization as $\pi_{sk}$ but encodes levels via variable length coding. Instead of using $\lceil\log_{2}k\rceil$ bits per dimension, we show that using variable length encoding such as arithmetic coding to compress the data reduces the communication cost significantly. In particular, in Theorem 4 we show that there is a scheme (denoted by $\pi_{svk}$ ) such that

and $\mathcal{E}(\pi_{svk},X^{n})=\mathcal{E}(\pi_{sk},X^{n})$ . Hence, setting $k=\sqrt{d}$ in Eqs. 2 and 3 yields

and with $\Theta(nd)$ bits of communication i.e., constant number of bits per dimension per client. Of the three protocols, $\pi_{svk}$ has the best MSE for a given communication cost. Note that $\pi_{svk}$ uses $k$ quantization levels but still uses $\mathcal{O}(1)$ bits per dimension per client for all $k\leq\sqrt{d}$ .

Theoretically, while variable length coding has better guarantees, stochastic rotated quantization has several practical advantages: it uses fixed length coding and hence can be combined with encryption schemes for privacy preserving secure aggregation . It can also provide lower quantization error in some scenarios due to better constants (see Section 7 for details).

Concurrent to this work, showed that stochastic quantization and Elias coding can be used to obtain communication-optimal SGD. Recently, showed that $\pi_{sb}$ can be improved further by optimizing the choice of stochastic quantization boundaries. However, their results depend on the number of bits necessary to represent a float, whereas ours do not.

3.2 Minimax MSE

In the above protocols, all of the clients transmit the data. We augment these protocols with a sampling procedure, where only a random fraction of clients transmit data. We show that a combination of $k$ -level quantization, variable length coding, and sampling can be used to achieve information theoretically optimal MSE for a given communication cost. In particular, combining Corollary 1 and Theorem 5 yields our minimax result:

There exists a universal constant $t<1$ such that for communication cost $c\leq ndt$ and $n\geq 1/t$ ,

This result shows that the product of communication cost and MSE scales linearly in the number of dimensions.

The rest of the paper is organized as follows. We first analyze the stochastic uniform quantization technique in Section 2. In Section 3, we propose the stochastic rotated quantization technique, and in Section 4 we analyze arithmetic coding. In Section 5, we combine the above algorithm with a sampling technique and state the upper bound on the minimax risk, and in Section 6 we state the matching minimax lower bounds. Finally, in Section 7 we discuss some practical considerations and apply these algorithms on distributed power iteration and Lloyd’s algorithm.

Stochastic uniform quantization

For a vector $X_{i}$ , let $X^{\max}_{i}=\max_{1\leq j\leq d}X_{i}(j)$ and similarly let $X^{\min}_{i}=\min_{1\leq j\leq d}X_{i}(j)$ . In the stochastic binary quantization protocol $\pi_{sb}$ , for each client $i$ , the quantized value for each coordinate $j$ is generated independently with private randomness as

We first bound the communication cost of this protocol.

Instead of sending vectors $Y_{i}$ , clients transmit two real values $X^{\max}_{i}$ and $X^{\min}_{i}$ (to a desired precision) and a bit vector $Y^{\prime}_{i}$ such that $Y^{\prime}_{i}(j)=1$ if $Y_{i}=X^{\max}_{i}$ and otherwise. Hence each client transmits $d+2r$ bits, where $r$ is the number of bits to transmit the real value to a desired precision.

Let $N$ be the maximum norm of the underlying vectors. To bound $r$ , observe that using $r$ bits, one can represent a number between $-N$ and $N$ to an error of $N/2^{r-1}$ . Thus using $3\log_{2}(dn)+1$ bits one can represent the minimum and maximum to an additive error of $N/(nd)^{3}$ . This error in transmitting minimum and maximum of the vector does not affect our calculations and we ignore it for simplicity. We note that in practice, each dimension of $X_{i}$ is often stored as a 32 bit or 64 bit float, and $r$ should be set as either 32 or 64. In this case, using an even larger $r$ does not further reduce the error. ∎

We now compute the estimation error of this protocol.

where the last equality follows by observing that $Y_{i}-X_{i}$ , $\forall i$ , are independent zero mean random variables. The proof follows by observing that for every $i$ ,

Lemma 2 implies the following upper bound.

The proof follows by Lemma 2 observing that $\forall j$

We also show that the above bound is tight:

There exists a set of vectors $X^{n}$ such that

For every $i$ , let $X_{i}$ be defined as follows. $X_{i}(1)=1/\sqrt{2}$ , $X_{i}(2)=-1/\sqrt{2}$ , and for all $j>2$ , $X_{i}(j)=0$ . For every $i$ , $X^{\max}_{i}=\frac{1}{\sqrt{2}}$ and $X^{\min}_{i}=-\frac{1}{\sqrt{2}}$ . Substituting these bounds in the conclusion of Lemma 2 (which is an equality) yields the theorem. ∎

Therefore, the simple algorithm proposed in this section gives MSE $\Theta(d/n)$ times the average norm. Such an error is too large for real-world use. For example, in the application of neural networks , $d$ can be on the order of millions, yet $n$ can be much smaller than that. In such cases, the MSE is even larger than the norm of the vector.

2 Stochastic k𝑘k-level quantization

A natural generalization of binary quantization is $k$ -level quantization. Let $k$ be a positive integer larger than $2$ . We propose a $k$ -level stochastic quantization scheme $\pi_{sk}$ to quantize each coordinate. Recall that for a vector $X_{i}$ , $X^{\max}_{i}=\max_{1\leq j\leq d}X_{i}(j)$ and $X^{\min}_{i}=\min_{1\leq j\leq d}X_{i}(j)$ . For every integer $r$ in the range $[0,k)$ , let

where $s_{i}$ satisfies $X^{\min}_{i}+s_{i}\geq X^{\max}_{i}$ . A natural choice for $s_{i}$ would be $X^{\max}_{i}-X^{\min}_{i}$ .We will show in Section 4, however, a higher value of $s_{i}$ and variable length coding has better guarantees. The algorithm quantizes each coordinate into one of $B_{i}(r)$ s stochastically. In $\pi_{sk}$ , for the $i$ -th datapoint and $j$ -th coordinate, if $X_{i}(j)\in[B_{i}(r),B_{i}({r+1}))$ ,

As before, the communication complexity of this protocol is bounded. The proof is similar to that of Lemma 1 and hence omitted.

The mean squared loss can be bounded as follows.

If $X^{\max}_{i}-X^{\min}_{i}\leq s_{i}\leq\sqrt{2}\left\lvert\left\lvert X_{i}\right\rvert\right\rvert_{2}\,\,\forall i$ , then for any $X^{n}$ , the $\pi_{sk}$ protocol satisfies,

We conclude this section by noting that $s_{i}=X^{\max}_{i}-X^{\min}_{i}$ satisfies the conditions for the above theorem by Eq. (4).

Stochastic rotated quantization

We show that the algorithm of the previous section can be significantly improved by a new protocol. The motivation comes from the fact that the MSE of stochastic binary quantization and stochastic $k$ -level quantization is $O(\frac{d}{n}(X_{i}^{\max}-X_{i}^{\min})^{2})$ (the proof of Lemma 3 and Theorem 2 with $s_{i}=X_{i}^{\max}-X_{i}^{\min}$ ). Therefore the MSE is smaller when $X_{i}^{\max}$ and $X_{i}^{\min}$ are close. For example, when $X_{i}$ is generated uniformly on the unit sphere, with high probability, $X_{i}^{\max}-X_{i}^{\min}$ is $\mathcal{O}\left(\sqrt{\frac{\log d}{d}}\right)$ . In such case, $\mathcal{E}(\pi_{sk},X^{n})$ is $\mathcal{O}(\frac{\log d}{n})$ instead of $\mathcal{O}(\frac{d}{n})$ .

In this section, we show that even without any assumptions on the distribution of the data, we can “reduce” $X_{i}^{\max}-X_{i}^{\min}$ with a structured random rotation, yielding an $\mathcal{O}(\frac{\log d}{n})$ error. We call the method stochastic rotated quantization and denote it by $\pi_{srk}$ .

The communication cost is same as $\pi_{sk}$ and is given by Lemma 5. We now bound the MSE.

For any $X^{n}$ , $\mathcal{E}(\pi_{srk}(R),X^{n})$ is at most

where $Z_{i}=RX_{i}$ and for every $i$ , let $s_{i}=Z^{\max}_{i}-Z^{\min}_{i}$ .

where the last inequality follows Eq. (5) and the value of $s_{i}$ . $(a)$ follows from the fact that the rotation does not change the norm of the vector, and $(b)$ follows from the tower law of expectation. The lemma follows from observing that

To obtain strong bounds, we need to find an orthogonal matrix $R$ that achieves low $(Z^{\max}_{i})^{2}$ and $(Z^{\min}_{i})^{2}$ . In addition, due to the fact that $d$ can be huge in practice, we need a type of orthogonal matrix that permits fast matrix-vector products. Naive orthogonal matrices that support fast multiplication such as block-diagonal matrices often result in high values of $(Z^{\max}_{i})^{2}$ and $(Z^{\min}_{i})^{2}$ . Motivated by recent works of structured matrices , we propose to use a special type of orthogonal matrix $R=HD$ , where $D$ is a random diagonal matrix with i.i.d. Rademacher entries ( $\pm 1$ with probability $0.5$ ). $H$ is a Walsh-Hadamard matrix . The Walsh-Hadamard matrix of dimension $2^{m}$ for $m\in\mathcal{N}$ is given by the recursive formula,

Let $R=HD$ , where $D$ is a diagonal matrix with independent Radamacher random variables. For every $i$ and every sequence $X^{n}$ ,

Combining the above two lemmas yields the main result.

For any $X^{n}$ , $\pi_{srk}(HD)$ protocol satisfies,

Variable length coding

Instead of preprocessing the data via a rotation matrix as in $\pi_{srk}$ , in this section we propose to use a variable length coding strategy to minimize the number of bits.

Consider the stochastic $k$ -level quantization technique. A natural way of transmitting $Y_{i}$ is sending the bin number for each coordinate, thus the total number of bits the algorithm sends per transmitted coordinate would be $d\lceil\log_{2}k\rceil$ . This naive implementation is sub-optimal. Instead, we propose to further encode the transmitted values using universal compression schemes . We first encode $h_{r}$ , the number of times each quantized value $r$ has appeared, and then use arithmetic or Huffman coding corresponding to the distribution $p_{r}=\frac{h_{r}}{d}.$ We denote this scheme by $\pi_{svk}$ . Since we quantize vectors the same way in $\pi_{sk}$ and $\pi_{svk}$ , the MSE of $\pi_{svk}$ is also given by Theorem 2. We now bound the communication cost.

Let $s_{i}=\sqrt{2}\left\lvert\left\lvert X_{i}\right\rvert\right\rvert$ . There exists an implementation of $\pi_{svk}$ such that $\mathcal{C}(\pi_{svk},X^{n})$ is at most

Once we have compressed the $h_{r}$ ’s, we use arithmetic coding corresponding to the distribution $p_{r}=h_{r}/d$ to compress and transmit bin values for each coordinate. The total number of bits arithmetic coding uses is

Let $a=(k-1)X^{\min}_{i}$ and $b=s_{i}$ . Note that if $Y_{i}(j)$ belongs to bin $r$ , $(a+br)^{2}=(k-1)^{2}Y^{2}_{i}(j)$ and $h_{r}$ is the number of coordinates quantized into bin $r$ . Hence $\sum_{r}h_{r}(a+br)^{2}$ is the scaled norm-square of $Y_{i}$ , i.e.,

where the $\alpha(j)=Y_{i}(j)-X_{i}(j)$ . Taking expectations on both sides and using the fact that the $\alpha(j)$ are independent zero mean random variables over a range of $s_{i}/(k-1)$ , we get

We now bound $\sum_{r}\frac{h_{r}}{d}\log_{2}\frac{d}{h_{r}}$ in terms of $\sum_{r}h_{r}(a+br)^{2}$ . Let $p_{r}=h_{r}/d$ and $\beta=\sum^{k-1}_{r=0}1/((a+br)^{2}+\delta)$ . Note that

where the first inequality follows from the positivity of KL-divergence. Choosing $\delta=s^{2}_{i}$ , yields $\beta\leq 4/s^{2}_{i}$ and hence,

Combining the above set of equations, we get that the expected communication is bounded by

where $(a)$ follows from Equation (7) and $(b)$ follows from Jensen’s inequality. The last inequality follows from Equation (6). ∎

Thus if $k=\sqrt{d}+1$ , the communication complexity is $\mathcal{O}(nd)$ and the MSE is $\mathcal{O}(1/n)$ .

Communication MSE trade-off

In the above protocols, all the clients transmit and hence the communication cost scales linearly with $n$ . Instead, we show that any of the above protocols can be combined by client sampling to obtain trade-offs between the MSE and the communication cost. Note that similar analysis also holds for sampling the coordinates.

Let $\pi$ be a protocol where the mean estimate is of the form:

All three protocols we have discussed are of this form. Let $\pi_{p}$ be the protocol where each client participates independently with probability $p$ . The server estimates $\bar{X}$ by

where $Y_{i}$ s are defined in the previous section and $S$ is the set of clients that transmitted.

For any set of vectors $X^{n}$ and protocol $\pi$ of the form Equation (8), its sampled version $\pi_{p}$ satisfies

The proof of communication cost follows from Lemma 5 and the fact that in expectation, $np$ clients transmit. We now bound the MSE. Let $S$ be the set of clients that transmit. The error $\mathcal{E}(\pi_{p},X^{n})$ is

Furthermore, the second term can be bounded as

where the last equality follows from the assumption that $\pi$ ’s mean estimate is of the form (8). $(a)$ follows from the fact that $R^{-1}Y_{i}-X_{i}$ are independent zero mean random variables. ∎

Combining the above lemma with Theorems 2 and 4, and choosing $k=\sqrt{d}+1$ results in the following.

For every $c\leq nd(2+\log_{2}(7/4))$ , there exists a protocol $\pi$ such that $\mathcal{C}(\pi,S^{d})\leq c$ and

Lower bounds

The lower bound relies on the lower bounds on distributed statistical estimation due to .

There exists a set of distributions $\mathcal{P}_{d}$ supported on $\left[-\frac{1}{\sqrt{d}},\frac{1}{\sqrt{d}}\right]^{d}$ such that if any centralized server wishes to estimate the mean of the underlying unknown distribution, then for any independent protocol $\pi$

where $\mathcal{C}(\pi)$ is the communication cost of the protocol, $\theta(p_{d})$ is the mean of $p_{d}\in\mathcal{P}_{d}$ , and $t$ is a positive constant.

Let $t$ be the constant in Lemma 9. For every $c\leq ndt/4$ and $n\geq 4/t$ ,

Given $n$ samples from the underlying distribution where each sample belongs to $S^{d}$ , it is easy to see that

where $\hat{\theta}(p_{d})$ is the empirical mean of the observed samples. Let $\mathcal{P}_{d}$ be the set of distributions in Lemma 9. Hence for any protocol $\pi$ there exists a distribution $p_{d}$ such that

$(a)$ follows from the fact that $2(a-b)^{2}+2(b-c)^{2}\geq(a-c)^{2}$ . $(b)$ follows from Lemma 9 and $(c)$ follows from the fact that $\mathcal{C}(\pi,S^{d})\leq ndt/4$ and $n\geq 4/t$ . ∎

Corollary 1 and Theorem 5 yield Theorem 1. We note that the above lower bound holds only for communication cost $c<\mathcal{O}(nd)$ . Extending the results for larger values of $c$ remains an open problem.

At a first glance it may appear that combining structured random matrix and variable length encoding may improve the result asymptotically, and therefore violates the lower bound. However, this is not true. Observe that variable length coding $\pi_{svk}$ and stochastic rotated quantization $\pi_{srk}$ use different aspects of the data: the variable length coding uses the fact that bins with large values of index $r$ are less frequent. Hence, we can use fewer bits to encode frequent bins and thus improve communication. In this scheme bin-width ( $s_{i}/(k-1)$ ) is $\sqrt{2}||X_{i}||_{2}/(k-1)$ . Rotated quantization uses the fact that rotation makes the min and max closer to each other and hence we can make bins with smaller width. In such a case, all the bins become “more or less equally likely” and hence variable length coding does not help. In this scheme bin-width ( $s_{i}/(k-1)$ ) is $(Z^{\max}_{i}-Z^{\min}_{i})/(k-1)\approx||X_{i}||_{2}(\log d)/(kd)$ , which is much smaller than bin-width for variable length coding. Hence variable length coding and random rotation cannot be used simultaneously.

Practical considerations and applications

Based on the theoretical analysis, the variable-length coding method provides the lowest quantization error asymptotically when using a constant number of bits. Stochastic rotated quantization in practice may be preferred due to the (hidden) constant factors of the variable length code methods. For example, considering quantizing a single vector $ $, stochastic rotated quantization can use 1 bit per dimension and achieves zero error: the rotated vector has only two values ,$ 2 $or ,$ -2 $. We further note that the rotated quantization is preferred when applied on “unbalanced” data, due to the fact that the rotation can correct the unbalancedness. We demonstrate this by generating a dataset where the value of the last feature dimension entry is much larger than others. We generate 1000 datapoints each with 256 dimensions. The first 255 dimensions are generated i.i.d. from$ N(0,1) $, and the last dimension is generated from$ N(100,1)$. As shown in Figure 1, the rotated stochastic quantization has the best performance. The improvement is especially significant for low bit rate cases.

We demonstrate two applications in the rest of this section. The experiments are performed on the MNIST ( $d=1024$ ) and CIFAR ( $d=512$ ) datasets.

Distributed Lloyd’s algorithm. In the distributed Lloyd’s (k-means) algorithm, each client has access to a subset of data points. In each iteration, the server broadcasts the cluster centers to all the clients. Each client updates the centers based on its local data, and sends the centers back to the server. The server then updates the centers by computing the weighted average of the centers sent from all clients. In the quantized setting, the client compresses the new centers before sending to the server. This saves the uplink communication cost, which is often the bottleneck of distributed learningIn this setting, the downlink is a broadcast, and therefore its cost can be reduced by a factor of $O(n/\log n)$ without quantization, where $n$ is the number of clients.. We set both the number of centers and number of clients to 10. Figure 2 shows the result.

Distributed power iteration. Power iteration is a widely used method to compute the top eigenvector of a matrix. In the distributed setting, each client has access to a subset of data. In each iteration, the server broadcasts the current estimate of the eigenvector to all clients. Each client then updates the eigenvector based on one power iteration on its local data, and sends the updated eigenvector back to the server. The server updates the eigenvector by computing the average of the eigenvectors sent by all clients. Similar to the above distributed Lloyd’s algorithm, in the quantized setting, the client compresses the estimated eigenvector before sending to the server. Figure 3 shows the result. The dataset is distributed over $100$ clients.

For both of these applications, variable-length coding achieves the lowest quantization error in most of the settings. Furthermore, for low-bit rate, stochastic rotated quantization is competitive with variable-length coding.

Acknowledgments

We thank Jayadev Acharya, Keith Bonawitz, Dan Holtmann-Rice, Jakub Konecny, Tengyu Ma, and Xiang Wu for helpful comments and discussions.

References

Appendix A Proof of Lemma 7

The equality follows from the symmetry in $HD$ . To prove the upper bound, observe that

Let $D(j)$ be the $j^{\text{th}}$ diagonal entry of $D$ . To bound the first term observe that $Z^{\max}_{i}$ is a function of $d$ independent random variables $D(1),D(2),\ldots D(d)$ . Changing $D(j)$ changes the $Z^{\max}_{i}$ by at most $\frac{2X_{i}(j)}{\sqrt{d}}$ . Hence, applying Efron-Stein variance bound yields

To bound the second term, observe that for every $\beta>0$ ,

Note that $Z_{i}(k)=\frac{1}{\sqrt{d}}\sum^{d}_{j=1}D(j)H(k,j)X_{i}(j)$ . Since the $D(j)$ ’s are Radamacher random variables and $|H(k,j)|=1$ for all $k,j$ , the distributions of $Z_{i}(k)$ is same for all $k$ . Hence by Jensen’s inequality,

Since $Z_{i}(1)=\frac{1}{\sqrt{d}}\sum^{d}_{j=1}D(j)X_{i}(j)$ ,

where $(a)$ follows from the fact that the $D(i)$ ’s are independent and $(b)$ follows from the fact that $e^{a}+e^{-a}\leq 2e^{a^{2}/2}$ for any $a$ . Hence,