Randomized Distributed Mean Estimation: Accuracy vs Communication

Jakub Konečný, Peter Richtárik

Introduction

In particular, we consider a star network topology with a single server at the centre and nn nodes connected to it. All nodes send an encoded (possibly via a lossy randomized transformation) version of their vector to the server, after which the server performs a decoding operation to estimate the true mean

The purpose of the encoding operation is to compress the vector so as to save on communication cost, which is typically the bottleneck in practical applications.

To better illustrate the setup, consider the naive approach in which all nodes send the vectors without performing any encoding operation, followed by the application of a simple averaging decoder by the server. This results in zero estimation error at the expense of maximum communication cost of ndrndr bits, where rr is the number of bits needed to communicate a single floating point entry/coordinate of XiX_{i}.

The distributed mean estimation problem was recently studied in a statistical framework where it is assumed that the vectors XiX_{i} are independent and identicaly distributed samples from some specific underlying distribution. In such a setup, the goal is to estimate the true mean of the underlying distribution . These works formulate lower and upper bounds on the communication cost needed to achieve the minimax optimal estimation error.

In contrast, we do not make any statistical assumptions on the source of the vectors, and study the trade-off between expected communication costs and mean square error of the estimate. Arguably, this setup is a more robust and accurate model of the distributed mean estimation problems arising as subproblems in applications such as reduce-all operations within algorithms for distributed and federated optimization . In these applications, the averaging operations need to be done repeatedly throughout the iterations of a master learning/optimization algorithm, and the vectors {Xi}\{X_{i}\} correspond to updates to a global model/variable. In these applications, the vectors evolve throughout the iterative process in a complicated pattern, typically approaching zero as the master algorithm converges to optimality. Hence, their statistical properties change, which renders fixed statistical assumptions not satisfied in practice.

For instance, when training a deep neural network model in a distributed environment, the vector XiX_{i} corresponds to a stochastic gradient based on a minibatch of data stored on node ii. In this setup we do not have any useful prior statistical knowledge about the high-dimensional vectors to be aggregated. It has recently been observed that when communication cost is high, which is typically the case for commodity clusters, and even more so in a federated optimization framework, it is can be very useful to sacrifice on estimation accuracy in favor of reduced communication .

While the above already improves upon the state of the art, the improved results are in fact obtained for a suboptimal choice of the parameters of our method (constant probabilities pijp_{ij}, and node centers fixed to the mean μi\mu_{i}). One can decrease the MSE further by optimizing over the probabilities and/or node centers (see Section 6). However, apart from a very low communication cost regime in which we have a closed form expression for the optimal probabilities, the problem needs to be solved numerically, and hence we do not have expressions for how much improvement is possible. We illustrate the effect of fixed and optimal probabilities on the trade-off between communication cost and MSE experimentally on a few selected datasets in Section 6 (see Figure 1).

2 Outline

In Section 2 we formalize the concepts of encoding and decoding protocols. In Section 3 we describe a parametric family of randomized (and unbiased) encoding protocols and give a simple formula for the mean squared error. Subsequently, in Section 4 we formalize the notion of communication cost, and describe several communication protocols, which are optimal under different circumstances. We give simple instantiations of our protocol in Section 5, illustrating the trade-off between communication costs and accuracy. In Section 6 we address the question of the optimal choice of parameters of our protocol. Finally, in Section 7 we comment on possible extensions we leave out to future work.

Three Protocols

The objective of this work is to study the trade-off between the (expected) number of bits that need to be communicated, and the accuracy of YY as an estimate of XX.

In this work we focus on encoders which are unbiased, in the following sense.

We say that encoder α\alpha is unbiased if Eα[α(Xi)]=Xi{\bf E}_{\alpha}\left[\alpha(X_{i})\right]=X_{i} for all i=1,2,,ni=1,2,\dots,n. We say that it is independent, if α(Xi)\alpha(X_{i}) is independent from α(Xj)\alpha(X_{j}) for all iji\neq j.

A trivial example of an encoding protocol is the identity function: α(Xi)=Xi\alpha(X_{i})=X_{i}. It is both unbiased and independent. This encoder does not lead to any savings in communication that would be otherwise infeasible though.

We now formalize the notion of accuracy of estimating XX via YY. Since YY can be random, the notion of accuracy will naturally be probabilistic.

The mean squared error of protocol (α,γ)(\alpha,\gamma) is the quantity

To illustrate the above concept, we now give a few examples:

If γ\gamma is the averaging function, i.e., γ(Y1,,Yn)=1ni=1nYi,\gamma(Y_{1},\dots,Y_{n})=\frac{1}{n}\sum_{i=1}^{n}Y_{i}, then

The next example generalizes the identity encoder and averaging decoder.

and hence the MSE of (α,γ)(\alpha,\gamma) is zero.

We shall now prove a simple result for unbiased and independent encoders used in subsequent sections.

If the encoder α\alpha is unbiased and independent, and γ\gamma is the averaging decoder, then

Note that Eα[Yi]=Xi{\bf E}_{\alpha}\left[Y_{i}\right]=X_{i} for all ii. We have

where (*) follows from unbiasedness and (**) from independence. ∎

One may wish to define the encoder as a combination of two or more separate encoders: α(Xi)=α2(α1(Xi))\alpha(X_{i})=\alpha_{2}(\alpha_{1}(X_{i})). See for an example where α1\alpha_{1} is a random rotation and α2\alpha_{2} is binary quantization.

A Family of Randomized Encoding Protocols

We shall define support of α\alpha on node ii to be the set Si=def{j  :  Yi(j)μi}S_{i}\overset{\text{def}}{=}\{j\;:\;Y_{i}(j)\neq\mu_{i}\}. We now define two parametric families of randomized encoding protocols. The first results in SiS_{i} of random size, the second has SiS_{i} of a fixed size.

With each pair (i,j)(i,j) we associate a parameter 0<pij10<p_{ij}\leq 1, representing a probability. The collection of parameters {pij,μi}\{p_{ij},\mu_{i}\} defines an encoding protocol α\alpha as follows:

Enforcing the probabilities to be positive, as opposed to nonnegative, leads to vastly simplified notation in what follows. However, it is more natural to allow pijp_{ij} to be zero, in which case we have Yi(j)=μiY_{i}(j)=\mu_{i} with probability 1. This raises issues such as potential lack of unbiasedness, which can be resolved, but only at the expense of a larger-than-reasonable notational overload.

In the rest of this section, let γ\gamma be the averaging decoder (Example 2). Since γ\gamma is fixed and deterministic, we shall for simplicity write Eα[]{\bf E}_{\alpha}\left[\cdot\right] instead of Eα,γ[]{\bf E}_{\alpha,\gamma}\left[\cdot\right]. Similarly, we shall write MSEα()MSE_{\alpha}(\cdot) instead of MSEα,γ()MSE_{\alpha,\gamma}(\cdot).

We now prove two lemmas describing properties of the encoding protocol α\alpha. Lemma 3.1 states that the protocol yields an unbiased estimate of the average XX and Lemma 3.2 provides the expected mean square error of the estimate.

The encoder α\alpha defined in (1) is unbiased. That is, Eα[α(Xi)]=Xi{\bf E}_{\alpha}\left[\alpha(X_{i})\right]=X_{i} for all ii. As a result, YY is an unbiased estimate of the true average: Eα[Y]=X{\bf E}_{\alpha}\left[Y\right]=X.

Due to linearity of expectation, it is enough to show that Eα[Y(j)]=X(j){\bf E}_{\alpha}\left[Y(j)\right]=X(j) for all jj. Since Y(j)=1ni=1nYi(j)Y(j)=\frac{1}{n}\sum_{i=1}^{n}Y_{i}(j) and X(j)=1ni=1nXi(j)X(j)=\frac{1}{n}\sum_{i=1}^{n}X_{i}(j), it suffices to show that Eα[Yi(j)]=Xi(j){\bf E}_{\alpha}\left[Y_{i}(j)\right]=X_{i}(j):

Let α=α(pij,μi)\alpha=\alpha(p_{ij},\mu_{i}) be the encoder defined in (1). Then

It suffices to substitute the above into (3). ∎

2 Encoding Protocol with Fixed-size Support

Here we propose an alternative encoding protocol, one with deterministic support size. As we shall see later, this results in deterministic communication cost.

Let σk(d)\sigma_{k}(d) denote the set of all subsets of {1,2,,d}\{1,2,\dots,d\} containing kk elements. The protocol α\alpha with a single integer parameter kk is then working as follows: First, each node ii samples Diσk(d)\mathcal{D}_{i}\in\sigma_{k}(d) uniformly at random, and then sets

Note that due to the design, the size of the support of YiY_{i} is always kk, i.e., Si=k|S_{i}|=k. Naturally, we can expect this protocol to perform practically the same as the protocol (1) with pij=k/dp_{ij}=k/d, for all i,ji,j. Lemma 3.4 indeed suggests this is the case. While this protocol admits a more efficient communication protocol (as we shall see in Section ), protocol (1) enjoys a larger parameters space, ultimately leading to better MSE. We comment on this tradeoff in subsequent sections.

As for the data-dependent protocol, we prove basic properties. The proofs are similar to those of Lemmas 3.1 and 3.2 and we defer them to Appendix A.

The encoder α\alpha defined in (1) is unbiased. That is, Eα[α(Xi)]=Xi{\bf E}_{\alpha}\left[\alpha(X_{i})\right]=X_{i} for all ii. As a result, YY is an unbiased estimate of the true average: Eα[Y]=X{\bf E}_{\alpha}\left[Y\right]=X.

Let α=α(k)\alpha=\alpha(k) be encoder defined as in (4). Then

Communication Protocols

Having defined the encoding protocols α\alpha, we need to specify the way the encoded vectors Yi=α(Xi)Y_{i}=\alpha(X_{i}), for i=1,2,,ni=1,2,\dots,n, are communicated to the server. Given a specific communication protocol β\beta, we write β(Yi)\beta(Y_{i}) to denote the (expected) number of bits that are communicated by node ii to the server. Since Yi=α(Xi)Y_{i}=\alpha(X_{i}) is in general not deterministic, β(Yi)\beta(Y_{i}) can be a random variable.

The communication cost of communication protocol β\beta under randomized encoding α\alpha is the total expected number of bits transmitted to the server:

Given YiY_{i}, a good communication protocol is able to encode Yi=α(Xi)Y_{i}=\alpha(X_{i}) using a few bits only. Let rr denote the number of bits used to represent a floating point number. Let rˉ\bar{r} be the the number of bits representing μi\mu_{i}.

In the rest of this section we describe several communication protocols β\beta and calculate their communication cost.

Represent Yi=α(Xi)Y_{i}=\alpha(X_{i}) as dd floating point numbers. Then for all encoding protocols α\alpha and all ii we have β(α(Xi))=dr\beta(\alpha(X_{i}))=dr, whence

2 Varying-length

We will use a single variable for every element of the vector YiY_{i}, which does not have constant size. The first bit decides whether the value represents μi\mu_{i} or not. If yes, end of variable, if not, next rr bits represent the value of Yi(j)Y_{i}(j). In addition, we need to communicate μi\mu_{i}, which takes rˉ\bar{r} bitsThe distinction here is because μi\mu_{i} can be chosen to be data independent, such as , so we don’t have to communicate anything (i.e., rˉ=0\bar{r}=0). We thus have

where 1e1_{e} is the indicator function of event ee. The expected number of bits communicated is given by

In the special case when pij=p>0p_{ij}=p>0 for all i,ji,j, we get

3 Sparse Communication Protocol for Encoder (1)

We can represent YiY_{i} as a sparse vector; that is, a list of pairs (j,Yi(j))(j,Y_{i}(j)) for which Yi(j)μiY_{i}(j)\neq\mu_{i}. The number of bits to represent each pair is log(d)+r\lceil\log(d)\rceil+r. Any index not found in the list, will be interpreted by server as having value μi\mu_{i}. Additionally, we have to communicate the value of μi\mu_{i} to the server, which takes rˉ\bar{r} bits. We assume that the value dd, size of the vectors, is known to the server. Hence,

Summing up through ii and taking expectations, the the communication cost is given by

In the special case when pij=p>0p_{ij}=p>0 for all i,ji,j, we get

A practical improvement upon this could be to (without loss of generality) assume that the pairs (j,Yi(j))(j,Y_{i}(j)) are ordered by jj, i.e., we have {(js,Yi(js))}s=1k\{(j_{s},Y_{i}(j_{s}))\}_{s=1}^{k} for some kk and j1<j2<<jkj_{1}<j_{2}<\dots<j_{k}. Further, let us denote j0=0j_{0}=0. We can then use a variant of variable-length quantity to represent the set {(jsjs1,Yi(js))}s=1k\{(j_{s}-j_{s-1},Y_{i}(j_{s}))\}_{s=1}^{k}. With careful design one can hope to reduce the log(d)\log(d) factor in the average case. Nevertheless, this does not improve the worst case analysis we focus on in this paper, and hence we do not delve deeper in this.

4 Sparse Communication Protocol for Encoder (4)

We now describe a sparse communication protocol compatible only with fixed length encoder defined in (4). Note that subset selection can be compressed in the form of a random seed, letting us avoid the log(d)\log(d) factor in (8). This includes the protocol defined in (4) but also (1) with uniform probabilities pijp_{ij}.

In particular, we can represent YiY_{i} as a sparse vector containing the list of the values for which Yi(j)μiY_{i}(j)\neq\mu_{i}, ordered by jj. Additionally, we need to communicate the value μi\mu_{i} (using rˉ\bar{r} bits) and a random seed (using rˉs\bar{r}_{s} bits), which can be used to reconstruct the indices jj, corresponding to the communicated values. Note that for any fixed kk defining protocol (4), we have Si=k|S_{i}|=k. Hence, communication cost is deterministic:

In the case of the variable-size-support encoding protocol (1) with pij=p>0p_{ij}=p>0 for all i,ji,j, the sparse communication protocol described here yields expected communication cost

5 Binary

If the elements of YiY_{i} take only two different values, YiminY_{i}^{min} or YimaxY_{i}^{max}, we can use a binary communication protocol. That is, for each node ii, we communicate the values of YiminY_{i}^{min} and YimaxY_{i}^{max} (using 2r2r bits), followed by a single bit per element of the array indicating whether YimaxY_{i}^{max} or YiminY_{i}^{min} should be used. The resulting (deterministic) communication cost is

6 Discussion

In the above, we have presented several communication protocols of different complexity. However, it is not possible to claim any of them is the most efficient one. Which communication protocol is the best, depends on the specifics of the used encoding protocol. Consider the extreme case of encoding protocol (1) with pij=1p_{ij}=1 for all i,ji,j. The naive communication protocol is clearly the most efficient, as all other protocols need to send some additional information.

However, in the interesting case when we consider small communication budget, the sparse communication protocols are the most efficient. Therefore, in the following sections, we focus primarily on optimizing the performance using these protocols.

Examples

In this section, we highlight on several instantiations of our protocols, recovering existing techniques and formulating novel ones. We comment on the resulting trade-offs between communication cost and estimation error.

We start by recovering an existing method, which turns every element of the vectors XiX_{i} into a particular binary representation.

If we set the parameters of protocol (1) as μi=Ximin\mu_{i}=X_{i}^{min} and pij=Xi(j)XiminΔip_{ij}=\frac{X_{i}(j)-X_{i}^{min}}{\Delta_{i}}, where Δi=defXimaxXimin\Delta_{i}\overset{\text{def}}{=}X_{i}^{max}-X_{i}^{min} (assume, for simplicity, that Δi0\Delta_{i}\neq 0), we exactly recover the quantization algorithm proposed in :

Using the formula (2) for the encoding protocol α\alpha, we get

This exactly recovers the MSE bound established in [10, Theorem 1]. Using the binary communication protocol yields the communication cost of 11 bit per element if XiX_{i}, plus a two real-valued scalars (11).

If we use the above protocol jointly with randomized linear encoder and decoder (see Example 3), where the linear transform is the randomized Hadamard transform, we recover the method described in [10, Section 3] which yields improved MSEα=2logd+2n1ni=1nXi2MSE_{\alpha}=\frac{2\log d+2}{n}\cdot\frac{1}{n}\sum_{i=1}^{n}\|X_{i}\|^{2} and can be implemented in O(dlogd)\mathcal{O}(d\log d) time.

2 Sparse Communication Protocols

Now we move to comparing the communication costs and estimation error of various instantiations of the encoding protocols, utilizing the deterministic sparse communication protocol and uniform probabilities.

For the remainder of this section, let us only consider instantiations of our protocol where pij=p>0p_{ij}=p>0 for all i,ji,j, and assume that the node centers are set to the vector averages, i.e., μi=1dj=1dXi(j)\mu_{i}=\frac{1}{d}\sum_{j=1}^{d}X_{i}(j). Denote R=1ni=1nj=1d(Xi(j)μi)2R=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{d}(X_{i}(j)-\mu_{i})^{2}. For simplicity, we also assume that S=nd|S|=nd, which is what we can in general expect without any prior knowledge about the vectors XiX_{i}.

The properties of the following examples follow from Equations (2) and (10). When considering the communication costs of the protocols, keep in mind that the trivial benchmark is Cα,β=ndrC_{\alpha,\beta}=ndr, which is achieved by simply sending the vectors unmodified. Communication cost of Cα,β=ndC_{\alpha,\beta}=nd corresponds to the interesting special case when we use (on average) one bit per element of each XiX_{i}.

In this case, the encoding protocol is lossless, which ensures MSE=0MSE=0. Note that in this case, we could get rid of the n(rˉs+rˉ)n(\bar{r}_{s}+\bar{r}) factor by using naive communication protocol.

This protocol order-wise matches the MSEMSE of the method in Remark 3. However, as long as d>2rd>2^{r}, this protocol attains this error with smaller communication cost. In particular, this is on expectation less than a single bit per element of XiX_{i}. Finally, note that the factor RR is always smaller or equal to the factor 1ni=1nXi2\frac{1}{n}\sum_{i=1}^{n}\|X_{i}\|^{2} appearing in Remark 3.

This protocol communicates on expectation single bit per element of XiX_{i} (plus additional rˉs+rˉ\bar{r}_{s}+\bar{r} bits per client), while attaining bound on MSEMSE of O(r/n)\mathcal{O}(r/n). To the best of out knowledge, this is the first method to attain this bound without additional assumptions.

If we choose p=drˉsrˉdrp=\frac{d-\bar{r}_{s}-\bar{r}}{dr}, we get

This alternative protocol attains on expectation exactly single bit per element of XiX_{i}, with (a slightly more complicated) O(r/n)\mathcal{O}(r/n) bound on MSEMSE.

This protocol attains the MSE of protocol in Example 4 while at the same time communicating on average significantly less than a single bit per element of XiX_{i}.

Using the deterministic sparse protocol, there is an obvious lower bound on the communication cost — n(rˉs+rˉ)n(\bar{r}_{s}+\bar{r}). We can bypass this threshold by using the sparse protocol, with a data-independent choice of μi\mu_{i}, such as , setting rˉ=0\bar{r}=0. By setting p=ϵ/d(logd+r)p=\epsilon/d(\lceil\log d\rceil+r), we get arbitrarily small expected communication cost of Cα,β=ϵC_{\alpha,\beta}=\epsilon, and the cost of exploding estimation error MSEα,γ=O(1/ϵn)MSE_{\alpha,\gamma}=\mathcal{O}(1/\epsilon n).

Note that all of the above examples have random communication costs. What we present is the expected communication cost of the protocols. All the above examples can be modified to use the encoding protocol with fixed-size support defined in \eqrefeq:randomizedprotocol2\eqref{eq:randomized_protocol_2} with the parameter kk set to the value of pdpd for corresponding pp used above, to get the same results. The only practical difference is that the communication cost will be deterministic for each node, which can be useful for certain applications.

Optimal Encoders

Here we consider (α,β,γ)(\alpha,\beta,\gamma), where α=α(pij,μi)\alpha=\alpha(p_{ij},\mu_{i}) is the encoder defined in (1), β\beta is the associated the sparse communication protocol, and γ\gamma is the averaging decoder. Recall from Lemma 2 and (8) that the mean square error and communication cost are given by:

Having these closed-form formulae as functions of the parameters {pij,μi}\{p_{ij},\mu_{i}\}, we can now ask questions such as:

Given a communication budget, which encoding protocol has the smallest mean squared error?

Given a bound on the mean squared error, which encoder suffers the minimal communication cost?

Let us now address the first question; the second question can be handled in a similar fashion. In particular, consider the optimization problem

where B>0B>0 represents a bound on the part of the total communication cost in (13) which depends on the choice of the probabilities pijp_{ij}.

Note that while the constraints in (15) are convex (they are linear), the objective is not jointly convex in {pij,μi}\{p_{ij},\mu_{i}\}. However, the objective is convex in {pij}\{p_{ij}\} and convex in {μi}\{\mu_{i}\}. This suggests a simple alternating minimization heuristic for solving the above problem:

Fix the probabilities and optimize over the node centers,

Fix the node centers and optimize over probabilities.

These two steps are repeated until a suitable convergence criterion is reached. Note that the first step has a closed form solution. Indeed, the problem decomposes across the node centers to nn univariate unconstrained convex quadratic minimization problems, and the solution is given by

The second step does not have a closed form solution in general; we provide an analysis of this step in Section 6.1.

Note that the upper bound i,j(Xi(j)μi)2/pij\sum_{i,j}(X_{i}(j)-\mu_{i})^{2}/p_{ij} on the objective is jointly convex in {pij,μi}\{p_{ij},\mu_{i}\}. We may therefore instead optimize this upper bound by a suitable convex optimization algorithm.

An alternative and a more practical model to (15) is to choose per-node budgets B1,,BnB_{1},\dots,B_{n} and require jpijBi\sum_{j}p_{ij}\leq B_{i} for all ii. The problem becomes separable across the nodes, and can therefore be solved by each node independently. If we set B=iBiB=\sum_{i}B_{i}, the optimal solution obtained this way will lead to MSE which is lower bpunded by the MSE obtained through (15).

Let the node centers μi\mu_{i} be fixed. Problem (15) (or, equivalently, step 2 of the alternating minimization method described above) then takes the form

Let S={(i,j)  :  Xi(j)μi}S=\{(i,j)\;:\;X_{i}(j)\neq\mu_{i}\}. Notice that as long as BSB\geq|S|, the optimal solution is to set pij=1p_{ij}=1 for all (i,j)S(i,j)\in S and pij=0p_{ij}=0 for all (i,j)S(i,j)\notin S.We interpret 0/00/0 as and do not worry about infeasibility. These issues can be properly formalized by allowing pijp_{ij} to be zero in the encoding protocol and in (6.1). However, handling this singular situation requires a notational overload which we are not willing to pay. In such a case, we have MSEα,γ=0.MSE_{\alpha,\gamma}=0. Hence, we can without loss of generality assume that BSB\leq|S|.

While we are not able to derive a closed-form solution to this problem, we can formulate upper and lower bounds on the optimal estimation error, given a bound on the communication cost formulated via BB.

Consider problem (6.1) and fix any BSB\leq|S|. Using the sparse communication protocol β\beta, the optimal encoding protocol α\alpha has communication complexity

and the mean squared error satisfies the bounds

where R=1ni=1nj=1d(Xi(j)μi)2=1ni=1nXiμi12R=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{d}(X_{i}(j)-\mu_{i})^{2}=\frac{1}{n}\sum_{i=1}^{n}\|X_{i}-\mu_{i}1\|^{2}. Let aij=Xi(j)μia_{ij}=|X_{i}(j)-\mu_{i}| and W=i,jaijW=\sum_{i,j}a_{ij}. If, moreover, B(i,j)Saij/max(i,j)SaijB\leq\sum_{(i,j)\in S}a_{ij}/\max_{(i,j)\in S}a_{ij} (which is true, for instance, in the ultra-low communication regime with B1B\leq 1), then

Setting pij=B/Sp_{ij}=B/|S| for all (i,j)S(i,j)\in S leads to a feasible solution of (6.1). In view of (13), one then has

where R=1ni=1nj=1d(Xi(j)μi)2=1ni=1nXiμi12R=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{d}(X_{i}(j)-\mu_{i})^{2}=\frac{1}{n}\sum_{i=1}^{n}\|X_{i}-\mu_{i}1\|^{2}.

If we relax the problem by removing the constraints pij1p_{ij}\leq 1, the optimal solution satisfies aij/pij=θ>0a_{ij}/p_{ij}=\theta>0 for all (i,j)S(i,j)\in S. At optimality the bound involving BB must be tight, which leads to (i,j)Saij/θ=B\sum_{(i,j)\in S}a_{ij}/\theta=B, whence θ=1B(i,j)Saij\theta=\tfrac{1}{B}\sum_{(i,j)\in S}a_{ij}. So, pij=aijB/(i,j)Saijp_{ij}=a_{ij}B/\sum_{(i,j)\in S}a_{ij}. The optimal MSE therefore satisfies the lower bound

where W=def(i,j)Saij((i,j)Saij2)1/2=(nR)1/2W\overset{\text{def}}{=}\sum_{(i,j)\in S}a_{ij}\geq\left(\sum_{(i,j)\in S}a_{ij}^{2}\right)^{1/2}=(nR)^{1/2}. Therefore, MSEα,γ(1B1)RnMSE_{\alpha,\gamma}\geq\left(\frac{1}{B}-1\right)\frac{R}{n}. If B(i,j)Saij/max(i,j)SaijB\leq\sum_{(i,j)\in S}a_{ij}/\max_{(i,j)\in S}a_{ij}, then pij1p_{ij}\leq 1 for all (i,j)S(i,j)\in S, and hence we have optimality. (Also note that, by Cauchy-Schwarz inequality, W2nRSW^{2}\leq nR|S|.) ∎

2 Trade-off Curves

To illustrate the trade-offs between communication cost and estimation error (MSE) achievable by the protocols discussed in this section, we present simple numerical examples in Figure 1, on three synthetic data sets with n=16n=16 and d=512d=512. We choose an array of values for BB, directly bounding the communication cost via (18), and evaluate the MSEMSE (2) for three encoding protocols (we use the sparse communication protocol and averaging decoder). All these protocols have the same communication cost, and only differ in the selection of the parameters pijp_{ij} and μi\mu_{i}. In particular, we consider

uniform probabilities pij=p>0p_{ij}=p>0 with average node centers μi=1dj=1dXi(j)\mu_{i}=\frac{1}{d}\sum_{j=1}^{d}X_{i}(j) (blue dashed line),

optimal probabilities pijp_{ij} with average node centers μi=1dj=1dXi(j)\mu_{i}=\frac{1}{d}\sum_{j=1}^{d}X_{i}(j) (green dotted line), and

optimal probabilities with optimal node centers, obtained via the alternating minimization approach described above (red solid line).

In order to put a scale on the horizontal axis, we assumed that r=16r=16. Note that, in practice, one would choose rr to be as small as possible without adversely affecting the application utilizing our distributed mean estimation method. The three plots represent XiX_{i} with entries drawn in an i.i.d. fashion from Gaussian (N(0,1)\mathcal{N}(0,1)), Laplace (L(0,1)\mathcal{L}(0,1)) and chi-squared (χ2(2)\chi^{2}(2)) distributions, respectively. As we can see, in the case of non-symmetric distributions, it is not necessarily optimal to set the node centers to averages.

As expected, for fixed node centers, optimizing over probabilities results in improved performance, across the entire trade-off curve. That is, the curve shifts downwards. In the first two plots based on data from symmetric distributions (Gaussian and Laplace), the average node centers are nearly optimal, which explains why the red solid and green dotted lines coalesce. This can be also established formally. In the third plot, based on the non-symmetric chi-squared data, optimizing over node centers leads to further improvement, which gets more pronounced with increased communication budget. It is possible to generate data where the difference between any pair of the three trade-off curves becomes arbitrarily large.

Finally, the black cross represents performance of the quantization protocol from Example 4. This approach appears as a single point in the trade-off space due to lack of any parameters to be fine-tuned.

Further Considerations

In this section we outline further ideas worth consideration. However, we leave a detailed analysis to future work.

We can generalize the binary encoding protocol (1) to a kk-ary protocol. To illustrate the concept without unnecessary notation overload, we present only the ternary (i.e., k=3k=3) case.

Let the collection of parameters {pij,pij,Xˉi,Xˉi}\{p^{\prime}_{ij},p^{\prime\prime}_{ij},\bar{X}^{\prime}_{i},\bar{X}^{\prime\prime}_{i}\} define an encoding protocol α\alpha as follows:

It is straightforward to generalize Lemmas 3.1 and 3.2 to this case. We omit the proofs for brevity.

The encoder α\alpha defined in (21) is unbiased. That is, Eα[α(Xi)]=Xi{\bf E}_{\alpha}\left[\alpha(X_{i})\right]=X_{i} for all ii. As a result, YY is an unbiased estimate of the true average: Eα[Y]=X{\bf E}_{\alpha}\left[Y\right]=X.

Let α=α(pij,pij,Xˉi,Xˉi)\alpha=\alpha\left(p^{\prime}_{ij},p^{\prime\prime}_{ij},\bar{X}^{\prime}_{i},\bar{X}^{\prime\prime}_{i}\right) be the protocol defined in (21). Then

We expect the kk-ary protocol to lead to better (lower) MSE bounds, but at the expense of an increase in communication cost. Whether or not the trade-off offered by k>2k>2 is better than that for the k=2k=2 case investigated in this paper is an interesting question to consider.

2 Preprocessing via Random Rotations

Following the idea proposed in , one can explore an encoding protocol αQ\alpha_{Q} which arises as the composition of a random rotation, QQ, applied to XiX_{i} for all ii, followed by the protocol α\alpha described in Section 3. Letting Zi=QXiZ_{i}=QX_{i} and Z=1niZiZ=\frac{1}{n}\sum_{i}Z_{i}, we thus have

With this protocol we associate the decoder γ(Y1,,Yn)=1ni=1nQ1Yi.\gamma(Y_{1},\dots,Y_{n})=\frac{1}{n}\sum_{i=1}^{n}Q^{-1}Y_{i}.

This approach is motivated by the following observation: a random rotation can be identified by a single random seed, which is easy to communicate to the server without the need to communicate all floating point entries defining QQ. So, a random rotation pre-processing step implies only a minor communication overhead. However, if the preprocessing step helps to dramatically reduce the MSE, we get an improvement. Note that the inner expectation above is the formula for MSE of our basic encoding-decoding protocol, given that the data is Zi=QXiZ_{i}=QX_{i} instead of {Xi}\{X_{i}\}. The outer expectation is over QQ. Hence, we would like the to find a mapping QQ which tends to transform the data {Xi}\{X_{i}\} into new data {Zi}\{Z_{i}\} with better MSE, in expectation.

where xˉ=1djx(j)\bar{x}=\tfrac{1}{d}\sum_{j}x(j) and 11 is the vector of all ones. Further, for simplicity assume that pij=pp_{ij}=p for all i,ji,j. Then using Lemma 3.2, we get

It is interesting to investigate whether choosing QQ as a random rotation, rather than identity (which is the implicit choice done in previous sections), leads to improvement in MSE, i.e., whether we can in some well-defined sense obtain an inequality of the type

This is the case for the quantization protocol proposed in , which arises as a special case of our more general protocol. This is because the quantization protocol is suboptimal within our family of encoders. Indeed, as we have shown, with a different choice of the parameter we can obtained results which improve, in theory, on the rotation + quantization approach. This suggests that perhaps combining an appropriately chosen rotation pre-processing step with our optimal encoder, it may be possible to achieve further improvements in MSE for any fixed communication budget. Finding suitable random rotations QQ requires a careful study which we leave to future research.

References

Appendix A Additional Proofs

In this section we provide proofs of Lemmas 3.3 and 3.4, describing properties of the encoding protocol α\alpha defined in (4). For completeness, we also repeat the statements.

The encoder α\alpha defined in (1) is unbiased. That is, Eα[α(Xi)]=Xi{\bf E}_{\alpha}\left[\alpha(X_{i})\right]=X_{i} for all ii. As a result, YY is an unbiased estimate of the true average: Eα[Y]=X{\bf E}_{\alpha}\left[Y\right]=X.

Since Y(j)=1ni=1nYi(j)Y(j)=\frac{1}{n}\sum_{i=1}^{n}Y_{i}(j) and X(j)=1ni=1nXi(j)X(j)=\frac{1}{n}\sum_{i=1}^{n}X_{i}(j), it suffices to show that Eα[Yi(j)]=Xi(j){\bf E}_{\alpha}\left[Y_{i}(j)\right]=X_{i}(j):

Let α=α(k)\alpha=\alpha(k) be encoder defined as in (4). Then

It suffices to substitute the above into (22). ∎