Distributed Mean Estimation with Limited Communication

Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, H. Brendan McMahan

Introduction

This basic estimation problem is used as a subroutine in several learning and optimization tasks where data is distributed across several clients. For example, in Lloyd’s algorithm for k-means clustering, if data is distributed across several clients, the server needs to compute the means of all clusters in each update step. Similarly, for PCA, if data samples are distributed across several clients, then for the power-iteration method, the server needs to average the output of all clients in each step.

Recently, algorithms involving distributed mean estimation have been used extensively in training large-scale neural networks and other statistical models . In a typical scenario of synchronized distributed learning, each client obtains a copy of a global model. The clients then update the model independently based on their local data. The updates (usually in the form of gradients) are then sent to a server, where they are averaged and used to update the global model. A critical step in all of the above algorithms is to estimate the mean of a set of vectors as in Eq. (1).

One of the main bottlenecks in distributed algorithms is the communication cost. This has spurred a line of work focusing on communication cost in learning . The communication cost can be prohibitive for modern applications, where each client can be a low-power and low-bandwidth device such as a mobile phone . Given such a wide set of applications, we study the basic problem of achieving the optimal minimax rate in distributed mean estimation with limited communication.

We note that our model and results differ from previous works on mean estimation in two ways: previous works assume that the data is generated i.i.d. according to some distribution; we do not make any distribution assumptions on data. Secondly, the objective in prior works is to estimate the mean of the underlying statistical model; our goal is to estimate the empirical mean of the data.

2 Model

Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server and they can transmit at the same time. In any independent communication protocol, each client transmits a function of XiX_{i} (say f(Xi)f(X_{i})), and a central server estimates the mean by some function of f(X1),f(X2),,f(Xn)f(X_{1}),f(X_{2}),\ldots,f(X_{n}). Let π\pi be any such protocol and let Ci(π,Xi)\mathcal{C}_{i}(\pi,X_{i}) be the expected number of transmitted bits by the ii-th client during protocol π\pi, where throughout the paper, expectation is over the randomness in protocol π\pi.

The total number of bits transmitted by all clients with the protocol π\pi is

Let the estimated mean be Xˉ^\hat{\bar{X}}. For a protocol π\pi, the MSE of the estimate is

We allow the use of both private and public randomness. Private randomness refers to random values that are generated by each machine separately, and public randomness refers to a sequence of random values that are shared among all partiesIn the absence of public randomness, the server can communicate a random seed that can be used by clients to emulate public randomness..

Let Π(c)\Pi(c) denote the set of all protocols with communication cost at most cc. The minimax MSE is

3 Results and discussion

We first analyze the MSE E(π,Xn)\mathcal{E}(\pi,X^{n}) for three algorithms, when C(π,Xn)=Θ(nd)\mathcal{C}(\pi,X^{n})=\Theta(nd), i.e., each client sends a constant number of bits per dimension.

Stochastic uniform quantization. In Section 2.1, as a warm-up we first show that a naive stochastic binary quantization algorithm (denoted by πsb\pi_{sb}) achieves an MSE of

A natural way to decrease the error is to increase the number of levels of quantization. If we use kk levels of quantization, in Theorem 2, we show that the error decreases as

In order to reduce the communication cost, we propose two approaches.

Stochastic rotated quantization: We show that preprocessing the data by a random rotation reduces the mean squared error. Specifically, in Theorem 3, we show that this new scheme (denoted by πsrk\pi_{srk}) achieves an MSE of

Variable length coding: Our second approach uses the same quantization as πsk\pi_{sk} but encodes levels via variable length coding. Instead of using log2k\lceil\log_{2}k\rceil bits per dimension, we show that using variable length encoding such as arithmetic coding to compress the data reduces the communication cost significantly. In particular, in Theorem 4 we show that there is a scheme (denoted by πsvk\pi_{svk}) such that

and E(πsvk,Xn)=E(πsk,Xn)\mathcal{E}(\pi_{svk},X^{n})=\mathcal{E}(\pi_{sk},X^{n}). Hence, setting k=dk=\sqrt{d} in Eqs. 2 and 3 yields

and with Θ(nd)\Theta(nd) bits of communication i.e., constant number of bits per dimension per client. Of the three protocols, πsvk\pi_{svk} has the best MSE for a given communication cost. Note that πsvk\pi_{svk} uses kk quantization levels but still uses O(1)\mathcal{O}(1) bits per dimension per client for all kdk\leq\sqrt{d}.

Theoretically, while variable length coding has better guarantees, stochastic rotated quantization has several practical advantages: it uses fixed length coding and hence can be combined with encryption schemes for privacy preserving secure aggregation . It can also provide lower quantization error in some scenarios due to better constants (see Section 7 for details).

Concurrent to this work, showed that stochastic quantization and Elias coding can be used to obtain communication-optimal SGD. Recently, showed that πsb\pi_{sb} can be improved further by optimizing the choice of stochastic quantization boundaries. However, their results depend on the number of bits necessary to represent a float, whereas ours do not.

3.2 Minimax MSE

In the above protocols, all of the clients transmit the data. We augment these protocols with a sampling procedure, where only a random fraction of clients transmit data. We show that a combination of kk-level quantization, variable length coding, and sampling can be used to achieve information theoretically optimal MSE for a given communication cost. In particular, combining Corollary 1 and Theorem 5 yields our minimax result:

There exists a universal constant t<1t<1 such that for communication cost cndtc\leq ndt and n1/tn\geq 1/t,

This result shows that the product of communication cost and MSE scales linearly in the number of dimensions.

The rest of the paper is organized as follows. We first analyze the stochastic uniform quantization technique in Section 2. In Section 3, we propose the stochastic rotated quantization technique, and in Section 4 we analyze arithmetic coding. In Section 5, we combine the above algorithm with a sampling technique and state the upper bound on the minimax risk, and in Section 6 we state the matching minimax lower bounds. Finally, in Section 7 we discuss some practical considerations and apply these algorithms on distributed power iteration and Lloyd’s algorithm.

Stochastic uniform quantization

For a vector XiX_{i}, let Ximax=max1jdXi(j)X^{\max}_{i}=\max_{1\leq j\leq d}X_{i}(j) and similarly let Ximin=min1jdXi(j)X^{\min}_{i}=\min_{1\leq j\leq d}X_{i}(j). In the stochastic binary quantization protocol πsb\pi_{sb}, for each client ii, the quantized value for each coordinate jj is generated independently with private randomness as

We first bound the communication cost of this protocol.

Instead of sending vectors YiY_{i}, clients transmit two real values XimaxX^{\max}_{i} and XiminX^{\min}_{i} (to a desired precision) and a bit vector YiY^{\prime}_{i} such that Yi(j)=1Y^{\prime}_{i}(j)=1 if Yi=XimaxY_{i}=X^{\max}_{i} and otherwise. Hence each client transmits d+2rd+2r bits, where rr is the number of bits to transmit the real value to a desired precision.

Let NN be the maximum norm of the underlying vectors. To bound rr, observe that using rr bits, one can represent a number between N-N and NN to an error of N/2r1N/2^{r-1}. Thus using 3log2(dn)+13\log_{2}(dn)+1 bits one can represent the minimum and maximum to an additive error of N/(nd)3N/(nd)^{3}. This error in transmitting minimum and maximum of the vector does not affect our calculations and we ignore it for simplicity. We note that in practice, each dimension of XiX_{i} is often stored as a 32 bit or 64 bit float, and rr should be set as either 32 or 64. In this case, using an even larger rr does not further reduce the error. ∎

We now compute the estimation error of this protocol.

where the last equality follows by observing that YiXiY_{i}-X_{i}, i\forall i, are independent zero mean random variables. The proof follows by observing that for every ii,

Lemma 2 implies the following upper bound.

The proof follows by Lemma 2 observing that j\forall j

We also show that the above bound is tight:

There exists a set of vectors XnX^{n} such that

For every ii, let XiX_{i} be defined as follows. Xi(1)=1/2X_{i}(1)=1/\sqrt{2}, Xi(2)=1/2X_{i}(2)=-1/\sqrt{2}, and for all j>2j>2, Xi(j)=0X_{i}(j)=0. For every ii, Ximax=12X^{\max}_{i}=\frac{1}{\sqrt{2}} and Ximin=12X^{\min}_{i}=-\frac{1}{\sqrt{2}}. Substituting these bounds in the conclusion of Lemma 2 (which is an equality) yields the theorem. ∎

Therefore, the simple algorithm proposed in this section gives MSE Θ(d/n)\Theta(d/n) times the average norm. Such an error is too large for real-world use. For example, in the application of neural networks , dd can be on the order of millions, yet nn can be much smaller than that. In such cases, the MSE is even larger than the norm of the vector.

2 Stochastic k𝑘k-level quantization

A natural generalization of binary quantization is kk-level quantization. Let kk be a positive integer larger than 22. We propose a kk-level stochastic quantization scheme πsk\pi_{sk} to quantize each coordinate. Recall that for a vector XiX_{i}, Ximax=max1jdXi(j)X^{\max}_{i}=\max_{1\leq j\leq d}X_{i}(j) and Ximin=min1jdXi(j)X^{\min}_{i}=\min_{1\leq j\leq d}X_{i}(j). For every integer rr in the range [0,k)[0,k), let

where sis_{i} satisfies Ximin+siXimaxX^{\min}_{i}+s_{i}\geq X^{\max}_{i}. A natural choice for sis_{i} would be XimaxXiminX^{\max}_{i}-X^{\min}_{i}.We will show in Section 4, however, a higher value of sis_{i} and variable length coding has better guarantees. The algorithm quantizes each coordinate into one of Bi(r)B_{i}(r)s stochastically. In πsk\pi_{sk}, for the ii-th datapoint and jj-th coordinate, if Xi(j)[Bi(r),Bi(r+1))X_{i}(j)\in[B_{i}(r),B_{i}({r+1})),

As before, the communication complexity of this protocol is bounded. The proof is similar to that of Lemma 1 and hence omitted.

The mean squared loss can be bounded as follows.

If XimaxXiminsi2Xi2iX^{\max}_{i}-X^{\min}_{i}\leq s_{i}\leq\sqrt{2}\left\lvert\left\lvert X_{i}\right\rvert\right\rvert_{2}\,\,\forall i, then for any XnX^{n}, the πsk\pi_{sk} protocol satisfies,

We conclude this section by noting that si=XimaxXimins_{i}=X^{\max}_{i}-X^{\min}_{i} satisfies the conditions for the above theorem by Eq. (4).

Stochastic rotated quantization

We show that the algorithm of the previous section can be significantly improved by a new protocol. The motivation comes from the fact that the MSE of stochastic binary quantization and stochastic kk-level quantization is O(dn(XimaxXimin)2)O(\frac{d}{n}(X_{i}^{\max}-X_{i}^{\min})^{2}) (the proof of Lemma 3 and Theorem 2 with si=XimaxXimins_{i}=X_{i}^{\max}-X_{i}^{\min}). Therefore the MSE is smaller when XimaxX_{i}^{\max} and XiminX_{i}^{\min} are close. For example, when XiX_{i} is generated uniformly on the unit sphere, with high probability, XimaxXiminX_{i}^{\max}-X_{i}^{\min} is O(logdd)\mathcal{O}\left(\sqrt{\frac{\log d}{d}}\right) . In such case, E(πsk,Xn)\mathcal{E}(\pi_{sk},X^{n}) is O(logdn)\mathcal{O}(\frac{\log d}{n}) instead of O(dn)\mathcal{O}(\frac{d}{n}).

In this section, we show that even without any assumptions on the distribution of the data, we can “reduce” XimaxXiminX_{i}^{\max}-X_{i}^{\min} with a structured random rotation, yielding an O(logdn)\mathcal{O}(\frac{\log d}{n}) error. We call the method stochastic rotated quantization and denote it by πsrk\pi_{srk}.

The communication cost is same as πsk\pi_{sk} and is given by Lemma 5. We now bound the MSE.

For any XnX^{n}, E(πsrk(R),Xn)\mathcal{E}(\pi_{srk}(R),X^{n}) is at most

where Zi=RXiZ_{i}=RX_{i} and for every ii, let si=ZimaxZimins_{i}=Z^{\max}_{i}-Z^{\min}_{i}.

where the last inequality follows Eq. (5) and the value of sis_{i}. (a)(a) follows from the fact that the rotation does not change the norm of the vector, and (b)(b) follows from the tower law of expectation. The lemma follows from observing that

To obtain strong bounds, we need to find an orthogonal matrix RR that achieves low (Zimax)2(Z^{\max}_{i})^{2} and (Zimin)2(Z^{\min}_{i})^{2}. In addition, due to the fact that dd can be huge in practice, we need a type of orthogonal matrix that permits fast matrix-vector products. Naive orthogonal matrices that support fast multiplication such as block-diagonal matrices often result in high values of (Zimax)2(Z^{\max}_{i})^{2} and (Zimin)2(Z^{\min}_{i})^{2}. Motivated by recent works of structured matrices , we propose to use a special type of orthogonal matrix R=HDR=HD, where DD is a random diagonal matrix with i.i.d. Rademacher entries (±1\pm 1 with probability 0.50.5). HH is a Walsh-Hadamard matrix . The Walsh-Hadamard matrix of dimension 2m2^{m} for mNm\in\mathcal{N} is given by the recursive formula,

Let R=HDR=HD, where DD is a diagonal matrix with independent Radamacher random variables. For every ii and every sequence XnX^{n},

Combining the above two lemmas yields the main result.

For any XnX^{n}, πsrk(HD)\pi_{srk}(HD) protocol satisfies,

Variable length coding

Instead of preprocessing the data via a rotation matrix as in πsrk\pi_{srk}, in this section we propose to use a variable length coding strategy to minimize the number of bits.

Consider the stochastic kk-level quantization technique. A natural way of transmitting YiY_{i} is sending the bin number for each coordinate, thus the total number of bits the algorithm sends per transmitted coordinate would be dlog2kd\lceil\log_{2}k\rceil. This naive implementation is sub-optimal. Instead, we propose to further encode the transmitted values using universal compression schemes . We first encode hrh_{r}, the number of times each quantized value rr has appeared, and then use arithmetic or Huffman coding corresponding to the distribution pr=hrd.p_{r}=\frac{h_{r}}{d}. We denote this scheme by πsvk\pi_{svk}. Since we quantize vectors the same way in πsk\pi_{sk} and πsvk\pi_{svk}, the MSE of πsvk\pi_{svk} is also given by Theorem 2. We now bound the communication cost.

Let si=2Xis_{i}=\sqrt{2}\left\lvert\left\lvert X_{i}\right\rvert\right\rvert. There exists an implementation of πsvk\pi_{svk} such that C(πsvk,Xn)\mathcal{C}(\pi_{svk},X^{n}) is at most

Once we have compressed the hrh_{r}’s, we use arithmetic coding corresponding to the distribution pr=hr/dp_{r}=h_{r}/d to compress and transmit bin values for each coordinate. The total number of bits arithmetic coding uses is

Let a=(k1)Ximina=(k-1)X^{\min}_{i} and b=sib=s_{i}. Note that if Yi(j)Y_{i}(j) belongs to bin rr, (a+br)2=(k1)2Yi2(j)(a+br)^{2}=(k-1)^{2}Y^{2}_{i}(j) and hrh_{r} is the number of coordinates quantized into bin rr. Hence rhr(a+br)2\sum_{r}h_{r}(a+br)^{2} is the scaled norm-square of YiY_{i}, i.e.,

where the α(j)=Yi(j)Xi(j)\alpha(j)=Y_{i}(j)-X_{i}(j). Taking expectations on both sides and using the fact that the α(j)\alpha(j) are independent zero mean random variables over a range of si/(k1)s_{i}/(k-1), we get

We now bound rhrdlog2dhr\sum_{r}\frac{h_{r}}{d}\log_{2}\frac{d}{h_{r}} in terms of rhr(a+br)2\sum_{r}h_{r}(a+br)^{2}. Let pr=hr/dp_{r}=h_{r}/d and β=r=0k11/((a+br)2+δ)\beta=\sum^{k-1}_{r=0}1/((a+br)^{2}+\delta). Note that

where the first inequality follows from the positivity of KL-divergence. Choosing δ=si2\delta=s^{2}_{i}, yields β4/si2\beta\leq 4/s^{2}_{i} and hence,

Combining the above set of equations, we get that the expected communication is bounded by

where (a)(a) follows from Equation (7) and (b)(b) follows from Jensen’s inequality. The last inequality follows from Equation (6). ∎

Thus if k=d+1k=\sqrt{d}+1, the communication complexity is O(nd)\mathcal{O}(nd) and the MSE is O(1/n)\mathcal{O}(1/n).

Communication MSE trade-off

In the above protocols, all the clients transmit and hence the communication cost scales linearly with nn. Instead, we show that any of the above protocols can be combined by client sampling to obtain trade-offs between the MSE and the communication cost. Note that similar analysis also holds for sampling the coordinates.

Let π\pi be a protocol where the mean estimate is of the form:

All three protocols we have discussed are of this form. Let πp\pi_{p} be the protocol where each client participates independently with probability pp. The server estimates Xˉ\bar{X} by

where YiY_{i}s are defined in the previous section and SS is the set of clients that transmitted.

For any set of vectors XnX^{n} and protocol π\pi of the form Equation (8), its sampled version πp\pi_{p} satisfies

The proof of communication cost follows from Lemma 5 and the fact that in expectation, npnp clients transmit. We now bound the MSE. Let SS be the set of clients that transmit. The error E(πp,Xn)\mathcal{E}(\pi_{p},X^{n}) is

Furthermore, the second term can be bounded as

where the last equality follows from the assumption that π\pi’s mean estimate is of the form (8). (a)(a) follows from the fact that R1YiXiR^{-1}Y_{i}-X_{i} are independent zero mean random variables. ∎

Combining the above lemma with Theorems 2 and 4, and choosing k=d+1k=\sqrt{d}+1 results in the following.

For every cnd(2+log2(7/4))c\leq nd(2+\log_{2}(7/4)), there exists a protocol π\pi such that C(π,Sd)c\mathcal{C}(\pi,S^{d})\leq c and

Lower bounds

The lower bound relies on the lower bounds on distributed statistical estimation due to .

There exists a set of distributions Pd\mathcal{P}_{d} supported on [1d,1d]d\left[-\frac{1}{\sqrt{d}},\frac{1}{\sqrt{d}}\right]^{d} such that if any centralized server wishes to estimate the mean of the underlying unknown distribution, then for any independent protocol π\pi

where C(π)\mathcal{C}(\pi) is the communication cost of the protocol, θ(pd)\theta(p_{d}) is the mean of pdPdp_{d}\in\mathcal{P}_{d}, and tt is a positive constant.

Let tt be the constant in Lemma 9. For every cndt/4c\leq ndt/4 and n4/tn\geq 4/t,

Given nn samples from the underlying distribution where each sample belongs to SdS^{d}, it is easy to see that

where θ^(pd)\hat{\theta}(p_{d}) is the empirical mean of the observed samples. Let Pd\mathcal{P}_{d} be the set of distributions in Lemma 9. Hence for any protocol π\pi there exists a distribution pdp_{d} such that

(a)(a) follows from the fact that 2(ab)2+2(bc)2(ac)22(a-b)^{2}+2(b-c)^{2}\geq(a-c)^{2}. (b)(b) follows from Lemma 9 and (c)(c) follows from the fact that C(π,Sd)ndt/4\mathcal{C}(\pi,S^{d})\leq ndt/4 and n4/tn\geq 4/t. ∎

Corollary 1 and Theorem 5 yield Theorem 1. We note that the above lower bound holds only for communication cost c<O(nd)c<\mathcal{O}(nd). Extending the results for larger values of cc remains an open problem.

At a first glance it may appear that combining structured random matrix and variable length encoding may improve the result asymptotically, and therefore violates the lower bound. However, this is not true. Observe that variable length coding πsvk\pi_{svk} and stochastic rotated quantization πsrk\pi_{srk} use different aspects of the data: the variable length coding uses the fact that bins with large values of index rr are less frequent. Hence, we can use fewer bits to encode frequent bins and thus improve communication. In this scheme bin-width (si/(k1)s_{i}/(k-1)) is 2Xi2/(k1)\sqrt{2}||X_{i}||_{2}/(k-1). Rotated quantization uses the fact that rotation makes the min and max closer to each other and hence we can make bins with smaller width. In such a case, all the bins become “more or less equally likely” and hence variable length coding does not help. In this scheme bin-width (si/(k1)s_{i}/(k-1)) is (ZimaxZimin)/(k1)Xi2(logd)/(kd)(Z^{\max}_{i}-Z^{\min}_{i})/(k-1)\approx||X_{i}||_{2}(\log d)/(kd), which is much smaller than bin-width for variable length coding. Hence variable length coding and random rotation cannot be used simultaneously.

Practical considerations and applications

Based on the theoretical analysis, the variable-length coding method provides the lowest quantization error asymptotically when using a constant number of bits. Stochastic rotated quantization in practice may be preferred due to the (hidden) constant factors of the variable length code methods. For example, considering quantizing a single vector $,stochasticrotatedquantizationcanuse1bitperdimensionandachieveszeroerror:therotatedvectorhasonlytwovalues,, stochastic rotated quantization can use 1 bit per dimension and achieves zero error: the rotated vector has only two values ,2or,or ,-2.Wefurthernotethattherotatedquantizationispreferredwhenappliedonunbalanceddata,duetothefactthattherotationcancorrecttheunbalancedness.Wedemonstratethisbygeneratingadatasetwherethevalueofthelastfeaturedimensionentryismuchlargerthanothers.Wegenerate1000datapointseachwith256dimensions.Thefirst255dimensionsaregeneratedi.i.d.from. We further note that the rotated quantization is preferred when applied on “unbalanced” data, due to the fact that the rotation can correct the unbalancedness. We demonstrate this by generating a dataset where the value of the last feature dimension entry is much larger than others. We generate 1000 datapoints each with 256 dimensions. The first 255 dimensions are generated i.i.d. fromN(0,1),andthelastdimensionisgeneratedfrom, and the last dimension is generated fromN(100,1)$. As shown in Figure 1, the rotated stochastic quantization has the best performance. The improvement is especially significant for low bit rate cases.

We demonstrate two applications in the rest of this section. The experiments are performed on the MNIST (d=1024d=1024) and CIFAR (d=512d=512) datasets.

Distributed Lloyd’s algorithm. In the distributed Lloyd’s (k-means) algorithm, each client has access to a subset of data points. In each iteration, the server broadcasts the cluster centers to all the clients. Each client updates the centers based on its local data, and sends the centers back to the server. The server then updates the centers by computing the weighted average of the centers sent from all clients. In the quantized setting, the client compresses the new centers before sending to the server. This saves the uplink communication cost, which is often the bottleneck of distributed learningIn this setting, the downlink is a broadcast, and therefore its cost can be reduced by a factor of O(n/logn)O(n/\log n) without quantization, where nn is the number of clients.. We set both the number of centers and number of clients to 10. Figure 2 shows the result.

Distributed power iteration. Power iteration is a widely used method to compute the top eigenvector of a matrix. In the distributed setting, each client has access to a subset of data. In each iteration, the server broadcasts the current estimate of the eigenvector to all clients. Each client then updates the eigenvector based on one power iteration on its local data, and sends the updated eigenvector back to the server. The server updates the eigenvector by computing the average of the eigenvectors sent by all clients. Similar to the above distributed Lloyd’s algorithm, in the quantized setting, the client compresses the estimated eigenvector before sending to the server. Figure 3 shows the result. The dataset is distributed over 100100 clients.

For both of these applications, variable-length coding achieves the lowest quantization error in most of the settings. Furthermore, for low-bit rate, stochastic rotated quantization is competitive with variable-length coding.

Acknowledgments

We thank Jayadev Acharya, Keith Bonawitz, Dan Holtmann-Rice, Jakub Konecny, Tengyu Ma, and Xiang Wu for helpful comments and discussions.

References

Appendix A Proof of Lemma 7

The equality follows from the symmetry in HDHD. To prove the upper bound, observe that

Let D(j)D(j) be the jthj^{\text{th}} diagonal entry of DD. To bound the first term observe that ZimaxZ^{\max}_{i} is a function of dd independent random variables D(1),D(2),D(d)D(1),D(2),\ldots D(d). Changing D(j)D(j) changes the ZimaxZ^{\max}_{i} by at most 2Xi(j)d\frac{2X_{i}(j)}{\sqrt{d}}. Hence, applying Efron-Stein variance bound yields

To bound the second term, observe that for every β>0\beta>0,

Note that Zi(k)=1dj=1dD(j)H(k,j)Xi(j)Z_{i}(k)=\frac{1}{\sqrt{d}}\sum^{d}_{j=1}D(j)H(k,j)X_{i}(j). Since the D(j)D(j)’s are Radamacher random variables and H(k,j)=1|H(k,j)|=1 for all k,jk,j, the distributions of Zi(k)Z_{i}(k) is same for all kk. Hence by Jensen’s inequality,

Since Zi(1)=1dj=1dD(j)Xi(j)Z_{i}(1)=\frac{1}{\sqrt{d}}\sum^{d}_{j=1}D(j)X_{i}(j),

where (a)(a) follows from the fact that the D(i)D(i)’s are independent and (b)(b) follows from the fact that ea+ea2ea2/2e^{a}+e^{-a}\leq 2e^{a^{2}/2} for any aa. Hence,