Communication-Computation Efficient Gradient Coding

Min Ye, Emmanuel Abbe

I introduction

Distributed computation plays a key role in the computational challenges faced by machine learning for large data sets . This requires overcoming a few obstacles: First the straggler effect, i.e., slow workers that hamper the computation time. Second, the communication cost; gradients in deep learning typically consist nowadays in millions of real-valued components, and the transmission of these high-dimensional vectors can amortize the savings of the computation time in large-scale distributed systems . This has driven researchers to use in particular gradient sparsification and gradient quantization to reduce communication cost .

More recently, coding theory has found its way into distributed computing , following the path of exporting coding techniques to distributed storage , caching and queuing . A few works have also initiated the use of coding techniques in distributed learning . Of particular interest to us is , which introduces coding techniques to mitigate the effect of stragglers in gradient computation. While this is a central task in machine learning, does not take into account the communication cost which is important in such applications as mentioned above.

This paper takes a global view on the running time of distributed learning tasks by considering the three parameters, namely, computation load, straggler tolerance and communication cost. We identify a three-fold fundamental tradeoff between these parameters in order to efficiently compute gradients (and more generally summations of vectors), exploiting distributivity both across data subsets and vector components. The tradeoff reads

where nn is the number of workers, kk is the number of data subsets, dd is the number of data subsets assigned to each worker, ss is the number of stragglers, and mm is the communication reduction factor. This generalizes the results in that correspond to m=1m=1. Note that one cannot derive (1) from the results of , and we will explain this in more detail below.

We further give an explicit code construction based on recursive polynomials that achieves the derived tradeoff. The key steps in our coding scheme are as follows: In order to reduce the dimension of transmitted vector for each worker, we first partition the coordinates of the gradient vector into mm groups of equal size. Then we design two matrices BB and VV, where the (ns)×n(n-s)\times n matrix VV has the property that any (ns)×(ns)(n-s)\times(n-s) submatrix is invertible. This property corresponds to the requirement that our coding scheme can tolerate any ss stragglers, and it can be easily satisfied by setting VV to be a (non-square) Vandermonde matrix. Furthermore, the (mn)×(ns)(mn)\times(n-s) matrix BB satisfies the following two property: (1) the last mm columns of BB consisting of nn identity matrices of size m×mm\times m; (2) for every j[n]j\in[n], the product of the iith row of BB and the jjth column of VV must be for a specific set of values of ii, and the cardinality of this set is (nd)m(n-d)m. The first property of BB guarantees the recovery of the sum gradient vector, and the second property ensures that each worker is assigned at most dd data subsets. We make use of the natural connection between the Vandermonde structure and polynomials to construct our matrix BB recursively: More precisely, we can view each row of BB as coefficients of some polynomial, and the product of BB and VV simply consists of the evaluations of these polynomials at certain points. We can then define these polynomials by specifying their roots so that the two properties of BB are satisfied. We also mention that the conditions in our construction are more restrictive than those in and : In our setting, the conditions in only require that the last mm columns of BB contain at most nn nonzero entries, and no requirements are imposed on the positions of these nonzero entries; as mentioned above, only deal with the special case of m=1m=1 and do not allow for dimensionality reduction of the gradient vectors. Due to these more relaxed conditions, the constructions in and do not have the recursive polynomial structure, which is the main technical novelty in our paper.

We further take numerical stability issue into consideration, and characterize an achievable region of the triple (d,s,m)(d,s,m) under a given upper bound κ\kappa of condition numbers of all the operations in the gradient reconstruction phase. We also present another coding scheme based on random matrices to achieve this region.

We support our theoretical findings by implementing our scheme on Amazon EC2 clusters using Python with mpi4py package. Experimental results show that the proposed scheme reduces the running time by 32%32\% compared to uncoded schemes and by 23%23\% compared to prior work , while maintaining the same generalization error on the Amazon Employee Access dataset from Kaggle, which was also used in for state-of-the-art experiments.

Slow workers (processors) called “stragglers” can hamper the computation time as the taskmaster needs to wait for all workers to complete their processing. Recent literature proposes adding redundancy in computation tasks of each worker so that the taskmaster can compute the final result using outputs from only a subset of workers and ignore the stragglers. The most popular ways to introduce redundancy in computation are based on either replication schemes or coding theoretic techniques . Lee et al. initialized the study of using erasure-correcting codes to mitigate straggler effects for linear machine learning tasks such as linear regression and matrix multiplication. Subsequently, Dutta et al. proposed new efficient coding schemes to calculate convolutions and the product of a matrix and a long vector , Yu et al. introduced optimal coding schemes to compute high-dimensional matrix multiplication and Fourier Transform , and Yang et al. developed coding methods for parallel iterative linear solver . Tandon et al. further used coding theoretic methods to avoid stragglers in nonlinear learning tasks. More specifically, presented an optimal trade-off between the computation load and straggler tolerance (the number of tolerable stragglers) in synchronous gradient descent for any loss function. Several code constructions achieving this trade-off were given in . Li et al. considered distributed gradient descent under a probabilistic model and proposed the Batched Coupon’s Collector scheme to alleviate straggler effect under this model. At the same time, the schemes in are designed to combat stragglers for the worst-case scenario. While most research focused on recovering the exact results in the presence of stragglers, suggested allowing some small deviations from the exact gradient in each iteration of the gradient descent and showed that one can obtain a good approximation of the original solution by using coding theoretic methods. Very recently, Zhu et al. proposed a sequential approximation method for distributed learning in the presence of stragglers, and their method is also based on erasure-correcting codes.

As mentioned above, high network communication cost for synchronizing gradients and parameters is also a well-known bottleneck of distributed learning. In particular for deep learning, gradient vectors typically consist of millions of real numbers, and for large-scale distributed systems, transmissions of high-dimensional gradient vectors might even amortize the savings of computation time . The most widely used methods to reduce communication cost in the literature are based on gradient sparsification and gradient quantization .

In this paper we directly incorporate the communication cost into the framework of reducing running time for gradient computation, in addition to computation load and straggler tolerance. In particular, we take advantage of distributing the computations over subsets of vector components in addition to subsets of data samples. The advantages of our coding scheme over the uncoded schemes and the schemes in are demonstrated by both experimental results and numerical analysis in Sections V and VI. We also strengthen the numerical analyses by studying the behavior of the running time using probabilistic models for the computation and communication times, obtaining improvements that are consistent with the outcome of the Amazon experiments (see Section VI). Our results apply to both batch gradient descent and mini-batch stochastic gradient descent (SGD), which is the most popular algorithm in large-scale distributed learning. Moreover, our coding theoretic method is orthogonal to the gradient sparsification and gradient quantization methods . In other words, our method can be used on top of the latter ones.

As a final remark, also studied the trade-off between computation and communication in distributed learning, but the problem setup in is different from our work in nature. We study distributed gradient descent while focused on MapReduce framework. The communication in our problem is from all the worker nodes to one master node, while the communication in is from all workers to all workers, and there is no master node in . This difference in problem setup leads to completely different results and techniques.

II Problem formulation and main results

where g(t):=L(D;β(t))=i=1NL(xi,yi;β(t))g^{(t)}:=\nabla L(D;\beta^{(t)})=\sum_{i=1}^{N}\nabla L(x_{i},y_{i};\beta^{(t)}) is the gradient of the loss at the current estimate of the parameters and hh is a gradient-based optimizer. As in , we assume that there are nn workers W1,W2,,WnW_{1},W_{2},\dots,W_{n}, and that the original dataset DD is partitioned into kk subsets of equal size, denoted as D1,D2,,DkD_{1},D_{2},\dots,D_{k}. Define the partial gradient vector of DiD_{i} as gi(t):=(x,y)DiL(x,y;β(t))g_{i}^{(t)}:=\sum_{(x,y)\in D_{i}}\nabla L(x,y;\beta^{(t)}). Clearly g(t)=g1(t)+g2(t)++gk(t)g^{(t)}=g_{1}^{(t)}+g_{2}^{(t)}+\dots+g_{k}^{(t)}. Suppose that each worker is assigned dd data subsets, and there are ss stragglers, i.e., we only wait for the results from the first nsn-s workers. For i=1,2,,ni=1,2,\dots,n, we write the datasets assigned to worker WiW_{i} as {Di1,Di2,,Did}\{D_{i_{1}},D_{i_{2}},\dots,D_{i_{d}}\}. Each worker computes its partial gradient vectors gi1(t),gi2(t),,gid(t)g_{i_{1}}^{(t)},g_{i_{2}}^{(t)},\dots,g_{i_{d}}^{(t)} and returns fi(gi1(t),gi2(t),,gid(t))f_{i}(g_{i_{1}}^{(t)},g_{i_{2}}^{(t)},\dots,g_{i_{d}}^{(t)}), a prespecified function of these partial gradients. In order to update the parameters according to (2), we require that the sum gradient vector g(t)g^{(t)} can be recovered from the results of the first nsn-s workers no matter who the ss stragglers will be. Due to complexity consideration, we would further like fi,i[n]f_{i},i\in[n] to be linear functions. Lee et al. showed that this is possible if and only if

Let us write each partial gradient vector as gi=(gi(0),gi(1),,gi(l1))g_{i}=(g_{i}(0),g_{i}(1),\dots,g_{i}(l-1)) for i=1,2,,ki=1,2,\dots,k. We will show that when sdkn1s\leq\frac{d}{k}n-1, each worker only needs to transmit a vectorAssume that k(dn)k|(dn) and (dkns)l(\frac{d}{k}n-s)|l. of dimension l/(dkns)l/(\frac{d}{k}n-s). In other words, we can reduce the communication cost by a factor of dkns\frac{d}{k}n-s.

Roughly speaking, showed the following two-dimensional tradeoff: if we assign more computation load at each worker, then we can tolerate more stragglers. In this paper we will show a three-dimensional tradeoff between computation load at each worker, straggler tolerance and the communication cost: for a fixed computation load, we can reduce the communication cost by waiting for results from more workers. Fig. 1 uses a toy example to illustrate this tradeoff as well as the basic idea of how to reduce the communication cost. In Fig. 1 the gradient vector has dimension l=2l=2, and it is clear that this idea extends to gradient vectors of any dimension (by padding a zero when ll is odd). To quantify the tradeoff, we introduce the following definition.

Given nn and kk, we say that a triple of nonnegative integers (d,s,m)(d,s,m) satisfying that 1dk1\leq d\leq k and m1m\geq 1 is achievableThroughout we assume that mlm|l. Since ll is typically very large and mm is relatively small, the condition mlm|l can always be satisfied by padding a few zeroes at the end of the gradient vectors. if there is a distributed synchronous gradient descent scheme such that

Each worker is assigned dd data subsets.

where i1,i2,,idi_{1},i_{2},\dots,i_{d} are the indices of datasets assigned to worker WiW_{i}.

f1,f2,,fnf_{1},f_{2},\dots,f_{n} are linear functions. In other words, fi(gi1,gi2,,gid)f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}) is a linear combination of the coordinates of the partial gradient vectors gi1,gi2,,gidg_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}.

For readers’ convenience, we list the main notation in Table I. Next we state the main theorem of this paper.

Let k,nk,n be positive integers. A triple (d,s,m)(d,s,m) is achievable if and only if

The converse proof is given in Appendix A, and the achievability scheme is given in Section III.

Note that the special case m=1m=1 in Theorem 1 is the same as the case considered in . We also remark that although (4) looks very similar to Theorem 1 in , their coding scheme can not be used to achieve (4) with equality when m>1m>1. In Appendix B, we discuss the differences between our work and in detail. In particular, we show that the constraint in our problem is stronger than that in .

Notice that the computation load at each worker is known by dk\frac{d}{k}, not the value of kk itself. We are interested in achieving the optimal computation load in (4), and the value of kk does not matter. Therefore we will assume that k=nk=n for the remainder of this paper (except in Appendix A and Appendix B).

Under this assumption, (4) has the following simple form

In Fig. 2 we take n=k=5,d=3,l=2n=k=5,d=3,l=2, and show the implementation for two different choices of the pair (s,m)(s,m). The communication cost of Fig. 2(b) is half of that of Fig. 2(a), but the system in Fig. 2(b) can only tolerate one straggler while the system in Fig. 2(a) can tolerate two stragglers. Table II below shows how to calculate the sum gradient vector in Fig. 2(b) when there is one straggler. In the table we abbreviate fi(gi1,gi2,,gid)f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}) in (3) as fif_{i}, i.e., fif_{i} is the transmitted vector of WiW_{i}.

In the proof of Theorem 1, we use Vandermonde matrices and assume that all the computations have infinite precision, which is not possible in real world applications. According to our experimental results, the stability issue of Vandermonde matrices can be ignored up to n=20n=20, which covers the regime considered in most related works . However, beyond that we need to design numerically stable coding schemes and give up the optimal trade-off (5) between d,sd,s and mm. In this section we find an achievable region for which the condition numbers of all operations in the gradient reconstruction phase are upper bounded by a given value κ\kappa, so that the numerical stability can be guaranteed. To that end, for any three given integers n>n1>n2n>n_{1}>n_{2}, we define a function γ(n,n1,n2,κ)\gamma(n,n_{1},n_{2},\kappa) to be the smallest integer n3n_{3} such that there is an n1×nn_{1}\times n matrix VV satisfying the following two properties:

n3n1n_{3}\geq n_{1}. For every subset F[n]\mathcal{F}\subseteq[n] with cardinality F=n3|\mathcal{F}|=n_{3}, the condition number of VFVFTV_{\mathcal{F}}V_{\mathcal{F}}^{T} is no larger than κ\kappa, where VFV_{\mathcal{F}} is the submatrix of VV consisting of columns whose indices are in the set F\mathcal{F}.

Let V[1:n2]V_{[1:n_{2}]} be the submatrix of VV consisting of the first n2n_{2} rows of VV. We require that every n2×n2n_{2}\times n_{2} submatrix consisting of circulant consecutive“circulant consecutive” means that the indices nn and 11 are considered consecutive. A more detailed explanation is given in Section IV. columns of V[1:n2]V_{[1:n_{2}]} is invertible.

Note that property 1) is similar to the restricted isometry property (RIP) property in compressed sensing . The only difference is that in compressed sensing n3<n1n_{3}<n_{1} while here we require n3n1n_{3}\geq n_{1}. We point out two obvious properties of this function: (1) for a fixed triple (n,n1,n2)(n,n_{1},n_{2}), the function γ(n,n1,n2,κ)\gamma(n,n_{1},n_{2},\kappa) decreases with κ\kappa; (2) when κ\kappa is large enough, γ(n,n1,n2,κ)=n1\gamma(n,n_{1},n_{2},\kappa)=n_{1}. We now state the theorem in the case n=kn=k to simplify the notation (see Remark 1):

Let κ\kappa be the upper bound on the condition number of all the operations in the gradient reconstruction phase. A triple (d,sκ,m)(d,s_{\kappa},m) is achievable if

The proof of this theorem is given in Section IV. As discussed above, when κ\kappa is large enough, i.e., when the stability constraint is loose, we have γ(n,nd+m,nd,κ)=nd+m\gamma(n,n-d+m,n-d,\kappa)=n-d+m, and (6) becomes sκdms_{\kappa}\leq d-m, which is the same as (5). Moreover, since γ(n,nd+m,nd,κ)\gamma(n,n-d+m,n-d,\kappa) decreases with κ\kappa, sκs_{\kappa} increases with κ\kappa. Namely, we can tolerate more stragglers if we allow less numerical stability. In our experiments we find that by setting VV to be Gaussian random matrix, we can achieve sκ=dms_{\kappa}=d-m with numerically stable coding scheme for n30n\leq 30, which improves upon the coding scheme based on Vandermonde matrices.

By choosing VV as a Gaussian random matrix and using the classical bounds on eigenvalues of large Wishart matricesA Wishart matrix is a matrix of form AATAA^{T}, where AA is a Gaussian random matrix. together with the union bound, we can obtain an upper bound of γ(n,n1,n2,κ)\gamma(n,n_{1},n_{2},\kappa). Let us introduce some more definitions to state the upper bound. Given two integers n>n1n>n_{1}, define the function

where HH is the entropy function H(q):=qlnq(1q)ln(1q)H(q):=-q\ln q-(1-q)\ln(1-q) defined for 0<q<10<q<1. It is easy to verify that when n1/n>1/2n_{1}/n>1/2, fn,n1(x)f_{n,n_{1}}(x) strictly decreases with xx. Following the same stepsThere are two differences between the settings in our paper and : First, we have one more condition that certain n2×n2n_{2}\times n_{2} submatrices of V[1:n2]V_{[1:n_{2}]} must be invertible, but this is satisfied with probability 11 for Gaussian random matrices, so this extra condition makes no difference to the proof, and the bound (7) does not depend on n2n_{2}. Second, in our paper we require n3n1n_{3}\geq n_{1} while in n3<n1n_{3}<n_{1}, but this difference can also be resolved by a trivial modification of the proof in . as in the proof of [32, Lemma 3.1], we can show that when n1/n>1/2n_{1}/n>1/2 and nn is large,

Let κ\kappa be the upper bound on the condition number of all the operations in the gradient reconstruction phase. When (dm)/n<1/2(d-m)/n<1/2, \kappa>\Big{(}\frac{1+\sqrt{n_{1}/n}}{1-\sqrt{n_{1}/n}}\Big{)}^{2}, and nn is large enough, a triple (d,sκ,m)(d,s_{\kappa},m) is achievable if

III Coding Scheme

In this section, we present a coding scheme achieving (5) with equality, i.e., the parameters in our scheme satisfy d=s+md=s+m. First we introduce two binary operations \oplus and \ominus over the set [n][n]. For a,b[n]a,b\in[n], define

In our scheme, each worker WiW_{i} is assigned with dd data subsets Di,Di1,Di2,,Di(d1)D_{i},D_{i\oplus 1},D_{i\oplus 2},\dots,D_{i\oplus(d-1)}. This is equivalent to say that each data subset DiD_{i} is assigned to dd workers Wi,Wi1,Wi2,,Wi(d1)W_{i},W_{i\ominus 1},W_{i\ominus 2},\dots,W_{i\ominus(d-1)}.

Let θ1,θ2,,θn\theta_{1},\theta_{2},\dots,\theta_{n} be nn distinct real numbers. Define nn polynomials pi,i[n]p_{i},i\in[n],

Before proceeding further, let us explain the meaning of θi\theta_{i} and pip_{i}. Each θi\theta_{i} is associated with the worker WiW_{i}, and each pip_{i} is associated with the dataset DiD_{i}. In our scheme, pj(θi)0p_{j}(\theta_{i})\neq 0 means that worker WiW_{i} needs the value of gjg_{j} to calculate fi(gi1,gi2,,gid)f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}), and therefore DjD_{j} is assigned to WiW_{i}. On the other hand, pj(θi)=0p_{j}(\theta_{i})=0 means that DjD_{j} is not assigned to WiW_{i}. By (8), we can see that each dataset DiD_{i} is NOT assigned to Wi1,Wi2,,Wi(nd)W_{i\oplus 1},W_{i\oplus 2},\dots,W_{i\oplus(n-d)}.

Next we construct an (mn)×(ns)(mn)\times(n-s) matrix B=(bij)B=(b_{ij}) from the polynomials pi,i[n]p_{i},i\in[n] defined in (8). Let pi,j,j=0,1,,ns1p_{i,j},j=0,1,\dots,n-s-1 be the coefficients of the polynomial pip_{i}, i.e.,

Since deg(pi)=nd\deg(p_{i})=n-d and d=s+ms+1d=s+m\geq s+1, we have pi,nd=1p_{i,n-d}=1 and pi,nd+1=pi,nd+2==pi,ns1=0p_{i,n-d+1}=p_{i,n-d+2}=\dots=p_{i,n-s-1}=0. For every i[n]i\in[n], we define mm polynomials pi(1),pi(2),,pi(m)p_{i}^{(1)},p_{i}^{(2)},\dots,p_{i}^{(m)} recursively:

where pi,j(u),j=0,1,,ns1p_{i,j}^{(u)},j=0,1,\dots,n-s-1 are the coefficients of pi(u)p_{i}^{(u)}, i.e., pi(u)(x)=j=0ns1pi,j(u)xjp_{i}^{(u)}(x)=\sum_{j=0}^{n-s-1}p_{i,j}^{(u)}x^{j}. Clearly, pi(u)p_{i}^{(u)} is a polynomial of degree deg(pi(u))=nd+u1\deg(p_{i}^{(u)})=n-d+u-1, and its leading coefficient is 11, i.e.,

It is also clear that pipi(u)p_{i}|p_{i}^{(u)} for all u[m]u\in[m] and all i[n]i\in[n], so we have

By a simple induction on uu, one can further see that

We can now specify the entries of BB as follows:

Moreover, according to (10) and (12), the submatrix B[(nd+1):(ns)]B_{[(n-d+1):(n-s)]} consisting of the last mm columns of BB is

where ImI_{m} is the m×mm\times m identity matrix, and there are nn identity matrix on the right-hand side of (15).

Recall that we assume mlm|l throughout the paper. For every v=0,1,,l/m1v=0,1,\dots,l/m-1 and j[n]j\in[n], define an mm-dimensional vector

For every v=0,1,,l/m1v=0,1,\dots,l/m-1, define an (mn)(mn)-dimensional vector

where the second equality follows from (11).

Now we are ready to define the transmitted vector fi(gi,gi1,,gi(d1))f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}) for each worker Wi,i[n]W_{i},i\in[n]:

By (17), the value of fi(gi,gi1,,gi(d1))f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}) indeed only depends on the values of gi,gi1,,gi(d1)g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}.

Without loss of generality let us assume that F={1,2,,ns}\mathcal{F}=\{1,2,\dots,n-s\}. Define the following (ns)×(ns)(n-s)\times(n-s) matrix

According to (18), from {fi(gi,gi1,,gi(d1)):iF}\{f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}):i\in\mathcal{F}\} we can obtain the values of

Since AA is invertible, we can calculate zvBend+uz_{v}Be_{n-d+u} for all 0vl/m10\leq v\leq l/m-1 and all u[m]u\in[m] from the vectors in (21) by multiplying A1end+uA^{-1}e_{n-d+u} to the right. By (19),

Therefore we conclude that the sum vector g1+g2++gng_{1}+g_{2}+\dots+g_{n} can be calculated from {fi(gi,gi1,,gi(d1)):iF}\{f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}):i\in\mathcal{F}\} whenever F=ns|\mathcal{F}|=n-s. Thus we have shown that our coding scheme satisfies all three conditions in Definition 1, and this completes the proof of the achievability part of Theorem 1.

III-B Efficient implementation of our coding scheme

To implement our coding scheme, an important step is to calculate the product

in order to obtain the transmitted vectors in (18). According to (14), this product can be easily calculated by the recursive procedure (9). Notice that in this recursive procedure we need to know the values of pi,nd1(u1)p_{i,n-d-1}^{(u-1)} for u=2,3,,mu=2,3,\dots,m, and by (13) we have b(i1)m+u,nd=pi,nd1(u)b_{(i-1)m+u,n-d}=p_{i,n-d-1}^{(u)}. Therefore in our implementation we need to calculate (at least some of) the entries of the matrix BB. The entries of BB are specified in (13), and they are calculated recursively according to (9) from the coefficients of the polynomials pi,i[n]p_{i},i\in[n]. While the recursive procedure in (9) might seem complicated, Algorithm 1 below describes an efficient way to calculate BB from the coefficients of pi,i[n]p_{i},i\in[n] defined in (8).

Finally, we remark that the examples in Fig. 2(a) and Fig. 2(b) are both obtained by setting θ1=2,θ2=1,θ3=0,θ4=1,θ5=2\theta_{1}=-2,\theta_{2}=-1,\theta_{3}=0,\theta_{4}=1,\theta_{5}=2 in our coding scheme.

In Section III-A we have shown that our coding scheme works for any set of nn distinct real numbers {θ1,θ2,,θn}\{\theta_{1},\theta_{2},\dots,\theta_{n}\}. However, in the proof we assume that the computation has infinite precision, which is not possible in real world application. Stability aspects need to be considered for the inversion of Vandermonde matrices of form (20) when the master node reconstructs the full gradient vector from partial gradient vectors returned by the first (ns)(n-s) worker nodes. It is well known that the accuracy of matrix inversion depends heavily on the condition number of the matrix. Therefore we need to find a set of {θ1,θ2,,θn}\{\theta_{1},\theta_{2},\dots,\theta_{n}\} such that every (ns)×(ns)(n-s)\times(n-s) submatrix of the following matrix VV has low condition number.

In our implementation in Section V, we choose

Note that numerical instability of our coding scheme is NOT due to the introduction of the communication cost reduction factor mm. In fact, in the authors presented coding schemes to achieve (5) for the special case of m=1m=1, and the schemes in both paper also involve inversion of Vandermonde matrices, so they also suffer from numerical instability. Moreover, the schemes in both paper set θ1,θ2,,θn\theta_{1},\theta_{2},\dots,\theta_{n} to be roots of unity. Such a choice does not resolve the numerical instability issue either: it is shown in that in the worst case the condition number of (ns)×(ns)(n-s)\times(n-s) submatrices of VV grows exponentially fast in nn when θ1,θ2,,θn\theta_{1},\theta_{2},\dots,\theta_{n} are roots of unity.

IV Proof of Theorem 2

Let s:=dms:=d-m and let sκ=nγ(n,ns,nd,κ).s_{\kappa}=n-\gamma(n,n-s,n-d,\kappa). Then by definition of the function γ\gamma, there is an (ns)×n(n-s)\times n matrix VV such that

nsκnsn-s_{\kappa}\geq n-s. For every subset F[n]\mathcal{F}\subseteq[n] with cardinality F=nsκ|\mathcal{F}|=n-s_{\kappa}, the condition number of VFVFTV_{\mathcal{F}}V_{\mathcal{F}}^{T} is no larger than κ\kappa, where VFV_{\mathcal{F}} is the submatrix of VV consisting of columns whose indices are in the set F\mathcal{F}.

Define nn submatrices of VV with size (nd)×(nd)(n-d)\times(n-d) as follows: For i=1,2,,d+1i=1,2,\dots,d+1, define SiS_{i} to be the submatrix of VV corresponding to the row indices {1,2,,nd}\{1,2,\dots,n-d\} and the column indices {i,i+1,,i+nd1}\{i,i+1,\dots,i+n-d-1\}. For i=d+2,d+3,,ni=d+2,d+3,\dots,n, define SiS_{i} to be the submatrix of VV corresponding to the row indices {1,2,,nd}\{1,2,\dots,n-d\} and the column indices {i,i+1,,n,1,2,,id1}\{i,i+1,\dots,n,1,2,\dots,i-d-1\}. The matrix SiS_{i} is invertible for all i[n]i\in[n].

We further define another nn submatrices of VV with size m×(nd)m\times(n-d) as follows: For i=1,2,,d+1i=1,2,\dots,d+1, define RiR_{i} to be the submatrix of VV corresponding to the row indices {nd+1,nd+2,,ns}\{n-d+1,n-d+2,\dots,n-s\} and the column indices {i,i+1,,i+nd1}\{i,i+1,\dots,i+n-d-1\}. For i=d+2,d+3,,ni=d+2,d+3,\dots,n, define RiR_{i} to be the submatrix of VV corresponding to the row indices {nd+1,nd+2,,ns}\{n-d+1,n-d+2,\dots,n-s\} and the column indices {i,i+1,,n,1,2,,id1}\{i,i+1,\dots,n,1,2,\dots,i-d-1\}.

To prove Theorem 2, we only need to find an (mn)×(ns)(mn)\times(n-s) matrix BB satisfying the following two conditions:

The product of the iith row of BB and the jjth column of VV is for all j[n]j\in[n] and all i[mn]{(mj)mod(mn)+1,(mj+1)mod(mn)+1,(mj+2)mod(mn)+1,,(mj+dm1)mod(mn)+1}i\in[mn]\setminus\{(mj)\text{mod}(mn)+1,(mj+1)\text{mod}(mn)+1,(mj+2)\text{mod}(mn)+1,\dots,(mj+dm-1)\text{mod}(mn)+1\};

Since Equation (15) already specifies the last mm columns of matrix BB, we only need to design the first (nd)(n-d) columns. For i[n]i\in[n], we write BiB_{i} the m×(nd)m\times(n-d) submatrix of BB corresponding to row indices {(i1)m+1,(i1)m+2,,im}\{(i-1)m+1,(i-1)m+2,\dots,im\} and column indices {1,2,,nd}\{1,2,\dots,n-d\}. Now the condition 1) above is equivalent to

Since SiS_{i} is invertible for all i[n]i\in[n], the equation above is equivalent to

It is easy to see that we can set Bi:=RiSi1B_{i}:=-R_{i}S_{i}^{-1} for all i[n]i\in[n] to satisfy this constraint. As a result, the matrix BB in our coding scheme is

where the matrices Ri,Si,i[n]R_{i},S_{i},i\in[n] are defined above as certain submatrices of the matrix VV.

Denote ViV_{i} as the iith column of VV. Recall the definition of zv,v=0,1,,l/m1z_{v},v=0,1,\dots,l/m-1 in (16). Now we are ready to define the transmitted vector fi(gi,gi1,,gi(d1))f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}) for each worker Wi,i[n]W_{i},i\in[n]:

By (24), the value of fi(gi,gi1,,gi(d1))f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}) indeed only depends on the values of gi,gi1,,gi(d1)g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}.

Similarly to the coding scheme in Section III-A, we can calculate zvBend+uz_{v}Be_{n-d+u} for all 0vl/m10\leq v\leq l/m-1 and all u[m]u\in[m] from the vectors in (26) by multiplying VFT(VFVFT)1end+uV_{\mathcal{F}}^{T}(V_{\mathcal{F}}V_{\mathcal{F}}^{T})^{-1}e_{n-d+u} to the right. Therefore we conclude that the sum vector g1+g2++gng_{1}+g_{2}+\dots+g_{n} can be calculated from {fi(gi,gi1,,gi(d1)):iF}\{f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}):i\in\mathcal{F}\} whenever F=nsκ|\mathcal{F}|=n-s_{\kappa}. Thus we have shown that our coding scheme satisfies all three conditions in Definition 1. Moreover, the only matrix inversions in the gradient reconstruction phase is calculating (VFVFT)1(V_{\mathcal{F}}V_{\mathcal{F}}^{T})^{-1}. By definition of the matrix VV, the condition numbers of {VFVFT:F[n],F=nsκ}\{V_{\mathcal{F}}V_{\mathcal{F}}^{T}:\mathcal{F}\subseteq[n],|\mathcal{F}|=n-s_{\kappa}\} are all upper bounded by κ\kappa. This completes the proof of Theorem 2.

As a final remark, we do not impose any stability constraints on the calculation of Si1S_{i}^{-1} when constructing the matrix BB, which is also a matrix inversion. This is because the construction of BB is only one-time, so we can afford to use high-precision calculation to compensate for possibly large condition number in the construction of matrix BB.

In Section III we set VV to be a (non-square) Vandermonde matrix. However, it is well known that Vandermonde matrices are badly ill-conditioned , so one way to alleviate the numerical instability is to use random matrices instead of Vandermonde matrices. For instance, we can choose VV in (22) to be a Gaussian random matrix and design the matrix BB as described above. According to our experimental results, using random matrices allows our scheme to be numerically stable for all n30n\leq 30 and all possible values of d,sd,s and mm.

V Experiments on Amazon EC2 clusters

In this section, we use our proposed gradient coding scheme to train a logistic regression model on the Amazon Employee Access dataset from Kagglehttps://www.kaggle.com/c/amazon-employee-access-challenge, and we compare the running time and Generalization AUCAUC is short for area under the ROC-curve. The Generalization AUC can be efficiently calculated using the “sklearn.metrics.auc” function in Python. between our method and baseline approaches. More specifically, we compare our scheme against: (1) the naive scheme, where the data is uniformly divided among all workers without replication and the master node waits for all workers to send their results before updating model parameters in each iteration, and (2) the coding schemes in , i.e., the special case of m=1m=1 in our scheme. Note that in the authors implemented their methods (which is the special case of m=1m=1 in this paper) to train the same model over the same dataset.

We used Python with mpi4py package to implement our gradient coding schemes proposed in Section III-A, where θ1,θ2,,θn\theta_{1},\theta_{2},\dots,\theta_{n} are specified in (23). We used t2.micro instances on Amazon EC2 as worker nodes and a single c3.8xlarge instance as the master node.

As a common preprocessing step, we converted the categorical features in the Amazon Employee Access dataset to binary features by one-hot encoding, which can be easily realized in Python. After one-hot encoding with interaction terms, the dimension of parameters in our model is l=343474l=343474. For all three schemes (our proposed scheme, the schemes in and the naive scheme), we used N=26220N=26220 training samples and adopted Nesterov’s Accelerated Gradient (NAG) descent [37, Section 3.7] to train the model. These experiments were run on n=10,15,20n=10,15,20 worker nodes.

In Fig. 3, we compare average running time per iteration for different schemes. For coding schemes proposed in , i.e., coding schemes corresponding to m=1m=1 in our paper, we choose the optimal value of ss such that it has the smallest running time among all possible choices of (m=1,s)(m=1,s). For coding schemes proposed in this paper, i.e., schemes with m>1m>1, we choose two pairs of (m,s)(m,s) with the smallest running time among all possible choices. We can see that for all three choices of nn, our scheme outperforms the schemes in by at least 23%23\% and outperforms the naive scheme by at least 32%32\%. We then plot generalization AUC vs. running time for these choices of (m,s)(m,s) in Fig. 4. The curves corresponding to m>1m>1 are always on the left side of the curves corresponding to m=1m=1 and the naive scheme, which means that our schemes achieve the target generalization error much faster than the other two schemes.

VI Analysis of the total computation and communication time

In this section we analyze the total runtime of our coding scheme for different choices of the design parameters (d,s,m)(d,s,m) based on a probabilistic model. Our analysis reveals the optimal choice of parameters under some special cases and also sheds light on how to choose (d,s,m)(d,s,m) in general. Following the probabilistic model of runtime in , we assume that both computation time and communication time have shifted exponential distribution, which is the sum of a constant and an exponential random variable. We also assume that for each worker, the computation time is proportional to dd, the number of assigned data subsets, and the communication time is proportional to the dimension of transmitted vector. This assumption is in accordance to the observation in the experiments of . The total runtime is the sum of the computation time and the communication time.Since the total number of samples NN in large-scale machine learning tasks is of order hundreds of millions, we have NnN\gg n in our problem. The computation time is of order Θ(Nl)\Theta(Nl) while the reconstruction time is of order O(nl)O(nl). Therefore we can ignore the reconstruction phase at the master node when estimating the total runtime.

Formally speaking, for i,j[n]i,j\in[n], let Ti,j(1)T_{i,j}^{(1)} be the computation time of data subset DjD_{j} for worker WiW_{i}. Similarly, for i[n]i\in[n], let Ti(2)T_{i}^{(2)} be the communication time for worker WiW_{i} to send a vector of dimension ll. We make the following assumption:

For i[n]i\in[n], Ti,1(1)=Ti,2(1)==Ti,n(1)=Ti(1)T_{i,1}^{(1)}=T_{i,2}^{(1)}=\dots=T_{i,n}^{(1)}=T_{i}^{(1)}, where the random variables Ti(1),i[n]T_{i}^{(1)},i\in[n] are i.i.d. with distribution

The communication time for worker WiW_{i} to send a vector of dimension ll^{\prime} is (l/l)Ti(2)(l^{\prime}/l)T_{i}^{(2)}, where the random variables Ti(2),i[n]T_{i}^{(2)},i\in[n] are i.i.d. with distribution

The random variables Ti(1),i[n]T_{i}^{(1)},i\in[n] and Ti(2),i[n]T_{i}^{(2)},i\in[n] are mutually independent.

Here t1t_{1} and t2t_{2} are the minimum computation and communication time of a worker in perfect conditions, respectively; λ1\lambda_{1} and λ2\lambda_{2} depict the straggling behavior in the computation and communication process, respectively. It is clear that smaller λ1\lambda_{1} means the distribution of the computation time has a heavy tail and more likely to cause delay. Similarly, smaller λ2\lambda_{2} means that the communication process is more likely to be the bottleneck.

Under the assumptions above, for a triple (d,s,m)(d,s,m), the computation time of worker WiW_{i} is dTi(1)dT_{i}^{(1)}, which is the sum of the constant dt1dt_{1} and an exponential random variable with distribution Exp(λ1/d)\operatorname{Exp}(\lambda_{1}/d), and the communication time of worker WiW_{i} is 1mTi(2)\frac{1}{m}T_{i}^{(2)}, which is the sum of the constant t2/mt_{2}/m and an exponential random variable with distribution Exp(mλ2)\operatorname{Exp}(m\lambda_{2}). Therefore, the total runtime for each worker WiW_{i} is the sum of dt1+t2/mdt_{1}+t_{2}/m and a random variable Ti(d,m)T_{i}^{(d,m)} with distribution(27) gives the expression when λ1/dmλ2\lambda_{1}/d\neq m\lambda_{2}. When λ1/d=mλ2\lambda_{1}/d=m\lambda_{2}, Ti(d,m)T_{i}^{(d,m)} is an Erlang random variable with parameters 22 and mλ2m\lambda_{2}.

Since Ti(d,m),i[n]T_{i}^{(d,m)},i\in[n] are i.i.d. and we only need to wait for the first nsn-s workers to return their results, the total runtime of the whole task is

where the random variable Td,s,mT_{d,s,m} has distribution

Here Td,s,mT_{d,s,m} is the (ns)(n-s)th order statistics of the distribution (27). Since (5) depicts the optimal tradeoff between d,sd,s and mm, we should choose these three parameters to achieve (5) with equality in order to minimize TtotT_{\operatorname{tot}}. In other words, we should always set s=dms=d-m.

To understand how the choice of (d,s,m)(d,s,m) affects the total runtime, let us first consider two extreme cases.

Computation time is dominant: Assume that λ1λ2\lambda_{1}\ll\lambda_{2} and t1t2t_{1}\gg t_{2}, so that we can ignore the communication time. Obviously we should set m=1m=1 and therefore s=d1s=d-1. In this case, Td,d1,1T_{d,d-1,1} is the (nd+1)(n-d+1)th order statistics of nn i.i.d exponential random variables with distribution Exp(λ1/d)\operatorname{Exp}(\lambda_{1}/d). Consequently,

The proof of this proposition is given in Appendix C.

Communication time is dominant: Assume that λ1λ2\lambda_{1}\gg\lambda_{2} and t1t2t_{1}\ll t_{2}, so that we can ignore the computation time. Obviously we should set d=nd=n and therefore s=nms=n-m. In this case, Tn,nm,mT_{n,n-m,m} is the mmth order statistics of nn i.i.d exponential random variables with distribution Exp(mλ2)\operatorname{Exp}(m\lambda_{2}). Consequently,

For a fixed value of nn, if t21λ2t_{2}\gg\frac{1}{\lambda_{2}}, then the optimal choice is m=nm=n. On the other hand, if t21λ2t_{2}\ll\frac{1}{\lambda_{2}}, then the optimal choice is m=1m=1.

The optimal ratio α\alpha between the communication cost reduction factor mm and the number of workers nn is the unique root of the following equation

Note that for any given positive λ2\lambda_{2} and t2t_{2}, the equation above has a unique root in the open interval (0,1)(0,1). The proof of this proposition is given in Appendix D

When computation time and communication time are comparable, accurate analysis of (29) becomes more difficult. Here we use a numerical example to illustrate the advantages of our new proposal. According to (28) and (29), when λ1/dmλ2\lambda_{1}/d\neq m\lambda_{2},

We can see that d=4,m=3d=4,m=3 is the optimal choice, whose total runtime is 21.369721.3697. The runtime for uncoded scheme (d=m=1d=m=1) is 36.113836.1138, and the best achievable runtime for the coding schemes in is 24.106324.1063 (d=8,m=1). Therefore our coding scheme outperforms the uncoded scheme by 41%41\% and outperforms the schemes in by 11%11\%.

Next we investigate how the optimal triple (d,s,m)(d,s,m) varies with the values of λ1,λ2,t1,t2\lambda_{1},\lambda_{2},t_{1},t_{2}. First we fix λ1,t1\lambda_{1},t_{1}, and let λ2,t2\lambda_{2},t_{2} vary. In the following table we take n=k=10,λ1=0.6n=k=10,\lambda_{1}=0.6 and t1=1.5t_{1}=1.5. The optimal triple (d,s,m)(d,s,m) for different values of λ2\lambda_{2} and t2t_{2} is recorded in the table.

We can see that mm typically increases with t2t_{2}. At the same time, dd decreases when we increase the value of λ2\lambda_{2}.

In the following table we fix λ2,t2\lambda_{2},t_{2}, and let λ1,t1\lambda_{1},t_{1} vary. More specifically, we take n=k=10,λ2=0.1n=k=10,\lambda_{2}=0.1 and t2=6t_{2}=6. The optimal triple (d,s,m)(d,s,m) for different values of λ1\lambda_{1} and t1t_{1} is recorded in the table.

We can see that for a fixed λ1\lambda_{1}, ss decreases with t1t_{1}.

Appendix A Converse proof of Theorem 1

Assume that (d,s,m)(d,s,m) is achievable, and let us prove (4). We first prove the following claim:

For every i[k]i\in[k], data subset DiD_{i} must be assigned to at least s+ms+m workers.

The proof goes by contradiction. Suppose for some j[k]j\in[k], DjD_{j} is assigned to a<s+ma<s+m workers. Without loss of generality we assume these aa workers are W1,W2,,WaW_{1},W_{2},\dots,W_{a}. Now suppose that W1,W2,,WsW_{1},W_{2},\dots,W_{s} are the ss stragglers. According to Definition 1, we should be able to calculate g1+g2++gkg_{1}+g_{2}+\dots+g_{k} from fi(gi1,gi2,,gid),i{s+1,s+2,,n}f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}),i\in\{s+1,s+2,\dots,n\}.

Observe that we can calculate the ll-dimensional vector gjg_{j} from the following set of vectors {gi:i[k]{j}}{g1+g2++gk}\{g_{i}:i\in[k]\setminus\{j\}\}\cup\{g_{1}+g_{2}+\dots+g_{k}\}. Therefore gjg_{j} can also be calculated from {gi:i[k]{j}}{fi(gi1,gi2,,gid):s+1in}\{g_{i}:i\in[k]\setminus\{j\}\}\cup\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq n\}. Since DjD_{j} is only assigned to the first aa workers, the values of {fi(gi1,gi2,,gid):a+1in}\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):a+1\leq i\leq n\} are determined by {gi:i[k]{j}}\{g_{i}:i\in[k]\setminus\{j\}\}. As a result, gjg_{j} can also be calculated from {gi:i[k]{j}}{fi(gi1,gi2,,gid):s+1ia}\{g_{i}:i\in[k]\setminus\{j\}\}\cup\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq a\}. Since fi,i[n]f_{i},i\in[n] are all linear functions (see condition 3 of Definition 1), we further deduce that {fi(gi1,gi2,,gid):s+1ia}\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq a\} must contain at least ll linear combinations of the coordinates of gjg_{j}. On the other hand, each fi(gi1,gi2,,gid)f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}) is a vector of dimension l/ml/m, so {fi(gi1,gi2,,gid):s+1ia}\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq a\} contains at most lm(as)\frac{l}{m}(a-s) linear combinations of the coordinates of gjg_{j}. As a result, we conclude that lm(as)l\frac{l}{m}(a-s)\geq l, i.e., as+ma\geq s+m, which gives a contradiction. This completes the proof of Claim 1. ∎

By Claim 1, each data subset is assigned to at least s+ms+m workers, so in total there are at least k(s+m)k(s+m) data subsets (counting with repetitions) assigned to all nn workers. Therefore each worker is assigned with at least kn(s+m)\frac{k}{n}(s+m) data subsets. Thus we have dkn(s+m)d\geq\frac{k}{n}(s+m). This completes the proof of (4).

Appendix B Differences between our results and the results in [14]

Below we state the main result of . Note that we change their notation to comply with ours.

We first show that this theorem gives a coding scheme to achieve (4) with equality for the special case m=1m=1, i.e., the case considered in . We set m=1m=1 and k=kk^{\prime}=k in Theorem 3, where kk is the number of data subsets in our problem. Moreover, we set a1a_{1} to be the all one vector. In our gradient coding problem, we want to calculate \sum_{i=1}^{k}g_{i}=a_{i}[\begin{array}[]{cccc}g_{1}&g_{2}&\dots&g_{k}\end{array}]^{T}. (Recall that g1,g2,,gkg_{1},g_{2},\dots,g_{k} are column vectors of dimension ll.) We denote the iith row of QQ in Theorem 3 as qiq_{i}. We claim that each worker WiW_{i} only needs to send the ll-dimensional vector q_{i}[\begin{array}[]{cccc}g_{1}&g_{2}&\dots&g_{k}\end{array}]^{T} to the master node. Indeed, Theorem 3 indicates that any (ns)(n-s) rows of QQ suffice to generate a1a_{1}, so this coding scheme can tolerate any ss stragglers. Moreover, since the number of nonzero entries in each qiq_{i} is at most kn(s+1)\frac{k}{n}(s+1), each worker only needs to be assigned with at most kn(s+1)\frac{k}{n}(s+1) data subsets. Therefore d=kn(s+1)d=\frac{k}{n}(s+1), achieving (4) with equality for the case m=1m=1.

Next we argue that for m>1m>1, Theorem 3 cannot give coding schemes achieving (4) with equality. For this case, in order to use Theorem 3 for gradient coding, one needs to set k=kmk^{\prime}=km. Moreover, for u[m]u\in[m], we should set aua_{u} to be the uuth row of the m×(mn)m\times(mn) matrix [\begin{array}[]{cccc}I_{m}&I_{m}&\dots&I_{m}\end{array}]. For every v=0,1,,l/m1v=0,1,\dots,l/m-1 and j[k]j\in[k], define an mm-dimensional vector

For every v=0,1,,l/m1v=0,1,\dots,l/m-1, define an (mk)(mk)-dimensional vector

Notice that the coordinates of the sum vector g1+g2++gng_{1}+g_{2}+\dots+g_{n} form the following set:

Let each worker WiW_{i} return the following (l/m)(l/m)-dimensional vector [\begin{array}[]{cccc}q_{i}z_{0}^{T}&q_{i}z_{1}^{T}&\dots&q_{i}z_{l/m-1}^{T}\end{array}]. Since any (ns)(n-s) rows of QQ suffice to generate a1,a2,,ama_{1},a_{2},\dots,a_{m}, one can calculate the elements in the set (32) from the returned results of any (ns)(n-s) workers and therefore recover the sum vector g1+g2++gng_{1}+g_{2}+\dots+g_{n}. By Theorem 3, each qiq_{i} has at most mkn(s+m)\frac{mk}{n}(s+m) nonzero entries. Now let us explain how the nonzero entries of qiq_{i} correspond to the data subsets assigned to worker WiW_{i}, which is the reason why Theorem 3 fails to give a gradient coding algorithm for m>1m>1. Let us further write qi=(qi,1,qi,2,,qi,mk)q_{i}=(q_{i,1},q_{i,2},\dots,q_{i,mk}). By definition of zvz_{v}, for any u[m]u\in[m] and j[k]j\in[k], if qi,(j1)m+u0q_{i,(j-1)m+u}\neq 0, then worker WiW_{i} needs the value of gj(vm+u1)g_{j}(vm+u-1) to calculate qizvTq_{i}z_{v}^{T}, i.e., data subset DjD_{j} should be assigned to WiW_{i}. Thus we conclude that the number of data subsets assigned to WiW_{i} is equal to

In order to achieve (4), we need the quantity in (33) to be no larger than kn(s+m)\frac{k}{n}(s+m) for all i[n]i\in[n]. If this is the case, then each qiq_{i} has at most mkn(s+m)\frac{mk}{n}(s+m) nonzero entries, which is the condition in Theorem 3. However, the condition in Theorem 3 does not imply that the quantity in (33) is at most kn(s+m)\frac{k}{n}(s+m). Thus the constraint in our problem is stronger than the constraint in Theorem 3, so the coding scheme in does not apply to our problem for the case m>1m>1.

Appendix C Proof of Proposition 1

The proposition follows immediately once we show that the optimal value of dd (we denote it dd^{*}) can only be 11 or nn. We prove by contradiction. Suppose that 1<d<n1<d^{*}<n, then by (30) we have

which implies that i=dn1i<i=d+1n1i\sum_{i=d^{*}}^{n}\frac{1}{i}<\sum_{i=d^{*}+1}^{n}\frac{1}{i}, but this is impossible. Therefore we conclude that dd^{*} can only be 11 or nn, and a simple comparison between these two gives the result in Proposition 1.

Appendix D Proof of Proposition 2

According to (31), we want to minimize the following function

Clearly h1(α)>0h_{1}^{\prime}(\alpha)>0 for all 0<α<10<\alpha<1. Since h1(0)=t2<0h_{1}(0)=-t_{2}<0 and h1(1)=+h_{1}(1^{-})=+\infty, the equation h1(α)=0h_{1}(\alpha)=0 has a unique solution α\alpha^{*} in the open interval (0,1)(0,1). Moreover, since h1(α)>h1(α)=0h_{1}(\alpha)>h_{1}(\alpha^{*})=0 for all α>α\alpha>\alpha^{*} and h1(α)<h1(α)=0h_{1}(\alpha)<h_{1}(\alpha^{*})=0 for all α<α\alpha<\alpha^{*}, we also have h(α)>0h^{\prime}(\alpha)>0 for all α>α\alpha>\alpha^{*} and h(α)<0h^{\prime}(\alpha)<0 for all α<α\alpha<\alpha^{*}. Consequently, α\alpha^{*} minimizes h(α)h(\alpha), and this completes the proof of Proposition 2.

References