Communication-Computation Efficient Gradient Coding

Min Ye, Emmanuel Abbe

I introduction

Distributed computation plays a key role in the computational challenges faced by machine learning for large data sets . This requires overcoming a few obstacles: First the straggler effect, i.e., slow workers that hamper the computation time. Second, the communication cost; gradients in deep learning typically consist nowadays in millions of real-valued components, and the transmission of these high-dimensional vectors can amortize the savings of the computation time in large-scale distributed systems . This has driven researchers to use in particular gradient sparsification and gradient quantization to reduce communication cost .

More recently, coding theory has found its way into distributed computing , following the path of exporting coding techniques to distributed storage , caching and queuing . A few works have also initiated the use of coding techniques in distributed learning . Of particular interest to us is , which introduces coding techniques to mitigate the effect of stragglers in gradient computation. While this is a central task in machine learning, does not take into account the communication cost which is important in such applications as mentioned above.

This paper takes a global view on the running time of distributed learning tasks by considering the three parameters, namely, computation load, straggler tolerance and communication cost. We identify a three-fold fundamental tradeoff between these parameters in order to efficiently compute gradients (and more generally summations of vectors), exploiting distributivity both across data subsets and vector components. The tradeoff reads

where $n$ is the number of workers, $k$ is the number of data subsets, $d$ is the number of data subsets assigned to each worker, $s$ is the number of stragglers, and $m$ is the communication reduction factor. This generalizes the results in that correspond to $m=1$ . Note that one cannot derive (1) from the results of , and we will explain this in more detail below.

We further give an explicit code construction based on recursive polynomials that achieves the derived tradeoff. The key steps in our coding scheme are as follows: In order to reduce the dimension of transmitted vector for each worker, we first partition the coordinates of the gradient vector into $m$ groups of equal size. Then we design two matrices $B$ and $V$ , where the $(n-s)\times n$ matrix $V$ has the property that any $(n-s)\times(n-s)$ submatrix is invertible. This property corresponds to the requirement that our coding scheme can tolerate any $s$ stragglers, and it can be easily satisfied by setting $V$ to be a (non-square) Vandermonde matrix. Furthermore, the $(mn)\times(n-s)$ matrix $B$ satisfies the following two property: (1) the last $m$ columns of $B$ consisting of $n$ identity matrices of size $m\times m$ ; (2) for every $j\in[n]$ , the product of the $i$ th row of $B$ and the $j$ th column of $V$ must be for a specific set of values of $i$ , and the cardinality of this set is $(n-d)m$ . The first property of $B$ guarantees the recovery of the sum gradient vector, and the second property ensures that each worker is assigned at most $d$ data subsets. We make use of the natural connection between the Vandermonde structure and polynomials to construct our matrix $B$ recursively: More precisely, we can view each row of $B$ as coefficients of some polynomial, and the product of $B$ and $V$ simply consists of the evaluations of these polynomials at certain points. We can then define these polynomials by specifying their roots so that the two properties of $B$ are satisfied. We also mention that the conditions in our construction are more restrictive than those in and : In our setting, the conditions in only require that the last $m$ columns of $B$ contain at most $n$ nonzero entries, and no requirements are imposed on the positions of these nonzero entries; as mentioned above, only deal with the special case of $m=1$ and do not allow for dimensionality reduction of the gradient vectors. Due to these more relaxed conditions, the constructions in and do not have the recursive polynomial structure, which is the main technical novelty in our paper.

We further take numerical stability issue into consideration, and characterize an achievable region of the triple $(d,s,m)$ under a given upper bound $\kappa$ of condition numbers of all the operations in the gradient reconstruction phase. We also present another coding scheme based on random matrices to achieve this region.

We support our theoretical findings by implementing our scheme on Amazon EC2 clusters using Python with mpi4py package. Experimental results show that the proposed scheme reduces the running time by $32\%$ compared to uncoded schemes and by $23\%$ compared to prior work , while maintaining the same generalization error on the Amazon Employee Access dataset from Kaggle, which was also used in for state-of-the-art experiments.

Slow workers (processors) called “stragglers” can hamper the computation time as the taskmaster needs to wait for all workers to complete their processing. Recent literature proposes adding redundancy in computation tasks of each worker so that the taskmaster can compute the final result using outputs from only a subset of workers and ignore the stragglers. The most popular ways to introduce redundancy in computation are based on either replication schemes or coding theoretic techniques . Lee et al. initialized the study of using erasure-correcting codes to mitigate straggler effects for linear machine learning tasks such as linear regression and matrix multiplication. Subsequently, Dutta et al. proposed new efficient coding schemes to calculate convolutions and the product of a matrix and a long vector , Yu et al. introduced optimal coding schemes to compute high-dimensional matrix multiplication and Fourier Transform , and Yang et al. developed coding methods for parallel iterative linear solver . Tandon et al. further used coding theoretic methods to avoid stragglers in nonlinear learning tasks. More specifically, presented an optimal trade-off between the computation load and straggler tolerance (the number of tolerable stragglers) in synchronous gradient descent for any loss function. Several code constructions achieving this trade-off were given in . Li et al. considered distributed gradient descent under a probabilistic model and proposed the Batched Coupon’s Collector scheme to alleviate straggler effect under this model. At the same time, the schemes in are designed to combat stragglers for the worst-case scenario. While most research focused on recovering the exact results in the presence of stragglers, suggested allowing some small deviations from the exact gradient in each iteration of the gradient descent and showed that one can obtain a good approximation of the original solution by using coding theoretic methods. Very recently, Zhu et al. proposed a sequential approximation method for distributed learning in the presence of stragglers, and their method is also based on erasure-correcting codes.

As mentioned above, high network communication cost for synchronizing gradients and parameters is also a well-known bottleneck of distributed learning. In particular for deep learning, gradient vectors typically consist of millions of real numbers, and for large-scale distributed systems, transmissions of high-dimensional gradient vectors might even amortize the savings of computation time . The most widely used methods to reduce communication cost in the literature are based on gradient sparsification and gradient quantization .

In this paper we directly incorporate the communication cost into the framework of reducing running time for gradient computation, in addition to computation load and straggler tolerance. In particular, we take advantage of distributing the computations over subsets of vector components in addition to subsets of data samples. The advantages of our coding scheme over the uncoded schemes and the schemes in are demonstrated by both experimental results and numerical analysis in Sections V and VI. We also strengthen the numerical analyses by studying the behavior of the running time using probabilistic models for the computation and communication times, obtaining improvements that are consistent with the outcome of the Amazon experiments (see Section VI). Our results apply to both batch gradient descent and mini-batch stochastic gradient descent (SGD), which is the most popular algorithm in large-scale distributed learning. Moreover, our coding theoretic method is orthogonal to the gradient sparsification and gradient quantization methods . In other words, our method can be used on top of the latter ones.

As a final remark, also studied the trade-off between computation and communication in distributed learning, but the problem setup in is different from our work in nature. We study distributed gradient descent while focused on MapReduce framework. The communication in our problem is from all the worker nodes to one master node, while the communication in is from all workers to all workers, and there is no master node in . This difference in problem setup leads to completely different results and techniques.

II Problem formulation and main results

where $g^{(t)}:=\nabla L(D;\beta^{(t)})=\sum_{i=1}^{N}\nabla L(x_{i},y_{i};\beta^{(t)})$ is the gradient of the loss at the current estimate of the parameters and $h$ is a gradient-based optimizer. As in , we assume that there are $n$ workers $W_{1},W_{2},\dots,W_{n}$ , and that the original dataset $D$ is partitioned into $k$ subsets of equal size, denoted as $D_{1},D_{2},\dots,D_{k}$ . Define the partial gradient vector of $D_{i}$ as $g_{i}^{(t)}:=\sum_{(x,y)\in D_{i}}\nabla L(x,y;\beta^{(t)})$ . Clearly $g^{(t)}=g_{1}^{(t)}+g_{2}^{(t)}+\dots+g_{k}^{(t)}$ . Suppose that each worker is assigned $d$ data subsets, and there are $s$ stragglers, i.e., we only wait for the results from the first $n-s$ workers. For $i=1,2,\dots,n$ , we write the datasets assigned to worker $W_{i}$ as $\{D_{i_{1}},D_{i_{2}},\dots,D_{i_{d}}\}$ . Each worker computes its partial gradient vectors $g_{i_{1}}^{(t)},g_{i_{2}}^{(t)},\dots,g_{i_{d}}^{(t)}$ and returns $f_{i}(g_{i_{1}}^{(t)},g_{i_{2}}^{(t)},\dots,g_{i_{d}}^{(t)})$ , a prespecified function of these partial gradients. In order to update the parameters according to (2), we require that the sum gradient vector $g^{(t)}$ can be recovered from the results of the first $n-s$ workers no matter who the $s$ stragglers will be. Due to complexity consideration, we would further like $f_{i},i\in[n]$ to be linear functions. Lee et al. showed that this is possible if and only if

Let us write each partial gradient vector as $g_{i}=(g_{i}(0),g_{i}(1),\dots,g_{i}(l-1))$ for $i=1,2,\dots,k$ . We will show that when $s\leq\frac{d}{k}n-1$ , each worker only needs to transmit a vectorAssume that $k|(dn)$ and $(\frac{d}{k}n-s)|l$ . of dimension $l/(\frac{d}{k}n-s)$ . In other words, we can reduce the communication cost by a factor of $\frac{d}{k}n-s$ .

Roughly speaking, showed the following two-dimensional tradeoff: if we assign more computation load at each worker, then we can tolerate more stragglers. In this paper we will show a three-dimensional tradeoff between computation load at each worker, straggler tolerance and the communication cost: for a fixed computation load, we can reduce the communication cost by waiting for results from more workers. Fig. 1 uses a toy example to illustrate this tradeoff as well as the basic idea of how to reduce the communication cost. In Fig. 1 the gradient vector has dimension $l=2$ , and it is clear that this idea extends to gradient vectors of any dimension (by padding a zero when $l$ is odd). To quantify the tradeoff, we introduce the following definition.

Given $n$ and $k$ , we say that a triple of nonnegative integers $(d,s,m)$ satisfying that $1\leq d\leq k$ and $m\geq 1$ is achievableThroughout we assume that $m|l$ . Since $l$ is typically very large and $m$ is relatively small, the condition $m|l$ can always be satisfied by padding a few zeroes at the end of the gradient vectors. if there is a distributed synchronous gradient descent scheme such that

Each worker is assigned $d$ data subsets.

where $i_{1},i_{2},\dots,i_{d}$ are the indices of datasets assigned to worker $W_{i}$ .

$f_{1},f_{2},\dots,f_{n}$ are linear functions. In other words, $f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}})$ is a linear combination of the coordinates of the partial gradient vectors $g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}$ .

For readers’ convenience, we list the main notation in Table I. Next we state the main theorem of this paper.

Let $k,n$ be positive integers. A triple $(d,s,m)$ is achievable if and only if

The converse proof is given in Appendix A, and the achievability scheme is given in Section III.

Note that the special case $m=1$ in Theorem 1 is the same as the case considered in . We also remark that although (4) looks very similar to Theorem 1 in , their coding scheme can not be used to achieve (4) with equality when $m>1$ . In Appendix B, we discuss the differences between our work and in detail. In particular, we show that the constraint in our problem is stronger than that in .

Notice that the computation load at each worker is known by $\frac{d}{k}$ , not the value of $k$ itself. We are interested in achieving the optimal computation load in (4), and the value of $k$ does not matter. Therefore we will assume that $k=n$ for the remainder of this paper (except in Appendix A and Appendix B).

Under this assumption, (4) has the following simple form

In Fig. 2 we take $n=k=5,d=3,l=2$ , and show the implementation for two different choices of the pair $(s,m)$ . The communication cost of Fig. 2(b) is half of that of Fig. 2(a), but the system in Fig. 2(b) can only tolerate one straggler while the system in Fig. 2(a) can tolerate two stragglers. Table II below shows how to calculate the sum gradient vector in Fig. 2(b) when there is one straggler. In the table we abbreviate $f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}})$ in (3) as $f_{i}$ , i.e., $f_{i}$ is the transmitted vector of $W_{i}$ .

In the proof of Theorem 1, we use Vandermonde matrices and assume that all the computations have infinite precision, which is not possible in real world applications. According to our experimental results, the stability issue of Vandermonde matrices can be ignored up to $n=20$ , which covers the regime considered in most related works . However, beyond that we need to design numerically stable coding schemes and give up the optimal trade-off (5) between $d,s$ and $m$ . In this section we find an achievable region for which the condition numbers of all operations in the gradient reconstruction phase are upper bounded by a given value $\kappa$ , so that the numerical stability can be guaranteed. To that end, for any three given integers $n>n_{1}>n_{2}$ , we define a function $\gamma(n,n_{1},n_{2},\kappa)$ to be the smallest integer $n_{3}$ such that there is an $n_{1}\times n$ matrix $V$ satisfying the following two properties:

$n_{3}\geq n_{1}$ . For every subset $\mathcal{F}\subseteq[n]$ with cardinality $|\mathcal{F}|=n_{3}$ , the condition number of $V_{\mathcal{F}}V_{\mathcal{F}}^{T}$ is no larger than $\kappa$ , where $V_{\mathcal{F}}$ is the submatrix of $V$ consisting of columns whose indices are in the set $\mathcal{F}$ .

Let $V_{[1:n_{2}]}$ be the submatrix of $V$ consisting of the first $n_{2}$ rows of $V$ . We require that every $n_{2}\times n_{2}$ submatrix consisting of circulant consecutive“circulant consecutive” means that the indices $n$ and $1$ are considered consecutive. A more detailed explanation is given in Section IV. columns of $V_{[1:n_{2}]}$ is invertible.

Note that property 1) is similar to the restricted isometry property (RIP) property in compressed sensing . The only difference is that in compressed sensing $n_{3}<n_{1}$ while here we require $n_{3}\geq n_{1}$ . We point out two obvious properties of this function: (1) for a fixed triple $(n,n_{1},n_{2})$ , the function $\gamma(n,n_{1},n_{2},\kappa)$ decreases with $\kappa$ ; (2) when $\kappa$ is large enough, $\gamma(n,n_{1},n_{2},\kappa)=n_{1}$ . We now state the theorem in the case $n=k$ to simplify the notation (see Remark 1):

Let $\kappa$ be the upper bound on the condition number of all the operations in the gradient reconstruction phase. A triple $(d,s_{\kappa},m)$ is achievable if

The proof of this theorem is given in Section IV. As discussed above, when $\kappa$ is large enough, i.e., when the stability constraint is loose, we have $\gamma(n,n-d+m,n-d,\kappa)=n-d+m$ , and (6) becomes $s_{\kappa}\leq d-m$ , which is the same as (5). Moreover, since $\gamma(n,n-d+m,n-d,\kappa)$ decreases with $\kappa$ , $s_{\kappa}$ increases with $\kappa$ . Namely, we can tolerate more stragglers if we allow less numerical stability. In our experiments we find that by setting $V$ to be Gaussian random matrix, we can achieve $s_{\kappa}=d-m$ with numerically stable coding scheme for $n\leq 30$ , which improves upon the coding scheme based on Vandermonde matrices.

By choosing $V$ as a Gaussian random matrix and using the classical bounds on eigenvalues of large Wishart matricesA Wishart matrix is a matrix of form $AA^{T}$ , where $A$ is a Gaussian random matrix. together with the union bound, we can obtain an upper bound of $\gamma(n,n_{1},n_{2},\kappa)$ . Let us introduce some more definitions to state the upper bound. Given two integers $n>n_{1}$ , define the function

where $H$ is the entropy function $H(q):=-q\ln q-(1-q)\ln(1-q)$ defined for $0<q<1$ . It is easy to verify that when $n_{1}/n>1/2$ , $f_{n,n_{1}}(x)$ strictly decreases with $x$ . Following the same stepsThere are two differences between the settings in our paper and : First, we have one more condition that certain $n_{2}\times n_{2}$ submatrices of $V_{[1:n_{2}]}$ must be invertible, but this is satisfied with probability $1$ for Gaussian random matrices, so this extra condition makes no difference to the proof, and the bound (7) does not depend on $n_{2}$ . Second, in our paper we require $n_{3}\geq n_{1}$ while in $n_{3}<n_{1}$ , but this difference can also be resolved by a trivial modification of the proof in . as in the proof of [32, Lemma 3.1], we can show that when $n_{1}/n>1/2$ and $n$ is large,

Let $\kappa$ be the upper bound on the condition number of all the operations in the gradient reconstruction phase. When $(d-m)/n<1/2$ , $\kappa>\Big{(}\frac{1+\sqrt{n_{1}/n}}{1-\sqrt{n_{1}/n}}\Big{)}^{2}$ , and $n$ is large enough, a triple $(d,s_{\kappa},m)$ is achievable if

III Coding Scheme

In this section, we present a coding scheme achieving (5) with equality, i.e., the parameters in our scheme satisfy $d=s+m$ . First we introduce two binary operations $\oplus$ and $\ominus$ over the set $[n]$ . For $a,b\in[n]$ , define

In our scheme, each worker $W_{i}$ is assigned with $d$ data subsets $D_{i},D_{i\oplus 1},D_{i\oplus 2},\dots,D_{i\oplus(d-1)}$ . This is equivalent to say that each data subset $D_{i}$ is assigned to $d$ workers $W_{i},W_{i\ominus 1},W_{i\ominus 2},\dots,W_{i\ominus(d-1)}$ .

Let $\theta_{1},\theta_{2},\dots,\theta_{n}$ be $n$ distinct real numbers. Define $n$ polynomials $p_{i},i\in[n]$ ,

Before proceeding further, let us explain the meaning of $\theta_{i}$ and $p_{i}$ . Each $\theta_{i}$ is associated with the worker $W_{i}$ , and each $p_{i}$ is associated with the dataset $D_{i}$ . In our scheme, $p_{j}(\theta_{i})\neq 0$ means that worker $W_{i}$ needs the value of $g_{j}$ to calculate $f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}})$ , and therefore $D_{j}$ is assigned to $W_{i}$ . On the other hand, $p_{j}(\theta_{i})=0$ means that $D_{j}$ is not assigned to $W_{i}$ . By (8), we can see that each dataset $D_{i}$ is NOT assigned to $W_{i\oplus 1},W_{i\oplus 2},\dots,W_{i\oplus(n-d)}$ .

Next we construct an $(mn)\times(n-s)$ matrix $B=(b_{ij})$ from the polynomials $p_{i},i\in[n]$ defined in (8). Let $p_{i,j},j=0,1,\dots,n-s-1$ be the coefficients of the polynomial $p_{i}$ , i.e.,

Since $\deg(p_{i})=n-d$ and $d=s+m\geq s+1$ , we have $p_{i,n-d}=1$ and $p_{i,n-d+1}=p_{i,n-d+2}=\dots=p_{i,n-s-1}=0$ . For every $i\in[n]$ , we define $m$ polynomials $p_{i}^{(1)},p_{i}^{(2)},\dots,p_{i}^{(m)}$ recursively:

where $p_{i,j}^{(u)},j=0,1,\dots,n-s-1$ are the coefficients of $p_{i}^{(u)}$ , i.e., $p_{i}^{(u)}(x)=\sum_{j=0}^{n-s-1}p_{i,j}^{(u)}x^{j}$ . Clearly, $p_{i}^{(u)}$ is a polynomial of degree $\deg(p_{i}^{(u)})=n-d+u-1$ , and its leading coefficient is $1$ , i.e.,

It is also clear that $p_{i}|p_{i}^{(u)}$ for all $u\in[m]$ and all $i\in[n]$ , so we have

By a simple induction on $u$ , one can further see that

We can now specify the entries of $B$ as follows:

Moreover, according to (10) and (12), the submatrix $B_{[(n-d+1):(n-s)]}$ consisting of the last $m$ columns of $B$ is

where $I_{m}$ is the $m\times m$ identity matrix, and there are $n$ identity matrix on the right-hand side of (15).

Recall that we assume $m|l$ throughout the paper. For every $v=0,1,\dots,l/m-1$ and $j\in[n]$ , define an $m$ -dimensional vector

For every $v=0,1,\dots,l/m-1$ , define an $(mn)$ -dimensional vector

where the second equality follows from (11).

Now we are ready to define the transmitted vector $f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)})$ for each worker $W_{i},i\in[n]$ :

By (17), the value of $f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)})$ indeed only depends on the values of $g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}$ .

Without loss of generality let us assume that $\mathcal{F}=\{1,2,\dots,n-s\}$ . Define the following $(n-s)\times(n-s)$ matrix

According to (18), from $\{f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}):i\in\mathcal{F}\}$ we can obtain the values of

Since $A$ is invertible, we can calculate $z_{v}Be_{n-d+u}$ for all $0\leq v\leq l/m-1$ and all $u\in[m]$ from the vectors in (21) by multiplying $A^{-1}e_{n-d+u}$ to the right. By (19),

Therefore we conclude that the sum vector $g_{1}+g_{2}+\dots+g_{n}$ can be calculated from $\{f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}):i\in\mathcal{F}\}$ whenever $|\mathcal{F}|=n-s$ . Thus we have shown that our coding scheme satisfies all three conditions in Definition 1, and this completes the proof of the achievability part of Theorem 1.

III-B Efficient implementation of our coding scheme

To implement our coding scheme, an important step is to calculate the product

in order to obtain the transmitted vectors in (18). According to (14), this product can be easily calculated by the recursive procedure (9). Notice that in this recursive procedure we need to know the values of $p_{i,n-d-1}^{(u-1)}$ for $u=2,3,\dots,m$ , and by (13) we have $b_{(i-1)m+u,n-d}=p_{i,n-d-1}^{(u)}$ . Therefore in our implementation we need to calculate (at least some of) the entries of the matrix $B$ . The entries of $B$ are specified in (13), and they are calculated recursively according to (9) from the coefficients of the polynomials $p_{i},i\in[n]$ . While the recursive procedure in (9) might seem complicated, Algorithm 1 below describes an efficient way to calculate $B$ from the coefficients of $p_{i},i\in[n]$ defined in (8).

Finally, we remark that the examples in Fig. 2(a) and Fig. 2(b) are both obtained by setting $\theta_{1}=-2,\theta_{2}=-1,\theta_{3}=0,\theta_{4}=1,\theta_{5}=2$ in our coding scheme.

In Section III-A we have shown that our coding scheme works for any set of $n$ distinct real numbers $\{\theta_{1},\theta_{2},\dots,\theta_{n}\}$ . However, in the proof we assume that the computation has infinite precision, which is not possible in real world application. Stability aspects need to be considered for the inversion of Vandermonde matrices of form (20) when the master node reconstructs the full gradient vector from partial gradient vectors returned by the first $(n-s)$ worker nodes. It is well known that the accuracy of matrix inversion depends heavily on the condition number of the matrix. Therefore we need to find a set of $\{\theta_{1},\theta_{2},\dots,\theta_{n}\}$ such that every $(n-s)\times(n-s)$ submatrix of the following matrix $V$ has low condition number.

In our implementation in Section V, we choose

Note that numerical instability of our coding scheme is NOT due to the introduction of the communication cost reduction factor $m$ . In fact, in the authors presented coding schemes to achieve (5) for the special case of $m=1$ , and the schemes in both paper also involve inversion of Vandermonde matrices, so they also suffer from numerical instability. Moreover, the schemes in both paper set $\theta_{1},\theta_{2},\dots,\theta_{n}$ to be roots of unity. Such a choice does not resolve the numerical instability issue either: it is shown in that in the worst case the condition number of $(n-s)\times(n-s)$ submatrices of $V$ grows exponentially fast in $n$ when $\theta_{1},\theta_{2},\dots,\theta_{n}$ are roots of unity.

IV Proof of Theorem 2

Let $s:=d-m$ and let $s_{\kappa}=n-\gamma(n,n-s,n-d,\kappa).$ Then by definition of the function $\gamma$ , there is an $(n-s)\times n$ matrix $V$ such that

$n-s_{\kappa}\geq n-s$ . For every subset $\mathcal{F}\subseteq[n]$ with cardinality $|\mathcal{F}|=n-s_{\kappa}$ , the condition number of $V_{\mathcal{F}}V_{\mathcal{F}}^{T}$ is no larger than $\kappa$ , where $V_{\mathcal{F}}$ is the submatrix of $V$ consisting of columns whose indices are in the set $\mathcal{F}$ .

Define $n$ submatrices of $V$ with size $(n-d)\times(n-d)$ as follows: For $i=1,2,\dots,d+1$ , define $S_{i}$ to be the submatrix of $V$ corresponding to the row indices $\{1,2,\dots,n-d\}$ and the column indices $\{i,i+1,\dots,i+n-d-1\}$ . For $i=d+2,d+3,\dots,n$ , define $S_{i}$ to be the submatrix of $V$ corresponding to the row indices $\{1,2,\dots,n-d\}$ and the column indices $\{i,i+1,\dots,n,1,2,\dots,i-d-1\}$ . The matrix $S_{i}$ is invertible for all $i\in[n]$ .

We further define another $n$ submatrices of $V$ with size $m\times(n-d)$ as follows: For $i=1,2,\dots,d+1$ , define $R_{i}$ to be the submatrix of $V$ corresponding to the row indices $\{n-d+1,n-d+2,\dots,n-s\}$ and the column indices $\{i,i+1,\dots,i+n-d-1\}$ . For $i=d+2,d+3,\dots,n$ , define $R_{i}$ to be the submatrix of $V$ corresponding to the row indices $\{n-d+1,n-d+2,\dots,n-s\}$ and the column indices $\{i,i+1,\dots,n,1,2,\dots,i-d-1\}$ .

To prove Theorem 2, we only need to find an $(mn)\times(n-s)$ matrix $B$ satisfying the following two conditions:

The product of the $i$ th row of $B$ and the $j$ th column of $V$ is for all $j\in[n]$ and all $i\in[mn]\setminus\{(mj)\text{mod}(mn)+1,(mj+1)\text{mod}(mn)+1,(mj+2)\text{mod}(mn)+1,\dots,(mj+dm-1)\text{mod}(mn)+1\}$ ;

Since Equation (15) already specifies the last $m$ columns of matrix $B$ , we only need to design the first $(n-d)$ columns. For $i\in[n]$ , we write $B_{i}$ the $m\times(n-d)$ submatrix of $B$ corresponding to row indices $\{(i-1)m+1,(i-1)m+2,\dots,im\}$ and column indices $\{1,2,\dots,n-d\}$ . Now the condition 1) above is equivalent to

Since $S_{i}$ is invertible for all $i\in[n]$ , the equation above is equivalent to

It is easy to see that we can set $B_{i}:=-R_{i}S_{i}^{-1}$ for all $i\in[n]$ to satisfy this constraint. As a result, the matrix $B$ in our coding scheme is

where the matrices $R_{i},S_{i},i\in[n]$ are defined above as certain submatrices of the matrix $V$ .

Denote $V_{i}$ as the $i$ th column of $V$ . Recall the definition of $z_{v},v=0,1,\dots,l/m-1$ in (16). Now we are ready to define the transmitted vector $f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)})$ for each worker $W_{i},i\in[n]$ :

By (24), the value of $f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)})$ indeed only depends on the values of $g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}$ .

Similarly to the coding scheme in Section III-A, we can calculate $z_{v}Be_{n-d+u}$ for all $0\leq v\leq l/m-1$ and all $u\in[m]$ from the vectors in (26) by multiplying $V_{\mathcal{F}}^{T}(V_{\mathcal{F}}V_{\mathcal{F}}^{T})^{-1}e_{n-d+u}$ to the right. Therefore we conclude that the sum vector $g_{1}+g_{2}+\dots+g_{n}$ can be calculated from $\{f_{i}(g_{i},g_{i\oplus 1},\dots,g_{i\oplus(d-1)}):i\in\mathcal{F}\}$ whenever $|\mathcal{F}|=n-s_{\kappa}$ . Thus we have shown that our coding scheme satisfies all three conditions in Definition 1. Moreover, the only matrix inversions in the gradient reconstruction phase is calculating $(V_{\mathcal{F}}V_{\mathcal{F}}^{T})^{-1}$ . By definition of the matrix $V$ , the condition numbers of $\{V_{\mathcal{F}}V_{\mathcal{F}}^{T}:\mathcal{F}\subseteq[n],|\mathcal{F}|=n-s_{\kappa}\}$ are all upper bounded by $\kappa$ . This completes the proof of Theorem 2.

As a final remark, we do not impose any stability constraints on the calculation of $S_{i}^{-1}$ when constructing the matrix $B$ , which is also a matrix inversion. This is because the construction of $B$ is only one-time, so we can afford to use high-precision calculation to compensate for possibly large condition number in the construction of matrix $B$ .

In Section III we set $V$ to be a (non-square) Vandermonde matrix. However, it is well known that Vandermonde matrices are badly ill-conditioned , so one way to alleviate the numerical instability is to use random matrices instead of Vandermonde matrices. For instance, we can choose $V$ in (22) to be a Gaussian random matrix and design the matrix $B$ as described above. According to our experimental results, using random matrices allows our scheme to be numerically stable for all $n\leq 30$ and all possible values of $d,s$ and $m$ .

V Experiments on Amazon EC2 clusters

In this section, we use our proposed gradient coding scheme to train a logistic regression model on the Amazon Employee Access dataset from Kagglehttps://www.kaggle.com/c/amazon-employee-access-challenge, and we compare the running time and Generalization AUCAUC is short for area under the ROC-curve. The Generalization AUC can be efficiently calculated using the “sklearn.metrics.auc” function in Python. between our method and baseline approaches. More specifically, we compare our scheme against: (1) the naive scheme, where the data is uniformly divided among all workers without replication and the master node waits for all workers to send their results before updating model parameters in each iteration, and (2) the coding schemes in , i.e., the special case of $m=1$ in our scheme. Note that in the authors implemented their methods (which is the special case of $m=1$ in this paper) to train the same model over the same dataset.

We used Python with mpi4py package to implement our gradient coding schemes proposed in Section III-A, where $\theta_{1},\theta_{2},\dots,\theta_{n}$ are specified in (23). We used t2.micro instances on Amazon EC2 as worker nodes and a single c3.8xlarge instance as the master node.

As a common preprocessing step, we converted the categorical features in the Amazon Employee Access dataset to binary features by one-hot encoding, which can be easily realized in Python. After one-hot encoding with interaction terms, the dimension of parameters in our model is $l=343474$ . For all three schemes (our proposed scheme, the schemes in and the naive scheme), we used $N=26220$ training samples and adopted Nesterov’s Accelerated Gradient (NAG) descent [37, Section 3.7] to train the model. These experiments were run on $n=10,15,20$ worker nodes.

In Fig. 3, we compare average running time per iteration for different schemes. For coding schemes proposed in , i.e., coding schemes corresponding to $m=1$ in our paper, we choose the optimal value of $s$ such that it has the smallest running time among all possible choices of $(m=1,s)$ . For coding schemes proposed in this paper, i.e., schemes with $m>1$ , we choose two pairs of $(m,s)$ with the smallest running time among all possible choices. We can see that for all three choices of $n$ , our scheme outperforms the schemes in by at least $23\%$ and outperforms the naive scheme by at least $32\%$ . We then plot generalization AUC vs. running time for these choices of $(m,s)$ in Fig. 4. The curves corresponding to $m>1$ are always on the left side of the curves corresponding to $m=1$ and the naive scheme, which means that our schemes achieve the target generalization error much faster than the other two schemes.

VI Analysis of the total computation and communication time

In this section we analyze the total runtime of our coding scheme for different choices of the design parameters $(d,s,m)$ based on a probabilistic model. Our analysis reveals the optimal choice of parameters under some special cases and also sheds light on how to choose $(d,s,m)$ in general. Following the probabilistic model of runtime in , we assume that both computation time and communication time have shifted exponential distribution, which is the sum of a constant and an exponential random variable. We also assume that for each worker, the computation time is proportional to $d$ , the number of assigned data subsets, and the communication time is proportional to the dimension of transmitted vector. This assumption is in accordance to the observation in the experiments of . The total runtime is the sum of the computation time and the communication time.Since the total number of samples $N$ in large-scale machine learning tasks is of order hundreds of millions, we have $N\gg n$ in our problem. The computation time is of order $\Theta(Nl)$ while the reconstruction time is of order $O(nl)$ . Therefore we can ignore the reconstruction phase at the master node when estimating the total runtime.

Formally speaking, for $i,j\in[n]$ , let $T_{i,j}^{(1)}$ be the computation time of data subset $D_{j}$ for worker $W_{i}$ . Similarly, for $i\in[n]$ , let $T_{i}^{(2)}$ be the communication time for worker $W_{i}$ to send a vector of dimension $l$ . We make the following assumption:

For $i\in[n]$ , $T_{i,1}^{(1)}=T_{i,2}^{(1)}=\dots=T_{i,n}^{(1)}=T_{i}^{(1)}$ , where the random variables $T_{i}^{(1)},i\in[n]$ are i.i.d. with distribution

The communication time for worker $W_{i}$ to send a vector of dimension $l^{\prime}$ is $(l^{\prime}/l)T_{i}^{(2)}$ , where the random variables $T_{i}^{(2)},i\in[n]$ are i.i.d. with distribution

The random variables $T_{i}^{(1)},i\in[n]$ and $T_{i}^{(2)},i\in[n]$ are mutually independent.

Here $t_{1}$ and $t_{2}$ are the minimum computation and communication time of a worker in perfect conditions, respectively; $\lambda_{1}$ and $\lambda_{2}$ depict the straggling behavior in the computation and communication process, respectively. It is clear that smaller $\lambda_{1}$ means the distribution of the computation time has a heavy tail and more likely to cause delay. Similarly, smaller $\lambda_{2}$ means that the communication process is more likely to be the bottleneck.

Under the assumptions above, for a triple $(d,s,m)$ , the computation time of worker $W_{i}$ is $dT_{i}^{(1)}$ , which is the sum of the constant $dt_{1}$ and an exponential random variable with distribution $\operatorname{Exp}(\lambda_{1}/d)$ , and the communication time of worker $W_{i}$ is $\frac{1}{m}T_{i}^{(2)}$ , which is the sum of the constant $t_{2}/m$ and an exponential random variable with distribution $\operatorname{Exp}(m\lambda_{2})$ . Therefore, the total runtime for each worker $W_{i}$ is the sum of $dt_{1}+t_{2}/m$ and a random variable $T_{i}^{(d,m)}$ with distribution(27) gives the expression when $\lambda_{1}/d\neq m\lambda_{2}$ . When $\lambda_{1}/d=m\lambda_{2}$ , $T_{i}^{(d,m)}$ is an Erlang random variable with parameters $2$ and $m\lambda_{2}$ .

Since $T_{i}^{(d,m)},i\in[n]$ are i.i.d. and we only need to wait for the first $n-s$ workers to return their results, the total runtime of the whole task is

where the random variable $T_{d,s,m}$ has distribution

Here $T_{d,s,m}$ is the $(n-s)$ th order statistics of the distribution (27). Since (5) depicts the optimal tradeoff between $d,s$ and $m$ , we should choose these three parameters to achieve (5) with equality in order to minimize $T_{\operatorname{tot}}$ . In other words, we should always set $s=d-m$ .

To understand how the choice of $(d,s,m)$ affects the total runtime, let us first consider two extreme cases.

Computation time is dominant: Assume that $\lambda_{1}\ll\lambda_{2}$ and $t_{1}\gg t_{2}$ , so that we can ignore the communication time. Obviously we should set $m=1$ and therefore $s=d-1$ . In this case, $T_{d,d-1,1}$ is the $(n-d+1)$ th order statistics of $n$ i.i.d exponential random variables with distribution $\operatorname{Exp}(\lambda_{1}/d)$ . Consequently,

The proof of this proposition is given in Appendix C.

Communication time is dominant: Assume that $\lambda_{1}\gg\lambda_{2}$ and $t_{1}\ll t_{2}$ , so that we can ignore the computation time. Obviously we should set $d=n$ and therefore $s=n-m$ . In this case, $T_{n,n-m,m}$ is the $m$ th order statistics of $n$ i.i.d exponential random variables with distribution $\operatorname{Exp}(m\lambda_{2})$ . Consequently,

For a fixed value of $n$ , if $t_{2}\gg\frac{1}{\lambda_{2}}$ , then the optimal choice is $m=n$ . On the other hand, if $t_{2}\ll\frac{1}{\lambda_{2}}$ , then the optimal choice is $m=1$ .

The optimal ratio $\alpha$ between the communication cost reduction factor $m$ and the number of workers $n$ is the unique root of the following equation

Note that for any given positive $\lambda_{2}$ and $t_{2}$ , the equation above has a unique root in the open interval $(0,1)$ . The proof of this proposition is given in Appendix D

When computation time and communication time are comparable, accurate analysis of (29) becomes more difficult. Here we use a numerical example to illustrate the advantages of our new proposal. According to (28) and (29), when $\lambda_{1}/d\neq m\lambda_{2}$ ,

We can see that $d=4,m=3$ is the optimal choice, whose total runtime is $21.3697$ . The runtime for uncoded scheme ( $d=m=1$ ) is $36.1138$ , and the best achievable runtime for the coding schemes in is $24.1063$ (d=8,m=1). Therefore our coding scheme outperforms the uncoded scheme by $41\%$ and outperforms the schemes in by $11\%$ .

Next we investigate how the optimal triple $(d,s,m)$ varies with the values of $\lambda_{1},\lambda_{2},t_{1},t_{2}$ . First we fix $\lambda_{1},t_{1}$ , and let $\lambda_{2},t_{2}$ vary. In the following table we take $n=k=10,\lambda_{1}=0.6$ and $t_{1}=1.5$ . The optimal triple $(d,s,m)$ for different values of $\lambda_{2}$ and $t_{2}$ is recorded in the table.

We can see that $m$ typically increases with $t_{2}$ . At the same time, $d$ decreases when we increase the value of $\lambda_{2}$ .

In the following table we fix $\lambda_{2},t_{2}$ , and let $\lambda_{1},t_{1}$ vary. More specifically, we take $n=k=10,\lambda_{2}=0.1$ and $t_{2}=6$ . The optimal triple $(d,s,m)$ for different values of $\lambda_{1}$ and $t_{1}$ is recorded in the table.

We can see that for a fixed $\lambda_{1}$ , $s$ decreases with $t_{1}$ .

Appendix A Converse proof of Theorem 1

Assume that $(d,s,m)$ is achievable, and let us prove (4). We first prove the following claim:

For every $i\in[k]$ , data subset $D_{i}$ must be assigned to at least $s+m$ workers.

The proof goes by contradiction. Suppose for some $j\in[k]$ , $D_{j}$ is assigned to $a<s+m$ workers. Without loss of generality we assume these $a$ workers are $W_{1},W_{2},\dots,W_{a}$ . Now suppose that $W_{1},W_{2},\dots,W_{s}$ are the $s$ stragglers. According to Definition 1, we should be able to calculate $g_{1}+g_{2}+\dots+g_{k}$ from $f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}),i\in\{s+1,s+2,\dots,n\}$ .

Observe that we can calculate the $l$ -dimensional vector $g_{j}$ from the following set of vectors $\{g_{i}:i\in[k]\setminus\{j\}\}\cup\{g_{1}+g_{2}+\dots+g_{k}\}$ . Therefore $g_{j}$ can also be calculated from $\{g_{i}:i\in[k]\setminus\{j\}\}\cup\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq n\}$ . Since $D_{j}$ is only assigned to the first $a$ workers, the values of $\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):a+1\leq i\leq n\}$ are determined by $\{g_{i}:i\in[k]\setminus\{j\}\}$ . As a result, $g_{j}$ can also be calculated from $\{g_{i}:i\in[k]\setminus\{j\}\}\cup\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq a\}$ . Since $f_{i},i\in[n]$ are all linear functions (see condition 3 of Definition 1), we further deduce that $\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq a\}$ must contain at least $l$ linear combinations of the coordinates of $g_{j}$ . On the other hand, each $f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}})$ is a vector of dimension $l/m$ , so $\{f_{i}(g_{i_{1}},g_{i_{2}},\dots,g_{i_{d}}):s+1\leq i\leq a\}$ contains at most $\frac{l}{m}(a-s)$ linear combinations of the coordinates of $g_{j}$ . As a result, we conclude that $\frac{l}{m}(a-s)\geq l$ , i.e., $a\geq s+m$ , which gives a contradiction. This completes the proof of Claim 1. ∎

By Claim 1, each data subset is assigned to at least $s+m$ workers, so in total there are at least $k(s+m)$ data subsets (counting with repetitions) assigned to all $n$ workers. Therefore each worker is assigned with at least $\frac{k}{n}(s+m)$ data subsets. Thus we have $d\geq\frac{k}{n}(s+m)$ . This completes the proof of (4).

Appendix B Differences between our results and the results in [14]

Below we state the main result of . Note that we change their notation to comply with ours.

We first show that this theorem gives a coding scheme to achieve (4) with equality for the special case $m=1$ , i.e., the case considered in . We set $m=1$ and $k^{\prime}=k$ in Theorem 3, where $k$ is the number of data subsets in our problem. Moreover, we set $a_{1}$ to be the all one vector. In our gradient coding problem, we want to calculate $\sum_{i=1}^{k}g_{i}=a_{i}[\begin{array}[]{cccc}g_{1}&g_{2}&\dots&g_{k}\end{array}]^{T}$ . (Recall that $g_{1},g_{2},\dots,g_{k}$ are column vectors of dimension $l$ .) We denote the $i$ th row of $Q$ in Theorem 3 as $q_{i}$ . We claim that each worker $W_{i}$ only needs to send the $l$ -dimensional vector $q_{i}[\begin{array}[]{cccc}g_{1}&g_{2}&\dots&g_{k}\end{array}]^{T}$ to the master node. Indeed, Theorem 3 indicates that any $(n-s)$ rows of $Q$ suffice to generate $a_{1}$ , so this coding scheme can tolerate any $s$ stragglers. Moreover, since the number of nonzero entries in each $q_{i}$ is at most $\frac{k}{n}(s+1)$ , each worker only needs to be assigned with at most $\frac{k}{n}(s+1)$ data subsets. Therefore $d=\frac{k}{n}(s+1)$ , achieving (4) with equality for the case $m=1$ .

Next we argue that for $m>1$ , Theorem 3 cannot give coding schemes achieving (4) with equality. For this case, in order to use Theorem 3 for gradient coding, one needs to set $k^{\prime}=km$ . Moreover, for $u\in[m]$ , we should set $a_{u}$ to be the $u$ th row of the $m\times(mn)$ matrix $[\begin{array}[]{cccc}I_{m}&I_{m}&\dots&I_{m}\end{array}]$ . For every $v=0,1,\dots,l/m-1$ and $j\in[k]$ , define an $m$ -dimensional vector

For every $v=0,1,\dots,l/m-1$ , define an $(mk)$ -dimensional vector

Notice that the coordinates of the sum vector $g_{1}+g_{2}+\dots+g_{n}$ form the following set:

Let each worker $W_{i}$ return the following $(l/m)$ -dimensional vector $[\begin{array}[]{cccc}q_{i}z_{0}^{T}&q_{i}z_{1}^{T}&\dots&q_{i}z_{l/m-1}^{T}\end{array}]$ . Since any $(n-s)$ rows of $Q$ suffice to generate $a_{1},a_{2},\dots,a_{m}$ , one can calculate the elements in the set (32) from the returned results of any $(n-s)$ workers and therefore recover the sum vector $g_{1}+g_{2}+\dots+g_{n}$ . By Theorem 3, each $q_{i}$ has at most $\frac{mk}{n}(s+m)$ nonzero entries. Now let us explain how the nonzero entries of $q_{i}$ correspond to the data subsets assigned to worker $W_{i}$ , which is the reason why Theorem 3 fails to give a gradient coding algorithm for $m>1$ . Let us further write $q_{i}=(q_{i,1},q_{i,2},\dots,q_{i,mk})$ . By definition of $z_{v}$ , for any $u\in[m]$ and $j\in[k]$ , if $q_{i,(j-1)m+u}\neq 0$ , then worker $W_{i}$ needs the value of $g_{j}(vm+u-1)$ to calculate $q_{i}z_{v}^{T}$ , i.e., data subset $D_{j}$ should be assigned to $W_{i}$ . Thus we conclude that the number of data subsets assigned to $W_{i}$ is equal to

In order to achieve (4), we need the quantity in (33) to be no larger than $\frac{k}{n}(s+m)$ for all $i\in[n]$ . If this is the case, then each $q_{i}$ has at most $\frac{mk}{n}(s+m)$ nonzero entries, which is the condition in Theorem 3. However, the condition in Theorem 3 does not imply that the quantity in (33) is at most $\frac{k}{n}(s+m)$ . Thus the constraint in our problem is stronger than the constraint in Theorem 3, so the coding scheme in does not apply to our problem for the case $m>1$ .

Appendix C Proof of Proposition 1

The proposition follows immediately once we show that the optimal value of $d$ (we denote it $d^{*}$ ) can only be $1$ or $n$ . We prove by contradiction. Suppose that $1<d^{*}<n$ , then by (30) we have

which implies that $\sum_{i=d^{*}}^{n}\frac{1}{i}<\sum_{i=d^{*}+1}^{n}\frac{1}{i}$ , but this is impossible. Therefore we conclude that $d^{*}$ can only be $1$ or $n$ , and a simple comparison between these two gives the result in Proposition 1.

Appendix D Proof of Proposition 2

According to (31), we want to minimize the following function

Clearly $h_{1}^{\prime}(\alpha)>0$ for all $0<\alpha<1$ . Since $h_{1}(0)=-t_{2}<0$ and $h_{1}(1^{-})=+\infty$ , the equation $h_{1}(\alpha)=0$ has a unique solution $\alpha^{*}$ in the open interval $(0,1)$ . Moreover, since $h_{1}(\alpha)>h_{1}(\alpha^{*})=0$ for all $\alpha>\alpha^{*}$ and $h_{1}(\alpha)<h_{1}(\alpha^{*})=0$ for all $\alpha<\alpha^{*}$ , we also have $h^{\prime}(\alpha)>0$ for all $\alpha>\alpha^{*}$ and $h^{\prime}(\alpha)<0$ for all $\alpha<\alpha^{*}$ . Consequently, $\alpha^{*}$ minimizes $h(\alpha)$ , and this completes the proof of Proposition 2.