Communication Complexity of Distributed Convex Learning and Optimization

Yossi Arjevani, Ohad Shamir

Introduction

The main challenge in solving such problems is that communication between the different machines is usually slow and constrained, at least compared to the speed of local processing. On the other hand, the datasets involved in distributed learning are usually large and high-dimensional. Therefore, machines cannot simply communicate their entire data to each other, and the question is how well can we solve problems such as Eq. (1) using as little communication as possible.

As datasets continue to increase in size, and parallel computing platforms becoming more and more common (from multiple cores on a single CPU to large-scale and geographically distributed computing grids), distributed learning and optimization methods have been the focus of much research in recent years, with just a few examples including . Most of this work studied algorithms for this problem, which provide upper bounds on the required time and communication complexity.

In this paper, we take the opposite direction, and study what are the fundamental performance limitations in solving Eq. (1), under several different sets of assumptions. We identify cases where existing algorithms are already optimal (at least in the worst-case), as well as cases where room for further improvement is still possible.

Since a major constraint in distributed learning is communication, we focus on studying the amount of communication required to optimize Eq. (1) up to some desired accuracy $\epsilon$ . More precisely, we consider the number of communication rounds that are required, where in each communication round the machines can generally broadcast to each other information linear in the problem’s dimension $d$ (e.g. a point in $\mathcal{W}$ or a gradient). This applies to virtually all algorithms for large-scale learning we are aware of, where sending vectors and gradients is feasible, but computing and sending larger objects, such as Hessians ( $d\times d$ matrices) is not.

Our results pertain to several possible settings (see Sec. 2 for precise definitions). First, we distinguish between the local functions being merely convex or strongly-convex, and whether they are smooth or not. These distinctions are standard in studying optimization algorithms for learning, and capture important properties such as the regularization and the type of loss function used. Second, we distinguish between a setting where the local functions are related – e.g., because they reflect statistical similarities in the data residing at different machines – and a setting where no relationship is assumed. For example, in the extreme case where data was split uniformly at random between machines, one can show that quantities such as the values, gradients and Hessians of the local functions differ only by $\delta=\mathcal{O}(1/\sqrt{n})$ , where $n$ is the sample size per machine, due to concentration of measure effects. Such similarities can be used to speed up the optimization/learning process, as was done in e.g. . Both the $\delta$ -related and the unrelated setting can be considered in a unified way, by letting $\delta$ be a parameter and studying the attainable lower bounds as a function of $\delta$ . Our results can be summarized as follows:

First, we define a mild structural assumption on the algorithm (which is satisfied by reasonable approaches we are aware of), which allows us to provide the lower bounds described below on the number of communication rounds required to reach a given suboptimality $\epsilon$ .

When the local functions can be unrelated, we prove a lower bound of $\Omega(\sqrt{1/\lambda}\log(1/\epsilon))$ for smooth and $\lambda$ -strongly convex functions, and $\Omega(\sqrt{1/\epsilon})$ for smooth convex functions. These lower bounds are matched by a straightforward distributed implementation of accelerated gradient descent. In particular, the results imply that many communication rounds may be required to get a high-accuracy solution, and moreover, that no algorithm satisfying our structural assumption would be better, even if we endow the local machines with unbounded computational power. For non-smooth functions, we show a lower bound of $\Omega(\sqrt{1/\lambda\epsilon})$ for $\lambda$ -strongly convex functions, and $\Omega(1/\epsilon)$ for general convex functions. Although we leave a full derivation to future work, it seems these lower bounds can be matched in our framework by an algorithm combining acceleration and Moreau proximal smoothing of the local functions.

When the local functions are related (as quantified by the parameter $\delta$ ), we prove a communication round lower bound of $\Omega(\sqrt{\delta/\lambda}\log(1/\epsilon))$ for smooth and $\lambda$ -strongly convex functions. For quadratics, this bound is matched by (up to constants and logarithmic factors) by the recently-proposed DISCO algorithm . However, getting an optimal algorithm for general strongly convex and smooth functions in the $\delta$ -related setting, let alone for non-smooth or non-strongly convex functions, remains open.

We also study the attainable performance without posing any structural assumptions on the algorithm, but in the more restricted case where only a single round of communication is allowed. We prove that in a broad regime, the performance of any distributed algorithm may be no better than a ‘trivial’ algorithm which returns the minimizer of one of the local functions, as long as the number of bits communicated is less than $\Omega(d^{2})$ . Therefore, in our setting, no communication-efficient 1-round distributed algorithm can provide non-trivial performance in the worst case.

There have been several previous works which considered lower bounds in the context of distributed learning and optimization, but to the best of our knowledge, none of them provide a similar type of results. Perhaps the most closely-related paper is , which studied the communication complexity of distributed optimization, and showed that $\Omega(d\log(1/\epsilon))$ bits of communication are necessary between the machines, for $d$ -dimensional convex problems. However, in our setting this does not lead to any non-trivial lower bound on the number of communication rounds (indeed, just specifying a $d$ -dimensional vector up to accuracy $\epsilon$ required $\mathcal{O}(d\log(1/\epsilon))$ bits). More recently, considered lower bounds for certain types of distributed learning problems, but not convex ones in an agnostic distribution-free framework. In the context of lower bounds for one-round algorithms, the results of imply that $\Omega(d^{2})$ bits of communication are required to solve linear regression in one round of communication. However, that paper assumes a different model than ours, where the function to be optimized is not split among the machines as in Eq. (1), where each $F_{i}$ is convex. Moreover, issues such as strong convexity and smoothness are not considered. proves an impossibility result for a one-round distributed learning scheme, even when the local functions are not merely related, but actually result from splitting data uniformly at random between machines. On the flip side, that result is for a particular algorithm, and doesn’t apply to any possible method.

Finally, we emphasize that distributed learning and optimization can be studied under many settings, including ones different than those studied here. For example, one can consider distributed learning on a stream of i.i.d. data , or settings where the computing architecture is different, e.g. where the machines have a shared memory, or the function to be optimized is not split as in Eq. (1). Studying lower bounds in such settings is an interesting topic for future work.

Notation and Framework

The only vector and matrix norms used in this paper are the Euclidean norm and the spectral norm, respectively. $\mathbf{e}_{j}$ denotes the $j$ -th standard unit vector. We let $\nabla G(\mathbf{w})$ and $\nabla^{2}G(\mathbf{w})$ denote the gradient and Hessians of a function $G$ at $\mathbf{w}$ , if they exist. $G$ is smooth (with parameter $L$ ) if it is differentiable and the gradient is $L$ -Lipschitz. In particular, if $\mathbf{w}^{*}=\arg\min_{\mathbf{w}\in\mathcal{W}}G(\mathbf{w})$ , then $G(\mathbf{w})-G(\mathbf{w}^{*})\leq\frac{L}{2}\left\|\mathbf{w}-\mathbf{w}^{*}\right\|^{2}$ . $G$ is strongly convex (with parameter $\lambda$ ) if for any $\mathbf{w},\mathbf{w}^{\prime}\in\mathcal{W}$ , $G(\mathbf{w}^{\prime})\geq G(\mathbf{w})+\langle\mathbf{g},\mathbf{w}^{\prime}-\mathbf{w}\rangle+\frac{\lambda}{2}\left\|\mathbf{w}^{\prime}-\mathbf{w}\right\|^{2}$ where $\mathbf{g}\in\partial G(\mathbf{w}^{\prime})$ is a subgradient of $G$ at $\mathbf{w}$ . In particular, if $\mathbf{w}^{*}=\arg\min_{\mathbf{w}\in\mathcal{W}}G(\mathbf{w})$ , then $G(\mathbf{w})-G(\mathbf{w}^{*})\geq\frac{\lambda}{2}\left\|\mathbf{w}-\mathbf{w}^{*}\right\|^{2}$ . Any convex function is also strongly-convex with $\lambda=0$ . A special case of smooth convex functions are quadratics, where $G(\mathbf{w})=\mathbf{w}^{\top}A\mathbf{w}+\mathbf{b}^{\top}\mathbf{w}+c$ for some positive semidefinite matrix $A$ , vector $\mathbf{b}$ and scalar $c$ . In this case, $\lambda$ and $L$ correspond to the smallest and largest eigenvalues of $A$ .

We model the distributed learning algorithm as an iterative process, where in each round the machines may perform some local computations, followed by a communication round where each machine broadcasts a message to all other machines. We make no assumptions on the computational complexity of the local computations. After all communication rounds are completed, a designated machine provides the algorithm’s output (possibly after additional local computation).

In this model, our main question is the following: How many rounds of communication are necessary in order to solve problems such as Eq. (1) to some given accuracy $\epsilon$ ?

As discussed in the introduction, we first need to distinguish between different assumptions on the possible relation between the local functions. One natural situation is when no significant relationship can be assumed, for instance when the data is arbitrarily split or is gathered by each machine from statistically dissimilar sources. We denote this as the unrelated setting. However, this assumption is often unnecessarily pessimistic. Often the data allocation process is more random, or we can assume that the different data sources for each machine have statistical similarities (to give a simple example, consider learning from users’ activity across a geographically distributed computing grid, each servicing its own local population). We will capture such similarities, in the context of quadratic functions, using the following definition:

are $\delta$ -related, if for any $i,j\in\{1\ldots k\}$ , it holds that

We end this section with an important remark. In this paper, we prove lower bounds for the $\delta$ -related setting, which includes as a special case the commonly-studied setting of randomly partitioned data (in which case $\delta=\mathcal{O}(1/\sqrt{n})$ ). However, our bounds do not apply for random partitioning, since they use $\delta$ -related constructions which do not correspond to randomly partitioned data. In fact, very recent work has cleverly shown that for randomly partitioned data, and for certain reasonable regimes of strong convexity and smoothness, it is actually possible to get better performance than what is indicated by our lower bounds. However, this encouraging result crucially relies on the random partition property, and in parameter regimes which limit how much each data point needs to be “touched”, hence preserving key statistical independence properties. We suspect that it may be difficult to improve on our lower bounds under substantially weaker assumptions.

Lower Bounds Using a Structural Assumption

In this section, we present lower bounds on the number of communication rounds, where we impose a certain mild structural assumption on the operations performed by the algorithm. Roughly speaking, our lower bounds pertain to a very large class of algorithms, which are based on linear operations involving points, gradients, and vector products with local Hessians and their inverses, as well as solving local optimization problems involving such quantities. At each communication round, the machines can share any of the vectors they have computed so far. Formally, we consider algorithms which satisfy the assumption stated below. For convenience, we state it for smooth functions (which are differentiable) and discuss the case of non-smooth functions in Sec. 3.2.

for some $\gamma,\nu\geq 0$ such that $\gamma+\nu>0$ . After every communication round, let $W_{j}:=\cup_{i=1}^{m}W_{i}$ for all $j$ . The algorithm’s final output (provided by the designated machine $j$ ) is a point in the span of $W_{j}$ .

This assumption requires several remarks:

Note that $W_{j}$ is not an explicit part of the algorithm: It simply includes all points computed by machine $j$ so far, or communicated to it by other machines, and is used to define the set of new points which the machine is allowed to compute.

The assumption bears some resemblance – but is far weaker – than standard assumptions used to provide lower bounds for iterative optimization algorithms. For example, a common assumption (see ) is that each computed point $\mathbf{w}$ must lie in the span of the previous gradients. This corresponds to a special case of Assumption 1, where $\gamma=1,\nu=0$ , and the span is only over gradients of previously computed points. Moreover, it also allows (for instance) exact optimization of each local function, which is a subroutine in some distributed algorithms (e.g. ), by setting $\gamma=0,\nu=1$ and computing a point $\mathbf{w}$ satisfying $\gamma\mathbf{w}+\nu\nabla F_{j}(\mathbf{w})=\mathbf{0}$ . By allowing the span to include previous gradients, we also incorporate algorithms which perform optimization of the local function plus terms involving previous gradients and points, such as , as well as algorithms which rely on local Hessian information and preconditioning, such as . In summary, the assumption is satisfied by most techniques for black-box convex optimization that we are aware of. Finally, we emphasize that we do not restrict the number or computational complexity of the operations performed between communication rounds.

The requirement that $\gamma,\nu\geq 0$ is to exclude algorithms which solve non-convex local optimization problems of the form $\min_{\mathbf{w}}F_{j}(\mathbf{w})+\gamma\left\|\mathbf{w}\right\|^{2}$ with $\gamma<0$ , which are unreasonable in practice and can sometimes break our lower bounds.

The assumption that $W_{j}$ is initially $\{\mathbf{0}\}$ (namely, that the algorithm starts from the origin) is purely for convenience, and our results can be easily adapted to any other starting point by shifting all functions accordingly.

The techniques we employ in this section are inspired by lower bounds on the iteration complexity of first-order methods for standard (non-distributed) optimization (see for example ). These are based on the construction of ‘hard’ functions, where each gradient (or subgradient) computation can only provide a small improvement in the objective value. In our setting, the dynamics are roughly similar, but the necessity of many gradient computations is replaced by many communication rounds. This is achieved by constructing suitable local functions, where at any time point no individual machine can ‘progress’ on its own, without information from other machines.

We begin by presenting a lower bound when the local functions $F_{i}$ are strongly-convex and smooth:

if $\lambda>0$ , and at least $\sqrt{\frac{3\delta}{32\epsilon}}\left\|\mathbf{w}^{*}\right\|-2$ if $\lambda=0$ .

The assumption of $m$ being even is purely for technical convenience, and can be discarded at the cost of making the proof slightly more complex. Also, note that $m$ does not appear explicitly in the bound, but may appear implicitly, via $\delta$ (for example, in a statistical setting $\delta$ may depend on the number of data points per machine, and may be larger if the same dataset is divided to more machines).

Let us contrast our lower bound with some existing algorithms and guarantees in the literature. First, regardless of whether the local functions are similar or not, we can always simulate any gradient-based method designed for a single machine, by iteratively computing gradients of the local functions, and performing a communication round to compute their average. Clearly, this will be a gradient of the objective function $F(\cdot)=\frac{1}{m}\sum_{i=1}^{m}F_{i}(\cdot)$ , which can be fed into any gradient-based method such as gradient descent or accelerated gradient descent . The resulting number of required communication rounds is then equal to the number of iterations. In particular, using accelerated gradient descent for smooth and $\lambda$ -strongly convex functions yields a round complexity of $\mathcal{O}(\sqrt{1/\lambda}\log(\left\|\mathbf{w}^{*}\right\|^{2}/\epsilon))$ , and $\mathcal{O}(\left\|\mathbf{w}^{*}\right\|\sqrt{1/\epsilon})$ for smooth convex functions. This matches our lower bound (up to constants and log factors) when the local functions are unrelated ( $\delta=\Omega(1)$ ).

The full proof of Thm. 1 appears in the supplementary material, and is based on the following idea: For simplicity, suppose we have two machines, with local functions $F_{1},F_{2}$ defined as follows,

It is easy to verify that for $\delta,\lambda\leq 1$ , both $F_{1}(\mathbf{w})$ and $F_{2}(\mathbf{w})$ are $1$ -smooth and $\lambda$ -strongly convex, as well as $\delta$ -related. Moreover, the optimum of their average is a point $\mathbf{w}^{*}$ with non-zero entries at all coordinates. However, since each local functions has a block-diagonal quadratic term, it can be shown that for any algorithm satisfying Assumption 1, after $T$ communication rounds, the points computed by the two machines can only have the first $T+1$ coordinates non-zero. No machine will be able to further ‘progress’ on its own, and cause additional coordinates to become non-zero, without another communication round. This leads to a lower bound on the optimization error which depends on $T$ , resulting in the theorem statement after a few computations.

2 Non-smooth Local Functions

Remaining in the framework of algorithms satisfying Assumption 1, we now turn to discuss the situation where the local functions are not necessarily smooth or differentiable. For simplicity, our formal results here will be in the unrelated setting, and we only informally discuss their extension to a $\delta$ -related setting (in a sense relevant to non-smooth functions). Formally defining $\delta$ -related non-smooth functions is possible but not altogether trivial, and is therefore left to future work.

We adapt Assumption 1 to the non-smooth case, by allowing gradients to be replaced by arbitrary subgradients at the same points. Namely, we replace Eq. (2) by the requirement that for some $\mathbf{g}\in\partial F_{j}(\mathbf{w})$ , and $\gamma,\nu\geq 0,\gamma+\nu>0$ ,

The lower bound for this setting is stated in the following theorem.

As in Thm. 1, we note that the assumption of even $m$ is for technical convenience.

This theorem, together with Thm. 1, implies that both strong convexity and smoothness are necessary for the number of communication rounds to scale logarithmically with the required accuracy $\epsilon$ . We emphasize that this is true even if we allow the machines unbounded computational power, to perform arbitrarily many operations satisfying Assumption 1. Moreover, a preliminary analysis indicates that performing accelerated gradient descent on smoothed versions of the local functions (using Moreau proximal smoothing, e.g. ), can match these lower bounds up to log factorsRoughly speaking, for any $\gamma>0$ , this smoothing creates a $\frac{1}{\gamma}$ -smooth function which is $\gamma$ -close to the original function. Plugging these into the guarantees of accelerated gradient descent and tuning $\gamma$ yields our lower bounds. Note that, in order to execute this algorithm each machine must be sufficiently powerful to obtain the gradient of the Moreau envelope of its local function, which is indeed the case in our framework.. We leave a full formal derivation (which has some subtleties) to future work.

The full proof of Thm. 2 appears in the supplementary material. The proof idea relies on the following construction: Assume that we fix the number of communication rounds to be $T$ , and (for simplicity) that $T$ is even and the number of machines is $2$ . Then we use local functions of the form

where $b$ is a suitably chosen parameter. It is easy to verify that both local functions are $\lambda$ -strongly convex and $(1+\lambda)$ -Lipschitz continuous over the unit Euclidean ball. Similar to the smooth case, we argue that after $T$ communication rounds, the resulting points $\mathbf{w}$ computed by machine $1$ will be non-zero only on the first $T+1$ coordinates, and the points $\mathbf{w}$ computed by machine $2$ will be non-zero only on the first $T$ coordinates. As in the smooth case, these functions allow us to ’control’ the progress of any algorithm which satisfies Assumption 1.

Finally, although the result is in the unrelated setting, it is straightforward to have a similar construction in a ‘ $\delta$ -related’ setting, by multiplying $F_{1}$ and $F_{2}$ by $\delta$ . The resulting two functions have their gradients and subgradients at most $\delta$ -different from each other, and the construction above leads to a lower bound of $\Omega(\delta/\epsilon)$ for convex Lipschitz functions, and $\Omega(\delta\sqrt{1/\lambda\epsilon})$ for $\lambda$ -strongly convex Lipschitz functions. In terms of upper bounds, we are actually unaware of any relevant algorithm in the literature adapted to such a setting, and the question of attainable performance here remains wide open.

One Round of Communication

In this section, we study what lower bounds are attainable without any kind of structural assumption (such as Assumption 1). This is a more challenging setting, and the result we present will be limited to algorithms using a single round of communication round. We note that this still captures a realistic non-interactive distributed computing scenario, where we want each machine to broadcast a single message, and a designated machine is then required to produce an output. In the context of distributed optimization, a natural example is a one-shot averaging algorithm, where each machine optimizes its own local data, and the resulting points are averaged (e.g. ).

Intuitively, with only a single round of communication, getting an arbitrarily small error $\epsilon$ may be infeasible. The following theorem establishes a lower bound on the attainable error, depending on the strong convexity parameter $\lambda$ and the similarity measure $\delta$ between the local functions, and compares this with a ‘trivial’ zero-communication algorithm, which just returns the optimum of a single local function:

The point $\hat{\mathbf{w}}$ returned by the algorithm satisfies

in expectation over the algorithm’s randomness.

The theorem shows that unless the communication budget is extremely large (quadratic in the dimension), there are functions which cannot be optimized to non-trivial accuracy in one round of communication, in the sense that the same accuracy (up to a universal constant) can be obtained with a ‘trivial’ solution where we just return the optimum of a single local function. This complements an earlier result in , which showed that a particular one-round algorithm is no better than returning the optimum of a local function, under the stronger assumption that the local functions are not merely $\delta$ -related, but are actually the average loss over some randomly partitioned data.

The full proof appears in the supplementary material, but we sketch the main ideas below. As before, focusing on the case of two machines, and assuming machine $2$ is responsible for providing the output, we use

where $M$ is essentially a randomly chosen $\{-1,+1\}$ -valued $d\times d$ symmetric matrix with spectral norm at most $c\sqrt{d}$ , and $c$ is a suitable constant. These functions can be shown to be $\delta$ -related as well as $\lambda$ -strongly convex. Moreover, the optimum of $F(\mathbf{w})=\frac{1}{2}(F_{1}(\mathbf{w})+F_{2}(\mathbf{w}))$ equals

Thus, we see that the optimal point $\mathbf{w}^{*}$ depends on the $j$ -th column of $M$ . Intuitively, the machines need to approximate this column, and this is the source of hardness in this setting: Machine $1$ knows $M$ but not $j$ , yet needs to communicate to machine $2$ enough information to construct its $j$ -th column. However, given a communication budget much smaller than the size of $M$ (which is $d^{2}$ ), it is difficult to convey enough information on the $j$ -th column without knowing what $j$ is. Carefully formalizing this intuition, and using some information-theoretic tools, allows us to prove the first part of Thm. 3. Proving the second part of Thm. 3 is straightforward, using a few computations.

Summary and Open Questions

In this paper, we studied lower bounds on the number of communication rounds needed to solve distributed convex learning and optimization problems, under several different settings. Our results indicate that when the local functions are unrelated, then regardless of the local machines’ computational power, many communication rounds may be necessary (scaling polynomially with $1/\epsilon$ or $1/\lambda$ ), and that the worst-case optimal algorithm (at least for smooth functions) is just a straightforward distributed implementation of accelerated gradient descent. When the functions are related, we show that the optimal performance is achieved by the algorithm of for quadratic and strongly convex functions, but designing optimal algorithms for more general functions remains open. Beside these results, which required a certain mild structural assumption on the algorithm employed, we also provided an assumption-free lower bound for one-round algorithms, which implies that even for strongly convex quadratic functions, such algorithms can sometimes only provide trivial performance.

Besides the question of designing optimal algorithms for the remaining settings, several additional questions remain open. First, it would be interesting to get assumption-free lower bounds for algorithms with multiple rounds of communication. Second, our work focused on communication complexity, but in practice the computational complexity of the local computations is no less important. Thus, it would be interesting to understand what is the attainable performance with simple, runtime-efficient algorithms. Finally, it would be interesting to study lower bounds for other distributed learning and optimization scenarios.

This research is supported in part by an FP7 Marie Curie CIG grant, the Intel ICRI-CI Institute, and Israel Science Foundation grant 425/13. We thank Nati Srebro for several helpful discussions and insights.

References

Appendix A Proofs

The proof of the theorem is based on splitting the machines into two sub-groups of the same size, each of which is assigned with a finite dimensional restriction of $F_{1}$ and $F_{2}$ (see Eq. (3)), and tracing the maximal number of non-zero coordinates for vectors in $W_{j}$ , the set of feasible points.

Recall that $F_{i}$ are defined as follows:

Note that $[F_{i}]_{d}(\mathbf{w})$ and $[F]_{d}(\mathbf{w})$ produce the same values as $F_{i}(\mathbf{w})$ and $F(\mathbf{w})$ do for vectors such that $\mathbf{w}_{i}=0$ for all $i\geq d$ . Similarly, we define the $d\times d$ leading principal submatrix of $A_{i}$ by $[A_{i}]_{d}$ .

We assign half of the machines with $[F_{1}]_{d}$ , and the other half with $[F_{2}]_{d}$ . To prove the theorem, we need the following lemma, which formalizes the intuition described in the main paper. Let

Suppose all the sets of feasible points satisfy $W_{j}\subseteq E_{T,d}$ for some $T\leq d-1$ , then under assumption 1, right after the next communication round we have $W_{j}\subseteq E_{T+1,d}$ .

Recall that by Assumption 1, each machine can compute new points $\mathbf{w}$ that satisfy the following for some $\gamma,\nu\geq 0$ such that $\gamma+\nu>0$ :

We now analyze the state of the sets of feasible points prior to the next communication round. Assume that $T$ is an odd number, i.e., assume $T=2k+1$ for some $k=0,1,\dots$ . The proof for the case where $T$ is even follows similar lines. Note that for any $\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime}\in W_{j}$ , we have

For any viable diagonal matrix D. Therefore, since $W_{j}\subseteq E_{2k+1,d}$ , we have that the first point generated by machines which hold $[F_{1}]_{d}(\mathbf{w})$ must satisfy

for $\gamma,\nu$ as stated in the assumption. That is,

Since $[A_{1}]_{d}$ is positive semidefinite, it holds that $H$ is invertible. Also, $[A_{1}]_{d},H$ and $H^{-1}$ admit the same partitions into $1\times 1$ and $2\times 2$ blocks on the diagonal, thus $H^{-1}E_{2k+1,d}\subseteq E_{2k+1,d}$ , yielding $\mathbf{w}\in E_{2k+1,d}$ . Inductively extending the latter argument shows that, in the absence of any communication rounds, all the machines whose local function is $[F_{1}]_{d}(\mathbf{w})$ are ‘stuck’ in $E_{2k+1,d}$ . As for machines which contain $[F_{2}]_{d}(\mathbf{w})$ , we have that for all $\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime}\in W_{j}$

For any viable diagonal matrix D. Therefore, the first generated point by these machines must satisfy,

Similarly to the previous case this implies that $\mathbf{w}\in E_{2k+2,d}$ . It is now left to show that these machines cannot make further progress beyond $E_{2k+2,d}$ without communicating. To see this, note that for all $\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime}\in E_{2k+2,d}$ we have,

This means that all the points which are generated subsequently also lie in $E_{2k+2,d}$ , i.e., without communicating , machines whose local function is $[F_{2}]_{d}(\mathbf{w})$ are stuck in $E_{2k+2,d}$ . Finally, executing a communication round updates all the sets of feasible points to be $W_{j}:=E_{2k+2,d}$ . ∎

The following is a direct consequence of a recursive application of Lemma 1.

Under assumption 1, after $T\leq d-1$ communication rounds we have

By first-order optimality condition for smooth convex functions, we have

The optimal solution can be now realized as a geometric sequence $(\zeta^{k})_{k=1}^{\infty}$ for some $\zeta$ as follows: By Eq. (5), we must have

Let $\mathbf{w}_{T}$ be some point which was obtained after $T\leq d-2$ communication rounds. To bound the sub-optimality of $\mathbf{w}_{T}$ from below, observe that

where the last equality follows from Corollary (1), according to which all the coordinates of $\mathbf{w}_{T}$ , except for the first $T+1\leq d-1$ , must vanish. To bound the $A$ term, note that

The fact that $F(\mathbf{w})$ is $\lambda$ -strongly convex implies

To bound the $B$ term from below, note that since $F$ is 1-smooth we have

Combining both lower bounds for the terms $A$ and $B$ , we get for any $T\leq d-2$

Picking $d$ sufficiently large, and considering how large the number of communication rounds $T$ must be to make this lower bound less than $\epsilon$ , we get

It is worth mentioning that by computing the exact minimizers of $[F_{i}]_{d}$ one may derive a lower bound such that the choice of $d$ does not depend on the parameters of the problem, except for the number of communication rounds. Nevertheless, such analysis requires a more involved reasoning which we find unnecessary for stating our results.

For the non-strongly convex case, where $\lambda=0$ , using Corollary 1 and a similar analysis (virtually identical to the proof of Theorem 2.1.7 in ), we have that if $T\leq\frac{1}{2}(d-1)$ , then

Therefore, to obtain an $\epsilon$ -suboptimal solution for this case, we must have at least

communication rounds, for sufficiently small $\epsilon$ .

A.2 Proof of Thm. 2

We construct two types of local functions, and provide one of them to $m/2$ of the machines, and the other function to the other $m/2$ machines, in some arbitrary order. In this case, the average function is simply the average of the two types of local functions.

We will first prove the theorem statement in the strongly convex case, where $\lambda>0$ is given, and then explain how to extract from it the result in the non strongly convex case.

Fix a natural number $k$ and some $b\in[0,1/\sqrt{k}]$ , to be specified later. We define the following local function over the unit ball:

otherwise. Being a sum of convex functions, both local functions are convex, and in fact $\lambda$ -strongly convex due to the $\frac{\lambda}{2}\left\|\mathbf{w}\right\|^{2}$ term. Furthermore, both function are $(1+\lambda)$ -Lipschitz continuous over the unit Euclidean ball. To see this, let $\partial(\cdot)$ denote the subdifferential operator and note that

Assume for a moment that $\lambda=0$ , then by the linearity of the sub-differential operator that

which shows that, for $\lambda=0$ , both functions are 1-Lipschitz. For $\lambda>0$ , note that $\frac{\lambda}{2}\left\|\mathbf{w}\right\|^{2}$ is $\lambda$ -Lipschitz over the unit ball and $\lambda$ -strongly convex. Therefore, using the linearity of the sub-differential operator again, we see that both $F_{i}$ are $(1+\lambda)$ -Lipschitz and $\lambda$ -strongly convex functions over the unit ball.

Similar to the smooth case, the following lemma shows that, no matter how the subgradients are chosen, at each iteration at most one non-zero coordinate may be gained.

Suppose all the sets of feasible points satisfy $W_{j}\subseteq E_{T,d}$ for some $T\leq d-1$ . Then under assumption 1, right after the next communication round we have $W_{j}\subseteq E_{T+1,d}$ .

Recall that by Assumption 1 (modified for the non-differentiable case), each machine can compute new points $\mathbf{w}$ that satisfy the following for some $\gamma,\nu\geq 0$ such that $\gamma+\nu>0$ :

For any viable diagonal matrix D. Therefore, we have that the first point generated by machines which hold $F_{1,k}$ must satisfy

for $\gamma,\nu$ as stated in Assumption (1). Note that, if $\nu=0$ (which by assumption means that $\gamma>0$ ) then clearly $\mathbf{w}\in E_{2p+1,d}$ . As for $\nu\neq 0$ , suppose by contradiction that $\mathbf{w}\notin E_{2p+1,d}$ . That is, assume that there exists some $j>2p+1$ such that $w[j]\neq 0$ . First, if the absolute value terms in $F_{1,k}$ do not involve $w[j]$ , e.g., when $j=d$ and $d$ is even, we have $\mathbf{g}_{1,k}(\mathbf{w})[j]=\lambda w[j]$ . In this case, by Eq. (9) we have

and since $\nu\lambda>0$ , this implies that $w[j]=0$ – a contradiction! Thus, it remains to consider the cases of either odd $d$ or $j\neq d$ . In both of these cases, $w[j]$ appears in one of the absolute value terms in $F_{1,k}$ , either as $|w[j-1]-w[j]|$ or $|w[j]-w[j+1]|$ (depending on whether $j$ is odd or even).

Let $l>p$ be such that either $2l=j$ or $2l+1=j$ , depending on the parity of $j$ . We note that any valid subgradient must satisfy

for some $\alpha\in$ , such that if $w[2l]-w[2l+1]\neq 0$ then

where $\operatorname{sgn}()$ is the sign function. Rearranging terms in Eq. (9) and using the facts that coordinates $2l,2l+1$ are always zero in $E_{2p+1,d}$ , as well as $\gamma+\nu\lambda\geq\nu\lambda>0$ , we get

contradicting Eq. (10). Hence, we must have $w[2l]-w[2l+1]=0$ , in which case Eq. (12) implies $\alpha=1/2$ . Thus, by Eq. (11)

which contradicts the assumption that either $w[j]$ (and hence $w[2l]$ or $w[2l+1]$ ) is not zero. Thus, we have shown that $\mathbf{w}\in E_{2p+1,d}$ , for the first point generated by machines holding $F_{1,k}$ . Repeating the argument, we get that any point generated by those machines, in the absence of any communication rounds, is ‘stuck’ in $E_{2p+1,d}$ .

We now turn to prove the claim for machines whose local function is $F_{2,k}$ , using an almost identical argument, which we provide below for completeness. For these functions, we assume that initially $W_{j}\subseteq E_{2p+1,d}$ , and will show any additional points computed locally by the machines must be in $E_{2p+2,d}$ . We begin by noting that for any $\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime}$ in $E_{2p+2,d}$ (and in particular $E_{2p+1,d}$ ), it holds that

For any viable diagonal matrix D. Therefore, we have that the first point generated by machines which hold $F_{2,k}$ must satisfy

for $\gamma,\nu$ as stated in the assumption. Note that, if $\nu=0$ then clearly $\mathbf{w}\in E_{2p+2,d}$ . As for $\nu\neq 0$ , suppose by contradiction that $\mathbf{w}\notin E_{2p+2,d}$ . That is, assume that there exists some $j>2p+2$ such that $w[j]\neq 0$ . First, if the absolute value terms in $F_{2,k}$ do not involve $w[j]$ , e.g., when $j=d$ and $d$ is odd, we have $\mathbf{g}_{2,k}(\mathbf{w})[j]=\lambda w[j]$ . In this case, by Eq. (13) we have

and since $\nu\lambda>0$ , this implies that $w[j]=0$ – a contradiction! Thus, it remains to consider the cases of either even $d$ or $j\neq d$ . In both of these cases, $w[j]$ appears in one of the absolute value terms in $F_{2,k}$ , either as $|w[j-1]-w[j]|$ or $|w[j]-w[j+1]|$ (depending on whether $j$ is odd or even).

Let $l>p$ be such that either $2l+1=j$ or $2l+2=j$ , depending on the parity of $j$ . We note that any valid subgradient must satisfy Any valid subgradient must satisfy

for some $\alpha\in$ , such that if $w[2l+1]-w[2l+2]\neq 0$ then

Rearranging terms in Eq. (13) and using the fact that $\gamma+\nu\lambda\geq\nu\lambda>0$ , we get

a contradiction to Eq. (14). Hence, we must have $w[2l+1]-w[2l+2]=0$ , in which case Eq. (16) implies $\alpha=1/2$ . Thus, by Eq. (15)

which contradicts our assumption that $w[j]$ (and hence either $w[2l+1]$ or $w[2l+2]$ is not zero). Thus, we have shown that $\mathbf{w}\in E_{2p+2,d}$ . As before, repeating the argument together with the assumption that $W_{j}\subseteq E_{2p+2,d}$ shows that, in the absence of any communication rounds, all the machines whose local function is $F_{2,k}$ are ’stuck’ in $E_{2p+2,d}$ . Therefore, before the next communication round, $W_{j}\subseteq E_{2p+2,d}$ for all machines $j$ holding $F_{2,j}$ . Moreover, as shown earlier, $W_{j}\subseteq E_{2p+1,d}$ for all machines holding $F_{1,k}$ . Therefore, after the next communication round, $W_{j}\subseteq E_{2p+2,d}$ for any machine $j$ . ∎

Repeatedly applying Lemma 2, we get the following corollary:

Under assumption 1, after $T\leq d-1$ communication rounds we have

With this corollary in hand, we now turn to establish the main result, namely, bounding from below the optimality of points in $W_{j}$ after $T$ communication rounds. Choosing the dimension $d$ such that $T\leq d-2$ , we employ the local functions defined in Eq. (8) with $k=T+2$ . In which case, the average function is

The key ingredient in deriving the lower bound is Corollary (2), according to which after $T$ communication rounds, all but the first $T+1$ coordinates must be zero, in particular $w[T+2]=0$ . Using this and the triangle inequality, we have

for all $\mathbf{w}$ in $W_{j}$ . Therefore, we can lower bound the objective value of the algorithm’s output by

On the flip side, the minimal value of $F(\mathbf{w})$ over the unit Euclidean ball can be upper bounded by $F(\mathbf{w}_{b})$ for some $b\leq\frac{1}{\sqrt{T+2}}$ , where

Assuming $T\geq\frac{1}{2\lambda}-2$ (so that $\lambda\geq\frac{1}{2(T+2)}$ ), we take

(note again that $\mathbf{w}_{b}$ is indeed in the unit ball for this regime of $\lambda$ and $T$ ). In this case, the minimal $w$ in Eq. (17) is $\frac{1}{2\lambda(T+2)\sqrt{2(T+2)}}$ , so we get a suboptimality lower bound of

This bound holds in particular for any $T\geq\left\lceil\frac{1}{2\lambda}-2\right\rceil$ . If the number of communication rounds $T$ is less than $\left\lceil\frac{1}{2\lambda}-2\right\rceil$ , then clearly we cannot do better than with $\left\lceil\frac{1}{2\lambda}-2\right\rceil$ communication rounds. Therefore, for any number of communication rounds $T$ , the suboptimality is at least

Therefore, for any $\epsilon\in\left(0,\frac{1}{16\lambda\left(\left\lceil\frac{1}{2\lambda}-2\right\rceil+2\right)^{2}}\right]$ , we would need at least $T\geq\sqrt{\frac{1}{16\lambda\epsilon}}-2$ communication rounds to get an $\epsilon$ -suboptimal solution. This implies the theorem statement for $\lambda$ -strongly convex functions.

Finally, we treat the case where the local functions are not required to be strongly convex. In this setting, for proving a lower bound, we can use the same construction as in Eq. (8), where we are free to choose any $\lambda$ . In particular, let us choose $\lambda=\frac{1}{2(T+2)}$ , and apply the lower bound derived above (note that in this case the condition $T\geq\frac{1}{2\lambda}-2$ trivially holds). Plugging in it into (A.2), we establish that for any number of communication rounds $T$ , the suboptimality is at least

Considering how large $T$ must be to make this smaller than some $\epsilon$ , we get that $T$ must be at least $\frac{1}{8\epsilon}-2$ .

A.3 Proof of Thm. 3

As usual, we construct two functions $F_{1},F_{2}$ , and provide $F_{1}$ to $m/2$ of the machines, and $F_{2}$ to the other $m/2$ machines, in some arbitrary order, such that the machine designated to provide the output receives $F_{2}$ . Note that the average function $F$ is simply $\frac{1}{2}(F_{1}(\mathbf{w})+F_{2}(\mathbf{w}))$ .

Let $c$ be a certain positive numerical constant (whose value corresponds to $c$ in Lemma 6 below). Given some symmetric $M\in\{-1,+1\}^{d\times d}$ , where $\left\|M\right\|\leq c\sqrt{d}$ , and $j\in\{\lceil d/2\rceil,\ldots,d\}$ , define

The following lemma establishes that the functions satisfy the strong convexity, smoothness and relatedness requirements of the theorem. The proof also establishes that the inverse in the definition of $F_{1}$ indeed exists.

$F_{1}$ and $F_{2}$ are $\lambda$ strongly-convex, $9\lambda$ smooth, and $\delta$ -related.

The Hessian of $F_{2}$ is $3\lambda I$ , which implies that $F_{2}$ is $3\lambda$ smooth and strongly convex (and in particular, $\lambda$ -strongly convex). As to $F_{1}$ , note that since $\left\|M\right\|\leq c\sqrt{d}$ , then

The fact that the spectral radius and spectral norm of symmetric matrices coincide implies that the eigenvalues of the matrix $I+\frac{1}{2c\sqrt{d}}M$ lie between $1-\frac{1}{2}=\frac{1}{2}$ and $1+\frac{1}{2}=\frac{3}{2}$ . Thus, all the eigenvalues are strictly positive, hence the matrix is indeed invertible as in the definition of $F_{1}$ . Moreover, the eigenvalues of the inverse lie in $\left[\frac{1}{3/2},\frac{1}{1/2}\right]=\left[\frac{2}{3},2\right]$ , and therefore those of $3\lambda\left(\left(I+\frac{1}{2c\sqrt{d}}M\right)^{-1}-\frac{1}{2}I\right)$ lie in $\left[3\lambda\left(\frac{2}{3}-\frac{1}{2}\right),3\lambda\left(2-\frac{1}{2}\right)\right]=\left[\frac{\lambda}{2},\frac{9\lambda}{2}\right]$ . Thus, the spectrum of the Hessian of $F_{1}$ lie in $\left[\lambda,9\lambda\right]$ , which implies that $F_{1}$ is $\lambda$ -strongly convex and $9\lambda$ smooth.

To show $\delta$ -relatedness, the only non-trivial part is upper-bounding the norm of the difference of the quadratic terms, which equals the following:

Since $\left\|M\right\|\leq c\sqrt{d}$ , the eigenvalues of $\left(I+\frac{1}{2c\sqrt{d}}M\right)^{-1}-I$ lie between $\frac{1}{1+1/2}-1=-\frac{1}{3}$ and $\frac{1}{1-1/2}-1=1$ , which implies that Eq. (19) can be upper bounded by $3\lambda\leq\delta$ . ∎

The next lemma proves the second part of the theorem, namely an upper bound on the suboptimality of any local function optimizer.

The optimum of any quadratic and strongly-convex function $\mathbf{w}^{\top}A\mathbf{w}+\mathbf{b}^{\top}\mathbf{w}+c$ equals $\frac{1}{2}A^{-1}\mathbf{b}$ . Therefore, if $\mathbf{w}^{*}$ is the optimizer of $F$ , and we denote the parameters of $F$ and $F_{j}$ by $A,\mathbf{b},c$ and $A_{j},\mathbf{b}_{j},c_{j}$ respectively, then

By definition of $F_{1},F_{2}$ and the average function $F$ , this is at most

In Lemma 3, we showed that $F_{1},F_{2}$ are $\lambda$ -strongly convex and $9\lambda$ smooth, which implies that the eigenvalues of $A_{j}$ as well as $A$ lie in $\left[\frac{\lambda}{2},\frac{9\lambda}{2}\right]$ . Therefore, the eigenvalues of $A_{j}^{-1}$ and $A^{-1}$ lie in $\left[\frac{2}{9\lambda},\frac{2}{\lambda}\right]$ , so $\left\|A^{-1}\right\|\leq\frac{2}{\lambda}$ and $\left\|A_{j}^{-1}-A^{-1}\right\|\leq\frac{2}{\lambda}$ . Substituting this back into Eq. (20), we get

Finally, since $F$ is $9\lambda$ -smooth, and its minimizer is $\mathbf{w}^{*}$ ,

which equals $81\delta^{2}/8\lambda$ as required. ∎

We now turn to derive the lower bound in the theorem statement. As discussed earlier, the intuition is that the optimal point $\mathbf{w}^{*}$ is a function of the $j$ -th column of $M$ , so the machines holding $F_{1}$ must broadcast enough information on $M$ to the designated machine producing the algorithm’s output (the machine, by construction, holds $F_{2}$ , and hence knows $j$ but not $M$ ). As long as the communication budget is smaller than the size of $M$ , this will be difficult to achieve. This intuition is formalized in the following lemma, which is based on information-theoretic tools:

For any dimension $d\geq c$ (where $c$ is the same constant as in Lemma 6 and the definition of $F_{1}$ ), and for any (possibly randomized) 1-round algorithm using at most $d^{2}/128$ bits of communication, there exists a valid choice of $M,j$ for the functions $F_{1},F_{2}$ defined above, such that the vector $\hat{\mathbf{w}}$ returned by the algorithm satisfies

where the expectation is over the algorithm’s randomness, and $c^{\prime}$ is a positive numerical constant.

Using the lemma and the $\lambda$ -strong convexity of $F_{1},F_{2}$ (and hence their average $F$ ),

By definition of $\mathbf{w}^{*}$ , we have that the $j$ -th column of $M$ , designated as $M_{j}$ , satisfies

Given the predictor $\hat{\mathbf{w}}$ returned by the algorithm, define

This can be thought of as the algorithm’s ‘estimate’ of the $j$ -th column of $M$ , based on the returned predictor.

Define $[w]=\min\{1,\max\{-1,w\}\}$ as the clipping operation of a scalar $w$ to $[-1,+1]$ , and for a vector $\mathbf{w}=(w_{1},\ldots,w_{d})$ , define $[\mathbf{w}]=([w_{1}],[w_{2}],\ldots,[w_{d}])$ . By the expressions for $M_{j},\hat{M}_{j}$ above, we have

Below, we will prove that if $M$ (in the definition of $F_{1}$ ) is chosen uniformly at random from all $\{-1,+1\}$ -valued $d\times d$ symmetric matrices, and $j$ (in the definition of $F_{2}$ ) is chosen uniformly at random from $\{\lceil d/2\rceil,\ldots,d\}$ , then for any deterministic algorithm,

Let us first show how this can be used to prove the lemma. To do so, we will need the following lemma on the concentration of the spectral norm of random symmetric matrices.

There exist positive numerical constants $c,c^{\prime}$ , such that if $M$ is a $d\times d$ symmetric matrix, where each entry $M_{j,i},j\geq i$ is chosen independently and uniformly from $\{-1,+1\}$ , and $d\geq c$ , then $\Pr(\left\|M\right\|>c\sqrt{d})\leq c\exp(-c^{\prime}d)$ .

First, we note that the expectation in Eq. (22) is over all symmetric $\{-1,+1\}$ -valued matrices, including those whose spectral norm may be larger than $c\sqrt{d}$ . However, by Lemma 6, $\Pr(\left\|M\right\|>c\sqrt{d})\leq c\exp(-c^{\prime}d)$ for some absolute constant $c^{\prime}$ . Letting $E$ be the event that $\left\|M\right\|>c\sqrt{d}$ , and noting that $\left\|[\mathbf{w}]\right\|^{2}\leq d$ for any vector $\mathbf{w}$ , we have

which is at least $d/16$ for any $d$ larger than some constant. Combining with Eq. (21), we get

This inequality implies that for any deterministic algorithm, in expectation over the random draw of $j$ and a $\{-1,+1\}$ -valued matrix $M$ with spectral norm at most $c\sqrt{d}$ , $\left\|\hat{\mathbf{w}}-\mathbf{w}^{*}\right\|^{2}$ will be at least $c^{\prime}\left(\frac{\delta}{\lambda}\right)^{2}$ for some suitable constant $c^{\prime}$ . By Yao’s minimax principle, this implies that for any (possibly randomized) algorithm, there will be some deterministic choice of $M,j$ such that $\left\|M\right\|\leq c\sqrt{d}$ , and for which

(in expectation over the algorithm’s randomness), yielding the lemma’s statement.

It now remains to prove Eq. (22), assuming $j$ is chosen uniformly at random from $\{\lceil d/2\rceil,\ldots,d\}$ , and $M$ is chosen at random (i.e. each entry at or above the main diagonal is chosen independently and uniformly from $\{-1,+1\}$ ). Roughly speaking, the proof idea is to reduce this to an upper bound on how much information the machines holding $M$ can send on $M$ ’s entries (and more particularly, on the entries in the upper-right quadrant of $M$ ). Since this quadrant is composed of $\Theta(d^{2})$ random variables, and the machines can send much less than $d^{2}$ bits, this information is necessarily restricted.

Let $\Pr(\cdot)$ denote probability with respect to the random choice of $M,j$ , and let ${\Pr}_{j}(\cdot)$ denote probability conditioned on the choice of $j$ . Recalling that any entry $M_{j,i}$ in the $j$ -th column has values in $\{-1,+1\}$ , it follows that either $M_{j,i}$ has the same sign as $\hat{M}_{j,i}$ , or that $([M_{j,i}-\hat{M}_{j,i}])^{2}$ is at least $1$ . Therefore, we have the following:

Let $S$ be the vector of bits broadcasted by the machines holding $F_{1}$ , and received by the machine designated with providing the output (recalling that it only holds $F_{2}$ ). Note that conditioned on $S$ and $j$ , the algorithm’s output (and hence $\hat{M}_{j,i}$ ) is independent of $M$ . Therefore, we have

Since $S$ is sent by the machines holding $F_{1}$ (and not $F_{2}$ ), it is independent of $j$ . Therefore, we can write the above as

where $j$ in the conditioning is a fixed index. Using Pinsker’s inequality, we can upper bound the above by

where $p$ is the probability distribution of $S$ , and $D_{kl}$ is the Kullback-Leibler divergence. By the elementary inequality $\sqrt{a}+\sqrt{b}\leq\sqrt{2(a+b)}$ for all non-negative $a,b$ , we can upper bound the above by

Recalling that this is an upper bound on $\left|{\Pr}_{j}\left(\hat{M}_{j,i}\geq 0|M_{j,i}<0\right)-{\Pr}_{j}\left(\hat{M}_{j,i}\geq 0|M_{j,i}>0\right)\right|$ , we have

where the last step is by Jensen’s inequality (i.e. the average of square roots is upper bounded by the square root of the average). The expression in the square root equals the average mutual information between a random variable $S$ (composed of at most $d^{2}/128$ bits), and $\lceil d/2\rceil\left(1+\lfloor d/2\rfloor\right)$ binary random variables $M_{j,i}$ , where $i\in\{1,\ldots,\lceil d/2\rceil\},j\in\{\lceil d/2\rceil,\ldots,d\}$ , which are all independent by construction. By Lemma 6 in , it is at most $(d^{2}/128)/\left(\lceil d/2\rceil\left(1+\lfloor d/2\rfloor\right)\right)\leq 1/32$ , so we have

Recalling this is an upper bound on Eq. (24), which is the second term in Eq. (23), we get that