Communication Complexity of Distributed Convex Learning and Optimization

Yossi Arjevani, Ohad Shamir

Introduction

The main challenge in solving such problems is that communication between the different machines is usually slow and constrained, at least compared to the speed of local processing. On the other hand, the datasets involved in distributed learning are usually large and high-dimensional. Therefore, machines cannot simply communicate their entire data to each other, and the question is how well can we solve problems such as Eq. (1) using as little communication as possible.

As datasets continue to increase in size, and parallel computing platforms becoming more and more common (from multiple cores on a single CPU to large-scale and geographically distributed computing grids), distributed learning and optimization methods have been the focus of much research in recent years, with just a few examples including . Most of this work studied algorithms for this problem, which provide upper bounds on the required time and communication complexity.

In this paper, we take the opposite direction, and study what are the fundamental performance limitations in solving Eq. (1), under several different sets of assumptions. We identify cases where existing algorithms are already optimal (at least in the worst-case), as well as cases where room for further improvement is still possible.

Since a major constraint in distributed learning is communication, we focus on studying the amount of communication required to optimize Eq. (1) up to some desired accuracy ϵ\epsilon. More precisely, we consider the number of communication rounds that are required, where in each communication round the machines can generally broadcast to each other information linear in the problem’s dimension dd (e.g. a point in W\mathcal{W} or a gradient). This applies to virtually all algorithms for large-scale learning we are aware of, where sending vectors and gradients is feasible, but computing and sending larger objects, such as Hessians (d×dd\times d matrices) is not.

Our results pertain to several possible settings (see Sec. 2 for precise definitions). First, we distinguish between the local functions being merely convex or strongly-convex, and whether they are smooth or not. These distinctions are standard in studying optimization algorithms for learning, and capture important properties such as the regularization and the type of loss function used. Second, we distinguish between a setting where the local functions are related – e.g., because they reflect statistical similarities in the data residing at different machines – and a setting where no relationship is assumed. For example, in the extreme case where data was split uniformly at random between machines, one can show that quantities such as the values, gradients and Hessians of the local functions differ only by δ=O(1/n)\delta=\mathcal{O}(1/\sqrt{n}), where nn is the sample size per machine, due to concentration of measure effects. Such similarities can be used to speed up the optimization/learning process, as was done in e.g. . Both the δ\delta-related and the unrelated setting can be considered in a unified way, by letting δ\delta be a parameter and studying the attainable lower bounds as a function of δ\delta. Our results can be summarized as follows:

First, we define a mild structural assumption on the algorithm (which is satisfied by reasonable approaches we are aware of), which allows us to provide the lower bounds described below on the number of communication rounds required to reach a given suboptimality ϵ\epsilon.

When the local functions can be unrelated, we prove a lower bound of Ω(1/λlog(1/ϵ))\Omega(\sqrt{1/\lambda}\log(1/\epsilon)) for smooth and λ\lambda-strongly convex functions, and Ω(1/ϵ)\Omega(\sqrt{1/\epsilon}) for smooth convex functions. These lower bounds are matched by a straightforward distributed implementation of accelerated gradient descent. In particular, the results imply that many communication rounds may be required to get a high-accuracy solution, and moreover, that no algorithm satisfying our structural assumption would be better, even if we endow the local machines with unbounded computational power. For non-smooth functions, we show a lower bound of Ω(1/λϵ)\Omega(\sqrt{1/\lambda\epsilon}) for λ\lambda-strongly convex functions, and Ω(1/ϵ)\Omega(1/\epsilon) for general convex functions. Although we leave a full derivation to future work, it seems these lower bounds can be matched in our framework by an algorithm combining acceleration and Moreau proximal smoothing of the local functions.

When the local functions are related (as quantified by the parameter δ\delta), we prove a communication round lower bound of Ω(δ/λlog(1/ϵ))\Omega(\sqrt{\delta/\lambda}\log(1/\epsilon)) for smooth and λ\lambda-strongly convex functions. For quadratics, this bound is matched by (up to constants and logarithmic factors) by the recently-proposed DISCO algorithm . However, getting an optimal algorithm for general strongly convex and smooth functions in the δ\delta-related setting, let alone for non-smooth or non-strongly convex functions, remains open.

We also study the attainable performance without posing any structural assumptions on the algorithm, but in the more restricted case where only a single round of communication is allowed. We prove that in a broad regime, the performance of any distributed algorithm may be no better than a ‘trivial’ algorithm which returns the minimizer of one of the local functions, as long as the number of bits communicated is less than Ω(d2)\Omega(d^{2}). Therefore, in our setting, no communication-efficient 1-round distributed algorithm can provide non-trivial performance in the worst case.

There have been several previous works which considered lower bounds in the context of distributed learning and optimization, but to the best of our knowledge, none of them provide a similar type of results. Perhaps the most closely-related paper is , which studied the communication complexity of distributed optimization, and showed that Ω(dlog(1/ϵ))\Omega(d\log(1/\epsilon)) bits of communication are necessary between the machines, for dd-dimensional convex problems. However, in our setting this does not lead to any non-trivial lower bound on the number of communication rounds (indeed, just specifying a dd-dimensional vector up to accuracy ϵ\epsilon required O(dlog(1/ϵ))\mathcal{O}(d\log(1/\epsilon)) bits). More recently, considered lower bounds for certain types of distributed learning problems, but not convex ones in an agnostic distribution-free framework. In the context of lower bounds for one-round algorithms, the results of imply that Ω(d2)\Omega(d^{2}) bits of communication are required to solve linear regression in one round of communication. However, that paper assumes a different model than ours, where the function to be optimized is not split among the machines as in Eq. (1), where each FiF_{i} is convex. Moreover, issues such as strong convexity and smoothness are not considered. proves an impossibility result for a one-round distributed learning scheme, even when the local functions are not merely related, but actually result from splitting data uniformly at random between machines. On the flip side, that result is for a particular algorithm, and doesn’t apply to any possible method.

Finally, we emphasize that distributed learning and optimization can be studied under many settings, including ones different than those studied here. For example, one can consider distributed learning on a stream of i.i.d. data , or settings where the computing architecture is different, e.g. where the machines have a shared memory, or the function to be optimized is not split as in Eq. (1). Studying lower bounds in such settings is an interesting topic for future work.

Notation and Framework

The only vector and matrix norms used in this paper are the Euclidean norm and the spectral norm, respectively. ej\mathbf{e}_{j} denotes the jj-th standard unit vector. We let G(w)\nabla G(\mathbf{w}) and 2G(w)\nabla^{2}G(\mathbf{w}) denote the gradient and Hessians of a function GG at w\mathbf{w}, if they exist. GG is smooth (with parameter LL) if it is differentiable and the gradient is LL-Lipschitz. In particular, if w=argminwWG(w)\mathbf{w}^{*}=\arg\min_{\mathbf{w}\in\mathcal{W}}G(\mathbf{w}), then G(w)G(w)L2ww2G(\mathbf{w})-G(\mathbf{w}^{*})\leq\frac{L}{2}\left\|\mathbf{w}-\mathbf{w}^{*}\right\|^{2}. GG is strongly convex (with parameter λ\lambda) if for any w,wW\mathbf{w},\mathbf{w}^{\prime}\in\mathcal{W}, G(w)G(w)+g,ww+λ2ww2G(\mathbf{w}^{\prime})\geq G(\mathbf{w})+\langle\mathbf{g},\mathbf{w}^{\prime}-\mathbf{w}\rangle+\frac{\lambda}{2}\left\|\mathbf{w}^{\prime}-\mathbf{w}\right\|^{2} where gG(w)\mathbf{g}\in\partial G(\mathbf{w}^{\prime}) is a subgradient of GG at w\mathbf{w}. In particular, if w=argminwWG(w)\mathbf{w}^{*}=\arg\min_{\mathbf{w}\in\mathcal{W}}G(\mathbf{w}), then G(w)G(w)λ2ww2G(\mathbf{w})-G(\mathbf{w}^{*})\geq\frac{\lambda}{2}\left\|\mathbf{w}-\mathbf{w}^{*}\right\|^{2}. Any convex function is also strongly-convex with λ=0\lambda=0. A special case of smooth convex functions are quadratics, where G(w)=wAw+bw+cG(\mathbf{w})=\mathbf{w}^{\top}A\mathbf{w}+\mathbf{b}^{\top}\mathbf{w}+c for some positive semidefinite matrix AA, vector b\mathbf{b} and scalar cc. In this case, λ\lambda and LL correspond to the smallest and largest eigenvalues of AA.

We model the distributed learning algorithm as an iterative process, where in each round the machines may perform some local computations, followed by a communication round where each machine broadcasts a message to all other machines. We make no assumptions on the computational complexity of the local computations. After all communication rounds are completed, a designated machine provides the algorithm’s output (possibly after additional local computation).

In this model, our main question is the following: How many rounds of communication are necessary in order to solve problems such as Eq. (1) to some given accuracy ϵ\epsilon?

As discussed in the introduction, we first need to distinguish between different assumptions on the possible relation between the local functions. One natural situation is when no significant relationship can be assumed, for instance when the data is arbitrarily split or is gathered by each machine from statistically dissimilar sources. We denote this as the unrelated setting. However, this assumption is often unnecessarily pessimistic. Often the data allocation process is more random, or we can assume that the different data sources for each machine have statistical similarities (to give a simple example, consider learning from users’ activity across a geographically distributed computing grid, each servicing its own local population). We will capture such similarities, in the context of quadratic functions, using the following definition:

are δ\delta-related, if for any i,j{1k}i,j\in\{1\ldots k\}, it holds that

We end this section with an important remark. In this paper, we prove lower bounds for the δ\delta-related setting, which includes as a special case the commonly-studied setting of randomly partitioned data (in which case δ=O(1/n)\delta=\mathcal{O}(1/\sqrt{n})). However, our bounds do not apply for random partitioning, since they use δ\delta-related constructions which do not correspond to randomly partitioned data. In fact, very recent work has cleverly shown that for randomly partitioned data, and for certain reasonable regimes of strong convexity and smoothness, it is actually possible to get better performance than what is indicated by our lower bounds. However, this encouraging result crucially relies on the random partition property, and in parameter regimes which limit how much each data point needs to be “touched”, hence preserving key statistical independence properties. We suspect that it may be difficult to improve on our lower bounds under substantially weaker assumptions.

Lower Bounds Using a Structural Assumption

In this section, we present lower bounds on the number of communication rounds, where we impose a certain mild structural assumption on the operations performed by the algorithm. Roughly speaking, our lower bounds pertain to a very large class of algorithms, which are based on linear operations involving points, gradients, and vector products with local Hessians and their inverses, as well as solving local optimization problems involving such quantities. At each communication round, the machines can share any of the vectors they have computed so far. Formally, we consider algorithms which satisfy the assumption stated below. For convenience, we state it for smooth functions (which are differentiable) and discuss the case of non-smooth functions in Sec. 3.2.

for some γ,ν0\gamma,\nu\geq 0 such that γ+ν>0\gamma+\nu>0. After every communication round, let Wj:=i=1mWiW_{j}:=\cup_{i=1}^{m}W_{i} for all jj. The algorithm’s final output (provided by the designated machine jj) is a point in the span of WjW_{j}.

This assumption requires several remarks:

Note that WjW_{j} is not an explicit part of the algorithm: It simply includes all points computed by machine jj so far, or communicated to it by other machines, and is used to define the set of new points which the machine is allowed to compute.

The assumption bears some resemblance – but is far weaker – than standard assumptions used to provide lower bounds for iterative optimization algorithms. For example, a common assumption (see ) is that each computed point w\mathbf{w} must lie in the span of the previous gradients. This corresponds to a special case of Assumption 1, where γ=1,ν=0\gamma=1,\nu=0, and the span is only over gradients of previously computed points. Moreover, it also allows (for instance) exact optimization of each local function, which is a subroutine in some distributed algorithms (e.g. ), by setting γ=0,ν=1\gamma=0,\nu=1 and computing a point w\mathbf{w} satisfying γw+νFj(w)=0\gamma\mathbf{w}+\nu\nabla F_{j}(\mathbf{w})=\mathbf{0}. By allowing the span to include previous gradients, we also incorporate algorithms which perform optimization of the local function plus terms involving previous gradients and points, such as , as well as algorithms which rely on local Hessian information and preconditioning, such as . In summary, the assumption is satisfied by most techniques for black-box convex optimization that we are aware of. Finally, we emphasize that we do not restrict the number or computational complexity of the operations performed between communication rounds.

The requirement that γ,ν0\gamma,\nu\geq 0 is to exclude algorithms which solve non-convex local optimization problems of the form minwFj(w)+γw2\min_{\mathbf{w}}F_{j}(\mathbf{w})+\gamma\left\|\mathbf{w}\right\|^{2} with γ<0\gamma<0, which are unreasonable in practice and can sometimes break our lower bounds.

The assumption that WjW_{j} is initially {0}\{\mathbf{0}\} (namely, that the algorithm starts from the origin) is purely for convenience, and our results can be easily adapted to any other starting point by shifting all functions accordingly.

The techniques we employ in this section are inspired by lower bounds on the iteration complexity of first-order methods for standard (non-distributed) optimization (see for example ). These are based on the construction of ‘hard’ functions, where each gradient (or subgradient) computation can only provide a small improvement in the objective value. In our setting, the dynamics are roughly similar, but the necessity of many gradient computations is replaced by many communication rounds. This is achieved by constructing suitable local functions, where at any time point no individual machine can ‘progress’ on its own, without information from other machines.

We begin by presenting a lower bound when the local functions FiF_{i} are strongly-convex and smooth:

if λ>0\lambda>0, and at least 3δ32ϵw2\sqrt{\frac{3\delta}{32\epsilon}}\left\|\mathbf{w}^{*}\right\|-2 if λ=0\lambda=0.

The assumption of mm being even is purely for technical convenience, and can be discarded at the cost of making the proof slightly more complex. Also, note that mm does not appear explicitly in the bound, but may appear implicitly, via δ\delta (for example, in a statistical setting δ\delta may depend on the number of data points per machine, and may be larger if the same dataset is divided to more machines).

Let us contrast our lower bound with some existing algorithms and guarantees in the literature. First, regardless of whether the local functions are similar or not, we can always simulate any gradient-based method designed for a single machine, by iteratively computing gradients of the local functions, and performing a communication round to compute their average. Clearly, this will be a gradient of the objective function F()=1mi=1mFi()F(\cdot)=\frac{1}{m}\sum_{i=1}^{m}F_{i}(\cdot), which can be fed into any gradient-based method such as gradient descent or accelerated gradient descent . The resulting number of required communication rounds is then equal to the number of iterations. In particular, using accelerated gradient descent for smooth and λ\lambda-strongly convex functions yields a round complexity of O(1/λlog(w2/ϵ))\mathcal{O}(\sqrt{1/\lambda}\log(\left\|\mathbf{w}^{*}\right\|^{2}/\epsilon)), and O(w1/ϵ)\mathcal{O}(\left\|\mathbf{w}^{*}\right\|\sqrt{1/\epsilon}) for smooth convex functions. This matches our lower bound (up to constants and log factors) when the local functions are unrelated (δ=Ω(1)\delta=\Omega(1)).

The full proof of Thm. 1 appears in the supplementary material, and is based on the following idea: For simplicity, suppose we have two machines, with local functions F1,F2F_{1},F_{2} defined as follows,

It is easy to verify that for δ,λ1\delta,\lambda\leq 1, both F1(w)F_{1}(\mathbf{w}) and F2(w)F_{2}(\mathbf{w}) are 11-smooth and λ\lambda-strongly convex, as well as δ\delta-related. Moreover, the optimum of their average is a point w\mathbf{w}^{*} with non-zero entries at all coordinates. However, since each local functions has a block-diagonal quadratic term, it can be shown that for any algorithm satisfying Assumption 1, after TT communication rounds, the points computed by the two machines can only have the first T+1T+1 coordinates non-zero. No machine will be able to further ‘progress’ on its own, and cause additional coordinates to become non-zero, without another communication round. This leads to a lower bound on the optimization error which depends on TT, resulting in the theorem statement after a few computations.

2 Non-smooth Local Functions

Remaining in the framework of algorithms satisfying Assumption 1, we now turn to discuss the situation where the local functions are not necessarily smooth or differentiable. For simplicity, our formal results here will be in the unrelated setting, and we only informally discuss their extension to a δ\delta-related setting (in a sense relevant to non-smooth functions). Formally defining δ\delta-related non-smooth functions is possible but not altogether trivial, and is therefore left to future work.

We adapt Assumption 1 to the non-smooth case, by allowing gradients to be replaced by arbitrary subgradients at the same points. Namely, we replace Eq. (2) by the requirement that for some gFj(w)\mathbf{g}\in\partial F_{j}(\mathbf{w}), and γ,ν0,γ+ν>0\gamma,\nu\geq 0,\gamma+\nu>0,

The lower bound for this setting is stated in the following theorem.

As in Thm. 1, we note that the assumption of even mm is for technical convenience.

This theorem, together with Thm. 1, implies that both strong convexity and smoothness are necessary for the number of communication rounds to scale logarithmically with the required accuracy ϵ\epsilon. We emphasize that this is true even if we allow the machines unbounded computational power, to perform arbitrarily many operations satisfying Assumption 1. Moreover, a preliminary analysis indicates that performing accelerated gradient descent on smoothed versions of the local functions (using Moreau proximal smoothing, e.g. ), can match these lower bounds up to log factorsRoughly speaking, for any γ>0\gamma>0, this smoothing creates a 1γ\frac{1}{\gamma}-smooth function which is γ\gamma-close to the original function. Plugging these into the guarantees of accelerated gradient descent and tuning γ\gamma yields our lower bounds. Note that, in order to execute this algorithm each machine must be sufficiently powerful to obtain the gradient of the Moreau envelope of its local function, which is indeed the case in our framework.. We leave a full formal derivation (which has some subtleties) to future work.

The full proof of Thm. 2 appears in the supplementary material. The proof idea relies on the following construction: Assume that we fix the number of communication rounds to be TT, and (for simplicity) that TT is even and the number of machines is 22. Then we use local functions of the form

where bb is a suitably chosen parameter. It is easy to verify that both local functions are λ\lambda-strongly convex and (1+λ)(1+\lambda)-Lipschitz continuous over the unit Euclidean ball. Similar to the smooth case, we argue that after TT communication rounds, the resulting points w\mathbf{w} computed by machine 11 will be non-zero only on the first T+1T+1 coordinates, and the points w\mathbf{w} computed by machine 22 will be non-zero only on the first TT coordinates. As in the smooth case, these functions allow us to ’control’ the progress of any algorithm which satisfies Assumption 1.

Finally, although the result is in the unrelated setting, it is straightforward to have a similar construction in a ‘δ\delta-related’ setting, by multiplying F1F_{1} and F2F_{2} by δ\delta. The resulting two functions have their gradients and subgradients at most δ\delta-different from each other, and the construction above leads to a lower bound of Ω(δ/ϵ)\Omega(\delta/\epsilon) for convex Lipschitz functions, and Ω(δ1/λϵ)\Omega(\delta\sqrt{1/\lambda\epsilon}) for λ\lambda-strongly convex Lipschitz functions. In terms of upper bounds, we are actually unaware of any relevant algorithm in the literature adapted to such a setting, and the question of attainable performance here remains wide open.

One Round of Communication

In this section, we study what lower bounds are attainable without any kind of structural assumption (such as Assumption 1). This is a more challenging setting, and the result we present will be limited to algorithms using a single round of communication round. We note that this still captures a realistic non-interactive distributed computing scenario, where we want each machine to broadcast a single message, and a designated machine is then required to produce an output. In the context of distributed optimization, a natural example is a one-shot averaging algorithm, where each machine optimizes its own local data, and the resulting points are averaged (e.g. ).

Intuitively, with only a single round of communication, getting an arbitrarily small error ϵ\epsilon may be infeasible. The following theorem establishes a lower bound on the attainable error, depending on the strong convexity parameter λ\lambda and the similarity measure δ\delta between the local functions, and compares this with a ‘trivial’ zero-communication algorithm, which just returns the optimum of a single local function:

The point w^\hat{\mathbf{w}} returned by the algorithm satisfies

in expectation over the algorithm’s randomness.

The theorem shows that unless the communication budget is extremely large (quadratic in the dimension), there are functions which cannot be optimized to non-trivial accuracy in one round of communication, in the sense that the same accuracy (up to a universal constant) can be obtained with a ‘trivial’ solution where we just return the optimum of a single local function. This complements an earlier result in , which showed that a particular one-round algorithm is no better than returning the optimum of a local function, under the stronger assumption that the local functions are not merely δ\delta-related, but are actually the average loss over some randomly partitioned data.

The full proof appears in the supplementary material, but we sketch the main ideas below. As before, focusing on the case of two machines, and assuming machine 22 is responsible for providing the output, we use

where MM is essentially a randomly chosen {1,+1}\{-1,+1\}-valued d×dd\times d symmetric matrix with spectral norm at most cdc\sqrt{d}, and cc is a suitable constant. These functions can be shown to be δ\delta-related as well as λ\lambda-strongly convex. Moreover, the optimum of F(w)=12(F1(w)+F2(w))F(\mathbf{w})=\frac{1}{2}(F_{1}(\mathbf{w})+F_{2}(\mathbf{w})) equals

Thus, we see that the optimal point w\mathbf{w}^{*} depends on the jj-th column of MM. Intuitively, the machines need to approximate this column, and this is the source of hardness in this setting: Machine 11 knows MM but not jj, yet needs to communicate to machine 22 enough information to construct its jj-th column. However, given a communication budget much smaller than the size of MM (which is d2d^{2}), it is difficult to convey enough information on the jj-th column without knowing what jj is. Carefully formalizing this intuition, and using some information-theoretic tools, allows us to prove the first part of Thm. 3. Proving the second part of Thm. 3 is straightforward, using a few computations.

Summary and Open Questions

In this paper, we studied lower bounds on the number of communication rounds needed to solve distributed convex learning and optimization problems, under several different settings. Our results indicate that when the local functions are unrelated, then regardless of the local machines’ computational power, many communication rounds may be necessary (scaling polynomially with 1/ϵ1/\epsilon or 1/λ1/\lambda), and that the worst-case optimal algorithm (at least for smooth functions) is just a straightforward distributed implementation of accelerated gradient descent. When the functions are related, we show that the optimal performance is achieved by the algorithm of for quadratic and strongly convex functions, but designing optimal algorithms for more general functions remains open. Beside these results, which required a certain mild structural assumption on the algorithm employed, we also provided an assumption-free lower bound for one-round algorithms, which implies that even for strongly convex quadratic functions, such algorithms can sometimes only provide trivial performance.

Besides the question of designing optimal algorithms for the remaining settings, several additional questions remain open. First, it would be interesting to get assumption-free lower bounds for algorithms with multiple rounds of communication. Second, our work focused on communication complexity, but in practice the computational complexity of the local computations is no less important. Thus, it would be interesting to understand what is the attainable performance with simple, runtime-efficient algorithms. Finally, it would be interesting to study lower bounds for other distributed learning and optimization scenarios.

This research is supported in part by an FP7 Marie Curie CIG grant, the Intel ICRI-CI Institute, and Israel Science Foundation grant 425/13. We thank Nati Srebro for several helpful discussions and insights.

References

Appendix A Proofs

The proof of the theorem is based on splitting the machines into two sub-groups of the same size, each of which is assigned with a finite dimensional restriction of F1F_{1} and F2F_{2} (see Eq. (3)), and tracing the maximal number of non-zero coordinates for vectors in WjW_{j}, the set of feasible points.

Recall that FiF_{i} are defined as follows:

Note that [Fi]d(w)[F_{i}]_{d}(\mathbf{w}) and [F]d(w)[F]_{d}(\mathbf{w}) produce the same values as Fi(w)F_{i}(\mathbf{w}) and F(w)F(\mathbf{w}) do for vectors such that wi=0\mathbf{w}_{i}=0 for all idi\geq d. Similarly, we define the d×dd\times d leading principal submatrix of AiA_{i} by [Ai]d[A_{i}]_{d}.

We assign half of the machines with [F1]d[F_{1}]_{d}, and the other half with [F2]d[F_{2}]_{d}. To prove the theorem, we need the following lemma, which formalizes the intuition described in the main paper. Let

Suppose all the sets of feasible points satisfy WjET,dW_{j}\subseteq E_{T,d} for some Td1T\leq d-1, then under assumption 1, right after the next communication round we have WjET+1,dW_{j}\subseteq E_{T+1,d}.

Recall that by Assumption 1, each machine can compute new points w\mathbf{w} that satisfy the following for some γ,ν0\gamma,\nu\geq 0 such that γ+ν>0\gamma+\nu>0:

We now analyze the state of the sets of feasible points prior to the next communication round. Assume that TT is an odd number, i.e., assume T=2k+1T=2k+1 for some k=0,1,k=0,1,\dots. The proof for the case where TT is even follows similar lines. Note that for any w,wWj\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime}\in W_{j}, we have

For any viable diagonal matrix D. Therefore, since WjE2k+1,dW_{j}\subseteq E_{2k+1,d}, we have that the first point generated by machines which hold [F1]d(w)[F_{1}]_{d}(\mathbf{w}) must satisfy

for γ,ν\gamma,\nu as stated in the assumption. That is,

Since [A1]d[A_{1}]_{d} is positive semidefinite, it holds that HH is invertible. Also, [A1]d,H[A_{1}]_{d},H and H1H^{-1} admit the same partitions into 1×11\times 1 and 2×22\times 2 blocks on the diagonal, thus H1E2k+1,dE2k+1,dH^{-1}E_{2k+1,d}\subseteq E_{2k+1,d}, yielding wE2k+1,d\mathbf{w}\in E_{2k+1,d}. Inductively extending the latter argument shows that, in the absence of any communication rounds, all the machines whose local function is [F1]d(w)[F_{1}]_{d}(\mathbf{w}) are ‘stuck’ in E2k+1,dE_{2k+1,d}. As for machines which contain [F2]d(w)[F_{2}]_{d}(\mathbf{w}), we have that for all w,wWj\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime}\in W_{j}

For any viable diagonal matrix D. Therefore, the first generated point by these machines must satisfy,

Similarly to the previous case this implies that wE2k+2,d\mathbf{w}\in E_{2k+2,d}. It is now left to show that these machines cannot make further progress beyond E2k+2,dE_{2k+2,d} without communicating. To see this, note that for all w,wE2k+2,d\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime}\in E_{2k+2,d} we have,

This means that all the points which are generated subsequently also lie in E2k+2,dE_{2k+2,d}, i.e., without communicating , machines whose local function is [F2]d(w)[F_{2}]_{d}(\mathbf{w}) are stuck in E2k+2,dE_{2k+2,d}. Finally, executing a communication round updates all the sets of feasible points to be Wj:=E2k+2,dW_{j}:=E_{2k+2,d}. ∎

The following is a direct consequence of a recursive application of Lemma 1.

Under assumption 1, after Td1T\leq d-1 communication rounds we have

By first-order optimality condition for smooth convex functions, we have

The optimal solution can be now realized as a geometric sequence (ζk)k=1(\zeta^{k})_{k=1}^{\infty} for some ζ\zeta as follows: By Eq. (5), we must have

Let wT\mathbf{w}_{T} be some point which was obtained after Td2T\leq d-2 communication rounds. To bound the sub-optimality of wT\mathbf{w}_{T} from below, observe that

where the last equality follows from Corollary (1), according to which all the coordinates of wT\mathbf{w}_{T}, except for the first T+1d1T+1\leq d-1, must vanish. To bound the AA term, note that

The fact that F(w)F(\mathbf{w}) is λ\lambda-strongly convex implies

To bound the BB term from below, note that since FF is 1-smooth we have

Combining both lower bounds for the terms AA and BB, we get for any Td2T\leq d-2

Picking dd sufficiently large, and considering how large the number of communication rounds TT must be to make this lower bound less than ϵ\epsilon, we get

It is worth mentioning that by computing the exact minimizers of [Fi]d[F_{i}]_{d} one may derive a lower bound such that the choice of dd does not depend on the parameters of the problem, except for the number of communication rounds. Nevertheless, such analysis requires a more involved reasoning which we find unnecessary for stating our results.

For the non-strongly convex case, where λ=0\lambda=0, using Corollary 1 and a similar analysis (virtually identical to the proof of Theorem 2.1.7 in ), we have that if T12(d1)T\leq\frac{1}{2}(d-1), then

Therefore, to obtain an ϵ\epsilon-suboptimal solution for this case, we must have at least

communication rounds, for sufficiently small ϵ\epsilon.

A.2 Proof of Thm. 2

We construct two types of local functions, and provide one of them to m/2m/2 of the machines, and the other function to the other m/2m/2 machines, in some arbitrary order. In this case, the average function is simply the average of the two types of local functions.

We will first prove the theorem statement in the strongly convex case, where λ>0\lambda>0 is given, and then explain how to extract from it the result in the non strongly convex case.

Fix a natural number kk and some b[0,1/k]b\in[0,1/\sqrt{k}], to be specified later. We define the following local function over the unit ball:

otherwise. Being a sum of convex functions, both local functions are convex, and in fact λ\lambda-strongly convex due to the λ2w2\frac{\lambda}{2}\left\|\mathbf{w}\right\|^{2} term. Furthermore, both function are (1+λ)(1+\lambda)-Lipschitz continuous over the unit Euclidean ball. To see this, let ()\partial(\cdot) denote the subdifferential operator and note that

Assume for a moment that λ=0\lambda=0, then by the linearity of the sub-differential operator that

which shows that, for λ=0\lambda=0, both functions are 1-Lipschitz. For λ>0\lambda>0, note that λ2w2\frac{\lambda}{2}\left\|\mathbf{w}\right\|^{2} is λ\lambda-Lipschitz over the unit ball and λ\lambda-strongly convex. Therefore, using the linearity of the sub-differential operator again, we see that both FiF_{i} are (1+λ)(1+\lambda)-Lipschitz and λ\lambda-strongly convex functions over the unit ball.

Similar to the smooth case, the following lemma shows that, no matter how the subgradients are chosen, at each iteration at most one non-zero coordinate may be gained.

Suppose all the sets of feasible points satisfy WjET,dW_{j}\subseteq E_{T,d} for some Td1T\leq d-1. Then under assumption 1, right after the next communication round we have WjET+1,dW_{j}\subseteq E_{T+1,d}.

Recall that by Assumption 1 (modified for the non-differentiable case), each machine can compute new points w\mathbf{w} that satisfy the following for some γ,ν0\gamma,\nu\geq 0 such that γ+ν>0\gamma+\nu>0:

For any viable diagonal matrix D. Therefore, we have that the first point generated by machines which hold F1,kF_{1,k} must satisfy

for γ,ν\gamma,\nu as stated in Assumption (1). Note that, if ν=0\nu=0 (which by assumption means that γ>0\gamma>0) then clearly wE2p+1,d\mathbf{w}\in E_{2p+1,d}. As for ν0\nu\neq 0, suppose by contradiction that wE2p+1,d\mathbf{w}\notin E_{2p+1,d}. That is, assume that there exists some j>2p+1j>2p+1 such that w[j]0w[j]\neq 0. First, if the absolute value terms in F1,kF_{1,k} do not involve w[j]w[j], e.g., when j=dj=d and dd is even, we have g1,k(w)[j]=λw[j]\mathbf{g}_{1,k}(\mathbf{w})[j]=\lambda w[j]. In this case, by Eq. (9) we have

and since νλ>0\nu\lambda>0, this implies that w[j]=0w[j]=0 – a contradiction! Thus, it remains to consider the cases of either odd dd or jdj\neq d. In both of these cases, w[j]w[j] appears in one of the absolute value terms in F1,kF_{1,k}, either as w[j1]w[j]|w[j-1]-w[j]| or w[j]w[j+1]|w[j]-w[j+1]| (depending on whether jj is odd or even).

Let l>pl>p be such that either 2l=j2l=j or 2l+1=j2l+1=j, depending on the parity of jj. We note that any valid subgradient must satisfy

for some α\alpha\in, such that if w[2l]w[2l+1]0w[2l]-w[2l+1]\neq 0 then

where sgn()\operatorname{sgn}() is the sign function. Rearranging terms in Eq. (9) and using the facts that coordinates 2l,2l+12l,2l+1 are always zero in E2p+1,dE_{2p+1,d}, as well as γ+νλνλ>0\gamma+\nu\lambda\geq\nu\lambda>0, we get

contradicting Eq. (10). Hence, we must have w[2l]w[2l+1]=0w[2l]-w[2l+1]=0, in which case Eq. (12) implies α=1/2\alpha=1/2. Thus, by Eq. (11)

which contradicts the assumption that either w[j]w[j] (and hence w[2l]w[2l] or w[2l+1]w[2l+1]) is not zero. Thus, we have shown that wE2p+1,d\mathbf{w}\in E_{2p+1,d}, for the first point generated by machines holding F1,kF_{1,k}. Repeating the argument, we get that any point generated by those machines, in the absence of any communication rounds, is ‘stuck’ in E2p+1,dE_{2p+1,d}.

We now turn to prove the claim for machines whose local function is F2,kF_{2,k}, using an almost identical argument, which we provide below for completeness. For these functions, we assume that initially WjE2p+1,dW_{j}\subseteq E_{2p+1,d}, and will show any additional points computed locally by the machines must be in E2p+2,dE_{2p+2,d}. We begin by noting that for any w,w\mathbf{w}^{\prime},\mathbf{w}^{\prime\prime} in E2p+2,dE_{2p+2,d} (and in particular E2p+1,dE_{2p+1,d}), it holds that

For any viable diagonal matrix D. Therefore, we have that the first point generated by machines which hold F2,kF_{2,k} must satisfy

for γ,ν\gamma,\nu as stated in the assumption. Note that, if ν=0\nu=0 then clearly wE2p+2,d\mathbf{w}\in E_{2p+2,d}. As for ν0\nu\neq 0, suppose by contradiction that wE2p+2,d\mathbf{w}\notin E_{2p+2,d}. That is, assume that there exists some j>2p+2j>2p+2 such that w[j]0w[j]\neq 0. First, if the absolute value terms in F2,kF_{2,k} do not involve w[j]w[j], e.g., when j=dj=d and dd is odd, we have g2,k(w)[j]=λw[j]\mathbf{g}_{2,k}(\mathbf{w})[j]=\lambda w[j]. In this case, by Eq. (13) we have

and since νλ>0\nu\lambda>0, this implies that w[j]=0w[j]=0 – a contradiction! Thus, it remains to consider the cases of either even dd or jdj\neq d. In both of these cases, w[j]w[j] appears in one of the absolute value terms in F2,kF_{2,k}, either as w[j1]w[j]|w[j-1]-w[j]| or w[j]w[j+1]|w[j]-w[j+1]| (depending on whether jj is odd or even).

Let l>pl>p be such that either 2l+1=j2l+1=j or 2l+2=j2l+2=j, depending on the parity of jj. We note that any valid subgradient must satisfy Any valid subgradient must satisfy

for some α\alpha\in, such that if w[2l+1]w[2l+2]0w[2l+1]-w[2l+2]\neq 0 then

Rearranging terms in Eq. (13) and using the fact that γ+νλνλ>0\gamma+\nu\lambda\geq\nu\lambda>0, we get

a contradiction to Eq. (14). Hence, we must have w[2l+1]w[2l+2]=0w[2l+1]-w[2l+2]=0, in which case Eq. (16) implies α=1/2\alpha=1/2. Thus, by Eq. (15)

which contradicts our assumption that w[j]w[j] (and hence either w[2l+1]w[2l+1] or w[2l+2]w[2l+2] is not zero). Thus, we have shown that wE2p+2,d\mathbf{w}\in E_{2p+2,d}. As before, repeating the argument together with the assumption that WjE2p+2,dW_{j}\subseteq E_{2p+2,d} shows that, in the absence of any communication rounds, all the machines whose local function is F2,kF_{2,k} are ’stuck’ in E2p+2,dE_{2p+2,d}. Therefore, before the next communication round, WjE2p+2,dW_{j}\subseteq E_{2p+2,d} for all machines jj holding F2,jF_{2,j}. Moreover, as shown earlier, WjE2p+1,dW_{j}\subseteq E_{2p+1,d} for all machines holding F1,kF_{1,k}. Therefore, after the next communication round, WjE2p+2,dW_{j}\subseteq E_{2p+2,d} for any machine jj. ∎

Repeatedly applying Lemma 2, we get the following corollary:

Under assumption 1, after Td1T\leq d-1 communication rounds we have

With this corollary in hand, we now turn to establish the main result, namely, bounding from below the optimality of points in WjW_{j} after TT communication rounds. Choosing the dimension dd such that Td2T\leq d-2, we employ the local functions defined in Eq. (8) with k=T+2k=T+2. In which case, the average function is

The key ingredient in deriving the lower bound is Corollary (2), according to which after TT communication rounds, all but the first T+1T+1 coordinates must be zero, in particular w[T+2]=0w[T+2]=0. Using this and the triangle inequality, we have

for all w\mathbf{w} in WjW_{j}. Therefore, we can lower bound the objective value of the algorithm’s output by

On the flip side, the minimal value of F(w)F(\mathbf{w}) over the unit Euclidean ball can be upper bounded by F(wb)F(\mathbf{w}_{b}) for some b1T+2b\leq\frac{1}{\sqrt{T+2}}, where

Assuming T12λ2T\geq\frac{1}{2\lambda}-2 (so that λ12(T+2)\lambda\geq\frac{1}{2(T+2)}), we take

(note again that wb\mathbf{w}_{b} is indeed in the unit ball for this regime of λ\lambda and TT). In this case, the minimal ww in Eq. (17) is 12λ(T+2)2(T+2)\frac{1}{2\lambda(T+2)\sqrt{2(T+2)}}, so we get a suboptimality lower bound of

This bound holds in particular for any T12λ2T\geq\left\lceil\frac{1}{2\lambda}-2\right\rceil. If the number of communication rounds TT is less than 12λ2\left\lceil\frac{1}{2\lambda}-2\right\rceil, then clearly we cannot do better than with 12λ2\left\lceil\frac{1}{2\lambda}-2\right\rceil communication rounds. Therefore, for any number of communication rounds TT, the suboptimality is at least

Therefore, for any ϵ(0,116λ(12λ2+2)2]\epsilon\in\left(0,\frac{1}{16\lambda\left(\left\lceil\frac{1}{2\lambda}-2\right\rceil+2\right)^{2}}\right], we would need at least T116λϵ2T\geq\sqrt{\frac{1}{16\lambda\epsilon}}-2 communication rounds to get an ϵ\epsilon-suboptimal solution. This implies the theorem statement for λ\lambda-strongly convex functions.

Finally, we treat the case where the local functions are not required to be strongly convex. In this setting, for proving a lower bound, we can use the same construction as in Eq. (8), where we are free to choose any λ\lambda. In particular, let us choose λ=12(T+2)\lambda=\frac{1}{2(T+2)}, and apply the lower bound derived above (note that in this case the condition T12λ2T\geq\frac{1}{2\lambda}-2 trivially holds). Plugging in it into (A.2), we establish that for any number of communication rounds TT, the suboptimality is at least

Considering how large TT must be to make this smaller than some ϵ\epsilon, we get that TT must be at least 18ϵ2\frac{1}{8\epsilon}-2.

A.3 Proof of Thm. 3

As usual, we construct two functions F1,F2F_{1},F_{2}, and provide F1F_{1} to m/2m/2 of the machines, and F2F_{2} to the other m/2m/2 machines, in some arbitrary order, such that the machine designated to provide the output receives F2F_{2}. Note that the average function FF is simply 12(F1(w)+F2(w))\frac{1}{2}(F_{1}(\mathbf{w})+F_{2}(\mathbf{w})).

Let cc be a certain positive numerical constant (whose value corresponds to cc in Lemma 6 below). Given some symmetric M{1,+1}d×dM\in\{-1,+1\}^{d\times d}, where Mcd\left\|M\right\|\leq c\sqrt{d}, and j{d/2,,d}j\in\{\lceil d/2\rceil,\ldots,d\}, define

The following lemma establishes that the functions satisfy the strong convexity, smoothness and relatedness requirements of the theorem. The proof also establishes that the inverse in the definition of F1F_{1} indeed exists.

F1F_{1} and F2F_{2} are λ\lambda strongly-convex, 9λ9\lambda smooth, and δ\delta-related.

The Hessian of F2F_{2} is 3λI3\lambda I, which implies that F2F_{2} is 3λ3\lambda smooth and strongly convex (and in particular, λ\lambda-strongly convex). As to F1F_{1}, note that since Mcd\left\|M\right\|\leq c\sqrt{d}, then

The fact that the spectral radius and spectral norm of symmetric matrices coincide implies that the eigenvalues of the matrix I+12cdMI+\frac{1}{2c\sqrt{d}}M lie between 112=121-\frac{1}{2}=\frac{1}{2} and 1+12=321+\frac{1}{2}=\frac{3}{2}. Thus, all the eigenvalues are strictly positive, hence the matrix is indeed invertible as in the definition of F1F_{1}. Moreover, the eigenvalues of the inverse lie in [13/2,11/2]=[23,2]\left[\frac{1}{3/2},\frac{1}{1/2}\right]=\left[\frac{2}{3},2\right], and therefore those of 3λ((I+12cdM)112I)3\lambda\left(\left(I+\frac{1}{2c\sqrt{d}}M\right)^{-1}-\frac{1}{2}I\right) lie in [3λ(2312),3λ(212)]=[λ2,9λ2]\left[3\lambda\left(\frac{2}{3}-\frac{1}{2}\right),3\lambda\left(2-\frac{1}{2}\right)\right]=\left[\frac{\lambda}{2},\frac{9\lambda}{2}\right]. Thus, the spectrum of the Hessian of F1F_{1} lie in [λ,9λ]\left[\lambda,9\lambda\right], which implies that F1F_{1} is λ\lambda-strongly convex and 9λ9\lambda smooth.

To show δ\delta-relatedness, the only non-trivial part is upper-bounding the norm of the difference of the quadratic terms, which equals the following:

Since Mcd\left\|M\right\|\leq c\sqrt{d}, the eigenvalues of (I+12cdM)1I\left(I+\frac{1}{2c\sqrt{d}}M\right)^{-1}-I lie between 11+1/21=13\frac{1}{1+1/2}-1=-\frac{1}{3} and 111/21=1\frac{1}{1-1/2}-1=1, which implies that Eq. (19) can be upper bounded by 3λδ3\lambda\leq\delta. ∎

The next lemma proves the second part of the theorem, namely an upper bound on the suboptimality of any local function optimizer.

The optimum of any quadratic and strongly-convex function wAw+bw+c\mathbf{w}^{\top}A\mathbf{w}+\mathbf{b}^{\top}\mathbf{w}+c equals 12A1b\frac{1}{2}A^{-1}\mathbf{b}. Therefore, if w\mathbf{w}^{*} is the optimizer of FF, and we denote the parameters of FF and FjF_{j} by A,b,cA,\mathbf{b},c and Aj,bj,cjA_{j},\mathbf{b}_{j},c_{j} respectively, then

By definition of F1,F2F_{1},F_{2} and the average function FF, this is at most

In Lemma 3, we showed that F1,F2F_{1},F_{2} are λ\lambda-strongly convex and 9λ9\lambda smooth, which implies that the eigenvalues of AjA_{j} as well as AA lie in [λ2,9λ2]\left[\frac{\lambda}{2},\frac{9\lambda}{2}\right]. Therefore, the eigenvalues of Aj1A_{j}^{-1} and A1A^{-1} lie in [29λ,2λ]\left[\frac{2}{9\lambda},\frac{2}{\lambda}\right], so A12λ\left\|A^{-1}\right\|\leq\frac{2}{\lambda} and Aj1A12λ\left\|A_{j}^{-1}-A^{-1}\right\|\leq\frac{2}{\lambda}. Substituting this back into Eq. (20), we get

Finally, since FF is 9λ9\lambda-smooth, and its minimizer is w\mathbf{w}^{*},

which equals 81δ2/8λ81\delta^{2}/8\lambda as required. ∎

We now turn to derive the lower bound in the theorem statement. As discussed earlier, the intuition is that the optimal point w\mathbf{w}^{*} is a function of the jj-th column of MM, so the machines holding F1F_{1} must broadcast enough information on MM to the designated machine producing the algorithm’s output (the machine, by construction, holds F2F_{2}, and hence knows jj but not MM). As long as the communication budget is smaller than the size of MM, this will be difficult to achieve. This intuition is formalized in the following lemma, which is based on information-theoretic tools:

For any dimension dcd\geq c (where cc is the same constant as in Lemma 6 and the definition of F1F_{1}), and for any (possibly randomized) 1-round algorithm using at most d2/128d^{2}/128 bits of communication, there exists a valid choice of M,jM,j for the functions F1,F2F_{1},F_{2} defined above, such that the vector w^\hat{\mathbf{w}} returned by the algorithm satisfies

where the expectation is over the algorithm’s randomness, and cc^{\prime} is a positive numerical constant.

Using the lemma and the λ\lambda-strong convexity of F1,F2F_{1},F_{2} (and hence their average FF),

By definition of w\mathbf{w}^{*}, we have that the jj-th column of MM, designated as MjM_{j}, satisfies

Given the predictor w^\hat{\mathbf{w}} returned by the algorithm, define

This can be thought of as the algorithm’s ‘estimate’ of the jj-th column of MM, based on the returned predictor.

Define [w]=min{1,max{1,w}}[w]=\min\{1,\max\{-1,w\}\} as the clipping operation of a scalar ww to [1,+1][-1,+1], and for a vector w=(w1,,wd)\mathbf{w}=(w_{1},\ldots,w_{d}), define [w]=([w1],[w2],,[wd])[\mathbf{w}]=([w_{1}],[w_{2}],\ldots,[w_{d}]). By the expressions for Mj,M^jM_{j},\hat{M}_{j} above, we have

Below, we will prove that if MM (in the definition of F1F_{1}) is chosen uniformly at random from all {1,+1}\{-1,+1\}-valued d×dd\times d symmetric matrices, and jj (in the definition of F2F_{2}) is chosen uniformly at random from {d/2,,d}\{\lceil d/2\rceil,\ldots,d\}, then for any deterministic algorithm,

Let us first show how this can be used to prove the lemma. To do so, we will need the following lemma on the concentration of the spectral norm of random symmetric matrices.

There exist positive numerical constants c,cc,c^{\prime}, such that if MM is a d×dd\times d symmetric matrix, where each entry Mj,i,jiM_{j,i},j\geq i is chosen independently and uniformly from {1,+1}\{-1,+1\}, and dcd\geq c, then Pr(M>cd)cexp(cd)\Pr(\left\|M\right\|>c\sqrt{d})\leq c\exp(-c^{\prime}d).

First, we note that the expectation in Eq. (22) is over all symmetric {1,+1}\{-1,+1\}-valued matrices, including those whose spectral norm may be larger than cdc\sqrt{d}. However, by Lemma 6, Pr(M>cd)cexp(cd)\Pr(\left\|M\right\|>c\sqrt{d})\leq c\exp(-c^{\prime}d) for some absolute constant cc^{\prime}. Letting EE be the event that M>cd\left\|M\right\|>c\sqrt{d}, and noting that [w]2d\left\|[\mathbf{w}]\right\|^{2}\leq d for any vector w\mathbf{w}, we have

which is at least d/16d/16 for any dd larger than some constant. Combining with Eq. (21), we get

This inequality implies that for any deterministic algorithm, in expectation over the random draw of jj and a {1,+1}\{-1,+1\}-valued matrix MM with spectral norm at most cdc\sqrt{d}, w^w2\left\|\hat{\mathbf{w}}-\mathbf{w}^{*}\right\|^{2} will be at least c(δλ)2c^{\prime}\left(\frac{\delta}{\lambda}\right)^{2} for some suitable constant cc^{\prime}. By Yao’s minimax principle, this implies that for any (possibly randomized) algorithm, there will be some deterministic choice of M,jM,j such that Mcd\left\|M\right\|\leq c\sqrt{d}, and for which

(in expectation over the algorithm’s randomness), yielding the lemma’s statement.

It now remains to prove Eq. (22), assuming jj is chosen uniformly at random from {d/2,,d}\{\lceil d/2\rceil,\ldots,d\}, and MM is chosen at random (i.e. each entry at or above the main diagonal is chosen independently and uniformly from {1,+1}\{-1,+1\}). Roughly speaking, the proof idea is to reduce this to an upper bound on how much information the machines holding MM can send on MM’s entries (and more particularly, on the entries in the upper-right quadrant of MM). Since this quadrant is composed of Θ(d2)\Theta(d^{2}) random variables, and the machines can send much less than d2d^{2} bits, this information is necessarily restricted.

Let Pr()\Pr(\cdot) denote probability with respect to the random choice of M,jM,j, and let Prj(){\Pr}_{j}(\cdot) denote probability conditioned on the choice of jj. Recalling that any entry Mj,iM_{j,i} in the jj-th column has values in {1,+1}\{-1,+1\}, it follows that either Mj,iM_{j,i} has the same sign as M^j,i\hat{M}_{j,i}, or that ([Mj,iM^j,i])2([M_{j,i}-\hat{M}_{j,i}])^{2} is at least 11. Therefore, we have the following:

Let SS be the vector of bits broadcasted by the machines holding F1F_{1}, and received by the machine designated with providing the output (recalling that it only holds F2F_{2}). Note that conditioned on SS and jj, the algorithm’s output (and hence M^j,i\hat{M}_{j,i}) is independent of MM. Therefore, we have

Since SS is sent by the machines holding F1F_{1} (and not F2F_{2}), it is independent of jj. Therefore, we can write the above as

where jj in the conditioning is a fixed index. Using Pinsker’s inequality, we can upper bound the above by

where pp is the probability distribution of SS, and DklD_{kl} is the Kullback-Leibler divergence. By the elementary inequality a+b2(a+b)\sqrt{a}+\sqrt{b}\leq\sqrt{2(a+b)} for all non-negative a,ba,b, we can upper bound the above by

Recalling that this is an upper bound on Prj(M^j,i0Mj,i<0)Prj(M^j,i0Mj,i>0)\left|{\Pr}_{j}\left(\hat{M}_{j,i}\geq 0|M_{j,i}<0\right)-{\Pr}_{j}\left(\hat{M}_{j,i}\geq 0|M_{j,i}>0\right)\right|, we have

where the last step is by Jensen’s inequality (i.e. the average of square roots is upper bounded by the square root of the average). The expression in the square root equals the average mutual information between a random variable SS (composed of at most d2/128d^{2}/128 bits), and d/2(1+d/2)\lceil d/2\rceil\left(1+\lfloor d/2\rfloor\right) binary random variables Mj,iM_{j,i}, where i{1,,d/2},j{d/2,,d}i\in\{1,\ldots,\lceil d/2\rceil\},j\in\{\lceil d/2\rceil,\ldots,d\}, which are all independent by construction. By Lemma 6 in , it is at most (d2/128)/(d/2(1+d/2))1/32(d^{2}/128)/\left(\lceil d/2\rceil\left(1+\lfloor d/2\rfloor\right)\right)\leq 1/32, so we have

Recalling this is an upper bound on Eq. (24), which is the second term in Eq. (23), we get that