Optimal algorithms for smooth and strongly convex distributed optimization in networks

Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, Laurent Massoulié

Introduction

Given the numerous applications of distributed optimization in machine learning, many algorithms have recently emerged, that allow the minimization of objective functions ff defined as the average 1ni=1nfi\frac{1}{n}\sum_{i=1}^{n}f_{i} of functions fif_{i} which are respectively accessible by separate nodes in a network . These algorithms typically alternate local incremental improvement steps (such as gradient steps) with communication steps between nodes in the network, and come with a variety of convergence rates (see for example ).

Two main regimes have been looked at: (a) centralized where communications are precisely scheduled and (b) decentralized where communications may not exhibit a precise schedule. In this paper, we consider these two regimes for objective functions which are smooth and strongly-convex and for which algorithms are linearly (exponentially) convergent. The main contribution of this paper is to propose new and matching upper and lower bounds of complexity for this class of distributed problems.

The optimal complexity bounds depend on natural quantities in optimization and network theory. Indeed, (a) for a single machine the optimal number of gradient steps to optimize a function is proportional to the square root of the condition number , and (b) for mean estimation, the optimal number of communication steps is proportional to the diameter of the network in centralized problems or to the square root of the eigengap of the Laplacian matrix in decentralized problems . As shown in Section 3, our lower complexity bounds happen to be combinations of the two contributions above.

These lower complexity bounds are attained by two separate algorithms. In the centralized case, the trivial distribution of Nesterov’s accelerated gradient attains this rate, while in the decentralized case, as shown in Section 4, the rate is achieved by a dual algorithm. We compare favorably our new optimal algorithms to existing work in Section 5.

Related work. Decentralized optimization has been extensively studied and early methods such as decentralized gradient descent or decentralized dual averaging exhibited sublinear convergence rates. More recently, a number of methods with provable linear convergence rates were developed, including EXTRA , augmented Lagrangians , and more recent approaches . The most popular of such approaches is the distributed alternating direction method of multipliers (D-ADMM) and has led to a large number of variations and extensions. In a different direction, second order methods were also investigated . However, to the best of our knowledge, the field still lacks a coherent theoretical understanding of the optimal convergence rates and its dependency on the characteristics of the communication network. In several related fields, complexity lower bounds were recently investigated, including the sequential optimization of a sum of functions , distributed optimization in flat (i.e. totally connected) networks , or distributed stochastic optimization .

Distributed optimization setting

in a distributed setting. More specifically, we assume that:

Each computing unit can compute first-order characteristics, such as the gradient of its own function or its Fenchel conjugate. By renormalization of the time axis, and without loss of generality, we assume that this computation is performed in one unit of time.

These actions may be performed asynchronously and in parallel, and each node ii possesses a local version of the parameter, which we refer to as θi\theta_{i}. Moreover, we assume that each function fif_{i} is α\alpha-strongly convex and β\beta-smooth, and we denote by κl=βα1\kappa_{l}=\frac{\beta}{\alpha}\geq 1 the local condition number. We also denote by αg\alpha_{g}, βg\beta_{g} and κg\kappa_{g}, respectively, the strong convexity, smoothness and condition number of the average (global) function fˉ\bar{f}. Note that we always have κgκl\kappa_{g}\leq\kappa_{l}, while the opposite inequality is, in general, not true (take for example f1(θ)=\mathds1{θ<0}θ2f_{1}(\theta)=\mathds{1}\{\theta<0\}\theta^{2} and f2(θ)=\mathds1{θ>0}θ2f_{2}(\theta)=\mathds{1}\{\theta>0\}\theta^{2} for which κl=+\kappa_{l}=+\infty and κg=1\kappa_{g}=1). However, the two quantities are close (resp. equal) when the local functions are similar (resp. equal) to one another.

2 Decentralized communication

A large body of literature considers a decentralized approach to distributed optimization based on the gossip algorithm . In such a case, communication is represented as a matrix multiplication with a matrix WW verifying the following constraints:

The kernel of WW is the set of constant vectors: Ker(W)=Span(\mathds1){\rm Ker}(W)={\rm Span}(\mathds{1}), where \mathds1=(1,...,1)\mathds{1}=(1,...,1)^{\top},

WW is defined on the edges of the network: Wij0W_{ij}\neq 0 only if i=ji=j or (i,j)E(i,j)\in\mathcal{E}.

The third condition will ensure that the gossip step converges to the average of all the vectors shared between the nodes. We will denote the matrix WW as the gossip matrix, since each communication step will be represented using it. Note that a simple choice for the gossip matrix is the Laplacian matrix L=DAL=D-A, where AA is the adjacency matrix of the network and D=\mathop{\rm diag}\big{(}{\sum_{i}A_{ij}}\big{)}. However, in the presence of large degree nodes, weighted Laplacian matrices are usually a better choice, and the problem of optimizing these weights is known as the fastest distributed consensus averaging problem and is investigated by .

We will denote by λ1(W)λn(W)=0\lambda_{1}(W)\geq\cdots\geq\lambda_{n}(W)=0 the spectrum of the gossip matrix WW, and its (normalized) eigengap the ratio γ(W)=λn1(W)/λ1(W)\gamma(W)=\lambda_{n-1}(W)/\lambda_{1}(W) between the second smallest and the largest eigenvalue. Equivalently, this is the inverse of the condition number of WW projected on the space orthogonal to the constant vector \mathds1\mathds{1}. This quantity will be the main parameter describing the connectivity of the communication network in Section 3.3 and Section 4.

Optimal convergence rates

In this section, we prove oracle complexity lower bounds for distributed optimization in two settings: strongly convex and smooth functions for centralized (i.e. master/slave) and decentralized algorithms based on a gossip matrix WW.

In the first setting, we show that distributing accelerated gradient descent matches the optimal convergence rate, while, in the second setting, the algorithm proposed in Section 4 is shown to be optimal. Note that we will use the notation g(ε)=Ω(f(ε))g(\varepsilon)=\Omega(f(\varepsilon)) for C>0\mbox s.t. ε>0,g(ε)Cf(ε)\exists C>0\mbox{\ s.t.\ }\forall\varepsilon>0,g(\varepsilon)\geq Cf(\varepsilon), and will, for simplicity, omit the additive terms that do not depend on the precision ε\varepsilon in Corollary 1 and Corollary 2.

The lower bounds provided hereafter depend on a new notion of black-box optimization procedures for the problem in Eq. (1), where we consider distributed algorithms verifying the following constraints:

Local computation: each node ii can, at time tt, compute the gradient of its local function fi(θ)\nabla f_{i}(\theta) or its Fenchel conjugate fi(θ)\nabla f^{*}_{i}(\theta) for a value θMi,t\theta\in\mathcal{M}_{i,t} in the node’s internal memory, that is, for all i{1,...,n}i\in\{1,...,n\},

Local communication: each node ii can, at time tt, share a value to all or part of its neighbors, that is, for all i{1,...,n}i\in\{1,...,n\},

Output value: each node ii must, at time tt, specify one vector in its memory as local output of the algorithm, that is, for all i{1,...,n}i\in\{1,...,n\},

Hence, a black-box procedure will return nn output values—one for each node of the network—and our analysis will focus on ensuring that all local output values are converging to the optimal parameter of Eq. (1). Moreover, we will say that a black-box procedure uses a gossip matrix WW if the local communication is achieved by multiplication of a vector with WW. For simplicity, we assume that all nodes start with the simple internal memory Mi,0={0}\mathcal{M}_{i,0}=\{0\}. Note that communications and local computations may be performed in parallel and asynchronously.

2 Centralized algorithms

In this section, we show that, for any black-box optimization procedure, at least Ω(κgln(1/ε))\Omega(\sqrt{\kappa_{g}}\ln(1/\varepsilon)) gradient steps and Ω(Δκgln(1/ε))\Omega(\Delta\sqrt{\kappa_{g}}\ln(1/\varepsilon)) communication steps are necessary to achieve a precision ε>0\varepsilon>0, where κg\kappa_{g} is the global condition number and Δ\Delta is the diameter of the network. These lower bounds extend the communication complexity lower bounds for totally connected communication networks of , and are natural since at least Ω(κgln(1/ε))\Omega(\sqrt{\kappa_{g}}\ln(1/\varepsilon)) steps are necessary to solve a strongly convex and smooth problem up to a fixed precision, and at least Δ\Delta communication steps are required to transmit a message between any given pair of nodes.

The proof of Theorem 1 relies on splitting the function used by Nesterov to prove oracle complexities for strongly convex and smooth optimization on two nodes at distance Δ\Delta. One can show that most dimensions of the parameters θi,t\theta_{i,t} will remain zero, and local gradient computations may only increase the number of non-zero dimensions by one. Finally, at least Δ\Delta communication rounds are necessary in-between every gradient computation, in order to share information between the two nodes. The detailed proof is available as supplementary material.

For any graph of diameter Δ\Delta and any black-box procedure, there exists functions fif_{i} such that the time to reach a precision ε>0\varepsilon>0 is lower bounded by

This optimal convergence rate is achieved by distributing Nesterov’s accelerated gradient descent on the global function. Computing the gradient of fˉ\bar{f} is performed by sending all the local gradients fi\nabla f_{i} to a single node (denoted as master node) in Δ\Delta communication steps (which may involve several simultaneous messages), and then returning the new parameter θt+1\theta_{t+1} to every node in the network (which requires another Δ\Delta communication steps). In practice, summing the gradients can be distributed by computing a spanning tree (with the root as master node), and asking for each node to perform the sum of its children’s gradients before sending it to its parent. Standard methods as described by can be used for performing this parallelization of gradient computations.

This algorithm has three limitations: first, the algorithm is not robust to machine failures, and the central role played by the master node also means that a failure of this particular machine may completely freeze the procedure. Second, and more generally, the algorithm requires precomputing a spanning tree, and is thus not suited to time-varying graphs, in which the connectivity between the nodes may change through time (e.g. in peer-to-peer networks). Finally, the algorithm requires every node to complete its gradient computation before aggregating them on the master node, and the efficiency of the algorithm thus depends on the slowest of all machines. Hence, in the presence of non-uniform latency of the local computations, or the slow down of a specific machine due to a hardware failure, the algorithm will suffer a significant drop in performance.

3 Decentralized algorithms

The gossip algorithm is a standard method for averaging values across a network when its connectivity may vary through time. This approach was shown to be robust against machine failures, non-uniform latencies and asynchronous or time-varying graphs, and a large body of literature extended this algorithm to distributed optimization .

The convergence analysis of decentralized algorithms usually relies on the spectrum of the gossip matrix WW used for communicating values in the network, and more specifically on the ratio between the second smallest and the largest eigenvalue of WW, denoted γ\gamma. In this section, we show that, with respect to this quantity and κl\kappa_{l}, reaching a precision ε\varepsilon requires at least Ω(κlln(1/ε))\Omega(\sqrt{\kappa_{l}}\ln(1/\varepsilon)) gradient steps and Ω(κlγln(1/ε))\Omega\left(\sqrt{\frac{\kappa_{l}}{\gamma}}\ln(1/\varepsilon)\right) communication steps, by exhibiting a gossip matrix such that a corresponding lower bound exists.

where κl=β/α\kappa_{l}=\beta/\alpha is the local condition number.

The proof of Theorem 2 relies on the same technique as that of Theorem 1, except that we now split the two functions on a subset of a linear graph. These networks have the appreciable property that Δ1/γ\Delta\approx 1/\sqrt{\gamma}, and we can thus use a slightly extended version of Theorem 1 to derive the desired result. The complete proof is available as supplementary material.

For any γ>0\gamma>0, there exists a gossip matrix WW of eigengap γ\gamma and α\alpha-strongly convex, β\beta-smooth functions such that, with κl=β/α\kappa_{l}=\beta/\alpha, for any black-box procedure using WW the time to reach a precision ε>0\varepsilon>0 is lower bounded by

We will see in the next section that this lower bound is met for a novel decentralized algorithm called multi-step dual accelerated (MSDA) and based on the dual formulation of the optimization problem. Note that these results provide optimal convergence rates with respect to κl\kappa_{l} and γ\gamma, but do not imply that γ\gamma is the right quantity to consider on general graphs. The quantity 1/γ1/\sqrt{\gamma} may indeed be very large compared to Δ\Delta, for example for star networks, for which Δ=2\Delta=2 and 1/γ=n1/\sqrt{\gamma}=\sqrt{n}. However, on many simple networks, the diameter Δ\Delta and the eigengap of the Laplacian matrix are tightly connected, and Δ1/γ\Delta\approx 1/\sqrt{\gamma}. For example, for linear graphs, Δ=n1\Delta=n-1 and 1/γ2n/π1/\sqrt{\gamma}\approx 2n/\pi, for totally connected networks, Δ=1\Delta=1 and 1/γ=11/\sqrt{\gamma}=1, and for regular networks, 1/γΔ22ln2n1/\sqrt{\gamma}\geq\frac{\Delta}{2\sqrt{2}\ln_{2}{n}} . Finally, note that the case of totally connected networks corresponds to a previous complexity lower bound on communications proven by , and is equivalent to our result for centralized algorithms with Δ=1\Delta=1.

Optimal decentralized algorithms

In this section, we present a simple framework for solving the optimization problem in Eq. (1) in a decentralized setting, from which we will derive several variants, including a synchronized algorithm whose convergence rate matches the lower bound in Corollary 2 . Note that the naive approach of distributing each (accelerated) gradient step by gossiping does not lead to a linear convergence rate, as the number of gossip steps has to increase with the number of iterations to ensure the linear rate is preserved. We begin with the simplest form of the algorithm, before extending it to more advanced scenarios.

A standard approach for solving Eq. (1) (see ) consists in rewriting the optimization problem as

Furthermore, the equality constraint θ1==θn\theta_{1}=\cdots=\theta_{n} is equivalent to ΘW=0\Theta\sqrt{W}=0, where Θ=(θ1,,θn)\Theta=(\theta_{1},\ldots,\theta_{n}) and WW is a gossip matrix verifying the assumptions described in Section 2. Note that, since WW is positive semi-definite, W\sqrt{W} exists and is defined as W=VΣ1/2V\sqrt{W}=V^{\top}\Sigma^{1/2}V, where W=VΣVW=V^{\top}\Sigma V is the singular value decomposition of WW. The equality ΘW=0\Theta\sqrt{W}=0 implies that each row of Θ\Theta is constant (since Ker(W)=Span(\mathds1){\rm Ker}(\sqrt{W})={\rm Span}(\mathds{1})), and is thus equivalent to θ1==θn\theta_{1}=\cdots=\theta_{n}. This leads to the following primal version of the optimization problem:

where F(Θ)=i=1nfi(θi)F(\Theta)=\sum_{i=1}^{n}f_{i}(\theta_{i}). Since Eq. (11) is a convex problem, it is equivalent to its dual optimization problem:

The optimization problem in Eq. (12) is unconstrained and convex, and can thus be solved using a variety of convex optimization techniques. The proposed single-step dual accelerated (SSDA) algorithm described in Alg. (1) uses Nesterov’s accelerated gradient descent, and can be thought of as an accelerated version of the distributed augmented Lagrangian method of for ρ=0\rho=0. The algorithm is derived by noting that a gradient step of size η>0\eta>0 for Eq. (12) is

and the change of variable yt=λtWy_{t}=\lambda_{t}\sqrt{W} leads to

This equation can be interpreted as gossiping the gradients of the local conjugate functions fi(yi,t)\nabla f_{i}^{*}(y_{i,t}), since F(yt)ij=fj(yj,t)i\nabla F^{*}(y_{t})_{ij}=\nabla f_{j}^{*}(y_{j,t})_{i}.

The iterative scheme in Alg. (1) converges to Θ=θ\mathds1\Theta={\theta^{*}}\mathds{1}^{\top} where θ\theta^{*} is the solution of Eq. (1). Furthermore, the time needed for this algorithm to reach any given precision ε>0\varepsilon>0 is

This theorem relies on proving that the condition number of the dual objective function is upper bounded by κlγ\frac{\kappa_{l}}{\gamma}, and noting that the convergence rate for accelerated gradient descent depends on the square root of the condition number (see, e.g., ). A detailed proof is available as supplementary material.

2 Multi-Step Dual Accelerated method

The main problem of Alg. (1) is that it always performs the same number of gradient and gossip steps. When communication is cheap compared to local computations (τ1\tau\ll 1), it would be preferable to perform more gossip steps than gradient steps in order to propagate the local gradients further than the local neighborhoods of each node. This can be achieved by replacing WW by PK(W)P_{K}(W) in Alg. (1), where PKP_{K} is a polynomial of degree at most KK. If PK(W)P_{K}(W) is itself a gossip matrix, then the analysis of the previous section can be applied and the convergence rate of the resulting algorithm depends on the eigengap of PK(W)P_{K}(W). Maximizing this quantity for a fixed KK leads to a common acceleration scheme known as Chebyshev acceleration and the choice

where c2=1+γ1γc_{2}=\frac{1+\gamma}{1-\gamma} and TKT_{K} are the Chebyshev polynomials defined as T0(x)=1T_{0}(x)=1, T1(x)=xT_{1}(x)=x, and, for all k1k\geq 1,

Finally, verifying that this particular choice of PK(W)P_{K}(W) is indeed a gossip matrix, and taking K=1γK=\lfloor\frac{1}{\sqrt{\gamma}}\rfloor leads to Alg. (2) with an optimal convergence rate with respect to γ\gamma and κl\kappa_{l}.

The iterative scheme in Alg. (2) converges to Θ=θ\mathds1\Theta={\theta^{*}}\mathds{1}^{\top} where θ\theta^{*} is the solution of Eq. (1). Furthermore, the time needed for this algorithm to reach any given precision ε>0\varepsilon>0 is

The proof of Theorem 4 relies on standard properties of Chebyshev polynomials that imply that, for the particular choice of K=1γK=\lfloor\frac{1}{\sqrt{\gamma}}\rfloor, we have 1γ(PK(W))2\frac{1}{\sqrt{\gamma(P_{K}(W))}}\leq 2. Hence, Theorem 3 applied to the gossip matrix W=PK(W)W^{\prime}=P_{K}(W) gives the desired convergence rate. The complete proof is available as supplementary material.

3 Discussion and further developments

We now discuss several extensions to the proposed algorithms.

Local vs. global condition number: MSDA and SSDA depend on the worst strong convexity of the local functions miniαi\min_{i}\alpha_{i}, which may be very small. A simple trick can be used to depend on the average strong convexity. Using the proxy functions gi(θ)=fi(θ)(αiαˉ)θ22g_{i}(\theta)=f_{i}(\theta)-(\alpha_{i}-\bar{\alpha})\|\theta\|_{2}^{2} instead of fif_{i}, where αˉ=1niαi\bar{\alpha}=\frac{1}{n}\sum_{i}\alpha_{i} is the average strong convexity, will improve the local condition number from κl=maxiβiminiαi\kappa_{l}=\frac{\max_{i}\beta_{i}}{\min_{i}\alpha_{i}} to

Several algorithms, including EXTRA and DIGing , have convergence rates that depend on the strong convexity of the global function αg\alpha_{g}. However, their convergence rates are not optimal, and it is still an open question to know if a rate close to O(κg(1+τγ)ln(1/ε))O\left(\sqrt{\kappa_{g}}(1+\frac{\tau}{\sqrt{\gamma}})\ln(1/\varepsilon)\right) can be achieved with a decentralized algorithm.

Asynchronous setting: Accelerated stochastic gradient descent such as SVRG or SAGA can be used on the dual problem in Eq. (12) instead of accelerated gradient descent, in order to obtain an asynchronous algorithm with a linear convergence rate. The details and exact convergence rate of such an approach are left as future work.

Experiments

In this section, we compare our new algorithms, single-step dual accelerated (SSDA) descent and multi-step dual accelerated (MSDA) descent, to standard distributed optimization algorithms in two settings: least-squares regression and classification by logistic regression. Note that these experiments on simple generated datasets are made to assess the differences between existing state-of-the-art algorithms and the ones provided in Section 4, and do not address the practical implementation details nor the efficiency of the compared algorithms on real-world distributed platforms. The effect of latency, machine failures or variable communication time is thus left for future work.

We compare SSDA and MSDA to four state-of-the-art distributed algorithms that achieve linear convergence rates: distributed ADMM (D-ADMM) , EXTRA , a recent approach named DIGing , and the distributed version of accelerated gradient descent (DAGD) described in Section 3.2 and shown to be optimal among centralized algorithms. When available in the literature, we used the optimal parameters for each algorithm (see Theorem 2 by for D-ADMM and Remark 3 by for EXTRA). For the DIGing algorithm, the parameters provided by are very conservative, and lead to a very slow convergence. We thus manually optimized the parameter for this algorithm. The experiments are simulated using a generated dataset consisting of 10,00010,000 samples randomly distributed to the nodes of a network of size 100100. In order to assess the effect of the connectivity of the network, we ran each experiment on two networks: one 10×1010\times 10 grid and an Erdös-Rényi random network with parameter p=6100p=\frac{6}{100} (i.e. of average degree 66). The quality metric used in this section is be the maximum approximation error among the nodes of the network

where θ\theta^{*} is the optimal parameter of the optimization problem in Eq. (1).

2 Least-squares regression

The regularized least-squares regression problem consists in solving the optimization problem

Figure 1 and Figure 2 show the performance of the compared algorithms on two networks: a 10×1010\times 10 grid graph and an Erdös-Rényi random graph of average degree 66. All algorithms are linearly convergent, although their convergence rates scale on several orders of magnitude. In all experiments, the centralized optimal algorithm DAGD has the best convergence rate, while MSDA has the best convergence rate among decentralized methods. When the communication time is smaller than the computation time (τ1\tau\gg 1), performing several communication rounds per gradient iteration will improve the efficiency of the algorithm and MSDA substantially outperforms SSDA.

3 Logistic classification

The logistic classification problem consists in solving the optimization problem

Figure 3 and Figure 4 show the performance of the compared algorithms for logistic classification on two networks: a 10×1010\times 10 grid graph and an Erdös-Rényi random graph of average degree 66. As for least-squares regression, all algorithms are linearly convergent, and their convergence rates scale on several orders of magnitude. In this case, the centralized optimal algorithm DAGD is outperformed by MSDA, although the two convergence rates are relatively similar. Again, when the communication time is smaller than the computation time (τ1\tau\gg 1), performing several communication rounds per gradient iteration will improve the efficiency of the algorithm and MSDA substantially outperforms SSDA. Note that, in Figure 4(a), D-ADMM requires 383383 iterations to reach the same error obtained after only 1010 iterations of SSDA, demonstrating a substantial improvement over state-of-the-art methods.

Conclusion

In this paper, we derived optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications in a network. For the decentralized setting, we introduced the multi-step dual accelerated (MSDA) algorithm with a provable optimal linear convergence rate, and showed its high efficiency compared to other state-of-the-art methods, including distributed ADMM and EXTRA. The simplicity of the approach makes the algorithm extremely flexible, and allows for future extensions, including time-varying networks and an analysis for non-strongly-convex functions. Finally, extending our complexity lower bounds to time delays, variable computational speeds of local systems, or machine failures would be a notable addition to this work.

Appendix A Detailed proofs

If AdcA_{d}^{c}\neq\emptyset, then for any t0t\geq 0 and any black-box procedure one has, for all i{1,...,n}i\in\{1,...,n\},

Furthermore, by definition of ki,tk_{i,t}, one has θi,k=0\theta_{i,k}=0 for all k>ki,tk>k_{i,t}, and thus

and, since fˉA\bar{f}^{A} is α\alpha-strongly convex,

Finally, the solution of the global problem minθfˉA(θ)\min_{\theta}\bar{f}^{A}(\theta) is θk=(βαβ+α)k\theta^{*}_{k}=\left(\frac{\sqrt{\beta}-\sqrt{\alpha}}{\sqrt{\beta}+\sqrt{\alpha}}\right)^{k}. Combining this result with Eqs. (27), (28) and (29) leads to the desired inequality. ∎

Using the previous lemma with d=Δd=\Delta the diameter of G\mathcal{G} and A={v}A=\{v\} one of the pair of nodes at distance Δ\Delta returns the desired result. ∎

Let γn=1cos(πn)1+cos(πn)\gamma_{n}=\frac{1-\cos(\frac{\pi}{n})}{1+\cos(\frac{\pi}{n})} be a decreasing sequence of positive numbers. Since γ2=1\gamma_{2}=1 and limnγn=0\lim_{n}\gamma_{n}=0, there exists n2n\geq 2 such that γnγ>γn+1\gamma_{n}\geq\gamma>\gamma_{n+1}. The cases n=2n=2 and n3n\geq 3 are treated separately. If n3n\geq 3, let G\mathcal{G} be the linear graph of size nn ordered from node v1v_{1} to vnv_{n}, and weighted with wi,i+1=1a\mathds1{i=1}w_{i,i+1}=1-a\mathds{1}\{i=1\}. Then, if A={v1,...,vn/32}A=\{v_{1},...,v_{\lceil n/32\rceil}\} and d=(11/16)n1d=(1-1/16)n-1, we have AdcA|A_{d}^{c}|\geq|A| and Lemma 1 implies:

A simple calculation gives κl=1+(κg1)n2A\kappa_{l}=1+(\kappa_{g}-1)\frac{n}{2|A|}, and thus κgκl/16\kappa_{g}\geq\kappa_{l}/16. Finally, if we take WaW_{a} as the Laplacian of the weighted graph G\mathcal{G}, a simple calculation gives that, if a=0a=0, γ(Wa)=γn\gamma(W_{a})=\gamma_{n} and, if a=1a=1, the network is disconnected and γ(Wa)=0\gamma(W_{a})=0. Thus, by continuity of the eigenvalues of a matrix, there exists a value aa\in such that γ(Wa)=γ\gamma(W_{a})=\gamma. Finally, by definition of nn, one has γ>γn+12(n+1)2\gamma>\gamma_{n+1}\geq\frac{2}{(n+1)^{2}}, and d1516(2γ1)115γd\geq\frac{15}{16}(\sqrt{\frac{2}{\gamma}}-1)-1\geq\frac{1}{5\sqrt{\gamma}} when γγ3=13\gamma\leq\gamma_{3}=\frac{1}{3}.

For the case n=2n=2, we consider the totally connected network of 33 nodes, reweight only the edge (v1,v3)(v_{1},v_{3}) by aa\in, and let WaW_{a} be its Laplacian matrix. If a=1a=1, then the network is totally connected and γ(Wa)=1\gamma(W_{a})=1. If, on the contrary, a=0a=0, then the network is a linear graph and γ(Wa)=γ3\gamma(W_{a})=\gamma_{3}. Thus, there exists a value aa\in such that γ(Wa)=γ\gamma(W_{a})=\gamma, and applying Lemma 1 with A={v1}A=\{v_{1}\} and d=1d=1 returns the desired result, since then κg2κl/3\kappa_{g}\geq 2\kappa_{l}/3 and d=113γd=1\geq\frac{1}{\sqrt{3\gamma}}. ∎

A.2 Convergence rates of SSDA and MSDA

Each step of the algorithm can be decomposed in first computing gradients, and then communicating these gradients across all neighborhoods. Thus, one step takes a time 1+τ1+\tau. Moreover, the Hessian of the dual function F(λW)F^{*}(\lambda\sqrt{W}) is

where \otimes is the Kronecker product and IdI_{d} is the identity matrix of size dd. Also, note that, in Alg.(2), the current values xtx_{t} and yty_{t} are always in the image of WId\sqrt{W}\otimes I_{d} (i.e. the set of matrices xx such that x\mathds1=0x^{\top}\mathds{1}=0). The condition number (in the image of WId\sqrt{W}\otimes I_{d}) can thus be upper bounded by κlγ\frac{\kappa_{l}}{\gamma}, and Nesterov’s acceleration requires κlγ\sqrt{\frac{\kappa_{l}}{\gamma}} steps to achieve any given precision . ∎

First, since PK(W)P_{K}(W) is a gossip matrix, Theorem 3 implies the convergence of Alg.(3). In order to simplify the analysis, we multiply WW by 2(1+γ)λ1(W)\frac{2}{(1+\gamma)\lambda_{1}(W)}, so that the resulting gossip matrix has a spectrum in [1c21,1+c21][1-c_{2}^{-1},1+c_{2}^{-1}]. Applying Theorem 6.2 in with α=1c21\alpha=1-c_{2}^{-1}, β=1+c21\beta=1+c_{2}^{-1} and γ=0\gamma=0 implies that the minimum

is attained by PK(x)=1TK(c2(1x))TK(c2)P_{K}(x)=1-\frac{T_{K}(c_{2}(1-x))}{T_{K}(c_{2})}. Finally, Corollary 6.3 of leads to

where c1=1γ1+γc_{1}=\frac{1-\sqrt{\gamma}}{1+\sqrt{\gamma}}, and taking K=1γK=\lfloor\frac{1}{\sqrt{\gamma}}\rfloor implies

The time required to reach an ε>0\varepsilon>0 precision using Alg.(3) is thus upper bounded by O((1+Kτ)κlγ(PK(W))ln(1/ε))=O(κl(1+1γτ)ln(1/ε))O\left((1+K\tau)\sqrt{\frac{\kappa_{l}}{\gamma(P_{K}(W))}}\ln(1/\varepsilon)\right)=O\left(\sqrt{\kappa_{l}}(1+\frac{1}{\sqrt{\gamma}}\tau)\ln(1/\varepsilon)\right). ∎

Appendix B Composite problems for machine learning

To maximize the dual problem, we can use (accelerated) proximal gradient, with the updates:

In order to compute the convergence rate of such an algorithm, if we assume that:

the largest singular value of each BiB_{i} is less than MM,

then we simply need to compute the condition number of the quadratic function

With the choice \rho^{2}=\frac{1}{\lambda_{\max}(W)}\big{(}\frac{c}{\mu}+M^{2}), it is lower bounded by \big{(}1+\mu\frac{M^{2}}{c}\big{)}\frac{4}{\gamma}, which is a natural upper bound on κl/γ\kappa_{l}/\gamma. Thus this essentially leads to the same convergence rate than the non-composite case with the Nesterov and Chebyshev accelerations, i.e. κl/γ\sqrt{\kappa_{l}/\gamma}.

The bound on the conditional number may be shown through the two inequalities:

References