A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates

Zhi Li, Wei Shi, Ming Yan

I Introduction

This paper focuses on the following decentralized optimization problem:

Specific problems of form (1) that require a decentralized computing architecture have appeared in various areas including networked multi-vehicle coordination, distributed information processing and decision making in sensor networks, as well as distributed estimation and learning. Some examples include distributed average consensus , distributed spectrum sensing , information control , power systems control , statistical inference and learning . In general, decentralized optimization fits the scenarios where the data is collected and/or stored in a distributed network, a fusion center is either inapplicable or unaffordable, and/or computing is required to be performed in a distributed but collaborative manner by multiple agents or by network designers.

The study on distributed algorithms dates back to the early 1980s . Since then, due to the emergence of large-scale networks, decentralized (optimization) algorithms, as a special type of distributed algorithms for solving problem (1), have received attention. Many efforts have been made on star networks with one master agent and multiple slave agents . This scheme is “centralized” due to the use of a “master” agent. It may suffer a single point of failure and may violate the privacy requirement in certain applications. In this paper, we focus on solving (1) in a decentralized fashion, where no “master” agent is used.

Incremental algorithms can solve (1) without the need of a “master” agent and it is based on a directed ring network. To handle general (possibly time-varying) networks, the distributed sub-gradient algorithm was proposed in . This algorithm and its variants are intuitive and simple but usually slow due to the diminishing step-size that is needed to obtain a consensual and optimal solution, even if the objective functions are differentiable and strongly convex. With a fixed step-size, these distributed methods can be fast, but they only converge to a neighborhood of the solution set which depends on the step-size. This phenomenon creates an exactness-speed dilemma .

A class of distributed approaches that bypass this dilemma is based on introducing the Lagrangian dual. The resulting algorithms include distributed dual decomposition and decentralized alternating direction method of multipliers (ADMM) . The decentralized ADMM and its proximal-gradient variant can employ a fixed step-size to achieve $O(1/k)$ rate under general convexity assumptions . Under the strong convexity assumption, the decentralized ADMM has linear convergence for time-invariant undirected graphs . There exist some other distributed methods that do not (explicitly) use dual variables but can still converge to an optimal consensual solution with fixed step-sizes. In particular, works in employ multi-consensus inner loops, Nesterov’s acceleration, and/or the adapt-then-combine (ATC) strategy. Under the assumption that the objectives have bounded and Lipschitz continuous gradientsThis means that the nonsmooth terms $r_{i}$ ’s are absent. Such assumption is much stronger than the one used for achieving the $O(1/k^{2})$ rate in Nesterov’s optimal gradient method ., the algorithm proposed in has $O\left(\ln(k)/k^{2}\right)$ rate. References use a difference structure to cancel the steady state error in decentralized gradient descent , thereby developing the algorithm EXTRA and its proximal-gradient variant PG-EXTRA. It converges at an $O(1/k)$ rate when the objective function in (1) is convex and has a linear convergence rate when the objective function is strongly convex and $r_{i}(x)=0$ for all $i$ .

A number of recent works employed the so-called gradient tracking to conquer different issues . To be specific, works relax the step-size rule to allow uncoordinated step-sizes across agents. Paper solves non-convex optimization problems. Paper aims at achieving geometric convergence over time-varying graphs. Work improves the convergence rate over EXTRA, and its formulation is the same as that in .

Another topic of interest is decentralized optimization over directed graphs , which is beyond the scope of this paper.

I-B Proposed Algorithm and its Advantages

If all local variables are identical, i.e., $x_{1}=\dots=x_{n}$ , we say that ${\mathbf{x}}$ is consensual. In addition, we define

We have $f({\mathbf{x}})=s({\mathbf{x}})+r({\mathbf{x}})$ . The gradient of $s$ at ${\mathbf{x}}$ is given in the same way as ${\mathbf{x}}$ in (3) by

𝐼𝐖2\widetilde{{\mathbf{W}}}=\frac{I+{\mathbf{W}}}{2} here. By making a simple modification over PG-EXTRA , our proposed algorithm brings a big improvement in the speed and the dependency of convergence over networks. To better expose this simple modification, let us compare a special case of our proposed algorithm with EXTRA for the smooth case, i.e., $r({\mathbf{x}})=0$ .

A large and network independent step-size for NIDS: All works mentioned above either employ pessimistic step-sizes or have network dependent upper bounds on step-sizes. Furthermore, the step-sizes for the strongly convex case are more conservative. For example, the step-size used to achieve linear convergence rates for EXTRA in is in the order of $O(\mu/L^{2})$ , where $\mu$ and $L$ are the strong convexity constant of $s({\mathbf{x}})$ and Lipschitz constant of $\nabla s({\mathbf{x}})$ , respectively. As a contrast, the centralized gradient descent can choose a step-size in the order of $O(1/L)$ . The upper bound of step-size for EXTRA was recently improved to $(5+3\lambda_{n}({\mathbf{W}}))/(4L)$ in . We can choose $1/(2L)$ for any ${\mathbf{W}}$ satisfying Assumption 1. Another example of employing a constant step-size in distributed optimization is DIGing . Although its ATC variant in was shown to converge faster than DIGing, the step-size is still very conservative compared to $O(1/L)$ . We will show that the step-size of NIDS can have the same upper bound $2/L$ as that of the centralized gradient descent. The achievable step-sizes of NIDS for $o(1/k)$ rate in the general convex case and the linear convergence rate in the strongly convex case are both in the order of $O(1/L)$ . Furthermore, NIDS allows each agent to have an individual step-size. Each agent $i$ can choose a step-size $\alpha_{i}$ that is as large as $2/L_{i}$ on any connected network, where $L_{i}$ is the Lipschitz constant of $\nabla s_{i}(x)$ and $L_{i}\leq L$ for all $i$ ( $L=\max_{i}L_{i}$ ). Apart from the step-sizes, to run NIDS, a common/public parameter $c$ is needed for the construction of $\widetilde{{\mathbf{W}}}$ (see (8) for the algorithm and $\widetilde{{\mathbf{W}}}$ ). This parameter $c$ can be chosen without any knowledge of the network (or the mixing matrix ${\mathbf{W}}$ ). For example, $c={1}/({2\max_{i}\alpha_{i}})$ . Table I provides an overview of algorithmic configurations for EXTRA and NIDS. NIDS works as long as each agent can estimate its local functional parameter: No agent needs any other global information including the number of agents in the whole network except the largest step-size, if it is not the same for all agents.

In the line of research of optimization over heterogeneous networks, after the initial work regarding uncoordinated step sizes, references and introduce and analyze a diffusion strategy with corrections that achieves an exact linear convergence with uncoordinated step sizes. This exact diffusion algorithm is still related to the Lagrangian method but can be considered as having incorporated a CTA structure. The CTA strategy can, similar to the ATC strategy, improve the convergence speed of certain consensus optimization algorithms (see [41, Remark 3] and [46, Section II.C]). However, the analysis in , though allowing step size mismatch across the network, does not take into consideration the heterogeneity of agents’ functional conditions. Furthermore, their upper bound for the step-size is in the order of $O(\mu/L^{2})$ .

Sublinear convergence rate for the general case: Under the general convexity assumption, we show that NIDS has a convergence rate of $o(1/k)$ , which is slightly better than the $O(1/k)$ rate of PG-EXTRA. Because the step-size of NIDS does not depend on the network topology and is much larger than that of PG-EXTRA, NIDS can be much faster than PG-EXTRA, as shown in the numerical experiments.

Linear convergence rate for the strongly convex case: Let us first define “scalability”. When the iterate $x^{k}$ of an algorithm converges to the optimal solution $x^{*}$ linearly, i.e., $\|x^{k}-x^{*}\|^{2}=O((1-1/S)^{k})$ with some positive constant $S$ , we say that the algorithm needs to run $O(S)\log(1/\epsilon)$ iterations to reach $\epsilon$ -accuracy. So we call $O(S)$ the scalability of the algorithm.

For the case where the non-smooth terms are absent and the functions $\{s_{i}\}_{i=1}^{n}$ are strongly convex, we show that NIDS achieves a linear convergence rate whose dependencies on the functions $\{s_{i}\}_{i=1}^{n}$ and the network topology are decoupled. To be specific, to reach $\epsilon$ -accuracy, the number of iterations needed for NIDS is

where $\lambda_{i}({\mathbf{W}})$ is the $i$ th largest eigenvalue of ${\mathbf{W}}$ . Both $\frac{L}{\mu}$ and $\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})}$ are typical in the literature of optimization and average consensus, respectively. The value $\frac{L}{\mu}$ , also called the condition number of the objective function, is aligned with the scalability of the standard gradient descent . The value $\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})}$ is understood as the condition numberWhen we choose ${\mathbf{W}}={\mathbf{I}}-\tau\textbf{\L}$ where Ł is the Laplacian of the underlying graph and $\tau$ is a positive tunable constant, we have $\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})}=\frac{\lambda_{1}(\textbf{\L})}{\lambda_{n-1}(\textbf{\L})}$ which is the finite condition number of Ł. Note that $\lambda_{n}(\textbf{\L})=0$ . of the network and aligned with the scalability of the simplest linear iterations for distributed averaging .

Separating the condition numbers of the objective function and the network provides a way to determine the bottleneck of NIDS for a specific problem and a given network. Therefore, the system designer might be able to smartly apply preconditioning on $\{s_{i}\}_{i=1}^{n}$ or improve the connectivity of the network to cost-effectively obtain a better convergence.

Summary and comparison of state-of-the-art algorithms: We list the properties of a few relevant algorithms in Table II. We let $\sigma:=\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})}$ . This quantity is directly affected by the network topology and how the matrix ${\mathbf{W}}$ is defined, thus is also related to the consensus ability of a network. When the network is fully connected (a complete graph), we can choose ${\mathbf{W}}$ so that $\lambda_{2}({\mathbf{W}})=\lambda_{n}({\mathbf{W}})=0$ and thus $\sigma=1$ (the best case); in general $\sigma\geq 1$ since $0<1-\lambda_{2}({\mathbf{W}})\leq 1-\lambda_{n}({\mathbf{W}})<2$ ; in the worst case, we have $\sigma\leq\frac{1}{1-\lambda_{2}({\mathbf{W}})}=O(n^{2})$ [4, Section 2.3]. We keep $\sigma$ in the bounds/rates of involved algorithms for a fair comparison instead of focusing on the worst case that often gives pessimistic/conservative results. We omit “ $O(\cdot)$ ” in “Bounds of step-sizes” and “Scalabilities” for brevity and only compare the effect of functional properties ( $\mu$ and $L$ ) and network properties ( $\sigma$ and/or $n$ ). Before talking into details, let us clarify a few points. In EXTRA, $\mu_{g}$ is a quantity that is associated with the strong convexity of the original function $\bar{f}(x)$ , so it covers a larger class of problems; In DIGing, $\bar{\mu}$ is the mean value of the strong convexity constants of local objectives; In Acc-DNGD, the step-size for the convex case contains $k$ , the current number of iterations. Thus it represents a diminishing step-size sequence; In Optimal , the total number of iterations $K$ is used to determine the step-size for the convex case. In addition, they apply to problems in which the objectives are dual friendly (see for its definition). Note some types of objectives are suitable for gradient update, some are suitable for dual gradient update (dual friendly), and some are suitable for proximal update.

Finding the algorithm with the lowest per-iteration cost depends on the problem (functions). Apparently, our bounds on step-sizes and the corresponding scalability/rate are better than those given in EXTRA and Harness (see Table II). When $\sigma$ is close to $1$ (the graph is well connected), the step-size bound and scalability given in DIGing are the same as NIDS. However, when $\sigma$ is large, their result becomes rather conservative. Acc-DNGD and Optimal have improved the scalability/rate of gradient-based distributed optimization by employing Nesterov’s acceleration technique on primal and dual problems, respectively. For the convex case, our rate is worse than theirs because our algorithm does not employ Nesterov’s acceleration. For the primal distributed gradient method after acceleration , the scalability in $\sigma$ is still worse than our result. Algorithm Optimal achieves the optimal scalability/rate for distributed optimization. However, as we have mentioned above, their algorithms are dual based thus apply to a different class of problems. In addition, NIDS supports proximable non-smooth functions and uncoordinated step-sizes while these have not been considered in Acc-DNGD and Optimal. To sum up, we have reached the best possible performance of first-order algorithms for distributed optimization without acceleration. Further improving the performance by incorporating Nesterov’s techniques to our algorithm will be a future direction.

Finally, we note that, references , appearing simultaneously with this work, also proposed (6b) to enlarge the step-size and use column stochastic matrices rather than symmetric doubly stochastic matrices. However, their algorithm only works for smooth problems, and their analysis seems to be restrictive and requires twice differentiability and strong convexity of $\{s_{i}\}_{i=1}^{n}$ . The stepsize is also in the order of $\mu/L^{2}$ .

I-C Future Works

The capability of our algorithm using purely locally determined parameters increases its potential to be extended to dynamic networks with a time-varying number of nodes. Given such flexibility, we may use similar schemes to solve the decentralized empirical risk minimization problems. Furthermore, it also enhances the privacy of the agents through allowing each agent to perform their own optimization procedure without negotiation on any parameter.

By using Nesterov’s acceleration technique, reference shows that the scalability of a new average consensus protocol can be improved to $O(n)$ ; when the nonsmooth terms $r_{i}$ ’s are absent, reference shows that the scalability of a new dual based accelerated distributed gradient method can be improved to $O(\sqrt{\sigma L/\mu})$ . One of our future work is exploring the convergence rates/scalability of the Nesterov’s accelerated version of our algorithm.

I-D Paper Organization

The rest of this paper is organized as follows. To facilitate the description of the technical ideas, the algorithms, and the analysis, we introduce additional notation in Subsection I-E. The intuition for the network-independent step-size is provided in Section II. In Section III, we introduce our algorithm NIDS and discuss its relation to some other existing algorithms. In Section IV, we first show that NIDS can be understood as an iterative algorithm for seeking a fixed point. Following this, we establish that NIDS converges at an $o(1/k)$ rate for the general convex case and a linear rate for the strongly convex case. Then, numerical simulations are given in Section V to corroborate our theoretical claims. Final remarks are given in Section VI.

I-E Notation

II Intuition for Network-independent step-size

In this section, we provide an intuition for the network-independent step-size for NIDS with only the differentiable function $s$ . The decentralized optimization problem is equivalent to

where $({\mathbf{I}}-{\mathbf{W}})^{1/2}$ is the square root of ${\mathbf{I}}-{\mathbf{W}}$ , and the constraint is the same as the consensual condition with the mixing matrix ${\mathbf{W}}$ given in Assumption 1. Denote $\textbf{{\L}}=({\mathbf{I}}-{\mathbf{W}})^{1/2}$ . The corresponding optimality condition with the introduction of the dual variable ${\mathbf{p}}$ is

EXTRA is equivalent to the Condat-Vu primal-dual algorithm , and it can be further explained as a forward-backward splitting applied to the equation, i.e.,

It is equivalent to EXTRA after ${\mathbf{p}}$ is eliminated. In this case, the new metric is a full matrix, and therefore, the upper bound of the step-size $\alpha$ depends on the matrix Ł. To be more specific,

which gives $\alpha\leq 2(1+\lambda_{\min}({\mathbf{W}}))/L$ . A larger and optimal upper bound for the step-size of EXTRA is shown in (See Table I), and it still depends on ${\mathbf{W}}$ . However, we choose a block diagonal metric and have

which is equivalent to NIDS after ${\mathbf{p}}$ is eliminated. Because the new metric is block diagonal, and the nonexpansiveness of the forward step depends on the function only, i.e., $\alpha\leq 2/L$ .

III Proposed Algorithm NIDS

In this section, we describe our proposed NIDS in Algorithm 1 for solving (1) in more details and explain the connections to other related methods.

The mixing matrix satisfies the following assumption, which comes from .

(Decentralized property). If $i\neq j$ and $(i,j)\notin{\mathcal{E}}$ , then $w_{ij}=0$ ;

(Symmetry). ${\mathbf{W}}={\mathbf{W}}^{T}$ ;

(Null space property). $\mathbf{Null}({\mathbf{I}}-{\mathbf{W}})=\mathbf{span}({\mathbf{1}}_{n\times 1})$ ;

(Spectral property). $2{\mathbf{I}}\succcurlyeq{\mathbf{W}}+{\mathbf{I}}\succ\mathbf{0}_{n\times n}$ .

Assumption 1 implies that the eigenvalues of ${\mathbf{W}}$ lie in $(-1,1]$ and the multiplicity of eigenvalue $1$ is one, i.e., $1=\lambda_{1}({\mathbf{W}})>\lambda_{2}({\mathbf{W}})\geq\cdots\geq\lambda_{n}({\mathbf{W}})>-1$ . Item 3 of Assumption 1 shows that $({\mathbf{I}}-{\mathbf{W}}){\mathbf{1}}_{n\times 1}=\mathbf{0}$ and the orthogonal complement of $\mathbf{span}({\mathbf{1}}_{n\times 1})$ is the row space of ${\mathbf{I}}-{\mathbf{W}}$ , which is also the column space of ${\mathbf{I}}-{\mathbf{W}}$ because of the symmetry of ${\mathbf{W}}$ .

The functions $\{s_{i}\}_{i=1}^{n}$ and $\{r_{i}\}_{i=1}^{n}$ satisfy the following assumption.

Functions $\{s_{i}(x)\}_{i=1}^{n}$ and $\{r_{i}(x)\}_{i=1}^{n}$ are lower semi-continuous proper convex, and $\{s_{i}(x)\}_{i=1}^{n}$ have Lipschitz continuous gradients with constants $\{L_{i}\}_{i=1}^{n}$ , respectively. Thus, we have

where $\widetilde{\mathbf{W}}={\mathbf{I}}-c\Lambda({\mathbf{I}}-{\mathbf{W}})$ and $c$ is chosen such that $\Lambda^{-1/2}\widetilde{\mathbf{W}}\Lambda^{1/2}={\mathbf{I}}-c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}\succcurlyeq\mathbf{0}$ . This condition shows that the upper bound of the parameter $c$ depends on ${\mathbf{W}}$ and $\Lambda$ . When the information about ${\mathbf{W}}$ is not given, we can just let $c=1/(2\max_{i}\alpha_{i})$ because $\lambda_{n}({\mathbf{W}})>-1$ . To set such a parameter, a preprocessing step is needed to obtain the maximum. However, since the maximum can be easily computed in a connected network in no more than $n-1$ rounds of communication wherein each node repeatedly takes maximum of the values from neighbors, the cost of this preprocessing is essentially negligible compared to the worst-case running time of our optimization protocol.

If all agents choose the same step-size, i.e., $\Lambda=\alpha{\mathbf{I}}$ , and we let $c=1/(2\alpha)$ , (8) becomes

The only difference between NIDS and PG-EXTRA is that the mixing operation is further applied to the successive difference of the gradients $-\alpha\nabla s({\mathbf{x}}^{k})+\alpha\nabla s({\mathbf{x}}^{k-1})$ in NIDS.

When there is no function $r({\mathbf{x}})$ , (8) becomes

and it further reduces to (6b) when $\Lambda=\alpha{\mathbf{I}}$ and $c=1/(2\alpha)$ . Note that, though (6b) appears in , its convergence still needs a small step-size that also depends on the network topology and the strong convexity constant. In Theorem 1 of , the upper bound for the step-size is also $O(\mu/L^{2})$ , which is the same as that of PG-EXTRA.

IV Convergence Analysis of NIDS

In order to show the convergence of NIDS, we also need the following assumption.

To simplify the analysis, we introduce a new sequence $\{{\mathbf{d}}^{k}\}_{k\geq 0}$ which is defined as

Using the sequence $\{{\mathbf{x}}^{k}\}_{k\geq 0}$ , we obtain a recursive (update) relation for $\{{\mathbf{d}}^{k}\}_{k\geq 0}$ :

where the second equality comes from the update of ${\mathbf{z}}^{k+1}$ in (8a) and the last one holds because of the definition of ${\mathbf{d}}^{k}$ in (11). Therefore, the iteration (8) is equivalent to, with the update order $({\mathbf{x}},{\mathbf{d}},{\mathbf{z}})$ ,

in the sense that both (8) and (12) generate the same $\{{\mathbf{x}}^{k},{\mathbf{z}}^{k}\}_{k>0}$ sequence.

Because ${\mathbf{x}}^{k}$ is determined by ${\mathbf{z}}^{k}$ only and can be eliminated from the iteration, iteration (12) is essentially an operator for $({\mathbf{d}},{\mathbf{z}})$ . Note that we have ${\mathbf{d}}^{1}={\Lambda^{-1}}({\mathbf{x}}^{0}-{\mathbf{z}}^{1})-\nabla s\left({\mathbf{x}}^{0}\right)=\mathbf{0}$ from Algorithm 1. Therefore, from the update of ${\mathbf{d}}^{k+1}$ in (12b), ${\mathbf{d}}^{k}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ for all $k$ . In fact, any ${\mathbf{z}}^{1}$ such that ${\mathbf{d}}^{1}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ works for NIDS. The following two lemmas show the relation between fixed points of (12) and optimal solutions of (1). The proofs for all lemmas and propositions are included in the supplemental material.

$({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ is a fixed point of (12) if and only if there exists a subgradient ${\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*})$ such that ${\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*}$ and

${\mathbf{x}}^{*}$ is consensual with $x^{*}_{1}=x^{*}_{2}=\cdots=x^{*}_{n}=x^{*}$ being an optimal solution of problem (1) if and only if there exists ${\mathbf{p}}^{*}$ and a subgradient ${\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*})$ such that:

In addition, $({\mathbf{d}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*},{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*})$ is a fixed point of iteration (12).

Lemma 2 shows that we can find a fixed point of iteration (12) to obtain an optimal solution of problem (1). It also tells us that we need ${\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ to get an optimal solution of problem (1). Therefore, we need ${\mathbf{d}}^{1}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ .

Let ${\mathbf{M}}=c^{-1}({\mathbf{I}}-{\mathbf{W}})^{\dagger}-\Lambda$ with ${\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}\succcurlyeq\mathbf{0}$ . Then $\|\cdot\|_{{\mathbf{M}}}$ is a norm defined for ${\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ .

The following lemma compares the distance to a fixed point of (12) for two consecutive iterates.

Let $({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ be a fixed point of iteration (12) with ${\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ . The update $({\mathbf{d}}^{k},{\mathbf{z}}^{k})\Rightarrow({\mathbf{d}}^{k+1},{\mathbf{z}}^{k+1})$ in (12) satisfies

From the update of ${\mathbf{z}}^{k+1}$ in (12c), we have

where the second equality comes from (12b), (14b), and ${\mathbf{d}}^{k+1}-{\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ . From (12a), we have that

The inequality and the second equality comes from (17) and (16), respectively. The first and fourth equalities hold because of the update of ${\mathbf{z}}^{k+1}$ in (12c). Using $2\langle{\mathbf{a}},{\mathbf{b}}\rangle=\|{\mathbf{a}}+{\mathbf{b}}\|^{2}-\|{\mathbf{a}}\|^{2}-\|{\mathbf{b}}\|^{2}$ and rearranging the previous inequality give us that

As explained in Section II, NIDS is equivalent to the primal-dual algorithm applied to problem

where $\iota(\cdot)$ is the indicator function, which return 0 for $\mathbf{0}$ and $+\infty$ otherwise, with the metric matrix being

We apply [56, Theorem 1] and obtain the following sublinear convergence result.

Furthermore, $({\mathbf{d}}^{k},{\mathbf{z}}^{k})$ converges to a fixed point $(\bar{\mathbf{d}},\bar{\mathbf{z}})$ of iteration (12) and $\bar{\mathbf{d}}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ , if ${\mathbf{I}}\succ c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}$ .

Note the convergence in Theorem 1 is shown in ${\mathbf{z}}$ and ${\mathbf{d}}$ . We will show the convergence in terms of (14). Recall that

where ${\mathbf{q}}^{k}\in\partial r({\mathbf{x}}^{k})$ . Therefore, $\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|_{\Lambda^{-1}}^{2}\rightarrow 0$ implies the convergence in terms of (14a).

where the second equality comes from ${\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ . Thus $\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|_{\Lambda^{-1}}^{2}+\|{\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\|^{2}_{\mathbf{M}}\rightarrow 0$ implies the convergence in terms of (14b).

IV-B Linear convergence for special cases

In this subsection, we provide the linear convergence rate for the case when $r({\mathbf{x}})=0$ , i.e., ${\mathbf{z}}^{k}={\mathbf{x}}^{k}$ in NIDS.

If $\{s_{i}(x)\}_{i=1}^{n}$ are strongly convex with parameters $\{\mu_{i}\}_{i=1}^{n}$ , then

The first inequality comes from ${\mathbf{x}}^{k+1}={\mathbf{x}}^{k}-\Lambda\nabla s\left({\mathbf{x}}^{k}\right)-{\mathbf{d}}^{k+1}$ , $\Lambda\nabla s\left({\mathbf{x}}^{*}\right)+{\mathbf{d}}^{*}=\mathbf{0}$ , (7), and (20). Combing (23) and (32), we have

Let $\rho$ defined as (21), then we have (22). ∎

The condition ${\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}$ implies that $c\leq\lambda_{n-1}(\Lambda^{-1/2}({\mathbf{I}}-{\mathbf{W}})^{\dagger}\Lambda^{-1/2})$ .

If the agents across the whole network use an identical stepsize $\alpha$ , that is, ${\Lambda}=\alpha{\mathbf{I}}$ , then

A concise but informative expression of the rate $\rho=\max\left(1-{\min_{i}\mu_{i}\over\max_{i}L_{i}},{\lambda_{2}({\mathbf{W}})-\lambda_{n}({\mathbf{W}})\over 1-\lambda_{n}({\mathbf{W}})}\right)$ can be obtained when we specifically choose $\alpha={1\over\max_{i}L_{i}}$ and $c={1\over(1-\lambda_{n}({\mathbf{W}}))\alpha}$ . When $\lambda_{n}({\mathbf{W}})$ is not given, we choose $c={1/(2\alpha)}$ and obtain the scalability $\max\left({\max_{i}L_{i}\over\min_{i}\mu_{i}},{2\over 1-\lambda_{2}({\mathbf{W}})}\right)$ . In this case, the network impact and the functional impact are decoupled.

If we let ${\Lambda}={\mathbf{L}}^{-1}$ and $c=\lambda_{n-1}({\mathbf{L}}^{-1/2}({\mathbf{I}}-{\mathbf{W}})^{\dagger}{\mathbf{L}}^{-1/2})$ , then the rate becomes

When $\lambda_{n}({\mathbf{W}})$ is not given, we choose $c=1/(2\max_{i}\alpha_{i})=\min_{i}L_{i}/2$ and obtain the scalability $\max\left(\max_{i}\frac{L_{i}}{\mu_{i}},\frac{\max_{i}L_{i}}{\min_{i}L_{i}}\cdot\frac{2}{1-\lambda_{2}({\mathbf{W}})}\right)$ . In this case, the networking impact is coupled with the function factors, i.e., the smoothness heterogeneity $\frac{\max_{i}L_{i}}{\min_{i}L_{i}}$ is multiplied on the networking impact. While the other number depends on the functional condition numbers $\frac{L_{i}}{\mu_{i}}$ ’s only.

Theorem 2 separates the dependence of the linear convergence rate on the functions and the network structure. In our current scheme, all the agents perform information exchange and the proximal-gradient step once in each iteration. If the proximal-gradient step is expensive, this explicit rate formula can help us to decide whether the so-called multi-step consensus can help reducing the computational time.

For the sake of simplicity, let us assume for this moment that all the agents have the same strong convexity constant $\mu$ and gradient Lipschitz continuity constant $L$ . Suppose that the “ $t$ -step consensus” technique is employed, i.e., the mixing matrix ${\mathbf{W}}$ in our algorithm is replaced by ${\mathbf{W}}^{t}$ , where $t$ is a positive integer. Then to reach $\epsilon$ -accuracy, the number of iterations needed is

When $L/\mu=1$ and step sizes are chosen as $\Lambda={\mathbf{L}}^{-1}$ , it says that we should let $t\rightarrow+\infty$ if the graph is not a complete graph. Such theoretical result is correct in intuition since in this case, the centralized gradient descent only needs one step to reach optimal and the bottleneck in decentralized optimization is the network.

Suppose $t_{\max}$ is a reasonable upper bound on $t$ , which is set by the system designer. It is difficult to explicitly find an optimal $t$ . But with the above analysis as an evidence, we suggest that one choose $t=\min\left([\log_{\lambda_{2}({\mathbf{W}})}(1-\frac{\mu}{L})],t_{\max}\right)$ if $1-\frac{\mu}{L}>\lambda_{2}({\mathbf{W}})$ ; otherwise $t=1$ . Here $[\cdot]$ gives the nearest integer.

If the bottleneck is on the functions, we can introduce a mapping $x=By$ and change the unknown variable from $x$ to $y$ . E.g., if the function $s_{i}(x)$ is a composition of a convex function with a linear mapping, replacing $x$ using $y$ changes the linear mapping and the condition number $L/\mu$ of the function. When $B$ is diagonal, it is similar to the column normalization in machine learning applications. There are other possible ways for reducing the condition number of the functions. It is out of the scope of this work, and we leave this as future work.

V Numerical Experiments

In this section, we compare the performance of NIDS with several state-of-the-art algorithms for decentralized optimization. These methods are

The DIGing-ATC . For reference, the DIGing-ATC updates are provided as follows:

The accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in );

The (dual friendly) optimal algorithm (OA) for distributed optimization (equation (7) in ).

Note there are two rounds of communication in each iteration of DIGing-ATC and Acc-DNGD-SC while there is only one round in that of EXTRA/NIDS/OA. For all the experiments, we first compute the exact solution ${\mathbf{x}}^{*}$ for (1) using the centralized (proximal) gradient descent. All networks are randomly generated with connectivity ratio $\tau$ , where $\tau$ is defined as the number of actual edges divided by the total number of possible edges ${{n(n-1)}\over 2}$ . We will report the specific $\tau$ used in each test. The mixing matrix ${\mathbf{W}}$ is always chosen with the Metropolis rule (see and [36, Section 2.4]).

The experiments are carried in Matlab R2016b running on a laptop with Intel i7 CPU @ 2.60HZ, 16.0 GB of RAM, and Windows 10 operating system. The source codes for reproducing the numerical results can be accessed at https://github.com/mingyan08/NIDS.

In order to ensure that each function ${1\over 2}\|{\mathbf{M}}_{i}x-y_{i}\|^{2}$ is strongly convex, we choose $m_{i}=60$ and $p=50$ and set the number of nodes $n=40$ . For the first experiment, we choose ${\mathbf{M}}_{i}$ such that the Lipshchitz constant of $\nabla s_{i}$ satisfies $L_{i}=1$ and the strongly convex constant $\mu_{i}=0.5$ for all $i$ . Based on Remark 4, we choose $\alpha={1/(\max_{i}L_{i})}=1$ and $c={1/(1-\lambda_{n}({\mathbf{W}}))}$ for NIDS. In addition, we choose $c={1/2}$ such that $\widetilde{{\mathbf{W}}}={{\mathbf{I}}+{\mathbf{W}}\over 2}$ , which gives the same as that for EXTRA.

The comparison of these methods (NIDS with $c={1/((1-\lambda_{n}({\mathbf{W}}))\alpha)}$ , NIDS with $c=1/2$ , EXTRA, DIGing-ATC, Acc-DNGD-SC, and OA) is shown in Fig. 1 for two different networks with connectivity ratios $\tau=0.35$ (top) and $\tau=0.45$ (bottom), respectively. It shows better performance of NIDS in both choices of $c$ (corresponding to known ${\mathbf{W}}$ and unknown ${\mathbf{W}}$ ) than that of other algorithms. NIDS with $c={1/((1-\lambda_{n}({\mathbf{W}}))\alpha)}$ always takes less than half the number of iterations used by EXTRA to reach the same accuracy. In our experiment, DIGing-ATC appears to be sensitive to networks. Under a better connected network (see Fig. 1 bottom), DIGing-ATC can catch up with NIDS with $c={1/(2\alpha)}$ . The theoretical step-size of Acc-DNGD-SC is too small due to a very small constant in the bound in , and the convergence of Acc-DNGD-DC under such theoretical step-size in our test is slow and uncompetitive. Thus we have carefully tuned its step-size. With the hand-optimized step-size, Acc-DNGD-SC can achieve a comparable performance as NIDS with $c={1/(2\alpha)}$ . In the plots, we observe that OA is fast in terms of the number of iterations. However, in this case, the per-iteration cost of OA is relatively high since it requires solving a system of linear equations at each iteration (though factorization tricks may be used to save some computational time).

Next, we demonstrate the effort of uncoordinated/adaptive step-size. We construct the function with $\mu_{i}=0.02$ and $L_{i}=1$ for each node $i$ . Then we change the $L_{i}$ values by multiplying the function by a constant.

We use the same mixing matrix for the following two experiments.

We change half nodes. We randomly pick an even number node and multiply its function by $4$ . For remaining even number nodes, we multiply their functions by a random integer 2 or 3.

We change a quarter nodes. We randomly pick an node not in ${\mathcal{U}}=\{4,8,16,\dots,40\}$ and multiply its function by 10. Then for other nodes in ${\mathcal{U}}$ , we multiply their functions by a random integer between 2 and 9.

We compare NIDS with adaptive step-size ( $1/{L_{i}}$ for node $i$ ) and NIDS with same step-size $1/{\max_{i}L_{i}}$ in Fig. 2. We let $c=1/(2\max_{i}\alpha_{i})$ , so no network information is needed. As shown in Fig. 2, NIDS with adaptive step-size converges faster than same step-size.

V-B The case with nonsmooth function r(𝐱)𝑟𝐱r({\mathbf{x}})

where the connectivity ratio of the network $\tau=0.1$ . We normalize the problem to make sure that the Lipschitz constant satisfies $L_{i}=1$ for each node, we choose $m_{i}=3$ and $p=200$ and set the number of nodes $n=40$ .

Fig. 3 shows that a larger step-size in NIDS leads to faster convergence. With step-size $1$ , NIDS and PG-EXTRA converge at the same speed. But if we keep increasing the step-size, PG-EXTRA will diverge with step-size $1.4$ while the step-size of NIDS can be increased to $1.9$ maintaining convergence at a faster speed.

V-C An application in classification for healthcare data

VI Conclusion

We proposed a novel decentralized consensus algorithm NIDS, whose step-size does not depend on the network structure. In NIDS, the step-size depends only on the objective function, and it can be as large as $2/L$ , where $L$ is the Lipschitz constant of the gradient of the smooth function. We showed that NIDS converges at the $o(1/k)$ rate for the general convex case and at a linear rate for the strongly convex case. For the strongly convex case, we separated the linear convergence rate’s dependence on the objective function and the network. The separated convergence rates match the typical rates for the general gradient descent and the consensus averaging. Furthermore, every agent in the network can choose its own step-size independently by its own objective function. Numerical experiments validated the theoretical results and demonstrated better performance of NIDS over state-of-the-art algorithms. Because the step-size of NIDS does not depend on the network structure, there are many possible future extensions. One extension is to apply NIDS on dynamic networks where nodes can join and drop off.

Acknowledgement

We thank the anonymous reviewers for helpful comments and suggestions to improve the clarity of this paper.

References

“ $\Rightarrow$ ” If $({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ is a fixed point of (12), we have

where the two equalities come from (12b) and (12c), respectively. Combining (12c) and (12a) gives

where ${\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*})$ .

“ $\Leftarrow$ ” In order to show that $({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ is a fixed point of iteration (12), we just need to verify that $({\mathbf{d}}^{k+1},{\mathbf{z}}^{k+1})=({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ if $({\mathbf{d}}^{k},{\mathbf{z}}^{k})=({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ . From (12a), we have ${\mathbf{x}}^{k}={\mathbf{x}}^{*}$ , then

Therefore, $({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ is a fixed point of iteration (12). ∎

VI-B Proof of Lemma 2

In addition, $({\mathbf{d}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*},{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*})$ is a fixed point of iteration (12).

“ $\Rightarrow$ ” Because ${\mathbf{x}}^{*}={\mathbf{1}}_{n\times 1}(x^{*})^{\top}$ , we have $({\mathbf{I}}-{\mathbf{W}}){\mathbf{x}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{1}}_{n\times 1}(x^{*})^{\top}=\mathbf{0}_{n\times 1}(x^{*})^{\top}=\mathbf{0}$ . The fact that $x^{*}$ is an optimal solution of problem (1) means there exists ${\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*})$ such that $(\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}=\mathbf{0}$ . That is to say all columns of $\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*}$ are orthogonal to ${\mathbf{1}}_{n\times 1}$ . Therefore, Remark 1 shows the existence of ${\mathbf{p}}^{*}$ such that $({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*}+\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*}=\mathbf{0}$ .

“ $\Leftarrow$ ” Equation (34b) shows that ${\mathbf{x}}^{*}$ is consensual because of item 3 of Assumption 1, i.e., ${\mathbf{x}}^{*}={\mathbf{1}}_{n\times 1}(x^{*})^{\top}$ for some $x^{*}$ . From (34a), we have $\mathbf{0}=(({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*}+\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}=({\mathbf{p}}^{*})^{\top}({\mathbf{I}}-{\mathbf{W}}){\mathbf{1}}_{n\times 1}+(\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}=(\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}$ . Thus, $\mathbf{0}\in\sum_{i=1}^{n}(\nabla s_{i}(x^{*})+\partial r_{i}(x^{*}))$ because ${\mathbf{x}}^{*}$ is consensual. This completes the proof for the equivalence.

Lemma 1 shows that $({\mathbf{d}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*},{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*})$ is a fixed point of iteration (12). ∎

VI-C Proof of Lemma 3

Letting ${\mathbf{x}}={\mathbf{A}}{\mathbf{y}}$ , we have $\|{\mathbf{x}}\|^{2}=\langle{\mathbf{U}}\Sigma{\mathbf{U}}^{\top}{\mathbf{y}},{\mathbf{U}}\Sigma{\mathbf{U}}^{\top}{\mathbf{y}}\rangle=\langle\Sigma{\mathbf{U}}^{\top}{\mathbf{y}},\Sigma{\mathbf{U}}^{\top}{\mathbf{y}}\rangle=\|\Sigma{\mathbf{U}}^{\top}{\mathbf{y}}\|^{2}$ . In addition,

which means that $\|\cdot\|^{2}_{{\mathbf{A}}^{\dagger}}=\langle\cdot,{\mathbf{A}}^{\dagger}\cdot\rangle$ is a norm for ${\mathbf{range}}({\mathbf{A}})$ . ∎

VI-D Proof of Proposition 1

Let ${\mathbf{M}}=c^{-1}({\mathbf{I}}-{\mathbf{W}})^{\dagger}-\Lambda$ with $\Lambda$ being symmetric positive definite and ${\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}\succcurlyeq\mathbf{0}$ . Then $\|\cdot\|_{{\mathbf{M}}}$ is a norm defined for ${\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ .

We apply Lemma 3 on $\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}$ and obtain the result. ∎

VI-E Proof of Theorem 1

Before proving Theorem 1, we present two lemmas. The first lemma shows that the distance to a fixed point of (12) is decreasing, and the second one show the distance between two iterations is decreasing.

Let $({\mathbf{d}}^{*},{\mathbf{z}}^{*})$ be a fixed point of (12) and ${\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ . For the sequence $({\mathbf{d}}^{k},{\mathbf{z}}^{k})$ generated from NIDS in (12) with ${\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}$ , we have

Let $({\mathbf{d}}^{k},{\mathbf{z}}^{k})$ be the sequence generated from NIDS in (12) with $\alpha_{i}<2/L_{i}$ for all $i$ and ${\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}$ . Then the sequence $\left\{\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|^{2}_{\Lambda^{-1}}+\|{\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\|_{{\mathbf{M}}}^{2}\right\}_{k\geq 0}$ is monotonically nonincreasing.

Similar to the proof for Lemma 4, we can show that

Subtracting (LABEL:for:nonincreasing2) from (LABEL:for:nonincreasing1) on both sides, we have

where the inequality comes from the Cauchy-Schwarz inequality. Then, the previous inequality, together with (39) and the Cauchy-Schwarz inequality, gives

The three inequalities hold because of (39), (40), and the Cauchy-Schwarz inequality, respectively. The first and third equalities come from (12c). Rearranging the previous inequality, we obtain

where the second and last inequalities come from (7) and $\Lambda<2{\mathbf{L}}^{-1}$ , respectively. It completes the proof. ∎

Lemma 5 shows that $\left\{\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|^{2}_{\Lambda^{-1}}+\|{\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\|_{{\mathbf{M}}}^{2}\right\}_{k\geq 0}$ is monotonically nonincreasing. Summing up (4) from $1$ to $k$ , we have

When ${\mathbf{I}}\succ c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}$ , inequality (4) shows that the sequence $({\mathbf{d}}^{k},{\mathbf{z}}^{k})$ is bounded, and there exists a convergent subsequence $({\mathbf{d}}^{k_{i}},{\mathbf{z}}^{k_{i}})$ such that $({\mathbf{d}}^{k_{i}},{\mathbf{z}}^{k_{i}})\rightarrow(\bar{\mathbf{d}},\bar{\mathbf{z}})$ . Then (41) gives the convergence of $({\mathbf{d}}^{k_{i}+1},{\mathbf{z}}^{k_{i}+1})$ . More specifically, $({\mathbf{d}}^{k_{i}+1},{\mathbf{z}}^{k_{i}+1})\rightarrow(\bar{\mathbf{d}},\bar{\mathbf{z}})$ . Therefore $(\bar{\mathbf{d}},\bar{\mathbf{z}})$ is a fixed point of iteration (12). In addition, because ${\mathbf{d}}^{k}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ for all $k$ , we have $\bar{\mathbf{d}}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}})$ . Finally Lemma 4 implies the convergence of $({\mathbf{d}}^{k},{\mathbf{z}}^{k})$ to $(\bar{\mathbf{d}},\bar{\mathbf{z}})$ . ∎