A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates

Zhi Li, Wei Shi, Ming Yan

I Introduction

This paper focuses on the following decentralized optimization problem:

Specific problems of form (1) that require a decentralized computing architecture have appeared in various areas including networked multi-vehicle coordination, distributed information processing and decision making in sensor networks, as well as distributed estimation and learning. Some examples include distributed average consensus , distributed spectrum sensing , information control , power systems control , statistical inference and learning . In general, decentralized optimization fits the scenarios where the data is collected and/or stored in a distributed network, a fusion center is either inapplicable or unaffordable, and/or computing is required to be performed in a distributed but collaborative manner by multiple agents or by network designers.

The study on distributed algorithms dates back to the early 1980s . Since then, due to the emergence of large-scale networks, decentralized (optimization) algorithms, as a special type of distributed algorithms for solving problem (1), have received attention. Many efforts have been made on star networks with one master agent and multiple slave agents . This scheme is “centralized” due to the use of a “master” agent. It may suffer a single point of failure and may violate the privacy requirement in certain applications. In this paper, we focus on solving (1) in a decentralized fashion, where no “master” agent is used.

Incremental algorithms can solve (1) without the need of a “master” agent and it is based on a directed ring network. To handle general (possibly time-varying) networks, the distributed sub-gradient algorithm was proposed in . This algorithm and its variants are intuitive and simple but usually slow due to the diminishing step-size that is needed to obtain a consensual and optimal solution, even if the objective functions are differentiable and strongly convex. With a fixed step-size, these distributed methods can be fast, but they only converge to a neighborhood of the solution set which depends on the step-size. This phenomenon creates an exactness-speed dilemma .

A class of distributed approaches that bypass this dilemma is based on introducing the Lagrangian dual. The resulting algorithms include distributed dual decomposition and decentralized alternating direction method of multipliers (ADMM) . The decentralized ADMM and its proximal-gradient variant can employ a fixed step-size to achieve O(1/k)O(1/k) rate under general convexity assumptions . Under the strong convexity assumption, the decentralized ADMM has linear convergence for time-invariant undirected graphs . There exist some other distributed methods that do not (explicitly) use dual variables but can still converge to an optimal consensual solution with fixed step-sizes. In particular, works in employ multi-consensus inner loops, Nesterov’s acceleration, and/or the adapt-then-combine (ATC) strategy. Under the assumption that the objectives have bounded and Lipschitz continuous gradientsThis means that the nonsmooth terms rir_{i}’s are absent. Such assumption is much stronger than the one used for achieving the O(1/k2)O(1/k^{2}) rate in Nesterov’s optimal gradient method ., the algorithm proposed in has O(ln(k)/k2)O\left(\ln(k)/k^{2}\right) rate. References use a difference structure to cancel the steady state error in decentralized gradient descent , thereby developing the algorithm EXTRA and its proximal-gradient variant PG-EXTRA. It converges at an O(1/k)O(1/k) rate when the objective function in (1) is convex and has a linear convergence rate when the objective function is strongly convex and ri(x)=0r_{i}(x)=0 for all ii.

A number of recent works employed the so-called gradient tracking to conquer different issues . To be specific, works relax the step-size rule to allow uncoordinated step-sizes across agents. Paper solves non-convex optimization problems. Paper aims at achieving geometric convergence over time-varying graphs. Work improves the convergence rate over EXTRA, and its formulation is the same as that in .

Another topic of interest is decentralized optimization over directed graphs , which is beyond the scope of this paper.

I-B Proposed Algorithm and its Advantages

If all local variables are identical, i.e., x1==xnx_{1}=\dots=x_{n}, we say that x{\mathbf{x}} is consensual. In addition, we define

We have f(x)=s(x)+r(x)f({\mathbf{x}})=s({\mathbf{x}})+r({\mathbf{x}}). The gradient of ss at x{\mathbf{x}} is given in the same way as x{\mathbf{x}} in (3) by

𝐼𝐖2\widetilde{{\mathbf{W}}}=\frac{I+{\mathbf{W}}}{2} here. By making a simple modification over PG-EXTRA , our proposed algorithm brings a big improvement in the speed and the dependency of convergence over networks. To better expose this simple modification, let us compare a special case of our proposed algorithm with EXTRA for the smooth case, i.e., r(x)=0r({\mathbf{x}})=0.

A large and network independent step-size for NIDS: All works mentioned above either employ pessimistic step-sizes or have network dependent upper bounds on step-sizes. Furthermore, the step-sizes for the strongly convex case are more conservative. For example, the step-size used to achieve linear convergence rates for EXTRA in is in the order of O(μ/L2)O(\mu/L^{2}), where μ\mu and LL are the strong convexity constant of s(x)s({\mathbf{x}}) and Lipschitz constant of s(x)\nabla s({\mathbf{x}}), respectively. As a contrast, the centralized gradient descent can choose a step-size in the order of O(1/L)O(1/L). The upper bound of step-size for EXTRA was recently improved to (5+3λn(W))/(4L)(5+3\lambda_{n}({\mathbf{W}}))/(4L) in . We can choose 1/(2L)1/(2L) for any W{\mathbf{W}} satisfying Assumption 1. Another example of employing a constant step-size in distributed optimization is DIGing . Although its ATC variant in was shown to converge faster than DIGing, the step-size is still very conservative compared to O(1/L)O(1/L). We will show that the step-size of NIDS can have the same upper bound 2/L2/L as that of the centralized gradient descent. The achievable step-sizes of NIDS for o(1/k)o(1/k) rate in the general convex case and the linear convergence rate in the strongly convex case are both in the order of O(1/L)O(1/L). Furthermore, NIDS allows each agent to have an individual step-size. Each agent ii can choose a step-size αi\alpha_{i} that is as large as 2/Li2/L_{i} on any connected network, where LiL_{i} is the Lipschitz constant of si(x)\nabla s_{i}(x) and LiLL_{i}\leq L for all ii (L=maxiLiL=\max_{i}L_{i}). Apart from the step-sizes, to run NIDS, a common/public parameter cc is needed for the construction of W~\widetilde{{\mathbf{W}}} (see (8) for the algorithm and W~\widetilde{{\mathbf{W}}}). This parameter cc can be chosen without any knowledge of the network (or the mixing matrix W{\mathbf{W}}). For example, c=1/(2maxiαi)c={1}/({2\max_{i}\alpha_{i}}). Table I provides an overview of algorithmic configurations for EXTRA and NIDS. NIDS works as long as each agent can estimate its local functional parameter: No agent needs any other global information including the number of agents in the whole network except the largest step-size, if it is not the same for all agents.

In the line of research of optimization over heterogeneous networks, after the initial work regarding uncoordinated step sizes, references and introduce and analyze a diffusion strategy with corrections that achieves an exact linear convergence with uncoordinated step sizes. This exact diffusion algorithm is still related to the Lagrangian method but can be considered as having incorporated a CTA structure. The CTA strategy can, similar to the ATC strategy, improve the convergence speed of certain consensus optimization algorithms (see [41, Remark 3] and [46, Section II.C]). However, the analysis in , though allowing step size mismatch across the network, does not take into consideration the heterogeneity of agents’ functional conditions. Furthermore, their upper bound for the step-size is in the order of O(μ/L2)O(\mu/L^{2}).

Sublinear convergence rate for the general case: Under the general convexity assumption, we show that NIDS has a convergence rate of o(1/k)o(1/k), which is slightly better than the O(1/k)O(1/k) rate of PG-EXTRA. Because the step-size of NIDS does not depend on the network topology and is much larger than that of PG-EXTRA, NIDS can be much faster than PG-EXTRA, as shown in the numerical experiments.

Linear convergence rate for the strongly convex case: Let us first define “scalability”. When the iterate xkx^{k} of an algorithm converges to the optimal solution xx^{*} linearly, i.e., xkx2=O((11/S)k)\|x^{k}-x^{*}\|^{2}=O((1-1/S)^{k}) with some positive constant SS, we say that the algorithm needs to run O(S)log(1/ϵ)O(S)\log(1/\epsilon) iterations to reach ϵ\epsilon-accuracy. So we call O(S)O(S) the scalability of the algorithm.

For the case where the non-smooth terms are absent and the functions {si}i=1n\{s_{i}\}_{i=1}^{n} are strongly convex, we show that NIDS achieves a linear convergence rate whose dependencies on the functions {si}i=1n\{s_{i}\}_{i=1}^{n} and the network topology are decoupled. To be specific, to reach ϵ\epsilon-accuracy, the number of iterations needed for NIDS is

where λi(W)\lambda_{i}({\mathbf{W}}) is the iith largest eigenvalue of W{\mathbf{W}}. Both Lμ\frac{L}{\mu} and 1λn(W)1λ2(W)\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})} are typical in the literature of optimization and average consensus, respectively. The value Lμ\frac{L}{\mu}, also called the condition number of the objective function, is aligned with the scalability of the standard gradient descent . The value 1λn(W)1λ2(W)\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})} is understood as the condition numberWhen we choose W=Iτ\L{\mathbf{W}}={\mathbf{I}}-\tau\textbf{\L} where Ł is the Laplacian of the underlying graph and τ\tau is a positive tunable constant, we have 1λn(W)1λ2(W)=λ1(\L)λn1(\L)\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})}=\frac{\lambda_{1}(\textbf{\L})}{\lambda_{n-1}(\textbf{\L})} which is the finite condition number of Ł. Note that λn(\L)=0\lambda_{n}(\textbf{\L})=0. of the network and aligned with the scalability of the simplest linear iterations for distributed averaging .

Separating the condition numbers of the objective function and the network provides a way to determine the bottleneck of NIDS for a specific problem and a given network. Therefore, the system designer might be able to smartly apply preconditioning on {si}i=1n\{s_{i}\}_{i=1}^{n} or improve the connectivity of the network to cost-effectively obtain a better convergence.

Summary and comparison of state-of-the-art algorithms: We list the properties of a few relevant algorithms in Table II. We let σ:=1λn(W)1λ2(W)\sigma:=\frac{1-\lambda_{n}({\mathbf{W}})}{1-\lambda_{2}({\mathbf{W}})}. This quantity is directly affected by the network topology and how the matrix W{\mathbf{W}} is defined, thus is also related to the consensus ability of a network. When the network is fully connected (a complete graph), we can choose W{\mathbf{W}} so that λ2(W)=λn(W)=0\lambda_{2}({\mathbf{W}})=\lambda_{n}({\mathbf{W}})=0 and thus σ=1\sigma=1 (the best case); in general σ1\sigma\geq 1 since 0<1λ2(W)1λn(W)<20<1-\lambda_{2}({\mathbf{W}})\leq 1-\lambda_{n}({\mathbf{W}})<2; in the worst case, we have σ11λ2(W)=O(n2)\sigma\leq\frac{1}{1-\lambda_{2}({\mathbf{W}})}=O(n^{2}) [4, Section 2.3]. We keep σ\sigma in the bounds/rates of involved algorithms for a fair comparison instead of focusing on the worst case that often gives pessimistic/conservative results. We omit “O()O(\cdot)” in “Bounds of step-sizes” and “Scalabilities” for brevity and only compare the effect of functional properties (μ\mu and LL) and network properties (σ\sigma and/or nn). Before talking into details, let us clarify a few points. In EXTRA, μg\mu_{g} is a quantity that is associated with the strong convexity of the original function fˉ(x)\bar{f}(x), so it covers a larger class of problems; In DIGing, μˉ\bar{\mu} is the mean value of the strong convexity constants of local objectives; In Acc-DNGD, the step-size for the convex case contains kk, the current number of iterations. Thus it represents a diminishing step-size sequence; In Optimal , the total number of iterations KK is used to determine the step-size for the convex case. In addition, they apply to problems in which the objectives are dual friendly (see for its definition). Note some types of objectives are suitable for gradient update, some are suitable for dual gradient update (dual friendly), and some are suitable for proximal update.

Finding the algorithm with the lowest per-iteration cost depends on the problem (functions). Apparently, our bounds on step-sizes and the corresponding scalability/rate are better than those given in EXTRA and Harness (see Table II). When σ\sigma is close to 11 (the graph is well connected), the step-size bound and scalability given in DIGing are the same as NIDS. However, when σ\sigma is large, their result becomes rather conservative. Acc-DNGD and Optimal have improved the scalability/rate of gradient-based distributed optimization by employing Nesterov’s acceleration technique on primal and dual problems, respectively. For the convex case, our rate is worse than theirs because our algorithm does not employ Nesterov’s acceleration. For the primal distributed gradient method after acceleration , the scalability in σ\sigma is still worse than our result. Algorithm Optimal achieves the optimal scalability/rate for distributed optimization. However, as we have mentioned above, their algorithms are dual based thus apply to a different class of problems. In addition, NIDS supports proximable non-smooth functions and uncoordinated step-sizes while these have not been considered in Acc-DNGD and Optimal. To sum up, we have reached the best possible performance of first-order algorithms for distributed optimization without acceleration. Further improving the performance by incorporating Nesterov’s techniques to our algorithm will be a future direction.

Finally, we note that, references , appearing simultaneously with this work, also proposed (6b) to enlarge the step-size and use column stochastic matrices rather than symmetric doubly stochastic matrices. However, their algorithm only works for smooth problems, and their analysis seems to be restrictive and requires twice differentiability and strong convexity of {si}i=1n\{s_{i}\}_{i=1}^{n}. The stepsize is also in the order of μ/L2\mu/L^{2} .

I-C Future Works

The capability of our algorithm using purely locally determined parameters increases its potential to be extended to dynamic networks with a time-varying number of nodes. Given such flexibility, we may use similar schemes to solve the decentralized empirical risk minimization problems. Furthermore, it also enhances the privacy of the agents through allowing each agent to perform their own optimization procedure without negotiation on any parameter.

By using Nesterov’s acceleration technique, reference shows that the scalability of a new average consensus protocol can be improved to O(n)O(n); when the nonsmooth terms rir_{i}’s are absent, reference shows that the scalability of a new dual based accelerated distributed gradient method can be improved to O(σL/μ)O(\sqrt{\sigma L/\mu}). One of our future work is exploring the convergence rates/scalability of the Nesterov’s accelerated version of our algorithm.

I-D Paper Organization

The rest of this paper is organized as follows. To facilitate the description of the technical ideas, the algorithms, and the analysis, we introduce additional notation in Subsection I-E. The intuition for the network-independent step-size is provided in Section II. In Section III, we introduce our algorithm NIDS and discuss its relation to some other existing algorithms. In Section IV, we first show that NIDS can be understood as an iterative algorithm for seeking a fixed point. Following this, we establish that NIDS converges at an o(1/k)o(1/k) rate for the general convex case and a linear rate for the strongly convex case. Then, numerical simulations are given in Section V to corroborate our theoretical claims. Final remarks are given in Section VI.

I-E Notation

II Intuition for Network-independent step-size

In this section, we provide an intuition for the network-independent step-size for NIDS with only the differentiable function ss. The decentralized optimization problem is equivalent to

where (IW)1/2({\mathbf{I}}-{\mathbf{W}})^{1/2} is the square root of IW{\mathbf{I}}-{\mathbf{W}}, and the constraint is the same as the consensual condition with the mixing matrix W{\mathbf{W}} given in Assumption 1. Denote \L=(IW)1/2\textbf{{\L}}=({\mathbf{I}}-{\mathbf{W}})^{1/2}. The corresponding optimality condition with the introduction of the dual variable p{\mathbf{p}} is

EXTRA is equivalent to the Condat-Vu primal-dual algorithm , and it can be further explained as a forward-backward splitting applied to the equation, i.e.,

It is equivalent to EXTRA after p{\mathbf{p}} is eliminated. In this case, the new metric is a full matrix, and therefore, the upper bound of the step-size α\alpha depends on the matrix Ł. To be more specific,

which gives α2(1+λmin(W))/L\alpha\leq 2(1+\lambda_{\min}({\mathbf{W}}))/L. A larger and optimal upper bound for the step-size of EXTRA is shown in (See Table I), and it still depends on W{\mathbf{W}}. However, we choose a block diagonal metric and have

which is equivalent to NIDS after p{\mathbf{p}} is eliminated. Because the new metric is block diagonal, and the nonexpansiveness of the forward step depends on the function only, i.e., α2/L\alpha\leq 2/L.

III Proposed Algorithm NIDS

In this section, we describe our proposed NIDS in Algorithm 1 for solving (1) in more details and explain the connections to other related methods.

The mixing matrix satisfies the following assumption, which comes from .

(Decentralized property). If iji\neq j and (i,j)E(i,j)\notin{\mathcal{E}}, then wij=0w_{ij}=0;

(Symmetry). W=WT{\mathbf{W}}={\mathbf{W}}^{T};

(Null space property). Null(IW)=span(1n×1)\mathbf{Null}({\mathbf{I}}-{\mathbf{W}})=\mathbf{span}({\mathbf{1}}_{n\times 1});

(Spectral property). 2IW+I0n×n2{\mathbf{I}}\succcurlyeq{\mathbf{W}}+{\mathbf{I}}\succ\mathbf{0}_{n\times n}.

Assumption 1 implies that the eigenvalues of W{\mathbf{W}} lie in (1,1](-1,1] and the multiplicity of eigenvalue 11 is one, i.e., 1=λ1(W)>λ2(W)λn(W)>11=\lambda_{1}({\mathbf{W}})>\lambda_{2}({\mathbf{W}})\geq\cdots\geq\lambda_{n}({\mathbf{W}})>-1. Item 3 of Assumption 1 shows that (IW)1n×1=0({\mathbf{I}}-{\mathbf{W}}){\mathbf{1}}_{n\times 1}=\mathbf{0} and the orthogonal complement of span(1n×1)\mathbf{span}({\mathbf{1}}_{n\times 1}) is the row space of IW{\mathbf{I}}-{\mathbf{W}}, which is also the column space of IW{\mathbf{I}}-{\mathbf{W}} because of the symmetry of W{\mathbf{W}}.

The functions {si}i=1n\{s_{i}\}_{i=1}^{n} and {ri}i=1n\{r_{i}\}_{i=1}^{n} satisfy the following assumption.

Functions {si(x)}i=1n\{s_{i}(x)\}_{i=1}^{n} and {ri(x)}i=1n\{r_{i}(x)\}_{i=1}^{n} are lower semi-continuous proper convex, and {si(x)}i=1n\{s_{i}(x)\}_{i=1}^{n} have Lipschitz continuous gradients with constants {Li}i=1n\{L_{i}\}_{i=1}^{n}, respectively. Thus, we have

where W~=IcΛ(IW)\widetilde{\mathbf{W}}={\mathbf{I}}-c\Lambda({\mathbf{I}}-{\mathbf{W}}) and cc is chosen such that Λ1/2W~Λ1/2=IcΛ1/2(IW)Λ1/20\Lambda^{-1/2}\widetilde{\mathbf{W}}\Lambda^{1/2}={\mathbf{I}}-c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}\succcurlyeq\mathbf{0}. This condition shows that the upper bound of the parameter cc depends on W{\mathbf{W}} and Λ\Lambda. When the information about W{\mathbf{W}} is not given, we can just let c=1/(2maxiαi)c=1/(2\max_{i}\alpha_{i}) because λn(W)>1\lambda_{n}({\mathbf{W}})>-1. To set such a parameter, a preprocessing step is needed to obtain the maximum. However, since the maximum can be easily computed in a connected network in no more than n1n-1 rounds of communication wherein each node repeatedly takes maximum of the values from neighbors, the cost of this preprocessing is essentially negligible compared to the worst-case running time of our optimization protocol.

If all agents choose the same step-size, i.e., Λ=αI\Lambda=\alpha{\mathbf{I}}, and we let c=1/(2α)c=1/(2\alpha), (8) becomes

The only difference between NIDS and PG-EXTRA is that the mixing operation is further applied to the successive difference of the gradients αs(xk)+αs(xk1)-\alpha\nabla s({\mathbf{x}}^{k})+\alpha\nabla s({\mathbf{x}}^{k-1}) in NIDS.

When there is no function r(x)r({\mathbf{x}}), (8) becomes

and it further reduces to (6b) when Λ=αI\Lambda=\alpha{\mathbf{I}} and c=1/(2α)c=1/(2\alpha). Note that, though (6b) appears in , its convergence still needs a small step-size that also depends on the network topology and the strong convexity constant. In Theorem 1 of , the upper bound for the step-size is also O(μ/L2)O(\mu/L^{2}), which is the same as that of PG-EXTRA.

IV Convergence Analysis of NIDS

In order to show the convergence of NIDS, we also need the following assumption.

To simplify the analysis, we introduce a new sequence {dk}k0\{{\mathbf{d}}^{k}\}_{k\geq 0} which is defined as

Using the sequence {xk}k0\{{\mathbf{x}}^{k}\}_{k\geq 0}, we obtain a recursive (update) relation for {dk}k0\{{\mathbf{d}}^{k}\}_{k\geq 0}:

where the second equality comes from the update of zk+1{\mathbf{z}}^{k+1} in (8a) and the last one holds because of the definition of dk{\mathbf{d}}^{k} in (11). Therefore, the iteration (8) is equivalent to, with the update order (x,d,z)({\mathbf{x}},{\mathbf{d}},{\mathbf{z}}),

in the sense that both (8) and (12) generate the same {xk,zk}k>0\{{\mathbf{x}}^{k},{\mathbf{z}}^{k}\}_{k>0} sequence.

Because xk{\mathbf{x}}^{k} is determined by zk{\mathbf{z}}^{k} only and can be eliminated from the iteration, iteration (12) is essentially an operator for (d,z)({\mathbf{d}},{\mathbf{z}}). Note that we have d1=Λ1(x0z1)s(x0)=0{\mathbf{d}}^{1}={\Lambda^{-1}}({\mathbf{x}}^{0}-{\mathbf{z}}^{1})-\nabla s\left({\mathbf{x}}^{0}\right)=\mathbf{0} from Algorithm 1. Therefore, from the update of dk+1{\mathbf{d}}^{k+1} in (12b), dkrange(IW){\mathbf{d}}^{k}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}) for all kk. In fact, any z1{\mathbf{z}}^{1} such that d1range(IW){\mathbf{d}}^{1}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}) works for NIDS. The following two lemmas show the relation between fixed points of (12) and optimal solutions of (1). The proofs for all lemmas and propositions are included in the supplemental material.

(d,z)({\mathbf{d}}^{*},{\mathbf{z}}^{*}) is a fixed point of (12) if and only if there exists a subgradient qr(x){\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*}) such that z=x+Λq{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*} and

x{\mathbf{x}}^{*} is consensual with x1=x2==xn=xx^{*}_{1}=x^{*}_{2}=\cdots=x^{*}_{n}=x^{*} being an optimal solution of problem (1) if and only if there exists p{\mathbf{p}}^{*} and a subgradient qr(x){\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*}) such that:

In addition, (d=(IW)p,z=x+Λq)({\mathbf{d}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*},{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*}) is a fixed point of iteration (12).

Lemma 2 shows that we can find a fixed point of iteration (12) to obtain an optimal solution of problem (1). It also tells us that we need drange(IW){\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}) to get an optimal solution of problem (1). Therefore, we need d1range(IW){\mathbf{d}}^{1}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}).

Let M=c1(IW)Λ{\mathbf{M}}=c^{-1}({\mathbf{I}}-{\mathbf{W}})^{\dagger}-\Lambda with IcΛ1/2(IW)Λ1/20{\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}\succcurlyeq\mathbf{0}. Then M\|\cdot\|_{{\mathbf{M}}} is a norm defined for range(IW){\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}).

The following lemma compares the distance to a fixed point of (12) for two consecutive iterates.

Let (d,z)({\mathbf{d}}^{*},{\mathbf{z}}^{*}) be a fixed point of iteration (12) with drange(IW){\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}). The update (dk,zk)(dk+1,zk+1)({\mathbf{d}}^{k},{\mathbf{z}}^{k})\Rightarrow({\mathbf{d}}^{k+1},{\mathbf{z}}^{k+1}) in (12) satisfies

From the update of zk+1{\mathbf{z}}^{k+1} in (12c), we have

where the second equality comes from (12b), (14b), and dk+1drange(IW){\mathbf{d}}^{k+1}-{\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}). From (12a), we have that

The inequality and the second equality comes from (17) and (16), respectively. The first and fourth equalities hold because of the update of zk+1{\mathbf{z}}^{k+1} in (12c). Using 2a,b=a+b2a2b22\langle{\mathbf{a}},{\mathbf{b}}\rangle=\|{\mathbf{a}}+{\mathbf{b}}\|^{2}-\|{\mathbf{a}}\|^{2}-\|{\mathbf{b}}\|^{2} and rearranging the previous inequality give us that

As explained in Section II, NIDS is equivalent to the primal-dual algorithm applied to problem

where ι()\iota(\cdot) is the indicator function, which return 0 for 0\mathbf{0} and ++\infty otherwise, with the metric matrix being

We apply [56, Theorem 1] and obtain the following sublinear convergence result.

Let (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) be the sequence generated from NIDS in (12) with αi<2/Li\alpha_{i}<2/L_{i} for all ii and IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}. We have

Furthermore, (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) converges to a fixed point (dˉ,zˉ)(\bar{\mathbf{d}},\bar{\mathbf{z}}) of iteration (12) and dˉrange(IW)\bar{\mathbf{d}}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}), if IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succ c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}.

Note the convergence in Theorem 1 is shown in z{\mathbf{z}} and d{\mathbf{d}}. We will show the convergence in terms of (14). Recall that

where qkr(xk){\mathbf{q}}^{k}\in\partial r({\mathbf{x}}^{k}). Therefore, zk+1zkΛ120\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|_{\Lambda^{-1}}^{2}\rightarrow 0 implies the convergence in terms of (14a).

where the second equality comes from dk+1dkrange(IW){\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}). Thus zk+1zkΛ12+dk+1dkM20\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|_{\Lambda^{-1}}^{2}+\|{\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\|^{2}_{\mathbf{M}}\rightarrow 0 implies the convergence in terms of (14b).

IV-B Linear convergence for special cases

In this subsection, we provide the linear convergence rate for the case when r(x)=0r({\mathbf{x}})=0, i.e., zk=xk{\mathbf{z}}^{k}={\mathbf{x}}^{k} in NIDS.

If {si(x)}i=1n\{s_{i}(x)\}_{i=1}^{n} are strongly convex with parameters {μi}i=1n\{\mu_{i}\}_{i=1}^{n}, then

The first inequality comes from xk+1=xkΛs(xk)dk+1{\mathbf{x}}^{k+1}={\mathbf{x}}^{k}-\Lambda\nabla s\left({\mathbf{x}}^{k}\right)-{\mathbf{d}}^{k+1}, Λs(x)+d=0\Lambda\nabla s\left({\mathbf{x}}^{*}\right)+{\mathbf{d}}^{*}=\mathbf{0}, (7), and (20). Combing (23) and (32), we have

Let ρ\rho defined as (21), then we have (22). ∎

The condition IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2} implies that cλn1(Λ1/2(IW)Λ1/2)c\leq\lambda_{n-1}(\Lambda^{-1/2}({\mathbf{I}}-{\mathbf{W}})^{\dagger}\Lambda^{-1/2}).

If the agents across the whole network use an identical stepsize α\alpha, that is, Λ=αI{\Lambda}=\alpha{\mathbf{I}}, then

A concise but informative expression of the rate ρ=max(1miniμimaxiLi,λ2(W)λn(W)1λn(W))\rho=\max\left(1-{\min_{i}\mu_{i}\over\max_{i}L_{i}},{\lambda_{2}({\mathbf{W}})-\lambda_{n}({\mathbf{W}})\over 1-\lambda_{n}({\mathbf{W}})}\right) can be obtained when we specifically choose α=1maxiLi\alpha={1\over\max_{i}L_{i}} and c=1(1λn(W))αc={1\over(1-\lambda_{n}({\mathbf{W}}))\alpha}. When λn(W)\lambda_{n}({\mathbf{W}}) is not given, we choose c=1/(2α)c={1/(2\alpha)} and obtain the scalability max(maxiLiminiμi,21λ2(W))\max\left({\max_{i}L_{i}\over\min_{i}\mu_{i}},{2\over 1-\lambda_{2}({\mathbf{W}})}\right). In this case, the network impact and the functional impact are decoupled.

If we let Λ=L1{\Lambda}={\mathbf{L}}^{-1} and c=λn1(L1/2(IW)L1/2)c=\lambda_{n-1}({\mathbf{L}}^{-1/2}({\mathbf{I}}-{\mathbf{W}})^{\dagger}{\mathbf{L}}^{-1/2}), then the rate becomes

When λn(W)\lambda_{n}({\mathbf{W}}) is not given, we choose c=1/(2maxiαi)=miniLi/2c=1/(2\max_{i}\alpha_{i})=\min_{i}L_{i}/2 and obtain the scalability max(maxiLiμi,maxiLiminiLi21λ2(W))\max\left(\max_{i}\frac{L_{i}}{\mu_{i}},\frac{\max_{i}L_{i}}{\min_{i}L_{i}}\cdot\frac{2}{1-\lambda_{2}({\mathbf{W}})}\right). In this case, the networking impact is coupled with the function factors, i.e., the smoothness heterogeneity maxiLiminiLi\frac{\max_{i}L_{i}}{\min_{i}L_{i}} is multiplied on the networking impact. While the other number depends on the functional condition numbers Liμi\frac{L_{i}}{\mu_{i}}’s only.

Theorem 2 separates the dependence of the linear convergence rate on the functions and the network structure. In our current scheme, all the agents perform information exchange and the proximal-gradient step once in each iteration. If the proximal-gradient step is expensive, this explicit rate formula can help us to decide whether the so-called multi-step consensus can help reducing the computational time.

For the sake of simplicity, let us assume for this moment that all the agents have the same strong convexity constant μ\mu and gradient Lipschitz continuity constant LL. Suppose that the “tt-step consensus” technique is employed, i.e., the mixing matrix W{\mathbf{W}} in our algorithm is replaced by Wt{\mathbf{W}}^{t}, where tt is a positive integer. Then to reach ϵ\epsilon-accuracy, the number of iterations needed is

When L/μ=1L/\mu=1 and step sizes are chosen as Λ=L1\Lambda={\mathbf{L}}^{-1}, it says that we should let t+t\rightarrow+\infty if the graph is not a complete graph. Such theoretical result is correct in intuition since in this case, the centralized gradient descent only needs one step to reach optimal and the bottleneck in decentralized optimization is the network.

Suppose tmaxt_{\max} is a reasonable upper bound on tt, which is set by the system designer. It is difficult to explicitly find an optimal tt. But with the above analysis as an evidence, we suggest that one choose t=min([logλ2(W)(1μL)],tmax)t=\min\left([\log_{\lambda_{2}({\mathbf{W}})}(1-\frac{\mu}{L})],t_{\max}\right) if 1μL>λ2(W)1-\frac{\mu}{L}>\lambda_{2}({\mathbf{W}}); otherwise t=1t=1. Here [][\cdot] gives the nearest integer.

If the bottleneck is on the functions, we can introduce a mapping x=Byx=By and change the unknown variable from xx to yy. E.g., if the function si(x)s_{i}(x) is a composition of a convex function with a linear mapping, replacing xx using yy changes the linear mapping and the condition number L/μL/\mu of the function. When BB is diagonal, it is similar to the column normalization in machine learning applications. There are other possible ways for reducing the condition number of the functions. It is out of the scope of this work, and we leave this as future work.

V Numerical Experiments

In this section, we compare the performance of NIDS with several state-of-the-art algorithms for decentralized optimization. These methods are

The DIGing-ATC . For reference, the DIGing-ATC updates are provided as follows:

The accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in );

The (dual friendly) optimal algorithm (OA) for distributed optimization (equation (7) in ).

Note there are two rounds of communication in each iteration of DIGing-ATC and Acc-DNGD-SC while there is only one round in that of EXTRA/NIDS/OA. For all the experiments, we first compute the exact solution x{\mathbf{x}}^{*} for (1) using the centralized (proximal) gradient descent. All networks are randomly generated with connectivity ratio τ\tau, where τ\tau is defined as the number of actual edges divided by the total number of possible edges n(n1)2{{n(n-1)}\over 2}. We will report the specific τ\tau used in each test. The mixing matrix W{\mathbf{W}} is always chosen with the Metropolis rule (see and [36, Section 2.4]).

The experiments are carried in Matlab R2016b running on a laptop with Intel i7 CPU @ 2.60HZ, 16.0 GB of RAM, and Windows 10 operating system. The source codes for reproducing the numerical results can be accessed at https://github.com/mingyan08/NIDS.

In order to ensure that each function 12Mixyi2{1\over 2}\|{\mathbf{M}}_{i}x-y_{i}\|^{2} is strongly convex, we choose mi=60m_{i}=60 and p=50p=50 and set the number of nodes n=40n=40. For the first experiment, we choose Mi{\mathbf{M}}_{i} such that the Lipshchitz constant of si\nabla s_{i} satisfies Li=1L_{i}=1 and the strongly convex constant μi=0.5\mu_{i}=0.5 for all ii. Based on Remark 4, we choose α=1/(maxiLi)=1\alpha={1/(\max_{i}L_{i})}=1 and c=1/(1λn(W))c={1/(1-\lambda_{n}({\mathbf{W}}))} for NIDS. In addition, we choose c=1/2c={1/2} such that W~=I+W2\widetilde{{\mathbf{W}}}={{\mathbf{I}}+{\mathbf{W}}\over 2}, which gives the same as that for EXTRA.

The comparison of these methods (NIDS with c=1/((1λn(W))α)c={1/((1-\lambda_{n}({\mathbf{W}}))\alpha)}, NIDS with c=1/2c=1/2, EXTRA, DIGing-ATC, Acc-DNGD-SC, and OA) is shown in Fig. 1 for two different networks with connectivity ratios τ=0.35\tau=0.35 (top) and τ=0.45\tau=0.45 (bottom), respectively. It shows better performance of NIDS in both choices of cc (corresponding to known W{\mathbf{W}} and unknown W{\mathbf{W}}) than that of other algorithms. NIDS with c=1/((1λn(W))α)c={1/((1-\lambda_{n}({\mathbf{W}}))\alpha)} always takes less than half the number of iterations used by EXTRA to reach the same accuracy. In our experiment, DIGing-ATC appears to be sensitive to networks. Under a better connected network (see Fig. 1 bottom), DIGing-ATC can catch up with NIDS with c=1/(2α)c={1/(2\alpha)}. The theoretical step-size of Acc-DNGD-SC is too small due to a very small constant in the bound in , and the convergence of Acc-DNGD-DC under such theoretical step-size in our test is slow and uncompetitive. Thus we have carefully tuned its step-size. With the hand-optimized step-size, Acc-DNGD-SC can achieve a comparable performance as NIDS with c=1/(2α)c={1/(2\alpha)}. In the plots, we observe that OA is fast in terms of the number of iterations. However, in this case, the per-iteration cost of OA is relatively high since it requires solving a system of linear equations at each iteration (though factorization tricks may be used to save some computational time).

Next, we demonstrate the effort of uncoordinated/adaptive step-size. We construct the function with μi=0.02\mu_{i}=0.02 and Li=1L_{i}=1 for each node ii. Then we change the LiL_{i} values by multiplying the function by a constant.

We use the same mixing matrix for the following two experiments.

We change half nodes. We randomly pick an even number node and multiply its function by 44. For remaining even number nodes, we multiply their functions by a random integer 2 or 3.

We change a quarter nodes. We randomly pick an node not in U={4,8,16,,40}{\mathcal{U}}=\{4,8,16,\dots,40\} and multiply its function by 10. Then for other nodes in U{\mathcal{U}}, we multiply their functions by a random integer between 2 and 9.

We compare NIDS with adaptive step-size (1/Li1/{L_{i}} for node ii) and NIDS with same step-size 1/maxiLi1/{\max_{i}L_{i}} in Fig. 2. We let c=1/(2maxiαi)c=1/(2\max_{i}\alpha_{i}), so no network information is needed. As shown in Fig. 2, NIDS with adaptive step-size converges faster than same step-size.

V-B The case with nonsmooth function r​(𝐱)𝑟𝐱r({\mathbf{x}})

where the connectivity ratio of the network τ=0.1\tau=0.1. We normalize the problem to make sure that the Lipschitz constant satisfies Li=1L_{i}=1 for each node, we choose mi=3m_{i}=3 and p=200p=200 and set the number of nodes n=40n=40.

Fig. 3 shows that a larger step-size in NIDS leads to faster convergence. With step-size 11, NIDS and PG-EXTRA converge at the same speed. But if we keep increasing the step-size, PG-EXTRA will diverge with step-size 1.41.4 while the step-size of NIDS can be increased to 1.91.9 maintaining convergence at a faster speed.

V-C An application in classification for healthcare data

VI Conclusion

We proposed a novel decentralized consensus algorithm NIDS, whose step-size does not depend on the network structure. In NIDS, the step-size depends only on the objective function, and it can be as large as 2/L2/L, where LL is the Lipschitz constant of the gradient of the smooth function. We showed that NIDS converges at the o(1/k)o(1/k) rate for the general convex case and at a linear rate for the strongly convex case. For the strongly convex case, we separated the linear convergence rate’s dependence on the objective function and the network. The separated convergence rates match the typical rates for the general gradient descent and the consensus averaging. Furthermore, every agent in the network can choose its own step-size independently by its own objective function. Numerical experiments validated the theoretical results and demonstrated better performance of NIDS over state-of-the-art algorithms. Because the step-size of NIDS does not depend on the network structure, there are many possible future extensions. One extension is to apply NIDS on dynamic networks where nodes can join and drop off.

Acknowledgement

We thank the anonymous reviewers for helpful comments and suggestions to improve the clarity of this paper.

References

(d,z)({\mathbf{d}}^{*},{\mathbf{z}}^{*}) is a fixed point of (12) if and only if there exists a subgradient qr(x){\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*}) such that z=x+Λq{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*} and

\Rightarrow” If (d,z)({\mathbf{d}}^{*},{\mathbf{z}}^{*}) is a fixed point of (12), we have

where the two equalities come from (12b) and (12c), respectively. Combining (12c) and (12a) gives

where qr(x){\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*}).

\Leftarrow” In order to show that (d,z)({\mathbf{d}}^{*},{\mathbf{z}}^{*}) is a fixed point of iteration (12), we just need to verify that (dk+1,zk+1)=(d,z)({\mathbf{d}}^{k+1},{\mathbf{z}}^{k+1})=({\mathbf{d}}^{*},{\mathbf{z}}^{*}) if (dk,zk)=(d,z)({\mathbf{d}}^{k},{\mathbf{z}}^{k})=({\mathbf{d}}^{*},{\mathbf{z}}^{*}). From (12a), we have xk=x{\mathbf{x}}^{k}={\mathbf{x}}^{*}, then

Therefore, (d,z)({\mathbf{d}}^{*},{\mathbf{z}}^{*}) is a fixed point of iteration (12). ∎

VI-B Proof of Lemma 2

x{\mathbf{x}}^{*} is consensual with x1=x2==xn=xx^{*}_{1}=x^{*}_{2}=\cdots=x^{*}_{n}=x^{*} being an optimal solution of problem (1) if and only if there exists p{\mathbf{p}}^{*} and a subgradient qr(x){\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*}) such that:

In addition, (d=(IW)p,z=x+Λq)({\mathbf{d}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*},{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*}) is a fixed point of iteration (12).

\Rightarrow” Because x=1n×1(x){\mathbf{x}}^{*}={\mathbf{1}}_{n\times 1}(x^{*})^{\top}, we have (IW)x=(IW)1n×1(x)=0n×1(x)=0({\mathbf{I}}-{\mathbf{W}}){\mathbf{x}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{1}}_{n\times 1}(x^{*})^{\top}=\mathbf{0}_{n\times 1}(x^{*})^{\top}=\mathbf{0}. The fact that xx^{*} is an optimal solution of problem (1) means there exists qr(x){\mathbf{q}}^{*}\in\partial r({\mathbf{x}}^{*}) such that (s(x)+q)1n×1=0(\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}=\mathbf{0}. That is to say all columns of s(x)+q\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*} are orthogonal to 1n×1{\mathbf{1}}_{n\times 1}. Therefore, Remark 1 shows the existence of p{\mathbf{p}}^{*} such that (IW)p+s(x)+q=0({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*}+\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*}=\mathbf{0}.

\Leftarrow” Equation (34b) shows that x{\mathbf{x}}^{*} is consensual because of item 3 of Assumption 1, i.e., x=1n×1(x){\mathbf{x}}^{*}={\mathbf{1}}_{n\times 1}(x^{*})^{\top} for some xx^{*}. From (34a), we have 0=((IW)p+s(x)+q)1n×1=(p)(IW)1n×1+(s(x)+q)1n×1=(s(x)+q)1n×1\mathbf{0}=(({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*}+\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}=({\mathbf{p}}^{*})^{\top}({\mathbf{I}}-{\mathbf{W}}){\mathbf{1}}_{n\times 1}+(\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}=(\nabla s({\mathbf{x}}^{*})+{\mathbf{q}}^{*})^{\top}{\mathbf{1}}_{n\times 1}. Thus, 0i=1n(si(x)+ri(x))\mathbf{0}\in\sum_{i=1}^{n}(\nabla s_{i}(x^{*})+\partial r_{i}(x^{*})) because x{\mathbf{x}}^{*} is consensual. This completes the proof for the equivalence.

Lemma 1 shows that (d=(IW)p,z=x+Λq)({\mathbf{d}}^{*}=({\mathbf{I}}-{\mathbf{W}}){\mathbf{p}}^{*},{\mathbf{z}}^{*}={\mathbf{x}}^{*}+\Lambda{\mathbf{q}}^{*}) is a fixed point of iteration (12). ∎

VI-C Proof of Lemma 3

Letting x=Ay{\mathbf{x}}={\mathbf{A}}{\mathbf{y}}, we have x2=UΣUy,UΣUy=ΣUy,ΣUy=ΣUy2\|{\mathbf{x}}\|^{2}=\langle{\mathbf{U}}\Sigma{\mathbf{U}}^{\top}{\mathbf{y}},{\mathbf{U}}\Sigma{\mathbf{U}}^{\top}{\mathbf{y}}\rangle=\langle\Sigma{\mathbf{U}}^{\top}{\mathbf{y}},\Sigma{\mathbf{U}}^{\top}{\mathbf{y}}\rangle=\|\Sigma{\mathbf{U}}^{\top}{\mathbf{y}}\|^{2}. In addition,

which means that A2=,A\|\cdot\|^{2}_{{\mathbf{A}}^{\dagger}}=\langle\cdot,{\mathbf{A}}^{\dagger}\cdot\rangle is a norm for range(A){\mathbf{range}}({\mathbf{A}}). ∎

VI-D Proof of Proposition 1

Let M=c1(IW)Λ{\mathbf{M}}=c^{-1}({\mathbf{I}}-{\mathbf{W}})^{\dagger}-\Lambda with Λ\Lambda being symmetric positive definite and IcΛ1/2(IW)Λ1/20{\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}\succcurlyeq\mathbf{0}. Then M\|\cdot\|_{{\mathbf{M}}} is a norm defined for range(IW){\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}).

We apply Lemma 3 on Λ1/2(IW)Λ1/2\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2} and obtain the result. ∎

VI-E Proof of Theorem 1

Before proving Theorem 1, we present two lemmas. The first lemma shows that the distance to a fixed point of (12) is decreasing, and the second one show the distance between two iterations is decreasing.

Let (d,z)({\mathbf{d}}^{*},{\mathbf{z}}^{*}) be a fixed point of (12) and drange(IW){\mathbf{d}}^{*}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}). For the sequence (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) generated from NIDS in (12) with IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}, we have

Let (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) be the sequence generated from NIDS in (12) with αi<2/Li\alpha_{i}<2/L_{i} for all ii and IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}. Then the sequence {zk+1zkΛ12+dk+1dkM2}k0\left\{\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|^{2}_{\Lambda^{-1}}+\|{\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\|_{{\mathbf{M}}}^{2}\right\}_{k\geq 0} is monotonically nonincreasing.

Similar to the proof for Lemma 4, we can show that

Subtracting (LABEL:for:nonincreasing2) from (LABEL:for:nonincreasing1) on both sides, we have

where the inequality comes from the Cauchy-Schwarz inequality. Then, the previous inequality, together with (39) and the Cauchy-Schwarz inequality, gives

The three inequalities hold because of (39), (40), and the Cauchy-Schwarz inequality, respectively. The first and third equalities come from (12c). Rearranging the previous inequality, we obtain

where the second and last inequalities come from (7) and Λ<2L1\Lambda<2{\mathbf{L}}^{-1}, respectively. It completes the proof. ∎

Let (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) be the sequence generated from NIDS in (12) with αi<2/Li\alpha_{i}<2/L_{i} for all ii and IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succcurlyeq c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}. We have

Furthermore, (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) converges to a fixed point (dˉ,zˉ)(\bar{\mathbf{d}},\bar{\mathbf{z}}) of iteration (12) and dˉrange(IW)\bar{\mathbf{d}}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}), if IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succ c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}.

Lemma 5 shows that {zk+1zkΛ12+dk+1dkM2}k0\left\{\|{\mathbf{z}}^{k+1}-{\mathbf{z}}^{k}\|^{2}_{\Lambda^{-1}}+\|{\mathbf{d}}^{k+1}-{\mathbf{d}}^{k}\|_{{\mathbf{M}}}^{2}\right\}_{k\geq 0} is monotonically nonincreasing. Summing up (4) from 11 to kk, we have

When IcΛ1/2(IW)Λ1/2{\mathbf{I}}\succ c\Lambda^{1/2}({\mathbf{I}}-{\mathbf{W}})\Lambda^{1/2}, inequality (4) shows that the sequence (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) is bounded, and there exists a convergent subsequence (dki,zki)({\mathbf{d}}^{k_{i}},{\mathbf{z}}^{k_{i}}) such that (dki,zki)(dˉ,zˉ)({\mathbf{d}}^{k_{i}},{\mathbf{z}}^{k_{i}})\rightarrow(\bar{\mathbf{d}},\bar{\mathbf{z}}). Then (41) gives the convergence of (dki+1,zki+1)({\mathbf{d}}^{k_{i}+1},{\mathbf{z}}^{k_{i}+1}). More specifically, (dki+1,zki+1)(dˉ,zˉ)({\mathbf{d}}^{k_{i}+1},{\mathbf{z}}^{k_{i}+1})\rightarrow(\bar{\mathbf{d}},\bar{\mathbf{z}}). Therefore (dˉ,zˉ)(\bar{\mathbf{d}},\bar{\mathbf{z}}) is a fixed point of iteration (12). In addition, because dkrange(IW){\mathbf{d}}^{k}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}) for all kk, we have dˉrange(IW)\bar{\mathbf{d}}\in{\mathbf{range}}({\mathbf{I}}-{\mathbf{W}}). Finally Lemma 4 implies the convergence of (dk,zk)({\mathbf{d}}^{k},{\mathbf{z}}^{k}) to (dˉ,zˉ)(\bar{\mathbf{d}},\bar{\mathbf{z}}). ∎