Convergence Rate of Distributed ADMM over Networks

Ali Makhdoumi, Asuman Ozdaglar

I Introduction

Many of today’s optimization problems in data science (including statistics, machine learning, and data mining) include an abundance of data, which cannot be handled by a single processor alone. This necessitates distributing data among multiple processors and processing it in a decentralized manner based on the available local information. The applications in machine learning along with other applications in distributed data processing where information is inherently distributed among many processors (see e.g. distributed sensor networks , coordination and flow control problems ) have spearheaded a large literature on distributed multiagent optimization.

In this paper, we focus on the following optimization problem:

Least-Absolute Shrinkage and Selection Operator (LASSO):

Suppose our distributed computing system consists of nn machines each with k=M/nk=M/n data points (without loss of generality suppose MM is divisible by nn, otherwise one of the machines have the remainder of data points). For all i=1,,ni=1,\dots,n, we define a function based on the available data to machine ii as

I-B Related Works and Contributions

Much of this literature builds on the seminal works , which proposed gradient methods that can parallelize computations across multiple processors. A number of recent papers proposed subgradient type methods or a dual averaging method to design distributed optimization algorithms.

An alternative approach is to use Alternating Direction Method of Multipliers (ADMM) type methods which for separable problems leads to decoupled computations (see e.g. and for comprehensive tutorials on ADMM). ADMM has been studied extensively in the 80’s . More recently, it has found applications in a variety of distributed settings in machine learning such as model fitting, resource allocation, and classification (see e.g. ).

In this paper we present a new distributed ADMM algorithm for solving problem (1) over a network. Our algorithm relies on a novel node-based reformulation of (1) and leads to an ADMM algorithm that uses dual variables with dimension given by the number of nodes in the network. This results in a significant reduction in the number of variables stored and communicated with respect to edge-based ADMM algorithm presented in the literature (see ). Our main contribution is a unified convergence rate analysis for this algorithm that applies to both the case when the local objective functions are convex and also the case when the local objective functions are strongly convex with Lipschitz continuous gradients. In particular, our analysis shows that when the local objective functions are convex (with no further assumptions), then the objective function at the ergodic average of the estimates generated by our algorithm converges with rate O(1T)O(\frac{1}{T}). Moreover, when the local objective functions are strongly convex with Lipschitz continuous gradients we show that the iterates converges linearly, i.e., the iterates converge to an ϵ\epsilon-neighborhood of the optimal solution after O(κflog(1ϵ))O(\sqrt{\kappa_{f}}\log(\frac{1}{\epsilon})) steps, where κf\kappa_{f} is the condition number defined as L/νL/\nu, where LL is the maximum Lipschitz gradient parameter and ν\nu is the minimum strong convexity constant of the local objective functions. This matches the best known iteration complexity and condition number dependence for the centralized ADMM (see e.g. ). Our convergence rate estimates also highlight a novel dependence on the network structure as well as the communication weights. In particular, for communication weights that are governed by the Laplacian of the graph, we establish a novel iteration complexity O(κfdmax4dmina2(G)log(1ϵ))O\left(\sqrt{\kappa_{f}}\sqrt{\frac{d^{4}_{\text{max}}}{d_{\text{min}}a^{2}(G)}}\log\left(\frac{1}{\epsilon}\right)\right), where dmind_{\text{min}} is the minimum degree, dmaxd_{\text{max}} is the maximum degree, and a(G)a(G) is the algebraic connectivity of the network. Finally, we illustrate the performance of our algorithm with numerical examples.

Our paper is most closely related to , which studied edge-based ADMM algorithms for solving (1). In , the authors consider convex local objective functions and provide an O(1T)O(\frac{1}{T}) convergence rate. The more recent paper assumes strongly convex local objective functions with Lipschitz gradients and show a linear convergence rate through a completely different analysis. This analysis does not extend to the node-based ADMM algorithm under these assumptions. In contrast, our paper provides a unified convergence rate analysis for both cases for the node-based distributed ADMM algorithm. Our paper is also related to that study the basic centralized ADMM where the goal is to miminize sum of two functions with a linearly coupled constraint. Our work is also related to the literature on the converge of operator splitting schemes, such as Douglas-Rachford splitting and relaxed Peaceman-Rachford .

I-C Outline

The organization of paper is as follows. In Section II we give the problem formulation and propose a novel distributed ADMM algorithm. In Section III, we show some preliminary results that helps us to show the main results. In Section IV we show the sub-linear convergence rate. In Section V we show the linear convergence rate of our algorithm. Finally, in Section VI we show the effect of network on the convergence rate and provide numerical results that illustrate the performance of our algorithm, which leads to concluding remarks in Section VII. All the omitted proofs are presented in the appendix.

II Framework

Consider a network represented by a connected graph G=(V,E)G=(V,E) where V={1,,n}V=\{1,\dots,n\} is the set of agents and EE is the set of edges. For any ii, let N(i)N(i) denote its set of neighbors including agent ii itself, i.e., N(i)={j  (i,j)E}{i}N(i)=\{j~{}|~{}(i,j)\in E\}\cup\{i\}, and let did_{i} denote the degree of agent ii, i.e., N(i)=di+1|N(i)|=d_{i}+1. We let dmax=maxiVdid_{\text{max}}=\max_{i\in V}d_{i} and dmin=miniVdid_{\text{min}}=\min_{i\in V}d_{i}.

The goal of the agents is to collectively solve optimization problem (1), where fif_{i} is a function known only to agent ii. In order to solve optimization problem (1), we introduce a variable xix_{i} for each ii and write the objective function of problem (1) as i=1nfi(xi)\sum_{i=1}^{n}f_{i}(x_{i}), so that the objective function is decoupled across the agents. The constraint that all the xix_{i}’s are equal can be imposed using the following matrix.

Let PP be a n×nn\times n matrix whose entries satisfy the following property: For any i=1,,ni=1,\dots,n, Pij=0P_{ij}=0 for jN(i)j\notin N(i). We refer to PP as the communication matrix.

The communication matrix PP satisfies null(P)=span{1}\text{null}(P)=\text{span}\{\mathbf{1}\}, where 1\mathbf{1} is a n×1n\times 1 vector with all entries equal to one and null(P)\text{null}(P) denotes the null-space of the matrix PP.

If Pij<0P_{ij}<0 for all jN(i){i}j\in N(i)\setminus\{i\}, summation of each row of PP is zero, and the graph is connected, then Assumption 1 holds. As a particular case, the Laplacian matrix of the graph given by Pij=1P_{ij}=-1 when jN(i){i}j\in N(i)\setminus\{i\} and zero otherwise, and Pii=diP_{ii}=d_{i} is a communication matrix that satisfies Assumption 1.

We next show that the constraint that all xix_{i}’s are equal can be enforced by the linear constraint Ax=0A\mathbf{x}=0, where x=(x1,,xn)\mathbf{x}=(x_{1},\dots,x_{n}) where each xix_{i} is a sub-vector of dimension dd and AA is a dn×dndn\times dn matrix defined as the Kronecker product between communication matrix PP and IdI_{d}, i.e., A=PIdA=P\otimes I_{d}.

Under Assumption 1, the constraint Ax=0A\mathbf{x}=0 guarantees that xi=xjx_{i}=x_{j} for all i,jVi,j\in V.

Using Lemma 1, under Assumption 1, we can reformulate optimization problem (1) as

where F(x)=i=1nfi(xi)F(\mathbf{x})=\sum_{i=1}^{n}f_{i}(x_{i}).

The optimal solution set of problem (3) is non-empty. We let x\mathbf{x}^{*} denote an optimal solution of the problem (3).

II-B Multiagent Distributed ADMM

We now use ADMM algorithm (see e.g. ). ADMM algorithm generates primal-dual sequences {xj(t)},{zij(t)}\{x_{j}(t)\},\{z_{ij}(t)\}, and {λij(t)}\{\lambda_{ij}(t)\} which at iteration t+1t+1 are updated as follows:

For any j=1,,nj=1,\dots,n, we update xjx_{j} as

For any i=1,,ni=1,\dots,n, we update the vector zi=[zij]jN(i)\mathbf{z}_{i}=[z_{ij}]_{j\in N(i)} as

where Zi={zi  jN(i)zij=0}Z_{i}=\{\mathbf{z}_{i}\ |\ \sum_{j\in N(i)}z_{ij}=0\}.

For i=1,,ni=1,\dots,n and jN(i)j\in N(i) we update λij\lambda_{ij} as

One can implement this algorithm in a distributed manner, where node ii maintains variables λij(t)\lambda_{ij}(t) and zij(t)z_{ij}(t) for all jN(i)j\in N(i) (). However, using the inherent symmetries in the problem, we can significantly reduce the number of variables that each node requires to maintain from O(E)O(|E|) to O(V)O(|V|).

We first show that for all tt, ii, and jN(i)j\in N(i), we have λij(t)=pi(t)\lambda_{ij}(t)=p_{i}(t). This reduction shows that the algorithm need not maintain dual variables λij(t)\lambda_{ij}(t) for each ii and its neighbors jj, but instead can operate with the lower dimensional node-based dual variable pi(t)p_{i}(t). The dual variable pi(t)p_{i}(t) can be updated using primal variables xj(t)x_{j}(t) for all jN(i)j\in N(i). The second observation is that zij(t)=Aijxj(t)yi(t)z_{ij}(t)=A_{ij}x_{j}(t)-y_{i}(t), where yi(t)=1di+1([A]i)x(t)y_{i}(t)=\frac{1}{d_{i}+1}\left({[A]^{i}}\right)^{\prime}\mathbf{x}(t) and ([A]i)=(Pi1,,Pin)Id\left({[A]^{i}}\right)^{\prime}=(P_{i1},\dots,P_{in})\otimes I_{d}. This reduction shows that the algorithm need not maintain primal variables zij(t)z_{ij}(t) for each ii and its neighbors jj, but instead can operate with the lower dimensional node-based primal variables yi(t)y_{i}(t), where yi(t)y_{i}(t) is node ii’s estimate of the primal variable (obtained as the average of primal variables of his own neighbors). The aforementioned reductions are shown in the following proposition.

The sequence {xi(t)}t=0\{x_{i}(t)\}_{t=0}^{\infty} for i=1,,ni=1,\dots,n generated by implementing the steps presented in Algorithm (1) is the same as the sequence generated by the ADMM algorithm.

The steps of the algorithm can be implemented in a distributed way, meaning that each node first updates her estimates based on the information received from her neighboring nodes and then broadcasts her updated estimates to her neighboring nodes. Each node ii maintains local variables xi(t)x_{i}(t), yi(t)y_{i}(t), and pi(t)p_{i}(t) and updates these variables using communication with its neighbors as follows:

At the end of iteration tt, each node ii sends out pi(t)p_{i}(t) and yi(t)y_{i}(t) to all of its neighbors and then each node such as jj uses yi(t)y_{i}(t) and pi(t)p_{i}(t) of all iN(j)i\in N(j) to update xj(t+1)x_{j}(t+1) as in step 1.

Each node jj sends out xj(t+1)x_{j}(t+1) to all of its neighbors and then each node such as ii computes yi(t+1)y_{i}(t+1) as in step 2.

Each node ii updates pi(t+1)p_{i}(t+1) as in step 3.

Using this algorithm agent ii need to store only three variables, xi(t)x_{i}(t), yi(t)y_{i}(t), and pi(t)p_{i}(t) and update them at each iteration. Also, each agent need to communicate only with (broadcast her estimates to) its neighbors. Therefore, the overall storage requirement is 3V3|V| and the overall communication requirement at each iteration is E|E|.

III Preliminary Results

In this section, we present the preliminary results that we will use to establish our convergence rate. we define

The update of Algorithm 1 can be written as

for some h(x(t+1))F(x(t+1))h(\mathbf{x}(t+1))\in\partial F(\mathbf{x}(t+1)).

Lemma 2 shows x(t+1)\mathbf{x}(t+1) can be written as a perturbed linear combination of {x(s)}s=0t\{\mathbf{x}(s)\}_{s=0}^{t} with the perturbation being the term 1cM1h(x(t+1))-\frac{1}{c}M^{-1}h(\mathbf{x}(t+1)). The intuition behind the convergence rate analysis is that the linear term that relates x(t+1)\mathbf{x}(t+1) to x(0),,x(t)\mathbf{x}(0),\dots,\mathbf{x}(t) guarantees that the sequence x(t)\mathbf{x}(t) converges to a consensus point where xi(t)=xj(t)x_{i}(t)=x_{j}(t) for all i,jVi,j\in V; and the perturbation term 1cM1h(x(t+1))-\frac{1}{c}M^{-1}h(\mathbf{x}(t+1)) guarantees that the converging point minimizes the objective function F(x)=i=1nfi(xi)F(\mathbf{x})=\sum_{i=1}^{n}f_{i}(x_{i}).

IV Sub-linear Rate of Convergence

In this section, we show the sublinear rate of convergence. We define two auxiliary sequences that we will use in proving the convergence rates. Since AD1AA^{\prime}D^{-1}A is positive semidefinite (see Lemma 6 in the appendix), we can define Q=(AD1A)1/2Q=(A^{\prime}D^{-1}A)^{1/2}. In other words, we let Q=VΣ1/2VQ=V\Sigma^{1/2}V^{\prime}, where AD1A=VΣVA^{\prime}D^{-1}A=V\Sigma V^{\prime} is the singular value decomposition of the symmetric matrix AD1AA^{\prime}D^{-1}A. We define the auxiliary sequences

Next, we show a proposition that bounds the function value at each iteration.

where q=(rx)\mathbf{q}^{*}=\begin{pmatrix}\mathbf{r}^{*}\\ \mathbf{x}^{*}\end{pmatrix}.

In order to obtain O(1/T)O(1/T) convergence rate, we consider the performance of the algorithm at the ergodic vector defined as x^(T)=(x^1(T),,x^n(T))\hat{\mathbf{x}}(T)=(\hat{x}_{1}(T),\dots,\hat{x}_{n}(T)), where

for any i=1,,ni=1,\dots,n. Note that each agent ii can construct this vector by simple recursive time-averaging of its estimate xi(t)x_{i}(t). Let (x,r^)(\mathbf{x}^{*},\hat{\mathbf{r}}) be a primal-dual optimal solution of

Since null(Q)=null(P)\text{null}(Q)=\text{null}(P), under Assumption (1), the optimal primal solution of this problem is the same as of the original problem (3) Next, we show both objective function and feasibility violation converges with rate O(1T)O(\frac{1}{T}) to the optimal value.

This theorem shows that the objective function at the ergodic average of the sequence of estimates generated by Algorithm 1 converges with rate O(1T)O(\frac{1}{T}) to the optimal solution. We next characterize the network effect on the performance guarantee.

For any TT, starting form x(0)=0\mathbf{x}(0)=0, we have

V Linear Rate of Convergence

In order to show the linear rate of convergence, we adopt the following standard assumptions.

For any i=1,,ni=1,\dots,n, the function fif_{i} is differentiable and has Lipschitz continuous gradient, i.e.,

for some Lfi0L_{f_{i}}\geq 0. The function fif_{i} is also strongly convex with parameter νfi>0\nu_{f_{i}}>0, i.e., fi(x)νfi2x22f_{i}(x)-\frac{\nu_{f_{i}}}{2}||x||_{2}^{2} is convex.

We let ν=min1inνfi\nu=\min_{1\leq i\leq n}\nu_{f_{i}} and L=max1inLfiL=\max_{1\leq i\leq n}L_{f_{i}}, and define the condition number of F(x)F(\mathbf{x}) (or the condition number of problem (3)) as κf=Lν\kappa_{f}=\frac{L}{\nu}. Note that when the functions are differentiable, we have

Assumption 3 results in the following standard inequalities for the aggregate function F(x)F(\mathbf{x}).

Under Assumption (3) we show that the sequence generated by Algorithm (1) converges linearly to the optimal solution (which is unique under these assumptions). The idea is to use strong convexity and Lipschitz gradient property of F(x)F(\mathbf{x}) in order to show that the GG-norm of sequence q(t)q\mathbf{q}(t)-\mathbf{q}^{*} contracts at each iteration, providing a linear rate.

Suppose Assumptions 1, 2, and 3 hold. For any value of the penalty parameter c>0c>0 and β(0,1)\beta\in(0,1), the sequence generated by Algorithm 1 {x(t)}t=1\{\mathbf{x}(t)\}_{t=1}^{\infty} satisfies

The rate of convergence in Theorem 3 holds for any choice of penalty parameter c>0c>0. In other words, for any choice of c>0c>0, the convergence rate is linear. We now optimize the rate of convergence over all choices of cc and provide an explicit convergence rate estimate that highlights dependence on the condition number of the problem.

Suppose Assumptions 1, 2, and 3 hold. Let {x(t)}t=1\{\mathbf{x}(t)\}_{t=1}^{\infty} be the sequence generated by Algorithm 1. There exist c>0c>0 for which we have

VI Network Effects

We can choose communication matrix PP (and the corresponding matrix AA) in the Algorithm 1 to be any matrix that satisfies Assumption (1). One natural choice for the matrix PP is the Laplacian of the graph which leads to having Aij=Aji=1A_{ij}=A_{ji}=-1 for all jN(i){i}j\in N(i)\setminus\{i\} and Aii=diA_{ii}=d_{i}. Using Laplacian as the communication matrix we can now capture the effect of network structure in the convergence rate.

The following proposition explicitly show the networks dependence in the bounds provided in Theorem 2.

For any TT, starting form x(0)=0\mathbf{x}(0)=0 and using standard Laplacian as the communication matrix, we have

where UU is a bound on the subgradients of the function FF at x\mathbf{x}^{*}, i.e., vU\|\mathbf{v}\|\leq U for all vF(x)\mathbf{v}\in\partial F(\mathbf{x}^{*}) and a(G)a(G) is the algebraic connectivity of the graph.

Therefore, highly connected graphs with larger algebraic connectivity has a faster convergence rate (see e.g. , for an overview of the results on algebraic connectivity).

VI-B Network Effect in Linear Rate

The following proposition explicitly show the networks dependence in the bound provided in Theorem 4.

Suppose Assumptions 1, 2, and 3 hold. Using standard Laplacian as the communication matrix, in order to reach an ϵ\epsilon-optimal solution O(κfdmax4dmina2(G)log(1ϵ))O\left(\sqrt{\kappa_{f}}\sqrt{\frac{d^{4}_{\text{max}}}{d_{\text{min}}a^{2}(G)}}\log\left(\frac{1}{\epsilon}\right)\right) iterations suffice.

Both of our guaranteed rates for sub-linear and linear rates depends on three parameters dmaxd_{\text{max}}, dmind_{\text{min}} and a(G)a(G). The convergence rate is faster for larger dmind_{\text{min}} and smaller dmaxd_{\text{max}}. Finally, the convergence rate is faster for larger algebraic connectivity a(G)a(G).

VII conclusion

We proposed a novel distributed algorithm based on Alternating Direction Method of Multipliers (ADMM) to minimize the sum of locally known convex functions. We first showed that ADMM can be implemented by only keeping track of some node-based variables. We then showed that our algorithm reaches ϵ\epsilon-optimal solution in O(1ϵ)O\left(\frac{1}{\epsilon}\right) number of iterations for convex functions and in (κflog(1ϵ))\left(\sqrt{\kappa_{f}}\log\left(\frac{1}{\epsilon}\right)\right) iterations for strongly convex and Lipschitz functions. Our analysis shows that the performance of our algorithm depends on the algebraic connectivity of the graph, the minimum degree of the nodes, and the maximum degree of the nodes. Finally, we illustrated the performance of our algorithm with numerical examples.

Appendix

Consider the kk-th coordinate of all xix_{i}’s and form a n×1n\times 1 vector xk\mathbf{x}^{k}. From Ax=0A\mathbf{x}=0 and the fact that A=PIA=P\otimes I, we obtain that Pxk=0P\mathbf{x}^{k}=0. This shows that xknull(P)\mathbf{x}^{k}\in\text{null}(P). Using Assumption 1, we have xkspan({1})\mathbf{x}^{k}\in\text{span}(\{\mathbf{1}\}), which guarantees that all entries of xk\mathbf{x}^{k} are equal. Similarly, for any k=1,,dk=1,\dots,d, the kk-th entries of all xix_{i}’s are equal. This leads to xi=xjx_{i}=x_{j} for all i,jVi,j\in V.

VII-B Proof of Lemma 2

Using the first step of Algorithm (1) for ii, we can write xi(t+1)x_{i}(t+1) as

where hi(x(t+1))=fi(xi(t+1))h_{i}(\mathbf{x}(t+1))=f^{\prime}_{i}(x_{i}(t+1)) for differentiable functions and hi(x(t+1))fi(xi(t+1))h_{i}(\mathbf{x}(t+1))\in\partial f_{i}(x_{i}(t+1)) in general. We next use second and third steps of Algorithm (1) to write pi(t)p_{i}(t) and yi(t)y_{i}(t) in terms of (x(0),,x(t))(\mathbf{x}(0),\dots,\mathbf{x}(t)). Using the update for jj, we have

Moreover, we can write the term jN(i)Ajiyj(t)\sum_{j\in N(i)}A_{ji}y_{j}(t) based on the sequence (x(0),,x(t))(\mathbf{x}(0),\dots,\mathbf{x}(t)) as follows

Substituting (VII-B) and (VII-B) in (VII-B), we can write the update of xi(t+1)x_{i}(t+1) in terms of the sequence (x(0),,x(t))(\mathbf{x}(0),\dots,\mathbf{x}(t)), which then can compactly be written as

where h(x(t+1))=F(x(t+1))h(\mathbf{x}(t+1))=\nabla F(\mathbf{x}(t+1)) if the functions are differentiable and h(x(t+1))F(x(t+1))h(\mathbf{x}(t+1))\in\partial F(\mathbf{x}(t+1)) in general. Left multiplying by 1cM1\frac{1}{c}M^{-1}, completes the proof.

VII-C Proof of Proposition 2

We first show a lemma that we will use in the proof of this proposition. The following lemma shows the relation between the auxiliary sequence r(t)\mathbf{r}(t) and the primal sequence x(t)\mathbf{x}(t).

Suppose Assumptions 1 and 2 hold. The sequence {x(t),r(t)}t=0\{\mathbf{x}(t),\mathbf{r}(t)\}_{t=0}^{\infty} satisfies

for some h(x(t+1))F(x(t+1))h(\mathbf{x}(t+1))\in\partial F(\mathbf{x}(t+1)).

We subtract (AD1A)x(t+1)(A^{\prime}D^{-1}A)\mathbf{x}(t+1) from both sides and rearrange the terms to obtain

Back to the proof of Proposition 2: Using Lemma 7 and Lemma 3 part (c), we have that

VII-D Proof of Theorem 1

Taking summation of the relation given in Proposition 2 from t=0t=0 up to t=Tt=T, we obtain that

Using convexity of the functions and Jensen’s inequality, we obtain

Next, we will bound the term r^Qx^(T){\hat{\mathbf{r}}}^{\prime}Q\hat{\mathbf{x}}(T). We add the term cr^Qx^(T)c{\hat{\mathbf{r}}}^{\prime}Q\hat{\mathbf{x}}(T) to both sides of (12) to obtain

Again, using Proposition 2 to bound the right-hand side of (14), we obtain

Using (15) to bound the right-hand side of (13), and then combining the result with (12), we obtain

We next bound the feasibility violation. Using Proposition 2 with r=r^+Qx^(T)Qx^(T)\mathbf{r}=\hat{\mathbf{r}}+\frac{Q\hat{\mathbf{x}}(T)}{||Q\hat{\mathbf{x}}(T)||}, we have

Since (x,r^)(\mathbf{x}^{*},\hat{\mathbf{r}}) is a primal-dual optimal solution, using saddle point inequality, we have that

Combining the two previous relations, we obtain

we can further bound Qx^(T)||Q\hat{\mathbf{x}}(T)|| as

VII-E Proof of Theorem 2

We first show a lemma that bounds the norm of dual optimal solution of (16).

Next, we use this lemma to analyze the network effect. Using Theorem 1 with zero initial condition, we have

Using xMAD1A2λMx22||\mathbf{x}^{*}||_{M-A^{\prime}D^{-1}A}^{2}\leq\lambda_{M}||\mathbf{x}^{*}||_{2}^{2} completes the proof.

VII-F Proof of Lemma 3

For any ii, since fi(x)νfi2x2f_{i}(x)-\frac{\nu_{f_{i}}}{2}||x||^{2} is convex, (fi(x)νfi2x2)\nabla(f_{i}(x)-\frac{\nu_{f_{i}}}{2}||x||^{2}) is monotone and we have

Let ϕx(y)=fi(y)<fi(x),y>\phi_{x}(y)=f_{i}(y)-\left<{\nabla}f_{i}(x),y\right>. Note that ϕx(y)\phi_{x}(y) has Lipschitz gradient with parameter LfiL_{f_{i}}. Moreover, we have that minyϕx(y)=ϕx(x)\min_{y}\phi_{x}(y)=\phi_{x}(x), since

is zero for y=xy=x. Therefore, using the previous relation, we have that

We add the two preceding relations to obtain

for any i=1,,ni=1,\dots,n. Combining this relation for all ii’s completes the proof. Finally, we prove part (c). Let h=(h1,,hn)h=(h_{1}^{\prime},\dots,h_{n}^{\prime})^{\prime}. By definition of subgradient, for any iVi\in V we have

Taking summation of this inequality for all i=1,,ni=1,\dots,n shows that

VII-G Proof of Theorem 3

We first show two lemmas that we use in the proof of this theorem. The first lemma shows that both matrices MAD1AM-A^{\prime}D^{-1}A and AD1AA^{\prime}D^{-1}A are positive semidefinite and the second lemma shows a relation between the sequences q(t)\mathbf{q}(t), r(t)\mathbf{r}(t), and x(t)\mathbf{x}(t) same as the one shown in Lemma 7.

The matrices MAD1AM-A^{\prime}D^{-1}A and AD1AA^{\prime}D^{-1}A are positive semidefinite.

Both matrices are clearly symmetric. We first show MAD1AM-A^{\prime}D^{-1}A is positive semidefinite. By definition, we have

where we used the fact that j=1nAlj=0\sum_{j=1}^{n}A_{lj}=0, for any ll. Therefore, by Greshgorin Circle Theorem, for any eigen value μ\mu of MAD1AM-A^{\prime}D^{-1}A, for some ii we have

where we used the fact that dl1d_{l}\geq 1 that evidently holds. We next show that AD1AA^{\prime}D^{-1}A is positive semidefinite. We have

Since [AD1A]iiji[AD1A]ij[A^{\prime}D^{-1}A]_{ii}\geq\sum_{j\neq i}\left.|[A^{\prime}D^{-1}A]_{ij}\right.|, similarly, by Greshgorin Circle Theorem, the matrix AD1AA^{\prime}D^{-1}A is positive semidefinite. ∎

Suppose Assumptions 1 and 2 hold. For differentiable functions, the sequence {x(t),r(t)}t=0\{\mathbf{x}(t),\mathbf{r}(t)\}_{t=0}^{\infty} satisfies

for some r\mathbf{r}^{*} that satisfies Qr+1cF(x)=0Q\mathbf{r}^{*}+\frac{1}{c}\nabla F(\mathbf{x}^{*})=0. Moreover, r\mathbf{r}^{*} belongs to the column span of QQ.

Using Lemma 2 for differentiable functions we have

We subtract (AD1A)x(t+1)(A^{\prime}D^{-1}A)\mathbf{x}(t+1) from both sides and rearrange the terms to obtain

Back to the proof of Theorem 3: Note that since MAD1AM-A^{\prime}D^{-1}A is positive semidefinite,

is a semi-inner product.This means it satisfies conjugate symmetry, linearity and semipositive-definiteness (instead of positive-definiteness). We first show that for a δ\delta given by the statement of theorem, we have

Again, using Lemma (3) and Lemma (7), we have

Using (VII-G) and (VII-G), for any β(0,1)\beta\in(0,1), we have

Comparing this relation with (18), it remains to show

Using Lemma (6), in order to show this inequality it suffices to show

Since both r(t+1)\mathbf{r}(t+1) and r\mathbf{r}^{*} are orthogonal to 1\mathbf{1} and null(Q)=span({1})\text{null}(Q)=\text{span}(\{\mathbf{1}\}), using Lemma (7), we obtain

Comparing (VII-G) and (VII-G), it suffices to have

This shows that (18) holds. Using (18) along with (VII-G) completes the proof.

VII-H Proof of Theorem 4

The largest possible δ\delta that satisfies the constraint given in Theorem 1 by maximizing over β(0,1)\beta\in(0,1) is the solution of

VII-I Proof of Proposition 3

where a(G)a(G) is the algebraic connectivity of the graph which is the smallest non-zero eigenvalue of the Laplacian matrix. Moreover, we have that

Plugging these two bounds in Theorem 2 an using the fact that dmin1d_{\text{min}}\geq 1 completes the proof.

VII-J Proof of Proposition 4

Using Theorem 4, for large enough κf\kappa_{f}(small enough δ\delta^{*}) we have 1log(1+δ)1δ\frac{1}{\log(1+\delta^{*})}\geq\frac{1}{\delta^{*}}, in order to have x(t)x2ϵ||\mathbf{x}(t)-\mathbf{x}^{*}||_{2}\leq\epsilon, we need to have

Using these two bounds along with λmax(A)2dmax\lambda_{\text{max}}(A)\leq 2d_{\text{max}}, we obtain

Plugging this bound into (VII-J) completes the proof.

References