Linear Convergence of Primal-Dual Gradient Methods and their Performance in Distributed Optimization

Sulaiman A. Alghunaim, Ali H. Sayed

I Introduction

Consider the constrained optimization problem:

is the augmented Lagrangian of problem (1), λ\lambda is a dual variable, and ρ0\rho\geq 0 is the augmented Lagrangian penalty parameter. Note that for ρ=0\rho=0, L0(w,λ)L_{0}(w,\lambda) becomes the classical Lagrangian of problem (1). If a point (w,λ)(w^{\star},\lambda^{\star}) exists that solves (2), then ww^{\star} is an optimal solution to the constrained problem when strong duality holds, which is the case under our assumptions . A classical algorithm that solves (2) is the primal-dual (PD) gradient algorithm (4). In this algorithm, Jρ(w){\nabla}J_{\rho}(w) denotes the gradient of Jρ(w)=J(w)+ρ2Bwb2J_{\rho}(w)=J(w)+{\rho\over 2}\|Bw-b\|^{2} evaluated at ww and (μw,μλ)(\mu_{w},\mu_{\lambda}) are positive step-sizes (learning rates) chosen by the designer. The updates in (4) are primal-descent dual-ascent steps applied to (3) and it subsume the classical Lagrangian implementation when ρ=0\rho=0 and the augmented Lagrangian implementation when ρ>0\rho>0. Note that the updates in (4) are incremental since the dual update (4b) uses the most recent primal variable wiw_{i} and not wi1w_{i-1}. If the dual update uses the previous primal iterate wi1w_{i-1}, then we refer to the update as non-incremental.

This work provides a concise proof that establishes the linear convergence of recursion (4) and studies its relation to the non-incremental implementation. We also study the effect of the penalty term ρ2Bwb2{\rho\over 2}\|Bw-b\|^{2} on the performance of multi-agent consensus optimization algorithms. Algorithms of the form (4) have been applied in various applications including wireless systems , power systems , reinforcement learning , and network utility maximization .

There exists a large body of literature on primal-dual saddle-point algorithms – see and the references therein, including the seminal work , which proposed recursions of the type (4) and established their convergence. These works focus on proving convergence to an optimal solution without providing convergence rates, provide sub-linear convergence rates (e.g., 1i{1\over i} where ii is the iteration index), or show linear convergence from a starting point that is sufficiently close to a solution (local convergence). Some other works examined global linear convergence under different settings.

The works focuses on continuous versions of the primal-dual gradient dynamics and establish linear convergence for augmented Lagrangian implementations (i.e., they require the presence of the augmented Lagrangian term ρ/2Bwb2\rho/2\|Bw-b\|^{2}, where ρ\rho is strictly positive). They also require BB to have full row rank. Similarly, the work establishes linear convergence for continuous primal-dual gradient dynamics for full row rank BB, but it does not require the presence of the augmented Lagrangian term. Moreover, it was shown in that if the continuous dynamics is discretized using Euler discretization, then the discrete version converges linearly under small enough step sizes. However, no upper bound is given on the step-sizes. Moreover, Euler discretization uses identical step-sizes for the primal and dual updates (i.e., μw=μλ\mu_{w}=\mu_{\lambda}) and results in a non-incremental primal-dual dynamics. Therefore, the results in are not directly applicable to the discrete incremental implementation (4) and do not provide clear bounds on the step-sizes.

We remark that linear convergence for various monotone operator methods have been established albeit under other conditions that are not satisfied in our setup. For example, the linear convergence results in and [18, Proposition 25.9] for forward-backward splitting methods would require the saddle-point problem (2) to be both strongly-convex with respect to ww and strongly-concave with respect to λ\lambda. This holds for example for problems with Lagrangian L(w,λ)=J(w)+λTBwg(λ)L(w,\lambda)=J(w)+\lambda^{\mathsf{T}}Bw-g(\lambda) where J(w)J(w) and g(λ)g(\lambda) are both strongly-convex functions. Similarly, the conditions used in require the saddle-point problem (2) to be strongly-convex with respect to ww and strongly-concave with respect to λ\lambda. In our setup, Lρ(w,λ)L_{\rho}(w,\lambda) is not strongly-concave with respect to λ\lambda.

The work showed that for saddle point problems with L(w,λ)=J(w)+λTBwg(λ)L(w,\lambda)=J(w)+\lambda^{\mathsf{T}}Bw-g(\lambda), linear convergence is possible without requiring the Lagrangian to be both strongly-convex and strongly-concave. In particular, it established linear convergence when the primal function J(w)J(w) is smooth and convex, the dual function g(λ)-g(\lambda) is smooth and strongly-concave, and the additional assumption that BB is a full column rank matrix. Unlike the current work, the algorithm analyzed in is non-incremental; moreover, particular fixed step-sizes are needed to establish linear convergence – [22, Theorem 3.1].

Now, in the distributed optimization literature, various incremental primal-dual gradient algorithms have been proposed to solve multi-agent consensus optimization problems – see and references therein, which are mostly based on AL formulations. They have been shown to achieve linear convergence under strong-convexity even though the consensus constraint matrix is not full rank. However, the analysis techniques used to establish the convergence of these methods either depend on the particular consensus constraint matrix and/or require the AL term to be strictly positive. Unlike these works, our analysis does not require ρ\rho to be strictly positive. Moreover, due to our unified Lagrangian and AL framework, we clarify the effect of the AL penalty term on the performance of these types of distributed algorithms. Note that the work studied non-incremental primal-dual methods with identical step-sizes for quadratic distributed optimization. It was found in that unlike AL methods, Lagrangian methods suffer from stability issues when the individual costs are not strongly-convex. Unlike , we study the affect of the AL penalty on the convergence rate of distributed algorithms.

I-B Contribution

Given the above, this work has two main contributions: I) Through an original proof, we establish the linear convergence of the incremental implementation (4). Moreover, we show how the non-incremental implementation is related to the incremental one and establish its linear convergence while providing explicit upper bounds on the step-sizes. Our proof technique does not require the AL parameter to be strictly positive nor do we require BB to have full row rank. II) We show the effect of the AL penalty term on the performance of distributed multi-agent optimization algorithms. Depending on the condition number of the agents’ costs, we provide scenarios where the AL term is beneficial and other scenarios where it is not beneficial.

II Auxiliary Results

This section gives the auxiliary results leading to the main convergence result. We start with the following condition on the cost function.

(Cost function): It is assumed that a unique solution ww^{\star} exists for problem (1) and the cost function J(w)J(w) is convex. It is also assumed that J(w)J(w) is δ\delta-smooth, consequently, Jρ(w)=J(w)+ρ2Bwb2J_{\rho}(w)=J(w)+{\rho\over 2}\|Bw-b\|^{2} is δρ\delta_{\rho}-smooth with δρ=δ+ρσmax2(B)\delta_{\rho}=\delta+\rho\sigma^{2}_{\max}(B). Moreover, the cost Jρ(w)J_{\rho}(w) is νρ\nu_{\rho}-strongly-convex with respect to ww^{\star}, namely,

The scalars satisfy 0<νρδρ0<\nu_{\rho}\leq\delta_{\rho} for any ρ0\rho\geq 0. \Box

If J(w)J(w) is ν\nu-strongly-convex, then ww^{\star} is unique [1, Example 5.4] and condition (5) will be satisfied with νρ=ν\nu_{\rho}=\nu. We remark that condition (5) does not necessarily imply that J(w)J(w) is strongly-convex w.r.t. ww^{\star} unless ρ=0\rho=0. This condition is used instead of typical strong-convexity to be consistent with the conditions used to study the effect of the augmented Lagrangian term on the performance of distributed algorithms in Section V. \Box

It is known that a pair (w,λ)(w^{\star},\lambda^{\star}) is an optimal solution to (2) if, and only if, it satisfies the optimality conditions :

From (6a) and uniqueness of ww^{\star}, λ\lambda^{\star} will be unique if BB has full row rank. In general λ\lambda^{\star} is not necessarily unique. Motivated by , we will characterize a particular dual solution that we later show convergence to. For that result and later analysis, we need the following result.

The result follows since xTUrΣrUrTx=λx2x^{\mathsf{T}}U_{r}\Sigma_{r}U_{r}^{\mathsf{T}}x=\|\lambda_{x}\|^{2}. The inequality follows since σ2(B)\underline{\sigma}^{2}(B) is the smallest eigenvalue (or diagonal entry) of Σr\Sigma_{r} – see [1, Appendix A.5.2]. ∎

(Particular dual λb\lambda^{\star}_{b}): There exists a unique optimal dual variable, denoted by λb\lambda^{\star}_{b}, lying in the range space of BB.

The argument is motivated by . Any solution λ\lambda^{\star} of the linear system of equations given in (6a) can be decomposed into two parts λ=λb+λn\lambda^{\star}=\lambda^{\star}_{b}+\lambda_{n}^{\star}, where λbRange(B)\lambda^{\star}_{b}\in{\rm Range}(B) and λnNull(BT)\lambda_{n}^{\star}\in{\rm Null}(B^{\mathsf{T}}) – see . Therefore, if (w,λ)(w^{\star},\lambda^{\star}) satisfies (6), then (w,λb)(w^{\star},\lambda^{\star}_{b}) also satisfies (6). We now show λb\lambda^{\star}_{b} is unique by contradiction. Assume we have two distinct dual solutions λb1=Bx1\lambda^{\star}_{b_{1}}=Bx_{1} and λb2=Bx2\lambda^{\star}_{b_{2}}=Bx_{2} lying in the range space of BB. Then, substituting into (6a) and subtracting, we get BTB(x1x2)=0B^{\mathsf{T}}B(x_{1}-x_{2})=0. It follows that B(x1x2)2=0\|B(x_{1}-x_{2})\|^{2}=0 and, consequently, B(x1x2)=0B(x_{1}-x_{2})=0. This means that λb1=Bx1=Bx2=λb2\lambda^{\star}_{b_{1}}=Bx_{1}=Bx_{2}=\lambda^{\star}_{b_{2}}, which is a contradiction. ∎

Note that if λi1\lambda_{i-1} belongs to the range space of BB (i.e., λi1=Bx\lambda_{i-1}=Bx for some xx) or λi1=0\lambda_{i-1}=0, then from b=Bwb=Bw^{\star} and (4b) we know that \lambda_{i}=\lambda_{i-1}+\mu_{\lambda}(Bw_{i}-b)=B\big{(}x+\mu_{\lambda}(w_{i}-w^{\star})\big{)} will remain in the range space of BB. Thus, {λi}i0\{\lambda_{i}\}_{i\geq 0} will always remain in the range space of BB if λ1\lambda_{-1} belongs to the range space of BB or λ1=0\lambda_{-1}=0. This observation will allow us to utilize the bound (8) to establish linear convergence to the particular saddle-point (w,λb)(w^{\star},\lambda^{\star}_{b}) without requiring a rank condition on the matrix BB.

III LINEAR CONVERGENCE RESULT

We are now ready to establish our main result. Let w~i  =Δ  wiw\widetilde{w}_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;w_{i}-w^{\star} and λ~i  =Δ  λiλb\widetilde{\lambda}_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda_{i}-\lambda^{\star}_{b} denote the primal and dual errors, respectively.

(Linear convergence): Let Assumption 1 holds and assume the step-sizes are positive and satisfy:

If λ1=0\lambda_{-1}=0, then algorithm (4) converges linearly to the particular saddle-point (w,λb)(w^{\star},\lambda^{\star}_{b}), namely, it holds that

where cλ>0c_{\lambda}>0, cw=1μwμλσmax2(B)>0c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)>0, and

Subtracting ww^{\star} and λb\lambda^{\star}_{b} from both sides of (4) and using the optimality conditions (6) we get the coupled error recursion:

Squaring both sides of (11a) and (11b) we get

Using the bound Bw~i2σmax2(B)w~i2\|B\widetilde{w}_{i}\|^{2}\leq\sigma^{2}_{\max}(B)\|\widetilde{w}_{i}\|^{2}, multiplying equation (13) by cλ  =Δ  μw/μλc_{\lambda}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\mu_{w}/\mu_{\lambda} and adding to (12) gives:

where cw  =Δ  1μwμλσmax2(B)c_{w}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B). Note that from Lemma 2, λb\lambda^{\star}_{b} lies in the range space of BB. Moreover, since λ1=0\lambda_{-1}=0, then we know that λ~i\widetilde{\lambda}_{i} will always lie in the range space of BB. Thus, from (7) it holds that BTλ~i12σ2(B)λ~i12\|B^{\mathsf{T}}\widetilde{\lambda}_{i-1}\|^{2}\geq\underline{\sigma}^{2}(B)\|\widetilde{\lambda}_{i-1}\|^{2}. Using this bound in (14), we get:

Since Jρ(w)J_{\rho}(w) is δρ\delta_{\rho}-smooth, it holds that [31, Theorem 2.1.5]:

for μ<2/δρ\mu<2/\delta_{\rho}. This follows directly by expanding the square and using the bounds (5) and (16). Let γ1=1μwνρ(1μwδρ)\gamma_{1}=1-\mu_{w}\nu_{\rho}(1-\mu_{w}\delta_{\rho}). Since cw=1μwμλσmax2(B)c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B), it holds that:

where the last step we used the fact that the second term is non-positive under the conditions μw<1δρ\mu_{w}<{1\over\delta_{\rho}} and μλνρ/σmax2(B)\mu_{\lambda}\leq\nu_{\rho}/\sigma^{2}_{\max}(B). We conclude that equation (10) holds by using the previous two equations in (15). Note that for positive step-sizes it holds that cλ=μwμλ>0c_{\lambda}={\mu_{w}\over\mu_{\lambda}}>0. Moreover, cw=1μwμλσmax2(B)>0c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)>0 and 0<1μwμλσ2(B)<10<1-\mu_{w}\mu_{\lambda}\underline{\sigma}^{2}(B)<1 if μwμλ<1σmax2(B)\mu_{w}\mu_{\lambda}<{1\over\sigma^{2}_{\max}(B)}. This condition is satisfied under condition (9) because under these conditions we have

where the last inequality hold because νρδρ\nu_{\rho}\leq\delta_{\rho}. ∎

Theorem 1 shows that under conditions (9), the incremental algorithm (4) converges linearly. We will show how to utilize this result to establish the linear convergence of the classical non-incremental (Arrow-Hurwicz) method .

IV Non-incremental PD Gradient Method

Consider the non-incremental update (Arrow-Hurwicz):

where Jη(w)  =Δ  J(w)+η2Bwb2J_{\eta}(w)\;\stackrel{{\scriptstyle\Delta}}{{=}}\;J(w)+{\eta\over 2}\|Bw-b\|^{2} and η0\eta\geq 0. Different from (4), recursion (19b) uses wi1w_{i-1} in the dual update instead of wiw_{i}. We will see that these two different implementations are equivalent for particular choices of η\eta and ρ\rho.

(Equivalence of (4) and (19b)) The primal iterates of the non-incremental recursion (19b) are equivalent to the primal iterates of the incremental recursion (4) if η=ρ+μλ\eta=\rho+\mu_{\lambda} and λ1=λ1μλ(Bw1b)\lambda^{\prime}_{-1}=\lambda_{-1}-\mu_{\lambda}(Bw_{-1}-b).

Let η=ρ+μλ\eta=\rho+\mu_{\lambda}. It holds that Jη(w)=Jρ(w)+μλ2Bwb2J_{\eta}(w)=J_{\rho}(w)+{\mu_{\lambda}\over 2}\|Bw-b\|^{2} so that Jη(w)=Jρ(w)+μλBT(Bwb){\nabla}J_{\eta}(w)={\nabla}J_{\rho}(w)+\mu_{\lambda}B^{\mathsf{T}}(Bw-b). Thus, for η=ρ+μλ\eta=\rho+\mu_{\lambda} step (19a) can be rewritten as:

where we introduced the change of variable λi  =Δ  λi+μλ(Bwib)\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}+\mu_{\lambda}(Bw_{i}-b). Adding μλ(Bwib)\mu_{\lambda}(Bw_{i}-b) to both sides of (19b) and using λi  =Δ  λi+μλ(Bwib)\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}+\mu_{\lambda}(Bw_{i}-b), we can directly rewrite (19b) as in (4b). Thus, the primal iterates of recursion (19b) are equivalent to the primal iterates of recursion (4) if λ1=λ1μλ(Bw1b)\lambda^{\prime}_{-1}=\lambda_{-1}-\mu_{\lambda}(Bw_{-1}-b). ∎

Lemma 3 implies that the non-incremental implementation (19b) is an instance of the incremental implementation with ρ=ημλ\rho=\eta-\mu_{\lambda}. Recall that in algorithm (4) we assume that ρ0\rho\geq 0. Therefore, if η=ρ+μλμλ\eta=\rho+\mu_{\lambda}\geq\mu_{\lambda}, the linear convergence of (19b) follows from Theorem 1 with ρ=ημλ0\rho=\eta-\mu_{\lambda}\geq 0. The case 0η<μλ0\leq\eta<\mu_{\lambda} implies that ρ=ημλ<0\rho=\eta-\mu_{\lambda}<0. This case can also be analyzed using the exact same technique as in Theorem 1. To show that, it suffices to consider the classical case η=0\eta=0.

(Non-Incremental η=0\eta=0) If the cost J(w)J(w) is δ\delta-smooth and ν\nu-strongly-convex and the step-sizes satisfy:

Then, recursion (19b) with η=0\eta=0 converges linearly to the optimal saddle-point if λ1=0\lambda^{\prime}_{-1}=0.

By relating recursion (19b) to (4), we are able to establish its linear convergence and provide explicit upper bounds on the step-sizes as well. The works and also established the linear convergence of the non-incremental recursion (19b) with η=0\eta=0. However, these works do not provide explicit upper bounds on the step-sizes or require particular fixed step-sizes to establish their result .

Assume b=0b=0 and consider the forward-backward gradient algorithm :

By using a change of variable trick, the analysis of (22b) directly follows from Theorem 1 with ρ=μλ\rho=\mu_{\lambda}. In particular, by adding and subtracting μwμλBTBwi1\mu_{w}\mu_{\lambda}B^{\mathsf{T}}Bw_{i-1} to the R.H.S. of (22a), letting λi  =Δ  λiμλBwi\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}-\mu_{\lambda}Bw_{i}, and rearranging (22b), recursion (22b) can be equivalently written as recursion (4) (b=0b=0) with ρ=μλ\rho=\mu_{\lambda}. \Box

V Application: Distributed Optimization

In this section, we study the benefit of the AL penalty term for distributed consensus optimization problems.

Consider a network of KK agents that are connected through some network and interested in the following problem:

The network is static, undirected, and the matrix AA is assumed to be primitive, i.e., there exists some integer p>0p>0 such that all entries of ApA^{p} are positive. We also assume AA to be symmetric, and doubly stochastic. \Box

Then, it holds that BW=0{\mathcal{B}}{\scriptstyle{\mathcal{W}}}=0 if, and only, if wk=ws  k,sw_{k}=w_{s}\ \forall\ k,s – see . Thus, problem (23) is equivalent to the following constrained problem:

A direct application of (4) to problem (26) gives:

where Jρ(W)  =Δ  J(W)+ρ2BW2{\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})\;\stackrel{{\scriptstyle\Delta}}{{=}}\;{\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\mathcal{B}}{\scriptstyle{\mathcal{W}}}\|^{2} with ρ0\rho\geq 0. Recursion (27) is not distributed yet because B{\mathcal{B}} need not have the network structure. However, this can be easily handled by a change of variable. Let Yi=Bλi{\scriptstyle{\mathcal{Y}}}_{i}={\mathcal{B}}\lambda_{i} and multiply (27b) by B{\mathcal{B}} gives:

Since B2=(IKA)IM{\mathcal{B}}^{2}=(I_{K}-A)\otimes I_{M} has the network structure, then the kk-th block of B2Wi=col{uk,i}k=1K{\mathcal{B}}^{2}{\scriptstyle{\mathcal{W}}}_{i}={\rm col}\{u_{k,i}\}_{k=1}^{K} has the distributed form uk,i=wk,isNkaskws,iu_{k,i}=w_{k,i}-\sum_{s\in{\mathcal{N}}_{k}}a_{sk}w_{s,i} where Nk{\mathcal{N}}_{k} denotes the neighbors of agent kk, including agent kk. Therefore, recursion (28) is distributed and agent kk can locally update its corresponding kk-th blocks in Wi{\scriptstyle{\mathcal{W}}}_{i} and Yi{\scriptstyle{\mathcal{Y}}}_{i}.

Before we establish convergence of recursion (28) and show the influence of the AL penalty term on its performance, we show how the derivation of recursions (27) and (28) are related to some state of the art algorithms.

Note that the saddle point interpretation of EXTRA appeared in the work . If we choose μw=μ\mu_{w}=\mu, μλ=12μ\mu_{\lambda}={1\over 2\mu}, and ρ=12μ\rho={1\over 2\mu} in algorithm (27) we get:

where Aˉ  =Δ  I12B2=12(I+A)\bar{{\mathcal{A}}}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;I-{1\over 2}{\mathcal{B}}^{2}={1\over 2}(I+{\mathcal{A}}) and A=AIM{\mathcal{A}}=A\otimes I_{M}. By eliminating the dual-variable (see, e.g., ), the above algorithm can be shown to be equivalent to the EXTRA algorithm in , which requires communicating the primal variable once per iteration.

V-A2 Exact diffusion [26]

which differs from EXTRA (29) in the primal update where the gradient is also multiplied by Aˉ\bar{{\mathcal{A}}}. By eliminating the dual-variable, the above algorithm can be shown to be equivalent to the exact-diffusion algorithm from . Different from a traditional gradient primal-descent (29a) that was used to derive EXTRA, exact diffusion uses incremental gradient descent steps – see for details. Exact diffusion enjoys wider step-size μ\mu stability range and better convergence performance compared to EXTRA – see .

It is worth mentioning that if we consider the penalized unconstrained problem \min_{{\scriptstyle{\scalebox{0.5}{\mbox{\displaystyle\mathcal{W}}}}}}\ {\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\mathcal{B}}{\scriptstyle{\mathcal{W}}}\|^{2} and apply two incremental gradient descent steps for the two terms in the penalized cost with step-size μ\mu and ρ=1μ\rho={1\over\mu}, we arrive at:

which is the diffusion algorithm . The bias that arises from solving the penalized problem, rather than the original problem, can be corrected by employing exact diffusion .

V-A3 DIGing [25]

If we choose a different penalty function Jρ(W)=J(W)+ρ2WIA22{\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\scriptstyle{\mathcal{W}}}\|_{I-{\mathcal{A}}^{2}}^{2} and set B2B{\mathcal{B}}^{2}\leftarrow{\mathcal{B}} in algorithm (27) with μw=μ\mu_{w}=\mu, μλ=1μ\mu_{\lambda}={1\over\mu}, and ρ=1μ\rho={1\over\mu} we get:

By eliminating the dual variable, this algorithm can be shown to be equivalent to DIGing – see [25, Section 2.2]. We see that the main difference from the EXTRA derivation is in the choice of the constraint and penalty matrices.

V-A4 Linearized ADMM [23]

where d,c>0d,c>0. The matrix L{\mathcal{L}} is the oriented Laplacian matrix chosen such that the kk-th block of LWi{\mathcal{L}}{\scriptstyle{\mathcal{W}}}_{i} is equal to sNkwk,iws,i\sum_{s\in{\mathcal{N}}_{k}}w_{k,i}-w_{s,i}. Recursion (33) is equivalent to (28) with B2{\mathcal{B}}^{2} replaced by L{\mathcal{L}}, μλ=ρ=c\mu_{\lambda}=\rho=c, and μw=1/d\mu_{w}=1/d.

Based on the previous derivations, one can rewrite problem (26) more generally as

where C{\mathcal{C}} and Cˉ\bar{{\mathcal{C}}} are general consensus matrices satisfying CW=0{\mathcal{C}}{\scriptstyle{\mathcal{W}}}=0 if, and only, if CˉW=0\bar{{\mathcal{C}}}{\scriptstyle{\mathcal{W}}}=0 if, and only, if w1==wKw_{1}=\cdots=w_{K}. Various algorithms can be derived by proper choices of C{\mathcal{C}} and Cˉ\bar{{\mathcal{C}}} and using more general primal-dual algorithms. For works focusing on unifying distributed algorithms, we refer interested readers to . \Box

We notice that most state-of-the-art algorithms are based on augmented Lagrangian formulations (i.e., they require ρ\rho to be strictly positive). However, it is unclear whether the AL term is always beneficial. Unlike previous works, we reveal the influence of AL penalty term on convergence rate of distributed algorithms compared to the classical Lagrangian case (ρ=0\rho=0). \Box

V-B AL Penalty Term Influence

To reveal the influence of the AL penalty parameter on the performance of distributed algorithms, we study the linear convergence properties of (28a)–(28b). To do that, we let (W,λb)({\scriptstyle{\mathcal{W}}}^{\star},\lambda^{\star}_{b}) be the point satisfying the optimality conditions of problem (26) where λb\lambda^{\star}_{b} lies in the range space of B{\mathcal{B}}. First, we recall the following result from [34, Proposition 3.6].

Let ρ>0\rho>0. If each cost Jk(w)J_{k}(w) is convex and δ\delta-smooth, and the aggregate cost 1Kk=1KJk(w){1\over K}\sum_{k=1}^{K}J_{k}(w) is βˉ\bar{\beta}-strongly convex, then the penalized augmented cost J(W)+ρ2WB22{\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\scriptstyle{\mathcal{W}}}\|_{{\mathcal{B}}^{2}}^{2} is νρ\nu_{\rho}-strongly-convex with respect to W{\scriptstyle{\mathcal{W}}}^{\star} where

and νρβˉ\nu_{\rho}\rightarrow\bar{\beta} as ρ\rho\rightarrow\infty. \Box

Assume that each cost Jk(w)J_{k}(w) is convex and δ\delta-smooth and let Assumption (2) hold. Then, the following result holds:

If ρ>0\rho>0, the aggregate cost 1Kk=1KJk(w){1\over K}\sum_{k=1}^{K}J_{k}(w) is βˉ\bar{\beta}-strongly convex, and μw<1δρ,μλνρσmax2(B)\mu_{w}<{1\over\delta_{\rho}},\quad\mu_{\lambda}\leq{\nu_{\rho}\over\sigma^{2}_{\max}({\mathcal{B}})}, then recursion (28a)–(28b) with Y1=0{\scriptstyle{\mathcal{Y}}}_{-1}=0 converges linearly and the convergence rate is upper bounded by:

where δρ=δ+ρσmax2(B)\delta_{\rho}=\delta+\rho\sigma^{2}_{\max}({\mathcal{B}}) and νρ>0\nu_{\rho}>0 is defined in (35).

If ρ=0\rho=0, each cost Jk(w)J_{k}(w) is βk\beta_{k}-strongly-convex, and μw<1δ,μλν0σmax2(B)\mu_{w}<{1\over\delta},\quad\mu_{\lambda}\leq{\nu_{0}\over\sigma^{2}_{\max}({\mathcal{B}})}, then recursion (28a)–(28b) with Y1=0{\scriptstyle{\mathcal{Y}}}_{-1}=0 converges linearly and the convergence rate is upper bounded by:

From the step-size conditions in Corollary 2, the convergence rates γL\gamma_{L} and γAL\gamma_{AL} have the form

for some 0<c<10<c<1 where κL  =Δ  δ/ν0\kappa_{L}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\delta/\nu_{0} and κAL  =Δ  δρ/νρ\kappa_{AL}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\delta_{\rho}/\nu_{\rho} are the condition numbers of J(W){\mathcal{J}}({\scriptstyle{\mathcal{W}}}) and Jρ(W){\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}}). Note that νρβˉ\nu_{\rho}\approx\bar{\beta} (for large enough ρ\rho). If the condition number of the aggregate cost is much smaller than the condition number of the individual costs (e.g., βˉ>>minkβk\bar{\beta}>>\min_{k}\beta_{k}), then the AL method will have faster convergence rate since κAL<κL\kappa_{AL}<\kappa_{L}, consequently γAL<γL<1\gamma_{AL}<\gamma_{L}<1. However, when the individual costs are well conditioned (e.g, βkβˉ\beta_{k}\approx\bar{\beta}), then κALκL\kappa_{AL}\approx\kappa_{L} and the AL penalty term is not that beneficial. Moreover, for large ρ\rho we can have κAL>κL\kappa_{AL}>\kappa_{L}; hence γL<γAL\gamma_{L}<\gamma_{AL} and AL term slows down the convergence rate.

VI Simulation

The matrix RkR_{k} is a randomly generated diagonal matrix with integer diagonal entries, each chosen between $.Inthiscase,each. In this case, eachJ_{k}(w)iswellconditionedbecauseis well conditioned because8/6isnotverylarge.TheresultforthisscenarioisshownontheleftplotofFig.1.Inallresults,PDdistributedrefersto(28)(withis not very large. The result for this scenario is shown on the left plot of Fig. 1. In all results, PD distributed refers to (28) (with\rho=0)andALPDdistributedrefersto(28)with) and AL PD distributed refers to (28) with\rho>0,EXTRAalgorithmfrom,andexactdiffusionfrom.Thestepsizesaremanuallychosentogetthebestpossibleconvergencerateforeachalgorithm.Wenoticethatforthiscase,increasing, EXTRA algorithm from , and exact diffusion from . The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. We notice that for this case, increasing\rhodecreasestheperformancecomparedtothecasedecreases the performance compared to the case\rho=0.Inthisscenario,wedonotseeanyadvantagesofALmethodscomparedtotheLagrangianmethod(. In this scenario, we do not see any advantages of AL methods compared to the Lagrangian method (\rho=0)duetothereasonsmentionedintheprevioussection.NotethatEXTRA(29)andexactdiffusion(30)convergesslowersincetheyrequire) due to the reasons mentioned in the previous section. Note that EXTRA (29) and exact diffusion (30) converges slower since they require\rho=1/2\mu,whichcannotbetweakedindependentlyfromthestepsize, which cannot be tweaked independently from the step-size\mu$.

We now construct RkR_{k} so that the local costs become ill-conditioned. To do that, we let RkR_{k} to be a diagonal matrix where the (k,k)(k,k)-th diagonal entry for each agent (Rk(k,k)R_{k}(k,k)) are chosen randomly between $andtheotherdiagonalentriesarechosenuniformlybetweenand the other diagonal entries are chosen uniformly between(0,1).Inthiscase,theratioofthelargestdiagonalentryandthesmallestcanbeverylargemakingeach. In this case, the ratio of the largest diagonal entry and the smallest can be very large making eachJ_{k}(w)illconditioned.However,theaggregatecostill-conditioned. However, the aggregate cost{1\over K}\sum_{k=1}^{K}(w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w)isbetterconditionedcomparedtotheindividualcosts.Thisisbecausefromourconstruction,theconditionnumberofis better conditioned compared to the individual costs. This is because from our construction, the condition number ofR=\sum_{k=1}^{K}R_{k}issmallerthantheconditionnumberofis smaller than the condition number ofR_{k}$. The left plot of Fig. 2 shows the result for this case. The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. In this case, we see that the Lagrangian method performs poorly compared to AL PD method, EXTRA, and Exact diffusion.

We now consider the case where the individual costs Jk(w)J_{k}(w) are non-convex but the aggregate cost k=1KJk(w)\sum_{k=1}^{K}J_{k}(w) is strongly-convex. To do that, we let RkR_{k} be a diagonal matrix with the (k,k)(k,k)-th diagonal entry for each agent, Rk(k,k)R_{k}(k,k), chosen randomly between $,theentries, the entriesR_{k}(k-1,k-1)=-{R_{k-1}(k-1,k-1)/2}forallfor allk\geq 2.Inthiscase,theindividualcosts. In this case, the individual costs\{J_{k}(w)\}_{k\geq 2}arenonconvexsincetheyhavenegativediagonalentries.However,theaggregatecostare non-convex since they have negative diagonal entries. However, the aggregate cost{1\over K}\sum_{k=1}^{K}(w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w)isstronglyconvexsinceis strongly convex sinceR=\sum_{k=1}^{K}R_{k}ispositivedefinitefromconstruction.TheresultofthissetupisshownintherightplotofFig.2.Thestepsizesaremanuallychosentogetthebestpossibleconvergencerateforeachalgorithm.WeseethattheALbasedmethodsstillconvergelinearly.However,thePDdistributedmethoddivergesevenundersmallstepsizes.Thisisbecausethecostis positive-definite from construction. The result of this set-up is shown in the right plot of Fig. 2. The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. We see that the AL based methods still converge linearly. However, the PD distributed method diverges even under small step-sizes. This is because the cost{\mathcal{J}}({\scriptstyle{\mathcal{W}}})=\sum_{k=1}^{K}(w_{k}^{\mathsf{T}}R_{k}w_{k}+r_{k}^{\mathsf{T}}w_{k})isnonconvexsincetheHessianis non-convex since the Hessian{\nabla}^{2}{\mathcal{J}}({\scriptstyle{\mathcal{W}}})={\rm blkdiag}\{R_{k}\}_{k=1}^{K}isindefinite.Incontrast,thecostis indefinite. In contrast, the cost{\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})isstronglyconvexforlargeis strongly-convex for large\rho$.

VII Concluding Remarks

In this work, we studied the linear convergence of the classical incremental primal-dual gradient algorithm (4). We provided an original proof that is applicable to both the Lagrangian and augmented Lagrangian implementations. Moreover, we proved the linear convergence of the non-incremental implementation (19b) by relating it to the incremental one. Finally, we studied algorithm (4) in distributed multi-agent optimization problems. The effect of the AL term on the performance of distributed algorithms is illustrated in theory and validated by means of simulation.

References

Appendix A Proof of Corollary 1

For η=0\eta=0, we know from Lemma 3 that recursion (19b) is equivalent to the incremental implementation with ρ=η=μλ\rho=\eta=-\mu_{\lambda}, namely,

where J(w)=Jμλ(w)=J(w)μλ2Bwb2J^{\prime}(w)=J_{-\mu_{\lambda}}(w)=J(w)-{\mu_{\lambda}\over 2}\|Bw-b\|^{2}. The above recursion is exactly (4) with ρ=0\rho=0 and cost J(w)J^{\prime}(w) instead of J(w)J(w). Therefore, its analysis follows from Theorem 1 as long as J(w)J^{\prime}(w) is δ\delta^{\prime}-smooth and ν\nu^{\prime}-strongly-convex for some δν>0\delta^{\prime}\geq\nu^{\prime}>0. It holds that:

where the first inequality holds from Cauchy-Schwartz and B(w1w2)2σmin2(B)w1w22\|B(w_{1}-w_{2})\|^{2}\geq\sigma^{2}_{\min}(B)\|w_{1}-w_{2}\|^{2}. The last inequality holds since J(w)J(w) is δ\delta-smooth. The above inequality is equivalent to the cost Jμλ(w)J_{-\mu_{\lambda}}(w) being δ=δμλσmin2(B)\delta^{\prime}=\delta-\mu_{\lambda}\sigma^{2}_{\min}(B) smooth – see [31, Theorem 2.1.5]. Moreover, from strong-convexity condition (5), it also holds that

Hence, the cost J(w)=Jμλ(w)J^{\prime}(w)=J_{-\mu_{\lambda}}(w) is ν=νμλσmax2(B)>0\nu^{\prime}=\nu-\mu_{\lambda}\sigma^{2}_{\max}(B)>0 strongly-convex if μλ<ν/σmax2(B)\mu_{\lambda}<\nu/\sigma^{2}_{\max}(B). By replacing δ\delta and ν\nu with δ\delta^{\prime} and ν\nu^{\prime} in (9) and setting ρ=0\rho=0 we get conditions (21).

Appendix B Proof of Corollary 2