Linear Convergence of Primal-Dual Gradient Methods and their Performance in Distributed Optimization

Sulaiman A. Alghunaim, Ali H. Sayed

I Introduction

Consider the constrained optimization problem:

is the augmented Lagrangian of problem (1), $\lambda$ is a dual variable, and $\rho\geq 0$ is the augmented Lagrangian penalty parameter. Note that for $\rho=0$ , $L_{0}(w,\lambda)$ becomes the classical Lagrangian of problem (1). If a point $(w^{\star},\lambda^{\star})$ exists that solves (2), then $w^{\star}$ is an optimal solution to the constrained problem when strong duality holds, which is the case under our assumptions . A classical algorithm that solves (2) is the primal-dual (PD) gradient algorithm (4). In this algorithm, ${\nabla}J_{\rho}(w)$ denotes the gradient of $J_{\rho}(w)=J(w)+{\rho\over 2}\|Bw-b\|^{2}$ evaluated at $w$ and $(\mu_{w},\mu_{\lambda})$ are positive step-sizes (learning rates) chosen by the designer. The updates in (4) are primal-descent dual-ascent steps applied to (3) and it subsume the classical Lagrangian implementation when $\rho=0$ and the augmented Lagrangian implementation when $\rho>0$ . Note that the updates in (4) are incremental since the dual update (4b) uses the most recent primal variable $w_{i}$ and not $w_{i-1}$ . If the dual update uses the previous primal iterate $w_{i-1}$ , then we refer to the update as non-incremental.

This work provides a concise proof that establishes the linear convergence of recursion (4) and studies its relation to the non-incremental implementation. We also study the effect of the penalty term ${\rho\over 2}\|Bw-b\|^{2}$ on the performance of multi-agent consensus optimization algorithms. Algorithms of the form (4) have been applied in various applications including wireless systems , power systems , reinforcement learning , and network utility maximization .

There exists a large body of literature on primal-dual saddle-point algorithms – see and the references therein, including the seminal work , which proposed recursions of the type (4) and established their convergence. These works focus on proving convergence to an optimal solution without providing convergence rates, provide sub-linear convergence rates (e.g., ${1\over i}$ where $i$ is the iteration index), or show linear convergence from a starting point that is sufficiently close to a solution (local convergence). Some other works examined global linear convergence under different settings.

The works focuses on continuous versions of the primal-dual gradient dynamics and establish linear convergence for augmented Lagrangian implementations (i.e., they require the presence of the augmented Lagrangian term $\rho/2\|Bw-b\|^{2}$ , where $\rho$ is strictly positive). They also require $B$ to have full row rank. Similarly, the work establishes linear convergence for continuous primal-dual gradient dynamics for full row rank $B$ , but it does not require the presence of the augmented Lagrangian term. Moreover, it was shown in that if the continuous dynamics is discretized using Euler discretization, then the discrete version converges linearly under small enough step sizes. However, no upper bound is given on the step-sizes. Moreover, Euler discretization uses identical step-sizes for the primal and dual updates (i.e., $\mu_{w}=\mu_{\lambda}$ ) and results in a non-incremental primal-dual dynamics. Therefore, the results in are not directly applicable to the discrete incremental implementation (4) and do not provide clear bounds on the step-sizes.

We remark that linear convergence for various monotone operator methods have been established albeit under other conditions that are not satisfied in our setup. For example, the linear convergence results in and [18, Proposition 25.9] for forward-backward splitting methods would require the saddle-point problem (2) to be both strongly-convex with respect to $w$ and strongly-concave with respect to $\lambda$ . This holds for example for problems with Lagrangian $L(w,\lambda)=J(w)+\lambda^{\mathsf{T}}Bw-g(\lambda)$ where $J(w)$ and $g(\lambda)$ are both strongly-convex functions. Similarly, the conditions used in require the saddle-point problem (2) to be strongly-convex with respect to $w$ and strongly-concave with respect to $\lambda$ . In our setup, $L_{\rho}(w,\lambda)$ is not strongly-concave with respect to $\lambda$ .

The work showed that for saddle point problems with $L(w,\lambda)=J(w)+\lambda^{\mathsf{T}}Bw-g(\lambda)$ , linear convergence is possible without requiring the Lagrangian to be both strongly-convex and strongly-concave. In particular, it established linear convergence when the primal function $J(w)$ is smooth and convex, the dual function $-g(\lambda)$ is smooth and strongly-concave, and the additional assumption that $B$ is a full column rank matrix. Unlike the current work, the algorithm analyzed in is non-incremental; moreover, particular fixed step-sizes are needed to establish linear convergence – [22, Theorem 3.1].

Now, in the distributed optimization literature, various incremental primal-dual gradient algorithms have been proposed to solve multi-agent consensus optimization problems – see and references therein, which are mostly based on AL formulations. They have been shown to achieve linear convergence under strong-convexity even though the consensus constraint matrix is not full rank. However, the analysis techniques used to establish the convergence of these methods either depend on the particular consensus constraint matrix and/or require the AL term to be strictly positive. Unlike these works, our analysis does not require $\rho$ to be strictly positive. Moreover, due to our unified Lagrangian and AL framework, we clarify the effect of the AL penalty term on the performance of these types of distributed algorithms. Note that the work studied non-incremental primal-dual methods with identical step-sizes for quadratic distributed optimization. It was found in that unlike AL methods, Lagrangian methods suffer from stability issues when the individual costs are not strongly-convex. Unlike , we study the affect of the AL penalty on the convergence rate of distributed algorithms.

I-B Contribution

Given the above, this work has two main contributions: I) Through an original proof, we establish the linear convergence of the incremental implementation (4). Moreover, we show how the non-incremental implementation is related to the incremental one and establish its linear convergence while providing explicit upper bounds on the step-sizes. Our proof technique does not require the AL parameter to be strictly positive nor do we require $B$ to have full row rank. II) We show the effect of the AL penalty term on the performance of distributed multi-agent optimization algorithms. Depending on the condition number of the agents’ costs, we provide scenarios where the AL term is beneficial and other scenarios where it is not beneficial.

II Auxiliary Results

This section gives the auxiliary results leading to the main convergence result. We start with the following condition on the cost function.

(Cost function): It is assumed that a unique solution $w^{\star}$ exists for problem (1) and the cost function $J(w)$ is convex. It is also assumed that $J(w)$ is $\delta$ -smooth, consequently, $J_{\rho}(w)=J(w)+{\rho\over 2}\|Bw-b\|^{2}$ is $\delta_{\rho}$ -smooth with $\delta_{\rho}=\delta+\rho\sigma^{2}_{\max}(B)$ . Moreover, the cost $J_{\rho}(w)$ is $\nu_{\rho}$ -strongly-convex with respect to $w^{\star}$ , namely,

The scalars satisfy $0<\nu_{\rho}\leq\delta_{\rho}$ for any $\rho\geq 0$ . $\Box$

If $J(w)$ is $\nu$ -strongly-convex, then $w^{\star}$ is unique [1, Example 5.4] and condition (5) will be satisfied with $\nu_{\rho}=\nu$ . We remark that condition (5) does not necessarily imply that $J(w)$ is strongly-convex w.r.t. $w^{\star}$ unless $\rho=0$ . This condition is used instead of typical strong-convexity to be consistent with the conditions used to study the effect of the augmented Lagrangian term on the performance of distributed algorithms in Section V. $\Box$

It is known that a pair $(w^{\star},\lambda^{\star})$ is an optimal solution to (2) if, and only if, it satisfies the optimality conditions :

From (6a) and uniqueness of $w^{\star}$ , $\lambda^{\star}$ will be unique if $B$ has full row rank. In general $\lambda^{\star}$ is not necessarily unique. Motivated by , we will characterize a particular dual solution that we later show convergence to. For that result and later analysis, we need the following result.

The result follows since $x^{\mathsf{T}}U_{r}\Sigma_{r}U_{r}^{\mathsf{T}}x=\|\lambda_{x}\|^{2}$ . The inequality follows since $\underline{\sigma}^{2}(B)$ is the smallest eigenvalue (or diagonal entry) of $\Sigma_{r}$ – see [1, Appendix A.5.2]. ∎

(Particular dual $\lambda^{\star}_{b}$ ): There exists a unique optimal dual variable, denoted by $\lambda^{\star}_{b}$ , lying in the range space of $B$ .

The argument is motivated by . Any solution $\lambda^{\star}$ of the linear system of equations given in (6a) can be decomposed into two parts $\lambda^{\star}=\lambda^{\star}_{b}+\lambda_{n}^{\star}$ , where $\lambda^{\star}_{b}\in{\rm Range}(B)$ and $\lambda_{n}^{\star}\in{\rm Null}(B^{\mathsf{T}})$ – see . Therefore, if $(w^{\star},\lambda^{\star})$ satisfies (6), then $(w^{\star},\lambda^{\star}_{b})$ also satisfies (6). We now show $\lambda^{\star}_{b}$ is unique by contradiction. Assume we have two distinct dual solutions $\lambda^{\star}_{b_{1}}=Bx_{1}$ and $\lambda^{\star}_{b_{2}}=Bx_{2}$ lying in the range space of $B$ . Then, substituting into (6a) and subtracting, we get $B^{\mathsf{T}}B(x_{1}-x_{2})=0$ . It follows that $\|B(x_{1}-x_{2})\|^{2}=0$ and, consequently, $B(x_{1}-x_{2})=0$ . This means that $\lambda^{\star}_{b_{1}}=Bx_{1}=Bx_{2}=\lambda^{\star}_{b_{2}}$ , which is a contradiction. ∎

Note that if $\lambda_{i-1}$ belongs to the range space of $B$ (i.e., $\lambda_{i-1}=Bx$ for some $x$ ) or $\lambda_{i-1}=0$ , then from $b=Bw^{\star}$ and (4b) we know that $\lambda_{i}=\lambda_{i-1}+\mu_{\lambda}(Bw_{i}-b)=B\big{(}x+\mu_{\lambda}(w_{i}-w^{\star})\big{)}$ will remain in the range space of $B$ . Thus, $\{\lambda_{i}\}_{i\geq 0}$ will always remain in the range space of $B$ if $\lambda_{-1}$ belongs to the range space of $B$ or $\lambda_{-1}=0$ . This observation will allow us to utilize the bound (8) to establish linear convergence to the particular saddle-point $(w^{\star},\lambda^{\star}_{b})$ without requiring a rank condition on the matrix $B$ .

III LINEAR CONVERGENCE RESULT

We are now ready to establish our main result. Let $\widetilde{w}_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;w_{i}-w^{\star}$ and $\widetilde{\lambda}_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda_{i}-\lambda^{\star}_{b}$ denote the primal and dual errors, respectively.

(Linear convergence): Let Assumption 1 holds and assume the step-sizes are positive and satisfy:

If $\lambda_{-1}=0$ , then algorithm (4) converges linearly to the particular saddle-point $(w^{\star},\lambda^{\star}_{b})$ , namely, it holds that

where $c_{\lambda}>0$ , $c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)>0$ , and

Subtracting $w^{\star}$ and $\lambda^{\star}_{b}$ from both sides of (4) and using the optimality conditions (6) we get the coupled error recursion:

Squaring both sides of (11a) and (11b) we get

Using the bound $\|B\widetilde{w}_{i}\|^{2}\leq\sigma^{2}_{\max}(B)\|\widetilde{w}_{i}\|^{2}$ , multiplying equation (13) by $c_{\lambda}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\mu_{w}/\mu_{\lambda}$ and adding to (12) gives:

where $c_{w}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)$ . Note that from Lemma 2, $\lambda^{\star}_{b}$ lies in the range space of $B$ . Moreover, since $\lambda_{-1}=0$ , then we know that $\widetilde{\lambda}_{i}$ will always lie in the range space of $B$ . Thus, from (7) it holds that $\|B^{\mathsf{T}}\widetilde{\lambda}_{i-1}\|^{2}\geq\underline{\sigma}^{2}(B)\|\widetilde{\lambda}_{i-1}\|^{2}$ . Using this bound in (14), we get:

Since $J_{\rho}(w)$ is $\delta_{\rho}$ -smooth, it holds that [31, Theorem 2.1.5]:

for $\mu<2/\delta_{\rho}$ . This follows directly by expanding the square and using the bounds (5) and (16). Let $\gamma_{1}=1-\mu_{w}\nu_{\rho}(1-\mu_{w}\delta_{\rho})$ . Since $c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)$ , it holds that:

where the last step we used the fact that the second term is non-positive under the conditions $\mu_{w}<{1\over\delta_{\rho}}$ and $\mu_{\lambda}\leq\nu_{\rho}/\sigma^{2}_{\max}(B)$ . We conclude that equation (10) holds by using the previous two equations in (15). Note that for positive step-sizes it holds that $c_{\lambda}={\mu_{w}\over\mu_{\lambda}}>0$ . Moreover, $c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)>0$ and $0<1-\mu_{w}\mu_{\lambda}\underline{\sigma}^{2}(B)<1$ if $\mu_{w}\mu_{\lambda}<{1\over\sigma^{2}_{\max}(B)}$ . This condition is satisfied under condition (9) because under these conditions we have

where the last inequality hold because $\nu_{\rho}\leq\delta_{\rho}$ . ∎

Theorem 1 shows that under conditions (9), the incremental algorithm (4) converges linearly. We will show how to utilize this result to establish the linear convergence of the classical non-incremental (Arrow-Hurwicz) method .

IV Non-incremental PD Gradient Method

Consider the non-incremental update (Arrow-Hurwicz):

where $J_{\eta}(w)\;\stackrel{{\scriptstyle\Delta}}{{=}}\;J(w)+{\eta\over 2}\|Bw-b\|^{2}$ and $\eta\geq 0$ . Different from (4), recursion (19b) uses $w_{i-1}$ in the dual update instead of $w_{i}$ . We will see that these two different implementations are equivalent for particular choices of $\eta$ and $\rho$ .

(Equivalence of (4) and (19b)) The primal iterates of the non-incremental recursion (19b) are equivalent to the primal iterates of the incremental recursion (4) if $\eta=\rho+\mu_{\lambda}$ and $\lambda^{\prime}_{-1}=\lambda_{-1}-\mu_{\lambda}(Bw_{-1}-b)$ .

Let $\eta=\rho+\mu_{\lambda}$ . It holds that $J_{\eta}(w)=J_{\rho}(w)+{\mu_{\lambda}\over 2}\|Bw-b\|^{2}$ so that ${\nabla}J_{\eta}(w)={\nabla}J_{\rho}(w)+\mu_{\lambda}B^{\mathsf{T}}(Bw-b)$ . Thus, for $\eta=\rho+\mu_{\lambda}$ step (19a) can be rewritten as:

where we introduced the change of variable $\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}+\mu_{\lambda}(Bw_{i}-b)$ . Adding $\mu_{\lambda}(Bw_{i}-b)$ to both sides of (19b) and using $\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}+\mu_{\lambda}(Bw_{i}-b)$ , we can directly rewrite (19b) as in (4b). Thus, the primal iterates of recursion (19b) are equivalent to the primal iterates of recursion (4) if $\lambda^{\prime}_{-1}=\lambda_{-1}-\mu_{\lambda}(Bw_{-1}-b)$ . ∎

Lemma 3 implies that the non-incremental implementation (19b) is an instance of the incremental implementation with $\rho=\eta-\mu_{\lambda}$ . Recall that in algorithm (4) we assume that $\rho\geq 0$ . Therefore, if $\eta=\rho+\mu_{\lambda}\geq\mu_{\lambda}$ , the linear convergence of (19b) follows from Theorem 1 with $\rho=\eta-\mu_{\lambda}\geq 0$ . The case $0\leq\eta<\mu_{\lambda}$ implies that $\rho=\eta-\mu_{\lambda}<0$ . This case can also be analyzed using the exact same technique as in Theorem 1. To show that, it suffices to consider the classical case $\eta=0$ .

(Non-Incremental $\eta=0$ ) If the cost $J(w)$ is $\delta$ -smooth and $\nu$ -strongly-convex and the step-sizes satisfy:

Then, recursion (19b) with $\eta=0$ converges linearly to the optimal saddle-point if $\lambda^{\prime}_{-1}=0$ .

By relating recursion (19b) to (4), we are able to establish its linear convergence and provide explicit upper bounds on the step-sizes as well. The works and also established the linear convergence of the non-incremental recursion (19b) with $\eta=0$ . However, these works do not provide explicit upper bounds on the step-sizes or require particular fixed step-sizes to establish their result .

Assume $b=0$ and consider the forward-backward gradient algorithm :

By using a change of variable trick, the analysis of (22b) directly follows from Theorem 1 with $\rho=\mu_{\lambda}$ . In particular, by adding and subtracting $\mu_{w}\mu_{\lambda}B^{\mathsf{T}}Bw_{i-1}$ to the R.H.S. of (22a), letting $\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}-\mu_{\lambda}Bw_{i}$ , and rearranging (22b), recursion (22b) can be equivalently written as recursion (4) ( $b=0$ ) with $\rho=\mu_{\lambda}$ . $\Box$

V Application: Distributed Optimization

In this section, we study the benefit of the AL penalty term for distributed consensus optimization problems.

Consider a network of $K$ agents that are connected through some network and interested in the following problem:

The network is static, undirected, and the matrix $A$ is assumed to be primitive, i.e., there exists some integer $p>0$ such that all entries of $A^{p}$ are positive. We also assume $A$ to be symmetric, and doubly stochastic. $\Box$

Then, it holds that ${\mathcal{B}}{\scriptstyle{\mathcal{W}}}=0$ if, and only, if $w_{k}=w_{s}\ \forall\ k,s$ – see . Thus, problem (23) is equivalent to the following constrained problem:

A direct application of (4) to problem (26) gives:

where ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})\;\stackrel{{\scriptstyle\Delta}}{{=}}\;{\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\mathcal{B}}{\scriptstyle{\mathcal{W}}}\|^{2}$ with $\rho\geq 0$ . Recursion (27) is not distributed yet because ${\mathcal{B}}$ need not have the network structure. However, this can be easily handled by a change of variable. Let ${\scriptstyle{\mathcal{Y}}}_{i}={\mathcal{B}}\lambda_{i}$ and multiply (27b) by ${\mathcal{B}}$ gives:

Since ${\mathcal{B}}^{2}=(I_{K}-A)\otimes I_{M}$ has the network structure, then the $k$ -th block of ${\mathcal{B}}^{2}{\scriptstyle{\mathcal{W}}}_{i}={\rm col}\{u_{k,i}\}_{k=1}^{K}$ has the distributed form $u_{k,i}=w_{k,i}-\sum_{s\in{\mathcal{N}}_{k}}a_{sk}w_{s,i}$ where ${\mathcal{N}}_{k}$ denotes the neighbors of agent $k$ , including agent $k$ . Therefore, recursion (28) is distributed and agent $k$ can locally update its corresponding $k$ -th blocks in ${\scriptstyle{\mathcal{W}}}_{i}$ and ${\scriptstyle{\mathcal{Y}}}_{i}$ .

Before we establish convergence of recursion (28) and show the influence of the AL penalty term on its performance, we show how the derivation of recursions (27) and (28) are related to some state of the art algorithms.

Note that the saddle point interpretation of EXTRA appeared in the work . If we choose $\mu_{w}=\mu$ , $\mu_{\lambda}={1\over 2\mu}$ , and $\rho={1\over 2\mu}$ in algorithm (27) we get:

where $\bar{{\mathcal{A}}}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;I-{1\over 2}{\mathcal{B}}^{2}={1\over 2}(I+{\mathcal{A}})$ and ${\mathcal{A}}=A\otimes I_{M}$ . By eliminating the dual-variable (see, e.g., ), the above algorithm can be shown to be equivalent to the EXTRA algorithm in , which requires communicating the primal variable once per iteration.

V-A2 Exact diffusion [26]

which differs from EXTRA (29) in the primal update where the gradient is also multiplied by $\bar{{\mathcal{A}}}$ . By eliminating the dual-variable, the above algorithm can be shown to be equivalent to the exact-diffusion algorithm from . Different from a traditional gradient primal-descent (29a) that was used to derive EXTRA, exact diffusion uses incremental gradient descent steps – see for details. Exact diffusion enjoys wider step-size $\mu$ stability range and better convergence performance compared to EXTRA – see .

It is worth mentioning that if we consider the penalized unconstrained problem $\min_{{\scriptstyle{\scalebox{0.5}{\mbox{$ \displaystyle\mathcal{W} $}}}}}\ {\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\mathcal{B}}{\scriptstyle{\mathcal{W}}}\|^{2}$ and apply two incremental gradient descent steps for the two terms in the penalized cost with step-size $\mu$ and $\rho={1\over\mu}$ , we arrive at:

which is the diffusion algorithm . The bias that arises from solving the penalized problem, rather than the original problem, can be corrected by employing exact diffusion .

V-A3 DIGing [25]

If we choose a different penalty function ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\scriptstyle{\mathcal{W}}}\|_{I-{\mathcal{A}}^{2}}^{2}$ and set ${\mathcal{B}}^{2}\leftarrow{\mathcal{B}}$ in algorithm (27) with $\mu_{w}=\mu$ , $\mu_{\lambda}={1\over\mu}$ , and $\rho={1\over\mu}$ we get:

By eliminating the dual variable, this algorithm can be shown to be equivalent to DIGing – see [25, Section 2.2]. We see that the main difference from the EXTRA derivation is in the choice of the constraint and penalty matrices.

V-A4 Linearized ADMM [23]

where $d,c>0$ . The matrix ${\mathcal{L}}$ is the oriented Laplacian matrix chosen such that the $k$ -th block of ${\mathcal{L}}{\scriptstyle{\mathcal{W}}}_{i}$ is equal to $\sum_{s\in{\mathcal{N}}_{k}}w_{k,i}-w_{s,i}$ . Recursion (33) is equivalent to (28) with ${\mathcal{B}}^{2}$ replaced by ${\mathcal{L}}$ , $\mu_{\lambda}=\rho=c$ , and $\mu_{w}=1/d$ .

Based on the previous derivations, one can rewrite problem (26) more generally as

where ${\mathcal{C}}$ and $\bar{{\mathcal{C}}}$ are general consensus matrices satisfying ${\mathcal{C}}{\scriptstyle{\mathcal{W}}}=0$ if, and only, if $\bar{{\mathcal{C}}}{\scriptstyle{\mathcal{W}}}=0$ if, and only, if $w_{1}=\cdots=w_{K}$ . Various algorithms can be derived by proper choices of ${\mathcal{C}}$ and $\bar{{\mathcal{C}}}$ and using more general primal-dual algorithms. For works focusing on unifying distributed algorithms, we refer interested readers to . $\Box$

We notice that most state-of-the-art algorithms are based on augmented Lagrangian formulations (i.e., they require $\rho$ to be strictly positive). However, it is unclear whether the AL term is always beneficial. Unlike previous works, we reveal the influence of AL penalty term on convergence rate of distributed algorithms compared to the classical Lagrangian case ( $\rho=0$ ). $\Box$

V-B AL Penalty Term Influence

To reveal the influence of the AL penalty parameter on the performance of distributed algorithms, we study the linear convergence properties of (28a)–(28b). To do that, we let $({\scriptstyle{\mathcal{W}}}^{\star},\lambda^{\star}_{b})$ be the point satisfying the optimality conditions of problem (26) where $\lambda^{\star}_{b}$ lies in the range space of ${\mathcal{B}}$ . First, we recall the following result from [34, Proposition 3.6].

Let $\rho>0$ . If each cost $J_{k}(w)$ is convex and $\delta$ -smooth, and the aggregate cost ${1\over K}\sum_{k=1}^{K}J_{k}(w)$ is $\bar{\beta}$ -strongly convex, then the penalized augmented cost ${\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\scriptstyle{\mathcal{W}}}\|_{{\mathcal{B}}^{2}}^{2}$ is $\nu_{\rho}$ -strongly-convex with respect to ${\scriptstyle{\mathcal{W}}}^{\star}$ where

and $\nu_{\rho}\rightarrow\bar{\beta}$ as $\rho\rightarrow\infty$ . $\Box$

Assume that each cost $J_{k}(w)$ is convex and $\delta$ -smooth and let Assumption (2) hold. Then, the following result holds:

If $\rho>0$ , the aggregate cost ${1\over K}\sum_{k=1}^{K}J_{k}(w)$ is $\bar{\beta}$ -strongly convex, and $\mu_{w}<{1\over\delta_{\rho}},\quad\mu_{\lambda}\leq{\nu_{\rho}\over\sigma^{2}_{\max}({\mathcal{B}})}$ , then recursion (28a)–(28b) with ${\scriptstyle{\mathcal{Y}}}_{-1}=0$ converges linearly and the convergence rate is upper bounded by:

where $\delta_{\rho}=\delta+\rho\sigma^{2}_{\max}({\mathcal{B}})$ and $\nu_{\rho}>0$ is defined in (35).

If $\rho=0$ , each cost $J_{k}(w)$ is $\beta_{k}$ -strongly-convex, and $\mu_{w}<{1\over\delta},\quad\mu_{\lambda}\leq{\nu_{0}\over\sigma^{2}_{\max}({\mathcal{B}})}$ , then recursion (28a)–(28b) with ${\scriptstyle{\mathcal{Y}}}_{-1}=0$ converges linearly and the convergence rate is upper bounded by:

From the step-size conditions in Corollary 2, the convergence rates $\gamma_{L}$ and $\gamma_{AL}$ have the form

for some $0<c<1$ where $\kappa_{L}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\delta/\nu_{0}$ and $\kappa_{AL}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\delta_{\rho}/\nu_{\rho}$ are the condition numbers of ${\mathcal{J}}({\scriptstyle{\mathcal{W}}})$ and ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})$ . Note that $\nu_{\rho}\approx\bar{\beta}$ (for large enough $\rho$ ). If the condition number of the aggregate cost is much smaller than the condition number of the individual costs (e.g., $\bar{\beta}>>\min_{k}\beta_{k}$ ), then the AL method will have faster convergence rate since $\kappa_{AL}<\kappa_{L}$ , consequently $\gamma_{AL}<\gamma_{L}<1$ . However, when the individual costs are well conditioned (e.g, $\beta_{k}\approx\bar{\beta}$ ), then $\kappa_{AL}\approx\kappa_{L}$ and the AL penalty term is not that beneficial. Moreover, for large $\rho$ we can have $\kappa_{AL}>\kappa_{L}$ ; hence $\gamma_{L}<\gamma_{AL}$ and AL term slows down the convergence rate.

VI Simulation

The matrix $R_{k}$ is a randomly generated diagonal matrix with integer diagonal entries, each chosen between $ $. In this case, each$ J_{k}(w) $is well conditioned because$ 8/6 $is not very large. The result for this scenario is shown on the left plot of Fig. 1. In all results, PD distributed refers to (28) (with$ \rho=0 $) and AL PD distributed refers to (28) with$ \rho>0 $, EXTRA algorithm from , and exact diffusion from . The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. We notice that for this case, increasing$ \rho $decreases the performance compared to the case$ \rho=0 $. In this scenario, we do not see any advantages of AL methods compared to the Lagrangian method ($ \rho=0 $) due to the reasons mentioned in the previous section. Note that EXTRA (29) and exact diffusion (30) converges slower since they require$ \rho=1/2\mu $, which cannot be tweaked independently from the step-size$ \mu$.

We now construct $R_{k}$ so that the local costs become ill-conditioned. To do that, we let $R_{k}$ to be a diagonal matrix where the $(k,k)$ -th diagonal entry for each agent ( $R_{k}(k,k)$ ) are chosen randomly between $ $and the other diagonal entries are chosen uniformly between$ (0,1) $. In this case, the ratio of the largest diagonal entry and the smallest can be very large making each$ J_{k}(w) $ill-conditioned. However, the aggregate cost$ {1\over K}\sum_{k=1}^{K}(w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w) $is better conditioned compared to the individual costs. This is because from our construction, the condition number of$ R=\sum_{k=1}^{K}R_{k} $is smaller than the condition number of$ R_{k}$. The left plot of Fig. 2 shows the result for this case. The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. In this case, we see that the Lagrangian method performs poorly compared to AL PD method, EXTRA, and Exact diffusion.

We now consider the case where the individual costs $J_{k}(w)$ are non-convex but the aggregate cost $\sum_{k=1}^{K}J_{k}(w)$ is strongly-convex. To do that, we let $R_{k}$ be a diagonal matrix with the $(k,k)$ -th diagonal entry for each agent, $R_{k}(k,k)$ , chosen randomly between $ $, the entries$ R_{k}(k-1,k-1)=-{R_{k-1}(k-1,k-1)/2} $for all$ k\geq 2 $. In this case, the individual costs$ \{J_{k}(w)\}_{k\geq 2} $are non-convex since they have negative diagonal entries. However, the aggregate cost$ {1\over K}\sum_{k=1}^{K}(w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w) $is strongly convex since$ R=\sum_{k=1}^{K}R_{k} $is positive-definite from construction. The result of this set-up is shown in the right plot of Fig. 2. The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. We see that the AL based methods still converge linearly. However, the PD distributed method diverges even under small step-sizes. This is because the cost$ {\mathcal{J}}({\scriptstyle{\mathcal{W}}})=\sum_{k=1}^{K}(w_{k}^{\mathsf{T}}R_{k}w_{k}+r_{k}^{\mathsf{T}}w_{k}) $is non-convex since the Hessian$ {\nabla}^{2}{\mathcal{J}}({\scriptstyle{\mathcal{W}}})={\rm blkdiag}\{R_{k}\}_{k=1}^{K} $is indefinite. In contrast, the cost$ {\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}}) $is strongly-convex for large$ \rho$.

VII Concluding Remarks

In this work, we studied the linear convergence of the classical incremental primal-dual gradient algorithm (4). We provided an original proof that is applicable to both the Lagrangian and augmented Lagrangian implementations. Moreover, we proved the linear convergence of the non-incremental implementation (19b) by relating it to the incremental one. Finally, we studied algorithm (4) in distributed multi-agent optimization problems. The effect of the AL term on the performance of distributed algorithms is illustrated in theory and validated by means of simulation.

References

Appendix A Proof of Corollary 1

For $\eta=0$ , we know from Lemma 3 that recursion (19b) is equivalent to the incremental implementation with $\rho=\eta=-\mu_{\lambda}$ , namely,

where $J^{\prime}(w)=J_{-\mu_{\lambda}}(w)=J(w)-{\mu_{\lambda}\over 2}\|Bw-b\|^{2}$ . The above recursion is exactly (4) with $\rho=0$ and cost $J^{\prime}(w)$ instead of $J(w)$ . Therefore, its analysis follows from Theorem 1 as long as $J^{\prime}(w)$ is $\delta^{\prime}-$ smooth and $\nu^{\prime}$ -strongly-convex for some $\delta^{\prime}\geq\nu^{\prime}>0$ . It holds that:

where the first inequality holds from Cauchy-Schwartz and $\|B(w_{1}-w_{2})\|^{2}\geq\sigma^{2}_{\min}(B)\|w_{1}-w_{2}\|^{2}$ . The last inequality holds since $J(w)$ is $\delta$ -smooth. The above inequality is equivalent to the cost $J_{-\mu_{\lambda}}(w)$ being $\delta^{\prime}=\delta-\mu_{\lambda}\sigma^{2}_{\min}(B)$ smooth – see [31, Theorem 2.1.5]. Moreover, from strong-convexity condition (5), it also holds that

Hence, the cost $J^{\prime}(w)=J_{-\mu_{\lambda}}(w)$ is $\nu^{\prime}=\nu-\mu_{\lambda}\sigma^{2}_{\max}(B)>0$ strongly-convex if $\mu_{\lambda}<\nu/\sigma^{2}_{\max}(B)$ . By replacing $\delta$ and $\nu$ with $\delta^{\prime}$ and $\nu^{\prime}$ in (9) and setting $\rho=0$ we get conditions (21).