On the Linear Convergence of the Alternating Direction Method of Multipliers

Mingyi Hong, Zhi-Quan Luo

Introduction

Consider the problem of minimizing a separable nonsmooth convex function subject to linear equality constraints:

where each fkf_{k} is a nonsmooth convex function (possibly with extended values), x=(x1T,...,xKT)Tnx=(x_{1}^{T},...,x_{K}^{T})^{T}\in\Re^{n} is a partition of the optimization variable xx, X=k=1KXkX=\prod_{k=1}^{K}X_{k} is the feasible set for xx, and E=(E1,E2,...,EK)m×nE=(E_{1},E_{2},...,E_{K})\in\Re^{m\times n} is an appropriate partition of matrix EE (consistent with the partition of xx) and qmq\in\Re^{m} is a vector. Notice that the model (1.1) can easily accommodate general linear inequality constraints ExqEx\geq q by adding one extra block. In particular, we can introduce a slack variable xK+10x_{K+1}\geq 0 and rewrite the inequality constraint as Exxk+1=qEx-x_{k+1}=q. The constraint xK+10x_{K+1}\geq 0 can be enforced by adding a new convex component function fK+1(xK+1)=i+m(xK+1)f_{K+1}(x_{K+1})=i_{\Re^{m}_{+}}(x_{K+1}) to the objective function f(x)f(x), where i+m(xK+1)i_{\Re^{m}_{+}}(x_{K+1}) is the indicator function for the nonnegative orthant +m\Re^{m}_{+}

In this way, the inequality constrained problem with KK blocks is reformulated as an equivalent equality constrained convex minimization problem with K+1K+1 blocks.

where λ>0\lambda>0 is a penalty parameter. Clearly, this is a structured convex optimization problem of the form (1.1) with K=2K=2. If the variable xx is further constrained to be nonnegative, then the corresponding compressive sensing problem can be formulated as a three block (K=3K=3) convex separable optimization problem (1.1) by introducing a slack variable. Similarly, in the stable version of robust principal component analysis (PCA) , we are given an observation matrix Mm×nM\in\Re^{m\times n} which is a noise-corrupted sum of a low rank matrix LL and a sparse matrix SS. The goal is recover LL and SS by solving the following nonsmooth convex optimization problem

while the coupling linear constraint is given L+S+Z=ML+S+Z=M. In image processing applications where the low rank matrix LL is additionally constrained to be nonnegative, then the above problem can be reformulated as

where CC is a slack matrix variable of the same size as LL, and i+mn()i_{\Re^{mn}_{+}}(\cdot) is the indicator function for the nonnegative orthant +mn\Re_{+}^{mn}. In this case, the stable robust PCA problem is again in the form of (1.1). In particular, it has 4 block variables (L,S,Z,C)(L,S,Z,C) and the first three convex functions are the same as in (1.2), while the fourth convex function is given by f4(C)=i+mn(C)f_{4}(C)=i_{\Re^{mn}_{+}}(C). The coupling linear constraints are L+S+Z=M, LC=0L+S+Z=M,\ L-C=0. Other applications of the form (1.1) include the latent variable Gaussian graphical model selection problem, see .

A popular approach to solving the separable convex optimization problem (1.1) is to attach a Lagrange multiplier vector yy to the linear constraints Ex=qEx=q and add a quadratic penalty, thus obtaining an augmented Lagrangian function of the form

where ρ0\rho\geq 0 is a constant. The augmented dual function is given by

and the dual problem (equivalent to (1.1) under mild conditions) is

Moreover, if ρ>0\rho>0, then ExEx is constant over the set of minimizers of (1.4) (see Lemma 2.1 in Section 2). This implies that the dual function d(y)d(y) is differentiable with

where x(y)x(y) is a minimizer of (1.4). Given the differentiability of d(y)d(y), it is natural to consider the following dual ascent method to solve the primal problem (1.1)

where α>0\alpha>0 is a suitably chosen stepsize. Such a dual ascent strategy is well suited for structured convex optimization problems that are amenable to decomposition. For example, if the objective function ff is separable (i.e., of the form given in (1.1)) and if we select ρ=0\rho=0, then the minimization in (1.4) decomposes into KK independent minimizations whose solutions frequently can be obtained in a simple form. In addition, the iterations can be implemented in a manner that exploits the sparsity structure of the problem and, in certain network cases, achieve a high degree of parallelism. Popular choices for the ascent methods include (single) coordinate ascent (see ), gradient ascent (see ) and gradient projection . (See for additional references.)

For large scale optimization problems, it is numerically advantageous to select ρ>0\rho>0. Unfortunately, this also introduces variable coupling in the augmented Lagrangian (1.3), which makes the exact minimization step in (1.4) no longer decomposable across variable blocks even if ff has a separable structure. In this case, it is more economical to minimize (1.4) inexactly by updating the components of xx cyclically via the coordinate descent method. In particular, we can apply the Gauss-Seidel strategy to inexactly minimize (1.4), and then update the multiplier yy using an approximate optimal solution of (1.4) in a manner similar to (1.6). The resulting algorithm is called the Alternating Direction Method of Multipliers (ADMM) and is summarized as follows (see ). In the general context of sums of monotone operators, the work of describes a large family of splitting methods for K3K\geq 3 blocks which, when applied to the dual, result in similar but not identical methods to the ADMM algorithm (1.7) below.

Alternating Direction Method of Multipliers (ADMM) At each iteration r1r\geq 1, we first update the primal variable blocks in the Gauss-Seidel fashion and then update the dual multiplier using the updated primal variables: \left\{\begin{array}[]{l}\displaystyle x_{k}^{r+1}={\rm arg}\!\min_{x_{k}\in X_{k}}L(x_{1}^{r+1},...,x^{r+1}_{k-1},x_{k},x^{r}_{k+1},...,x^{r}_{K};y^{r}),\ k=1,2,...,K,\\[10.0pt] \displaystyle y^{r+1}=y^{r}+\alpha(q-Ex^{r+1})=y^{r}+\alpha\left(q-\sum_{k=1}^{K}E_{k}x_{k}^{r+1}\right),\end{array}\right. (1.7) where α>0\alpha>0 is the step size for the dual update.

Notice that if there is only one block (K=1K=1), then the ADMM reduces to the standard augmented Lagrangian method of multipliers for which the global convergence is well understood (see e.g., ). In particular, it is known that, under mild assumptions on the problem, this type of dual gradient ascent methods generate a sequence of iterates whose limit points must be optimal solutions of the original problem (see ). For the special case of ordinary network flow problems, it is further known that an associated sequence of dual iterates converges to an optimal solution of the dual (see ). The rate of convergence of dual ascent methods has been studied in the reference which showed that, under mild assumptions on the problem, the distance to the optimal dual solution set from any ymy\in\Re^{m} near the set is bounded above by the dual optimality ‘residual’ d(y)\|\nabla d(y)\|. By using this bound, it can be shown that a number of ascent methods, including coordinate ascent methods and a gradient projection method, converge at least linearly when applied to solve the dual problem (see ; also see for related analysis). (Throughout this paper, by ‘linear convergence’ we mean root–linear convergence (denoted by R-linear convergence) in the sense of Ortega and Rheinboldt .)

When there are two blocks (K=2K=2), the convergence of the ADMM was studied in the context of Douglas-Rachford splitting method for finding a zero of the sum of two maximal monotone operators. It is known that in this case every limit point of the iterates is an optimal solution of the problem. The recent work of have shown that, under some additional assumptions, the objective values generated by the ADMM algorithm and its accelerated version (which performs some additional line search steps for the dual update) converge at a rate of O(1/r)O(1/r) and O(1/r2)O(1/r^{2}) respectively. Moreover, if the objective function f(x)f(x) is strongly convex and the constraint matrix EE is row independent, then the ADMM is known to converge linearly to the unique minimizer of (1.1) . [One notable exception to the strong convexity requirement is in the special case of linear programming for which the ADMM is linearly convergent .] More recent convergence rate analysis of the ADMM still requires at least one of the component functions (f1f_{1} or f2f_{2}) to be strongly convex and have a Lipschitz continuous gradient. Under these and additional rank conditions on the constraint matrix EE, some linear convergence rate results can be obtained for a subset of primal and dual variables in the ADMM algorithm (or its variant); see . However, when there are more than two blocks involved (K3K\geq 3), the convergence (or the rate of convergence) of the ADMM method is unknown, and this has been a key open question for several decades. The recent work describes a list of novel applications of the ADMM with K3K\geq 3 and motivates strongly for the need to analyze the convergence of the ADMM in the multi-block case. The recent monograph contains more details of the history, convergence analysis and applications of the ADMM and related methods.

A main contribution of this paper is to establish the global (linear) convergence of the ADMM method for a class of convex objective functions involving any number of blocks (KK is arbitrary). The key requirement for the global (linear) convergence is the satisfaction of a certain error bound condition that is similar to that used in the analysis of . This error bound estimates the distance from an iterate to the optimal solution set in terms of a certain proximity residual. The class of objective functions that are known to satisfy this error bound condition include many of the compressive sensing applications, such as LASSO , Group LASSO or Sparse Group LASSO .

Technical Preliminaries

Let ff be a closed proper convex function in n\Re^{n}, let EE be an m×nm\times n matrix, let qq be a vector in m\Re^{m}. Let dom f{\rm dom}\ f denote the effective domain of ff and let int(dom f)\hbox{int}(\hbox{dom }f) denote the interior of dom f{\rm dom}\ f. We make the following standing assumptions regarding ff:

The global minimum of (1.1) is attained and so is its dual optimal value. The intersection Xint(dom f){xEx=q}X\cap\hbox{int}(\hbox{dom }f)\cap\{x\mid Ex=q\} is nonempty.

f=f1(x1)+f2(x2)++fK(xK)f=f_{1}(x_{1})+f_{2}(x_{2})+\cdots+f_{K}(x_{K}), with each fkf_{k} further decomposable as

where gkg_{k} and hkh_{k} are both convex and continuous over their domains, and AkA_{k}’s are some given matrices (not necessarily full column rank, and can be zero).

Each gkg_{k} is strictly convex and continuously differentiable on int(dom gk)\hbox{int}(\hbox{dom }g_{k}) with a uniform Lipschitz continuous gradient

Each hkh_{k} satisfies either one of the following conditions

The epigraph of hk(xk)h_{k}(x_{k}) is a polyhedral set.

hk(xk)=λkxk1+JwJxk,J2h_{k}(x_{k})=\lambda_{k}\|x_{k}\|_{1}+\sum_{J}w_{J}\|x_{k,J}\|_{2}, where xk=(,xk,J,)x_{k}=(\cdots,x_{k,J},\cdots) is a partition of xkx_{k} with JJ being the partition index.

Each hk(xk)h_{k}(x_{k}) is the sum of the functions described in the previous two items.

For any fixed and finite yy and ξ\xi, khk(xk)\sum_{k}h_{k}(x_{k}) is finite for all x{x:L(x;y)ξ}Xx\in\{x:L(x;y)\leq\xi\}\cap X.

Each submatrix EkE_{k} has full column rank.

The feasible sets XkX_{k}, k=1,,Kk=1,\cdots,K are compact polyhedral sets.

We have the following remarks regarding to the assumptions made.

Each fkf_{k} may only contain convex function hkh_{k}. That is, the strongly convex part gkg_{k} can be absent. Also, since the matrices AkA_{k}’s are not required to have full column rank, the overall objective function f(x)f(x) is not necessarily strongly convex. In fact, under Assumption A, the optimization problem (1.1) can still have multiple primal or dual optimal solutions. This makes the convergence (and rate of convergence) analysis of ADMM difficult.

Assumption (e)(e) does allow hk()h_{k}(\cdot) to be an indicator function, as in this case the set {xkkhk(xk)=}X\{x_{k}\mid\sum_{k}h_{k}(x_{k})=\infty\}\cap X is not a subset of {x:L(x;y)ξ}X\{x:L(x;y)\leq\xi\}\cap X for any given yy and ξ\xi.

Linear term of the form bk,xk\langle b_{k},x_{k}\rangle is already included in hkh_{k}, as its ephigraph is polyhedral. Moreover, from the assumption that XkX_{k} is polyhedral, the feasibility constraint xkXkx_{k}\in X_{k} can be absorbed into hkh_{k} by adding to it an indicator function iXk(xk){i}_{X_{k}}(x_{k}). To simplify notations, we will not explicitly write xkXkx_{k}\in X_{k} in the ADMM update (1.7) from now on.

Assumption (f) is made to ensure that the subproblems for each xkx_{k} is strongly convex. This assumption will be relaxed later when the subproblems are solved inexactly; see Section 4.1.

Assumption (g) requires the feasible set of the variables to be compact, which is needed to ensure that certain error bounds of the primal and dual problems of (1.1) hold. This assumption is usually satisfied in practical applications (e.g. the consensus problems) whenever a priori knowledge on the variable domain is available. This assumption can be further relaxed; see the discussion at the end of Section 3.

Under Assumption A, both the primal optimum and the dual optimum values of (1.1) are attained and are equal (i.e., the strong duality holds for (1.1)) so that

where dd^{*} is the optimal value of the dual of (1.1).

Under Assumption A, there may still be multiple optimal solutions for both the primal problem (1.1) and its dual problem. We first claim that the dual functional

is differentiable everywhere. Let X(y)X(y) denote the set of optimal solutions for (2.1).

For any ymy\in\Re^{m}, both ExEx and AkxkA_{k}x_{k}, k=1,2,...,Kk=1,2,...,K, are constant over X(y)X(y). Moreover, the dual function d(y)d(y) is differentiable everywhere and

where x(y)X(y)x(y)\in X(y) is any minimizer of (2.1).

Proof. Fix ymy\in\Re^{m}. We first show that ExEx is invariant over X(y)X(y). Suppose the contrary, so that there exist two optimal solutions xx and xx^{\prime} from X(y)X(y) with the property that ExExEx\neq Ex^{\prime}. Then, we have

Due to the convexity of L(x;y)L(x;y) with respect to the variable xx, the solution set X(y)X(y) must be convex, implying xˉ=(x+x)/2X(y){\bar{x}}=(x+x^{\prime})/2\in X(y). By the convexity of f(x)f(x), we have

Moreover, by the strict convexity of 2\|\cdot\|^{2} and the assumption ExExEx\neq Ex^{\prime}, we have

Multiplying this inequality by ρ/2\rho/2 and adding it to the previous inequality yields

This contradicts the definition d(y)=minxL(x;y)d(y)=\min_{x}L(x;y). Thus, ExEx is invariant over X(y)X(y). Notice that d(y)d(y) is a concave function and its subdifferential is given by

Since Ex(y)Ex(y) is invariant over X(y)X(y), the subdifferential d(y)\partial d(y) is a singleton. By Danskin’s Theorem, this implies that d(y)d(y) is differentiable and the gradient is given by d(y)=qEx(y)\nabla d(y)=q-Ex(y), for any x(y)X(y)x(y)\in X(y).

A similar argument (and using the strict convexity of gkg_{k}) shows that AkxkA_{k}x_{k} is also invariant over X(y)X(y). The proof is complete. Q.E.D.

By using Lemma 2.1, we show below a Lipschitz continuity property of d(y)\nabla d(y), for yy over any level set of dd.

Fix any scalar ηf\eta\leq f^{*} and let U={ ym  d(y)η }{\cal U}=\{\ y\in\Re^{m}\ |\ d(y)\geq\eta\ \}. Then there holds

Proof. Fix any yy and yy^{\prime} in U{\cal U}. Let x=x(y)x=x(y) and x=x(y)x^{\prime}=x(y^{\prime}) be two minimizers of L(x;y)L(x;y) and L(x;y)L(x;y^{\prime}) respectively. By convexity, we have

where zz and zz^{\prime} are some subgradient vectors in the subdifferential f(x)\partial f(x) and f(x)\partial f(x^{\prime}) respectively. Thus, we have

Upon rearranging terms and using the convexity property

Thus, ρE(xx)yy\rho\|E(x^{\prime}-x)\|\leq\|y^{\prime}-y\| which together with d(y)d(y)=E(xx)\nabla d(y^{\prime})-\nabla d(y)=E(x-x^{\prime}) (cf. Lemma 2.1) yields

To show the linear convergence of the ADMM method, we need certain local error bounds around the optimal solution set X(y)X(y) as well as around the dual optimal solution set YY^{*}. To describe these local error bounds, we first define the notion of a proximity operator. Let h:\mboxdom(h)h:\mbox{dom\,}(h)\mapsto\Re be a (possibly nonsmooth) convex function. For every x\mboxdom(h)x\in\mbox{dom\,}(h), the proximity operator of hh is defined as

Notice that if h(x)h(x) is the indicator function of a closed convex set XX, then

so the proximity operator is a generalization of the projection operator. In particular, it is known that the proximity operator satisfies the nonexpansiveness property:

The proximity operator can be used to characterize the optimality condition for a nonsmooth convex optimization problem. Suppose a convex function ff is decomposed as f(x)=g(Ax)+h(x)f(x)=g(Ax)+h(x) where gg is strongly convex and differentiable, hh is a convex (possibly nonsmooth) function, then we can define the proximal gradient of ff with respect to hh as

For the Lagrangian minimization problem (2.1) and under Assumption A, the work of suggests that the size of the proximal gradient

can be used to upper bound the distance to the optimal solution set X(y)X(y) of (2.1). Here

represent the nonsmooth and the smooth parts of f(x)f(x) respectively.

In our analysis of ADMM, we will also need an error bound for the dual function d(y)d(y). Notice that a ymy\in\Re^{m} solves (1.5) if and only if yy satisfies the system of nonlinear equations

This suggests that the norm of the ‘residual’ d(y)\|\nabla d(y)\| may be a good estimate of how close yy is from solving (1.5). The next lemma says if the nonsmooth part of fkf_{k} takes certain forms, then the distance to the primal and dual optimal solution sets can indeed be bounded.

If in addition XX is a polyhedral set, then there exists a positive scalar τ\tau and δ\delta such that the following error bound holds

Similarly, if assumption A-(g) also holds, then for any scalar ζ\zeta, there exist positive scalars δ\delta and τ\tau such that

Moreover, in both cases the constant τ\tau is independent of the choice of yy and xx.

Another new ingredient in Lemma 2.3 is the additional claim that the constants δ\delta, τ\tau are both independent of the choice of yy. This property follows directly from a similar property of Hoffman’s error bound (on which the error bounds of are based) for a feasible linear system P:={xAxb}P:=\{x\mid Ax\leq b\}:

where τ\tau is independent of bb. In fact, a careful checking of the proofs of shows that the corresponding error constants δ\delta and τ\tau for the augmented Lagrangian function L(x;y)L(x;y) can be indeed made independent of yy. We omit the proof of the first part of Lemma 2.3 for space consideration.

Under Assumption A(f), the augmented Lagrangian function L(x;y)L(x;y) (cf. (1.3)) is strongly convex with respect to each subvector xkx_{k}. As a result, each alternating minimization iteration of ADMM (1.7)

has a unique optimal solution. Thus the sequence of iterates {xr}\{x^{r}\} of the ADMM are well defined. The following lemma shows that the alternating minimization of the Lagrangian function gives a sufficient descent of the Lagrangian function value.

Suppose Assumptions A(b) and A(f) hold. Then fix any index rr, we have

where the constant γ>0\gamma>0 is independent of rr and yry^{r}.

Proof. By assumptions A(b) and A(f) , the augmented Lagrangian function

is strongly convex in each variable xkx_{k} and has a uniform modulus ρλmin(EkTEk)>0\rho\lambda_{\min}(E_{k}^{T}E_{k})>0. Here, the notation λmin()\lambda_{\min}(\cdot) denotes the smallest eigenvalue of a symmetric matrix. This implies that, for each kk, that

for all xx, where xˉk\bar{x}_{k} is the minimizer of minxkL(x;y)\min_{x_{k}}L(x;y) (when all other variables {xj}jk\{x_{j}\}_{j\neq k} are fixed).

Fix any index rr. For each k{1,...,K}k\in\{1,...,K\}, by ADMM (1.7), xkr+1x^{r+1}_{k} is the minimizer of L(x1r+1,...,xk1r+1,xk,xk+1r,xk+2r,...,xKr;yr)L(x^{r+1}_{1},...,x^{r+1}_{k-1},x_{k},x^{r}_{k+1},x^{r}_{k+2},...,x^{r}_{K};y^{r}). It follows from (2.7)

is independent of rr and yry^{r}. Summing this over kk, we obtain the sufficient decrease condition

Suppose assumptions A(b)—A(c) hold. Let {xr}\{x^{r}\} be generated by the ADMM algorithm (1.7). Then there exists some constant σ>0\sigma>0 ((independent of yry^{r})) such that

Proof. Fix any r1r\geq 1 and any 1kK1\leq k\leq K. According to the ADMM procedure (1.7), the variable xkx_{k} is updated as follows

The corresponding optimality condition can be written as

This further implies that the entire proximal gradient vector can be bounded by xr+1xr\|x^{r+1}-x^{r}\|:

Setting σ=(c+1)K\sigma=(c+1)\sqrt{K} (which is independent of yry^{r}) completes the proof. Q.E.D.

Linear Convergence of ADMM

Let dd^{*} denote the dual optimal value and {xr,yr}\{x^{r},y^{r}\} be the sequence generated by the ADMM method (1.7). Due to assumption A(a), dd^{*} also equals to the primal optimal value. Further we denote

which represents the gap from dual optimality at the rr-th iteration. The primal gap to optimality at iteration rr is defined as

Clearly, we have both Δdr0\Delta_{d}^{r}\geq 0 and Δpr0\Delta_{p}^{r}\geq 0 for all rr. To establish the linear convergence of ADMM, we need several lemmas to estimate the sizes of the primal and dual optimality gaps as well as their respective decrease.

Let X(yr)X(y^{r}) denote the set of optimal solutions for the following optimization problem

We first bound the sizes of the dual and primal optimality gaps.

Suppose assumptions A(a)—A(e) and A(g) hold. Then for any scalar δ>0\delta>0, there exists a positive scalar τ\tau^{\prime} such that

for any yrmy^{r}\in\Re^{m} with d(yr)δ\|\nabla d(y^{r})\|\leq\delta. Moreover, there exist positive scalars ζ\zeta and ζ\zeta^{{}^{\prime}} ((independent of yr)y^{r}) such that

where the second inequality follows from Lemma 2.2. Recall from the second part in Lemma 2.3 that there exists some τ\tau such that

Combining the above two inequalities yields

where τ=τ2/ρ\tau^{\prime}=\tau^{2}/\rho is a constant. This establishes the bound on the size of dual gap (3.3).

It remains to prove the bound on the primal gap (3.4). For notational simplicity, let us separate the smooth and nonsmooth part of the augmented Lagrangian as follows

Let xkr+1x^{r+1}_{k} denote the kk-th subvector of the primal vector xr+1x^{r+1}. From the way that the variables are updated (2.10), we have

where the gradient vector xkLˉ({xjkr+1},{xjr}j>k;yr)\nabla_{x_{k}}\bar{L}\left(\{x^{r+1}_{j\leq k}\},\{x^{r}_{j}\}_{j>k};y^{r}\right) can be explicitly expressed as

and the error vector ekre^{r}_{k} is defined by

Note that we can bound the norm of ekre^{r}_{k} as follows

where the constant c>0c>0 is independent of yry^{r}, and can take the same value as in (2.11).

Using (3.5), and by the definition of the proximity operator, we have the following

Summing over all k=1,,Kk=1,\cdots,K, we obtain

Using the above results, we can bound Δpr\Delta^{r}_{p} by

We then bound the decrease of the dual optimality gap.

Proof. The reduction of the optimality gap in the dual space can be bounded as follows:

where the last equality follows from the update of the dual variable yr1y^{r-1}, and the fact that xˉr1\bar{x}^{r-1} minimizes L(,yr1)L(\cdot,y^{r-1}). Q.E.D.

Lemma 3.2 implies that if qExrq-Ex^{r} is close to the true dual gradient d(yr)=qExˉr\nabla d(y^{r})=q-E{\bar{x}}^{r}, then the dual optimal gap is reduced after each ADMM iteration. However, since ADMM updates the primal variable by only one Gauss-Seidel sweep, the primal iterate xrx^{r} is not necessarily close the minimizer xˉr{\bar{x}}^{r} of L(x;yr)L(x;y^{r}). Thus, unlike the method of multipliers (for which xr=xˉrx^{r}={\bar{x}}^{r} for all rr), there is no guarantee that the dual optimality gap Δdr\Delta_{d}^{r} is indeed reduced after each iteration of ADMM.

Next we proceed to bound the decrease in the primal gap Δpr\Delta_{p}^{r}.

Suppose assumptions A(b) and A(f) hold. Then for each r1r\geq 1, we have

for some γ\gamma independent of yry^{r}.

By the update rule of yry^{r} (cf. (1.7)), we have

Recall from Lemma 2.4 that the alternating minimization of the Lagrangian function gives a sufficient descent. In particular, we have

for some γ>0\gamma>0 that is independent of rr and yry^{r}. Therefore, we have

Hence, we have the following bound on the reduction of primal optimality gap

where the last step is due to Lemma 3.2. Q.E.D.

Notice that when α=0\alpha=0 (i.e., no dual update in the ADMM algorithm), Lemma 3.3 reduces to the sufficient decrease estimate (2.6) in Lemma 2.4. When α>0\alpha>0, the primal optimality gap is not necessarily reduced after each ADMM iteration due to the positive term αExrq2\alpha\|Ex^{r}-q\|^{2} in (3.11). Thus, in general, we cannot guarantee a consistent decrease of either the dual optimality gap Δdr\Delta_{d}^{r} or the primal optimality gap Δpr\Delta_{p}^{r}. However, somewhat surprisingly, the sum of the primal and dual optimality gaps decreases for all rr, as long as the dual step size α\alpha is sufficiently small. This is used to establish the linear convergence of ADMM method.

Suppose the conditions in Assumption A hold. Then the sequence of iterates {(xr,yr)}\{(x^{r},y^{r})\} generated by the ADMM algorithm (1.7) converges linearly to an optimal primal-dual solution for (1.1), provided the stepsize α\alpha is sufficiently small. Moreover, the sequence of feasibility violation {Exrq}\{\|Ex^{r}-q\|\} also converges linearly.

Proof. We show by induction that the sum of optimality gaps Δdr+Δpr\Delta^{r}_{d}+\Delta^{r}_{p} is reduced after each ADMM iteration, as long as the stepsize α\alpha is chosen sufficiently small. For any r1r\geq 1, we denote

By induction, suppose Δdr1+Δpr1Δd0+Δp0\Delta^{r-1}_{d}+\Delta^{r-1}_{p}\leq\Delta^{0}_{d}+\Delta^{0}_{p} for some r1r\geq 1. Recall that each XkX_{k} is compact and that the indicator function iXk(xk)i_{X_{k}}(x_{k}) is included in hk(xk)h_{k}(x_{k}) (see the discussion after Assumption A), it follows that xrXx^{r}\in X, implying the boundedness of xrx^{r}. Thus, we obtain from Lemma 2.3 that

for some τ>0\tau>0 (independent of yry^{r}). To prove Theorem 3.1, we combine the two estimates (3.10) and (3.11) to obtain

Now we invoke (3.13) and Lemma 2.5 to lower bound xr+1xr\|x^{r+1}-x^{r}\|:

Substituting this bound into (3.14) yields

Thus, if we choose the stepsize α\alpha sufficiently small so that

which completes the induction. Moreover, the induction argument shows that if the stepsize α\alpha satisfies the condition (3.17), then the descent condition (3.16) holds for all r1r\geq 1.

We now show that the sum of optimality gaps Δdr+Δpr\Delta^{r}_{d}+\Delta^{r}_{p} in fact contracts geometrically after a finite number of ADMM iterations. By (3.19), for any δ>0\delta>0, there must exist a finite integer rˉ>0\bar{r}>0 such that for all rrˉr\geq\bar{r}, d(yr)δ\|\nabla d(y^{r})\|\leq\delta. Since Δdr\Delta_{d}^{r}, Δpr\Delta^{r}_{p} are nonnegative and bounded from above (see (3.18)), it follows that d(yr)d(y^{r}) is bounded from below by a constant ζ\zeta independent of rr. Applying the second part of Lemma 2.3, we have that for all rrˉr\geq\bar{r}, the dual error bound dist(yr,Y)τd(yr){\rm dist}(y^{r},Y^{*})\leq\tau\|\nabla d(y^{r})\| holds true.

Therefore, it follows from Lemma 3.1 that we have the following cost-to-go estimate

for some τ>0\tau^{\prime}>0 and for all rrˉr\geq\bar{r}.

Moreover, we can use Lemma 3.1 to bound xr+1xr2\|x^{r+1}-x^{r}\|^{2} from below by Δpr\Delta^{r}_{p}. In particular, we have from (3.15) and Lemma 3.1 that

Substituting this bound and (3.20) into (3.16), and assuming that α>0\alpha>0 satisfies (3.17), we obtain

Since α>0\alpha>0 is chosen small enough such that (3.17) holds, we have

This shows that the sequence {Δpr+Δdr}rrˉ\{\Delta_{p}^{r}+\Delta_{d}^{r}\}_{r\geq\bar{r}} converges to zero Q-linearlyA sequence {xr}\{x^{r}\} is said to converge QQ-linearly to some xˉ\bar{x} if xr+1xˉ/xrxˉμ\|x^{r+1}-\bar{x}\|/\|x^{r}-\bar{x}\|\leq\mu for all rr, where μ(0,1)\mu\in(0,1) is some constant. A sequence {xr}\{x^{r}\} is said to converge to xˉ\bar{x} RR-linearly if xrxˉcμr\|x^{r}-\bar{x}\|\leq c\mu^{r} for all rr and for some c>0c>0. . As a result, we conclude that {Δpr+Δdr}\{\Delta_{p}^{r}+\Delta_{d}^{r}\} and hence both Δpr\Delta_{p}^{r} and Δdr\Delta^{r}_{d} globally converge to zero R-linearlyTo see that such R-linear convergence is in fact global, note that rˉ>0\bar{r}>0 is finite, and Δpr+Δdr\Delta_{p}^{r}+\Delta_{d}^{r} is Q-linearly convergent for rrˉr\geq\bar{r}. Then one can always find an appropriate constant cc such that Δpr+Δdrc(1+λ)r\Delta_{p}^{r}+\Delta_{d}^{r}\leq c(1+\lambda)^{-r} for all r=1,2,.r=1,2,\ldots..

We next show that the dual sequence {yr}\{y^{r}\} is also R-linearly convergent. To this end, notice that the inequalities (3.15) and (3.16) imply

Then by (3.21), we see that both xrxˉr0\|x^{r}-{\bar{x}}^{r}\|\to 0 and Exˉrq0\|E{\bar{x}}^{r}-q\|\to 0 R-linearly. This implies that Exrq0Ex^{r}-q\to 0 R-linearly and d(yr)0\nabla d(y^{r})\to 0 R-linearly. Using the fact that dist(yr,Y)τd(yr){\rm dist}(y^{r},Y^{*})\leq\tau\|\nabla d(y^{r})\|, we conclude that yry^{r} converges R-linearly to an optimal dual solution.

We now argue that the primal iterates {xr}\{x^{r}\} converge to an optimal solution of (1.1). By the inequality (3.16), we can further conclude that

R-linearly. Notice that the R-linear convergence of xr+1xr20\|x^{r+1}-x^{r}\|^{2}\to 0 implies that xr+1xr0\|x^{r+1}-x^{r}\|\to 0 R-linearly. This further shows that xrxx^{r}\to x^{\infty} R-linearly for some xx^{\infty}. Denote the limit of dual sequence {yr}\{y^{r}\} by yy^{\infty}. By the preceding argument, we know yy^{\infty} is a dual optimal solution of (1.1). To show that xx^{\infty} is a primal optimal solution of (1.1), it suffices to prove that xX(y)x^{\infty}\in X(y^{\infty}). Using (3.15), and the fact that xrxˉr0\|x^{r}-\bar{x}^{r}\|\to 0, we have

Since xˉrX(yr)\bar{x}^{r}\in X(y^{r}), we have L(xˉr,yr)L(x,yr)L(\bar{x}^{r},y^{r})\leq L(x,y^{r}) for all xXx\in X. Passing limit, we obtain L(x,y)L(x,y)L({x}^{\infty},{y}^{\infty})\leq L(x,{y}^{\infty}) for all xXx\in X, that is, xX(y){x}^{\infty}\in X({y}^{\infty}). It then follows that the sequence {xr}\{x^{r}\} converges R-linearly to a primal optimal solution. Q.E.D.

The following corollary relaxes the compactness assumption of the feasible set XX (Assumption A-(g) in Theorem 3.1), and replaces it by the boundedness of the primal-dual iterates.

Suppose assumptions A(a)—A(f) hold, and that XX is a polyhedral set. Further assume that either one of the following two assumptions holds true

The sequence of dual iterates {yr}\{y^{r}\} lies in a compact set, and that the set {xL(x,y)ζ}\{x\mid L(x,y)\leq\zeta\} is compact for any finite yy and ζ\zeta.

The sequence of primal-dual iterates {(xr,yr)}\{(x^{r},y^{r})\} lies in a compact set.

Then if α\alpha is chosen sufficiently small ((cf. (3.17))), all the conclusions stated in Theorem 3.1 still hold true. Moreover, the sequence of function values {f(xr)}\{f(x^{r})\} also converges linearly.

Again we show by induction that the sum of optimality gaps Δdr+Δpr\Delta^{r}_{d}+\Delta^{r}_{p} is reduced after each ADMM iteration, as long as the stepsize α\alpha is chosen sufficiently small. For any r1r\geq 1, by induction we suppose Δdr1+Δpr1Δd0+Δp0\Delta^{r-1}_{d}+\Delta^{r-1}_{p}\leq\Delta^{0}_{d}+\Delta^{0}_{p}. Then xrδ<\|x^{r}\|\leq\delta<\infty, and by the first part of Lemma 2.3, the primal error bound holds. Then we can carry out exactly the same analysis as the proof of Theorem 3.1 to arrive at the same conclusion.

linearly, it follows that f(xr+1)d0f(x^{r+1})-d^{*}\to 0 R-linearly (recall that yry^{r} is now in a compact set). The proof is complete.

Similarly, when the second assumption is true, then the error bound condition in Lemma 2.3 again holds, and by using the same argument above we arrive at the desired conclusion. Q.E.D.

As a remark, we point out that the proof of Corollary 3.1 also shows that the same linear convergence of f(xr)df(x^{r})\to d^{*} also holds under the assumptions in Theorem 3.1.

which can be equivalently written as a KK-block problem

where xkx_{k}, aa and bb are now nn-dimensional vectors, and Ekm×nE_{k}\in\Re^{m\times n} is some matrix with full column rank.

Furthermore, the boundedness assumption in Corollary 3.1 is satisfied by many examples of the two-block ADMM described in . This is because when K=2K=2, α/β(0,12(1+5))\alpha/\beta\in(0,\frac{1}{2}(1+\sqrt{5})) and EkE_{k}’s are full column rank, it is known that both the primal and dual iterates generated by the two-block ADMM algorithm indeed lie in a bounded set . Therefore the second assumption made in the Corollary 3.1 holds true, hence we only require assumptions A(a)–A(f), which are in fact quite mild. For example, they are met by the following instance of the consensus problem (see [7, Section 7] for introduction of the consensus problem)

where w>0w>0 is some constant. Thus, the two block ADMM algorithm converges linearly for (3.25) regardless of the rank of AA. Note that when AA is full column rank, the objective is strongly convex. Consequently the error bound condition in Lemma 2.3 holds true globally, and the coefficient τ\tau can be at least greater than λmin(ATA)\lambda_{\rm min}(A^{T}A).

Variants of ADMM

The convergence analysis of Section 3 can be extended to some variants of the ADMM. We briefly describe two of them below.

In the original ADMM (1.7), each block xkx_{k} is updated by solving a convex optimization subproblem exactly. For large scale problems, this subproblem may not be easy to solve unless the matrix EkE_{k} is unitary (i.e., EkTEk=IE^{T}_{k}E_{k}=I) in which case the variables in xkx_{k} can be further decoupled (assuming fkf_{k} is separable). If the matrix EkE_{k} is not unitary, we can still employ a simple proximal gradient step to inexactly minimize L(x1r+1,...,xk1r+1,xk,xk+1r,...,xKr)L(x^{r+1}_{1},...,x^{r+1}_{k-1},x_{k},x^{r}_{k+1},...,x^{r}_{K}). More specifically, we update each block of xkx_{k} according to the following procedure

in which the smooth part of the objective function in the kk-th subproblem, namely,

is linearized locally at xkrx^{r}_{k}, and a proximal term β2xkxkr2\frac{\beta}{2}\|x_{k}-x_{k}^{r}\|^{2} is added. Here, β>0\beta>0 is a positive constant. With this change, updating xkx_{k} is easy when hkh_{k} (the nonsmooth part of fkf_{k}) is separable. For example, this is the case for compressive sensing applications where hk(xk)=xk1h_{k}(x_{k})=\|x_{k}\|_{1}, and the resulting subproblem admits a closed form solution given by the component-wise soft thresholding (also known as the shrinkage operator). We note that the proximal ADMM algorithm described here is slightly more general than the proximal ADMM algorithm seen in the literature, in which only the penalization term \frac{\rho}{2}\Big{\|}E_{k}x_{k}+\sum_{j<k}E_{j}x_{j}^{r+1}+\sum_{j>k}E_{j}x_{j}^{r}-q\Big{\|}^{2} is linearized locally at xkrx^{r}_{k}; see e.g., .

We claim that Theorem 3.1 holds for the proximal ADMM algorithm without requiring assumption A-(f) (the full rankness of EkE_{k}’s). Indeed, to establish the (linear) convergence of the proximal ADMM (4.1), we can follow the same proof steps as that for Theorem 3.1, with the only changes being in the proof of Lemmas 2.4-2.5 and Lemma 3.1. We first show that Lemma 2.4 holds without assumption A(f). Clearly subproblem (4.1) is now strongly convex without the full column rank assumption of EkE_{k}’s made in A(f). In the following, we will show that as long as β\beta is large enough, there is a sufficient descent:

This property can be seen by bounding the smooth part of L(x1r+1,...,xk1r+1,xk,xk+1r,...,xKr)L(x^{r+1}_{1},...,x^{r+1}_{k-1},x_{k},x^{r}_{k+1},...,x^{r}_{K}), which is given by

with the Taylor expansion at xkrx^{r}_{k}:

is the Lipschitz constant of Lˉk(){\bar{L}}_{k}(\cdot) and LL is the Lipschitz constant of gk()\nabla g_{k}(\cdot). Making the above inequality more explicit yields

provided the regularization parameter β\beta satisfies

In the above derivation of (4.4), the first step is due to (4.3), while the second inequality follows from the definition of xkr+1x^{r+1}_{k} (cf. (4.1)). Summing (4.4) over all kk yields the desired estimate of sufficient descent (4.2).

To verify that Lemma 2.5 still holds for the proximal ADMM algorithm, we note from the corresponding optimality condition for (4.1)

Using this relation in place of (2.10) and following the same proof steps, we can easily prove that the bound (2.9) in Lemma 2.5 can be extended to the proximal ADMM algorithm. Thus, the convergence results in Theorem 3.1 remain true for the proximal ADMM algorithm (4.1).

It remains to verify that Lemma 3.1 still holds true. In fact the first part of Lemma 3.1 can be shown to be independent of the iterates, thus it trivially holds true for the proximal ADMM algorithm. To show that the second part of Lemma 3.1 is true, note that the optimality condition of the proximal ADMM algorithm implies that

where in this case ekre^{r}_{k} is given as

It is then straightforward to show that the norm of ekre^{r}_{k} can be bounded by cxrxr+1c^{{}^{\prime}}\|x^{r}-x^{r+1}\| for some constant c>0c^{{}^{\prime}}>0. The rest of the proof follows the same steps as in Lemma 3.1.

2 Jacobi Update

Another popular variant of the ADMM algorithm is to use a Jacobi iteration (instead of a Gauss-Seidel iteration) to update the primal variable blocks {xk}\{x_{k}\}. In particular, the ADMM iteration (1.7) is modified as follows:

The convergence for this direct Jacobi scheme is unclear, as the augmented Lagrangian function may not decrease after each Jacobi update. In the following, we consider a modified Jacobi scheme with an explicit stepsize control. Specifically, let us introduce an intermediate variable w=(w1T,,wKT)Tnw=(w^{T}_{1},\cdots,w^{T}_{K})^{T}\in\Re^{n}. The modified Jacobi update is given as follows:

where a stepsize of 1/K1/K is used in the update of each variable block.

With this modification, we claim that Lemmas 2.4-2.5 and Lemma 3.1 still hold. In particular, Lemma 2.4 can be argued as follows. The strong convexity of L(x;y)L(x;y) with respect to the variable block xkx_{k} implies that

where the first inequality comes from the convexity of the augmented Lagrangian function.

From the update rule (4.7) we have K(xkr+1xkr)=(wkr+1xkr),K(x^{r+1}_{k}-x^{r}_{k})=(w^{r+1}_{k}-x^{r}_{k}), which combined with the previous inequality yields

The proof of Lemma 2.5 also requires only minor modifications. In particular, we have the following optimality condition for (4.5)

Similar to the proof of Lemma 2.5, we have

Utilizing the relationship K(xkr+1xkr)=(wkr+1xkr)K(x^{r+1}_{k}-x^{r}_{k})=(w^{r+1}_{k}-x^{r}_{k}), we can establish Lemma 2.5 by following similar proof steps (which we omit due to space reason).

Lemma 3.1 can be shown as follows. We first express wkr+1w_{k}^{r+1} as

Again by using the relationship K(xkr+1xkr)=(wkr+1xkr)K(x^{r+1}_{k}-x^{r}_{k})=(w^{r+1}_{k}-x^{r}_{k}), we can bound the norm of ekre^{r}_{k} by cxr+1xrc^{{}^{\prime}}\|x^{r+1}-x^{r}\|, for some c>0c^{{}^{\prime}}>0. The remaining proof steps are similar to those in Lemma 3.1.

Since Lemmas 2.4-2.5 and Lemma 3.1 hold for the Jacobi version of the ADMM algorithm with a step size control, we conclude that the convergence results of Theorem 3.1 remain true in this case.

Concluding Remarks

In this paper we have established the convergence and the rate of convergence of the classical ADMM algorithm when the number of variable blocks are more than two and in the absence of strong convexity. Our analysis is a departure of the conventional analysis of ADMM algorithm which relies on a contraction argument involving a weighted (semi-)norm of (xrx,yry)(x^{r}-x^{*},y^{r}-y^{*}), see . In our analysis, we require neither the strong convexity of the objective function nor the row independence assumption of the constrained matrix EE. Instead, we use a local error bound to show that when the stepsize of dual update is made sufficiently small, the sum of the primal and the dual optimality gaps decreases after each ADMM iteration, although separately they may individually increase. An interesting issue for further research is to identify good practical stepsize rules for dual update. While (3.17) does suggest a dual stepsize rule in terms of error bound constants, it may be too conservative and cumbersome to compute unless the objective function is strongly convex. One possibility may be to use an adaptive dual stepsize rule to guarantee the decrease of the sum of the primal and dual optimality gaps.

Acknowledgement: The authors are grateful to Xiangfeng Wang and Dr. Min Tao of Nanjing University for their constructive comments.

Appendix

The augmented Lagrangian dual function can be expressed as

Let (x,y)(x^{*},y^{*}) denote a primal and dual optimal solution pair. Let XX^{*} and YY^{*} denote the primal and dual optimal solution set. The he following properties will be useful in our subsequent analysis.

There exist positive scalars σg\sigma_{g}, LgL_{g} such that  x(y),x(y)X\forall~{}x(y),x(y^{\prime})\in X

ATg(Ax(y))ATg(Ax(y)),x(y)x(y)σgAx(y)Ax(y)2\langle A^{T}\nabla g(Ax(y^{\prime}))-A^{T}\nabla g(Ax(y)),x(y^{\prime})-x(y)\rangle\geq\sigma_{g}\|Ax(y^{\prime})-Ax(y)\|^{2}.

g(Ax(y))g(Ax(y))ATg(Ax(y)),x(y)x(y)σg2Ax(y)Ax(y)2.g(Ax(y^{\prime}))-g(Ax(y))-\langle A^{T}\nabla g(Ax(y)),x(y^{\prime})-x(y)\rangle\geq\frac{\sigma_{g}}{2}\|Ax(y^{\prime})-Ax(y)\|^{2}.

ATg(Ax(y))ATg(Ax(y))LgAx(y)Ax(y).\|A^{T}\nabla g(Ax(y^{\prime}))-A^{T}\nabla g(Ax(y))\|\leq L_{g}\|Ax(y^{\prime})-Ax(y)\|.

All a-1)–a-3) are true for p()p(\cdot) as well, with some constants σp\sigma_{p} and LpL_{p}.

d(y)=qEx(y)\nabla d(y)=q-Ex(y), and d(y)d(y)1ρyy.\|\nabla d(y^{\prime})-\nabla d(y)\|\leq\frac{1}{\rho}\|y^{\prime}-y\|.

Part (a) is true due to the assumed Lipchitz continuity and strong convexity of the function g()g(\cdot). Part (b) is from the Lipchitz continuity and strong convexity of the quadratic penalization p()p(\cdot). Part (c) has been shown in Lemma 2.1 and Lemma 2.2.

To proceed, let us rewrite the primal problem equivalently as

It is easy to verify by using the optimality condition for problem (6.2) that

We can take e=0e=0, and use the fact that x(y)Xx(y^{*})\in X^{*}, we see that (x,s,y,λ)M(ETp(Ex)+ATg(Ax),0)(x,s,y,\lambda)\in\mathcal{M}(E^{T}\nabla p(Ex)+A^{T}\nabla g(Ax),0) if and only if xXx\in X^{*} and yYy\in Y^{*}.

The following result states a well-known local upper Lipschitzian continuity property for the polyhedral multifunction M\mathcal{M}; see .

There exists a positive scalar θ\theta that depends on A,E,Cx,CsA,E,C_{x},C_{s} only, such that for each (dˉ,eˉ)(\bar{d},\bar{e}) there is a positive scalar δ\delta^{\prime} satisfying

The following is the main result for this appendix. Note that the scalar τ\tau in the claim is independent the choice of yy, xx, ss, and is independent on the coefficients of the linear term ss.

Suppose all the assumptions in Assumption A are satisfied. Then there exits positive scalars δ\delta, τ\tau such that for all yUy\in\cal U and d(y)δ\|\nabla d(y)\|\leq\delta, there holds dist(y,Y)τd(y){\rm dist}(y,Y^{*})\leq\tau\|\nabla d(y)\|.

which is polyhedral and thus is closed. Then we can pass limit to it and conclude (cf. Proposition 6.1)

Then we let e=d(y)e=\nabla d(y), and suppose eδ\|e\|\leq\delta. From the previous argument we have

where the first inequality comes from the strong convexity of g()g(\cdot) and p()p(\cdot); the last equality is from (6.6) and (6.10). Moreover, we have

where in the first equality we have used the fact that CsTλCsTλ=0C^{T}_{s}\lambda-C^{T}_{s}\lambda^{*}=0; see (6.7) (6.11); in the third equality and in the last inequality we have used the complementary conditions (6.13) and (6.9). As a result, we have

where the last step is due to d(y)=Ex(y)q\nabla d(y)=Ex(y)-q and d(y)=Exq=0\nabla d(y^{*})=Ex^{*}-q=0. Finally we have from Proposition 6.1

We see that the above inequality is quadratic in (x(y),s,y,λ)(x,s,y,λ)/e\|(x(y),s,y,\lambda)-(x^{*},s^{*},y^{*},\lambda^{*})\|/\|e\|, so we have

for some scalar τ\tau depending on θ\theta, LgL_{g}, LpL_{p}, σg\sigma_{g}, σp\sigma_{p}. It is worth noting that τ\tau does not depend on the choice of the coefficients of the linear term ss. We conclude dist(y,Y)τd(y){\rm dist}(y,Y^{*})\leq\tau\|\nabla d(y)\|. Q.E.D.

References