Fast Multiple Splitting Algorithms for Convex Optimization

Donald Goldfarb, Shiqian Ma

Introduction

Problem (1.1) is closely related to the following inclusion problem:

where T1,,TKT_{1},\ldots,T_{K} are set-valued maximal monotone operators. The goal of problem (1.2) is to find a zero of the sum of KK maximal monotone operators. Note that the optimality conditions for (1.1) are

hence, these conditions can be satisfied by solving a problem of the form (1.2).

In the extensive literature on splitting and ADM algorithms, the case K=2K=2 predominates. The algorithms for solving (1.2) when K=2K=2 are usually based on operator splitting techniques. Important operator splitting algorithms include the Douglas-Rachford , Peaceman-Rachford , double-backward and forward-backward class of algorithms. Alternating direction methods (ADM) within an augmented Lagrangian framework for solving (1.1) are optimization analogs/variants of the Douglas-Rachford and Peaceman-Rachford splitting methods. These algorithms have been studied extensively for the case of K=2K=2, and were first proposed in the 1970s for solving optimization problems arising from numerical PDE problems . We refer to and the references therein for more information on splitting and ADM algorithms for the case of K=2K=2.

Although there is an extensive literature on operator splitting methods, very few convergence results have been published on methods for finding a zero of a sum of more than two maximal monotone operators. The principal exceptions, are the Jacobi-like method of Spingarn and more recently, the general projective splitting methods of Eckstein and Svaiter . The algorithm addressed in first reduces problem (1.2) to the sum of two maximal monotone operators by defining new subspaces and operators, and then applies a Douglas-Rachford splitting algorithm to solve the new problem. The projective splitting methods in do not reduce problem (1.2) to the case K=2K=2. Instead, by using the concept of an extended solution set, it is shown in that solving (1.2) is equivalent to finding a point in the extended solution set, and a separator-projection algorithm is given to do this.

Global convergence results for variable splitting ADMs and operator splitting algorithms for the case of K=2K=2 have been proved under various assumptions. However, except for the fairly recently proposed gradient methods in and related iterative shrinkage/thresholding algorithms in and the alternating linearization methods in , complexity bounds for these methods had not been established. These complexity results are extensions of the seminal results of Nesterov , who first showed that certain first-order methods that he proposed could obtain an ϵ\epsilon-optimal solution of a smooth convex programming problem in O(1/ϵ)O(1/\sqrt{\epsilon}) iterations. Moreover, he showed that his methods were optimal in the sense that this iteration complexity was the best that could be obtained using only first-order information. Nesterov’s optimal gradient methods are accelerated gradient methods that use a combination of previous points to compute the new point at each iteration. By combining these methods with smoothing techniques, optimal complexity results were obtained for solving nonsmooth problems in .

In this paper, we propose two classes of multiple variable-splitting algorithms based on alternating direction and alternating linearization techniques that can solve problem (1.1) for general K(K2)K(K\geq 2) and we present complexity results for them. (Note that the complexity results in are only for problem (1.1) when K=2K=2). The algorithms in the first class can be viewed as alternating linearization methods in the sense that at each iteration these algorithms perform KK minimizations of an approximation to the original objective function FF by keeping one of the functions fi(x)f_{i}(x) unchanged and linearizing the other K1K-1 functions. An alternating linearization method for minimizing the sum of two convex functions was studied by Kiwiel et al.. However, our algorithms differ greatly from the one in in the way that the proximal terms are chosen. Moreover, our algorithms are more general as they can solve general problems with K(K2)K(K\geq 2) functions. Furthermore, we prove that the iteration complexity of this class of splitting algorithms is O(1/ϵ)O(1/\epsilon) for an ϵ\epsilon-optimal solution. To the best of our knowledge, this is the first complexity result of this type for splitting/alternating direction type algorithms. The algorithms in our second class are accelerated versions of the algorithms in our first class and have O(1/ϵ)O(1/\sqrt{\epsilon}) iteration complexities. This class of splitting algorithms is also new as are the complexity results.

Our new algorithms have, in addition, several practical advantages. First, they are all parallelizable. Thus, although at each iteration we solve KK subproblems, the CPU time required should be approximately equal to the time required to solve the most difficult of the subproblems if we have KK processors that can work in parallel. Second, since every function fif_{i} is minimized once at each iteration, it is likely that our algorithms will need fewer iterations to converge than operator splitting algorithms such as FPC ,TVCMRI , ISTA and FISTA . The numerical results in for the case of K=2K=2 support this conclusion.

The rest of this paper is organized as follows. In Section 2 we propose a class of splitting algorithms based on alternating direction and alternating linearization methods for solving (1.1) and prove that they require O(1/ϵ)O(1/\epsilon) iterations to obtain an ϵ\epsilon-optimal solution. In Section 3 we propose accelerated splitting algorithms for solving (1.1) and prove they have O(1/ϵ)O(1/\sqrt{\epsilon}) complexities. We discuss how to apply our algorithms for solving nonsmooth problems by using smoothing techniques in Section 4. Numerical results are presented in Section 5. Finally, we summarize our results in Section 6.

A class of multiple splitting algorithms

By introducing new variables, i.e., splitting variable xx into KK different variables, problem (1.1) can be rewritten as:

In Sections 2 and 3, we focus on splitting and ADM algorithms for solving (2.3) and their complexity results.

We make the following assumptions throughout Sections 2 and 3.

where L(fi)L(f_{i}) is the Lipschitz constant.

Problem (1.1) is solvable, i.e., X:=argminF.X_{*}:=\arg\min F\neq\emptyset.

We define the term ϵ\epsilon-optimal as follows.

Suppose xx^{*} is an optimal solution to the following problem

xCx\in\mathcal{C} is called an ϵ\epsilon-optimal solution to (2.4) if f(x)f(x)ϵf(x)-f(x^{*})\leq\epsilon holds.

The following notation is adopted throughout Sections 2 and 3.

where μ\mu is a penalty parameter. We use Qi(v1,,vi1,u,vi+1,,vK)Q_{i}(v^{1},\ldots,v^{i-1},u,v^{i+1},\ldots,v^{K}) to denote the following approximation to the function F(u)F(u):

i.e., QiQ_{i} is an approximation to the function FF, where the ii-th function fif_{i} is unchanged but the other functions are approximated by a linear term plus a proximal term. We use pi(v1,,vi1,vi+1,,vK)p_{i}(v^{1},\ldots,v^{i-1},v^{i+1},\ldots,v^{K}) to denote the minimizer of Qi(v1,,vi1,u,vi+1,,vK)Q_{i}(v^{1},\ldots,v^{i-1},u,v^{i+1},\ldots,v^{K}) with respect to uu, i.e.,

With the above notation, we have the following lemma which follows from a fundamental property of a smooth function in the class C1,1C^{1,1}; see e.g., .

The following key lemma is crucial for the proofs of our complexity results. Our proofs of this lemma and most of the results that follow in this and the remaining sections of the paper closely follow proofs given in for related lemmas and theorems.

where p:=pi(v1,,vi1,vi+1,,vK)p:=p_{i}(v^{1},\ldots,v^{i-1},v^{i+1},\ldots,v^{K}).

where the second inequality is due to the convexity of the functions fj,j=1,,Kf_{j},j=1,\ldots,K and the last equality is from the first-order optimality conditions for problem (2.5), i.e.,

One natural choice of D(k)D^{(k)} is to take all of its components equal to 1/K1/K. In this case, all w(k)i,i=1,,Kw_{(k)}^{i},i=1,\ldots,K are equal to i=1Kx(k)i/K\sum_{i=1}^{K}x_{(k)}^{i}/K, i.e., the average of the current KK iterates.

At iteration kk, Algorithm 1 computes KK points x(k)i,i=1,,Kx_{(k)}^{i},i=1,\ldots,K by solving KK subproblems. For many problems in practice, these KK subproblems are expected to be very easy to solve. Another advantage of the algorithm is that it is parallelizable since given w(k)i,i=1,,Kw_{(k)}^{i},i=1,\ldots,K, the KK subproblems in Algorithm 1 can be solved simultaneously. Algorithm (1) can be viewed as an alternating linearization method since at each iteration, KK subproblems are solved, and each subproblem corresponds to minimizing a function involving linear approximations to some of the functions. Although Algorithm 1 assumes the Lipschitz constants are known, and hence that μ\mu is known, this assumption can be relaxed by using the backtracking technique in to estimate μ\mu at each iteration.

We prove in the following that the number of iterations needed by Algorithm 1 to obtain an ϵ\epsilon-optimal solution is O(1/ϵ)O(1/\epsilon).

Suppose xx^{*} is an optimal solution to problem (2.3). For any choice of μ1/max1iKL(fi)\mu\leq 1/\max_{1\leq i\leq K}{L(f_{i})}, the sequence {x(k)i,w(k)i}i=1K\{x_{(k)}^{i},w_{(k)}^{i}\}_{i=1}^{K} generated by Algorithm 1 satisfies:

Thus, the sequence {mini=1,,KF(x(k)i)}\{\min_{i=1,\ldots,K}F(x_{(k)}^{i})\} produced by Algorithm 1 converges to F(x)F(x^{*}). Moreover, if μβ/maxi{L(fi)}\mu\geq\beta/\max_{i}\{L(f_{i})\} where 0<β10<\beta\leq 1, the number of iterations needed to obtain an ϵ\epsilon-optimal solution is at most C/ϵ\lceil C/\epsilon\rceil, where C=(K1)maxi{L(fi)}x0x22βC=\frac{(K-1)\max_{i}\{L(f_{i})\}\|x_{0}-x^{*}\|^{2}}{2\beta}.

In (2.6), by letting u=x,vj=w(n)i,j=1,,K,jiu=x^{*},v^{j}=w_{(n)}^{i},j=1,\ldots,K,j\neq i, we have p=xn+1ip=x_{n+1}^{i} and

Using the definition of w(n)iw_{(n)}^{i} in Algorithm 1, we have

where the second and the last equalities are due to the fact that D(n+1)D^{(n+1)} is a doubly stochastic matrix and the inequality is due to the convexity of the function 2.\|\cdot\|^{2}.

Thus by summing (2.11) over i=1,,Ki=1,\ldots,K we obtain

where the last inequality is due to (2.12).

Summing (2.13) over n=0,1,,k1n=0,1,\ldots,k-1, and using the fact that w(0)i=x0,i=1,,Kw_{(0)}^{i}=x_{0},i=1,\ldots,K, yields

In (2.6), by letting u=vj=w(n)i,j=1,,K,jiu=v^{j}=w_{(n)}^{i},j=1,\ldots,K,j\neq i, we get p=x(n+1)ip=x_{(n+1)}^{i} and

From the way we compute w(n)i,i=1,,K,w_{(n)}^{i},i=1,\ldots,K, and the facts that FF is convex and D(n)D^{(n)} is a doubly stochastic matrix, we get

Now summing (2.15) over i=1,,Ki=1,\ldots,K and using (2.16), we obtain

This shows that the sums i=1KF(x(n)i)\sum_{i=1}^{K}F(x_{(n)}^{i}) are non-increasing as nn increases. Hence,

Finally, combining (2.14) and (2.18) yields

It then follows that mini=1,,KF(x(k)i)F(x)ϵ,\min_{i=1,\ldots,K}F(x_{(k)}^{i})-F(x^{*})\leq\epsilon, if kC/ϵk\geq\lceil C/\epsilon\rceil, where C=(K1)maxi{L(fi)}x0x22βC=\frac{(K-1)\max_{i}\{L(f_{i})\}\|x_{0}-x^{*}\|^{2}}{2\beta}, and hence that for any kC/ϵk\geq\lceil C/\epsilon\rceil, x(k):=argmin{x(k)iF(x(k)i),i=1,,K}x_{(k)}:=\arg\min\{x_{(k)}^{i}|F(x_{(k)}^{i}),i=1,\ldots,K\} is an ϵ\epsilon-optimal solution. ∎

If in the original problem (1.1), xx is subject to a convex constraint xCx\in\mathcal{C}, where C\mathcal{C} is a convex set, we can impose this constraint in every subproblem in MSA and obtain the same complexity result. The only changes in the proof are in Lemma 5. If there is a constraint xCx\in\mathcal{C}, then (2.6) and (2.7) hold for any uCu\in\mathcal{C} and the last equality in (2.7) becomes a “\geq” inequality due to the fact that the optimality conditions (2.8) become

Unfortunately, this extension is not very practical, since for it to be useful, adding the constraint in every subproblem would most likely make most of these subproblems difficult to solve.

A class of fast multiple splitting algorithms

To establish the O(1/ϵ)O(1/\sqrt{\epsilon}) iteration complexity of FaMSA, we need the following lemma.

Suppose xx^{*} is an optimal solution to problem (2.3). For any choice of μmax1iKL(fi)\mu\leq\max_{1\leq i\leq K}L(f_{i}), the sequence {x(k)i,w(k)i,w^(k)i}i=1K\{x_{(k)}^{i},w_{(k)}^{i},\hat{w}_{(k)}^{i}\}_{i=1}^{K} generated by Algorithm 2 satisfies:

where vk:=i=1KF(x(k)i)KF(x)v_{k}:=\sum_{i=1}^{K}F(x_{(k)}^{i})-KF(x^{*}) and uki:=tkx(k)i(tk1)w^(k1)ix,i=1,,K.u_{k}^{i}:=t_{k}x_{(k)}^{i}-(t_{k}-1)\hat{w}_{(k-1)}^{i}-x^{*},i=1,\ldots,K.

In (2.6), by letting u=w^(k)i,vj=w(k+1)i,j=1,,K,jiu=\hat{w}_{(k)}^{i},v^{j}=w_{(k+1)}^{i},j=1,\ldots,K,j\neq i, we get p=x(k+1)ip=x_{(k+1)}^{i} and

Summing (3.2) over i=1,,Ki=1,\ldots,K, and using the facts that FF is convex and D(k)D^{(k)} is a doubly stochastic matrix, we get

In (2.6), by letting u=x,vj=w(k+1)iu=x^{*},v^{j}=w_{(k+1)}^{i}, we get p=x(k+1)ip=x_{(k+1)}^{i} and

Summing (3.4) over i=1,,Ki=1,\ldots,K we obtain

Now multiplying (3.3) by tk2t_{k}^{2} and (3.5) by tk+1t_{k+1}, adding the resulting two inequalities, using the relation tk2=tk+1(tk+11)t_{k}^{2}=t_{k+1}(t_{k+1}-1), and the identity (2.9), we get

From the way we compute w(k+1)iw_{(k+1)}^{i} in Algorithm 2, i.e.,

Thus, from (3.6) and the definition of ukiu_{k}^{i} it follows that

Before proving our main complexity theorem to Algorithm 2, we note that the sequence {tk}\{t_{k}\} generated by Algorithm 2 clearly satisfies tk+1tk+12,t_{k+1}\geq t_{k}+\frac{1}{2}, and hence tk(k+1)/2t_{k}\geq(k+1)/2 for all k1k\geq 1 since t1=1t_{1}=1.

Suppose xx^{*} is an optimal solution to problem (2.3). For any choice of μmax1iKL(fi)\mu\leq\max_{1\leq i\leq K}L(f_{i}), the sequence {x(k)i,w(k)i,w^(k)i}i=1K\{x_{(k)}^{i},w_{(k)}^{i},\hat{w}_{(k)}^{i}\}_{i=1}^{K} generated by Algorithm 2 satisfies:

Thus, the sequence {mini=1,,KF(x(k)i)}\{\min_{i=1,\ldots,K}F(x_{(k)}^{i})\} produced by Algorithm 2 converges to F(x)F(x^{*}). Moreover, if μβ/maxi{L(fi)}\mu\geq\beta/\max_{i}\{L(f_{i})\} where 0<β10<\beta\leq 1, the number of iterations needed to obtain an ϵ\epsilon-optimal solution is at most C/ϵ\lfloor\sqrt{C/\epsilon}\rfloor, where C=2(K1)maxi{L(fi)}x0x2/βC=2(K-1)\max_{i}\{L(f_{i})\}\|x_{0}-x^{*}\|^{2}/\beta.

where the first inequality is due to tk(k+1)/2t_{k}\geq(k+1)/2, the first equality is from the facts that t1=1t_{1}=1 and u1i=x(1)ixu_{1}^{i}=x_{(1)}^{i}-x^{*}, the third inequality is from letting k=0k=0 in (3.5) and the last equality is due to w(1)i=x0,i=1,,K.w_{(1)}^{i}=x_{0},i=1,\ldots,K.

Thus, from vk=i=1KF(x(k)i)KF(x)v_{k}=\sum_{i=1}^{K}F(x_{(k)}^{i})-KF(x^{*}) we get

Moreover, it follows that if C/(k+1)2ϵC/(k+1)^{2}\leq\epsilon, i.e., kC/ϵk\geq\lfloor\sqrt{C/\epsilon}\rfloor, then mini=1,,KF(x(k)i)F(x)ϵ,\min_{i=1,\ldots,K}F(x_{(k)}^{i})-F(x^{*})\leq\epsilon, where C=2(K1)maxi{L(fi)}x0x2/βC=2(K-1)\max_{i}\{L(f_{i})\}\|x_{0}-x^{*}\|^{2}/\beta. This implies that for any kC/ϵk\geq\lfloor\sqrt{C/\epsilon}\rfloor, x(k):=argmin{x(k)iF(x(k)i),i=1,,K}x_{(k)}:=\arg\min\{x_{(k)}^{i}|F(x_{(k)}^{i}),i=1,\ldots,K\} is an ϵ\epsilon-optimal solution. ∎

Although we have assumed that the Lipschitz constants L(fi)L(f_{i}) are known, and hence that μ\mu is chosen in Algorithm 2 to be smaller than 1/max1iK{L(fi)}1/\max_{1\leq i\leq K}\{L(f_{i})\}, this can be relaxed by using the backtracking technique in that chooses a μ\mu at each iteration that is smaller than the μ\mu used at the previous iteration and for which F(p)Qi(w(k)i,,w(k)i,p,w(k)i,,w(k)i)F(p)\leq Q_{i}(w_{(k)}^{i},\ldots,w_{(k)}^{i},p,w_{(k)}^{i},\ldots,w_{(k)}^{i}) for all ii.

In this section, we present a variant of the fast multiple splitting algorithm (Algorithm 2) that is much more efficient and requires much less memory than Algorithm 2 for problems in which KK is large. This variant uses D(k):=1/KeeD^{(k)}:=1/Kee^{\top}, where ee is the nn-dimensional vector with all ones, and replaces x(k)ix_{(k)}^{i} in the last line of Algorithm 2 by w^(k)i\hat{w}_{(k)}^{i}; i.e., in the last line of Algorithm 2, we compute w(k+1)iw_{(k+1)}^{i} for i=1,,Ki=1,\ldots,K by the formula:

It is easy to see that in this variant, the w^(k)i,i=1,,K\hat{w}_{(k)}^{i},i=1,\ldots,K are all the same and the w(k+1)i,i=1,,Kw_{(k+1)}^{i},i=1,\ldots,K are all the same. We call this variant FaMSA-s, where s refers to the fact that this variant computes a “single” vector w^k\hat{w}^{k} and a single vector w(k+1)w_{(k+1)} at the kk-th iteration. It is given below as Algorithm 3.

It is easy to verify that the following analog of Lemma 8 applies to Algorithm FaMSA-s.

Suppose xx^{*} is an optimal solution to problem (2.3). The sequence {w(k),w^(k)}\{w_{(k)},\hat{w}_{(k)}\} generated by Algorithm FaMSA-s satisfies:

where vk:=K(F(w^(k))F(x))v_{k}:=K(F(\hat{w}_{(k)})-F(x^{*})) and uk:=tkw^(k)(tk1)w^(k1)x.u_{k}:=t_{k}\hat{w}_{(k)}-(t_{k}-1)\hat{w}_{(k-1)}-x^{*}.

The proof is very similar to the proof of Lemma 8; hence, we leave it to the reader. The main difference is that instead of using the inequality i=1KF(w^(k)i)i=1KF(x(k)i)\sum_{i=1}^{K}F(\hat{w}_{(k)}^{i})\leq\sum_{i=1}^{K}F(x_{(k)}^{i}) to replace the sum involving w^(k)i\hat{w}_{(k)}^{i}, we use the fact that KF(w^k+1)i=1KF(x(k+1)i)KF(\hat{w}_{k+1})\leq\sum_{i=1}^{K}F(x_{(k+1)}^{i}) to replace the sum involving x(k+1)ix_{(k+1)}^{i} in the proof. ∎

From Lemma 11, Theorem 9 with w^(k)i\hat{w}_{(k)}^{i} and w(k)iw_{(k)}^{i}, respectively, for i=1,,Ki=1,\ldots,K replaced by w^(k)\hat{w}_{(k)} and w(k)w_{(k)} follows immediately for FaMSA-s.

Multiple splitting algorithms for nonsmooth problems

where ρ\rho is a positive smoothness parameter. It can be shown that fρ(x)f_{\rho}(x) is well defined and is in the class of C1,1C^{1,1} and its gradient is Lipschitz continuous with constant Lρ=1ρσL_{\rho}=\frac{1}{\rho\sigma} (see Theorem 1 in ). Also, it is easy to show that the following relations hold for f(x)f(x) and fρ(x)f_{\rho}(x):

For nonsmooth problems in imaging, data analysis, and machine learning, etc. with regularization terms that involve total variation and the nuclear norm, we can use similar smoothing techniques to smooth these nonsmooth functions, and then apply our multiple splitting algorithms to solve them.

Numerical experiments

We present some preliminary numerical experiments in this section. Specifically, we apply our MSA and FaMSA algorithms to solve the Fermat-Weber problem and a total variation and wavelet based image deblurring problem. All numerical experiments were run in MATLAB 7.3.0 on a Dell Precision 670 workstation with an Intel Xeon(TM) 3.4GHZ CPU and 6GB of RAM.

The Fermat-Weber (F-W) problem can be cast as:

where ρ>0\rho>0 is a smoothness parameter. The gradient of fiρf_{i}^{\rho}, fiρ(x)=yi,\nabla f_{i}^{\rho}(x)=y_{i}^{*}, where yiy_{i}^{*} is the optimal solution to the optimization problem in (5.2). It is easy to show that yi=xcimax{ρ,xci}.y_{i}^{*}=\frac{x-c^{i}}{\max\{\rho,\|x-c^{i}\|\}}. Moreover, fiρ(x)\nabla f_{i}^{\rho}(x) is Lipschitz continuous with constant L(fiρ)=1/ρL(f_{i}^{\rho})=1/\rho. Now we can apply MSA, FaMSA and FaMSA-s to solve

The ii-th subproblem in all of these algorithms corresponds to solving the following problem:

It is easy to check that the optimal solution to problem (5.4) is given by

If we choose the doubly stochastic matrix D(k)D^{(k)} to be D(k):=1/KeeD^{(k)}:=1/Kee^{\top} in MSA as we do in FaMSA-s, all w(k)iw_{(k)}^{i}’s are the same in MSA as they are in FaMSA-s. Hence, computing x(k)ix_{(k)}^{i}, for i=1,,Ki=1,\ldots,K in both algorithms can be done efficiently as follows.

We compared the performance of MSA and FaMSA-s with the classical gradient method (Grad) and Nesterov’s accelerated gradient method (Nest) for solving (5.3). The classical gradient method for solving (5.3) with step size τ>0\tau>0 is:

The variant of Nesterov’s accelerated gradient method that we used is the following:

was less than 10610^{-6}. We tested the performance of these four solvers for different choices of τ\tau, which is the step size for Grad and Nest. Note that since the wiw^{i}’s are the same in MSA with D(k)=1KeeD^{(k)}=\frac{1}{K}ee^{\top} for all kk and in FaMSA-s, these two methods can be viewed as linearization methods in which the single function j=1,jiKfj(x)\sum_{j=1,j\neq i}^{K}f_{j}(x) is linearized at the point ww with only one proximal term K12μxw\frac{K-1}{2\mu}\|x-w\| in the ii-th subproblem. So the step size for MSA and FaMSA-s is μ/(K1)\mu/(K-1). Hence, the parameter μ\mu for MSA and FaMSA-s was set to μ=τ(K1)\mu=\tau(K-1) in our numerical tests.

Our results are presented in Table 1. The CPU times reported are in seconds. These results show that for the F-W problem, our implementations of MSA and FaMSA-s take roughly between two and three times as much time to solve each problem as taken by Grad and Nest, respectively. This is not surprising since it is clear that the computation of each set of KK vectors z(k)iz_{(k)}^{i} and x(k)ix_{(k)}^{i} for i=1,,Ki=1,\ldots,K in (5.9) is roughly comparable to a single computation of the gradient, i.e., the KK gradients of fiρ(x)f_{i}^{\rho}(x), for i=1,,Ki=1,\ldots,K. Moreover, for the simple F-W objective function, not much is gained by minimizing only one out of the KK individual functions fiρ(x)f_{i}^{\rho}(x), i=1,,Ki=1,\ldots,K, when KK is large as it is in our tests. Note that the number of iterations required by MSA and Grad were exactly the same on our set of test problems. When KK is of a moderate size and the individual functions are more complicated, MSA should require fewer iterations than Grad.

The purpose of this set of tests was not to demonstrate any advantage that our algorithms might have over gradient methods. Rather, they were performed to validate our algorithms and show that the accelerated variants like algorithm Nest can reduce the number of iterations required to solve problems of the form (1.1). This is quite clear from the results reported in Table 1. We further note that FaMSA-s often takes one to three fewer iterations than Nest. Note that for some problems, the multiple splitting algorithm took only one iteration to converge. The reason was that for these problems, the number of points was much larger than the dimension of the space. Therefore, the points were very compact and fairly uniformly distributed around the initial point; hence that point was quite likely to be very close to the optimal solution.

2 An image deblurring problem

In this section, we report the results of applying our multiple splitting algorithms to a benchmark total variation and wavelet-based image deblurring problem from . In this problem, the original image is the well-known Cameraman image of size 256×256256\times 256 and the observed image is obtained after imposing a uniform blur of size 9×99\times 9 (denoted by the operator AA) and Gaussian noise (generated by the function randn in MATLAB with a seed of 0 and a standard deviation of 0.560.56). Since the vector of coefficients of the wavelet transform of the image is sparse in this problem and the total variation norm of the image is expected to be small, one can try to reconstruct the image xx from the observed image bb by solving the problem:

Thus, the smooth version of problem (5.11) was:

However, when we applied our multiple splitting algorithms to (5.12), we actually performed the following computation on the kk-th iteration:

which is a standard TV-denoising problem. In our tests, we perform 10 iterations of the algorithm proposed by Chambolle in to approximately solve this problem. The second subproblem in (5.17) can be reduced to:

It is easy to check that the solution of (5.18) is given by:

Solving this linear system is easy since the operator AA has a special structure and thus (AA+2/μI)(A^{\top}A+2/\mu I) can be inverted efficiently.

In our tests, we set α=0.001\alpha=0.001, β=0.035\beta=0.035 and used smoothing parameters δ=σ=104\delta=\sigma=10^{-4}. The initial points were all set equal to . We compared the performance of MSA, FaMSA, FaMSA-s and Grad for different μ\mu and step sizes τ\tau. In these comparisons, we simply terminated the codes after 500 iterations. The objective function value and the improvement signal noise ratio (ISNR) at different iterations are reported in Table 2. The ISNR is defined as ISNR:=10log10xb2xxˉ2ISNR:=10\log_{10}\frac{\|x-b\|^{2}}{\|x-\bar{x}\|^{2}}, where xx is the reconstructed image and xˉ\bar{x} is the true image. As we did for F-W problem, we always used μ=τ(K1)\mu=\tau(K-1) and since there were three functions in this problem, we used μ=2τ\mu=2\tau. For large μ\mu, we did not report the results for all of the iterations since the comparisons are quite clear from the selected iterations. See Figure 1 for additional and more complete comparisons. We make the following observations from Table 2. For μ=0.1\mu=0.1, FaMSA-s achieved the best objective function value in about 200 iterations and 152 CPU seconds. The best ISNR was also achieved by FaMSA-s, in about 300 iterations and 227 seconds. MSA and Grad were not able to obtain an acceptable solution in 500 iterations. In fact, they were only able to reduce the objective function to twice the near-optimal value of 3.86×1043.86\times 10^{4} achieved by FaMSA-s. For μ=0.5\mu=0.5, FaMSA-s achieved the best objective function value and ISNR in 100 iterations and 76 seconds and 125 iterations and 94 seconds, respectively. Again, MSA and Grad did not achieve acceptable results even after 500 iterations. For μ=1\mu=1, MSA achieved the best objective function value, 3.73×1043.73\times 10^{4}, after 500 iterations and 349 CPU seconds, while the best ISNR was achieved by FaMSA-s in 80 iterations and 61 seconds. Also, the best objective function value achieved by FaMSA-s was at the 60-th iteration after only 47 CPU seconds. We also note that for μ=0.1,0.5\mu=0.1,0.5 and 11, MSA was always better than Grad and FaMSA-s was always slightly better than FaMSA. Another observation was that MSA always decreased the objective function value for μ=0.1,0.5\mu=0.1,0.5 and 11, while FaMSA and FaMSA-s always achieved near-optimal results in a relatively small number of iterations and then started getting worse. However, in practice, one would always terminate FaMSA and FaMSA-s once the objective function value started increasing. For μ=5\mu=5, MSA gave very good results while the other three solvers diverged immediately. Specifically, the best objective function value 3.73×1043.73\times 10^{4} was achieved by MSA in 120 iterations and 80 CPU seconds, and the best ISNR was achieved by MSA in 200 iterations and 132 CPU seconds. Thus, based on these observations, we conclude that FaMSA-s attains a nearly optimal solution very quickly for small μ\mu while MSA is more stable for large μ\mu.

We also plotted some figures to graphically illustrate the performance of these solvers. Figures (a), (b) and (c) in Figure 1 plot the objective function value versus the iteration number for μ=0.1,0.5\mu=0.1,0.5 and 11, respectively. Figures (d), (e) and (f) in Figure 1 plot ISNR versus the iteration number for μ=0.1,0.5\mu=0.1,0.5 and 11. We did not plot graphs for μ=5\mu=5, since FaMSA, FaMSA-s and Grad diverged from the very first iteration. From Figure 1 we can see the comparisons clearly. Basically, these figures show that FaMSA and FaMSA-s achieve a nearly optimal solution very quickly. We can also see from (b), (c), (e) and (f) that FaMSA-s is always slightly better than FaMSA and MSA is always better than Grad.

We also tested setting D(k)D^{(k)} to the identity matrix in MSA and FaMSA, but this choice, as expected, did not give as good results.

To see how MSA performed for the deblurring problem (5.12), we show the original (a), blurred (b) and reconstructed (c) cameraman images in Figure 2. The reconstructed image (c) is the one that was obtained by applying MSA with μ=5\mu=5 after 200 iterations. The ISNR of the reconstructed image is 5.3182. From Figure 2 we see that MSA was able to recover the blurred image very well.

Conclusions

In this paper, we proposed two classes of multiple splitting algorithms based on alternating directions and optimal gradient techniques for minimizing the sum of KK convex functions. Complexity bounds on the number of iterations required to obtain an ϵ\epsilon-optimal solution for these algorithms were derived. Our algorithms are all parallelizable, which is attractive for practical applications involving large-scale optimization problems.

Acknowledgement

We would like to thank the anonymous referee for making several very helpful suggestions.

References