Fast Multiple Splitting Algorithms for Convex Optimization
Donald Goldfarb, Shiqian Ma
Introduction
Problem (1.1) is closely related to the following inclusion problem:
where are set-valued maximal monotone operators. The goal of problem (1.2) is to find a zero of the sum of maximal monotone operators. Note that the optimality conditions for (1.1) are
hence, these conditions can be satisfied by solving a problem of the form (1.2).
In the extensive literature on splitting and ADM algorithms, the case predominates. The algorithms for solving (1.2) when are usually based on operator splitting techniques. Important operator splitting algorithms include the Douglas-Rachford , Peaceman-Rachford , double-backward and forward-backward class of algorithms. Alternating direction methods (ADM) within an augmented Lagrangian framework for solving (1.1) are optimization analogs/variants of the Douglas-Rachford and Peaceman-Rachford splitting methods. These algorithms have been studied extensively for the case of , and were first proposed in the 1970s for solving optimization problems arising from numerical PDE problems . We refer to and the references therein for more information on splitting and ADM algorithms for the case of .
Although there is an extensive literature on operator splitting methods, very few convergence results have been published on methods for finding a zero of a sum of more than two maximal monotone operators. The principal exceptions, are the Jacobi-like method of Spingarn and more recently, the general projective splitting methods of Eckstein and Svaiter . The algorithm addressed in first reduces problem (1.2) to the sum of two maximal monotone operators by defining new subspaces and operators, and then applies a Douglas-Rachford splitting algorithm to solve the new problem. The projective splitting methods in do not reduce problem (1.2) to the case . Instead, by using the concept of an extended solution set, it is shown in that solving (1.2) is equivalent to finding a point in the extended solution set, and a separator-projection algorithm is given to do this.
Global convergence results for variable splitting ADMs and operator splitting algorithms for the case of have been proved under various assumptions. However, except for the fairly recently proposed gradient methods in and related iterative shrinkage/thresholding algorithms in and the alternating linearization methods in , complexity bounds for these methods had not been established. These complexity results are extensions of the seminal results of Nesterov , who first showed that certain first-order methods that he proposed could obtain an -optimal solution of a smooth convex programming problem in iterations. Moreover, he showed that his methods were optimal in the sense that this iteration complexity was the best that could be obtained using only first-order information. Nesterov’s optimal gradient methods are accelerated gradient methods that use a combination of previous points to compute the new point at each iteration. By combining these methods with smoothing techniques, optimal complexity results were obtained for solving nonsmooth problems in .
In this paper, we propose two classes of multiple variable-splitting algorithms based on alternating direction and alternating linearization techniques that can solve problem (1.1) for general and we present complexity results for them. (Note that the complexity results in are only for problem (1.1) when ). The algorithms in the first class can be viewed as alternating linearization methods in the sense that at each iteration these algorithms perform minimizations of an approximation to the original objective function by keeping one of the functions unchanged and linearizing the other functions. An alternating linearization method for minimizing the sum of two convex functions was studied by Kiwiel et al.. However, our algorithms differ greatly from the one in in the way that the proximal terms are chosen. Moreover, our algorithms are more general as they can solve general problems with functions. Furthermore, we prove that the iteration complexity of this class of splitting algorithms is for an -optimal solution. To the best of our knowledge, this is the first complexity result of this type for splitting/alternating direction type algorithms. The algorithms in our second class are accelerated versions of the algorithms in our first class and have iteration complexities. This class of splitting algorithms is also new as are the complexity results.
Our new algorithms have, in addition, several practical advantages. First, they are all parallelizable. Thus, although at each iteration we solve subproblems, the CPU time required should be approximately equal to the time required to solve the most difficult of the subproblems if we have processors that can work in parallel. Second, since every function is minimized once at each iteration, it is likely that our algorithms will need fewer iterations to converge than operator splitting algorithms such as FPC ,TVCMRI , ISTA and FISTA . The numerical results in for the case of support this conclusion.
The rest of this paper is organized as follows. In Section 2 we propose a class of splitting algorithms based on alternating direction and alternating linearization methods for solving (1.1) and prove that they require iterations to obtain an -optimal solution. In Section 3 we propose accelerated splitting algorithms for solving (1.1) and prove they have complexities. We discuss how to apply our algorithms for solving nonsmooth problems by using smoothing techniques in Section 4. Numerical results are presented in Section 5. Finally, we summarize our results in Section 6.
A class of multiple splitting algorithms
By introducing new variables, i.e., splitting variable into different variables, problem (1.1) can be rewritten as:
In Sections 2 and 3, we focus on splitting and ADM algorithms for solving (2.3) and their complexity results.
We make the following assumptions throughout Sections 2 and 3.
where is the Lipschitz constant.
Problem (1.1) is solvable, i.e.,
We define the term -optimal as follows.
Suppose is an optimal solution to the following problem
is called an -optimal solution to (2.4) if holds.
The following notation is adopted throughout Sections 2 and 3.
where is a penalty parameter. We use to denote the following approximation to the function :
i.e., is an approximation to the function , where the -th function is unchanged but the other functions are approximated by a linear term plus a proximal term. We use to denote the minimizer of with respect to , i.e.,
With the above notation, we have the following lemma which follows from a fundamental property of a smooth function in the class ; see e.g., .
The following key lemma is crucial for the proofs of our complexity results. Our proofs of this lemma and most of the results that follow in this and the remaining sections of the paper closely follow proofs given in for related lemmas and theorems.
where .
where the second inequality is due to the convexity of the functions and the last equality is from the first-order optimality conditions for problem (2.5), i.e.,
One natural choice of is to take all of its components equal to . In this case, all are equal to , i.e., the average of the current iterates.
At iteration , Algorithm 1 computes points by solving subproblems. For many problems in practice, these subproblems are expected to be very easy to solve. Another advantage of the algorithm is that it is parallelizable since given , the subproblems in Algorithm 1 can be solved simultaneously. Algorithm (1) can be viewed as an alternating linearization method since at each iteration, subproblems are solved, and each subproblem corresponds to minimizing a function involving linear approximations to some of the functions. Although Algorithm 1 assumes the Lipschitz constants are known, and hence that is known, this assumption can be relaxed by using the backtracking technique in to estimate at each iteration.
We prove in the following that the number of iterations needed by Algorithm 1 to obtain an -optimal solution is .
Suppose is an optimal solution to problem (2.3). For any choice of , the sequence generated by Algorithm 1 satisfies:
Thus, the sequence produced by Algorithm 1 converges to . Moreover, if where , the number of iterations needed to obtain an -optimal solution is at most , where .
In (2.6), by letting , we have and
Using the definition of in Algorithm 1, we have
where the second and the last equalities are due to the fact that is a doubly stochastic matrix and the inequality is due to the convexity of the function
Thus by summing (2.11) over we obtain
where the last inequality is due to (2.12).
Summing (2.13) over , and using the fact that , yields
In (2.6), by letting , we get and
From the way we compute and the facts that is convex and is a doubly stochastic matrix, we get
Now summing (2.15) over and using (2.16), we obtain
This shows that the sums are non-increasing as increases. Hence,
Finally, combining (2.14) and (2.18) yields
It then follows that if , where , and hence that for any , is an -optimal solution. ∎
If in the original problem (1.1), is subject to a convex constraint , where is a convex set, we can impose this constraint in every subproblem in MSA and obtain the same complexity result. The only changes in the proof are in Lemma 5. If there is a constraint , then (2.6) and (2.7) hold for any and the last equality in (2.7) becomes a “” inequality due to the fact that the optimality conditions (2.8) become
Unfortunately, this extension is not very practical, since for it to be useful, adding the constraint in every subproblem would most likely make most of these subproblems difficult to solve.
A class of fast multiple splitting algorithms
To establish the iteration complexity of FaMSA, we need the following lemma.
Suppose is an optimal solution to problem (2.3). For any choice of , the sequence generated by Algorithm 2 satisfies:
where and
In (2.6), by letting , we get and
Summing (3.2) over , and using the facts that is convex and is a doubly stochastic matrix, we get
In (2.6), by letting , we get and
Summing (3.4) over we obtain
Now multiplying (3.3) by and (3.5) by , adding the resulting two inequalities, using the relation , and the identity (2.9), we get
From the way we compute in Algorithm 2, i.e.,
Thus, from (3.6) and the definition of it follows that
Before proving our main complexity theorem to Algorithm 2, we note that the sequence generated by Algorithm 2 clearly satisfies and hence for all since .
Suppose is an optimal solution to problem (2.3). For any choice of , the sequence generated by Algorithm 2 satisfies:
Thus, the sequence produced by Algorithm 2 converges to . Moreover, if where , the number of iterations needed to obtain an -optimal solution is at most , where .
where the first inequality is due to , the first equality is from the facts that and , the third inequality is from letting in (3.5) and the last equality is due to
Thus, from we get
Moreover, it follows that if , i.e., , then where . This implies that for any , is an -optimal solution. ∎
Although we have assumed that the Lipschitz constants are known, and hence that is chosen in Algorithm 2 to be smaller than , this can be relaxed by using the backtracking technique in that chooses a at each iteration that is smaller than the used at the previous iteration and for which for all .
In this section, we present a variant of the fast multiple splitting algorithm (Algorithm 2) that is much more efficient and requires much less memory than Algorithm 2 for problems in which is large. This variant uses , where is the -dimensional vector with all ones, and replaces in the last line of Algorithm 2 by ; i.e., in the last line of Algorithm 2, we compute for by the formula:
It is easy to see that in this variant, the are all the same and the are all the same. We call this variant FaMSA-s, where s refers to the fact that this variant computes a “single” vector and a single vector at the -th iteration. It is given below as Algorithm 3.
It is easy to verify that the following analog of Lemma 8 applies to Algorithm FaMSA-s.
Suppose is an optimal solution to problem (2.3). The sequence generated by Algorithm FaMSA-s satisfies:
where and
The proof is very similar to the proof of Lemma 8; hence, we leave it to the reader. The main difference is that instead of using the inequality to replace the sum involving , we use the fact that to replace the sum involving in the proof. ∎
From Lemma 11, Theorem 9 with and , respectively, for replaced by and follows immediately for FaMSA-s.
Multiple splitting algorithms for nonsmooth problems
where is a positive smoothness parameter. It can be shown that is well defined and is in the class of and its gradient is Lipschitz continuous with constant (see Theorem 1 in ). Also, it is easy to show that the following relations hold for and :
For nonsmooth problems in imaging, data analysis, and machine learning, etc. with regularization terms that involve total variation and the nuclear norm, we can use similar smoothing techniques to smooth these nonsmooth functions, and then apply our multiple splitting algorithms to solve them.
Numerical experiments
We present some preliminary numerical experiments in this section. Specifically, we apply our MSA and FaMSA algorithms to solve the Fermat-Weber problem and a total variation and wavelet based image deblurring problem. All numerical experiments were run in MATLAB 7.3.0 on a Dell Precision 670 workstation with an Intel Xeon(TM) 3.4GHZ CPU and 6GB of RAM.
The Fermat-Weber (F-W) problem can be cast as:
where is a smoothness parameter. The gradient of , where is the optimal solution to the optimization problem in (5.2). It is easy to show that Moreover, is Lipschitz continuous with constant . Now we can apply MSA, FaMSA and FaMSA-s to solve
The -th subproblem in all of these algorithms corresponds to solving the following problem:
It is easy to check that the optimal solution to problem (5.4) is given by
If we choose the doubly stochastic matrix to be in MSA as we do in FaMSA-s, all ’s are the same in MSA as they are in FaMSA-s. Hence, computing , for in both algorithms can be done efficiently as follows.
We compared the performance of MSA and FaMSA-s with the classical gradient method (Grad) and Nesterov’s accelerated gradient method (Nest) for solving (5.3). The classical gradient method for solving (5.3) with step size is:
The variant of Nesterov’s accelerated gradient method that we used is the following:
was less than . We tested the performance of these four solvers for different choices of , which is the step size for Grad and Nest. Note that since the ’s are the same in MSA with for all and in FaMSA-s, these two methods can be viewed as linearization methods in which the single function is linearized at the point with only one proximal term in the -th subproblem. So the step size for MSA and FaMSA-s is . Hence, the parameter for MSA and FaMSA-s was set to in our numerical tests.
Our results are presented in Table 1. The CPU times reported are in seconds. These results show that for the F-W problem, our implementations of MSA and FaMSA-s take roughly between two and three times as much time to solve each problem as taken by Grad and Nest, respectively. This is not surprising since it is clear that the computation of each set of vectors and for in (5.9) is roughly comparable to a single computation of the gradient, i.e., the gradients of , for . Moreover, for the simple F-W objective function, not much is gained by minimizing only one out of the individual functions , , when is large as it is in our tests. Note that the number of iterations required by MSA and Grad were exactly the same on our set of test problems. When is of a moderate size and the individual functions are more complicated, MSA should require fewer iterations than Grad.
The purpose of this set of tests was not to demonstrate any advantage that our algorithms might have over gradient methods. Rather, they were performed to validate our algorithms and show that the accelerated variants like algorithm Nest can reduce the number of iterations required to solve problems of the form (1.1). This is quite clear from the results reported in Table 1. We further note that FaMSA-s often takes one to three fewer iterations than Nest. Note that for some problems, the multiple splitting algorithm took only one iteration to converge. The reason was that for these problems, the number of points was much larger than the dimension of the space. Therefore, the points were very compact and fairly uniformly distributed around the initial point; hence that point was quite likely to be very close to the optimal solution.
2 An image deblurring problem
In this section, we report the results of applying our multiple splitting algorithms to a benchmark total variation and wavelet-based image deblurring problem from . In this problem, the original image is the well-known Cameraman image of size and the observed image is obtained after imposing a uniform blur of size (denoted by the operator ) and Gaussian noise (generated by the function randn in MATLAB with a seed of 0 and a standard deviation of ). Since the vector of coefficients of the wavelet transform of the image is sparse in this problem and the total variation norm of the image is expected to be small, one can try to reconstruct the image from the observed image by solving the problem:
Thus, the smooth version of problem (5.11) was:
However, when we applied our multiple splitting algorithms to (5.12), we actually performed the following computation on the -th iteration:
which is a standard TV-denoising problem. In our tests, we perform 10 iterations of the algorithm proposed by Chambolle in to approximately solve this problem. The second subproblem in (5.17) can be reduced to:
It is easy to check that the solution of (5.18) is given by:
Solving this linear system is easy since the operator has a special structure and thus can be inverted efficiently.
In our tests, we set , and used smoothing parameters . The initial points were all set equal to . We compared the performance of MSA, FaMSA, FaMSA-s and Grad for different and step sizes . In these comparisons, we simply terminated the codes after 500 iterations. The objective function value and the improvement signal noise ratio (ISNR) at different iterations are reported in Table 2. The ISNR is defined as , where is the reconstructed image and is the true image. As we did for F-W problem, we always used and since there were three functions in this problem, we used . For large , we did not report the results for all of the iterations since the comparisons are quite clear from the selected iterations. See Figure 1 for additional and more complete comparisons. We make the following observations from Table 2. For , FaMSA-s achieved the best objective function value in about 200 iterations and 152 CPU seconds. The best ISNR was also achieved by FaMSA-s, in about 300 iterations and 227 seconds. MSA and Grad were not able to obtain an acceptable solution in 500 iterations. In fact, they were only able to reduce the objective function to twice the near-optimal value of achieved by FaMSA-s. For , FaMSA-s achieved the best objective function value and ISNR in 100 iterations and 76 seconds and 125 iterations and 94 seconds, respectively. Again, MSA and Grad did not achieve acceptable results even after 500 iterations. For , MSA achieved the best objective function value, , after 500 iterations and 349 CPU seconds, while the best ISNR was achieved by FaMSA-s in 80 iterations and 61 seconds. Also, the best objective function value achieved by FaMSA-s was at the 60-th iteration after only 47 CPU seconds. We also note that for and , MSA was always better than Grad and FaMSA-s was always slightly better than FaMSA. Another observation was that MSA always decreased the objective function value for and , while FaMSA and FaMSA-s always achieved near-optimal results in a relatively small number of iterations and then started getting worse. However, in practice, one would always terminate FaMSA and FaMSA-s once the objective function value started increasing. For , MSA gave very good results while the other three solvers diverged immediately. Specifically, the best objective function value was achieved by MSA in 120 iterations and 80 CPU seconds, and the best ISNR was achieved by MSA in 200 iterations and 132 CPU seconds. Thus, based on these observations, we conclude that FaMSA-s attains a nearly optimal solution very quickly for small while MSA is more stable for large .
We also plotted some figures to graphically illustrate the performance of these solvers. Figures (a), (b) and (c) in Figure 1 plot the objective function value versus the iteration number for and , respectively. Figures (d), (e) and (f) in Figure 1 plot ISNR versus the iteration number for and . We did not plot graphs for , since FaMSA, FaMSA-s and Grad diverged from the very first iteration. From Figure 1 we can see the comparisons clearly. Basically, these figures show that FaMSA and FaMSA-s achieve a nearly optimal solution very quickly. We can also see from (b), (c), (e) and (f) that FaMSA-s is always slightly better than FaMSA and MSA is always better than Grad.
We also tested setting to the identity matrix in MSA and FaMSA, but this choice, as expected, did not give as good results.
To see how MSA performed for the deblurring problem (5.12), we show the original (a), blurred (b) and reconstructed (c) cameraman images in Figure 2. The reconstructed image (c) is the one that was obtained by applying MSA with after 200 iterations. The ISNR of the reconstructed image is 5.3182. From Figure 2 we see that MSA was able to recover the blurred image very well.
Conclusions
In this paper, we proposed two classes of multiple splitting algorithms based on alternating directions and optimal gradient techniques for minimizing the sum of convex functions. Complexity bounds on the number of iterations required to obtain an -optimal solution for these algorithms were derived. Our algorithms are all parallelizable, which is attractive for practical applications involving large-scale optimization problems.
Acknowledgement
We would like to thank the anonymous referee for making several very helpful suggestions.