A Unified Algorithmic Framework for Block-Structured Optimization Involving Big Data
Mingyi Hong, Meisam Razaviyayn, Zhi-Quan Luo, Jong-Shi Pang
I Introduction
With advances in sensor, communication and storage technologies, data acquisition is more ubiquitous than any time in the past. This has made available big data sets in many areas of engineering, biological, social, and physical sciences. While the proper modeling and analysis of such data sets can yield valuable information for inference, estimation, tracking, learning and decision-making, their size and complexity present great challenges in algorithm design and implementation.
Due to its central role in big data analytics, large-scale optimization has recently attracted significant attention not only from the optimization community, but also from the machine learning, statistics as well as the signal processing communities. For example, emerging problems in image processing, social network analysis and computational biology can easily exceed millions or billions of variables, and significant research is underway to enable fast accurate solution of these problems . Also, problems related to the design and provision of large scale smart infrastructures such as wireless communication networks require real-time efficient resource allocation decisions to ensure optimal network performance. Traditional general purpose optimization tools are inadequate for these problems due to the complexity of the model, the heterogeneity of the data, and most importantly the sheer data size . Modern large-scale optimization algorithms, especially those that are capable of exploiting problem structures, dealing with distributed, time varying and incomplete data sets, utilizing massively parallel computing and storage infrastructures, have become the workhorse in the big data era.
To be efficient for big data applications, optimization algorithms must have certain properties:
Each of their computational steps must be simple and easy to perform;
The intermediate results are easily stored;
They can be implemented in a distributed and/or parallel manner so as to exploit the modern multi-core and cluster computing architectures;
A high-quality solution can be found using a small number of iterations.
These requirements preclude the use of high-order information about the problem (i.e., the Hessian matrix of the objective), which is usually too expensive to obtain, even for modest-sized problems.
I-B The Block Coordinate Decent Method
A very popular family of optimization algorithms that satisfies most of the aforementioned properties is the block coordinate descent (BCD) method, sometimes also known as the alternating minimization/maximization (AM) algorithm. The basic steps of the BCD are simple: i) partition the entire optimization variables into small blocks, and ii) optimize one block variable (or few blocks of variables) at each iteration, while holding the remaining variables fixed. More specifically, consider the following block-structured optimization problem
where we have defined ; for the rest of variables , let . We refer the readers to Fig. 1 for a graphical illustration of the algorithm.
The BCD algorithm is intuitively appealing and very simple to implement, yet it is extremely powerful. It enjoys tremendous popularity in a wide range of applications from signal processing, communications, to machine learning. Representative examples include image deblurring , statistical learning , wireless communications , and so on. In recent years, there is a surge of renewed interests to extend and generalize the BCD type of algorithms due to its applications in modern big data optimization problems. Theoretically, the BCD algorithm and its variants have been significantly generalized to accommodate various coordinate update rules , and have been made suitable for implementation on modern parallel processing architecture . It can handle a wide range of nonsmooth nonconvex cost functions . Practically, it has been used in emerging large-scale signal processing and machine learning applications, such as compressive sensing/sparse signal recovery , matrix completion , tensor decomposition , to name just a few. A recent survey of this algorithm can be found in .
I-C The Block Successive Upper Bound Minimization Method
In this article, we introduce a unifying framework, called the Block Successive Upper bound Minimization (BSUM) method, which generalizes the BCD type of algorithms . Simply put, the BSUM framework includes algorithms that successively optimize certain upper-bounds or surrogate functions of the original objectives, possibly in a block by block manner. The BSUM framework significantly expands the application domain of the traditional BCD algorithms. For example, it covers many classical statistics and machine learning algorithms such as the Expectation Maximization (EM) method , the Concave-Convex Procedure (CCCP) and the multiplicative Nonnegative Matrix Factorization (NMF) . It also includes as special cases many well-known signal processing algorithms such as the family of interference pricing algorithms and the weighted minimum mean square error (WMMSE) algorithms for interference management in large-scale wireless systems.
The generality and flexibility of the BSUM offers an ideal platform to explore optimization algorithms for big data. Through the lens of the BSUM, one obtains not only a thorough understanding of the variety of algorithms and applications that are being covered, but more importantly the principle for designing a good algorithm suitable for big data. To this end, this article will first provide a concise overview of a few key theoretical characterizations of the algorithms that fall in the BSUM framework, BCD included. Our emphasis will be given to providing intuitive understanding as to when and where the BSUM framework should (or should not) work, and how its performance can be characterized. The second part of this article offers a detailed account of many existing large-scale optimization algorithms that fall in the BSUM framework, together with a few big data related applications that are of significant interests to the signal processing community. The last part of the article outlines some interesting extensions of the BSUM that further help expand its application domains. Throughout this article, special emphasis will be given to computational issues arising from big data optimization problems such as algorithm design for parallel and distributed computation, algorithm implementation on multi-core computing platforms and distributed storage of the data. In particular, we will discuss how these issues can be addressed in the BSUM framework by providing theoretical insights and examples from real practical problems.
II BSUM and Its Theoretical Properties
The complete description of the BSUM is given in Table I. We also refer the readers to Fig. 2 for a graphical comparison of the iterates generated by BSUM and BCD for a two dimensional problem. It should be clear at this point that when and no approximation is used (i.e., and at each iteration a single coordinate is selected), then the BSUM reduces to the classical cyclic BCD method. In Table II we also present several index selection rules that are covered by the BSUM framework. For simplicity of presentation, we will use the classical cyclic index selection rule where in the remainder of this article unless otherwise noted.
Next we introduce the precise definition of the approximation function. The main idea is that for each , the approximation should be an upper bound of the original objective function at the point of (hence the BSUM name of the framework). Please see Fig. 3 for an illustration of the upper bound minimization process. Intuitively, picking an upper bound approximation function is reasonable because optimizing it at least should guarantee some descent of the original objective ; see Fig. 3(c).
Using this definition, we make the following assumptions on the ’s.
Intuitively, Assumptions (A1) and (A2) imply that the approximation function is a global upper bound of ; while the assumption (A3) guarantees that the first order behaviors of the objective function and the approximation function are the same at the point of approximation (cf. Fig. 3). In Table III, we provide the readers with a few commonly used upper bounds that satisfy Assumption A; see also for additional examples. More discussion will be given in subsequent sections on how these bounds are used in practice.
It is worth mentioning that for a popular subclass of problem (1), Assumption A can be further simplified; see the following example [21, Proposition 2].
Consider the following special form of problem (1):
𝑟1r+1. It is clear from the figure that after solving the BSUM subproblem (5), , that is, the objective function is strictly decreased. Now that we have seen the main steps of the algorithm, next we describe its theoretical properties. We address the following questions related to the convergence of the BSUM: When does the BSUM converge? How fast does it converge? When does the BSUM fail and why? The answers to these questions will be instrumental in understanding the framework as well as evaluating the performance of various related algorithms.
II-B When does the BSUM converge?
In order to discuss the convergence property of the algorithm, we first investigate the optimality conditions for problem (1), which characterize the set of solutions that we would like our algorithm to reach. To this end, we introduce two related notions, one is called the stationary solution and the other is the coordinatewise minimum solution; see Table IV for their precise definitions. Roughly speaking, the coordinatewise minimum is the point where no single block , has a better direction to move to, while at a stationary point the entire vector cannot move to a better direction. Further, a stationary solution must be a coordinatewise minimum, but the reverse direction is generally not true; see the example next.
This example confirms that the coordinatewise minimum can be much inferior to the stationary solution. Therefore in the subsequent discussion we will mainly focus on finding the stationary solutions rather than the coordinatewise minimum. An immediate question is that: whether one can easily distinguish these two types of solutions, or for that matter, when does a coordinatewise minimum become a stationary solution? Let us define a regular point as a point that satisfies the following statement: if is coordinatewise minimum, then it is a stationary solution. It turns out that for a large and popular subclass of problem (1) expressed below in (7) where the nonsmooth function is separable across the blocks, all feasible points are regular.
The main convergence result for BSUM method is given below, which is adapted from [21, Theorem 2]
Suppose the cyclic coordinate selection rule is chosen, i.e., . Let be a sequence generated by the BSUM algorithm. Suppose Assumption A holds, and that each is regular. Then the following is true:
Suppose that the function is quasi-convex in for . Furthermore, assume that the subproblem (5) has a unique solution for any point . Then every limit point of is a stationary point of (1).
Suppose the level set is compact. Furthermore, assume that the subproblem (5) has a unique solution for any point , for at least blocks. Then converge to the set of stationary points, i.e.,
where and is the set of stationary points.
Here the first part of the result deals with the possibility of an unbounded sequence, whereas the second part assumes that the sequence lies in a compact set, therefore it has a slightly stronger claim. Note that Theorem 1 can be easily extended to all the coordinate selection rules given in Table II, and for the randomized version the convergence is with probability 1.
The readers should pay special attention to the conditions set forth by Theorem 1. First it says that the upper bound function needs to be picked carefully to satisfy both Assumption A and the requirement that at least subproblems (5) have a unique solution. However, when the objective function is convex, the uniqueness requirement of the per-block solution can be dropped; see . Also Theorem 1 requires the problem to be well-defined so that coordinatewise optimal solutions are equivalent to the stationary solution (which precludes objective function in Example 2). In the next subsection we will demonstrate, through a couple of concrete examples, that relaxing some of these conditions indeed leads to the divergence of the BSUM.
II-C When does BSUM fail?
In this subsection, we provide a few examples in which BSUM fails to converge to any stationary solutions. The examples in this section serve as reminders that in practice, in order to avoid those pitfalls, extreme care must be exercised when designing large-scale optimization algorithms.
Our first example comes from Example 2. It demonstrates the consequence of lacking the regularity condition.
Consider the following unconstrained convex optimization problem , where and . Clearly by defining , the objective function can be rewritten as , which is not regular at point (cf. Example 2). It follows that by setting (no approximation is used), and letting , the BSUM algorithm will not be able to move forward for either or , thus it will be stuck at the non-interesting point , far away from the only stationary solution .
The next example shows that BSUM fails to converge if the feasible set is no longer a Cartesian product of feasible sets , a fact that we have taken for granted so far.
Consider the following simple quadratic problem:
The optimal objective value is . Consider the BSUM algorithm with an arbitrary approximation function satisfying Assumption A, but initiated at the point . Then the BSUM method will be stuck at this non-interesting point without making any progress, because it is not possible to change a single block without violating the coupling constraint.
Our next example shows that BSUM could diverge if the approximation function violates Assumption A.
The optimal objective value is , with . Consider using the BSUM algorithm with a linear approximation function, which violates Assumption (A2). More specifically, for a given feasible tuple , the ’s subproblem becomes
Clearly the optimal solution is either or . The same happens when solving the subproblem for . Therefore the algorithm will never reach the optimal solution . Further, if the feasible sets of and are unbounded, then the BSUM can diverge.
Our last example is taken from Powell . It shows that without the uniqueness assumption of each subproblem (5), the BSUM algorithm is not necessarily convergent.
Consider the following unconstrained problem
where the notation means . In this case, fixing and optimizing over yields the following solution
Similar solution can be obtained for and as well. Suppose we set for all (no approximation is used), and letting for some . Then it can be shown the applying the cyclic version of the BSUM algorithm, the iterates will be cycling around six points , , , , , , and none of these six points is a stationary solution of the original problem. The reason for such pathological behavior is that, in any one of the six limit points above, there are at least two subproblems that have multiple optimal solutions. For example, at and fixing (resp. ), the optimal solution for (resp. ) is any element in the interval ; cf. (10).
A natural question at this point is, can we make the BSUM work for these examples? The answer is affirmative, but how this can be done requires a case by case study. To handle the first two examples (i.e., Example 3 – 4), a generalized version of BSUM is needed, which will be discussed in section V. For the third example, one can simply pick a better upper bound to guarantee convergence. For example, if we pick the proximal upper bound (cf. Table III): , then each subproblem will have a unique solution, and the algorithm will not be trapped by the non-interesting solutions given in Example 6. Notice that the use of inhibits the move of from its current position . Thus, the main message here is that when optimizing each block, it is beneficial, at least theoretically, to be less greedy so that a conservative step is taken towards reducing the objective. The extent of the “conservativeness” for the per block update is then naturally characterized by the chosen upper bounds. Quite interestingly, the difficulty with the non-unique subproblem solution can also be resolved by using randomization. Formally, we have the following corollary to Theorem 1.
Suppose the level set is compact. Then under the randomized block selection rule, the iterates generated by the BSUM algorithm converge to the set of stationary points almost surely, i.e.,
II-D How fast does the BSUM converge?
Now that we have examined the convergence of the BSUM, let us proceed next to characterize the convergence speed of the algorithm. No doubt that this is an important issue, especially so for big data optimization – the sheer size of the data and limited computational resource makes it difficult to optimize a problem to global optimality. Consequently, we are generally satisfied with high-quality suboptimal solutions, and are mostly concerned with how quickly such solutions can be obtained.
Recently, extensive research efforts have been devoted to analyzing the convergence rate for various BSUM-type algorithms, most of which use randomized coordinate selection rules and/or quadratic upper bound functions (cf. Table III) to solve convex problems; for example see . Since it is not possible to go over all the details of these different variations of BSUM, here we present in a high level a fairly general result that covers a large family of upper bound functions satisfying Assumption A, as well as all coordinate selection rules outlined in Table II.
To this end, let us make the following additional assumptions.
, where is a smooth convex function with Lipschitz continuous gradient, i.e. there exists a constant such that Further both and ’s are convex functions.
The level set is compact.
Each upper bound function is strongly convex with respect to .
An -optimal solution is defined as an , where is the globally optimal objective value of problem (7). Suppose both Assumptions A and B are satisfied. Then it is shown in that BSUM takes at most iterations to find an -optimal solution, for all coordinate rules specified in Table II, where is a constant only related to the description of the problem. Such type of convergence rate is known as sublinear convergence. Here the main message is that under Assumptions A and B, the algorithm generally converges sublinearly in the order of . Further, for different special forms of BSUM, the constant in front of can be significantly refined so that it is independent of problem dimension; see . It is also interesting to note here that when the objective is either strongly convex, or convex but with certain special structure, the BSUM achieves linear rate of convergence. That is, BSUM takes at most iterations to find an -optimal solution, which is much faster than the sublinear rate; see e.g., for the related discussions.
Finally, we briefly mention that it is possible to relax certain conditions in Assumption B to obtain refined rates. For example, shows that dropping the per-block strong convexity assumption in (B3) still achieves an sublinear convergence. In it is shown that when removing the convexity Assumption (B1), it is also possible to characterize the convergence rate to stationary solutions. In it is shown that when there are two blocks of variables, the cyclic version of the BSUM can be accelerated to achieve an improved complexity. In a few recent works , it is shown that when randomized block selection and the quadratic upper bound are used, it is possible to accelerate the BSUM with any blocks.
III Algorithms Covered by the BSUM Framework
Now that we have a good understanding of the theoretical properties of the family of BSUM algorithms, we demonstrate in this section the wide applicability of the BSUM framework by revealing its connection to a few well-known algorithms for high dimensional massive data analysis. For each of the examples presented below, we pay special attention to the design of the upper bound functions.
The first algorithm that the BSUM covers is obviously the classic BCD described in Section I-B. Here the upper bound function is simply the original objective itself, i.e., . We would like to mention that all the convergence and rate of convergence analysis of BSUM carries over to the BCD method. Specifically, the result in Section II-D implies that the BCD method (with coordinate update rules specified in Table II) converges in a sublinear manner whenever Assumptions (B1) and (B2) are satisfied. This result by itself is interesting, as there has been limited theoretical analysis for general form of BCD algorithm when applied to problems satisfying Assumptions (B1) and (B2).
III-B The Convex-Concave Procedure (CCCP)
Consider the following unconstrained nonconvex problem, known as the difference of convex (DC) program: where both and are convex functions. The well-known CCCP algorithm generates a sequence by solving
where . Clearly, the linear upper bound in Table III is used here, therefore CCCP is a special case of the BSUM algorithm with a single block of variables. Furthermore, the updates can also be done in a block coordinate manner.
III-C The majorization-minimization (MM) algorithm
The MM algorithm which has been widely used in statistics , can be viewed as the single block version of BSUM. Consider the problem of where is a smooth function. The basic idea of the MM algorithm is to first construct a “majorization” function for the original objective , such that
Then the algorithm generates a sequence of iterates by successively minimizing . Clearly this algorithm represents a slight generalization of the CCCP discussed in the previous subsection, but nevertheless still falls in the framework of BSUM.
As another concrete example of the MM algorithm, let us consider the celebrated expectation maximization (EM) algorithm . Let be a vector observation used for estimating the parameter . The maximum likelihood estimate of is given as (where denotes the conditional probability of given )
Furthermore, it is not hard to see that , therefore, both conditions in (11) are satisfied. Similarly as in the previous case, one can extend the EM to a block coordinate version; see for detailed discussion.
III-D The Proximal Point Algorithm (PPA)
The classical PPA (see, e.g., [50, Section 3.4.3]) obtains a solution of the problem by solving the following equivalent problem
where is a convex function, is a closed convex set, and is a coefficient. Clearly problem (13) is strongly convex in both and so long as is convex (but not jointly strongly convex in ). This problem can be solved by performing the following two steps alternatingly
Equivalently, the PPA algorithm can be viewed as successively minimizing the single block version of the proximal upper bound given in Table III. Note that for a problem with a single block of variables, if is coordinatewise minimum, then it must be a global minimum solution. Therefore every feasible solution is regular, and the convergence of PPA is covered by BSUM.
Further, the PPA can be generalized to solve the multi-block problem (1), where is convex in each of its block components, but not necessarily strictly convex. Directly applying BCD may fail to find a stationary solution for this problem, as the per-block subproblems can contain multiple solutions (cf. Example 6). Alternatively, we can consider an alternating PPA , which successively solves the following subproblem
Clearly this algorithm is a special form of BSUM with strongly convex proximal upper bound (cf. Table III). It follows that each subproblem has a unique optimal solution, and by Theorem 1 it must converge to a stationary solution.
III-E The Forward-Backward Splitting (FBS) Algorithm
The FBS algorithm (also known as the proximal splitting algorithm, see, e.g., and the references therein) for nonsmooth optimization solves the composite problem (6) with a single block of variables (i.e., ), where is convex and lower semicontinuous; is convex and has Lipschitz continuous gradient, i.e., , and for some .
Define the proximity operator as
which is the quadratic upper bound in Table III, with {\mbox{\boldmath\Phi}}_{1}:=\frac{1}{\beta}\mathbf{I}. It is easy to see that the iteration (15) is equivalent to the following iteration therefore it again falls under the BSUM framework.
Similar to the previous example, we can generalize the FBS algorithm to solve multiple block problems of the form (7). The resulting algorithm, sometimes also known as the block coordinate proximal gradient (BCPG) method, has received significant attention recently due to its efficiency for solving certain big data optimization problem such as LASSO . We refer the readers to for recent developments and applications for BCPG.
Here we make a special note that by appealing to the general convergence rate result in Section II-D, the BCPG method with any coordinate selection rules in Table II gives a sublinear convergence rate, when it is used to solve problem (7) that satisfies Assumption B.
III-F The Nonnegative Matrix Factorization (NMF) Algorithm
Here means the th component of matrix . Below we show that when the iterates are well-defined (i.e., and ), the NMF iteration (18a) – (18b) is also covered by BSUM .
Let and represent the th column of and , respectively. Then at a given iterate , the subproblem for optimizing is given by
Define the upper bound function as
where {\mbox{\boldmath\Phi}}_{i}(W^{r},H^{r}_{i}) is a diagonal matrix given by
Clearly, {\mbox{\boldmath\Phi}}_{i}(W^{r},H^{r}_{i})\succ 0, and it is easy to show that {\mbox{\boldmath\Phi}}_{i}(W^{r},H^{r}_{i})\succ(W^{r})^{T}W^{r}, where is the Hessian of the objective of (19) . This implies that is the quadratic upper bound given in Table III of . Further, one can check that the subproblem that minimizes has a unique solution, given by (18a). Similar analysis can be established for the -block update rule as well. Therefore we conclude that the iterates (18a) – (18b) are a special case of BSUM. Finally we note that it is also possible to use different upper bound functions to derive more efficient update rules for the NMF problem (17); for example see where both the concave upper bound and the Jensen’s upper bound (cf. Table III) are used.
III-G The Iterative Reweighted Least Squares (IRLS) Method
The IRLS method is a popular algorithm used for solving big data problems such as sparse recovery . Consider the following problem
where is some small constant and denotes the smooth part of the objective. The IRLS algorithm solves problem (21) by performing the following iteration
Define the following function for :
It is clear that , so Assumption (A1) is satisfied. To verify Assumption (A2), we apply the arithmetic-geometric inequality, and have
Then according to Example 1, Assumption (A3) is automatically true, therefore we have verified that defined in (22) is indeed an upper bound function for the smooth function . It follows that the IRLS algorithm corresponds to a single-block BSUM algorithm. Notice that using the BSUM framework we can easily generalize the IRLS to the multi-block scenario.
IV Applications of the BSUM Framework
In this subsection, we briefly review a few applications of the BSUM framework in wireless communication, bioinformatics, signal processing, and machine learning.
When linear beamformers are employed at the transmitters and receivers, the transmitted signal and the estimated received data stream can be respectively written as
A crucial task in modern wireless networks is to design the transmit and receive beamformers and in order to maximize a given utility of the system. Here, for simplicity of presentation, we consider the sum rate utility function as our objective. Therefore, our goal is to solve the following optimization problem:
where is the total power budget of user and , which is the communication rate of user , is given by
Problem (23) is nonconvex and known to be NP-hard . Using the well-known relation between the signal to interference plus noise ratio (SINR) and the mean square error (MSE) value, one can rewrite (23) as :
where is the MSE value and is given by
Since the function is concave, it is upper bounded by its first order approximation (i.e., the linear upper bound in Table III). Therefore, we can define the function
where is the optimization variable and denotes the beamformer at iteration . It is not hard to see that the approximation function in (25) is a valid upper-bound in the BSUM framework and at each iteration , this choice of approximation function leads to a quadratic programming problem which has closed form solutions. The resulting algorithm, dubbed weighted minimization of mean square error (WMMSE), converges to a stationary point of the problem and in practice it typically converges in a few iterations even for large size problems .
The reader is referred to for more details of the algorithm and its extensions to various beamformer design scenarios and different utility functions. It is also worth noting that many other interesting transceiver design algorithms also fall into the BSUM framework; see for more details.
IV-B Bioinformatics and Signal Processing
Here we briefly outline two interesting big data applications of the BSUM framework in bioinformatics and signal processing.
An essential step in the analysis of modern high throughput sequencing technologies of biological data is to estimate the abundance level of each transcript in the experiment. Mathematically, this problem can be stated as follows. Consider transcript sequences with the corresponding abundance levels such that . Let be noisy sequencing reads originated from the transcript sequences, where each read , , is originated from only one of the transcript sequences . Given the observed reads, the likelihood of the abundance levels can be written as
where can be obtained efficiently using an alignment algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., . Therefore, given , the maximum likelihood estimation of the abundance levels can be stated as
As a special case of the EM algorithm, a popular approach for solving this optimization problem is to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress software solves the following optimization problem at the -th iteration of the algorithm:
Using Jensen’s inequality, it is not hard to check that (27) is a valid upper-bound of (26) in the BSUM framework. Moreover, (27) has a closed form solution obtained by
which makes the algorithm computationally efficient at each step. In practice, the above algorithm for abundance estimation converges in a few iterations. Moreover, this algorithm is perfectly suitable for distributed storage and multi-core machines. In particular, since the number of reads is much larger than the number of sequences , one can store the reads in different processing units. Hence at each iteration , the processing unit , can compute the local value
where is the set of reads stored at processor with . Then, all processors update their global abundance estimate through the consensus procedure
For a very recent application of BSUM algorithm in gene RNA-seq abundance estimation, the readers are referred to .
IV-B2 Tensor decomposition
In general, finding the CP decomposition of a given tensor is NP-hard . A well-known algorithm for finding the CP decomposition is the alternating least squares (ALS) algorithm proposed in . This algorithm is in essence the BCD algorithm on the following optimization problem
In the ALS algorithm, we consider three blocks of variables: , , and . At each iteration of the algorithm, two blocks are held fixed and only one block is updated by solving (28). The block selection rule is cyclic and therefore one needs the uniqueness of the minimizer assumption at each iteration for theoretical convergence guarantee. Clearly, this assumption does not hold in general in (28) and therefore, theoretically, convergence is not always guaranteed. In addition, another well-known drawback of the ALS algorithm is the “swamp” effect where the objective remains almost constant for many iterations and then starts decreasing again. It has been observed in the literature that the employment of proximal upper-bound (see Table III) could help reducing the swamp effect . It is also suggested in that decreasing the proximal coefficient ( in Table III) during the ALS algorithm could further improve the performance of the algorithm. Notice that these modifications in the algorithm makes the algorithm a special case of BSUM framework. Consequently, its theoretical convergence is also guaranteed by Theorem 1.
Figure 4 compares the performance of the naive ALS algorithm with the one using proximal upper-bound. As can be seen from the figures the proximal ALS algorithm has less swamp effect as compared to naive ALS method. The reader is referred to for more details of the algorithm and to for the application of BSUM and CP decomposition in gene expression and brain imaging.
IV-C Machine learning: sparse dictionary learning and sparse linear discriminant analysis
where the sets and are given based on the prior knowledge on the data. The function measures the goodness-of-fit of the model. For example, a popular choice of the function and the set leads to the following optimization problem
where the first term in the objective keeps our estimated signals close to the training set and the second term forces the representation to be sparse. One popular approach in the dictionary learning algorithm is to alternatingly update the dictionary and the coefficients . However, naively updating these two variables to its global optimum requires solving a sparse recovery problem at each iteration which is costly for big size problems. Motivated by the idea of inexact steps in BSUM framework, one can iteratively replace the objective by a locally tight upper-bound which is easier to minimize at each iteration and hence leading to computationally cheaper steps in the algorithm. It is not hard to see that utilizing the quadratic upper-bound in Table III with diagonal matrices {\mbox{\boldmath\Phi}}_{i} leads to closed form updates at each step . Unlike many existing algorithms in the literature , the resulting algorithm is guaranteed to converge to the set of stationary solutions theoretically as the result of Theorem 1.
Figure 6 and Table V show the performance of the resulting algorithm for dictionary learning in an image denoising problem. The denoising is performed on the Lena image corrupted by additive Gaussian noise with various variances . As can be seen from Table V, the proposed algorithm results in larger PSNR values than the K-SVD method when the noise level is large. Moreover, the proposed algorithm contains less visual artifacts. Furthermore, each step of the proposed algorithm is in closed form and it is computationally favorable, while each step of the K-SVD method requires an inner iterative method.
IV-C2 Sparse linear discriminant analysis
where is observations mean in class . Similarly, the standard between class covariance estimate is given by
Unfortunately, when the number of features is large relative to , the matrix is rank deficient and therefore problem (29) is ill-posed. In order to resolve this issue and to have a small generalization error, suggests to regularize the optimization problem with a convex penalty function ; and solve
Clearly, this optimization problem is non-convex. As suggested in , one can linearize the first part of the objective in (30) iteratively to obtain a tight upper-bound of the objective. It is not hard to see that the algorithm used in is BSUM with the linear upper-bound given in Table III.
V Extensions
In this section, we discuss extensions and generalizations of the BSUM framework in various settings.
Consider the following stochastic optimization problem
where is a closed convex set and is a random variable modeling the uncertainty in our optimization problem. A standard classical approach for solving (31) is the sample average approximation (SAA) method; see and the references therein. At iteration of SAA method, given a new realization of the random variable , SAA method generates a new iterate by solving a problem with the sample average as its objective, where are independent identically-distributed realizations of the random variable .
A major drawback of the SAA method is that each of its iteration can be computationally very expensive. The computational inefficiency arises from either the non-convexity of the objective, or not having closed form solutions at each iteration.
Motivated by the BSUM framework, authors of suggest to use an inexact version of SAA method, in which a sequence of upper bounds of the objective are minimized. In particular, at each iteration , the optimization variable is updated by
where is an upper-bound of the function around the point . The approximation function is assumed to be in the form of BSUM approximation. The resulting algorithm, named stochastic successive upper bound minimization (SSUM), is guaranteed to converge to the set of stationary solutions almost surely; see for more details. Further, it is shown to be capable of dealing with various practical problems in signal processing and machine learning. For example, as we will see shortly, the authors in apply SSUM framework to cope with uncertainties in channel estimation for wireless beamformer design problem. As another example, the online sparse dictionary leaning algorithm proposed in is a special case of SSUM.
As an example of the SSUM method, consider the wireless transceiver design problem discussed in Section IV-A where the channel coefficients are not exactly known. In this scenario, we can consider the channel coefficients as random variables and solve the following stochastic optimization problem
which is the stochastic counterpart of the optimization problem (23). Utilizing the upper bound (25) in SSUM algorithm leads to the stochastic WMMSE algorithm . Figure 7 illustrates the numerical performance of the SSUM methods as compared with SAA. At each iteration of SAA procedure, one should solve a nonconvex optimization problem. Two different methods are considered: the gradient descent method with random initialization and the WMMSE algorithm which is known to converge in few iterations for this problem. As can be seen from the figure, the running time of the SAA algorithm is much longer than that of the SSUM.
V-B Coupling constraints
So far in this article, we have assumed that the constraints in the optimization problem is separable and convex. In other words, the constraint set in (1) is of the form with each being convex. A natural extension of the BSUM framework is to modify it in order to deal with coupling and nonconvex constraints.
Consider the following convex problem with linear coupling constraints
At each iteration of the ADMM method, either a primal block variable is updated according to
or the dual Lagrange multiplier is updated according to the gradient ascent rule
where is the dual step-size at iteration . The update orders for the primal and dual variables could be either cyclic or randomized.
Similar to the BSUM framework, one can replace the augmented Lagrangian function with its tight upper-bound at iteration where
with being a locally tight approximation of the function around the point satisfying Assumption A. The resulting algorithm, named Block Successive Upper Bound Minimization Method of Multipliers (BSUMM) , is guaranteed to converge to the global optimal of (33) under some regularity assumptions . For extensions to non-convex problems, please see the recent work . There are a few other interesting techniques to deal with linearly coupling constraint. For example references propose to randomly pick 2 blocks of variables to update at each iteration, and reference propose new algorithms based on minimizing the augmented Lagrangian function. We refer the readers to these papers for more related works in this direction.
Let us illustrate an application of BSUMM to a multi-commodity routing problem, which arises in the design of next generation cloud-based communication networks . Consider a connected wireline network which is controlled by Network Controllers (NCs) as illustrated in Fig. 8. Let denote the set of network nodes, which is partitioned into subsets, i.e., , , . The set of directed links that connect nodes of is denoted as , where denotes the directed link from node to node . Each NC controls and the links connecting these nodes, i.e., (cf. Fig. 8). The network is called the subnetwork .
Our objective is to transport data flows over the network, with each flow being routed from a source node to the sink node . We use to denote flow ’s rate, and use to denote its rate on link . We also assume that a master node exists which controls the data flow rates . The central NC controls the subnetwork , consisting of the master node and the links connecting different subnetworks, i.e., .
Let us consider two types of network constraints, as we list below.
Link capacity constraints. Assume each link has a fixed capacity denoted as . The total flow rate on link is constrained by
where is the all-one vector and .
Flow conservation constraints. For any node and data flow , the total incoming flow should be equal to the total outgoing flow:
where and denote the set of links going into and coming out of a node respectively; if , otherwise .
To provide fairness to the users, let us maximize the minimum rate of all data flows. The problem can be formulated as the following linear program (LP)
where and . Obviously, one can use off-the-shelf optimization packages such as Gurobi to solve the LP (36), but this is only viable in a centralized setting where all the flows are managed by a single controller.
To enable distributed/parallel network management across the NCs, we need to allow each NC to independently optimize the variables belonging to the subnetwork . However, this task is difficult because the optimization variables of problem (36) is coupled (indeed each flow rate appears in exactly two flow conservation constraints). To address this problem, we introduce a few sets of new variables to decouple the flow conservation constraints across different subnetworks (we refer the readers to for the detailed reformulation). The reformulated problem is given by
where and and are some feasible sets, and are the block-variables. By applying the BSUMM, we can obtain a parallel/distributed algorithm. A few remarks about the implementation of this algorithm are in order.
The replication of link/flow variables for links across different subnetworks allows each subnetwork to be considered separately and independently. This feature makes the BSUMM subproblems solvable in parallel. The requirement of the replicated variables being the same as the original variables is enforced by the linear coupling constraints, and they can be satisfied asymptotically as the BSUMM algorithm converges.
The subproblems of the proposed BSUMM-based algorithm can be solved by each NC very efficiently. For example the update of can be performed by each NC in closed-form; the update of can be performed by running the well-known RELAX code .
A careful implementation of the BSUMM allows the NCs to act asynchronously, in the sense that they do not need to coordinate with each other for computation. Such asynchronous implementation has the potential of greatly improving the computational efficiency.
We illustrate the BSUMM implementation over a mesh wireline network with nodes, which is randomly partitioned into subnetworks with directed links within these subnetworks and directed links connecting the subnetworks. The capacities for the links within (resp. between) the subnetworks are uniformly distributed in MBits/s (resp. MBits/s). All simulation results are averaged over randomly selected data flow pairs and link capacity.
To demonstrate the benefit of parallelization, we also utilize a high performance computing (HPC) cluster, and make each computing node to be a NC. We compare a few different algorithms, listed below:
Solving a large-scale LP by Gurobi , a centralized solver;
The synchronous BSUMM algorithm with NCs, with the computation done by either a single or by distributed computing cores;
The asynchronous BSUMM with NCs, with the computation done in distributed computing cores.
Note that the asynchrony in the network arises naturally from the per-node computational delay and network communication delay. In the following table we demonstrate the performance of various algorithms when . Clearly the asynchronous BSUMM with a small number of NCs outperforms all the rest of the algorithms.
The numerical results suggest that appropriate network decomposition and asynchronous implementation are both critical for the fast convergence of BSUMM. In practice, we observe that the network should be decomposed following a few guidelines: i) the computation burden across the subnetworks is well balanced; ii) the subroutine within the network can achieve its maximum efficiency; iii) the total number of replicated auxiliary variables is small.
V-B2 Nonconvex constraints
The BSUM idea can be straightforwardly extended to a nonconvex constraint scenario for single block optimization problems. To proceed, let us consider the optimization problem
As illustrated in Figure 9, the iterative approximation of the constraints is a restriction of the constraints and hence the iterates remain feasible during the algorithm. If, in addition, some constraint qualification conditions are satisfied, the resulting algorithm is guaranteed to converge to the set of stationary solutions of (37); see [38, Theorem 1] for detailed conditions and analysis.
V-C Parallel version and extensions to game theory
With the recent advances in multicore and cluster computational platforms, it is desirable to design “parallel” algorithms for multi-block optimization problem where multiple cores update the block variables in parallel to optimize the objective function. A naive parallel extension of the BCD approach for solving (1) is to update all blocks (or a subset of them) in parallel by solving
Unfortunately this naive extension of the BCD algorithm does not converge in general and might result in zig-zag/oscillating or divergent behavior. As an example, consider the problem
Clearly, this problem is convex with bounded feasible set and its optimal value is zero. However, the above naive parallel extension of the algorithm leads to the following iteration path:
which is clearly not convergent. This is caused by aggressive steps used in the algorithm. To make the algorithm convergent, it is then necessary to employ small enough and controlled steps. Furthermore, in the case of nonconvex objective function in (1), the approximation functions could be again used to obtain computationally efficient update rules. The resulting algorithm, dubbed parallel successive convex approximation (PSCA), is summarized in Table VII. To see the convergence analysis of this algorithm and other related algorithms such as the Flexible Parallel Algorithm (FPA), the reader is referred to and the references therein.
Notice that PSCA can be viewed as a way of solving a multi-agent optimization problem where multiple agents/users try to optimize a common objective by updating their own variable iteratively. Furthermore, it can be used in a game theoretic setting where each player in the game utilizes the best response strategy by optimizing a locally tight upper-bound of its own utility function. This algorithm is guaranteed to converge for some particular class of games under some regularity assumptions on the players utility functions. The convergence analysis presented in is based on certain contraction approach as well as monotone convergence for potential games.
Figure 10 illustrates the behavior of cyclic and randomized parallel PSCA method as compared with their serial counterparts (i.e., the “Cyclic BCD” and the “Randomized BCD” ) applied to the LASSO problem. The performance of the “PSCA” method is also illustrated for different number of processors and various block selection rules. As can be seen from the figures, parallelizaiton can result in more efficient algorithm; however, the convergence speed does not grow linearly with the number of processing cores. Moreover, increasing the number of processors beyond certain point results in slower convergence, which can be attributed to the increased communication overhead among the nodes.
It is also worth noting that the parallel update rule is very useful in dealing with distributed data sets. Consider solving the LASSO problem with the following objective: Assume we have processing cores each having their own memory. Let us partition the matrix and vector into blocks: . If PSCA is implemented in a way that each core is only responsible for updating block , then at each iteration , core ’s problem of interest can be written as
where . Notice that the value of can be calculated by letting each node compute the value of and broadcast it to other nodes. Hence under this architecture, each node does not need to know the complete matrix and only local information is enough for a distributed implementation of the PSCA method.
V-D Practical Considerations
There are a number of factors that we need to consider when using the BSUM framework.
The first consideration is about the choice of the upper bound. What is a good bound for a given application? The answer is generally problem dependent, as we have already seen in a few examples. The general guideline is that a good upper bound should be able to ensure algorithm convergence, best exploit the problem structure and make the subproblems easily solvable (preferably in closed-form). For example a simple proximal upper bound is not likely to perform well for the transceiver design problem discussed in Section IV-A, as the resulting subproblems will not decompose over the variables.
The second consideration is about the choice of the block update rules. As we have seen in Section II-D, different update rules lead to quite distinct convergence behaviour. For convex problems, deterministic rules such as the cyclic rule promise the worst-case convergence rates, while the randomized rule ensures convergence rate in either averaged or high probability sense. Further, there is barely any theoretical rate analysis for nonconvex problems, regardless of the block selection rules. Therefore the best strategy in practice is to perform extensive numerical study and pick the best rule for the application at hand. For example, researchers have found that MBI rule is effective for certain tensor decomposition problem ; the cyclic rule can be superior to the randomized rule for certain LASSO problem, and certain G-So rule can outperform the randomized rule .
The third consideration is about the choice of the parallelization schemes. There has been extensive research on parallelizing various special cases and variations of the BSUM type algorithm, see and the references therein. These algorithms differ in a number of implementation details and in their applicability. For example, most of the implementations use randomized block selection rules to pick the variable blocks, while and additionally use certain variations of the G-So rule. The majority of the schemes only work for convex problems, with the exception of which work for general nonsmooth and nonconvex problems in the form of (7). When assessing whether a given problem is suitable for parallelization, it is important to know that oftentimes the number of blocks that can be updated in parallel is data dependent. For example when solving LASSO problems, it is shown in and that the degree of parallelization is dependent on the maximum eigenvalue of certain submatrices of the data matrix. Some recent result shows that for certain randomized coordinate descent method, such dependency could be mild. For solving general convex nonsmooth problems, shows that the stepsize should be carefully selected based on both the “separability” of the problem (or the sparsity of the data matrix) as well as the degree of parallelization. If the application at hand does not satisfy these conditions, the alternatives usually are: 1) exploit the problem structure and pick a good upper bound, so that the subproblems are decomposable, leading to parallel and stepsize-free updates; see for example the NMF problem in Section III-F and the WMMSE algorithm in Section IV-A; 2) to use the diminishing stepsizes for updating the blocks, see for example .
VI Issues and Open Research Problems
This article presents a comprehensive algorithmic framework called BSUM for block-structured large-scale optimization. The main strength of the BSUM framework is its strong theoretical convergence guarantee and its flexibility. As demonstrated in this article, the BSUM framework covers a number of well-known but seemingly unrelated algorithms as well as their new extensions. Moreover, it is amenable to a number of different data models as well as to parallel implementation on modern multi-core computing platforms.
To close, we briefly highlight a couple of issues and open research topics related to the BSUM framework.
Communication delay and overhead in parallel implementations: As discussed in subsection V-C, the convergence speed of the parallel version of BSUM framework does not increase linearly with the number of computational nodes. In fact, after some point, increasing the number of computational nodes can lead to slower convergence speed. As mentioned before, this is due to the delay caused by communication among the nodes. This observation gives rise to two important research questions: First, given the maximum allowable number of computation nodes and the communication overhead of the nodes, what is the optimum choice of the number of cores for solving a given optimization problem? Answering this question requires computation/communication tradeoff analysis of the proposed optimization approach. Second, can the BSUM framework be extended and implemented in a (semi-)asynchronous manner? If this is possible, then the communication overhead can be reduced significantly since the nodes are not required to wait for each other before updating the variables, making the algorithm lock-free. For recent efforts on this research direction see .
Nonlinear coupling constraints: As we observe in Section V, the BSUM framework can also be used in the presence of linear coupling or nonconvex decoupled constraints. How can the BSUM framework be generalized to problems with nonlinear coupling constraints? More precisely, can the BSUM framework with block-wise update rules be applied to the optimization problem of the following form?
Example 4 shows that the naive extension of the BCD approach fails to find the optimal solution even in the convex setting. A popular approach to tackle the above problem is to place the constraints in the objective using Lagrange multipliers and update the multipliers iteratively. However, this approach typically leads to double-loop algorithms and requires subgradient steps in the dual space which is known to be slow.