A Unified Algorithmic Framework for Block-Structured Optimization Involving Big Data

Mingyi Hong, Meisam Razaviyayn, Zhi-Quan Luo, Jong-Shi Pang

I Introduction

With advances in sensor, communication and storage technologies, data acquisition is more ubiquitous than any time in the past. This has made available big data sets in many areas of engineering, biological, social, and physical sciences. While the proper modeling and analysis of such data sets can yield valuable information for inference, estimation, tracking, learning and decision-making, their size and complexity present great challenges in algorithm design and implementation.

Due to its central role in big data analytics, large-scale optimization has recently attracted significant attention not only from the optimization community, but also from the machine learning, statistics as well as the signal processing communities. For example, emerging problems in image processing, social network analysis and computational biology can easily exceed millions or billions of variables, and significant research is underway to enable fast accurate solution of these problems . Also, problems related to the design and provision of large scale smart infrastructures such as wireless communication networks require real-time efficient resource allocation decisions to ensure optimal network performance. Traditional general purpose optimization tools are inadequate for these problems due to the complexity of the model, the heterogeneity of the data, and most importantly the sheer data size . Modern large-scale optimization algorithms, especially those that are capable of exploiting problem structures, dealing with distributed, time varying and incomplete data sets, utilizing massively parallel computing and storage infrastructures, have become the workhorse in the big data era.

To be efficient for big data applications, optimization algorithms must have certain properties:

Each of their computational steps must be simple and easy to perform;

The intermediate results are easily stored;

They can be implemented in a distributed and/or parallel manner so as to exploit the modern multi-core and cluster computing architectures;

A high-quality solution can be found using a small number of iterations.

These requirements preclude the use of high-order information about the problem (i.e., the Hessian matrix of the objective), which is usually too expensive to obtain, even for modest-sized problems.

I-B The Block Coordinate Decent Method

A very popular family of optimization algorithms that satisfies most of the aforementioned properties is the block coordinate descent (BCD) method, sometimes also known as the alternating minimization/maximization (AM) algorithm. The basic steps of the BCD are simple: i) partition the entire optimization variables into small blocks, and ii) optimize one block variable (or few blocks of variables) at each iteration, while holding the remaining variables fixed. More specifically, consider the following block-structured optimization problem

where we have defined xir1:=(x1r1,,xi1r1,xi+1r1,,xnr1)x^{r-1}_{-i}:=(x^{r-1}_{1},\cdots,x^{r-1}_{i-1},x^{r-1}_{i+1},\cdots,x^{r-1}_{n}); for the rest of variables jij\neq i, let xjr=xjr1x^{r}_{j}=x^{r-1}_{j}. We refer the readers to Fig. 1 for a graphical illustration of the algorithm.

The BCD algorithm is intuitively appealing and very simple to implement, yet it is extremely powerful. It enjoys tremendous popularity in a wide range of applications from signal processing, communications, to machine learning. Representative examples include image deblurring , statistical learning , wireless communications , and so on. In recent years, there is a surge of renewed interests to extend and generalize the BCD type of algorithms due to its applications in modern big data optimization problems. Theoretically, the BCD algorithm and its variants have been significantly generalized to accommodate various coordinate update rules , and have been made suitable for implementation on modern parallel processing architecture . It can handle a wide range of nonsmooth nonconvex cost functions . Practically, it has been used in emerging large-scale signal processing and machine learning applications, such as compressive sensing/sparse signal recovery , matrix completion , tensor decomposition , to name just a few. A recent survey of this algorithm can be found in .

I-C The Block Successive Upper Bound Minimization Method

In this article, we introduce a unifying framework, called the Block Successive Upper bound Minimization (BSUM) method, which generalizes the BCD type of algorithms . Simply put, the BSUM framework includes algorithms that successively optimize certain upper-bounds or surrogate functions of the original objectives, possibly in a block by block manner. The BSUM framework significantly expands the application domain of the traditional BCD algorithms. For example, it covers many classical statistics and machine learning algorithms such as the Expectation Maximization (EM) method , the Concave-Convex Procedure (CCCP) and the multiplicative Nonnegative Matrix Factorization (NMF) . It also includes as special cases many well-known signal processing algorithms such as the family of interference pricing algorithms and the weighted minimum mean square error (WMMSE) algorithms for interference management in large-scale wireless systems.

The generality and flexibility of the BSUM offers an ideal platform to explore optimization algorithms for big data. Through the lens of the BSUM, one obtains not only a thorough understanding of the variety of algorithms and applications that are being covered, but more importantly the principle for designing a good algorithm suitable for big data. To this end, this article will first provide a concise overview of a few key theoretical characterizations of the algorithms that fall in the BSUM framework, BCD included. Our emphasis will be given to providing intuitive understanding as to when and where the BSUM framework should (or should not) work, and how its performance can be characterized. The second part of this article offers a detailed account of many existing large-scale optimization algorithms that fall in the BSUM framework, together with a few big data related applications that are of significant interests to the signal processing community. The last part of the article outlines some interesting extensions of the BSUM that further help expand its application domains. Throughout this article, special emphasis will be given to computational issues arising from big data optimization problems such as algorithm design for parallel and distributed computation, algorithm implementation on multi-core computing platforms and distributed storage of the data. In particular, we will discuss how these issues can be addressed in the BSUM framework by providing theoretical insights and examples from real practical problems.

II BSUM and Its Theoretical Properties

The complete description of the BSUM is given in Table I. We also refer the readers to Fig. 2 for a graphical comparison of the iterates generated by BSUM and BCD for a two dimensional problem. It should be clear at this point that when Ir={(r  mod  n)+1}\mathcal{I}^{r}=\{(r\;{\rm mod}\;n)+1\} and no approximation is used (i.e., ui(xi,z)=f(xi,zi)u_{i}(x_{i},z)=f(x_{i},z_{-i}) and at each iteration a single coordinate is selected), then the BSUM reduces to the classical cyclic BCD method. In Table II we also present several index selection rules that are covered by the BSUM framework. For simplicity of presentation, we will use the classical cyclic index selection rule where Ir={(r  mod  n)+1}\mathcal{I}^{r}=\{(r\;{\rm mod}\;n)+1\} in the remainder of this article unless otherwise noted.

Next we introduce the precise definition of the approximation function. The main idea is that for each ii, the approximation ui(xi,xr)u_{i}(x_{i},x^{r}) should be an upper bound of the original objective function at the point of xrx^{r} (hence the BSUM name of the framework). Please see Fig. 3 for an illustration of the upper bound minimization process. Intuitively, picking an upper bound approximation function is reasonable because optimizing it at least should guarantee some descent of the original objective ff; see Fig. 3(c).

Using this definition, we make the following assumptions on the uiu_{i}’s.

Intuitively, Assumptions (A1) and (A2) imply that the approximation function is a global upper bound of f(x)f(x); while the assumption (A3) guarantees that the first order behaviors of the objective function and the approximation function are the same at the point of approximation (cf. Fig. 3). In Table III, we provide the readers with a few commonly used upper bounds that satisfy Assumption A; see also for additional examples. More discussion will be given in subsequent sections on how these bounds are used in practice.

It is worth mentioning that for a popular subclass of problem (1), Assumption A can be further simplified; see the following example [21, Proposition 2].

Consider the following special form of problem (1):

𝑟1r+1. It is clear from the figure that after solving the BSUM subproblem (5), f(xir+1,xir)<f(xir,xir)f(x^{r+1}_{i},x^{r}_{-i})<f(x^{r}_{i},x^{r}_{-i}), that is, the objective function is strictly decreased. Now that we have seen the main steps of the algorithm, next we describe its theoretical properties. We address the following questions related to the convergence of the BSUM: When does the BSUM converge? How fast does it converge? When does the BSUM fail and why? The answers to these questions will be instrumental in understanding the framework as well as evaluating the performance of various related algorithms.

II-B When does the BSUM converge?

In order to discuss the convergence property of the algorithm, we first investigate the optimality conditions for problem (1), which characterize the set of solutions that we would like our algorithm to reach. To this end, we introduce two related notions, one is called the stationary solution and the other is the coordinatewise minimum solution; see Table IV for their precise definitions. Roughly speaking, the coordinatewise minimum x^\hat{x} is the point where no single block x^i\hat{x}_{i}, i=1,,ni=1,\cdots,n has a better direction to move to, while at a stationary point xx^{*} the entire vector cannot move to a better direction. Further, a stationary solution must be a coordinatewise minimum, but the reverse direction is generally not true; see the example next.

This example confirms that the coordinatewise minimum can be much inferior to the stationary solution. Therefore in the subsequent discussion we will mainly focus on finding the stationary solutions rather than the coordinatewise minimum. An immediate question is that: whether one can easily distinguish these two types of solutions, or for that matter, when does a coordinatewise minimum become a stationary solution? Let us define a regular point xXx\in\mathcal{X} as a point that satisfies the following statement: if xx is coordinatewise minimum, then it is a stationary solution. It turns out that for a large and popular subclass of problem (1) expressed below in (7) where the nonsmooth function is separable across the blocks, all feasible points xXx\in\mathcal{X} are regular.

The main convergence result for BSUM method is given below, which is adapted from [21, Theorem 2]

Suppose the cyclic coordinate selection rule is chosen, i.e., Ir={(r  mod  n)+1}\mathcal{I}^{r}=\{(r\;{\rm mod}\;n)+1\}. Let {xr}r=1\{x^{r}\}_{r=1}^{\infty} be a sequence generated by the BSUM algorithm. Suppose Assumption A holds, and that each xrx^{r} is regular. Then the following is true:

Suppose that the function ui(xi,y)u_{i}(x_{i},y) is quasi-convex in xix_{i} for i=1,,ni=1,\ldots,n. Furthermore, assume that the subproblem (5) has a unique solution for any point xr1Xx^{r-1}\in\mathcal{X}. Then every limit point xx^{*} of {xr}\{x^{r}\} is a stationary point of (1).

Suppose the level set X0={xf(x)f(x0)}\mathcal{X}^{0}=\{x\mid f(x)\leq f(x^{0})\} is compact. Furthermore, assume that the subproblem (5) has a unique solution for any point xr1Xx^{r-1}\in\mathcal{X}, r1r\geq 1 for at least n1n-1 blocks. Then {xr}\{x^{r}\} converge to the set of stationary points, i.e.,

where d(x,X)minxXxxd(x,\mathcal{X}^{*})\triangleq\min_{x^{*}\in\mathcal{X}^{*}}\|x-x^{*}\| and X\mathcal{X}^{*} is the set of stationary points.

Here the first part of the result deals with the possibility of an unbounded sequence, whereas the second part assumes that the sequence lies in a compact set, therefore it has a slightly stronger claim. Note that Theorem 1 can be easily extended to all the coordinate selection rules given in Table II, and for the randomized version the convergence is with probability 1.

The readers should pay special attention to the conditions set forth by Theorem 1. First it says that the upper bound function needs to be picked carefully to satisfy both Assumption A and the requirement that at least n1n-1 subproblems (5) have a unique solution. However, when the objective function f(x)f(x) is convex, the uniqueness requirement of the per-block solution can be dropped; see . Also Theorem 1 requires the problem to be well-defined so that coordinatewise optimal solutions are equivalent to the stationary solution (which precludes objective function Ax1\|Ax\|_{1} in Example 2). In the next subsection we will demonstrate, through a couple of concrete examples, that relaxing some of these conditions indeed leads to the divergence of the BSUM.

II-C When does BSUM fail?

In this subsection, we provide a few examples in which BSUM fails to converge to any stationary solutions. The examples in this section serve as reminders that in practice, in order to avoid those pitfalls, extreme care must be exercised when designing large-scale optimization algorithms.

Our first example comes from Example 2. It demonstrates the consequence of lacking the regularity condition.

Consider the following unconstrained convex optimization problem minx  i=12Aixi1\min_{x}\;\left\|\sum_{i=1}^{2}A_{i}x_{i}\right\|_{1}, where A1=[3  2]TA_{1}=[3\;2]^{T} and A2=[4  1]TA_{2}=[4\;1]^{T}. Clearly by defining A=[3  4;2  1]A=[3\;4;2\;1], the objective function can be rewritten as Ax1\|Ax\|_{1}, which is not regular at point (x1,x2)=(4,3)(x_{1},x_{2})=(-4,3) (cf. Example 2). It follows that by setting ui(xi,z)=i=12Aixi1u_{i}(x_{i},z)=\left\|\sum_{i=1}^{2}A_{i}x_{i}\right\|_{1} (no approximation is used), and letting (x10,x20)=(4,3)(x^{0}_{1},x^{0}_{2})=(-4,3), the BSUM algorithm will not be able to move forward for either x1x_{1} or x2x_{2}, thus it will be stuck at the non-interesting point (4,3)(-4,3), far away from the only stationary solution (0,0)(0,0).

The next example shows that BSUM fails to converge if the feasible set X\mathcal{X} is no longer a Cartesian product of feasible sets X1,,Xn\mathcal{X}_{1},\cdots,\mathcal{X}_{n}, a fact that we have taken for granted so far.

Consider the following simple quadratic problem:

The optimal objective value is 22. Consider the BSUM algorithm with an arbitrary approximation function satisfying Assumption A, but initiated at the point (x10,x20)=(0,2)(x^{0}_{1},x^{0}_{2})=(0,2). Then the BSUM method will be stuck at this non-interesting point without making any progress, because it is not possible to change a single block without violating the coupling constraint.

Our next example shows that BSUM could diverge if the approximation function uiu_{i} violates Assumption A.

The optimal objective value is , with (x1,x2)=(0,0)(x^{*}_{1},x^{*}_{2})=(0,0). Consider using the BSUM algorithm with a linear approximation function, which violates Assumption (A2). More specifically, for a given feasible tuple (x1,x2)=(x^1,x^2)(x_{1},x_{2})=(\widehat{x}_{1},\widehat{x}_{2}), the x1x_{1}’s subproblem becomes

Clearly the optimal solution is either x1=1x_{1}=-1 or x1=1x_{1}=1. The same happens when solving the subproblem for x2x_{2}. Therefore the algorithm will never reach the optimal solution (x1,x2)=(0,0)(x^{*}_{1},x^{*}_{2})=(0,0). Further, if the feasible sets of x1x_{1} and x2x_{2} are unbounded, then the BSUM can diverge.

Our last example is taken from Powell . It shows that without the uniqueness assumption of each subproblem (5), the BSUM algorithm is not necessarily convergent.

Consider the following unconstrained problem

where the notation (z)+2(z)^{2}_{+} means (max{0,z})2(\max\{0,z\})^{2}. In this case, fixing (x2,x3)(x_{2},x_{3}) and optimizing over x1x_{1} yields the following solution

Similar solution can be obtained for x2x_{2} and x3x_{3} as well. Suppose we set ui(xi,z)=f(x)u_{i}(x_{i},z)=f(x) for all ii (no approximation is used), and letting (x10,x20,x30)=(1ϵ,1+12ϵ,114ϵ)(x^{0}_{1},x^{0}_{2},x^{0}_{3})=(-1-\epsilon,1+\frac{1}{2}\epsilon,-1-\frac{1}{4}\epsilon) for some ϵ>0\epsilon>0. Then it can be shown the applying the cyclic version of the BSUM algorithm, the iterates will be cycling around six points (1,1,1)(1,1,-1), (1,1,1)(1,-1,-1), (1,1,1)(1,-1,1), (1,1,1)(-1,-1,1), (1,1,1)(-1,1,1), (1,1,1)(-1,1,-1), and none of these six points is a stationary solution of the original problem. The reason for such pathological behavior is that, in any one of the six limit points above, there are at least two subproblems that have multiple optimal solutions. For example, at (1,1,1)(1,1,-1) and fixing x2,x3x_{2},x_{3} (resp. x1,x3x_{1},x_{3}), the optimal solution for x1x_{1} (resp. x2x_{2}) is any element in the interval [1,  1][-1,\;1]; cf. (10).

A natural question at this point is, can we make the BSUM work for these examples? The answer is affirmative, but how this can be done requires a case by case study. To handle the first two examples (i.e., Example 3 – 4), a generalized version of BSUM is needed, which will be discussed in section V. For the third example, one can simply pick a better upper bound to guarantee convergence. For example, if we pick the proximal upper bound (cf. Table III): ui(xi,z)=f(x)+γ2xizi2u_{i}(x_{i},z)=f(x)+\frac{\gamma}{2}\|x_{i}-z_{i}\|^{2}, then each subproblem will have a unique solution, and the algorithm will not be trapped by the non-interesting solutions given in Example 6. Notice that the use of γ2xizi2\frac{\gamma}{2}\|x_{i}-z_{i}\|^{2} inhibits the move of xix_{i} from its current position ziz_{i}. Thus, the main message here is that when optimizing each block, it is beneficial, at least theoretically, to be less greedy so that a conservative step is taken towards reducing the objective. The extent of the “conservativeness” for the per block update is then naturally characterized by the chosen upper bounds. Quite interestingly, the difficulty with the non-unique subproblem solution can also be resolved by using randomization. Formally, we have the following corollary to Theorem 1.

Suppose the level set X0={xf(x)f(x0)}\mathcal{X}^{0}=\{x\mid f(x)\leq f(x^{0})\} is compact. Then under the randomized block selection rule, the iterates generated by the BSUM algorithm converge to the set of stationary points almost surely, i.e.,

II-D How fast does the BSUM converge?

Now that we have examined the convergence of the BSUM, let us proceed next to characterize the convergence speed of the algorithm. No doubt that this is an important issue, especially so for big data optimization – the sheer size of the data and limited computational resource makes it difficult to optimize a problem to global optimality. Consequently, we are generally satisfied with high-quality suboptimal solutions, and are mostly concerned with how quickly such solutions can be obtained.

Recently, extensive research efforts have been devoted to analyzing the convergence rate for various BSUM-type algorithms, most of which use randomized coordinate selection rules and/or quadratic upper bound functions (cf. Table III) to solve convex problems; for example see . Since it is not possible to go over all the details of these different variations of BSUM, here we present in a high level a fairly general result that covers a large family of upper bound functions satisfying Assumption A, as well as all coordinate selection rules outlined in Table II.

To this end, let us make the following additional assumptions.

f(x):=g(x)+i=1nhi(xi)f(x):=g(x)+\sum_{i=1}^{n}h_{i}(x_{i}), where g(x)g(x) is a smooth convex function with Lipschitz continuous gradient, i.e. there exists a constant LL such that g(x)g(y)Lxy,   x,yX.\|\nabla g(x)-\nabla g(y)\|\leq L\|x-y\|,\;\forall~{}x,y\in\mathcal{X}. Further both gg and hih_{i}’s are convex functions.

The level set {xf(x)f(x0),xX}\{x\mid f(x)\leq f(x^{0}),x\in\mathcal{X}\} is compact.

Each upper bound function ui(xi,z)u_{i}(x_{i},z) is strongly convex with respect to xix_{i}.

An ϵ\epsilon-optimal solution xϵXx^{\epsilon}\in\mathcal{X} is defined as an xϵ{xxX,f(x)f(x)ϵ}x^{\epsilon}\in\{x\mid x\in\mathcal{X},f(x)-f(x^{*})\leq\epsilon\}, where f(x)f(x^{*}) is the globally optimal objective value of problem (7). Suppose both Assumptions A and B are satisfied. Then it is shown in that BSUM takes at most c/ϵ{c}/{\epsilon} iterations to find an ϵ\epsilon-optimal solution, for all coordinate rules specified in Table II, where c>0c>0 is a constant only related to the description of the problem. Such type of convergence rate is known as sublinear convergence. Here the main message is that under Assumptions A and B, the algorithm generally converges sublinearly in the order of 1/ϵ1/\epsilon. Further, for different special forms of BSUM, the constant cc in front of 1/ϵ1/\epsilon can be significantly refined so that it is independent of problem dimension; see . It is also interesting to note here that when the objective ff is either strongly convex, or convex but with certain special structure, the BSUM achieves linear rate of convergence. That is, BSUM takes at most O(log(c/ϵ))\mathcal{O}(\log(c/\epsilon)) iterations to find an ϵ\epsilon-optimal solution, which is much faster than the sublinear rate; see e.g., for the related discussions.

Finally, we briefly mention that it is possible to relax certain conditions in Assumption B to obtain refined rates. For example, shows that dropping the per-block strong convexity assumption in (B3) still achieves an O(1/ϵ)\mathcal{O}({1}/{\epsilon}) sublinear convergence. In it is shown that when removing the convexity Assumption (B1), it is also possible to characterize the convergence rate to stationary solutions. In it is shown that when there are two blocks of variables, the cyclic version of the BSUM can be accelerated to achieve an improved O(1/ϵ)\mathcal{O}(1/\sqrt{\epsilon}) complexity. In a few recent works , it is shown that when randomized block selection and the quadratic upper bound are used, it is possible to accelerate the BSUM with any n>2n>2 blocks.

III Algorithms Covered by the BSUM Framework

Now that we have a good understanding of the theoretical properties of the family of BSUM algorithms, we demonstrate in this section the wide applicability of the BSUM framework by revealing its connection to a few well-known algorithms for high dimensional massive data analysis. For each of the examples presented below, we pay special attention to the design of the upper bound functions.

The first algorithm that the BSUM covers is obviously the classic BCD described in Section I-B. Here the upper bound function is simply the original objective itself, i.e., ui(xi,z):=f(xi,zi),   xiXi,zX, iu_{i}(x_{i},z):=f(x_{i},z_{-i}),\;\forall~{}x_{i}\in\mathcal{X}_{i},z\in\mathcal{X},\forall~{}i. We would like to mention that all the convergence and rate of convergence analysis of BSUM carries over to the BCD method. Specifically, the result in Section II-D implies that the BCD method (with coordinate update rules specified in Table II) converges in a sublinear manner whenever Assumptions (B1) and (B2) are satisfied. This result by itself is interesting, as there has been limited theoretical analysis for general form of BCD algorithm when applied to problems satisfying Assumptions (B1) and (B2).

III-B The Convex-Concave Procedure (CCCP)

Consider the following unconstrained nonconvex problem, known as the difference of convex (DC) program: min  f(x):=g1(x)g2(x)\min\;f(x):=g_{1}(x)-g_{2}(x) where both g1g_{1} and g2g_{2} are convex functions. The well-known CCCP algorithm generates a sequence {xr}\{x^{r}\} by solving

where u(x,xr):=g1(x)xxr,g2(xr)g2(xr)u(x,x^{r}):=g_{1}(x)-\langle x-x^{r},\nabla g_{2}(x^{r})\rangle-g_{2}(x^{r}). Clearly, the linear upper bound in Table III is used here, therefore CCCP is a special case of the BSUM algorithm with a single block of variables. Furthermore, the updates can also be done in a block coordinate manner.

III-C The majorization-minimization (MM) algorithm

The MM algorithm which has been widely used in statistics , can be viewed as the single block version of BSUM. Consider the problem of minxXf(x)\min_{x\in\mathcal{X}}f(x) where f(x)f(x) is a smooth function. The basic idea of the MM algorithm is to first construct a “majorization” function u(x,z)u(x,z) for the original objective f(z)f(z), such that

Then the algorithm generates a sequence of iterates by successively minimizing u(x,xr)u(x,x^{r}). Clearly this algorithm represents a slight generalization of the CCCP discussed in the previous subsection, but nevertheless still falls in the framework of BSUM.

As another concrete example of the MM algorithm, let us consider the celebrated expectation maximization (EM) algorithm . Let ww be a vector observation used for estimating the parameter xx. The maximum likelihood estimate of xx is given as (where p(wx)p(w|x) denotes the conditional probability of ww given xx)

Furthermore, it is not hard to see that u(xr,xr)=lnp(wxr)u(x^{r},x^{r})=-\ln p(w|x^{r}), therefore, both conditions in (11) are satisfied. Similarly as in the previous case, one can extend the EM to a block coordinate version; see for detailed discussion.

III-D The Proximal Point Algorithm (PPA)

The classical PPA (see, e.g., [50, Section 3.4.3]) obtains a solution of the problem minxXf(x)\min_{x\in\mathcal{X}}f(x) by solving the following equivalent problem

where f()f(\cdot) is a convex function, X\mathcal{X} is a closed convex set, and γ>0\gamma>0 is a coefficient. Clearly problem (13) is strongly convex in both xx and yy so long as f(x)f(x) is convex (but not jointly strongly convex in (x,y)(x,y)). This problem can be solved by performing the following two steps alternatingly

Equivalently, the PPA algorithm can be viewed as successively minimizing the single block version of the proximal upper bound u(x;xr)u(x;x^{r}) given in Table III. Note that for a problem with a single block of variables, if xXx\in\mathcal{X} is coordinatewise minimum, then it must be a global minimum solution. Therefore every feasible solution xXx\in\mathcal{X} is regular, and the convergence of PPA is covered by BSUM.

Further, the PPA can be generalized to solve the multi-block problem (1), where f()f(\cdot) is convex in each of its block components, but not necessarily strictly convex. Directly applying BCD may fail to find a stationary solution for this problem, as the per-block subproblems can contain multiple solutions (cf. Example 6). Alternatively, we can consider an alternating PPA , which successively solves the following subproblem

Clearly this algorithm is a special form of BSUM with strongly convex proximal upper bound (cf. Table III). It follows that each subproblem has a unique optimal solution, and by Theorem 1 it must converge to a stationary solution.

III-E The Forward-Backward Splitting (FBS) Algorithm

The FBS algorithm (also known as the proximal splitting algorithm, see, e.g., and the references therein) for nonsmooth optimization solves the composite problem (6) with a single block of variables (i.e., n=1n=1), where hh is convex and lower semicontinuous; gg is convex and has Lipschitz continuous gradient, i.e., g(x)g(y)Lxy\|\nabla g(x)-\nabla g(y)\|\leq L\|x-y\|,  x,yX\forall~{}x,y\in\mathcal{X} and for some L>0L>0.

Define the proximity operator proxh:XX{\rm prox}_{h}:\mathcal{X}\to\mathcal{X} as

which is the quadratic upper bound in Table III, with {\mbox{\boldmath\Phi}}_{1}:=\frac{1}{\beta}\mathbf{I}. It is easy to see that the iteration (15) is equivalent to the following iteration xr+1=argminxXu(x,xr),x^{r+1}=\arg\min_{x\in\mathcal{X}}u(x,x^{r}), therefore it again falls under the BSUM framework.

Similar to the previous example, we can generalize the FBS algorithm to solve multiple block problems of the form (7). The resulting algorithm, sometimes also known as the block coordinate proximal gradient (BCPG) method, has received significant attention recently due to its efficiency for solving certain big data optimization problem such as LASSO . We refer the readers to for recent developments and applications for BCPG.

Here we make a special note that by appealing to the general convergence rate result in Section II-D, the BCPG method with any coordinate selection rules in Table II gives a sublinear convergence rate, when it is used to solve problem (7) that satisfies Assumption B.

III-F The Nonnegative Matrix Factorization (NMF) Algorithm

Here [Hr+1]j,i[H^{r+1}]_{j,i} means the (j,i)(j,i)th component of matrix Hr+1H^{r+1}. Below we show that when the iterates are well-defined (i.e., [Wijr]>0[W^{r}_{ij}]>0 and [Hijr]>0[H^{r}_{ij}]>0), the NMF iteration (18a) – (18b) is also covered by BSUM .

Let HiH_{i} and ViV_{i} represent the iith column of HH and VV, respectively. Then at a given iterate {Wr,Hr}\{W^{r},H^{r}\}, the subproblem for optimizing HiH_{i} is given by

Define the upper bound function ui(Hi,{Wr,Hr})u_{i}(H_{i},\{W^{r},H^{r}\}) as

where {\mbox{\boldmath\Phi}}_{i}(W^{r},H^{r}_{i}) is a diagonal matrix given by

Clearly, {\mbox{\boldmath\Phi}}_{i}(W^{r},H^{r}_{i})\succ 0, and it is easy to show that {\mbox{\boldmath\Phi}}_{i}(W^{r},H^{r}_{i})\succ(W^{r})^{T}W^{r}, where (Wr)TWr(W^{r})^{T}W^{r} is the Hessian of the objective of (19) . This implies that ui(Hi,{Wr,Hr})u_{i}(H_{i},\{W^{r},H^{r}\}) is the quadratic upper bound given in Table III of f(Hi,{Wr,Hr})f(H_{i},\{W^{r},H^{r}\}). Further, one can check that the subproblem that minimizes ui(Hi,{Wr,Hr})u_{i}(H_{i},\{W^{r},H^{r}\}) has a unique solution, given by (18a). Similar analysis can be established for the WW-block update rule as well. Therefore we conclude that the iterates (18a) – (18b) are a special case of BSUM. Finally we note that it is also possible to use different upper bound functions to derive more efficient update rules for the NMF problem (17); for example see where both the concave upper bound and the Jensen’s upper bound (cf. Table III) are used.

III-G The Iterative Reweighted Least Squares (IRLS) Method

The IRLS method is a popular algorithm used for solving big data problems such as sparse recovery . Consider the following problem

where η\eta is some small constant and g(x)g(x) denotes the smooth part of the objective. The IRLS algorithm solves problem (21) by performing the following iteration

Define the following function for g(x)g(x):

It is clear that g(xr)=u(xr,xr)g(x^{r})=u(x^{r},x^{r}), so Assumption (A1) is satisfied. To verify Assumption (A2), we apply the arithmetic-geometric inequality, and have

Then according to Example 1, Assumption (A3) is automatically true, therefore we have verified that u(x,xr)u(x,x^{r}) defined in (22) is indeed an upper bound function for the smooth function g(x)g(x). It follows that the IRLS algorithm corresponds to a single-block BSUM algorithm. Notice that using the BSUM framework we can easily generalize the IRLS to the multi-block scenario.

IV Applications of the BSUM Framework

In this subsection, we briefly review a few applications of the BSUM framework in wireless communication, bioinformatics, signal processing, and machine learning.

When linear beamformers are employed at the transmitters and receivers, the transmitted signal and the estimated received data stream can be respectively written as

A crucial task in modern wireless networks is to design the transmit and receive beamformers vk\mathbf{v}_{k} and uk\mathbf{u}_{k} in order to maximize a given utility of the system. Here, for simplicity of presentation, we consider the sum rate utility function as our objective. Therefore, our goal is to solve the following optimization problem:

where PkP_{k} is the total power budget of user kk and Rk(u,v)R_{k}(\mathbf{u},\mathbf{v}), which is the communication rate of user kk, is given by

Problem (23) is nonconvex and known to be NP-hard . Using the well-known relation between the signal to interference plus noise ratio (SINR) and the mean square error (MSE) value, one can rewrite (23) as :

where ek(u,v)e_{k}(\mathbf{u},\mathbf{v}) is the MSE value and is given by

Since the log()\log(\cdot) function is concave, it is upper bounded by its first order approximation (i.e., the linear upper bound in Table III). Therefore, we can define the function

where x(u,v)\mathbf{x}\triangleq(\mathbf{u},\mathbf{v}) is the optimization variable and xr(ur,vr)\mathbf{x}^{r}\triangleq(\mathbf{u}^{r},\mathbf{v}^{r}) denotes the beamformer at iteration rr. It is not hard to see that the approximation function in (25) is a valid upper-bound in the BSUM framework and at each iteration rr, this choice of approximation function leads to a quadratic programming problem which has closed form solutions. The resulting algorithm, dubbed weighted minimization of mean square error (WMMSE), converges to a stationary point of the problem and in practice it typically converges in a few iterations even for large size problems .

The reader is referred to for more details of the algorithm and its extensions to various beamformer design scenarios and different utility functions. It is also worth noting that many other interesting transceiver design algorithms also fall into the BSUM framework; see for more details.

IV-B Bioinformatics and Signal Processing

Here we briefly outline two interesting big data applications of the BSUM framework in bioinformatics and signal processing.

An essential step in the analysis of modern high throughput sequencing technologies of biological data is to estimate the abundance level of each transcript in the experiment. Mathematically, this problem can be stated as follows. Consider MM transcript sequences s1,,sM{A,C,G,T}Ls_{1},\ldots,s_{M}\in\{A,C,G,T\}^{L} with the corresponding abundance levels ρ1,,ρM\rho_{1},\ldots,\rho_{M} such that m=1Mρm=1\sum_{m=1}^{M}\rho_{m}=1. Let R1,,RNR_{1},\ldots,R_{N} be noisy sequencing reads originated from the transcript sequences, where each read RnR_{n}, n=1,,Nn=1,\ldots,N, is originated from only one of the transcript sequences s1,,sMs_{1},\ldots,s_{M}. Given the observed reads, the likelihood of the abundance levels ρ1,,ρM\rho_{1},\ldots,\rho_{M} can be written as

where αnmPr(Rnread  Rn  from  sequence  sm)\alpha_{nm}\triangleq{\rm Pr}\left(R_{n}\mid{\rm read\;}R_{n}{\;\rm from\;sequence\;}s_{m}\right) can be obtained efficiently using an alignment algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., . Therefore, given {αnm}n,m\{\alpha_{nm}\}_{n,m}, the maximum likelihood estimation of the abundance levels can be stated as

As a special case of the EM algorithm, a popular approach for solving this optimization problem is to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress software solves the following optimization problem at the rr-th iteration of the algorithm:

Using Jensen’s inequality, it is not hard to check that (27) is a valid upper-bound of (26) in the BSUM framework. Moreover, (27) has a closed form solution obtained by

which makes the algorithm computationally efficient at each step. In practice, the above algorithm for abundance estimation converges in a few iterations. Moreover, this algorithm is perfectly suitable for distributed storage and multi-core machines. In particular, since the number of reads NN is much larger than the number of sequences MM, one can store the reads R1,,RNR_{1},\ldots,R_{N} in npn_{p} different processing units. Hence at each iteration rr, the processing unit p,p=1,,npp,p=1,\ldots,n_{p}, can compute the local value

where Np\mathcal{N}_{p} is the set of reads stored at processor pp with p=1npNp={1,2,,N}\cup_{p=1}^{n_{p}}\mathcal{N}_{p}=\{1,2,\ldots,N\}. Then, all processors update their global abundance estimate through the consensus procedure

For a very recent application of BSUM algorithm in gene RNA-seq abundance estimation, the readers are referred to .

IV-B2 Tensor decomposition

In general, finding the CP decomposition of a given tensor is NP-hard . A well-known algorithm for finding the CP decomposition is the alternating least squares (ALS) algorithm proposed in . This algorithm is in essence the BCD algorithm on the following optimization problem

In the ALS algorithm, we consider three blocks of variables: {ar}r=1R\{\mathbf{a}_{r}\}_{r=1}^{R}, {br}r=1R\{\mathbf{b}_{r}\}_{r=1}^{R}, and {cr}r=1R\{\mathbf{c}_{r}\}_{r=1}^{R}. At each iteration of the algorithm, two blocks are held fixed and only one block is updated by solving (28). The block selection rule is cyclic and therefore one needs the uniqueness of the minimizer assumption at each iteration for theoretical convergence guarantee. Clearly, this assumption does not hold in general in (28) and therefore, theoretically, convergence is not always guaranteed. In addition, another well-known drawback of the ALS algorithm is the “swamp” effect where the objective remains almost constant for many iterations and then starts decreasing again. It has been observed in the literature that the employment of proximal upper-bound (see Table III) could help reducing the swamp effect . It is also suggested in that decreasing the proximal coefficient (γ\gamma in Table III) during the ALS algorithm could further improve the performance of the algorithm. Notice that these modifications in the algorithm makes the algorithm a special case of BSUM framework. Consequently, its theoretical convergence is also guaranteed by Theorem 1.

Figure 4 compares the performance of the naive ALS algorithm with the one using proximal upper-bound. As can be seen from the figures the proximal ALS algorithm has less swamp effect as compared to naive ALS method. The reader is referred to for more details of the algorithm and to for the application of BSUM and CP decomposition in gene expression and brain imaging.

IV-C Machine learning: sparse dictionary learning and sparse linear discriminant analysis

where the sets X\mathcal{X} and A\mathcal{A} are given based on the prior knowledge on the data. The function d(,,)d(\cdot,\cdot,\cdot) measures the goodness-of-fit of the model. For example, a popular choice of the function d(,,)d(\cdot,\cdot,\cdot) and the set A\mathcal{A} leads to the following optimization problem

where the first term in the objective keeps our estimated signals close to the training set and the second term forces the representation to be sparse. One popular approach in the dictionary learning algorithm is to alternatingly update the dictionary A\mathbf{A} and the coefficients X\mathbf{X} . However, naively updating these two variables to its global optimum requires solving a sparse recovery problem at each iteration which is costly for big size problems. Motivated by the idea of inexact steps in BSUM framework, one can iteratively replace the objective by a locally tight upper-bound which is easier to minimize at each iteration and hence leading to computationally cheaper steps in the algorithm. It is not hard to see that utilizing the quadratic upper-bound in Table III with diagonal matrices {\mbox{\boldmath\Phi}}_{i} leads to closed form updates at each step . Unlike many existing algorithms in the literature , the resulting algorithm is guaranteed to converge to the set of stationary solutions theoretically as the result of Theorem 1.

Figure 6 and Table V show the performance of the resulting algorithm for dictionary learning in an image denoising problem. The denoising is performed on the Lena image corrupted by additive Gaussian noise with various variances σ2\sigma^{2}. As can be seen from Table V, the proposed algorithm results in larger PSNR values than the K-SVD method when the noise level is large. Moreover, the proposed algorithm contains less visual artifacts. Furthermore, each step of the proposed algorithm is in closed form and it is computationally favorable, while each step of the K-SVD method requires an inner iterative method.

IV-C2 Sparse linear discriminant analysis

where μ^k=1NiCkxi\widehat{\mu}_{k}=\frac{1}{N}\sum_{i\in\mathcal{C}_{k}}\mathbf{x}_{i} is observations mean in class Ck\mathcal{C}_{k}. Similarly, the standard between class covariance estimate is given by

Unfortunately, when the number of features is large relative to NN, the matrix Σ^w\widehat{\Sigma}_{w} is rank deficient and therefore problem (29) is ill-posed. In order to resolve this issue and to have a small generalization error, suggests to regularize the optimization problem with a convex penalty function P()P(\cdot); and solve

Clearly, this optimization problem is non-convex. As suggested in , one can linearize the first part of the objective in (30) iteratively to obtain a tight upper-bound of the objective. It is not hard to see that the algorithm used in is BSUM with the linear upper-bound given in Table III.

V Extensions

In this section, we discuss extensions and generalizations of the BSUM framework in various settings.

Consider the following stochastic optimization problem

where X\mathcal{X} is a closed convex set and ξ\xi is a random variable modeling the uncertainty in our optimization problem. A standard classical approach for solving (31) is the sample average approximation (SAA) method; see and the references therein. At iteration rr of SAA method, given a new realization ξr\xi^{r} of the random variable ξ\xi, SAA method generates a new iterate xrx^{r} by solving a problem with the sample average 1ri=1rg(x,ξi)\frac{1}{r}\sum_{i=1}^{r}g(x,\xi^{i}) as its objective, where ξ1,ξ2,,ξr\xi^{1},\xi^{2},\ldots,\xi^{r} are independent identically-distributed realizations of the random variable ξ\xi.

A major drawback of the SAA method is that each of its iteration can be computationally very expensive. The computational inefficiency arises from either the non-convexity of the objective, or not having closed form solutions at each iteration.

Motivated by the BSUM framework, authors of suggest to use an inexact version of SAA method, in which a sequence of upper bounds of the objective are minimized. In particular, at each iteration rr, the optimization variable is updated by

where g^(,xi1,ξi)\widehat{g}(\cdot,x^{i-1},\xi^{i}) is an upper-bound of the function g(,ξi)g(\cdot,\xi^{i}) around the point xi1x^{i-1}. The approximation function g^(,xi1,ξi)\widehat{g}(\cdot,x^{i-1},\xi^{i}) is assumed to be in the form of BSUM approximation. The resulting algorithm, named stochastic successive upper bound minimization (SSUM), is guaranteed to converge to the set of stationary solutions almost surely; see for more details. Further, it is shown to be capable of dealing with various practical problems in signal processing and machine learning. For example, as we will see shortly, the authors in apply SSUM framework to cope with uncertainties in channel estimation for wireless beamformer design problem. As another example, the online sparse dictionary leaning algorithm proposed in is a special case of SSUM.

As an example of the SSUM method, consider the wireless transceiver design problem discussed in Section IV-A where the channel coefficients {Hij}i,j\{\mathbf{H}_{ij}\}_{i,j} are not exactly known. In this scenario, we can consider the channel coefficients as random variables and solve the following stochastic optimization problem

which is the stochastic counterpart of the optimization problem (23). Utilizing the upper bound (25) in SSUM algorithm leads to the stochastic WMMSE algorithm . Figure 7 illustrates the numerical performance of the SSUM methods as compared with SAA. At each iteration of SAA procedure, one should solve a nonconvex optimization problem. Two different methods are considered: the gradient descent method with random initialization and the WMMSE algorithm which is known to converge in few iterations for this problem. As can be seen from the figure, the running time of the SAA algorithm is much longer than that of the SSUM.

V-B Coupling constraints

So far in this article, we have assumed that the constraints in the optimization problem is separable and convex. In other words, the constraint set X\mathcal{X} in (1) is of the form X=X1×Xn\mathcal{X}=\mathcal{X}_{1}\times\ldots\mathcal{X}_{n} with each Xi\mathcal{X}_{i} being convex. A natural extension of the BSUM framework is to modify it in order to deal with coupling and nonconvex constraints.

Consider the following convex problem with linear coupling constraints

At each iteration of the ADMM method, either a primal block variable xix_{i} is updated according to

or the dual Lagrange multiplier λ\lambda is updated according to the gradient ascent rule

where αr\alpha^{r} is the dual step-size at iteration rr. The update orders for the primal and dual variables could be either cyclic or randomized.

Similar to the BSUM framework, one can replace the augmented Lagrangian function L(,xir;λr)L(\cdot,x_{-i}^{r};\lambda^{r}) with its tight upper-bound L~i(,xr;λr)\widetilde{L}_{i}(\cdot,x^{r};\lambda^{r}) at iteration r+1r+1 where

with ui(,xr)u_{i}(\cdot,x^{r}) being a locally tight approximation of the function f(,xir)f(\cdot,x_{-i}^{r}) around the point xirx_{i}^{r} satisfying Assumption A. The resulting algorithm, named Block Successive Upper Bound Minimization Method of Multipliers (BSUMM) , is guaranteed to converge to the global optimal of (33) under some regularity assumptions . For extensions to non-convex problems, please see the recent work . There are a few other interesting techniques to deal with linearly coupling constraint. For example references propose to randomly pick 2 blocks of variables to update at each iteration, and reference propose new algorithms based on minimizing the augmented Lagrangian function. We refer the readers to these papers for more related works in this direction.

Let us illustrate an application of BSUMM to a multi-commodity routing problem, which arises in the design of next generation cloud-based communication networks . Consider a connected wireline network N=(V,L)\mathcal{N}=(\mathcal{V},\mathcal{L}) which is controlled by K+1K+1 Network Controllers (NCs) as illustrated in Fig. 8. Let V\mathcal{V} denote the set of network nodes, which is partitioned into KK subsets, i.e., V=i=1KVi\mathcal{V}=\cup_{i=1}^{K}\mathcal{V}^{i}, ViVj=\mathcal{V}^{i}\cap\mathcal{V}^{j}=\emptyset,   ij\forall\;i\neq j. The set of directed links that connect nodes of V\mathcal{V} is denoted as L{l=(sl,dl)    sl,dlV}\mathcal{L}\triangleq\{l=(s_{l},d_{l})\mid\;\forall\;s_{l},d_{l}\in\mathcal{V}\}, where l=(sl,dl)l=(s_{l},d_{l}) denotes the directed link from node sls_{l} to node dld_{l}. Each NC ii controls Vi\mathcal{V}^{i} and the links connecting these nodes, i.e., Li{l=(sl,dl)L  sl,dlVi}\mathcal{L}^{i}\triangleq\{l=(s_{l},d_{l})\in\mathcal{L}\mid\forall\;s_{l},d_{l}\in\mathcal{V}^{i}\} (cf. Fig. 8). The network Ni(Vi,Li)\mathcal{N}^{i}\triangleq(\mathcal{V}^{i},\mathcal{L}^{i}) is called the subnetwork ii.

Our objective is to transport MM data flows over the network, with each flow mm being routed from a source node smVs_{m}\in\mathcal{V} to the sink node dmVd_{m}\in\mathcal{V}. We use rm0r_{m}\geq 0 to denote flow mm’s rate, and use fl,m0f_{l,m}\geq 0 to denote its rate on link lLl\in\mathcal{L}. We also assume that a master node exists which controls the data flow rates {rm}m=1M\{r_{m}\}_{m=1}^{M}. The central NC controls the subnetwork N0\mathcal{N}^{0}, consisting of the master node and the links connecting different subnetworks, i.e., L0=ijLij0\mathcal{L}^{0}=\cup_{i\neq j}\mathcal{L}_{ij}^{0}.

Let us consider two types of network constraints, as we list below.

Link capacity constraints. Assume each link lLl\in\mathcal{L} has a fixed capacity denoted as ClC_{l}. The total flow rate on link ll is constrained by

where 1{\bf 1} is the all-one vector and fl[fl,1,,fl,M]T\mathbf{f}_{l}\triangleq[f_{l,1},\ldots,f_{l,M}]^{T}.

Flow conservation constraints. For any node vVv\in\mathcal{V} and data flow mm, the total incoming flow should be equal to the total outgoing flow:

where In(v){lL  dl=v}{\rm In}(v)\triangleq\{l\in\mathcal{L}\mid\;d_{l}=v\} and Out(v){lL  sl=v}{\rm Out}(v)\triangleq\{l\in\mathcal{L}\mid\;s_{l}=v\} denote the set of links going into and coming out of a node vv respectively; 1v=x=11_{v=x}=1 if v=xv=x, otherwise 1v=x=01_{v=x}=0.

To provide fairness to the users, let us maximize the minimum rate of all data flows. The problem can be formulated as the following linear program (LP)

where f{fllL}\mathbf{f}\triangleq\{\mathbf{f}_{l}\mid l\in\mathcal{L}\} and r{rmin,rmm=1M}\mathbf{r}\triangleq\{r_{\min},r_{m}\mid m=1\sim M\}. Obviously, one can use off-the-shelf optimization packages such as Gurobi to solve the LP (36), but this is only viable in a centralized setting where all the flows are managed by a single controller.

To enable distributed/parallel network management across the NCs, we need to allow each NC ii to independently optimize the variables belonging to the subnetwork Ni\mathcal{N}^{i}. However, this task is difficult because the optimization variables of problem (36) is coupled (indeed each flow rate fmf_{m} appears in exactly two flow conservation constraints). To address this problem, we introduce a few sets of new variables to decouple the flow conservation constraints across different subnetworks (we refer the readers to for the detailed reformulation). The reformulated problem \eqrefMainProb\eqref{MainProb} is given by

where X0\mathcal{X}_{0} and Xi1\mathcal{X}_{i1} and Xi2\mathcal{X}_{i2} are some feasible sets, and {rmin,x02,x01,xi1,xi2,xi3}\{r_{\min},\mathbf{x}_{02},\mathbf{x}_{01},\mathbf{x}_{i1},\mathbf{x}_{i2},\mathbf{x}_{i3}\} are the block-variables. By applying the BSUMM, we can obtain a parallel/distributed algorithm. A few remarks about the implementation of this algorithm are in order.

The replication of link/flow variables for links across different subnetworks allows each subnetwork to be considered separately and independently. This feature makes the BSUMM subproblems solvable in parallel. The requirement of the replicated variables being the same as the original variables is enforced by the linear coupling constraints, and they can be satisfied asymptotically as the BSUMM algorithm converges.

The subproblems of the proposed BSUMM-based algorithm can be solved by each NC very efficiently. For example the update of {rmin,x02}\{{r_{\min}},\mathbf{x}_{02}\} can be performed by each NC in closed-form; the update of {xi,x01i}\{{\mathbf{x}_{i}},\mathbf{x}_{01}^{i}\} can be performed by running the well-known RELAX code .

A careful implementation of the BSUMM allows the NCs to act asynchronously, in the sense that they do not need to coordinate with each other for computation. Such asynchronous implementation has the potential of greatly improving the computational efficiency.

We illustrate the BSUMM implementation over a mesh wireline network with 126126 nodes, which is randomly partitioned into 99 subnetworks with 306306 directed links within these subnetworks and 100100 directed links connecting the subnetworks. The capacities for the links within (resp. between) the subnetworks are uniformly distributed in MBits/s (resp. MBits/s). All simulation results are averaged over 200200 randomly selected data flow pairs and link capacity.

To demonstrate the benefit of parallelization, we also utilize a high performance computing (HPC) cluster, and make each computing node to be a NC. We compare a few different algorithms, listed below:

Solving a large-scale LP by Gurobi , a centralized solver;

The synchronous BSUMM algorithm with K=10K=10 NCs, with the computation done by either a single or by 1010 distributed computing cores;

The asynchronous BSUMM with K=10K=10 NCs, with the computation done in 1010 distributed computing cores.

Note that the asynchrony in the network arises naturally from the per-node computational delay and network communication delay. In the following table we demonstrate the performance of various algorithms when M=200M=200. Clearly the asynchronous BSUMM with a small number of NCs outperforms all the rest of the algorithms.

The numerical results suggest that appropriate network decomposition and asynchronous implementation are both critical for the fast convergence of BSUMM. In practice, we observe that the network should be decomposed following a few guidelines: i) the computation burden across the subnetworks is well balanced; ii) the subroutine within the network can achieve its maximum efficiency; iii) the total number of replicated auxiliary variables is small.

V-B2 Nonconvex constraints

The BSUM idea can be straightforwardly extended to a nonconvex constraint scenario for single block optimization problems. To proceed, let us consider the optimization problem

As illustrated in Figure 9, the iterative approximation of the constraints is a restriction of the constraints and hence the iterates remain feasible during the algorithm. If, in addition, some constraint qualification conditions are satisfied, the resulting algorithm is guaranteed to converge to the set of stationary solutions of (37); see [38, Theorem 1] for detailed conditions and analysis.

V-C Parallel version and extensions to game theory

With the recent advances in multicore and cluster computational platforms, it is desirable to design “parallel” algorithms for multi-block optimization problem where multiple cores update the block variables in parallel to optimize the objective function. A naive parallel extension of the BCD approach for solving (1) is to update all blocks (or a subset of them) in parallel by solving

Unfortunately this naive extension of the BCD algorithm does not converge in general and might result in zig-zag/oscillating or divergent behavior. As an example, consider the problem

Clearly, this problem is convex with bounded feasible set and its optimal value is zero. However, the above naive parallel extension of the algorithm leads to the following iteration path:

which is clearly not convergent. This is caused by aggressive steps used in the algorithm. To make the algorithm convergent, it is then necessary to employ small enough and controlled steps. Furthermore, in the case of nonconvex objective function f()f(\cdot) in (1), the approximation functions could be again used to obtain computationally efficient update rules. The resulting algorithm, dubbed parallel successive convex approximation (PSCA), is summarized in Table VII. To see the convergence analysis of this algorithm and other related algorithms such as the Flexible Parallel Algorithm (FPA), the reader is referred to and the references therein.

Notice that PSCA can be viewed as a way of solving a multi-agent optimization problem where multiple agents/users try to optimize a common objective by updating their own variable iteratively. Furthermore, it can be used in a game theoretic setting where each player in the game utilizes the best response strategy by optimizing a locally tight upper-bound of its own utility function. This algorithm is guaranteed to converge for some particular class of games under some regularity assumptions on the players utility functions. The convergence analysis presented in is based on certain contraction approach as well as monotone convergence for potential games.

Figure 10 illustrates the behavior of cyclic and randomized parallel PSCA method as compared with their serial counterparts (i.e., the “Cyclic BCD” and the “Randomized BCD” ) applied to the LASSO problem. The performance of the “PSCA” method is also illustrated for different number of processors and various block selection rules. As can be seen from the figures, parallelizaiton can result in more efficient algorithm; however, the convergence speed does not grow linearly with the number of processing cores. Moreover, increasing the number of processors beyond certain point results in slower convergence, which can be attributed to the increased communication overhead among the nodes.

It is also worth noting that the parallel update rule is very useful in dealing with distributed data sets. Consider solving the LASSO problem with the following objective: Axb2+λx1.\|Ax-b\|^{2}+\lambda\|x\|_{1}. Assume we have qq processing cores each having their own memory. Let us partition the matrix AA and vector xx into qq blocks: A=[A1,,Aq]andx=[x1T,,xqT]TA=[A_{1},\ldots,A_{q}]\quad{\rm and}\quad x=[x^{T}_{1},\cdots,x^{T}_{q}]^{T}. If PSCA is implemented in a way that each core jj is only responsible for updating block jj, then at each iteration r+1r+1, core jj’s problem of interest can be written as

where bjbijAixirb^{\prime}_{j}\triangleq b-\sum_{i\neq j}A_{i}x_{i}^{r}. Notice that the value of bjb^{\prime}_{j} can be calculated by letting each node ii compute the value of AixiA_{i}x_{i} and broadcast it to other nodes. Hence under this architecture, each node does not need to know the complete matrix AA and only local information is enough for a distributed implementation of the PSCA method.

V-D Practical Considerations

There are a number of factors that we need to consider when using the BSUM framework.

The first consideration is about the choice of the upper bound. What is a good bound for a given application? The answer is generally problem dependent, as we have already seen in a few examples. The general guideline is that a good upper bound should be able to ensure algorithm convergence, best exploit the problem structure and make the subproblems easily solvable (preferably in closed-form). For example a simple proximal upper bound is not likely to perform well for the transceiver design problem discussed in Section IV-A, as the resulting subproblems will not decompose over the variables.

The second consideration is about the choice of the block update rules. As we have seen in Section II-D, different update rules lead to quite distinct convergence behaviour. For convex problems, deterministic rules such as the cyclic rule promise the worst-case convergence rates, while the randomized rule ensures convergence rate in either averaged or high probability sense. Further, there is barely any theoretical rate analysis for nonconvex problems, regardless of the block selection rules. Therefore the best strategy in practice is to perform extensive numerical study and pick the best rule for the application at hand. For example, researchers have found that MBI rule is effective for certain tensor decomposition problem ; the cyclic rule can be superior to the randomized rule for certain LASSO problem, and certain G-So rule can outperform the randomized rule .

The third consideration is about the choice of the parallelization schemes. There has been extensive research on parallelizing various special cases and variations of the BSUM type algorithm, see and the references therein. These algorithms differ in a number of implementation details and in their applicability. For example, most of the implementations use randomized block selection rules to pick the variable blocks, while and additionally use certain variations of the G-So rule. The majority of the schemes only work for convex problems, with the exception of which work for general nonsmooth and nonconvex problems in the form of (7). When assessing whether a given problem is suitable for parallelization, it is important to know that oftentimes the number of blocks that can be updated in parallel is data dependent. For example when solving LASSO problems, it is shown in and that the degree of parallelization is dependent on the maximum eigenvalue of certain submatrices of the data matrix. Some recent result shows that for certain randomized coordinate descent method, such dependency could be mild. For solving general convex nonsmooth problems, shows that the stepsize should be carefully selected based on both the “separability” of the problem (or the sparsity of the data matrix) as well as the degree of parallelization. If the application at hand does not satisfy these conditions, the alternatives usually are: 1) exploit the problem structure and pick a good upper bound, so that the subproblems are decomposable, leading to parallel and stepsize-free updates; see for example the NMF problem in Section III-F and the WMMSE algorithm in Section IV-A; 2) to use the diminishing stepsizes for updating the blocks, see for example .

VI Issues and Open Research Problems

This article presents a comprehensive algorithmic framework called BSUM for block-structured large-scale optimization. The main strength of the BSUM framework is its strong theoretical convergence guarantee and its flexibility. As demonstrated in this article, the BSUM framework covers a number of well-known but seemingly unrelated algorithms as well as their new extensions. Moreover, it is amenable to a number of different data models as well as to parallel implementation on modern multi-core computing platforms.

To close, we briefly highlight a couple of issues and open research topics related to the BSUM framework.

Communication delay and overhead in parallel implementations: As discussed in subsection V-C, the convergence speed of the parallel version of BSUM framework does not increase linearly with the number of computational nodes. In fact, after some point, increasing the number of computational nodes can lead to slower convergence speed. As mentioned before, this is due to the delay caused by communication among the nodes. This observation gives rise to two important research questions: First, given the maximum allowable number of computation nodes and the communication overhead of the nodes, what is the optimum choice of the number of cores for solving a given optimization problem? Answering this question requires computation/communication tradeoff analysis of the proposed optimization approach. Second, can the BSUM framework be extended and implemented in a (semi-)asynchronous manner? If this is possible, then the communication overhead can be reduced significantly since the nodes are not required to wait for each other before updating the variables, making the algorithm lock-free. For recent efforts on this research direction see .

Nonlinear coupling constraints: As we observe in Section V, the BSUM framework can also be used in the presence of linear coupling or nonconvex decoupled constraints. How can the BSUM framework be generalized to problems with nonlinear coupling constraints? More precisely, can the BSUM framework with block-wise update rules be applied to the optimization problem of the following form?

Example 4 shows that the naive extension of the BCD approach fails to find the optimal solution even in the convex setting. A popular approach to tackle the above problem is to place the constraints in the objective using Lagrange multipliers and update the multipliers iteratively. However, this approach typically leads to double-loop algorithms and requires subgradient steps in the dual space which is known to be slow.

References