Faster and Simpler Width-Independent Parallel Algorithms for Positive Semidefinite Programming
Richard Peng, Kanat Tangwongsan, Peng Zhang
Introduction
Semidefinite programming (SDP), alongside linear programming (LP), is an important tool in approximation algorithms, optimization, and discrete mathematics. In the context of approximation algorithms alone, it has emerged as a key technique which underlies a number of impressive results that substantially improve the approximation ratios. To solve a semidefinite program, algorithms from the linear programming literature such as Ellipsoid or interior-point algorithms can be applied to derive near exact solutions. But they are often costly. As a result, finding efficient approximations to such problems is a critical step in making them more practical.
From a parallel algorithms standpoint, both LPs and SDPs are P-complete even to approximate to any constant accuracy, suggesting that it is unlikely that they have a polylogarithmic depth algorithm. For linear programs, however, the special case of positive linear programs, first studied by Luby and Nisan , has an algorithm that finds a -approximate solution in iterations. This weaker approximation guarantee is still sufficient for approximation algorithms (e.g., solutions to vertex cover and set cover via randomized rounding), spurring interest in studying these problems in both sequential and parallel contexts (see, e.g., ).
The importance of problems such as MaxCut and Sparsest Cut has led to the identification and study of positive SDPs. Our work is motivated by a result by Jain and Yao that gave the first positive SDP algorithm whose work and depth are independent of the width parameter (commonly known as width-independent algorithms). As there are substantial differences in the analysis in this version compared to the conference version , we address the relation between our paper and other works in Section 1.1.
We present a simple algorithm that offers guarantees similar to but has less work-depth complexity. Each iteration of our algorithm involves only simple matrix operations and computing the trace of the product of a matrix exponential and a positive semidefinite matrix. The input consists of an accuracy parameter and a positive semidefinite program (PSDP) in the following standard primal form:
where the matrices are -by- symmetric positive semidefinite matrices, denotes the pointwise dot product between matrices (see Section 2), and the scalars are nonnegative reals. This is a subclass of SDPs where the matrices and scalars are “positive” in their respective settings. We also assume, as is standard, that the SDP has strong duality. Our main result is as follows:
Given a primal positive SDP involving matrices with constraints and an accuracy parameter , there is an algorithm approxPSDP that produces a -approximation in iterations, where each iteration involves computing matrix sums and a special primitive that computes in the case when and are both positive semidefinite.
The theorem quantifies the cost of our algorithm in terms of the number of iterations. The work and depth bounds implied by this theorem vary with the format of the input and how the matrix exponential is computed in each iteration. As we will discuss in Section 4, with input given in a suitable form, our algorithm runs in nearly-linear work and polylogarithmic depth.
The first definition of positive SDPs was due to Klein and Lu , who used it to characterize the MaxCut SDP. The MaxCut SDP can be viewed as a direct generalization of positive (packing) LPs. More recent work defines the notion of positive packing SDPs , which captures problems such as MaxCut, sparse PCA, and coloring; and the notion of covering SDPs , which captures the ARV relaxation of Sparsest Cut among others. Works this area tend to focus on developing fast sequential algorithms for finding a -approximation, leading to a series of sequential algorithms (e.g., ). The iteration count for these algorithms, however, depends on the so-called “width” parameter of the input program or some parameter of the spectrum of the input program. In some instances, the width parameter can be as large as , making it a bottleneck in the depth of direct parallelization. This is even more so in the case of the Sparsest Cut SDP even though the problem has been labeled as a covering SDP .
Over the past few years, width-independent algorithms for positive SDPs have received much attention. Jain and Yao gave the first result in this direction: a polylog depth algorithm based on the first width independent linear programming algoirthm by Luby and Nisan . Their algorithm is based on updating the primal matrix. This leads to an intricate analysis based on carefully analyzing the eigenspaces of a particular matrix before and after each update. It takes iterations, each of which involves computing spectral decompositions using least work.
Our algorithm follows a different approach. It updates the dual program in a way motivated by the width-independent positive LP algorithm by Young . Concurrently, Jain and Yao gave a similar algorithm for positive SDPs. Their algorithm solves a class of SDPs which contains both packing and diagonal covering constraints. Since matrix packing conditions between diagonal matrices are equivalent to point-wise conditions of the diagonal entries, these constraints are closer to a generalization of positive covering LP constraints. We believe that removing this restriction on diagonal packing matrices would greatly widen the class of problems included in this class of SDPs and discuss possibilities in this direction in Section 5.
The convergence analyses of both of these routines go through the matrix multiplicative weights update framework for solving semidefinite programs . However, a crucial algebraic piece is missing for adapting the iteration count bound from Young’s algorithm. Unlike scalar exponentials, matrix exponentials is not a monotonic function under standard notions of matrix ordering such as the Loewner partial order.
Most recently, Allen-Zhu et al. gave the first rigorous adaptation of Young’s algorithm to positive SDPs . Motivated by this result, as well as its precursor in positive LPs , we complete the analysis of our original algorithm. Specifically, we show that the dual generation scheme in these routines provide good bounds on the iteration count. This leads to a different analysis of Young’s algorithm that omits phases. Our modified analysis is for a simplified pseudocode of the algorithm from that removes these phases. However, the phase-based version can be analyzed similarly.
Compared to the algorithm by Allen-Zhu et al. , our analysis uses more elementary linear algebraic techniques, and is closer to multiplicative weights update schemes. We believe the dynamic bucketing method for obtaining better dependencies on is also applicable to our analysis. Finally, as many recent results on faster positive LP / SDP algorithms take optimization based views, we believe our approach is also of independent interest.
2 Overview
We derive our algorithm by generalizing Young’s algoirthm . In place of the “soft max” function for bounding the maximum of a set of linear constraints, we use matrix exponential and the matrix multiplicative weights update (MMWU) mechanism. Moreover, besides standard operations on (sparse) matrix, the only other primitive needed is the matrix dot product , where and are positive semidefinite.
Intuitions. For intuition about packing SDPs and the matrix multiplicative weights update method for finding approximate solutions, a useful analogy of the decision problem is that of packing a (fractional) amount of ellipses into the unit ball. Figure 1 provides an example involving matrices (ellipses) in dimensions. Note that and are axis-aligned; their sum is also axis-aligned in this case. In fact, positive linear programs in the broader context corresponds exactly to the restriction of all ellipsoids being axis-aligned. In this setting, the algorithm of can be viewed as creating a penalty function by weighting the length of the axises using an exponential function. Then, ellipsoids with sufficiently small penalty subject to this function have their weights increased.
However, once we allow general ellipsoids such as , the resulting sum will no longer be axis-aligned. In this setting, a natural extension is to take the exponential of the semimajor axises of the resulting ellipsoid instead, leading to the matrix multiplicative weights scheme. Our analysis then focuses on showing that Young’s algorithm still has a width-independent iteration count in this setting.
Work and Depth. We now discuss the work and depth bounds of our algorithm. The main cost of each iteration of our algorithm comes from computing the dot product between a matrix exponential and a PSD matrix. Like in the sequential setting , we need to compute for each iteration the product , where is some PSD matrix. The cost of our algorithm therefore depends on how the input is specified. When the input is given prefactored—that is, the -by- matrices ’s are given as and the matrix is given, then Theorem 4.1 can be used to compute matrix exponential in work and depth, where is the number of nonzero entries across ’s and . This is because the matrix that we exponentiate has , as shown in Lemma 3.5. Therefore, as a corollary to the main theorem, we have the following cost bounds:
The algorithm approxPSDP has in work and depth.
If, however, the input program is not given in this form, we can add a preprocessing step that factors each into since is positive semidefinite. In general, this preprocessing requires at most work and depth using standard parallel QR factorization . Furthermore, these matrices often have certain structure that makes them easier to factor. Similarly, we can factor and invert with the same cost bound, and can do better if it also has specialized structure.
Background and Notation
where are the eigenvectors of with eigenvalues respectively. We will use , to represent the eigenvalues of in decreasing order and also use to denote . Notice that positive semidefiniteness induces a partial ordering on matrices. We write if .
The trace of a matrix , denoted , is the sum of the matrix’s diagonal entries: . Alternatively, the trace of a matrix can be expressed as the sum of its eigenvalues, so . Furthermore, we define
It follows that is positive semidefinite if and only if for all PSD .
where, again, is the eigenvector corresponding to the eigenvalue . It is not difficult to check that for , this definition coincides with .
Our algorithm relies on a matrix multiplicative weights (MMW) algorithm, which can be summarized as follows. For a fixed and , we play a “game” a number of times, where in iteration , the following steps are performed:
Produce a “probability” matrix ;
Incur a gain matrix ; and
Like in the standard setting of multiplicative weights algorithms, the gain matrix is chosen by an external party, possibly adversarially. In our algorithm, the gain matrix is chosen to reflect the step we make in the iteration. Arora and Kale shows that the MMW algorithm has the following guarantees (restated for our setting):
For , if ’s are all PSD and , then after iterations,
2 Reduction to Bounded Decision Version
Our algorithm works with normalized primal/dual programs shown in Figure 2.
By using binary search and appropriately scaling the input program, such an SDP can be approximated using the following decision problem:
or a PSD matrix (a primal solution) such that
This reduction can also ensure bounded trace on all ’s. The following lemma summarizes key properties of such a reduction:
For , a positive packing semidefinite program can be approximated to a relative error to using calls to the -decision problem. Furthermore, each supplied to the decision problem has .
The reduction is common to most positive linear and semidefinite program solvers ; we briefly sketch the idea for completeness.
(Sektch) First, transform to the normalized form by “dividing through” by (see Appendix A). Since is within a factor of of the maximum eigenvalue of , gives a value that is within a factor of the optimum. Therefore, the optimization version can be solved by binary searching on the objective a total of at most iterations.
In each of these decision instances, we can rescale ’s so that the threshold in question is . In this setting, the total sum of with is at most . By not using these ’s, the optimum changes by an additive value of less than .
Solving Positive SDPs
In this section, we describe a parallel algorithm for solving the decision version of positive packing SDPs, inspired by Young’s algorithm for positive LPs. The following theorem presents the guarantees of our algorithm.
Let . There is an algorithm decisionPSDP that given a positive SDP, solves the -decision problem in iterations.
Presented in Algorithm 3.1 is an algorithm that we will show to satisfy the theorem.
The algorithm is a multiplicative weights update routine, which proceeds in several rounds. The starting solution is . This solution is chosen to be small so that , hence respecting the dual constraint. Each subsequent update is a multiple of the current solution, so this is also chosen to ensure that these updates make rapid progress.
In each iteration following that, the algorithm computes
Our presentation follows the multiplicative weights update framework from Arora-Kale and Kale . Several intermediate variables are helpful for further discussion. Define the cumulative sum corresponding to as
This allows us to write the exponential as . We will also specifically define the probability matrix by which we use to pick the update coordinates:
This matrix is easier to work with because it has trace :
Using this “probability” matrix , the algorithm identifies which coordinates to update. These are the coordinates ’s for which their contributions with respect to the ’s are still small—. Each of these coordinates will be incremented by , where . Therefore, in terms of , we have
The main loop in Algorithm 3.1 terminates when or the number of iterations exceeds a preset threshold . For a desired accuracy parameter , we set to so that the additive term from Theorem 2.1 can be absorbed into the relative error. This additive term comes from the starting point .
The choice of may seem mysterious at this point; it is chosen to prevent us from taking a step that is too big from the current solution while still making substantial progress. It ensures that
The update has small width, aka. , and
We cannot overshoot by much when exiting from the while loop, .
Our analysis of decisionPSDP and in turn our proof of Theorem 3.1 hinge on two complementary processess: show that after steps, is indeed a feasible primal solution, or that if the the algorithm terminates because , then is a feasible dual solution. In the latter case, we have
For this to be a dual solution, we still need to show that it satisfies .
Let be the final iteration count (i.e., the final ). To meet the requirement above, we only need to show that . We prove the following spectrum bound:
We prove this lemma by induction on the iteration number, resorting to properties of the MMW algorithm (Theorem 2.1), which relates the final spectral values to the “gain” derived at each intermediate step.
. In other words, .
Now for any iteration , we can rewrite as
since both sums yield positive semidefinite matrices and the of the first sum is at most by Claim 3.3.
Proof of Lemma 3.2: We will prove (3.5) by (strong) induction on . The base case of is true by Claim 3.3. For a given , if we inductively assume that for all , then for each ,
This makes Theorem 2.1 applicable, which gives
which allows us to conclude that , as desired.
It remains to show the claims about the algorithm utilized in the above proof.
Our choice of guarantees that for all ,
Summing across gives the desired bound.
Now every , though can be empty, has the property that , so
It remains to examine the case where the algorithm returns a primal solution: Equation (3.3) gives
Furthermore, this satisfies the primal constraints:
If —i.e, the algorithm exits the while-loop because it reaches iterations—then for all , .
Assume for a contradiction that there is an such that . This means
Let be the iterations in which the -th coordinate of is incremented. By Markov’s inequality, the number of such iterations is bounded by . But then, every time the -th coordinate changes, it increases by a factor of , so
Because , the algorithm exits the while-loop with . Therefore, for ,
as by Lemma 2.2. This is a contradiction to , which proves the lemma.
Proof of Theorem 3.1: The algorithm terminates after at most iterations. Notice that may be empty in some iterations but this does not harm the algorithm nor the proof. It is standard to check that . If we do run this many iterations, then Lemma 3.6 gives that we terminate with a primal solution.
Otherwise, Lemma 3.2 gives that at any point in the algorithm, the solution vector satisfies . Together with Equation (3.4), we know that any returned and
Thus, is indeed a dual solution with value at least . Replacing with then meets the requirements of the decision problem.
Matrix Exponential Evaluation
We describe a fast algorithm for computing the matrix dot product of a positive semidefinite matrix and the matrix exponential of another positive semidefinite matrix.
There is an algorithm bigDotExp that when given a -by- matrix with non-zero entries, , and -by- matrices in factorized form where the total number of nonzeros across all is ; computes approximations to all in depth and work.
The idea behind Theorem 4.1 is to approximate the matrix exponential using a low-degree polynomial because evaluating matrix exponentials exactly is costly. For this, we will apply the following lemma, reproduced from Lemma 6 in :
If is a PSD matrix such that , then the operator
The given factorization of each allows us to write as the -norm of a vector:
By Lemma 4.2, it suffices to evaluate where is an approximation to . To further reduce the work, we can apply the Johnson-Lindenstrauss transformation to reduce the length of the vectors to ; specifically, we find a Gaussian matrix and evaluate
Since only has rows, we can compute using evaluations of . The work/depth bounds follow from doing each of the evaluations of , where denotes the -th column of , and matrix-vector multiplies involving in parallel.
Conclusion
We presented a simple NC parallel algorithm for packing SDPs that requires iterations, where each iteration involves only simple matrix operations and computing the trace of the product of a matrix exponential and a positive semidefinite matrix. When a positive SDP is given in a factorized form, we showed how the dot product with matrix exponential can be implemented in nearly-linear work, leading to an algorithm with work, where is the number of constraint matrices, is the dimension of these matrices, and is the total number of nonzero entries in the factorization.
Compared to the situation with positive LPs, the classification of positive SDPs is much richer because packing/covering constraints can take many forms, either as matrices (e.g. for packing, for covering) or as dot products between matrices (e.g. for packing, for covering). The positive SDPs studied in and our paper should be compared with the closely related notion of covering SDPs studied by Iyengar et al ; however, among the applications they examine, only the beamforming SDP relaxation discussed in Section 2.2 of falls completely within the framework of packing SDPs as defined in 2.10. Problems such as MaxCut and SparsestCut require additional matrix-based packing constraints. We believe extending these algorithms to solve mixed packing/covering SDPs is an interesting direction for future work.
Acknowledgments
This work is partially supported by the National Science Foundation under grant numbers CCF-1018463, CCF-1018188, and CCF-1016799 and by generous gifts from IBM, Intel, and Microsoft. Richard Peng was partly supported by a Microsoft Research PhD. Fellowship.
We thank the SPAA reviewers, as well as Guy Blelloch and Gary Miller for suggestions that helped improve this paper. While making this revision, we benefitted greatly from discussions with Jon Kelner and Di Wang.
References
Appendix A Normalized Positive SDPs
This is the same transformation that Jain and Yao presented ; we only present it here for easy reference.
Consider the primal program in (1.4). It suffices to show that it can be transformed into the following program without changing the optimal value:
We can make the following assumptions without loss of generality: First, for all because if were , we could have thrown it away. Second, all ’s are the support of , or otherwise we know that the corresponding dual variable must be set to and therefore can be removed right away. Therefore, we will treat as having a full-rank, allowing us to define
It is not hard to verify that the normalized program (A.4) has the same optimal value as the original SDP (1.4).
Note that if we’re given factorization of into , then can also be factorized as:
Furthermore, it can be checked that the dual of the normalized program is the same as the dual in Equation 2.10.