Randomized Rounding for the Largest Simplex Problem
Aleksandar Nikolov
Introduction
The -MVS problem can be easily reduced to a a problem about subdeterminants of positive semidefinite matrices (see Lemma 5 for the reduction). For an matrix , let be the submatrix with rows indexed by and columns indexed by . In the maximum -subdeterminant problem (-MSD) we are given an positive semidefinite matrix of rank , and the goal is find a set of cardinality so that is maximized. The -MVS problem in dimensions can be reduced to solving instances of the -MSD problem for matrices of rank , and the reduction is approximation preserving.
The -MSD problem was also independently studied in the context of low-rank approximations. The optimal row-rank approximation of a matrix is well understood, and for both the operator and the Frobenius norm (and in fact any unitarily invariant matrix norm) is given by the the projection of the rows and columns of onto the top singular vectors. However, an approximation in terms of a submatrix of often has a better explanatory value. For example, if is a matrix in which each row is a data point, and if is well-approximated by its projection onto the span of of its columns, we can argue that, at least intuitively, these columns represent important features in the data. Goreinov and Tyrtyshnikov [GT01] gave one formalization of this intuition, which we cite next.
Let be an matrix and let be an optimal solution to the -MSD problem for . Then, for we have
where is the -st largest singular value of .
Hereditary discrepancy has an important application to rounding problems: roughly speaking, if the hereditary discrepancy of the matrix is low, then any solution to a linear system can be rounded to an integral vector without introducing a lot of error. In this sense, hereditary discrepancy generalizes total unimodularity, which is equivalent to [GH62]. The following theorem, proved by Lovász, Spencer, and Vesztergombi, makes this connection precise.
In fact the theorem holds with hereditary discrepancy defined in terms of any norm.
The following theorem shows that gives nearly-tight upper and lower bound on . While not explicitly stated in this form, the theorem can be proved by modifying the arguments in [LSV86, Mat11] in a straightforward way.
There exists a constant such that for any matrix ,
If for each we have a factor approximation for -MSD, then we get a factor -approximation to , and, therefore, a factor approximation to .
On the algorithmic side, the best known approximation for -MSD is for a constant , proved by Di Summa et al. [DEFM14]. They show that this approximation is achieved by a classical algorithm by Khachiyan [Kha95], for which they give a new analysis. This also implies a factor approximation for -MVS (see Lemma 5). For , the best approximation known is of the form for a constant : algorithms with this guarantee were given by Packer [Pac04] for -MVS and by Çivril and Magdon-Ismail [ÇM13] for -MSD.
2 Our Contribution
In this paper we design deterministic polynomial time approximation algorithms for the -MSD, and, therefore, also the -MVS, problems. Our main result is the following theorem.
There exists a deterministic polynomial time algorithm which approximates the -MSD problem within a factor of . This implies that there also exists a deterministic polynomial time algorithm which approximates the -MVS problem within a factor of
This is the first approximation algorithm for -MSD and -MVS with an approximation factor of the form , which matches the known hardness results up to the constant in the exponent. It is natural to conjecture that it is -hard to approximate -MSD within a factor for any . We leave this as an open problem.
Theorem 4 implies a factor approximation to for any matrix , and, therefore, a factor approximation to . The latter result also follows from the techniques of the author and Talwar [NT14]. However, our result gives the first constant-factor approximation to a natural variant of the determinant lower bound. It is an interesting open problem to extend this result to the determinant lower bound for , which is equal to
We also use our techniques to give an elementary and short proof of a variant of the restricted invertibility principle of Bourgain and Tzafriri [BT87].
3 Techniques
Our strategy for approximating -MSD when is similar, but the analysis becomes more complicated. For motivation, let us consider the case, in which we simply need to compute the largest diagonal entry of the input matrix , or, working with the columns of the square root of , we need to compute the index such that has the largest squared Euclidean norm. Of course, this problem can be solved trivially in linear time by enumerating over the , but it is instructive to solve it using an approach similar to the one we used for -MSD. Consider the smallest enclosing ball problem for : minimize subject to being contained in a Euclidean ball of radius centered at . It is clear that the optimal is equal to the norm of the longest . The dual of the smallest enclosing ball problem is the problem of maximizing over probability vectors (a much more general version of this fact is proved in Theorem 15). This latter problem is our convex relaxation of -MSD. While this is a natural relaxation that we could have arrived at directly, without going through the smallest enclosing ball problem, our approach pays off when generalizing to the case , in which it is not clear how to come up directly with a natural convex relaxation of -MSD. The randomized rounding algorithm applied to the relaxation samples an index from the distribution determined by an optimal vector ; the expected squared length of is , i.e. exactly the objective value of the relaxation.
We follow a similar strategy for general . We define a minimization problem over ellipsoids centered at 0 that contain . The objective of the problem is to minimize the volume of the largest -dimensional section of the containing ellipsoid. It is not hard to show that this problem gives an upper bound on -MSD (Lemma 13). The main technical challenge is to derive the dual of this optimization problem and to analyze the natural randomized rounding algorithm applied to it. An important difference from the case is that the objective of the ellipsoid optimization problem is no longer differentiable, which complicates the analysis of the dual. When , the objective of the dual “splits” into two terms, one that resembles the case and another that resembles the case. To relate the expected value of the output of the rounding algorithm to this more complicated objective we use the theory of Schur-concave functions applied to the elementary symmetric polynomials.
We derandomize our algorithms using the method of conditional expectations. This approach and the use of the elementary symmetric polynomials to relate the eigenvalues of a matrix to its entries are inspired by the volume-sampling algorithms of Deshpande and Rademacher [DR10]. These sampling techniques together with the Schur concavity of ratios of elementary symmetric polynomials were used previously in the work of Guruswami and Sinop [GS12] on low rank matrix approximations.
Preliminaries
We use the notation for an integer . With we denote the set of size subsets of the set .
Lemma 5 implies that a factor approximation algorithm for -MSD implies a factor approximation algorithm for -MVS. For this reason, for the rest of the paper we will focus our attention on the -MSD problem.
2 Convex Analysis and Optimization
Consider an optimization problem in the following general form:
For any which is feasible for (1)–(2), and any , . This fact is known as weak duality. The Lagrange dual problem is defined as
Strong duality holds when the optimal value of (3) equals the optimal value of (1)–(2). Slater’s condition is a commonly used sufficient condition for strong duality. We state it next.
Assume in the problem (1)–(2) are convex functions over their respective domains, and for some , are affine functions. Let there be a point in the relative interior of the domains of , so that for and for . Then the optimal value of (1)–(2) equals the optimal value of (3), and the value of (3) is achieved if it is finite.
For more information on convex programming and duality, we refer the reader to the books by Boyd and Vandenberghe [BV04] and Rockafellar [Roc70].
3 Ellipsoids and John’s Theorem
This program corresponds to finding the minimum volume ellipsoid centered at 0 that contains . It is a convex minimization problem over the open domain with affine constraints, and, therefore, satisfies Slater’s condition. The dual problem to (4)–(6) is
Up to scaling of the variables , this is the -optimal design problem. For a proof of the duality, see [BV04, Sect. 5.1.6, 5.2.4, 7.5.2]; it also follows from the the more general Theorem 15. Since it is the dual of a convex minimization problem, (7)–(9) is a convex maximization problem. Then the following variant of John’s theorem is a direct consequence of strong duality for the program (4)–(6) (which is implied by Slater’s condition):
The optimal value of (4)–(6) is equal to the optimal value of (7)–(9).
4 Properties of Determinants
First we recall the classical Binet-Cauchy formula for the determinant of a matrix product. For any matrix , , we have
Let be the degree elementary symmetric polynomial, i.e.
Let be an symmetric matrix with eigenvalues . It is well-known that and . In fact a similar identity involving the entries of and its eigenvalues holds for all :
This fact is also classical and can be proved by expressing each coefficient of the characteristic polynomial of in two different ways: as a sum of subdeterminants, and as a symmetric polynomial of its roots.
5 Schur Convexity
We use the following classical fact about the Schur-concavity of elementary symmetric functions, proved by Schur.
The elementary symmetric polynomial of degree , , is Schur-concave.
The Full-Dimensional Case
In this section we discuss the special case . We treat this case separately because it is a natural problem in itself, and the technical details of our algorithm are simpler, while illustrating some of the key ideas of our approach.
We first prove a simple lemma which is essential to our analysis.
Let be a matrix with column vectors . Let give the probabilities for a distribution on , i.e. for all and . Let be a random multiset of elements, each sampled independently with replacement from according to the distribution determined by . Then
where is a diagonal matrix with the values on the main diagonal. The right hand side is equal to by the Binet-Cauchy formula (10). Since , this finishes the proof. ∎
We present our approximation algorithm for -MSD as Algorithm 1. The main approximation guarantee of the algorithm is given in Theorem 10.
Let the random multiset be the output of Algorithm 1 for input and an -optimal . Then
Observe first that for any of size . Then by Lemma 9, with , we have
The first inequality above follows by applying the AM-GM inequality to the eigenvalues of , and the last inequality is implied by the constraints (5). What we have shown is equivalent to the intuitive geometric fact that the volume of the largest simplex with one vertex at contained in the convex hull of is at most the volume of the largest simplex with one vertex at contained in the Löwner ellipsoid of (or in fact any ellipsoid containing these points).
as desired. The asymptotic estimate is a restatement of Stirling’s approximation of . ∎
Since (7)–(9) is a convex optimization problem, we can use the ellipsoid method to to compute an -optimal solution in time polynomial in [GLS81]. Khachiyan [Kha96] showed how to compute a -optimal solution to (7)–(9) (i.e. a multiplicative approximation to \det\Bigl{(}\sum_{i}c_{i}v_{i}v_{i}^{T}\Bigr{)}) using a polynomial in number of real value operations. Using either method with Algorithm 1, we get an approximation factor of in time polynomial in , , and .
In Section 5 we show how to derandomize Algorithm 1 using the method of conditional expectations.
The General Case
A natural first attempt to extend Algorithm 1 to general is to simply sample , rather than , coordinates from the distribution induced by an optimal solution to (7)–(9). A straightforward extension of the analysis in Section 3 shows that this algorithm achieves approximation factor , which is for but approaches for smaller . In order to achieve approximation for all , we generalize the Löwner ellipsoid problem. The rounding algorithm remains essentially the same, but the details of the analysis become more complicated.
It is not hard to see that for an ellipsoid is proportional to the product of the lengths of the longest major axes of . We use this observation to formulate the problem of finding -Löwner ellipsoid as a convex program. First we need to define the appropriate function on the space of positive definite matrices.
To show that is convex and continuous, and to characterize its subdifferentials, we will use a general result of Lewis, extending classical work by von Neumann on unitarily invariant matrix norms. Below we state a slightly specialized case of his result.
For a set , let us use the notation for the -dimensional indicator vector of , i.e. the -th coordinate of is if and otherwise. Let us define the convex polytope . This is the basis polytope of the rank uniform matroid. We can now prove the convexity of and characterize its subdifferentials.
The function is convex and continuous on the space of positive definite matrices. Moreover, for any with eigenvalues
the subdifferential of at is
Because is symmetric, Lemma 11 implies that in order to show that is convex and continuous, we only need to show that is convex and continuous. Because the function is monotone decreasing in , we can write as
Since , and each is differentiable, we have
The gradient is given by for and otherwise. We have
This implies the desired characterization of . ∎
Geometrically, the lemma captures the following fact. Let be an ellipsoid centered at and containing . Then the volume of the largest -dimensional simplex in the convex hull of with one vertex at is at most the volume of the largest -dimensional simplex in with one vertex at . Moreover, the latter simplex is contained in the -dimensional subspace whose intersection with has the largest volume. In the formal proof below we use a linear algebraic argument.
Let be the eigenvalues of , and let be the eigenvalues of . By the Cauchy interlace theorem, for any , and, therefore,
On the other hand, by applying the AM-GM inequality to the eigenvalues of the matrix , we get
The final inequality above follows from the constraints (14). Combining the inequalities gives the desired bound. ∎
2 Duality for j𝑗j-Löwner Ellipsoids
As mentioned above, the program (13)–(15) that we used to capture -Löwner ellipsoids is convex and satisfies Slater’s condition. Therefore, it admits a dual characterization, which we will use in our algorithm. In this section we derive the dual characterization using the Lagrange dual function.
Before we introduce the dual, or even properly define its objective function, we need to prove a technical lemma.
Let be non-negative reals, and let be a positive integer. There exists a unique integer , , such that
Define . If holds, then (16) is satisfied for , and we are done. So let us assume that . Then , and the first inequality in (16) is satisfied for . If the second inequality is also satisfied we are done, so let us assume that , which implies the first inequality in (16) for . Continuing in this manner, we see that if the inequalities (16) are not satisfied for any , then we must have . But the second inequality for , i.e. is always satisfied because all the are non-negative, so we have that if (16) does not hold for any , then it must hold for . This finishes the proof of existence.
This completes the proof of uniqueness. ∎
We now introduce a function which will be used in formulating a dual characterization of (13)–(15).
We will prove that the dual of (13)–(15) is equivalent to the following optimization problem:
The program (17)–(19) is a convex optimization problem, and its optimal value is equal to the optimal value of (13)–(15).
Observe that when , , so that (13)–(15) in this case reduces to (7)–(9). I.e. Theorem 15 generalizes Lemma 7.
To prove Theorem 15, we need two additional technical lemmas. The first one is well-know and follows from more general results characterizing the facets of the basis polytope of a matroid [Sch03].
For any and ,
The next lemma is the key technical ingredient in the proof of Theorem 15.
Let be a matrix of rank at least . Then there exists a matrix such that , and .
Let be the rank of , and let be its eigenvalues. Let be an orthonormal matrix such that for . Assume that is a positive integer strictly smaller than such that and define . A unique such choice of exists by Lemma 14. Moreover, since has rank at least , , which also implies . Therefore, the following vector is well-defined for any :
Let us set . By Lemma 12, to prove that , it suffices to show that . This inclusion follows from Lemma 16 because, by the choice of and , for all , and .
The equality follows by a calculation. By the choice of and , . Therefore, the smallest eigenvalues of are , and we have
Let us define to be the domain for the constraints (14) and the objective function (13). This makes the constraint implicit. The optimization problem is convex by Lemma 12. Is is also always feasible: for example, if , then is a feasible solution. Slater’s condition is therefore satisfied and strong duality holds. To prove the theorem, it suffices to show that the dual problem to (13)–(15) is equivalent to (17)–(19).
The Lagrange dual function for (13)–(15) is
A matrix achieves the minimum above if and only if , which, by the additivity of subgradients, is equivalent to . Define . Consider first the case in which has rank less than . Let be a parameter, and let be an orthogonal projection matrix onto the nullspace of . Consider the matrix . The sum remains bounded for all , while goes to as . Therefore in this case.
Next we consider the case in which has rank at least . Then, by Lemma 17, there exists a such that , and, therefore, this achieves . From Lemma 12 and , it follows that . Also using the fact that, by Lemma 17, , we have
To finish the proof we show that any that maximizes the right hand side above satisfies , and, therefore, the optimal value of the dual problem, , is equal to the optimal value of (17)–(19). Let us fix some arbitrary such that for all and , and consider the function , defined over positive real numbers . It will be enough to show that the unique maximizer of is . Since is a restriction of a convex function, it is also convex, and it is enough to show that is the unique solution of . Let be the eigenvalues of , and let be the unique integer with which is computed, i.e. is such that . The eigenvalues of are , so the condition is clearly satisfied, and, by Lemma 14, this choice of is unique. Therefore, \Gamma_{j}\Bigl{(}\sum_{i=1}^{n}{c_{i}v_{i}v_{i}^{T}}\Bigr{)} and \Gamma_{j}\Bigl{(}\sum_{i=1}^{n}{tc_{i}v_{i}v_{i}^{T}}\Bigr{)} are computed with the same . By the basic properties of logarithms, \Gamma_{j}\Bigl{(}t\sum_{i=1}^{n}{c_{i}v_{i}v_{i}^{T}}\Bigr{)}=\Gamma_{j}\Bigl{(}\sum_{i=1}^{n}{c_{i}v_{i}v_{i}^{T}}\Bigr{)}+j\ln t, and, therefore,
The derivative vanishes only at , which implies that is maximized at . This proves the claim and finishes the proof of the theorem. ∎
3 The Rounding Algorithm
Our rounding algorithm, shown as Algorithm 2, is nearly identical to Algorithm 1, except for using probability weights proportional to an optimal solution of (17)–(19). The approximation guarantee for the algorithm is given by the following theorem.
Let the random multiset be the output of Algorithm 1 for input and -optimal . Then
Let us define for , and . If some element in repeats, then . On the other hand, each set can be sampled in different ways, one for each ordering of its elements. We can then write the expectation of as
The asymptotic estimate is again just Sterling’s approximation to . This completes the proof of the theorem. ∎
Since (17)–(19) is a convex optimization problem, we can use the ellipsoid method to to compute an -optimal solution in time polynomial in [GLS81]. Together with Algorithm 2, we get an approximation factor of in time polynomial in , , and . It is also conceivable that the barycentric coordinate descent method of Khachiyan [Kha96] can be extended to solve (17)–(19).
In Section 5 we show how to derandomize Algorithm 2 using the method of conditional expectations.
Derandomizing the Algorithms
In Theorems 10 and 18 we only proved our approximation guarantees in expectation. A priori, this does not give a useful bound on the probability that the set output by Algorithm 1 or 2 is close to optimal. However, it is not hard to derandomize the algorithms using the method of conditional expectation. The deterministic algorithm is presented as Algorithm 3.
By the method of conditional expectation [AS08], and Theorem 18, it is enough to show that
where the expectation is over the distribution on the output of Algorithm 2. Expanding the expectation on the right hand side, we have
The second equality follows from the “base times height” formula for the determinant. The penultimate equality follows from (11). ∎
To implement Algorithm 3, we need to be able to evaluate the elementary symmetric polynomial . This can be done in polynomial time by expanding the characteristic polynomial of the matrix : is equal to times the coefficient of the term of degree .
Restricted Invertibility Principles
The celebrated Restricted Invertibility Principle (RIP) of Bourgain and Tzafriri [BT87] is a powerful generalization of the simple fact in linear algebra that a matrix of rank has an invertible submatrix with at least columns. The RIP shows that if the “robust rank” of a matrix is large, i.e. the matrix has many large singular values, then it contains a proportionally large submatrix which is well-invertible, i.e. its inverse is bounded in operator norm. The RIP has had many applications in Banach space theory, asymptotic convex geometry, statistics, and recently in discrepancy theory and private data analysis.
The version of the RIP above was proved by Spielman and Srivastava for . However, essentially the same proof shows the slight generalization formulated above.
Next we state our version of the RIP for determinants.
Moreover, such a set can be computed in deterministic polynomial time.
where , is the vector of eigenvalues of the matrix , and the final equality follows by (11).
Let us identify with a matrix in the natural way. The matrix
Note that the entries of (i.e. the eigenvalues of ) are equal to the squared singular values of , and, therefore, and . To complete the proof, it remains to show that . Because is Schur concave, it is enough to show that \lambda\prec\bar{\lambda}\stackrel{{\scriptstyle def}}{{=}}\Bigl{(}\frac{\|\lambda\|_{1}}{j},\ldots,\frac{\|\lambda\|_{1}}{j},0,\ldots,0\Bigr{)}, where has non-zero coordinates. Indeed, for any , by the choice of ,
For , , with equality for . Therefore, and . This finishes the proof of the main claim.
A set such that satisfies the conclusion of the theorem can be computed in deterministic polynomial time via the method of conditional expectations, as in the proof of Theorem 19. ∎
Theorem 21 is incomparable with Theorem 20. On one hand, the conclusion of Theorem 20 is of a qualitatively stronger type: it implies a lower bound on the smallest singular value of the matrix . On the other hand, the set in Theorem 21 can be as large as the (floor function of the) robust rank , while this is in general not possible in Theorem 20.
Conclusion
We have given a polynomial time deterministic algorithm that approximates the -MSD problem by a factor of , and, therefore, the -MVS problem by a factor of . Our algorithms use randomized rounding with a generalization of the -optimal design problem. The analysis relies on convex duality, Schur convexity, and elementary properties of determinants.
We conjecture that approximating the -MSD problem within a factor of is -hard for any . As an easier problem, it will be interesting to construct an input for which the -Löwner ellipsoid approximates -MSD no better than a factor of , or to give a better analysis.
We also leave open the problem of computing a constant factor approximation to the determinant lower bound on hereditary discrepancy. Finally, it will be interesting to generalize Khachiyan’s barycentric coordinate descent algorithm for the -optimal design problem to the dual of the -Löwner ellipsoid problem.
Acknowledgements
I thank Kunal Talwar and Nikhil Bansal for useful discussions.