Compressed Sensing: How sharp is the Restricted Isometry Property
Jeffrey D. Blanchard, Coralia Cartis, Jared Tanner
Introduction
Questions of this form must have been around for millennia. Consider this puzzle: “A counterfeit coin is hidden in a batch of otherwise similar coins; it is distinguished from the others by its slightly heavier weight. How many balance weighings are needed to find the counterfeit?” Abstractly, this concerns the special case where is an unknown singleton and the nonzero value is nonnegative; the balance is abstractly the same as an inner product which gives weight +1 to the coefficients placed in the “right” pan and -1 to the coefficients placed in the “left” pan. Many people quickly find that roughly measurements suffice to find the position and value of the nonzero, each time putting half the remaining coins in one pan, half in the other, and discarding from further consideration the coins that turn out on the light side. Lighthearted as puzzles can sometimes seem, they can lead to serious applications.
During World War Two, efficient screening of large groups of soldiers for certain infections was based on the principle of group testing, in which blood from many soldiers is combined in a single tube and tested for presence of an infectious agent. If an infection is found, one studies that group and by dyadic subdivision eventually isolates the infecteds .
More advanced mathematics can do much better than such common-sense ideas. Those with a physical bent may quickly see that, if is prime, again assuming a singleton and a nonnegative , it will be enough, in fact, to make only 2 inner products, with respectively a sine and cosine of frequency ; the phase of the corresponding complex Fourier coefficient immediately reveals the position of the nonzero. Note here that, for large , we are doing dramatically better than common-sense (2 measurements rather than ).
Advanced mathematics is better than the common-sense approach in another way: common-sense uses adaptive measurements, where the next measurement vector is selected after viewing all previous measurements. In the advanced approach, adaptivity is unnecessary: one simply makes 2 measurements defined a priori and later combines the two to reconstruct.
Compressed Sensing (CS) embodies the advanced approach: it designs a special matrix of size , measures via , giving measurements of the vector in parallel, and reconstructs from using computationally efficient and stable algorithms. The key point is that can be taken much smaller than , and much closer to . For example, if is known to be -sparse and nonnegative suffices and if is only known to be -sparse, roughly will suffice, if is small .
Since the release of the seminal CS papers in 2004, , a great deal of excitement has been generated in signal processing and applied mathematics research, with hundreds of papers on the theory, applications, and extensions of compressed sensing (more than 400 of these are collected at Rice’s online Compressive Sensing Resources archive dsp.rice.edu/cs). Many applications have been proposed, including magnetic resonance imaging , radar , and single-pixel cameras to name a few. In the MRI applications, it has been reported that diagnostic quality images can be obtained in the recording time using CS approaches, . For a recent review of CS see the special issue containing and for a review of sparse approximation see .
which is the convex relaxation of the computationally intractable decoder, , seeking the sparsest solution in agreement with the measurements
Following the usual convention in the CS community, counts the number of nonzero entries in . Many other encoder/decoder pairs are also being actively studied, with new alternatives being proposed regularly; see Section 3.
To quantify the sparsity/undersampling trade off, we adopt a proportional-growth asymptotic, in which we consider sequences of triples where all elements grow large in a coordinated way, and for some constants . This defines a two-dimensional phase space in for asymptotic analysis.
A sequence of problem sizes is said to grow proportionally if, for , and as .
Ultimately, we want to determine, as precisely as possible, which subset of this phase space corresponds to successful recovery and which subset corresponds to unsuccessful recovery. This is the phase-transition framework advocated by Donoho et al ; see Section 3 for a precise definition. By translating the sufficient RIP conditions into the proportional-growth asymptotic, we find lower bounds on the phase-transition for in . An answer to this question plays the role of an undersampling theorem: to what degree can we undersample a signal and still be able to reconstruct it?
to shed some light on the behavior of the RIP constants of a matrix ensemble with as much precision as possible;
to introduce a reader new to this topic to the type of large deviation analysis calculations often encountered in CS and applicable to many areas faced with combinatorial challenges.
We conclude with a discussion of some other important and related topics not addressed in the current paper. We briefly discuss comparisons of results when noise is present in the measurements or the signal is not perfectly -sparse, average case analysis versus the theoretical worst case analysis presented here, and the potential to improve the phase transition curves through improved analysis or improved bounds.
Bounds on RIP for Gaussian Random Matrices
The asymmetric way that the expected eigenvalues and deviate from 1 suggests that the symmetric treatment used by the traditional RIP is missing an important part of the picture. We generalize the RIP to an asymmetric form and derive the sharpest recovery conditions implied by the RIP.
For a matrix of size , the asymmetric RIP constants and are defined as:
(A similar change in the definition of the RIP constants was used independently by Foucart and Lai in , motivated by different concerns.)
Although both the smallest and largest singular values of affect the stability of the reconstruction algorithms, the smaller eigenvalue is dominant for compressed sensing in that it allows distinguishing between sparse vectors from their measurement by . In fact, it is often incorrectly stated that is a necessary condition to ensure that there are no two -sparse vectors, say and , with the same measurements ; the actual necessary condition is .
We see from (4) and (5) that and with . A standard large deviation analysis of bounds on the probability density functions of and allows us to establish upper bounds of and which are exponentially unlikely to be exceeded.
Let be a matrix of size drawn from the Gaussian ensemble and consider the proportional-growth asymptotic ( and as ). Let denote the usual Shannon Entropy with base logarithms, and let
Define and as the solution to (8) and (9), respectively:
Define and as
To facilitate ease of calculating and , web forms for their calculation are available at ecos.maths.ed.ac.uk.
In the proportional growth asymptotic, the probability that and bound the random variables and , respectively, tends to 1 as . In statistical terminology, the coverage probability of the upper confidence bounds and tends to one as . In fact, all probabilities presented in this manuscript converge to their limit “exponentially in ”; that is, the probability for finite approaches its limit as grows with discrepancy bounded by a constant multiple of for some fixed .
Fix . Under the proportional-growth asymptotic, Definition 2, sample each matrix from the Gaussian ensemble. Then
Extensive empirical estimates of and show that the bounds and are rather sharp; in fact, they are no more than twice the actual upper bounds on and , see Figure 4 and Table 1, and are much closer for the region applicable for CS decoders, . The empirically observed lower bounds on and are calculated through the following process. The number of rows, , is fixed at one of the values in Table 1. For each , 47 values of are selected so that ranges from to . For each a matrix of size is drawn from and either the algorithm from or is applied to determine support sets of size which are candidates for the support sets that maximize or . The largest or smallest eigenvalue of each resulting submatrix is calculated and recorded. The above process is repeated for some number of matrices, see the caption of Table 1, and the maximum value recorded. The empirical calculation of RIP constants are lower bounds on the true RIP constants as the support sets calculated by and may not be the support sets which maximize the RIP constants.
In order to prove Theorem 5, this section employs a type of large deviation technique often encountered in CS and applicable in fact, to many areas faced with combinatorial challenges.
We first establish some useful lemmas concerning the extreme eigenvalues of Wishart matrices. The matrix generates different Wishart matrices . Exponential bounds on the tail probabilities of the largest and smallest eigenvalues of such Wishart matrices can be combined with exponential bounds on to control the chance of large deviations using the union bound. This large deviation analysis technique is characteristic of proofs in compressed sensing. By using the exact probability density functions on the tail behavior of the extreme eigenvalues of Wishart matrices the overestimation of the union bound is dramatically reduced. We focus on the slightly more technical results for the bound on the most extreme of the largest eigenvalues, , and prove these statements in full detail. Corresponding results for are stated with their similar proofs omitted.
The probability density function, , for the largest eigenvalue of the Wishart matrix was determined by Edelman in . For our analysis, a simplified upper bound suffices.
Let be a matrix of size whose entries are drawn i.i.d from . Let denote the probability density function for the largest eigenvalue of the Wishart matrix of size . Then satisfies:
For our purposes, it is sufficient to have a precise characterization of ’s exponential (with respect to ) behavior.
where is a polynomial in .
Let be as defined in (11) and let . To extract the exponential behavior of we write where
Clearly, and can be subsumed as part of . To simplify , we apply the second of Binet’s log gamma formulas [52, Sec. 12.32], namely where is a convergent, improper integral. With representing the constant and integral from Binet’s formula we then have
As it can be absorbed into and we have
To bound , we must simultaneously account for all Wishart matrices derived from . Using a union bound this amounts to studying the exponential behavior of . In the proportional-growth asymptotic this can be determined by characterizing , which from Lemma 7 is given by
Recall that is the usual Shannon Entropy with base logarithms.
Equipped with Lemma 7 and (13), Proposition 8 establishes as an upper bound on in the proportional-growth asymptotic.
Throughout this proof and are fixed, and we focus our attention on , often abbreviating in (13) as . We first verify that (9) has a unique solution. Since
is strictly decreasing on and is strictly concave. Combined with
and , there is a unique solution to (9), namely .
Select and let be such that , . First, we write the probability statement in terms of :
To bound the integral in (14) in terms of we write in terms of , , and as where
Since , the quantity is strictly decreasing in on . Therefore we have
Therefore, combining (14) and (15) we obtain
with the last inequality following from the strict concavity of . Since is strictly bounded away from zero and , we arrive at, for any
By the definition of in Definition 1, for ; combined with Proposition 8 for as
exponentially in , and taking a minimum over the compact set we arrive at the desired result. ∎
A similar approach leads to corresponding results for . Edelman also determined the probability density function, , for the smallest eigenvalue of the Wishart matrix . Here again, a simplified upper bound suffices:
Let be a matrix of size whose entries are drawn i.i.d. from . Let denote the probability density function for the smallest eigenvalue of the Wishart matrix of size . Then satisfies:
With Lemma 10, we establish a bound on the asymptotic behavior of the distribution of the smallest eigenvalue of Wishart matrix of size .
where is a polynomial in .
With Lemma 11, the large deviation analysis yields
Similar to the proof of Proposition 8, Lemma 11 and (19) are used to establish as an upper bound on in the proportional-growth asymptotic.
Let , and be a matrix of size whose entries are drawn i.i.d. from . Define where is the solution to (8). Then for any , in the proportional-growth asymptotic
The bound is strictly increasing in for any , and as a consequence no tighter bound can be achieved by minimizing over matrices of larger size as was done in Proposition 9.
RIP Undersampling Theorems
As stated at the end of the introduction, one of the central aims of this article is to advocate a unifying framework for the comparison of results in compressed sensing. Currently there is no general agreement in the compressed sensing community on such a framework, making it difficult to compare results obtained by different methods of analysis or to identify when new results are improvements over existing ones. Donoho has put forth the phase transition framework borrowed from the statistical mechanics literature and used successfully in a similar context by the combinatorial optimization community, see . This framework has been successfully employed in compressed sensing by Donoho et al, .
the oversampling rate of making measurements as opposed to the optimal oracle rate of making measurements when the oracle knows the support of :
For each value of there is a largest value of which guarantees successful recovery of .
We now formalize the phase transition framework described above.
The event StrongEquiv(alg) denotes the following property of an matrix : for every -sparse vector , the algorithm “alg” exactly recovers from the corresponding measurements .
For most compressed sensing algorithms and for a broad class of matrices, under the proportional-growth asymptotic there is a strictly positive function alg defining a region of the phase space which ensures successful recovery of every -sparse vector . This function, alg), is called the Strong phase transition function .
Consider the proportional-growth asymptotic with parameters . Draw the corresponding matrices from the Gaussian ensemble and fix . Suppose that we are given a function ;alg) with the property that, whenever alg), alg as . We say that ;alg) bounds a region of strong equivalence.
For any matrix of size with RIP constants and , for . Define
To translate this result into the phase transition framework for matrices from the Gaussian ensemble, we employ the RIP bounds (10) to the asymmetric RIP constants and . It turns out that naively inserting these bounds into (20) yields a bound on , see Lemma 18, and provides a simple way to obtain a bound on the region of strong equivalence.
and as the solution to .
The function is displayed as the red curve in Figure 5.
Fix . Consider the proportional-growth asymptotic with parameters . Draw the corresponding matrices from the Gaussian ensemble. Then
Theorem 5 and the form of imply a similar bound to the above with a modified dependence on . For any , with and , the probability, on the draw of from the Gaussian ensemble, that
is satisfied converges to one exponentially with . Since is non-decreasing in and is strictly increasing in for any and , it follows that the right-hand side of (23) can be bounded by the right-hand side of (22) for any fixed satisfying , by setting
(The upper bound on is imposed so that the second argument of and , , is in the admissible range of .) That the bound (22) is satisfied for all sufficiently small, and that the right hand side of (22) is strictly increasing in establishes that (22) is in fact satisfied probability on the draw of that converges to one exponentially in for any . ∎
and as the solution to .
The function is displayed as the blue curve in Figure 5.
Versions of Theorems 19 and 21 exist for finite values of , , but in each case the recoverability conditions rapidly approach the stated asymptotic limiting functions as grow; we do not further complicate the discussion with their rates of convergence.
3 Further Considerations
In this article, we have considered only the case of noiseless measurements, Regions of Strong Equivalence, and a particular result obtained via an eigenvalue analysis and the RIP. We briefly discuss some additional considerations for the phase transition framework.
3.2 Regions of Weak Equivalence and Average Case Performance
In many applications, it may not be imperative that the decoder is able to reconstruct every -sparse vector. Instead, one may be willing to lose a small fraction of all possible -sparse signals. This is the behavior observed when a decoder is tested on -sparse vectors whose support sets are drawn uniformly at random. Large scale empirical testing of CoSaMP, Subspace Pursuit and Iterated Hard Thresholding were compiled by Donoho and Maleki . Most sparse approximation algorithms do not have a theoretical average case analysis. The polytope analysis of Donoho and Tanner allows for analytical arguments providing a Region of Weak Equivalence where recovery is guaranteed for all but a but a small fraction of -sparse signals. An average case variant of the RIP is being developed, see .
3.3 Improving the RIP Phase Transition
Alternatively, the region of strong equivalence might be increased by improving the bounds on the RIP constants, . If the method of analysis remained the same, we can explore the effects of improved bounds by examining the statements with empirically observed lower bounds on RIP constants for Gaussian matrices. As detailed in Table 1, the current bounds from Thm. 5 are no more than twice the empirical RIP constants. Replacing the RIP constants with empirically observed lower bounds of the RIP constants (for ) in gives us a function , see Figure 6, which is an upper bound on the region of strong equivalence implied by Thm. 15; this improvement is no more than 2.5 times for .
Foucart and Lai bounded the discrepancy between and any satisfying (25) in terms of the discrepancy between and its best sparse approximation,
Given , for any matrix of size with and with RIP constants and and defined as (20), if
then a solution of (26) approximates any satisfying (25) within the bounds
with and functions of , , and .
The parameter in Theorem 22 is a free parameter from the method of proof, and should be selected so as to maximize the region where (28) and/or other conditions are satisfied. For brevity we do not state the formulae for and as functions of , but only state them in Theorem 23 in terms of their bounds for Gaussian random matrices as .
Although the solution of (26), , has unknown sparsity, Theorem 22 ensures that if there is a solution of (25), , which can be well approximated by a sparse vector, i.e. if is small, then if (28) is satisfied the discrepancy between and will be similarly small. For instance, if the sparsest solution of (26), , is sparse, then (30) implies that ; moreover, if then will be sparse and satisfy (in the case this result is summarized as Theorem 15). Substituting bounds on the asymmetric RIP constants and from Theorem 5 we arrive at a quantitative version of Theorem 22 for Gaussian random matrices.
Given , for any , as with and , if where is the maximum over of the solutions, , of with defined as in (21), there is overwhelming probability on the draw of with Gaussian i.i.d. entries that a solution of (26) approximates any satisfying (25) within the bounds
The multiplicative “stability factors” are defined as:
with .
Acknowledgements. The authors would like to thank the editor and the referees for their useful suggestions that have greatly improved the manuscript.