Enhancing Sparsity by Reweighted L1 Minimization

Emmanuel J. Candes, Michael B. Wakin, Stephen P. Boyd

Introduction

In many instances, however, the object we wish to recover is known to be structured in the sense that it is sparse or compressible. This means that the unknown object depends upon a smaller number of unknown parameters. In a biological experiment, one could measure changes of expression in 30,000 genes and expect at most a couple hundred genes with a different expression level. In signal processing, one could sample or sense signals which are known to be sparse (or approximately so) when expressed in the correct basis. This premise radically changes the problem, making the search for solutions feasible since the simplest solution now tends to be the right one.

A common alternative is to consider the convex problem

In Section 3, we present a series of experiments demonstrating the superior performance and broad applicability of this algorithm, not only for recovery of sparse signals, but also pertaining to compressible signals, noisy measurements, error correction, and image processing. This section doubles as a brief tour of the applications of Compressive Sensing. In Section 4, we demonstrate the promise of this method for efficient data acquisition. Finally, we conclude in Section 5 with a final discussion of related work and future directions.

For the sake of illustration, consider the simple 3-D example in Figure 1, where $x_{0}=[0~{}1~{}0]^{T}$ and

2 An iterative algorithm

The question remains of how a valid set of weights may be obtained without first knowing $x_{0}$ . As Figure 1 shows, there may exist a range of favorable weighting matrices $W$ for each fixed $x_{0}$ , which suggests the possibility of constructing a favorable set of weights based solely on an approximation $x$ to $x_{0}$ or on other side information about the vector magnitudes.

We propose a simple iterative algorithm that alternates between estimating $x_{0}$ and redefining the weights. The algorithm is as follows:

Update the weights: for each $i=1,\ldots,n$ ,

Using an iterative algorithm to construct the weights $(w_{i})$ tends to allow for successively better estimation of the nonzero coefficient locations. Even though the early iterations may find inaccurate signal estimates, the largest signal coefficients are most likely to be identified as nonzero. Once these locations are identified, their influence is downweighted in order to allow more sensitivity for identifying the remaining small but nonzero signal coefficients.

3 Analytical justification

The iterative reweighted algorithm falls in the general class of MM algorithms, see and references therein. In a nutshell, MM algorithms are more general than EM algorithms, and work by iteratively minimizing a simple surrogate function majorizing a given objective function. To establish this connection, consider the problem

The equivalence means that if $x^{\star}$ is a solution to (7), then $(x^{\star},|x^{\star}|)$ is a solution to (8). And conversely, if $(x^{\star},u^{\star})$ is a solution to (8), then $x^{\star}$ is a solution to (7).

where ${\cal C}$ is a convex set. In (8), the function $g$ is concave and, therefore, below its tangent. Thus, one can improve on a guess $v$ at the solution by minimizing a linearization of $g$ around $v$ . This simple observation yields the following MM algorithm: starting with $v^{(0)}\in{\cal C}$ , inductively define

Each iterate is now the solution to a convex optimization problem. In the case (8) of interest, this gives

One now recognizes our iterative algorithm.

4 Variations

One could imagine a variety of possible reweighting functions in place of (6). We have experimented with alternatives, including a binary (large/small) setting of $w_{i}$ depending on the current guess. Though such alternatives occasionally provide superior reconstruction of sparse signals, we have found the rule (6) to perform well in a variety of experiments and applications.

Alternatively, one can attempt to minimize a concave function other than the log-sum penalty. For instance, we may consider

5 Historical progression

Numerical experiments

Figure 4(b) shows the performance, with a fixed value of $\epsilon=0.1$ , of the reweighting algorithm for various numbers of reweighted iterations. We see that much of the benefit comes from the first few reweighting iterations, and so the added computational cost for improved signal recovery is quite moderate.

2 Sparse and compressible signal recovery with adaptive choice of ϵitalic-ϵ\epsilon

Let $(|x|_{(i)})$ denote a reordering of $(|x_{i}|)$ in decreasing order of magnitude. Set

3 Recovery from noisy measurements

The parameter $\delta$ is adjusted so that the true vector $x_{0}$ be feasible (resp. feasible with high probability) for (9) in the case where $z$ is deterministic (resp. stochastic).

4 Statistical estimation

Again $\delta$ is a parameter making sure that the true unknown vector is feasible with high probability.

that is, by regressing $y$ onto the subset of columns indexed by $I$ .

We repeat the above experiment across 5000 trials. Figure 8 shows a histogram of the ratio $\rho^{2}$ between the squared error loss of some estimate $x$ and the ideal squared error

for both the unweighted and reweighted Gauss-Dantzig estimators. (The results are also summarized in Table 1.) For an interpretation of the denominator, the ideal squared error $\sum\min(x_{0,i}^{2},\sigma^{2})$ is roughly the mean-squared error one could achieve if one had available an oracle supplying perfect information about which coordinates of $x_{0}$ are nonzero, and which are actually worth estimating. We see again a significant reduction in reconstruction error; the median value of $\rho^{2}$ decreases from 2.43 to 1.21. As pointed out, a primary reason for this improvement comes from a more accurate identification of significant coefficients: on average the unweighted Gauss-Dantzig estimator includes 3.2 “false positives,” while the reweighted Gauss-Dantzig estimator includes only 0.5. Both algorithms correctly include all 8 nonzero positions in a large majority of trials.

5 Error correction

upon receiving the corrupted codeword $y=\Phi x_{0}+e$ ; here, $e$ is the unknown but sparse corruption pattern. The conclusion of is then that the solution to this program recovers $x_{0}$ exactly provided that the fraction of errors is not too large. Continuing on our theme, one can also enhance the performance of this error-correction strategy, further increasing the number of corrupted entries that can be overcome.

Select a vector of length $n=128$ with elements drawn from a zero-mean unit-variance Gaussian distribution, and sample an $m\times n$ coding matrix $\Phi$ with i.i.d. Gaussian entries yielding the codeword $\Phi x$ . For this experiment, $m=4n=512$ , and $k$ random entries of the codeword are corrupted with a sign flip. For the recovery, we simply use a reweighted version of (11). Our algorithm is as follows:

We set $\epsilon$ to be some factor $\beta$ times the standard deviation of the corrupted codeword $y$ . We run 100 trials for several values of $\beta$ and of the size $k$ of the corruption pattern.

6 Total variation minimization for sparse image gradients

where $(Dx)_{i,j}$ is the 2-dimensional vector of forward differences $(Dx)_{i,j}=(x_{i+1,j}-x_{i,j},x_{i,j+1}-x_{i,j})$ . Because many natural images have a sparse or nearly sparse gradient, it makes sense to search for the reconstruction with minimal TV norm, i.e.,

see , for example. It turns out that this problem can be recast as a second-order cone program , and thus solved efficiently.

We adapt (14) by minimizing a sequence of weighted TV norms as follows:

Solve the weighted TV minimization problem

Update the weights; for each $(i,j)$ , $1\leq i,j\leq n-1$ ,

Naturally, this iterative algorithm corresponds to minimizing a sequence of linearizations of the log-sum function $\sum_{1\leq i,j\leq n-1}\log(\|(Dx)_{i,j}\|+\epsilon)$ around the previous signal estimate.

To show how this can enhance the performance of the recovery, consider the following experiment. Our test image is the Shepp-Logan phantom of size $n=256\times 256$ (see Figure 10(a)). The pixels take values between and $1$ , and the image has a nonzero gradient at 2184 pixels. We measure $y$ by sampling the discrete Fourier transform of the phantom along $10$ pseudo-radial lines (see Figure 10(b)). That is, $y=\Phi x_{0}$ , where $\Phi$ represents a subset of the Fourier coefficients of $x_{0}$ . In total, we take $m=2521$ real-valued measurements.

In many problems, a signal may assume sparsity in a possibly overcomplete representation. To make things concrete, suppose we are given a dictionary $\Psi$ of waveforms $(\psi_{j})_{j\in J}$ (the columns of $\Psi$ ) which allows representing any signal as $x=\Psi\alpha$ . The representation $\alpha$ is deemed sparse when the vector of coefficients $\alpha$ has comparably few significant terms. In some applications, it may be natural to choose $\Psi$ as an orthonormal basis but in others, a sparse representation of the signal $x$ may only become possible when $\Psi$ is a redundant dictionary; that is, it has more columns than rows. A good example is provided by an audio signal which often is sparsely represented as a superposition of waveforms of the general shape $\sigma^{-1/2}g((t-t_{0})/\sigma)e^{i\omega t}$ , where $t_{0}$ , $\omega$ , and $\sigma$ are discrete shift, modulation and scale parameters.

In this setting, the common approach for sparsity-based recovery from linear measurements goes by the name of Basis Pursuit and is of the form

Our first example is motivated by our own research focused on advancing devices for analog-to-digital conversion of high-bandwidth signals. To cut a long story short, standard analog-to-digital converter (ADC) technology implements the usual quantized Shannon representation; that is, the signal is uniformly sampled at or above the Nyquist rate. The hardware brick wall is that conventional analog-to-digital conversion technology is currently limited to sample rates on the order of 1GHz, and hardware implementations of high precision Shannon-based conversion at substantially higher rates seem out of sight for decades to come. This is where the theory of compressive sensing becomes relevant.

Whereas it may not be possible to digitize an analog signal at a very high rate rate, it may be quite possible to change its polarity at a high rate. The idea is then to multiply the signal by a pseudo-random sequence of plus and minus ones, integrate the product over time windows, and digitize the integral at the end of each time interval. This is a parallel architecture and one has several of these random multiplier-integrator pairs running in parallel using distinct or event nearly independent pseudo-random sign sequences.

To show the promise of this approach, we take $x_{0}$ to be a 1-D signal of length $n=512$ which is a superposition of two modulated pulses (see Figure 11(a)). From this signal, we collect $m=30$ measurements using an $m\times n$ matrix $\Phi$ populated with i.i.d. Bernoulli $\pm 1$ entries. This is an unreasonably small amount of data corresponding to an undersampling factor exceeding 17. For reconstruction we consider a time-frequency Gabor dictionary that consists of a variety of sine waves modulated by Gaussian windows, with different locations and scales. Overall the dictionary is approximately $43\times$ overcomplete and does not contain the two pulses that comprise $x_{0}$ .

2 Frequency sampling of biomedical images

Compressed sensing can help reduce the scan time in Magnetic Resonance Imaging (MRI) and offer sharper images of living tissues. This is especially important because time consuming MRI scans have traditionally limited the use of this sensing modality in important applications. Simply put, faster imaging here means novel applications. In MR, one collects information about an object by measuring its Fourier coefficients and faster acquisition here means fewer measurements.

We mimic an MR experiment by taking our unknown image $x_{0}$ to be the $n=256\times 256=65536$ pixel MR angiogram image shown in Figure 12(a). We sample the image along 80 lines in the Fourier domain (see Figure 12(b)), effectively taking $m=18737$ real-valued measurements $y=\Phi x_{0}$ . In plain terms, we undersample by a factor of about 3.

Figure 12(c) shows the minimum energy reconstruction which solves

Discussion

for $i=1,2,\dots,n$ . For nonzero signal coefficients, it is shown that each step of FOCUSS is equivalent to a step of the modified Newton’s method for minimizing the function

subject to $y=\Phi x$ . As the iterations proceed, it is suggested to identify those coefficients apparently converging to zero, remove them from subsequent iterations, and constrain them instead to be identically zero.

2 Future directions

As shown in Section 2, when there is a sparse solution and the reweighted algorithm finds it, convergence may occur in just very few steps. It would be of interest to understand this phenomenon more precisely.

What are smart and robust rules for selecting the parameter $\epsilon$ ? That is, rules that would automatically adapt to the dynamic range and the sparsity of the object under study as to ensure reliable performance across a broad array of signals. Of interest are ways of updating $\epsilon$ as the algorithm progresses towards a solution. Of course, $\epsilon$ does not need to be uniform across all coordinates.

We mentioned the use of other functionals and reweighting rules. How do they compare?

Finally, any result quantifying the improvement of the reweighted algorithm for special classes of sparse or nearly sparse signals would be significant.

Acknowledgments

E. C. was partially supported by a National Science Foundation grant CCF-515362, by the 2006 Waterman Award (NSF) and by a grant from DARPA. This work was performed while M. W. was an NSF Postdoctoral Fellow (NSF DMS-0603606) in the Department of Applied and Computational Mathematics at Caltech. S. B. was partially supported by NSF award 0529426, NASA award NNX07AEIIA, and AFOSR awards FA9550-06-1-0514 and FA9550-06-1-0312. We would like to thank Nathaniel Braun and Peter Stobbe for fruitful discussions about this project. Parts of this work were presented at the Fourth IEEE International Symposium on Biomedical Imaging (ISBI ‘07) held April 12–15, 2007 and at the Von Neumann Symposium on Sparse Representation and High-Dimensional Geometry held July 8–12, 2007. Related work was first developed as lecture notes for the course EE364b: Convex Optimization II, given at Stanford Winter quarter 2006-07 .