Accurate Prediction of Phase Transitions in Compressed Sensing via a Connection to Minimax Denoising
David Donoho, Iain Johnstone, Andrea Montanari
Introduction
In the noiseless compressed sensing problem, we are given a collection of linear measurements of an unknown vector :
Here the measurement matrix is by , , and the -vector is the object we wish to recover. Both and are known, while is unknown and we seek to recover an approximation to .
Since , the equations are underdetermined. It seems hopeless to recover in general, but in compressed sensing one also assumes that the object is sparse in the appropriate sense. Suppose that the object is known to be -sparse, i.e. to have nonzero entries. If the problem dimensions are large, many recovery algorithms exhibit the phenomenon of phase transition.
Explicitly, let and denote the sparsity and undersampling parameters, respectively. Hence defines a phase space for the different kinds of limiting situations we may encounter as grow large. For a variety of algorithms and Gaussian matrices with iid entries, one finds that this phase space can be partitioned into two phases: “success” and “failure”. Namely, for a given algorithm and given sparsity fraction , there exists a critical fraction such that if the sampling rate is larger than the critical value, , then the algorithm is successful in recovering the underlying object with high probabilityThroughout the paper, we will write that an event holds with high probability (w.h.p.) if its probability converges to in the large system limit with and fixed., while if the algorithm is unsuccessful, also with high probability. In particular, means that it is indeed possible to undersample and still recover the unknown signal. In fact shows precisely the limits of allowable undersampling. By now a large amount of empirical and theoretical knowledge has been compiled about the phase transitions exhibited by different algorithms: we refer the reader to [DT10b, Don06, DT09a, XH10, BCT11, Sto10, KWT09, DMM09, DMM11, Wai09]. In a parallel line of work, a number of sufficient conditions under which undersampling is possible using deterministic matrices have been studied, see e.g. [CT05, BRT09, BGI+08, Can06].
It is however fair to say that the research focused so far on ‘unstructured’ notions of sparsity whereby simply counts the number on non-zero entries in . (We refer to Section 1.7 for an overview of related literature.) On the other hand, applications naturally lead to ‘structured’ notions of sparsity. This paper applies an algorithm framework - Approximate Message Passing (AMP) - to construct specific algorithms applicable to a variety of compressed sensing settings, including block and structured sparsity, convex and nonconvex penalization, and develops a single unifying formula that, specialized to each instance, gives the actual phase transition that we observe in practice. To give a preview of our results, we first recall some facts about statistical decision theory and AMP reconstruction. For the sake of illustration, the classical case of simple sparse vectors will be used as a running example.
Throughout this paper, we will consider estimation of unknown structured signals from a minimax point of view. Various notions of structures can be formalized by considering a family of probability measures over . One such probability measure will be denoted by and a signal with distribution will often be denoted as .
The family will typically include degenerate distributions, i.e. point masses for some .
The case of simple sparse vectors corresponds to the family
where is the set of Borel probability measures on and, as usual, denotes the number of nonzero entries of the vector .
As exemplified in this case is often indexed by a sparsity parameter , with corresponding to the number of non-zero entries. We will sometimes use the notation to indicate this dependency, also beyond the last example. Two further common properties that will always hold unless otherwise stated are the following.
Nestedness. If then .
Scale invariance. If then any scaled version of (defined by letting for some ) is also in .
2 Denoising and minimax MSE
The denoising problem requires to reconstruct a signal from observations whereby is a noise vector of known variance. (Here and below denoted the identity matrix in dimensions.) A denoiser is a mapping
that returns an estimate of when applied to observations . The denoiser is parametrized by the noise scale and additional tuning parameters . Often denoisers have the property and are hence called ‘shrinkers’. We will often have , i.e. the denoiser depends on a single non-negative parameter, but more complex choices of the parameter space fit in the formalism as well.
Following the minimax formulation in the previous section, we evaluate denoisers on signals , for specific class of distributions . Because of the scale invariance property of , it is sufficient to consider scale invariant denoisers:
Hence we omit the last argument when this is . We evaluate a denoiser through its minimax mean square error (MSE) per coordinate
where expectation is taken with respect to and , . In words, we tune the denoiser optimally to control the (per-coordinate) mean square error for typical signals from even the most unfavorable choice within our class .
We say that a denoiser is separable if, for , we have
A well studied denoiser is coordinatewise soft-thresholding, that we will denote by . This is a separable denoiser with a unique parameter (the threshold). On each coordinate this acts as
Soft thresholding is well suited for sparse signals from the class defined in Eq. (1.2). It turns out that the resulting minimax MSE can be characterized in terms of a scalar estimation problem, namely for all , . Explicitly, all these quantities are given by
where expectation is taken with respect to and independent of . We refer to [DJ94, DMM09, DMM11] for an explicit characterization of this quantity (a summary being provided in Section 2). In particular can be explicitly evaluated.
In several other examples has been explicitly evaluated (see [DMM09], Supplementary Information).
In this paper we will give several other calculations of , for signal structures and denoisers going considerably beyond these examples.
3 Compressed sensing and AMP reconstruction
Consider now the noiseless compressed sensing problem, i.e. the problem of recovering a signal from linear observations , cf. Eq. (1.1). The key intuition is that this can be done exploiting the structure of , sparsity being a special example. Approximate Message Passing (AMP) is an iterative scheme that allows to exploit richer types of structure in a flexible way. Given a denoiser that is well suited for reconstructing from observations , the AMP framework turns it into a scheme for solving the compressed sensing problem.
The AMP iteration starts from , and proceeds for iterations , , …by maintaining a current reconstruction and a current working residual , and adjusting these iteratively. At iteration , it forms a vector of current pseudo-data and the next iteration’s estimate is obtained by applying to the current pseudo-data:
Here is a scalar determined by
The rationale for this specific choice of is discussed in [DMM09, DMM10, BM11a]: a justification goes betond the scope of the present paper. The parameter is can be interpreted as the noise standard deviation for the pseudo-data . This can be estimated from or as explained in Appendix G.
Conceptually, AMP constructs an artificial denoising problem at each iteration and solves it using the denoising defined by . In other words, it solves a compressed sensing problem by successive denoising. For the purpose of this paper, this description should be sufficient, save for two remarks.
First, the specifics of the construction are absolutely crucial for the results of this paper. These are embedded in the specification of the scale factors and .
Second, the above algorithm framework was originally proposed in [DMM09, DMM10] in the case of a separable denoiser , i.e. a denoiser acting independently on each coordinate. In that paper the algorithm was derived by constructing a proper belief propagation message passing algorithm, and then obtaining the above algorithm as a first-order approximation. Specific separable denoisers corresponded to different choices of the prior in belief propagation.
A central point of this paper is that the form of the algorithm (1.7), (1.8), (1.9) is really more general and can be used in settings outside the original definition.
4 Phase transition for AMP
A recurring property of AMP algorithms is that they undergo a phase transition. When the undersampling ratio decreases below a certain threshold (that depends on the signal class and the denoiser ), the algorithm behavior changes from being successful most of the times, to failing most of the times. In order to formalize this notion, we introduce the following terminology.
We say that AMP succeeds with high probability for the signal class , and denoising procedure if there exist a choice of the tuning parameter such that the following happens. For each , there exists a function with such that, for any ,
Here probability is taken with respect to and the sensing matrix . Further, the limit is taken with .
Viceversa we say that AMP fails with high probability for the signal class , and denoising procedure if for any the following happens. There exists and a sequence such that, for all
Our main result is the following general relation between denoising and compressed sensing.
Phase Transition Formula for AMP. Consider compressed sensing reconstruction over the signal class , using AMP with the denoiser . Denote by the asymptotic minimax MSE per coordinate using denoiser .
Then AMP succeeds with high probability if
Viceversa AMP fails with high probability for .
Let be the class of signals with at most non-zero entries (in expectation) and consider AMP with soft thresholding . Then the above formula states that reconstruction will succeed if and fail for . This result was first proved in [DMM09] to follow from state evolution. State evolution was subsequently established as a rigorous tool in [BM11a].
The same paper [DMM09] studied AMP with positive soft thresholding and showed that it succeeds for , AMP with capping, proving that it succeeds for .
Appendix A spells out how these existing results fall under the aegis of Eq. (1.13).
Comparison to phase diagrams. In prior literature on phase transitions in compressed sensing, [DT10b, Don06, DT09a, BCT11, DMM09, DMM11], the authors considered a different phase diagram, based on variables and . The relation makes for a 1-1 relationship between the diagrams, so all information in the two diagrams can be presented in either format.
5 This Paper
Our aim in this paper is to show that formula (1.13) is correct in settings extending far beyond the three cases just mentioned in Example 1.5. We lay out several denoising problems, and in each one verify the general formula. This requires in each case calculating the minimax MSE for a problem of statistical decision theory; implementing AMP for compressed sensing with the given denoising family; and verifying empirically that the phase transition does indeed occur at the precise sparsity/undersampling tradeoff indicated by the formula.
In particular, we consider the following denoising tasks, and corresponding compressed sensing problems.
Again we consider the class od sparse vectors but instead of soft-thrresholding, we use the firm shrinkage denoiser . This is again a separable denoiser with two tuning parameters with . It acts on each coordinate by setting for , for and interpolating linearly.
Denoting by the associated asymptotic minimax MSE, we will show that strictly. By verifying the general formula, we show that the phase transition curve for optimally-tuned AMP firm shrinkage is slightly better than the phase transition for optimally tuned AMP soft shrinkage.
For the same class of sparse vectors . we consider the separable denoiser applies coordinatewise shrinkage using a minimax shrinkage. In other words implicitly we are optimizing the mean square error over . We calculate the minimax MSE function , and show that strictly. By verifying the general formula we show that the phase transition curve for AMP minimax shrinkage is slightly better than the phase transition for both AMP soft or firm shrinkage.
Here we consider the class of block sparse vectors (see Section 3 for a formal definition). We use two block-separable denoisers: either block soft thresholding (for block length , the -variate nonlinearity obeys ) or block James-Stein denoiser. We will compute the minimax MSE function , and bound the minimax MSE function . We will verify that the phase transition curve for optimally-tuned AMP with block-separable denoisers follows the general formula.
Notice that, as demonstrated numerically in [DMM09], and proved in [BM11b] in the case of Gaussian sensing matrices, soft-thresholding AMP reconstruction coincides with LASSO reconstruction (in the large system limit). By the above results, firm-shrinkage AMP and minimax AMP both outperform LASSO reconstruction. Correspondingly, it can be argued that blocksoft thresholding AMP coincides asymptotically with group LASSO, and hence James-Stein AMP outperforms the latter.
In all of the above examples, the denoisers are coordinatewise or at least blockwise separable. We next consider examples where the denoiser has more subtle structure. We find that formula (1.13) applies more generally.
We consider the class of vectors that are monotone with at most points of increase. As denoiser, we use the least-squares projection onto the cone of monotone increasing functions.
We consider the class of vectors that have at most points of change. The denoiser minimizes the residual sum of squares penalized by times the total variation of the signal.
In these cases, evaluating the asymptotic minimax MSE is more challenging than for separable denoisers and simpler classes of signals. Nevertheless, we will show that it can be done quite explicitly. We find well-defined phase transitions for AMP reconstruction, precisely at the location predicted by the general formula (1.13).
6 Contributions
We list eight contributions, beginning with the two most obvious:
A formula for phase transitions of AMP algorithms. We confirm that formula (1.13) accurately describes the sparsity-undersampling tradeoff under which AMP algorithms successfully recover a sparse structured signal from underdetermined measurements. We prove that this relation follows from the state evolution formalism.
As demonstrated in [DMM11] and proved in [BM11a] in the case of the LASSO, there exists a correspondence between convex optimization methods and specific AMP algorithms. We will show that this correspondence is considerably more general. This provides a unified approach which yields sharp phase transition predictions in numerous cases.
Limited benefit of nonconvex penalization for ordinary sparsity. Within the class of scalar separable AMP algorithms, the best achievable phase transition is obtained by the minimax shrinker. Unfortunately the improvement in the transition is relatively small.
Calculation of the minimax MSE of monotone regression and total variation denoising. We are not aware of any previous work computing the minimax MSE of these denoising procedures under the condition of -sparse first differences. We prove here a characterization for each of these cases and show that it agrees with the phase transition of both AMP and convex optimization algorithms.
A conjectures flow naturally from this work:
State Evolution accurately describes the behavior of a wide range of AMP algorithms, for large system sizes . State evolution is a formalism that allows to characterize the asymptotic behavior of AMP as the number of dimension tend to infinity [DMM09]. We show in Section 6 that the general relation (1.13) can be proved by assuming state evolution to hold.
In the case of separable denoisers, under suitable regularity conditions, the correctness of state evolution as a description of AMP is proved by [BM11a]. Since formula (1.13) is apparently successful beyond the separable case, it is natural to conjecture that state evolution applies much more generally than to the cases proven so far.
Our study supports the general conclusion that AMP provides a general tool in compressed sensing, that is applicable beyond simple sparse signal. If one knows that a certain shrinker is appropriate for denoising a certain type of signal, then the corresponding AMP algorithm provides an efficient reconstruction method for the associate compressed sensing problem. The denoising minimax MSE then maps to the sparsity undersampling tradeoff.
An interesting research direction is the study of the noisy linear model , whereby is a noise vector (e.g. ). In analogy [DMM11], we expect reconstruction to be stable with respect to noise for and instable for .
7 Related literature
Approximate message passing algorithms for compressed sensing reconstruction were introduced in [DMM09]. They were largely motivated by the connection with message passing algorithms in iterative decoding systems [RU08], and with mean field methods in statistical physics [MM09] (in particular the cavity method and TAP equations). We refer to [DMM11] for a discussion of these connections.
The original AMP framework [DMM09, DMM10] included iterations of the form defined in Eqs. (1.7), (1.8), (1.9) whereby the denoiser is separable. While this covers the and shrinkage rules studied in this paper, it did not include the various non-separable denoisers we discuss below, namely the block, monotone and total variation denoisers. Further, in [DMM09], the phase transition behavior was validated numerically only for , and denoisers, that are in correspondence with well-studied convex optimization problems. The extension to a noisy linear model , with a vector of iid random entries was carried out in [DMM11]. We also refer to [Mon12] for an overview of this work.
Several papers investigate generalizations of the original framework put forward in [DMM09]. The paper [BM11a] defines a general class of approximate message passing algorithms for which the state evolution was proved to be correct. This include in particular all separable Lipschitz-continuous denoisers. Generalizations of this result were proved in [BLM12, JM12]. Notice that all the separable denoisers treated in this papers are Lipschitz continuous with the exception of hard thresholding. While the last case is not covered by [BM11a], we expect state evolution to hold for hard thresholding AMP as well, by a suitable approximation argument.
Rangan [Ran11] introduces a class of generalized approximate message passing (G-AMP) algorithms that cope with –roughly– two extensions of the basic noisy linear model. First, the noisy measurement vector can be a non-linear (random) function of the noiseless measurement . Second, each of the ‘coordinates’ of can itself be a –low dimensional– vector. Interesting applications of this framework were developed in [KGR11, Sch11]. Let us notice that G-AMP does not cover any of the non-separable cases treated here (even the block sparse example), and hence provides a generalization in an ‘orthogonal’ direction.
In a parallel line of work, Schniter applied AMP to a number of examples in which the signal has a structured prior [Sch10, SSS10, SPS10]. Inference with respect to the prior is carried out using belief propagation, and this is combined with AMP to compute a posteriori estimates. This type of application fits within the class of problems studied here, by choosing the denoiser in Eq. (1.8) be given by the appropriate conditional expectation with respect to the signal prior. Note however that the general scheme provided by Eqs. (1.7), (1.8), (1.9) encompasses cases in which the denoiser is not the Bayes estimator for a specific prior.
A special case of known prior is the one in which is distributed according to the (known) product measure (i.e. the coordinates of are iid with known distribution ). The fundamental limits for compressed sensing reconstruction were established in [WV10]. The natural AMP algorithm uses in this case a posterior expectation denoiser [DMM10]. It was proved in [DJM11] that, for suitable sensing matrices with heteroscedastic entries, this approach achieves the fundamental limits of [WV10] (this approach was put forward in [KMS+12] on the basis of a statistical physics argument). This case fits within the general philosophy of the present paper whereby the class consists of a single distribution, namely . However, we prefer not treating this example in the present paper because it is a degenerate case, and the fact that is not scale invariant leads to some technical differences. We refer instead to [DJM11].
Maleki, Anitori, Yang and Baraniuk [MAYB11] used methods analogous to the ones developed here to study phase transitions for compressed sensing with complex vectors. This is closely related to the block-separable setting considered in Section 3 (there is however some difference in the structure of the sensing matrix).
Structured sparsity models are studied from a different point of view in [BCDH10, CHDB08, CICB10]. Those works focus on deriving sparsity models that capture a variety applications, and of convex relaxations that promote the relevant sparsity patters. Reconstruction guarantees are proved under suitable ‘isometry’ assumptions on the sensing matrix.
Closer to our approach is a recent series of papers [CRPW11, RRN11, RRN11], considering general classes of structured signals under random measurements. Let us emphasize two important differences with respect to our work. First, these papers only deal with convex reconstruction methods, while we shall analyze several approaches that are not derived from convex optimization and demonstrate improvements. Second, they establish reconstruction guarantees using concentration-of-measure arguments, while we propose exact asymptotics (essentially based on weak convergence), which enables us to unveil the key relation (1.13) between denoising and the compressed sensing phase transition.
Scalar-separable denoisers
In this section we study scalar-separable denoisers, cf. Eq. (1.6), that further satisfy the scaling relation (1.3). Unless stated otherwise, we will assume that signals belong to the simple sparsity class introduced in Eq. (1.2), to be denoted as .
As mentioned in the previous section, the computation of the minimax MSE is greatly simplified for separable denoisers. We state and prove the following elementary result in greater generality than necessary for this section. (In particular is here a general family of probability distributions.)
Let be any family of probability distributions satisfies the following conditions: If , then defining ( times), we have ; Viceversa, if , then letting denote the -th marginal of , we have .
Then, for any separable denoiser , and for any ,
Fix and define, for ,
The lemma then follows immediately if we prove that, for any , . In order to prove the last statement, first notice that, by property :
The proof is finished by property , since
This lemma reduces the problem to solving a minimax scalar estimation problem. This problem was characterized before for soft thresholding , positive soft thresholding , and also hard thresholding [DDS92, DJ94]. Plots of the minimax soft threshold and the minimax MSE are available in [DMM09, DMM11]. Such plots also appear later in this paper as baselines for comparison of interesting new families, namely Firm and Minimax shrinkage.
2 Firm shrinkage
Soft and hard thresholding can be recovered as limiting cases:
Lemma 2.1 yields the following formula for the minimax MSE of firm shrinkage:
Figure 1 and Table 1 show the minimax MSE for firm shrinkage as resulting from this calculation. The figure also shows similar results for soft and hard thresholding, for comparative purposes. Over the range presented, the minimax MSE for firm thresholding is strictly smaller than the MSE for hard or soft thresholding. Namely, over this range of ,
This validates the criticisms of soft thresholding, which is often said to shrink large values too heavilyNote, however, that the use of hard thresholding instead of soft thresholding leads to a larger worst case mean square error..
Figure 2 shows the minimax thresholds. At least for we see clearly that , so firm thresholding is preferred over the limiting cases of hard and soft thresholdingThese are numerical results. It is an open question whether, for the minimax firm threshold have parameter reducing it to soft thresholding.. Figure 3 shows the corresponding minimax denoisers for specific values of . Finally, Figure 4 plots the minimax value of as a function of (corresponding to the minimax probability distribution ).
3 Minimax shrinkage
The previous example showed that a parametric family of shrinkers can improve on soft thresholding, and hence improve the predicted phase transition according to (1.13). The ultimate improvement one could make in this direction is to use the globally minimax nonlinear shrinker. This is the separable denoiser that is minimax not within some parametric family, such as the soft thresholding or the broader firm thresholding family, but minimax over all measurable nonlinearities . While this notion might appear somewhat abstract, it can be in fact implemented in practice as illustrated in Figure 3, that present plots of the more familiar denoisers (hard, soft, and firm) together with the minimax denoiser.
Formally, let be the set of all measurable functions , for such a , set . The minimax MSE over this class is
The calculation of this quantity uses a variety of ideas from minimax decision theory, developed through several papers [Bic81, CS81, BC83, DDS92, Joh94b, Joh94a, DJ94]: details are given in Appendix B. A key point of this computation is the characterization of the minimax nonlinearity as the minimal MSE Bayes rule (that is, the conditional expectation) for the so-called least-favorable prior. The least-favorable prior is the solution of Mallows’ classical Fisher information problem [Mal78], for which we compute numerical upper and lower bounds that coincide within the stated precision.
Table 1 and Figure 1 present numerical values associated with the solution of the minimax problem. As expected, , i.e. optimizing over all nonlinearities yields a smaller mean square error than soft or firm thresholding. On the other hand, Table 1 shows that the improvements are typically of size 0.01 or smaller over the range . For very small it was pointed out in [DMM09] that [DJ94] implies
In the limit of extreme sparsity, there is nothing to be gained by completely general nonlinearities over soft thresholding. The improvement is non-vanishing, but moderate for non-vanishing.
4 Empirical phase transition behavior
The research hypothesis driving this paper is that Eq. (1.13) describes the phase transition of AMP algorithms. In order to be completely explicit, we need to check the following predictions, for each nonlinearity of interest
There exists a curve such that for the corresponding AMP algorithm will typically succeed in reconstructing the unknown signal , and for the algorithm will typically fail.
The curve is related to the corresponding scalar denoising problem by .
We now test this hypothesis for the firm and globally minimax nonlinearities .
Our experiment was conducted along the same lines as [MD10, DT09b, DMM09, DMM11, BT10]. We considered a range of problem sizes and a range of sparsity parameters , and a grid of values surrounding the predicted phase transition . We ran Monte Carlo reconstructions at each parameter combination. We declared “success” when the relative mean-squared error was below :
We used iterations of AMPIn most cases the mentioned convergence criterion is reached after a much smaller number of iterations (roughly 20) . We repeated a subset of our simulations with different requirements on and different number of iterations, without significant changes in the threshold location. This point is further justified in Appendix C. In the interest of reproducibility, a suite of Java classes for carrying out these and other simulations in the paper is made available as [DJM12].
We proceeded to analyse the outcomes of these numerical simulations as follows, see also Appendix H (a similar analysis was already carried out in in [DMM09, DT09b]). The simulations generated a data set, containing, for each algorithm and each fixed , a list of values and empirical success fractions . The success fractions observed at were indeed typically better than and at were typically worse than .
To quantify this tendency, we fit a logistic regression
where was computed analytically using ideas mentioned earlier. The choice of the model (2.4) is motivated by the observation that the success probability increases rapidly around the phase transition, and by the common statistical use of logistic models. Also, similar models have been proved to be asymptotically correct in analogous phase transition phenomena [DM08].
For each set of data corresponding to given and each non-linearity, we estimate and from the logit fit, leading to values . Using these quantities, we estimate the phase transition location as the value at which the probability of success is . Using Eq. (2.4) this corresponds to , i.e. . We are therefore led to define the offset between the empirical phase transition and the prediction as
In order to check the general relation provided by Eq. (1.13) we need to show that tends to zero as gets large, to within the statistical uncertainty. In Table 2 we report our results on the empirical phase transition, confirming that indeed the offset is small and decreasing with .
A few additional remarks on these data are of interest:
We calculated formal confidence intervals for , indicating the tight control we have of the correct value.
As in earlier studies [DT09b], we expect that tend at a rate that is inversely proportional to a power of . Namely
for some . Our data supports this relationship, with . See Appendix H.
Denoting by the fitted slope coefficient at dimension , evidence that is increasing with larger indicates that a sharpening of the phase transition is indeed occurring. Appendix H shows that is consistent with our data.
We refer to Appendices G and H for further details.
Block-separable denoisers
We now turn to the case of block-structured sparsity, first introducing some notational conventions. We partition the vector into blocks each of size . Denoting by the -th block, we hence write
A block-separable denoiser is a mapping that decomposes according to the above partition:
where, with an abuse of notation, we use the same symbol to denote the single-block denoiser . The last equation replaces Eq. (1.6) which correspond to the simple separable case. The above form applies to noise with variance . For general variance we adopt again the scaling relation (1.3).
We will apply these denoiser to signals from the block-sparse class defined as follows for , , ,
In words, this is the class of (random) vectors that have (in expectation) at most blocks different from . For simplicity, we will write for the case,
The same simplifications described in Section 2.1 applies, with obvious modifications, to the present context.
Let be any family of probability distributions satisfies the following conditions: If , then defining ( times), we have ; Viceversa, if , then letting denote the marginal of the -th block under , we have .
Then, for any block-separable denoiser , and for any multiple of
The proof is omitted since it is an immediate generalization of the one of Lemma 3.1. The class to be studied in the rest of this section clearly satisfy the assumption of this lemma.
Block-soft thresholding is the nonlinear shrinker defined by letting, for , and ,
where . The case reduces to traditional soft thresholding of Example 1.2. More generally, shrinks its argument to if and moves it by an amount towards the origin otherwise. It can also be regarded as the solution of a penalized least squares problem, namely
Block thresholding has previously been considered by Hall, Kerkyacharian and Picard [HKP98] and by Cai [Cai99] although in specific ‘wavelet’ applications.
where expectation is taken with respect to independent of . Notice that the condition simply amount to saying that is a probability measure on with . The calculation of can be reduced to a calculus problem. We state the results below deferring calculations to Appendix D.
Let bye a chi-square random variable with degrees of freedom and define the functions as follows
The minimax risk of block soft thresholding over the class is given by
This is a parametric expression for . The parameter corresponds to the minimax threshold .
In Figure 5 we present graphs of as a function of . It is immediate to prove the following structural properties: (the upper bound follows from taking ); is monotone increasing and concave (monotonicity is a consequence of for , and concavity follows since any measure in can be written as convex combination of measures in and in ); as ; as . (Recall that we are considering the MSE per coordinate.) Associated with the minimax problem is also an optimal threshold value , that we plot in Figure 6.
A particularly interesting case is the one of large blocks. As the minimax MSE has a well defined, and particularly explicit limit.
Further, when properly normalized, the minimax threshold converges with increasing block size:
2 Block James-Stein
In order to approach oracle MSE for large , we propose to use the positive-part James-Stein shrinkage estimator [JS10]. This is again a block-separable denoiser that acts as follows on a block :
Analogously to block soft thresholding, this estimator shrinks to blocks with small norms. On the other hand, its bias vanishes as . Using once more Lemma 3.1 we have (notice that in this case there is no tuning parameter)
Remarkably, the limiting behavior of this denoiser is ideal, and noticeably better than block soft thresholding, as shown by Figure 7 and formally in the next lemma.
Let denote the minimax MSE for over the class of -block sparse sequences. For any , we have:
Consider temporarily the case where the observation is with , and nonrandom and known. A simple calculation shows that the optimal linear estimator of the form , , is given by
This estimator uses information about (which could only be supplied by an oracle) to choose the constant as a function of . Note in particular that the risk of this estimator is
Applying (3.6), and keeping in mind that, for , , we have
The oracle inequality [DJ95, Theorem 5] shows that for , and for every vector , if , then
Combined with the previous display this proves the Lemma. ∎
The argument in the proof leads in fact to a convenient expression for . With the notations introduced there, we have
Now is known to be minimax for the unconstrained problem of estimating a non-sparse vector , i.e. yielding
Therefore computing the minimax MSE for reduces to computing the single quantity , that can be estimated through numerical integration. A good approximation for large is provided by the following formula with (cf. Appendix I.2.1). Hence we have
In the next section will use this formula (neglecting terms) in comparing the general prediction of Eq. (1.13) with the empirical results for the James-Stein AMP algorithm. Numerical integration reveals that this formula is accurate enough for such comparison.
3 Empirical phase transition behavior
We now turn to the compressed sensing reconstruction problem whereby the block-sparse vector is reconstructed from observed data using the AMP algorithm. We want to test the hypothesis that Eq. (1.13) describes the phase transition of the two block shrinkage AMP algorithms, corresponding to the block soft thresholding, and block James-Stein.
We conducted a set of experiments similar to those described in Section 2.4 We constructed block-sparse signals at different undersampling and sparsity levels and ran tests of block thresholding AMP. More precisely, we used the update equations (1.7) to (1.9) with (block soft thresholding AMP) or (James-Stein AMP).
It is a straightforward calculus exercise to compute an explicit expression for the memory term . For block soft thresholding AMP we get
Our results show that the curve correctly separates two phases of performance: below this curve success in AMP recovery is atypical and above it is typical. Similarly, the curve correctly describes the phase transition for block James-Stein shrinkage. The empirical results are presented in Figure 8 (for block soft thresholding) and Figure 9 (for block James-Stein). We refer to Appendix G for further details.
Monotone regression
In this section and the next, we show that, quite surprisingly, the formula (1.13) can be applied also to some highly nontrivial non-separable denoisers.
In this section we consider vectors that are monotone, and mostly constant. Let denote the cone of nondecreasing sequences:
We then define the class of mostly constant non-decreasing vectors
Since vectors from this class are –in general– not sparse, we will occasionally refer to the parameter as to the ‘simplicity’ parameter.
For this problem we will consider the denoiser , that solves the monotone regression problem
In other words, is the (Euclidean) projection on the cone of monotone sequences. This denoiser is highly non-separable, as one can understand most clearly by studying the standard pool-adjacent-violators algorithm for implementing it (see [BC90] for a recent reference).
In order to apply formula (1.13), we need to calculate , which requires in particular determining the least favorable distribution and proving that the limit of the minimax MSE exists. We present here the main ideas, deferring details to Appendix F.
It is convenient to introduce the risk at :
where expectation is taken with respect to . It is further useful to introduce a specific notation for the risk at , namely
The risk of monotone regression satisfies the following properties
The function is monotone increasing for .
Let be the set of increase points of . Denoting them by , , let for (with, by convention, , ). Then, for any ,
For a non-empty closed convex , we let denote the Euclidean projector to , i.e. . Further, for , .
Note that it is sufficient to show that, letting D(\mu;z)\equiv\big{\|}\eta^{mono}(\mu+z)-\mu\big{\|}_{2}^{2}, is monotone increasing in . By continuity of the projection operator, it follows that is continuous. Further, notice that is a cone obtained as the intersection of half-spaces:
Let be the separating hyperplane for and, for , define
with, by convention . Since , we have that is continuous and piecewise linear and equal one of the projectors , that we will denote by for . It is therefore sufficient to show that, defining for ,
the function is monotone increasing for .
Let where each is a contiguous segment (in the sense of point ), and . Further, for , let . Then, for any ,
which is clearly increasing in . ∎
The proof of part is deferred to Appendix F.
The last Lemma shows that the least favorable signal is constant on positions of the interval and has large (going to infinity) jumps at the remaining increase points. The resulting risk only depends on the distribution of the lengths of the intervals over which is constant.
The next Lemma provides some useful insight on the behavior of the risk at . This is crucial since it determines the minimax risk though Eq. (4.4).
The monotone regression risk at zero, defined through Eq. (4.3) satisfies and, for any
The proof of this Lemma can be found in Appendix F.2.
For moderate values of , can be computed numerically through Monte Carlo simulations. Figure 10 presents the results of such a simulation. It appears that as suggesting that the last lemma is loose by a logarithmic factor.
We can finally establish our main result on the minimax MSE of monotone regression over the class . Remarkably, we are able to characterize the least favorable distribution .
The asymptotic minimax MSE of monotone regression
Finally, for any , , the following distribution has risk larger that for all large enough. A signal has at all increase points for some large enough, and the lengths of intervals between increase points have distribution achieving the max in Eq. (4.6).
Hence is immediately upper bounded by the right hand side of Eq. (4.6). The matching lower bound is obtained by evaluating the above expressions for the distribution .
Finally Eq. (4.1) follows by using Lemma 4.2 in Eq. (4.6). ∎
The resulting curve is presented in Fig. 10.
2 Empirical phase transition behavior
We next consider the compressed sensing problem. We programmed the AMP iteration (1.7)-(1.9), with the monotone regression denoiser. The denoiser itself was implemented using the standard pool adjacent violators algorithm.
It is a simple exercise to obtain an explicit formula for the memory term . As in Lemma 4.1, let denote the number of increase points in the signal . We then have
We will refer to this specific version of AMP as to monoreg AMP.
In the present case we evaluated success probability using the following (Hamming-like) distance
and declared a success when . In Figure 11 we used and , but very similar results are obtained with other values of the parameters. The rationale for using instead of the normalized mean square error lies in the structure of the signals . Since the least favorable is monotone with large jumps, its norm is very large,concentrated at endpoints, and depends strongly on . This leads to subtle normalization issues across different .
The agrement between the empirical phase transition and the general prediction in Fig. 11 is satisfactory and improves with the signal’s length.
Total variation minimization
In this section we consider vectors that are mostly constant, with a few change points. In order to model this problem, we introduce the class of probability distributions
Again is a ‘simplicity’ parameter. Note that this class is quite similar to the class studied in the previous section, the ‘only’ difference being that change points can be either points of increase or points of decrease.
A convenient denoiser for this setting is the total variation penalized least-squares [ROF92], also called fused LASSO [TSR+05], that we will denote by . This depends on and, for and noise variance , it returns
An extensive literature is devoted to solving this denoising problem, see for example [VO96]. For general noise variance , the above expression is generalized through the usual scaling relationship (1.3).
Much of the analysis in this section is analogous to the one of monotone regression. We will therefore present several arguments in synthetic form to limit redundancy.
In this section we outline the computation of the asymptotic minimax MSE of the total variation denoiser over the class , to be denoted by .
We start by defining a generalization of the problem (5.1). For , we let
We define the risk at as
where . We denote the risk at by
Notice that, by symmetry, and . We then have the following analogous of Lemma 4.1.
The risk of total variation regression satisfies the following properties
The function is monotone increasing for .
Let be the set of change points of . Denoting them by , , let for (with, by convention, , ). Further, for , let for , , . For we let .
The argument in part is essentially the same as in part of Lemma 4.1 and we will therefore omit it.
For proving part , we will prove that, letting , the function is increasing for . First notice that the stationarity condition for the minimum in Eq. (5.1) reads
with the convention that . Let and be defined as in part of the statement. Then, summing Eq. (5.6) over , we get
where Hence is piecewise affine with components indexed by . Within each component, we have with defined as per Eq. (5.8).
Since is continuous (and hence is), it is sufficient to prov that, letting
the function is monotone increasing for . Using Eq. (5.8) we obtain
where denotes the average of vector over . It follows that is increasing as claimed. ∎
The risk at , , can be computed numerically for moderate values of . Notice that the cases , , and are only relevant for the boundary intervals and and turn out to be immaterial for the asymptotic minimax risk. Thanks to symmetries, the only relevant cases are and . The results of a numerical computation for these quantities is shown in Figure 12. These calculations suggest , which is indeed consistent with intuition as boundary conditions induce a larger bias. Also, it is easy to prove that as (as , converges to a constant vector).
Using the last Lemma, and proceeding as in the proof of Theorem 4.1, it is immediate to obtain a characterization of the minimax MSE of the total variation denoiser. For technical reasons, we need to introduce the class of vectors in with distance at most between changepoints.
The asymptotic minimax MSE of total variation denoiser
We omit this proof since it is an immediate generalization of the one of Theorem 4.1. Notice that is monotone increasing in and hence admit a limit as . We expect that
This limit can be evaluated numerically and in Figure 12 we plot the resulting minimax risk.
Notice that, by properly modifying Eq. (5.9), one obtains the minimax risk over subsets of with constrained change point distributions. For instance, we can consider the case in which the lengths between change points are distributed as for uniformly random change points, and increase/decrease points are alternating. We then get
We consequently define the random changepoint minimax risk as
This curve is plotted in Figure 12 for comparison.
2 Empirical phase transition behavior
We implemented the the AMP iteration (1.7)-(1.9), using the total variation denoiser . For the latter, we used the software package tvdip (in the Matlab implementation) or the projected Newton method [VO96] (in the Java implementation).
For be denote the number of constant segments in or, equivalently, the number of change points in , plus one. We then have the following expression for the memory term in Eq. (1.7)-(1.9):
We will refer to this specific version of AMP as to TV-AMP.
We carried out two types of experiments. In the first class of experiments we considered signals with distances between change points distributed as
This is the same distribution as if each position is independently an increase point or a decrease point, each with probability . The predicted phase transition curve is given by Eq. (5.10) and the minimax value of is used in the AMP implementation. The simulations results are presented in Figure 13 for , and show good agreement between predictions and observations. In this case we used the Hamming metric (4.8), because the norm of the typical signal scales super-linearly in .
Characterization of the phase transition using state evolution
In this section we prove the basic relation (1.13) between minimax mean square error in denoising and the phase-transition boundary in the sparsity-undersampling plane. Our proof assumes that the state evolution formalism developed in [DMM09, DMM10, DMM11] holds, in the precise terms stated below. This formalism was established rigorously for separable denoisers (under additional regularity assumptions) in [BM11a].
A crucial observation for state evolution is that the mean squared error of the AMP reconstruction at iteration is practically non-random for large system sizes , and has a well-defined limit as . In particular, the limit
exists almost surely (here we assume , while ). Moreover, the evolution of with increasing is dictated by a formula which is explicitly computable, and defined below. We will use the term state evolution to refer both to the mapping and to the sequence with appropriate initial condition. State evolution allows to determine whether AMP recovers the signal correctly, by simply checking whether as (in which case the MSE vanishes asymptotically) or not. The latter problem does in turn reduce to a problem in real analysis.
The papers [DMM09, DMM10, DMM11] developed the state evolution framework for separable denoisers and verified its predictions numerically for three specific examples (namely for the shrinkers Soft, SoftPos, and Cap). However, the heuristic argument presented in those papers was much more general. Indeed, [BM11a] proved that state evolution holds, in a precise asymptotic sense, for Gaussian measurement matrices with iid entries and generic separable denoisers, under mild regularity assumptions. A generalization to non-gaussian entries was subsequently proved in [BLM12].
Here we generalize this approach to non-separable denoisers . This framework covers all the shrinkers discussed in Sections 2 to 5, and yields a formal proof of the main formula (1.13), under the assumption that indeed state evolution is correct in this broader context. We will throughout assume the scaling relation (1.3).
In the next sections we will first introduce some basic notations and facts about state evolution. Then we will prove the phase transition expression (1.13) by establishing first a lower bound and then a matching upper bound, both given by the minimax MSE.
The next definition provides the suitable generalization of the state evolution mapping to the present setting.
For given , and a sequence of probability distributions over , define the state evolution mapping by
whenever the limit on the right-hand side exists. Here, as before, and are independent vectors, , . In other words, is the per-coordinate MSE of denoiser at noise level .
The fixed points of the mapping play of course a crucial role in the analysis of state evolution.
The highest fixed point of the mapping is defined as .
The importance and applicability this notion is underscored by the next two observations. Here and below we say that a function is starshaped if is decreasing.
Suppose that and any one of these three conditions holds:
is an increasing function of , and the initial condition of state evolution satisfies ;
is an increasing starshaped function of .
Then state evolution converges to the highest fixed point:
Further, if and is a starshaped function of , then .
The proof is a standard calculus exercise; we omit it.
The function is starshaped for all of the following choices of the denoiser :
The proof of this Lemma is deferred to Appendix I.
2 State Evolution Phase Transition
Consider a collection of probability distributions over , indexed by as per Section 1.1 (these do not need to be simple sparse signals). For a sequence of probability distributions we write if for all and the limit on the Eq. (6.1) exists for each . Letting , we consider the minimax value:
For , define the state evolution phase transition as
Note that is monotone decreasing as a function of , by definition of the state evolution mapping , cf. Eq. (6.1). It follows that for and for . Further, by nestedness, it is monotone increasing as a function of , which implies that is monotone increasing. The rationale for this definition is that, for and under any of the assumptions of Lemma 6.1, state evolution predicts that AMP will correctly recover the signal .
Let be a nested, scale-invariant collection of probability distributions, and assume that the shrinker obeys the scaling relation (1.3). Define the minimax MSE as per Eq. (1.5). Then
In order to prove this result, we shall first establish a more general fact. Given a sequence of distributions and, for any , we let
The rationale for this definition is clear. Under the assumption that state evolution holds, for a signal sampled from distribution , AMP (with tuning parameter ) is guaranteed to reconstruct if and only if .
the normalized MSE for denoising with worst case case signal-to-noise ratio. With these definitions we have the following.
For any sequence of probability measures , and any , we have
We are now in position to prove Theorem 6.1.
Throughout the proof we drop the argument , since this is kept constant.
Define . We claim that . Indeed choose . Then by definition there exists such that, for all , there exists with . Hence, for all , there exists with . This implies that, for all , , i.e. . Proceeding in the same way, it is immediate to prove that, for any , we have . Hence, we conclude that as claimed.
To conclude the proof, we note that, by Lemma 6.3, we have
where the last equality follows from the scale invariant property of . The last quantity is nothing but . ∎
3 Non-convergence of state evolution
Lemmas 6.1.(b) and 6.2 immediately imply that the answer is positive for soft thresholding, positive soft thresholding, block soft thresholding, James-Stein shrinkage, monotone regression and total variation denoising. It turns out that the answer is positive also for firm thresholding and the global minimax denoiser. In Appendix E we describe the argument for these cases.
Phase transitions for other algorithms
Formula (1.13) connects an algorithmic property – phase transitions of AMP recovery algorithms – with a property from statistical decision theory – minimax mean squared error in denoising.
We want to explore a further connection, relating the behavior of convex optimization with that of certain AMP algorithms. As proved in [BM11b], in the large system limit AMP with soft thresholding denoiser effectively computes the solution to
for an appropriately calibrated .
More generally, consider a generalized reconstruction method of the form
where is a convex penalization. To this reconstruction problem, we can associate an AMP algorithm, by using the denoiser in Eq. (1.7), (1.8), (1.9), whereby
(we also let ). In other words is the proximal operator of the penalization . We will refer to this algorithm as to AMP-.
We then have the following general correspondence, which follows immediately by writing the stationarity condition of problem and the fixed points of AMP- (see [Mon12]).
Any fixed point of AMP- with fixed point parameters , , corresponds to a stationary point of with given by
In particular, if the regularizer is convex, then fixed points correspond to minimizers.
The fixed points of AMP with positive soft thresholding denoiser are solutions of
where denotes the standard scalar product over , and is the all-one vector.
The fixed points of AMP with capping denoiser effectively are solutions of
For noiseless compressed sensing reconstruction, the correspondence involves the limit of the above problem, that is
Phase transitions for such convex programs were characterized in [Don06, DT09a, DT10a] for the three examples mentioned above, using methods from combinatorial geometry. Thus, whenever AMP converges with high probability, formula (1.13) connects fundamental problems of minimax decision theory to fundamental problems in high dimensional combinatorial geometry.
We wil next verify numerically this correspondence beyond the three classical examples (again focusing on noiseless measurements). In the block-structured case, we can compare block-soft AMP to the following convex optimization problem:
Analogously, we can compare monoreg AMP to the following convex optimization problem:
where , . Figure 16 verifies that the phase transition occurs around the location predicted by (1.13), just as with monoreg AMP.
Finally, we can compare TV AMP to the following convex optimization problem:
where again . Figure 17 verifies that the phase transition occurs at the location predicted by (1.13), just as with TV AMP.
where (here we redefined to absorb the factor ).
It turns out that a similar interpretation can be given for any scalar denoiser (and hence for any separable denoiser as well). In particular, the minimax and firm thresholding rules and are optimizers of penalization schemes of the same form as above, with non-convex.
We can construct the penalty corresponding to a denoiser by observing that at the solution . Defining the residual , and noting that , we obtain that and are related through the change of variables
A similar analysis can be carried out for block separable denoisers that are covariant under rotation, i.e. if satisfies for any rotation . We already mentioned that the block soft denoiser can be written as
An implied penalty can also be derived for block James-Stein shrinkage . Due to the covariance under rotation, the corresponding penalty only depends on the modulus of . Figure 19 shows the implied penalty as a function of . Namely, the penalty is such that, for ,
coincides with the positive-part James-Stein estimator. The implicit penalization is again nonconvex.
Appendix A Classical cases
The general formula (1.13) was already validated in [DMM09] for the following three classical cases:
Simple sparse signals from the class (cf. Eq. (1.2)) and soft thresholding denoiser.
Non-negative sparse signals with soft positive thresholding denoiser, cf. Example 1.3.
Box constrained signals with capping denoiser, cf. Example 1.3.
Analytical expressions for the phase transition curves were computed using state evolution in the Online Supplement to [DMM09]. We review the results here since they provide a useful stepping stone for understanding more complicate cases (cf., e.g.. Section 6).
Notice that the last of the examples above (box-constrained signals) is not scale invariant, according to the general definition of Section 1.1. For non scale-invariant classes, the definitions (1.4) and (1.5) are generalized by taking the supremum over the noise covariance as well. Namely, for a generic class , we let
It is easy to check that this definition coincides with the earlier one for scale-invariant classes. We will write instead of for box constrained signals with capping denoiser, i.e. case above. The results for case is further elucidated by comparing it with the following scale invariant problem:
We consider sparse non-negative signals , modeled through the class , and the simple positive part denoiser . We denote corresponding minimax risk by .
The minimax risk for examples , , can be computed explicitly. Indeed in both cases we have
Here and are independent random variables. The reduction second equality follows from th remark that the extremal distributions in and are mixtures of two point distributions. This remark reduces the calculation of the minimax risk to a simple calculation [DMM09], whose results are summarized below.
The minimax risk for the problems stated above is
(The first two are parametric expressions in , which is the optimal threshold at the given sparsity level.)
Also, the AMP phase transition for the noiseless reconstruction problem is given in all of these cases by . In other words AMP succeeds with high probability of and fails with high probability if .
As mentioned above, the calculation of the minimax risk is a calculus exercise, and follows the same lines as in [DMM09]. This coincide with the AMP threshold by the general analysis of Section 6 for cases , , . For the non-scale invariant case, we refer, once more, to [DMM09].
Notice that problem and have the same minimax risk. This identity mirrors a result in [DT10a] that characterizes the phase transition threshold for reconstructing from noiseless linear measurements , with or . If a simple feasibility linear program is used (namely, find any with ), then the undersampling threshold for both problems is given by .
Appendix B Calculation of minimax MSE
In this Appendix we describe the calculation of the global minimax risk over the class , as defined per Eq. (2.3). In particular, we will explain how the values in Table 1 and Figure 1 for have been computed.
Here is he posterior mean estimator for prior , is the Gaussian measure , , denotes the convolution of measures, and denotes the Fisher information. For a probability measure , with density with respect to the Lebesgue measure this is defined as
Further, if is convex and weakly compact, the set of probability measures is also convex and weakly compact. If follows from [Hub64, Theorem 4] that the in Eq. (B.1) is achieved at a unique point . Hence
where we defined . The unique minimizer is known as theleast favorable distribution. The minimax optimal denoiser (achieving the over in Eq. (B.1)) is the posterior expectation with respect to the prior .
Bickel and Collins [BC83, Theorem 1], prove that, under suitable assumption on the class , the least favorable distribution is a mixture of point masses
where , and the sequence has no accumulation points except, possibly, at . As mentioned above, the minimax denoiser is the posterior expectation associate to the prior . By Tweedie’s formula, this takes the form
where is the so-called score function
and is the density of .
Focusing now specifically on the class . This case is covered by the general theory of [BC83], and corresponds to their example . Without loss of generality we can assume that the are monotone increasing, with , and that , with . A conjecture of Mallows [Mal78] states that in fact we may take
In other words, the conjectured least favorable distribution has the form of a two-sided geometric distribution on a scaled copy of . While the conjecture has not been proved, Mallows [Mal78] provided an argument (based on an analogous problem in robust estimation [Hub64]) that suggesting that indeed it captures the correct tail behavior.
For estimating the minimax risk numerically, we chose a large parameter and assume a generalized “Mallows form” for . More precisely, we assume an equispaced grid, and geometrically decaying weights. This is a little more general than what Mallows proposed, having 3 total degrees of freedom (spacing, total weight and rate of decay), rather than two. For , we allow the parameters and to vary freely. In this way we obtained a parametric family of probability distributions, with parameter , with
The quantity can be estimated numerically, and provides an upper bound on . We used up to and checked that the resulting is insensitive to this choice. Notice that choice of the Mallows form for is immaterial for two reasons: As a consequence [BC83] and of the weak continuity of Fisher information, should be insensitive to the tail behavior of the distribution ; We are only using it to derive an upper bound.
In order to get a lower bound on , we use Huber’s minimax theorem [Hub64, HR09], which implies that, for any differentiable in measure,
where in the supremum ranges over all densities of probability measures with .
Let denote the probability measure corresponding to the optimum of the parametric optimization (B.3). Denote by denote the corresponding score function
where . This corresponds to a denoiser . Let denote a maximizing density for . By Huber’s theory, this can be chosen to be two-point mixture where is chosen to achieve the worst case value on the right side of (B.4) and set .
We have the bounds . Numerically, we compute integrals and extrema over fine grids with at least samples per unit of range, getting not and but instead numerical approximations and . Table 3 presents some information about numerical approximation results, which may help the reader assess its accuracy for small values of . Some minimizing distributions obtained in this way are shown in Figure 20; the mass points are displayed in Figure 21.
Our numerical results, showing that allows us to infer that the Mallows form is approximately correct. The denoiser that we actually apply in our estimation and compressed sensing experiments is:
Appendix C Convergence properties of AMP
Throughout the paper we checked convergence of AMP by imposing a threshold on the reconstruction accuracy and the number of iterations. For instance, in the case of separable denoisers, cf. Section 2.4, we declared the reconstruction successful if
for a certain choice of and . In particular, the results presented correspond to and .
It is natural to ask how to choose and , and whether different choices of and would lead to significantly different estimates of the phase transition boundary. It turns out that the empirical phase transition is fairly insensitive to these choices for the cases considered here, as soon as is sufficiently large and . This insensitivity is related to the convergence properties of AMP. Indeed both theory and empirical evidence [DMM09, DMM11] indicate exponential convergence. Namely, for all , there exist dimension independent constants , such that, with high probability,
On the other hand, for , , with high probability.
Figure 22 presents data that confirm this behavior (further numerical evidence can be found in [DMM09]). The data refer to simple sparse signals with , and soft thresholding denoising. The curves correspond to several values of close to the predicted phase transition location . Notice the clear exponential decay of the error for and a large constant mean square error for .
If the phase transition has to be determined with relative accuracy , this suggests the rule of thumb and . We verified that these conditions are satisfied by our choices of and .
Appendix D Calculation of minimax MSE for block soft thresholding
The argument is analogous to the one for the scalar case (corresponding to ) treated in Appendix A. For and , define the risk at as
where and . Since the two point mixtures are the extremal distributions in , we have
By the definition of chi-square distribution, we have
It follows from the invariance of the distribution of under rotations that only depends on through its norm . Further, as proved in Appendix I.1 is increasing in , and
At this point the problem is reduced to a calculus exercise.
D.2 Proof of Lemma 3.3
In this appendix we consider the asymptotics for large block size . It is easy to show that the minimax threshold level must be of order . By a compactness argument, we can assume for some to be determined. Define the risk as in Eq. (D.1) (note that this depends implicitely on ) and the normalized risk as . We claim that
Assuming these claim to hold, we have, by Eq. (D.2),
Calculus shows that the minimum is achieved at , whence
which coincides with the statement of Lemma 3.3.
In order to complete the proof, we have to prove claims (D.3) and (D.4). The second one is immediate because of Lemma I.1 that implies indeed .
The limit (D.3) follows instead from the central limit theorem. Indeed, let denote a central chi-squared with degrees of freedom. Its square root is the norm of a standard Gaussian random vector in dimension , and concentrates around . Indeed by the central limit theorem as . Therefore, we have
and the latter converges to by dominated convergence.
Appendix E Non-convergence of state evolution
We begin by developing a lower bound that holds for all the denoisers studied in this paper. For notational simplicity, we consider in fact separable denoisers, but the result is easily see to hold in general, provided that the signal class is scale-invariant.
Let denote the mixture where can be either finite or infinite. Define the risk at as
where is independent of .
We say that the risk function is super-quadratic on if, for any ,
The next result shows that superquadratic behavior of the risk function implies non convergence of state evolution for signal distribution .
(State Evolution Non-Convergence) Fix any , and assume there exists such that: The risk function is superquadratic on , , and .
Let . Then there is such that
Define and assume . This implies , and, since is superquadratic by assumption,
For firm and minimax denoisers , we took a fine grid of and at each fixed evaluated on a fine grid of , checking the inequality (E.1). In the case of firm thresholding we used the minimax threshold values . We further used the least favorable , . Sample results are presented in Figures 23 and 24.
These computations show that the risk function is superquadratic on .
Appendix F Monotone regression
where we recall that is the discrete derivative. (Of course this problem does not provide an algorithm since it requires to know , but here we are interested in it only for analysis purposes.)
As , all the constraints for which become irrelevant. We are naturally led to defining the following localized monotone regression problem
(Here and below omit the dependence of , on .) Let denote the solution of with data . The above discussion implies that, for , we have the following limit in probability:
In words, the risk ‘at infinity’ of monotone regression is simply given by the local risk.
In order to conclude the proof, it is sufficient to show that is given by the right-hand side of Eq. (4.4). It is easy to check that the problem separates into independent optimization problem for each . Namely, for , can be found by solving the following smaller problem
where and, if the segment is a singleton, the constraint disappears. Let be the solution of this problem. Then,
F.2 Proof of Lemma 4.2
Throughout the proof, we denote by the solution of the monotone regression problem
Clearly since in this case there is no monotonicity constraint and the solution of the regression problem is simply . In order to prove Eq. (4.5), let (i.e. an increase point: ) and define by letting , and by letting . Then, , for all small enough. Hence, we must have , and for all small enough. Expanding to linear order in , we conclude that, for all :
Further, if is the all vector, for all . Minimizing with respect to , we get
By virtue of Eq. (F.1), and using the fact that is monotone, we have
It is then easy to check that, letting denote the complement of ,
Indeed define, for , with, by convention, if the set is empty. Then, by definition of ,
where the first inequality follows by definition of the events , , … and the second since . The inequality follows essentially by the same argument. By union bound we therefore obtain
We can therefore bound the risk as follows
On the other hand it is easy to see that, defining , we necessarily have for all , whence
where the last inequality holds for . Using this bound in conjunction with Eq. (F.6), we finally get
Appendix G Further computational details
We record here a few details that have been omitted from the main text.
The AMP iteration, cf. Eqs. (1.8), (1.8), (1.9), requires to estimate the variance of the effective observation at time . According to the the general theory of state evolution [DMM09, BM11a], the empirical distribution of the coordinates of is asymptotically Gaussian with mean and variance . This motivates the following estimator, first proposed in [DMM09]:
where is the normal distribution function. This is known as the % pseudo-variance in robust estimator and has the advantage of being insensitive to a small fraction of large outliers.
Computations were done in partly using Matlab, and partly through a Java program . In the spirit of reproducible research, a suite of Java classes that allow to repeat our simulations is available through an open code repository [DJM12].
The plots of minimax risk were obtained by evaluating numerically the expression in the main text. For separable and block-separable denoisers (with the exception of the global minimax denoiser ), the integrals can be expressed in terms of the Gaussian distribution function or incomplete beta functions. For the global minimax denoiser, integration was performed numerically using the standard MAtlab routines.
Evaluation of the minimax risk required searching the least favorable distribution among two point mixtures of the form , in the separable case and in the block separable one. Optimization over was performed by brute force search over a grid, with recursive refinement of the grid.
For the non-separable cases, the procedure for approximating the minimax MSE was explained Section 4 and 5.
Appendix H Finite-N𝑁N scaling and error analysis
The empirical phase transitions observed in this paper admit further analysis, to verify whether the following expected behavior take place, namely: the offsets tend towards zero with increasing ; the steepness of the phase transition increases with increasing .
As described in Section 2.4, at each fixed value of the sparsity parameter, we gathered data at several different values of , and obtained the empirical phase transition parameter , recorded as offset from prediction, so that means that the 50% success location fitted to the -fixed, -varying dataset is exactly at the predicted location . Our analysis gave not only the empirical phase transition location, but also its formal standard error . (Here we make explicit the dependence of on the specific denoiser.)
We fit a linear model to the dataset including all the phase transition results for soft and firm thresholding. We considered several exponents that might be describing the decay of the offset with increasing :
Table 4 shows that provides an adequate description of the offsets, with an exceeding . A plot of raw ’s versus the predictions of model (H.1) is given in Figure 25.
H.2 Transitions sharpen
In addition to an empirical phase transition parameter we also fitted an empirical steepness parameter , according to the logistic model:
where is the empirical success probability for the -th experiment.
We expect to grow with increasing , corresponding to increasingly abrupt transitions from complete failure at to complete success at . In order to test this behavior, we fitted a linear model to the values of computed for multiple values of , , and denoisers . We considered a range of powers that might be describing the growth of the steepness with increasing :
Table 5 shows that provides an adequate description of the steepnesses, with an exceeding . A plot of raw ’s versus the predictions of model (H.2) is given in Figure 26.
Appendix I Further properties of the risk function
In this appendix we prove several useful properties of the risk function of the block soft and James-Stein denoisers. Throughout this section, we define the risk at as
The argument will be droppend or replaced by the threshold level whenever clear from the context. Since we only consider denoisers that are equivariant under rotation, depends on the vector only through its norm . With a slight abuse of notation, we will use to denote the norm as well. In other words, the reader can assume .
In this section we consider the block soft denoiser . We will write for .
For block soft thresholding the risk function has these properties:
Let be the density function of . This satisfies
Applying the first identity, integrating by parts, canceling terms, and then using the second identity, we obtain
For the upper bound (I.3), use the Poisson mixture representation with and an identity for the central density family to obtain
and so to conclude that . Hence the second term in (I.6) is bounded by , whence follows the conclusion . Property (I.4) is immediate from (I.2) and (I.3) and the large- limit of .
To obtain the bound in (I.5), write the risk function using the unbiased risk formula as
I.2 Positive-part James-Stein denoiser
As in the previous section, we set and . HEre we consider the James-Stein denoiser defined by
and we will write .
We again let , with . We have the noncentral chi-squared distribution with noncentrality . Stein’s unbiased estimate of risk is
We will first develop an approximation of the risk at that was used in Sections (3.2) and (3.3).
Let denote the density of a central chi-squared with degrees of freedom. We then have the density satisfies
Using the identity and letting , we can rewrite the last expression as
By a standard tightness argument, this implies that that
An Edgeworth series leads to the expansion . Indeed, one can integrate the expression (I.8) numerically, and the numerical values are consistent with for large .
I.2.2 Monotonicity of R(μ)𝑅𝜇R(\mu).
We use the variation-diminishing version of total positivity, developed in [BJM81].
The non-central family is strictly variation diminishing of all orders
For , let and denote the number of sign changes and strict sign changes of , and let denote the sign of (assuming that , the more general definition being given in [BJM81]). Further define the function by
where is the density of the noncentral chi-square with degrees of freedom and noncentrality . By the SVR property we have that that and that if then necessarily . In particular this implies that, if is strictly increasing, then is strictly increasing as well. Indeed this follows by letting for and
If is strictly increasing, then for all , whence for all , with whenever . This in turns implies that is increasing.
We now verify that the risk of is monotone increasing in . Let
and define using Eq. (I.9). Note that is strictly increasing and hence is increasing as well by the above argument. But is Stein’s unbiased risk estimator and hence , which implies the claim.
I.3 Proof of Lemma 6.2
Since any probability distribution is written as a convex combination of point masses, it is sufficient to prove the claim for . In this case, using the scaling relation (1.3), we have
with the risk function. Therefore the state evolution mapping is starshaped for all distributions if and only if is monotone increasing.
The monotonicity of the risk function was proved in [DMM09] for soft thresholding and positive soft thresholding. It is proved in Section 4 and 5 for monotone regression and total variation denoising. Finally, it is proved in Section I.1 and I.2 for block soft and James-Stein denoisers.