Accurate Prediction of Phase Transitions in Compressed Sensing via a Connection to Minimax Denoising

David Donoho, Iain Johnstone, Andrea Montanari

Introduction

In the noiseless compressed sensing problem, we are given a collection of linear measurements of an unknown vector x0x_{0}:

Here the measurement matrix AA is nn by NN, n<Nn<N, and the NN-vector x0x_{0} is the object we wish to recover. Both yy and AA are known, while x0x_{0} is unknown and we seek to recover an approximation to x0x_{0}.

Since n<Nn<N, the equations are underdetermined. It seems hopeless to recover x0x_{0} in general, but in compressed sensing one also assumes that the object is sparse in the appropriate sense. Suppose that the object is known to be kk-sparse, i.e. to have kk nonzero entries. If the problem dimensions (k,n,N)(k,n,N) are large, many recovery algorithms exhibit the phenomenon of phase transition.

Explicitly, let ε=k/N{\varepsilon}=k/N and δ=n/N\delta=n/N denote the sparsity and undersampling parameters, respectively. Hence (ε,δ)2({\varepsilon},\delta)\in^{2} defines a phase space for the different kinds of limiting situations we may encounter as (k,n,N)(k,n,N) grow large. For a variety of algorithms and Gaussian matrices AA with iid entries, one finds that this phase space can be partitioned into two phases: “success” and “failure”. Namely, for a given algorithm A{\cal A} and given sparsity fraction ε{\varepsilon}, there exists a critical fraction δ(εA)\delta({\varepsilon}|{\cal A}) such that if the sampling rate δ\delta is larger than the critical value, δ>δ(εA)\delta>\delta({\varepsilon}|{\cal A}), then the algorithm is successful in recovering the underlying object x0x_{0} with high probabilityThroughout the paper, we will write that an event holds with high probability (w.h.p.) if its probability converges to 11 in the large system limit N,nN,n\to\infty with δ=n/N\delta=n/N and ε=k/N{\varepsilon}=k/N fixed., while if δ<δ(εA)\delta<\delta({\varepsilon}|{\cal A}) the algorithm is unsuccessful, also with high probability. In particular, δ(εA)<1\delta({\varepsilon}|{\cal A})<1 means that it is indeed possible to undersample and still recover the unknown signal. In fact δ(εA)\delta({\varepsilon}|{\cal A}) shows precisely the limits of allowable undersampling. By now a large amount of empirical and theoretical knowledge has been compiled about the phase transitions exhibited by different algorithms: we refer the reader to [DT10b, Don06, DT09a, XH10, BCT11, Sto10, KWT09, DMM09, DMM11, Wai09]. In a parallel line of work, a number of sufficient conditions under which undersampling is possible using deterministic matrices have been studied, see e.g. [CT05, BRT09, BGI+08, Can06].

It is however fair to say that the research focused so far on ‘unstructured’ notions of sparsity whereby kk simply counts the number on non-zero entries in x0x_{0}. (We refer to Section 1.7 for an overview of related literature.) On the other hand, applications naturally lead to ‘structured’ notions of sparsity. This paper applies an algorithm framework - Approximate Message Passing (AMP) - to construct specific algorithms applicable to a variety of compressed sensing settings, including block and structured sparsity, convex and nonconvex penalization, and develops a single unifying formula that, specialized to each instance, gives the actual phase transition that we observe in practice. To give a preview of our results, we first recall some facts about statistical decision theory and AMP reconstruction. For the sake of illustration, the classical case of simple sparse vectors will be used as a running example.

Throughout this paper, we will consider estimation of unknown structured signals xRNx\in{\bf R}^{N} from a minimax point of view. Various notions of structures can be formalized by considering a family FN{\cal F}_{N} of probability measures over RN{\bf R}^{N}. One such probability measure will be denoted by νNFN\nu_{N}\in{\cal F}_{N} and a signal with distribution νN\nu_{N} will often be denoted as XνN{\bf X}\sim\nu_{N}.

The family FN{\cal F}_{N} will typically include degenerate distributions, i.e. point masses νN=δx0\nu_{N}=\delta_{x_{0}} for some x0RNx_{0}\in{\bf R}^{N}.

The case of simple sparse vectors corresponds to the family

where P(RN){\cal P}({\bf R}^{N}) is the set of Borel probability measures on RN{\bf R}^{N} and, as usual, v0\|v\|_{0} denotes the number of nonzero entries of the vector vv. \blacksquare

As exemplified in this case FN{\cal F}_{N} is often indexed by a sparsity parameter ε{\varepsilon}, with NεN{\varepsilon} corresponding to the number of non-zero entries. We will sometimes use the notation FN,ε{\cal F}_{N,{\varepsilon}} to indicate this dependency, also beyond the last example. Two further common properties that will always hold unless otherwise stated are the following.

Nestedness. If ε1ε2{\varepsilon}_{1}\leq{\varepsilon}_{2} then FN,ε1FN,ε2{\cal F}_{N,{\varepsilon}_{1}}\subseteq{\cal F}_{N,{\varepsilon}_{2}}.

Scale invariance. If νNFN,ε\nu_{N}\in{\cal F}_{N,{\varepsilon}} then any scaled version of νN\nu_{N} (defined by letting νNa(B)=ν(aB)\nu^{a}_{N}(B)=\nu(aB) for some a>0a>0) is also in FN,ε{\cal F}_{N,{\varepsilon}}.

2 Denoising and minimax MSE

The denoising problem requires to reconstruct a signal xRNx\in{\bf R}^{N} from observations Y=x+Z{\bf Y}=x+{\bf Z} whereby ZN(0,σ2IN×N){\bf Z}\sim{\sf N}(0,\sigma^{2}{\bf I}_{N\times N}) is a noise vector of known variance. (Here and below Im×m{\bf I}_{m\times m} denoted the identity matrix in mm dimensions.) A denoiser is a mapping

that returns an estimate of xx when applied to observations y=Yy={\bf Y}. The denoiser is parametrized by the noise scale σ\sigma and additional tuning parameters τΘ\tau\in\Theta. Often denoisers have the property η(y;τ,σ)2y2\|\eta(y;\tau,\sigma)\|_{2}\leq\|y\|_{2} and are hence called ‘shrinkers’. We will often have Θ=R+\Theta={\bf R}_{+}, i.e. the denoiser depends on a single non-negative parameter, but more complex choices of the parameter space Θ\Theta fit in the formalism as well.

Following the minimax formulation in the previous section, we evaluate denoisers on signals XνNFN,ε{\bf X}\sim\nu_{N}\in{\cal F}_{N,{\varepsilon}}, for specific class of distributions FN,ε{\cal F}_{N,{\varepsilon}}. Because of the scale invariance property of FN,ε{\cal F}_{N,{\varepsilon}}, it is sufficient to consider scale invariant denoisers:

Hence we omit the last argument when this is σ=1\sigma=1. We evaluate a denoiser η\eta through its minimax mean square error (MSE) per coordinate

where expectation is taken with respect to XνN{\bf X}\sim\nu_{N} and Y=X+Z{\bf Y}={\bf X}+{\bf Z}, ZN(0,IN×N){\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N}). In words, we tune the denoiser optimally to control the (per-coordinate) mean square error for typical signals from even the most unfavorable choice within our class FN,ε{\cal F}_{N,{\varepsilon}}.

We say that a denoiser is separable if, for v=(v1,,vN)RNv=(v_{1},\dots,v_{N})\in{\bf R}^{N}, we have

A well studied denoiser is coordinatewise soft-thresholding, that we will denote by ηsoft\eta^{soft}. This is a separable denoiser with a unique parameter τΘ=R+\tau\in\Theta={\bf R}_{+} (the threshold). On each coordinate yRy\in{\bf R} this acts as

Soft thresholding is well suited for sparse signals from the class FN,ε{\cal F}_{N,{\varepsilon}} defined in Eq. (1.2). It turns out that the resulting minimax MSE M(FN,εη)M({\cal F}_{N,{\varepsilon}}|\eta) can be characterized in terms of a scalar estimation problem, namely for all NN, M(εηsoft)=M(FN,εηsoft)=M(F1,εηsoft)M({\varepsilon}|\eta^{soft})=M({\cal F}_{N,{\varepsilon}}|\eta^{soft})=M({\cal F}_{1,{\varepsilon}}|\eta^{soft}). Explicitly, all these quantities are given by

where expectation is taken with respect to XνX\sim\nu and ZN(0,1)Z\sim{\sf N}(0,1) independent of XX. We refer to [DJ94, DMM09, DMM11] for an explicit characterization of this quantity (a summary being provided in Section 2). In particular M(εSoft)M({\varepsilon}|{\rm Soft}) can be explicitly evaluated. \blacksquare

In several other examples M(FN,εη)M({\cal F}_{N,{\varepsilon}}|\eta) has been explicitly evaluated (see [DMM09], Supplementary Information).

In this paper we will give several other calculations of M(FN,εη)M({\cal F}_{N,{\varepsilon}}|\eta), for signal structures and denoisers going considerably beyond these examples.

3 Compressed sensing and AMP reconstruction

Consider now the noiseless compressed sensing problem, i.e. the problem of recovering a signal x0RNx_{0}\in{\bf R}^{N} from n<Nn<N linear observations y=Ax0y=Ax_{0}, cf. Eq. (1.1). The key intuition is that this can be done exploiting the structure of x0x_{0}, sparsity being a special example. Approximate Message Passing (AMP) is an iterative scheme that allows to exploit richer types of structure in a flexible way. Given a denoiser η(;τ,σ):RNRN\eta(\,\cdot\,;\tau,\sigma):{\bf R}^{N}\to{\bf R}^{N} that is well suited for reconstructing x0x_{0} from observations x0+Zx_{0}+{\bf Z}, the AMP framework turns it into a scheme for solving the compressed sensing problem.

The AMP iteration starts from x0=0x^{0}=0, and proceeds for iterations t=1t=1, 22, …by maintaining a current reconstruction xtRNx^{t}\in{\bf R}^{N} and a current working residual ztRnz^{t}\in{\bf R}^{n}, and adjusting these iteratively. At iteration tt, it forms a vector of current pseudo-data yt=xt+ATzty^{t}=x^{t}+A^{{\sf T}}z^{t} and the next iteration’s estimate is obtained by applying η\eta to the current pseudo-data:

Here bt{\sf b}_{t} is a scalar determined by

The rationale for this specific choice of bt{\sf b}_{t} is discussed in [DMM09, DMM10, BM11a]: a justification goes betond the scope of the present paper. The parameter σt\sigma_{t} is can be interpreted as the noise standard deviation for the pseudo-data yty^{t}. This can be estimated from yty^{t} or ztz^{t} as explained in Appendix G.

Conceptually, AMP constructs an artificial denoising problem at each iteration and solves it using the denoising defined by η\eta. In other words, it solves a compressed sensing problem by successive denoising. For the purpose of this paper, this description should be sufficient, save for two remarks.

First, the specifics of the construction are absolutely crucial for the results of this paper. These are embedded in the specification of the scale factors bt{\sf b}_{t} and σt\sigma_{t}.

Second, the above algorithm framework was originally proposed in [DMM09, DMM10] in the case of a separable denoiser η\eta, i.e. a denoiser acting independently on each coordinate. In that paper the algorithm was derived by constructing a proper belief propagation message passing algorithm, and then obtaining the above algorithm as a first-order approximation. Specific separable denoisers corresponded to different choices of the prior in belief propagation.

A central point of this paper is that the form of the algorithm (1.7), (1.8), (1.9) is really more general and can be used in settings outside the original definition.

4 Phase transition for AMP

A recurring property of AMP algorithms is that they undergo a phase transition. When the undersampling ratio δ\delta decreases below a certain threshold (that depends on the signal class Fε,N{\cal F}_{{\varepsilon},N} and the denoiser η\eta), the algorithm behavior changes from being successful most of the times, to failing most of the times. In order to formalize this notion, we introduce the following terminology.

We say that AMP succeeds with high probability for the signal class FN,ε{\cal F}_{N,{\varepsilon}}, and denoising procedure η\eta if there exist a choice of the tuning parameter τΘ\tau\in\Theta such that the following happens. For each ξ>0\xi>0, there exists a function o(t)o(t) with limto(t)=0\lim_{t\to\infty}o(t)=0 such that, for any νNFN,ε\nu_{N}\in{\cal F}_{N,{\varepsilon}},

Here probability is taken with respect to x0=XνNx_{0}={\bf X}\sim\nu_{N} and the sensing matrix AA. Further, the limit NN\to\infty is taken with n/Nδn/N\to\delta.

Viceversa we say that AMP fails with high probability for the signal class FN,ε{\cal F}_{N,{\varepsilon}}, and denoising procedure η\eta if for any τΘ\tau\in\Theta the following happens. There exists ξ>0\xi>0 and a sequence νNFN,ε\nu_{N}\in{\cal F}_{N,{\varepsilon}} such that, for all t0t\geq 0

Our main result is the following general relation between denoising and compressed sensing.

Phase Transition Formula for AMP. Consider compressed sensing reconstruction over the signal class FN,ε{\cal F}_{N,{\varepsilon}}, using AMP with the denoiser η\eta. Denote by M(εη)M({\varepsilon}|\eta) the asymptotic minimax MSE per coordinate using denoiser η\eta.

Then AMP succeeds with high probability if

Viceversa AMP fails with high probability for δ<M(εη)\delta<M({\varepsilon}|\eta).

Let FN,ε{\cal F}_{N,{\varepsilon}} be the class of signals with at most NεN{\varepsilon} non-zero entries (in expectation) and consider AMP with soft thresholding ηsoft(;τ)\eta^{soft}(\,\cdot\,;\tau). Then the above formula states that reconstruction will succeed if δ>M(εSoft)\delta>M({\varepsilon}|{\rm Soft}) and fail for δ>M(εSoft)\delta>M({\varepsilon}|{\rm Soft}). This result was first proved in [DMM09] to follow from state evolution. State evolution was subsequently established as a rigorous tool in [BM11a].

The same paper [DMM09] studied AMP with positive soft thresholding and showed that it succeeds for δ>M(εSoftPos)\delta>M({\varepsilon}|{\rm SoftPos}), AMP with capping, proving that it succeeds for δ>M(εCap)\delta>M({\varepsilon}|{\rm Cap}).

Appendix A spells out how these existing results fall under the aegis of Eq. (1.13). \blacksquare

Comparison to (ρ,δ)(\rho,\delta) phase diagrams. In prior literature on phase transitions in compressed sensing, [DT10b, Don06, DT09a, BCT11, DMM09, DMM11], the authors considered a different phase diagram, based on variables δ\delta and ρ=ε/δ\rho={\varepsilon}/\delta. The relation ε=ρδ{\varepsilon}=\rho\delta makes for a 1-1 relationship between the diagrams, so all information in the two diagrams can be presented in either format.

5 This Paper

Our aim in this paper is to show that formula (1.13) is correct in settings extending far beyond the three cases just mentioned in Example 1.5. We lay out several denoising problems, and in each one verify the general formula. This requires in each case (a)(a) calculating the minimax MSE for a problem of statistical decision theory; (b)(b) implementing AMP for compressed sensing with the given denoising family; and (c)(c) verifying empirically that the phase transition does indeed occur at the precise sparsity/undersampling tradeoff indicated by the formula.

In particular, we consider the following denoising tasks, and corresponding compressed sensing problems.

Again we consider the class od sparse vectors FN,ε{\cal F}_{N,{\varepsilon}} but instead of soft-thrresholding, we use the firm shrinkage denoiser ηfirm(;τ)\eta^{firm}(\,\cdot\,;\tau). This is again a separable denoiser with two tuning parameters τ=(τ1,τ2)\tau=(\tau_{1},\tau_{2}) with τΘ{(τ1,τ2):  0τ1<τ2<}\tau\in\Theta\equiv\{(\tau_{1},\tau_{2}):\;0\leq\tau_{1}<\tau_{2}<\infty\}. It acts on each coordinate by setting ηfirm(y;τ)=0\eta^{firm}(y;\tau)=0 for y<τ1|y|<\tau_{1}, ηfirm(y;τ)=y\eta^{firm}(y;\tau)=y for y>τ2|y|>\tau_{2} and interpolating linearly.

Denoting by M(εFirm)M({\varepsilon}|{\rm Firm}) the associated asymptotic minimax MSE, we will show that M(εFirm)<M(εSoft)M({\varepsilon}|{\rm Firm})<M({\varepsilon}|{\rm Soft}) strictly. By verifying the general formula, we show that the phase transition curve for optimally-tuned AMP firm shrinkage is slightly better than the phase transition for optimally tuned AMP soft shrinkage.

For the same class of sparse vectors FN,ε{\cal F}_{N,{\varepsilon}}. we consider the separable denoiser η\eta applies coordinatewise shrinkage using a minimax shrinkage. In other words implicitly we are optimizing the mean square error over Θ{\mboxallscalarnonlinearities}\Theta\equiv\{\mbox{ all scalar nonlinearities }\}. We calculate the minimax MSE function M(εMinimax)M({\varepsilon}|{\rm Minimax}), and show that M(εMinimax)<M(εFirm)M({\varepsilon}|{\rm Minimax})<M({\varepsilon}|{\rm Firm}) strictly. By verifying the general formula we show that the phase transition curve for AMP minimax shrinkage is slightly better than the phase transition for both AMP soft or firm shrinkage.

Here we consider the class of block sparse vectors FN,ε,B{\cal F}_{N,{\varepsilon},B} (see Section 3 for a formal definition). We use two block-separable denoisers: either block soft thresholding (for block length BB, the BB-variate nonlinearity obeys ηB,λ(y)=y(1y2/λ)+\eta_{B,\lambda}(y)=y\cdot(1-\|y\|_{2}/\lambda)_{+}) or block James-Stein denoiser. We will compute the minimax MSE function MB(εBlockSoft)M_{B}({\varepsilon}|{\rm BlockSoft}), and bound the minimax MSE function MB(εJamesStein)M_{B}({\varepsilon}|{\rm JamesStein}). We will verify that the phase transition curve for optimally-tuned AMP with block-separable denoisers follows the general formula.

Notice that, as demonstrated numerically in [DMM09], and proved in [BM11b] in the case of Gaussian sensing matrices, soft-thresholding AMP reconstruction coincides with LASSO reconstruction (in the large system limit). By the above results, firm-shrinkage AMP and minimax AMP both outperform LASSO reconstruction. Correspondingly, it can be argued that blocksoft thresholding AMP coincides asymptotically with group LASSO, and hence James-Stein AMP outperforms the latter.

In all of the above examples, the denoisers are coordinatewise or at least blockwise separable. We next consider examples where the denoiser has more subtle structure. We find that formula (1.13) applies more generally.

We consider the class FN,ε,mono{\cal F}_{N,{\varepsilon},{\rm mono}} of vectors that are monotone with at most NεN{\varepsilon} points of increase. As denoiser, we use the least-squares projection η\eta onto the cone of monotone increasing functions.

We consider the class FN,ε,TV{\cal F}_{N,{\varepsilon},TV} of vectors that have at most NεN{\varepsilon} points of change. The denoiser η\eta minimizes the residual sum of squares penalized by τ\tau times the total variation of the signal.

In these cases, evaluating the asymptotic minimax MSE is more challenging than for separable denoisers and simpler classes of signals. Nevertheless, we will show that it can be done quite explicitly. We find well-defined phase transitions for AMP reconstruction, precisely at the location predicted by the general formula (1.13).

6 Contributions

We list eight contributions, beginning with the two most obvious:

A formula for phase transitions of AMP algorithms. We confirm that formula (1.13) accurately describes the sparsity-undersampling tradeoff under which AMP algorithms successfully recover a sparse structured signal from underdetermined measurements. We prove that this relation follows from the state evolution formalism.

As demonstrated in [DMM11] and proved in [BM11a] in the case of the LASSO, there exists a correspondence between convex optimization methods and specific AMP algorithms. We will show that this correspondence is considerably more general. This provides a unified approach which yields sharp phase transition predictions in numerous cases.

Limited benefit of nonconvex penalization for ordinary sparsity. Within the class of scalar separable AMP algorithms, the best achievable phase transition is obtained by the minimax shrinker. Unfortunately the improvement in the transition is relatively small.

Calculation of the minimax MSE of monotone regression and total variation denoising. We are not aware of any previous work computing the minimax MSE of these denoising procedures under the condition of ε{\varepsilon}-sparse first differences. We prove here a characterization for each of these cases and show that it agrees with the phase transition of both AMP and convex optimization algorithms.

A conjectures flow naturally from this work:

State Evolution accurately describes the behavior of a wide range of AMP algorithms, for large system sizes NN. State evolution is a formalism that allows to characterize the asymptotic behavior of AMP as the number of dimension tend to infinity [DMM09]. We show in Section 6 that the general relation (1.13) can be proved by assuming state evolution to hold.

In the case of separable denoisers, under suitable regularity conditions, the correctness of state evolution as a description of AMP is proved by [BM11a]. Since formula (1.13) is apparently successful beyond the separable case, it is natural to conjecture that state evolution applies much more generally than to the cases proven so far.

Our study supports the general conclusion that AMP provides a general tool in compressed sensing, that is applicable beyond simple sparse signal. If one knows that a certain shrinker is appropriate for denoising a certain type of signal, then the corresponding AMP algorithm provides an efficient reconstruction method for the associate compressed sensing problem. The denoising minimax MSE then maps to the sparsity undersampling tradeoff.

An interesting research direction is the study of the noisy linear model y=Ax0+wy=Ax_{0}+w, whereby ww is a noise vector (e.g. wN(0,σ2Im×m)w\sim{\sf N}(0,\sigma^{2}{\bf I}_{m\times m})). In analogy [DMM11], we expect reconstruction to be stable with respect to noise for δ<M(εη)\delta<M({\varepsilon}|\eta) and instable for δ>M(εη)\delta>M({\varepsilon}|\eta).

7 Related literature

Approximate message passing algorithms for compressed sensing reconstruction were introduced in [DMM09]. They were largely motivated by the connection with message passing algorithms in iterative decoding systems [RU08], and with mean field methods in statistical physics [MM09] (in particular the cavity method and TAP equations). We refer to [DMM11] for a discussion of these connections.

The original AMP framework [DMM09, DMM10] included iterations of the form defined in Eqs. (1.7), (1.8), (1.9) whereby the denoiser is separable. While this covers the Firm{\rm Firm} and Minimax{\rm Minimax} shrinkage rules studied in this paper, it did not include the various non-separable denoisers we discuss below, namely the block, monotone and total variation denoisers. Further, in [DMM09], the phase transition behavior was validated numerically only for Soft{\rm Soft}, SoftPos{\rm SoftPos} and Cap{\rm Cap} denoisers, that are in correspondence with well-studied convex optimization problems. The extension to a noisy linear model y=Ax0+wy=Ax_{0}+w, with wRnw\in{\bf R}^{n} a vector of iid random entries was carried out in [DMM11]. We also refer to [Mon12] for an overview of this work.

Several papers investigate generalizations of the original framework put forward in [DMM09]. The paper [BM11a] defines a general class of approximate message passing algorithms for which the state evolution was proved to be correct. This include in particular all separable Lipschitz-continuous denoisers. Generalizations of this result were proved in [BLM12, JM12]. Notice that all the separable denoisers treated in this papers are Lipschitz continuous with the exception of hard thresholding. While the last case is not covered by [BM11a], we expect state evolution to hold for hard thresholding AMP as well, by a suitable approximation argument.

Rangan [Ran11] introduces a class of generalized approximate message passing (G-AMP) algorithms that cope with –roughly– two extensions of the basic noisy linear model. First, the noisy measurement vector yy can be a non-linear (random) function of the noiseless measurement Ax0Ax_{0}. Second, each of the ‘coordinates’ of x0x_{0} can itself be a –low dimensional– vector. Interesting applications of this framework were developed in [KGR11, Sch11]. Let us notice that G-AMP does not cover any of the non-separable cases treated here (even the block sparse example), and hence provides a generalization in an ‘orthogonal’ direction.

In a parallel line of work, Schniter applied AMP to a number of examples in which the signal x0x_{0} has a structured prior [Sch10, SSS10, SPS10]. Inference with respect to the prior is carried out using belief propagation, and this is combined with AMP to compute a posteriori estimates. This type of application fits within the class of problems studied here, by choosing the denoiser ηt\eta_{t} in Eq. (1.8) be given by the appropriate conditional expectation with respect to the signal prior. Note however that the general scheme provided by Eqs. (1.7), (1.8), (1.9) encompasses cases in which the denoiser is not the Bayes estimator for a specific prior.

A special case of known prior is the one in which x0=XνNx_{0}={\bf X}\sim\nu_{N} is distributed according to the (known) product measure νN=ν××ν\nu_{N}=\nu\times\cdots\times\nu (i.e. the coordinates of X{\bf X} are iid with known distribution ν\nu). The fundamental limits for compressed sensing reconstruction were established in [WV10]. The natural AMP algorithm uses in this case a posterior expectation denoiser [DMM10]. It was proved in [DJM11] that, for suitable sensing matrices with heteroscedastic entries, this approach achieves the fundamental limits of [WV10] (this approach was put forward in [KMS+12] on the basis of a statistical physics argument). This case fits within the general philosophy of the present paper whereby the class FN,ε{\cal F}_{N,{\varepsilon}} consists of a single distribution, namely FN,ε={νN}{\cal F}_{N,{\varepsilon}}=\{\nu_{N}\}. However, we prefer not treating this example in the present paper because it is a degenerate case, and the fact that FN,ε{\cal F}_{N,{\varepsilon}} is not scale invariant leads to some technical differences. We refer instead to [DJM11].

Maleki, Anitori, Yang and Baraniuk [MAYB11] used methods analogous to the ones developed here to study phase transitions for compressed sensing with complex vectors. This is closely related to the block-separable setting considered in Section 3 (there is however some difference in the structure of the sensing matrix).

Structured sparsity models are studied from a different point of view in [BCDH10, CHDB08, CICB10]. Those works focus on deriving sparsity models that capture a variety applications, and of convex relaxations that promote the relevant sparsity patters. Reconstruction guarantees are proved under suitable ‘isometry’ assumptions on the sensing matrix.

Closer to our approach is a recent series of papers [CRPW11, RRN11, RRN11], considering general classes of structured signals under random measurements. Let us emphasize two important differences with respect to our work. First, these papers only deal with convex reconstruction methods, while we shall analyze several approaches that are not derived from convex optimization and demonstrate improvements. Second, they establish reconstruction guarantees using concentration-of-measure arguments, while we propose exact asymptotics (essentially based on weak convergence), which enables us to unveil the key relation (1.13) between denoising and the compressed sensing phase transition.

Scalar-separable denoisers

In this section we study scalar-separable denoisers, cf. Eq. (1.6), that further satisfy the scaling relation (1.3). Unless stated otherwise, we will assume that signals belong to the simple sparsity class introduced in Eq. (1.2), to be denoted as FN,ε{\cal F}_{N,{\varepsilon}}.

As mentioned in the previous section, the computation of the minimax MSE is greatly simplified for separable denoisers. We state and prove the following elementary result in greater generality than necessary for this section. (In particular FN,ε{\cal F}_{N,{\varepsilon}} is here a general family of probability distributions.)

Let FN,εP(RN){\cal F}_{N,{\varepsilon}}\subseteq{\cal P}({\bf R}^{N}) be any family of probability distributions satisfies the following conditions: (i)(i) If ν1F1,ε\nu_{1}\in{\cal F}_{1,{\varepsilon}}, then defining νNν1××ν1\nu_{N}\equiv\nu_{1}\times\cdots\times\nu_{1} (NN times), we have νNFN,ε\nu_{N}\in{\cal F}_{N,{\varepsilon}}; (ii)(ii) Viceversa, if νNFN\nu_{N}\in{\cal F}_{N}, then letting νN,i\nu_{N,i} denote the ii-th marginal of νN\nu_{N}, we have νNN1i=1NνN,iF1,ε\overline{\nu}_{N}\equiv N^{-1}\sum_{i=1}^{N}\nu_{N,i}\in{\cal F}_{1,{\varepsilon}}.

Then, for any separable denoiser η\eta, and for any NN,

Fix τΘ\tau\in\Theta and define, for ZN(0,IN×N){\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N}),

The lemma then follows immediately if we prove that, for any NN, M(FN,εη,τ)=M(F1,εη,τ)M({\cal F}_{N,{\varepsilon}}|\eta,\tau)=M({\cal F}_{1,{\varepsilon}}|\eta,\tau). In order to prove the last statement, first notice that, by property (i)(i):

The proof is finished by property (ii)(ii), since

This lemma reduces the problem to solving a minimax scalar estimation problem. This problem was characterized before for soft thresholding η=ηsoft\eta=\eta^{soft}, positive soft thresholding η=ηsoftpos\eta=\eta^{softpos}, and also hard thresholding η(y;τ)=y1{y>τ}\eta(y;\tau)=y1_{\{|y|>\tau\}} [DDS92, DJ94]. Plots of the minimax soft threshold τ(ε)\tau^{*}({\varepsilon}) and the minimax MSE are available in [DMM09, DMM11]. Such plots also appear later in this paper as baselines for comparison of interesting new families, namely Firm and Minimax shrinkage.

2 Firm shrinkage

Soft and hard thresholding can be recovered as limiting cases:

Lemma 2.1 yields the following formula for the minimax MSE of firm shrinkage:

Figure 1 and Table 1 show the minimax MSE for firm shrinkage as resulting from this calculation. The figure also shows similar results for soft and hard thresholding, for comparative purposes. Over the range presented, the minimax MSE for firm thresholding is strictly smaller than the MSE for hard or soft thresholding. Namely, over this range of ε{\varepsilon},

This validates the criticisms of soft thresholding, which is often said to shrink large values too heavilyNote, however, that the use of hard thresholding instead of soft thresholding leads to a larger worst case mean square error..

Figure 2 shows the minimax thresholds. At least for ε<1/3{\varepsilon}<1/3 we see clearly that τ1(ε)<τ2(ε)<\tau_{1}^{*}({\varepsilon})<\tau_{2}^{*}({\varepsilon})<\infty, so firm thresholding is preferred over the limiting cases of hard and soft thresholdingThese are numerical results. It is an open question whether, for ε>1/3{\varepsilon}>1/3 the minimax firm threshold have parameter τ2(ε)=\tau_{2}({\varepsilon})=\infty reducing it to soft thresholding.. Figure 3 shows the corresponding minimax denoisers for specific values of ε{\varepsilon}. Finally, Figure 4 plots the minimax value of μ\mu as a function of ε{\varepsilon} (corresponding to the minimax probability distribution νε,μ\nu_{{\varepsilon},\mu}).

3 Minimax shrinkage

The previous example showed that a parametric family of shrinkers can improve on soft thresholding, and hence improve the predicted phase transition according to (1.13). The ultimate improvement one could make in this direction is to use the globally minimax nonlinear shrinker. This is the separable denoiser η\eta that is minimax not within some parametric family, such as the soft thresholding or the broader firm thresholding family, but minimax over all measurable nonlinearities η:RR\eta:{\bf R}\to{\bf R}. While this notion might appear somewhat abstract, it can be in fact implemented in practice as illustrated in Figure 3, that present plots of the more familiar denoisers (hard, soft, and firm) together with the minimax denoiser.

Formally, let ΘL(R)\Theta\equiv{\cal L}({\bf R}) be the set of all measurable functions τ:RR\tau:{\bf R}\to{\bf R}, for such a τ\tau, set ηall(x;τ)τ(x)\eta^{all}(x;\tau)\equiv\tau(x). The minimax MSE over this class is

The calculation of this quantity uses a variety of ideas from minimax decision theory, developed through several papers [Bic81, CS81, BC83, DDS92, Joh94b, Joh94a, DJ94]: details are given in Appendix B. A key point of this computation is the characterization of the minimax nonlinearity as the minimal MSE Bayes rule (that is, the conditional expectation) for the so-called least-favorable prior. The least-favorable prior is the solution of Mallows’ classical Fisher information problem [Mal78], for which we compute numerical upper and lower bounds that coincide within the stated precision.

Table 1 and Figure 1 present numerical values associated with the solution of the minimax problem. As expected, M(εMinimax)<M(εFirm)M(εSoft)M({\varepsilon}|{\rm Minimax})<M({\varepsilon}|{\rm Firm})\leq M({\varepsilon}|{\rm Soft}), i.e. optimizing over all nonlinearities yields a smaller mean square error than soft or firm thresholding. On the other hand, Table 1 shows that the improvements are typically of size 0.01 or smaller over the range ε(0.01,0.25){\varepsilon}\in(0.01,0.25). For very small ε{\varepsilon} it was pointed out in [DMM09] that [DJ94] implies

In the limit of extreme sparsity, there is nothing to be gained by completely general nonlinearities over soft thresholding. The improvement is non-vanishing, but moderate for ε{\varepsilon} non-vanishing.

4 Empirical phase transition behavior

The research hypothesis driving this paper is that Eq. (1.13) describes the phase transition of AMP algorithms. In order to be completely explicit, we need to check the following predictions, for each nonlinearity η\eta of interest

There exists a curve εδ(εη){\varepsilon}\mapsto\delta({\varepsilon}|\eta) such that for δ>δ(εη)\delta>\delta({\varepsilon}|\eta) the corresponding AMP algorithm will typically succeed in reconstructing the unknown signal x0x_{0}, and for δ<δ(εη)\delta<\delta({\varepsilon}|\eta) the algorithm will typically fail.

The curve is related to the corresponding scalar denoising problem by δ(εη)=M(εη)\delta({\varepsilon}|\eta)=M({\varepsilon}|\eta).

We now test this hypothesis for the firm and globally minimax nonlinearities η{ηfirm,ηall}\eta\in\{\eta^{firm},\eta^{all}\}.

Our experiment was conducted along the same lines as [MD10, DT09b, DMM09, DMM11, BT10]. We considered a range of problem sizes N{1000,2000,4000}N\in\{1000,2000,4000\} and a range of sparsity parameters ε{0.01,0.02,0.05,.10,0.15,0.20,0.25}{\varepsilon}\in\{0.01,0.02,0.05,.10,0.15,0.20,0.25\}, and a grid of δ\delta values surrounding the predicted phase transition δ(εη)\delta({\varepsilon}|\eta). We ran Nsample=1000N_{\rm sample}=1000 Monte Carlo reconstructions at each parameter combination. We declared “success” when the relative mean-squared error was below 1%1\%:

We used t=300t=300 iterations of AMPIn most cases the mentioned convergence criterion is reached after a much smaller number of iterations (roughly 20) . We repeated a subset of our simulations with different requirements on x^tx022/x022\|\widehat{x}^{t}-x_{0}\|_{2}^{2}/\|x_{0}\|_{2}^{2} and different number of iterations, without significant changes in the threshold location. This point is further justified in Appendix C. In the interest of reproducibility, a suite of Java classes for carrying out these and other simulations in the paper is made available as [DJM12].

We proceeded to analyse the outcomes of these numerical simulations as follows, see also Appendix H (a similar analysis was already carried out in in [DMM09, DT09b]). The simulations generated a data set, containing, for each algorithm and each fixed ε{\varepsilon}, a list of values δi\delta_{i} and empirical success fractions p^i\widehat{p}_{i}. The success fractions observed at δ>M(εη)\delta>M({\varepsilon}|\eta) were indeed typically better than 50%50\% and at δ<M(εη)\delta<M({\varepsilon}|\eta) were typically worse than 50%50\%.

To quantify this tendency, we fit a logistic regression

where δ(εη)=M(εη)\delta({\varepsilon}|\eta)=M({\varepsilon}|\eta) was computed analytically using ideas mentioned earlier. The choice of the model (2.4) is motivated by the observation that the success probability increases rapidly around the phase transition, and by the common statistical use of logistic models. Also, similar models have been proved to be asymptotically correct in analogous phase transition phenomena [DM08].

For each set of data corresponding to given (N,ε)(N,{\varepsilon}) and each non-linearity, we estimate α\alpha and β\beta from the logit fit, leading to values α^,β^\widehat{\alpha},\widehat{\beta}. Using these quantities, we estimate the phase transition location as the value at which the probability p^\widehat{p} of success is 50%50\%. Using Eq. (2.4) this corresponds to α^+β^(δδ(εη))=0\widehat{\alpha}+\widehat{\beta}(\delta-\delta({\varepsilon}|\eta))=0, i.e. δ=δ(εη)(α^/β)\delta=\delta({\varepsilon}|\eta)-(\widehat{\alpha}/\beta). We are therefore led to define the offset between the empirical phase transition and the prediction δ(εη)=M(εη)\delta({\varepsilon}|\eta)=M({\varepsilon}|\eta) as

In order to check the general relation provided by Eq. (1.13) we need to show that PT^(N,ε)\widehat{{\rm PT}}(N,{\varepsilon}) tends to zero as NN gets large, to within the statistical uncertainty. In Table 2 we report our results on the empirical phase transition, confirming that indeed the offset is small and decreasing with NN.

A few additional remarks on these data are of interest:

We calculated formal 95%95\% confidence intervals for PT^\widehat{{\rm PT}}, indicating the tight control we have of the correct value.

As in earlier studies [DT09b], we expect that PT^(N,ε)\widehat{{\rm PT}}(N,{\varepsilon}) tend at a rate that is inversely proportional to a power of NN. Namely

for some γ(0,1]\gamma\in(0,1]. Our data supports this relationship, with γ1/3\gamma\approx 1/3. See Appendix H.

Denoting by β^N\widehat{\beta}_{N} the fitted slope coefficient at dimension NN, evidence that β^N\widehat{\beta}_{N} is increasing with larger NN indicates that a sharpening of the phase transition is indeed occurring. Appendix H shows that β^NN\widehat{\beta}_{N}\sim\sqrt{N} is consistent with our data.

We refer to Appendices G and H for further details.

Block-separable denoisers

We now turn to the case of block-structured sparsity, first introducing some notational conventions. We partition the vector x=(x1,x2,,xN)x=(x_{1},x_{2},\dots,x_{N}) into MM blocks each of size BB. Denoting by blockm(x)=(x(m1)B+1,,xmB)block_{m}(x)=(x_{(m-1)B+1},\dots,x_{mB}) the mm-th block, we hence write

A block-separable denoiser is a mapping η(;τ):RNRN\eta(\,\cdot\,;\tau):{\bf R}^{N}\to{\bf R}^{N} that decomposes according to the above partition:

where, with an abuse of notation, we use the same symbol to denote the single-block denoiser η(;τ):RBRB\eta(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B}. The last equation replaces Eq. (1.6) which correspond to the simple separable case. The above form applies to noise with variance σ2=1\sigma^{2}=1. For general variance we adopt again the scaling relation (1.3).

We will apply these denoiser to signals from the block-sparse class FN,ε,B{\cal F}_{N,{\varepsilon},B} defined as follows for ε{\varepsilon}\in, BNB\in{\bf N}, MN/BM\equiv N/B,

In words, this is the class of (random) vectors X{\bf X} that have (in expectation) at most MεM{\varepsilon} blocks different from . For simplicity, we will write Fε,B{\cal F}_{{\varepsilon},B} for the M=1M=1 case, FB,ε,B{\cal F}_{B,{\varepsilon},B}

The same simplifications described in Section 2.1 applies, with obvious modifications, to the present context.

Let FN,εP(RN){\cal F}_{N,{\varepsilon}}\subseteq{\cal P}({\bf R}^{N}) be any family of probability distributions satisfies the following conditions: (i)(i) If νBFB,ε\nu_{B}\in{\cal F}_{B,{\varepsilon}}, then defining νNνB××νB\nu_{N}\equiv\nu_{B}\times\cdots\times\nu_{B} (M=N/BM=N/B times), we have νNFN,ε\nu_{N}\in{\cal F}_{N,{\varepsilon}}; (ii)(ii) Viceversa, if νNFN\nu_{N}\in{\cal F}_{N}, then letting νN,i\nu_{N,i} denote the marginal of the ii-th block under νN\nu_{N}, we have νNM1i=1MνN,iFB,ε\overline{\nu}_{N}\equiv M^{-1}\sum_{i=1}^{M}\nu_{N,i}\in{\cal F}_{B,{\varepsilon}}.

Then, for any block-separable denoiser η\eta, and for any NN multiple of BB

The proof is omitted since it is an immediate generalization of the one of Lemma 3.1. The class FN,ε,B{\cal F}_{N,{\varepsilon},B} to be studied in the rest of this section clearly satisfy the assumption of this lemma.

Block-soft thresholding ηsoft(;τ):RBRB\eta^{soft}(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B} is the nonlinear shrinker defined by letting, for yRBy\in{\bf R}^{B}, and τR+\tau\in{\bf R}_{+},

where (z)+max(z,0)(z)_{+}\equiv\max(z,0). The case B=1B=1 reduces to traditional soft thresholding of Example 1.2. More generally, ηsoft(y;τ)\eta^{soft}(y;\tau) shrinks its argument yy to if y2τ\|y\|_{2}\leq\tau and moves it by an amount τ\tau towards the origin otherwise. It can also be regarded as the solution of a penalized least squares problem, namely

Block thresholding has previously been considered by Hall, Kerkyacharian and Picard [HKP98] and by Cai [Cai99] although in specific ‘wavelet’ applications.

where expectation is taken with respect to XνX\sim\nu independent of ZN(0,IB×B){\bf Z}\sim{\sf N}(0,{\bf I}_{B\times B}). Notice that the condition νFε,B\nu\in{\cal F}_{{\varepsilon},B} simply amount to saying that ν\nu is a probability measure on RB{\bf R}^{B} with ν({0})1ε\nu(\{0\})\geq 1-{\varepsilon}. The calculation of MB(εBlockSoft)M_{B}({\varepsilon}|{\rm BlockSoft}) can be reduced to a calculus problem. We state the results below deferring calculations to Appendix D.

Let XBX_{B} bye a chi-square random variable with BB degrees of freedom and define the functions g,h:R+Rg,h:{\bf R}_{+}\to{\bf R} as follows

The minimax risk of block soft thresholding over the class FN,ε,B{\cal F}_{N,{\varepsilon},B} is given by

This is a parametric expression for τ[0,)\tau\in[0,\infty). The parameter corresponds to the minimax threshold τ\tau.

In Figure 5 we present graphs of M(ε)=MB(εBlockSoft)M({\varepsilon})=M_{B}({\varepsilon}|{\rm BlockSoft}) as a function of ε{\varepsilon}. It is immediate to prove the following structural properties: (i)(i) 0M(ε)10\leq M({\varepsilon})\leq 1 (the upper bound follows from taking τ=0\tau=0); (ii)(ii) M(ε)M({\varepsilon}) is monotone increasing and concave (monotonicity is a consequence of FB,εFB,ε{\cal F}_{B,{\varepsilon}}\subseteq{\cal F}_{B,{\varepsilon}^{\prime}} for εε{\varepsilon}\leq{\varepsilon}^{\prime}, and concavity follows since any measure in Fqε1+(1q)ε2,B{\cal F}_{q{\varepsilon}_{1}+(1-q){\varepsilon}_{2},B} can be written as convex combination of measures in Fε1,B{\cal F}_{{\varepsilon}_{1},B} and in Fε2,B{\cal F}_{{\varepsilon}_{2},B}); (iii)(iii) M(ε)0M({\varepsilon})\rightarrow 0 as ε0{\varepsilon}\rightarrow 0; M(ε)1M({\varepsilon})\rightarrow 1 as ε1{\varepsilon}\rightarrow 1. (Recall that we are considering the MSE per coordinate.) Associated with the minimax problem is also an optimal threshold value τ(εB)\tau^{*}({\varepsilon}|B), that we plot in Figure 6.

A particularly interesting case is the one of large blocks. As BB\rightarrow\infty the minimax MSE has a well defined, and particularly explicit limit.

Further, when properly normalized, the minimax threshold converges with increasing block size:

2 Block James-Stein

In order to approach oracle MSE for large BB, we propose to use the positive-part James-Stein shrinkage estimator [JS10]. This is again a block-separable denoiser that acts as follows on a block yRBy\in{\bf R}^{B}:

Analogously to block soft thresholding, this estimator shrinks to blocks with small norms. On the other hand, its bias vanishes as y2\|y\|_{2}\to\infty. Using once more Lemma 3.1 we have (notice that in this case there is no tuning parameter)

Remarkably, the limiting BB\rightarrow\infty behavior of this denoiser is ideal, and noticeably better than block soft thresholding, as shown by Figure 7 and formally in the next lemma.

Let MB(εJamesStein)M_{B}({\varepsilon}|{\rm JamesStein}) denote the minimax MSE for ηJS\eta^{JS} over the class of ε{\varepsilon}-block sparse sequences. For any B>2B>2, we have:

Consider temporarily the case where the observation is Y=μ+Z{\bf Y}=\mu+{\bf Z} with ZN(0,IB×B){\bf Z}\sim{\sf N}(0,{\bf I}_{B\times B}), and μRB\mu\in{\bf R}^{B} nonrandom and known. A simple calculation shows that the optimal linear estimator of the form η(y)=cy\eta(y)=cy, cRc\in{\bf R}, is given by

This estimator uses information about μ2\|\mu\|_{2} (which could only be supplied by an oracle) to choose the constant cc as a function of μ2\|\mu\|_{2}. Note in particular that the risk of this estimator is

Applying (3.6), and keeping in mind that, for νFε,B\nu\in{\cal F}_{{\varepsilon},B}, ν({X=0})(1ε)\nu(\{X=0\})\geq(1-{\varepsilon}), we have

The oracle inequality [DJ95, Theorem 5] shows that for B>2B>2, and for every vector μRB\mu\in{\bf R}^{B}, if YN(μ,IB×B)Y\sim{\sf N}(\mu,{\bf I}_{B\times B}), then

Combined with the previous display this proves the Lemma. ∎

The argument in the proof leads in fact to a convenient expression for MB(εJamesStein)M_{B}({\varepsilon}|{\rm JamesStein}). With the notations introduced there, we have

Now ηJS\eta^{JS} is known to be minimax for the unconstrained problem of estimating a non-sparse vector μ\mu, i.e. supμRBR(μ;ηJS)=B\sup_{\mu\in{\bf R}^{B}}R(\mu;\eta^{JS})=B yielding

Therefore computing the minimax MSE for ηJS\eta^{JS} reduces to computing the single quantity R(0;ηJS)R(0;\eta^{JS}), that can be estimated through numerical integration. A good approximation for large BB is provided by the following formula R(0;ηJS)=B1+κB3/2+O(B2)R(0;\eta^{JS})=B^{-1}+\kappa B^{-3/2}+O(B^{-2}) with κ0.752\kappa\approx 0.752 (cf. Appendix I.2.1). Hence we have

In the next section will use this formula (neglecting O(B2)O(B^{-2}) terms) in comparing the general prediction of Eq. (1.13) with the empirical results for the James-Stein AMP algorithm. Numerical integration reveals that this formula is accurate enough for such comparison.

3 Empirical phase transition behavior

We now turn to the compressed sensing reconstruction problem whereby the block-sparse vector x0x_{0} is reconstructed from observed data y=Ax0y=Ax_{0} using the AMP algorithm. We want to test the hypothesis that Eq. (1.13) describes the phase transition of the two block shrinkage AMP algorithms, corresponding to the block soft thresholding, and block James-Stein.

We conducted a set of experiments similar to those described in Section 2.4 We constructed block-sparse signals at different undersampling and sparsity levels and ran tests of block thresholding AMP. More precisely, we used the update equations (1.7) to (1.9) with η=ηsoft\eta=\eta^{soft} (block soft thresholding AMP) or η=ηJS\eta=\eta^{JS} (James-Stein AMP).

It is a straightforward calculus exercise to compute an explicit expression for the memory term bt{\sf b}_{t}. For block soft thresholding AMP we get

Our results show that the curve δ=MB(εBlockSoft)\delta=M_{B}({\varepsilon}|{\rm BlockSoft}) correctly separates two phases of performance: below this curve success in AMP recovery is atypical and above it is typical. Similarly, the curve δ=MB(εJamesStein)\delta=M_{B}({\varepsilon}|{\rm JamesStein}) correctly describes the phase transition for block James-Stein shrinkage. The empirical results are presented in Figure 8 (for block soft thresholding) and Figure 9 (for block James-Stein). We refer to Appendix G for further details.

Monotone regression

In this section and the next, we show that, quite surprisingly, the formula (1.13) can be applied also to some highly nontrivial non-separable denoisers.

In this section we consider vectors that are monotone, and mostly constant. Let M{\cal M} denote the cone of nondecreasing sequences:

We then define the class of mostly constant non-decreasing vectors

Since vectors from this class are –in general– not sparse, we will occasionally refer to the parameter ε{\varepsilon} as to the ‘simplicity’ parameter.

For this problem we will consider the denoiser ηmono:RNRN\eta^{mono}:{\bf R}^{N}\to{\bf R}^{N}, that solves the monotone regression problem

In other words, ηmono\eta^{mono} is the (Euclidean) projection on the cone of monotone sequences. This denoiser is highly non-separable, as one can understand most clearly by studying the standard pool-adjacent-violators algorithm for implementing it (see [BC90] for a recent reference).

In order to apply formula (1.13), we need to calculate M(εMonoReg)M({\varepsilon}|{\rm MonoReg}), which requires in particular determining the least favorable distribution νNFε,N,mono\nu_{N}\in{\cal F}_{{\varepsilon},N,{\rm mono}} and proving that the limit NN\to\infty of the minimax MSE exists. We present here the main ideas, deferring details to Appendix F.

It is convenient to introduce the risk at μMN\mu\in{\cal M}_{N}:

where expectation is taken with respect to ZN(0,IN×N){\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N}). It is further useful to introduce a specific notation for the risk at , namely

The risk of monotone regression satisfies the following properties

The function tRN(tμ)t\mapsto R_{N}(t\,\mu) is monotone increasing for tR+t\in{\bf R}_{+}.

Let I+(μ){i[N1]:  μi<μi+1}I_{+}(\mu)\equiv\{i\in[N-1]:\;\mu_{i}<\mu_{i+1}\} be the set of increase points of μ\mu. Denoting them by I+(μ){i1,i2,,iK(μ)}I_{+}(\mu)\equiv\{i_{1},i_{2},\dots,i_{K(\mu)}\}, ikik+1i_{k}\leq i_{k+1}, let Jk{ik+1,ik+2,,ik+1}J_{k}\equiv\{i_{k}+1,i_{k}+2,\dots,i_{k+1}\} for k{0,,K(μ)}k\in\{0,\dots,K(\mu)\} (with, by convention, i0=0i_{0}=0, iK(μ)+1=Ni_{K(\mu)+1}=N). Then, for any μMN\mu\in{\cal M}_{N},

For a non-empty closed convex SRN{\cal S}\subseteq{\bf R}^{N}, we let PS:RNRN{\sf P}_{\cal S}:{\bf R}^{N}\to{\bf R}^{N} denote the Euclidean projector to S{\cal S}, i.e. PS(y)\mboxargminxSxy2{\sf P}_{\cal S}(y)\equiv\mbox{argmin}_{x\in S}\|x-y\|_{2}. Further, for vRNv\in{\bf R}^{N}, S+v{x+v:  xS}{\cal S}+v\equiv\{x+v\,:\;x\in{\cal S}\}.

Note that it is sufficient to show that, letting D(\mu;z)\equiv\big{\|}\eta^{mono}(\mu+z)-\mu\big{\|}_{2}^{2}, tD(tμ;z)t\mapsto D(t\mu;z) is monotone increasing in tR+t\in{\bf R}_{+}. By continuity of the projection operator, it follows that μD(μ;z)\mu\mapsto D(\mu;z) is continuous. Further, notice that MN{\cal M}_{N} is a cone obtained as the intersection of N1N-1 half-spaces:

Let Vi={xRN:  xi=xi+1}{\cal V}_{i}=\{x\in{\bf R}^{N}:\;x_{i}=x_{i+1}\} be the separating hyperplane for Hi{\cal H}_{i} and, for B[N1]B\subseteq[N-1], define

with, by convention VRN{\cal V}_{\emptyset}\equiv{\bf R}^{N}. Since ηmono=PMN\eta^{mono}={\sf P}_{{\cal M}_{N}}, we have that ηmono\eta^{mono} is continuous and piecewise linear and equal one of the projectors PVB{\sf P}_{{\cal V}_{B}}, that we will denote by PB{\sf P}_{B} for B[N1]B\subseteq[N-1]. It is therefore sufficient to show that, defining for B[N1]B\subseteq[N-1],

the function tDB(tμ,z)t\mapsto D_{B}(t\mu,z) is monotone increasing for tR+t\in{\bf R}_{+}.

Let Bk=1KBkB\equiv\cup_{k=1}^{K}B_{k} where each BkB_{k} is a contiguous segment (in the sense of point (b)(b)), and Bk{i[N]:iBk(i1)Bk}\overline{B}_{k}\equiv\{i\in[N]:i\in B_{k}\vee(i-1)\in B_{k}\}. Further, for xRNx\in{\bf R}^{N}, let xSS1iSxi\overline{x}_{S}\equiv|S|^{-1}\sum_{i\in S}x_{i}. Then, for any xRNx\in{\bf R}^{N},

which is clearly increasing in tR+t\in{\bf R}_{+}. ∎

The proof of part (b)(b) is deferred to Appendix F.

The last Lemma shows that the least favorable signal μ\mu is constant on N(1ε)N(1-{\varepsilon}) positions of the interval {1,2,,N}\{1,2,\dots,N\} and has large (going to infinity) jumps at the remaining NεN{\varepsilon} increase points. The resulting risk only depends on the distribution of the lengths of the intervals over which μ\mu is constant.

The next Lemma provides some useful insight on the behavior of the risk at . This is crucial since it determines the minimax risk though Eq. (4.4).

The monotone regression risk at zero, defined through Eq. (4.3) satisfies r(1)=1r(1)=1 and, for any N10N\geq 10

The proof of this Lemma can be found in Appendix F.2.

For moderate values of NN, r(N)r(N) can be computed numerically through Monte Carlo simulations. Figure 10 presents the results of such a simulation. It appears that r(N)=Θ(logN)r(N)=\Theta(\log N) as NN\to\infty suggesting that the last lemma is loose by a logarithmic factor.

We can finally establish our main result on the minimax MSE of monotone regression over the class FN,ε,mono{\cal F}_{N,{\varepsilon},{\rm mono}}. Remarkably, we are able to characterize the least favorable distribution νNFN,ε,mono\nu_{N}\in{\cal F}_{N,{\varepsilon},{\rm mono}}.

The asymptotic minimax MSE of monotone regression

Finally, for any ξ>0\xi>0, ε(0,1){\varepsilon}\in(0,1), the following distribution νN(ε,ξ)FN,ε,mono\nu_{N}^{({\varepsilon},\xi)}\in{\cal F}_{N,{\varepsilon},{\rm mono}} has risk larger that M(εMonoReg)ξM({\varepsilon}|{\rm MonoReg})-\xi for all NN large enough. A signal XνN(ε,ξ){\bf X}\sim\nu^{({\varepsilon},\xi)}_{N} has Xi+1Xi=Δ>0X_{i+1}-X_{i}=\Delta>0 at all increase points i[N1]i\in[N-1] for some Δ=Δ(ξ)\Delta=\Delta(\xi) large enough, and the lengths of intervals between increase points have distribution π\pi achieving the max in Eq. (4.6).

Hence M(εMonoReg)M({\varepsilon}|{\rm MonoReg}) is immediately upper bounded by the right hand side of Eq. (4.6). The matching lower bound is obtained by evaluating the above expressions for the distribution νN(ε,ξ)\nu_{N}^{({\varepsilon},\xi)}.

Finally Eq. (4.1) follows by using Lemma 4.2 in Eq. (4.6). ∎

The resulting curve M(εMonoReg)M({\varepsilon}|{\rm MonoReg}) is presented in Fig. 10.

2 Empirical phase transition behavior

We next consider the compressed sensing problem. We programmed the AMP iteration (1.7)-(1.9), with η(;τ,σ)=ηmono()\eta(\,\cdot\,;\tau,\sigma)=\eta^{mono}(\,\cdot\,) the monotone regression denoiser. The denoiser itself was implemented using the standard pool adjacent violators algorithm.

It is a simple exercise to obtain an explicit formula for the memory term bt{\sf b}_{t}. As in Lemma 4.1, let K(μ)K(\mu) denote the number of increase points in the signal μRN\mu\in{\bf R}^{N}. We then have

We will refer to this specific version of AMP as to monoreg AMP.

In the present case we evaluated success probability using the following (Hamming-like) distance

and declared a success when H(xt,x0)βH(x^{t},x_{0})\leq\beta. In Figure 11 we used t=300t=300 and α=β=0.01\alpha=\beta=0.01, but very similar results are obtained with other values of the parameters. The rationale for using Hα(xt,x0)H_{\alpha}(x^{t},x_{0}) instead of the normalized mean square error lies in the structure of the signals x0x_{0}. Since the least favorable x0x_{0} is monotone with large jumps, its norm is very large,concentrated at endpoints, and depends strongly on NN. This leads to subtle normalization issues across different NN.

The agrement between the empirical phase transition and the general prediction δ=M(εMonoReg)\delta=M({\varepsilon}|{\rm MonoReg}) in Fig. 11 is satisfactory and improves with the signal’s length.

Total variation minimization

In this section we consider vectors xRNx\in{\bf R}^{N} that are mostly constant, with a few change points. In order to model this problem, we introduce the class of probability distributions

Again ε(0,1){\varepsilon}\in(0,1) is a ‘simplicity’ parameter. Note that this class is quite similar to the class FN,ε,mono{\cal F}_{N,{\varepsilon},{\rm mono}} studied in the previous section, the ‘only’ difference being that change points can be either points of increase or points of decrease.

A convenient denoiser for this setting is the total variation penalized least-squares [ROF92], also called fused LASSO [TSR+05], that we will denote by ηtv(;τ):RNRN\eta^{tv}(\,\cdot\,;\tau):{\bf R}^{N}\to{\bf R}^{N}. This depends on τR+\tau\in{\bf R}_{+} and, for yRNy\in{\bf R}^{N} and noise variance σ2=\sigma^{2}=, it returns

An extensive literature is devoted to solving this denoising problem, see for example [VO96]. For general noise variance σ2\sigma^{2}, the above expression is generalized through the usual scaling relationship (1.3).

Much of the analysis in this section is analogous to the one of monotone regression. We will therefore present several arguments in synthetic form to limit redundancy.

In this section we outline the computation of the asymptotic minimax MSE of the total variation denoiser over the class FN,ε,TV{\cal F}_{N,{\varepsilon},TV}, to be denoted by M(εTV)M({\varepsilon}|TV).

We start by defining a generalization of the problem (5.1). For s=(s1,s2){+1,1}2s=(s_{1},s_{2})\in\{+1,-1\}^{2}, we let

We define the risk at μRN\mu\in{\bf R}^{N} as

where ZN(0,IN×N){\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N}). We denote the risk at by

Notice that, by symmetry, rs(N;τ)=rs(N;τ)r_{s}(N;\tau)=r_{-s}(N;\tau) and rs1,s2(N;τ)=rs2,s1(N;τ)r_{s_{1},s_{2}}(N;\tau)=r_{s_{2},s_{1}}(N;\tau). We then have the following analogous of Lemma 4.1.

The risk of total variation regression satisfies the following properties

The function tRN(tμ;τ)t\mapsto R_{N}(t\,\mu;\tau) is monotone increasing for tR+t\in{\bf R}_{+}.

Let I(μ){i[N1]:  μiμi+1}I_{\neq}(\mu)\equiv\{i\in[N-1]:\;\mu_{i}\neq\mu_{i+1}\} be the set of change points of μ\mu. Denoting them by I(μ){i1,i2,,iK(μ)}I_{\neq}(\mu)\equiv\{i_{1},i_{2},\dots,i_{K(\mu)}\}, ikik+1i_{k}\leq i_{k+1}, let Jk{ik+1,ik+2,,ik+1}J_{k}\equiv\{i_{k}+1,i_{k}+2,\dots,i_{k+1}\} for k{0,,K}k\in\{0,\dots,K\} (with, by convention, i0=0i_{0}=0, iK+1=Ni_{K+1}=N). Further, for K1K\geq 1, let s(k)=[sgn(μikμik+1),sgn(μik+1+1μik+1)]s(k)=[{\rm sgn}(\mu_{i_{k}}-\mu_{i_{k}+1}),{\rm sgn}(\mu_{i_{k+1}+1}-\mu_{i_{k+1}})] for k{1,,K1}k\in\{1,\dots,K-1\}, s(0)=[0,sgn(μi1+1μi1)]s(0)=[0,{\rm sgn}(\mu_{i_{1}+1}-\mu_{i_{1}})], s(K)=[0,sgn(μikμik+1)]s(K)=[0,{\rm sgn}(\mu_{i_{k}}-\mu_{i_{k}+1})]. For K=0K=0 we let s(0)=(0,0)s(0)=(0,0).

The argument in part (b)(b) is essentially the same as in part (b)(b) of Lemma 4.1 and we will therefore omit it.

For proving part (a)(a), we will prove that, letting D(μ;z)ηtv(μ+z;τ)μ22D(\mu;z)\equiv\|\eta^{tv}(\mu+z;\tau)-\mu\|_{2}^{2}, the function tD(tμ;z)t\mapsto D(t\mu;z) is increasing for tR+t\in{\bf R}_{+}. First notice that the stationarity condition for the minimum in Eq. (5.1) reads

with the convention that v0=vN=0v_{0}=v_{N}=0. Let I=I(x)I_{\neq}=I_{\neq}(x) and Jk,s(k)J_{k},s(k) be defined as in part (b)(b) of the statement. Then, summing Eq. (5.6) over iJki\in J_{k}, we get

where s(k)=s1(k)+s2(k)\overline{s}(k)=s_{1}(k)+s_{2}(k) Hence ηtv(;τ)\eta^{tv}(\,\cdot\,;\tau) is piecewise affine with components indexed by J={Jk,s(k)}k[K]{\cal J}=\{J_{k},s(k)\}_{k\in[K]}. Within each component, we have ηtv(y;τ)=FJ(y)\eta^{tv}(y;\tau)=F_{{\cal J}}(y) with FJF_{{\cal J}} defined as per Eq. (5.8).

Since yηtv(y;τ)y\mapsto\eta^{tv}(y;\tau) is continuous (and hence tD(tμ;z)t\mapsto D(t\mu;z) is), it is sufficient to prov that, letting

the function tDJ(μ;z)t\mapsto D_{{\cal J}}(\mu;\,z) is monotone increasing for tR+t\in{\bf R}_{+}. Using Eq. (5.8) we obtain

where xJk\overline{x}_{J_{k}} denotes the average of vector xx over JkJ_{k}. It follows that tDJ(μ;z)t\mapsto D_{{\cal J}}(\mu;\,z) is increasing as claimed. ∎

The risk at , rs(N;τ)r_{s}(N;\tau), can be computed numerically for moderate values of NN. Notice that the cases r00(N;τ)r_{00}(N;\tau), r0±(N;τ)r_{0\pm}(N;\tau), and r±0(N;τ)r_{\pm 0}(N;\tau) are only relevant for the boundary intervals J0(μ)J_{0}(\mu) and JK(μ)(μ)J_{K(\mu)}(\mu) and turn out to be immaterial for the asymptotic minimax risk. Thanks to symmetries, the only relevant cases are r++(N;τ)r_{++}(N;\tau) and r+(N;τ)r_{+-}(N;\tau). The results of a numerical computation for these quantities is shown in Figure 12. These calculations suggest r++(N;τ)r+(N;τ)r_{++}(N;\tau)\geq r_{+-}(N;\tau), which is indeed consistent with intuition as boundary conditions ++++ induce a larger bias. Also, it is easy to prove that r+(N;τ)0r_{+-}(N;\tau)\to 0 as τ\tau\to\infty (as τ\tau\to\infty, ηtv(y;τ)\eta^{tv}(y;\tau) converges to a constant vector).

Using the last Lemma, and proceeding as in the proof of Theorem 4.1, it is immediate to obtain a characterization of the minimax MSE of the total variation denoiser. For technical reasons, we need to introduce the class FN,ε,TV(L){\cal F}_{N,{\varepsilon},TV}(L) of vectors in FN,ε,TV{\cal F}_{N,{\varepsilon},TV} with distance at most LL between changepoints.

The asymptotic minimax MSE of total variation denoiser

We omit this proof since it is an immediate generalization of the one of Theorem 4.1. Notice that ML(ε)M_{L}({\varepsilon}) is monotone increasing in LL and hence admit a limit as LL\to\infty. We expect that

This limit can be evaluated numerically and in Figure 12 we plot the resulting minimax risk.

Notice that, by properly modifying Eq. (5.9), one obtains the minimax risk over subsets of FN,ε{\cal F}_{N,{\varepsilon}} with constrained change point distributions. For instance, we can consider the case in which the lengths between change points are distributed as for uniformly random change points, and increase/decrease points are alternating. We then get

We consequently define the random changepoint minimax risk as

This curve is plotted in Figure 12 for comparison.

2 Empirical phase transition behavior

We implemented the the AMP iteration (1.7)-(1.9), using the total variation denoiser η(;τ,σ)=ηtv(;τ,σ)\eta(\,\cdot\,;\tau,\sigma)=\eta^{tv}(\,\cdot\,;\tau,\sigma). For the latter, we used the software package tvdip (in the Matlab implementation) or the projected Newton method [VO96] (in the Java implementation).

For xRNx\in{\bf R}^{N} be K0(x)K_{0}(x) denote the number of constant segments in xx or, equivalently, the number of change points in xx, plus one. We then have the following expression for the memory term in Eq. (1.7)-(1.9):

We will refer to this specific version of AMP as to TV-AMP.

We carried out two types of experiments. In the first class of experiments we considered signals xx with distances between change points distributed as

This is the same distribution as if each position is independently an increase point or a decrease point, each with probability ε/2{\varepsilon}/2. The predicted phase transition curve is given by Eq. (5.10) and the minimax value of τ\tau is used in the AMP implementation. The simulations results are presented in Figure 13 for N=200N=200, 500500 and show good agreement between predictions and observations. In this case we used the Hamming metric (4.8), because the norm of the typical signal x022\|x_{0}\|_{2}^{2} scales super-linearly in NN.

Characterization of the phase transition using state evolution

In this section we prove the basic relation (1.13) between minimax mean square error in denoising and the phase-transition boundary in the sparsity-undersampling plane. Our proof assumes that the state evolution formalism developed in [DMM09, DMM10, DMM11] holds, in the precise terms stated below. This formalism was established rigorously for separable denoisers (under additional regularity assumptions) in [BM11a].

A crucial observation for state evolution is that the mean squared error of the AMP reconstruction xtx^{t} at iteration tt is practically non-random for large system sizes NN, and has a well-defined limit as NN\to\infty. In particular, the limit

exists almost surely (here we assume nn\to\infty, while n/Nδn/N\to\delta). Moreover, the evolution of mtm_{t} with increasing tt is dictated by a formula mt+1=Ψ(mt)m_{t+1}=\Psi(m_{t}) which is explicitly computable, and defined below. We will use the term state evolution to refer both to the mapping mΨ(m)m\mapsto\Psi(m) and to the sequence {mt}t0\{m_{t}\}_{t\geq 0} with appropriate initial condition. State evolution allows to determine whether AMP recovers the signal x0x_{0} correctly, by simply checking whether mt0m_{t}\rightarrow 0 as tt\rightarrow\infty (in which case the MSE vanishes asymptotically) or not. The latter problem does in turn reduce to a problem in real analysis.

The papers [DMM09, DMM10, DMM11] developed the state evolution framework for separable denoisers and verified its predictions numerically for three specific examples (namely for the shrinkers Soft, SoftPos, and Cap). However, the heuristic argument presented in those papers was much more general. Indeed, [BM11a] proved that state evolution holds, in a precise asymptotic sense, for Gaussian measurement matrices AA with iid entries and generic separable denoisers, under mild regularity assumptions. A generalization to non-gaussian entries was subsequently proved in [BLM12].

Here we generalize this approach to non-separable denoisers η(;τ,σ):RNRN\eta(\,\cdot\,;\tau,\sigma):{\bf R}^{N}\mapsto{\bf R}^{N}. This framework covers all the shrinkers discussed in Sections 2 to 5, and yields a formal proof of the main formula (1.13), under the assumption that indeed state evolution is correct in this broader context. We will throughout assume the scaling relation (1.3).

In the next sections we will first introduce some basic notations and facts about state evolution. Then we will prove the phase transition expression (1.13) by establishing first a lower bound and then a matching upper bound, both given by the minimax MSE.

The next definition provides the suitable generalization of the state evolution mapping to the present setting.

For given δ,τ0\delta,\tau\geq 0, and ν={νN}NN\nu=\{\nu_{N}\}_{N\in{\bf N}} a sequence of probability distributions over RN{\bf R}^{N}, define the state evolution mapping Ψ(;δ,τ,ν):RR\Psi(\,\cdot\,;\delta,\tau,\nu):{\bf R}\mapsto{\bf R} by

whenever the limit on the right-hand side exists. Here, as before, X{\bf X} and Z{\bf Z} are independent vectors, XνN{\bf X}\sim\nu_{N}, ZN(0,IN×N){\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N}). In other words, Ψ(m;δ,τ,ν)\Psi(m;\delta,\tau,\nu) is the per-coordinate MSE of denoiser η\eta at noise level σ2m/δ\sigma^{2}\equiv m/\delta.

The fixed points of the mapping mΨ(m)m\mapsto\Psi(m) play of course a crucial role in the analysis of state evolution.

The highest fixed point of the mapping Ψ()=Ψ(;δ,τ,ν)\Psi(\,\cdot\,)=\Psi(\,\cdot\,;\delta,\tau,\nu) is defined as HFP(Ψ)sup{m:Ψ(m)m}{\rm HFP}(\Psi)\equiv\sup\{m:\Psi(m)\geq m\}.

The importance and applicability this notion is underscored by the next two observations. Here and below we say that a function f:R+Rf:{\bf R}_{+}\to{\bf R} is starshaped if xf(x)/xx\mapsto f(x)/x is decreasing.

Suppose that m0>0m_{0}>0 and any one of these three conditions holds:

Ψ(m)\Psi(m) is an increasing function of mm, and the initial condition of state evolution satisfies m0HFP(Ψ)m_{0}\geq{\rm HFP}(\Psi);

Ψ(m)\Psi(m) is an increasing starshaped function of mm.

Then state evolution converges to the highest fixed point:

Further, if HFP(Ψ)>0{\rm HFP}(\Psi)>0 and Ψ(m)\Psi(m) is a starshaped function of mm, then liminftmt>0\lim\inf_{t\rightarrow\infty}m_{t}>0.

The proof is a standard calculus exercise; we omit it.

The function mΨ(m)m\mapsto\Psi(m) is starshaped for all of the following choices of the denoiser η\eta:

The proof of this Lemma is deferred to Appendix I.

2 State Evolution Phase Transition

Consider a collection FN,ε{\cal F}_{N,{\varepsilon}} of probability distributions over RN{\bf R}^{N}, indexed by ε{\varepsilon}\in as per Section 1.1 (these do not need to be simple sparse signals). For a sequence of probability distributions ν={νN}NN\nu=\{\nu_{N}\}_{N\in{\bf N}} we write νFε\nu\in{\cal F}_{{\varepsilon}} if νNFN,ε\nu_{N}\in{\cal F}_{N,{\varepsilon}} for all NN and the limit on the Eq. (6.1) exists for each mR+m\in{\bf R}_{+}. Letting HFP(δ,τ,ν)=HFP(Ψ(    ;δ,τ,ν)){\rm HFP}(\delta,\tau,\nu)={\rm HFP}(\Psi(\;\cdot\;;\delta,\tau,\nu)), we consider the minimax value:

For ε{\varepsilon}\in, define the state evolution phase transition as

Note that HFP(ε,δ){\rm HFP}^{*}({\varepsilon},\delta) is monotone decreasing as a function of δ\delta, by definition of the state evolution mapping Ψ\Psi, cf. Eq. (6.1). It follows that HFP(ε,δ)=0{\rm HFP}^{*}({\varepsilon},\delta)=0 for δ>δSE(ε)\delta>\delta_{SE}({\varepsilon}) and HFP(ε,δ)>0{\rm HFP}^{*}({\varepsilon},\delta)>0 for δ<δSE(ε)\delta<\delta_{SE}({\varepsilon}). Further, by nestedness, it is monotone increasing as a function of ε{\varepsilon}, which implies that εδSE(ε){\varepsilon}\mapsto\delta_{SE}({\varepsilon}) is monotone increasing. The rationale for this definition is that, for δ>δSE(ε)\delta>\delta_{SE}({\varepsilon}) and under any of the assumptions of Lemma 6.1, state evolution predicts that AMP will correctly recover the signal x0x_{0}.

Let FN,ε{\cal F}_{N,{\varepsilon}} be a nested, scale-invariant collection of probability distributions, and assume that the shrinker η(;τ,σ)\eta(\,\cdot\,;\tau,\sigma) obeys the scaling relation (1.3). Define the minimax MSE M(εη)M({\varepsilon}|\eta) as per Eq. (1.5). Then

In order to prove this result, we shall first establish a more general fact. Given a sequence of distributions ν={νN}NN\nu=\{\nu_{N}\}_{N\in{\bf N}} and, for any τΘ\tau\in\Theta, we let

The rationale for this definition is clear. Under the assumption that state evolution holds, for a signal x0x_{0} sampled from distribution νN\nu_{N}, AMP (with tuning parameter τ\tau) is guaranteed to reconstruct x0x_{0} if and only if δ>δSE(τ,νη)\delta>\delta_{SE}(\tau,\nu|\eta).

the normalized MSE for denoising with worst case case signal-to-noise ratio. With these definitions we have the following.

For any sequence of probability measures ν={νN}N1\nu=\{\nu_{N}\}_{N\geq 1}, and any τΘ\tau\in\Theta, we have

We are now in position to prove Theorem 6.1.

Throughout the proof we drop the argument η\eta, since this is kept constant.

Define δ(ε)infτΘsupνFεδ(τ,ν)\delta_{*}({\varepsilon})\equiv\inf_{\tau\in\Theta}\sup_{\nu\in{\cal F}_{{\varepsilon}}}\delta(\tau,\nu). We claim that δ(ε)=δSE(ε)\delta_{*}({\varepsilon})=\delta_{SE}({\varepsilon}). Indeed choose δ[0,δSE(ε))\delta\in[0,\delta_{SE}({\varepsilon})). Then by definition there exists m>0m>0 such that, for all τΘ\tau\in\Theta, there exists νFε\nu\in{\cal F}_{{\varepsilon}} with HFP(δ,τ,ν)>m{\rm HFP}(\delta,\tau,\nu)>m. Hence, for all τΘ\tau\in\Theta, there exists νFε\nu\in{\cal F}_{{\varepsilon}} with δ(τ,ν)δ\delta(\tau,\nu)\geq\delta. This implies that, for all τΘ\tau\in\Theta, supνFεδ(τ,ν)>δ\sup_{\nu\in{\cal F}_{{\varepsilon}}}\delta(\tau,\nu)>\delta, i.e. δ(ε)δ\delta_{*}({\varepsilon})\geq\delta. Proceeding in the same way, it is immediate to prove that, for any δ[0,δ(ε))\delta\in[0,\delta_{*}({\varepsilon})), we have δSE(ε)>δ\delta_{SE}({\varepsilon})>\delta. Hence, we conclude that δSE(ε)=δ(ε)\delta_{SE}({\varepsilon})=\delta_{*}({\varepsilon}) as claimed.

To conclude the proof, we note that, by Lemma 6.3, we have

where the last equality follows from the scale invariant property of FN,ε{\cal F}_{N,{\varepsilon}}. The last quantity is nothing but M(ε)M({\varepsilon}). ∎

3 Non-convergence of state evolution

Lemmas 6.1.(b) and 6.2 immediately imply that the answer is positive for soft thresholding, positive soft thresholding, block soft thresholding, James-Stein shrinkage, monotone regression and total variation denoising. It turns out that the answer is positive also for firm thresholding and the global minimax denoiser. In Appendix E we describe the argument for these cases.

Phase transitions for other algorithms

Formula (1.13) connects an algorithmic property – phase transitions of AMP recovery algorithms – with a property from statistical decision theory – minimax mean squared error in denoising.

We want to explore a further connection, relating the behavior of convex optimization with that of certain AMP algorithms. As proved in [BM11b], in the large system limit AMP with soft thresholding denoiser effectively computes the solution to

for an appropriately calibrated λ=λ(τ)\lambda=\lambda(\tau).

More generally, consider a generalized reconstruction method of the form

where J:RNRJ:{\bf R}^{N}\to{\bf R} is a convex penalization. To this reconstruction problem, we can associate an AMP algorithm, by using the denoiser ηJ(;τ)\eta^{J}(\,\cdot\,;\tau) in Eq. (1.7), (1.8), (1.9), whereby

(we also let ηJ(;τ,σ)=ηJ(;τσ)\eta^{J}(\,\cdot\,;\tau,\sigma)=\eta^{J}(\,\cdot\,;\tau\cdot\sigma)). In other words ηJ\eta^{J} is the proximal operator of the penalization J()J(\,\cdot\,). We will refer to this algorithm as to AMP-JJ.

We then have the following general correspondence, which follows immediately by writing the stationarity condition of problem PJ=PJ(λ)P_{J}=P_{J}(\lambda) and the fixed points of AMP-JJ (see [Mon12]).

Any fixed point xx^{\infty} of AMP-JJ with fixed point parameters τ\tau_{\infty}, b{\sf b}_{\infty}, σ\sigma_{\infty} corresponds to a stationary point of PJ(λ)P_{J}(\lambda) with λ\lambda given by

In particular, if the regularizer JJ is convex, then fixed points correspond to minimizers.

The fixed points of AMP with positive soft thresholding denoiser are solutions of

where ,\langle\,\cdot\,,\,\cdot\,\rangle denotes the standard scalar product over RN{\bf R}^{N}, and 11 is the all-one vector.

The fixed points of AMP with capping denoiser effectively are solutions of

For noiseless compressed sensing reconstruction, the correspondence involves the limit λ0\lambda\to 0 of the above problem, that is

Phase transitions for such convex programs were characterized in [Don06, DT09a, DT10a] for the three examples mentioned above, using methods from combinatorial geometry. Thus, whenever AMP converges with high probability, formula (1.13) connects fundamental problems of minimax decision theory to fundamental problems in high dimensional combinatorial geometry.

We wil next verify numerically this correspondence beyond the three classical examples (again focusing on noiseless measurements). In the block-structured case, we can compare block-soft AMP to the following convex optimization problem:

Analogously, we can compare monoreg AMP to the following convex optimization problem:

where Δx=(Δx1,,ΔxN)\Delta x=(\Delta x_{1},\dots,\Delta x_{N}), Δxi=(xi+1xi)\Delta x_{i}=(x_{i+1}-x_{i}). Figure 16 verifies that the phase transition occurs around the location predicted by (1.13), just as with monoreg AMP.

Finally, we can compare TV AMP to the following convex optimization problem:

where again xTVi=1N1Δxi\|x\|_{TV}\equiv\sum_{i=1}^{N-1}|\Delta x_{i}|. Figure 17 verifies that the phase transition occurs at the location predicted by (1.13), just as with TV AMP.

where J(x)=λxJ(x)=\lambda|x| (here we redefined JJ to absorb the factor τ\tau).

It turns out that a similar interpretation can be given for any scalar denoiser (and hence for any separable denoiser as well). In particular, the minimax and firm thresholding rules ηall\eta^{all} and ηfirm(;τ1,τ2)\eta^{firm}(\,\cdot\,;\tau_{1},\tau_{2}) are optimizers of penalization schemes of the same form as above, with J(x)J(x) non-convex.

We can construct the penalty J()J(\,\cdot\,) corresponding to a denoiser η()\eta(\,\cdot\,) by observing that x+J(x)=yx+J^{\prime}(x)=y at the solution x=η(y)x=\eta(y). Defining the residual Δ(y)=yη(y)\Delta(y)=y-\eta(y), and noting that η(y)+Δ(y)=y\eta(y)+\Delta(y)=y, we obtain that Δ(y)\Delta(y) and J(x)J^{\prime}(x) are related through the change of variables

A similar analysis can be carried out for block separable denoisers that are covariant under rotation, i.e. if η(;τ):RBRB\eta(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B} satisfies η(Rx;τ)=Rη(x;τ)\eta(Rx;\tau)=R\eta(x;\tau) for any rotation RR. We already mentioned that the block soft denoiser can be written as

An implied penalty can also be derived for block James-Stein shrinkage ηJS\eta^{JS}. Due to the covariance under rotation, the corresponding penalty only depends on the modulus of xRBx\in{\bf R}^{B}. Figure 19 shows the implied penalty J(sB)J(s|B) as a function of s=x2s=\|x\|_{2}. Namely, the penalty J(B):R+RJ(\,\cdot\,|B):{\bf R}_{+}\to{\bf R} is such that, for yRBy\in{\bf R}^{B},

coincides with the positive-part James-Stein estimator. The implicit penalization is again nonconvex.

Appendix A Classical cases

The general formula (1.13) was already validated in [DMM09] for the following three classical cases:

Simple sparse signals from the class FN,ε{\cal F}_{N,{\varepsilon}} (cf. Eq. (1.2)) and soft thresholding denoiser.

Non-negative sparse signals with soft positive thresholding denoiser, cf. Example 1.3.

Box constrained signals with capping denoiser, cf. Example 1.3.

Analytical expressions for the phase transition curves were computed using state evolution in the Online Supplement to [DMM09]. We review the results here since they provide a useful stepping stone for understanding more complicate cases (cf., e.g.. Section 6).

Notice that the last of the examples above (box-constrained signals) is not scale invariant, according to the general definition of Section 1.1. For non scale-invariant classes, the definitions (1.4) and (1.5) are generalized by taking the supremum over the noise covariance as well. Namely, for a generic class FN,ε{\cal F}_{N,{\varepsilon}}, we let

It is easy to check that this definition coincides with the earlier one for scale-invariant classes. We will write M(εCap)M({\varepsilon}|{\rm Cap}) instead of M(εηcap)M({\varepsilon}|\eta^{cap}) for box constrained signals with capping denoiser, i.e. case (iii)(iii) above. The results for case (iii)(iii) is further elucidated by comparing it with the following scale invariant problem:

We consider sparse non-negative signals x00x_{0}\geq 0, modeled through the class FN,ε,+{\cal F}_{N,{\varepsilon},+}, and the simple positive part denoiser η+(y)max(y,0)\eta^{+}(y)\equiv\max(y,0). We denote corresponding minimax risk by M(εPos)M({\varepsilon}|{\rm Pos}).

The minimax risk for examples (i)(i), (ii)(ii), (iv)(iv) can be computed explicitly. Indeed in both cases we have

Here XνX\sim\nu and ZN(0,1)Z\sim{\sf N}(0,1) are independent random variables. The reduction second equality follows from th remark that the extremal distributions in F1,ε{\cal F}_{1,{\varepsilon}} and F1,ε,+{\cal F}_{1,{\varepsilon},+} are mixtures of two point distributions. This remark reduces the calculation of the minimax risk to a simple calculation [DMM09], whose results are summarized below.

The minimax risk for the problems stated above is

(The first two are parametric expressions in τ\tau, which is the optimal threshold at the given sparsity level.)

Also, the AMP phase transition for the noiseless reconstruction problem is given in all of these cases by δ=M(εη)\delta=M({\varepsilon}|\eta). In other words AMP succeeds with high probability of δ>M(εη)\delta>M({\varepsilon}|\eta) and fails with high probability if δ<M(εη)\delta<M({\varepsilon}|\eta).

As mentioned above, the calculation of the minimax risk is a calculus exercise, and follows the same lines as in [DMM09]. This coincide with the AMP threshold by the general analysis of Section 6 for cases (i)(i), (ii)(ii), (iv)(iv). For the non-scale invariant case, we refer, once more, to [DMM09].

Notice that problem (iii)(iii) and (iv)(iv) have the same minimax risk. This identity mirrors a result in [DT10a] that characterizes the phase transition threshold for reconstructing x0SNRNx_{0}\in{\cal S}_{N}\subseteq{\bf R}^{N} from noiseless linear measurements y=Ax0y=Ax_{0}, with SN=N{\cal S}_{N}=^{N} or SN=R+N{\cal S}_{N}={\bf R}_{+}^{N}. If a simple feasibility linear program is used (namely, find any xSNx\in{\cal S}_{N} with y=Axy=Ax), then the undersampling threshold for both problems is given by δ=(1+ε)/2\delta=(1+{\varepsilon})/2.

Appendix B Calculation of minimax MSE

In this Appendix we describe the calculation of the global minimax risk over the class FF1,ε{\cal F}\equiv{\cal F}_{1,{\varepsilon}}, as defined per Eq. (2.3). In particular, we will explain how the values in Table 1 and Figure 1 for M(εMinimax)M({\varepsilon}|{\rm Minimax}) have been computed.

Here ην\eta^{\nu} is he posterior mean estimator for prior ν\nu, γ\gamma is the Gaussian measure γ(dx)=ϕ(x)dx\gamma({\rm d}x)=\phi(x){\rm d}x, ϕ(x)=ex2/2/2π\phi(x)=e^{-x^{2}/2}/\sqrt{2\pi}, γν\gamma\star\nu denotes the convolution of measures, and II denotes the Fisher information. For a probability measure νf(dx)=f(x)dx\nu_{f}({\rm d}x)=f(x){\rm d}x, with density ff with respect to the Lebesgue measure this is defined as

Further, if F{\cal F} is convex and weakly compact, the set of probability measures {γν:  νF}\{\gamma\star\nu:\;\nu\in{\cal F}\} is also convex and weakly compact. If follows from [Hub64, Theorem 4] that the inf\inf in Eq. (B.1) is achieved at a unique point ν=ν\nu=\nu^{*}. Hence

where we defined I=I(ε)=I(γν)I^{*}=I^{*}({\varepsilon})=I(\gamma\star\nu^{*}). The unique minimizer ν\nu^{*} is known as theleast favorable distribution. The minimax optimal denoiser (achieving the inf\inf over η\eta in Eq. (B.1)) is the posterior expectation with respect to the prior ν\nu^{*}.

Bickel and Collins [BC83, Theorem 1], prove that, under suitable assumption on the class F{\cal F}, the least favorable distribution is a mixture of point masses

where iαi=1\sum_{i}\alpha_{i}=1, αi>0\alpha_{i}>0 and the sequence {μi}iZ\{\mu_{i}\}_{i\in{\bf Z}} has no accumulation points except, possibly, at ±\pm\infty. As mentioned above, the minimax denoiser is the posterior expectation associate to the prior ν\nu^{*}. By Tweedie’s formula, this takes the form

where ψ\psi^{*} is the so-called score function

and ff^{*} is the density of νγ\nu^{*}\star\gamma.

Focusing now specifically on the class F=F1,ε{\cal F}={\cal F}_{1,{\varepsilon}}. This case is covered by the general theory of [BC83], and corresponds to their example (ii)(ii). Without loss of generality we can assume that the μi\mu_{i} are monotone increasing, with μi=μi\mu_{-i}=\mu_{i}, and that μ0=0\mu_{0}=0, with α0=(1ε)\alpha_{0}=(1-{\varepsilon}). A conjecture of Mallows [Mal78] states that in fact we may take

In other words, the conjectured least favorable distribution has the form of a two-sided geometric distribution on a scaled copy of Z{\bf Z}. While the conjecture has not been proved, Mallows [Mal78] provided an argument (based on an analogous problem in robust estimation [Hub64]) that suggesting that indeed it captures the correct tail behavior.

For estimating the minimax risk numerically, we chose a large parameter KK and assume a generalized “Mallows form” for i>K|i|>K. More precisely, we assume an equispaced grid, and geometrically decaying weights. This is a little more general than what Mallows proposed, having 3 total degrees of freedom (spacing, total weight and rate of decay), rather than two. For KiK-K\leq i\leq K, we allow the parameters αi\alpha_{i} and μi\mu_{i} to vary freely. In this way we obtained a parametric family νθ\nu_{\theta} of probability distributions, with parameter θ=((αi)i[K],(μi)i[K],c0,c1,λ)\theta=((\alpha_{i})_{i\in[K]},(\mu_{i})_{i\in[K]},c_{0},c_{1},\lambda), with

The quantity I+=I+(ε)I^{+}=I^{+}({\varepsilon}) can be estimated numerically, and provides an upper bound on II^{*}. We used up to K=50K=50 and checked that the resulting I+I^{+} is insensitive to this choice. Notice that choice of the Mallows form for i>Ki>K is immaterial for two reasons: (i)(i) As a consequence [BC83] and of the weak continuity of Fisher information, I+I^{+} should be insensitive to the tail behavior of the distribution FF; (ii)(ii) We are only using it to derive an upper bound.

In order to get a lower bound on II^{*}, we use Huber’s minimax theorem [Hub64, HR09], which implies that, for any ψ:RR\psi:{\bf R}\to{\bf R} differentiable in measure,

where in the supremum gg ranges over all densities of probability measures γν\gamma\star\nu with νF1,ε\nu\in{\cal F}_{1,{\varepsilon}}.

Let ν+\nu^{+} denote the probability measure corresponding to the optimum of the parametric optimization (B.3). Denote by ψ+\psi^{+} denote the corresponding score function

where f+=ν+γf^{+}=\nu^{+}\star\gamma. This corresponds to a denoiser η+(y)=yψ+(y)\eta^{+}(y)=y-\psi^{+}(y). Let g+g^{+} denote a maximizing density gg for J(ψ+,g)J(\psi^{+},g). By Huber’s theory, this can be chosen to be two-point mixture (1ε)δ0+εδμ+(1-{\varepsilon})\delta_{0}+{\varepsilon}\delta_{\mu^{+}} where μ+\mu^{+} is chosen to achieve the worst case value on the right side of (B.4) and set I=I(ε)=J(ψ+,g+)I_{-}=I_{-}({\varepsilon})=J(\psi^{+},g^{+}).

We have the bounds III+I_{-}\leq I^{*}\leq I^{+}. Numerically, we compute integrals and extrema over fine grids with at least 100100 samples per unit of range, getting not I+I^{+} and II_{-} but instead numerical approximations I~+\widetilde{I}^{+} and I~\widetilde{I}_{-}. Table 3 presents some information about numerical approximation results, which may help the reader assess its accuracy for small values of KK. Some minimizing distributions obtained in this way are shown in Figure 20; the mass points (μi)(\mu_{i}) are displayed in Figure 21.

Our numerical results, showing that I~I~+\widetilde{I}_{-}\approx\widetilde{I}_{+} allows us to infer that the Mallows form is approximately correct. The denoiser that we actually apply in our estimation and compressed sensing experiments is:

Appendix C Convergence properties of AMP

Throughout the paper we checked convergence of AMP by imposing a threshold on the reconstruction accuracy and the number of iterations. For instance, in the case of separable denoisers, cf. Section 2.4, we declared the reconstruction successful if

for a certain choice of tt and γ>0\gamma>0. In particular, the results presented correspond to γ=0.01\gamma=0.01 and t=300t=300.

It is natural to ask how to choose γ\gamma and tt, and whether different choices of γ\gamma and tt would lead to significantly different estimates of the phase transition boundary. It turns out that the empirical phase transition is fairly insensitive to these choices for the cases considered here, as soon as t100t\gtrsim 100 is sufficiently large and γ0.05\gamma\lesssim 0.05. This insensitivity is related to the convergence properties of AMP. Indeed both theory and empirical evidence [DMM09, DMM11] indicate exponential convergence. Namely, for all δ>M(εη)\delta>M({\varepsilon}|\eta), there exist dimension independent constants C=C(δ)C=C(\delta), b=b(δ)>0b=b(\delta)>0 such that, with high probability,

On the other hand, for δ<M(εη)\delta<M({\varepsilon}|\eta), x^tx022c(δ)n\|\widehat{x}^{t}-x_{0}\|_{2}^{2}\geq c(\delta)n, with high probability.

Figure 22 presents data that confirm this behavior (further numerical evidence can be found in [DMM09]). The data refer to simple sparse signals with ε=0.05{\varepsilon}=0.05, and soft thresholding denoising. The curves correspond to several values of δ\delta close to the predicted phase transition location δ=M(εη)0.0239\delta=M({\varepsilon}|\eta)\approx 0.0239. Notice the clear exponential decay of the error for δ>M(εη)\delta>M({\varepsilon}|\eta) and a large constant mean square error for δ<M(εη)\delta<M({\varepsilon}|\eta).

If the phase transition has to be determined with relative accuracy Δ\Delta, this suggests the rule of thumb exp{b(δ+Δ)t}γ\exp\{-b(\delta_{*}+\Delta)t\}\leq\gamma and c(δΔ)γc(\delta_{*}-\Delta)\geq\gamma. We verified that these conditions are satisfied by our choices of tt and γ\gamma.

Appendix D Calculation of minimax MSE for block soft thresholding

The argument is analogous to the one for the scalar case (corresponding to B=1B=1) treated in Appendix A. For μRB\mu\in{\bf R}^{B} and τR+\tau\in{\bf R}_{+}, define the risk at μ\mu as

where Xν{\bf X}\sim\nu and ZN(0,IB×B){\bf Z}\sim{\sf N}(0,{\bf I}_{B\times B}). Since the two point mixtures are the extremal distributions in Fε,B{\cal F}_{{\varepsilon},B}, we have

By the definition of chi-square distribution, we have

It follows from the invariance of the distribution of Z{\bf Z} under rotations that R(μ;τ)R(\mu;\tau) only depends on μ\mu through its norm μ\|\mu\|. Further, as proved in Appendix I.1 R(μ;τ)R(\mu;\tau) is increasing in μ\|\mu\|, and

At this point the problem is reduced to a calculus exercise.

D.2 Proof of Lemma 3.3

In this appendix we consider the asymptotics for large block size BB\to\infty. It is easy to show that the minimax threshold level τ\tau must be of order B\sqrt{B}. By a compactness argument, we can assume τ=cB\tau=c\,\sqrt{B} for some cc to be determined. Define the risk as in Eq. (D.1) (note that this depends implicitely on BB) and the normalized risk as R~B(μ;τ)=R(μ;τ)/B\widetilde{R}_{B}(\mu;\tau)=R(\mu;\tau)/B. We claim that

Assuming these claim to hold, we have, by Eq. (D.2),

Calculus shows that the minimum is achieved at c=1εc^{*}=1-{\varepsilon}, whence

which coincides with the statement of Lemma 3.3.

In order to complete the proof, we have to prove claims (D.3) and (D.4). The second one is immediate because of Lemma I.1 that implies indeed R~B(;cB)=1+c2\widetilde{R}_{B}(\infty;c\sqrt{B})=1+c^{2}.

The limit (D.3) follows instead from the central limit theorem. Indeed, let XBX_{B} denote a central chi-squared with BB degrees of freedom. Its square root is the norm of a standard Gaussian random vector in dimension BB, and concentrates around B\sqrt{B}. Indeed by the central limit theorem 2(XBB)DN(0,1)\sqrt{2}(\sqrt{X_{B}}-\sqrt{B})\Rightarrow_{D}{\sf N}(0,1) as BB\rightarrow\infty. Therefore, we have

and the latter converges to (1c)+2(1-c)_{+}^{2} by dominated convergence.

Appendix E Non-convergence of state evolution

We begin by developing a lower bound that holds for all the denoisers η\eta studied in this paper. For notational simplicity, we consider in fact separable denoisers, but the result is easily see to hold in general, provided that the signal class is scale-invariant.

Let νε,μ\nu_{{\varepsilon},\mu} denote the mixture (1ε)δ0+εδμ(1-{\varepsilon})\delta_{0}+{\varepsilon}\delta_{\mu} where μR+{}\mu\in{\bf R}_{+}\cup\{\infty\} can be either finite or infinite. Define the risk at μ\mu as

where Xνε,μX\sim\nu_{{\varepsilon},\mu} is independent of ZN(0,1)Z\sim{\sf N}(0,1).

We say that the risk function RR is super-quadratic on [0,μ)[0,\mu_{*}) if, for any μ[0,μ)\mu\in[0,\mu_{*}),

The next result shows that superquadratic behavior of the risk function implies non convergence of state evolution for signal distribution νε,μ\nu_{{\varepsilon},\mu}.

(State Evolution Non-Convergence) Fix any τΘ\tau\in\Theta, and assume there exists μ0>0\mu_{0}>0 such that: (i)(i) The risk function R(μ;τ)R(\mu;\tau) is superquadratic on [0,μ)[0,\mu_{*}), (ii)(ii) R(μ)δR(\mu_{*})\geq\delta, and (iii)(iii) δε\delta\geq{\varepsilon}.

Let Ψ(m)=Ψ(m;δ,τ,νε,μ)\Psi(m)=\Psi(m;\delta,\tau,\nu_{{\varepsilon},\mu}). Then there is mfp>0m_{\rm fp}>0 such that

Define mfp=(μ/μ)2{m_{\rm fp}}=(\mu/\mu_{*})^{2} and assume mmfpm\geq{m_{\rm fp}}. This implies μ/mμ\mu/\sqrt{m}\leq\mu_{*}, and, since RR is superquadratic by assumption,

For firm and minimax denoisers η{ηfirm,ηall}\eta\in\{\eta^{firm},\eta^{all}\}, we took a fine grid of ε{\varepsilon} and at each fixed ε{\varepsilon} evaluated R(μ;τ)R(\mu;\tau) on a fine grid of μ\mu, checking the inequality (E.1). In the case of firm thresholding we used the minimax threshold values τ=τ(ε)\tau=\tau^{*}({\varepsilon}). We further used the least favorable μ\mu, μ=μ(ε)\mu=\mu^{*}({\varepsilon}). Sample results are presented in Figures 23 and 24.

These computations show that the risk function R(μ;τ(ε))R(\mu;\tau^{*}({\varepsilon})) is superquadratic on (0,μ(ε))(0,\mu^{*}({\varepsilon})).

Appendix F Monotone regression

where we recall that Δvi=vi+1vi\Delta v_{i}=v_{i+1}-v_{i} is the discrete derivative. (Of course this problem does not provide an algorithm since it requires to know Δμi\Delta\mu_{i}, but here we are interested in it only for analysis purposes.)

As tt\to\infty, all the constraints ΔvitΔμi\Delta v_{i}\geq-t\Delta\mu_{i} for which Δμi>0\Delta\mu_{i}>0 become irrelevant. We are naturally led to defining the following localized monotone regression problem

(Here and below omit the dependence of I0I_{0}, I+I_{+} on μ\mu.) Let ηlmono(z;I0)\eta^{lmono}(z;I_{0}) denote the solution of (Qlmono)(Q_{lmono}) with data z,I0z,I_{0}. The above discussion implies that, for z=ZN(0,IN×N)z={\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N}), we have the following limit in probability:

In words, the risk ‘at infinity’ of monotone regression is simply given by the local risk.

In order to conclude the proof, it is sufficient to show that Rloc(I0)R^{loc}(I_{0}) is given by the right-hand side of Eq. (4.4). It is easy to check that the problem (Qlmono)(Q_{lmono}) separates into independent optimization problem for each JkJ_{k}. Namely, for iJki\in J_{k}, viv_{i} can be found by solving the following smaller problem

where vJk=(vik+1,,vik+1)v_{J_{k}}=(v_{i_{k}+1},\dots,v_{i_{k+1}}) and, if the segment JkJ_{k} is a singleton, the constraint disappears. Let vJk(zJk)v_{J_{k}}(z_{J_{k}}) be the solution of this problem. Then,

F.2 Proof of Lemma 4.2

Throughout the proof, we denote by x=x(Z)=ηmono(z=Z)x=x({\bf Z})=\eta^{mono}(z={\bf Z}) the solution of the monotone regression problem

Clearly r(1)=1r(1)=1 since in this case there is no monotonicity constraint and the solution of the regression problem is simply x=zx=z. In order to prove Eq. (4.5), let kI+(x)k\in I_{+}(x) (i.e. an increase point: xk<xk+1x_{k}<x_{k+1}) and define r(k)RNr^{(k)}\in{\bf R}^{N} by letting ri(k)=1{i>k}r^{(k)}_{i}=1_{\{i>k\}}, and l(k)RNl^{(k)}\in{\bf R}^{N} by letting li(k)=1{ik}l^{(k)}_{i}=1_{\{i\leq k\}}. Then, x+ξr(k)MNx+\xi\,r^{(k)}\in{\cal M}_{N}, x+ξl(k)MNx+\xi\,l^{(k)}\in{\cal M}_{N} for all ξ\xi small enough. Hence, we must have x+ξr(k)z22xz22\|x+\xi\,r^{(k)}-z\|_{2}^{2}\geq\|x-z\|_{2}^{2}, and x+ξl(k)z22xz22\|x+\xi\,l^{(k)}-z\|_{2}^{2}\geq\|x-z\|_{2}^{2} for all ξ\xi small enough. Expanding to linear order in ξ\xi, we conclude that, for all kI+(x)k\in I_{+}(x):

Further, if r(0)r^{(0)} is the all 11 vector, x+ξr(0)MNx+\xi\,r^{(0)}\in{\cal M}_{N} for all ξR\xi\in{\bf R}. Minimizing with respect to ξ\xi, we get

By virtue of Eq. (F.1), and using the fact that xx is monotone, we have

It is then easy to check that, letting Ec{\cal E}^{c} denote the complement of E{\cal E},

Indeed define, for i[N]i\in[N], k(i)max{k:  k<i,  kI+(x)}k(i)\equiv\max\{k:\;k<i,\;k\in I_{+}(x)\} with, by convention, k(i)=0k(i)=0 if the set {k<i,  kI+(x)}\{k<i,\;k\in I_{+}(x)\} is empty. Then, by definition of I+(x)I_{+}(x),

where the first inequality follows by definition of the events E{\cal E}_{\emptyset}, E1{\cal E}_{1}, …EN1{\cal E}_{N-1} and the second since ik(i)+1i\geq k(i)+1. The inequality xi(6logN)/ix_{i}\geq-\sqrt{(6\log N)/i} follows essentially by the same argument. By union bound we therefore obtain

We can therefore bound the risk as follows

On the other hand it is easy to see that, defining Zmax=maxi[N]ZiZ_{\max}=\max_{i\in[N]}|Z_{i}|, we necessarily have xi(Z)Zmax|x_{i}({\bf Z})|\leq Z_{\max} for all i[N]i\in[N], whence

where the last inequality holds for N10N\geq 10. Using this bound in conjunction with Eq. (F.6), we finally get

Appendix G Further computational details

We record here a few details that have been omitted from the main text.

The AMP iteration, cf. Eqs. (1.8), (1.8), (1.9), requires to estimate the variance σt2\sigma_{t}^{2} of the effective observation at time tt. According to the the general theory of state evolution [DMM09, BM11a], the empirical distribution of the coordinates of ztz^{t} is asymptotically Gaussian with mean and variance σt2\sigma_{t}^{2}. This motivates the following estimator, first proposed in [DMM09]:

where Φ(z)\Phi(z) is the normal distribution function. This is known as the 2525% pseudo-variance in robust estimator and has the advantage of being insensitive to a small fraction of large outliers.

Computations were done in partly using Matlab, and partly through a Java program . In the spirit of reproducible research, a suite of Java classes that allow to repeat our simulations is available through an open code repository [DJM12].

The plots of minimax risk were obtained by evaluating numerically the expression in the main text. For separable and block-separable denoisers (with the exception of the global minimax denoiser ηall\eta^{all}), the integrals can be expressed in terms of the Gaussian distribution function or incomplete beta functions. For the global minimax denoiser, integration was performed numerically using the standard MAtlab routines.

Evaluation of the minimax risk required searching the least favorable distribution among two point mixtures of the form (1ε)δ0+δμ(1-{\varepsilon})\delta_{0}+\delta_{\mu}, in the separable case and (ε)δ0+εδμe1(-{\varepsilon})\delta_{0}+{\varepsilon}\delta_{\mu\,e_{1}} in the block separable one. Optimization over μR+\mu\in{\bf R}_{+} was performed by brute force search over a grid, with recursive refinement of the grid.

For the non-separable cases, the procedure for approximating the minimax MSE was explained Section 4 and 5.

Appendix H Finite-N𝑁N scaling and error analysis

The empirical phase transitions observed in this paper admit further analysis, to verify whether the following expected behavior take place, namely: (a)(a) the offsets tend towards zero with increasing NN; (b)(b) the steepness of the phase transition increases with increasing NN.

As described in Section 2.4, at each fixed value ε{\varepsilon} of the sparsity parameter, we gathered data at several different values of δ\delta, and obtained the empirical phase transition parameter PT^(N,ε,η)\widehat{{\rm PT}}(N,{\varepsilon},\eta), recorded as offset from prediction, so that PT^(N,ε,η)=0\widehat{{\rm PT}}(N,{\varepsilon},\eta)=0 means that the 50% success location fitted to the ε{\varepsilon}-fixed, δ\delta-varying dataset is exactly at the predicted location M(εη)M({\varepsilon}|\eta). Our analysis gave not only the empirical phase transition location, but also its formal standard error SE(PT^)SE(\widehat{{\rm PT}}). (Here we make explicit the dependence of PT^\widehat{{\rm PT}} on the specific denoiser.)

We fit a linear model to the dataset including all the phase transition results for soft and firm thresholding. We considered several exponents γ\gamma that might be describing the decay of the offset with increasing NN:

Table 4 shows that γ=1/3\gamma=1/3 provides an adequate description of the offsets, with an R2R^{2} exceeding 0.9950.995. A plot of raw PT^\widehat{{\rm PT}}’s versus the predictions of model (H.1) is given in Figure 25.

H.2 Transitions sharpen

In addition to an empirical phase transition parameter PT^(N,ε,η)\widehat{{\rm PT}}(N,{\varepsilon},\eta) we also fitted an empirical steepness parameter β^(N,ε,η)\widehat{\beta}(N,{\varepsilon},\eta), according to the logistic model:

where p^i\widehat{p}_{i} is the empirical success probability for the ii-th experiment.

We expect β^(N,ε,η)\widehat{\beta}(N,{\varepsilon},\eta) to grow with increasing NN, corresponding to increasingly abrupt transitions from complete failure at δPT^(N,ε,η)\delta\ll\widehat{{\rm PT}}(N,{\varepsilon},\eta) to complete success at δPT^(N,ε,η)\delta\gg\widehat{{\rm PT}}(N,{\varepsilon},\eta). In order to test this behavior, we fitted a linear model to the values of β^\widehat{\beta} computed for multiple values of NN, ε{\varepsilon}, and denoisers η\eta. We considered a range of powers γ~\widetilde{\gamma} that might be describing the growth of the steepness with increasing NN:

Table 5 shows that γ=1/2\gamma=1/2 provides an adequate description of the steepnesses, with an R2R^{2} exceeding 0.9990.999. A plot of raw β^\widehat{\beta}’s versus the predictions of model (H.2) is given in Figure 26.

Appendix I Further properties of the risk function

In this appendix we prove several useful properties of the risk function of the block soft and James-Stein denoisers. Throughout this section, we define the risk at μ\mu as

The argument η\eta will be droppend or replaced by the threshold level τ\tau whenever clear from the context. Since we only consider denoisers that are equivariant under rotation, R(μ;η)R(\mu;\eta) depends on the vector μ\mu only through its norm μ2\|\mu\|_{2}. With a slight abuse of notation, we will use μ\mu to denote the norm as well. In other words, the reader can assume μ=μe1\mu=\mu\,e_{1}.

In this section we consider the block soft denoiser ηsoft(;τ):RBRB\eta^{soft}(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B}. We will write R(μ;τ)R(\mu;\tau) for R(μ;ηsoft(;τ))R(\mu;\eta^{soft}(\,\cdot\,;\tau)).

For block soft thresholding the risk function μR(μτ)\mu\mapsto R(\mu\tau) has these properties:

Let fξ,d(w)f_{\xi,d}(w) be the density function of S2χd2(ξ)S^{2}\sim\chi_{d}^{2}(\xi). This satisfies

Applying the first identity, integrating by parts, canceling terms, and then using the second identity, we obtain

For the upper bound (I.3), use the Poisson mixture representation fξ,d+2(w)=j=0pξ/2(j)fd+2+2j(w)f_{\xi,d+2}(w)=\sum_{j=0}^{\infty}p_{\xi/2}(j)f_{d+2+2j}(w) with pλ(j)λjeλ/j!p_{\lambda}(j)\equiv\lambda^{j}e^{-\lambda}/j! and an identity for the central χ2\chi^{2} density family to obtain

and so to conclude that fξ,d+2(w)(w/d)fξ,d(w)f_{\xi,d+2}(w)\leq(w/d)f_{\xi,d}(w). Hence the second term in (I.6) is bounded by (11/τ)τ2fξ,d(w)dw(1-1/\tau)\int_{\tau^{2}}^{\infty}f_{\xi,d}(w)\,{\rm d}w, whence follows the conclusion (R/μ2)1(\partial R/\partial\mu^{2})\leq 1. Property (I.4) is immediate from (I.2) and (I.3) and the large-μ\mu limit of RR.

To obtain the bound in (I.5), write the risk function using the unbiased risk formula as

I.2 Positive-part James-Stein denoiser

As in the previous section, we set ξμ2\xi\equiv\mu^{2} and d=Bd=B. HEre we consider the James-Stein denoiser ηJS:RdRd\eta^{JS}:{\bf R}^{d}\to{\bf R}^{d} defined by

and we will write R(μ)=R(μ;ηJS)R(\mu)=R(\mu;\eta^{JS}).

We again let S2=μ+Z22S^{2}=\|\mu+{\bf Z}\|^{2}_{2}, with ZN(0,Id×d){\bf Z}\sim{\sf N}(0,{\bf I}_{d\times d}). We have the noncentral chi-squared distribution Sχd2(ξ)S\sim\chi_{d}^{2}(\xi) with noncentrality ξ\xi. Stein’s unbiased estimate of risk is

We will first develop an approximation of the risk at that was used in Sections (3.2) and (3.3).

Let fd(w)f_{d}(w) denote the density of a central chi-squared with dd degrees of freedom. We then have the density satisfies

Using the identity wfd2(w)=(d2)fd(w)wf_{d-2}(w)=(d-2)f_{d}(w) and letting D=d2D=d-2, we can rewrite the last expression as

By a standard tightness argument, this implies that that

An Edgeworth series leads to the expansion R(0)=1+R1d1/2+Θ(d2)R(0)=1+R_{1}\,d^{-1/2}+\Theta(d^{-2}). Indeed, one can integrate the expression (I.8) numerically, and the numerical values are consistent with R(0)1+0.752/dR(0)\approx 1+0.752/\sqrt{d} for large dd.

I.2.2 Monotonicity of R​(μ)𝑅𝜇R(\mu).

We use the variation-diminishing version of total positivity, developed in [BJM81].

The non-central χ2\chi^{2} family is strictly variation diminishing of all orders

For g:[0,)Rg:[0,\infty)\to{\bf R}, let S(g)S^{-}(g) and S+(g)S^{+}(g) denote the number of sign changes and strict sign changes of gg, and let IS(g)IS(g) denote the sign of g(0)g(0) (assuming that g(0)0g(0)\neq 0, the more general definition being given in [BJM81]). Further define the function γ:[0,)R\gamma:[0,\infty)\mapsto{\bf R} by

where fd,ξ()f_{d,\xi}(\,\cdot\,) is the density of the noncentral chi-square with dd degrees of freedom and noncentrality ξ\xi. By the SVR property we have that that S+(γ)S(g)S^{+}(\gamma)\leq S^{-}(g) and that if S+(γ)=S(g)S^{+}(\gamma)=S^{-}(g) then necessarily IS(γ)=IS(g)IS(\gamma)=IS(g). In particular this implies that, if gg is strictly increasing, then γ\gamma is strictly increasing as well. Indeed this follows by letting ga(w)=g(w)ag_{a}(w)=g(w)-a for aRa\in{\bf R} and

If gg is strictly increasing, then S(ga)1S^{-}(g_{a})\leq 1 for all aRa\in{\bf R}, whence S+(γa)1S^{+}(\gamma_{a})\leq 1 for all aa, with IS(γ)=IS(g)IS(\gamma)=IS(g) whenever S+(γa)=1S^{+}(\gamma_{a})=1. This in turns implies that γ\gamma is increasing.

We now verify that the risk R(μ)R(\mu) of ηJS\eta^{JS} is monotone increasing in ξ=μ2[0,)\xi=\|\mu\|^{2}\in[0,\infty). Let

and define γ(ξ)\gamma(\xi) using Eq. (I.9). Note that gg is strictly increasing and hence ξγ(ξ)\xi\mapsto\gamma(\xi) is increasing as well by the above argument. But U(S)U(S) is Stein’s unbiased risk estimator and hence R(μ)=γ(ξ=mu2)R(\mu)=\gamma(\xi=\|mu\|^{2}), which implies the claim.

I.3 Proof of Lemma 6.2

Since any probability distribution is written as a convex combination of point masses, it is sufficient to prove the claim for ν=δμ\nu=\delta_{\mu}. In this case, using the scaling relation (1.3), we have

with R()R(\,\cdot\,) the risk function. Therefore the state evolution mapping is starshaped for all distributions ν\nu if and only if μR(μ)\mu\mapsto R(\mu) is monotone increasing.

The monotonicity of the risk function was proved in [DMM09] for soft thresholding and positive soft thresholding. It is proved in Section 4 and 5 for monotone regression and total variation denoising. Finally, it is proved in Section I.1 and I.2 for block soft and James-Stein denoisers.

References