Accurate Prediction of Phase Transitions in Compressed Sensing via a Connection to Minimax Denoising

David Donoho, Iain Johnstone, Andrea Montanari

Introduction

In the noiseless compressed sensing problem, we are given a collection of linear measurements of an unknown vector $x_{0}$ :

Here the measurement matrix $A$ is $n$ by $N$ , $n<N$ , and the $N$ -vector $x_{0}$ is the object we wish to recover. Both $y$ and $A$ are known, while $x_{0}$ is unknown and we seek to recover an approximation to $x_{0}$ .

Since $n<N$ , the equations are underdetermined. It seems hopeless to recover $x_{0}$ in general, but in compressed sensing one also assumes that the object is sparse in the appropriate sense. Suppose that the object is known to be $k$ -sparse, i.e. to have $k$ nonzero entries. If the problem dimensions $(k,n,N)$ are large, many recovery algorithms exhibit the phenomenon of phase transition.

Explicitly, let ${\varepsilon}=k/N$ and $\delta=n/N$ denote the sparsity and undersampling parameters, respectively. Hence $({\varepsilon},\delta)\in^{2}$ defines a phase space for the different kinds of limiting situations we may encounter as $(k,n,N)$ grow large. For a variety of algorithms and Gaussian matrices $A$ with iid entries, one finds that this phase space can be partitioned into two phases: “success” and “failure”. Namely, for a given algorithm ${\cal A}$ and given sparsity fraction ${\varepsilon}$ , there exists a critical fraction $\delta({\varepsilon}|{\cal A})$ such that if the sampling rate $\delta$ is larger than the critical value, $\delta>\delta({\varepsilon}|{\cal A})$ , then the algorithm is successful in recovering the underlying object $x_{0}$ with high probabilityThroughout the paper, we will write that an event holds with high probability (w.h.p.) if its probability converges to $1$ in the large system limit $N,n\to\infty$ with $\delta=n/N$ and ${\varepsilon}=k/N$ fixed., while if $\delta<\delta({\varepsilon}|{\cal A})$ the algorithm is unsuccessful, also with high probability. In particular, $\delta({\varepsilon}|{\cal A})<1$ means that it is indeed possible to undersample and still recover the unknown signal. In fact $\delta({\varepsilon}|{\cal A})$ shows precisely the limits of allowable undersampling. By now a large amount of empirical and theoretical knowledge has been compiled about the phase transitions exhibited by different algorithms: we refer the reader to [DT10b, Don06, DT09a, XH10, BCT11, Sto10, KWT09, DMM09, DMM11, Wai09]. In a parallel line of work, a number of sufficient conditions under which undersampling is possible using deterministic matrices have been studied, see e.g. [CT05, BRT09, BGI+08, Can06].

It is however fair to say that the research focused so far on ‘unstructured’ notions of sparsity whereby $k$ simply counts the number on non-zero entries in $x_{0}$ . (We refer to Section 1.7 for an overview of related literature.) On the other hand, applications naturally lead to ‘structured’ notions of sparsity. This paper applies an algorithm framework - Approximate Message Passing (AMP) - to construct specific algorithms applicable to a variety of compressed sensing settings, including block and structured sparsity, convex and nonconvex penalization, and develops a single unifying formula that, specialized to each instance, gives the actual phase transition that we observe in practice. To give a preview of our results, we first recall some facts about statistical decision theory and AMP reconstruction. For the sake of illustration, the classical case of simple sparse vectors will be used as a running example.

Throughout this paper, we will consider estimation of unknown structured signals $x\in{\bf R}^{N}$ from a minimax point of view. Various notions of structures can be formalized by considering a family ${\cal F}_{N}$ of probability measures over ${\bf R}^{N}$ . One such probability measure will be denoted by $\nu_{N}\in{\cal F}_{N}$ and a signal with distribution $\nu_{N}$ will often be denoted as ${\bf X}\sim\nu_{N}$ .

The family ${\cal F}_{N}$ will typically include degenerate distributions, i.e. point masses $\nu_{N}=\delta_{x_{0}}$ for some $x_{0}\in{\bf R}^{N}$ .

The case of simple sparse vectors corresponds to the family

where ${\cal P}({\bf R}^{N})$ is the set of Borel probability measures on ${\bf R}^{N}$ and, as usual, $\|v\|_{0}$ denotes the number of nonzero entries of the vector $v$ . $\blacksquare$

As exemplified in this case ${\cal F}_{N}$ is often indexed by a sparsity parameter ${\varepsilon}$ , with $N{\varepsilon}$ corresponding to the number of non-zero entries. We will sometimes use the notation ${\cal F}_{N,{\varepsilon}}$ to indicate this dependency, also beyond the last example. Two further common properties that will always hold unless otherwise stated are the following.

Nestedness. If ${\varepsilon}_{1}\leq{\varepsilon}_{2}$ then ${\cal F}_{N,{\varepsilon}_{1}}\subseteq{\cal F}_{N,{\varepsilon}_{2}}$ .

Scale invariance. If $\nu_{N}\in{\cal F}_{N,{\varepsilon}}$ then any scaled version of $\nu_{N}$ (defined by letting $\nu^{a}_{N}(B)=\nu(aB)$ for some $a>0$ ) is also in ${\cal F}_{N,{\varepsilon}}$ .

2 Denoising and minimax MSE

The denoising problem requires to reconstruct a signal $x\in{\bf R}^{N}$ from observations ${\bf Y}=x+{\bf Z}$ whereby ${\bf Z}\sim{\sf N}(0,\sigma^{2}{\bf I}_{N\times N})$ is a noise vector of known variance. (Here and below ${\bf I}_{m\times m}$ denoted the identity matrix in $m$ dimensions.) A denoiser is a mapping

that returns an estimate of $x$ when applied to observations $y={\bf Y}$ . The denoiser is parametrized by the noise scale $\sigma$ and additional tuning parameters $\tau\in\Theta$ . Often denoisers have the property $\|\eta(y;\tau,\sigma)\|_{2}\leq\|y\|_{2}$ and are hence called ‘shrinkers’. We will often have $\Theta={\bf R}_{+}$ , i.e. the denoiser depends on a single non-negative parameter, but more complex choices of the parameter space $\Theta$ fit in the formalism as well.

Following the minimax formulation in the previous section, we evaluate denoisers on signals ${\bf X}\sim\nu_{N}\in{\cal F}_{N,{\varepsilon}}$ , for specific class of distributions ${\cal F}_{N,{\varepsilon}}$ . Because of the scale invariance property of ${\cal F}_{N,{\varepsilon}}$ , it is sufficient to consider scale invariant denoisers:

Hence we omit the last argument when this is $\sigma=1$ . We evaluate a denoiser $\eta$ through its minimax mean square error (MSE) per coordinate

where expectation is taken with respect to ${\bf X}\sim\nu_{N}$ and ${\bf Y}={\bf X}+{\bf Z}$ , ${\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N})$ . In words, we tune the denoiser optimally to control the (per-coordinate) mean square error for typical signals from even the most unfavorable choice within our class ${\cal F}_{N,{\varepsilon}}$ .

We say that a denoiser is separable if, for $v=(v_{1},\dots,v_{N})\in{\bf R}^{N}$ , we have

A well studied denoiser is coordinatewise soft-thresholding, that we will denote by $\eta^{soft}$ . This is a separable denoiser with a unique parameter $\tau\in\Theta={\bf R}_{+}$ (the threshold). On each coordinate $y\in{\bf R}$ this acts as

Soft thresholding is well suited for sparse signals from the class ${\cal F}_{N,{\varepsilon}}$ defined in Eq. (1.2). It turns out that the resulting minimax MSE $M({\cal F}_{N,{\varepsilon}}|\eta)$ can be characterized in terms of a scalar estimation problem, namely for all $N$ , $M({\varepsilon}|\eta^{soft})=M({\cal F}_{N,{\varepsilon}}|\eta^{soft})=M({\cal F}_{1,{\varepsilon}}|\eta^{soft})$ . Explicitly, all these quantities are given by

where expectation is taken with respect to $X\sim\nu$ and $Z\sim{\sf N}(0,1)$ independent of $X$ . We refer to [DJ94, DMM09, DMM11] for an explicit characterization of this quantity (a summary being provided in Section 2). In particular $M({\varepsilon}|{\rm Soft})$ can be explicitly evaluated. $\blacksquare$

In several other examples $M({\cal F}_{N,{\varepsilon}}|\eta)$ has been explicitly evaluated (see [DMM09], Supplementary Information).

In this paper we will give several other calculations of $M({\cal F}_{N,{\varepsilon}}|\eta)$ , for signal structures and denoisers going considerably beyond these examples.

3 Compressed sensing and AMP reconstruction

Consider now the noiseless compressed sensing problem, i.e. the problem of recovering a signal $x_{0}\in{\bf R}^{N}$ from $n<N$ linear observations $y=Ax_{0}$ , cf. Eq. (1.1). The key intuition is that this can be done exploiting the structure of $x_{0}$ , sparsity being a special example. Approximate Message Passing (AMP) is an iterative scheme that allows to exploit richer types of structure in a flexible way. Given a denoiser $\eta(\,\cdot\,;\tau,\sigma):{\bf R}^{N}\to{\bf R}^{N}$ that is well suited for reconstructing $x_{0}$ from observations $x_{0}+{\bf Z}$ , the AMP framework turns it into a scheme for solving the compressed sensing problem.

The AMP iteration starts from $x^{0}=0$ , and proceeds for iterations $t=1$ , $2$ , …by maintaining a current reconstruction $x^{t}\in{\bf R}^{N}$ and a current working residual $z^{t}\in{\bf R}^{n}$ , and adjusting these iteratively. At iteration $t$ , it forms a vector of current pseudo-data $y^{t}=x^{t}+A^{{\sf T}}z^{t}$ and the next iteration’s estimate is obtained by applying $\eta$ to the current pseudo-data:

Here ${\sf b}_{t}$ is a scalar determined by

The rationale for this specific choice of ${\sf b}_{t}$ is discussed in [DMM09, DMM10, BM11a]: a justification goes betond the scope of the present paper. The parameter $\sigma_{t}$ is can be interpreted as the noise standard deviation for the pseudo-data $y^{t}$ . This can be estimated from $y^{t}$ or $z^{t}$ as explained in Appendix G.

Conceptually, AMP constructs an artificial denoising problem at each iteration and solves it using the denoising defined by $\eta$ . In other words, it solves a compressed sensing problem by successive denoising. For the purpose of this paper, this description should be sufficient, save for two remarks.

First, the specifics of the construction are absolutely crucial for the results of this paper. These are embedded in the specification of the scale factors ${\sf b}_{t}$ and $\sigma_{t}$ .

Second, the above algorithm framework was originally proposed in [DMM09, DMM10] in the case of a separable denoiser $\eta$ , i.e. a denoiser acting independently on each coordinate. In that paper the algorithm was derived by constructing a proper belief propagation message passing algorithm, and then obtaining the above algorithm as a first-order approximation. Specific separable denoisers corresponded to different choices of the prior in belief propagation.

A central point of this paper is that the form of the algorithm (1.7), (1.8), (1.9) is really more general and can be used in settings outside the original definition.

4 Phase transition for AMP

A recurring property of AMP algorithms is that they undergo a phase transition. When the undersampling ratio $\delta$ decreases below a certain threshold (that depends on the signal class ${\cal F}_{{\varepsilon},N}$ and the denoiser $\eta$ ), the algorithm behavior changes from being successful most of the times, to failing most of the times. In order to formalize this notion, we introduce the following terminology.

We say that AMP succeeds with high probability for the signal class ${\cal F}_{N,{\varepsilon}}$ , and denoising procedure $\eta$ if there exist a choice of the tuning parameter $\tau\in\Theta$ such that the following happens. For each $\xi>0$ , there exists a function $o(t)$ with $\lim_{t\to\infty}o(t)=0$ such that, for any $\nu_{N}\in{\cal F}_{N,{\varepsilon}}$ ,

Here probability is taken with respect to $x_{0}={\bf X}\sim\nu_{N}$ and the sensing matrix $A$ . Further, the limit $N\to\infty$ is taken with $n/N\to\delta$ .

Viceversa we say that AMP fails with high probability for the signal class ${\cal F}_{N,{\varepsilon}}$ , and denoising procedure $\eta$ if for any $\tau\in\Theta$ the following happens. There exists $\xi>0$ and a sequence $\nu_{N}\in{\cal F}_{N,{\varepsilon}}$ such that, for all $t\geq 0$

Our main result is the following general relation between denoising and compressed sensing.

Phase Transition Formula for AMP. Consider compressed sensing reconstruction over the signal class ${\cal F}_{N,{\varepsilon}}$ , using AMP with the denoiser $\eta$ . Denote by $M({\varepsilon}|\eta)$ the asymptotic minimax MSE per coordinate using denoiser $\eta$ .

Then AMP succeeds with high probability if

Viceversa AMP fails with high probability for $\delta<M({\varepsilon}|\eta)$ .

Let ${\cal F}_{N,{\varepsilon}}$ be the class of signals with at most $N{\varepsilon}$ non-zero entries (in expectation) and consider AMP with soft thresholding $\eta^{soft}(\,\cdot\,;\tau)$ . Then the above formula states that reconstruction will succeed if $\delta>M({\varepsilon}|{\rm Soft})$ and fail for $\delta>M({\varepsilon}|{\rm Soft})$ . This result was first proved in [DMM09] to follow from state evolution. State evolution was subsequently established as a rigorous tool in [BM11a].

The same paper [DMM09] studied AMP with positive soft thresholding and showed that it succeeds for $\delta>M({\varepsilon}|{\rm SoftPos})$ , AMP with capping, proving that it succeeds for $\delta>M({\varepsilon}|{\rm Cap})$ .

Appendix A spells out how these existing results fall under the aegis of Eq. (1.13). $\blacksquare$

Comparison to $(\rho,\delta)$ phase diagrams. In prior literature on phase transitions in compressed sensing, [DT10b, Don06, DT09a, BCT11, DMM09, DMM11], the authors considered a different phase diagram, based on variables $\delta$ and $\rho={\varepsilon}/\delta$ . The relation ${\varepsilon}=\rho\delta$ makes for a 1-1 relationship between the diagrams, so all information in the two diagrams can be presented in either format.

5 This Paper

Our aim in this paper is to show that formula (1.13) is correct in settings extending far beyond the three cases just mentioned in Example 1.5. We lay out several denoising problems, and in each one verify the general formula. This requires in each case $(a)$ calculating the minimax MSE for a problem of statistical decision theory; $(b)$ implementing AMP for compressed sensing with the given denoising family; and $(c)$ verifying empirically that the phase transition does indeed occur at the precise sparsity/undersampling tradeoff indicated by the formula.

In particular, we consider the following denoising tasks, and corresponding compressed sensing problems.

Again we consider the class od sparse vectors ${\cal F}_{N,{\varepsilon}}$ but instead of soft-thrresholding, we use the firm shrinkage denoiser $\eta^{firm}(\,\cdot\,;\tau)$ . This is again a separable denoiser with two tuning parameters $\tau=(\tau_{1},\tau_{2})$ with $\tau\in\Theta\equiv\{(\tau_{1},\tau_{2}):\;0\leq\tau_{1}<\tau_{2}<\infty\}$ . It acts on each coordinate by setting $\eta^{firm}(y;\tau)=0$ for $|y|<\tau_{1}$ , $\eta^{firm}(y;\tau)=y$ for $|y|>\tau_{2}$ and interpolating linearly.

Denoting by $M({\varepsilon}|{\rm Firm})$ the associated asymptotic minimax MSE, we will show that $M({\varepsilon}|{\rm Firm})<M({\varepsilon}|{\rm Soft})$ strictly. By verifying the general formula, we show that the phase transition curve for optimally-tuned AMP firm shrinkage is slightly better than the phase transition for optimally tuned AMP soft shrinkage.

For the same class of sparse vectors ${\cal F}_{N,{\varepsilon}}$ . we consider the separable denoiser $\eta$ applies coordinatewise shrinkage using a minimax shrinkage. In other words implicitly we are optimizing the mean square error over $\Theta\equiv\{\mbox{ all scalar nonlinearities }\}$ . We calculate the minimax MSE function $M({\varepsilon}|{\rm Minimax})$ , and show that $M({\varepsilon}|{\rm Minimax})<M({\varepsilon}|{\rm Firm})$ strictly. By verifying the general formula we show that the phase transition curve for AMP minimax shrinkage is slightly better than the phase transition for both AMP soft or firm shrinkage.

Here we consider the class of block sparse vectors ${\cal F}_{N,{\varepsilon},B}$ (see Section 3 for a formal definition). We use two block-separable denoisers: either block soft thresholding (for block length $B$ , the $B$ -variate nonlinearity obeys $\eta_{B,\lambda}(y)=y\cdot(1-\|y\|_{2}/\lambda)_{+}$ ) or block James-Stein denoiser. We will compute the minimax MSE function $M_{B}({\varepsilon}|{\rm BlockSoft})$ , and bound the minimax MSE function $M_{B}({\varepsilon}|{\rm JamesStein})$ . We will verify that the phase transition curve for optimally-tuned AMP with block-separable denoisers follows the general formula.

Notice that, as demonstrated numerically in [DMM09], and proved in [BM11b] in the case of Gaussian sensing matrices, soft-thresholding AMP reconstruction coincides with LASSO reconstruction (in the large system limit). By the above results, firm-shrinkage AMP and minimax AMP both outperform LASSO reconstruction. Correspondingly, it can be argued that blocksoft thresholding AMP coincides asymptotically with group LASSO, and hence James-Stein AMP outperforms the latter.

In all of the above examples, the denoisers are coordinatewise or at least blockwise separable. We next consider examples where the denoiser has more subtle structure. We find that formula (1.13) applies more generally.

We consider the class ${\cal F}_{N,{\varepsilon},{\rm mono}}$ of vectors that are monotone with at most $N{\varepsilon}$ points of increase. As denoiser, we use the least-squares projection $\eta$ onto the cone of monotone increasing functions.

We consider the class ${\cal F}_{N,{\varepsilon},TV}$ of vectors that have at most $N{\varepsilon}$ points of change. The denoiser $\eta$ minimizes the residual sum of squares penalized by $\tau$ times the total variation of the signal.

In these cases, evaluating the asymptotic minimax MSE is more challenging than for separable denoisers and simpler classes of signals. Nevertheless, we will show that it can be done quite explicitly. We find well-defined phase transitions for AMP reconstruction, precisely at the location predicted by the general formula (1.13).

6 Contributions

We list eight contributions, beginning with the two most obvious:

A formula for phase transitions of AMP algorithms. We confirm that formula (1.13) accurately describes the sparsity-undersampling tradeoff under which AMP algorithms successfully recover a sparse structured signal from underdetermined measurements. We prove that this relation follows from the state evolution formalism.

As demonstrated in [DMM11] and proved in [BM11a] in the case of the LASSO, there exists a correspondence between convex optimization methods and specific AMP algorithms. We will show that this correspondence is considerably more general. This provides a unified approach which yields sharp phase transition predictions in numerous cases.

Limited benefit of nonconvex penalization for ordinary sparsity. Within the class of scalar separable AMP algorithms, the best achievable phase transition is obtained by the minimax shrinker. Unfortunately the improvement in the transition is relatively small.

Calculation of the minimax MSE of monotone regression and total variation denoising. We are not aware of any previous work computing the minimax MSE of these denoising procedures under the condition of ${\varepsilon}$ -sparse first differences. We prove here a characterization for each of these cases and show that it agrees with the phase transition of both AMP and convex optimization algorithms.

A conjectures flow naturally from this work:

State Evolution accurately describes the behavior of a wide range of AMP algorithms, for large system sizes $N$ . State evolution is a formalism that allows to characterize the asymptotic behavior of AMP as the number of dimension tend to infinity [DMM09]. We show in Section 6 that the general relation (1.13) can be proved by assuming state evolution to hold.

In the case of separable denoisers, under suitable regularity conditions, the correctness of state evolution as a description of AMP is proved by [BM11a]. Since formula (1.13) is apparently successful beyond the separable case, it is natural to conjecture that state evolution applies much more generally than to the cases proven so far.

Our study supports the general conclusion that AMP provides a general tool in compressed sensing, that is applicable beyond simple sparse signal. If one knows that a certain shrinker is appropriate for denoising a certain type of signal, then the corresponding AMP algorithm provides an efficient reconstruction method for the associate compressed sensing problem. The denoising minimax MSE then maps to the sparsity undersampling tradeoff.

An interesting research direction is the study of the noisy linear model $y=Ax_{0}+w$ , whereby $w$ is a noise vector (e.g. $w\sim{\sf N}(0,\sigma^{2}{\bf I}_{m\times m})$ ). In analogy [DMM11], we expect reconstruction to be stable with respect to noise for $\delta<M({\varepsilon}|\eta)$ and instable for $\delta>M({\varepsilon}|\eta)$ .

7 Related literature

Approximate message passing algorithms for compressed sensing reconstruction were introduced in [DMM09]. They were largely motivated by the connection with message passing algorithms in iterative decoding systems [RU08], and with mean field methods in statistical physics [MM09] (in particular the cavity method and TAP equations). We refer to [DMM11] for a discussion of these connections.

The original AMP framework [DMM09, DMM10] included iterations of the form defined in Eqs. (1.7), (1.8), (1.9) whereby the denoiser is separable. While this covers the ${\rm Firm}$ and ${\rm Minimax}$ shrinkage rules studied in this paper, it did not include the various non-separable denoisers we discuss below, namely the block, monotone and total variation denoisers. Further, in [DMM09], the phase transition behavior was validated numerically only for ${\rm Soft}$ , ${\rm SoftPos}$ and ${\rm Cap}$ denoisers, that are in correspondence with well-studied convex optimization problems. The extension to a noisy linear model $y=Ax_{0}+w$ , with $w\in{\bf R}^{n}$ a vector of iid random entries was carried out in [DMM11]. We also refer to [Mon12] for an overview of this work.

Several papers investigate generalizations of the original framework put forward in [DMM09]. The paper [BM11a] defines a general class of approximate message passing algorithms for which the state evolution was proved to be correct. This include in particular all separable Lipschitz-continuous denoisers. Generalizations of this result were proved in [BLM12, JM12]. Notice that all the separable denoisers treated in this papers are Lipschitz continuous with the exception of hard thresholding. While the last case is not covered by [BM11a], we expect state evolution to hold for hard thresholding AMP as well, by a suitable approximation argument.

Rangan [Ran11] introduces a class of generalized approximate message passing (G-AMP) algorithms that cope with –roughly– two extensions of the basic noisy linear model. First, the noisy measurement vector $y$ can be a non-linear (random) function of the noiseless measurement $Ax_{0}$ . Second, each of the ‘coordinates’ of $x_{0}$ can itself be a –low dimensional– vector. Interesting applications of this framework were developed in [KGR11, Sch11]. Let us notice that G-AMP does not cover any of the non-separable cases treated here (even the block sparse example), and hence provides a generalization in an ‘orthogonal’ direction.

In a parallel line of work, Schniter applied AMP to a number of examples in which the signal $x_{0}$ has a structured prior [Sch10, SSS10, SPS10]. Inference with respect to the prior is carried out using belief propagation, and this is combined with AMP to compute a posteriori estimates. This type of application fits within the class of problems studied here, by choosing the denoiser $\eta_{t}$ in Eq. (1.8) be given by the appropriate conditional expectation with respect to the signal prior. Note however that the general scheme provided by Eqs. (1.7), (1.8), (1.9) encompasses cases in which the denoiser is not the Bayes estimator for a specific prior.

A special case of known prior is the one in which $x_{0}={\bf X}\sim\nu_{N}$ is distributed according to the (known) product measure $\nu_{N}=\nu\times\cdots\times\nu$ (i.e. the coordinates of ${\bf X}$ are iid with known distribution $\nu$ ). The fundamental limits for compressed sensing reconstruction were established in [WV10]. The natural AMP algorithm uses in this case a posterior expectation denoiser [DMM10]. It was proved in [DJM11] that, for suitable sensing matrices with heteroscedastic entries, this approach achieves the fundamental limits of [WV10] (this approach was put forward in [KMS+12] on the basis of a statistical physics argument). This case fits within the general philosophy of the present paper whereby the class ${\cal F}_{N,{\varepsilon}}$ consists of a single distribution, namely ${\cal F}_{N,{\varepsilon}}=\{\nu_{N}\}$ . However, we prefer not treating this example in the present paper because it is a degenerate case, and the fact that ${\cal F}_{N,{\varepsilon}}$ is not scale invariant leads to some technical differences. We refer instead to [DJM11].

Maleki, Anitori, Yang and Baraniuk [MAYB11] used methods analogous to the ones developed here to study phase transitions for compressed sensing with complex vectors. This is closely related to the block-separable setting considered in Section 3 (there is however some difference in the structure of the sensing matrix).

Structured sparsity models are studied from a different point of view in [BCDH10, CHDB08, CICB10]. Those works focus on deriving sparsity models that capture a variety applications, and of convex relaxations that promote the relevant sparsity patters. Reconstruction guarantees are proved under suitable ‘isometry’ assumptions on the sensing matrix.

Closer to our approach is a recent series of papers [CRPW11, RRN11, RRN11], considering general classes of structured signals under random measurements. Let us emphasize two important differences with respect to our work. First, these papers only deal with convex reconstruction methods, while we shall analyze several approaches that are not derived from convex optimization and demonstrate improvements. Second, they establish reconstruction guarantees using concentration-of-measure arguments, while we propose exact asymptotics (essentially based on weak convergence), which enables us to unveil the key relation (1.13) between denoising and the compressed sensing phase transition.

Scalar-separable denoisers

In this section we study scalar-separable denoisers, cf. Eq. (1.6), that further satisfy the scaling relation (1.3). Unless stated otherwise, we will assume that signals belong to the simple sparsity class introduced in Eq. (1.2), to be denoted as ${\cal F}_{N,{\varepsilon}}$ .

As mentioned in the previous section, the computation of the minimax MSE is greatly simplified for separable denoisers. We state and prove the following elementary result in greater generality than necessary for this section. (In particular ${\cal F}_{N,{\varepsilon}}$ is here a general family of probability distributions.)

Let ${\cal F}_{N,{\varepsilon}}\subseteq{\cal P}({\bf R}^{N})$ be any family of probability distributions satisfies the following conditions: $(i)$ If $\nu_{1}\in{\cal F}_{1,{\varepsilon}}$ , then defining $\nu_{N}\equiv\nu_{1}\times\cdots\times\nu_{1}$ ( $N$ times), we have $\nu_{N}\in{\cal F}_{N,{\varepsilon}}$ ; $(ii)$ Viceversa, if $\nu_{N}\in{\cal F}_{N}$ , then letting $\nu_{N,i}$ denote the $i$ -th marginal of $\nu_{N}$ , we have $\overline{\nu}_{N}\equiv N^{-1}\sum_{i=1}^{N}\nu_{N,i}\in{\cal F}_{1,{\varepsilon}}$ .

Then, for any separable denoiser $\eta$ , and for any $N$ ,

Fix $\tau\in\Theta$ and define, for ${\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N})$ ,

The lemma then follows immediately if we prove that, for any $N$ , $M({\cal F}_{N,{\varepsilon}}|\eta,\tau)=M({\cal F}_{1,{\varepsilon}}|\eta,\tau)$ . In order to prove the last statement, first notice that, by property $(i)$ :

The proof is finished by property $(ii)$ , since

This lemma reduces the problem to solving a minimax scalar estimation problem. This problem was characterized before for soft thresholding $\eta=\eta^{soft}$ , positive soft thresholding $\eta=\eta^{softpos}$ , and also hard thresholding $\eta(y;\tau)=y1_{\{|y|>\tau\}}$ [DDS92, DJ94]. Plots of the minimax soft threshold $\tau^{*}({\varepsilon})$ and the minimax MSE are available in [DMM09, DMM11]. Such plots also appear later in this paper as baselines for comparison of interesting new families, namely Firm and Minimax shrinkage.

2 Firm shrinkage

Soft and hard thresholding can be recovered as limiting cases:

Lemma 2.1 yields the following formula for the minimax MSE of firm shrinkage:

Figure 1 and Table 1 show the minimax MSE for firm shrinkage as resulting from this calculation. The figure also shows similar results for soft and hard thresholding, for comparative purposes. Over the range presented, the minimax MSE for firm thresholding is strictly smaller than the MSE for hard or soft thresholding. Namely, over this range of ${\varepsilon}$ ,

This validates the criticisms of soft thresholding, which is often said to shrink large values too heavilyNote, however, that the use of hard thresholding instead of soft thresholding leads to a larger worst case mean square error..

Figure 2 shows the minimax thresholds. At least for ${\varepsilon}<1/3$ we see clearly that $\tau_{1}^{*}({\varepsilon})<\tau_{2}^{*}({\varepsilon})<\infty$ , so firm thresholding is preferred over the limiting cases of hard and soft thresholdingThese are numerical results. It is an open question whether, for ${\varepsilon}>1/3$ the minimax firm threshold have parameter $\tau_{2}({\varepsilon})=\infty$ reducing it to soft thresholding.. Figure 3 shows the corresponding minimax denoisers for specific values of ${\varepsilon}$ . Finally, Figure 4 plots the minimax value of $\mu$ as a function of ${\varepsilon}$ (corresponding to the minimax probability distribution $\nu_{{\varepsilon},\mu}$ ).

3 Minimax shrinkage

The previous example showed that a parametric family of shrinkers can improve on soft thresholding, and hence improve the predicted phase transition according to (1.13). The ultimate improvement one could make in this direction is to use the globally minimax nonlinear shrinker. This is the separable denoiser $\eta$ that is minimax not within some parametric family, such as the soft thresholding or the broader firm thresholding family, but minimax over all measurable nonlinearities $\eta:{\bf R}\to{\bf R}$ . While this notion might appear somewhat abstract, it can be in fact implemented in practice as illustrated in Figure 3, that present plots of the more familiar denoisers (hard, soft, and firm) together with the minimax denoiser.

Formally, let $\Theta\equiv{\cal L}({\bf R})$ be the set of all measurable functions $\tau:{\bf R}\to{\bf R}$ , for such a $\tau$ , set $\eta^{all}(x;\tau)\equiv\tau(x)$ . The minimax MSE over this class is

The calculation of this quantity uses a variety of ideas from minimax decision theory, developed through several papers [Bic81, CS81, BC83, DDS92, Joh94b, Joh94a, DJ94]: details are given in Appendix B. A key point of this computation is the characterization of the minimax nonlinearity as the minimal MSE Bayes rule (that is, the conditional expectation) for the so-called least-favorable prior. The least-favorable prior is the solution of Mallows’ classical Fisher information problem [Mal78], for which we compute numerical upper and lower bounds that coincide within the stated precision.

Table 1 and Figure 1 present numerical values associated with the solution of the minimax problem. As expected, $M({\varepsilon}|{\rm Minimax})<M({\varepsilon}|{\rm Firm})\leq M({\varepsilon}|{\rm Soft})$ , i.e. optimizing over all nonlinearities yields a smaller mean square error than soft or firm thresholding. On the other hand, Table 1 shows that the improvements are typically of size 0.01 or smaller over the range ${\varepsilon}\in(0.01,0.25)$ . For very small ${\varepsilon}$ it was pointed out in [DMM09] that [DJ94] implies

In the limit of extreme sparsity, there is nothing to be gained by completely general nonlinearities over soft thresholding. The improvement is non-vanishing, but moderate for ${\varepsilon}$ non-vanishing.

4 Empirical phase transition behavior

The research hypothesis driving this paper is that Eq. (1.13) describes the phase transition of AMP algorithms. In order to be completely explicit, we need to check the following predictions, for each nonlinearity $\eta$ of interest

There exists a curve ${\varepsilon}\mapsto\delta({\varepsilon}|\eta)$ such that for $\delta>\delta({\varepsilon}|\eta)$ the corresponding AMP algorithm will typically succeed in reconstructing the unknown signal $x_{0}$ , and for $\delta<\delta({\varepsilon}|\eta)$ the algorithm will typically fail.

The curve is related to the corresponding scalar denoising problem by $\delta({\varepsilon}|\eta)=M({\varepsilon}|\eta)$ .

We now test this hypothesis for the firm and globally minimax nonlinearities $\eta\in\{\eta^{firm},\eta^{all}\}$ .

Our experiment was conducted along the same lines as [MD10, DT09b, DMM09, DMM11, BT10]. We considered a range of problem sizes $N\in\{1000,2000,4000\}$ and a range of sparsity parameters ${\varepsilon}\in\{0.01,0.02,0.05,.10,0.15,0.20,0.25\}$ , and a grid of $\delta$ values surrounding the predicted phase transition $\delta({\varepsilon}|\eta)$ . We ran $N_{\rm sample}=1000$ Monte Carlo reconstructions at each parameter combination. We declared “success” when the relative mean-squared error was below $1\%$ :

We used $t=300$ iterations of AMPIn most cases the mentioned convergence criterion is reached after a much smaller number of iterations (roughly 20) . We repeated a subset of our simulations with different requirements on $\|\widehat{x}^{t}-x_{0}\|_{2}^{2}/\|x_{0}\|_{2}^{2}$ and different number of iterations, without significant changes in the threshold location. This point is further justified in Appendix C. In the interest of reproducibility, a suite of Java classes for carrying out these and other simulations in the paper is made available as [DJM12].

We proceeded to analyse the outcomes of these numerical simulations as follows, see also Appendix H (a similar analysis was already carried out in in [DMM09, DT09b]). The simulations generated a data set, containing, for each algorithm and each fixed ${\varepsilon}$ , a list of values $\delta_{i}$ and empirical success fractions $\widehat{p}_{i}$ . The success fractions observed at $\delta>M({\varepsilon}|\eta)$ were indeed typically better than $50\%$ and at $\delta<M({\varepsilon}|\eta)$ were typically worse than $50\%$ .

To quantify this tendency, we fit a logistic regression

where $\delta({\varepsilon}|\eta)=M({\varepsilon}|\eta)$ was computed analytically using ideas mentioned earlier. The choice of the model (2.4) is motivated by the observation that the success probability increases rapidly around the phase transition, and by the common statistical use of logistic models. Also, similar models have been proved to be asymptotically correct in analogous phase transition phenomena [DM08].

For each set of data corresponding to given $(N,{\varepsilon})$ and each non-linearity, we estimate $\alpha$ and $\beta$ from the logit fit, leading to values $\widehat{\alpha},\widehat{\beta}$ . Using these quantities, we estimate the phase transition location as the value at which the probability $\widehat{p}$ of success is $50\%$ . Using Eq. (2.4) this corresponds to $\widehat{\alpha}+\widehat{\beta}(\delta-\delta({\varepsilon}|\eta))=0$ , i.e. $\delta=\delta({\varepsilon}|\eta)-(\widehat{\alpha}/\beta)$ . We are therefore led to define the offset between the empirical phase transition and the prediction $\delta({\varepsilon}|\eta)=M({\varepsilon}|\eta)$ as

In order to check the general relation provided by Eq. (1.13) we need to show that $\widehat{{\rm PT}}(N,{\varepsilon})$ tends to zero as $N$ gets large, to within the statistical uncertainty. In Table 2 we report our results on the empirical phase transition, confirming that indeed the offset is small and decreasing with $N$ .

A few additional remarks on these data are of interest:

We calculated formal $95\%$ confidence intervals for $\widehat{{\rm PT}}$ , indicating the tight control we have of the correct value.

As in earlier studies [DT09b], we expect that $\widehat{{\rm PT}}(N,{\varepsilon})$ tend at a rate that is inversely proportional to a power of $N$ . Namely

for some $\gamma\in(0,1]$ . Our data supports this relationship, with $\gamma\approx 1/3$ . See Appendix H.

Denoting by $\widehat{\beta}_{N}$ the fitted slope coefficient at dimension $N$ , evidence that $\widehat{\beta}_{N}$ is increasing with larger $N$ indicates that a sharpening of the phase transition is indeed occurring. Appendix H shows that $\widehat{\beta}_{N}\sim\sqrt{N}$ is consistent with our data.

We refer to Appendices G and H for further details.

Block-separable denoisers

We now turn to the case of block-structured sparsity, first introducing some notational conventions. We partition the vector $x=(x_{1},x_{2},\dots,x_{N})$ into $M$ blocks each of size $B$ . Denoting by $block_{m}(x)=(x_{(m-1)B+1},\dots,x_{mB})$ the $m$ -th block, we hence write

A block-separable denoiser is a mapping $\eta(\,\cdot\,;\tau):{\bf R}^{N}\to{\bf R}^{N}$ that decomposes according to the above partition:

where, with an abuse of notation, we use the same symbol to denote the single-block denoiser $\eta(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B}$ . The last equation replaces Eq. (1.6) which correspond to the simple separable case. The above form applies to noise with variance $\sigma^{2}=1$ . For general variance we adopt again the scaling relation (1.3).

We will apply these denoiser to signals from the block-sparse class ${\cal F}_{N,{\varepsilon},B}$ defined as follows for ${\varepsilon}\in$ , $B\in{\bf N}$ , $M\equiv N/B$ ,

In words, this is the class of (random) vectors ${\bf X}$ that have (in expectation) at most $M{\varepsilon}$ blocks different from . For simplicity, we will write ${\cal F}_{{\varepsilon},B}$ for the $M=1$ case, ${\cal F}_{B,{\varepsilon},B}$

The same simplifications described in Section 2.1 applies, with obvious modifications, to the present context.

Let ${\cal F}_{N,{\varepsilon}}\subseteq{\cal P}({\bf R}^{N})$ be any family of probability distributions satisfies the following conditions: $(i)$ If $\nu_{B}\in{\cal F}_{B,{\varepsilon}}$ , then defining $\nu_{N}\equiv\nu_{B}\times\cdots\times\nu_{B}$ ( $M=N/B$ times), we have $\nu_{N}\in{\cal F}_{N,{\varepsilon}}$ ; $(ii)$ Viceversa, if $\nu_{N}\in{\cal F}_{N}$ , then letting $\nu_{N,i}$ denote the marginal of the $i$ -th block under $\nu_{N}$ , we have $\overline{\nu}_{N}\equiv M^{-1}\sum_{i=1}^{M}\nu_{N,i}\in{\cal F}_{B,{\varepsilon}}$ .

Then, for any block-separable denoiser $\eta$ , and for any $N$ multiple of $B$

The proof is omitted since it is an immediate generalization of the one of Lemma 3.1. The class ${\cal F}_{N,{\varepsilon},B}$ to be studied in the rest of this section clearly satisfy the assumption of this lemma.

Block-soft thresholding $\eta^{soft}(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B}$ is the nonlinear shrinker defined by letting, for $y\in{\bf R}^{B}$ , and $\tau\in{\bf R}_{+}$ ,

where $(z)_{+}\equiv\max(z,0)$ . The case $B=1$ reduces to traditional soft thresholding of Example 1.2. More generally, $\eta^{soft}(y;\tau)$ shrinks its argument $y$ to if $\|y\|_{2}\leq\tau$ and moves it by an amount $\tau$ towards the origin otherwise. It can also be regarded as the solution of a penalized least squares problem, namely

Block thresholding has previously been considered by Hall, Kerkyacharian and Picard [HKP98] and by Cai [Cai99] although in specific ‘wavelet’ applications.

where expectation is taken with respect to $X\sim\nu$ independent of ${\bf Z}\sim{\sf N}(0,{\bf I}_{B\times B})$ . Notice that the condition $\nu\in{\cal F}_{{\varepsilon},B}$ simply amount to saying that $\nu$ is a probability measure on ${\bf R}^{B}$ with $\nu(\{0\})\geq 1-{\varepsilon}$ . The calculation of $M_{B}({\varepsilon}|{\rm BlockSoft})$ can be reduced to a calculus problem. We state the results below deferring calculations to Appendix D.

Let $X_{B}$ bye a chi-square random variable with $B$ degrees of freedom and define the functions $g,h:{\bf R}_{+}\to{\bf R}$ as follows

The minimax risk of block soft thresholding over the class ${\cal F}_{N,{\varepsilon},B}$ is given by

This is a parametric expression for $\tau\in[0,\infty)$ . The parameter corresponds to the minimax threshold $\tau$ .

In Figure 5 we present graphs of $M({\varepsilon})=M_{B}({\varepsilon}|{\rm BlockSoft})$ as a function of ${\varepsilon}$ . It is immediate to prove the following structural properties: $(i)$ $0\leq M({\varepsilon})\leq 1$ (the upper bound follows from taking $\tau=0$ ); $(ii)$ $M({\varepsilon})$ is monotone increasing and concave (monotonicity is a consequence of ${\cal F}_{B,{\varepsilon}}\subseteq{\cal F}_{B,{\varepsilon}^{\prime}}$ for ${\varepsilon}\leq{\varepsilon}^{\prime}$ , and concavity follows since any measure in ${\cal F}_{q{\varepsilon}_{1}+(1-q){\varepsilon}_{2},B}$ can be written as convex combination of measures in ${\cal F}_{{\varepsilon}_{1},B}$ and in ${\cal F}_{{\varepsilon}_{2},B}$ ); $(iii)$ $M({\varepsilon})\rightarrow 0$ as ${\varepsilon}\rightarrow 0$ ; $M({\varepsilon})\rightarrow 1$ as ${\varepsilon}\rightarrow 1$ . (Recall that we are considering the MSE per coordinate.) Associated with the minimax problem is also an optimal threshold value $\tau^{*}({\varepsilon}|B)$ , that we plot in Figure 6.

A particularly interesting case is the one of large blocks. As $B\rightarrow\infty$ the minimax MSE has a well defined, and particularly explicit limit.

Further, when properly normalized, the minimax threshold converges with increasing block size:

2 Block James-Stein

In order to approach oracle MSE for large $B$ , we propose to use the positive-part James-Stein shrinkage estimator [JS10]. This is again a block-separable denoiser that acts as follows on a block $y\in{\bf R}^{B}$ :

Analogously to block soft thresholding, this estimator shrinks to blocks with small norms. On the other hand, its bias vanishes as $\|y\|_{2}\to\infty$ . Using once more Lemma 3.1 we have (notice that in this case there is no tuning parameter)

Remarkably, the limiting $B\rightarrow\infty$ behavior of this denoiser is ideal, and noticeably better than block soft thresholding, as shown by Figure 7 and formally in the next lemma.

Let $M_{B}({\varepsilon}|{\rm JamesStein})$ denote the minimax MSE for $\eta^{JS}$ over the class of ${\varepsilon}$ -block sparse sequences. For any $B>2$ , we have:

Consider temporarily the case where the observation is ${\bf Y}=\mu+{\bf Z}$ with ${\bf Z}\sim{\sf N}(0,{\bf I}_{B\times B})$ , and $\mu\in{\bf R}^{B}$ nonrandom and known. A simple calculation shows that the optimal linear estimator of the form $\eta(y)=cy$ , $c\in{\bf R}$ , is given by

This estimator uses information about $\|\mu\|_{2}$ (which could only be supplied by an oracle) to choose the constant $c$ as a function of $\|\mu\|_{2}$ . Note in particular that the risk of this estimator is

Applying (3.6), and keeping in mind that, for $\nu\in{\cal F}_{{\varepsilon},B}$ , $\nu(\{X=0\})\geq(1-{\varepsilon})$ , we have

The oracle inequality [DJ95, Theorem 5] shows that for $B>2$ , and for every vector $\mu\in{\bf R}^{B}$ , if $Y\sim{\sf N}(\mu,{\bf I}_{B\times B})$ , then

Combined with the previous display this proves the Lemma. ∎

The argument in the proof leads in fact to a convenient expression for $M_{B}({\varepsilon}|{\rm JamesStein})$ . With the notations introduced there, we have

Now $\eta^{JS}$ is known to be minimax for the unconstrained problem of estimating a non-sparse vector $\mu$ , i.e. $\sup_{\mu\in{\bf R}^{B}}R(\mu;\eta^{JS})=B$ yielding

Therefore computing the minimax MSE for $\eta^{JS}$ reduces to computing the single quantity $R(0;\eta^{JS})$ , that can be estimated through numerical integration. A good approximation for large $B$ is provided by the following formula $R(0;\eta^{JS})=B^{-1}+\kappa B^{-3/2}+O(B^{-2})$ with $\kappa\approx 0.752$ (cf. Appendix I.2.1). Hence we have

In the next section will use this formula (neglecting $O(B^{-2})$ terms) in comparing the general prediction of Eq. (1.13) with the empirical results for the James-Stein AMP algorithm. Numerical integration reveals that this formula is accurate enough for such comparison.

3 Empirical phase transition behavior

We now turn to the compressed sensing reconstruction problem whereby the block-sparse vector $x_{0}$ is reconstructed from observed data $y=Ax_{0}$ using the AMP algorithm. We want to test the hypothesis that Eq. (1.13) describes the phase transition of the two block shrinkage AMP algorithms, corresponding to the block soft thresholding, and block James-Stein.

We conducted a set of experiments similar to those described in Section 2.4 We constructed block-sparse signals at different undersampling and sparsity levels and ran tests of block thresholding AMP. More precisely, we used the update equations (1.7) to (1.9) with $\eta=\eta^{soft}$ (block soft thresholding AMP) or $\eta=\eta^{JS}$ (James-Stein AMP).

It is a straightforward calculus exercise to compute an explicit expression for the memory term ${\sf b}_{t}$ . For block soft thresholding AMP we get

Our results show that the curve $\delta=M_{B}({\varepsilon}|{\rm BlockSoft})$ correctly separates two phases of performance: below this curve success in AMP recovery is atypical and above it is typical. Similarly, the curve $\delta=M_{B}({\varepsilon}|{\rm JamesStein})$ correctly describes the phase transition for block James-Stein shrinkage. The empirical results are presented in Figure 8 (for block soft thresholding) and Figure 9 (for block James-Stein). We refer to Appendix G for further details.

Monotone regression

In this section and the next, we show that, quite surprisingly, the formula (1.13) can be applied also to some highly nontrivial non-separable denoisers.

In this section we consider vectors that are monotone, and mostly constant. Let ${\cal M}$ denote the cone of nondecreasing sequences:

We then define the class of mostly constant non-decreasing vectors

Since vectors from this class are –in general– not sparse, we will occasionally refer to the parameter ${\varepsilon}$ as to the ‘simplicity’ parameter.

For this problem we will consider the denoiser $\eta^{mono}:{\bf R}^{N}\to{\bf R}^{N}$ , that solves the monotone regression problem

In other words, $\eta^{mono}$ is the (Euclidean) projection on the cone of monotone sequences. This denoiser is highly non-separable, as one can understand most clearly by studying the standard pool-adjacent-violators algorithm for implementing it (see [BC90] for a recent reference).

In order to apply formula (1.13), we need to calculate $M({\varepsilon}|{\rm MonoReg})$ , which requires in particular determining the least favorable distribution $\nu_{N}\in{\cal F}_{{\varepsilon},N,{\rm mono}}$ and proving that the limit $N\to\infty$ of the minimax MSE exists. We present here the main ideas, deferring details to Appendix F.

It is convenient to introduce the risk at $\mu\in{\cal M}_{N}$ :

where expectation is taken with respect to ${\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N})$ . It is further useful to introduce a specific notation for the risk at , namely

The risk of monotone regression satisfies the following properties

The function $t\mapsto R_{N}(t\,\mu)$ is monotone increasing for $t\in{\bf R}_{+}$ .

Let $I_{+}(\mu)\equiv\{i\in[N-1]:\;\mu_{i}<\mu_{i+1}\}$ be the set of increase points of $\mu$ . Denoting them by $I_{+}(\mu)\equiv\{i_{1},i_{2},\dots,i_{K(\mu)}\}$ , $i_{k}\leq i_{k+1}$ , let $J_{k}\equiv\{i_{k}+1,i_{k}+2,\dots,i_{k+1}\}$ for $k\in\{0,\dots,K(\mu)\}$ (with, by convention, $i_{0}=0$ , $i_{K(\mu)+1}=N$ ). Then, for any $\mu\in{\cal M}_{N}$ ,

For a non-empty closed convex ${\cal S}\subseteq{\bf R}^{N}$ , we let ${\sf P}_{\cal S}:{\bf R}^{N}\to{\bf R}^{N}$ denote the Euclidean projector to ${\cal S}$ , i.e. ${\sf P}_{\cal S}(y)\equiv\mbox{argmin}_{x\in S}\|x-y\|_{2}$ . Further, for $v\in{\bf R}^{N}$ , ${\cal S}+v\equiv\{x+v\,:\;x\in{\cal S}\}$ .

Note that it is sufficient to show that, letting $D(\mu;z)\equiv\big{\|}\eta^{mono}(\mu+z)-\mu\big{\|}_{2}^{2}$ , $t\mapsto D(t\mu;z)$ is monotone increasing in $t\in{\bf R}_{+}$ . By continuity of the projection operator, it follows that $\mu\mapsto D(\mu;z)$ is continuous. Further, notice that ${\cal M}_{N}$ is a cone obtained as the intersection of $N-1$ half-spaces:

Let ${\cal V}_{i}=\{x\in{\bf R}^{N}:\;x_{i}=x_{i+1}\}$ be the separating hyperplane for ${\cal H}_{i}$ and, for $B\subseteq[N-1]$ , define

with, by convention ${\cal V}_{\emptyset}\equiv{\bf R}^{N}$ . Since $\eta^{mono}={\sf P}_{{\cal M}_{N}}$ , we have that $\eta^{mono}$ is continuous and piecewise linear and equal one of the projectors ${\sf P}_{{\cal V}_{B}}$ , that we will denote by ${\sf P}_{B}$ for $B\subseteq[N-1]$ . It is therefore sufficient to show that, defining for $B\subseteq[N-1]$ ,

the function $t\mapsto D_{B}(t\mu,z)$ is monotone increasing for $t\in{\bf R}_{+}$ .

Let $B\equiv\cup_{k=1}^{K}B_{k}$ where each $B_{k}$ is a contiguous segment (in the sense of point $(b)$ ), and $\overline{B}_{k}\equiv\{i\in[N]:i\in B_{k}\vee(i-1)\in B_{k}\}$ . Further, for $x\in{\bf R}^{N}$ , let $\overline{x}_{S}\equiv|S|^{-1}\sum_{i\in S}x_{i}$ . Then, for any $x\in{\bf R}^{N}$ ,

which is clearly increasing in $t\in{\bf R}_{+}$ . ∎

The proof of part $(b)$ is deferred to Appendix F.

The last Lemma shows that the least favorable signal $\mu$ is constant on $N(1-{\varepsilon})$ positions of the interval $\{1,2,\dots,N\}$ and has large (going to infinity) jumps at the remaining $N{\varepsilon}$ increase points. The resulting risk only depends on the distribution of the lengths of the intervals over which $\mu$ is constant.

The next Lemma provides some useful insight on the behavior of the risk at . This is crucial since it determines the minimax risk though Eq. (4.4).

The monotone regression risk at zero, defined through Eq. (4.3) satisfies $r(1)=1$ and, for any $N\geq 10$

The proof of this Lemma can be found in Appendix F.2.

For moderate values of $N$ , $r(N)$ can be computed numerically through Monte Carlo simulations. Figure 10 presents the results of such a simulation. It appears that $r(N)=\Theta(\log N)$ as $N\to\infty$ suggesting that the last lemma is loose by a logarithmic factor.

We can finally establish our main result on the minimax MSE of monotone regression over the class ${\cal F}_{N,{\varepsilon},{\rm mono}}$ . Remarkably, we are able to characterize the least favorable distribution $\nu_{N}\in{\cal F}_{N,{\varepsilon},{\rm mono}}$ .

The asymptotic minimax MSE of monotone regression

Finally, for any $\xi>0$ , ${\varepsilon}\in(0,1)$ , the following distribution $\nu_{N}^{({\varepsilon},\xi)}\in{\cal F}_{N,{\varepsilon},{\rm mono}}$ has risk larger that $M({\varepsilon}|{\rm MonoReg})-\xi$ for all $N$ large enough. A signal ${\bf X}\sim\nu^{({\varepsilon},\xi)}_{N}$ has $X_{i+1}-X_{i}=\Delta>0$ at all increase points $i\in[N-1]$ for some $\Delta=\Delta(\xi)$ large enough, and the lengths of intervals between increase points have distribution $\pi$ achieving the max in Eq. (4.6).

Hence $M({\varepsilon}|{\rm MonoReg})$ is immediately upper bounded by the right hand side of Eq. (4.6). The matching lower bound is obtained by evaluating the above expressions for the distribution $\nu_{N}^{({\varepsilon},\xi)}$ .

Finally Eq. (4.1) follows by using Lemma 4.2 in Eq. (4.6). ∎

The resulting curve $M({\varepsilon}|{\rm MonoReg})$ is presented in Fig. 10.

2 Empirical phase transition behavior

We next consider the compressed sensing problem. We programmed the AMP iteration (1.7)-(1.9), with $\eta(\,\cdot\,;\tau,\sigma)=\eta^{mono}(\,\cdot\,)$ the monotone regression denoiser. The denoiser itself was implemented using the standard pool adjacent violators algorithm.

It is a simple exercise to obtain an explicit formula for the memory term ${\sf b}_{t}$ . As in Lemma 4.1, let $K(\mu)$ denote the number of increase points in the signal $\mu\in{\bf R}^{N}$ . We then have

We will refer to this specific version of AMP as to monoreg AMP.

In the present case we evaluated success probability using the following (Hamming-like) distance

and declared a success when $H(x^{t},x_{0})\leq\beta$ . In Figure 11 we used $t=300$ and $\alpha=\beta=0.01$ , but very similar results are obtained with other values of the parameters. The rationale for using $H_{\alpha}(x^{t},x_{0})$ instead of the normalized mean square error lies in the structure of the signals $x_{0}$ . Since the least favorable $x_{0}$ is monotone with large jumps, its norm is very large,concentrated at endpoints, and depends strongly on $N$ . This leads to subtle normalization issues across different $N$ .

The agrement between the empirical phase transition and the general prediction $\delta=M({\varepsilon}|{\rm MonoReg})$ in Fig. 11 is satisfactory and improves with the signal’s length.

Total variation minimization

In this section we consider vectors $x\in{\bf R}^{N}$ that are mostly constant, with a few change points. In order to model this problem, we introduce the class of probability distributions

Again ${\varepsilon}\in(0,1)$ is a ‘simplicity’ parameter. Note that this class is quite similar to the class ${\cal F}_{N,{\varepsilon},{\rm mono}}$ studied in the previous section, the ‘only’ difference being that change points can be either points of increase or points of decrease.

A convenient denoiser for this setting is the total variation penalized least-squares [ROF92], also called fused LASSO [TSR+05], that we will denote by $\eta^{tv}(\,\cdot\,;\tau):{\bf R}^{N}\to{\bf R}^{N}$ . This depends on $\tau\in{\bf R}_{+}$ and, for $y\in{\bf R}^{N}$ and noise variance $\sigma^{2}=$ , it returns

An extensive literature is devoted to solving this denoising problem, see for example [VO96]. For general noise variance $\sigma^{2}$ , the above expression is generalized through the usual scaling relationship (1.3).

Much of the analysis in this section is analogous to the one of monotone regression. We will therefore present several arguments in synthetic form to limit redundancy.

In this section we outline the computation of the asymptotic minimax MSE of the total variation denoiser over the class ${\cal F}_{N,{\varepsilon},TV}$ , to be denoted by $M({\varepsilon}|TV)$ .

We start by defining a generalization of the problem (5.1). For $s=(s_{1},s_{2})\in\{+1,-1\}^{2}$ , we let

We define the risk at $\mu\in{\bf R}^{N}$ as

where ${\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N})$ . We denote the risk at by

Notice that, by symmetry, $r_{s}(N;\tau)=r_{-s}(N;\tau)$ and $r_{s_{1},s_{2}}(N;\tau)=r_{s_{2},s_{1}}(N;\tau)$ . We then have the following analogous of Lemma 4.1.

The risk of total variation regression satisfies the following properties

The function $t\mapsto R_{N}(t\,\mu;\tau)$ is monotone increasing for $t\in{\bf R}_{+}$ .

Let $I_{\neq}(\mu)\equiv\{i\in[N-1]:\;\mu_{i}\neq\mu_{i+1}\}$ be the set of change points of $\mu$ . Denoting them by $I_{\neq}(\mu)\equiv\{i_{1},i_{2},\dots,i_{K(\mu)}\}$ , $i_{k}\leq i_{k+1}$ , let $J_{k}\equiv\{i_{k}+1,i_{k}+2,\dots,i_{k+1}\}$ for $k\in\{0,\dots,K\}$ (with, by convention, $i_{0}=0$ , $i_{K+1}=N$ ). Further, for $K\geq 1$ , let $s(k)=[{\rm sgn}(\mu_{i_{k}}-\mu_{i_{k}+1}),{\rm sgn}(\mu_{i_{k+1}+1}-\mu_{i_{k+1}})]$ for $k\in\{1,\dots,K-1\}$ , $s(0)=[0,{\rm sgn}(\mu_{i_{1}+1}-\mu_{i_{1}})]$ , $s(K)=[0,{\rm sgn}(\mu_{i_{k}}-\mu_{i_{k}+1})]$ . For $K=0$ we let $s(0)=(0,0)$ .

The argument in part $(b)$ is essentially the same as in part $(b)$ of Lemma 4.1 and we will therefore omit it.

For proving part $(a)$ , we will prove that, letting $D(\mu;z)\equiv\|\eta^{tv}(\mu+z;\tau)-\mu\|_{2}^{2}$ , the function $t\mapsto D(t\mu;z)$ is increasing for $t\in{\bf R}_{+}$ . First notice that the stationarity condition for the minimum in Eq. (5.1) reads

with the convention that $v_{0}=v_{N}=0$ . Let $I_{\neq}=I_{\neq}(x)$ and $J_{k},s(k)$ be defined as in part $(b)$ of the statement. Then, summing Eq. (5.6) over $i\in J_{k}$ , we get

where $\overline{s}(k)=s_{1}(k)+s_{2}(k)$ Hence $\eta^{tv}(\,\cdot\,;\tau)$ is piecewise affine with components indexed by ${\cal J}=\{J_{k},s(k)\}_{k\in[K]}$ . Within each component, we have $\eta^{tv}(y;\tau)=F_{{\cal J}}(y)$ with $F_{{\cal J}}$ defined as per Eq. (5.8).

Since $y\mapsto\eta^{tv}(y;\tau)$ is continuous (and hence $t\mapsto D(t\mu;z)$ is), it is sufficient to prov that, letting

the function $t\mapsto D_{{\cal J}}(\mu;\,z)$ is monotone increasing for $t\in{\bf R}_{+}$ . Using Eq. (5.8) we obtain

where $\overline{x}_{J_{k}}$ denotes the average of vector $x$ over $J_{k}$ . It follows that $t\mapsto D_{{\cal J}}(\mu;\,z)$ is increasing as claimed. ∎

The risk at , $r_{s}(N;\tau)$ , can be computed numerically for moderate values of $N$ . Notice that the cases $r_{00}(N;\tau)$ , $r_{0\pm}(N;\tau)$ , and $r_{\pm 0}(N;\tau)$ are only relevant for the boundary intervals $J_{0}(\mu)$ and $J_{K(\mu)}(\mu)$ and turn out to be immaterial for the asymptotic minimax risk. Thanks to symmetries, the only relevant cases are $r_{++}(N;\tau)$ and $r_{+-}(N;\tau)$ . The results of a numerical computation for these quantities is shown in Figure 12. These calculations suggest $r_{++}(N;\tau)\geq r_{+-}(N;\tau)$ , which is indeed consistent with intuition as boundary conditions $++$ induce a larger bias. Also, it is easy to prove that $r_{+-}(N;\tau)\to 0$ as $\tau\to\infty$ (as $\tau\to\infty$ , $\eta^{tv}(y;\tau)$ converges to a constant vector).

Using the last Lemma, and proceeding as in the proof of Theorem 4.1, it is immediate to obtain a characterization of the minimax MSE of the total variation denoiser. For technical reasons, we need to introduce the class ${\cal F}_{N,{\varepsilon},TV}(L)$ of vectors in ${\cal F}_{N,{\varepsilon},TV}$ with distance at most $L$ between changepoints.

The asymptotic minimax MSE of total variation denoiser

We omit this proof since it is an immediate generalization of the one of Theorem 4.1. Notice that $M_{L}({\varepsilon})$ is monotone increasing in $L$ and hence admit a limit as $L\to\infty$ . We expect that

This limit can be evaluated numerically and in Figure 12 we plot the resulting minimax risk.

Notice that, by properly modifying Eq. (5.9), one obtains the minimax risk over subsets of ${\cal F}_{N,{\varepsilon}}$ with constrained change point distributions. For instance, we can consider the case in which the lengths between change points are distributed as for uniformly random change points, and increase/decrease points are alternating. We then get

We consequently define the random changepoint minimax risk as

This curve is plotted in Figure 12 for comparison.

2 Empirical phase transition behavior

We implemented the the AMP iteration (1.7)-(1.9), using the total variation denoiser $\eta(\,\cdot\,;\tau,\sigma)=\eta^{tv}(\,\cdot\,;\tau,\sigma)$ . For the latter, we used the software package tvdip (in the Matlab implementation) or the projected Newton method [VO96] (in the Java implementation).

For $x\in{\bf R}^{N}$ be $K_{0}(x)$ denote the number of constant segments in $x$ or, equivalently, the number of change points in $x$ , plus one. We then have the following expression for the memory term in Eq. (1.7)-(1.9):

We will refer to this specific version of AMP as to TV-AMP.

We carried out two types of experiments. In the first class of experiments we considered signals $x$ with distances between change points distributed as

This is the same distribution as if each position is independently an increase point or a decrease point, each with probability ${\varepsilon}/2$ . The predicted phase transition curve is given by Eq. (5.10) and the minimax value of $\tau$ is used in the AMP implementation. The simulations results are presented in Figure 13 for $N=200$ , $500$ and show good agreement between predictions and observations. In this case we used the Hamming metric (4.8), because the norm of the typical signal $\|x_{0}\|_{2}^{2}$ scales super-linearly in $N$ .

Characterization of the phase transition using state evolution

In this section we prove the basic relation (1.13) between minimax mean square error in denoising and the phase-transition boundary in the sparsity-undersampling plane. Our proof assumes that the state evolution formalism developed in [DMM09, DMM10, DMM11] holds, in the precise terms stated below. This formalism was established rigorously for separable denoisers (under additional regularity assumptions) in [BM11a].

A crucial observation for state evolution is that the mean squared error of the AMP reconstruction $x^{t}$ at iteration $t$ is practically non-random for large system sizes $N$ , and has a well-defined limit as $N\to\infty$ . In particular, the limit

exists almost surely (here we assume $n\to\infty$ , while $n/N\to\delta$ ). Moreover, the evolution of $m_{t}$ with increasing $t$ is dictated by a formula $m_{t+1}=\Psi(m_{t})$ which is explicitly computable, and defined below. We will use the term state evolution to refer both to the mapping $m\mapsto\Psi(m)$ and to the sequence $\{m_{t}\}_{t\geq 0}$ with appropriate initial condition. State evolution allows to determine whether AMP recovers the signal $x_{0}$ correctly, by simply checking whether $m_{t}\rightarrow 0$ as $t\rightarrow\infty$ (in which case the MSE vanishes asymptotically) or not. The latter problem does in turn reduce to a problem in real analysis.

The papers [DMM09, DMM10, DMM11] developed the state evolution framework for separable denoisers and verified its predictions numerically for three specific examples (namely for the shrinkers Soft, SoftPos, and Cap). However, the heuristic argument presented in those papers was much more general. Indeed, [BM11a] proved that state evolution holds, in a precise asymptotic sense, for Gaussian measurement matrices $A$ with iid entries and generic separable denoisers, under mild regularity assumptions. A generalization to non-gaussian entries was subsequently proved in [BLM12].

Here we generalize this approach to non-separable denoisers $\eta(\,\cdot\,;\tau,\sigma):{\bf R}^{N}\mapsto{\bf R}^{N}$ . This framework covers all the shrinkers discussed in Sections 2 to 5, and yields a formal proof of the main formula (1.13), under the assumption that indeed state evolution is correct in this broader context. We will throughout assume the scaling relation (1.3).

In the next sections we will first introduce some basic notations and facts about state evolution. Then we will prove the phase transition expression (1.13) by establishing first a lower bound and then a matching upper bound, both given by the minimax MSE.

The next definition provides the suitable generalization of the state evolution mapping to the present setting.

For given $\delta,\tau\geq 0$ , and $\nu=\{\nu_{N}\}_{N\in{\bf N}}$ a sequence of probability distributions over ${\bf R}^{N}$ , define the state evolution mapping $\Psi(\,\cdot\,;\delta,\tau,\nu):{\bf R}\mapsto{\bf R}$ by

whenever the limit on the right-hand side exists. Here, as before, ${\bf X}$ and ${\bf Z}$ are independent vectors, ${\bf X}\sim\nu_{N}$ , ${\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N})$ . In other words, $\Psi(m;\delta,\tau,\nu)$ is the per-coordinate MSE of denoiser $\eta$ at noise level $\sigma^{2}\equiv m/\delta$ .

The fixed points of the mapping $m\mapsto\Psi(m)$ play of course a crucial role in the analysis of state evolution.

The highest fixed point of the mapping $\Psi(\,\cdot\,)=\Psi(\,\cdot\,;\delta,\tau,\nu)$ is defined as ${\rm HFP}(\Psi)\equiv\sup\{m:\Psi(m)\geq m\}$ .

The importance and applicability this notion is underscored by the next two observations. Here and below we say that a function $f:{\bf R}_{+}\to{\bf R}$ is starshaped if $x\mapsto f(x)/x$ is decreasing.

Suppose that $m_{0}>0$ and any one of these three conditions holds:

$\Psi(m)$ is an increasing function of $m$ , and the initial condition of state evolution satisfies $m_{0}\geq{\rm HFP}(\Psi)$ ;

$\Psi(m)$ is an increasing starshaped function of $m$ .

Then state evolution converges to the highest fixed point:

Further, if ${\rm HFP}(\Psi)>0$ and $\Psi(m)$ is a starshaped function of $m$ , then $\lim\inf_{t\rightarrow\infty}m_{t}>0$ .

The proof is a standard calculus exercise; we omit it.

The function $m\mapsto\Psi(m)$ is starshaped for all of the following choices of the denoiser $\eta$ :

The proof of this Lemma is deferred to Appendix I.

2 State Evolution Phase Transition

Consider a collection ${\cal F}_{N,{\varepsilon}}$ of probability distributions over ${\bf R}^{N}$ , indexed by ${\varepsilon}\in$ as per Section 1.1 (these do not need to be simple sparse signals). For a sequence of probability distributions $\nu=\{\nu_{N}\}_{N\in{\bf N}}$ we write $\nu\in{\cal F}_{{\varepsilon}}$ if $\nu_{N}\in{\cal F}_{N,{\varepsilon}}$ for all $N$ and the limit on the Eq. (6.1) exists for each $m\in{\bf R}_{+}$ . Letting ${\rm HFP}(\delta,\tau,\nu)={\rm HFP}(\Psi(\;\cdot\;;\delta,\tau,\nu))$ , we consider the minimax value:

For ${\varepsilon}\in$ , define the state evolution phase transition as

Note that ${\rm HFP}^{*}({\varepsilon},\delta)$ is monotone decreasing as a function of $\delta$ , by definition of the state evolution mapping $\Psi$ , cf. Eq. (6.1). It follows that ${\rm HFP}^{*}({\varepsilon},\delta)=0$ for $\delta>\delta_{SE}({\varepsilon})$ and ${\rm HFP}^{*}({\varepsilon},\delta)>0$ for $\delta<\delta_{SE}({\varepsilon})$ . Further, by nestedness, it is monotone increasing as a function of ${\varepsilon}$ , which implies that ${\varepsilon}\mapsto\delta_{SE}({\varepsilon})$ is monotone increasing. The rationale for this definition is that, for $\delta>\delta_{SE}({\varepsilon})$ and under any of the assumptions of Lemma 6.1, state evolution predicts that AMP will correctly recover the signal $x_{0}$ .

Let ${\cal F}_{N,{\varepsilon}}$ be a nested, scale-invariant collection of probability distributions, and assume that the shrinker $\eta(\,\cdot\,;\tau,\sigma)$ obeys the scaling relation (1.3). Define the minimax MSE $M({\varepsilon}|\eta)$ as per Eq. (1.5). Then

In order to prove this result, we shall first establish a more general fact. Given a sequence of distributions $\nu=\{\nu_{N}\}_{N\in{\bf N}}$ and, for any $\tau\in\Theta$ , we let

The rationale for this definition is clear. Under the assumption that state evolution holds, for a signal $x_{0}$ sampled from distribution $\nu_{N}$ , AMP (with tuning parameter $\tau$ ) is guaranteed to reconstruct $x_{0}$ if and only if $\delta>\delta_{SE}(\tau,\nu|\eta)$ .

the normalized MSE for denoising with worst case case signal-to-noise ratio. With these definitions we have the following.

For any sequence of probability measures $\nu=\{\nu_{N}\}_{N\geq 1}$ , and any $\tau\in\Theta$ , we have

We are now in position to prove Theorem 6.1.

Throughout the proof we drop the argument $\eta$ , since this is kept constant.

Define $\delta_{*}({\varepsilon})\equiv\inf_{\tau\in\Theta}\sup_{\nu\in{\cal F}_{{\varepsilon}}}\delta(\tau,\nu)$ . We claim that $\delta_{*}({\varepsilon})=\delta_{SE}({\varepsilon})$ . Indeed choose $\delta\in[0,\delta_{SE}({\varepsilon}))$ . Then by definition there exists $m>0$ such that, for all $\tau\in\Theta$ , there exists $\nu\in{\cal F}_{{\varepsilon}}$ with ${\rm HFP}(\delta,\tau,\nu)>m$ . Hence, for all $\tau\in\Theta$ , there exists $\nu\in{\cal F}_{{\varepsilon}}$ with $\delta(\tau,\nu)\geq\delta$ . This implies that, for all $\tau\in\Theta$ , $\sup_{\nu\in{\cal F}_{{\varepsilon}}}\delta(\tau,\nu)>\delta$ , i.e. $\delta_{*}({\varepsilon})\geq\delta$ . Proceeding in the same way, it is immediate to prove that, for any $\delta\in[0,\delta_{*}({\varepsilon}))$ , we have $\delta_{SE}({\varepsilon})>\delta$ . Hence, we conclude that $\delta_{SE}({\varepsilon})=\delta_{*}({\varepsilon})$ as claimed.

To conclude the proof, we note that, by Lemma 6.3, we have

where the last equality follows from the scale invariant property of ${\cal F}_{N,{\varepsilon}}$ . The last quantity is nothing but $M({\varepsilon})$ . ∎

3 Non-convergence of state evolution

Lemmas 6.1.(b) and 6.2 immediately imply that the answer is positive for soft thresholding, positive soft thresholding, block soft thresholding, James-Stein shrinkage, monotone regression and total variation denoising. It turns out that the answer is positive also for firm thresholding and the global minimax denoiser. In Appendix E we describe the argument for these cases.

Phase transitions for other algorithms

Formula (1.13) connects an algorithmic property – phase transitions of AMP recovery algorithms – with a property from statistical decision theory – minimax mean squared error in denoising.

We want to explore a further connection, relating the behavior of convex optimization with that of certain AMP algorithms. As proved in [BM11b], in the large system limit AMP with soft thresholding denoiser effectively computes the solution to

for an appropriately calibrated $\lambda=\lambda(\tau)$ .

More generally, consider a generalized reconstruction method of the form

where $J:{\bf R}^{N}\to{\bf R}$ is a convex penalization. To this reconstruction problem, we can associate an AMP algorithm, by using the denoiser $\eta^{J}(\,\cdot\,;\tau)$ in Eq. (1.7), (1.8), (1.9), whereby

(we also let $\eta^{J}(\,\cdot\,;\tau,\sigma)=\eta^{J}(\,\cdot\,;\tau\cdot\sigma)$ ). In other words $\eta^{J}$ is the proximal operator of the penalization $J(\,\cdot\,)$ . We will refer to this algorithm as to AMP- $J$ .

We then have the following general correspondence, which follows immediately by writing the stationarity condition of problem $P_{J}=P_{J}(\lambda)$ and the fixed points of AMP- $J$ (see [Mon12]).

Any fixed point $x^{\infty}$ of AMP- $J$ with fixed point parameters $\tau_{\infty}$ , ${\sf b}_{\infty}$ , $\sigma_{\infty}$ corresponds to a stationary point of $P_{J}(\lambda)$ with $\lambda$ given by

In particular, if the regularizer $J$ is convex, then fixed points correspond to minimizers.

The fixed points of AMP with positive soft thresholding denoiser are solutions of

where $\langle\,\cdot\,,\,\cdot\,\rangle$ denotes the standard scalar product over ${\bf R}^{N}$ , and $1$ is the all-one vector.

The fixed points of AMP with capping denoiser effectively are solutions of

For noiseless compressed sensing reconstruction, the correspondence involves the limit $\lambda\to 0$ of the above problem, that is

Phase transitions for such convex programs were characterized in [Don06, DT09a, DT10a] for the three examples mentioned above, using methods from combinatorial geometry. Thus, whenever AMP converges with high probability, formula (1.13) connects fundamental problems of minimax decision theory to fundamental problems in high dimensional combinatorial geometry.

We wil next verify numerically this correspondence beyond the three classical examples (again focusing on noiseless measurements). In the block-structured case, we can compare block-soft AMP to the following convex optimization problem:

Analogously, we can compare monoreg AMP to the following convex optimization problem:

where $\Delta x=(\Delta x_{1},\dots,\Delta x_{N})$ , $\Delta x_{i}=(x_{i+1}-x_{i})$ . Figure 16 verifies that the phase transition occurs around the location predicted by (1.13), just as with monoreg AMP.

Finally, we can compare TV AMP to the following convex optimization problem:

where again $\|x\|_{TV}\equiv\sum_{i=1}^{N-1}|\Delta x_{i}|$ . Figure 17 verifies that the phase transition occurs at the location predicted by (1.13), just as with TV AMP.

where $J(x)=\lambda|x|$ (here we redefined $J$ to absorb the factor $\tau$ ).

It turns out that a similar interpretation can be given for any scalar denoiser (and hence for any separable denoiser as well). In particular, the minimax and firm thresholding rules $\eta^{all}$ and $\eta^{firm}(\,\cdot\,;\tau_{1},\tau_{2})$ are optimizers of penalization schemes of the same form as above, with $J(x)$ non-convex.

We can construct the penalty $J(\,\cdot\,)$ corresponding to a denoiser $\eta(\,\cdot\,)$ by observing that $x+J^{\prime}(x)=y$ at the solution $x=\eta(y)$ . Defining the residual $\Delta(y)=y-\eta(y)$ , and noting that $\eta(y)+\Delta(y)=y$ , we obtain that $\Delta(y)$ and $J^{\prime}(x)$ are related through the change of variables

A similar analysis can be carried out for block separable denoisers that are covariant under rotation, i.e. if $\eta(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B}$ satisfies $\eta(Rx;\tau)=R\eta(x;\tau)$ for any rotation $R$ . We already mentioned that the block soft denoiser can be written as

An implied penalty can also be derived for block James-Stein shrinkage $\eta^{JS}$ . Due to the covariance under rotation, the corresponding penalty only depends on the modulus of $x\in{\bf R}^{B}$ . Figure 19 shows the implied penalty $J(s|B)$ as a function of $s=\|x\|_{2}$ . Namely, the penalty $J(\,\cdot\,|B):{\bf R}_{+}\to{\bf R}$ is such that, for $y\in{\bf R}^{B}$ ,

coincides with the positive-part James-Stein estimator. The implicit penalization is again nonconvex.

Appendix A Classical cases

The general formula (1.13) was already validated in [DMM09] for the following three classical cases:

Simple sparse signals from the class ${\cal F}_{N,{\varepsilon}}$ (cf. Eq. (1.2)) and soft thresholding denoiser.

Non-negative sparse signals with soft positive thresholding denoiser, cf. Example 1.3.

Box constrained signals with capping denoiser, cf. Example 1.3.

Analytical expressions for the phase transition curves were computed using state evolution in the Online Supplement to [DMM09]. We review the results here since they provide a useful stepping stone for understanding more complicate cases (cf., e.g.. Section 6).

Notice that the last of the examples above (box-constrained signals) is not scale invariant, according to the general definition of Section 1.1. For non scale-invariant classes, the definitions (1.4) and (1.5) are generalized by taking the supremum over the noise covariance as well. Namely, for a generic class ${\cal F}_{N,{\varepsilon}}$ , we let

It is easy to check that this definition coincides with the earlier one for scale-invariant classes. We will write $M({\varepsilon}|{\rm Cap})$ instead of $M({\varepsilon}|\eta^{cap})$ for box constrained signals with capping denoiser, i.e. case $(iii)$ above. The results for case $(iii)$ is further elucidated by comparing it with the following scale invariant problem:

We consider sparse non-negative signals $x_{0}\geq 0$ , modeled through the class ${\cal F}_{N,{\varepsilon},+}$ , and the simple positive part denoiser $\eta^{+}(y)\equiv\max(y,0)$ . We denote corresponding minimax risk by $M({\varepsilon}|{\rm Pos})$ .

The minimax risk for examples $(i)$ , $(ii)$ , $(iv)$ can be computed explicitly. Indeed in both cases we have

Here $X\sim\nu$ and $Z\sim{\sf N}(0,1)$ are independent random variables. The reduction second equality follows from th remark that the extremal distributions in ${\cal F}_{1,{\varepsilon}}$ and ${\cal F}_{1,{\varepsilon},+}$ are mixtures of two point distributions. This remark reduces the calculation of the minimax risk to a simple calculation [DMM09], whose results are summarized below.

The minimax risk for the problems stated above is

(The first two are parametric expressions in $\tau$ , which is the optimal threshold at the given sparsity level.)

Also, the AMP phase transition for the noiseless reconstruction problem is given in all of these cases by $\delta=M({\varepsilon}|\eta)$ . In other words AMP succeeds with high probability of $\delta>M({\varepsilon}|\eta)$ and fails with high probability if $\delta<M({\varepsilon}|\eta)$ .

As mentioned above, the calculation of the minimax risk is a calculus exercise, and follows the same lines as in [DMM09]. This coincide with the AMP threshold by the general analysis of Section 6 for cases $(i)$ , $(ii)$ , $(iv)$ . For the non-scale invariant case, we refer, once more, to [DMM09].

Notice that problem $(iii)$ and $(iv)$ have the same minimax risk. This identity mirrors a result in [DT10a] that characterizes the phase transition threshold for reconstructing $x_{0}\in{\cal S}_{N}\subseteq{\bf R}^{N}$ from noiseless linear measurements $y=Ax_{0}$ , with ${\cal S}_{N}=^{N}$ or ${\cal S}_{N}={\bf R}_{+}^{N}$ . If a simple feasibility linear program is used (namely, find any $x\in{\cal S}_{N}$ with $y=Ax$ ), then the undersampling threshold for both problems is given by $\delta=(1+{\varepsilon})/2$ .

Appendix B Calculation of minimax MSE

In this Appendix we describe the calculation of the global minimax risk over the class ${\cal F}\equiv{\cal F}_{1,{\varepsilon}}$ , as defined per Eq. (2.3). In particular, we will explain how the values in Table 1 and Figure 1 for $M({\varepsilon}|{\rm Minimax})$ have been computed.

Here $\eta^{\nu}$ is he posterior mean estimator for prior $\nu$ , $\gamma$ is the Gaussian measure $\gamma({\rm d}x)=\phi(x){\rm d}x$ , $\phi(x)=e^{-x^{2}/2}/\sqrt{2\pi}$ , $\gamma\star\nu$ denotes the convolution of measures, and $I$ denotes the Fisher information. For a probability measure $\nu_{f}({\rm d}x)=f(x){\rm d}x$ , with density $f$ with respect to the Lebesgue measure this is defined as

Further, if ${\cal F}$ is convex and weakly compact, the set of probability measures $\{\gamma\star\nu:\;\nu\in{\cal F}\}$ is also convex and weakly compact. If follows from [Hub64, Theorem 4] that the $\inf$ in Eq. (B.1) is achieved at a unique point $\nu=\nu^{*}$ . Hence

where we defined $I^{*}=I^{*}({\varepsilon})=I(\gamma\star\nu^{*})$ . The unique minimizer $\nu^{*}$ is known as theleast favorable distribution. The minimax optimal denoiser (achieving the $\inf$ over $\eta$ in Eq. (B.1)) is the posterior expectation with respect to the prior $\nu^{*}$ .

Bickel and Collins [BC83, Theorem 1], prove that, under suitable assumption on the class ${\cal F}$ , the least favorable distribution is a mixture of point masses

where $\sum_{i}\alpha_{i}=1$ , $\alpha_{i}>0$ and the sequence $\{\mu_{i}\}_{i\in{\bf Z}}$ has no accumulation points except, possibly, at $\pm\infty$ . As mentioned above, the minimax denoiser is the posterior expectation associate to the prior $\nu^{*}$ . By Tweedie’s formula, this takes the form

where $\psi^{*}$ is the so-called score function

and $f^{*}$ is the density of $\nu^{*}\star\gamma$ .

Focusing now specifically on the class ${\cal F}={\cal F}_{1,{\varepsilon}}$ . This case is covered by the general theory of [BC83], and corresponds to their example $(ii)$ . Without loss of generality we can assume that the $\mu_{i}$ are monotone increasing, with $\mu_{-i}=\mu_{i}$ , and that $\mu_{0}=0$ , with $\alpha_{0}=(1-{\varepsilon})$ . A conjecture of Mallows [Mal78] states that in fact we may take

In other words, the conjectured least favorable distribution has the form of a two-sided geometric distribution on a scaled copy of ${\bf Z}$ . While the conjecture has not been proved, Mallows [Mal78] provided an argument (based on an analogous problem in robust estimation [Hub64]) that suggesting that indeed it captures the correct tail behavior.

For estimating the minimax risk numerically, we chose a large parameter $K$ and assume a generalized “Mallows form” for $|i|>K$ . More precisely, we assume an equispaced grid, and geometrically decaying weights. This is a little more general than what Mallows proposed, having 3 total degrees of freedom (spacing, total weight and rate of decay), rather than two. For $-K\leq i\leq K$ , we allow the parameters $\alpha_{i}$ and $\mu_{i}$ to vary freely. In this way we obtained a parametric family $\nu_{\theta}$ of probability distributions, with parameter $\theta=((\alpha_{i})_{i\in[K]},(\mu_{i})_{i\in[K]},c_{0},c_{1},\lambda)$ , with

The quantity $I^{+}=I^{+}({\varepsilon})$ can be estimated numerically, and provides an upper bound on $I^{*}$ . We used up to $K=50$ and checked that the resulting $I^{+}$ is insensitive to this choice. Notice that choice of the Mallows form for $i>K$ is immaterial for two reasons: $(i)$ As a consequence [BC83] and of the weak continuity of Fisher information, $I^{+}$ should be insensitive to the tail behavior of the distribution $F$ ; $(ii)$ We are only using it to derive an upper bound.

In order to get a lower bound on $I^{*}$ , we use Huber’s minimax theorem [Hub64, HR09], which implies that, for any $\psi:{\bf R}\to{\bf R}$ differentiable in measure,

where in the supremum $g$ ranges over all densities of probability measures $\gamma\star\nu$ with $\nu\in{\cal F}_{1,{\varepsilon}}$ .

Let $\nu^{+}$ denote the probability measure corresponding to the optimum of the parametric optimization (B.3). Denote by $\psi^{+}$ denote the corresponding score function

where $f^{+}=\nu^{+}\star\gamma$ . This corresponds to a denoiser $\eta^{+}(y)=y-\psi^{+}(y)$ . Let $g^{+}$ denote a maximizing density $g$ for $J(\psi^{+},g)$ . By Huber’s theory, this can be chosen to be two-point mixture $(1-{\varepsilon})\delta_{0}+{\varepsilon}\delta_{\mu^{+}}$ where $\mu^{+}$ is chosen to achieve the worst case value on the right side of (B.4) and set $I_{-}=I_{-}({\varepsilon})=J(\psi^{+},g^{+})$ .

We have the bounds $I_{-}\leq I^{*}\leq I^{+}$ . Numerically, we compute integrals and extrema over fine grids with at least $100$ samples per unit of range, getting not $I^{+}$ and $I_{-}$ but instead numerical approximations $\widetilde{I}^{+}$ and $\widetilde{I}_{-}$ . Table 3 presents some information about numerical approximation results, which may help the reader assess its accuracy for small values of $K$ . Some minimizing distributions obtained in this way are shown in Figure 20; the mass points $(\mu_{i})$ are displayed in Figure 21.

Our numerical results, showing that $\widetilde{I}_{-}\approx\widetilde{I}_{+}$ allows us to infer that the Mallows form is approximately correct. The denoiser that we actually apply in our estimation and compressed sensing experiments is:

Appendix C Convergence properties of AMP

Throughout the paper we checked convergence of AMP by imposing a threshold on the reconstruction accuracy and the number of iterations. For instance, in the case of separable denoisers, cf. Section 2.4, we declared the reconstruction successful if

for a certain choice of $t$ and $\gamma>0$ . In particular, the results presented correspond to $\gamma=0.01$ and $t=300$ .

It is natural to ask how to choose $\gamma$ and $t$ , and whether different choices of $\gamma$ and $t$ would lead to significantly different estimates of the phase transition boundary. It turns out that the empirical phase transition is fairly insensitive to these choices for the cases considered here, as soon as $t\gtrsim 100$ is sufficiently large and $\gamma\lesssim 0.05$ . This insensitivity is related to the convergence properties of AMP. Indeed both theory and empirical evidence [DMM09, DMM11] indicate exponential convergence. Namely, for all $\delta>M({\varepsilon}|\eta)$ , there exist dimension independent constants $C=C(\delta)$ , $b=b(\delta)>0$ such that, with high probability,

On the other hand, for $\delta<M({\varepsilon}|\eta)$ , $\|\widehat{x}^{t}-x_{0}\|_{2}^{2}\geq c(\delta)n$ , with high probability.

Figure 22 presents data that confirm this behavior (further numerical evidence can be found in [DMM09]). The data refer to simple sparse signals with ${\varepsilon}=0.05$ , and soft thresholding denoising. The curves correspond to several values of $\delta$ close to the predicted phase transition location $\delta=M({\varepsilon}|\eta)\approx 0.0239$ . Notice the clear exponential decay of the error for $\delta>M({\varepsilon}|\eta)$ and a large constant mean square error for $\delta<M({\varepsilon}|\eta)$ .

If the phase transition has to be determined with relative accuracy $\Delta$ , this suggests the rule of thumb $\exp\{-b(\delta_{*}+\Delta)t\}\leq\gamma$ and $c(\delta_{*}-\Delta)\geq\gamma$ . We verified that these conditions are satisfied by our choices of $t$ and $\gamma$ .

Appendix D Calculation of minimax MSE for block soft thresholding

The argument is analogous to the one for the scalar case (corresponding to $B=1$ ) treated in Appendix A. For $\mu\in{\bf R}^{B}$ and $\tau\in{\bf R}_{+}$ , define the risk at $\mu$ as

where ${\bf X}\sim\nu$ and ${\bf Z}\sim{\sf N}(0,{\bf I}_{B\times B})$ . Since the two point mixtures are the extremal distributions in ${\cal F}_{{\varepsilon},B}$ , we have

By the definition of chi-square distribution, we have

It follows from the invariance of the distribution of ${\bf Z}$ under rotations that $R(\mu;\tau)$ only depends on $\mu$ through its norm $\|\mu\|$ . Further, as proved in Appendix I.1 $R(\mu;\tau)$ is increasing in $\|\mu\|$ , and

At this point the problem is reduced to a calculus exercise.

D.2 Proof of Lemma 3.3

In this appendix we consider the asymptotics for large block size $B\to\infty$ . It is easy to show that the minimax threshold level $\tau$ must be of order $\sqrt{B}$ . By a compactness argument, we can assume $\tau=c\,\sqrt{B}$ for some $c$ to be determined. Define the risk as in Eq. (D.1) (note that this depends implicitely on $B$ ) and the normalized risk as $\widetilde{R}_{B}(\mu;\tau)=R(\mu;\tau)/B$ . We claim that

Assuming these claim to hold, we have, by Eq. (D.2),

Calculus shows that the minimum is achieved at $c^{*}=1-{\varepsilon}$ , whence

which coincides with the statement of Lemma 3.3.

In order to complete the proof, we have to prove claims (D.3) and (D.4). The second one is immediate because of Lemma I.1 that implies indeed $\widetilde{R}_{B}(\infty;c\sqrt{B})=1+c^{2}$ .

The limit (D.3) follows instead from the central limit theorem. Indeed, let $X_{B}$ denote a central chi-squared with $B$ degrees of freedom. Its square root is the norm of a standard Gaussian random vector in dimension $B$ , and concentrates around $\sqrt{B}$ . Indeed by the central limit theorem $\sqrt{2}(\sqrt{X_{B}}-\sqrt{B})\Rightarrow_{D}{\sf N}(0,1)$ as $B\rightarrow\infty$ . Therefore, we have

and the latter converges to $(1-c)_{+}^{2}$ by dominated convergence.

Appendix E Non-convergence of state evolution

We begin by developing a lower bound that holds for all the denoisers $\eta$ studied in this paper. For notational simplicity, we consider in fact separable denoisers, but the result is easily see to hold in general, provided that the signal class is scale-invariant.

Let $\nu_{{\varepsilon},\mu}$ denote the mixture $(1-{\varepsilon})\delta_{0}+{\varepsilon}\delta_{\mu}$ where $\mu\in{\bf R}_{+}\cup\{\infty\}$ can be either finite or infinite. Define the risk at $\mu$ as

where $X\sim\nu_{{\varepsilon},\mu}$ is independent of $Z\sim{\sf N}(0,1)$ .

We say that the risk function $R$ is super-quadratic on $[0,\mu_{*})$ if, for any $\mu\in[0,\mu_{*})$ ,

The next result shows that superquadratic behavior of the risk function implies non convergence of state evolution for signal distribution $\nu_{{\varepsilon},\mu}$ .

(State Evolution Non-Convergence) Fix any $\tau\in\Theta$ , and assume there exists $\mu_{0}>0$ such that: $(i)$ The risk function $R(\mu;\tau)$ is superquadratic on $[0,\mu_{*})$ , $(ii)$ $R(\mu_{*})\geq\delta$ , and $(iii)$ $\delta\geq{\varepsilon}$ .

Let $\Psi(m)=\Psi(m;\delta,\tau,\nu_{{\varepsilon},\mu})$ . Then there is $m_{\rm fp}>0$ such that

Define ${m_{\rm fp}}=(\mu/\mu_{*})^{2}$ and assume $m\geq{m_{\rm fp}}$ . This implies $\mu/\sqrt{m}\leq\mu_{*}$ , and, since $R$ is superquadratic by assumption,

For firm and minimax denoisers $\eta\in\{\eta^{firm},\eta^{all}\}$ , we took a fine grid of ${\varepsilon}$ and at each fixed ${\varepsilon}$ evaluated $R(\mu;\tau)$ on a fine grid of $\mu$ , checking the inequality (E.1). In the case of firm thresholding we used the minimax threshold values $\tau=\tau^{*}({\varepsilon})$ . We further used the least favorable $\mu$ , $\mu=\mu^{*}({\varepsilon})$ . Sample results are presented in Figures 23 and 24.

These computations show that the risk function $R(\mu;\tau^{*}({\varepsilon}))$ is superquadratic on $(0,\mu^{*}({\varepsilon}))$ .

Appendix F Monotone regression

where we recall that $\Delta v_{i}=v_{i+1}-v_{i}$ is the discrete derivative. (Of course this problem does not provide an algorithm since it requires to know $\Delta\mu_{i}$ , but here we are interested in it only for analysis purposes.)

As $t\to\infty$ , all the constraints $\Delta v_{i}\geq-t\Delta\mu_{i}$ for which $\Delta\mu_{i}>0$ become irrelevant. We are naturally led to defining the following localized monotone regression problem

(Here and below omit the dependence of $I_{0}$ , $I_{+}$ on $\mu$ .) Let $\eta^{lmono}(z;I_{0})$ denote the solution of $(Q_{lmono})$ with data $z,I_{0}$ . The above discussion implies that, for $z={\bf Z}\sim{\sf N}(0,{\bf I}_{N\times N})$ , we have the following limit in probability:

In words, the risk ‘at infinity’ of monotone regression is simply given by the local risk.

In order to conclude the proof, it is sufficient to show that $R^{loc}(I_{0})$ is given by the right-hand side of Eq. (4.4). It is easy to check that the problem $(Q_{lmono})$ separates into independent optimization problem for each $J_{k}$ . Namely, for $i\in J_{k}$ , $v_{i}$ can be found by solving the following smaller problem

where $v_{J_{k}}=(v_{i_{k}+1},\dots,v_{i_{k+1}})$ and, if the segment $J_{k}$ is a singleton, the constraint disappears. Let $v_{J_{k}}(z_{J_{k}})$ be the solution of this problem. Then,

F.2 Proof of Lemma 4.2

Throughout the proof, we denote by $x=x({\bf Z})=\eta^{mono}(z={\bf Z})$ the solution of the monotone regression problem

Clearly $r(1)=1$ since in this case there is no monotonicity constraint and the solution of the regression problem is simply $x=z$ . In order to prove Eq. (4.5), let $k\in I_{+}(x)$ (i.e. an increase point: $x_{k}<x_{k+1}$ ) and define $r^{(k)}\in{\bf R}^{N}$ by letting $r^{(k)}_{i}=1_{\{i>k\}}$ , and $l^{(k)}\in{\bf R}^{N}$ by letting $l^{(k)}_{i}=1_{\{i\leq k\}}$ . Then, $x+\xi\,r^{(k)}\in{\cal M}_{N}$ , $x+\xi\,l^{(k)}\in{\cal M}_{N}$ for all $\xi$ small enough. Hence, we must have $\|x+\xi\,r^{(k)}-z\|_{2}^{2}\geq\|x-z\|_{2}^{2}$ , and $\|x+\xi\,l^{(k)}-z\|_{2}^{2}\geq\|x-z\|_{2}^{2}$ for all $\xi$ small enough. Expanding to linear order in $\xi$ , we conclude that, for all $k\in I_{+}(x)$ :

Further, if $r^{(0)}$ is the all $1$ vector, $x+\xi\,r^{(0)}\in{\cal M}_{N}$ for all $\xi\in{\bf R}$ . Minimizing with respect to $\xi$ , we get

By virtue of Eq. (F.1), and using the fact that $x$ is monotone, we have

It is then easy to check that, letting ${\cal E}^{c}$ denote the complement of ${\cal E}$ ,

Indeed define, for $i\in[N]$ , $k(i)\equiv\max\{k:\;k<i,\;k\in I_{+}(x)\}$ with, by convention, $k(i)=0$ if the set $\{k<i,\;k\in I_{+}(x)\}$ is empty. Then, by definition of $I_{+}(x)$ ,

where the first inequality follows by definition of the events ${\cal E}_{\emptyset}$ , ${\cal E}_{1}$ , … ${\cal E}_{N-1}$ and the second since $i\geq k(i)+1$ . The inequality $x_{i}\geq-\sqrt{(6\log N)/i}$ follows essentially by the same argument. By union bound we therefore obtain

We can therefore bound the risk as follows

On the other hand it is easy to see that, defining $Z_{\max}=\max_{i\in[N]}|Z_{i}|$ , we necessarily have $|x_{i}({\bf Z})|\leq Z_{\max}$ for all $i\in[N]$ , whence

where the last inequality holds for $N\geq 10$ . Using this bound in conjunction with Eq. (F.6), we finally get

Appendix G Further computational details

We record here a few details that have been omitted from the main text.

The AMP iteration, cf. Eqs. (1.8), (1.8), (1.9), requires to estimate the variance $\sigma_{t}^{2}$ of the effective observation at time $t$ . According to the the general theory of state evolution [DMM09, BM11a], the empirical distribution of the coordinates of $z^{t}$ is asymptotically Gaussian with mean and variance $\sigma_{t}^{2}$ . This motivates the following estimator, first proposed in [DMM09]:

where $\Phi(z)$ is the normal distribution function. This is known as the $25$ % pseudo-variance in robust estimator and has the advantage of being insensitive to a small fraction of large outliers.

Computations were done in partly using Matlab, and partly through a Java program . In the spirit of reproducible research, a suite of Java classes that allow to repeat our simulations is available through an open code repository [DJM12].

The plots of minimax risk were obtained by evaluating numerically the expression in the main text. For separable and block-separable denoisers (with the exception of the global minimax denoiser $\eta^{all}$ ), the integrals can be expressed in terms of the Gaussian distribution function or incomplete beta functions. For the global minimax denoiser, integration was performed numerically using the standard MAtlab routines.

Evaluation of the minimax risk required searching the least favorable distribution among two point mixtures of the form $(1-{\varepsilon})\delta_{0}+\delta_{\mu}$ , in the separable case and $(-{\varepsilon})\delta_{0}+{\varepsilon}\delta_{\mu\,e_{1}}$ in the block separable one. Optimization over $\mu\in{\bf R}_{+}$ was performed by brute force search over a grid, with recursive refinement of the grid.

For the non-separable cases, the procedure for approximating the minimax MSE was explained Section 4 and 5.

Appendix H Finite-N𝑁N scaling and error analysis

The empirical phase transitions observed in this paper admit further analysis, to verify whether the following expected behavior take place, namely: $(a)$ the offsets tend towards zero with increasing $N$ ; $(b)$ the steepness of the phase transition increases with increasing $N$ .

As described in Section 2.4, at each fixed value ${\varepsilon}$ of the sparsity parameter, we gathered data at several different values of $\delta$ , and obtained the empirical phase transition parameter $\widehat{{\rm PT}}(N,{\varepsilon},\eta)$ , recorded as offset from prediction, so that $\widehat{{\rm PT}}(N,{\varepsilon},\eta)=0$ means that the 50% success location fitted to the ${\varepsilon}$ -fixed, $\delta$ -varying dataset is exactly at the predicted location $M({\varepsilon}|\eta)$ . Our analysis gave not only the empirical phase transition location, but also its formal standard error $SE(\widehat{{\rm PT}})$ . (Here we make explicit the dependence of $\widehat{{\rm PT}}$ on the specific denoiser.)

We fit a linear model to the dataset including all the phase transition results for soft and firm thresholding. We considered several exponents $\gamma$ that might be describing the decay of the offset with increasing $N$ :

Table 4 shows that $\gamma=1/3$ provides an adequate description of the offsets, with an $R^{2}$ exceeding $0.995$ . A plot of raw $\widehat{{\rm PT}}$ ’s versus the predictions of model (H.1) is given in Figure 25.

H.2 Transitions sharpen

In addition to an empirical phase transition parameter $\widehat{{\rm PT}}(N,{\varepsilon},\eta)$ we also fitted an empirical steepness parameter $\widehat{\beta}(N,{\varepsilon},\eta)$ , according to the logistic model:

where $\widehat{p}_{i}$ is the empirical success probability for the $i$ -th experiment.

We expect $\widehat{\beta}(N,{\varepsilon},\eta)$ to grow with increasing $N$ , corresponding to increasingly abrupt transitions from complete failure at $\delta\ll\widehat{{\rm PT}}(N,{\varepsilon},\eta)$ to complete success at $\delta\gg\widehat{{\rm PT}}(N,{\varepsilon},\eta)$ . In order to test this behavior, we fitted a linear model to the values of $\widehat{\beta}$ computed for multiple values of $N$ , ${\varepsilon}$ , and denoisers $\eta$ . We considered a range of powers $\widetilde{\gamma}$ that might be describing the growth of the steepness with increasing $N$ :

Table 5 shows that $\gamma=1/2$ provides an adequate description of the steepnesses, with an $R^{2}$ exceeding $0.999$ . A plot of raw $\widehat{\beta}$ ’s versus the predictions of model (H.2) is given in Figure 26.

Appendix I Further properties of the risk function

In this appendix we prove several useful properties of the risk function of the block soft and James-Stein denoisers. Throughout this section, we define the risk at $\mu$ as

The argument $\eta$ will be droppend or replaced by the threshold level $\tau$ whenever clear from the context. Since we only consider denoisers that are equivariant under rotation, $R(\mu;\eta)$ depends on the vector $\mu$ only through its norm $\|\mu\|_{2}$ . With a slight abuse of notation, we will use $\mu$ to denote the norm as well. In other words, the reader can assume $\mu=\mu\,e_{1}$ .

In this section we consider the block soft denoiser $\eta^{soft}(\,\cdot\,;\tau):{\bf R}^{B}\to{\bf R}^{B}$ . We will write $R(\mu;\tau)$ for $R(\mu;\eta^{soft}(\,\cdot\,;\tau))$ .

For block soft thresholding the risk function $\mu\mapsto R(\mu\tau)$ has these properties:

Let $f_{\xi,d}(w)$ be the density function of $S^{2}\sim\chi_{d}^{2}(\xi)$ . This satisfies

Applying the first identity, integrating by parts, canceling terms, and then using the second identity, we obtain

For the upper bound (I.3), use the Poisson mixture representation $f_{\xi,d+2}(w)=\sum_{j=0}^{\infty}p_{\xi/2}(j)f_{d+2+2j}(w)$ with $p_{\lambda}(j)\equiv\lambda^{j}e^{-\lambda}/j!$ and an identity for the central $\chi^{2}$ density family to obtain

and so to conclude that $f_{\xi,d+2}(w)\leq(w/d)f_{\xi,d}(w)$ . Hence the second term in (I.6) is bounded by $(1-1/\tau)\int_{\tau^{2}}^{\infty}f_{\xi,d}(w)\,{\rm d}w$ , whence follows the conclusion $(\partial R/\partial\mu^{2})\leq 1$ . Property (I.4) is immediate from (I.2) and (I.3) and the large- $\mu$ limit of $R$ .

To obtain the bound in (I.5), write the risk function using the unbiased risk formula as

I.2 Positive-part James-Stein denoiser

As in the previous section, we set $\xi\equiv\mu^{2}$ and $d=B$ . HEre we consider the James-Stein denoiser $\eta^{JS}:{\bf R}^{d}\to{\bf R}^{d}$ defined by

and we will write $R(\mu)=R(\mu;\eta^{JS})$ .

We again let $S^{2}=\|\mu+{\bf Z}\|^{2}_{2}$ , with ${\bf Z}\sim{\sf N}(0,{\bf I}_{d\times d})$ . We have the noncentral chi-squared distribution $S\sim\chi_{d}^{2}(\xi)$ with noncentrality $\xi$ . Stein’s unbiased estimate of risk is

We will first develop an approximation of the risk at that was used in Sections (3.2) and (3.3).

Let $f_{d}(w)$ denote the density of a central chi-squared with $d$ degrees of freedom. We then have the density satisfies

Using the identity $wf_{d-2}(w)=(d-2)f_{d}(w)$ and letting $D=d-2$ , we can rewrite the last expression as

By a standard tightness argument, this implies that that

An Edgeworth series leads to the expansion $R(0)=1+R_{1}\,d^{-1/2}+\Theta(d^{-2})$ . Indeed, one can integrate the expression (I.8) numerically, and the numerical values are consistent with $R(0)\approx 1+0.752/\sqrt{d}$ for large $d$ .

I.2.2 Monotonicity of R(μ)𝑅𝜇R(\mu).

We use the variation-diminishing version of total positivity, developed in [BJM81].

The non-central $\chi^{2}$ family is strictly variation diminishing of all orders

For $g:[0,\infty)\to{\bf R}$ , let $S^{-}(g)$ and $S^{+}(g)$ denote the number of sign changes and strict sign changes of $g$ , and let $IS(g)$ denote the sign of $g(0)$ (assuming that $g(0)\neq 0$ , the more general definition being given in [BJM81]). Further define the function $\gamma:[0,\infty)\mapsto{\bf R}$ by

where $f_{d,\xi}(\,\cdot\,)$ is the density of the noncentral chi-square with $d$ degrees of freedom and noncentrality $\xi$ . By the SVR property we have that that $S^{+}(\gamma)\leq S^{-}(g)$ and that if $S^{+}(\gamma)=S^{-}(g)$ then necessarily $IS(\gamma)=IS(g)$ . In particular this implies that, if $g$ is strictly increasing, then $\gamma$ is strictly increasing as well. Indeed this follows by letting $g_{a}(w)=g(w)-a$ for $a\in{\bf R}$ and

If $g$ is strictly increasing, then $S^{-}(g_{a})\leq 1$ for all $a\in{\bf R}$ , whence $S^{+}(\gamma_{a})\leq 1$ for all $a$ , with $IS(\gamma)=IS(g)$ whenever $S^{+}(\gamma_{a})=1$ . This in turns implies that $\gamma$ is increasing.

We now verify that the risk $R(\mu)$ of $\eta^{JS}$ is monotone increasing in $\xi=\|\mu\|^{2}\in[0,\infty)$ . Let

and define $\gamma(\xi)$ using Eq. (I.9). Note that $g$ is strictly increasing and hence $\xi\mapsto\gamma(\xi)$ is increasing as well by the above argument. But $U(S)$ is Stein’s unbiased risk estimator and hence $R(\mu)=\gamma(\xi=\|mu\|^{2})$ , which implies the claim.

I.3 Proof of Lemma 6.2

Since any probability distribution is written as a convex combination of point masses, it is sufficient to prove the claim for $\nu=\delta_{\mu}$ . In this case, using the scaling relation (1.3), we have

with $R(\,\cdot\,)$ the risk function. Therefore the state evolution mapping is starshaped for all distributions $\nu$ if and only if $\mu\mapsto R(\mu)$ is monotone increasing.

The monotonicity of the risk function was proved in [DMM09] for soft thresholding and positive soft thresholding. It is proved in Section 4 and 5 for monotone regression and total variation denoising. Finally, it is proved in Section I.1 and I.2 for block soft and James-Stein denoisers.

Introduction

2 Denoising and minimax MSE

3 Compressed sensing and AMP reconstruction

4 Phase transition for AMP

5 This Paper

6 Contributions

7 Related literature

Scalar-separable denoisers

2 Firm shrinkage

3 Minimax shrinkage

4 Empirical phase transition behavior

Block-separable denoisers

2 Block James-Stein

3 Empirical phase transition behavior

Monotone regression

2 Empirical phase transition behavior

Total variation minimization

2 Empirical phase transition behavior

Characterization of the phase transition using state evolution

2 State Evolution Phase Transition

3 Non-convergence of state evolution

Phase transitions for other algorithms

Appendix A Classical cases

Appendix B Calculation of minimax MSE

Appendix C Convergence properties of AMP

Appendix D Calculation of minimax MSE for block soft thresholding

D.2 Proof of Lemma 3.3

Appendix E Non-convergence of state evolution

Appendix F Monotone regression

F.2 Proof of Lemma 4.2

Appendix G Further computational details

Appendix H Finite-N𝑁N scaling and error analysis

H.2 Transitions sharpen

Appendix I Further properties of the risk function

I.2 Positive-part James-Stein denoiser

I.2.2 Monotonicity of R​(μ)𝑅𝜇R(\mu).

I.3 Proof of Lemma 6.2

References

I.2.2 Monotonicity of R(μ)𝑅𝜇R(\mu).