The Noise-Sensitivity Phase Transition in Compressed Sensing

David L. Donoho, Arian Maleki, Andrea Montanari

Introduction

Consider the noisy underdetermined system of linear equations:

A very popular approach estimates x0x^{0} via the solution x1,λx^{1,\lambda} of the following convex optimization problem

This is the λ=0\lambda=0 limit of (1.2): its solution obeys x^1,0=limλ0x1,λ\hat{x}^{1,0}=\lim_{\lambda\to 0}x^{1,\lambda}.

We define below a so-called formal MSE (fMSE{\rm fMSE}), and evaluate the (minimax, formal) noise sensitivity:

Our main theoretical result is the formula

Quantity (1.5) is the payoff of a traditional two-person zero sum game, in which the undersampling and sparsity are fixed in advance, the researcher plays against Nature, Nature picks both a noise level and a signal distribution, and the researcher picks a penalization level, in knowledge of Nature’s choices. It is traditional in analyzing such games to identify the least-favorable strategy of Nature (who maximizes payout from the researcher), and the optimal strategy for the researcher (who wants to minimize payout). We are able to identify both and give explicit formulas for the so-called saddlepoint strategy, where Nature plays the least-favorable strategy against the researcher and the researcher minimizes the consequent damage. In Proposition 3.1 below we give formulas for this pair of strategies. The phase-transition structure evident in (1.7) is saying that above the curve ρ\mboxMSE\rho_{\mbox{\rm\tiny MSE}}, Nature has available unboundedly good strategies, to which the researcher has no effective response.

2 Structure of the Formalism

3 Empirical Validation

We use the word formalism for the machinery underlying our derivations because it is not (yet) a rigorously-proven method which is known to give correct results under established regularity conditions. In this sense our method has similarities to the replica and cavity methods of statistical physics, famously useful tools without rigorous general justification.

Our theoretical results are validated here by computational experiments which show that the predictions of our formulas are accurate, and, even more importantly, that the underlying formal structure leading to our predictions – least-favorable objects, game-theoretic saddlepoints of the MSE payoff function, maximin tuning of λ\lambda, unboundedness of the noise sensitivity above phase transition– can all be observed experimentally. Because our formalism makes so many different kinds of predictions about quantities with clear operational significance and about their dynamical evolution in the AMP algorithm, it is quite different than some other formalisms, such as the replica method, in which many fewer checkable predictions are made. In particular, as demonstrated in [DMM09a], the present formalism describes precisely the evolution of an actual low complexity algorithm.

We focused in this paper on measurement matrices AA with Gaussian iid entries. It was recently proved that the state evolution formalism at the core of our analysis is indeed asymptotically correct for Gaussian matrices AA [BM10]. We believe that similar results hold for matrices AA with uniformly bounded iid entries with zero mean and variance 1/n1/n. However our results should extend to a broader universality class including matrices with iid entries with same mean and variance, under an appropriate light tail condition. It is an outstanding mathematical challenge to prove that such predictions are indeed correct for a broader universality class of estimation problems.

As discussed in Section 7, an alternative route also from statistical physics, using the replica method has been recently used to investigate similar questions. We will argue that the present framework which makes predictions about actual dynamical behavior of algorithms, is computationally verifiable in great detail, whereas the replica method itself applies to no constructive algorithm and makes comparatively many fewer predictions.

Minimax MSE of Soft Thresholding

We briefly recall notions from, e.g., [DJHS92, DJ94] and then generalize them. We wish to recover an NN vector x0=(x0(i):1iN)x^{0}=(x^{0}(i):1\leq i\leq N) which is observed in Gaussian white noise

with z0(i)N(0,σ2)z^{0}(i)\sim{\sf N}(0,\sigma^{2}) independent and identically distributed. This can be regarded as special case of the compressed sensing model (1.1), whereby n=Nn=N and A=IA=I is the identity matrix – i.e. there is no underdetermined system of equations. We assume that x0x^{0} is sparse. It makes sense to consider soft thresholding

where the soft threshold function (with threshold level θ\theta) is defined by

In words, the estimator (2) ‘shrinks’ the observations yy towards the origin by a multiple τ\tau of the noise level σ\sigma.

We define the soft thresholding mean square error by

Here expectation is with respect to independent random variables ZN(0,1)Z\sim{\sf N}(0,1) and XνX\sim\nu.

It is important to allow general σ\sigma in calculations below. However, note to the scale invariance

where νa\nu^{a} is the probability distribution obtained by rescaling ν\nu: νa(S)=ν({x:axS})\nu^{a}(S)=\nu(\{x:\,a\,x\in S\}). It follows that all calculations can be made in the σ=1\sigma=1 setting and results rescaled to obtain final answers. Below, when we deal with σ=1\sigma=1, we will suppress the σ\sigma argument, and simply write mse(ν,τ)mse(1;ν,τ){\sf mse}(\nu,\tau)\equiv{\sf mse}(1;\nu,\tau)

The minimax threshold MSE was defined in [DJHS92, DJ94] by

and is explicitly computable at finite ε{\varepsilon}.

A peculiar aspect of the results in [DJ94] requires us to generalize their results somewhat. For a given, fixed τ>0\tau>0, the worst case MSE obeys

We will need approximations which place no mass at \infty. We say distribution νε,α\nu_{{\varepsilon},\alpha} is α\alpha-least-favorable for η(;τ)\eta(\,\cdot\,;\tau) if it is the least-dispersed distribution in Fε{\cal F}_{\varepsilon} achieving a fraction (1α)(1-\alpha) of the worst case risk for η(;τ)\eta(\,\cdot\,;\tau), i.e. if both (i)(i)

and (ii)(ii) ν\nu has the smallest second moment for which (i)(i) is true. The least favorable distribution νε,α\nu_{{\varepsilon},\alpha} has the form of a three-point mixture

Here μ±(ε,α)\mu^{\pm}({\varepsilon},\alpha) is an explicitly computable function, see below, and for α>0\alpha>0 fixed we have

Note in particular the relatively weak role played by α\alpha. This shows that although the precise least-favorable situation places mass at infinity, in fact, an approximately least-favorable situation is already achieved much closer to the origin.

Main Results

The notation of the last section allows us to state our main results.

(Large-System Limit). A sequence of problem size parameters n,Nn,N will be said to grow proportionally if both n,Nn,N\rightarrow\infty while n/Nδ(0,1)n/N\rightarrow\delta\in(0,1).

Consider a sequence of random variables (Wn,N)(W_{n,N}), where n,Nn,N grow proportionally. Suppose that Wn,NW_{n,N} converges in probability to a deterministic quantity WW_{\infty}, which may depend on δ>0\delta>0. Then we say that Wn,NW_{n,N} has large-system limit WW_{\infty}, denoted

(Large-System Framework). We denote by LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu) a sequence of problem instances (y,A,x0)n,N(y,A,x^{0})_{n,N} as per Eq. (1.1) indexed by problem sizes n,Nn,N growing proportionally: n/Nδn/N\rightarrow\delta. In each instance, the entries of the n×Nn\times N matrix AA are Gaussian iid N(0,1/n){\sf N}(0,1/n), the entries of z0z^{0} are Gaussian iid N(0,σ2){\sf N}(0,\sigma^{2}) and the entries of x0x^{0} are iid ν\nu.

For the sake of concreteness we focus here on problem sequences whereby the matrix AA has iid Gaussian entries. An obvious generalization of this setting would be to assume that the entries are iid with mean and variance 1/n1/n. We expect our result to hold for a broad set of distributions in this class.

In order to match the kk-sparsity condition underlying (1.1) we consider the standard framework only for νFδρ\nu\in{\cal F}_{\delta\rho}.

(Observable). Let x^\hat{x} denote the output of a reconstruction algorithm on problem instance (y,A,x0)(y,A,x^{0}). An observable JJ is a function J(y,A,x0,x^)J(y,A,x^{0},\hat{x}) of the tuple (y,A,x0,x^)(y,A,x^{0},\hat{x}).

In an abuse of notation, the realized values Jn,N=J(y,A,x0,x^)J_{n,N}=J(y,A,x^{0},\hat{x}) in this framework will also be called observables. An example is the observed per-coordinate MSE:

The MSE{\rm MSE} depends explicitly on x0x^{0} and implicitly on yy and AA (through the reconstruction algorithm). Unless specified, we shall assume that the reconstruction algorithm solves the LASSO problem (1.2), and hence x^1,λ=x^\hat{x}^{1,\lambda}=\hat{x}. Further in the following we will drop the dependence of the observable on the arguments y,A,x0,x^y,A,x^{0},\hat{x}, and the problem dimensions n,Nn,N, when clear from context.

(Formalism). A formalism is a procedure that assigns a purported large-system limit \scFormal(J){\sc\rm Formal}(J) to an observable JJ in the LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu). This limit in general depends on δ\delta, ρ\rho, σ2\sigma^{2}, and νFδρ\nu\in{\cal F}_{\delta\rho}: \scFormal(J)=\scFormal(J;δ,ρ,σ,ν){\sc\rm Formal}(J)={\sc\rm Formal}(J;\delta,\rho,\sigma,\nu).

(Validation). A formalism is theoretically validated by proving that, in the standard asymptotic framework, we have

for a class J{\cal J} of observables to which the formalism applies, and for a range of LSF(δ,ρ,σ2,ν){\rm LSF}(\delta,\rho,\sigma^{2},\nu).

A formalism is empirically validated by showing that, for problem instances (y,A,x0)(y,A,x^{0}) realized from LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu) with large NN we have

for a collection of observables JJJ\in{\cal J} and a range of asymptotic framework parameters (δ,ρ,σ,ν)(\delta,\rho,\sigma,\nu); here the approximation \approx should be evaluated by usual standards of empirical science.

Obviously, theoretical validation is stronger than empirical validation, but careful empirical validation is still validation. We do not attempt here to theoretically validate this formalism in any generality; see [BM10] results in this direction. Instead we view the formalism as calculating predictions of empirical results. We have compared these predictions with empirical results and found a persuasive level of agreement. For example, our formalism has been used to predict the MSE of reconstructions by (1.2), and actual empirical results match the predictions, i.e.:

2 Results of the Formalism

The behavior of formal mean square error changes dramatically at the following phase boundary.

For each δ\delta\in, let ρ\mboxMSE(δ)\rho_{\mbox{\rm\tiny MSE}}(\delta) be the value of ρ\rho solving

It is well known that M±(ε)M^{\pm}({\varepsilon}) is monotone increasing and concave in ε{\varepsilon}, with M±(0)=0M^{\pm}(0)=0 and M±(1)=1M^{\pm}(1)=1. As a consequence, ρ\mboxMSE\rho_{\mbox{\rm\tiny MSE}} is also a monotone increasing function of δ\delta, in fact ρ\mboxMSE(δ)0\rho_{\mbox{\rm\tiny MSE}}(\delta)\rightarrow 0 as δ0\delta\rightarrow 0 and ρ\mboxMSE(δ)1\rho_{\mbox{\rm\tiny MSE}}(\delta)\rightarrow 1 as δ1\delta\rightarrow 1. An explicit expression for the curve (δ,ρ\mboxMSE(δ))(\delta,\rho_{\mbox{\rm\tiny MSE}}(\delta)) is provided in Appendix A.

Results of Formalism. The formalism developed below yields the following conclusions.

1.a In the region ρ<ρ\mboxMSE(δ)\rho<\rho_{\mbox{\rm\tiny MSE}}(\delta), the minimax formal noise sensitivity obeys the formula

In particular, MM^{*} is finite throughout this region.

1.b With σ2\sigma^{2} the noise level in (1.1), define the formal noise-plus interference level, fNPI=fNPI(τ;δ,ρ,σ,ν){\rm fNPI}={\rm fNPI}(\tau;\delta,\rho,\sigma,\nu)

and its minimax value NPI(δ,ρ;σ)σ2(1+M(δ,ρ)/δ){\rm NPI}^{*}(\delta,\rho;\sigma)\equiv\sigma^{2}\cdot(1+M^{*}(\delta,\rho)/\delta). For α>0\alpha>0, define

1.c The formally maximin penalty parameter obeys

where EqDR(){\rm EqDR}(\,\cdots\,) is the asymptotic detection rate, i.e. the asymptotic fraction of coordinates that are estimated to be nonzero. (An explicit expression for this quantity is given in Section 4.5.)

In particular with this ν\nu-adaptive choice of penalty parameter, the formal MSE{\rm MSE} of x^1,λ\hat{x}^{1,\lambda} does not exceed Mσ2M^{*}\cdot\sigma^{2}.

2 In the region ρ>ρ\mboxMSE(δ)\rho>\rho_{\mbox{\rm\tiny MSE}}(\delta), the formal noise sensitivity is infinite. Throughout this phase, for each fixed number M<M<\infty, there exists α>0\alpha>0 such that the probability distribution νFδρ\nu\in{\cal F}_{\delta\rho} placing its nonzeros at ±μ(δ,ρ,α)\pm\mu^{*}(\delta,\rho,\alpha), yields formal MSE{\rm MSE} larger than MM.

We explain the formalism and derive these results in Section 4 below.

3 Interpretation of the Predictions

Figure 1 displays the noise sensitivity; above the phase transition boundary ρ=ρ\mboxMSE(δ)\rho=\rho_{\mbox{\rm\tiny MSE}}(\delta), it is infinite. The different contour lines show positions in the δ,ρ\delta,\rho plane where a given noise sensitivity is achieved. As one might expect, the sensitivity blows up rather dramatically as we approach the phase boundary.

Figure 4 displays the least-favorable coefficient amplitude μ(δ,ρ,α=0.02)\mu^{*}(\delta,\rho,\alpha=0.02). Notice that μ(δ,ρ,α)\mu^{*}(\delta,\rho,\alpha) diverges as the phase boundary is approached. Indeed beyond the phase boundary arbitrarily large MSE can be produced by choosing μ\mu large enough.

Figure 5 displays the value of the optimal penalization parameter amplitude λ=λ(νδ,ρ;δ,ρ,σ=1)\lambda^{*}=\lambda^{*}(\nu_{\delta,\rho}^{*};\delta,\rho,\sigma=1). Note that the parameter tends to zero as we approach phase transition.

For these figures, the region above phase transition is not decorated, because the values there are infinite or not defined.

4 Comparison to other phase transitions

In view of the importance of the phase boundary for Proposition 3.1, we note the following:

5 Validating the Predictions

Proposition 3.1 makes predictions for the behavior of solutions to (1.2). It will be validated empirically, by showing that such solutions behave as predicted.

In particular, simulation evidence will be presented to show that in the phase where noise sensitivity is finite:

Running (1.2) for data (y,A)(y,A) generated from vectors x0x_{0} with coordinates with distribution ν\nu which is nearly least-favorable results in an empirical MSE approximately equal to M(δ,ρ)σ2M^{*}(\delta,\rho)\cdot\sigma^{2}.

Running (1.2) for data (y,A)(y,A) generated from vectors x0x_{0} with coordinates with distribution ν\nu which is far from least-favorable results in empirical MSE noticeably smaller than M(δ,ρ)σ2M^{*}(\delta,\rho)\cdot\sigma^{2}.

Running (1.2) with a suboptimal penalty parameter λ\lambda results in empirical MSE noticeably greater than M(δ,ρ)σ2M^{*}(\delta,\rho)\cdot\sigma^{2}.

Second, in the phase where formal MSE is infinite:

Running (1.2) on vectors x0x_{0} generated by formally least-favorable results in an empirical MSE which is very large.

Evidence for all these claims will be given below.

The formalism

We now consider a reconstruction approach seemingly very different from (P2,λ,1P_{2,\lambda,1}). This algorithm, called first-order approximate message passing (AMP) algorithm proceeds iteratively, starting at x^0=0\hat{x}^{0}=0 and producing the estimate x^t\hat{x}^{t} of x0x^{0} at iteration tt according to the iteration:

2 Formal MSE, and its evolution

Let npi(m;σ,δ)σ2+m/δ{\sf npi}(m;\sigma,\delta)\equiv\sigma^{2}+m/\delta. We define the MSE map Ψ\Psi through

where the function mse(;ν,τ){\sf mse}(\,\cdot\,;\nu,\tau) is the soft thresholding mean square error already introduced in Eq. (2.5). It describes the MSE of soft thresholding in a problem where the noise level is npi\sqrt{{\sf npi}}. A heuristic explanation of the meaning and origin of npi{\sf npi} will be given below.

State Evolution. The state is a 5-tuple (m;δ,σ,τ,ν)(m;\delta,\sigma,\tau,\nu). State evolution is the evolution of the state by the rule

As the parameters (δ,σ,τ,ν)(\delta,\sigma,\tau,\nu) remain fixed during evolution, we usually omit mention of them and think of state evolution simply as the iterated application of Ψ\Psi:

Stable Fixed Point. The Highest Fixed Point of the continuous function Ψ\Psi is

The stability coefficient of the continuously differentiable function Ψ\Psi is

We say that HFP(Ψ){\rm HFP}(\Psi) is a stable fixed point if 0SC(Ψ)<10\leq{\rm SC}(\Psi)<1.

To illustrate this, Figure 6 shows the MSE map and fixed points in three cases.

In what follows we denote by μ2(ν)=x2dν\mu_{2}(\nu)=\int x^{2}{\rm d}\nu the second-moment of the distribution ν\nu.

Let Ψ()=Ψ(,δ,σ,τ,ν)\Psi(\,\cdot\,)=\Psi(\,\cdot\,,\delta,\sigma,\tau,\nu), and assume either σ2>0\sigma^{2}>0 or μ2(ν)>0\mu_{2}(\nu)>0. Then the sequence of iterates mtm_{t} defined by mt+1=Ψ(mt)m_{t+1}=\Psi(m_{t}) starting from m0=μ2(ν)m_{0}=\mu_{2}(\nu) converges monotonically to HFP(Ψ){\rm HFP}(\Psi):

Further, if σ>0\sigma>0 then HFP(Ψ)(0,){\rm HFP}(\Psi)\in(0,\infty) is the unique fixed point.

Suppose further that the stability coefficient satisfies 0<SC(Ψ)<10<{\rm SC}(\Psi)<1. Then there exists a constant A(ν,Ψ){\cal A}(\nu,\Psi) such that

Finally, if μ2(ν)HFP(Ψ)\mu_{2}(\nu)\geq{\rm HFP}(\Psi) then the sequence {mt}\{m_{t}\} is monotonically decreasing to μ2(ν)\mu_{2}(\nu) with

In short, barring the trivial case x0=0x^{0}=0, z0=0z^{0}=0 (no signal, no noise), state evolution converges to the highest fixed point. If the stability coefficient is smaller than 11, convergence is exponentially fast.

This Lemma is an immediate consequence of the fact that mΨ(m)m\mapsto\Psi(m) is a concave non-decreasing function, with Ψ(0)>0\Psi(0)>0 as long as σ>0\sigma>0 and Ψ(0)=0\Psi(0)=0 for σ=0\sigma=0.

Indeed in [DMM09b] the authors showed that at noise level σ=0\sigma=0, the MSE map mΨ(m;δ,σ,ν,τ)m\to\Psi(m;\delta,\sigma,\nu,\tau) is concave as a function of mm. We have the identity

relating the noise-level MSE map to the noise-level σ\sigma MSE map. From this it follows that Ψ\Psi is concave for σ>0\sigma>0 as well. Also, [DMM09b] shows that Ψ(m=0;δ,σ=0,ν,τ)=0\Psi(m=0;\delta,\sigma=0,\nu,\tau)=0 and dΨdm(m=0;δ,σ=0,ν,τ)>0\frac{{\rm d}\Psi}{{\rm d}m}(m=0;\delta,\sigma=0,\nu,\tau)>0, whence Ψ(m=0;δ,σ,ν,τ)>0\Psi(m=0;\delta,\sigma,\nu,\tau)>0 for any positive noise level σ\sigma. ∎

In the same paper [DMM09b], the authors derived the least-favorable stability coefficient in the noiseless case σ=0\sigma=0:

They showed that, for M±(δ,ρ)<δM^{\pm}(\delta,\rho)<\delta the only fixed point is at m=0m=0 and has stability coefficient

Hence, it follows that SC(δ,ρ,σ=0)<1{\rm SC}^{*}(\delta,\rho,\sigma=0)<1 throughout the region ρ<ρ\mboxMSE(δ)\rho<\rho_{\mbox{\rm\tiny MSE}}(\delta).

Concavity of the noise level 0 MSE map implies

We therefore conclude that throughout the region ρ<ρ\mboxMSE(δ)\rho<\rho_{\mbox{\rm\tiny MSE}}(\delta) For this reason, that region can also be called the stability phase, not only the stability coefficient is smaller than 11, SC(Ψ)<1{\rm SC}(\Psi)<1, but that it can be bounded away from 11 uniformly in the signal distribution ν\nu.

Throughout the region ρ<ρ\mboxMSE(δ)\rho<\rho_{\mbox{\rm\tiny MSE}}(\delta), 0<δ<10<\delta<1, for every νFδρ\nu\in{\cal F}_{\delta\rho}, we have SC(Ψ)SC(δ,ρ)<1{\rm SC}(\Psi)\leq{\rm SC}^{*}(\delta,\rho)<1.

Outside the stability region, for each large mm, we can find measures ν\nu obeying the sparsity constraint νFδρ\nu\in{\cal F}_{\delta\rho} for which state evolution converges to a fixed point suffering equilibrium MSE>m{\rm MSE}>m. The construction in section 4.5 shows that HFP(Ψ)>μ2(ν)>m{\rm HFP}(\Psi)>\mu_{2}(\nu)>m. Figure 7 shows the MSE map and the state evolution in three cases which may be compared to 6. In the first case, ρ\rho is well below ρ\mboxMSE\rho_{\mbox{\rm\tiny MSE}} and the fixed point is well below μ2(ν)\mu_{2}(\nu). In the second case, ρ\rho is slightly below ρ\mboxMSE\rho_{\mbox{\rm\tiny MSE}} and the fixed point is close to μ2(ν)\mu_{2}(\nu). In the third case, ρ\rho is above ρ\mboxMSE\rho_{\mbox{\rm\tiny MSE}} and the fixed point, lies above μ2(ν)\mu_{2}(\nu).

μ2(ν)\mu_{2}(\nu) is the MSE one suffers by ‘doing nothing’: setting threshold λ=\lambda=\infty and taking x^=0\hat{x}=0. When HFP(Ψ)>μ2(ν){\rm HFP}(\Psi)>\mu_{2}(\nu), one iteration of thresholding makes things worse, not better. In words, the phase boundary is exactly the place below which we are sure that, if μ2(ν)\mu_{2}(\nu) is large, a single iteration of thresholding gives an estimate x^1\hat{x}^{1} that is better than the starting point x^0\hat{x}^{0}. Above the phase boundary, even a single iteration of thresholding may be a catastrophically bad thing to do.

(Equilibrium States and State-Conditional Expectations)

where npi=npi(m;σ,δ){\sf npi}={\sf npi}(m;\sigma,\delta) and XνX\sim\nu, ZN(0,1)Z\sim{\sf N}(0,1) are independent random variables.

Suppose we are given (δ,σ,ν,τ)(\delta,\sigma,\nu,\tau), and a fixed point mm^{*}, m=HFP(Ψ)m^{*}={\rm HFP}(\Psi) with Ψ=Ψ(;δ,σ,ν,τ)\Psi=\Psi(\,\cdot\,;\delta,\sigma,\nu,\tau). The tuple S=(m;δ,σ,ν)S^{*}=(m^{*};\delta,\sigma,\nu) is called the equilibrium state of state evolution. The expectation in the equilibrium state is E(ζS){\cal E}(\zeta|S^{*}).

Let SS^{*} denote the equilibrium state reached by state evolution in a given situation (δ,σ,ν,τ)(\delta,\sigma,\nu,\tau). The state evolution formalism assigns the purported limit value

Validity of the state evolution formalism for AMPT entails that, for a sequence of problem instances (y,A,x0)(y,A,x^{0}) drawn from LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu), the large-system limit for observable Jn,NζJ^{\zeta}_{n,N} is simply the expectation in the equilibrium state:

The class J{\cal J} of observables representable by the form JζJ^{\zeta} is quite rich, by choosing ζ(u,v,w)\zeta(u,v,w) appropriately. Table 1 gives examples of well-known observables and the ζ\zeta which will generate them.

Formal values for other interesting observables can in principle be obtained by combining such simple ones. For example, the False Discovery rate FDR is the ratio FDeR//DR and so the ratio of two elementary observables of the kind for which the formalism is defined. We assign it the purported limit value

Below we list a certain number of observables for which the formalism was checked empirically and that play an important role in characterizing the fixed point estimates.

Calculation of Formal Operating Characteristics of AMPT(τ){\rm AMPT}(\tau) by State Evolution

Given δ,σ,ν,τ\delta,\sigma,\nu,\tau, identify the fixed point HFP(Ψ(;δ,σ,ν,τ){\rm HFP}(\Psi(\,\cdot\,;\delta,\sigma,\nu,\tau). Calculate the following quantities

Equilibrium Noise Plus Interference Level

Equilibrium Mean Squared Residual. Let Y=X+npiZY_{\infty}=X+\sqrt{{\sf npi}_{\infty}}\,Z for XνX\sim\nu and ZN(0,1)Z\sim{\sf N}(0,1) are independent. Then

3 AMPT - LASSO Calibration

Of course at this point the reader is entitled to feel that the introduction of AMPT is a massive digression. The relevance of AMPT is indicated by the following conclusion from [DMM10b]:

In the large system limit, the operating characteristics of AMPT(τ){\rm AMPT}(\tau) are equivalent to those of LASSO(λ)(\lambda) under an appropriate calibration τλ\tau\leftrightarrow\lambda.

By calibration, we mean a rescaling that maps results on one problem into results on the other problem. The notion is explained at greater length in [DMM10b]. The correct mapping can be guessed from the following remarks:

LASSO(λ){\rm LASSO}(\lambda): no residual exceeds λ\lambda: AT(yAx^1,λ)λ\|A^{T}(y-A\hat{x}^{1,\lambda})\|_{\infty}\leq\lambda. Further

AMPT(τ){\rm AMPT}(\tau): At a fixed point x^\hat{x}^{\infty}, zz^{\infty}, no working residual exceeds the equilibrium threshold θ\theta_{\infty}: ATzθ\|A^{T}z^{\infty}\|_{\infty}\leq\theta_{\infty}. Further

Define df=#{i:x^i0}df=\#\{i:\hat{x}^{\infty}_{i}\neq 0\}. Further notice that at the AMPT fixed point (1df/n)z=yATx^(1-{\rm df}/n)z^{\infty}=y-A^{T}\hat{x}^{\infty}. We can summarize these remarks in the following statement

Solutions x^1,λ\hat{x}^{1,\lambda} of LASSO(λ){\rm LASSO}(\lambda) (i.e. optima of the problem (1.2)) are in correspondence with fixed points (x^,z)(\hat{x}^{\infty},z^{\infty}) of the AMPT(τ){\rm AMPT}(\tau) under the bijection x^=x^1,λ\hat{x}^{\infty}=\hat{x}^{1,\lambda}, z=(yATx^1,λ)/(1df/n)z^{\infty}=(y-A^{T}\hat{x}^{1,\lambda})/(1-{\rm df}/n), provided the threshold parameters are in the following relation

In other words, if we have a fixed point of AMPT(τ){\rm AMPT}(\tau) we can choose λ\lambda in such a way that this is also an optimum of LASSO(λ){\rm LASSO}(\lambda). Viceversa, any optimum of LASSO(λ){\rm LASSO}(\lambda) can be realized as a fixed point of AMPT(τ){\rm AMPT}(\tau): notice in fact that the relation (4.6) is invertible whenever df<n{\rm df}<n.

This simple rule gives a calibration relationship between τ\tau and λ\lambda, i.e. a one-one correspondence between τ\tau and λ\lambda that renders the two apparently different reconstruction procedures equivalent, provided the iteration AMPT(τ){\rm AMPT}(\tau) converges rapidly to its fixed point. Our empirical results confirm that this is indeed what happens for typical large system frameworks LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu).

The next lemma characterizes the equilibrium calibration relation between AMP and LASSO.

Let EqDR(τ)=EqDR(τ;δ,ρ,ν,σ){\rm EqDR}(\tau)={\rm EqDR}(\tau;\delta,\rho,\nu,\sigma) denote the equilibrium detection rate obtained from state evolution when the tuning parameter of AMPT is τ\tau. Define τ0(δ,ρ,ν,σ)>0\tau^{0}(\delta,\rho,\nu,\sigma)>0, so that EqDR(τ)δ{\rm EqDR}(\tau)\leq\delta when τ>τ0\tau>\tau^{0}. For each λ0\lambda\geq 0, there is a unique value τ(λ)[τ0,)\tau(\lambda)\in[\tau_{0},\infty) such that

We can restate Finding 4.1 in the following more convenient form.

For each λ[0,)\lambda\in[0,\infty) we find that AMPT(τ(λ)){\rm AMPT}(\tau(\lambda)) and LASSO(λ){\rm LASSO}(\lambda) have statistically equivalent observables. In particular the MSE{\rm MSE}, MAE{\rm MAE}, MSR{\rm MSR}, DR{\rm DR}, have the same distributions.

4 Derivation of Proposition 3.1

Consider the following Minimax Problem for AMPT(τ){\rm AMPT}(\tau). With fMSE(τ;δ,ρ,σ,ν){\rm fMSE}(\tau;\delta,\rho,\sigma,\nu) denoting the equilibrium formal MSE for AMPT(τ)AMPT(\tau) for the framework LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu), fix σ=1\sigma=1 and define

We will first show that this definition obeys the formula just like the one in Proposition 3.1, given for MM^{*}. Later we show that M=MM^{\flat}=M^{*}.

Figure 8 depicts the behavior of τ\tau^{*} in the (δ,ρ)(\delta,\rho) plane.

Consider νFδρ\nu\in{\cal F}_{\delta\rho} and σ2=1\sigma^{2}=1 and set τ(δ,ρ)=τ±(δρ)\tau^{*}(\delta,\rho)=\tau^{\pm}(\delta\rho) as in the statement. Let for short Ψ(m;ν)=Ψ(m,δ,σ=1,τ,ν)=mse(npi(m,1,δ);ν,τ)\Psi(m;\nu)=\Psi(m,\delta,\sigma=1,\tau^{*},\nu)={\sf mse}({\sf npi}(m,1,\delta);\nu,\tau^{*}), cf. Eq. (4.4). Then m=HFP(Ψ)m^{*}={\rm HFP}(\Psi) obeys, by definition of fixed point,

where we used the fact that τ(δ,ρ)=τ±(δρ)\tau^{*}(\delta,\rho)=\tau^{\pm}(\delta\rho). Hence

The function mmnpi(m;δ,1)m\mapsto\frac{m}{{\sf npi}(m;\delta,1)} is one-to-one strictly increasing on the interval [0,δ)[0,\delta). Thus, provided that 1M±(δρ)/δ>01-M^{\pm}(\delta\rho)/\delta>0, i.e. ρ<ρ\mboxMSE\rho<\rho_{\mbox{\rm\tiny MSE}}, we have

As this inequality applies to any HFP{\rm HFP} produced by our formalism, in particular the largest one consistent with νFδρ\nu\in{\cal F}_{\delta\rho}, we have

We now develop the reverse inequality. To do so, we make a specific choice ν\overline{\nu} of ν\nu. Fix α>0\alpha>0 small. Now for ε=δρ{\varepsilon}=\delta\rho, define ξ=μ±(ε,α)NPI\xi=\mu^{\pm}({\varepsilon},\alpha)\cdot\sqrt{{\rm NPI}^{*}}, where NPI=1+M/δ{\rm NPI}^{*}=1+M^{\flat}/\delta (with M=M±(δρ)/(1M±(δρ)/δ)M^{\flat}=M^{\pm}(\delta\rho)/(1-M^{\pm}(\delta\rho)/\delta) as in the thesis). Let ν=(1ε)δ0+(ε/2)δξ+(ε/2)δξ\overline{\nu}=(1-{\varepsilon})\delta_{0}+({\varepsilon}/2)\,\delta_{-\xi}+({\varepsilon}/2)\delta_{\xi}. Denote by m=m(ν)m^{*}=m^{*}(\overline{\nu}) the highest fixed point corresponding to the signal distribution ν\overline{\nu}. Using once again scale invariance, we have

Note that mse(m;(1ε)δ0+(ε/2)δx+(ε/2)δx,τ){\sf mse}(m;(1-{\varepsilon})\delta_{0}+({\varepsilon}/2)\delta_{-x}+({\varepsilon}/2)\delta_{x},\tau) is monotone increasing in x|x|. Recall that νε,α=(1ε)δ0+(ε/2)δμ±(ε,α)+(ε/2)δμ±(ε,α)\nu_{{\varepsilon},\alpha}=(1-{\varepsilon})\delta_{0}+({\varepsilon}/2)\delta_{-\mu^{\pm}({\varepsilon},\alpha)}+({\varepsilon}/2)\delta_{\mu^{\pm}({\varepsilon},\alpha)} is α\alpha-least favorable for the minimax problem (2.7). Consequently,

Using the scale-invariance relation, Eq. (4.11), we conclude that

Again, in the region ρ<ρ\mboxMSE(δ)\rho<\rho_{\mbox{\rm\tiny MSE}}(\delta), the function mmnpi(m;δ,1)m\mapsto\frac{m}{{\sf npi}(m;\delta,1)} is one-to-one and monotone and therefore so

We now explain how this result about AMPT leads to our claim for the behavior of the LASSO estimator x^1,λ\hat{x}^{1,\lambda}. By a scale invariance the quantity (1.5) can be rewritten as a fixed-scale σ=1\sigma=1 property:

where we introduced explicit reference to the algorithm used, and dropped the irrelevant arguments. We will analogously write fMSE(ν,τAMPT){\rm fMSE}(\nu,\tau|{\rm AMPT}) for the AMPT(τ)(\tau) MSE.

Assume the validity of our calibration relation i.e. the equivalence of formal operating characteristics of AMPT(τ){\rm AMPT}(\tau) and LASSO(λ(τ)){\rm LASSO}(\lambda(\tau)). Then

Also, for λ\lambda^{*} as defined in Proposition 3.1,

In words, λ\lambda^{*} is the maximin penalization and the maximin MSE of LASSOis precisely given by the formula (4.8).

Taking the validity of our calibration relationship τλ(τ)\tau\leftrightarrow\lambda(\tau) as given, we must have

Our definition of λ\lambda^{*} in Proposition 3.1 is simply the calibration relation applied to the minimax AMPT threshold τ\tau^{*}, i.e. λ=λ(τ)\lambda^{*}=\lambda(\tau^{*}). Hence assuming the validity of our calibration relation, we have:

Display (4.4) shows that all these equalities are equal to M(δ,ρ)M^{\flat}(\delta,\rho). ∎

The proof of Proposition 3.1, points 1a1a, 1b1b, 1c1c follows immediately from the above.

5 Formal MSE above Phase Transition

We now make an explicit construction showing that noise sensitivity is unbounded above PT.

We first consider the AMPT algorithm above PT. Fix δ\delta, ρ\rho with ρ>ρ\mboxMSE(δ)\rho>\rho_{\mbox{\rm\tiny MSE}}(\delta) and set ε=δρ{\varepsilon}=\delta\rho.

In this section we focus on 3 point distributions with mass at equal to 1ε1-{\varepsilon}. With an abuse of notation we let mse(μ,τ){\sf mse}(\mu,\tau) denote the MSE of scalar soft thresholding for amplitude of the non-zeros equal to μ\mu, and noise variance equal to 11. In formulas, mse(μ,τ)mse(1;(1ε)δ0+(ε/2)δμ+(ε/2)δμ,τ){\sf mse}(\mu,\tau)\equiv{\sf mse}(1;(1-{\varepsilon})\delta_{0}+({\varepsilon}/2)\delta_{\mu}+({\varepsilon}/2)\delta_{-\mu},\tau), and

Consider values of the AMPT threshold τ\tau such that mse(0,τ)<δ{\sf mse}(0,\tau)<\delta; this will be possible for all τ\tau sufficiently large. Pick a number γ(0,1)\gamma\in(0,1) obeying

Let M±(ε,τ)=supμmse(μ,τ)M^{\pm}({\varepsilon},\tau)=\sup_{\mu}{\sf mse}(\mu,\tau) denote the worst case risk of η(;τ)\eta(\,\cdot\,;\tau) over the class Fε{\cal F}_{\varepsilon}. Let μ±(ε,α,τ)\mu^{\pm}({\varepsilon},\alpha,\tau) denote the α\alpha-least-favorable μ\mu for threshold τ\tau:

Define α=1γδ/M±(ε,τ)\alpha^{*}=1-\gamma\delta/M^{\pm}({\varepsilon},\tau), and note that α(0,1)\alpha^{*}\in(0,1) by earlier assumptions. Let μ=μ±(α,τ,ε)\mu^{*}=\mu^{\pm}(\alpha^{*},\tau,{\varepsilon}). A straightforward calculation along the lines of the previous section yields.

For the measure ν=(1ε)δ0+(ε/2)δμ+(ε/2)δμ\nu=(1-{\varepsilon})\delta_{0}+({\varepsilon}/2)\delta_{\mu^{*}}+({\varepsilon}/2)\delta_{-\mu^{*}}, the formal MSE{\rm MSE} and formal NPI{\rm NPI} are given by

Assumption (4.13) permits us to choose γ\gamma very close to 1. Hence the above formulas show explicitly that MSE is unbounded above phase transition.

What do the formulas say about x^1,λ\hat{x}^{1,\lambda} above PT? The τ\tau’s which can be associated to λ\lambda obey

where EqDR(ν,τ)=EqDR(τ;δ,ρ,ν,σ){\rm EqDR}(\nu,\tau)={\rm EqDR}(\tau;\delta,\rho,\nu,\sigma) is the equilibrium detection rate for a signal with distribution ν\nu. Equivalently, they are those τ\tau where the equilibrium discovery number is nn or smaller.

the parameter λ0\lambda\geq 0 defined by the calibration relation

One can check that, for each λ0\lambda\geq 0, for each phase space point above phase transition, the above construction allows to construct a measure μ\mu with ε=δρ{\varepsilon}=\delta\rho mass on nonzeros and with arbitrarily high formal MSE. This completes the derivation of part 2 of Proposition 3.1.

Empirical Validation

So far our discussion explains how state evolution calculations are carried out so others might reproduce them. The actual ‘science contribution’ of our paper comes in showing that these calculations describe the actual behavior of solutions to (1.2). We check these calculations in two ways: first, to show that individual MSE predictions are accurate, and second, to show that the mathematical structures (least-favorable, minimax saddlepoint, maximin threshold) that lead to our predictions are visible in empirical results.

Let fMSE(λ;δ,ρ,σ,ν){\rm fMSE}(\lambda;\delta,\rho,\sigma,\nu) denote the formal MSE we assign to x^1,λ\hat{x}^{1,\lambda} for problem instances from LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu). Let eMSE(λ)n,N{\rm eMSE}(\lambda)_{n,N} denote the empirical MSE of the LASSO estimator x^1,λ\hat{x}^{1,\lambda} in a problem instance drawn from LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu) at a given problem size n,Nn,N. In claiming that the noise sensitivity of x^1,λ\hat{x}^{1,\lambda} is bounded above by M(δ,ρ)M^{*}(\delta,\rho), we are saying that in empirical trials, the ratio eMSE/σ2{\rm eMSE}/\sigma^{2} will not be larger than MM^{*} with statistical significance. We now present empirical evidence for this claim.

We first consider the accuracy of theoretical predictions at the nearly-least-favorable signals generated by νδ,ρ,α=(1ε)δ0+(ε/2)δμ(δ,ρ,α)+(ε/2)δμ(δ,ρ,α)\nu_{\delta,\rho,\alpha}=(1-{\varepsilon})\delta_{0}+({\varepsilon}/2)\delta_{-\mu^{*}(\delta,\rho,\alpha)}+({\varepsilon}/2)\delta_{\mu^{*}(\delta,\rho,\alpha)} defined by Part 2.b2.b of Proposition 3.1. If the empirical ratio eMSE/σ2{\rm eMSE}/\sigma^{2} is substantially above the theoretical bound M(δ,ρ)M^{*}(\delta,\rho), according to standards of statistical significance, we have falsified the proposition.

We consider parameter points δ{0.10,0.25,0.50}\delta\in\{0.10,0.25,0.50\} and ρ{12ρ\mboxMSE,34ρ\mboxMSE,910ρ\mboxMSE,1920ρ\mboxMSE}\rho\in\{\frac{1}{2}\cdot\rho_{\mbox{\rm\tiny MSE}},\frac{3}{4}\cdot\rho_{\mbox{\rm\tiny MSE}},\frac{9}{10}\cdot\rho_{\mbox{\rm\tiny MSE}},\frac{19}{20}\cdot\rho_{\mbox{\rm\tiny MSE}}\}. The predictions of the SE formalism are detailed in Table 2.

Results at N=1500𝑁1500N=1500

To test these predictions, we generate in each situation R=200R=200 random realizations of size N=1500N=1500 from LSF(δ,ρ,σ,ν){\rm LSF}(\delta,\rho,\sigma,\nu) with the parameters shown in Table 2 and run the LARS/LASSO solver to find the solution x^1,λ\hat{x}^{1,\lambda}. Table 3 shows the empirical average MSE in 200200 trials at each tested situation.

Except at δ=0.10\delta=0.10 the mismatch between empirical and theoretical a few to several percent - reasonable given the sample size R=200R=200. At δ=0.10\delta=0.10, ρ=0.180\rho=0.180 – close to phase transition – there is a mismatch needing attention. (In fact, at each level of δ\delta the most serious mismatch is at the value of ρ\rho closest to phase transition. This can be attributed partially to the blowup of the quantity being measured as we approach phase transition.) We will pursue this mismatch below.

We also ran trials at δ{0.15,0.20,0.30,0.35,0.40,0.45}\delta\in\{0.15,0.20,0.30,0.35,0.40,0.45\}. These cases exhibited the same patterns seen above, with adequate fit except at small δ\delta, especially near phase transition. We omit the data here.

In all our trials, we measured numerous observables – not only the MSE. The trend in mismatch between theory and observation in such observables was comparable to that seen for MSE. In [DMM09b, DMM10b], the reader can find discussion and presentation of evidence for other observables.

Results at N=4000𝑁4000N=4000

Statistics of random sampling dictate that there always be some measure of disagreement between empirical averages and expectations. When the expectations are taken in the large-system limit, as ours are, there are additional small-NN effects that appear separate from random sampling effects. However, both sorts of effects should visibly decline with increasing NN.

Table 4 presents results for N=4000N=4000; we expect the discrepancies to shrink when the experiments are run at larger value of NN. We study the same ρ\rho and δ\delta that were studied for N=1500N=1500, and see that the mismatches in our MSE’s have grown smaller with NN.

Results at N=8000𝑁8000N=8000

Small values of δ\delta have the largest discrepancy specially when ρ\rho is chosen very close to the phase transition curve. To show that this discrepancy shrinks as we increase the value of NN, we do a similar experiment for δ=0.10\delta=0.10 but this time with N=8000N=8000. Table 5 summarizes the results of this simulation and shows better agreement between the formal predictions and empirical results.

The alert reader will no doubt have noticed that the discrepancy between theoretical predictions and empirical results is in many cases quite a bit larger in magnitude than the size of the the formal standard errors reported in the above tables. We emphasize that the theoretical predictions are formal limits for the NN\rightarrow\infty case, while empirical results take place at finite NN. In both statistics and statistical physics it is quite common for mismatches between finite-NN results and NN-large to occur as either O(N1/2)O(N^{-1/2}) (eg Normal approximation to the Poisson) or O(N1)O(N^{-1}) effects (eg Normal approximation to fair coin tossing). Analogously, we might anticipate that mismatches in this setting of order NαN^{-\alpha} with α\alpha either 1/21/2 or 11. Figure 9 presents empirical and theoretical results taken from the cases N=1500N=1500, 40004000, and 80008000 and displays them on a common graph, with yy-axis a mean-squared error (empirical or theoretical) and on the xx axis the inverse system size 1/N1/N. The case 1/N=01/N=0 presents the formal large-system limit predicted by our calculations and the other cases 1/N>01/N>0 present empirical results described in the tables above. As can be seen, the discrepancy between formal MSE and empirical MSE tends to zero linearly with 1/N1/N. (A similar plot with 1/N1/\sqrt{N} on the xx-axis would not be so convincing.)

The formal and empirical MSE{\rm MSE}’s at the quasi saddlepoint (ν,λ)(\nu^{*},\lambda^{*}) show statistical agreement at the cases studied, in the sense that either the MSE{\rm MSE}’s are consistent with standard statistical sampling formulas, or, where they were not consistent at N=1500N=1500, fresh data at N=4000N=4000 and N=8000N=8000 showed marked reductions in the anomalies confirming that the anomalies decline with increasing NN.

1.2 Existence of Game-Theoretic Saddlepoint in eMSE

Underlying our derivations of minimax formal MSE is a game-theoretic saddlepoint structure, illustrated in Figure 10. The loss function MSE has the following structure around the quasi saddlepoint (ν,λ)(\nu^{*},\lambda^{*}): any variation of μ\mu to lower values, will cause a reduction in loss, while a variation of λ\lambda to other values will cause an increase in loss.

1.3 Other penalization gives larger MSE

If our formalism is correct in deriving optimal penalization for x^1,λ\hat{x}^{1,\lambda}, we will see that changes of the penalization away from λ\lambda^{*} will cause MSE to increase. We consider the same situations as earlier, but now vary λ\lambda away from the minimax value, while holding the other aspects of the problem fixed. In the Appendix, Tables 7 and 8 presents numerical values of the empirical MSE obtained. Note the agreement of formal MSE, in which a saddlepoint is rigorously proven, and empirical MSE, which represents actual LARS/LASSO reconstructions. Also in this case we used R=200R=200 Monte Carlo replications.

To visualize the information in those tables, we refer to Figure 11.

1.4 MSE with more favorable measures is smaller

In our formalism, fixing λ=λ\lambda=\lambda^{*}, and varying μ\mu to smaller values will cause a reduction in formal MSE. Namely, if instead of μ(δ,ρ,0.01)\mu^{*}(\delta,\rho,0.01) we used μ(δ,ρ,α)\mu^{*}(\delta,\rho,\alpha) for α\alpha significantly larger than 0.010.01, we would see a significant reduction in MSE, by an amount matching the predicted amount.

where the scaling uses fNPI(ν0){\rm fNPI}(\nu_{0}). In particular, for μ=μ(δ,ρ,α)=μ±(δρ,α)NPI(δ,ρ)\mu=\mu^{*}(\delta,\rho,\alpha)=\mu^{\pm}(\delta\cdot\rho,\alpha)\sqrt{{\rm NPI}^{*}(\delta,\rho)}, the three point mixture: νδ,ρ,α\nu_{\delta,\rho,\alpha} has

and we ought to be able to see this. Table 9 shows results of simulations at N=1500N=1500. The theoretical MSE drops as we move away from the nearly least favorable μ\mu in the direction of smaller μ\mu, and the empirical MSE responds similarly.

The empirical data exhibit the saddlepoint structures predicted by the SE formalism.

1.5 MSE of Mixtures

The SE formalism contains a basic mathematical structure which allows one to infer that behavior at one saddlepoint determines the global minimax value: behavior under taking convex combinations (mixtures) of measures ν\nu.

Let mse(ν,λ){\sf mse}(\nu,\lambda) denote the ‘risk’ (MSE) of scalar soft thresholding as in Section 2. For such scalar thresholding, we have the affine relation

This ‘quasi-affinity’ relation allows to extend the saddlepoint structure from 3 point mixtures to more general measures.

To check this, we consider two near-least-favorable measures, ν0=νδ,ρ,0.02\nu_{0}=\nu_{\delta,\rho,0.02} and ν1=νδ,ρ,0.50\nu_{1}=\nu_{\delta,\rho,0.50}. and generate a range of cases ν(α)=(1α)ν0+αν1\nu^{(\alpha)}=(1-\alpha)\nu_{0}+\alpha\nu_{1} by varying alpha. When α∉{0,1}\alpha\not\in\{0,1\} this is a 5 point mixture rather than one of the 3-point mixtures we have been studying. Figure 12 displays the convexity bound (5.1), and the behavior of the formal MSE of this 5 point mixture. For comparison it also presents the formal MSE of the 3 point mixture having its mass at the weighted mean (1α)μ(δ,ρ,0.02)+αμ(δ,ρ,0.50)(1-\alpha)\mu(\delta,\rho,0.02)+\alpha\mu(\delta,\rho,0.50). Evidently, the 5 point mixture typically has smaller MSE than the comparable 3-point mixture, and it always is below the convexity bound.

The empirical MSE obeys the mixture inequalities predicted by the SE formalism.

2 Above Phase Transition

We conducted an empirical study of the formulas derived in Section 4.5. At δ=0.25\delta=0.25 we chose ρ=0.401\rho=0.401 - well above phase transition - and selected a range of τ\tau and γ\gamma values allowed by our formalism. For each pair γ,τ\gamma,\tau, we generated R=200R=200 Monte Carlo realizations and obtained LASSO solutions with the given penalization parameter λ\lambda. The results are described in Table 6. The match between formal MSE and empirical MSE is acceptable.

Running x^1,λ\hat{x}^{1,\lambda} at the 33-point mixtures defined for the regime above phase transition in Lemma 4.6 yields empirical MSE consistent with the formulas of that Lemma.

This validates the unboundedness of MSE of LASSO above phase transition.

Extensions

A completely parallel treatment can be given for the case where x00x^{0}\geq 0. In that setting, we use the positivity-constrained soft-threshold

and consider the corresponding positive-constrained thresholding minimax MSE [DJHS92]

We define the minimax, formal noise sensitivity:

here νFρδ+\nu\in{\cal F}^{+}_{\rho\delta} is the marginal distribution of x0x_{0}. Let ρ\mboxMSE+(δ)\rho_{\mbox{\rm\tiny MSE}}^{+}(\delta) denote the solution of

In complete analogy to (1.7) we have the formula:

2 Other Classes of Matrices

We focused here on matrices AA with Gaussian iid entries.

We believe that similar results to those obtained here hold for matrices AA with uniformly bounded iid entries with zero mean and variance 1/n1/n. In fact, we believe our results should extend to a broader universality class including matrices with iid entries with same mean and variance, under an appropriate light tail condition.

Relations with Statistical Physics and Information Theory

This section outlines the relations of the approach advocated here with ideas in information theory (in particular, with the theory of sparse graph codes), graphical models and statistical physics (more precisely spin glass theory). We will not discuss such relations in full mathematical detail, but only stress some important points that might be useful for researchers in each of those fields.

Message passing algorithms, and most notably belief propagation, have been intensively investigated in coding theory and communications, in particular because of their success in decoding sparse graph codes [RU08]. Belief propagation is defined whenever the a posteriori joint distribution of the variables to be inferred xx conditional on the observations yy can be written as a graphical model. In the present case this is easily done, provided the a priori probability distribution of the signal x=(x1,,xN)x=(x_{1},\dots,x_{N}) takes a ν=ν1×ν2×νN\nu=\nu_{1}\times\nu_{2}\dots\times\nu_{N}. The posterior is then

Graphical models of this type were (implicitly or explicitly) considered in the context of multiuser detection [Kab03, NS05, MPT06, MT06]. The underlying factor graph [KFL01] is the complete bipartite graph over NN variable nodes and nn factor nodes.

Applying belief propagation to such a model incurs two obvious difficulties: the graph is dense (and hence the complexity per iteration scales at least like n3n^{3}, and in fact worse), and the alphabet is continuous (and hence messages are not finitely representable). As discussed in [DMM10a], AMP solves these problem. From the information theoretical perspective, the term +zt1dft/n+z^{t-1}{\rm df}_{t}/n in Eq. (4.1) corresponds to ‘subtracting intrinsic information’.

In coding theory, message passing algorithms are analyzed through density evolution [RU08]. The common justification for density evolution is that the underlying graph is random and sparse, and hence converges locally to a tree in the large system limit. In the case of trees density evolution is exact, hence it is asymptotically exact for sparse random graphs. Such an easy justification is not available in the cases of dense graphs treated here and a deeper mathematical analysis is required. In [BM10], this analysis was carried out in the case of Gaussian matrices AA. It remains a challenge to generalize such analysis beyond the case of Gaussian matrices AA.

Having outlined the relation with belief propagation and coding, it is important to clarify a key point. In the context of sparse graph coding, belief propagation performances and MAP (maximum a posteriori probability) performances do not generally coincide even asymptotically (although they are intimately related [MMU04, MMRU09]). In the present paper we instead conjecture that AMP and LASSO have asymptotically equal MSE under appropriate calibration. This is due to the fact that that the state evolution recursion mtmt+1=Ψ(mt)m_{t}\mapsto m_{t+1}=\Psi(m_{t}) has only one stable fixed point.

2 Statistical physics

There is a well studied connection between statistical physics techniques and message passing algorithms [MM09]. In particular, the sum-product algorithm corresponds to the Bethe-Peierls approximation in statistical physics, and its fixed points are stationary points of the Bethe free energy. In the context of spin glass theory, the Bethe-Peierls approximation is also referred to as the ‘replica symmetric cavityWhen this terminology is used in statistical physics, the emphasis is rather on properties of random instances. method’.

The Bethe-Peierls approximation postulates a set of non-linear equations on quantities that correspond to the belief propagation messages, and allow to compute posterior marginals under the distribution (7.1). In the special cases of spin glasses on the complete graph (the celebrated Sherrington-Kirkpatrick model), these equations reduce to the so-called TAP equations, named after Thouless, Anderson and Palmer who first used them [TAP77].

The original TAP equations where a set of non-linear equations for local magnetizations (i.e. expectations of a single variable). Thouless, Anderson and Palmer first recognized that naive mean field is not accurate enough in the spin glass model, and corrected it by adding the so called Onsager reaction term that is analogous to the term +zt1dft/n+z^{t-1}{\rm df}_{t}/n in Eq. (4.1). More than 30 years after the original paper, a complete mathematical justification of the TAP equations remains an open problem in spin glass theory, although important partial results exist [Tal03]. While the connection between belief propagation and Bethe-Peierls approximation stimulated a considerable amount of research [YFW05], the algorithmic uses of TAP equations have received only sparse attention. Remarkable exceptions include [OW01, Kab03, NS05].

3 State evolution and replica calculations

Within statistical mechanics, the typical properties of probability measures of the form (7.1) are studied using the replica method or the cavity method [MM09]. These can be described as non-rigorous but mathematically sophisticated techniques. Despite intense efforts and some spectacular progresses [Tal03], even a precise statement of the assumptions implicit in such techniques is missing, in a general setting.

The fixed points of state evolution describe the output of the corresponding AMP, when the latter is run for a sufficiently large number of iterations (independent of the dimensions n,Nn,N). It is well known, within statistical mechanics [MM09], that the fixed point equations do indeed coincide with the equations obtained form the replica method (in its replica-symmetric form).

During the last few months, several papers investigated compressed sensing problems using the replica method [RFG09, KWT09, GBS09]. In view of the discussion above, it is not surprising that these results can be recovered from the state evolution formalism put forward in [DMM09a]. Let us mention that the latter has several advantages over the replica method:

It is more concrete, and its assumptions can be checked quantitatively through simulations;

It is intimately related to efficient message passing algorithms;

It actually allows to predict the performances of these algorithms (including for instance precise convergence time estimates);

It actually leads to rigorous statements, at least in the case of Gaussian sensing matrices.

Appendix A Some explicit formulae

This appendix contain some formulae and analytical derivation omitted from the main text.

The phase boundary curve admits the parametric expression

This is simply obtained from Eq. (2.8). If we call Gε(τ)G_{\varepsilon}(\tau) the function on the right hand side, then the parametric expression given here follows from δ=Gε(τ)\delta=G_{{\varepsilon}}(\tau) and Gε(τ)=0G^{\prime}_{{\varepsilon}}(\tau)=0 (which are equivalent to δ=M±(ε)\delta=M^{\pm}({\varepsilon})).

Appendix B Tables

This appendix contains table of empirical results supporting our claims.

References