Universality in polytope phase transitions and message passing algorithms

Mohsen Bayati, Marc Lelarge, Andrea Montanari

Introduction and main results

The present paper is concerned with the asymptotic distribution of xtx^{t} as NN\to\infty with tt fixed, and establishes the following results:

As NN\to\infty, the finite-dimensional marginals of the distribution of xtx^{t} are asymptotically insensitive to the distribution of the entries of AijA_{ij}.

The entries of xtx^{t} are asymptotically Gaussian with zero mean, and variance that can be explicitly computed through a one-dimensional recursion that we will refer to as state evolution.

As an application, we use state evolution to prove universality of a phase transition on polytope geometry, with connections to compressed sensing. This solves, for a broad class of random matrices with independent entries, a conjecture put forward by Donoho and Tanner Donoho2005b , DonohoTannerUniversality .

In order to illustrate the usefulness of the first two technical results, we will start the presentation of our results from the third one.

Before stating our results, it is useful to comment on the special form of the iteration (1), and in particular on the role of the memory term btf(xt1;t1)\mathsf{b}_{t}f(x^{t-1};t-1) (which is inspired from the so-called “Onsager correction” in statistical physics thouless1977solution , mezard1987spin , bolthausen2012iterative ). The function of this term is to cancel, to leading order, the effect of correlations between xit+1x_{i}^{t+1} and {xis\dvtxst}\{x^{s}_{i}\dvtx s\leq t\}. This cancelation is particularly transparent in our proof technique, whereby xitx^{t}_{i} is expressed as a sum of monomials in AjkA_{jk}, with 1j,kn1\leq j,k\leq n, and is indexed by labeled trees. The memory term effectively cancels the contribution of “one-step reversing” trees.

Without such memory term, the properties of the resulting iteration change crucially. In particular, it is no longer true that xitx^{t}_{i} is approximately Gaussian as NN\to\infty; see Section 2 for further clarification on this point.

A polytope QQ is said to be centrosymmetric if xQx\in Q implies xQ-x\in Q. Following Donoho2005b , Donoho2005a we say that such a polytope is kk-neighborly if the condition below holds: {longlist}[(I)]

It was shown in a series of papers Donoho2005a , Donoho2005b , DoTa05a , DoTa05b , DoTa08 that polytope neighborliness has tight connections with the geometric properties of random point clouds, and with sparsity-seeking methods to solve underdetermined systems of linear equations. The latter are in turn central in a number of applied domains, including model selection for data analysis and compressed sensing. For the reader’s convenience, these connections will be briefly reviewed in Section 6.

If the sequence Q\mathcal{Q} is random, we say that Q\mathcal{Q} has weak neighborliness ρ\rho (in probability) if the above limits hold in probability.

In other words, a sequence of polytopes {Qn}n0\{Q^{n}\}_{n\geq 0} has weak neighborliness ρ\rho, if for large nn the mm-dimensional polytope QnQ^{n} has close to the maximum possible number of kk faces, for all k<mρ(1ξ)k<m\rho(1-\xi).

Note that previously the neighborliness of a polytope was defined to be the largest integer kk satisfying condition (I). However, in our definition, weak neighborliness refers to the fraction k/nk/n. This is due to the fact that weak neighborliness is defined in the limit nn\to\infty.

The existence of weakly neighborly polytope sequences is clear when m(n)=nm(n)=n since in this case we can take Qn=CnQ^{n}=C^{n} with ρ=1\rho=1, but the existence is highly nontrivial when mm is only a fraction of nn.

Then, the sequence of polytopes {A(n)Cn}n0\{A(n)C^{n}\}_{n\geq 0} has weak neighborliness ρ(δ)\rho_{*}(\delta) in probability.

A characterization of the curve δρ(δ)\delta\mapsto\rho_{*}(\delta) was provided in Donoho2005b , but we omit it here since a more explicit expression will be given below.

Motivated by applications to data analysis and signal processing, Donoho and Tanner DonohoTannerUniversality carried out extensive numerical simulations for random polytopes of the form A(n)CnA(n)C^{n} for several choices of the distribution of A(n)A(n). They formulated a universality hypothesis according to which the conclusion of Theorem 1 holds for a far broader class of random matrices. The results of their numerical simulations were consistent with this hypothesis.

Here we establish the first rigorous result indicating universality of polytope neighborliness for a broad class of random matrices. Define the curve (δ,ρ(δ))(\delta,\rho_{*}(\delta)), δ(0,1)\delta\in(0,1), parametrically by letting, for α(0,)\alpha\in(0,\infty),

Here we extend the scope of Theorem 1 from Gaussian matrices to matrices with independent sub-GaussianSee equation (7) for the definition of sub-Gaussian random variables. entries (not necessarily identically distributed).

Then the sequence of polytopes {A(n)Cn}n0\{A(n)C^{n}\}_{n\geq 0} has weak neighborliness ρ(δ)\rho_{*}(\delta) in probability.

It is likely that this theorem can be improved in two directions. First, a milder tail condition than sub-Gaussianity is probably sufficient. Second, we are assuming that the distribution of AijA_{ij} has an arbitrarily small Gaussian component. This is not necessary for the upper bound on neighborliness, and appears to be an artifact of the proof of the lower bound.

The proof of Theorem 2 is provided in Section 6. By comparison, the most closely related result toward universality is by Adamczak, Litvak, Pajor, and Tomczak-Jaegermann Adamczak . For a class of matrices A(n)A(n) with i.i.d. columns, these authors prove that A(n)CnA(n)C^{n} has neighborliness scaling linearly with nn. This, however, does not suggest that a limit weak neighborliness exists, and is universal, as established instead in Theorem 2.

At the other extreme, universality of compressed sensing phase transitions can be conjectured from the results of the nonrigorous replica method KabashimaTanaka , RanganFletcherGoyal .

2 Universality of iterative algorithms

We will consider here and below a setting that is somewhat more general than the one described by equation (1). Following the terminology of DMM09 , we will refer to such an iteration as to the approximate message passing (AMP) iteration/algorithm.

There are several motivations for considering such a generalization. On one hand, it is necessary for the application to high-dimensional polytope geometry presented in the previous section. The reader might have noticed that the random matrix in Theorem 2 is rectangular. This is a different setting from that of iteration (1), whereby the random matrix AA is square and symmetric. The generalization to xVq,Nx\in\mathcal{V}_{q,N} introduced here, with AA square and symmetric, covers the case of rectangular matrices as well through a suitable reduction. In a nutshell, given a rectangular random matrix AA^{\prime}, the reduction consists of constructing a symmetric matrix that has AA^{\prime} as submatrix; cf. Section 5 for details.

Additional motivations for the generalization introduced here come from the application of AMP algorithms to a variety of problems in signal processing. For instance the authors of krzakala2012statistical , donoho2013information study compressed sensing reconstruction for “spatially coupled” sensing matrices. These are random matrices with independent but not identically distributed entries. As already discussed in donoho2013information , javanmard2013state for the case of Gaussian entries, a rigorous analysis of this algorithm requires generalizing the setting of (1).

Several other applications require a generalization of iteration (1), including the analysis of generalized AMP algorithms RanganGAMP , AMP reconstruction of block-sparse signals donoho2013accurate , the analysis of phase retrieval algorithms schniter2012compressive and so on. All of these applications can be treated within the setting introduced here, although our rigorous analysis requires the use of polynomial nonlinearities.

A brief sketch of some proof ideas for the “scalar” case of equation (1) can be found in Section 2.

An AMP instance is a triple (A,F,x0)(A,\mathcal{F},x^{0}) where: {longlist}[(3)]

x0Vq,Nx^{0}\in\mathcal{V}_{q,N} is an initial condition. Given F={fk\dvtxk[N]}\mathcal{F}=\{f^{k}\dvtx k\in[N]\}, we define f(;t)\dvtxVq,NVq,Nf(\cdot;t)\dvtx\mathcal{V}_{q,N}\to\mathcal{V}_{q,N} that maps vv to v=f(v;t)v^{\prime}=f(v;t), and is given by vi=fi(vi;t)\mathbf{v}^{\prime}_{i}=f^{i}(\mathbf{v}_{i};t) for all i[N]i\in[N].

The approximate message passing orbit corresponding to the instance (A,F,x0)(A,\mathcal{F},x^{0}) is the sequence of vectors {xt}t0\{x^{t}\}_{t\geq 0}, xtVq,Nx^{t}\in\mathcal{V}_{q,N} defined as follows, for t0t\geq 0,

Here Bt\dvtxVq,NVq,N\mathsf{B}_{t}\dvtx\mathcal{V}_{q,N}\to\mathcal{V}_{q,N} is the linear operator that maps vv to v=Btvv^{\prime}=\mathsf{B}_{t}v, and is defined by

The above definition can also be summarized by the following expression for the evolution of a single coordinate under AMP:

Recall that a centered random variable XX is sub-Gaussian with scale factor σ2\sigma^{2} if, for all λ>0\lambda>0, we have

Let {(A(N),FN,x0,N)}N1\{(A(N),\mathcal{F}_{N},x^{0,N})\}_{N\geq 1} be a sequence of AMP instances indexed by the dimension NN, with A(N)A(N) a random matrix and x0,Nx^{0,N} a random vector. We say that the sequence is (C,d)(C,d)-regular (or, for short, regular) polynomial sequence if: {longlist}[(3)]

For each NN, the functions fi(;t)f^{i}(\cdot;t) in FN\mathcal{F}_{N} [possibly random, as long as they are independent from A(N)A(N), x0,Nx^{0,N}] are polynomials with maximum degree dd and coefficients bounded by CC.

For each NN, A(N)A(N) and x0,Nx^{0,N} are independent. Further, we havei=1Nexp{xi0,N22/C}NC\sum_{i=1}^{N}\exp\{\|\mathbf{x}_{i}^{0,N}\|_{2}^{2}/C\}\leq NC with probability converging to one as NN\to\infty.

We state now our universality result for the algorithm (4).

3 State evolution

Theorem 3 establishes that the behavior of the sequence {xt}t0\{x^{t}\}_{t\geq 0} is, in the high-dimensional limit, insensitive to the distribution of the entries of the random matrix AA. In order to characterize this limit, we need to make some assumption on the collection of functions FN\mathcal{F}_{N}. In particular, we need to relate the functions FN\mathcal{F}_{N} to the functions FN\mathcal{F}_{N^{\prime}} in order to have a high-dimensional (NN\to\infty) limit.

the function fif^{i} only depends on the partition index of i[N]i\in[N], and on the value of Y(i)Y(i);

the distribution of Y(i)Y(i) only depends on the partition index of i[N]i\in[N];

the fractional size CaN/N|C^{N}_{a}|/N is NN-independent for large NN.

There are a few points to make precise, and this is done in the definition below.

for each a[k]a\in[k], we have limNCaN/N=ca(0,1)\lim_{N\to\infty}|C^{N}_{a}|/N=c_{a}\in(0,1);

for each N0N\geq 0, each a[k]a\in[k] and each iCaNi\in C_{a}^{N}, we have fi(x,t)=g(x,Y(i),a,t)f^{i}(\mathbf{x},t)=g(\mathbf{x},Y(i),a,t) where Y(1),,Y(N)Y(1),\dots,Y(N) are independent random variables with Y(i)PaY(i)\sim P_{a} whenever iCaNi\in C^{N}_{a} for some a[k]a\in[k];

With a slight abuse of notation, we will sometimes denote a converging sequence by {(A(N),g,x0,N)}N0\{(A(N),g,x^{0,N})\}_{N\geq 0}. We use capital letters to denote the Y(i)Y(i)’s to emphasize that they are random and do not change across iterations.

for all a[k]a\in[k]. Here YaPaY_{a}\sim P_{a}, ZatN(0,Σat)Z^{t}_{a}\sim\mathsf{N}(0,\Sigma^{t}_{a}) and YaY_{a} and ZatZ^{t}_{a} are independent.

where ZaN(0,Σat)Z_{a}\sim\mathsf{N}(0,\Sigma_{a}^{t}) is independent of YaPaY_{a}\sim P_{a}.

We conclude by mentioning that, following DMM09 , generalizations of algorithm (4) were studied by several groups SchniterTurbo , RanganGAMP , MalekiComplex , for a number of applications. Universality results analogous to the one proved here are expected to hold for such generalizations as well.

4 Outline of the paper

The paper is organized as follows. Before delving into the details of the analysis, Section 2 provides an informal discussion of the main proof ideas for the case q=1q=1. After some preliminary facts and notations in Section 3, Section 4 considers the AMP iteration (4) and proves Theorems 3 and 4. In order to achieve our goal, we introduce two different iterations whose analysis provides useful intermediate steps. We also prove a generalization of Theorem 4 to estimate functions of messages at two distinct times ψ(xit,xis,Y(i))\psi(\mathbf{x}_{i}^{t},\mathbf{x}_{i}^{s},Y(i)).

Section 5 proves a generalization of Theorem 4 to the case of rectangular (nonsymmetric) matrices AA. This is achieved by effectively embedding the rectangular matrix, into a larger symmetric matrix and applying our results for symmetric matrices.

The generalization to rectangular matrices is finally used in Section 6 to prove our result on the universality of polytope neighborliness, Theorem 2. This is done via a correspondence with compressed sensing reconstruction established in Donoho2005b , and a sharp analysis of an AMP iteration that solves this reconstruction problem.

Universality of iterative algorithms: Sketch of main ideas

Let us focus on, say, coordinate 11 of xˉt\bar{x}^{t}. An explicit calculation yields (recall that, by convention, Aii=0A_{ii}=0)

Consider first t=1t=1. Under our assumptions, xˉ11\bar{x}^{1}_{1} is a sum of i.i.d. random variables with mean and variance 1/N1/N. By the central limit theorem, it converges in distribution to a standard Gaussian random variable, as predicted by Theorem 4.

Consider next t=2t=2. In equation (15) we decomposed the sum over {i,j}\{i,j\} in a sum over terms with j=1j=1, and a sum over terms with j1j\neq 1. The first sum converges almost surely to 11 by the law of large numbers. It is easy to see that the second sum has expectation equal to zero and variance equal to (N1)/N(N-1)/N that converges to 11. Indeed, a slightly more complicated calculation shows that it converges to a standard Gaussian. Overall, xˉ12\mathbf{\bar{x}}^{2}_{1} converges in distribution to a Gaussian with mean 11 and variance 11, unlike what is predicted by Theorem 4 for x12\mathbf{x}^{2}_{1}. (Theorem 4 always predicts xit\mathbf{x}^{t}_{i} to have asymptotically zero mean.)

Notice that the terms in the sum (15) are indexed by an ordered triple (1,i,j)(1,i,j) with i,j[N]i,j\in[N], 1i1\neq i, iji\neq j. We can identify such a triple with a length 22 rooted (directed acyclic) path with vertices labeled by 11 (the root), ii, jj: ji1j\to i\to 1. The terms that lead to a nonzero mean are those corresponding to j=1j=1, that is, with a one-step reversal in the order in which they visit labels of [N][N]. These are paths of the form 1i11\to i\to 1.

Consider now adding back the memory term btf(xt1;t1)\mathsf{b}_{t}f(x^{t-1};t-1). It is easy to check that, in the present case [namely f(x;t)=xf(x;t)=x], equation (1) reduces to xt+1=Axtxt1x^{t+1}=Ax^{t}-x^{t-1} and, in particular

Comparing with equation (15), we see that the memory term asymptotically cancels the effect of one-step reversing paths. The same analysis can be developed, with additional labor, to subsequent iterations. At each tt, the memory term cancels the effect of one-step reversing paths, and the residual terms match the prediction of Theorem 4.

The proof follows a similar argument for a general polynomial f(x)f(x). As in the linear case, each coordinate xitx^{t}_{i} can be expressed as a sum of monomials in the independent random variables (Aij)i<j(A_{ij})_{i<j}. The main difference is that now these monomials are indexed by rooted trees instead of rooted paths with vertex labels in [N][N]. To see this, consider the special case

Then, a direct calculation of iteration (1) yields

The actual proof of Theorems 3 and 4 in Section 4 follows the same intuition as above, but of course, requires several technical steps: {longlist}[(6)]

We prove that, for our purposes, the distribution of the random variable xit\mathbf{x}^{t}_{i} is accurately approximated by the distribution of zit\mathbf{z}^{t}_{i}; see Proposition 3.

We prove, under the same assumptions as in our universality result, Theorem 3, the distribution of zit\mathbf{z}^{t}_{i} is insensitive to the distribution of the matrix entries (Aij)i<j(A_{ij})_{i<j}; cf. Proposition 1. This is done by the moment method. Any moment of zit\mathbf{z}^{t}_{i} is written as the expectation of a polynomial in the (Aij)i<j(A_{ij})_{i<j}. We show that the only terms that matter are the ones in which each AijA_{ij} appears with degree at most two. Hence the expectation only depends on the first two moments of the matrix entries, which are fixed by assumption.

Together with the previous point, this immediately implies Theorem 3.

In Proposition 2, we prove that the distribution of zit\mathbf{z}_{i}^{t} (and hence xit\mathbf{x}_{i}^{t}) is, for our purposes, accurately approximated by the distribution of yit\mathbf{y}^{t}_{i}.

Finally we exploit the fact that a fresh matrix AA is sampled at each iteration to prove that state evolution holds for yit\mathbf{y}^{t}_{i}; cf. Proposition 4.

By the previous point, this implies Theorem 4.

In the next section we introduce some basic facts and notation. We will implement the above strategy in Section 4.

Notations and basic simplifications

In our proof of Theorem 4 we will make use of the following simplification that lightens somewhat the notation.

For proving Theorem 4, it is sufficient to consider the case in which g\dvtx(x,Y,a,t)g(x,Y,a,t)g\dvtx(\mathbf{x},Y,a,t)\mapsto g(\mathbf{x},Y,a,t) is independent of YY.

The basic idea of the construction is to enlarge qq in such a way to keep track of the value of Y(i)Y(i) in the a subset of the coordinates of xit\mathbf{x}_{i}^{t}.

Proofs of Theorems 3 and 4

In this section we consider the AMP iteration (4), and prove Theorem 3 and Theorem 4, and indeed generalize the latter.

with Σ^at\widehat{\Sigma}^{t}_{a} defined per equation (10). For any s,t1s,t\geq 1, we set recursively

Let {(A(N),FN,x0,N)}N1\{(A(N),\mathcal{F}_{N},x^{0,N})\}_{N\geq 1} be a polynomial and converging sequence of instances and denote by {xt}t0\{x^{t}\}_{t\geq 0} the corresponding AMP orbit.

where (Zat,Zas)N(0,Σat,s)(Z^{t}_{a},Z^{s}_{a})\sim\mathsf{N}(0,\Sigma^{t,s}_{a}) is independent of YaPaY_{a}\sim P_{a}.

The rest of this section is organized as follows. In Section 4.1 we introduce two new iterations that are useful intermediary steps for our analysis. We show that the corresponding variables admit representations as sums over trees in Section 4.2 and use them to prove basic properties of these recursions in Sections 4.3, 4.4 and 4.5. Theorems 3 and 5 are then proved in Sections 4.6, 4.7. Because of equation (23), Theorem 4 follows as a special case of Theorem 5. Indeed, we will show that both statements are equivalent through a reduction argument. Depending on the application, Theorem 5 might be a more convenient formulation of the state evolution and will be used in Section 5.

Our first result establishes universality of the moments of zijt\mathbf{z}^{t}_{i\to j} for polynomial sequences of instances.

The proof of this proposition is provided in Section 4.3.

In this statement and in the rest of this section, KK is always understood as a function of d,t,q,m,Cd,t,q,m,C which may vary from line to line, but which is independent of NN.

The asymptotic analysis of yty^{t} is particularly simple because an independent random matrix AtA^{t} is used at each iteration. In particular, it is easy to establish state evolution for yty^{t}. Our next result shows that yty^{t} provides a good approximation for ztz^{t}.

The proof of this proposition is provided in Section 4.4.

Let (A(N),FN,x0,N)N1(A(N),\mathcal{F}_{N},x^{0,N})_{N\geq 1} be a (C,d)(C,d)-regular polynomial sequence of instances. Denote by {xt}t0\{x^{t}\}_{t\geq 0} the corresponding AMP sequence and by {zt}t0\{z^{t}\}_{t\geq 0} the sequence defined by (25) while iterating (24). Then for any t1t\geq 1 and m(1),,m(q)0m(1),\dots,m(q)\geq 0, there exists KK independent of NN such that, for any i[N]i\in[N],

The proof of this proposition is provided in Section 4.5.

2 Tree representation

We denote by Tt\overline{\mathcal{T}}^{t} the set of labeled trees TT with tt generations as above. We let TtTt\mathcal{T}^{t}\subseteq\overline{\mathcal{T}}^{t} denote the subset of such trees that satisfy the following additional condition: {longlist}[(1)]

Let (A(N),FN,x0,N)N1(A(N),\mathcal{F}_{N},x^{0,N})_{N\geq 1} be a polynomial sequence of AMP instances. Denote by zit\mathbf{z}^{t}_{i} the orbit defined by (25) while iterating (24) with matrix AA. Then

We first prove (30) by induction on tt. For t=1t=1 we have, by definition,

To prove the induction, we start with equation (24), which yields

3 Proof of Proposition 1

We are now in position to prove Proposition 1.

For notational simplicity, we consider the case m(r)=mm(r)=m, and m(s)=0m(s)=0 for all s[q]rs\in[q]\setminus r. Thanks to Lemma 1, we have

obtained by taking λ=Ns/C\lambda=\sqrt{Ns/C}.

since in the product on the right-hand side of (4.3), there are at most μ/2\mu/2 terms different from one.

Hence the condition 1Ni=1Nexp(xi0,N22/C)C\frac{1}{N}\sum_{i=1}^{N}\exp(\|\mathbf{x}^{0,N}_{i}\|_{2}^{2}/C)\leq C ensures that for any p2p\geq 2,

which completes the proof Proposition 1. Here we used the fact that all values μ,q\mu,q and {Ck}k=0md\{C_{k}\}_{k=0}^{md} are independent of NN.

We end this section by showing that the term S1(A)S_{1}(A) can be further reduced. This result will be useful in the sequel, and we state it as the following lemma.

4 Proof of Proposition 2

The proof follows the same approach as for Proposition 1. For notational simplicity, we consider the case m(r)=mm(r)=m and m(s)=0m(s)=0 for all s[q]rs\in[q]\setminus r. The general case follows by the same argument. For y\mathbf{y}, we are using a different matrix at each iteration and we need to define a new weight associated to trees TTtT\in\mathcal{T}^{t} as follows:

5 Proof of Proposition 3

As in the proof of Proposition 1, we will rely on a representation of xit(r)x^{t}_{i}(r) based on labeled trees defined as in Section 4.2. In the present case, it is, however, more convenient to work with trees from which marks have been removed, that is, we identify any two trees in which the vertex marks are different, but the types are the same. Notice that equations (30), (31) imply

where Γ(T,c,t)\Gamma^{\prime}(T,\mathbf{c},t) is obtained by summing Γ(T,c,t)\Gamma(T,\mathbf{c},t) over all trees TT that coincide up to marks. In the following, with a slight abuse of notation, we will write Γ(T,c,t)\Gamma(T,\mathbf{c},t) instead of Γ(T,c,t)\Gamma^{\prime}(T,\mathbf{c},t).

Under the same assumptions as in Proposition 3, we have

for some Γ~(T,t,r)\widetilde{\Gamma}(T,t,r) which is bounded uniformly as Γ~(T,t,r)K(d,C,t)|\widetilde{\Gamma}(T,t,r)|\leq K(d,C,t).

Following the same argument as in Lemma 1, we first prove by induction on tt that we can find Γ~(T,t,r)\widetilde{\Gamma}(T,t,r) such that

It remains to prove that Γ~(T,t,r)\widetilde{\Gamma}(T,t,r) agrees with the expression in Lemma 1, [cf. equations (45), (46)], for TUit(r)T\in\mathcal{U}^{t}_{i}(r) and is zero for trees in UtBit\overline{\mathcal{U}}^{t}\setminus\mathcal{B}^{t}_{i}. The proof of this fact will proceed by induction on tt. The cases t=0,1t=0,1 are clear since Bit=\mathcal{B}^{t}_{i}=\varnothing. For t1t\geq 1, we define

where the last sum contains a finite number of nonzero terms.

We now show that each contribution on the right-hand side [except zit+1(r)z^{t+1}_{i}(r)] can be written as a sum of terms A(T)Γ~(T,t+1,r,x0)A(T)\widetilde{\Gamma}(T,t+1,r,x^{0}) over trees TBit+1T\in\mathcal{B}^{t+1}_{i} that we construct explicitly.

We now treat the terms in the first line. Again, we have

6 Proof of Theorem 3

7 Proof of Theorem 5

An important simplification is provided by the following.

It is sufficient to prove Theorem 5 for t=st=s. (Hence, Theorem 4 implies Theorem 5.)

The new sequence of initial conditions is constructed as follows: {longlist}[(3)]

The partitions CaNC^{N}_{a}, a[k]a\in[k] and matrices A(N)A(N) are kept unchanged.

As a consequence of this construction, uit=xit\mathbf{u}^{t}_{i}=\mathbf{x}^{t}_{i} for all i[N]i\in[N], t1t\geq 1, and vit=xith\mathbf{v}^{t}_{i}=\mathbf{x}^{t-h}_{i} for all th+1t\geq h+1. This completes the reduction.

As a consequence of this remark, it is sufficient to prove Theorem 4, and by Remark 1, we can limit ourselves to the case in which g\dvtx(x,Y,a,t)g(x,Y,a,t)g\dvtx(\mathbf{x},Y,a,t)\mapsto g(\mathbf{x},Y,a,t) does not depend on YY, and hence this argument will be dropped. We begin by considering the expectation of moments of xit\mathbf{x}_{i}^{t}.

where ZatN(0,Σat)Z^{t}_{a}\sim\mathsf{N}(0,\Sigma^{t}_{a}).

By Propositions 2 and 3, we need only to prove the statement for the AMP orbit yty^{t}. We will indeed prove by induction on tt that for any iCaNi\in C^{N}_{a} and any jij\neq i,

For t1t\geq 1, let Ft\mathfrak{F}_{t} be the σ\sigma-algebra generated by A0,,At1A^{0},\dots,A^{t-1}. We will show, using the central limit theorem, that the random vector (yijt+1(1),,yijt+1(q))(y_{i\to j}^{t+1}(1),\dots,y_{i\to j}^{t+1}(q)) given Ft\mathfrak{F}_{t} converge in distribution to a centered Gaussian random vector. More precisely, by (27) and the induction hypothesis, the following limit holds in probability:

In the base case t=0t=0 the same conclusion holds because

where the second identity holds by assumption.

Next consider the induction claim, equation (55). Recall the representation introduced in Section 4.4,

Using this representation of yijt\mathbf{y}^{t}_{i\to j}, ykjt\mathbf{y}^{t}_{k\to j} it is easy to show that, for iki\neq k, i,kCaNi,k\in C_{a}^{N}

for some function ε(N)0{\varepsilon}(N)\to 0 as NN\to\infty ar m,C,d,t\mathbf{m},C,d,t fixed. Indeed, the above expectations can be represented as sums over m=m(1)+m(2)++m(q)m=m(1)+m(2)+\cdots+m(q) trees T1,,TmTijtT_{1},\dots,T_{m}\in\mathcal{T}^{t}_{i\to j} and mm trees T1,,TmTkjtT^{\prime}_{1},\dots,T^{\prime}_{m}\in\mathcal{T}^{t}_{k\to j}. Let G\mathbf{G} be the simple graph obtained by identifying vertices of the same type in T1,,Tm,T1,,\breakTmT_{1},\dots,T_{m},T^{\prime}_{1},\dots,\break T^{\prime}_{m}.

On the other hand, the number of such terms is bounded by KNμ2KN^{\mu-2} (because the type has to be assigned to μ\mu vertices, but 22 of these are fixed to ii and kk), and hence the overall contribution of these terms vanishes as well.

Equation (55) follows for iteration t+1t+1 by applying Chebyshev’s inequality to the sequence

for i,jNi,j\leq N: Aij=AijA^{\prime}_{ij}=A_{ij} and when i>Ni>N or j>Nj>N define AijN(0,1/N)A_{ij}^{\prime}\sim\mathsf{N}(0,1/N) independently;

g(x,b,t)=g(x,b,t)g^{\prime}(\mathbf{x},b,t^{\prime})=g(\mathbf{x},b,t^{\prime}) for b[k]b\in[k], tt1t^{\prime}\leq t-1; g(x,b,t)=0g^{\prime}(\mathbf{x},b,t)=0 for b[k]ab\in[k]\setminus a; g1(x,a,t)=φ(x)g^{\prime}_{1}(\mathbf{x},a,t)=\varphi(\mathbf{x}), gr(x,a,t)=0g^{\prime}_{r}(\mathbf{x},a,t)=0, for r2r\geq 2; g(x,k+1,t)=0g^{\prime}(\mathbf{x},k+1,t^{\prime})=0 for all tt^{\prime}.

The definition of g(x,a,t)g^{\prime}(\mathbf{x},a,t^{\prime}) for t>tt^{\prime}>t is irrelevant for our purposes. Since g(x,k+1,t)=0g^{\prime}(\mathbf{x},k+1,t^{\prime})=0 for all tt^{\prime}, the orbit (xit\dvtxiN,tt)(\mathbf{x}_{i}^{t^{\prime}}\dvtx i\leq N,t^{\prime}\leq t) is not affected by the new variables. Further, by the general AMP equation (6), we have, for iCk+1Ni\in C_{k+1}^{N},

Notice that the {Aij}jCaN\{A_{ij}\}_{j\in C_{a}^{N}} in this equation are independent of xjt\mathbf{x}_{j}^{t}. Hence

On the other hand, using Proposition 4 (once for iteration t+1t+1 and iCk+1Ni\in C_{k+1}^{N}, and another time for iteration tt and iCaNi\in C_{a}^{N}) we get

where ZatN(0,Σat)Z^{t}_{a}\sim\mathsf{N}(0,\Sigma^{t}_{a}). Comparing these equations with equation (60) we conclude that

Taking φ(x)=xk\varphi(\mathbf{x})=\mathbf{x}^{\mathbf{k}}, we obtain equation (57) for m\mathbf{m} even. In order to establish equation (57) for general m\mathbf{m} we take, for instance, φ(x)=1+εxm\varphi(\mathbf{x})=1+{\varepsilon}\mathbf{x}^{\mathbf{m}} and use the fact that the limit must vanish for all ε{\varepsilon}.

Proof of Theorem 5 By Remark 1 and Remark 2, we reduced ourselves to the case t=st=s, and Y(i)=0Y(i)=0 [equivalently, Y(i)Y(i), is absent].

Proposition 4 shows the convergence of expected the moments of μaN\mu^{N}_{a} to moments that determine the Gaussian distribution. Proposition 5 combined with Chebyshev inequality implies

in probability. The proof follows using the relation between convergence in probability and convergence almost sure along subsequences, together with the moment method.

Nonsymmetric matrices

In this section we consider a slightly different setting that turns out to be a special case of the one introduced in Section 1.3.

A converging sequence of (polynomial) bipartite AMP instances {(A(n),f,h,x0,n)}n1\{(A(n),f,h,x^{0,n})\}_{n\geq 1} is defined by giving for each nn: {longlist}[(3)]

Throughout this section, we will refer to nonbipartite AMP instances as per Definition 5, as to symmetric instances. With these ingredients, we define the AMP orbit as follows.

The approximate message passing orbit corresponding to the bipartite instance (A,f,h,x0)(A,f,h,x^{0}) is the sequence of vectors {xt,zt}t0\{x^{t},z^{t}\}_{t\geq 0}, xtVq,nx^{t}\in\mathcal{V}_{q,n} ztVq,mz^{t}\in\mathcal{V}_{q,m} defined as follows, for t0t\geq 0,

where f()f(\cdot), h()h(\cdot) are applied componentwise; see below for an explicit formulation. Here Bt\dvtxVq,mVq,m\mathsf{B}_{t}\dvtx\mathcal{V}_{q,m}\to\mathcal{V}_{q,m} is the linear operator that maps vv to v=Btvv^{\prime}=\mathsf{B}_{t}v, and for any j[m]j\in[m] satisfies,

Analogously Dt\dvtxVq,nVq,n\mathsf{D}_{t}\dvtx\mathcal{V}_{q,n}\to\mathcal{V}_{q,n} is the linear operator defined by letting, for v=Dtvv^{\prime}=\mathsf{D}_{t}v, and any j[n]j\in[n],

For the sake of clarity, it is useful to rewrite the iteration (67), (68) explicitly, by components

We will state and prove a state evolution result that is analogous to Theorem 5 for the present case. Since the proof is by reduction to the symmetric case, the same argument also implies a universality statement of the type of Theorem 3. However, we will not state explicitly any universality statement in this case. We begin by introducing the appropriate state evolution recursion. In analogy with equation (10), we introduce two sequences of positive semidefinite matrices {Σt}t0\{\Sigma^{t}\}_{t\geq 0}, {Ξt}t0\{\Xi^{t}\}_{t\geq 0} by letting Ξ0\Xi^{0} be given as per equation (66) and defining, for all t1t\geq 1,

We also define a two-times recursion analogous to equations (21), (4). Namely, we introduce the boundary condition

with Ξt\Xi^{t} defined per equations (5), (5). For any s,t1s,t\geq 1, we set recursively

(Recall that [u,v][u,v] denotes the column vector obtained by concatenating uu and vv.)

Let {(A(n),f,h,x0,n)}n1\{(A(n),f,h,x^{0,n})\}_{n\geq 1} be a polynomial and converging sequence of bipartite AMP instances, and denote by {xt,zt}t0\{x^{t},z^{t}\}_{t\geq 0} the corresponding AMP orbit.

where (Xt,Xs)N(0,Σt,s)(X^{t},X^{s})\sim\mathsf{N}(0,\Sigma^{t,s}) is independent of YQY\sim Q, and (Zt,Zs)\breakN(0,Ξt,s)(Z^{t},Z^{s})\sim\break\mathsf{N}(0,\Xi^{t,s}) is independent of WPW\sim P.

The proof follows by constructing a suitable polynomial and converging sequence of symmetric instances, recognizing that a suitable subset of the resulting orbit corresponds to the orbit {xt,zt}\{x^{t},z^{t}\} of interest, and applying Theorem 5.

We partition the index set in k=2k=2 subsets: [N]=C1NC2N[N]=C^{N}_{1}\cup C_{2}^{N}, with C1N={1,,m}C^{N}_{1}=\{1,\dots,m\} and C2N={m+1,,m+n}C^{N}_{2}=\{m+1,\dots,m+n\}. In particular c1=δ/(1+δ)c_{1}=\delta/(1+\delta) and c2=1/(1+δ)c_{2}=1/(1+\delta).

The symmetric random matrix AA^{\prime} is given by

In particular W11=W22=0W_{11}=W_{22}=0 and W12=W21=(1+δ)/δW_{12}=W_{21}=(1+\delta)/\delta.

The vertex labels are Ys(i)=W(i)Y_{s}(i)=W(i) for imi\leq m and Ys(i)=Y(im)Y_{s}(i)=Y(i-m) for i>mi>m. In particular, these are independent random variables with distribution Ys(i)P1=QY_{s}(i)\sim P_{1}=Q if iC1Ni\in C_{1}^{N} and Ys(i)P2=PY_{s}(i)\sim P_{2}=P if iC2Ni\in C_{2}^{N}.

The initial condition is given by xs,i0,N=0\mathbf{x}_{s,i}^{0,N}=0 for iC1Ni\in C_{1}^{N} and xs,i0,N=xim0,n\mathbf{x}_{s,i}^{0,N}=\mathbf{x}^{0,n}_{i-m} for iC2Ni\in C_{2}^{N}.

The definition of g(x,Y,a=1,2t+1)g(\mathbf{x},Y,a=1,2t+1) and g(x,Y,a=2,2t)g(\mathbf{x},Y,a=2,2t) is irrelevant for our purposes. The proof is concluded by recognizing that, for all t0t\geq 0,

We finish this section with a lemma that establishes continuity of the AMP trajectories with respect to Gaussian perturbations of the matrix AA. This fact will be used in the next section. (Notice that an analogous lemma holds by the same argument for converging, nonbipartite, instances.)

Enumerating the edges in TT as (u1,v1),,(uk,vk)(u_{1},v_{1}),\dots,(u_{k},v_{k}) the quantity in parenthesis reads

Proof of universality of polytope neighborliness

In this section we prove Theorem 2, deferring several technical steps to the Appendices.

Notice that these matrices differ by a factor 1/m1/\sqrt{m} from the matrices in the statement of Theorem 2. Since neighborliness is invariant under scale transformations, this change is immaterial.

This is indeed a rephrasing of Theorem 2 in Donoho2005a .

In view of this result, Theorem 2 follows from the following result on compressed sensing with random sensing matrices.

Consider either of the following two cases: {longlist}[(II)]

The matrix A(n)A(n) has i.i.d. entries, and {x0(n)}n1\{x_{0}(n)\}_{n\geq 1} is any fixed sequence of vectors with limnx0(n)0/m(n)=ρ\lim_{n\to\infty}\|x_{0}(n)\|_{0}/m(n)=\rho.

The rest of this section is devoted to the proof of Theorem 8. Indeed, as shown below, this immediately implies Theorem 2.

Using Markov’s inequality, equations (87), (88) coincide (resp.) with assumptions (1) and (2) in Theorem 7. The claim follows by applying this theorem.

For c(0,1)c\in(0,1), let S(c){i[n]\dvtxvi1c}S(c)\equiv\{i\in[n]\dvtx|v_{i}|\geq 1-c\}. Then, for any S[n]S^{\prime}\subseteq[n], Sc1n|S^{\prime}|\leq c_{1}n, the minimum singular value of AS(c1)SA_{S(c_{1})\cup S^{\prime}} satisfies σmin(AS(c1)S)c2\sigma_{\min}(A_{S(c_{1})\cup S^{\prime}})\geq c_{2}.

The proof of this lemma is deferred to Appendix B.

The proof of Theorem 8 consists of two parts. For ρ>ρ(δ)\rho>\rho_{*}(\delta), we shall exhibit a vector xx with x1<x01\|x\|_{1}<\|x_{0}\|_{1} and y=Axy=Ax. For ρ<ρ(δ)\rho<\rho_{*}(\delta) we will show that assumptions of Lemma 5 hold. In particular, we will construct a subgradient vv as per assumption 5. In both tasks, we will use an iterative message passing algorithm analogous to the one in Section 5. The algorithm is defined by the following recursion initialized with x0=0x^{0}=0:

where η(u;θ)=sign(u)(uθ)+\eta(u;\theta)=\operatorname{sign}(u)(u-\theta)_{+}, α\alpha is a nonnegative constant and bt\mathsf{b}_{t} is a diagonal matrix whose precise definition is immaterial here and will be given in the proof of Proposition 6 below. Notice two important differences with respect to the treatment in Section 5:

The iteration in equations (89), (90) does not take immediately the form in equations (67), (68). For instance the nonlinear mapping η(;ασt)\eta(\cdot;\alpha\sigma_{t}) is applied after multiplication by ATA^{\mathsf{T}}. This mismatch can be resolved by a simple change of variables.

The nonlinear mapping η(;ασt)\eta(\cdot;\alpha\sigma_{t}) is not a polynomial. This point will be addressed by constructing suitable polynomial approximations of η\eta.

We refer to Appendix A for further details.

For t0t\geq 0, σt\sigma_{t} is defined by the one-dimensional recursion

There exists a diagonal matrix bt=bt(n)\mathsf{b}_{t}=\mathsf{b}_{t}(n) such that

We also need a generalization of the last proposition for functions of the estimates xtx^{t}, xsx^{s} at two distinct iteration numbers tst\neq s. To this objective, we introduce the generalization of the state evolution equation (91). Namely, we define {Rs,t}s,t0\{R_{s,t}\}_{s,t\geq 0} recursively for all s,t0s,t\geq 0 by letting

with ZtN(0,Rt,t)Z_{t}\sim\mathsf{N}(0,R_{t,t}). This uniquely determines the doubly infinite array {Rt,s}t,s0\{R_{t,s}\}_{t,s\geq 0}. Notice in particular that Rt,t=σt2R_{t,t}=\sigma_{t}^{2} for all t0t\geq 0. (This is easily checked by induction over tt.)

The proof of this proposition is in Appendix A.

Finally, we need some analytical estimates on the recursions (91) and (98). Part of these estimates were already proved in DMM09 , NSPT , BayatiMontanariLASSO , but we reproduce them here for the reader’s convenience. Proofs of the others are provided in Appendix C.

If ρ<ρ(δ)\rho<\rho_{*}(\delta), then: {longlist}[(a3)]

There exists α1(ε,δ)\alpha_{1}({\varepsilon},\delta), α2(ε,δ)\alpha_{2}({\varepsilon},\delta), α(ε)\alpha_{*}({\varepsilon}) with 0<α1(ε,δ)<α(ε)<α2(ε,δ)<0<\alpha_{1}({\varepsilon},\delta)<\alpha_{*}({\varepsilon})<\alpha_{2}({\varepsilon},\allowbreak\delta)<\infty, and ω(ε,δ)(0,1)\omega_{*}({\varepsilon},\delta)\in(0,1) such that the following happens. For each α(α1,α2)\alpha\in(\alpha_{1},\alpha_{2}), σt2=Bωt(1+ot(1))\sigma_{t}^{2}=B\omega^{t}(1+o_{t}(1)) as tt\to\infty, with ω(0,1)\omega\in(0,1).

Further, for each ω[ω(ε,δ),1)\omega\in[\omega_{*}({\varepsilon},\delta),1) there exists α(α1,α]\alpha_{-}\in(\alpha_{1},\alpha_{*}] and α+[α,α2)\alpha_{+}\in[\alpha_{*},\alpha_{2}) (distinct as long as ω>ω\omega>\omega_{*}) such that, letting α{α,α+}\alpha\in\{\alpha_{-},\alpha_{+}\}, σt2=Bωt(1+ot(1))\sigma_{t}^{2}=B\omega^{t}(1+o_{t}(1)).

Finally, for all α[α,α2)\alpha\in[\alpha_{*},\alpha_{2}), we have ε+2(1ε)Φ(α)<δ{\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha)<\delta.

For any α[α(ε),α2(ε,δ))\alpha\in[\alpha_{*}({\varepsilon}),\alpha_{2}({\varepsilon},\delta)), we have limtRt,t1/(σtσt1)=1\lim_{t\to\infty}R_{t,t-1}/(\sigma_{t}\sigma_{t-1})=1.

For any α>αmin(δ)\alpha>\alpha_{\min}(\delta), we have limtσt2=σ2>0\lim_{t\to\infty}\sigma_{t}^{2}=\sigma_{*}^{2}>0 and, for αα0\alpha\geq\alpha_{0}, limt[Rt,t2Rt,t1+Rt1,t1]=0\lim_{t\to\infty}[R_{t,t}-2R_{t,t-1}+R_{t-1,t-1}]=0.

We are now in position to prove Theorem 8. For greater convenience of the reader, we distinguish the cases ρ<ρ(δ)\rho<\rho_{*}(\delta) and ρ>ρ(δ)\rho>\rho_{*}(\delta). Before considering these cases, we will establish some common simplifications.

Summarizing this discussion, in order to prove the Theorem both in case (I) and case (II), it will be sufficient to do so for the following setting:

In the proof of Theorem 8, we can assume the vectors x0(n)x_{0}(n) to be random with i.i.d. entries with common distribution pX=(1ε)δ0+εγp_{X}=(1-{\varepsilon})\delta_{0}+{\varepsilon}\gamma.

Notice that, by definition of the function η(;)\eta(\cdot;\cdot) we have xit1(ATzt1)ix^itθt1|x^{t-1}_{i}-(A^{\mathsf{T}}z^{t-1})_{i}-\widehat{x}^{t}_{i}|\leq\theta_{t-1}, and hence vtx01v^{t}\in\partial\|x_{0}\|_{1}. We can write

This part of the proof is completed by showing that there exists h(t)h(t) with limth(t)=0\lim_{t\to\infty}h(t)=0 such that, for each tt, with high probability we have ξt22/n(1ω)2/α2+h(t)\|\xi^{t}\|_{2}^{2}/n\leq(1-\sqrt{\omega})^{2}/\alpha^{2}+h(t), βt22/nh(t)\|\beta^{t}\|_{2}^{2}/n\leq h(t) and ζt22/nh(t)\|\zeta^{t}\|_{2}^{2}/n\leq h(t). Indeed, if this is true, we can then choose tt sufficiently large and α(α(ε),α2(ε,δ))\alpha\in(\alpha_{*}({\varepsilon}),\alpha_{2}({\varepsilon},\delta)) so that ξt+βt+ζt22\|\xi^{t}+\beta^{t}+\zeta^{t}\|_{2}^{2} is small enough as to satisfy the condition 5 of Lemma 5.

First consider ξt\xi^{t}. Applying Proposition 7 to ψ(x,y1,y2)=(y1y2)2\psi(x,y_{1},y_{2})=(y_{1}-y_{2})^{2}, we have, in probability

Here the last equality follows from the fact that σt2/σt12ω\sigma^{2}_{t}/\sigma^{2}_{t-1}\to\omega by Lemma 6(a1) and Rt,t1/(σtσt1)1R_{t,t-1}/(\sigma_{t}\sigma_{t-1})\to 1 by Lemma 6(a2). This implies the claim for ξt\xi^{t}.

Next, consider βt\beta^{t}. By Proposition 6(2),

and hence βt22/nh(t)\|\beta^{t}\|_{2}^{2}/n\leq h(t) with high probability for any h(t)>0h(t)>0.

Finally consider ζt\zeta^{t}, and define R(y;θ)=yη(y;θ)R(y;\theta)=y-\eta(y;\theta). We have

Notice that this apparently requires applying Proposition 6 to the function ψ(x,y)=[sign(x)R(y;θ)]21x0\psi(x,y)=[\operatorname{sign}(x)-R(y;\theta)]^{2}\mathbf{1}_{x\neq 0} which is non-Lipschitz in xx. However, we can define a Lipschitz approximation, with parameter r>0r>0,

Notice that ψr\psi_{r} is bounded and Lipschitz continuous. We further haveψr(x,y)ψ(x,y)41(x0;xr)|\psi_{r}(x,y)-\psi(x,y)|\leq 4\mathbf{1}(x\neq 0;|x|\leq r), whence

The last inequality holds almost surely by the law of large numbers using γ([r,r])<2r\gamma([-r,r])<2r. Analogously,

Hence the claim (110) follows by applying Proposition 6(3) to ψr(x,y)\psi_{r}(x,y), using equations (6.2), (6.2), and letting r0r\to 0.

We conclude by noting that the right-hand side of equation (110) converges to as tt\to\infty by dominated convergence, since σt0\sigma_{t}\to 0. Therefore,

This completes our proof of assumption 5 of Lemma 5.

We finally consider hypothesis 5. Let St(c)S_{t}(c) be defined as there, for the subgradient vtv^{t}, namely

We claim that the following two claims hold for some t0t_{*}\geq 0 independent of nn:

There exists c1,c^2>0c_{1},\hat{c}_{2}>0 (independent of ν\nu) such that for all S[n]S^{\prime}\subseteq[n], S2c1n|S^{\prime}|\leq 2c_{1}n, the minimum singular value of AS~t(2c1)SA_{\widetilde{S}_{t_{*}}(2c_{1})\cup S^{\prime}}, satisfies σmin(AS~t(2c1)S)c^2ν\sigma_{\min}(A_{\widetilde{S}_{t_{*}}(2c_{1})\cup S^{\prime}})\geq\hat{c}_{2}\nu with probability converging to 11 as nn\to\infty.

where o1(t,ν)o_{1}(t_{*},\nu) vanishes as ν0\nu\to 0 at tt_{*}, c1c_{1}, c2c_{2} fixed, and o2(t,ν;n1)o_{2}(t_{*},\nu;n^{-1}) vanishes as n10n^{-1}\to 0 at tt_{*}, ν\nu, c1c_{1}, c2c_{2} fixed.

These claims immediately imply that hypothesis 5 of Lemma 5 holds with probability converging to one as nn\to\infty. Indeed, if Snc1|S^{\prime}|\leq nc_{1}, then (by Claim 2) St(c1)SS~t(2c1)SS_{t}(c_{1})\cup S^{\prime}\subseteq\widetilde{S}_{t_{*}}(2c_{1})\cup S^{\prime\prime} where S2nc1|S^{\prime\prime}|\leq 2nc_{1} with probability larger than 1o1(t;ν)o2(t,ν;n1)1-o_{1}(t_{*};\nu)-o_{2}(t_{*},\nu;n^{-1}). By Claim 1, we hence have σmin(ASt(c1)S)c2c^2ν\sigma_{\min}(A_{S_{t}(c_{1})\cup S^{\prime}})\geq c_{2}\equiv\hat{c}_{2}\nu. The thesis follows since ν\nu can be chosen as small as we want. [Notice that once tt_{*} is fixed to satisfy these claims, we can still choose ttt\geq t_{*} arbitrarily to satisfy hypothesis 5 of Lemma 5, as per the argument above.]

In order to prove Claim 1, above first notice that, for any b0b\geq 0

where in the last line H(c)H(c) denotes the binary entropy of bb, and we used (nnc)exp{nH(c)}{{n}\choose{nc}}\leq\exp\{nH(c)\}. We want to show that tt_{*}, bb, c1c_{1}, c2c_{2}, ν\nu can be chosen so that both contributions vanish as nn\to\infty.

with a1=a1((b+2c1)/δ)a_{1}=a_{1}((b+2c_{1})/\delta) bounded as long as (b+2c1)/δ<1(b+2c_{1})/\delta<1. We can therefore select c^2=1/(2a1)\hat{c}_{2}=1/(2a_{1}) and select c1c_{1} small enough so that H(2c1)(1/2)(δb2c1)log2H(2c_{1})\leq(1/2)(\delta-b-2c_{1})\log 2. This ensures that the first term in equation (6.2) vanishes as nn\to\infty.

We are left with the task of selecting b(0,δ)b\in(0,\delta), t0t_{*}\geq 0, so that the second term vanishes as well, since then we can take c1(0,(δb)/2)c_{1}\in(0,(\delta-b)/2). To this hand notice that by Proposition 6 (and using the fact that X+σt1ZX+\sigma_{t-1}Z has a density) we have, in probability,

and further, since σt0\sigma_{t}\to 0 as tt\to\infty [cf. Lemma 6(a1)] and θt=ασt\theta_{t}=\alpha\sigma_{t}, we have

On the other hand, by Lemma 6(a1), and since α[α,α2)\alpha\in[\alpha_{*},\alpha_{2}), we have ε+2(1ε)Φ(α)<δ{\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha)<\delta. Hence there exist b0(0,δ)b_{0}\in(0,\delta) and c1>0c_{1}>0 so that for all tt_{*} large enough St(3c1)nb0|S_{t_{*}}(3c_{1})|\leq nb_{0} with high probability. Taking b(b0,δ)b\in(b_{0},\delta) and using Markov’s inequality (with t=t1t_{*}^{\prime}=t_{*}-1)

where the last inequality follows from Proposition 6(4). These terms can be made arbitrarily small by choosing ν\nu small and nn large enough.

In order to complete the proof, we need to show that Claim 2 holds for eventually larger tt_{*}. First notice that, applying again Proposition 6(4), we get

By Proposition 7, and using the fact that the vector (X+Zt,X+Zt)(X+Z_{t_{*}},X+Z_{t}) has a density, we have, in probability,

where, by Lemma 6(a3), h(t)h(t_{*}) vanishes as tt_{*}\to\infty. Given any c1>0c_{1}>0, we can therefore choose tt_{*} so that, with high probability St(c1)St(c1)nc1/2|S_{t}(c_{1})\setminus S_{t_{*}}(c_{1})|\leq nc_{1}/2. Combining with equation (6.2), we obtain the desired claim.

Fix a small number h>0h>0. By Lemma 6(b), there exists Δ=Δ(δ,ε)>0\Delta=\Delta(\delta,{\varepsilon})>0 independent of hh, such that, for α=α0(δ,pX)\alpha=\alpha_{0}(\delta,p_{X}) and tt large enough,

as well as σt122σ2\sigma_{t-1}^{2}\leq 2\sigma_{*}^{2}. By Propositions 6, 7 (and noting that X+σtZX+\sigma_{t}Z has a distribution that is absolutely continuous with respect to Lebesgue measure), we have, with high probability,

Namely equation (119) implies (122), equation (120) implies (123), equation (121) implies (124) and the assumption σt122σ2\sigma_{t-1}^{2}\leq 2\sigma_{*}^{2} implies (125).

Using equation (92) together with the above, we get

Appendix A Proofs of Propositions 6 and 7

In this Appendix we prove Propositions 6 and 7 by a suitable application of Theorem 6. Before passing to these proofs, we establish a corollary of Theorem 6 that allows us to control iterations of the form (89), (90), with η(;)\eta(\cdot;\cdot) replaced by a general polynomial.

Let x0=(ID1)x0x^{0}=(I-D^{-1})x_{0} (notice that DD is invertible with high probability) and define iteratively

We next introduce the corresponding state evolution recursion. Namely, we define {R~s,t}s,t0\{\widetilde{R}_{s,t}\}_{s,t\geq 0} recursively for all s,t0s,t\geq 0 by letting

with ZtN(0,R~t,t)Z_{t}\sim\mathsf{N}(0,\widetilde{R}_{t,t}). This uniquely determines the doubly infinite array{R~t,s}t,s0\{\widetilde{R}_{t,s}\}_{t,s\geq 0}.

which coincides with equation (1) after a redefinition of the function ψ\psi.

A.2 Proofs of Propositions 6 and 7

We will start by proving Proposition 6. Since Proposition 7 follows from the same construction, we will only point to the necessary modifications. Before presenting the proof, we recall a basic result in weighted polynomial approximation (here stated for a specific case); see, for example, Lubinsky .

Proof of Propositions 6 Since the proposition holds as nn\to\infty at tt fixed, we shall assume throughout that t{0,1,,tmax}t\in\{0,1,\dots,t_{\max}\} for some fixed arbitrarily large tmaxt_{\max}.

We claim that, for each β,tmax>0\beta,t_{\max}>0, we can construct an orbit {xβ,t,zβ,t}t0\{x^{\beta,t},z^{\beta,t}\}_{t\geq 0} obeying equations (A.1), (132) for suitable functions ηt=ηt(β)\eta_{t}=\eta_{t}^{(\beta)} such that the following holds (with a slight abuse of notation we will drop the parameter β\beta from xβ,t,zβ,tx^{\beta,t},z^{\beta,t}). For all 0ttmax0\leq t\leq t_{\max}, and all functions ψ\psi as in the statement, we have zt=yAxt+btzt1z^{t}=y-Ax^{t}+\mathsf{b}_{t}z^{t-1} by construction. Further, in probability,

In order to prove equations (148) to (150) we proceed as follows. It is easy to check that σt>0\sigma_{t}>0 for all tt; cf. equation (91). We use Theorem 9 to construct polynomials ηt\eta_{t} such that

Given polynomials as defined by (154), we define {xt,zt}t0\{x^{t},z^{t}\}_{t\geq 0} as per equations (A.1), (132), with the initial condition given there. Equation (150) follows immediately from Corollary 1 for ξ\xi sufficiently small. Equation (149) also follows from the same corollary, by taking

and then using once again equation (154) on the resulting expression.

Finally, consider equation (148). For economy of notation, we write

Using the tree representation in Section 4.2, it is not hard to prove that the expectation on the right-hand side is bounded as follows:

Consider, for instance, the first case, pp, qq, rr, ss distinct. Using Lemma 3, each of φp\varphi_{p}, φq\varphi_{q}, φr\varphi_{r}, φs\varphi_{s} can be represented as a sum over trees with root type respectively at pp, qq, rr, ss. The weight of these trees is as in Lemma 3, times the prefactor (Aip2m1)(Ais2m1)(A_{ip}^{2}-m^{-1})\cdots(A_{is}^{2}-m^{-1}). Let μ\mu be the total number of edges in these trees, plus 88 (two for each of the additional factors). Then any nonvanishing contribution is of order nμ/2n^{-\mu/2}. Let G\mathbf{G} be the graph obtained by identifying the vertices of the same type in these trees, and e(G)e(\mathbf{G}) the number of its edges. Since each edge in G\mathbf{G} must be covered at least twice by the trees to get a nonzero expectation, and the edges in (i,p),,(i,s)(i,p),\dots,(i,s) at least once, we have 2e(G)+4μ2e(\mathbf{G})+4\leq\mu. The number of vertices in G\mathbf{G} is at most e(G)+1e(\mathbf{G})+1 (note that G\mathbf{G} is connected because it includes type ii connected to p,q,r,sp,q,r,s). Of these vertices all but 55 (whose type is ii, pp, qq, rr, ss) can take an arbitrary type, yielding a combinatorial factor of order ne(G)+15nμ/26n^{e(\mathbf{G})+1-5}\leq n^{\mu/2-6}. Hence the sum over trees is of order nμ/2nμ/26=n6n^{-\mu/2}n^{\mu/2-6}=n^{-6} as claimed.

Since by standard concentration bounds maxi[n]Dii\max_{i\in[n]}D_{ii}, mini[n]Dii1\min_{i\in[n]}D_{ii}\to 1, we obtain, in probability,

where, in the last step, we applied Corollary 1 to the polynomials ηt1\eta^{\prime}_{t-1}, and XpXX\sim p_{X}, ZN(0,1)Z\sim\mathsf{N}(0,1) are independent. We are left with the task of showing that, by taking ξ\xi small enough in equation (154), we can ensure that

Indeed integrating by parts with respect to ZZ the above difference can be written as (for KK a finite constant that can depend on tt and change from line to line)

The claim follows by noting that, as argued above σt1σ~t1Kξ|\sigma_{t-1}-\widetilde{\sigma}_{t-1}|\leq K^{\prime}\xi.

Proof of Claim 3 To be definite we will focus on equation (162).

We finally sketch the proof of Proposition 7.

Proof of Proposition 7 The sequence {xt,zt}t0\{x^{t},z^{t}\}_{t\geq 0} is constructed as in the previous statement. The proof hence follow by using Corollary 1, and taking ξ\xi small enough in equation (154), since we can ensure that R~t,sRt,sβ|\widetilde{R}_{t,s}-R_{t,s}|\leq\beta^{\prime} for any β>0\beta^{\prime}>0 and any t,stmaxt,s\leq t_{\max} (as shown above for the case t=st=s).

Appendix B Proof of Lemma 5

Throughout the proof we denote by C1C_{1}, C2C_{2}, C3C_{3} etc., positive constants that depend uniquely on c1,,c3c_{1},\dots,c_{3}.

and denote by x^\widehat{x} any minimizer. Further, let vv be a subgradient as in the statement, and define, for some c(0,1)c\in(0,1),

Also, let S(c)=[n]S(c)\overline{S}(c)=[n]\setminus S(c) be the complement of this set. Notice that, by definition of subgradient, we have vi=sign(x0,i)v_{i}=\operatorname{sign}(x_{0,i}) for all iSi\in S and v0,i1|v_{0,i}|\leq 1 for all in S[n]S\overline{S}\equiv[n]\setminus S. This implies that SS(c)S\subseteq S(c).

Since S(c)S\overline{S}(c)\subseteq\overline{S}, we have x0,S(c)=0x_{0,\overline{S}(c)}=0 and hence

On the other hand, vS(c)v_{S(c)} is in the subgradient of xS(c)1\|x_{S(c)}\|_{1} at xS(c)=x0,S(c)x_{S(c)}=x_{0,S(c)}. Hence R10R_{1}\geq 0. It follows that equation (169) implies x^1x01+v,(x^x0)+cx^S(c)1\|\widehat{x}\|_{1}\geq\|x_{0}\|_{1}+\langle v,(\widehat{x}-x_{0})\rangle+c\|\widehat{x}_{\overline{S}(c)}\|_{1}. Since x^\widehat{x} is a minimizer, we thus get

where in the last step we used Cauchy–Schwarz together with assumption 5. Hereafter we let rx^x0r\equiv\widehat{x}-x_{0}.

Fix c=c1c=c_{1}. Since S(c)S\overline{S}(c)\subseteq\overline{S}, we have rS(c)=x^S(c)r_{\overline{S}(c)}=\widehat{x}_{\overline{S}(c)}, and using equation (173), we conclude that there exists C14/c13C_{1}\leq 4/c_{1}^{3} such that

On the other hand, by definition Ar=0Ar=0, and hence AS+rS++AS+rS+=0A_{S_{+}}r_{S_{+}}+A_{\overline{S}_{+}}r_{\overline{S}_{+}}=0. Since S(c)S\overline{S}(c)\subseteq\overline{S}, we have SS(c)S+S\subseteq S(c)\subseteq S_{+}. Further S+S(c)=S1S_{+}\setminus S(c)=S_{1}, whence S+S(c)nc=nc1|S_{+}\setminus S(c)|\leq nc=nc_{1}. By assumption 5, we have σmin(AS+)c2\sigma_{\min}(A_{S_{+}})\geq c_{2} and therefore,

Combining this with equation (175), we deduce that r2C2εr2\|r\|_{2}\leq C_{2}{\varepsilon}\|r\|_{2} for some C2=C2(c1,c2,c3)C_{2}=C_{2}(c_{1},c_{2},c_{3}), which in turns implies r=0r=0 provided that C2ε<1C_{2}{\varepsilon}<1. The claim hence follows for ε0=1/[2C2(c1,c2,c3)]{\varepsilon}_{0}=1/[2C_{2}(c_{1},c_{2},c_{3})].

Appendix C Asymptotic analysis of state evolution: Proof of Lemma 6

where expectation is taken with respect to the independent random variables XpXX\sim p_{X} and ZN(0,1)Z\sim\mathsf{N}(0,1). When necessary, we will indicate the dependency on pXp_{X} by F(σ2,θ;pX)\mathsf{F}(\sigma^{2},\theta;p_{X}). With this notation the state evolution recursion reads σt+12=F(σt2,ασt)\sigma^{2}_{t+1}=\mathsf{F}(\sigma^{2}_{t},\alpha\sigma_{t}). The following properties of the function F\mathsf{F} were proved in DMM09 ; see also BayatiMontanariLASSO , Appendix A, for a more explicit treatment.

For any α>0\alpha>0, the mapping σ2F(σ2,ασ)\sigma^{2}\mapsto\mathsf{F}(\sigma^{2},\alpha\sigma) is monotone increasing and concave with F(0,0)=0\mathsf{F}(0,0)=0 and

The first two derivatives of αGε(α)\alpha\mapsto G_{{\varepsilon}}(\alpha) will be used in the proof

By inspection of equation (180), Gε(α)>0G^{\prime\prime}_{{\varepsilon}}(\alpha)>0 for all α>0\alpha>0, hence Gε(α)G_{{\varepsilon}}(\alpha) is strictly convex. Further, from equation (179), we have Gε(0)=4(1ε)ϕ(0)<0G^{\prime}_{{\varepsilon}}(0)=-4(1-{\varepsilon})\phi(0)<0 and Gε(α)=2αε+Oα(1)>0G^{\prime}_{{\varepsilon}}(\alpha)=2\alpha{\varepsilon}+O_{\alpha}(1)>0 as α\alpha\to\infty. Hence αGε(α)\alpha\mapsto G_{{\varepsilon}}(\alpha) has a unique minimum α(ε)(0,)\alpha_{*}({\varepsilon})\in(0,\infty).

Finally, equation (181) follows immediately by using the conditionGε(α)=0G_{{\varepsilon}}^{\prime}(\alpha_{*})=0 in expression (C).

In our proof it is more convenient to use the coordinates (δ,ε)(\delta,{\varepsilon}) instead of (ρ,δ)(\rho,\delta). In terms of the latter, the phase boundary (2), (3) reads

Notice that the use of the symbol α(ε)\alpha_{*}({\varepsilon}) in the last equations is not an abuse of notation. Indeed comparing equation (183) with (179) we conclude that α(ε)\alpha_{*}({\varepsilon}) is indeed the unique solution of Gε(α)=0G^{\prime}_{{\varepsilon}}(\alpha)=0. Further, comparing equation (182) with equation (C) we obtain the following.

Let (δ,ρ(δ))(\delta,\rho_{*}(\delta)) be the phase boundary defined by equations (2), (3). Then, for ρ,δ\rho,\delta\in, ρ>ρ(δ)\rho>\rho_{*}(\delta) if and only if, for ε(0,1){\varepsilon}\in(0,1), δ(ε,1)\delta\in({\varepsilon},1)

Vice versa ρ<ρ(δ)\rho<\rho_{*}(\delta) if and only if δ>δ(ε)\delta>\delta_{*}({\varepsilon}).

Proof of Lemma 6(a1) We set α=α(ε)argminα0Gε(α)\alpha=\alpha_{*}({\varepsilon})\equiv\arg\min_{\alpha\geq 0}G_{{\varepsilon}}(\alpha). Hence we have, by Lemma 7, and Lemma 9,

Finally by Lemma 8 Gε(α)ε+2(1ε)Φ(α)<δG_{\varepsilon}(\alpha_{*})\equiv{\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha_{*})<\delta. Since αε+2(1ε)Φ(α)\alpha\mapsto{\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha) is decreasing in α\alpha, the last claim follows.

In the proof of part (a2) we will make use of the following analytical result.

Then Fα,ε\mathcal{F}_{\alpha,{\varepsilon}} is increasing and convex on $withwith\mathcal{F}_{\alpha,{\varepsilon}}(1)=1andand\mathcal{F}^{\prime}_{\alpha,{\varepsilon}}(1)<1.Inparticular,. In particular,\mathcal{F}_{\alpha,{\varepsilon}}(Q)>Qforallfor all\in[0,1)$

which is strictly positive and increasing in QQ. Hence QFα,ε(Q)Q\mapsto\mathcal{F}_{\alpha,{\varepsilon}}(Q) is increasing and strictly convex. Finally, since η(y;α)=1(yα)\eta^{\prime}(y;\alpha)=\mathbf{1}(|y|\geq\alpha), we have

where the last equality follows again by Lemma 8. This completes the proof.

We are now in position to prove part (a2) of Lemma 6.

Proof of Lemma 6(a2) Throughout the proof we fix α(α(ε,δ),α2(ε,δ))\alpha\in(\alpha_{*}({\varepsilon},\delta),\allowbreak\alpha_{2}({\varepsilon},\delta)). Let the sequence {σt2}t0\{\sigma_{t}^{2}\}_{t\geq 0} be given as per the state evolution equation (91). Define QtRt,t1/(σtσt1)Q_{t}\equiv R_{t,t-1}/(\sigma_{t}\sigma_{t-1}). By Proposition 7, QtQ_{t} is the covariance of two Gaussian random variables of variance 11. Hence Qt1|Q_{t}|\leq 1. Using equation (98) we further have

(Notice that this expectation depends on the distribution of XX_{\infty} only through ε{\varepsilon}, because of the symmetry properties of the function η\eta.)

Further, by the proof of part (a1), as tt\to\infty we have σt20\sigma^{2}_{t}\to 0 and

Comparing equations (186) and (194) we conclude that, for any QQ\in,

Further the convergence is uniform, since the functions Ft\mathcal{F}_{t} are uniformly Lipschitz; see proof of Lemma 10 above.

Consider now the sequence {Qt}t0\{Q_{t}\}_{t\geq 0}, and let Q=liminftQtQ_{*}=\lim\inf_{t\to\infty}Q_{t}. Since QtQ_{t}\in for all tt, we have QQ_{*}\in as well. We claim that in fact Q=1Q_{*}=1 and therefore limtQt=1\lim_{t\to\infty}Q_{t}=1, which implies the thesis.

where, in the last step, we used the fact that Fα,ε()\mathcal{F}_{\alpha,{\varepsilon}}(\cdot) is monotone increasing. Since Fα,ε(q)>q\mathcal{F}_{\alpha,{\varepsilon}}(q)>q for all q[0,1)q\in[0,1) by Lemma 10, we conclude that Q=1Q_{*}=1.

Before proving (a3) of Lemma 6, we establish one more technical result.

By triangular inequality, the left-hand side of equation (11) can be upper bounded as D1+D2D_{1}+D_{2} whereby

Here XX and XX_{\infty} are coupled in such a way that X=0X=0 if and only if X=0X_{\infty}=0 and the two variables have the same sign in the other case. We focus on bounding D1D_{1} since D2D_{2} can be treated along the same lines. Letting R(x;θ)=η(x;θ)xR(x;\theta)=\eta(x;\theta)-x, we have

where in the last line we used Stein’s lemma to integrate over Z2Z_{2}, and RR^{\prime} denotes derivative with respect to the first argument. Once again the two terms are treated along the same lines, and we will only consider D1,aD_{1,a}. We have

The other term in equation (C.1) is bounded by the same argument. This concludes the proof of equation (11).

The proof of equation (201) follows from equation (11) if we notice that

The last lemma has a useful consequence that we will exploit in the ensuing proof of Lemma 6(a3).

Let Fα,ε(Q)\mathcal{F}_{\alpha,{\varepsilon}}(Q) be defined as per equation (186) and Ft(Q)\mathcal{F}_{t}(Q) defined as per equation (194) with pXp_{X}, α\alpha, ε{\varepsilon} satisfying the conditions of Lemma 6(a3). Then there exists a constants B,B,b>0B,B^{\prime},b>0 depending on pXp_{X} such that

The second inequality follows from the first one using Lemma 6(a1). Using equation (201), we have

The proof of the corollary is obtained by noting that σt=Θ(σt1)\sigma_{t}=\Theta(\sigma_{t-1}) and applying equation (11) to the expectation in equation (194).

Proof of Lemma 6(a3) Define, as in the proof of part (a2), QtRt,t1/(σtσt1)Q_{t}\equiv R_{t,t-1}/(\sigma_{t}\sigma_{t-1}), and recall that

By Corollary 2, and Lemma 10, it follows that Qt1Aω2tQ_{t}\geq 1-A\overline{\omega}^{2t} for some constants A>0A>0, ω(0,1)\overline{\omega}\in(0,1). Indeed,

and the claim follows by noting that Fα,ε(1)(0,1)\mathcal{F}_{\alpha,{\varepsilon}}^{\prime}(1)\in(0,1) by Lemma 10.

Next consider the quantity in equation (102). We have

Proof of Lemma 6(b1), (b2) First notice that, with the definitions given in the previous section

Notice that the right-hand side is equal to 2/δ2/\delta for α=0\alpha=0, monotonically decreasing in α\alpha and vanishing as α\alpha\to\infty. Hence there exists αmin(ε,δ)\alpha_{\min}({\varepsilon},\delta) such that the right-hand side is smaller than 11 if and only if α>αmin(ε,δ)\alpha>\alpha_{\min}({\varepsilon},\delta). Further, σ2F(σ2,ασ)\sigma^{2}\mapsto\mathsf{F}(\sigma^{2},\alpha\sigma) is concave with F(0,0)=0\mathsf{F}(0,0)=0 and first derivative larger than 11 at σ2=0\sigma^{2}=0; cf. Lemma 7. It follows that for α>αmin(ε,δ)\alpha>\alpha_{\min}({\varepsilon},\delta) there exists a unique σ(δ,pX)\sigma_{*}(\delta,p_{X}) such that F(σ2,ασ)>σ2\mathsf{F}(\sigma^{2},\alpha\sigma)>\sigma^{2} for all σ(0,σ)\sigma\in(0,\sigma_{*}) and F(σ2,ασ)<σ2\mathsf{F}(\sigma^{2},\alpha\sigma)<\sigma^{2} for σ(σ,)\sigma\in(\sigma_{*},\infty). It follows that σt2σ\sigma^{2}_{t}\to\sigma_{*} for any σ020\sigma_{0}^{2}\neq 0. This proves the first part of claim (b1).

Notice that by the definition of αmin\alpha_{\min} given above, we have

In order to prove the second statement in (b1), we proceed analogously to part (a2), and define QtRt/(σtσt1)Q_{t}\equiv R_{t}/(\sigma_{t}\sigma_{t-1}). This sequence satisfies the recursion (193) with Ft\mathcal{F}_{t} defined as per equation (194). As tt\to\infty we have σtσ\sigma_{t}\to\sigma_{*} and hence Ft\mathcal{F}_{t} converges uniformly to a limit that we denote by an abuse of notation Fα,δ,pX\mathcal{F}_{\alpha,\delta,p_{X}}, where

Proceeding as in the proof of Lemma 10, we conclude that QFα,δ,pX(Q)Q\mapsto\mathcal{F}_{\alpha,\delta,p_{X}}(Q) is increasing and convex on $.Further[for. Further [forZ\sim\mathsf{N}(0,1)$]

Finally, for αα0(δ,pX)\alpha\geq\alpha_{0}(\delta,p_{X}),

and therefore Fα,δ,pX(Q)>Q\mathcal{F}_{\alpha,\delta,p_{X}}(Q)>Q for all Q[0,1)Q\in[0,1). Hence, proceeding again as in the proof of part (a2) we conclude that limtQt=1\lim_{t\to\infty}Q_{t}=1 and therefore limtRt,t1=σ2\lim_{t\to\infty}R_{t,t-1}=\sigma_{*}^{2} as claimed.

where F(σ2,θ)\mathsf{F}(\sigma^{2},\theta) is defined as per equation (176). Using these expressions in equation (211) we conclude that

In particular, one can check from equations (212), (213) that a stationary pointIndeed this is the unique saddle point of the function (θ1,σ2)E(θ,σ2)(\theta^{-1},\sigma^{2})\mapsto\mathcal{E}(\theta,\sigma^{2}) as it can be proved by the general minimax theorem. is given by setting σ=σ(δ,pX)\sigma=\sigma_{*}(\delta,p_{X}) and θ=θ(δ,pX)α0(δ,pX)σ(δ,pX)\theta=\theta_{*}(\delta,p_{X})\equiv\alpha_{0}(\delta,p_{X})\sigma_{*}(\delta,p_{X}).

Define E(σ2)=E(σ2,α0(δ,pX)σ)E(\sigma^{2})=\mathcal{E}(\sigma^{2},\alpha_{0}(\delta,p_{X})\sigma). Using again equations (212), (213) we get

Appendix D Reference results

The following calculus fact is used in the main text.

For all s,x>0s,x>0 we have xs(se)sexx^{s}\leq(\frac{s}{e})^{s}e^{x}.

Since f(x)=ln(x)f(x)=\ln(x) for x>0x>0 is concave, when xsx\geq s, then

This is equivalent to (x/s)sexs(x/s)^{s}\leq e^{x-s} which proves the result. The case of x<sx<s is proved similarly.

We also use an estimate on the minimum singular value of perturbed rectangular matrices, which was proved in Cucker , Theorem 1.1.

Acknowledgments

Andrea Montanari is grateful to Amir Dembo, David Donoho and Van Vu for stimulating conversations.

References