Universality in polytope phase transitions and message passing algorithms

Mohsen Bayati, Marc Lelarge, Andrea Montanari

Introduction and main results

The present paper is concerned with the asymptotic distribution of $x^{t}$ as $N\to\infty$ with $t$ fixed, and establishes the following results:

As $N\to\infty$ , the finite-dimensional marginals of the distribution of $x^{t}$ are asymptotically insensitive to the distribution of the entries of $A_{ij}$ .

The entries of $x^{t}$ are asymptotically Gaussian with zero mean, and variance that can be explicitly computed through a one-dimensional recursion that we will refer to as state evolution.

As an application, we use state evolution to prove universality of a phase transition on polytope geometry, with connections to compressed sensing. This solves, for a broad class of random matrices with independent entries, a conjecture put forward by Donoho and Tanner Donoho2005b , DonohoTannerUniversality .

In order to illustrate the usefulness of the first two technical results, we will start the presentation of our results from the third one.

Before stating our results, it is useful to comment on the special form of the iteration (1), and in particular on the role of the memory term $\mathsf{b}_{t}f(x^{t-1};t-1)$ (which is inspired from the so-called “Onsager correction” in statistical physics thouless1977solution , mezard1987spin , bolthausen2012iterative ). The function of this term is to cancel, to leading order, the effect of correlations between $x_{i}^{t+1}$ and $\{x^{s}_{i}\dvtx s\leq t\}$ . This cancelation is particularly transparent in our proof technique, whereby $x^{t}_{i}$ is expressed as a sum of monomials in $A_{jk}$ , with $1\leq j,k\leq n$ , and is indexed by labeled trees. The memory term effectively cancels the contribution of “one-step reversing” trees.

Without such memory term, the properties of the resulting iteration change crucially. In particular, it is no longer true that $x^{t}_{i}$ is approximately Gaussian as $N\to\infty$ ; see Section 2 for further clarification on this point.

A polytope $Q$ is said to be centrosymmetric if $x\in Q$ implies $-x\in Q$ . Following Donoho2005b , Donoho2005a we say that such a polytope is $k$ -neighborly if the condition below holds: {longlist}[(I)]

It was shown in a series of papers Donoho2005a , Donoho2005b , DoTa05a , DoTa05b , DoTa08 that polytope neighborliness has tight connections with the geometric properties of random point clouds, and with sparsity-seeking methods to solve underdetermined systems of linear equations. The latter are in turn central in a number of applied domains, including model selection for data analysis and compressed sensing. For the reader’s convenience, these connections will be briefly reviewed in Section 6.

If the sequence $\mathcal{Q}$ is random, we say that $\mathcal{Q}$ has weak neighborliness $\rho$ (in probability) if the above limits hold in probability.

In other words, a sequence of polytopes $\{Q^{n}\}_{n\geq 0}$ has weak neighborliness $\rho$ , if for large $n$ the $m$ -dimensional polytope $Q^{n}$ has close to the maximum possible number of $k$ faces, for all $k<m\rho(1-\xi)$ .

Note that previously the neighborliness of a polytope was defined to be the largest integer $k$ satisfying condition (I). However, in our definition, weak neighborliness refers to the fraction $k/n$ . This is due to the fact that weak neighborliness is defined in the limit $n\to\infty$ .

The existence of weakly neighborly polytope sequences is clear when $m(n)=n$ since in this case we can take $Q^{n}=C^{n}$ with $\rho=1$ , but the existence is highly nontrivial when $m$ is only a fraction of $n$ .

Then, the sequence of polytopes $\{A(n)C^{n}\}_{n\geq 0}$ has weak neighborliness $\rho_{*}(\delta)$ in probability.

A characterization of the curve $\delta\mapsto\rho_{*}(\delta)$ was provided in Donoho2005b , but we omit it here since a more explicit expression will be given below.

Motivated by applications to data analysis and signal processing, Donoho and Tanner DonohoTannerUniversality carried out extensive numerical simulations for random polytopes of the form $A(n)C^{n}$ for several choices of the distribution of $A(n)$ . They formulated a universality hypothesis according to which the conclusion of Theorem 1 holds for a far broader class of random matrices. The results of their numerical simulations were consistent with this hypothesis.

Here we establish the first rigorous result indicating universality of polytope neighborliness for a broad class of random matrices. Define the curve $(\delta,\rho_{*}(\delta))$ , $\delta\in(0,1)$ , parametrically by letting, for $\alpha\in(0,\infty)$ ,

Here we extend the scope of Theorem 1 from Gaussian matrices to matrices with independent sub-GaussianSee equation (7) for the definition of sub-Gaussian random variables. entries (not necessarily identically distributed).

Then the sequence of polytopes $\{A(n)C^{n}\}_{n\geq 0}$ has weak neighborliness $\rho_{*}(\delta)$ in probability.

It is likely that this theorem can be improved in two directions. First, a milder tail condition than sub-Gaussianity is probably sufficient. Second, we are assuming that the distribution of $A_{ij}$ has an arbitrarily small Gaussian component. This is not necessary for the upper bound on neighborliness, and appears to be an artifact of the proof of the lower bound.

The proof of Theorem 2 is provided in Section 6. By comparison, the most closely related result toward universality is by Adamczak, Litvak, Pajor, and Tomczak-Jaegermann Adamczak . For a class of matrices $A(n)$ with i.i.d. columns, these authors prove that $A(n)C^{n}$ has neighborliness scaling linearly with $n$ . This, however, does not suggest that a limit weak neighborliness exists, and is universal, as established instead in Theorem 2.

At the other extreme, universality of compressed sensing phase transitions can be conjectured from the results of the nonrigorous replica method KabashimaTanaka , RanganFletcherGoyal .

2 Universality of iterative algorithms

We will consider here and below a setting that is somewhat more general than the one described by equation (1). Following the terminology of DMM09 , we will refer to such an iteration as to the approximate message passing (AMP) iteration/algorithm.

There are several motivations for considering such a generalization. On one hand, it is necessary for the application to high-dimensional polytope geometry presented in the previous section. The reader might have noticed that the random matrix in Theorem 2 is rectangular. This is a different setting from that of iteration (1), whereby the random matrix $A$ is square and symmetric. The generalization to $x\in\mathcal{V}_{q,N}$ introduced here, with $A$ square and symmetric, covers the case of rectangular matrices as well through a suitable reduction. In a nutshell, given a rectangular random matrix $A^{\prime}$ , the reduction consists of constructing a symmetric matrix that has $A^{\prime}$ as submatrix; cf. Section 5 for details.

Additional motivations for the generalization introduced here come from the application of AMP algorithms to a variety of problems in signal processing. For instance the authors of krzakala2012statistical , donoho2013information study compressed sensing reconstruction for “spatially coupled” sensing matrices. These are random matrices with independent but not identically distributed entries. As already discussed in donoho2013information , javanmard2013state for the case of Gaussian entries, a rigorous analysis of this algorithm requires generalizing the setting of (1).

Several other applications require a generalization of iteration (1), including the analysis of generalized AMP algorithms RanganGAMP , AMP reconstruction of block-sparse signals donoho2013accurate , the analysis of phase retrieval algorithms schniter2012compressive and so on. All of these applications can be treated within the setting introduced here, although our rigorous analysis requires the use of polynomial nonlinearities.

A brief sketch of some proof ideas for the “scalar” case of equation (1) can be found in Section 2.

An AMP instance is a triple $(A,\mathcal{F},x^{0})$ where: {longlist}[(3)]

$x^{0}\in\mathcal{V}_{q,N}$ is an initial condition. Given $\mathcal{F}=\{f^{k}\dvtx k\in[N]\}$ , we define $f(\cdot;t)\dvtx\mathcal{V}_{q,N}\to\mathcal{V}_{q,N}$ that maps $v$ to $v^{\prime}=f(v;t)$ , and is given by $\mathbf{v}^{\prime}_{i}=f^{i}(\mathbf{v}_{i};t)$ for all $i\in[N]$ .

The approximate message passing orbit corresponding to the instance $(A,\mathcal{F},x^{0})$ is the sequence of vectors $\{x^{t}\}_{t\geq 0}$ , $x^{t}\in\mathcal{V}_{q,N}$ defined as follows, for $t\geq 0$ ,

Here $\mathsf{B}_{t}\dvtx\mathcal{V}_{q,N}\to\mathcal{V}_{q,N}$ is the linear operator that maps $v$ to $v^{\prime}=\mathsf{B}_{t}v$ , and is defined by

The above definition can also be summarized by the following expression for the evolution of a single coordinate under AMP:

Recall that a centered random variable $X$ is sub-Gaussian with scale factor $\sigma^{2}$ if, for all $\lambda>0$ , we have

Let $\{(A(N),\mathcal{F}_{N},x^{0,N})\}_{N\geq 1}$ be a sequence of AMP instances indexed by the dimension $N$ , with $A(N)$ a random matrix and $x^{0,N}$ a random vector. We say that the sequence is $(C,d)$ -regular (or, for short, regular) polynomial sequence if: {longlist}[(3)]

For each $N$ , the functions $f^{i}(\cdot;t)$ in $\mathcal{F}_{N}$ [possibly random, as long as they are independent from $A(N)$ , $x^{0,N}$ ] are polynomials with maximum degree $d$ and coefficients bounded by $C$ .

For each $N$ , $A(N)$ and $x^{0,N}$ are independent. Further, we have $\sum_{i=1}^{N}\exp\{\|\mathbf{x}_{i}^{0,N}\|_{2}^{2}/C\}\leq NC$ with probability converging to one as $N\to\infty$ .

We state now our universality result for the algorithm (4).

3 State evolution

Theorem 3 establishes that the behavior of the sequence $\{x^{t}\}_{t\geq 0}$ is, in the high-dimensional limit, insensitive to the distribution of the entries of the random matrix $A$ . In order to characterize this limit, we need to make some assumption on the collection of functions $\mathcal{F}_{N}$ . In particular, we need to relate the functions $\mathcal{F}_{N}$ to the functions $\mathcal{F}_{N^{\prime}}$ in order to have a high-dimensional ( $N\to\infty$ ) limit.

the function $f^{i}$ only depends on the partition index of $i\in[N]$ , and on the value of $Y(i)$ ;

the distribution of $Y(i)$ only depends on the partition index of $i\in[N]$ ;

the fractional size $|C^{N}_{a}|/N$ is $N$ -independent for large $N$ .

There are a few points to make precise, and this is done in the definition below.

for each $a\in[k]$ , we have $\lim_{N\to\infty}|C^{N}_{a}|/N=c_{a}\in(0,1)$ ;

for each $N\geq 0$ , each $a\in[k]$ and each $i\in C_{a}^{N}$ , we have $f^{i}(\mathbf{x},t)=g(\mathbf{x},Y(i),a,t)$ where $Y(1),\dots,Y(N)$ are independent random variables with $Y(i)\sim P_{a}$ whenever $i\in C^{N}_{a}$ for some $a\in[k]$ ;

With a slight abuse of notation, we will sometimes denote a converging sequence by $\{(A(N),g,x^{0,N})\}_{N\geq 0}$ . We use capital letters to denote the $Y(i)$ ’s to emphasize that they are random and do not change across iterations.

for all $a\in[k]$ . Here $Y_{a}\sim P_{a}$ , $Z^{t}_{a}\sim\mathsf{N}(0,\Sigma^{t}_{a})$ and $Y_{a}$ and $Z^{t}_{a}$ are independent.

where $Z_{a}\sim\mathsf{N}(0,\Sigma_{a}^{t})$ is independent of $Y_{a}\sim P_{a}$ .

We conclude by mentioning that, following DMM09 , generalizations of algorithm (4) were studied by several groups SchniterTurbo , RanganGAMP , MalekiComplex , for a number of applications. Universality results analogous to the one proved here are expected to hold for such generalizations as well.

4 Outline of the paper

The paper is organized as follows. Before delving into the details of the analysis, Section 2 provides an informal discussion of the main proof ideas for the case $q=1$ . After some preliminary facts and notations in Section 3, Section 4 considers the AMP iteration (4) and proves Theorems 3 and 4. In order to achieve our goal, we introduce two different iterations whose analysis provides useful intermediate steps. We also prove a generalization of Theorem 4 to estimate functions of messages at two distinct times $\psi(\mathbf{x}_{i}^{t},\mathbf{x}_{i}^{s},Y(i))$ .

Section 5 proves a generalization of Theorem 4 to the case of rectangular (nonsymmetric) matrices $A$ . This is achieved by effectively embedding the rectangular matrix, into a larger symmetric matrix and applying our results for symmetric matrices.

The generalization to rectangular matrices is finally used in Section 6 to prove our result on the universality of polytope neighborliness, Theorem 2. This is done via a correspondence with compressed sensing reconstruction established in Donoho2005b , and a sharp analysis of an AMP iteration that solves this reconstruction problem.

Universality of iterative algorithms: Sketch of main ideas

Let us focus on, say, coordinate $1$ of $\bar{x}^{t}$ . An explicit calculation yields (recall that, by convention, $A_{ii}=0$ )

Consider first $t=1$ . Under our assumptions, $\bar{x}^{1}_{1}$ is a sum of i.i.d. random variables with mean and variance $1/N$ . By the central limit theorem, it converges in distribution to a standard Gaussian random variable, as predicted by Theorem 4.

Consider next $t=2$ . In equation (15) we decomposed the sum over $\{i,j\}$ in a sum over terms with $j=1$ , and a sum over terms with $j\neq 1$ . The first sum converges almost surely to $1$ by the law of large numbers. It is easy to see that the second sum has expectation equal to zero and variance equal to $(N-1)/N$ that converges to $1$ . Indeed, a slightly more complicated calculation shows that it converges to a standard Gaussian. Overall, $\mathbf{\bar{x}}^{2}_{1}$ converges in distribution to a Gaussian with mean $1$ and variance $1$ , unlike what is predicted by Theorem 4 for $\mathbf{x}^{2}_{1}$ . (Theorem 4 always predicts $\mathbf{x}^{t}_{i}$ to have asymptotically zero mean.)

Notice that the terms in the sum (15) are indexed by an ordered triple $(1,i,j)$ with $i,j\in[N]$ , $1\neq i$ , $i\neq j$ . We can identify such a triple with a length $2$ rooted (directed acyclic) path with vertices labeled by $1$ (the root), $i$ , $j$ : $j\to i\to 1$ . The terms that lead to a nonzero mean are those corresponding to $j=1$ , that is, with a one-step reversal in the order in which they visit labels of $[N]$ . These are paths of the form $1\to i\to 1$ .

Consider now adding back the memory term $\mathsf{b}_{t}f(x^{t-1};t-1)$ . It is easy to check that, in the present case [namely $f(x;t)=x$ ], equation (1) reduces to $x^{t+1}=Ax^{t}-x^{t-1}$ and, in particular

Comparing with equation (15), we see that the memory term asymptotically cancels the effect of one-step reversing paths. The same analysis can be developed, with additional labor, to subsequent iterations. At each $t$ , the memory term cancels the effect of one-step reversing paths, and the residual terms match the prediction of Theorem 4.

The proof follows a similar argument for a general polynomial $f(x)$ . As in the linear case, each coordinate $x^{t}_{i}$ can be expressed as a sum of monomials in the independent random variables $(A_{ij})_{i<j}$ . The main difference is that now these monomials are indexed by rooted trees instead of rooted paths with vertex labels in $[N]$ . To see this, consider the special case

Then, a direct calculation of iteration (1) yields

The actual proof of Theorems 3 and 4 in Section 4 follows the same intuition as above, but of course, requires several technical steps: {longlist}[(6)]

We prove that, for our purposes, the distribution of the random variable $\mathbf{x}^{t}_{i}$ is accurately approximated by the distribution of $\mathbf{z}^{t}_{i}$ ; see Proposition 3.

We prove, under the same assumptions as in our universality result, Theorem 3, the distribution of $\mathbf{z}^{t}_{i}$ is insensitive to the distribution of the matrix entries $(A_{ij})_{i<j}$ ; cf. Proposition 1. This is done by the moment method. Any moment of $\mathbf{z}^{t}_{i}$ is written as the expectation of a polynomial in the $(A_{ij})_{i<j}$ . We show that the only terms that matter are the ones in which each $A_{ij}$ appears with degree at most two. Hence the expectation only depends on the first two moments of the matrix entries, which are fixed by assumption.

Together with the previous point, this immediately implies Theorem 3.

In Proposition 2, we prove that the distribution of $\mathbf{z}_{i}^{t}$ (and hence $\mathbf{x}_{i}^{t}$ ) is, for our purposes, accurately approximated by the distribution of $\mathbf{y}^{t}_{i}$ .

Finally we exploit the fact that a fresh matrix $A$ is sampled at each iteration to prove that state evolution holds for $\mathbf{y}^{t}_{i}$ ; cf. Proposition 4.

By the previous point, this implies Theorem 4.

In the next section we introduce some basic facts and notation. We will implement the above strategy in Section 4.

Notations and basic simplifications

In our proof of Theorem 4 we will make use of the following simplification that lightens somewhat the notation.

For proving Theorem 4, it is sufficient to consider the case in which $g\dvtx(\mathbf{x},Y,a,t)\mapsto g(\mathbf{x},Y,a,t)$ is independent of $Y$ .

The basic idea of the construction is to enlarge $q$ in such a way to keep track of the value of $Y(i)$ in the a subset of the coordinates of $\mathbf{x}_{i}^{t}$ .

Proofs of Theorems 3 and 4

In this section we consider the AMP iteration (4), and prove Theorem 3 and Theorem 4, and indeed generalize the latter.

with $\widehat{\Sigma}^{t}_{a}$ defined per equation (10). For any $s,t\geq 1$ , we set recursively

Let $\{(A(N),\mathcal{F}_{N},x^{0,N})\}_{N\geq 1}$ be a polynomial and converging sequence of instances and denote by $\{x^{t}\}_{t\geq 0}$ the corresponding AMP orbit.

where $(Z^{t}_{a},Z^{s}_{a})\sim\mathsf{N}(0,\Sigma^{t,s}_{a})$ is independent of $Y_{a}\sim P_{a}$ .

The rest of this section is organized as follows. In Section 4.1 we introduce two new iterations that are useful intermediary steps for our analysis. We show that the corresponding variables admit representations as sums over trees in Section 4.2 and use them to prove basic properties of these recursions in Sections 4.3, 4.4 and 4.5. Theorems 3 and 5 are then proved in Sections 4.6, 4.7. Because of equation (23), Theorem 4 follows as a special case of Theorem 5. Indeed, we will show that both statements are equivalent through a reduction argument. Depending on the application, Theorem 5 might be a more convenient formulation of the state evolution and will be used in Section 5.

Our first result establishes universality of the moments of $\mathbf{z}^{t}_{i\to j}$ for polynomial sequences of instances.

The proof of this proposition is provided in Section 4.3.

In this statement and in the rest of this section, $K$ is always understood as a function of $d,t,q,m,C$ which may vary from line to line, but which is independent of $N$ .

The asymptotic analysis of $y^{t}$ is particularly simple because an independent random matrix $A^{t}$ is used at each iteration. In particular, it is easy to establish state evolution for $y^{t}$ . Our next result shows that $y^{t}$ provides a good approximation for $z^{t}$ .

The proof of this proposition is provided in Section 4.4.

Let $(A(N),\mathcal{F}_{N},x^{0,N})_{N\geq 1}$ be a $(C,d)$ -regular polynomial sequence of instances. Denote by $\{x^{t}\}_{t\geq 0}$ the corresponding AMP sequence and by $\{z^{t}\}_{t\geq 0}$ the sequence defined by (25) while iterating (24). Then for any $t\geq 1$ and $m(1),\dots,m(q)\geq 0$ , there exists $K$ independent of $N$ such that, for any $i\in[N]$ ,

The proof of this proposition is provided in Section 4.5.

2 Tree representation

We denote by $\overline{\mathcal{T}}^{t}$ the set of labeled trees $T$ with $t$ generations as above. We let $\mathcal{T}^{t}\subseteq\overline{\mathcal{T}}^{t}$ denote the subset of such trees that satisfy the following additional condition: {longlist}[(1)]

Let $(A(N),\mathcal{F}_{N},x^{0,N})_{N\geq 1}$ be a polynomial sequence of AMP instances. Denote by $\mathbf{z}^{t}_{i}$ the orbit defined by (25) while iterating (24) with matrix $A$ . Then

We first prove (30) by induction on $t$ . For $t=1$ we have, by definition,

To prove the induction, we start with equation (24), which yields

3 Proof of Proposition 1

We are now in position to prove Proposition 1.

For notational simplicity, we consider the case $m(r)=m$ , and $m(s)=0$ for all $s\in[q]\setminus r$ . Thanks to Lemma 1, we have

obtained by taking $\lambda=\sqrt{Ns/C}$ .

since in the product on the right-hand side of (4.3), there are at most $\mu/2$ terms different from one.

Hence the condition $\frac{1}{N}\sum_{i=1}^{N}\exp(\|\mathbf{x}^{0,N}_{i}\|_{2}^{2}/C)\leq C$ ensures that for any $p\geq 2$ ,

which completes the proof Proposition 1. Here we used the fact that all values $\mu,q$ and $\{C_{k}\}_{k=0}^{md}$ are independent of $N$ .

We end this section by showing that the term $S_{1}(A)$ can be further reduced. This result will be useful in the sequel, and we state it as the following lemma.

4 Proof of Proposition 2

The proof follows the same approach as for Proposition 1. For notational simplicity, we consider the case $m(r)=m$ and $m(s)=0$ for all $s\in[q]\setminus r$ . The general case follows by the same argument. For $\mathbf{y}$ , we are using a different matrix at each iteration and we need to define a new weight associated to trees $T\in\mathcal{T}^{t}$ as follows:

5 Proof of Proposition 3

As in the proof of Proposition 1, we will rely on a representation of $x^{t}_{i}(r)$ based on labeled trees defined as in Section 4.2. In the present case, it is, however, more convenient to work with trees from which marks have been removed, that is, we identify any two trees in which the vertex marks are different, but the types are the same. Notice that equations (30), (31) imply

where $\Gamma^{\prime}(T,\mathbf{c},t)$ is obtained by summing $\Gamma(T,\mathbf{c},t)$ over all trees $T$ that coincide up to marks. In the following, with a slight abuse of notation, we will write $\Gamma(T,\mathbf{c},t)$ instead of $\Gamma^{\prime}(T,\mathbf{c},t)$ .

Under the same assumptions as in Proposition 3, we have

for some $\widetilde{\Gamma}(T,t,r)$ which is bounded uniformly as $|\widetilde{\Gamma}(T,t,r)|\leq K(d,C,t)$ .

Following the same argument as in Lemma 1, we first prove by induction on $t$ that we can find $\widetilde{\Gamma}(T,t,r)$ such that

It remains to prove that $\widetilde{\Gamma}(T,t,r)$ agrees with the expression in Lemma 1, [cf. equations (45), (46)], for $T\in\mathcal{U}^{t}_{i}(r)$ and is zero for trees in $\overline{\mathcal{U}}^{t}\setminus\mathcal{B}^{t}_{i}$ . The proof of this fact will proceed by induction on $t$ . The cases $t=0,1$ are clear since $\mathcal{B}^{t}_{i}=\varnothing$ . For $t\geq 1$ , we define

where the last sum contains a finite number of nonzero terms.

We now show that each contribution on the right-hand side [except $z^{t+1}_{i}(r)$ ] can be written as a sum of terms $A(T)\widetilde{\Gamma}(T,t+1,r,x^{0})$ over trees $T\in\mathcal{B}^{t+1}_{i}$ that we construct explicitly.

We now treat the terms in the first line. Again, we have

6 Proof of Theorem 3

7 Proof of Theorem 5

An important simplification is provided by the following.

It is sufficient to prove Theorem 5 for $t=s$ . (Hence, Theorem 4 implies Theorem 5.)

The new sequence of initial conditions is constructed as follows: {longlist}[(3)]

The partitions $C^{N}_{a}$ , $a\in[k]$ and matrices $A(N)$ are kept unchanged.

As a consequence of this construction, $\mathbf{u}^{t}_{i}=\mathbf{x}^{t}_{i}$ for all $i\in[N]$ , $t\geq 1$ , and $\mathbf{v}^{t}_{i}=\mathbf{x}^{t-h}_{i}$ for all $t\geq h+1$ . This completes the reduction.

As a consequence of this remark, it is sufficient to prove Theorem 4, and by Remark 1, we can limit ourselves to the case in which $g\dvtx(\mathbf{x},Y,a,t)\mapsto g(\mathbf{x},Y,a,t)$ does not depend on $Y$ , and hence this argument will be dropped. We begin by considering the expectation of moments of $\mathbf{x}_{i}^{t}$ .

where $Z^{t}_{a}\sim\mathsf{N}(0,\Sigma^{t}_{a})$ .

By Propositions 2 and 3, we need only to prove the statement for the AMP orbit $y^{t}$ . We will indeed prove by induction on $t$ that for any $i\in C^{N}_{a}$ and any $j\neq i$ ,

For $t\geq 1$ , let $\mathfrak{F}_{t}$ be the $\sigma$ -algebra generated by $A^{0},\dots,A^{t-1}$ . We will show, using the central limit theorem, that the random vector $(y_{i\to j}^{t+1}(1),\dots,y_{i\to j}^{t+1}(q))$ given $\mathfrak{F}_{t}$ converge in distribution to a centered Gaussian random vector. More precisely, by (27) and the induction hypothesis, the following limit holds in probability:

In the base case $t=0$ the same conclusion holds because

where the second identity holds by assumption.

Next consider the induction claim, equation (55). Recall the representation introduced in Section 4.4,

Using this representation of $\mathbf{y}^{t}_{i\to j}$ , $\mathbf{y}^{t}_{k\to j}$ it is easy to show that, for $i\neq k$ , $i,k\in C_{a}^{N}$

for some function ${\varepsilon}(N)\to 0$ as $N\to\infty$ ar $\mathbf{m},C,d,t$ fixed. Indeed, the above expectations can be represented as sums over $m=m(1)+m(2)+\cdots+m(q)$ trees $T_{1},\dots,T_{m}\in\mathcal{T}^{t}_{i\to j}$ and $m$ trees $T^{\prime}_{1},\dots,T^{\prime}_{m}\in\mathcal{T}^{t}_{k\to j}$ . Let $\mathbf{G}$ be the simple graph obtained by identifying vertices of the same type in $T_{1},\dots,T_{m},T^{\prime}_{1},\dots,\break T^{\prime}_{m}$ .

On the other hand, the number of such terms is bounded by $KN^{\mu-2}$ (because the type has to be assigned to $\mu$ vertices, but $2$ of these are fixed to $i$ and $k$ ), and hence the overall contribution of these terms vanishes as well.

Equation (55) follows for iteration $t+1$ by applying Chebyshev’s inequality to the sequence

for $i,j\leq N$ : $A^{\prime}_{ij}=A_{ij}$ and when $i>N$ or $j>N$ define $A_{ij}^{\prime}\sim\mathsf{N}(0,1/N)$ independently;

$g^{\prime}(\mathbf{x},b,t^{\prime})=g(\mathbf{x},b,t^{\prime})$ for $b\in[k]$ , $t^{\prime}\leq t-1$ ; $g^{\prime}(\mathbf{x},b,t)=0$ for $b\in[k]\setminus a$ ; $g^{\prime}_{1}(\mathbf{x},a,t)=\varphi(\mathbf{x})$ , $g^{\prime}_{r}(\mathbf{x},a,t)=0$ , for $r\geq 2$ ; $g^{\prime}(\mathbf{x},k+1,t^{\prime})=0$ for all $t^{\prime}$ .

The definition of $g^{\prime}(\mathbf{x},a,t^{\prime})$ for $t^{\prime}>t$ is irrelevant for our purposes. Since $g^{\prime}(\mathbf{x},k+1,t^{\prime})=0$ for all $t^{\prime}$ , the orbit $(\mathbf{x}_{i}^{t^{\prime}}\dvtx i\leq N,t^{\prime}\leq t)$ is not affected by the new variables. Further, by the general AMP equation (6), we have, for $i\in C_{k+1}^{N}$ ,

Notice that the $\{A_{ij}\}_{j\in C_{a}^{N}}$ in this equation are independent of $\mathbf{x}_{j}^{t}$ . Hence

On the other hand, using Proposition 4 (once for iteration $t+1$ and $i\in C_{k+1}^{N}$ , and another time for iteration $t$ and $i\in C_{a}^{N}$ ) we get

where $Z^{t}_{a}\sim\mathsf{N}(0,\Sigma^{t}_{a})$ . Comparing these equations with equation (60) we conclude that

Taking $\varphi(\mathbf{x})=\mathbf{x}^{\mathbf{k}}$ , we obtain equation (57) for $\mathbf{m}$ even. In order to establish equation (57) for general $\mathbf{m}$ we take, for instance, $\varphi(\mathbf{x})=1+{\varepsilon}\mathbf{x}^{\mathbf{m}}$ and use the fact that the limit must vanish for all ${\varepsilon}$ .

Proof of Theorem 5 By Remark 1 and Remark 2, we reduced ourselves to the case $t=s$ , and $Y(i)=0$ [equivalently, $Y(i)$ , is absent].

Proposition 4 shows the convergence of expected the moments of $\mu^{N}_{a}$ to moments that determine the Gaussian distribution. Proposition 5 combined with Chebyshev inequality implies

in probability. The proof follows using the relation between convergence in probability and convergence almost sure along subsequences, together with the moment method.

Nonsymmetric matrices

In this section we consider a slightly different setting that turns out to be a special case of the one introduced in Section 1.3.

A converging sequence of (polynomial) bipartite AMP instances $\{(A(n),f,h,x^{0,n})\}_{n\geq 1}$ is defined by giving for each $n$ : {longlist}[(3)]

Throughout this section, we will refer to nonbipartite AMP instances as per Definition 5, as to symmetric instances. With these ingredients, we define the AMP orbit as follows.

The approximate message passing orbit corresponding to the bipartite instance $(A,f,h,x^{0})$ is the sequence of vectors $\{x^{t},z^{t}\}_{t\geq 0}$ , $x^{t}\in\mathcal{V}_{q,n}$ $z^{t}\in\mathcal{V}_{q,m}$ defined as follows, for $t\geq 0$ ,

where $f(\cdot)$ , $h(\cdot)$ are applied componentwise; see below for an explicit formulation. Here $\mathsf{B}_{t}\dvtx\mathcal{V}_{q,m}\to\mathcal{V}_{q,m}$ is the linear operator that maps $v$ to $v^{\prime}=\mathsf{B}_{t}v$ , and for any $j\in[m]$ satisfies,

Analogously $\mathsf{D}_{t}\dvtx\mathcal{V}_{q,n}\to\mathcal{V}_{q,n}$ is the linear operator defined by letting, for $v^{\prime}=\mathsf{D}_{t}v$ , and any $j\in[n]$ ,

For the sake of clarity, it is useful to rewrite the iteration (67), (68) explicitly, by components

We will state and prove a state evolution result that is analogous to Theorem 5 for the present case. Since the proof is by reduction to the symmetric case, the same argument also implies a universality statement of the type of Theorem 3. However, we will not state explicitly any universality statement in this case. We begin by introducing the appropriate state evolution recursion. In analogy with equation (10), we introduce two sequences of positive semidefinite matrices $\{\Sigma^{t}\}_{t\geq 0}$ , $\{\Xi^{t}\}_{t\geq 0}$ by letting $\Xi^{0}$ be given as per equation (66) and defining, for all $t\geq 1$ ,

We also define a two-times recursion analogous to equations (21), (4). Namely, we introduce the boundary condition

with $\Xi^{t}$ defined per equations (5), (5). For any $s,t\geq 1$ , we set recursively

(Recall that $[u,v]$ denotes the column vector obtained by concatenating $u$ and $v$ .)

Let $\{(A(n),f,h,x^{0,n})\}_{n\geq 1}$ be a polynomial and converging sequence of bipartite AMP instances, and denote by $\{x^{t},z^{t}\}_{t\geq 0}$ the corresponding AMP orbit.

where $(X^{t},X^{s})\sim\mathsf{N}(0,\Sigma^{t,s})$ is independent of $Y\sim Q$ , and $(Z^{t},Z^{s})\sim\break\mathsf{N}(0,\Xi^{t,s})$ is independent of $W\sim P$ .

The proof follows by constructing a suitable polynomial and converging sequence of symmetric instances, recognizing that a suitable subset of the resulting orbit corresponds to the orbit $\{x^{t},z^{t}\}$ of interest, and applying Theorem 5.

We partition the index set in $k=2$ subsets: $[N]=C^{N}_{1}\cup C_{2}^{N}$ , with $C^{N}_{1}=\{1,\dots,m\}$ and $C^{N}_{2}=\{m+1,\dots,m+n\}$ . In particular $c_{1}=\delta/(1+\delta)$ and $c_{2}=1/(1+\delta)$ .

The symmetric random matrix $A^{\prime}$ is given by

In particular $W_{11}=W_{22}=0$ and $W_{12}=W_{21}=(1+\delta)/\delta$ .

The vertex labels are $Y_{s}(i)=W(i)$ for $i\leq m$ and $Y_{s}(i)=Y(i-m)$ for $i>m$ . In particular, these are independent random variables with distribution $Y_{s}(i)\sim P_{1}=Q$ if $i\in C_{1}^{N}$ and $Y_{s}(i)\sim P_{2}=P$ if $i\in C_{2}^{N}$ .

The initial condition is given by $\mathbf{x}_{s,i}^{0,N}=0$ for $i\in C_{1}^{N}$ and $\mathbf{x}_{s,i}^{0,N}=\mathbf{x}^{0,n}_{i-m}$ for $i\in C_{2}^{N}$ .

The definition of $g(\mathbf{x},Y,a=1,2t+1)$ and $g(\mathbf{x},Y,a=2,2t)$ is irrelevant for our purposes. The proof is concluded by recognizing that, for all $t\geq 0$ ,

We finish this section with a lemma that establishes continuity of the AMP trajectories with respect to Gaussian perturbations of the matrix $A$ . This fact will be used in the next section. (Notice that an analogous lemma holds by the same argument for converging, nonbipartite, instances.)

Enumerating the edges in $T$ as $(u_{1},v_{1}),\dots,(u_{k},v_{k})$ the quantity in parenthesis reads

Proof of universality of polytope neighborliness

In this section we prove Theorem 2, deferring several technical steps to the Appendices.

Notice that these matrices differ by a factor $1/\sqrt{m}$ from the matrices in the statement of Theorem 2. Since neighborliness is invariant under scale transformations, this change is immaterial.

This is indeed a rephrasing of Theorem 2 in Donoho2005a .

In view of this result, Theorem 2 follows from the following result on compressed sensing with random sensing matrices.

Consider either of the following two cases: {longlist}[(II)]

The matrix $A(n)$ has i.i.d. entries, and $\{x_{0}(n)\}_{n\geq 1}$ is any fixed sequence of vectors with $\lim_{n\to\infty}\|x_{0}(n)\|_{0}/m(n)=\rho$ .

The rest of this section is devoted to the proof of Theorem 8. Indeed, as shown below, this immediately implies Theorem 2.

Using Markov’s inequality, equations (87), (88) coincide (resp.) with assumptions (1) and (2) in Theorem 7. The claim follows by applying this theorem.

For $c\in(0,1)$ , let $S(c)\equiv\{i\in[n]\dvtx|v_{i}|\geq 1-c\}$ . Then, for any $S^{\prime}\subseteq[n]$ , $|S^{\prime}|\leq c_{1}n$ , the minimum singular value of $A_{S(c_{1})\cup S^{\prime}}$ satisfies $\sigma_{\min}(A_{S(c_{1})\cup S^{\prime}})\geq c_{2}$ .

The proof of this lemma is deferred to Appendix B.

The proof of Theorem 8 consists of two parts. For $\rho>\rho_{*}(\delta)$ , we shall exhibit a vector $x$ with $\|x\|_{1}<\|x_{0}\|_{1}$ and $y=Ax$ . For $\rho<\rho_{*}(\delta)$ we will show that assumptions of Lemma 5 hold. In particular, we will construct a subgradient $v$ as per assumption 5. In both tasks, we will use an iterative message passing algorithm analogous to the one in Section 5. The algorithm is defined by the following recursion initialized with $x^{0}=0$ :

where $\eta(u;\theta)=\operatorname{sign}(u)(u-\theta)_{+}$ , $\alpha$ is a nonnegative constant and $\mathsf{b}_{t}$ is a diagonal matrix whose precise definition is immaterial here and will be given in the proof of Proposition 6 below. Notice two important differences with respect to the treatment in Section 5:

The iteration in equations (89), (90) does not take immediately the form in equations (67), (68). For instance the nonlinear mapping $\eta(\cdot;\alpha\sigma_{t})$ is applied after multiplication by $A^{\mathsf{T}}$ . This mismatch can be resolved by a simple change of variables.

The nonlinear mapping $\eta(\cdot;\alpha\sigma_{t})$ is not a polynomial. This point will be addressed by constructing suitable polynomial approximations of $\eta$ .

We refer to Appendix A for further details.

For $t\geq 0$ , $\sigma_{t}$ is defined by the one-dimensional recursion

There exists a diagonal matrix $\mathsf{b}_{t}=\mathsf{b}_{t}(n)$ such that

We also need a generalization of the last proposition for functions of the estimates $x^{t}$ , $x^{s}$ at two distinct iteration numbers $t\neq s$ . To this objective, we introduce the generalization of the state evolution equation (91). Namely, we define $\{R_{s,t}\}_{s,t\geq 0}$ recursively for all $s,t\geq 0$ by letting

with $Z_{t}\sim\mathsf{N}(0,R_{t,t})$ . This uniquely determines the doubly infinite array $\{R_{t,s}\}_{t,s\geq 0}$ . Notice in particular that $R_{t,t}=\sigma_{t}^{2}$ for all $t\geq 0$ . (This is easily checked by induction over $t$ .)

The proof of this proposition is in Appendix A.

Finally, we need some analytical estimates on the recursions (91) and (98). Part of these estimates were already proved in DMM09 , NSPT , BayatiMontanariLASSO , but we reproduce them here for the reader’s convenience. Proofs of the others are provided in Appendix C.

If $\rho<\rho_{*}(\delta)$ , then: {longlist}[(a3)]

There exists $\alpha_{1}({\varepsilon},\delta)$ , $\alpha_{2}({\varepsilon},\delta)$ , $\alpha_{*}({\varepsilon})$ with $0<\alpha_{1}({\varepsilon},\delta)<\alpha_{*}({\varepsilon})<\alpha_{2}({\varepsilon},\allowbreak\delta)<\infty$ , and $\omega_{*}({\varepsilon},\delta)\in(0,1)$ such that the following happens. For each $\alpha\in(\alpha_{1},\alpha_{2})$ , $\sigma_{t}^{2}=B\omega^{t}(1+o_{t}(1))$ as $t\to\infty$ , with $\omega\in(0,1)$ .

Further, for each $\omega\in[\omega_{*}({\varepsilon},\delta),1)$ there exists $\alpha_{-}\in(\alpha_{1},\alpha_{*}]$ and $\alpha_{+}\in[\alpha_{*},\alpha_{2})$ (distinct as long as $\omega>\omega_{*}$ ) such that, letting $\alpha\in\{\alpha_{-},\alpha_{+}\}$ , $\sigma_{t}^{2}=B\omega^{t}(1+o_{t}(1))$ .

Finally, for all $\alpha\in[\alpha_{*},\alpha_{2})$ , we have ${\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha)<\delta$ .

For any $\alpha\in[\alpha_{*}({\varepsilon}),\alpha_{2}({\varepsilon},\delta))$ , we have $\lim_{t\to\infty}R_{t,t-1}/(\sigma_{t}\sigma_{t-1})=1$ .

For any $\alpha>\alpha_{\min}(\delta)$ , we have $\lim_{t\to\infty}\sigma_{t}^{2}=\sigma_{*}^{2}>0$ and, for $\alpha\geq\alpha_{0}$ , $\lim_{t\to\infty}[R_{t,t}-2R_{t,t-1}+R_{t-1,t-1}]=0$ .

We are now in position to prove Theorem 8. For greater convenience of the reader, we distinguish the cases $\rho<\rho_{*}(\delta)$ and $\rho>\rho_{*}(\delta)$ . Before considering these cases, we will establish some common simplifications.

Summarizing this discussion, in order to prove the Theorem both in case (I) and case (II), it will be sufficient to do so for the following setting:

In the proof of Theorem 8, we can assume the vectors $x_{0}(n)$ to be random with i.i.d. entries with common distribution $p_{X}=(1-{\varepsilon})\delta_{0}+{\varepsilon}\gamma$ .

Notice that, by definition of the function $\eta(\cdot;\cdot)$ we have $|x^{t-1}_{i}-(A^{\mathsf{T}}z^{t-1})_{i}-\widehat{x}^{t}_{i}|\leq\theta_{t-1}$ , and hence $v^{t}\in\partial\|x_{0}\|_{1}$ . We can write

This part of the proof is completed by showing that there exists $h(t)$ with $\lim_{t\to\infty}h(t)=0$ such that, for each $t$ , with high probability we have $\|\xi^{t}\|_{2}^{2}/n\leq(1-\sqrt{\omega})^{2}/\alpha^{2}+h(t)$ , $\|\beta^{t}\|_{2}^{2}/n\leq h(t)$ and $\|\zeta^{t}\|_{2}^{2}/n\leq h(t)$ . Indeed, if this is true, we can then choose $t$ sufficiently large and $\alpha\in(\alpha_{*}({\varepsilon}),\alpha_{2}({\varepsilon},\delta))$ so that $\|\xi^{t}+\beta^{t}+\zeta^{t}\|_{2}^{2}$ is small enough as to satisfy the condition 5 of Lemma 5.

First consider $\xi^{t}$ . Applying Proposition 7 to $\psi(x,y_{1},y_{2})=(y_{1}-y_{2})^{2}$ , we have, in probability

Here the last equality follows from the fact that $\sigma^{2}_{t}/\sigma^{2}_{t-1}\to\omega$ by Lemma 6(a1) and $R_{t,t-1}/(\sigma_{t}\sigma_{t-1})\to 1$ by Lemma 6(a2). This implies the claim for $\xi^{t}$ .

Next, consider $\beta^{t}$ . By Proposition 6(2),

and hence $\|\beta^{t}\|_{2}^{2}/n\leq h(t)$ with high probability for any $h(t)>0$ .

Finally consider $\zeta^{t}$ , and define $R(y;\theta)=y-\eta(y;\theta)$ . We have

Notice that this apparently requires applying Proposition 6 to the function $\psi(x,y)=[\operatorname{sign}(x)-R(y;\theta)]^{2}\mathbf{1}_{x\neq 0}$ which is non-Lipschitz in $x$ . However, we can define a Lipschitz approximation, with parameter $r>0$ ,

Notice that $\psi_{r}$ is bounded and Lipschitz continuous. We further have $|\psi_{r}(x,y)-\psi(x,y)|\leq 4\mathbf{1}(x\neq 0;|x|\leq r)$ , whence

The last inequality holds almost surely by the law of large numbers using $\gamma([-r,r])<2r$ . Analogously,

Hence the claim (110) follows by applying Proposition 6(3) to $\psi_{r}(x,y)$ , using equations (6.2), (6.2), and letting $r\to 0$ .

We conclude by noting that the right-hand side of equation (110) converges to as $t\to\infty$ by dominated convergence, since $\sigma_{t}\to 0$ . Therefore,

This completes our proof of assumption 5 of Lemma 5.

We finally consider hypothesis 5. Let $S_{t}(c)$ be defined as there, for the subgradient $v^{t}$ , namely

We claim that the following two claims hold for some $t_{*}\geq 0$ independent of $n$ :

There exists $c_{1},\hat{c}_{2}>0$ (independent of $\nu$ ) such that for all $S^{\prime}\subseteq[n]$ , $|S^{\prime}|\leq 2c_{1}n$ , the minimum singular value of $A_{\widetilde{S}_{t_{*}}(2c_{1})\cup S^{\prime}}$ , satisfies $\sigma_{\min}(A_{\widetilde{S}_{t_{*}}(2c_{1})\cup S^{\prime}})\geq\hat{c}_{2}\nu$ with probability converging to $1$ as $n\to\infty$ .

where $o_{1}(t_{*},\nu)$ vanishes as $\nu\to 0$ at $t_{*}$ , $c_{1}$ , $c_{2}$ fixed, and $o_{2}(t_{*},\nu;n^{-1})$ vanishes as $n^{-1}\to 0$ at $t_{*}$ , $\nu$ , $c_{1}$ , $c_{2}$ fixed.

These claims immediately imply that hypothesis 5 of Lemma 5 holds with probability converging to one as $n\to\infty$ . Indeed, if $|S^{\prime}|\leq nc_{1}$ , then (by Claim 2) $S_{t}(c_{1})\cup S^{\prime}\subseteq\widetilde{S}_{t_{*}}(2c_{1})\cup S^{\prime\prime}$ where $|S^{\prime\prime}|\leq 2nc_{1}$ with probability larger than $1-o_{1}(t_{*};\nu)-o_{2}(t_{*},\nu;n^{-1})$ . By Claim 1, we hence have $\sigma_{\min}(A_{S_{t}(c_{1})\cup S^{\prime}})\geq c_{2}\equiv\hat{c}_{2}\nu$ . The thesis follows since $\nu$ can be chosen as small as we want. [Notice that once $t_{*}$ is fixed to satisfy these claims, we can still choose $t\geq t_{*}$ arbitrarily to satisfy hypothesis 5 of Lemma 5, as per the argument above.]

In order to prove Claim 1, above first notice that, for any $b\geq 0$

where in the last line $H(c)$ denotes the binary entropy of $b$ , and we used ${{n}\choose{nc}}\leq\exp\{nH(c)\}$ . We want to show that $t_{*}$ , $b$ , $c_{1}$ , $c_{2}$ , $\nu$ can be chosen so that both contributions vanish as $n\to\infty$ .

with $a_{1}=a_{1}((b+2c_{1})/\delta)$ bounded as long as $(b+2c_{1})/\delta<1$ . We can therefore select $\hat{c}_{2}=1/(2a_{1})$ and select $c_{1}$ small enough so that $H(2c_{1})\leq(1/2)(\delta-b-2c_{1})\log 2$ . This ensures that the first term in equation (6.2) vanishes as $n\to\infty$ .

We are left with the task of selecting $b\in(0,\delta)$ , $t_{*}\geq 0$ , so that the second term vanishes as well, since then we can take $c_{1}\in(0,(\delta-b)/2)$ . To this hand notice that by Proposition 6 (and using the fact that $X+\sigma_{t-1}Z$ has a density) we have, in probability,

and further, since $\sigma_{t}\to 0$ as $t\to\infty$ [cf. Lemma 6(a1)] and $\theta_{t}=\alpha\sigma_{t}$ , we have

On the other hand, by Lemma 6(a1), and since $\alpha\in[\alpha_{*},\alpha_{2})$ , we have ${\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha)<\delta$ . Hence there exist $b_{0}\in(0,\delta)$ and $c_{1}>0$ so that for all $t_{*}$ large enough $|S_{t_{*}}(3c_{1})|\leq nb_{0}$ with high probability. Taking $b\in(b_{0},\delta)$ and using Markov’s inequality (with $t_{*}^{\prime}=t_{*}-1$ )

where the last inequality follows from Proposition 6(4). These terms can be made arbitrarily small by choosing $\nu$ small and $n$ large enough.

In order to complete the proof, we need to show that Claim 2 holds for eventually larger $t_{*}$ . First notice that, applying again Proposition 6(4), we get

By Proposition 7, and using the fact that the vector $(X+Z_{t_{*}},X+Z_{t})$ has a density, we have, in probability,

where, by Lemma 6(a3), $h(t_{*})$ vanishes as $t_{*}\to\infty$ . Given any $c_{1}>0$ , we can therefore choose $t_{*}$ so that, with high probability $|S_{t}(c_{1})\setminus S_{t_{*}}(c_{1})|\leq nc_{1}/2$ . Combining with equation (6.2), we obtain the desired claim.

Fix a small number $h>0$ . By Lemma 6(b), there exists $\Delta=\Delta(\delta,{\varepsilon})>0$ independent of $h$ , such that, for $\alpha=\alpha_{0}(\delta,p_{X})$ and $t$ large enough,

as well as $\sigma_{t-1}^{2}\leq 2\sigma_{*}^{2}$ . By Propositions 6, 7 (and noting that $X+\sigma_{t}Z$ has a distribution that is absolutely continuous with respect to Lebesgue measure), we have, with high probability,

Namely equation (119) implies (122), equation (120) implies (123), equation (121) implies (124) and the assumption $\sigma_{t-1}^{2}\leq 2\sigma_{*}^{2}$ implies (125).

Using equation (92) together with the above, we get

Appendix A Proofs of Propositions 6 and 7

In this Appendix we prove Propositions 6 and 7 by a suitable application of Theorem 6. Before passing to these proofs, we establish a corollary of Theorem 6 that allows us to control iterations of the form (89), (90), with $\eta(\cdot;\cdot)$ replaced by a general polynomial.

Let $x^{0}=(I-D^{-1})x_{0}$ (notice that $D$ is invertible with high probability) and define iteratively

We next introduce the corresponding state evolution recursion. Namely, we define $\{\widetilde{R}_{s,t}\}_{s,t\geq 0}$ recursively for all $s,t\geq 0$ by letting

with $Z_{t}\sim\mathsf{N}(0,\widetilde{R}_{t,t})$ . This uniquely determines the doubly infinite array $\{\widetilde{R}_{t,s}\}_{t,s\geq 0}$ .

which coincides with equation (1) after a redefinition of the function $\psi$ .

A.2 Proofs of Propositions 6 and 7

We will start by proving Proposition 6. Since Proposition 7 follows from the same construction, we will only point to the necessary modifications. Before presenting the proof, we recall a basic result in weighted polynomial approximation (here stated for a specific case); see, for example, Lubinsky .

Proof of Propositions 6 Since the proposition holds as $n\to\infty$ at $t$ fixed, we shall assume throughout that $t\in\{0,1,\dots,t_{\max}\}$ for some fixed arbitrarily large $t_{\max}$ .

We claim that, for each $\beta,t_{\max}>0$ , we can construct an orbit $\{x^{\beta,t},z^{\beta,t}\}_{t\geq 0}$ obeying equations (A.1), (132) for suitable functions $\eta_{t}=\eta_{t}^{(\beta)}$ such that the following holds (with a slight abuse of notation we will drop the parameter $\beta$ from $x^{\beta,t},z^{\beta,t}$ ). For all $0\leq t\leq t_{\max}$ , and all functions $\psi$ as in the statement, we have $z^{t}=y-Ax^{t}+\mathsf{b}_{t}z^{t-1}$ by construction. Further, in probability,

In order to prove equations (148) to (150) we proceed as follows. It is easy to check that $\sigma_{t}>0$ for all $t$ ; cf. equation (91). We use Theorem 9 to construct polynomials $\eta_{t}$ such that

Given polynomials as defined by (154), we define $\{x^{t},z^{t}\}_{t\geq 0}$ as per equations (A.1), (132), with the initial condition given there. Equation (150) follows immediately from Corollary 1 for $\xi$ sufficiently small. Equation (149) also follows from the same corollary, by taking

and then using once again equation (154) on the resulting expression.

Finally, consider equation (148). For economy of notation, we write

Using the tree representation in Section 4.2, it is not hard to prove that the expectation on the right-hand side is bounded as follows:

Consider, for instance, the first case, $p$ , $q$ , $r$ , $s$ distinct. Using Lemma 3, each of $\varphi_{p}$ , $\varphi_{q}$ , $\varphi_{r}$ , $\varphi_{s}$ can be represented as a sum over trees with root type respectively at $p$ , $q$ , $r$ , $s$ . The weight of these trees is as in Lemma 3, times the prefactor $(A_{ip}^{2}-m^{-1})\cdots(A_{is}^{2}-m^{-1})$ . Let $\mu$ be the total number of edges in these trees, plus $8$ (two for each of the additional factors). Then any nonvanishing contribution is of order $n^{-\mu/2}$ . Let $\mathbf{G}$ be the graph obtained by identifying the vertices of the same type in these trees, and $e(\mathbf{G})$ the number of its edges. Since each edge in $\mathbf{G}$ must be covered at least twice by the trees to get a nonzero expectation, and the edges in $(i,p),\dots,(i,s)$ at least once, we have $2e(\mathbf{G})+4\leq\mu$ . The number of vertices in $\mathbf{G}$ is at most $e(\mathbf{G})+1$ (note that $\mathbf{G}$ is connected because it includes type $i$ connected to $p,q,r,s$ ). Of these vertices all but $5$ (whose type is $i$ , $p$ , $q$ , $r$ , $s$ ) can take an arbitrary type, yielding a combinatorial factor of order $n^{e(\mathbf{G})+1-5}\leq n^{\mu/2-6}$ . Hence the sum over trees is of order $n^{-\mu/2}n^{\mu/2-6}=n^{-6}$ as claimed.

Since by standard concentration bounds $\max_{i\in[n]}D_{ii}$ , $\min_{i\in[n]}D_{ii}\to 1$ , we obtain, in probability,

where, in the last step, we applied Corollary 1 to the polynomials $\eta^{\prime}_{t-1}$ , and $X\sim p_{X}$ , $Z\sim\mathsf{N}(0,1)$ are independent. We are left with the task of showing that, by taking $\xi$ small enough in equation (154), we can ensure that

Indeed integrating by parts with respect to $Z$ the above difference can be written as (for $K$ a finite constant that can depend on $t$ and change from line to line)

The claim follows by noting that, as argued above $|\sigma_{t-1}-\widetilde{\sigma}_{t-1}|\leq K^{\prime}\xi$ .

Proof of Claim 3 To be definite we will focus on equation (162).

We finally sketch the proof of Proposition 7.

Proof of Proposition 7 The sequence $\{x^{t},z^{t}\}_{t\geq 0}$ is constructed as in the previous statement. The proof hence follow by using Corollary 1, and taking $\xi$ small enough in equation (154), since we can ensure that $|\widetilde{R}_{t,s}-R_{t,s}|\leq\beta^{\prime}$ for any $\beta^{\prime}>0$ and any $t,s\leq t_{\max}$ (as shown above for the case $t=s$ ).

Appendix B Proof of Lemma 5

Throughout the proof we denote by $C_{1}$ , $C_{2}$ , $C_{3}$ etc., positive constants that depend uniquely on $c_{1},\dots,c_{3}$ .

and denote by $\widehat{x}$ any minimizer. Further, let $v$ be a subgradient as in the statement, and define, for some $c\in(0,1)$ ,

Also, let $\overline{S}(c)=[n]\setminus S(c)$ be the complement of this set. Notice that, by definition of subgradient, we have $v_{i}=\operatorname{sign}(x_{0,i})$ for all $i\in S$ and $|v_{0,i}|\leq 1$ for all in $\overline{S}\equiv[n]\setminus S$ . This implies that $S\subseteq S(c)$ .

Since $\overline{S}(c)\subseteq\overline{S}$ , we have $x_{0,\overline{S}(c)}=0$ and hence

On the other hand, $v_{S(c)}$ is in the subgradient of $\|x_{S(c)}\|_{1}$ at $x_{S(c)}=x_{0,S(c)}$ . Hence $R_{1}\geq 0$ . It follows that equation (169) implies $\|\widehat{x}\|_{1}\geq\|x_{0}\|_{1}+\langle v,(\widehat{x}-x_{0})\rangle+c\|\widehat{x}_{\overline{S}(c)}\|_{1}$ . Since $\widehat{x}$ is a minimizer, we thus get

where in the last step we used Cauchy–Schwarz together with assumption 5. Hereafter we let $r\equiv\widehat{x}-x_{0}$ .

Fix $c=c_{1}$ . Since $\overline{S}(c)\subseteq\overline{S}$ , we have $r_{\overline{S}(c)}=\widehat{x}_{\overline{S}(c)}$ , and using equation (173), we conclude that there exists $C_{1}\leq 4/c_{1}^{3}$ such that

On the other hand, by definition $Ar=0$ , and hence $A_{S_{+}}r_{S_{+}}+A_{\overline{S}_{+}}r_{\overline{S}_{+}}=0$ . Since $\overline{S}(c)\subseteq\overline{S}$ , we have $S\subseteq S(c)\subseteq S_{+}$ . Further $S_{+}\setminus S(c)=S_{1}$ , whence $|S_{+}\setminus S(c)|\leq nc=nc_{1}$ . By assumption 5, we have $\sigma_{\min}(A_{S_{+}})\geq c_{2}$ and therefore,

Combining this with equation (175), we deduce that $\|r\|_{2}\leq C_{2}{\varepsilon}\|r\|_{2}$ for some $C_{2}=C_{2}(c_{1},c_{2},c_{3})$ , which in turns implies $r=0$ provided that $C_{2}{\varepsilon}<1$ . The claim hence follows for ${\varepsilon}_{0}=1/[2C_{2}(c_{1},c_{2},c_{3})]$ .

Appendix C Asymptotic analysis of state evolution: Proof of Lemma 6

where expectation is taken with respect to the independent random variables $X\sim p_{X}$ and $Z\sim\mathsf{N}(0,1)$ . When necessary, we will indicate the dependency on $p_{X}$ by $\mathsf{F}(\sigma^{2},\theta;p_{X})$ . With this notation the state evolution recursion reads $\sigma^{2}_{t+1}=\mathsf{F}(\sigma^{2}_{t},\alpha\sigma_{t})$ . The following properties of the function $\mathsf{F}$ were proved in DMM09 ; see also BayatiMontanariLASSO , Appendix A, for a more explicit treatment.

For any $\alpha>0$ , the mapping $\sigma^{2}\mapsto\mathsf{F}(\sigma^{2},\alpha\sigma)$ is monotone increasing and concave with $\mathsf{F}(0,0)=0$ and

The first two derivatives of $\alpha\mapsto G_{{\varepsilon}}(\alpha)$ will be used in the proof

By inspection of equation (180), $G^{\prime\prime}_{{\varepsilon}}(\alpha)>0$ for all $\alpha>0$ , hence $G_{{\varepsilon}}(\alpha)$ is strictly convex. Further, from equation (179), we have $G^{\prime}_{{\varepsilon}}(0)=-4(1-{\varepsilon})\phi(0)<0$ and $G^{\prime}_{{\varepsilon}}(\alpha)=2\alpha{\varepsilon}+O_{\alpha}(1)>0$ as $\alpha\to\infty$ . Hence $\alpha\mapsto G_{{\varepsilon}}(\alpha)$ has a unique minimum $\alpha_{*}({\varepsilon})\in(0,\infty)$ .

Finally, equation (181) follows immediately by using the condition $G_{{\varepsilon}}^{\prime}(\alpha_{*})=0$ in expression (C).

In our proof it is more convenient to use the coordinates $(\delta,{\varepsilon})$ instead of $(\rho,\delta)$ . In terms of the latter, the phase boundary (2), (3) reads

Notice that the use of the symbol $\alpha_{*}({\varepsilon})$ in the last equations is not an abuse of notation. Indeed comparing equation (183) with (179) we conclude that $\alpha_{*}({\varepsilon})$ is indeed the unique solution of $G^{\prime}_{{\varepsilon}}(\alpha)=0$ . Further, comparing equation (182) with equation (C) we obtain the following.

Let $(\delta,\rho_{*}(\delta))$ be the phase boundary defined by equations (2), (3). Then, for $\rho,\delta\in$ , $\rho>\rho_{*}(\delta)$ if and only if, for ${\varepsilon}\in(0,1)$ , $\delta\in({\varepsilon},1)$

Vice versa $\rho<\rho_{*}(\delta)$ if and only if $\delta>\delta_{*}({\varepsilon})$ .

Proof of Lemma 6(a1) We set $\alpha=\alpha_{*}({\varepsilon})\equiv\arg\min_{\alpha\geq 0}G_{{\varepsilon}}(\alpha)$ . Hence we have, by Lemma 7, and Lemma 9,

Finally by Lemma 8 $G_{\varepsilon}(\alpha_{*})\equiv{\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha_{*})<\delta$ . Since $\alpha\mapsto{\varepsilon}+2(1-{\varepsilon})\Phi(-\alpha)$ is decreasing in $\alpha$ , the last claim follows.

In the proof of part (a2) we will make use of the following analytical result.

Then $\mathcal{F}_{\alpha,{\varepsilon}}$ is increasing and convex on $ $with$ \mathcal{F}_{\alpha,{\varepsilon}}(1)=1 $and$ \mathcal{F}^{\prime}_{\alpha,{\varepsilon}}(1)<1 $. In particular,$ \mathcal{F}_{\alpha,{\varepsilon}}(Q)>Q $for all$ \in[0,1)$

which is strictly positive and increasing in $Q$ . Hence $Q\mapsto\mathcal{F}_{\alpha,{\varepsilon}}(Q)$ is increasing and strictly convex. Finally, since $\eta^{\prime}(y;\alpha)=\mathbf{1}(|y|\geq\alpha)$ , we have

where the last equality follows again by Lemma 8. This completes the proof.

We are now in position to prove part (a2) of Lemma 6.

Proof of Lemma 6(a2) Throughout the proof we fix $\alpha\in(\alpha_{*}({\varepsilon},\delta),\allowbreak\alpha_{2}({\varepsilon},\delta))$ . Let the sequence $\{\sigma_{t}^{2}\}_{t\geq 0}$ be given as per the state evolution equation (91). Define $Q_{t}\equiv R_{t,t-1}/(\sigma_{t}\sigma_{t-1})$ . By Proposition 7, $Q_{t}$ is the covariance of two Gaussian random variables of variance $1$ . Hence $|Q_{t}|\leq 1$ . Using equation (98) we further have

(Notice that this expectation depends on the distribution of $X_{\infty}$ only through ${\varepsilon}$ , because of the symmetry properties of the function $\eta$ .)

Further, by the proof of part (a1), as $t\to\infty$ we have $\sigma^{2}_{t}\to 0$ and

Comparing equations (186) and (194) we conclude that, for any $Q\in$ ,

Further the convergence is uniform, since the functions $\mathcal{F}_{t}$ are uniformly Lipschitz; see proof of Lemma 10 above.

Consider now the sequence $\{Q_{t}\}_{t\geq 0}$ , and let $Q_{*}=\lim\inf_{t\to\infty}Q_{t}$ . Since $Q_{t}\in$ for all $t$ , we have $Q_{*}\in$ as well. We claim that in fact $Q_{*}=1$ and therefore $\lim_{t\to\infty}Q_{t}=1$ , which implies the thesis.

where, in the last step, we used the fact that $\mathcal{F}_{\alpha,{\varepsilon}}(\cdot)$ is monotone increasing. Since $\mathcal{F}_{\alpha,{\varepsilon}}(q)>q$ for all $q\in[0,1)$ by Lemma 10, we conclude that $Q_{*}=1$ .

Before proving (a3) of Lemma 6, we establish one more technical result.

By triangular inequality, the left-hand side of equation (11) can be upper bounded as $D_{1}+D_{2}$ whereby

Here $X$ and $X_{\infty}$ are coupled in such a way that $X=0$ if and only if $X_{\infty}=0$ and the two variables have the same sign in the other case. We focus on bounding $D_{1}$ since $D_{2}$ can be treated along the same lines. Letting $R(x;\theta)=\eta(x;\theta)-x$ , we have

where in the last line we used Stein’s lemma to integrate over $Z_{2}$ , and $R^{\prime}$ denotes derivative with respect to the first argument. Once again the two terms are treated along the same lines, and we will only consider $D_{1,a}$ . We have

The other term in equation (C.1) is bounded by the same argument. This concludes the proof of equation (11).

The proof of equation (201) follows from equation (11) if we notice that

The last lemma has a useful consequence that we will exploit in the ensuing proof of Lemma 6(a3).

Let $\mathcal{F}_{\alpha,{\varepsilon}}(Q)$ be defined as per equation (186) and $\mathcal{F}_{t}(Q)$ defined as per equation (194) with $p_{X}$ , $\alpha$ , ${\varepsilon}$ satisfying the conditions of Lemma 6(a3). Then there exists a constants $B,B^{\prime},b>0$ depending on $p_{X}$ such that

The second inequality follows from the first one using Lemma 6(a1). Using equation (201), we have

The proof of the corollary is obtained by noting that $\sigma_{t}=\Theta(\sigma_{t-1})$ and applying equation (11) to the expectation in equation (194).

Proof of Lemma 6(a3) Define, as in the proof of part (a2), $Q_{t}\equiv R_{t,t-1}/(\sigma_{t}\sigma_{t-1})$ , and recall that

By Corollary 2, and Lemma 10, it follows that $Q_{t}\geq 1-A\overline{\omega}^{2t}$ for some constants $A>0$ , $\overline{\omega}\in(0,1)$ . Indeed,

and the claim follows by noting that $\mathcal{F}_{\alpha,{\varepsilon}}^{\prime}(1)\in(0,1)$ by Lemma 10.

Next consider the quantity in equation (102). We have

Proof of Lemma 6(b1), (b2) First notice that, with the definitions given in the previous section

Notice that the right-hand side is equal to $2/\delta$ for $\alpha=0$ , monotonically decreasing in $\alpha$ and vanishing as $\alpha\to\infty$ . Hence there exists $\alpha_{\min}({\varepsilon},\delta)$ such that the right-hand side is smaller than $1$ if and only if $\alpha>\alpha_{\min}({\varepsilon},\delta)$ . Further, $\sigma^{2}\mapsto\mathsf{F}(\sigma^{2},\alpha\sigma)$ is concave with $\mathsf{F}(0,0)=0$ and first derivative larger than $1$ at $\sigma^{2}=0$ ; cf. Lemma 7. It follows that for $\alpha>\alpha_{\min}({\varepsilon},\delta)$ there exists a unique $\sigma_{*}(\delta,p_{X})$ such that $\mathsf{F}(\sigma^{2},\alpha\sigma)>\sigma^{2}$ for all $\sigma\in(0,\sigma_{*})$ and $\mathsf{F}(\sigma^{2},\alpha\sigma)<\sigma^{2}$ for $\sigma\in(\sigma_{*},\infty)$ . It follows that $\sigma^{2}_{t}\to\sigma_{*}$ for any $\sigma_{0}^{2}\neq 0$ . This proves the first part of claim (b1).

Notice that by the definition of $\alpha_{\min}$ given above, we have

In order to prove the second statement in (b1), we proceed analogously to part (a2), and define $Q_{t}\equiv R_{t}/(\sigma_{t}\sigma_{t-1})$ . This sequence satisfies the recursion (193) with $\mathcal{F}_{t}$ defined as per equation (194). As $t\to\infty$ we have $\sigma_{t}\to\sigma_{*}$ and hence $\mathcal{F}_{t}$ converges uniformly to a limit that we denote by an abuse of notation $\mathcal{F}_{\alpha,\delta,p_{X}}$ , where

Proceeding as in the proof of Lemma 10, we conclude that $Q\mapsto\mathcal{F}_{\alpha,\delta,p_{X}}(Q)$ is increasing and convex on $ $. Further [for$ Z\sim\mathsf{N}(0,1)$]

Finally, for $\alpha\geq\alpha_{0}(\delta,p_{X})$ ,

and therefore $\mathcal{F}_{\alpha,\delta,p_{X}}(Q)>Q$ for all $Q\in[0,1)$ . Hence, proceeding again as in the proof of part (a2) we conclude that $\lim_{t\to\infty}Q_{t}=1$ and therefore $\lim_{t\to\infty}R_{t,t-1}=\sigma_{*}^{2}$ as claimed.

where $\mathsf{F}(\sigma^{2},\theta)$ is defined as per equation (176). Using these expressions in equation (211) we conclude that

In particular, one can check from equations (212), (213) that a stationary pointIndeed this is the unique saddle point of the function $(\theta^{-1},\sigma^{2})\mapsto\mathcal{E}(\theta,\sigma^{2})$ as it can be proved by the general minimax theorem. is given by setting $\sigma=\sigma_{*}(\delta,p_{X})$ and $\theta=\theta_{*}(\delta,p_{X})\equiv\alpha_{0}(\delta,p_{X})\sigma_{*}(\delta,p_{X})$ .

Define $E(\sigma^{2})=\mathcal{E}(\sigma^{2},\alpha_{0}(\delta,p_{X})\sigma)$ . Using again equations (212), (213) we get

Appendix D Reference results

The following calculus fact is used in the main text.

For all $s,x>0$ we have $x^{s}\leq(\frac{s}{e})^{s}e^{x}$ .

Since $f(x)=\ln(x)$ for $x>0$ is concave, when $x\geq s$ , then

This is equivalent to $(x/s)^{s}\leq e^{x-s}$ which proves the result. The case of $x<s$ is proved similarly.

We also use an estimate on the minimum singular value of perturbed rectangular matrices, which was proved in Cucker , Theorem 1.1.

Acknowledgments

Andrea Montanari is grateful to Amir Dembo, David Donoho and Van Vu for stimulating conversations.