Universality in polytope phase transitions and message passing algorithms
Mohsen Bayati, Marc Lelarge, Andrea Montanari
Introduction and main results
The present paper is concerned with the asymptotic distribution of as with fixed, and establishes the following results:
As , the finite-dimensional marginals of the distribution of are asymptotically insensitive to the distribution of the entries of .
The entries of are asymptotically Gaussian with zero mean, and variance that can be explicitly computed through a one-dimensional recursion that we will refer to as state evolution.
As an application, we use state evolution to prove universality of a phase transition on polytope geometry, with connections to compressed sensing. This solves, for a broad class of random matrices with independent entries, a conjecture put forward by Donoho and Tanner Donoho2005b , DonohoTannerUniversality .
In order to illustrate the usefulness of the first two technical results, we will start the presentation of our results from the third one.
Before stating our results, it is useful to comment on the special form of the iteration (1), and in particular on the role of the memory term (which is inspired from the so-called “Onsager correction” in statistical physics thouless1977solution , mezard1987spin , bolthausen2012iterative ). The function of this term is to cancel, to leading order, the effect of correlations between and . This cancelation is particularly transparent in our proof technique, whereby is expressed as a sum of monomials in , with , and is indexed by labeled trees. The memory term effectively cancels the contribution of “one-step reversing” trees.
Without such memory term, the properties of the resulting iteration change crucially. In particular, it is no longer true that is approximately Gaussian as ; see Section 2 for further clarification on this point.
A polytope is said to be centrosymmetric if implies . Following Donoho2005b , Donoho2005a we say that such a polytope is -neighborly if the condition below holds: {longlist}[(I)]
It was shown in a series of papers Donoho2005a , Donoho2005b , DoTa05a , DoTa05b , DoTa08 that polytope neighborliness has tight connections with the geometric properties of random point clouds, and with sparsity-seeking methods to solve underdetermined systems of linear equations. The latter are in turn central in a number of applied domains, including model selection for data analysis and compressed sensing. For the reader’s convenience, these connections will be briefly reviewed in Section 6.
If the sequence is random, we say that has weak neighborliness (in probability) if the above limits hold in probability.
In other words, a sequence of polytopes has weak neighborliness , if for large the -dimensional polytope has close to the maximum possible number of faces, for all .
Note that previously the neighborliness of a polytope was defined to be the largest integer satisfying condition (I). However, in our definition, weak neighborliness refers to the fraction . This is due to the fact that weak neighborliness is defined in the limit .
The existence of weakly neighborly polytope sequences is clear when since in this case we can take with , but the existence is highly nontrivial when is only a fraction of .
Then, the sequence of polytopes has weak neighborliness in probability.
A characterization of the curve was provided in Donoho2005b , but we omit it here since a more explicit expression will be given below.
Motivated by applications to data analysis and signal processing, Donoho and Tanner DonohoTannerUniversality carried out extensive numerical simulations for random polytopes of the form for several choices of the distribution of . They formulated a universality hypothesis according to which the conclusion of Theorem 1 holds for a far broader class of random matrices. The results of their numerical simulations were consistent with this hypothesis.
Here we establish the first rigorous result indicating universality of polytope neighborliness for a broad class of random matrices. Define the curve , , parametrically by letting, for ,
Here we extend the scope of Theorem 1 from Gaussian matrices to matrices with independent sub-GaussianSee equation (7) for the definition of sub-Gaussian random variables. entries (not necessarily identically distributed).
Then the sequence of polytopes has weak neighborliness in probability.
It is likely that this theorem can be improved in two directions. First, a milder tail condition than sub-Gaussianity is probably sufficient. Second, we are assuming that the distribution of has an arbitrarily small Gaussian component. This is not necessary for the upper bound on neighborliness, and appears to be an artifact of the proof of the lower bound.
The proof of Theorem 2 is provided in Section 6. By comparison, the most closely related result toward universality is by Adamczak, Litvak, Pajor, and Tomczak-Jaegermann Adamczak . For a class of matrices with i.i.d. columns, these authors prove that has neighborliness scaling linearly with . This, however, does not suggest that a limit weak neighborliness exists, and is universal, as established instead in Theorem 2.
At the other extreme, universality of compressed sensing phase transitions can be conjectured from the results of the nonrigorous replica method KabashimaTanaka , RanganFletcherGoyal .
2 Universality of iterative algorithms
We will consider here and below a setting that is somewhat more general than the one described by equation (1). Following the terminology of DMM09 , we will refer to such an iteration as to the approximate message passing (AMP) iteration/algorithm.
There are several motivations for considering such a generalization. On one hand, it is necessary for the application to high-dimensional polytope geometry presented in the previous section. The reader might have noticed that the random matrix in Theorem 2 is rectangular. This is a different setting from that of iteration (1), whereby the random matrix is square and symmetric. The generalization to introduced here, with square and symmetric, covers the case of rectangular matrices as well through a suitable reduction. In a nutshell, given a rectangular random matrix , the reduction consists of constructing a symmetric matrix that has as submatrix; cf. Section 5 for details.
Additional motivations for the generalization introduced here come from the application of AMP algorithms to a variety of problems in signal processing. For instance the authors of krzakala2012statistical , donoho2013information study compressed sensing reconstruction for “spatially coupled” sensing matrices. These are random matrices with independent but not identically distributed entries. As already discussed in donoho2013information , javanmard2013state for the case of Gaussian entries, a rigorous analysis of this algorithm requires generalizing the setting of (1).
Several other applications require a generalization of iteration (1), including the analysis of generalized AMP algorithms RanganGAMP , AMP reconstruction of block-sparse signals donoho2013accurate , the analysis of phase retrieval algorithms schniter2012compressive and so on. All of these applications can be treated within the setting introduced here, although our rigorous analysis requires the use of polynomial nonlinearities.
A brief sketch of some proof ideas for the “scalar” case of equation (1) can be found in Section 2.
An AMP instance is a triple where: {longlist}[(3)]
is an initial condition. Given , we define that maps to , and is given by for all .
The approximate message passing orbit corresponding to the instance is the sequence of vectors , defined as follows, for ,
Here is the linear operator that maps to , and is defined by
The above definition can also be summarized by the following expression for the evolution of a single coordinate under AMP:
Recall that a centered random variable is sub-Gaussian with scale factor if, for all , we have
Let be a sequence of AMP instances indexed by the dimension , with a random matrix and a random vector. We say that the sequence is -regular (or, for short, regular) polynomial sequence if: {longlist}[(3)]
For each , the functions in [possibly random, as long as they are independent from , ] are polynomials with maximum degree and coefficients bounded by .
For each , and are independent. Further, we have with probability converging to one as .
We state now our universality result for the algorithm (4).
3 State evolution
Theorem 3 establishes that the behavior of the sequence is, in the high-dimensional limit, insensitive to the distribution of the entries of the random matrix . In order to characterize this limit, we need to make some assumption on the collection of functions . In particular, we need to relate the functions to the functions in order to have a high-dimensional () limit.
the function only depends on the partition index of , and on the value of ;
the distribution of only depends on the partition index of ;
the fractional size is -independent for large .
There are a few points to make precise, and this is done in the definition below.
for each , we have ;
for each , each and each , we have where are independent random variables with whenever for some ;
With a slight abuse of notation, we will sometimes denote a converging sequence by . We use capital letters to denote the ’s to emphasize that they are random and do not change across iterations.
for all . Here , and and are independent.
where is independent of .
We conclude by mentioning that, following DMM09 , generalizations of algorithm (4) were studied by several groups SchniterTurbo , RanganGAMP , MalekiComplex , for a number of applications. Universality results analogous to the one proved here are expected to hold for such generalizations as well.
4 Outline of the paper
The paper is organized as follows. Before delving into the details of the analysis, Section 2 provides an informal discussion of the main proof ideas for the case . After some preliminary facts and notations in Section 3, Section 4 considers the AMP iteration (4) and proves Theorems 3 and 4. In order to achieve our goal, we introduce two different iterations whose analysis provides useful intermediate steps. We also prove a generalization of Theorem 4 to estimate functions of messages at two distinct times .
Section 5 proves a generalization of Theorem 4 to the case of rectangular (nonsymmetric) matrices . This is achieved by effectively embedding the rectangular matrix, into a larger symmetric matrix and applying our results for symmetric matrices.
The generalization to rectangular matrices is finally used in Section 6 to prove our result on the universality of polytope neighborliness, Theorem 2. This is done via a correspondence with compressed sensing reconstruction established in Donoho2005b , and a sharp analysis of an AMP iteration that solves this reconstruction problem.
Universality of iterative algorithms: Sketch of main ideas
Let us focus on, say, coordinate of . An explicit calculation yields (recall that, by convention, )
Consider first . Under our assumptions, is a sum of i.i.d. random variables with mean and variance . By the central limit theorem, it converges in distribution to a standard Gaussian random variable, as predicted by Theorem 4.
Consider next . In equation (15) we decomposed the sum over in a sum over terms with , and a sum over terms with . The first sum converges almost surely to by the law of large numbers. It is easy to see that the second sum has expectation equal to zero and variance equal to that converges to . Indeed, a slightly more complicated calculation shows that it converges to a standard Gaussian. Overall, converges in distribution to a Gaussian with mean and variance , unlike what is predicted by Theorem 4 for . (Theorem 4 always predicts to have asymptotically zero mean.)
Notice that the terms in the sum (15) are indexed by an ordered triple with , , . We can identify such a triple with a length rooted (directed acyclic) path with vertices labeled by (the root), , : . The terms that lead to a nonzero mean are those corresponding to , that is, with a one-step reversal in the order in which they visit labels of . These are paths of the form .
Consider now adding back the memory term . It is easy to check that, in the present case [namely ], equation (1) reduces to and, in particular
Comparing with equation (15), we see that the memory term asymptotically cancels the effect of one-step reversing paths. The same analysis can be developed, with additional labor, to subsequent iterations. At each , the memory term cancels the effect of one-step reversing paths, and the residual terms match the prediction of Theorem 4.
The proof follows a similar argument for a general polynomial . As in the linear case, each coordinate can be expressed as a sum of monomials in the independent random variables . The main difference is that now these monomials are indexed by rooted trees instead of rooted paths with vertex labels in . To see this, consider the special case
Then, a direct calculation of iteration (1) yields
The actual proof of Theorems 3 and 4 in Section 4 follows the same intuition as above, but of course, requires several technical steps: {longlist}[(6)]
We prove that, for our purposes, the distribution of the random variable is accurately approximated by the distribution of ; see Proposition 3.
We prove, under the same assumptions as in our universality result, Theorem 3, the distribution of is insensitive to the distribution of the matrix entries ; cf. Proposition 1. This is done by the moment method. Any moment of is written as the expectation of a polynomial in the . We show that the only terms that matter are the ones in which each appears with degree at most two. Hence the expectation only depends on the first two moments of the matrix entries, which are fixed by assumption.
Together with the previous point, this immediately implies Theorem 3.
In Proposition 2, we prove that the distribution of (and hence ) is, for our purposes, accurately approximated by the distribution of .
Finally we exploit the fact that a fresh matrix is sampled at each iteration to prove that state evolution holds for ; cf. Proposition 4.
By the previous point, this implies Theorem 4.
In the next section we introduce some basic facts and notation. We will implement the above strategy in Section 4.
Notations and basic simplifications
In our proof of Theorem 4 we will make use of the following simplification that lightens somewhat the notation.
For proving Theorem 4, it is sufficient to consider the case in which is independent of .
The basic idea of the construction is to enlarge in such a way to keep track of the value of in the a subset of the coordinates of .
Proofs of Theorems 3 and 4
In this section we consider the AMP iteration (4), and prove Theorem 3 and Theorem 4, and indeed generalize the latter.
with defined per equation (10). For any , we set recursively
Let be a polynomial and converging sequence of instances and denote by the corresponding AMP orbit.
where is independent of .
The rest of this section is organized as follows. In Section 4.1 we introduce two new iterations that are useful intermediary steps for our analysis. We show that the corresponding variables admit representations as sums over trees in Section 4.2 and use them to prove basic properties of these recursions in Sections 4.3, 4.4 and 4.5. Theorems 3 and 5 are then proved in Sections 4.6, 4.7. Because of equation (23), Theorem 4 follows as a special case of Theorem 5. Indeed, we will show that both statements are equivalent through a reduction argument. Depending on the application, Theorem 5 might be a more convenient formulation of the state evolution and will be used in Section 5.
Our first result establishes universality of the moments of for polynomial sequences of instances.
The proof of this proposition is provided in Section 4.3.
In this statement and in the rest of this section, is always understood as a function of which may vary from line to line, but which is independent of .
The asymptotic analysis of is particularly simple because an independent random matrix is used at each iteration. In particular, it is easy to establish state evolution for . Our next result shows that provides a good approximation for .
The proof of this proposition is provided in Section 4.4.
Let be a -regular polynomial sequence of instances. Denote by the corresponding AMP sequence and by the sequence defined by (25) while iterating (24). Then for any and , there exists independent of such that, for any ,
The proof of this proposition is provided in Section 4.5.
2 Tree representation
We denote by the set of labeled trees with generations as above. We let denote the subset of such trees that satisfy the following additional condition: {longlist}[(1)]
Let be a polynomial sequence of AMP instances. Denote by the orbit defined by (25) while iterating (24) with matrix . Then
We first prove (30) by induction on . For we have, by definition,
To prove the induction, we start with equation (24), which yields
3 Proof of Proposition 1
We are now in position to prove Proposition 1.
For notational simplicity, we consider the case , and for all . Thanks to Lemma 1, we have
obtained by taking .
since in the product on the right-hand side of (4.3), there are at most terms different from one.
Hence the condition ensures that for any ,
which completes the proof Proposition 1. Here we used the fact that all values and are independent of .
We end this section by showing that the term can be further reduced. This result will be useful in the sequel, and we state it as the following lemma.
4 Proof of Proposition 2
The proof follows the same approach as for Proposition 1. For notational simplicity, we consider the case and for all . The general case follows by the same argument. For , we are using a different matrix at each iteration and we need to define a new weight associated to trees as follows:
5 Proof of Proposition 3
As in the proof of Proposition 1, we will rely on a representation of based on labeled trees defined as in Section 4.2. In the present case, it is, however, more convenient to work with trees from which marks have been removed, that is, we identify any two trees in which the vertex marks are different, but the types are the same. Notice that equations (30), (31) imply
where is obtained by summing over all trees that coincide up to marks. In the following, with a slight abuse of notation, we will write instead of .
Under the same assumptions as in Proposition 3, we have
for some which is bounded uniformly as .
Following the same argument as in Lemma 1, we first prove by induction on that we can find such that
It remains to prove that agrees with the expression in Lemma 1, [cf. equations (45), (46)], for and is zero for trees in . The proof of this fact will proceed by induction on . The cases are clear since . For , we define
where the last sum contains a finite number of nonzero terms.
We now show that each contribution on the right-hand side [except ] can be written as a sum of terms over trees that we construct explicitly.
We now treat the terms in the first line. Again, we have
6 Proof of Theorem 3
7 Proof of Theorem 5
An important simplification is provided by the following.
It is sufficient to prove Theorem 5 for . (Hence, Theorem 4 implies Theorem 5.)
The new sequence of initial conditions is constructed as follows: {longlist}[(3)]
The partitions , and matrices are kept unchanged.
As a consequence of this construction, for all , , and for all . This completes the reduction.
As a consequence of this remark, it is sufficient to prove Theorem 4, and by Remark 1, we can limit ourselves to the case in which does not depend on , and hence this argument will be dropped. We begin by considering the expectation of moments of .
where .
By Propositions 2 and 3, we need only to prove the statement for the AMP orbit . We will indeed prove by induction on that for any and any ,
For , let be the -algebra generated by . We will show, using the central limit theorem, that the random vector given converge in distribution to a centered Gaussian random vector. More precisely, by (27) and the induction hypothesis, the following limit holds in probability:
In the base case the same conclusion holds because
where the second identity holds by assumption.
Next consider the induction claim, equation (55). Recall the representation introduced in Section 4.4,
Using this representation of , it is easy to show that, for ,
for some function as ar fixed. Indeed, the above expectations can be represented as sums over trees and trees . Let be the simple graph obtained by identifying vertices of the same type in .
On the other hand, the number of such terms is bounded by (because the type has to be assigned to vertices, but of these are fixed to and ), and hence the overall contribution of these terms vanishes as well.
Equation (55) follows for iteration by applying Chebyshev’s inequality to the sequence
for : and when or define independently;
for , ; for ; , , for ; for all .
The definition of for is irrelevant for our purposes. Since for all , the orbit is not affected by the new variables. Further, by the general AMP equation (6), we have, for ,
Notice that the in this equation are independent of . Hence
On the other hand, using Proposition 4 (once for iteration and , and another time for iteration and ) we get
where . Comparing these equations with equation (60) we conclude that
Taking , we obtain equation (57) for even. In order to establish equation (57) for general we take, for instance, and use the fact that the limit must vanish for all .
Proof of Theorem 5 By Remark 1 and Remark 2, we reduced ourselves to the case , and [equivalently, , is absent].
Proposition 4 shows the convergence of expected the moments of to moments that determine the Gaussian distribution. Proposition 5 combined with Chebyshev inequality implies
in probability. The proof follows using the relation between convergence in probability and convergence almost sure along subsequences, together with the moment method.
Nonsymmetric matrices
In this section we consider a slightly different setting that turns out to be a special case of the one introduced in Section 1.3.
A converging sequence of (polynomial) bipartite AMP instances is defined by giving for each : {longlist}[(3)]
Throughout this section, we will refer to nonbipartite AMP instances as per Definition 5, as to symmetric instances. With these ingredients, we define the AMP orbit as follows.
The approximate message passing orbit corresponding to the bipartite instance is the sequence of vectors , defined as follows, for ,
where , are applied componentwise; see below for an explicit formulation. Here is the linear operator that maps to , and for any satisfies,
Analogously is the linear operator defined by letting, for , and any ,
For the sake of clarity, it is useful to rewrite the iteration (67), (68) explicitly, by components
We will state and prove a state evolution result that is analogous to Theorem 5 for the present case. Since the proof is by reduction to the symmetric case, the same argument also implies a universality statement of the type of Theorem 3. However, we will not state explicitly any universality statement in this case. We begin by introducing the appropriate state evolution recursion. In analogy with equation (10), we introduce two sequences of positive semidefinite matrices , by letting be given as per equation (66) and defining, for all ,
We also define a two-times recursion analogous to equations (21), (4). Namely, we introduce the boundary condition
with defined per equations (5), (5). For any , we set recursively
(Recall that denotes the column vector obtained by concatenating and .)
Let be a polynomial and converging sequence of bipartite AMP instances, and denote by the corresponding AMP orbit.
where is independent of , and is independent of .
The proof follows by constructing a suitable polynomial and converging sequence of symmetric instances, recognizing that a suitable subset of the resulting orbit corresponds to the orbit of interest, and applying Theorem 5.
We partition the index set in subsets: , with and . In particular and .
The symmetric random matrix is given by
In particular and .
The vertex labels are for and for . In particular, these are independent random variables with distribution if and if .
The initial condition is given by for and for .
The definition of and is irrelevant for our purposes. The proof is concluded by recognizing that, for all ,
We finish this section with a lemma that establishes continuity of the AMP trajectories with respect to Gaussian perturbations of the matrix . This fact will be used in the next section. (Notice that an analogous lemma holds by the same argument for converging, nonbipartite, instances.)
Enumerating the edges in as the quantity in parenthesis reads
Proof of universality of polytope neighborliness
In this section we prove Theorem 2, deferring several technical steps to the Appendices.
Notice that these matrices differ by a factor from the matrices in the statement of Theorem 2. Since neighborliness is invariant under scale transformations, this change is immaterial.
This is indeed a rephrasing of Theorem 2 in Donoho2005a .
In view of this result, Theorem 2 follows from the following result on compressed sensing with random sensing matrices.
Consider either of the following two cases: {longlist}[(II)]
The matrix has i.i.d. entries, and is any fixed sequence of vectors with .
The rest of this section is devoted to the proof of Theorem 8. Indeed, as shown below, this immediately implies Theorem 2.
Using Markov’s inequality, equations (87), (88) coincide (resp.) with assumptions (1) and (2) in Theorem 7. The claim follows by applying this theorem.
For , let . Then, for any , , the minimum singular value of satisfies .
The proof of this lemma is deferred to Appendix B.
The proof of Theorem 8 consists of two parts. For , we shall exhibit a vector with and . For we will show that assumptions of Lemma 5 hold. In particular, we will construct a subgradient as per assumption 5. In both tasks, we will use an iterative message passing algorithm analogous to the one in Section 5. The algorithm is defined by the following recursion initialized with :
where , is a nonnegative constant and is a diagonal matrix whose precise definition is immaterial here and will be given in the proof of Proposition 6 below. Notice two important differences with respect to the treatment in Section 5:
The iteration in equations (89), (90) does not take immediately the form in equations (67), (68). For instance the nonlinear mapping is applied after multiplication by . This mismatch can be resolved by a simple change of variables.
The nonlinear mapping is not a polynomial. This point will be addressed by constructing suitable polynomial approximations of .
We refer to Appendix A for further details.
For , is defined by the one-dimensional recursion
There exists a diagonal matrix such that
We also need a generalization of the last proposition for functions of the estimates , at two distinct iteration numbers . To this objective, we introduce the generalization of the state evolution equation (91). Namely, we define recursively for all by letting
with . This uniquely determines the doubly infinite array . Notice in particular that for all . (This is easily checked by induction over .)
The proof of this proposition is in Appendix A.
Finally, we need some analytical estimates on the recursions (91) and (98). Part of these estimates were already proved in DMM09 , NSPT , BayatiMontanariLASSO , but we reproduce them here for the reader’s convenience. Proofs of the others are provided in Appendix C.
If , then: {longlist}[(a3)]
There exists , , with , and such that the following happens. For each , as , with .
Further, for each there exists and (distinct as long as ) such that, letting , .
Finally, for all , we have .
For any , we have .
For any , we have and, for , .
We are now in position to prove Theorem 8. For greater convenience of the reader, we distinguish the cases and . Before considering these cases, we will establish some common simplifications.
Summarizing this discussion, in order to prove the Theorem both in case (I) and case (II), it will be sufficient to do so for the following setting:
In the proof of Theorem 8, we can assume the vectors to be random with i.i.d. entries with common distribution .
Notice that, by definition of the function we have , and hence . We can write
This part of the proof is completed by showing that there exists with such that, for each , with high probability we have , and . Indeed, if this is true, we can then choose sufficiently large and so that is small enough as to satisfy the condition 5 of Lemma 5.
First consider . Applying Proposition 7 to , we have, in probability
Here the last equality follows from the fact that by Lemma 6(a1) and by Lemma 6(a2). This implies the claim for .
Next, consider . By Proposition 6(2),
and hence with high probability for any .
Finally consider , and define . We have
Notice that this apparently requires applying Proposition 6 to the function which is non-Lipschitz in . However, we can define a Lipschitz approximation, with parameter ,
Notice that is bounded and Lipschitz continuous. We further have, whence
The last inequality holds almost surely by the law of large numbers using . Analogously,
Hence the claim (110) follows by applying Proposition 6(3) to , using equations (6.2), (6.2), and letting .
We conclude by noting that the right-hand side of equation (110) converges to as by dominated convergence, since . Therefore,
This completes our proof of assumption 5 of Lemma 5.
We finally consider hypothesis 5. Let be defined as there, for the subgradient , namely
We claim that the following two claims hold for some independent of :
There exists (independent of ) such that for all , , the minimum singular value of , satisfies with probability converging to as .
where vanishes as at , , fixed, and vanishes as at , , , fixed.
These claims immediately imply that hypothesis 5 of Lemma 5 holds with probability converging to one as . Indeed, if , then (by Claim 2) where with probability larger than . By Claim 1, we hence have . The thesis follows since can be chosen as small as we want. [Notice that once is fixed to satisfy these claims, we can still choose arbitrarily to satisfy hypothesis 5 of Lemma 5, as per the argument above.]
In order to prove Claim 1, above first notice that, for any
where in the last line denotes the binary entropy of , and we used . We want to show that , , , , can be chosen so that both contributions vanish as .
with bounded as long as . We can therefore select and select small enough so that . This ensures that the first term in equation (6.2) vanishes as .
We are left with the task of selecting , , so that the second term vanishes as well, since then we can take . To this hand notice that by Proposition 6 (and using the fact that has a density) we have, in probability,
and further, since as [cf. Lemma 6(a1)] and , we have
On the other hand, by Lemma 6(a1), and since , we have . Hence there exist and so that for all large enough with high probability. Taking and using Markov’s inequality (with )
where the last inequality follows from Proposition 6(4). These terms can be made arbitrarily small by choosing small and large enough.
In order to complete the proof, we need to show that Claim 2 holds for eventually larger . First notice that, applying again Proposition 6(4), we get
By Proposition 7, and using the fact that the vector has a density, we have, in probability,
where, by Lemma 6(a3), vanishes as . Given any , we can therefore choose so that, with high probability . Combining with equation (6.2), we obtain the desired claim.
Fix a small number . By Lemma 6(b), there exists independent of , such that, for and large enough,
as well as . By Propositions 6, 7 (and noting that has a distribution that is absolutely continuous with respect to Lebesgue measure), we have, with high probability,
Namely equation (119) implies (122), equation (120) implies (123), equation (121) implies (124) and the assumption implies (125).
Using equation (92) together with the above, we get
Appendix A Proofs of Propositions 6 and 7
In this Appendix we prove Propositions 6 and 7 by a suitable application of Theorem 6. Before passing to these proofs, we establish a corollary of Theorem 6 that allows us to control iterations of the form (89), (90), with replaced by a general polynomial.
Let (notice that is invertible with high probability) and define iteratively
We next introduce the corresponding state evolution recursion. Namely, we define recursively for all by letting
with . This uniquely determines the doubly infinite array.
which coincides with equation (1) after a redefinition of the function .
A.2 Proofs of Propositions 6 and 7
We will start by proving Proposition 6. Since Proposition 7 follows from the same construction, we will only point to the necessary modifications. Before presenting the proof, we recall a basic result in weighted polynomial approximation (here stated for a specific case); see, for example, Lubinsky .
Proof of Propositions 6 Since the proposition holds as at fixed, we shall assume throughout that for some fixed arbitrarily large .
We claim that, for each , we can construct an orbit obeying equations (A.1), (132) for suitable functions such that the following holds (with a slight abuse of notation we will drop the parameter from ). For all , and all functions as in the statement, we have by construction. Further, in probability,
In order to prove equations (148) to (150) we proceed as follows. It is easy to check that for all ; cf. equation (91). We use Theorem 9 to construct polynomials such that
Given polynomials as defined by (154), we define as per equations (A.1), (132), with the initial condition given there. Equation (150) follows immediately from Corollary 1 for sufficiently small. Equation (149) also follows from the same corollary, by taking
and then using once again equation (154) on the resulting expression.
Finally, consider equation (148). For economy of notation, we write
Using the tree representation in Section 4.2, it is not hard to prove that the expectation on the right-hand side is bounded as follows:
Consider, for instance, the first case, , , , distinct. Using Lemma 3, each of , , , can be represented as a sum over trees with root type respectively at , , , . The weight of these trees is as in Lemma 3, times the prefactor . Let be the total number of edges in these trees, plus (two for each of the additional factors). Then any nonvanishing contribution is of order . Let be the graph obtained by identifying the vertices of the same type in these trees, and the number of its edges. Since each edge in must be covered at least twice by the trees to get a nonzero expectation, and the edges in at least once, we have . The number of vertices in is at most (note that is connected because it includes type connected to ). Of these vertices all but (whose type is , , , , ) can take an arbitrary type, yielding a combinatorial factor of order . Hence the sum over trees is of order as claimed.
Since by standard concentration bounds , , we obtain, in probability,
where, in the last step, we applied Corollary 1 to the polynomials , and , are independent. We are left with the task of showing that, by taking small enough in equation (154), we can ensure that
Indeed integrating by parts with respect to the above difference can be written as (for a finite constant that can depend on and change from line to line)
The claim follows by noting that, as argued above .
Proof of Claim 3 To be definite we will focus on equation (162).
We finally sketch the proof of Proposition 7.
Proof of Proposition 7 The sequence is constructed as in the previous statement. The proof hence follow by using Corollary 1, and taking small enough in equation (154), since we can ensure that for any and any (as shown above for the case ).
Appendix B Proof of Lemma 5
Throughout the proof we denote by , , etc., positive constants that depend uniquely on .
and denote by any minimizer. Further, let be a subgradient as in the statement, and define, for some ,
Also, let be the complement of this set. Notice that, by definition of subgradient, we have for all and for all in . This implies that .
Since , we have and hence
On the other hand, is in the subgradient of at . Hence . It follows that equation (169) implies . Since is a minimizer, we thus get
where in the last step we used Cauchy–Schwarz together with assumption 5. Hereafter we let .
Fix . Since , we have , and using equation (173), we conclude that there exists such that
On the other hand, by definition , and hence . Since , we have . Further , whence . By assumption 5, we have and therefore,
Combining this with equation (175), we deduce that for some , which in turns implies provided that . The claim hence follows for .
Appendix C Asymptotic analysis of state evolution: Proof of Lemma 6
where expectation is taken with respect to the independent random variables and . When necessary, we will indicate the dependency on by . With this notation the state evolution recursion reads . The following properties of the function were proved in DMM09 ; see also BayatiMontanariLASSO , Appendix A, for a more explicit treatment.
For any , the mapping is monotone increasing and concave with and
The first two derivatives of will be used in the proof
By inspection of equation (180), for all , hence is strictly convex. Further, from equation (179), we have and as . Hence has a unique minimum .
Finally, equation (181) follows immediately by using the condition in expression (C).
In our proof it is more convenient to use the coordinates instead of . In terms of the latter, the phase boundary (2), (3) reads
Notice that the use of the symbol in the last equations is not an abuse of notation. Indeed comparing equation (183) with (179) we conclude that is indeed the unique solution of . Further, comparing equation (182) with equation (C) we obtain the following.
Let be the phase boundary defined by equations (2), (3). Then, for , if and only if, for ,
Vice versa if and only if .
Proof of Lemma 6(a1) We set . Hence we have, by Lemma 7, and Lemma 9,
Finally by Lemma 8 . Since is decreasing in , the last claim follows.
In the proof of part (a2) we will make use of the following analytical result.
Then is increasing and convex on $\mathcal{F}_{\alpha,{\varepsilon}}(1)=1\mathcal{F}^{\prime}_{\alpha,{\varepsilon}}(1)<1\mathcal{F}_{\alpha,{\varepsilon}}(Q)>Q\in[0,1)$
which is strictly positive and increasing in . Hence is increasing and strictly convex. Finally, since , we have
where the last equality follows again by Lemma 8. This completes the proof.
We are now in position to prove part (a2) of Lemma 6.
Proof of Lemma 6(a2) Throughout the proof we fix . Let the sequence be given as per the state evolution equation (91). Define . By Proposition 7, is the covariance of two Gaussian random variables of variance . Hence . Using equation (98) we further have
(Notice that this expectation depends on the distribution of only through , because of the symmetry properties of the function .)
Further, by the proof of part (a1), as we have and
Comparing equations (186) and (194) we conclude that, for any ,
Further the convergence is uniform, since the functions are uniformly Lipschitz; see proof of Lemma 10 above.
Consider now the sequence , and let . Since for all , we have as well. We claim that in fact and therefore , which implies the thesis.
where, in the last step, we used the fact that is monotone increasing. Since for all by Lemma 10, we conclude that .
Before proving (a3) of Lemma 6, we establish one more technical result.
By triangular inequality, the left-hand side of equation (11) can be upper bounded as whereby
Here and are coupled in such a way that if and only if and the two variables have the same sign in the other case. We focus on bounding since can be treated along the same lines. Letting , we have
where in the last line we used Stein’s lemma to integrate over , and denotes derivative with respect to the first argument. Once again the two terms are treated along the same lines, and we will only consider . We have
The other term in equation (C.1) is bounded by the same argument. This concludes the proof of equation (11).
The proof of equation (201) follows from equation (11) if we notice that
The last lemma has a useful consequence that we will exploit in the ensuing proof of Lemma 6(a3).
Let be defined as per equation (186) and defined as per equation (194) with , , satisfying the conditions of Lemma 6(a3). Then there exists a constants depending on such that
The second inequality follows from the first one using Lemma 6(a1). Using equation (201), we have
The proof of the corollary is obtained by noting that and applying equation (11) to the expectation in equation (194).
Proof of Lemma 6(a3) Define, as in the proof of part (a2), , and recall that
By Corollary 2, and Lemma 10, it follows that for some constants , . Indeed,
and the claim follows by noting that by Lemma 10.
Next consider the quantity in equation (102). We have
Proof of Lemma 6(b1), (b2) First notice that, with the definitions given in the previous section
Notice that the right-hand side is equal to for , monotonically decreasing in and vanishing as . Hence there exists such that the right-hand side is smaller than if and only if . Further, is concave with and first derivative larger than at ; cf. Lemma 7. It follows that for there exists a unique such that for all and for . It follows that for any . This proves the first part of claim (b1).
Notice that by the definition of given above, we have
In order to prove the second statement in (b1), we proceed analogously to part (a2), and define . This sequence satisfies the recursion (193) with defined as per equation (194). As we have and hence converges uniformly to a limit that we denote by an abuse of notation , where
Proceeding as in the proof of Lemma 10, we conclude that is increasing and convex on $Z\sim\mathsf{N}(0,1)$]
Finally, for ,
and therefore for all . Hence, proceeding again as in the proof of part (a2) we conclude that and therefore as claimed.
where is defined as per equation (176). Using these expressions in equation (211) we conclude that
In particular, one can check from equations (212), (213) that a stationary pointIndeed this is the unique saddle point of the function as it can be proved by the general minimax theorem. is given by setting and .
Define . Using again equations (212), (213) we get
Appendix D Reference results
The following calculus fact is used in the main text.
For all we have .
Since for is concave, when , then
This is equivalent to which proves the result. The case of is proved similarly.
We also use an estimate on the minimum singular value of perturbed rectangular matrices, which was proved in Cucker , Theorem 1.1.
Acknowledgments
Andrea Montanari is grateful to Amir Dembo, David Donoho and Van Vu for stimulating conversations.