The LASSO risk for gaussian matrices
Mohsen Bayati, Andrea Montanari
Introduction
with . The original signal is estimated by
A large and rapidly growing literature is devoted to developing fast algorithms for solving the optimization problem (1.3) and characterizing the performances and optimality of the estimator . We refer to Section 1.3 for an unavoidably incomplete overview.
Despite such substantial effort, and many remarkable achievements, our understanding of (1.3) is not even comparable to the one we have of more classical topics in statistics and estimation theory. For instance, the best bound on the mean squared error () of the estimator (1.3), i.e. on the quantity , was proved by Candes, Romberg and Tao [CRT06] (who in fact did not consider the LASSO but a related optimization problem). Their result estimates the mean squared error only up to an unknown numerical multiplicative factor. Work by Candes and Tao [CT07] on the analogous Dantzig selector, upper bounds the mean squared error up to a factor , under somewhat different assumptions.
The objective of this paper is to complement this type of ‘rough but robust’ bounds by proving asymptotically exact expressions for the mean square error. Our asymptotic result holds almost surely for sequences of random matrices with fixed aspect ratio and independent gaussian entries. While this setting is admittedly specific, the careful study of such matrix ensembles has a long tradition both in statistics and communications theory and has spurred many insights [Joh06, Tel99]. Further, we carried out simulations on real data matrices with continuous entries (gene expression data) and binary feature matrices (hospital medical records). The results appear to be quite encouraging.
Although our rigorous results are asymptotic in the problem dimensions, numerical simulations have shown that they are accurate already on problems with a few hundreds of variables. Further, they seem to enjoy a remarkable universality property and to hold for a fairly broad family of matrices [DMM10]. Both these phenomena are analogous to ones in random matrix theory, where delicate asymptotic properties of gaussian ensembles were subsequently proved to hold for much broader classes of random matrices. Also, asymptotic statements in random matrix theory have been replaced over time by concrete probability bounds in finite dimensions. Of course the optimization problem (1.2) is not immediately related to spectral properties of the random matrix . As a consequence, universality and non-asymptotic results in random matrix theory cannot be directly exported to the present problem. Nevertheless, we expect such developments to be foreseeable.
Our proofs are based on the analysis of an efficient iterative algorithm first proposed by [DMM09], and called AMP, for approximate message passing. The algorithm is inspired by belief-propagation on graphical models; although the resulting iteration is significantly simpler (and scales linearly in the number of nodes). Extensive simulations [DMM10] showed that, in a number of settings, AMP performances are statistically indistinguishable to the ones of LASSO, while its complexity is essentially as low as the one of the simplest greedy algorithms.
The proof technique just described is new. Earlier literature analyzes the convex optimization problem (1.3) –or similar problems– by a clever construction of an approximate optimum, or of a dual witness. Such constructions are largely explicit. Here instead we prove an asymptotically exact characterization of a rather non-trivial iterative algorithm. The algorithm is then proved to converge to the exact optimum.
As already mentioned, we will consider sequences of instances of increasing sizes, along which the LASSO behavior has a non-trivial limit.
Let us stress that our proof only applies to a subclass of converging sequences, namely for gaussian measurement matrices . The notion of converging sequences is however important since it defines a class of problem instances to which the ideas developed below might be generalizable. Also, while the measurement matrices will be random, the signal , and noise vectors will be deterministic.
For a converging sequence of instances, and an arbitrary sequence of thresholds (independent of ), the asymptotic behavior of the recursion (1.8) can be characterized as follows.
where is independent of . Notice that the function depends implicitly on the law . We will see later that the quantity has the same distribution as . In other words, is the MSE of the estimator for .
The next proposition that was conjectured in [DMM09] and proved in [BM11] shows that the behavior of AMP can be tracked by the above one dimensional recursion. We often refer to this prediction by state evolution.
where is independent of .
In order to establish the connection with the LASSO, a specific policy has to be chosen for the thresholds . Throughout this paper we will take with is fixed. In other words, the sequence is given by the recursion
Let us finally discuss why there should be any relation at all between the AMP algorithm (1.8) and the solution of the LASSO. Assume that , and that is a fixed point of the corresponding AMP iteration. Let . Then the fixed point condition reads
Notice that if and only if there exists such that (here denotes the subgradient of the function ). It follows that the fixed point condition can be rewritten as
Comparing with the stationarity condition for the LASSO cost function (1.2) we obtain the following.
Any fixed point of the AMP iteration with is a minimizer of the LASSO cost function with
2 Main result
Before stating our results, we have to describe a calibration mapping between and that was introduced in [DMM10]. This mapping is necessary since in the analysis of AMP plays the role of . In other words, it can be viewed as regularization parameter and controls sparsity of AMP estimates. In particular, we will show that there exist a one-to-one (monotone) function between values of and .
Let us start by stating some convenient properties of the state evolution recursion.
Let be the unique non-negative solution of the equation
with the standard gaussian density and .
For any , , the fixed point equation admits a unique solution. Denoting by this solution, we have . Further the convergence takes place for any initial condition and is monotone. Finally at .
For greater convenience of the reader, a proof of this statement is provided in Appendix A.1.
We then define the function on , by
This function defines a correspondence (calibration) between the threshold and the regularization parameter . It should be intuitively clear that larger corresponds to larger thresholds and hence larger since both cases yield smaller estimates of . The specific choice in Eq. (1.18) is motivated by Lemma 1.2.
In the following we will need to invert this function. We thus define in such a way that
The next result implies that the set on the right-hand side is non-empty and therefore the function is well defined.
The function is continuous on the interval with and .
Therefore the function satisfying Eq. (1.19) exists.
A proof of this statement is provided in Section A.2. We will denote by the image of the function . Notice that the definition of is a priori not unique. We will see that uniqueness follows from our main theorem.
Examples of the mappings , and are presented in Figures 1, 2, and 3 respectively.
2.2 Main results
where is independent of , and .
Let us emphasize oonce more that the vectors , are deterministic in this statement, and ‘almost surely’ is understood with respect to the choice of .
As a corollary, using function we obtain:
Assume the hypothesis of Theorem 1.5. Let be the LASSO estimator for instance . Then, almost surely
where is independent of , and .
As a second corollary of Theorem 1.5, the function is indeed uniquely defined.
For any there exists a unique such that (with the function defined as in Eq. (1.18).
Hence the function is continuous non-decreasing with .
The proof of this corollary (which uses Theorem 1.5) is provided in Appendix A.3.
We prove Theorem 1.5 by proving the following result in Section 3.
Assume the hypotheses of Theorem 1.5. Let be the LASSO estimator for instance , and denote by the sequence of estimates produced by AMP. Then
Let us emphasize that the statement of Theorem 1.8 requires taking the limit of infinite dimensions before the limit of an infinite number of iterations . In this sense it is (informally speaking) a statement about the high-dimensional limit behavior, for a large-but-finite number of iterations. Although this is not a common setting within mathematical optimization, we think that it is particularly compelling from a compressed sensing point of view. It implies that, for any finite tolerance , there exists a finite number of iterations such that for any fixed , AMP has mean squared error at most larger than the LASSO, with high probability as . Further, closer analysis of the state evolution recursion [DMM09, DMM10] implies that for some constant independent of the dimension, and the signal , provided the under-sampling ratio is larger than a phase transition value . Notice that taking the high dimensional point of view yields us a considerably faster convergence than the optimum rate at fixed dimension, namely [BT09].
3 Related work
The LASSO was introduced in [Tib96, CD95]. Several papers provide performance guarantees for the LASSO or similar convex optimization methods [CRT06, CT07], by proving upper bounds on the resulting mean squared error. These works assume an appropriate ‘isometry’ condition to hold for . While such condition hold with high probability for some random matrices, it is often difficult to verify them explicitly. Further, it is only applicable to very sparse vectors . These restrictions are intrinsic to the worst-case point of view developed in [CRT06, CT07].
Guarantees have been proved for correct support recovery in [ZY06], under an appropriate ‘incoherence’ assumption on . While support recovery is an interesting conceptualization for some applications (e.g. model selection), the metric considered in the present paper (mean squared error) provides complementary information and is quite standard in many different fields.
Closer to the spirit of this paper [RFG09] derived expressions for the mean squared error under the same model considered here. Similar results were presented recently in [KWT09, GBS09]. These papers argue that a sharp asymptotic characterization of the LASSO risk can provide valuable guidance in practical applications. For instance, it can be used to evaluate competing optimization methods on large scale applications, or to tune the regularization parameter .
Unfortunately, these results were non-rigorous and were obtained through the famously powerful ‘replica method’ from statistical physics [MM09].
Let us emphasize that the present paper offers two advantages over these recent developments: It is completely rigorous, thus putting on a firmer basis this line of research; It is algorithmic in that the LASSO mean squared error is shown to be equivalent to the one achieved by a low-complexity message passing algorithm.
Numerical illustrations
Further, our result is asymptotic, while and one might wonder how accurate it is for instances of moderate dimensions.
We obtained the optimum estimator using CVX, a package for specifying and solving convex programs [GB10] and OWLQN, a package for solving large-scale versions of LASSO [AG07]. We used several values of between and and equal to , , , and . The aspect ratio of matrices was fixed in all cases to . For each case, the point was plotted and the results are shown in the figures. Continuous lines corresponds to the asymptotic prediction by Corollary 1.6, namely .
The agreement is remarkably good already for of the order of a few hundreds, and deviations are consistent with statistical fluctuations.
The four figures correspond to measurement matrices :
Figure 4: Data consist of measurements of expression level of genes.From this matrix we took sub-matrices of aspect ratio for each . The entries were continuous variables. We standardized all columns of to have mean 0 and variance 1.
Figure 5: From a data set of patient records we extracted binary features describing demographic information, medical history, lab results, medications etc. The - matrix was sparse (with only non-zero entries). Similar to , for each , the sub-matrices with aspect ratio were selected and standardized.
Figure 6: Random gaussian matrices with aspect ratio and iid entries (as in Theorem 1.5);
Figure 7: Random matrices with aspect ratio . Each entry is independently equal to or with equal probability.
Notice the behavior appears to be essentially indistinguishable. Also the asymptotic prediction has a minimum as a function of . The location of this minimum can be used to select the regularization parameter. Further empirical analysis is presented in [BBM11].
A structural property and proof of the main results
The rest of the paper is devoted to the proof of Theorem 1.8. Section 3.2 proves a structural property that is the key tool in this proof. Section 3.3 uses this property together with a few lemmas to prove Theorem 1.8
The proof of Theorem 1.5 follows immediately from Theorem 1.8.
For any , we have, by the pseudo-Lipschitz property of ,
where the second inequality follows by Cauchy-Schwarz. Next we take the limit followed by . The first term vanishes by Theorem 1.8. For the second term, note that remains bounded since is a converging sequence. The two terms and also remain bounded in this limit because of state evolution (as proved in Lemma 3.2 below).
where we used Theorem 1.1 and Proposition 1.3. ∎
For a matrix we denote its minimum and maximum singular values by , respectively. We also denote the minimum non-zero singular value of by .
2 A structural property of the LASSO cost function
One main challenge in the proof of Theorem 1.5 lies in the fact that the function is not –in general– strictly convex. Hence there can be, in principle, vectors of cost very close to the optimum and nevertheless far from the optimum.
The following Lemma provides conditions under which this does not happen.
There exists a function such that the following happens.
There exists with ;
Let , and . Then, for any , , we have ;
The maximum singular value of is bounded: .
Then . Further for any , as .
Further, if , the same conclusion holds under assumptions 1, 2, 3, 5.
Throughout the proof we denote functions of the constants and of such that as (we shall omit the dependence of on ).
Let . We have
where follows from hypothesis (2), from the fact that since which gives
and follows from the definition of .
Using hypothesis (1) and (3), we get by Cauchy-Schwarz
Each of the three terms on the left-hand side is non-negative. The third one is trivial. The first one is non-negative since
and each is either equal to (when ) or equal to otherwise. The second term in (3.3) is also non-negative since and since by definition of subgradient. Therefore,
Since , we have
In the case , the proof is concluded. In the case , we need to prove an analogous bound for . From Eq. (3.4) together with , we get
Where (3.9) follows immediately from definition of and . Now, notice that . From Eq. (3.8) and definition of it follows that
To conclude the proof, it is sufficient to prove an analogous bound for with . Since , we have by hypothesis (4) that . By Eq. (3.9) we have . Therefore
In the last step we used triangular inequality together with the fact that (by assumption (5)) and (by construction). Using , we get
This finishes the proof when . Note that if this assumption does not hold then we can take and . Hence, the result follows as a special case of above. ∎
3 Proof of Theorem 1.8
The proof is based on a series of Lemmas that are used to check the assumptions of Lemma 3.1
Under the conditions of Theorem 1.5, assume and . Denote by the estimator and by the sequence of AMP estimates. Then there is a constant such that for all , almost surely
The second Lemma implies that the estimates of AMP are approximate minima, in the sense that the cost function admits a small subgradient at , when is large. The proof is deferred to Section 5.2.
Under the conditions of Theorem 1.5, for all there exists a subgradient of at point such that almost surely,
The next lemma implies that sub-matrices of constructed using the first iterations of the AMP algorithm are non-singular (more precisely, have singular values bounded away from ). The proof can be found in Section 5.3.
Let be measurable on the -algebra generated by and and assume for some . Then there exists (independent of ) and (depending on and ) such that
eventually almost surely as .
We will apply this lemma to a specific choice of the set . Namely, defining
for . Our last lemma shows that this sequence of sets ‘converges’ in the following sense. The proof can be found in Section 5.4.
Fix and let the sequence be defined as in Eq. (3.17) above. For any there exists such that, for all fixed, we have
eventually almost surely as .
The above two lemmas imply the following.
There exist constants , , and such that, for any ,
eventually almost surely as .
First notice that, for any fixed , the set is measurable on . Indeed by Eq. (1.8) contains as well, and hence it contains which is a linear combination of , . Finally is obviously a measurable function of .
Using Lemma F.3(b) the empirical distribution of converges weakly to for independent of . (Following the notation of [BM11], we let .) Therefore, for any constant we have almost surely
The last equality follows from the weak convergence of the empirical distribution of (from Lemma F.3(b), which takes the same form as Theorem 1.8), together with the absolute continuity of the distribution of .
eventually almost surely as , for all fixed larger than some .
For any we can apply Lemma 3.4 for some , . Fix and let be fixed as well. Let (with defined as per Lemma 3.5). Take . Obviously is non-increasing. Then we have, by Lemma 3.4
where both events hold eventually almost surely as . The claim follows with and . ∎
We are now in position to prove Theorem 1.8.
We apply Lemma 3.1 to , the AMP estimate and the distance from the optimum. The thesis follows by checking conditions 1–5. Namely we need to show that there exists constants and, for each some exists such that 1–5 hold eventually almost surely as .
Condition 2 is immediate since minimizes .
Condition 3 follows from Lemma 3.3 with arbitrarily small for large enough.
Condition 4. Notice that this condition only needs to be verified for .
Take as defined in Eq. (3.16). Using the definition (1.8), it is easy to check that if and otherwise. In other words as required. Further by inspection of the proof of Lemma 3.3, it follows that , with the subgradient bounded in that lemma (cf. Eq. (5.3)). The condition then holds by Proposition 3.6.
Condition 5 follows from standard limit theorems on the singular values of Wishart matrices (cf. Theorem F.2). ∎
State evolution estimates
This section contains a reminder of the state-evolution method developed in [BM11]. For greater convenience of the reader, we also restate two lemmas from [BM11] (namely, Lemmas F.3 and F.3) in appendix F.3. We will use these two Lemmas throughout our analysis.
We also state some extensions of those results that will be proved in the appendices.
AMP, cf. Eq. (1.8) is a special case of the general iterative procedure given by Eq. (3.1) of [BM11]. This takes the general form
where , (both derivatives are with respect to the first argument).
and the initial condition is .
Regarding as column vectors, the equations for and can be written in matrix form as:
or in short and .
Following [BM11], we define as the -algebra generated by , , , and . The conditional distribution of the random matrix given the -algebra , is given by
Here , , and , are orthogonal projector onto column spaces of and respectively.
Before proceeding, it is convenient to introduce the notation
to denote the coefficient of in Eq. (1.8). Using and Lemma F.3(b) (proved in [BM11]) we get, almost surely,
Notice that the function is discontinuous and therefore Lemma F.3(b) does not apply immediately. On the other hand, this implies that the empirical distribution of converges weakly to the distribution of . The claim follows from the fact that has a density, together with the standard properties of weak convergence.
2 Some consequences and generalizations
We begin with a simple calculation, that will be useful.
If are the AMP residuals, then, almost surely,
Using representation (4.5) and Lemma F.3(b)(c), we get
Next, we need to generalize state evolution to compute large system limits for functions of , , with . To this purpose, we define the covariances recursively by
with independent of . This determines by the above recursion for all and for all .
With these definition, we have the following generalization of Theorem 1.1.
Clearly this result reduces to Theorem 1.1 in the case by noting that . The general proof can be found in Appendix B.
The following lemma implies that, asymptotically for large , the AMP estimates converge.
Under the condition of Theorem 1.5, the estimates and residuals of AMP almost surely satisfy
Proofs of auxiliary lemmas
In order to bound the norm of , we use state evolution, Theorem 1.1, for the function ,
for and independent of . The expectation on the right hand side is bounded and hence is bounded.
The last bound holds almost surely as , using standard asymptotic estimate on the singular values of random matrices (cf. Theorem F.2) implying that has a bounded limit almost surely, together with the fact that is a converging sequence.
Now, decompose as where and (the orthogonal complement of ). Since, belongs to the random subspace with dimension , Kashin theorem (cf. Theorem F.1) implies that there exists a positive constant such that
Hence, by using triangle inequality and Cauchy-Schwarz, we get
By definition of cost function we have . Further, limit theorems for the eigenvalues of Wishart matrices (cf. Theorem F.2) imply that there exists a constant such that asymptotically almost surely . Therefore (denoting by bounded constants), we have
The claim follows by using the Eq. (5.1) to bound and using to bound the last term.
2 Proof of Lemma 3.3
First note that equation of AMP implies
Therefore, the vector where
is a valid subgradient of at . On the other hand, . We finally get
It is straightforward to see from Eqs. (LABEL:eq:x(t)=eta_interpreted) and (5.3) that . Hence,
By Lemma 4.3, and the fact that is almost surely bounded as (cf. Theorem F.2), we deduce that the two terms and converge to when and then . For the third term, using state evolution (see Lemma 4.1), we obtain . Finally, using the calibration relation Eq. (1.18), we get
3 Proof of Lemma 3.4
The proof uses the representation (4.9), together with the expression (4.10) for the conditional expectation. Apart from the matrices , , , introduced there, we will also use
We state below a somewhat more convenient description.
It is clearly sufficient to prove that, for , , , we have
The first identity is an easy consequence of the fact that , while the second one follows immediately from . ∎
The following fact (see Appendix D for a proof) will be used several times.
For any there exists such that, for , eventually almost surely as ,
Given the above remarks, we will immediately see that Lemma 3.4 is implied by the following statement.
Let be given such that , for some . Then there exists (independent of ) and (depending on and ) such that
eventually almost surely as . (With .)
In the next section we will show that this lemma implies Lemma 3.4. We will then prove the lemma just stated.
By Borel-Cantelli, it is sufficient to show that, for measurable on and there exist and , such that
for all large enough. Conditioning on and using the union bound, this probability can be estimated as
where is the binary entropy function. The union bound calculation indeed proceeds as follows
where . Now, fix in such a way that (with defined as per Lemma 5.3). Further choose . The above probability is then upper bounded by
Finally, applying Lemma 5.3 and using Lemma 5.1 to estimate , we get, for all large enough,
3.2 Proof of Lemma 5.3
We begin with the following Pythagorean inequality.
Let be given such that , for some . Recall that and consider the event
Here the notation refers to the usual scalar product of vectors and of the same dimension. Assuming that the claim holds, we have indeed
(Notice that in Proposition E.1 is stated for the equivalent case of a random sub-space of fixed dimension , and a subspace of dimension scaling linearly with the ambient one.) ∎
Let be given such that , for some . Then there exists constant , such that the event
Let be the linear space . Of course the dimension of is at most . Then we have (for all vectors with )
Finally a simple bound to control the norm of .
There exists a constant such that, defining the event,
we have that holds eventually almost surely as .
The bound is proved analogously. ∎
By Lemma 5.6 we can assume that event holds, for some function (without loss of generality ). We will let be the event
Let us assume first that , whence
where the last inequality uses . Therefore, using Lemma 5.4, we get
Next we assume . Due to Lemma 5.4 and 5.5 we can assume that events and hold. Therefore
4 Proof of Lemma 3.5
The key step consists in establishing the following result, which will be instrumental in the proof of Lemma 4.3 as well (and whose proof is deferred to Appendix C.1).
Assume and let be defined by the recursion (4.13) with initial condition (4.14). Then there exists constants , such that for all
It is also useful to prove the following fact.
For any and , the matrix is strictly positive definite.
almost surely. Hence, . Thus the result follows from Lemma 5.2. ∎
It is then relatively easy to deduce the following.
Assume and let be defined by the recursion (4.13) with initial condition (4.14). Then there exists constants , such that for all
By triangular inequality and Eq. (5.11), we have
which, together with Eq. (5.14) proves our claim. ∎
We are now in position to prove Lemma 3.5.
We will show that, under the assumptions of the Lemma, almost surely, which implies our claim. Indeed, by Theorem 4.2 we have
Let . By Proposition 1.3, for any and all large enough we have for . Then
where the last inequality follows by Lemma 5.9. By taking we finally get (for some constant ) , which implies our claim. ∎
Acknowledgement
It is a pleasure to thank David Donoho and Arian Maleki for many stimulating exchanges. We are also indebted with José Bento who collaborated in preparing Figures 4 to 7.
An earlier version of this paper stated some auxiliary lemmas in terms of convergence in probability. We rectified this to convergence almost sure as for the main theorems (with virtually no change in the proofs). We are grateful to Edgar Dobriban and Weijie Su for pointing out this inconsistency.
This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211.
Appendix A Properties of the state evolution recursion
It is a straightforward calculus exercise to compute the partial derivatives
From these formulae we obtain the total derivative
with the inequality being strict whenever , . It follows that is concave, and strictly concave provided and is not identically .
which is strictly positive for all . To see this, let , and notice that , and .
Since is concave, and strictly increasing for large enough, it also follows that it is increasing everywhere.
Notice that is strictly decreasing with . Hence, for , we have for small enough and for large enough. Therefore the fixed point equation admits at least one solution. It follows from the concavity of that the solution is unique and that the sequence of iterates converge to .
A.2 Proof of Proposition 1.4
As a first step, we claim that is continuously differentiable on . Indeed this is defined as the unique solution of
Since is continuously differentiable and (the second inequality being a consequence of concavity plus , both shown in the proof of Proposition 1.3), the claim follows from the implicit function theorem applied to the mapping .
Next notice that as . Indeed, introducing the notation , we have, again by concavity,
i.e. . Now , while as (shown in the proof of Proposition 1.3), whence the claim follows.
Next consider the function defined by
Notice that . Since is continuously differentiable, it follows that is continuously differentiable as well.
Using the characterization of in Eq. (1.17) (and the well known inequality valid for all ), it is immediate to show that . Therefore
A.3 Proof of Corollary 1.7
By Proposition 1.4, it is sufficient to prove that, for any there exists a unique such that . Assume by contradiction that there are two distinct such values , .
Notice that in this case, the function is not defined uniquely and we can apply Theorem 1.5 to both choices and . Using the test function we deduce that
Since the left hand side does not depend on the choice of , it follows that .
Next apply Theorem 1.5 to the function . We get
Appendix B Proof of Theorem 4.2
First note that using representation (4.2) we have . Furthermore, using Lemma F.3(b) we have almost surely
For we have using Lemma F.3(b) almost surely
Induction hypothesis: Assume that for all and ,
Then we prove Eq. (B.2) for (case is similar). First assume and in which using Lemma F.3(c) we have almost surely
Similarly, for the case and , using Lemma F.3(b)(c) we have almost surely
using the induction hypothesis. Hence the result follows.
Appendix C Proof of Lemma 4.3
The proof of Lemma 4.3 relies on Lemma 5.7 which we will prove in the first subsection.
Before proving Lemma 5.7, we state and prove the following property of gaussian random variables.
for the indicator function of . Since the Ornstein-Uhlenbeck process is reversible with respect to the standard gaussian measure , we have
It is convenient to change coordinates and define
Next we will show that by induction on that the stronger inequality holds for all . We have indeed
The initial condition implied by Eq. (4.14) is
We will consider the above iteration for arbitrary initialization (satisfying ) and will show the following three facts:
Fact . As , . Further the convergence is monotone.
Fact . If and , then for all and .
Fact . The jacobian of at has spectral radius .
By simple compactness arguments, Facts and imply as . (Notice that remains bounded since and by the convergence of .) Fact implies that convergence is exponentially fast.
Proof of Fact . Notice that evolves independently by , with the state evolution mapping introduced in Eq. (1.9). It follows from Proposition 1.3 that monotonically for any initial condition. Since , the same happens for .
where , . In particular, by Lemma C.1, is strictly increasing (notice that the covariance of and is which is decreasing in ). Further
Hence, since using Eq. (1.18) we have . Finally, by Lemma C.1, is decreasing in . It follows that as claimed.
Proof of Fact . From the definition of , we have the following expression for the Jacobian
where with an abuse of notation we let . Computing the eigenvalues of the above matrix, we get
Since as proved above, and as per Proposition 1.3, the claim follows. ∎
C.2 Lemma 5.7 implies Lemma 4.3
Using representations (4.4) and (4.3) (i.e., and ) and Lemma F.3(c) we obtain,
where the last equality uses . Therefore, it is sufficient to prove the thesis for . By state evolution, Theorem 4.2, we have
The first term vanishes as because by Proposition 1.3. The second term instead vanishes since , by Lemma 5.7.
Appendix D Proof of Lemma 5.2
First note that the upper bound on is trivial since using representations (4.7), (4.8), , and Lemma F.3(c)(d) all entries of the matrix are bounded as and the matrix has fixed dimensions. Hence, we only focus on the lower-bound for .
The result for and follows directly from Lemma F.3(g) and Lemma 8 of [BM11].
For and the proof is by induction on .
For we have and . Using Lemma F.3(b)(c) we obtain almost surely
Induction hypothesis: Assume that for all there exist positive constants and such that as
First write and denote its first coordinates with . Next, we consider the conditional distribution . Using Eqs. (4.9) and (4.10) we obtain (since )
Hence, conditional on we have, almost surely
From Lemma F.3(g) we know that is larger than a positive constant . Hence, from representation (D.3) and induction hypothesis (D.1)
To simplify the notation let . Now if then
which proves the result. Otherwise, we obtain the inequality
Appendix E A concentration estimate
The following proposition follows from standard concentration-of-measure arguments.
Let be a -net in , i.e. a subset of vectors such that, for any , there exists such that . It follows from a standard counting argument [Led01] that there exists an -net of size . Define
Since is Lipschitz with modulus , we have
which is smaller than for all large enough. ∎
Appendix F Useful reference material
In this appendix we collect a few known results that are used several times in our proof. We also provide some pointers to the literature.
In our proof we make use of the following well-known result of Kashin in the theory of diameters of smooth functions [Kas77].
For any positive number there exist a universal constant such that for any , with probability at least , for a uniformly random subspace of dimension ,
F.2 Singular values of random matrices
We will repeatedly make use of limit behavior of extreme singular values of random matrices. A very general result was proved in [BY93] (see also [BS05]).
We will also use the following fact that follows from the standard singular value decomposition
F.3 Two Lemmas from [BM11]
Our proof uses the results of [BM11]. We state copy here the crucial technical lemma in that paper. Notations refer to the general algorithm in Eq. (4.1). General state evolution defines quantities and via
where and are independent of
where and are two zero-mean gaussian vectors independent of , , with .
For all the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):
Here denotes derivative with respect to the first coordinate of .
For all and the following limits exist, and there exist strictly positive constants and (independent of , ) such that almost surely
It is also useful to recall some simple properties of gaussian random matrices.