The LASSO risk for gaussian matrices

Mohsen Bayati, Andrea Montanari

Introduction

with λ>0\lambda>0. The original signal is estimated by

A large and rapidly growing literature is devoted to developing fast algorithms for solving the optimization problem (1.3) and characterizing the performances and optimality of the estimator x^\widehat{x}. We refer to Section 1.3 for an unavoidably incomplete overview.

Despite such substantial effort, and many remarkable achievements, our understanding of (1.3) is not even comparable to the one we have of more classical topics in statistics and estimation theory. For instance, the best bound on the mean squared error (MSE{\rm MSE}) of the estimator (1.3), i.e. on the quantity N1x^x02N^{-1}\|\widehat{x}-x_{0}\|^{2}, was proved by Candes, Romberg and Tao [CRT06] (who in fact did not consider the LASSO but a related optimization problem). Their result estimates the mean squared error only up to an unknown numerical multiplicative factor. Work by Candes and Tao [CT07] on the analogous Dantzig selector, upper bounds the mean squared error up to a factor ClogNC\log N, under somewhat different assumptions.

The objective of this paper is to complement this type of ‘rough but robust’ bounds by proving asymptotically exact expressions for the mean square error. Our asymptotic result holds almost surely for sequences of random matrices AA with fixed aspect ratio and independent gaussian entries. While this setting is admittedly specific, the careful study of such matrix ensembles has a long tradition both in statistics and communications theory and has spurred many insights [Joh06, Tel99]. Further, we carried out simulations on real data matrices with continuous entries (gene expression data) and binary feature matrices (hospital medical records). The results appear to be quite encouraging.

Although our rigorous results are asymptotic in the problem dimensions, numerical simulations have shown that they are accurate already on problems with a few hundreds of variables. Further, they seem to enjoy a remarkable universality property and to hold for a fairly broad family of matrices [DMM10]. Both these phenomena are analogous to ones in random matrix theory, where delicate asymptotic properties of gaussian ensembles were subsequently proved to hold for much broader classes of random matrices. Also, asymptotic statements in random matrix theory have been replaced over time by concrete probability bounds in finite dimensions. Of course the optimization problem (1.2) is not immediately related to spectral properties of the random matrix AA. As a consequence, universality and non-asymptotic results in random matrix theory cannot be directly exported to the present problem. Nevertheless, we expect such developments to be foreseeable.

Our proofs are based on the analysis of an efficient iterative algorithm first proposed by [DMM09], and called AMP, for approximate message passing. The algorithm is inspired by belief-propagation on graphical models; although the resulting iteration is significantly simpler (and scales linearly in the number of nodes). Extensive simulations [DMM10] showed that, in a number of settings, AMP performances are statistically indistinguishable to the ones of LASSO, while its complexity is essentially as low as the one of the simplest greedy algorithms.

The proof technique just described is new. Earlier literature analyzes the convex optimization problem (1.3) –or similar problems– by a clever construction of an approximate optimum, or of a dual witness. Such constructions are largely explicit. Here instead we prove an asymptotically exact characterization of a rather non-trivial iterative algorithm. The algorithm is then proved to converge to the exact optimum.

As already mentioned, we will consider sequences of instances of increasing sizes, along which the LASSO behavior has a non-trivial limit.

Let us stress that our proof only applies to a subclass of converging sequences, namely for gaussian measurement matrices A(N)A(N). The notion of converging sequences is however important since it defines a class of problem instances to which the ideas developed below might be generalizable. Also, while the measurement matrices A(N)A(N) will be random, the signal x0(N)x_{0}(N), and noise vectors w(N)w(N) will be deterministic.

For a converging sequence of instances, and an arbitrary sequence of thresholds {θt}t0\{\theta_{t}\}_{t\geq 0} (independent of NN), the asymptotic behavior of the recursion (1.8) can be characterized as follows.

where ZN(0,1)Z\sim{\sf N}(0,1) is independent of X0X_{0}. Notice that the function F{\sf F} depends implicitly on the law pX0p_{X_{0}}. We will see later that the quantity Azt+xtA^{*}z^{t}+x^{t} has the same distribution as X0+τtZX_{0}+\tau_{t}Z. In other words, τt2\tau_{t}^{2} is the MSE of the estimator Azt+xtA^{*}z^{t}+x^{t} for x0x_{0}.

The next proposition that was conjectured in [DMM09] and proved in [BM11] shows that the behavior of AMP can be tracked by the above one dimensional recursion. We often refer to this prediction by state evolution.

where ZN(0,1)Z\sim{\sf N}(0,1) is independent of X0pX0X_{0}\sim p_{X_{0}}.

In order to establish the connection with the LASSO, a specific policy has to be chosen for the thresholds {θt}t0\{\theta_{t}\}_{t\geq 0}. Throughout this paper we will take θt=ατt\theta_{t}=\alpha\tau_{t} with α\alpha is fixed. In other words, the sequence {τt}t0\{\tau_{t}\}_{t\geq 0} is given by the recursion

Let us finally discuss why there should be any relation at all between the AMP algorithm (1.8) and the solution of the LASSO. Assume that θtθ\theta_{t}\to\theta, and that (x,z)(x,z) is a fixed point of the corresponding AMP iteration. Let ω=δ1η(x+Az;θ)\omega=\delta^{-1}\langle\eta^{\prime}(x+A^{*}z;\theta)\rangle. Then the fixed point condition reads

Notice that x=η(r;θ)x=\eta(r;\theta) if and only if there exists v(x)x1v(x)\in\partial\|x\|_{1} such that x+θv(x)=rx+\theta v(x)=r (here f\partial f denotes the subgradient of the function ff). It follows that the fixed point condition can be rewritten as

Comparing with the stationarity condition for the LASSO cost function (1.2) we obtain the following.

Any fixed point xt=xx^{t}=x of the AMP iteration with θt=θ\theta_{t}=\theta is a minimizer of the LASSO cost function with

2 Main result

Before stating our results, we have to describe a calibration mapping between α\alpha and λ\lambda that was introduced in [DMM10]. This mapping is necessary since in the analysis of AMP α\alpha plays the role of λ\lambda. In other words, it can be viewed as regularization parameter and controls sparsity of AMP estimates. In particular, we will show that there exist a one-to-one (monotone) function between values of α\alpha and λ\lambda.

Let us start by stating some convenient properties of the state evolution recursion.

Let αmin=αmin(δ)\alpha_{\rm min}=\alpha_{\rm min}(\delta) be the unique non-negative solution of the equation

with ϕ(z)ez2/2/2π\phi(z)\equiv e^{-z^{2}/2}/\sqrt{2\pi} the standard gaussian density and Φ(z)zϕ(x)dx\Phi(z)\equiv\int_{-\infty}^{z}\phi(x)\,{\rm d}x.

For any σ2>0\sigma^{2}>0, α>αmin(δ)\alpha>\alpha_{\rm min}(\delta), the fixed point equation τ2=F(τ2,ατ)\tau^{2}={\sf F}(\tau^{2},\alpha\tau) admits a unique solution. Denoting by τ=τ(α)\tau_{*}=\tau_{*}(\alpha) this solution, we have limtτt=τ(α)\lim_{t\to\infty}\tau_{t}=\tau_{*}(\alpha). Further the convergence takes place for any initial condition and is monotone. Finally dFdτ2(τ2,ατ)<1\left|\frac{{\rm d}{\sf F}}{{\rm d}\tau^{2}}(\tau^{2},\alpha\tau)\right|<1 at τ=τ\tau=\tau_{*}.

For greater convenience of the reader, a proof of this statement is provided in Appendix A.1.

We then define the function αλ(α)\alpha\mapsto\lambda(\alpha) on (αmin(δ),)(\alpha_{\rm min}(\delta),\infty), by

This function defines a correspondence (calibration) between the threshold ατ\alpha\tau_{*} and the regularization parameter λ\lambda. It should be intuitively clear that larger λ\lambda corresponds to larger thresholds and hence larger α\alpha since both cases yield smaller estimates of x0x_{0}. The specific choice in Eq. (1.18) is motivated by Lemma 1.2.

In the following we will need to invert this function. We thus define α:(0,)(αmin,)\alpha:(0,\infty)\to(\alpha_{\rm min},\infty) in such a way that

The next result implies that the set on the right-hand side is non-empty and therefore the function λα(λ)\lambda\mapsto\alpha(\lambda) is well defined.

The function αλ(α)\alpha\mapsto\lambda(\alpha) is continuous on the interval (αmin,)(\alpha_{\rm min},\infty) with λ(αmin+)=\lambda(\alpha_{\rm min}+)=-\infty and limαλ(α)=\lim_{\alpha\to\infty}\lambda(\alpha)=\infty.

Therefore the function λα(λ)\lambda\mapsto\alpha(\lambda) satisfying Eq. (1.19) exists.

A proof of this statement is provided in Section A.2. We will denote by A=α((0,)){\cal A}=\alpha((0,\infty)) the image of the function α\alpha. Notice that the definition of α\alpha is a priori not unique. We will see that uniqueness follows from our main theorem.

Examples of the mappings τ2F(τ2,ατ)\tau^{2}\mapsto{\sf F}(\tau^{2},\alpha\tau), ατ(α)\alpha\mapsto\tau_{*}(\alpha) and αλ(α)\alpha\mapsto\lambda(\alpha) are presented in Figures 1, 2, and 3 respectively.

2.2 Main results

where ZN(0,1)Z\sim{\sf N}(0,1) is independent of X0pX0X_{0}\sim p_{X_{0}}, τ=τ(α(λ))\tau_{*}=\tau_{*}(\alpha(\lambda)) and θ=α(λ)τ(α(λ))\theta_{*}=\alpha(\lambda)\tau_{*}(\alpha(\lambda)).

Let us emphasize oonce more that the vectors x0(N)x_{0}(N), w(N)w(N) are deterministic in this statement, and ‘almost surely’ is understood with respect to the choice of A(N)A(N).

As a corollary, using function ψ(a,b)(ab)2\psi(a,b)\equiv(a-b)^{2} we obtain:

Assume the hypothesis of Theorem 1.5. Let x^(λ;N)\widehat{x}(\lambda;N) be the LASSO estimator for instance (x0(N),w(N),A(N))(x_{0}(N),w(N),A(N)). Then, almost surely

where ZN(0,1)Z\sim{\sf N}(0,1) is independent of X0pX0X_{0}\sim p_{X_{0}}, τ=τ(α(λ))\tau_{*}=\tau_{*}(\alpha(\lambda)) and θ=α(λ)τ(α(λ))\theta_{*}=\alpha(\lambda)\tau_{*}(\alpha(\lambda)).

As a second corollary of Theorem 1.5, the function λα(λ)\lambda\mapsto\alpha(\lambda) is indeed uniquely defined.

For any λ,σ2>0\lambda,\sigma^{2}>0 there exists a unique α>αmin\alpha>\alpha_{\rm min} such that λ(α)=λ\lambda(\alpha)=\lambda (with the function αλ(α)\alpha\to\lambda(\alpha) defined as in Eq. (1.18).

Hence the function λα(λ)\lambda\mapsto\alpha(\lambda) is continuous non-decreasing with α((0,))A=(α0,)\alpha((0,\infty))\equiv{\cal A}=(\alpha_{0},\infty).

The proof of this corollary (which uses Theorem 1.5) is provided in Appendix A.3.

We prove Theorem 1.5 by proving the following result in Section 3.

Assume the hypotheses of Theorem 1.5. Let x^(λ;N)\widehat{x}(\lambda;N) be the LASSO estimator for instance (x0(N),w(N),A(N))(x_{0}(N),w(N),A(N)), and denote by {xt(N)}t0\{x^{t}(N)\}_{t\geq 0} the sequence of estimates produced by AMP. Then

Let us emphasize that the statement of Theorem 1.8 requires taking the limit of infinite dimensions NN\to\infty before the limit of an infinite number of iterations tt\to\infty. In this sense it is (informally speaking) a statement about the high-dimensional limit behavior, for a large-but-finite number of iterations. Although this is not a common setting within mathematical optimization, we think that it is particularly compelling from a compressed sensing point of view. It implies that, for any finite tolerance ε>0\varepsilon>0, there exists a finite number of iterations t(ε)t_{*}(\varepsilon) such that for any fixed tt(ε)t\geq t_{*}(\varepsilon), AMP has mean squared error at most ε\varepsilon larger than the LASSO, with high probability as NN\to\infty. Further, closer analysis of the state evolution recursion [DMM09, DMM10] implies that t(ε)Clog(1/ε)t_{*}(\varepsilon)\leq C\,\log(1/\varepsilon) for some constant CC independent of the dimension, and the signal x0x_{0}, provided the under-sampling ratio δ\delta is larger than a phase transition value δc\delta_{c}. Notice that taking the high dimensional point of view yields us a considerably faster convergence than the optimum rate at fixed dimension, namely t(ε)C/εt_{*}(\varepsilon)\leq C/\sqrt{\varepsilon} [BT09].

3 Related work

The LASSO was introduced in [Tib96, CD95]. Several papers provide performance guarantees for the LASSO or similar convex optimization methods [CRT06, CT07], by proving upper bounds on the resulting mean squared error. These works assume an appropriate ‘isometry’ condition to hold for AA. While such condition hold with high probability for some random matrices, it is often difficult to verify them explicitly. Further, it is only applicable to very sparse vectors x0x_{0}. These restrictions are intrinsic to the worst-case point of view developed in [CRT06, CT07].

Guarantees have been proved for correct support recovery in [ZY06], under an appropriate ‘incoherence’ assumption on AA. While support recovery is an interesting conceptualization for some applications (e.g. model selection), the metric considered in the present paper (mean squared error) provides complementary information and is quite standard in many different fields.

Closer to the spirit of this paper [RFG09] derived expressions for the mean squared error under the same model considered here. Similar results were presented recently in [KWT09, GBS09]. These papers argue that a sharp asymptotic characterization of the LASSO risk can provide valuable guidance in practical applications. For instance, it can be used to evaluate competing optimization methods on large scale applications, or to tune the regularization parameter λ\lambda.

Unfortunately, these results were non-rigorous and were obtained through the famously powerful ‘replica method’ from statistical physics [MM09].

Let us emphasize that the present paper offers two advantages over these recent developments: (i)(i) It is completely rigorous, thus putting on a firmer basis this line of research; (ii)(ii) It is algorithmic in that the LASSO mean squared error is shown to be equivalent to the one achieved by a low-complexity message passing algorithm.

Numerical illustrations

Further, our result is asymptotic, while and one might wonder how accurate it is for instances of moderate dimensions.

We obtained the optimum estimator x^\widehat{x} using CVX, a package for specifying and solving convex programs [GB10] and OWLQN, a package for solving large-scale versions of LASSO [AG07]. We used several values of λ\lambda between and 22 and NN equal to 200200, 500500, 10001000, and 20002000. The aspect ratio of matrices was fixed in all cases to δ=0.64\delta=0.64. For each case, the point (λ,MSE)(\lambda,{\rm MSE}) was plotted and the results are shown in the figures. Continuous lines corresponds to the asymptotic prediction by Corollary 1.6, namely δ(τ2σ2)\delta(\tau_{*}^{2}-\sigma^{2}).

The agreement is remarkably good already for N,nN,n of the order of a few hundreds, and deviations are consistent with statistical fluctuations.

The four figures correspond to measurement matrices AA:

Figure 4: Data consist of 22532253 measurements of expression level of 70777077 genes.From this matrix we took sub-matrices AA of aspect ratio δ\delta for each NN. The entries were continuous variables. We standardized all columns of AA to have mean 0 and variance 1.

Figure 5: From a data set of 19321932 patient records we extracted 48334833 binary features describing demographic information, medical history, lab results, medications etc. The -11 matrix was sparse (with only 3.1%3.1\% non-zero entries). Similar to (i)(i), for each NN, the sub-matrices AA with aspect ratio δ\delta were selected and standardized.

Figure 6: Random gaussian matrices with aspect ratio δ\delta and iid N(0,1/n){\sf N}(0,1/n) entries (as in Theorem 1.5);

Figure 7: Random ±1\pm 1 matrices with aspect ratio δ\delta. Each entry is independently equal to +1/n+1/\sqrt{n} or 1/n-1/\sqrt{n} with equal probability.

Notice the behavior appears to be essentially indistinguishable. Also the asymptotic prediction has a minimum as a function of λ\lambda. The location of this minimum can be used to select the regularization parameter. Further empirical analysis is presented in [BBM11].

A structural property and proof of the main results

The rest of the paper is devoted to the proof of Theorem 1.8. Section 3.2 proves a structural property that is the key tool in this proof. Section 3.3 uses this property together with a few lemmas to prove Theorem 1.8

The proof of Theorem 1.5 follows immediately from Theorem 1.8.

For any t0t\geq 0, we have, by the pseudo-Lipschitz property of ψ\psi,

where the second inequality follows by Cauchy-Schwarz. Next we take the limit NN\to\infty followed by tt\to\infty. The first term vanishes by Theorem 1.8. For the second term, note that x022/N\|x_{0}\|_{2}^{2}/N remains bounded since (x0,w,A)(x_{0},w,A) is a converging sequence. The two terms xt+122/N\|x^{t+1}\|_{2}^{2}/N and x^22/N\|\widehat{x}\|_{2}^{2}/N also remain bounded in this limit because of state evolution (as proved in Lemma 3.2 below).

where we used Theorem 1.1 and Proposition 1.3. ∎

For a matrix MM we denote its minimum and maximum singular values by σmin(M)\sigma_{\rm min}(M), σmax(M)\sigma_{\rm max}(M) respectively. We also denote the minimum non-zero singular value of MM by σ^min(M)\hat{\sigma}_{\rm min}(M).

2 A structural property of the LASSO cost function

One main challenge in the proof of Theorem 1.5 lies in the fact that the function xCA,y(x)x\mapsto{\cal C}_{A,y}(x) is not –in general– strictly convex. Hence there can be, in principle, vectors xx of cost very close to the optimum and nevertheless far from the optimum.

The following Lemma provides conditions under which this does not happen.

There exists a function ξ(ε,c1,,c5)\xi(\varepsilon,c_{1},\dots,c_{5}) such that the following happens.

There exists \mboxsg(C,x)C(x)\mbox{\rm sg}({\cal C},x)\in\partial{\cal C}(x) with \mboxsg(C,x)2Nε\|\mbox{\rm sg}({\cal C},x)\|_{2}\leq\sqrt{N}\,\varepsilon;

Let v(1/λ)[A(yAx)+\mboxsg(C,x)]x1v\equiv(1/\lambda)[A^{*}(y-Ax)+\mbox{\rm sg}({\cal C},x)]\in\partial\|x\|_{1}, and S(c2){i[N]:  vi1c2}S(c_{2})\equiv\{i\in[N]:\;|v_{i}|\geq 1-c_{2}\}. Then, for any S[N]S^{\prime}\subseteq[N], Sc3N|S^{\prime}|\leq c_{3}N, we have σmin(AS(c2)S)c4\sigma_{\rm min}(A_{S(c_{2})\cup S^{\prime}})\geq c_{4};

The maximum singular value of AA is bounded: σmax(A)2c5\sigma_{\rm max}(A)^{2}\leq c_{5}.

Then r2Nξ(ε,c1,,c5)\|r\|_{2}\leq\sqrt{N}\,\xi(\varepsilon,c_{1},\dots,c_{5}). Further for any c1,,c5>0c_{1},\dots,c_{5}>0, ξ(ε,c1,,c5)0\xi(\varepsilon,c_{1},\dots,c_{5})\to 0 as ε0\varepsilon\to 0.

Further, if ker(A)={0}\ker(A)=\{0\}, the same conclusion holds under assumptions 1, 2, 3, 5.

Throughout the proof we denote ξ1,ξ2,\xi_{1},\xi_{2},\dots functions of the constants c1,,c5>0c_{1},\dots,c_{5}>0 and of ε\varepsilon such that ξi(ε)0\xi_{i}(\varepsilon)\to 0 as ε0\varepsilon\to 0 (we shall omit the dependence of ξi\xi_{i} on ε\varepsilon).

Let S=supp(x)[N]S={\rm supp}(x)\subseteq[N]. We have

where (a)(a) follows from hypothesis (2), (c)(c) from the fact that vS=\mboxsign(xS)v_{S}=\mbox{\rm sign}(x_{S}) since vx1v\in\partial\|x\|_{1} which gives

and (d)(d) follows from the definition of (v)(v).

Using hypothesis (1) and (3), we get by Cauchy-Schwarz

Each of the three terms on the left-hand side is non-negative. The third one is trivial. The first one is non-negative since

and each (xi+ri)[\mboxsign(xi+ri)\mboxsign(xi)](x_{i}+r_{i})\left[\mbox{\rm sign}(x_{i}+r_{i})-\mbox{\rm sign}(x_{i})\right] is either equal to (when \mboxsign(xi)=\mboxsign(xi+ri)\mbox{\rm sign}(x_{i})=\mbox{\rm sign}(x_{i}+r_{i})) or equal to 2xi+ri2|x_{i}+r_{i}| otherwise. The second term in (3.3) is also non-negative since riviri=ri[1vi\mboxsign(ri)]|r_{i}|-v_{i}r_{i}=|r_{i}|[1-v_{i}\mbox{\rm sign}(r_{i})] and 1vi\mboxsign(ri)1\geq v_{i}\,\mbox{\rm sign}(r_{i}) since vi1|v_{i}|\leq 1 by definition of subgradient. Therefore,

Since Ar22(c42/4)r22\|A_{\perp}r^{\perp}\|_{2}^{2}\geq(c_{4}^{2}/4)\|r^{\perp}\|^{2}_{2}, we have

In the case V={0}V_{\parallel}=\{0\}, the proof is concluded. In the case V{0}V_{\parallel}\neq\{0\}, we need to prove an analogous bound for rr^{\parallel}. From Eq. (3.4) together with rS1NrS2Nr2(2N/c4)ξ1(ε)\|r^{\perp}_{\overline{S}}\|_{1}\leq\sqrt{N}\|r^{\perp}_{\overline{S}}\|_{2}\leq\sqrt{N}\|r^{\perp}\|_{2}\leq(2N/c_{4})\sqrt{\xi_{1}(\varepsilon)}, we get

Where (3.9) follows immediately from definition of AA_{\perp} and rr^{\parallel}. Now, notice that S(c2)S\overline{S}(c_{2})\subseteq\overline{S}. From Eq. (3.8) and definition of S(c2)S(c_{2}) it follows that

To conclude the proof, it is sufficient to prove an analogous bound for rS+22\|r^{\parallel}_{S_{+}}\|_{2}^{2} with S+=[N]S+=S(c2)S1S_{+}=[N]\setminus\overline{S}_{+}=S(c_{2})\cup S_{1}. Since S1Nc3|S_{1}|\leq Nc_{3}, we have by hypothesis (4) that σmin(AS+)c4\sigma_{\rm min}(A_{S_{+}})\geq c_{4}. By Eq. (3.9) we have Ar=Ar=AS+rS++AS+rS+A_{\parallel}r^{\parallel}=Ar^{\parallel}=A_{S_{+}}r^{\parallel}_{S_{+}}+A_{\overline{S}_{+}}r^{\parallel}_{\overline{S}_{+}}. Therefore

In the last step we used triangular inequality together with the fact that σmax(AS+)2c5\sigma_{\rm max}(A_{\overline{S}_{+}})^{2}\leq c_{5} (by assumption (5)) and σmax(A)c4/2\sigma_{\rm max}(A_{\parallel})\leq c_{4}/2 (by construction). Using r22=rS+22+rS+22\|r^{\parallel}\|^{2}_{2}=\|r^{\parallel}_{S_{+}}\|_{2}^{2}+\|r^{\parallel}_{\overline{S}_{+}}\|_{2}^{2}, we get

This finishes the proof when S(c2)Nc3/2|\overline{S}(c_{2})|\geq Nc_{3}/2. Note that if this assumption does not hold then we can take S+=\overline{S}_{+}=\emptyset and S+=[N]S_{+}=[N]. Hence, the result follows as a special case of above. ∎

3 Proof of Theorem 1.8

The proof is based on a series of Lemmas that are used to check the assumptions of Lemma 3.1

Under the conditions of Theorem 1.5, assume λ>0\lambda>0 and α=α(λ)\alpha=\alpha(\lambda). Denote by x^(λ;N)\widehat{x}(\lambda;N) the LASSO{\rm LASSO} estimator and by {xt(N)}\{x^{t}(N)\} the sequence of AMP estimates. Then there is a constant B{\sf B} such that for all t0t\geq 0, almost surely

The second Lemma implies that the estimates of AMP are approximate minima, in the sense that the cost function C{\cal C} admits a small subgradient at xtx^{t}, when tt is large. The proof is deferred to Section 5.2.

Under the conditions of Theorem 1.5, for all tt there exists a subgradient \mboxsg(C,xt)\mbox{\rm sg}({\cal C},x^{t}) of C{\cal C} at point xtx^{t} such that almost surely,

The next lemma implies that sub-matrices of AA constructed using the first tt iterations of the AMP algorithm are non-singular (more precisely, have singular values bounded away from ). The proof can be found in Section 5.3.

Let S[N]S\subseteq[N] be measurable on the σ\sigma-algebra St\mathfrak{S}_{t} generated by {z0,,zt1}\{z^{0},\dots,z^{t-1}\} and {x0+Az0,,xt1+Azt1}\{x^{0}+A^{*}z^{0},\dots,x^{t-1}+A^{*}z^{t-1}\} and assume SN(δc)|S|\leq N(\delta-c) for some c>0c>0. Then there exists a1=a1(c)>0a_{1}=a_{1}(c)>0 (independent of tt) and a2=a2(c,t)>0a_{2}=a_{2}(c,t)>0 (depending on tt and cc) such that

eventually almost surely as NN\to\infty.

We will apply this lemma to a specific choice of the set SS. Namely, defining

for γ(0,1)\gamma\in(0,1). Our last lemma shows that this sequence of sets St(γ)S_{t}(\gamma) ‘converges’ in the following sense. The proof can be found in Section 5.4.

Fix γ(0,1)\gamma\in(0,1) and let the sequence {St(γ)}t0\{S_{t}(\gamma)\}_{t\geq 0} be defined as in Eq. (3.17) above. For any ξ>0\xi>0 there exists t=t(ξ,γ)<t_{*}=t_{*}(\xi,\gamma)<\infty such that, for all t2t1tt_{2}\geq t_{1}\geq t_{*} fixed, we have

eventually almost surely as NN\to\infty.

The above two lemmas imply the following.

There exist constants γ1(0,1)\gamma_{1}\in(0,1), γ2\gamma_{2}, γ3>0\gamma_{3}>0 and tmin<t_{\rm min}<\infty such that, for any ttmint\geq t_{\rm min},

eventually almost surely as NN\to\infty.

First notice that, for any fixed γ\gamma, the set St(γ)S_{t}(\gamma) is measurable on St\mathfrak{S}_{t}. Indeed by Eq. (1.8) St\mathfrak{S}_{t} contains {x0,,xt}\{x^{0},\dots,x^{t}\} as well, and hence it contains vtv^{t} which is a linear combination of xt1+Azt1x^{t-1}+A^{*}z^{t-1}, xtx^{t}. Finally St(γ)S_{t}(\gamma) is obviously a measurable function of vtv^{t}.

Using Lemma F.3(b) the empirical distribution of (x0Azt1xt1,x0)(x_{0}-A^{*}z^{t-1}-x^{t-1},x_{0}) converges weakly to (τt1Z,X0)(\tau_{t-1}Z,X_{0}) for ZN(0,1)Z\sim{\sf N}(0,1) independent of X0pX0X_{0}\sim p_{X_{0}}. (Following the notation of [BM11], we let ht=x0Azt1xt1h^{t}=x_{0}-A^{*}z^{t-1}-x^{t-1}.) Therefore, for any constant γ\gamma we have almost surely

The last equality follows from the weak convergence of the empirical distribution of {(hi,x0,i)}i[N]\{(h_{i},x_{0,i})\}_{i\in[N]} (from Lemma F.3(b), which takes the same form as Theorem 1.8), together with the absolute continuity of the distribution of X0+τt1Zη(X0+τt1Z,θt1)|X_{0}+\tau_{t-1}Z-\eta(X_{0}+\tau_{t-1}Z,\theta_{t-1})|.

eventually almost surely as NN\to\infty, for all fixed tt larger than some tmin,1(c)t_{{\rm min},1}(c).

For any ttmin,1(c)t\geq t_{{\rm min},1}(c) we can apply Lemma 3.4 for some a1(c)a_{1}(c), a2(c,t)>0a_{2}(c,t)>0. Fix c>0c>0 and let a1=a1(c)a_{1}=a_{1}(c) be fixed as well. Let tmin=max(tmin,1,t(a1/2,γ1))t_{\rm min}=\max(t_{{\rm min},1},t_{*}(a_{1}/2,\gamma_{1})) (with t()t_{*}(\,\cdot\,) defined as per Lemma 3.5). Take a2=a2(c,tmin)a_{2}=a_{2}(c,t_{\rm min}). Obviously ta2(c,t)t\mapsto a_{2}(c,t) is non-increasing. Then we have, by Lemma 3.4

where both events hold eventually almost surely as NN\to\infty. The claim follows with γ2=a1(c)/2\gamma_{2}=a_{1}(c)/2 and γ3=a2(c,tmin)\gamma_{3}=a_{2}(c,t_{\rm min}). ∎

We are now in position to prove Theorem 1.8.

We apply Lemma 3.1 to x=xtx=x^{t}, the AMP estimate and r=x^xtr=\widehat{x}-x^{t} the distance from the LASSO{\rm LASSO} optimum. The thesis follows by checking conditions 1–5. Namely we need to show that there exists constants c1,,c5>0c_{1},\dots,c_{5}>0 and, for each ε>0\varepsilon>0 some t=t(ε)t=t(\varepsilon) exists such that 1–5 hold eventually almost surely as NN\to\infty.

Condition 2 is immediate since x+r=x^x+r=\widehat{x} minimizes C(){\cal C}(\,\cdot\,).

Condition 3 follows from Lemma 3.3 with ε\varepsilon arbitrarily small for tt large enough.

Condition 4. Notice that this condition only needs to be verified for δ<1\delta<1.

Take v=vtv=v^{t} as defined in Eq. (3.16). Using the definition (1.8), it is easy to check that vit1|v_{i}^{t}|\leq 1 if xit=0x_{i}^{t}=0 and vit=\mboxsign(xit)v_{i}^{t}=\mbox{\rm sign}(x^{t}_{i}) otherwise. In other words vtx1v^{t}\in\partial\|x\|_{1} as required. Further by inspection of the proof of Lemma 3.3, it follows that vt=(1/λ)[A(yAxt)+\mboxsg(C,xt)]v^{t}=(1/\lambda)[A^{*}(y-Ax^{t})+\mbox{\rm sg}({\cal C},x^{t})], with \mboxsg(C,xt)\mbox{\rm sg}({\cal C},x^{t}) the subgradient bounded in that lemma (cf. Eq. (5.3)). The condition then holds by Proposition 3.6.

Condition 5 follows from standard limit theorems on the singular values of Wishart matrices (cf. Theorem F.2). ∎

State evolution estimates

This section contains a reminder of the state-evolution method developed in [BM11]. For greater convenience of the reader, we also restate two lemmas from [BM11] (namely, Lemmas F.3 and F.3) in appendix F.3. We will use these two Lemmas throughout our analysis.

We also state some extensions of those results that will be proved in the appendices.

AMP, cf. Eq. (1.8) is a special case of the general iterative procedure given by Eq. (3.1) of [BM11]. This takes the general form

where ξt=g(bt,w)\xi_{t}=\langle g^{\prime}(b^{t},w)\rangle, λt=1δft(ht,x0)\lambda_{t}=\frac{1}{\delta}\langle f^{\prime}_{t}(h^{t},x_{0})\rangle (both derivatives are with respect to the first argument).

and the initial condition is q0=x0q^{0}=-x_{0}.

Regarding ht,bth^{t},b^{t} as column vectors, the equations for b0,,bt1b^{0},\ldots,b^{t-1} and h1,,hth^{1},\ldots,h^{t} can be written in matrix form as:

or in short Yt=AQtY_{t}=AQ_{t} and Xt=AMtX_{t}=A^{*}M_{t}.

Following [BM11], we define St\mathfrak{S}_{t} as the σ\sigma-algebra generated by b0,,bt1b^{0},\ldots,b^{t-1}, m0,,mt1m^{0},\ldots,m^{t-1}, h1,,hth^{1},\ldots,h^{t}, and q0,,qtq^{0},\ldots,q^{t}. The conditional distribution of the random matrix AA given the σ\sigma-algebra St{\mathfrak{S}_{t}}, is given by

Here PMt=IPMtP_{M_{t}}^{\perp}=I-P_{M_{t}}, PQt=IPQtP_{Q_{t}}^{\perp}=I-P_{Q_{t}}, and PQtP_{Q_{t}}, PMtP_{M_{t}} are orthogonal projector onto column spaces of QtQ_{t} and MtM_{t} respectively.

Before proceeding, it is convenient to introduce the notation

to denote the coefficient of zt1z^{t-1} in Eq. (1.8). Using ht=x0Azt1xt1h^{t}=x_{0}-A^{*}z^{t-1}-x^{t-1} and Lemma F.3(b) (proved in [BM11]) we get, almost surely,

Notice that the function η(;θt1)\eta^{\prime}(\,\cdot\,;\theta_{t-1}) is discontinuous and therefore Lemma F.3(b) does not apply immediately. On the other hand, this implies that the empirical distribution of {(Azit1+xit1,x0,i)}1iN\{(A^{*}z^{t-1}_{i}+x^{t-1}_{i},x_{0,i})\}_{1\leq i\leq N} converges weakly to the distribution of (X0+τt1Z,X0)(X_{0}+\tau_{t-1}Z,X_{0}). The claim follows from the fact that X0+τt1ZX_{0}+\tau_{t-1}Z has a density, together with the standard properties of weak convergence.

2 Some consequences and generalizations

We begin with a simple calculation, that will be useful.

If {zt}t0\{z^{t}\}_{t\geq 0} are the AMP residuals, then, almost surely,

Using representation (4.5) and Lemma F.3(b)(c), we get

Next, we need to generalize state evolution to compute large system limits for functions of xtx^{t}, xsx^{s}, with tst\neq s. To this purpose, we define the covariances {Rs,t}s,t0\{{\sf R}_{s,t}\}_{s,t\geq 0} recursively by

with ZtN(0,Rt,t)Z_{t}\sim{\sf N}(0,{\sf R}_{t,t}) independent of X0X_{0}. This determines by the above recursion Rt,s{\sf R}_{t,s} for all t0t\geq 0 and for all s0s\geq 0.

With these definition, we have the following generalization of Theorem 1.1.

Clearly this result reduces to Theorem 1.1 in the case s=ts=t by noting that Rt,t=τt2{\sf R}_{t,t}=\tau^{2}_{t}. The general proof can be found in Appendix B.

The following lemma implies that, asymptotically for large NN, the AMP estimates converge.

Under the condition of Theorem 1.5, the estimates {xt}t0\{x^{t}\}_{t\geq 0} and residuals {zt}t0\{z^{t}\}_{t\geq 0} of AMP almost surely satisfy

Proofs of auxiliary lemmas

In order to bound the norm of xtx^{t}, we use state evolution, Theorem 1.1, for the function ψ(a,b)=a2\psi(a,b)=a^{2},

for ZN(0,1)Z\sim{\sf N}(0,1) and independent of X0pX0X_{0}\sim p_{X_{0}}. The expectation on the right hand side is bounded and hence limtlimNxt,xt\lim_{t\to\infty}\lim_{N\to\infty}\langle x^{t},x^{t}\rangle is bounded.

The last bound holds almost surely as NN\to\infty, using standard asymptotic estimate on the singular values of random matrices (cf. Theorem F.2) implying that σmax(A)\sigma_{\max}(A) has a bounded limit almost surely, together with the fact that (x0,w,A)(x_{0},w,A) is a converging sequence.

Now, decompose x^\widehat{x} as x^=x^+x^\widehat{x}=\widehat{x}_{\parallel}+\widehat{x}_{\perp} where x^ker(A)\widehat{x}_{\parallel}\in\ker(A) and x^ker(A)\widehat{x}_{\perp}\in\ker(A)^{\perp} (the orthogonal complement of ker(A)\ker(A)). Since, x^\widehat{x}_{\parallel} belongs to the random subspace ker(A)\ker(A) with dimension Nn=N(1δ)N-n=N(1-\delta), Kashin theorem (cf. Theorem F.1) implies that there exists a positive constant c1=c1(δ)c_{1}=c_{1}(\delta) such that

Hence, by using triangle inequality and Cauchy-Schwarz, we get

By definition of cost function we have x^1λ1C(x^)\|\widehat{x}\|_{1}\leq\lambda^{-1}{\cal C}(\widehat{x}). Further, limit theorems for the eigenvalues of Wishart matrices (cf. Theorem F.2) imply that there exists a constant c=c(δ)c=c(\delta) such that asymptotically almost surely x^2cAx^2\|\widehat{x}_{\perp}\|^{2}\leq c\,\|A\widehat{x}_{\perp}\|^{2}. Therefore (denoting by ci: i=2,3,4c_{i}:~{}i=2,3,4 bounded constants), we have

The claim follows by using the Eq. (5.1) to bound C(x^)/N{\cal C}(\widehat{x})/N and using Ax0+w2σmax(A)2x02+w22NB1\|Ax_{0}+w\|^{2}\leq\sigma_{\max}(A)^{2}\|x_{0}\|^{2}+\|w\|^{2}\leq 2N{\sf B}_{1} to bound the last term. \Box

2 Proof of Lemma 3.3

First note that equation xt=η(Azt1+xt1;θt1)x^{t}=\eta(A^{*}z^{t-1}+x^{t-1};\theta_{t-1}) of AMP implies

Therefore, the vector \mboxsg(C,xt)λstA(yAxt)\mbox{\rm sg}({\cal C},x^{t})\equiv\lambda\,s^{t}-A^{*}(y-Ax^{t}) where

is a valid subgradient of C{\cal C} at xtx^{t}. On the other hand, yAxt=ztωtzt1y-Ax^{t}=z^{t}-\omega_{t}z^{t-1}. We finally get

It is straightforward to see from Eqs. (LABEL:eq:x(t)=eta_interpreted) and (5.3) that (I)=λ(xt1xt)(I)=\lambda(x^{t-1}-x^{t}). Hence,

By Lemma 4.3, and the fact that σmax(A)\sigma_{\max}(A) is almost surely bounded as NN\to\infty (cf. Theorem F.2), we deduce that the two terms λxtxt1/(θt1N)\lambda\|x^{t}-x^{t-1}\|/(\theta_{t-1}\sqrt{N}) and σmax(A)ztzt12/N\sigma_{\max}(A)\|z^{t}-z^{t-1}\|^{2}/\sqrt{N} converge to when NN\to\infty and then tt\to\infty. For the third term, using state evolution (see Lemma 4.1), we obtain limNzt12/N<\lim_{N\to\infty}\|z^{t-1}\|^{2}/N<\infty. Finally, using the calibration relation Eq. (1.18), we get

3 Proof of Lemma 3.4

The proof uses the representation (4.9), together with the expression (4.10) for the conditional expectation. Apart from the matrices YtY_{t}, QtQ_{t}, XtX_{t}, MtM_{t} introduced there, we will also use

We state below a somewhat more convenient description.

It is clearly sufficient to prove that, for v=v+vv=v_{\parallel}+v_{\perp}, PQv=vP_{Q}v_{\parallel}=v_{\parallel}, PQv=vP_{Q}^{\perp}v_{\perp}=v_{\perp}, we have

The first identity is an easy consequence of the fact that XQ=MAQ=MYX^{*}Q=M^{*}AQ=M^{*}Y, while the second one follows immediately from Qv=0Q^{*}v_{\perp}=0. ∎

The following fact (see Appendix D for a proof) will be used several times.

For any tt there exists c>0c>0 such that, for R{QQ;MM;XX;YY}R\in\{Q^{*}Q;\,M^{*}M;\,X^{*}X;\,Y^{*}Y\}, eventually almost surely as NN\to\infty,

Given the above remarks, we will immediately see that Lemma 3.4 is implied by the following statement.

Let S[N]S\subseteq[N] be given such that SN(δγ)|S|\leq N(\delta-\gamma), for some γ>0\gamma>0. Then there exists α1=α1(γ)>0\alpha_{1}=\alpha_{1}(\gamma)>0 (independent of tt) and α2=α2(γ,t)>0\alpha_{2}=\alpha_{2}(\gamma,t)>0 (depending on tt and γ\gamma) such that

eventually almost surely as NN\to\infty. (With Ev=Y(QQ)1QPQv+M(MM)1XPQvEv=Y(Q^{*}Q)^{-1}Q^{*}P_{Q}v+M(M^{*}M)^{-1}X^{*}P_{Q}^{\perp}v.)

In the next section we will show that this lemma implies Lemma 3.4. We will then prove the lemma just stated.

By Borel-Cantelli, it is sufficient to show that, for SS measurable on St\mathfrak{S}_{t} and SN(δc)|S|\leq N(\delta-c) there exist a1=a1(c)>0a_{1}=a_{1}(c)>0 and a2=a2(c,t)>0a_{2}=a_{2}(c,t)>0, such that

for all NN large enough. Conditioning on St\mathfrak{S}_{t} and using the union bound, this probability can be estimated as

where h(p)=plogp(1p)log(1p)h(p)=-p\log p-(1-p)\log(1-p) is the binary entropy function. The union bound calculation indeed proceeds as follows

where XS=minv=1,supp(v)SSAv{\sf X}_{S^{\prime}}=\min_{\|v\|=1,{\rm supp}(v)\subseteq S\cup S^{\prime}}\|Av\|. Now, fix a1<c/2a_{1}<c/2 in such a way that h(a1)α1(c/2)/2h(a_{1})\leq\alpha_{1}(c/2)/2 (with α1\alpha_{1} defined as per Lemma 5.3). Further choose a2=α2(c/2,t)/2a_{2}=\alpha_{2}(c/2,t)/2. The above probability is then upper bounded by

Finally, applying Lemma 5.3 and using Lemma 5.1 to estimate AvAv, we get, for all NN large enough,

3.2 Proof of Lemma 5.3

We begin with the following Pythagorean inequality.

Let S[N]S\subseteq[N] be given such that SN(δγ)|S|\leq N(\delta-\gamma), for some γ>0\gamma>0. Recall that Ev=Y(QQ)1QPQv+M(MM)1XPQvEv=Y(Q^{*}Q)^{-1}Q^{*}P_{Q}v+M(M^{*}M)^{-1}X^{*}P_{Q}^{\perp}v and consider the event

Here the notation (u,v)(u,v) refers to the usual scalar product uvu^{*}v of vectors uu and vv of the same dimension. Assuming that the claim holds, we have indeed

(Notice that in Proposition E.1 is stated for the equivalent case of a random sub-space of fixed dimension dd, and a subspace of dimension scaling linearly with the ambient one.) ∎

Let S[N]S\subseteq[N] be given such that SN(δγ)|S|\leq N(\delta-\gamma), for some γ>0\gamma>0. Then there exists constant c1=c1(γ)c_{1}=c_{1}(\gamma), c2=c2(γ)c_{2}=c_{2}(\gamma) such that the event

Let VV be the linear space V=im(PQPS)V={\rm im}(P_{Q}^{\perp}P_{S}). Of course the dimension of VV is at most N(δγ)N(\delta-\gamma). Then we have (for all vectors with supp(v)S{\rm supp}(v)\subseteq S)

Finally a simple bound to control the norm of EvEv.

There exists a constant c=c(t)>0c=c(t)>0 such that, defining the event,

we have that E3{\cal E}_{3} holds eventually almost surely as NN\to\infty.

The bound EPQvc(t)1PQv\|EP_{Q}^{\perp}v\|\leq c(t)^{-1}\|P^{\perp}_{Q}v\| is proved analogously. ∎

By Lemma 5.6 we can assume that event E3{\cal E}_{3} holds, for some function c=c(t)c=c(t) (without loss of generality c<1/2c<1/2). We will let E{\cal E} be the event

Let us assume first that PQvc2/10\|P_{Q}^{\perp}v\|\leq c^{2}/10, whence

where the last inequality uses PQv=1PQv21/2\|P_{Q}v\|=\sqrt{1-\|P_{Q}^{\perp}v\|^{2}}\geq 1/2. Therefore, using Lemma 5.4, we get

Next we assume PQvc2/10\|P_{Q}^{\perp}v\|\geq c^{2}/10. Due to Lemma 5.4 and 5.5 we can assume that events E1{\cal E}_{1} and E2{\cal E}_{2} hold. Therefore

4 Proof of Lemma 3.5

The key step consists in establishing the following result, which will be instrumental in the proof of Lemma 4.3 as well (and whose proof is deferred to Appendix C.1).

Assume α>αmin(δ)\alpha>\alpha_{\rm min}(\delta) and let {Rs,t}\{{\sf R}_{s,t}\} be defined by the recursion (4.13) with initial condition (4.14). Then there exists constants B1{\sf B}_{1}, r1>0{\sf r}_{1}>0 such that for all t0t\geq 0

It is also useful to prove the following fact.

For any α>0\alpha>0 and T0T\geq 0, the T×TT\times T matrix RT+1{Rs,t}0s,t<TR_{T+1}\equiv\{{\sf R}_{s,t}\}_{0\leq s,t<T} is strictly positive definite.

almost surely. Hence, RT+1=a.s.δlimN(MT+1MT+1/N)R_{T+1}\stackrel{{\scriptstyle{\rm a.s.}}}{{=}}\delta\lim_{N\to\infty}(M_{T+1}^{*}M_{T+1}/N). Thus the result follows from Lemma 5.2. ∎

It is then relatively easy to deduce the following.

Assume α>αmin(δ)\alpha>\alpha_{\rm min}(\delta) and let {Rs,t}\{{\sf R}_{s,t}\} be defined by the recursion (4.13) with initial condition (4.14). Then there exists constants B2{\sf B}_{2}, r2>0{\sf r}_{2}>0 such that for all t1,t2t0t_{1},t_{2}\geq t\geq 0

By triangular inequality and Eq. (5.11), we have

which, together with Eq. (5.14) proves our claim. ∎

We are now in position to prove Lemma 3.5.

We will show that, under the assumptions of the Lemma, limNSt2(γ)St1(γ)/Nξ\lim_{N\to\infty}|S_{t_{2}}(\gamma)\setminus S_{t_{1}}(\gamma)|/N\leq\xi almost surely, which implies our claim. Indeed, by Theorem 4.2 we have

Let a(1γ)ατa\equiv(1-\gamma)\alpha\tau_{*}. By Proposition 1.3, for any ε>0\varepsilon>0 and all tt_{*} large enough we have (1γ)θti1aε|(1-\gamma)\theta_{t_{i}-1}-a|\leq\varepsilon for i{1,2}i\in\{1,2\}. Then

where the last inequality follows by Lemma 5.9. By taking ε=er2t/3\varepsilon=e^{-{\sf r}_{2}\,t_{*}/3} we finally get (for some constant CC) Pt1,t2Cer2tP_{t_{1},t_{2}}\leq C\,e^{-{\sf r}_{2}t_{*}}, which implies our claim. ∎

Acknowledgement

It is a pleasure to thank David Donoho and Arian Maleki for many stimulating exchanges. We are also indebted with José Bento who collaborated in preparing Figures 4 to 7.

An earlier version of this paper stated some auxiliary lemmas in terms of convergence in probability. We rectified this to convergence almost sure as for the main theorems (with virtually no change in the proofs). We are grateful to Edgar Dobriban and Weijie Su for pointing out this inconsistency.

This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211.

Appendix A Properties of the state evolution recursion

It is a straightforward calculus exercise to compute the partial derivatives

From these formulae we obtain the total derivative

with the inequality being strict whenever α>0\alpha>0, u0u\neq 0. It follows that τ2F(τ2,ατ)\tau^{2}\mapsto{\sf F}(\tau^{2},\alpha\tau) is concave, and strictly concave provided α>0\alpha>0 and X0X_{0} is not identically .

which is strictly positive for all α0\alpha\geq 0. To see this, let f(α)(1+α2)Φ(α)αϕ(α)f(\alpha)\equiv(1+\alpha^{2})\Phi(-\alpha)-\alpha\,\phi(\alpha), and notice that f(α)=2αΦ(α)2ϕ(α)<0f^{\prime}(\alpha)=2\alpha\Phi(-\alpha)-2\phi(\alpha)<0, and f()=0f(\infty)=0.

Since τ2F(τ2,ατ)\tau^{2}\mapsto{\sf F}(\tau^{2},\alpha\tau) is concave, and strictly increasing for τ2\tau^{2} large enough, it also follows that it is increasing everywhere.

Notice that αf(α)\alpha\mapsto f(\alpha) is strictly decreasing with f(0)=1/2f(0)=1/2. Hence, for α>αmin(δ)\alpha>\alpha_{\rm min}(\delta), we have F(τ2,ατ)>τ2{\sf F}(\tau^{2},\alpha\tau)>\tau^{2} for τ2\tau^{2} small enough and F(τ2,ατ)<τ2{\sf F}(\tau^{2},\alpha\tau)<\tau^{2} for τ2\tau^{2} large enough. Therefore the fixed point equation admits at least one solution. It follows from the concavity of τ2F(τ2,ατ)\tau^{2}\mapsto{\sf F}(\tau^{2},\alpha\tau) that the solution is unique and that the sequence of iterates τt2\tau_{t}^{2} converge to τ\tau_{*}. \Box

A.2 Proof of Proposition 1.4

As a first step, we claim that ατ2(α)\alpha\mapsto\tau_{*}^{2}(\alpha) is continuously differentiable on (0,)(0,\infty). Indeed this is defined as the unique solution of

Since (τ2,α)F(τ2,ατ)(\tau^{2},\alpha)\mapsto{\sf F}(\tau^{2}_{*},\alpha\tau_{*}) is continuously differentiable and 0dFdτ2(τ2,ατ)<10\leq\frac{{\rm d}{\sf F}}{{\rm d}\tau^{2}}(\tau^{2}_{*},\alpha\tau_{*})<1 (the second inequality being a consequence of concavity plus limτ2dFdτ2(τ2,ατ)<1\lim_{\tau^{2}\to\infty}\frac{{\rm d}{\sf F}}{{\rm d}\tau^{2}}(\tau^{2},\alpha\tau)<1, both shown in the proof of Proposition 1.3), the claim follows from the implicit function theorem applied to the mapping (τ2,α)[τ2F(τ2,α)](\tau^{2},\alpha)\mapsto[\tau^{2}-F(\tau^{2},\alpha)].

Next notice that τ2(α)+\tau_{*}^{2}(\alpha)\to+\infty as ααmin(δ)\alpha\downarrow\alpha_{\rm min}(\delta). Indeed, introducing the notation Flimτ2dFdτ2(τ2,ατ){\sf F}^{\prime}_{\infty}\equiv\lim_{\tau^{2}\to\infty}\frac{{\rm d}{\sf F}}{{\rm d}\tau^{2}}(\tau^{2},\alpha\tau), we have, again by concavity,

i.e. τ2F(0,0)/(1F)\tau_{*}^{2}\geq{\sf F}(0,0)/(1-{\sf F}^{\prime}_{\infty}). Now F(0,0)σ2{\sf F}(0,0)\geq\sigma^{2}, while F1{\sf F}^{\prime}_{\infty}\uparrow 1 as ααmin(δ)\alpha\downarrow\alpha_{\rm min}(\delta) (shown in the proof of Proposition 1.3), whence the claim follows.

Next consider the function (α,τ2)g(α,τ2)(\alpha,\tau^{2})\mapsto g(\alpha,\tau^{2}) defined by

Notice that λ(α)=g(α,τ2(α))\lambda(\alpha)=g(\alpha,{\tau_{*}}^{2}(\alpha)). Since gg is continuously differentiable, it follows that αλ(α)\alpha\mapsto\lambda(\alpha) is continuously differentiable as well.

Using the characterization of αmin\alpha_{\rm min} in Eq. (1.17) (and the well known inequality αΦ(α)ϕ(α)\alpha\Phi(-\alpha)\leq\phi(\alpha) valid for all α>0\alpha>0), it is immediate to show that l<0l_{*}<0. Therefore

A.3 Proof of Corollary 1.7

By Proposition 1.4, it is sufficient to prove that, for any λ>0\lambda>0 there exists a unique α>αmin\alpha>\alpha_{\rm min} such that λ(α)=λ\lambda(\alpha)=\lambda. Assume by contradiction that there are two distinct such values α1\alpha_{1}, α2\alpha_{2}.

Notice that in this case, the function α(λ)\alpha(\lambda) is not defined uniquely and we can apply Theorem 1.5 to both choices α(λ)=α1\alpha(\lambda)=\alpha_{1} and α(λ)=α2\alpha(\lambda)=\alpha_{2}. Using the test function ψ(x,y)=(xy)2\psi(x,y)=(x-y)^{2} we deduce that

Since the left hand side does not depend on the choice of α\alpha, it follows that τ(α1)=τ(α2)\tau_{*}(\alpha_{1})=\tau_{*}(\alpha_{2}).

Next apply Theorem 1.5 to the function ψ(x,y)=x\psi(x,y)=|x|. We get

Appendix B Proof of Theorem 4.2

First note that using representation (4.2) we have xt+Azt=x0ht+1x^{t}+A^{*}z^{t}=x_{0}-h^{t+1}. Furthermore, using Lemma F.3(b) we have almost surely

For s=t=0s=t=0 we have using Lemma F.3(b) almost surely

Induction hypothesis: Assume that for all sks\leq k and tkt\leq k,

Then we prove Eq. (B.2) for t=k+1t=k+1 (case s=k+1s=k+1 is similar). First assume s=0s=0 and t=k+1t=k+1 in which using Lemma F.3(c) we have almost surely

Similarly, for the case t=k+1t=k+1 and s>0s>0, using Lemma F.3(b)(c) we have almost surely

using the induction hypothesis. Hence the result follows.

Appendix C Proof of Lemma 4.3

The proof of Lemma 4.3 relies on Lemma 5.7 which we will prove in the first subsection.

Before proving Lemma 5.7, we state and prove the following property of gaussian random variables.

for ff the indicator function of II. Since the Ornstein-Uhlenbeck process is reversible with respect to the standard gaussian measure μG\mu_{\rm G}, we have

It is convenient to change coordinates and define

Next we will show that by induction on tt that the stronger inequality yt,3<(yt,1+yt,2)y_{t,3}<(y_{t,1}+y_{t,2}) holds for all tt. We have indeed

The initial condition implied by Eq. (4.14) is

We will consider the above iteration for arbitrary initialization y0y_{0} (satisfying y0,3<y0,1+y0,2y_{0,3}<y_{0,1}+y_{0,2}) and will show the following three facts:

Fact (i)(i). As tt\to\infty, yt,1,yt,2τ2y_{t,1},y_{t,2}\to\tau_{*}^{2}. Further the convergence is monotone.

Fact (ii)(ii). If y0,1=y0,2=τ2y_{0,1}=y_{0,2}=\tau_{*}^{2} and y0,32τ2y_{0,3}\leq 2\tau_{*}^{2}, then yt,1=yt,2=τ2y_{t,1}=y_{t,2}=\tau_{*}^{2} for all tt and yt,30y_{t,3}\to 0.

Fact (iii)(iii). The jacobian J=JG(y)J=J_{{\sf G}}(y_{*}) of G{\sf G} at y=(τ2,τ2,0)y_{*}=(\tau_{*}^{2},\tau_{*}^{2},0) has spectral radius σ(J)<1\sigma(J)<1.

By simple compactness arguments, Facts (i)(i) and (ii)(ii) imply ytyy_{t}\to y_{*} as tt\to\infty. (Notice that yt,3y_{t,3} remains bounded since yt,3(yt,1+yt,2)y_{t,3}\leq(y_{t,1}+y_{t,2}) and by the convergence of yt,1,yt,2y_{t,1},y_{t,2}.) Fact (iii)(iii) implies that convergence is exponentially fast.

Proof of Fact (i)(i). Notice that yt,2y_{t,2} evolves independently by yt+1,2=G2(yt)=F(y2,t,αy2,t)y_{t+1,2}={\sf G}_{2}(y_{t})={\sf F}(y_{2,t},\alpha\sqrt{y_{2,t}}), with F(,){\sf F}(\,\cdot\,,\,\cdot\,) the state evolution mapping introduced in Eq. (1.9). It follows from Proposition 1.3 that yt,2τ2y_{t,2}\to\tau_{*}^{2} monotonically for any initial condition. Since yt+1,1=yt,2y_{t+1,1}=y_{t,2}, the same happens for yt,1y_{t,1}.

where Zt1=τ2x2/4Z+(x/2)WZ_{t-1}=\sqrt{\tau_{*}^{2}-x^{2}/4}Z+(x/2)W, Zt=τ2x2/4Z(x/2)WZ_{t}=\sqrt{\tau_{*}^{2}-x^{2}/4}Z-(x/2)W. In particular, by Lemma C.1, xG(x)x\mapsto{\sf G}_{*}(x) is strictly increasing (notice that the covariance of Zt1Z_{t-1} and ZtZ_{t} is τ2(x/2)\tau_{*}^{2}-(x/2) which is decreasing in xx). Further

Hence, since λ>0\lambda>0 using Eq. (1.18) we have G(0)<1{\sf G}^{\prime}(0)<1. Finally, by Lemma C.1, xG(x)x\mapsto{\sf G}^{\prime}(x) is decreasing in [0,2τ)[0,2\tau_{*}). It follows that yt,3G(0)ty0,30y_{t,3}\leq{\sf G}^{\prime}(0)^{t}y_{0,3}\to 0 as claimed.

Proof of Fact (iii)(iii). From the definition of G{\sf G}, we have the following expression for the Jacobian

where with an abuse of notation we let F(τ2)dτ2dτ2F(τ2,ατ)τ2=τ2{\sf F}^{\prime}(\tau_{*}^{2})\equiv\left.\frac{{\rm d}\phantom{\tau^{2}}}{{\rm d}\tau^{2}}{\sf F}(\tau^{2},\alpha\tau)\right|_{\tau^{2}=\tau^{2}_{*}}. Computing the eigenvalues of the above matrix, we get

Since G(0)<1{\sf G}_{*}^{\prime}(0)<1 as proved above, and F(τ2)<1{\sf F}(\tau_{*}^{2})<1 as per Proposition 1.3, the claim follows. ∎

C.2 Lemma 5.7 implies Lemma 4.3

Using representations (4.4) and (4.3) (i.e., bt=wztb^{t}=w-z^{t} and qt=x0xtq^{t}=x_{0}-x^{t}) and Lemma F.3(c) we obtain,

where the last equality uses qt=xtx0q^{t}=x^{t}-x_{0}. Therefore, it is sufficient to prove the thesis for xt+1xt2\|x^{t+1}-x^{t}\|_{2}. By state evolution, Theorem 4.2, we have

The first term vanishes as tt\to\infty because θt=ατtατ\theta_{t}=\alpha\tau_{t}\to\alpha\tau_{*} by Proposition 1.3. The second term instead vanishes since Rt,tτ{\sf R}_{t,t}\to\tau_{*}, Rt,t1τ{\sf R}_{t,t-1}\to\tau_{*} by Lemma 5.7.

Appendix D Proof of Lemma 5.2

First note that the upper bound on λmax(R/N)\lambda_{\max}(R/N) is trivial since using representations (4.7), (4.8), qt=ft(ht,x0)q^{t}=f_{t}(h^{t},x_{0}), mt=gt(bt,w)m^{t}=g_{t}(b^{t},w) and Lemma F.3(c)(d) all entries of the matrix R/NR/N are bounded as NN\to\infty and the matrix has fixed dimensions. Hence, we only focus on the lower-bound for λmin(R/N)\lambda_{\min}(R/N).

The result for R=MMR=M^{*}M and R=QQR=Q^{*}Q follows directly from Lemma F.3(g) and Lemma 8 of [BM11].

For R=YYR=Y^{*}Y and R=XXR=X^{*}X the proof is by induction on tt.

For t=1t=1 we have Yt=b0Y_{t}=b^{0} and Xt=h1+ξ0q0=h1x0X_{t}=h^{1}+\xi_{0}q^{0}=h^{1}-x_{0}. Using Lemma F.3(b)(c) we obtain almost surely

Induction hypothesis: Assume that for all tkt\leq k there exist positive constants cX(t)c_{X}(t) and cY(t)c_{Y}(t) such that as NN\to\infty

First write at=(a1,,at)\vec{a}_{t}=(a_{1},\ldots,a_{t}) and denote its first t1t-1 coordinates with at1\vec{a}_{t-1}. Next, we consider the conditional distribution ASt1A|_{\mathfrak{S}_{t-1}}. Using Eqs. (4.9) and (4.10) we obtain (since Yt=AQtY_{t}=AQ_{t})

Hence, conditional on St1\mathfrak{S}_{t-1} we have, almost surely

From Lemma F.3(g) we know that limNqt1,qt1\lim_{N\to\infty}\langle q^{t-1}_{\perp},q^{t-1}_{\perp}\rangle is larger than a positive constant ςt\varsigma_{t}. Hence, from representation (D.3) and induction hypothesis (D.1)

To simplify the notation let ctlimNN1/2bt1+λt1mt2c^{\prime}_{t}\equiv\lim_{N\to\infty}N^{-1/2}\|b^{t-1}+\lambda_{t-1}m^{t-2}\|. Now if ctatcY(t1)at1/2c^{\prime}_{t}|a_{t}|\leq\sqrt{c_{Y}(t-1)}\|\vec{a}_{t-1}\|/2 then

which proves the result. Otherwise, we obtain the inequality

Appendix E A concentration estimate

The following proposition follows from standard concentration-of-measure arguments.

Let Nd(ε/2)N_{d}(\varepsilon/2) be a (ε/2)(\varepsilon/2)-net in SdS_{d}, i.e. a subset of vectors {u1,,uM}Sd\{u^{1},\dots,u^{M}\}\in S^{d} such that, for any uSdu\in S^{d}, there exists i{1,,M}i\in\{1,\dots,M\} such that uuiε/2\|u-u^{i}\|\leq\varepsilon/2. It follows from a standard counting argument [Led01] that there exists an (ε/2)(\varepsilon/2)-net of size Nd(ε/2)M(100/ε)d|N_{d}(\varepsilon/2)|\equiv M\leq(100/\varepsilon)^{d}. Define

Since uPλQuu\mapsto P_{\lambda}Qu is Lipschitz with modulus 11, we have

which is smaller than emc(ε)/2e^{-mc(\varepsilon)/2} for all mm large enough. ∎

Appendix F Useful reference material

In this appendix we collect a few known results that are used several times in our proof. We also provide some pointers to the literature.

In our proof we make use of the following well-known result of Kashin in the theory of diameters of smooth functions [Kas77].

For any positive number υ\upsilon there exist a universal constant cυc_{\upsilon} such that for any n1n\geq 1, with probability at least 12n1-2^{-n}, for a uniformly random subspace Vn,υV_{n,\upsilon} of dimension n(1υ)\lfloor n(1-\upsilon)\rfloor,

F.2 Singular values of random matrices

We will repeatedly make use of limit behavior of extreme singular values of random matrices. A very general result was proved in [BY93] (see also [BS05]).

We will also use the following fact that follows from the standard singular value decomposition

F.3 Two Lemmas from [BM11]

Our proof uses the results of [BM11]. We state copy here the crucial technical lemma in that paper. Notations refer to the general algorithm in Eq. (4.1). General state evolution defines quantities {τt2}t0\{\tau_{t}^{2}\}_{t\geq 0} and {σt2}t0\{\sigma_{t}^{2}\}_{t\geq 0} via

where WpWW\sim p_{W} and X0pX0X_{0}\sim p_{X_{0}} are independent of ZN(0,1)Z\sim{\sf N}(0,1)

where (Z0,,Zt)(Z_{0},\ldots,Z_{t}) and (Z^0,,Z^t)(\hat{Z}_{0},\ldots,\hat{Z}_{t}) are two zero-mean gaussian vectors independent of X0X_{0}, WW, with Zi,Z^iN(0,1)Z_{i},\hat{Z}_{i}\sim{\sf N}(0,1).

For all 0r,st0\leq r,s\leq t the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

Here φ\varphi^{\prime} denotes derivative with respect to the first coordinate of φ\varphi.

For all 0rt0\leq r\leq t and 0st10\leq s\leq t-1 the following limits exist, and there exist strictly positive constants ρr\rho_{r} and ςs\varsigma_{s} (independent of NN, nn) such that almost surely

It is also useful to recall some simple properties of gaussian random matrices.

References