The dynamics of message passing on dense graphs, with applications to compressed sensing

Mohsen Bayati, Andrea Montanari

Introduction and main results

for an appropriate sequence of non-linear functions {ηt}t0\{\eta_{t}\}_{t\geq 0}. (Here by convention any variable with negative index is assumed to be .) The algorithm succeeds if xtx^{t} converges to a good approximation of x0x_{0} (cf. [DMM09] for details).

Three findings were presented in [DMM09]:

For a large class of random matrices AA, the behavior of AMP algorithm is accurately described by a formalism called ‘state evolution’ (SE). Extensive numerical experiments tested this claim on gaussian, Radamacher, and partial Fourier matrices;

The sparsity-undersampling tradeoff of AMP as derived from SE coincides, for an appropriate choice of the functions ηt\eta_{t}, with the one of convex optimization approaches. Let us stress that standard convex optimization algorithms do not scale to large applications (e.g. to image processing), while the computational complexity of AMP is as low as the one of the simplest greedy algorithms;

As a byproduct of (1)(1) and (2)(2), SE allows to re-derive reconstruction phase boundaries earlier determined via random polytope geometry (see in particular [DT05, DT09] and references therein).

These findings were based on heuristic arguments and numerical simulations. In this paper we provide the first rigorous support to finding (1)(1), by proving that SE holds in the large system limit, for random sensing matrices AA with gaussian entries. Implications on points (2)(2) and (3)(3) will be reported in a forthcoming paper.

Interestingly, state evolution gives access to sharp predictions that cannot be derived from random polytope geometry. A prominent example is the noise sensitivity of LASSO, which is investigated in [DMM10c].

Note that AMP is an approximation to the following message passing algorithm. For all i,j[N]i,j\in[N] and a,b[n]a,b\in[n] (here and below [N]{1,2,,N}[N]\equiv\{1,2,\ldots,N\}) start with messages xja0=0x_{j\to a}^{0}=0 and proceed by

As argued in [DMM10a], AMP accurately approximates message passing in the large system limit. We refer to appendix A for an heuristic argument justifying the AMP update rules (1.1) starting from the algorithm (1.2). While this derivation is not necessary for the proofs of this paper, it can help the reader familiar with message passing algorithms to develop the correct intuition.

An important tool for the analysis of message passing algorithms is provided by density evolution [RU08]. Density evolution is known to hold asymptotically for sequences of sparse graphs that are locally tree-like. The factor graph underlying the algorithm (1.2) is dense: indeed it is the complete bipartite graph. State evolution can be regarded (in a very precise sense) as the analogue of density evolution for dense graphs.

For the sake of concreteness, we will focus in this Section on the algorithm (1.1), and will keep to the compressed sensing language. Nevertheless our analysis applies to a much larger family of message passing algorithms on dense graphs, for instance the multi-user detection algorithm studied in [Kab03, NS05, MT06]. Applications to such algorithms are discussed in Section 2. Section 3 describes an even more general formulation, as well as the proof of our theorems. Finally, Section 4 describes a generalization to the case of symmetric matrices AA that is directly related to the work of Erwin Bolthausen [Bol09] .

It is important to mention that the algorithms (1.1) and (1.2) are completely different from gaussian belief propagation (BP). The gaussian assumption refers indeed to the distribution of the matrix entries, not to the variables to be inferred. More generally, none of the existing rigorous results for BP seem to be applicable here.

It is remarkable that density evolution (in its special incarnation, SE) holds for dense graphs. This is at odds with the standard argument used for justifying density evolution so far: ‘density evolution works because the graph is locally tree-like.’ To the best of our knowledge, the approach developed here is the first one that overcomes the limitations of the standard argument (a discussion of earlier literature is provided in Section 1.4).

We begin with some missing definitions for algorithm (1.1). We assume

with X0pX0X_{0}\sim p_{X_{0}} and ZN(0,1)Z\sim{\sf N}(0,1) independent from X0X_{0}. We will use the term state evolution to refer both to the recursion (1.4) (or its more general version introduced in Section 3.2) and to the sequence {τt}t0\{\tau_{t}\}_{t\geq 0} that it defines.

ϕ\phi is locally Lipschitz, that is for any M>0M>0 there exist a constant LM,m<L_{M,m}<\infty such that for all x,y[M,M]mx,y\in[-M,M]^{m},

Further, LM,mc[1+(Mm)k1]L_{M,m}\leq c[1+(M\sqrt{m})^{k-1}] for some constant cc.

In the following we shall use generically LL for Lipschitz constants entering bounds of this type. It is understood (and will not be mentioned explicitly) that the constant must be properly adjusted at various passages.

with X0pX0X_{0}\sim p_{X_{0}} and ZN(0,1)Z\sim{\sf N}(0,1) independent.

Up to a trivial change of variables, this is a formalization of the findings of [DMM09] (cf. in particular Eqs. (7), (8) and Finding 2 in that paper).

As an immediate consequence of the above theorem we have the following decoupling principle implying that a typical (finite) subset of the coordinates of xtx^{t} are asymptotically independent.

For the proof of this corollary we refer to Section 3.10.

2 Universality

Our proof technique heavily relies on the assumption that A(N)A(N) is gaussian. Nevertheless, we expect the convergence expressed in Theorem 1 to be a fairly general result. In particular, we expect it to hold for matrices with i.i.d. entries with zero mean and variance 1/n1/n, under a suitable moment condition. This type of universality is quite common in random matrix theory, and several arguments suggest that it should hold in the present case. For instance, it is possible to prove that state evolution holds for this broader class of random matrices when ηt()\eta_{t}(\,\cdot\,) is affine. Also, the heuristic argument discussed in the next section is clearly insensitive to the details of distribution of the entries.

Numerical evidence presented in [DMM09] (we refer in particular to the online supplement) suggests that state evolution might hold for an even broader class of matrices. Determining the domain of such an universality class is an outstanding open problem.

3 State evolution: the basic intuition

The state evolution recursion has a simple heuristic description, that is useful to present here since it clarifies the difficulties involved in the proof. In particular, this description brings up the key role played by the last term in the update equation for ztz^{t}, that we will call the ‘Onsager term’, following [DMM09].

Consider again the recursion (1.1), but introduce the following three modifications: (i)(i) Replace the random matrix AA with a new independent copy A(t)A(t) at each iteration tt; (ii)(ii) Correspondingly replace the observation vector yy with yt=A(t)x0+wy^{t}=A(t)x_{0}+w; (iii)(iii) Eliminate the last term in the update equation for ztz^{t}. We thus get the following dynamics:

where A(0),A(1),A(2),A(0),A(1),A(2),\dots are i.i.d. matrices of dimensions n×Nn\times N with i.i.d. entries Aij(t)N(0,1/n)A_{ij}(t)\sim{\sf N}(0,1/n). (Notice that, unlike in the rest of the paper, we use here the argument of AA to denote the iteration number, and not the matrix dimensions.)

This recursion is most conveniently written by eliminating ztz^{t}:

Using the central limit theorem, it is easy to show that each entry of B(t)B(t) is approximately normal, with zero mean and variance 1/n1/n. Further, distinct entries are approximately pairwise independent. Therefore, if we let τ^t2=limNxtx02/N\widehat{\tau}_{t}^{2}=\lim_{N\to\infty}\|x^{t}-x_{0}\|^{2}/N, we obtain that B(t)(xtx0)B(t)(x^{t}-x_{0}) converges to a vector with i.i.d. normal entries with mean and variance Nτ^t2/n=τ^t2/δN\widehat{\tau}_{t}^{2}/n=\widehat{\tau}_{t}^{2}/\delta. Notice that this is true because A(t)A(t) is independent of {A(s)}1st1\{A(s)\}_{1\leq s\leq t-1} and, in particular, of (xtx0)(x^{t}-x^{0}).

Conditional on ww, A(t)wA(t)^{*}w is a vector of i.i.d. normal entries with mean and variance (1/n)w2(1/n)\|w\|^{2} which converges by the law of large numbers to σ2\sigma^{2}. A slightly longer exercise shows that these entries are approximately independent from the ones of B(t)(xtx0)B(t)(x^{t}-x_{0}). Summarizing, each entry of the vector in the argument of ηt\eta_{t} in Eq. (1.10) converges to X0+τtZX_{0}+\tau_{t}Z with ZN(0,1)Z\sim{\sf N}(0,1) independent of X0X_{0}, and

On the other hand, by Eq. (1.10), each entry of xt+1x0x^{t+1}-x_{0} converges to ηt(X0+τtZ)X0\eta_{t}(X_{0}+\tau_{t}\,Z)-X_{0}, and therefore

Using together Eq. (1.11) and (1.12) we finally obtain the state evolution recursion, Eq. (1.4).

We conclude that state evolution would hold if the matrix AA was drawn independently from the same gaussian distribution at each iteration. In the case of interest, AA does not change across iterations, and the above argument falls apart because xtx^{t} and AA are dependent. This dependency is non-negligible even in the large system limit NN\to\infty. This point can be clarified by considering the iteration

with a matrix AA constant across iterations. This iteration is the basis of several algorithms in compressed sensing, most notably the so-called ‘iterative soft thresholding’ [DDM04]. Such algorithms have been the object of great interest because of the high computational cost of standard convex optimization methods in large scale applications.

Numerical studies of iterative soft thresholding [DM09, DMM09] show that its behavior is dramatically different from the one in Eq. (1.1) and in particular state evolution does not hold for the iterative soft thresholding iteration (1.13), even in the large system limit.

This is not a surprise: the correlations between AA and xtx^{t} simply cannot be neglected. On the other hand, adding the Onsager term leads to an asymptotic cancelation of these correlations. As a consequence, state evolution holds for the AMP iteration (1.1) despite the fact that the matrix is kept constant.

4 Related literature

As mentioned, the standard argument for justifying density evolution relies on the locally-tree like structure of the underlying graph. This argument was developed and systematically exploited for the analysis of low-density parity-check (LDPC) codes under iterative decoding [RU08]. In this context, density evolution provides an exact tool for computing asymptotic thresholds of code ensembles based on sparse graph constructions. Optimization of these thresholds has been a major design principle in LDPC codes.

The locally tree-like property is a special case of local weak convergence. Local weak convergence of graph sequences was first defined and studied in probability theory by Benjamini and Schramm [BS96], and then greatly developed by David Aldous [AS03], in particular to study the so called ‘random assignment problem’ [Ald01]. Loosely speaking, local weak convergence allows to treat sequences of graphs of increasing size, such that the neighborhood of a node converges to a well defined limit object.

The random assignment problem is defined as a distribution of random instances of the assignment problem on complete bipartite graphs. In particular, such graphs are not locally tree-like. Nevertheless, they admit a rather simple local weak limit (called the PWIT), which is a tree. The basic reason is that only a sparse subgraph of the complete bipartite graph is relevant for the minimum cost assignment, namely the one of edges with small cost. One concrete way to derive density evolution in this case is indeed to eliminate all the edges of cost larger than –say– Δn/n\Delta_{n}/n with Δn\Delta_{n} diverging slowly with the graph size nn. The resulting graph is sparse and one can apply standard arguments (cf. [MM09] for an outline of this argument). A more sophisticated argument was presented in [SS09] which nevertheless uses the existence of a non-trivial local weak limit, and the fact that only a sparse subgraph is relevant (Lemma 4.1 in [SS09]).

This reduction to a sparse graph, and hence to a limit tree, is impossible in the class of algorithms studied in our paper: the algorithm iteration cannot be approximated by an iteration on a sparse graph (at least not on an instance-by-instance basis). This corresponds to the fact that no (simple) local weak limit exists in our case. The underlying graph is the complete bipartite graph with vertex sets [N]{1,2,,N}[N]\equiv\{1,2,\ldots,N\} and [n]{1,2,,n}[n]\equiv\{1,2,\ldots,n\}, and edge-weights AaiA_{ai} for all (a,i)[n]×[N](a,i)\in[n]\times[N]. If we choose a node i[N]i\in[N] as root, its depth-11 neighborhood consists of [n][n] node, each carrying a weight of order 1/n1/\sqrt{n}. Even this small neighborhood has no simple local weak limit.

This difference is analogous to the difference between mean-field spin glasses (e.g. the Sherrington-Kirkpatrick model) and the random assignment problem [Tal10, Chapter 7]. As a consequence, our proof does not rely on local weak convergence, and has to deal directly with the intricacies of graphs with many short cycles.

The theorem proved in this paper is not only relevant for [DMM09] but for a larger context as well. First of all, following the work by Tanaka [Tan02], hundreds of papers have been published in information theory using the replica method to study multi-user detection problems. In its replica-symmetric version, the replica method typically predicts the system performances through the solution of a system of non-linear equations, which coincide with the fixed point equations for state evolution. The present result provides a rigorous foundation to that line of work, along with the analysis of a concrete algorithm that achieves those performance. Further, [GV05] insisted on the role of a ‘decoupling principle’ that emerges from the replica method, and on the insight it provides. Corollary 1 indeed proves a specific form of this decoupling principle.

A more recent line of works uses the replica method to study typical performances of compressed sensing methods. Although non-rigorous and limited to asymptotic statements, the replica method has the advantage of providing sharp predictions. Standard techniques instead predict performances up to undetermined multiplicative constants. The determination of these constants can be of guidance for practical applications. This motivated several groups to publish results based on the replica method [RFG09, KWT09, GBS09]. The present paper provides a rigorous foundation to this work as well.

Examples

In this section we discuss in greater detail some of the applications of Theorem 1 to specific problems. To be definite, it is convenient to keep in mind a specific observable for applying Theorem 1. If we choose the test function ψ(x,y)=(xy)2\psi(x,y)=(x-y)^{2}, we get almost surely

As a warm-up example consider the case in which the a priori distribution of x0x_{0} is gaussian, namely its entries are i.i.d. N(0,v2){\sf N}(0,v^{2}). It is a consequence of state evolution that the optimal AMP algorithm makes use of linear scalar estimators

Clearly, such functions are Lipschitz continuous, for any λt\lambda_{t} finite. The AMP algorithm (1.1) becomes

Theorem 1 also shows that the empirical distribution of {(Azt+xt)ix0,i}i[N]\{(A^{*}z^{t}+x^{t})_{i}-x_{0,i}\}_{i\in[N]} is asymptotically gaussian with mean and variance τt2\tau_{t}^{2}. Hence, the optimal choice of λt\lambda_{t} is

Notice that this also minimizes the right hand side of Eq. (2.5). Under this choice, the recursion (2.5) yields

The right hand side is a concave function of τt2\tau_{t}^{2}, and is easy to show that τtτ\tau_{t}\to\tau_{\infty} exponentially fast, where, for c=(1δ)/δc=(1-\delta)/\delta,

The mean square error of the resulting algorithm is estimated via Eq. (2.1). In particular, under the optimal choice of λt\lambda_{t}, the latter converges to (τ2σ2)δ(\tau_{\infty}^{2}-\sigma^{2})\delta with τ\tau_{\infty} given as above, thus yielding

We recall that the asymptotic mean square error of optimal (MMSE) linear estimation was computed by Tse-Hanly and Verdú-Shamai in the case of random matrices AA with i.i.d. entries [TH99, VS99]. The motivation came from the analysis of multiuser receivers. The resulting MSE coincides with the value predicted in Eq. (2.9), thus showing that –in the linear case– the AMP algorithm is asymptotically equivalent to the MMSE estimator.

Notice that the computation of the MMSE in [TH99, VS99] relied heavily on the Marcenko-Pastur law for the limit spectral law of Wishart matrices [MP67]. Conversely, any calculation of the MMSE as a function of the noise variance σ2\sigma^{2} gives access to the asymptotic Stieltjis transform of the spectral measure of AA. This suggests that state evolution is a non-trivial result already in the case of linear ηt()\eta_{t}(\,\cdot\,), since it can be used to derive the Marcenko-Pastur law in random matrix theory.

2 Compressed sensing via soft thresholding

(Indeed Theorem 1 accommodates for a more general behavior, since p^x0(N)\hat{p}_{x_{0}(N)} is only required to converge weakly.)

In [DMM09], the authors proposed an algorithm of the form (1.1) with ηt(x)=η(x;θt)\eta_{t}(x)=\eta(x;\theta_{t}) a sequence of soft-threshold functions

The function xη(x;θ)x\mapsto\eta(x;\theta) is non-linear but nevertheless it is Lipschitz continuous. Therefore Theorem 1 applies to this case, and allows to predict the asymptotic mean square error using Eqs. (1.4) and (1.6).

This choice of the nonlinearity ηt\eta_{t} is close to the optimal in minimax sense. Indeed, a substantial literature (see e.g. [DJ94b, DJ94a]) studies the problem of estimating the scalar X0X_{0} from the noisy observation

with ZN(0,s2)Z\sim{\sf N}(0,s^{2}). For an appropriate choice of the threshold θ=θ(ε,s)\theta=\theta(\varepsilon,s), and ε0\varepsilon\downarrow 0 (very sparse sources), the soft thresholding estimator was proved to be minimax optimal, i.e. to achieve the minimum worst-case MSE over the class (2.10). State evolution allows to deduce that the choice (2.14) yields the best algorithm of the form (1.1) for estimating sparse vectors, over the worst-case vector x0x_{0} [DMM09].

It is argued in [DMM09, DMM10c], and proved in [BM10] in the case of gaussian matrices, that the asymptotic MSE of AMP coincides with the one of a popular convex optimization estimation technique, known as the LASSO. The above argument is suggestive of a possible way to prove minimax optimality of the LASSO.

Finally, state evolution provides a systematic way of improving the choice of the non-linearities ηt\eta_{t} when the class of signal changes. The basic idea is to choose the function ηt\eta_{t} that minimizes the right-hand side of Eq. (1.4) in minimax sense. This corresponds to constructing minimax MMSE estimators for the scalar problem (2.15). For instance, the limit case in which the distribution of X0X_{0} is known, the MMSE estimator is simply conditional expectation, which leads to the choice

with ZN(0,1)Z\sim{\sf N}(0,1). In other words, the very choice of the non-linearities is dictated by the gaussian convergence phenomenon described in Theorem 1.

3 Multi-User Detection

The model (1.3) is used to describe the input-output relation in code division multiple access (CDMA) channel. The matrix AA contains the users’ signatures. A frequently used setting for theoretical analysis consists in taking the large system limit with n/Nδn/N\to\delta giving the spreading factor, and in assuming that the signatures (and hence AA) have i.i.d. components. The entries x0,ix_{0,i} belong to the signal constellation used by the system. For the sake of simplicity, we consider the case of antipodal signaling, i.e. x0,i{+1,1}x_{0,i}\in\{+1,-1\} uniformly at random. Other signal constellations can also be treated applying our Theorem 1. The hypothesis that x0x_{0} is independent from AA is also standard in this context and justified by the remark that the transmitted information is independent from the signatures. Further, the source-channel separation theorem naturally leads to the uniform distribution.

The rationale for this choice is that it gives the conditional expectation of a uniformly random signal X0{+1,1}X_{0}\in\{+1,-1\}, given the observation X0+τtZ=xX_{0}+\tau_{t}\,Z=x for ZN(0,1)Z\sim{\sf N}(0,1) gaussian noise. This is therefore a special case of the rule (2.16) and by the argument given there, it achieves minimal mean-square error within the class of algorithms (1.1).

This state evolution recursion was proved in [MT06] for properly chosen sparse signature matrices AA. Theorem 1 provides the first generalization to the more relevant case of dense signatures.

As mentioned in Section 1.4, Tanaka used the replica method to compute the asymptotic performance of a MMSE receiver. The expressions obtained through this method correspond to a fixed point of the recursion (2.19). It was further proved in [MT06] that, whenever the fixed point is unique, this prediction is asymptotically correct. For such values of the parameters, we deduce that the AMP algorithm is asymptotically equivalent to the MMSE receiver.

Let us point out that, in a practical setting, it might be inconvenient to estimate the noise variance and/or to change the function ηt\eta_{t} across iterations. Several authors (see for instance [Tan05]) used the function

State evolution can be applied in this case as well (for any finite β\beta) and reads

On the other hand the case β\beta\to\infty is not covered by our Theorem 1, since it corresponds to the discontinuous function ηt(x)=sign(x)\eta_{t}(x)={\rm sign}(x).

Proof

The proof is based on a conditioning technique developed by Erwin Bolthausen for the analysis of the so-called TAP equations in spin glass theory [Bol09]. Related ideas can also be found in [Don06].

In the next section, we provide a high-level description of the conditioning technique, by using a simpler type of recursion as reference. We will then introduce some new notations and state and prove a more general result than Theorem 1.

Now consider the ttht^{\rm th} iteration (i.e., ht+1=Gmtλtmt1h^{t+1}=Gm^{t}-\lambda_{t}m^{t-1}). The problem in repeating the above argument is that GG and f(mt)f(m^{t}) are dependent. For instance f(mt)f(m^{t}) might a priori align with the minimum eigenvector of GG. More generally the problem is that GG is not independent from the σ\sigma-algebra St\mathfrak{S}_{t} generated by {h0,h1,,ht}\{h^{0},h^{1},\dots,h^{t}\}.

The key idea in the conditioning technique is to avoid computing the conditional distribution of mtm^{t} given GG. We instead compute the conditional distribution of GG given St\mathfrak{S}_{t}.

The next important remark is that St\mathfrak{S}_{t} contains {m0,m1,,mt}\{m^{0},m^{1},\dots,m^{t}\} as well. Conditioning on St\mathfrak{S}_{t} is therefore equivalent to conditioning on the event

which is in turn equivalent to making a set of linear observations of GG.

At this point, the assumption that GG is gaussian plays a crucial role. The conditional distribution of a gaussian random variable GG given linear observations is the same as its conditional expectation plus the projection of an independent gaussian. In formulae:

We refer to the actual proof for a calculation of the various terms involved.

Each of the above terms can be written explicitly as a function of the observed values {m0,m1,,mt}\{m^{0},m^{1},\dots,m^{t}\} and of the new gaussian random variables GnewG^{\rm new}. The first term Gnew(Ptmt)G^{\rm new}({\rm P}^{t}_{\perp}m^{t}) is clearly gaussian. The other terms are not. In order to control them, we will proceed by induction over tt and use an appropriate strong law of large numbers for triangular arrays. The key phenomenon is that the only non-gaussian term that does not vanish in the large system limit cancels with the term λtmt1-\lambda_{t}m^{t-1} in recursion (3.1), thus implying the claimed gaussianity of ht+1h^{t+1}.

2 A general result

We describe now a more general recursion than in Eq. (1.1). In the next section we show that the AMP algorithm (1.1) can be regarded as a special case of the recursion defined here.

where ξt=gt(bt,w)\xi_{t}=\langle g^{\prime}_{t}(b^{t},w)\rangle, λt=1δft(ht,x0)\lambda_{t}=\frac{1}{\delta}\langle f^{\prime}_{t}(h^{t},x^{0})\rangle (both derivatives are with respect to the first argument), and we recall that –by definition– m1=0m^{-1}=0.

exists, is positive and finite, for a sequence of initial conditions of increasing dimensions. State evolution defines quantities {τt2}t0\{\tau_{t}^{2}\}_{t\geq 0} and {σt2}t0\{\sigma_{t}^{2}\}_{t\geq 0} via

where WpWW\sim p_{W} and X0pX0X_{0}\sim p_{X_{0}} are independent of ZN(0,1)Z\sim{\sf N}(0,1). Further, recall the notion of pseudo-Lipschitz function for k>1k>1 from Section 1.1. We have the following general result.

where X0pX0X_{0}\sim p_{X_{0}} and WpWW\sim p_{W} are independent of ZN(0,1)Z\sim{\sf N}(0,1), and σt\sigma_{t}, τt\tau_{t} are determined by recursion (3.5).

3 Corollary of Theorem 2: AMP and Theorem 1

As already mentioned, the AMP algorithm (1.1) is a special case of recursion (3.3). The reduction is obtained by defining

The functions ftf_{t} and gtg_{t} are given by

and the initial condition is q0=x0q^{0}=-x_{0}.

Although the recursions (1.1) and (3.3) are equivalent mathematically, only the former can be used as an algorithm. Indeed the recursion (3.3) tracks the difference of the current estimates xtx^{t} from x0x_{0}, and is initialized using x0x^{0} itself. The recursion (3.3) is only relevant for mathematical analysis.

Due to symmetry, for each tt, all coordinates of the vector hth^{t} have the same distribution (similarly for btb^{t}, qtq^{t} and mtm^{t}).

4 Proof of Theorem 1

and τ02=σ2+σ02\tau_{0}^{2}=\sigma^{2}+\sigma_{0}^{2}. Also, by definition, xt+1=ηt(Abt+xt)=ηt(x0ht+1)x^{t+1}=\eta_{t}(A^{*}b^{t}+x^{t})=\eta_{t}(x_{0}-h^{t+1}). Therefore, applying Theorem 2 to the function (hit,x0,i)ψ(ηt1(x0,ihit),x0,i)(h^{t}_{i},x_{0,i})\mapsto\psi(\eta_{t-1}(x_{0,i}-h^{t}_{i}),x_{0,i}) we obtain almost surely

5 Definitions and notations

When the update equation for ht+1h^{t+1} in (3.3) is used, all values of b0,,btb^{0},\ldots,b^{t}, m0,,mtm^{0},\ldots,m^{t}, h1,,hth^{1},\ldots,h^{t} and q0,,qtq^{0},\ldots,q^{t} have been previously calculated. Hence, we can consider the distribution of ht+1h^{t+1} conditioned on all these known variables and also conditioned on x0x_{0} and ww. In particular, define St1,t2\mathfrak{S}_{t_{1},t_{2}} to be the σ\sigma-algebra generated by b0,,bt11b^{0},\ldots,b^{t_{1}-1}, m0,,mt11m^{0},\ldots,m^{t_{1}-1}, h1,,ht2h^{1},\ldots,h^{t_{2}}, q0,,qt2q^{0},\ldots,q^{t_{2}} and x0x_{0} and ww. The basic idea of the proof is to compute the conditional distributions btSt,tb^{t}|_{\mathfrak{S}_{t,t}} and ht+1St+1,th^{t+1}|_{\mathfrak{S}_{t+1,t}}. This is done by characterizing the conditional distribution of the matrix AA given this filtration.

Regarding hth^{t} and btb^{t} as column vectors, the equations for b0,,bt1b^{0},\ldots,b^{t-1} and h1,,hth^{1},\ldots,h^{t} can be written in matrix form as:

or in short Xt=AMtX_{t}=A^{*}M_{t} and Yt=AQtY_{t}=AQ_{t} . Here and below we use vertical lines to indicate columns of a matrix, i.e. [a1a2ak][a_{1}|a_{2}|\dots|a_{k}] is the matrix with columns a1,,aka_{1},\dots,a_{k}.

We will show in Section 3.9 (cf. Corollary 2) that for any fixed tt as NN goes to infinity the quantities βi\beta_{i}’s and αj\alpha_{j}’s have a finite limit.

6 Main technical Lemma

We prove the following more general result.

where (Z0,,Zt)(Z_{0},\ldots,Z_{t}) and (Z^0,,Z^t)(\hat{Z}_{0},\ldots,\hat{Z}_{t}) are two zero-mean gaussian vectors independent of X0X_{0}, WW, with Zi,Z^iN(0,1)Z_{i},\hat{Z}_{i}\sim{\sf N}(0,1).

For all 0r,st0\leq r,s\leq t the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

Here φ\varphi^{\prime} denotes derivative with respect to the first coordinate of φ\varphi.

For all 0rt0\leq r\leq t and 0st10\leq s\leq t-1 the following limits exist, and there exist strictly positive constants ρr\rho_{r} and ςs\varsigma_{s} (independent of NN, nn) such that almost surely

Equations (3.20) and (3.21) have the form of Stein’s lemma [Ste72] (cf. Lemma 4 in Section 3.8).

Assuming Lemma 1 is correct Theorem 2 easily follows. To be more precise, Theorem 2 is a obtained by applying Lemma 1(b) to functions ϕh(y0,,yt,x0,i)=ψ(yt,x0,i)\phi_{h}(y_{0},\ldots,y_{t},x_{0,i})=\psi(y_{t},x_{0,i}) and ϕb(y0,,yt,wi)=ψ(yt,wi)\phi_{b}(y_{0},\ldots,y_{t},w_{i})=\psi(y_{t},w_{i}).

The rest of Section 3 focuses on proof of Lemma 1.

7 Useful probability facts

Before embarking in the actual proof, it is convenient to summarize a few facts that will be used repeatedly.

We will use the following strong law of large numbers (SLLN) for triangular arrays of independent but not identically distributed random variables. The form stated below follows immediately from [HT97, Theorem 2.1].

Theorem 2.1 in [HT97] is stronger than what we state here. In Appendix B we show how Theorem 3 follows from it.

Next, we present an algebraic inequality that will be used in conjunction with Theorem 3. Its proof is provided in Appendix C

Let u1,,unu_{1},\ldots,u_{n} be a sequence of non-negative numbers. Then for all ε>0\varepsilon>0 the following holds

Next, we present a standard property of Gaussian matrices without proof.

We will apply the following law of large numbers to the sequence {x0(N),w(N)}N\{x_{0}(N),w(N)\}_{N}. Its proof can be found in Appendix D.1.

Next lemma is on reak convergence of Lipschitz functions and its proof is in Appendix D.2.

It is useful to remember a standard formula for the conditional variance of gaussian random variables.

Let (Z1,,Zt)(Z_{1},\ldots,Z_{t}) be a normal random vector with zero mean, and assume that the covariance matrix of (Z1,,Zt1)(Z_{1},\ldots,Z_{t-1}) (denoted by CC) is invertible. Then

An immediate consequence is the following fact, proven in Appendix D.3.

It is also convenient to recall some linear algebra facts. The first one is proved in Appendix D.4.

The second one is just a direct consequence of the fact that the mapping Sλmin(S)S\mapsto\lambda_{\rm min}(S) is continuous at any matrix SS that is invertible.

Let {Sn}n1\{S_{n}\}_{n\geq 1} be a sequence of t×tt\times t matrices such that liminfnλmin(Sn)>c\lim\inf_{n\to\infty}\lambda_{\min}(S_{n})>c for a positive constant cc. Also assume that limnSn=S\lim_{n\to\infty}S_{n}=S_{\infty} where the limit is element-wise. Then, λmin(S)c\lambda_{\min}(S_{\infty})\geq c.

8 Conditional distributions

In order to calculate btSt,tb^{t}|_{\mathfrak{S}_{t,t}} and ht+1St+1,th^{t+1}|_{\mathfrak{S}_{t+1,t}} we will characterize the conditional distributions ASt,tA|_{\mathfrak{S}_{t,t}} and ASt+1,tA|_{\mathfrak{S}_{t+1,t}}.

For (t1,t2)=(t,t)(t_{1},t_{2})=(t,t) or (t1,t2)=(t+1,t)(t_{1},t_{2})=(t+1,t), the conditional distribution of the random matrix AA given the σ\sigma-algebra St1,t2{\mathfrak{S}_{t_{1},t_{2}}}, satisfies

Here PMt2=IPMt2P_{M_{t_{2}}}^{\perp}={\sf I}-P_{M_{t_{2}}}, PQt1=IPQt1P_{Q_{t_{1}}}^{\perp}={\sf I}-P_{Q_{t_{1}}}, and PQt1P_{Q_{t_{1}}}, PMt2P_{M_{t_{2}}} are orthogonal projector onto column spaces of Qt1Q_{t_{1}} and Mt2M_{t_{2}} respectively.

Write mt=mt+mtm^{t}=m^{t}_{\parallel}+m^{t}_{\perp}. Using (3.30) and the fact that Mtmt=0M_{t}^{*}m^{t}_{\perp}=0, we obtain

Similarly, writing qt=qt+qtq^{t}=q^{t}_{\parallel}+q^{t}_{\perp}, qt=Qtβq^{t}_{\parallel}=Q_{t}\vec{\beta}, and using XtQt=MtAQt=MtYtX_{t}^{*}Q_{t}=M_{t}^{*}AQ_{t}=M_{t}^{*}Y_{t}, Qtqt=0Q_{t}^{*}q^{t}_{\perp}=0 we obtain (3.32). ∎

where AF\|A\|_{F} denotes the Frobenius norm of matrix AA. We use Lagrange multipliers method to obtain this minimum. Consider the Lagrangian

FF=F\mathcal{F}\circ\mathcal{F}=\mathcal{F}.

F(A)V={AAQt1=0,AMt2=0}\mathcal{F}(A)\in V=\{A|AQ_{t_{1}}=0,A^{*}M_{t_{2}}=0\}.

F\mathcal{F} is symmetric. That is for all matrices A,BA,B: (F(A),B)=(A,F(B))(\mathcal{F}(A),B)=(A,\mathcal{F}(B)).

(b) is correct since by definition of F(A)Q=PMAPQQ=0\mathcal{F}(A)Q=P_{M}^{\perp}AP_{Q}^{\perp}Q=0 and similarly F(A)M=0\mathcal{F}(A)^{*}M=0.

and each of the last three term vanishes either because AQ=0AQ=0 or because AM=0A^{*}M=0.

9 Proof of Lemma 1

The proof is by induction on tt. Let Ht+1\mathcal{H}_{t+1} be the property that (3.14), (3.16), (3.18), (3.20), (3.22), (3.24), and (3.25) hold. Similarly, let Bt\mathcal{B}_{t} be the property that (3.15), (3.17), (3.19), (3.21), (3.23) and (3.26) hold. The inductive proof consists of the following four main steps.

If Br\mathcal{B}_{r}, Hs\mathcal{H}_{s} hold for all r<tr<t and sts\leq t then Bt\mathcal{B}_{t} holds.

If Br\mathcal{B}_{r}, Hs\mathcal{H}_{s} hold for all rtr\leq t and sts\leq t then Ht+1\mathcal{H}_{t+1} holds.

For each of these steps we will have to prove several properties that we will denote by (a)(a), (b)(b), (c)(c), (d)(d), (e)(e) and (g)(g) according to their appearance in Lemma 1. For H\mathcal{H} we also need to prove a property (f)(f).

It is immediate to check that our claims become trivial if xft(x,X0)x\mapsto f_{t}(x,X_{0}) is constant (i.e. independent of xx) almost surely (with respect to X0pX0X_{0}\sim p_{X_{0}}), or if xgt(x,W)x\mapsto g_{t}(x,W) is constant almost surely (with respect to WpWW\sim p_{W}). We will therefore assume that neither of these degenerate cases hold.

S0,0\mathfrak{S}_{0,0} is generated by x0x_{0}, q0q^{0} and ww. Also q0=q0q^{0}=q^{0}_{\perp} since Q0Q_{0} is an empty matrix. Hence

The last inequality uses assumption on empirical moments of ww. Therefore, we can apply Theorem 3 to get

This case is trivial since there is not msm^{s} with st1=1s\leq t-1=-1.

Note that h1=Am0ξ0q0h^{1}=A^{*}m^{0}-\xi_{0}q^{0}.

S1,0\mathfrak{S}_{1,0} is generated by x0x_{0}, q0q^{0}, ww, b0b^{0} and m0m^{0}. Also m0=m0m^{0}=m^{0}_{\perp} since M0M_{0} is an empty matrix. Applying Lemma 11 we have

But using B0(d)\mathcal{B}_{0}(d) for φ=g0\varphi=g_{0}

Also B0(b)\mathcal{B}_{0}(b), applied to the function ϕb(x,w)=g0(x,w)2\phi_{b}(x,w)=g_{0}(x,w)^{2} gives

where the last estimate follows from Lemma 3(c) and (3.35). Finally,

Using (3.36), (3.35), and Lemma 3, we get

First note that, conditioning on S1,0\mathfrak{S}_{1,0},

Using (3.36) and Lemma 3(a) we have almost surely

Hence, we only need to show 1Ni=1Nai2k2<\frac{1}{N}\sum_{i=1}^{N}\|a_{i}\|^{2k-2}<\infty and 1Ni=1Nci2k2<\frac{1}{N}\sum_{i=1}^{N}\|c_{i}\|^{2k-2}<\infty as NN\to\infty. But

which is bounded using part (e)(e) and the original assumption on x0x_{0}. Similarly, using 1Ni=1Nqi02k2<\frac{1}{N}\sum_{i=1}^{N}\|q_{i}^{0}\|^{2k-2}<\infty, we obtain 1Ni=1Nci2k2<\frac{1}{N}\sum_{i=1}^{N}\|c_{i}\|^{2k-2}<\infty.

The last equality used B0(c)\mathcal{B}_{0}(c).

Since t=0t=0, and q0=q0q^{0}=q^{0}_{\perp} then the result follows from (3.4) and that σ02>0\sigma_{0}^{2}>0.

This part is analogous to step 1 albeit more complex.

Note that using induction hypothesis Bt1(b)\mathcal{B}_{t-1}(b) for ϕb(bir,bis,wi)=gr(bir,wi)gs(bis,wi)\phi_{b}(b^{r}_{i},b^{s}_{i},w_{i})=g_{r}(b^{r}_{i},w_{i})g_{s}(b^{s}_{i},w_{i}), 0r,st10\leq r,s\leq t-1 we have almost surely

But using induction hypothesis, we have limnmr,mr>ςr>0\lim_{n\to\infty}\langle m^{r}_{\perp},m^{r}_{\perp}\rangle>\varsigma_{r}>0 for all r<t1r<t-1. So using Lemma 9, for large enough nn the smallest eigenvalue of matrix Mt1Mt1/nM_{t-1}^{*}M_{t-1}/n is larger than a positive constant cc^{\prime} that is independent of nn. Hence, by Lemma 10 its inverse converges to an invertible limit. Thus, Eqs. (3.37) and (3.38) lead to

Now the result follows from Lemma 8 provided that we show for gaussian random variables σ0Z^0,,σt1Z^t1\sigma_{0}\hat{Z}_{0},\ldots,\sigma_{t-1}\hat{Z}_{t-1}, all conditional variances Var[σrZ^rσ0Z^0,,σr1Z^r1]{\rm Var}[\sigma_{r}\hat{Z}_{r}|\,\sigma_{0}\hat{Z}_{0},\ldots,\sigma_{r-1}\hat{Z}_{r-1}] are strictly positive for r=0,,t1r=0,\ldots,t-1. To prove the latter first using the induction hypothesis Bt1(b)\mathcal{B}_{t-1}(b), we have for all 0rt10\leq r\leq t-1

Similar as above we used the fact that for large enough nn the matrix BrBr/nB_{r}^{*}B_{r}/n has a smallest eigenvalue greater than a positive constant to obtain the limit of its inverse. On the other hand using induction hypothesis Br(c)\mathcal{B}_{r}(c) we have almost surely

And now by induction hypothesis Hr(g)\mathcal{H}_{r}(g) we have limNqr,qr>ρr\lim_{N\to\infty}\langle q^{r}_{\perp},q^{r}_{\perp}\rangle>\rho_{r}. Hence the result follows.

have finite limits as nn and NN converge to \infty.

We can apply Lemma 9 to obtain that for large enough nn the smallest eigenvalue of MtMt/nM_{t}^{*}M_{t}/n is larger than a positive constant cc^{\prime}. Hence by Lemma 10 its inverse has a finite limit. Similarly, we can apply induction hypothesis Ht(g)\mathcal{H}_{t}(g) and Lemmas 9 and 10 to the matrix QtQt/NQ_{t}^{*}Q_{t}/N. ∎

Recall definition of YtY_{t} and XtX_{t} from Section 3.5.

where Ξt=diag(ξ0,,ξt1)\Xi_{t}=\textrm{diag}(\xi_{0},\ldots,\xi_{t-1}), Ht=[h1ht]H_{t}=[h^{1}|\cdots|h^{t}], Bt=[b0bt1]B_{t}=[b^{0}|\cdots|b^{t-1}], and Λt=diag(λ0,,λt1)\Lambda_{t}=\textrm{diag}(\lambda_{0},\ldots,\lambda_{t-1}).

Recall that mt=Mtαm^{t}_{\parallel}=M_{t}\vec{\alpha} and qt=Qtβq^{t}_{\parallel}=Q_{t}\vec{\beta}. On the other hand Yt+1mt=Bt+1mtY_{t+1}^{*}m_{\perp}^{t}=B_{t+1}^{*}m_{\perp}^{t} because Mtmt=0M_{t}^{*}m_{\perp}^{t}=0. Similarly, Xtqt=HtqtX_{t}^{*}q_{\perp}^{t}=H_{t}^{*}q_{\perp}^{t}. Hence we need to show

To simplify the notation denote the matrix MtMt/nM_{t}^{*}M_{t}/n by GG. Therefore,

But using the induction hypothesis Ht(d)\mathcal{H}_{t}(d) for φ=f1,,ft\varphi=f_{1},\dots,f_{t}, and Ht(f)\mathcal{H}_{t}(f), the term hr,qts=0t1βsqs/δ\langle h^{r},q^{t}-\sum_{s=0}^{t-1}\beta_{s}q^{s}\rangle/\delta is almost surely equal to the limit of hr,htλts=0t1βshr,hsλs\langle h^{r},h^{t}\rangle\lambda_{t}-\sum_{s=0}^{t-1}\beta_{s}\langle h^{r},h^{s}\rangle\lambda_{s}. This can be modified, using the induction hypothesis Ht(c)\mathcal{H}_{t}(c), to mr1,mt1λts=0t1βsmr1,ms1λs\langle m^{r-1},m^{t-1}\rangle\lambda_{t}-\sum_{s=0}^{t-1}\beta_{s}\langle m^{r-1},m^{s-1}\rangle\lambda_{s} almost surely, which can be written as Gr,tλts=0t1βsGr,sλsG_{r,t}\lambda_{t}-\sum_{s=0}^{t-1}\beta_{s}G_{r,s}\lambda_{s}. Hence,

Notice that the above series of equalities hold because GG has, almost surely, a non-singular limit as NN\to\infty as shown in point (g)(g) above.

Equation (3.42) is proved analogously, using ξt=g(bt,w)\xi_{t}=\langle g^{\prime}(b^{t},w)\rangle. ∎

The proof of Eq. (3.15) follows immediately since the last lemma yields

Note that, using Lemma 3(c), as n,Nn,N\to\infty,

For r,s<tr,s<t we can use the induction hypothesis. For s=t,r<ts=t,r<t, we can apply Lemma 14 to btb^{t} (proved above), thus obtaining

Note that, by induction hypothesis Bt1(d)\mathcal{B}_{t-1}(d) applied to φ=gt1\varphi=g_{t-1}, and using the bound Bt1(e)\mathcal{B}_{t-1}(e) to control bi,br\langle b^{i},b^{r}\rangle, we deduce that each term mi,br\langle m^{i},b^{r}\rangle has a finite limit. Thus,

where the last estimate uses the induction hypothesis Bt1(c)\mathcal{B}_{t-1}(c) and Ht(c)\mathcal{H}_{t}(c) which imply, almost surely, for some constant cc, PMtbr,PMtbrbr,br<c\langle P_{M_{t}}^{\perp}b^{r},P_{M_{t}}^{\perp}b^{r}\rangle\leq\langle b^{r},b^{r}\rangle<c and qt,qtqt,qt<c\langle q_{\perp}^{t},q_{\perp}^{t}\rangle\leq\langle q^{t},q^{t}\rangle<c for all NN large enough. Finally, using the induction hypothesis Bt1(c)\mathcal{B}_{t-1}(c) for each term of the form bi,br\langle b^{i},b^{r}\rangle (noting that i,rt1i,r\leq t-1) and Corollary 2 we have

The last line uses the definition of βi\beta_{i} and qtqrq_{\perp}^{t}\perp q^{r}.

For the case of r=s=tr=s=t, similarly, we have

The contribution of other terms is o(1)o(1) because:

i=0t1βibi,Mtot(1)=o(1)\langle\sum_{i=0}^{t-1}\beta_{i}b^{i},M_{t}\vec{o}_{t}(1)\rangle=o(1), using Corollary 2 and induction hypothesis Bt1(d)\mathcal{B}_{t-1}(d) for φ=gj\varphi=g_{j}.

The arguments at the last two points are completely analogous to the one carried out in the case s=ts=t, r<tr<t above.

Hence, from the induction hypothesis Bt1(c)\mathcal{B}_{t-1}(c),

Conditioning on St,t\mathfrak{S}_{t,t} and using Lemma 14 (proved at point (a)(a) above), almost surely,

that follows from the Lipschitz assumption on grg_{r}. Thus,

which has a finite limit almost surely, using the induction hypothesis Bt1(e)\mathcal{B}_{t-1}(e) and the assumption on ww.

where Z0,,Zt1Z_{0},\ldots,Z_{t-1} are iid with distribution N(0,1){\sf N}(0,1). and c0,,crc_{0},\dots,c_{r} are allmost surely bounded for all NN large enough. Therefore, almost surely,

Now each term is finite using the same argument as in Eq. (3.44).

Therefore, using Cauchy-Schwartz inequality twice, we have

Hence for any fixed tt, (3.45) vanishes almost surely when nn goes to \infty.

Now given, b0,,bt1b^{0},\ldots,b^{t-1}, consider the random variables

Hence we can use induction hypothesis Bt1(b)\mathcal{B}_{t-1}(b) and Corollary 2 for

where ZZ is an independent N(0,1){\sf N}(0,1) random variable to show

Note that r=0t1βrσrZr+γtZ\sum_{r=0}^{t-1}{\beta_{r}}\sigma_{r}Z_{r}+\gamma_{t}\,Z is gaussian. All that we need, is to show that the variance of this gaussian is σt2\sigma_{t}^{2}. But using a combination of (3.46) and (3.47) for the pseudo-Lipschitz function ϕb(y0,,yt,wi)=yt2\phi_{b}(y_{0},\ldots,y_{t},w_{i})=y_{t}^{2},

𝑡1\mathcal{H}_{t+1} Due to symmetry, proof of this step is very similar to the proof of step 3 and we present only some differences.

This part is very similar to the one of Bt(g)\mathcal{B}_{t}(g).

To prove Eq. (3.14) we use Lemma 14(a) as for Bt(a)\mathcal{B}_{t}(a) to obtain

Now, using Lemma 3(c), as n,Nn,N\to\infty,

For r,s<tr,s<t we can use induction hypothesis. For s=t,r<ts=t,r<t, very similar to the proof of Bt(a)\mathcal{B}_{t}(a),

Now, by induction hypothesis Ht(d)\mathcal{H}_{t}(d), for φ=f\varphi=f, each term qi,hr+1\langle q^{i},h^{r+1}\rangle has a finite limit. Thus,

Where the last line uses the definition of αi\alpha_{i} and mtmrm_{\perp}^{t}\perp m^{r}.

This part is very similar to Bt(e)\mathcal{B}_{t}(e).

Using Ht(a)\mathcal{H}_{t}(a) and Lemma 3(a) we have almost surely

But this limit is almost surely, using the induction hypothesis Hr(e)\mathcal{H}_{r}(e) for r<tr<t and Bt(c)\mathcal{B}_{t}(c).

where ZZ is an independent N(0,1){\sf N}(0,1) random variable, to show

Note that r=0t1αrτrZr+n1/2mtZ\sum_{r=0}^{t-1}{\alpha_{r}}\tau_{r}Z_{r}+n^{-1/2}\|m_{\perp}^{t}\|Z is gaussian. All that we need, is to show that the variance of this gaussian is τt2\tau_{t}^{2}. But using combination of (3.49) and (3.50) for the pseudo-Lipschitz function ϕh(y0,,yt,x0,i)=yt2\phi_{h}(y_{0},\ldots,y_{t},x_{0,i})=y_{t}^{2},

On the other hand in part (c) we proved limNht+1,ht+1=a.s.limNgt(bt,w),gt(bt,w)\lim_{N\to\infty}\langle h^{t+1},h^{t+1}\rangle\stackrel{{\scriptstyle{\rm a.s.}}}{{=}}\lim_{N\to\infty}\langle g_{t}(b^{t},w),g_{t}(b^{t},w)\rangle.

10 Proof of Corollary 1

Symmetric Case

where λt=f(ht)\lambda_{t}=\langle f^{\prime}(h^{t})\rangle. Now let τ12=limNm1,m1\tau_{1}^{2}=\lim_{N\to\infty}\langle m_{1},m_{1}\rangle, and define recursively for t1t\geq 1,

This theorem was proved by Bolthausen in the case f(x)=tanh(βx+h)f(x)=\tanh(\beta x+h) and m1,m1=τ2\langle m^{1},m^{1}\rangle=\tau_{*}^{2}, for τ2\tau_{*}^{2} the fixed point of the recursion (4.2). The general proof is very similar to the one of Theorem 2, and exploits the same conditioning trick. We omit it to avoid repetitions.

When we are calculating ht+1h^{t+1}, all values h1,,hth^{1},\ldots,h^{t} and hence m1,,mtm^{1},\ldots,m^{t} are known to us. Denote the σ\sigma-algebra generated by all of these random variables by Ut\mathfrak{U}_{t}. Moreover, use the following compact formulation for (4.1).

The analogue of Lemma 1 is the following.

where Z1,,ZtZ_{1},\ldots,Z_{t} have N(0,1){\sf N}(0,1) distribution.

For all 1r,st1\leq r,s\leq t the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

For all 1r,st1\leq r,s\leq t, and for any Lipschitz continuous function φ\varphi, the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

For all 0rt0\leq r\leq t the following limit exists and there are positive constants ρr\rho_{r} (independent of NN) such that almost surely

Acknowledgement

We are deeply indebted with Erwin Bolthausen, who first presented the conditioning technique during the EURANDOM workshop on ‘Order, disorder and double disorder’ in September 2009. We are also grateful to Dave Donoho and Arian Maleki, for a number of insightful exchanges.

This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211.

Appendix A AMP algorithm: An heuristic derivation

In this appendix we present an heuristic derivation of the AMP iteration (1.1) starting from the standard message passing formulation (1.2). Let us stress that such derivation is not relevant for the proof of our Theorem 1. Our objective is to help the reader develop an intuitive understanding of the AMP iteration. For further discussion of the connection with belief propagation we refer to [DMM10a, DMM10b].

Let us rewrite the message passing iteration for greater convenience of the reader

Notice that on the right-hand side of both equations the messages appears in sums of Θ(N)\Theta(N) terms. Consider for instance the messages {zait}i[N]\{z_{a\to i}^{t}\}_{i\in[N]} for a fixed node a[n]a\in[n]. These depend on i[N]i\in[N] only because the excluded term changes. It is therefore natural to guess that zait=zat+O(N1/2)z^{t}_{a\to i}=z^{t}_{a}+O(N^{-1/2}) and xiat=xit+O(n1/2)x^{t}_{i\to a}=x^{t}_{i}+O(n^{-1/2}), where zatz^{t}_{a} only depends on the index aa (and not on ii), and xitx^{t}_{i} only depends on ii (and not on aa).

A naïve approximation would consist in neglecting the O(N1/2)O(N^{-1/2}) correction but this turns out to produce a non-vanishing error in the large-NN limit. We instead set

We will now drop the terms that are negligible without writing explicitly the error terms. First of all notice that single terms of the type AaiδzaitA_{ai}{\delta z}_{a\to i}^{t} are of order 1/N1/N and can be safely neglected. Indeed δzai=O(N1/2){\delta z}_{a\to i}=O(N^{-1/2}) by our anzatz, and Aai=O(N1/2)A_{ai}=O(N^{-1/2}) by definition. We get

We next expand the second equation to linear order in δxiat{\delta x}_{i\to a}^{t} and δzait{\delta z}_{a\to i}^{t}:

Notice that the last term on the right hand side of the first equation is the only one dependent on ii, and we can therefore identify this term with δzait{\delta z}_{a\to i}^{t}. We obtain the decomposition

Analogously for the second equation we get

Substituting Eq. (A.4) in Eq. (A.5) to eliminate δzbit{\delta z}_{b\to i}^{t} we get

and using the normalization of AA, we get b[n]Abi21\sum_{b\in[n]}A_{bi}^{2}\to 1, whence

Analogously substituting Eq. (A.6) in (A.3), we get

Again, using the law of large numbers and the normalization of AA, we get

whence substituting in (A.9), we obtain the second equation in (1.1). This finishes our derivation.

Appendix B Strong law of large number for triangular arrays

The last inequality uses ϱ<1\varrho<1 which leads to 2k(1ϱ/2)>12k(1-\varrho/2)>1. Hence, condition (2.4) of [HT97] satisfies as well. Therefore n1i=1nXn,in^{-1}\sum_{i=1}^{n}X_{n,i} converges to almost surely.

Appendix C Proof of Lemma 2

Define f(β)1βlog(i=1neβlogui)f(\beta)\equiv\frac{1}{\beta}\log\left(\sum_{i=1}^{n}e^{\beta\log u_{i}}\right). Lemma 2 is equivalent to show that f(1+ε)f(1)f(1+\varepsilon)\leq f(1). We prove that ff is a decreasing function for all β>0\beta>0. Note that

where H(p)H(p) is the entropy of a probability distribution on {1,,n}\{1,\ldots,n\} with pi=eβloguis=1neβlogusp_{i}=\frac{e^{\beta\log u_{i}}}{\sum_{s=1}^{n}e^{\beta\log u_{s}}} and is always non-negative. This finishes the proof.

Appendix D Proof of probability and linear algebra lemmas

In this Appendix we provide proofs of two probability lemmas stated in Section 3.7.

On the other hand, since ψ\psi is pseudo-Lipschitz with order kk we have ψ(x)L(1+xk)|\psi(x)|\leq L(1+|x|^{k}) for x1|x|\geq 1. Therefore for large enough BB,

D.2 Proof of Lemma 6

Letting Zn(ω)F(Xn(ω),Yn(ω))Z_{n}(\omega)\equiv F^{\prime}(X_{n}(\omega),Y_{n}(\omega)) (if (Xn(ω),Yn(ω))∉CF(X_{n}(\omega),Y_{n}(\omega))\not\in{\cal C}_{F} set Zn(ω)=0Z_{n}(\omega)=0) and Z(ω)F(X(ω),Y(ω))Z(\omega)\equiv F^{\prime}(X(\omega),Y(\omega)), we thus proved that

D.3 Proof of Lemma 8

Let us denote by QQ the covariance of the gaussian vector Z1,,ZtZ_{1},\ldots,Z_{t}. The set of matrices QQ satisfying the constraints with constants c1,,ctc_{1},\dots,c_{t}, KK is compact. Hence if the thesis does not hold, there must exist a specific covariance matrix satisfying these constrains, and such that

If t=max{i{1,,t}:ai0}t_{*}=\max\{i\in\{1,\dots,t\}\,:\,a^{\prime}_{i}\neq 0\}, this implies

which contradicts the assumption Var(ZtZ1,,Zt1)>0{\rm Var}(Z_{t_{*}}|Z_{1},\dots,Z_{t_{*}-1})>0.

D.4 Proof of Lemma 9

We will prove the thesis by induction over tt. The case t=1t=1 is trivial, and assume that the claim is true up for any (t1)(t-1) vectors v1,,vt1v_{1},\dots,v_{t-1}, with constant ct1c^{\prime}_{t-1}. Without loss of generality, we will assume vi2/nK\|v_{i}\|^{2}/n\leq K for some constant KK independent of nn (increasing the norm of the viv_{i}’s increases λmin(C)\lambda_{\rm min}(C)).

Defining uiu_{i} to be the columns of UU, Gram-Schmidt orthonormalization prescribes

Which implies Aii=viPi1(vi)1(cn)1/2A_{ii}=\|v_{i}-P_{i-1}(v_{i})\|^{-1}\leq(cn)^{-1/2} and

It follows that σmax(A)cn1/2\sigma_{\rm max}(A)\leq c^{\prime\prime\prime}n^{-1/2} (with cc^{\prime\prime\prime} depending on nn) whence the thesis follows by Eq. (D.4).

References