The dynamics of message passing on dense graphs, with applications to compressed sensing

Mohsen Bayati, Andrea Montanari

Introduction and main results

for an appropriate sequence of non-linear functions $\{\eta_{t}\}_{t\geq 0}$ . (Here by convention any variable with negative index is assumed to be .) The algorithm succeeds if $x^{t}$ converges to a good approximation of $x_{0}$ (cf. [DMM09] for details).

Three findings were presented in [DMM09]:

For a large class of random matrices $A$ , the behavior of AMP algorithm is accurately described by a formalism called ‘state evolution’ (SE). Extensive numerical experiments tested this claim on gaussian, Radamacher, and partial Fourier matrices;

The sparsity-undersampling tradeoff of AMP as derived from SE coincides, for an appropriate choice of the functions $\eta_{t}$ , with the one of convex optimization approaches. Let us stress that standard convex optimization algorithms do not scale to large applications (e.g. to image processing), while the computational complexity of AMP is as low as the one of the simplest greedy algorithms;

As a byproduct of $(1)$ and $(2)$ , SE allows to re-derive reconstruction phase boundaries earlier determined via random polytope geometry (see in particular [DT05, DT09] and references therein).

These findings were based on heuristic arguments and numerical simulations. In this paper we provide the first rigorous support to finding $(1)$ , by proving that SE holds in the large system limit, for random sensing matrices $A$ with gaussian entries. Implications on points $(2)$ and $(3)$ will be reported in a forthcoming paper.

Interestingly, state evolution gives access to sharp predictions that cannot be derived from random polytope geometry. A prominent example is the noise sensitivity of LASSO, which is investigated in [DMM10c].

Note that AMP is an approximation to the following message passing algorithm. For all $i,j\in[N]$ and $a,b\in[n]$ (here and below $[N]\equiv\{1,2,\ldots,N\}$ ) start with messages $x_{j\to a}^{0}=0$ and proceed by

As argued in [DMM10a], AMP accurately approximates message passing in the large system limit. We refer to appendix A for an heuristic argument justifying the AMP update rules (1.1) starting from the algorithm (1.2). While this derivation is not necessary for the proofs of this paper, it can help the reader familiar with message passing algorithms to develop the correct intuition.

An important tool for the analysis of message passing algorithms is provided by density evolution [RU08]. Density evolution is known to hold asymptotically for sequences of sparse graphs that are locally tree-like. The factor graph underlying the algorithm (1.2) is dense: indeed it is the complete bipartite graph. State evolution can be regarded (in a very precise sense) as the analogue of density evolution for dense graphs.

For the sake of concreteness, we will focus in this Section on the algorithm (1.1), and will keep to the compressed sensing language. Nevertheless our analysis applies to a much larger family of message passing algorithms on dense graphs, for instance the multi-user detection algorithm studied in [Kab03, NS05, MT06]. Applications to such algorithms are discussed in Section 2. Section 3 describes an even more general formulation, as well as the proof of our theorems. Finally, Section 4 describes a generalization to the case of symmetric matrices $A$ that is directly related to the work of Erwin Bolthausen [Bol09] .

It is important to mention that the algorithms (1.1) and (1.2) are completely different from gaussian belief propagation (BP). The gaussian assumption refers indeed to the distribution of the matrix entries, not to the variables to be inferred. More generally, none of the existing rigorous results for BP seem to be applicable here.

It is remarkable that density evolution (in its special incarnation, SE) holds for dense graphs. This is at odds with the standard argument used for justifying density evolution so far: ‘density evolution works because the graph is locally tree-like.’ To the best of our knowledge, the approach developed here is the first one that overcomes the limitations of the standard argument (a discussion of earlier literature is provided in Section 1.4).

We begin with some missing definitions for algorithm (1.1). We assume

with $X_{0}\sim p_{X_{0}}$ and $Z\sim{\sf N}(0,1)$ independent from $X_{0}$ . We will use the term state evolution to refer both to the recursion (1.4) (or its more general version introduced in Section 3.2) and to the sequence $\{\tau_{t}\}_{t\geq 0}$ that it defines.

$\phi$ is locally Lipschitz, that is for any $M>0$ there exist a constant $L_{M,m}<\infty$ such that for all $x,y\in[-M,M]^{m}$ ,

Further, $L_{M,m}\leq c[1+(M\sqrt{m})^{k-1}]$ for some constant $c$ .

In the following we shall use generically $L$ for Lipschitz constants entering bounds of this type. It is understood (and will not be mentioned explicitly) that the constant must be properly adjusted at various passages.

with $X_{0}\sim p_{X_{0}}$ and $Z\sim{\sf N}(0,1)$ independent.

Up to a trivial change of variables, this is a formalization of the findings of [DMM09] (cf. in particular Eqs. (7), (8) and Finding 2 in that paper).

As an immediate consequence of the above theorem we have the following decoupling principle implying that a typical (finite) subset of the coordinates of $x^{t}$ are asymptotically independent.

For the proof of this corollary we refer to Section 3.10.

2 Universality

Our proof technique heavily relies on the assumption that $A(N)$ is gaussian. Nevertheless, we expect the convergence expressed in Theorem 1 to be a fairly general result. In particular, we expect it to hold for matrices with i.i.d. entries with zero mean and variance $1/n$ , under a suitable moment condition. This type of universality is quite common in random matrix theory, and several arguments suggest that it should hold in the present case. For instance, it is possible to prove that state evolution holds for this broader class of random matrices when $\eta_{t}(\,\cdot\,)$ is affine. Also, the heuristic argument discussed in the next section is clearly insensitive to the details of distribution of the entries.

Numerical evidence presented in [DMM09] (we refer in particular to the online supplement) suggests that state evolution might hold for an even broader class of matrices. Determining the domain of such an universality class is an outstanding open problem.

3 State evolution: the basic intuition

The state evolution recursion has a simple heuristic description, that is useful to present here since it clarifies the difficulties involved in the proof. In particular, this description brings up the key role played by the last term in the update equation for $z^{t}$ , that we will call the ‘Onsager term’, following [DMM09].

Consider again the recursion (1.1), but introduce the following three modifications: $(i)$ Replace the random matrix $A$ with a new independent copy $A(t)$ at each iteration $t$ ; $(ii)$ Correspondingly replace the observation vector $y$ with $y^{t}=A(t)x_{0}+w$ ; $(iii)$ Eliminate the last term in the update equation for $z^{t}$ . We thus get the following dynamics:

where $A(0),A(1),A(2),\dots$ are i.i.d. matrices of dimensions $n\times N$ with i.i.d. entries $A_{ij}(t)\sim{\sf N}(0,1/n)$ . (Notice that, unlike in the rest of the paper, we use here the argument of $A$ to denote the iteration number, and not the matrix dimensions.)

This recursion is most conveniently written by eliminating $z^{t}$ :

Using the central limit theorem, it is easy to show that each entry of $B(t)$ is approximately normal, with zero mean and variance $1/n$ . Further, distinct entries are approximately pairwise independent. Therefore, if we let $\widehat{\tau}_{t}^{2}=\lim_{N\to\infty}\|x^{t}-x_{0}\|^{2}/N$ , we obtain that $B(t)(x^{t}-x_{0})$ converges to a vector with i.i.d. normal entries with mean and variance $N\widehat{\tau}_{t}^{2}/n=\widehat{\tau}_{t}^{2}/\delta$ . Notice that this is true because $A(t)$ is independent of $\{A(s)\}_{1\leq s\leq t-1}$ and, in particular, of $(x^{t}-x^{0})$ .

Conditional on $w$ , $A(t)^{*}w$ is a vector of i.i.d. normal entries with mean and variance $(1/n)\|w\|^{2}$ which converges by the law of large numbers to $\sigma^{2}$ . A slightly longer exercise shows that these entries are approximately independent from the ones of $B(t)(x^{t}-x_{0})$ . Summarizing, each entry of the vector in the argument of $\eta_{t}$ in Eq. (1.10) converges to $X_{0}+\tau_{t}Z$ with $Z\sim{\sf N}(0,1)$ independent of $X_{0}$ , and

On the other hand, by Eq. (1.10), each entry of $x^{t+1}-x_{0}$ converges to $\eta_{t}(X_{0}+\tau_{t}\,Z)-X_{0}$ , and therefore

Using together Eq. (1.11) and (1.12) we finally obtain the state evolution recursion, Eq. (1.4).

We conclude that state evolution would hold if the matrix $A$ was drawn independently from the same gaussian distribution at each iteration. In the case of interest, $A$ does not change across iterations, and the above argument falls apart because $x^{t}$ and $A$ are dependent. This dependency is non-negligible even in the large system limit $N\to\infty$ . This point can be clarified by considering the iteration

with a matrix $A$ constant across iterations. This iteration is the basis of several algorithms in compressed sensing, most notably the so-called ‘iterative soft thresholding’ [DDM04]. Such algorithms have been the object of great interest because of the high computational cost of standard convex optimization methods in large scale applications.

Numerical studies of iterative soft thresholding [DM09, DMM09] show that its behavior is dramatically different from the one in Eq. (1.1) and in particular state evolution does not hold for the iterative soft thresholding iteration (1.13), even in the large system limit.

This is not a surprise: the correlations between $A$ and $x^{t}$ simply cannot be neglected. On the other hand, adding the Onsager term leads to an asymptotic cancelation of these correlations. As a consequence, state evolution holds for the AMP iteration (1.1) despite the fact that the matrix is kept constant.

4 Related literature

As mentioned, the standard argument for justifying density evolution relies on the locally-tree like structure of the underlying graph. This argument was developed and systematically exploited for the analysis of low-density parity-check (LDPC) codes under iterative decoding [RU08]. In this context, density evolution provides an exact tool for computing asymptotic thresholds of code ensembles based on sparse graph constructions. Optimization of these thresholds has been a major design principle in LDPC codes.

The locally tree-like property is a special case of local weak convergence. Local weak convergence of graph sequences was first defined and studied in probability theory by Benjamini and Schramm [BS96], and then greatly developed by David Aldous [AS03], in particular to study the so called ‘random assignment problem’ [Ald01]. Loosely speaking, local weak convergence allows to treat sequences of graphs of increasing size, such that the neighborhood of a node converges to a well defined limit object.

The random assignment problem is defined as a distribution of random instances of the assignment problem on complete bipartite graphs. In particular, such graphs are not locally tree-like. Nevertheless, they admit a rather simple local weak limit (called the PWIT), which is a tree. The basic reason is that only a sparse subgraph of the complete bipartite graph is relevant for the minimum cost assignment, namely the one of edges with small cost. One concrete way to derive density evolution in this case is indeed to eliminate all the edges of cost larger than –say– $\Delta_{n}/n$ with $\Delta_{n}$ diverging slowly with the graph size $n$ . The resulting graph is sparse and one can apply standard arguments (cf. [MM09] for an outline of this argument). A more sophisticated argument was presented in [SS09] which nevertheless uses the existence of a non-trivial local weak limit, and the fact that only a sparse subgraph is relevant (Lemma 4.1 in [SS09]).

This reduction to a sparse graph, and hence to a limit tree, is impossible in the class of algorithms studied in our paper: the algorithm iteration cannot be approximated by an iteration on a sparse graph (at least not on an instance-by-instance basis). This corresponds to the fact that no (simple) local weak limit exists in our case. The underlying graph is the complete bipartite graph with vertex sets $[N]\equiv\{1,2,\ldots,N\}$ and $[n]\equiv\{1,2,\ldots,n\}$ , and edge-weights $A_{ai}$ for all $(a,i)\in[n]\times[N]$ . If we choose a node $i\in[N]$ as root, its depth- $1$ neighborhood consists of $[n]$ node, each carrying a weight of order $1/\sqrt{n}$ . Even this small neighborhood has no simple local weak limit.

This difference is analogous to the difference between mean-field spin glasses (e.g. the Sherrington-Kirkpatrick model) and the random assignment problem [Tal10, Chapter 7]. As a consequence, our proof does not rely on local weak convergence, and has to deal directly with the intricacies of graphs with many short cycles.

The theorem proved in this paper is not only relevant for [DMM09] but for a larger context as well. First of all, following the work by Tanaka [Tan02], hundreds of papers have been published in information theory using the replica method to study multi-user detection problems. In its replica-symmetric version, the replica method typically predicts the system performances through the solution of a system of non-linear equations, which coincide with the fixed point equations for state evolution. The present result provides a rigorous foundation to that line of work, along with the analysis of a concrete algorithm that achieves those performance. Further, [GV05] insisted on the role of a ‘decoupling principle’ that emerges from the replica method, and on the insight it provides. Corollary 1 indeed proves a specific form of this decoupling principle.

A more recent line of works uses the replica method to study typical performances of compressed sensing methods. Although non-rigorous and limited to asymptotic statements, the replica method has the advantage of providing sharp predictions. Standard techniques instead predict performances up to undetermined multiplicative constants. The determination of these constants can be of guidance for practical applications. This motivated several groups to publish results based on the replica method [RFG09, KWT09, GBS09]. The present paper provides a rigorous foundation to this work as well.

Examples

In this section we discuss in greater detail some of the applications of Theorem 1 to specific problems. To be definite, it is convenient to keep in mind a specific observable for applying Theorem 1. If we choose the test function $\psi(x,y)=(x-y)^{2}$ , we get almost surely

As a warm-up example consider the case in which the a priori distribution of $x_{0}$ is gaussian, namely its entries are i.i.d. ${\sf N}(0,v^{2})$ . It is a consequence of state evolution that the optimal AMP algorithm makes use of linear scalar estimators

Clearly, such functions are Lipschitz continuous, for any $\lambda_{t}$ finite. The AMP algorithm (1.1) becomes

Theorem 1 also shows that the empirical distribution of $\{(A^{*}z^{t}+x^{t})_{i}-x_{0,i}\}_{i\in[N]}$ is asymptotically gaussian with mean and variance $\tau_{t}^{2}$ . Hence, the optimal choice of $\lambda_{t}$ is

Notice that this also minimizes the right hand side of Eq. (2.5). Under this choice, the recursion (2.5) yields

The right hand side is a concave function of $\tau_{t}^{2}$ , and is easy to show that $\tau_{t}\to\tau_{\infty}$ exponentially fast, where, for $c=(1-\delta)/\delta$ ,

The mean square error of the resulting algorithm is estimated via Eq. (2.1). In particular, under the optimal choice of $\lambda_{t}$ , the latter converges to $(\tau_{\infty}^{2}-\sigma^{2})\delta$ with $\tau_{\infty}$ given as above, thus yielding

We recall that the asymptotic mean square error of optimal (MMSE) linear estimation was computed by Tse-Hanly and Verdú-Shamai in the case of random matrices $A$ with i.i.d. entries [TH99, VS99]. The motivation came from the analysis of multiuser receivers. The resulting MSE coincides with the value predicted in Eq. (2.9), thus showing that –in the linear case– the AMP algorithm is asymptotically equivalent to the MMSE estimator.

Notice that the computation of the MMSE in [TH99, VS99] relied heavily on the Marcenko-Pastur law for the limit spectral law of Wishart matrices [MP67]. Conversely, any calculation of the MMSE as a function of the noise variance $\sigma^{2}$ gives access to the asymptotic Stieltjis transform of the spectral measure of $A$ . This suggests that state evolution is a non-trivial result already in the case of linear $\eta_{t}(\,\cdot\,)$ , since it can be used to derive the Marcenko-Pastur law in random matrix theory.

2 Compressed sensing via soft thresholding

(Indeed Theorem 1 accommodates for a more general behavior, since $\hat{p}_{x_{0}(N)}$ is only required to converge weakly.)

In [DMM09], the authors proposed an algorithm of the form (1.1) with $\eta_{t}(x)=\eta(x;\theta_{t})$ a sequence of soft-threshold functions

The function $x\mapsto\eta(x;\theta)$ is non-linear but nevertheless it is Lipschitz continuous. Therefore Theorem 1 applies to this case, and allows to predict the asymptotic mean square error using Eqs. (1.4) and (1.6).

This choice of the nonlinearity $\eta_{t}$ is close to the optimal in minimax sense. Indeed, a substantial literature (see e.g. [DJ94b, DJ94a]) studies the problem of estimating the scalar $X_{0}$ from the noisy observation

with $Z\sim{\sf N}(0,s^{2})$ . For an appropriate choice of the threshold $\theta=\theta(\varepsilon,s)$ , and $\varepsilon\downarrow 0$ (very sparse sources), the soft thresholding estimator was proved to be minimax optimal, i.e. to achieve the minimum worst-case MSE over the class (2.10). State evolution allows to deduce that the choice (2.14) yields the best algorithm of the form (1.1) for estimating sparse vectors, over the worst-case vector $x_{0}$ [DMM09].

It is argued in [DMM09, DMM10c], and proved in [BM10] in the case of gaussian matrices, that the asymptotic MSE of AMP coincides with the one of a popular convex optimization estimation technique, known as the LASSO. The above argument is suggestive of a possible way to prove minimax optimality of the LASSO.

Finally, state evolution provides a systematic way of improving the choice of the non-linearities $\eta_{t}$ when the class of signal changes. The basic idea is to choose the function $\eta_{t}$ that minimizes the right-hand side of Eq. (1.4) in minimax sense. This corresponds to constructing minimax MMSE estimators for the scalar problem (2.15). For instance, the limit case in which the distribution of $X_{0}$ is known, the MMSE estimator is simply conditional expectation, which leads to the choice

with $Z\sim{\sf N}(0,1)$ . In other words, the very choice of the non-linearities is dictated by the gaussian convergence phenomenon described in Theorem 1.

3 Multi-User Detection

The model (1.3) is used to describe the input-output relation in code division multiple access (CDMA) channel. The matrix $A$ contains the users’ signatures. A frequently used setting for theoretical analysis consists in taking the large system limit with $n/N\to\delta$ giving the spreading factor, and in assuming that the signatures (and hence $A$ ) have i.i.d. components. The entries $x_{0,i}$ belong to the signal constellation used by the system. For the sake of simplicity, we consider the case of antipodal signaling, i.e. $x_{0,i}\in\{+1,-1\}$ uniformly at random. Other signal constellations can also be treated applying our Theorem 1. The hypothesis that $x_{0}$ is independent from $A$ is also standard in this context and justified by the remark that the transmitted information is independent from the signatures. Further, the source-channel separation theorem naturally leads to the uniform distribution.

The rationale for this choice is that it gives the conditional expectation of a uniformly random signal $X_{0}\in\{+1,-1\}$ , given the observation $X_{0}+\tau_{t}\,Z=x$ for $Z\sim{\sf N}(0,1)$ gaussian noise. This is therefore a special case of the rule (2.16) and by the argument given there, it achieves minimal mean-square error within the class of algorithms (1.1).

This state evolution recursion was proved in [MT06] for properly chosen sparse signature matrices $A$ . Theorem 1 provides the first generalization to the more relevant case of dense signatures.

As mentioned in Section 1.4, Tanaka used the replica method to compute the asymptotic performance of a MMSE receiver. The expressions obtained through this method correspond to a fixed point of the recursion (2.19). It was further proved in [MT06] that, whenever the fixed point is unique, this prediction is asymptotically correct. For such values of the parameters, we deduce that the AMP algorithm is asymptotically equivalent to the MMSE receiver.

Let us point out that, in a practical setting, it might be inconvenient to estimate the noise variance and/or to change the function $\eta_{t}$ across iterations. Several authors (see for instance [Tan05]) used the function

State evolution can be applied in this case as well (for any finite $\beta$ ) and reads

On the other hand the case $\beta\to\infty$ is not covered by our Theorem 1, since it corresponds to the discontinuous function $\eta_{t}(x)={\rm sign}(x)$ .

Proof

The proof is based on a conditioning technique developed by Erwin Bolthausen for the analysis of the so-called TAP equations in spin glass theory [Bol09]. Related ideas can also be found in [Don06].

In the next section, we provide a high-level description of the conditioning technique, by using a simpler type of recursion as reference. We will then introduce some new notations and state and prove a more general result than Theorem 1.

Now consider the $t^{\rm th}$ iteration (i.e., $h^{t+1}=Gm^{t}-\lambda_{t}m^{t-1}$ ). The problem in repeating the above argument is that $G$ and $f(m^{t})$ are dependent. For instance $f(m^{t})$ might a priori align with the minimum eigenvector of $G$ . More generally the problem is that $G$ is not independent from the $\sigma$ -algebra $\mathfrak{S}_{t}$ generated by $\{h^{0},h^{1},\dots,h^{t}\}$ .

The key idea in the conditioning technique is to avoid computing the conditional distribution of $m^{t}$ given $G$ . We instead compute the conditional distribution of $G$ given $\mathfrak{S}_{t}$ .

The next important remark is that $\mathfrak{S}_{t}$ contains $\{m^{0},m^{1},\dots,m^{t}\}$ as well. Conditioning on $\mathfrak{S}_{t}$ is therefore equivalent to conditioning on the event

which is in turn equivalent to making a set of linear observations of $G$ .

At this point, the assumption that $G$ is gaussian plays a crucial role. The conditional distribution of a gaussian random variable $G$ given linear observations is the same as its conditional expectation plus the projection of an independent gaussian. In formulae:

We refer to the actual proof for a calculation of the various terms involved.

Each of the above terms can be written explicitly as a function of the observed values $\{m^{0},m^{1},\dots,m^{t}\}$ and of the new gaussian random variables $G^{\rm new}$ . The first term $G^{\rm new}({\rm P}^{t}_{\perp}m^{t})$ is clearly gaussian. The other terms are not. In order to control them, we will proceed by induction over $t$ and use an appropriate strong law of large numbers for triangular arrays. The key phenomenon is that the only non-gaussian term that does not vanish in the large system limit cancels with the term $-\lambda_{t}m^{t-1}$ in recursion (3.1), thus implying the claimed gaussianity of $h^{t+1}$ .

2 A general result

We describe now a more general recursion than in Eq. (1.1). In the next section we show that the AMP algorithm (1.1) can be regarded as a special case of the recursion defined here.

where $\xi_{t}=\langle g^{\prime}_{t}(b^{t},w)\rangle$ , $\lambda_{t}=\frac{1}{\delta}\langle f^{\prime}_{t}(h^{t},x^{0})\rangle$ (both derivatives are with respect to the first argument), and we recall that –by definition– $m^{-1}=0$ .

exists, is positive and finite, for a sequence of initial conditions of increasing dimensions. State evolution defines quantities $\{\tau_{t}^{2}\}_{t\geq 0}$ and $\{\sigma_{t}^{2}\}_{t\geq 0}$ via

where $W\sim p_{W}$ and $X_{0}\sim p_{X_{0}}$ are independent of $Z\sim{\sf N}(0,1)$ . Further, recall the notion of pseudo-Lipschitz function for $k>1$ from Section 1.1. We have the following general result.

where $X_{0}\sim p_{X_{0}}$ and $W\sim p_{W}$ are independent of $Z\sim{\sf N}(0,1)$ , and $\sigma_{t}$ , $\tau_{t}$ are determined by recursion (3.5).

3 Corollary of Theorem 2: AMP and Theorem 1

As already mentioned, the AMP algorithm (1.1) is a special case of recursion (3.3). The reduction is obtained by defining

The functions $f_{t}$ and $g_{t}$ are given by

and the initial condition is $q^{0}=-x_{0}$ .

Although the recursions (1.1) and (3.3) are equivalent mathematically, only the former can be used as an algorithm. Indeed the recursion (3.3) tracks the difference of the current estimates $x^{t}$ from $x_{0}$ , and is initialized using $x^{0}$ itself. The recursion (3.3) is only relevant for mathematical analysis.

Due to symmetry, for each $t$ , all coordinates of the vector $h^{t}$ have the same distribution (similarly for $b^{t}$ , $q^{t}$ and $m^{t}$ ).

4 Proof of Theorem 1

and $\tau_{0}^{2}=\sigma^{2}+\sigma_{0}^{2}$ . Also, by definition, $x^{t+1}=\eta_{t}(A^{*}b^{t}+x^{t})=\eta_{t}(x_{0}-h^{t+1})$ . Therefore, applying Theorem 2 to the function $(h^{t}_{i},x_{0,i})\mapsto\psi(\eta_{t-1}(x_{0,i}-h^{t}_{i}),x_{0,i})$ we obtain almost surely

5 Definitions and notations

When the update equation for $h^{t+1}$ in (3.3) is used, all values of $b^{0},\ldots,b^{t}$ , $m^{0},\ldots,m^{t}$ , $h^{1},\ldots,h^{t}$ and $q^{0},\ldots,q^{t}$ have been previously calculated. Hence, we can consider the distribution of $h^{t+1}$ conditioned on all these known variables and also conditioned on $x_{0}$ and $w$ . In particular, define $\mathfrak{S}_{t_{1},t_{2}}$ to be the $\sigma$ -algebra generated by $b^{0},\ldots,b^{t_{1}-1}$ , $m^{0},\ldots,m^{t_{1}-1}$ , $h^{1},\ldots,h^{t_{2}}$ , $q^{0},\ldots,q^{t_{2}}$ and $x_{0}$ and $w$ . The basic idea of the proof is to compute the conditional distributions $b^{t}|_{\mathfrak{S}_{t,t}}$ and $h^{t+1}|_{\mathfrak{S}_{t+1,t}}$ . This is done by characterizing the conditional distribution of the matrix $A$ given this filtration.

Regarding $h^{t}$ and $b^{t}$ as column vectors, the equations for $b^{0},\ldots,b^{t-1}$ and $h^{1},\ldots,h^{t}$ can be written in matrix form as:

or in short $X_{t}=A^{*}M_{t}$ and $Y_{t}=AQ_{t}$ . Here and below we use vertical lines to indicate columns of a matrix, i.e. $[a_{1}|a_{2}|\dots|a_{k}]$ is the matrix with columns $a_{1},\dots,a_{k}$ .

We will show in Section 3.9 (cf. Corollary 2) that for any fixed $t$ as $N$ goes to infinity the quantities $\beta_{i}$ ’s and $\alpha_{j}$ ’s have a finite limit.

6 Main technical Lemma

We prove the following more general result.

where $(Z_{0},\ldots,Z_{t})$ and $(\hat{Z}_{0},\ldots,\hat{Z}_{t})$ are two zero-mean gaussian vectors independent of $X_{0}$ , $W$ , with $Z_{i},\hat{Z}_{i}\sim{\sf N}(0,1)$ .

For all $0\leq r,s\leq t$ the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

Here $\varphi^{\prime}$ denotes derivative with respect to the first coordinate of $\varphi$ .

For all $0\leq r\leq t$ and $0\leq s\leq t-1$ the following limits exist, and there exist strictly positive constants $\rho_{r}$ and $\varsigma_{s}$ (independent of $N$ , $n$ ) such that almost surely

Equations (3.20) and (3.21) have the form of Stein’s lemma [Ste72] (cf. Lemma 4 in Section 3.8).

Assuming Lemma 1 is correct Theorem 2 easily follows. To be more precise, Theorem 2 is a obtained by applying Lemma 1(b) to functions $\phi_{h}(y_{0},\ldots,y_{t},x_{0,i})=\psi(y_{t},x_{0,i})$ and $\phi_{b}(y_{0},\ldots,y_{t},w_{i})=\psi(y_{t},w_{i})$ .

The rest of Section 3 focuses on proof of Lemma 1.

7 Useful probability facts

Before embarking in the actual proof, it is convenient to summarize a few facts that will be used repeatedly.

We will use the following strong law of large numbers (SLLN) for triangular arrays of independent but not identically distributed random variables. The form stated below follows immediately from [HT97, Theorem 2.1].

Theorem 2.1 in [HT97] is stronger than what we state here. In Appendix B we show how Theorem 3 follows from it.

Next, we present an algebraic inequality that will be used in conjunction with Theorem 3. Its proof is provided in Appendix C

Let $u_{1},\ldots,u_{n}$ be a sequence of non-negative numbers. Then for all $\varepsilon>0$ the following holds

Next, we present a standard property of Gaussian matrices without proof.

We will apply the following law of large numbers to the sequence $\{x_{0}(N),w(N)\}_{N}$ . Its proof can be found in Appendix D.1.

Next lemma is on reak convergence of Lipschitz functions and its proof is in Appendix D.2.

It is useful to remember a standard formula for the conditional variance of gaussian random variables.

Let $(Z_{1},\ldots,Z_{t})$ be a normal random vector with zero mean, and assume that the covariance matrix of $(Z_{1},\ldots,Z_{t-1})$ (denoted by $C$ ) is invertible. Then

An immediate consequence is the following fact, proven in Appendix D.3.

It is also convenient to recall some linear algebra facts. The first one is proved in Appendix D.4.

The second one is just a direct consequence of the fact that the mapping $S\mapsto\lambda_{\rm min}(S)$ is continuous at any matrix $S$ that is invertible.

Let $\{S_{n}\}_{n\geq 1}$ be a sequence of $t\times t$ matrices such that $\lim\inf_{n\to\infty}\lambda_{\min}(S_{n})>c$ for a positive constant $c$ . Also assume that $\lim_{n\to\infty}S_{n}=S_{\infty}$ where the limit is element-wise. Then, $\lambda_{\min}(S_{\infty})\geq c$ .

8 Conditional distributions

In order to calculate $b^{t}|_{\mathfrak{S}_{t,t}}$ and $h^{t+1}|_{\mathfrak{S}_{t+1,t}}$ we will characterize the conditional distributions $A|_{\mathfrak{S}_{t,t}}$ and $A|_{\mathfrak{S}_{t+1,t}}$ .

For $(t_{1},t_{2})=(t,t)$ or $(t_{1},t_{2})=(t+1,t)$ , the conditional distribution of the random matrix $A$ given the $\sigma$ -algebra ${\mathfrak{S}_{t_{1},t_{2}}}$ , satisfies

Here $P_{M_{t_{2}}}^{\perp}={\sf I}-P_{M_{t_{2}}}$ , $P_{Q_{t_{1}}}^{\perp}={\sf I}-P_{Q_{t_{1}}}$ , and $P_{Q_{t_{1}}}$ , $P_{M_{t_{2}}}$ are orthogonal projector onto column spaces of $Q_{t_{1}}$ and $M_{t_{2}}$ respectively.

Write $m^{t}=m^{t}_{\parallel}+m^{t}_{\perp}$ . Using (3.30) and the fact that $M_{t}^{*}m^{t}_{\perp}=0$ , we obtain

Similarly, writing $q^{t}=q^{t}_{\parallel}+q^{t}_{\perp}$ , $q^{t}_{\parallel}=Q_{t}\vec{\beta}$ , and using $X_{t}^{*}Q_{t}=M_{t}^{*}AQ_{t}=M_{t}^{*}Y_{t}$ , $Q_{t}^{*}q^{t}_{\perp}=0$ we obtain (3.32). ∎

where $\|A\|_{F}$ denotes the Frobenius norm of matrix $A$ . We use Lagrange multipliers method to obtain this minimum. Consider the Lagrangian

$\mathcal{F}\circ\mathcal{F}=\mathcal{F}$ .

$\mathcal{F}(A)\in V=\{A|AQ_{t_{1}}=0,A^{*}M_{t_{2}}=0\}$ .

$\mathcal{F}$ is symmetric. That is for all matrices $A,B$ : $(\mathcal{F}(A),B)=(A,\mathcal{F}(B))$ .

(b) is correct since by definition of $\mathcal{F}(A)Q=P_{M}^{\perp}AP_{Q}^{\perp}Q=0$ and similarly $\mathcal{F}(A)^{*}M=0$ .

and each of the last three term vanishes either because $AQ=0$ or because $A^{*}M=0$ .

9 Proof of Lemma 1

The proof is by induction on $t$ . Let $\mathcal{H}_{t+1}$ be the property that (3.14), (3.16), (3.18), (3.20), (3.22), (3.24), and (3.25) hold. Similarly, let $\mathcal{B}_{t}$ be the property that (3.15), (3.17), (3.19), (3.21), (3.23) and (3.26) hold. The inductive proof consists of the following four main steps.

If $\mathcal{B}_{r}$ , $\mathcal{H}_{s}$ hold for all $r<t$ and $s\leq t$ then $\mathcal{B}_{t}$ holds.

If $\mathcal{B}_{r}$ , $\mathcal{H}_{s}$ hold for all $r\leq t$ and $s\leq t$ then $\mathcal{H}_{t+1}$ holds.

For each of these steps we will have to prove several properties that we will denote by $(a)$ , $(b)$ , $(c)$ , $(d)$ , $(e)$ and $(g)$ according to their appearance in Lemma 1. For $\mathcal{H}$ we also need to prove a property $(f)$ .

It is immediate to check that our claims become trivial if $x\mapsto f_{t}(x,X_{0})$ is constant (i.e. independent of $x$ ) almost surely (with respect to $X_{0}\sim p_{X_{0}}$ ), or if $x\mapsto g_{t}(x,W)$ is constant almost surely (with respect to $W\sim p_{W}$ ). We will therefore assume that neither of these degenerate cases hold.

$\mathfrak{S}_{0,0}$ is generated by $x_{0}$ , $q^{0}$ and $w$ . Also $q^{0}=q^{0}_{\perp}$ since $Q_{0}$ is an empty matrix. Hence

The last inequality uses assumption on empirical moments of $w$ . Therefore, we can apply Theorem 3 to get

This case is trivial since there is not $m^{s}$ with $s\leq t-1=-1$ .

Note that $h^{1}=A^{*}m^{0}-\xi_{0}q^{0}$ .

$\mathfrak{S}_{1,0}$ is generated by $x_{0}$ , $q^{0}$ , $w$ , $b^{0}$ and $m^{0}$ . Also $m^{0}=m^{0}_{\perp}$ since $M_{0}$ is an empty matrix. Applying Lemma 11 we have

But using $\mathcal{B}_{0}(d)$ for $\varphi=g_{0}$

Also $\mathcal{B}_{0}(b)$ , applied to the function $\phi_{b}(x,w)=g_{0}(x,w)^{2}$ gives

where the last estimate follows from Lemma 3(c) and (3.35). Finally,

Using (3.36), (3.35), and Lemma 3, we get

First note that, conditioning on $\mathfrak{S}_{1,0}$ ,

Using (3.36) and Lemma 3(a) we have almost surely

Hence, we only need to show $\frac{1}{N}\sum_{i=1}^{N}\|a_{i}\|^{2k-2}<\infty$ and $\frac{1}{N}\sum_{i=1}^{N}\|c_{i}\|^{2k-2}<\infty$ as $N\to\infty$ . But

which is bounded using part $(e)$ and the original assumption on $x_{0}$ . Similarly, using $\frac{1}{N}\sum_{i=1}^{N}\|q_{i}^{0}\|^{2k-2}<\infty$ , we obtain $\frac{1}{N}\sum_{i=1}^{N}\|c_{i}\|^{2k-2}<\infty$ .

The last equality used $\mathcal{B}_{0}(c)$ .

Since $t=0$ , and $q^{0}=q^{0}_{\perp}$ then the result follows from (3.4) and that $\sigma_{0}^{2}>0$ .

This part is analogous to step 1 albeit more complex.

Note that using induction hypothesis $\mathcal{B}_{t-1}(b)$ for $\phi_{b}(b^{r}_{i},b^{s}_{i},w_{i})=g_{r}(b^{r}_{i},w_{i})g_{s}(b^{s}_{i},w_{i})$ , $0\leq r,s\leq t-1$ we have almost surely

But using induction hypothesis, we have $\lim_{n\to\infty}\langle m^{r}_{\perp},m^{r}_{\perp}\rangle>\varsigma_{r}>0$ for all $r<t-1$ . So using Lemma 9, for large enough $n$ the smallest eigenvalue of matrix $M_{t-1}^{*}M_{t-1}/n$ is larger than a positive constant $c^{\prime}$ that is independent of $n$ . Hence, by Lemma 10 its inverse converges to an invertible limit. Thus, Eqs. (3.37) and (3.38) lead to

Now the result follows from Lemma 8 provided that we show for gaussian random variables $\sigma_{0}\hat{Z}_{0},\ldots,\sigma_{t-1}\hat{Z}_{t-1}$ , all conditional variances ${\rm Var}[\sigma_{r}\hat{Z}_{r}|\,\sigma_{0}\hat{Z}_{0},\ldots,\sigma_{r-1}\hat{Z}_{r-1}]$ are strictly positive for $r=0,\ldots,t-1$ . To prove the latter first using the induction hypothesis $\mathcal{B}_{t-1}(b)$ , we have for all $0\leq r\leq t-1$

Similar as above we used the fact that for large enough $n$ the matrix $B_{r}^{*}B_{r}/n$ has a smallest eigenvalue greater than a positive constant to obtain the limit of its inverse. On the other hand using induction hypothesis $\mathcal{B}_{r}(c)$ we have almost surely

And now by induction hypothesis $\mathcal{H}_{r}(g)$ we have $\lim_{N\to\infty}\langle q^{r}_{\perp},q^{r}_{\perp}\rangle>\rho_{r}$ . Hence the result follows.

have finite limits as $n$ and $N$ converge to $\infty$ .

We can apply Lemma 9 to obtain that for large enough $n$ the smallest eigenvalue of $M_{t}^{*}M_{t}/n$ is larger than a positive constant $c^{\prime}$ . Hence by Lemma 10 its inverse has a finite limit. Similarly, we can apply induction hypothesis $\mathcal{H}_{t}(g)$ and Lemmas 9 and 10 to the matrix $Q_{t}^{*}Q_{t}/N$ . ∎

Recall definition of $Y_{t}$ and $X_{t}$ from Section 3.5.

where $\Xi_{t}=\textrm{diag}(\xi_{0},\ldots,\xi_{t-1})$ , $H_{t}=[h^{1}|\cdots|h^{t}]$ , $B_{t}=[b^{0}|\cdots|b^{t-1}]$ , and $\Lambda_{t}=\textrm{diag}(\lambda_{0},\ldots,\lambda_{t-1})$ .

Recall that $m^{t}_{\parallel}=M_{t}\vec{\alpha}$ and $q^{t}_{\parallel}=Q_{t}\vec{\beta}$ . On the other hand $Y_{t+1}^{*}m_{\perp}^{t}=B_{t+1}^{*}m_{\perp}^{t}$ because $M_{t}^{*}m_{\perp}^{t}=0$ . Similarly, $X_{t}^{*}q_{\perp}^{t}=H_{t}^{*}q_{\perp}^{t}$ . Hence we need to show

To simplify the notation denote the matrix $M_{t}^{*}M_{t}/n$ by $G$ . Therefore,

But using the induction hypothesis $\mathcal{H}_{t}(d)$ for $\varphi=f_{1},\dots,f_{t}$ , and $\mathcal{H}_{t}(f)$ , the term $\langle h^{r},q^{t}-\sum_{s=0}^{t-1}\beta_{s}q^{s}\rangle/\delta$ is almost surely equal to the limit of $\langle h^{r},h^{t}\rangle\lambda_{t}-\sum_{s=0}^{t-1}\beta_{s}\langle h^{r},h^{s}\rangle\lambda_{s}$ . This can be modified, using the induction hypothesis $\mathcal{H}_{t}(c)$ , to $\langle m^{r-1},m^{t-1}\rangle\lambda_{t}-\sum_{s=0}^{t-1}\beta_{s}\langle m^{r-1},m^{s-1}\rangle\lambda_{s}$ almost surely, which can be written as $G_{r,t}\lambda_{t}-\sum_{s=0}^{t-1}\beta_{s}G_{r,s}\lambda_{s}$ . Hence,

Notice that the above series of equalities hold because $G$ has, almost surely, a non-singular limit as $N\to\infty$ as shown in point $(g)$ above.

Equation (3.42) is proved analogously, using $\xi_{t}=\langle g^{\prime}(b^{t},w)\rangle$ . ∎

The proof of Eq. (3.15) follows immediately since the last lemma yields

Note that, using Lemma 3(c), as $n,N\to\infty$ ,

For $r,s<t$ we can use the induction hypothesis. For $s=t,r<t$ , we can apply Lemma 14 to $b^{t}$ (proved above), thus obtaining

Note that, by induction hypothesis $\mathcal{B}_{t-1}(d)$ applied to $\varphi=g_{t-1}$ , and using the bound $\mathcal{B}_{t-1}(e)$ to control $\langle b^{i},b^{r}\rangle$ , we deduce that each term $\langle m^{i},b^{r}\rangle$ has a finite limit. Thus,

where the last estimate uses the induction hypothesis $\mathcal{B}_{t-1}(c)$ and $\mathcal{H}_{t}(c)$ which imply, almost surely, for some constant $c$ , $\langle P_{M_{t}}^{\perp}b^{r},P_{M_{t}}^{\perp}b^{r}\rangle\leq\langle b^{r},b^{r}\rangle<c$ and $\langle q_{\perp}^{t},q_{\perp}^{t}\rangle\leq\langle q^{t},q^{t}\rangle<c$ for all $N$ large enough. Finally, using the induction hypothesis $\mathcal{B}_{t-1}(c)$ for each term of the form $\langle b^{i},b^{r}\rangle$ (noting that $i,r\leq t-1$ ) and Corollary 2 we have

The last line uses the definition of $\beta_{i}$ and $q_{\perp}^{t}\perp q^{r}$ .

For the case of $r=s=t$ , similarly, we have

The contribution of other terms is $o(1)$ because:

$\langle\sum_{i=0}^{t-1}\beta_{i}b^{i},M_{t}\vec{o}_{t}(1)\rangle=o(1)$ , using Corollary 2 and induction hypothesis $\mathcal{B}_{t-1}(d)$ for $\varphi=g_{j}$ .

The arguments at the last two points are completely analogous to the one carried out in the case $s=t$ , $r<t$ above.

Hence, from the induction hypothesis $\mathcal{B}_{t-1}(c)$ ,

Conditioning on $\mathfrak{S}_{t,t}$ and using Lemma 14 (proved at point $(a)$ above), almost surely,

that follows from the Lipschitz assumption on $g_{r}$ . Thus,

which has a finite limit almost surely, using the induction hypothesis $\mathcal{B}_{t-1}(e)$ and the assumption on $w$ .

where $Z_{0},\ldots,Z_{t-1}$ are iid with distribution ${\sf N}(0,1)$ . and $c_{0},\dots,c_{r}$ are allmost surely bounded for all $N$ large enough. Therefore, almost surely,

Now each term is finite using the same argument as in Eq. (3.44).

Therefore, using Cauchy-Schwartz inequality twice, we have

Hence for any fixed $t$ , (3.45) vanishes almost surely when $n$ goes to $\infty$ .

Now given, $b^{0},\ldots,b^{t-1}$ , consider the random variables

Hence we can use induction hypothesis $\mathcal{B}_{t-1}(b)$ and Corollary 2 for

where $Z$ is an independent ${\sf N}(0,1)$ random variable to show

Note that $\sum_{r=0}^{t-1}{\beta_{r}}\sigma_{r}Z_{r}+\gamma_{t}\,Z$ is gaussian. All that we need, is to show that the variance of this gaussian is $\sigma_{t}^{2}$ . But using a combination of (3.46) and (3.47) for the pseudo-Lipschitz function $\phi_{b}(y_{0},\ldots,y_{t},w_{i})=y_{t}^{2}$ ,

𝑡1\mathcal{H}_{t+1} Due to symmetry, proof of this step is very similar to the proof of step 3 and we present only some differences.

This part is very similar to the one of $\mathcal{B}_{t}(g)$ .

To prove Eq. (3.14) we use Lemma 14(a) as for $\mathcal{B}_{t}(a)$ to obtain

Now, using Lemma 3(c), as $n,N\to\infty$ ,

For $r,s<t$ we can use induction hypothesis. For $s=t,r<t$ , very similar to the proof of $\mathcal{B}_{t}(a)$ ,

Now, by induction hypothesis $\mathcal{H}_{t}(d)$ , for $\varphi=f$ , each term $\langle q^{i},h^{r+1}\rangle$ has a finite limit. Thus,

Where the last line uses the definition of $\alpha_{i}$ and $m_{\perp}^{t}\perp m^{r}$ .

This part is very similar to $\mathcal{B}_{t}(e)$ .

Using $\mathcal{H}_{t}(a)$ and Lemma 3(a) we have almost surely

But this limit is almost surely, using the induction hypothesis $\mathcal{H}_{r}(e)$ for $r<t$ and $\mathcal{B}_{t}(c)$ .

where $Z$ is an independent ${\sf N}(0,1)$ random variable, to show

Note that $\sum_{r=0}^{t-1}{\alpha_{r}}\tau_{r}Z_{r}+n^{-1/2}\|m_{\perp}^{t}\|Z$ is gaussian. All that we need, is to show that the variance of this gaussian is $\tau_{t}^{2}$ . But using combination of (3.49) and (3.50) for the pseudo-Lipschitz function $\phi_{h}(y_{0},\ldots,y_{t},x_{0,i})=y_{t}^{2}$ ,

On the other hand in part (c) we proved $\lim_{N\to\infty}\langle h^{t+1},h^{t+1}\rangle\stackrel{{\scriptstyle{\rm a.s.}}}{{=}}\lim_{N\to\infty}\langle g_{t}(b^{t},w),g_{t}(b^{t},w)\rangle$ .

10 Proof of Corollary 1

Symmetric Case

where $\lambda_{t}=\langle f^{\prime}(h^{t})\rangle$ . Now let $\tau_{1}^{2}=\lim_{N\to\infty}\langle m_{1},m_{1}\rangle$ , and define recursively for $t\geq 1$ ,

This theorem was proved by Bolthausen in the case $f(x)=\tanh(\beta x+h)$ and $\langle m^{1},m^{1}\rangle=\tau_{*}^{2}$ , for $\tau_{*}^{2}$ the fixed point of the recursion (4.2). The general proof is very similar to the one of Theorem 2, and exploits the same conditioning trick. We omit it to avoid repetitions.

When we are calculating $h^{t+1}$ , all values $h^{1},\ldots,h^{t}$ and hence $m^{1},\ldots,m^{t}$ are known to us. Denote the $\sigma$ -algebra generated by all of these random variables by $\mathfrak{U}_{t}$ . Moreover, use the following compact formulation for (4.1).

The analogue of Lemma 1 is the following.

where $Z_{1},\ldots,Z_{t}$ have ${\sf N}(0,1)$ distribution.

For all $1\leq r,s\leq t$ the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

For all $1\leq r,s\leq t$ , and for any Lipschitz continuous function $\varphi$ , the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

For all $0\leq r\leq t$ the following limit exists and there are positive constants $\rho_{r}$ (independent of $N$ ) such that almost surely

Acknowledgement

We are deeply indebted with Erwin Bolthausen, who first presented the conditioning technique during the EURANDOM workshop on ‘Order, disorder and double disorder’ in September 2009. We are also grateful to Dave Donoho and Arian Maleki, for a number of insightful exchanges.

This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211.

Appendix A AMP algorithm: An heuristic derivation

In this appendix we present an heuristic derivation of the AMP iteration (1.1) starting from the standard message passing formulation (1.2). Let us stress that such derivation is not relevant for the proof of our Theorem 1. Our objective is to help the reader develop an intuitive understanding of the AMP iteration. For further discussion of the connection with belief propagation we refer to [DMM10a, DMM10b].

Let us rewrite the message passing iteration for greater convenience of the reader

Notice that on the right-hand side of both equations the messages appears in sums of $\Theta(N)$ terms. Consider for instance the messages $\{z_{a\to i}^{t}\}_{i\in[N]}$ for a fixed node $a\in[n]$ . These depend on $i\in[N]$ only because the excluded term changes. It is therefore natural to guess that $z^{t}_{a\to i}=z^{t}_{a}+O(N^{-1/2})$ and $x^{t}_{i\to a}=x^{t}_{i}+O(n^{-1/2})$ , where $z^{t}_{a}$ only depends on the index $a$ (and not on $i$ ), and $x^{t}_{i}$ only depends on $i$ (and not on $a$ ).

A naïve approximation would consist in neglecting the $O(N^{-1/2})$ correction but this turns out to produce a non-vanishing error in the large- $N$ limit. We instead set

We will now drop the terms that are negligible without writing explicitly the error terms. First of all notice that single terms of the type $A_{ai}{\delta z}_{a\to i}^{t}$ are of order $1/N$ and can be safely neglected. Indeed ${\delta z}_{a\to i}=O(N^{-1/2})$ by our anzatz, and $A_{ai}=O(N^{-1/2})$ by definition. We get

We next expand the second equation to linear order in ${\delta x}_{i\to a}^{t}$ and ${\delta z}_{a\to i}^{t}$ :

Notice that the last term on the right hand side of the first equation is the only one dependent on $i$ , and we can therefore identify this term with ${\delta z}_{a\to i}^{t}$ . We obtain the decomposition

Analogously for the second equation we get

Substituting Eq. (A.4) in Eq. (A.5) to eliminate ${\delta z}_{b\to i}^{t}$ we get

and using the normalization of $A$ , we get $\sum_{b\in[n]}A_{bi}^{2}\to 1$ , whence

Analogously substituting Eq. (A.6) in (A.3), we get

Again, using the law of large numbers and the normalization of $A$ , we get

whence substituting in (A.9), we obtain the second equation in (1.1). This finishes our derivation.

Appendix B Strong law of large number for triangular arrays

The last inequality uses $\varrho<1$ which leads to $2k(1-\varrho/2)>1$ . Hence, condition (2.4) of [HT97] satisfies as well. Therefore $n^{-1}\sum_{i=1}^{n}X_{n,i}$ converges to almost surely.

Appendix C Proof of Lemma 2

Define $f(\beta)\equiv\frac{1}{\beta}\log\left(\sum_{i=1}^{n}e^{\beta\log u_{i}}\right)$ . Lemma 2 is equivalent to show that $f(1+\varepsilon)\leq f(1)$ . We prove that $f$ is a decreasing function for all $\beta>0$ . Note that

where $H(p)$ is the entropy of a probability distribution on $\{1,\ldots,n\}$ with $p_{i}=\frac{e^{\beta\log u_{i}}}{\sum_{s=1}^{n}e^{\beta\log u_{s}}}$ and is always non-negative. This finishes the proof.

Appendix D Proof of probability and linear algebra lemmas

In this Appendix we provide proofs of two probability lemmas stated in Section 3.7.

On the other hand, since $\psi$ is pseudo-Lipschitz with order $k$ we have $|\psi(x)|\leq L(1+|x|^{k})$ for $|x|\geq 1$ . Therefore for large enough $B$ ,

D.2 Proof of Lemma 6

Letting $Z_{n}(\omega)\equiv F^{\prime}(X_{n}(\omega),Y_{n}(\omega))$ (if $(X_{n}(\omega),Y_{n}(\omega))\not\in{\cal C}_{F}$ set $Z_{n}(\omega)=0$ ) and $Z(\omega)\equiv F^{\prime}(X(\omega),Y(\omega))$ , we thus proved that

D.3 Proof of Lemma 8

Let us denote by $Q$ the covariance of the gaussian vector $Z_{1},\ldots,Z_{t}$ . The set of matrices $Q$ satisfying the constraints with constants $c_{1},\dots,c_{t}$ , $K$ is compact. Hence if the thesis does not hold, there must exist a specific covariance matrix satisfying these constrains, and such that

If $t_{*}=\max\{i\in\{1,\dots,t\}\,:\,a^{\prime}_{i}\neq 0\}$ , this implies

which contradicts the assumption ${\rm Var}(Z_{t_{*}}|Z_{1},\dots,Z_{t_{*}-1})>0$ .

D.4 Proof of Lemma 9

We will prove the thesis by induction over $t$ . The case $t=1$ is trivial, and assume that the claim is true up for any $(t-1)$ vectors $v_{1},\dots,v_{t-1}$ , with constant $c^{\prime}_{t-1}$ . Without loss of generality, we will assume $\|v_{i}\|^{2}/n\leq K$ for some constant $K$ independent of $n$ (increasing the norm of the $v_{i}$ ’s increases $\lambda_{\rm min}(C)$ ).

Defining $u_{i}$ to be the columns of $U$ , Gram-Schmidt orthonormalization prescribes

Which implies $A_{ii}=\|v_{i}-P_{i-1}(v_{i})\|^{-1}\leq(cn)^{-1/2}$ and

It follows that $\sigma_{\rm max}(A)\leq c^{\prime\prime\prime}n^{-1/2}$ (with $c^{\prime\prime\prime}$ depending on $n$ ) whence the thesis follows by Eq. (D.4).