State Evolution for General Approximate Message Passing Algorithms, with Applications to Spatial Coupling

Adel Javanmard, Andrea Montanari

Introduction

Approximate message passing (AMP) algorithms [DMM09] apply ideas from graphical models (belief propagation [Pea88]) and statistical physics (mean field or TAP equations [MPV87, MM09]) to statistical estimation. In particular AMP applies to problems that do not admit a sparse graphical model description. An AMP algorithm takes the form

In this paper we focus on Gaussian matrices and consider a different type of generalization that was motivated by the following recent developments.

In information theory parlance, the vector (Ax)(Ax) is passed through a memoryless channel with transition probability p()p(\,\cdot\,|\,\cdot\,). From a statistics point of view, this corresponds to estimation of a generalized linear model [NW72, MN89]. The linear model (3) is recovered as the special case in which the channel is Gaussian or –more generally– the noise is purely additive. Rangan conjectured that suitable state evolution equations hold for G-AMP algorithms as well, without however providing a formal proof.

Bean, Bickel, El Karoui and Yu [BBEKY12] recently considered the problem of estimating the unknown vector xx in the linear model (3) using robust regression. They developed exact asymptotic expressions for the risk that are analogous to the one proved in [BM12] for the Lasso. The results of [BBEKY12] are, on the other hand, based on an heuristic derivation.

The proof in [BM12] was based on the state evolution analysis of a suitable AMP algorithm whose fixed points coincide with the Lasso optima. This is suggestive of a possible approach for proving the results of [BBEKY12]: define a suitable AMP algorithm for solving the robust regression problem, and analyze it through state evolution. Indeed a comparison of the formulae in [BBEKY12] with the state evolution formulae in [Ran11] appears encouraging.

In this paper we establish a rigorous generalization of state evolution that covers all of the above developments. Applications to generalized AMP are already discussed in [Ran11], and applications to spatially coupled sensing matrices can be found in [DJM11b] and Section 3. Finally, applications to robust regression are left for future study.

Remarkably, all of the above applications can be derived by treating the following generalization of the iteration (1), (2). (A formal definition is given in the next section.)

Our proof uses the technique of [BM11], which in turns build on an idea first introduced by Bolthausen [Bol12]. A convenient simplification with respect to [BM11] consists in studying a recursion in which the rectangular matrix AA is replaced by a symmetric matrix, and the algorithm state is described by a single vector.

In section 2 we put forward formal definitions and state our main result for the case of symmetric matrices. In section 3 we show how the case of rectangular matrices can be reduced to the symmetric one. We also show how our result applies to the case of compressed sensing reconstruction with spatially coupled matrices. Finally, we prove our main result in Section 4.

Main result

A symmetric AMP instance is a triple (A,F,x0)(A,{\cal F},x^{0}) where:

x0Vq,Nx^{0}\in{\cal V}_{q,N} is an initial condition.

Given F={fk:k[N]}{\cal F}=\{f^{k}:k\in[N]\}, we define f(;t):Vq,NVq,Nf(\,\cdot\,;t):{\cal V}_{q,N}\to{\cal V}_{q,N} by letting v=f(v;t)v^{\prime}=f(v;t) be given by vi=fi(vi;t)\mathbf{v}^{\prime}_{i}=f^{i}(\mathbf{v}_{i};t) for all i[N]i\in[N].

The approximate message passing orbit corresponding to the instance (A,F,x0)(A,{\cal F},x^{0}) is the sequence of vectors {xt}t0\{x^{t}\}_{t\geq 0}, xtVq,Nx^{t}\in{\cal V}_{q,N} defined as follows, for t0t\geq 0,

Here Bt:Vq,NVq,N{\sf B}_{t}:{\cal V}_{q,N}\to{\cal V}_{q,N} is the linear operator defined by letting, for v=Btvv^{\prime}={\sf B}_{t}v,

In order to establish the behavior of the sequence {xt}t0\{x^{t}\}_{t\geq 0} in the high dimensional limit, we need to consider a sequence of AMP instances {A(N),FN,x0,N}N0\{A(N),{\cal F}_{N},x^{0,N}\}_{N\geq 0} indexed by the dimension NN.

For each a[q]a\in[q], we have limNCaN/N=ca(0,1)\lim_{N\to\infty}|C^{N}_{a}|/N=c_{a}\in(0,1).

For each N0N\geq 0, each a[q]a\in[q] and each iCaNi\in C_{a}^{N}, we have fi(x,t)=g(x,yi,a,t)f^{i}(\mathbf{x},t)=g(\mathbf{x},\mathbf{y}_{i},a,t). Further, the empirical distribution of {yi}iCaN\{\mathbf{y}_{i}\}_{i\in C^{N}_{a}}, denoted by P^a\hat{P}_{a}, converges weakly to PaP_{a}.

An apparent generalization of the above definition would require the partition to be C1NC2NCqN=[N]C^{N}_{1}\cup C^{N}_{2}\cup\dots\cup C^{N}_{q^{\prime}}=[N], while xtVq,Nx^{t}\in{\cal V}_{q,N}, with qqq\neq q^{\prime}. It is easy to see that there is no loss of generality in assuming q=qq=q^{\prime} as we do in our definition. Indeed the case q<qq^{\prime}<q can be reduced to our setting by refining the partition arbitrarily, and q>qq^{\prime}>q by adding dummy coordinates to to the variables xi\mathbf{x}_{i}.

The function fi(,)f^{i}(\,\cdot\,,\,\cdot\,) depends implicitly on yi\mathbf{y}_{i}. However, the yi\mathbf{y}_{i}’s do not change across iterations and so we do not show this dependence explicitly in our notation.

for all a[q]a\in[q]. Here YaPaY_{a}\sim P_{a}, ZatN(0,Σt)Z_{a}^{t}\sim{\sf N}\left(0,\Sigma^{t}\right) and YaY_{a} and ZatZ_{a}^{t} are independent.

where ZatN(0,Σt)Z_{a}^{t}\sim{\sf N}(0,\Sigma^{t}) is independent of YaPaY_{a}\sim P_{a}.

AMP for rectangular and spatially-coupled matrices

In this section we develop two applications of our main theorem:

We show that AMP iterations with AA a rectangular matrix, see e.g. Eqs. (1), (2), can be recast in the form of an iteration with a symmetric matrix AA and are therefore covered by Theorem 1. This construction is provided in Section 3.5 (below Proposition 5).

We apply the general Theorem 1 to AMP reconstruction in compressed sensing with spatially coupled matrices. In [DJM11b], it was proved that, conditionally to a state evolution lemma, this approach achieves the information-theoretic limits of compressed sensing set forth in [WV10]. Here we show that our main result Theorem 1 implies the state evolution lemma (Lemma 4.14.1 in [DJM11b]).

We will let RLr|{\sf R}|\equiv L_{r} and CLc|{\sf C}|\equiv L_{c} denote the matrix dimensions. The ensemble parameters are related to the sensing matrix dimensions by n=n0Lcn=n_{0}L_{c} and m=m0Lrm=m_{0}L_{r}.

In order to describe a random matrix AM(W,m0,n0)A\sim{\cal M}(W,m_{0},n_{0}) from this ensemble, partition the column and row indices of AA in –respectively– LcL_{c} and LrL_{r} groups of equal size. Explicitly

Further, if iRri\in R_{r} or jCsj\in C_{s} we will write, respectively, r=g(i)r={\sf g}(i) or s=g(j)s={\sf g}(j). In other words g(){\sf g}(\,\cdot\,) is the operator determining the group index of a given row or column.

With this notation we have the following concise definition of the ensemble.

A random sensing matrix AA is distributed according to the ensemble M(W,m0,n0){\cal M}(W,m_{0},n_{0}) (and we write AM(W,m0,n0)A\sim{\cal M}(W,m_{0},n_{0})) if the entries {Aij,    i[m],j[n]}\{A_{ij},\;\;i\in[m],j\in[n]\} are independent Gaussian random variables with

See Fig. 1 for a schematic of matrix AA. Note that the ensemble M(W,m0,n0){\cal M}(W,m_{0},n_{0}) includes, as special case, rectangular non-symmetric matrices with i.i.d. entries.

2 AMP for compressed sensing reconstruction

3 State evolution

For all t0t\geq 0, aRa\in{\sf R}, and iCi\in{\sf C}, let

Here and below, mmse(s){\sf mmse}(s) denotes the minimum mean square error in estimating XpXX\sim p_{X} from a noisy observation in Gaussian noise, at signal-to-noise ratio ss. Formally,

In the constructions for the matrix QtQ^{t}, the nonlinearities ηt\eta_{t}, and the vector bt{\sf b}^{t}, we use the fact that the state evolution sequence can be precomputed.

The nonlinearity ηt\eta_{t} is chosen as follows:

where ηt,i\eta_{t,i} is the conditional expectation estimator for XpXX\sim p_{X} in gaussian noise:

Finally, in order to define the vector bit{\sf b}^{t}_{i}, let us introduce the quantity (with ηt,i\eta^{\prime}_{t,i} denoting the derivative of viηt,i(vi)v_{i}\mapsto\eta_{t,i}(v_{i}))

The vector bt{\sf b}^{t} is then defined by

where we defined Qi,jt=Q~r,utQ^{t}_{i,j}=\widetilde{Q}^{t}_{r,u} for iRri\in R_{r}, jCuj\in C_{u}.

The following Lemma (Lemma 4.14.1 in [DJM11b]) claims that the state evolution (17) allows an exact asymptotic analysis of AMP algorithm (14)- (15) in the limit of a large number of dimensions.

5 Proof of Lemma 1

We show that Lemma 1 follows from Theorem 1. Consider the following change of variables:

Consider the following approximate message passing orbit with vectors {vt,ut}t0\{v^{t},u^{t}\}_{t\geq 0}, vtVq,nv^{t}\in{\cal V}_{q,n}, utVq,mu^{t}\in{\cal V}_{q,m}:

for given yVq,ny\in{\cal V}_{q,n} and wVq,mw\in{\cal V}_{q,m}. Here Bt:Vq,mVq,m{\sf B}_{t}:{\cal V}_{q,m}\to{\cal V}_{q,m} is the linear operator defined by letting, for z=Btzz^{\prime}={\sf B}_{t}z, and any i[m]i\in[m],

Analogously Dt:Vq,nVq,n{\sf D}_{t}:{\cal V}_{q,n}\to{\cal V}_{q,n} is the linear operator defined by letting, for z=Dtzz^{\prime}={\sf D}_{t}z, and any j[n]j\in[n],

Assume that y=(y1,,yn)y=(\mathbf{y}_{1},\dotsc,\mathbf{y}_{n}), w=(w1,,wm)w=(\mathbf{w}_{1},\dotsc,\mathbf{w}_{m}), and v1=(v11,,vn1)v^{1}=(\mathbf{v}^{1}_{1},\dotsc,\mathbf{v}^{1}_{n}) are given by

We refer to Section 3.5.1 for the proof of Proposition 5.

We proceed by constructing a suitable converging sequence of symmetric AMP instances, recognizing that a subset of the resulting orbit corresponds to the orbit {vt,ut}\{v^{t},u^{t}\} of interest. The converging symmetric AMP instances (As(N),g,xs0)(A_{s}(N),g,x_{s}^{0}) are defined as:

The instances has dimensions N=m+nN=m+n and q=Lr+Lcq={L_{r}}+{L_{c}}.

The initial condition is given by xs0=(xs,10,,xs,N0)Vq,Nx_{s}^{0}=(\mathbf{x}^{0}_{s,1},\cdots,\mathbf{x}^{0}_{s,N})\in{\cal V}_{q,N}, where xs,i0=0\mathbf{x}^{0}_{s,i}=0 for imi\leq m and xs,i0=vim1\mathbf{x}^{0}_{s,i}=\mathbf{v}^{1}_{i-m} for m<im+nm<i\leq m+n.

Now, it is easy to see that, for all t0t\geq 0,

Now we are ready to prove Lemma 1 by applying Theorem 1.

Here (a)(a) follows from Eq. (37) and the definition of ys,j\mathbf{y}_{s,j} (note that j=jmj^{\prime}=j-m); (b)(b) follows from the fact a=g(j)a={\sf g}(j) and Proposition 5.

Applying Theorem 1, we have almost surely

with XpXX\sim p_{X} and ZN(0,Σaa2t)Z\sim{\sf N}(0,\Sigma^{2t}_{aa}). Therefore, to complete the proof we need to show that

By definition of function gg (see Eq.s (32)- (35)), it is easy to see that Eq. (9) reduces to:

Here a=aLra=a^{\prime}-{L_{r}}, XpXX\sim p_{X} and ZatN(0,Σaa2t)Z_{a}^{t}\sim{\sf N}(0,\Sigma^{2t}_{aa}). Also,

We prove relation (40) using induction on tt. The induction basis (t=0t=0) is trivial. Suppose that the claim holds for t1t-1. Then,

This proves the induction claim for tt. Combining (38),(39) and (40), Lemma 1 follows.

We prove the result by induction on tt. For t=0t=0, the claim follows from our definition. Suppose that the claim holds for t1t-1, we prove that for tt.

Writing Eq. (28) for coordinate ii, we have

Restricting to coordinate g(i){\sf g}(i), we get

Here, we have used the fact that e(vkt,yk,g(k),t)e(\mathbf{v}^{t}_{k},\mathbf{y}_{k},{\sf g}(k),t) does not depend on vk,lt\mathbf{v}^{t}_{k,l} for lg(k)l\neq{\sf g}(k).

where we used the induction hypothesis in the last step. Furthermore,

where we used the induction hypothesis in the second equality. The last equality follows from the definition of bit{\sf b}^{t}_{i} (see Eq. (22));

where the second equality follows from (27). This proves the induction claim for uit(g(i))\mathbf{u}_{i}^{t}({\sf g}(i)).

Next we prove the claim for vjt+1(g(j))\mathbf{v}^{t+1}_{j}({\sf g}(j)). Writing Eq. (29) for coordinate jj, we have

Restricting to coordinate g(j){\sf g}(j), we get

Here, we have used the fact that h(ult,wl,g(l),t)h(\mathbf{u}^{t}_{l},\mathbf{w}_{l},{\sf g}(l),t) does not depend on ul,kt\mathbf{u}^{t}_{l,k} for kg(l)k\neq{\sf g}(l).

where the second equality follows from (26). This proves the induction claim for vit+1(g(i))\mathbf{v}_{i}^{t+1}({\sf g}(i)).

Proof of Theorem 1

Letting mt=f(xt;t)m^{t}=f(x^{t};t) for t0t\geq 0, Eq. (5), becomes

This is initialized with m1=0m^{-1}=0 and m0=m0,NVq,Nm^{0}=m^{0,N}\in{\cal V}_{q,N}, a sequence of deterministic vectors in Vq,N{\cal V}_{q,N}, with limsupNN1i=1Nmi02k2<\lim\sup_{N\to\infty}N^{-1}\sum_{i=1}^{N}\|\mathbf{m}^{0}_{i}\|^{2k-2}<\infty. Also recall that the vectors y=(y1,,yN)Vq,Ny=(\mathbf{y}_{1},\dotsc,\mathbf{y}_{N})\in{\cal V}_{q,N} are a fixed sequence indexed by NN, with converging empirical distributions.

The idea of the proof is to study the stochastic process {x0,x1,,xt,}\{x^{0},x^{1},\dots,x^{t},\dots\} taking values in Vq,N{\cal V}_{q,N} without conditioning on the matrix AA. Instead, for each tt, we will compute the conditional distribution of xt+1x^{t+1} given x0,,xtx^{0},\ldots,x^{t}, and hence m0,,mtm^{0},\ldots,m^{t}. More precisely, let St\mathfrak{S}_{t} be the σ\sigma-algebra generated by these variables. We will compute the conditional distributions xt+1Stx^{t+1}|_{\mathfrak{S}_{t}}, by characterizing the conditional distribution of the matrix AA given this filtration.

We therefore have Btmt1=mt1BtT{\sf B}_{t}m_{t-1}=m_{t-1}{\bf B}_{t}^{{\sf T}} and the equations for x1,,xtx^{1},\ldots,x^{t} can be written in matrix form as:

In short Yt1=AMt1Y_{t-1}=AM_{t-1}. Here and below we use [QP][Q|P] to denote the matrix obtained by concatenating QQ and PP horizontally.

2 Main technical Lemma

This condition is useful to rule out trivial degeneracies.

where (Za1,,Zat+1)(Z_{a}^{1},\dots,Z_{a}^{t+1}) is a Gaussian vector independent of YaPaY_{a}\sim P_{a} and, for each ii, ZaiN(0,Σi)Z_{a}^{i}\sim{\sf N}(0,\Sigma^{i})

For all 1r,st,a[q]1\leq r,s\leq t,a\in[q] the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

For all 0rt0\leq r\leq t the following limit exists and there are positive constants ρr\rho_{r} (independent of NN) such that almost surely

First assume that the sequence of functions FN{\cal F}_{N} is non-trivial. Theorem 1 follows readily from Lemma 2. More specifically, Theorem 1 is obtained by applying Lemma 2(b)(b) to functions ϕ(xi1,,xit)=ψ(xit,yi)\phi(\mathbf{x}^{1}_{i},\dotsc,\mathbf{x}^{t}_{i})=\psi(\mathbf{x}^{t}_{i},\mathbf{y}_{i}).

The resulting sequence of instances is then non-trivial and state evolution applies. Call Σt(ε)\Sigma^{t}({\varepsilon}) the resulting state evolution sequence, and denote by xt(ϵ)x^{t}(\epsilon) the corresponding orbit. Applying Theorem 1, we have

with Zat(ϵ)N(0,Σt(ϵ))Z^{t}_{a}(\epsilon)\sim{\sf N}(0,\Sigma^{t}(\epsilon)). In order to prove the same theorem for the orbit {xt}t0\{x^{t}\}_{t\geq 0}, we need to show the following two facts:

Let aN(ϵ)=1Ni=1Nψ(xit(ϵ),yi)a_{N}(\epsilon)=\frac{1}{N}\sum_{i=1}^{N}\psi(\mathbf{x}^{t}_{i}(\epsilon),\mathbf{y}_{i}). Then aN(ϵ)aN(0)Cϵ|a_{N}(\epsilon)-a_{N}(0)|\leq C\epsilon, with constant CC being independent of NN.

where the last step follows from (ii)(ii) and Eq. (66). Therefore, taking the limit of both sides as ϵ0\epsilon\to 0,

where the last step follows from (i)(i). This proves Theorem 1 for {xt}t0\{x^{t}\}_{t\geq 0}.

It remains to prove facts (i)(i)-(ii)(ii). The claim in (i)(i) follows readily by applying dominated convergence theorem and noting that ψ(,)\psi(\cdot,\cdot) is Lipschitz continuous.

3 Proof of Lemma 2

The proof is by induction on tt. Let Bt\mathcal{B}_{t} be the property that (59), (60), (61), (63), (64), and (65) hold.

S0\mathfrak{S}_{0} is generated by yy, x0x^{0} and m0m^{0}. Also m0=m0m^{0}=m^{0}_{\perp} since M1M_{-1} is an empty matrix. Hence

Let A^=ACaN\hat{A}=A_{C^{N}_{a}} be the submatrix formed by the rows in CANC^{N}_{A}. Using Lemma 4(c), conditioned on m0m^{0},

where the last step follows by applying B0(b)\mathcal{B}_{0}(b) to the functions ϕ(xi1,yi)=xi1(l)[φa(xi1,yi)]k\phi(\mathbf{x}^{1}_{i},\mathbf{y}_{i})=\mathbf{x}^{1}_{i}(l)[\varphi^{a}(\mathbf{x}_{i}^{1},\mathbf{y}_{i})]_{k}, for all l,k[q]l,k\in[q]. Furthermore, using Lemma 5,

As proved in part (c)(c), limNxCaN1,xCaN1=Σ1\lim_{N\to\infty}\langle x^{1}_{C^{N}_{a}},x^{1}_{C^{N}_{a}}\rangle=\Sigma^{1}. Also, by part (b)(b), the empirical distribution of {(xi1,yi)}iCaN\{(\mathbf{x}_{i}^{1},\mathbf{y}_{i})\}_{i\in C^{N}_{a}} converges weakly to the distribution of (Za,Ya)(Z_{a},Y_{a}), and consequently we get

This proves Eq. (62). To prove Eq. (63), notice that

where the last step holds since a[q]ca=1\sum_{a\in[q]}c_{a}=1. Further,

Combining Eqs. (68), (69), (70) and Eq. (62), we get the desired result.

for a constant cpc_{p}. Therefore, by Theorem 2, we get

Since t=0t=0 and m0=m0m^{0}=m^{0}_{\perp}, the result follows from limNm0,m0=Σ1\lim_{N\to\infty}\langle m^{0},m^{0}\rangle=\Sigma^{1} and that Σ1=b[q]cbΣ^b00\Sigma^{1}=\sum_{b\in[q]}c_{b}\widehat{\Sigma}^{0}_{b}\succ 0.

Suppose that Bt1\mathcal{B}_{t-1} holds. We prove Bt\mathcal{B}_{t}.

has a finite limit as NN\to\infty by the induction hypothesis Bt1(b)\mathcal{B}_{t-1}(b). Furthermore, mit=g(xit,yi,a,t)\mathbf{m}_{i}^{t}=g(\mathbf{x}^{t}_{i},\mathbf{y}_{i},a,t), for iCaNi\in C^{N}_{a}. By induction hypothesis Bt1(a)\mathcal{B}_{t-1}(a), it is sufficient to show that there exists ρ>0\rho>0 depending on tt such that,

Using Bt1(e)\mathcal{B}_{t-1}(e), we can choose UU large enough to ensure that there exists at least N/2N/2 values of the indices i[N]i\in[N] such that r=0t1αr1TxirU\|\sum_{r=0}^{t-1}\alpha_{r-1}^{{\sf T}}\mathbf{x}^{r}_{i}\|\leq U. Note that UU and therefore ε{\varepsilon} depend on tt but do not depend on NN. The lower bound (71) follows then by taking ρ=ε/4\rho={\varepsilon}/4.

Using Mt1Tmt=0M_{t-1}^{\sf T}m^{t}_{\perp}=0 and Yt1=AMt1Y_{t-1}=AM_{t-1}, it is immediate to see that

Moreover, Yt1Tmt=Xt1TmtY_{t-1}^{\sf T}m^{t}_{\perp}=X_{t-1}^{\sf T}m^{t}_{\perp} because Mt2Tmt=0M_{t-2}^{\sf T}m^{t}_{\perp}=0. Recalling mt=Mt1αm^{t}_{\parallel}=M_{t-1}\alpha we need to show

To simplify the notation denote the matrix Mt1TMt1/NM_{t-1}^{\sf T}M_{t-1}/N by GG. Therefore,

But using the induction hypothesis Bt1(d)\mathcal{B}_{t-1}(d) for φ=f(;1),,f(;t)\varphi=f(\cdot;1),\dotsc,f(\cdot;t), the term xr,mts=0t1msαs\langle x^{r},m^{t}-\sum_{s=0}^{t-1}m^{s}\alpha_{s}\rangle is almost surely equal to the limit of xr,xtBtTs=0t1xr,xsBsTαs\langle x^{r},x^{t}\rangle{\bf B}_{t}^{\sf T}-\sum_{s=0}^{t-1}\langle x^{r},x^{s}\rangle{\bf B}_{s}^{\sf T}\alpha_{s}. This can be modified, using the induction hypothesis Bt1(c)\mathcal{B}_{t-1}(c), to mr1,mt1BtTs=0t1mr1,ms1BsTαs\langle m^{r-1},m^{t-1}\rangle{\bf B}_{t}^{\sf T}-\sum_{s=0}^{t-1}\langle m^{r-1},m^{s-1}\rangle{\bf B}_{s}^{\sf T}\alpha_{s} almost surely, which can be written as G(r),(t)BtTs=0t1G(r),(s)BsTαsG_{(r),(t)}{\bf B}_{t}^{\sf T}-\sum_{s=0}^{t-1}G_{(r),(s)}{\bf B}_{s}^{\sf T}\alpha_{s}. Hence,

Notice that in the above equalities we used the fact that GG has, almost surely, a non-singular limit as NN\to\infty which was discussed in part (f)(f). ∎

The proof of Eq. (59) follows immediately since the last lemma yields

Note that, using Lemma 4(d), as NN\to\infty,

For r,s<tr,s<t we can use induction hypothesis. For r=t,s<tr=t,s<t,

Now, by induction hypothesis Bt1(d)\mathcal{B}_{t-1}(d), for φ(v,u)=g(v,u,a,i)\varphi(\mathbf{v},\mathbf{u})=g(\mathbf{v},\mathbf{u},a,i), each term mCaNi,xCaNs+1\langle m_{C^{N}_{a}}^{i},x_{C^{N}_{a}}^{s+1}\rangle has a finite limit. Thus,

where the last line uses the definition of αi\alpha_{i} and mtmsm_{\perp}^{t}\perp m^{s}.

This part follows by a very similar argument to the one in the proof of Lemma 11 (Step Bt(e)\mathcal{B}_{t}(e)) in [BM11].

Therefore, using Cauchy-Schwartz inequality twice, we have

Hence for any fixed tt, (73) vanishes almost surely when NN goes to \infty.

Now given, x1,,xt\mathbf{x}^{1},\ldots,\mathbf{x}^{t}, consider the random variables

Hence we can use induction hypothesis Bt1(b)\mathcal{B}_{t-1}(b) for

with ZN(0,Iq×q)Z\sim{\sf N}(0,{\rm I}_{q\times q}) independent of xir+1\mathbf{x}^{r+1}_{i}, rt1r\leq t-1, to show

On the other hand as proved in part (c)(c),

By induction hypothesis Bt1(b)\mathcal{B}_{t-1}(b) for the pseudo-Lipschitz functions

for gaussian vectors Zar+1 N(0,Σr+1)Z_{a}^{r+1}~{}\sim{\sf N}(0,\Sigma^{r+1}), Zas+1 N(0,Σs+1)Z_{a}^{s+1}~{}\sim{\sf N}(0,\Sigma^{s+1}). Using Lemma 5, we have almost surely,

By another application of part (b)(b) for ϕ(xi1,,xit+1,yi)=xir+1(l)xis+1(k)\phi(\mathbf{x}_{i}^{1},\ldots,\mathbf{x}_{i}^{t+1},\mathbf{y}_{i})=\mathbf{x}_{i}^{r+1}(l)\mathbf{x}_{i}^{s+1}(k) for all l,k[q]l,k\in[q],

Eq. (63) follows from Eq. (62) exactly by the same argument as in B0(d)\mathcal{B}_{0}(d).

Appendix A Reference probability results

In this appendix, we summarize a few probability facts that are repeatedly used in the proof of Lemma 2. We start by the following strong law of large numbers (SLLN) for triangular arrays of independent but not identically distributed random variables. The form stated below follows immediately from [HT97, Theorem 2.1].

Next, we present a standard property of Gaussian matrices without proof. This is a generalization of [BM11, Lemma 2].

The following law of large numbers is a generalization of [BM11, Lemma 4] and can be proved in a very similar manner.

References