State Evolution for General Approximate Message Passing Algorithms, with Applications to Spatial Coupling

Adel Javanmard, Andrea Montanari

Introduction

Approximate message passing (AMP) algorithms [DMM09] apply ideas from graphical models (belief propagation [Pea88]) and statistical physics (mean field or TAP equations [MPV87, MM09]) to statistical estimation. In particular AMP applies to problems that do not admit a sparse graphical model description. An AMP algorithm takes the form

In this paper we focus on Gaussian matrices and consider a different type of generalization that was motivated by the following recent developments.

In information theory parlance, the vector $(Ax)$ is passed through a memoryless channel with transition probability $p(\,\cdot\,|\,\cdot\,)$ . From a statistics point of view, this corresponds to estimation of a generalized linear model [NW72, MN89]. The linear model (3) is recovered as the special case in which the channel is Gaussian or –more generally– the noise is purely additive. Rangan conjectured that suitable state evolution equations hold for G-AMP algorithms as well, without however providing a formal proof.

Bean, Bickel, El Karoui and Yu [BBEKY12] recently considered the problem of estimating the unknown vector $x$ in the linear model (3) using robust regression. They developed exact asymptotic expressions for the risk that are analogous to the one proved in [BM12] for the Lasso. The results of [BBEKY12] are, on the other hand, based on an heuristic derivation.

The proof in [BM12] was based on the state evolution analysis of a suitable AMP algorithm whose fixed points coincide with the Lasso optima. This is suggestive of a possible approach for proving the results of [BBEKY12]: define a suitable AMP algorithm for solving the robust regression problem, and analyze it through state evolution. Indeed a comparison of the formulae in [BBEKY12] with the state evolution formulae in [Ran11] appears encouraging.

In this paper we establish a rigorous generalization of state evolution that covers all of the above developments. Applications to generalized AMP are already discussed in [Ran11], and applications to spatially coupled sensing matrices can be found in [DJM11b] and Section 3. Finally, applications to robust regression are left for future study.

Remarkably, all of the above applications can be derived by treating the following generalization of the iteration (1), (2). (A formal definition is given in the next section.)

Our proof uses the technique of [BM11], which in turns build on an idea first introduced by Bolthausen [Bol12]. A convenient simplification with respect to [BM11] consists in studying a recursion in which the rectangular matrix $A$ is replaced by a symmetric matrix, and the algorithm state is described by a single vector.

In section 2 we put forward formal definitions and state our main result for the case of symmetric matrices. In section 3 we show how the case of rectangular matrices can be reduced to the symmetric one. We also show how our result applies to the case of compressed sensing reconstruction with spatially coupled matrices. Finally, we prove our main result in Section 4.

Main result

A symmetric AMP instance is a triple $(A,{\cal F},x^{0})$ where:

$x^{0}\in{\cal V}_{q,N}$ is an initial condition.

Given ${\cal F}=\{f^{k}:k\in[N]\}$ , we define $f(\,\cdot\,;t):{\cal V}_{q,N}\to{\cal V}_{q,N}$ by letting $v^{\prime}=f(v;t)$ be given by $\mathbf{v}^{\prime}_{i}=f^{i}(\mathbf{v}_{i};t)$ for all $i\in[N]$ .

The approximate message passing orbit corresponding to the instance $(A,{\cal F},x^{0})$ is the sequence of vectors $\{x^{t}\}_{t\geq 0}$ , $x^{t}\in{\cal V}_{q,N}$ defined as follows, for $t\geq 0$ ,

Here ${\sf B}_{t}:{\cal V}_{q,N}\to{\cal V}_{q,N}$ is the linear operator defined by letting, for $v^{\prime}={\sf B}_{t}v$ ,

In order to establish the behavior of the sequence $\{x^{t}\}_{t\geq 0}$ in the high dimensional limit, we need to consider a sequence of AMP instances $\{A(N),{\cal F}_{N},x^{0,N}\}_{N\geq 0}$ indexed by the dimension $N$ .

For each $a\in[q]$ , we have $\lim_{N\to\infty}|C^{N}_{a}|/N=c_{a}\in(0,1)$ .

For each $N\geq 0$ , each $a\in[q]$ and each $i\in C_{a}^{N}$ , we have $f^{i}(\mathbf{x},t)=g(\mathbf{x},\mathbf{y}_{i},a,t)$ . Further, the empirical distribution of $\{\mathbf{y}_{i}\}_{i\in C^{N}_{a}}$ , denoted by $\hat{P}_{a}$ , converges weakly to $P_{a}$ .

An apparent generalization of the above definition would require the partition to be $C^{N}_{1}\cup C^{N}_{2}\cup\dots\cup C^{N}_{q^{\prime}}=[N]$ , while $x^{t}\in{\cal V}_{q,N}$ , with $q\neq q^{\prime}$ . It is easy to see that there is no loss of generality in assuming $q=q^{\prime}$ as we do in our definition. Indeed the case $q^{\prime}<q$ can be reduced to our setting by refining the partition arbitrarily, and $q^{\prime}>q$ by adding dummy coordinates to to the variables $\mathbf{x}_{i}$ .

The function $f^{i}(\,\cdot\,,\,\cdot\,)$ depends implicitly on $\mathbf{y}_{i}$ . However, the $\mathbf{y}_{i}$ ’s do not change across iterations and so we do not show this dependence explicitly in our notation.

for all $a\in[q]$ . Here $Y_{a}\sim P_{a}$ , $Z_{a}^{t}\sim{\sf N}\left(0,\Sigma^{t}\right)$ and $Y_{a}$ and $Z_{a}^{t}$ are independent.

where $Z_{a}^{t}\sim{\sf N}(0,\Sigma^{t})$ is independent of $Y_{a}\sim P_{a}$ .

AMP for rectangular and spatially-coupled matrices

In this section we develop two applications of our main theorem:

We show that AMP iterations with $A$ a rectangular matrix, see e.g. Eqs. (1), (2), can be recast in the form of an iteration with a symmetric matrix $A$ and are therefore covered by Theorem 1. This construction is provided in Section 3.5 (below Proposition 5).

We apply the general Theorem 1 to AMP reconstruction in compressed sensing with spatially coupled matrices. In [DJM11b], it was proved that, conditionally to a state evolution lemma, this approach achieves the information-theoretic limits of compressed sensing set forth in [WV10]. Here we show that our main result Theorem 1 implies the state evolution lemma (Lemma $4.1$ in [DJM11b]).

We will let $|{\sf R}|\equiv L_{r}$ and $|{\sf C}|\equiv L_{c}$ denote the matrix dimensions. The ensemble parameters are related to the sensing matrix dimensions by $n=n_{0}L_{c}$ and $m=m_{0}L_{r}$ .

In order to describe a random matrix $A\sim{\cal M}(W,m_{0},n_{0})$ from this ensemble, partition the column and row indices of $A$ in –respectively– $L_{c}$ and $L_{r}$ groups of equal size. Explicitly

Further, if $i\in R_{r}$ or $j\in C_{s}$ we will write, respectively, $r={\sf g}(i)$ or $s={\sf g}(j)$ . In other words ${\sf g}(\,\cdot\,)$ is the operator determining the group index of a given row or column.

With this notation we have the following concise definition of the ensemble.

A random sensing matrix $A$ is distributed according to the ensemble ${\cal M}(W,m_{0},n_{0})$ (and we write $A\sim{\cal M}(W,m_{0},n_{0})$ ) if the entries $\{A_{ij},\;\;i\in[m],j\in[n]\}$ are independent Gaussian random variables with

See Fig. 1 for a schematic of matrix $A$ . Note that the ensemble ${\cal M}(W,m_{0},n_{0})$ includes, as special case, rectangular non-symmetric matrices with i.i.d. entries.

2 AMP for compressed sensing reconstruction

3 State evolution

For all $t\geq 0$ , $a\in{\sf R}$ , and $i\in{\sf C}$ , let

Here and below, ${\sf mmse}(s)$ denotes the minimum mean square error in estimating $X\sim p_{X}$ from a noisy observation in Gaussian noise, at signal-to-noise ratio $s$ . Formally,

In the constructions for the matrix $Q^{t}$ , the nonlinearities $\eta_{t}$ , and the vector ${\sf b}^{t}$ , we use the fact that the state evolution sequence can be precomputed.

The nonlinearity $\eta_{t}$ is chosen as follows:

where $\eta_{t,i}$ is the conditional expectation estimator for $X\sim p_{X}$ in gaussian noise:

Finally, in order to define the vector ${\sf b}^{t}_{i}$ , let us introduce the quantity (with $\eta^{\prime}_{t,i}$ denoting the derivative of $v_{i}\mapsto\eta_{t,i}(v_{i})$ )

The vector ${\sf b}^{t}$ is then defined by

where we defined $Q^{t}_{i,j}=\widetilde{Q}^{t}_{r,u}$ for $i\in R_{r}$ , $j\in C_{u}$ .

The following Lemma (Lemma $4.1$ in [DJM11b]) claims that the state evolution (17) allows an exact asymptotic analysis of AMP algorithm (14)- (15) in the limit of a large number of dimensions.

5 Proof of Lemma 1

We show that Lemma 1 follows from Theorem 1. Consider the following change of variables:

Consider the following approximate message passing orbit with vectors $\{v^{t},u^{t}\}_{t\geq 0}$ , $v^{t}\in{\cal V}_{q,n}$ , $u^{t}\in{\cal V}_{q,m}$ :

for given $y\in{\cal V}_{q,n}$ and $w\in{\cal V}_{q,m}$ . Here ${\sf B}_{t}:{\cal V}_{q,m}\to{\cal V}_{q,m}$ is the linear operator defined by letting, for $z^{\prime}={\sf B}_{t}z$ , and any $i\in[m]$ ,

Analogously ${\sf D}_{t}:{\cal V}_{q,n}\to{\cal V}_{q,n}$ is the linear operator defined by letting, for $z^{\prime}={\sf D}_{t}z$ , and any $j\in[n]$ ,

Assume that $y=(\mathbf{y}_{1},\dotsc,\mathbf{y}_{n})$ , $w=(\mathbf{w}_{1},\dotsc,\mathbf{w}_{m})$ , and $v^{1}=(\mathbf{v}^{1}_{1},\dotsc,\mathbf{v}^{1}_{n})$ are given by

We refer to Section 3.5.1 for the proof of Proposition 5.

We proceed by constructing a suitable converging sequence of symmetric AMP instances, recognizing that a subset of the resulting orbit corresponds to the orbit $\{v^{t},u^{t}\}$ of interest. The converging symmetric AMP instances $(A_{s}(N),g,x_{s}^{0})$ are defined as:

The instances has dimensions $N=m+n$ and $q={L_{r}}+{L_{c}}$ .

The initial condition is given by $x_{s}^{0}=(\mathbf{x}^{0}_{s,1},\cdots,\mathbf{x}^{0}_{s,N})\in{\cal V}_{q,N}$ , where $\mathbf{x}^{0}_{s,i}=0$ for $i\leq m$ and $\mathbf{x}^{0}_{s,i}=\mathbf{v}^{1}_{i-m}$ for $m<i\leq m+n$ .

Now, it is easy to see that, for all $t\geq 0$ ,

Now we are ready to prove Lemma 1 by applying Theorem 1.

Here $(a)$ follows from Eq. (37) and the definition of $\mathbf{y}_{s,j}$ (note that $j^{\prime}=j-m$ ); $(b)$ follows from the fact $a={\sf g}(j)$ and Proposition 5.

Applying Theorem 1, we have almost surely

with $X\sim p_{X}$ and $Z\sim{\sf N}(0,\Sigma^{2t}_{aa})$ . Therefore, to complete the proof we need to show that

By definition of function $g$ (see Eq.s (32)- (35)), it is easy to see that Eq. (9) reduces to:

Here $a=a^{\prime}-{L_{r}}$ , $X\sim p_{X}$ and $Z_{a}^{t}\sim{\sf N}(0,\Sigma^{2t}_{aa})$ . Also,

We prove relation (40) using induction on $t$ . The induction basis ( $t=0$ ) is trivial. Suppose that the claim holds for $t-1$ . Then,

This proves the induction claim for $t$ . Combining (38),(39) and (40), Lemma 1 follows.

We prove the result by induction on $t$ . For $t=0$ , the claim follows from our definition. Suppose that the claim holds for $t-1$ , we prove that for $t$ .

Writing Eq. (28) for coordinate $i$ , we have

Restricting to coordinate ${\sf g}(i)$ , we get

Here, we have used the fact that $e(\mathbf{v}^{t}_{k},\mathbf{y}_{k},{\sf g}(k),t)$ does not depend on $\mathbf{v}^{t}_{k,l}$ for $l\neq{\sf g}(k)$ .

where we used the induction hypothesis in the last step. Furthermore,

where we used the induction hypothesis in the second equality. The last equality follows from the definition of ${\sf b}^{t}_{i}$ (see Eq. (22));

where the second equality follows from (27). This proves the induction claim for $\mathbf{u}_{i}^{t}({\sf g}(i))$ .

Next we prove the claim for $\mathbf{v}^{t+1}_{j}({\sf g}(j))$ . Writing Eq. (29) for coordinate $j$ , we have

Restricting to coordinate ${\sf g}(j)$ , we get

Here, we have used the fact that $h(\mathbf{u}^{t}_{l},\mathbf{w}_{l},{\sf g}(l),t)$ does not depend on $\mathbf{u}^{t}_{l,k}$ for $k\neq{\sf g}(l)$ .

where the second equality follows from (26). This proves the induction claim for $\mathbf{v}_{i}^{t+1}({\sf g}(i))$ .

Proof of Theorem 1

Letting $m^{t}=f(x^{t};t)$ for $t\geq 0$ , Eq. (5), becomes

This is initialized with $m^{-1}=0$ and $m^{0}=m^{0,N}\in{\cal V}_{q,N}$ , a sequence of deterministic vectors in ${\cal V}_{q,N}$ , with $\lim\sup_{N\to\infty}N^{-1}\sum_{i=1}^{N}\|\mathbf{m}^{0}_{i}\|^{2k-2}<\infty$ . Also recall that the vectors $y=(\mathbf{y}_{1},\dotsc,\mathbf{y}_{N})\in{\cal V}_{q,N}$ are a fixed sequence indexed by $N$ , with converging empirical distributions.

The idea of the proof is to study the stochastic process $\{x^{0},x^{1},\dots,x^{t},\dots\}$ taking values in ${\cal V}_{q,N}$ without conditioning on the matrix $A$ . Instead, for each $t$ , we will compute the conditional distribution of $x^{t+1}$ given $x^{0},\ldots,x^{t}$ , and hence $m^{0},\ldots,m^{t}$ . More precisely, let $\mathfrak{S}_{t}$ be the $\sigma$ -algebra generated by these variables. We will compute the conditional distributions $x^{t+1}|_{\mathfrak{S}_{t}}$ , by characterizing the conditional distribution of the matrix $A$ given this filtration.

We therefore have ${\sf B}_{t}m_{t-1}=m_{t-1}{\bf B}_{t}^{{\sf T}}$ and the equations for $x^{1},\ldots,x^{t}$ can be written in matrix form as:

In short $Y_{t-1}=AM_{t-1}$ . Here and below we use $[Q|P]$ to denote the matrix obtained by concatenating $Q$ and $P$ horizontally.

2 Main technical Lemma

This condition is useful to rule out trivial degeneracies.

where $(Z_{a}^{1},\dots,Z_{a}^{t+1})$ is a Gaussian vector independent of $Y_{a}\sim P_{a}$ and, for each $i$ , $Z_{a}^{i}\sim{\sf N}(0,\Sigma^{i})$

For all $1\leq r,s\leq t,a\in[q]$ the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):

For all $0\leq r\leq t$ the following limit exists and there are positive constants $\rho_{r}$ (independent of $N$ ) such that almost surely

First assume that the sequence of functions ${\cal F}_{N}$ is non-trivial. Theorem 1 follows readily from Lemma 2. More specifically, Theorem 1 is obtained by applying Lemma 2 $(b)$ to functions $\phi(\mathbf{x}^{1}_{i},\dotsc,\mathbf{x}^{t}_{i})=\psi(\mathbf{x}^{t}_{i},\mathbf{y}_{i})$ .

The resulting sequence of instances is then non-trivial and state evolution applies. Call $\Sigma^{t}({\varepsilon})$ the resulting state evolution sequence, and denote by $x^{t}(\epsilon)$ the corresponding orbit. Applying Theorem 1, we have

with $Z^{t}_{a}(\epsilon)\sim{\sf N}(0,\Sigma^{t}(\epsilon))$ . In order to prove the same theorem for the orbit $\{x^{t}\}_{t\geq 0}$ , we need to show the following two facts:

Let $a_{N}(\epsilon)=\frac{1}{N}\sum_{i=1}^{N}\psi(\mathbf{x}^{t}_{i}(\epsilon),\mathbf{y}_{i})$ . Then $|a_{N}(\epsilon)-a_{N}(0)|\leq C\epsilon$ , with constant $C$ being independent of $N$ .

where the last step follows from $(ii)$ and Eq. (66). Therefore, taking the limit of both sides as $\epsilon\to 0$ ,

where the last step follows from $(i)$ . This proves Theorem 1 for $\{x^{t}\}_{t\geq 0}$ .

It remains to prove facts $(i)$ - $(ii)$ . The claim in $(i)$ follows readily by applying dominated convergence theorem and noting that $\psi(\cdot,\cdot)$ is Lipschitz continuous.

3 Proof of Lemma 2

The proof is by induction on $t$ . Let $\mathcal{B}_{t}$ be the property that (59), (60), (61), (63), (64), and (65) hold.

$\mathfrak{S}_{0}$ is generated by $y$ , $x^{0}$ and $m^{0}$ . Also $m^{0}=m^{0}_{\perp}$ since $M_{-1}$ is an empty matrix. Hence

Let $\hat{A}=A_{C^{N}_{a}}$ be the submatrix formed by the rows in $C^{N}_{A}$ . Using Lemma 4(c), conditioned on $m^{0}$ ,

where the last step follows by applying $\mathcal{B}_{0}(b)$ to the functions $\phi(\mathbf{x}^{1}_{i},\mathbf{y}_{i})=\mathbf{x}^{1}_{i}(l)[\varphi^{a}(\mathbf{x}_{i}^{1},\mathbf{y}_{i})]_{k}$ , for all $l,k\in[q]$ . Furthermore, using Lemma 5,

As proved in part $(c)$ , $\lim_{N\to\infty}\langle x^{1}_{C^{N}_{a}},x^{1}_{C^{N}_{a}}\rangle=\Sigma^{1}$ . Also, by part $(b)$ , the empirical distribution of $\{(\mathbf{x}_{i}^{1},\mathbf{y}_{i})\}_{i\in C^{N}_{a}}$ converges weakly to the distribution of $(Z_{a},Y_{a})$ , and consequently we get

This proves Eq. (62). To prove Eq. (63), notice that

where the last step holds since $\sum_{a\in[q]}c_{a}=1$ . Further,

Combining Eqs. (68), (69), (70) and Eq. (62), we get the desired result.

for a constant $c_{p}$ . Therefore, by Theorem 2, we get

Since $t=0$ and $m^{0}=m^{0}_{\perp}$ , the result follows from $\lim_{N\to\infty}\langle m^{0},m^{0}\rangle=\Sigma^{1}$ and that $\Sigma^{1}=\sum_{b\in[q]}c_{b}\widehat{\Sigma}^{0}_{b}\succ 0$ .

Suppose that $\mathcal{B}_{t-1}$ holds. We prove $\mathcal{B}_{t}$ .

has a finite limit as $N\to\infty$ by the induction hypothesis $\mathcal{B}_{t-1}(b)$ . Furthermore, $\mathbf{m}_{i}^{t}=g(\mathbf{x}^{t}_{i},\mathbf{y}_{i},a,t)$ , for $i\in C^{N}_{a}$ . By induction hypothesis $\mathcal{B}_{t-1}(a)$ , it is sufficient to show that there exists $\rho>0$ depending on $t$ such that,

Using $\mathcal{B}_{t-1}(e)$ , we can choose $U$ large enough to ensure that there exists at least $N/2$ values of the indices $i\in[N]$ such that $\|\sum_{r=0}^{t-1}\alpha_{r-1}^{{\sf T}}\mathbf{x}^{r}_{i}\|\leq U$ . Note that $U$ and therefore ${\varepsilon}$ depend on $t$ but do not depend on $N$ . The lower bound (71) follows then by taking $\rho={\varepsilon}/4$ .

Using $M_{t-1}^{\sf T}m^{t}_{\perp}=0$ and $Y_{t-1}=AM_{t-1}$ , it is immediate to see that

Moreover, $Y_{t-1}^{\sf T}m^{t}_{\perp}=X_{t-1}^{\sf T}m^{t}_{\perp}$ because $M_{t-2}^{\sf T}m^{t}_{\perp}=0$ . Recalling $m^{t}_{\parallel}=M_{t-1}\alpha$ we need to show

To simplify the notation denote the matrix $M_{t-1}^{\sf T}M_{t-1}/N$ by $G$ . Therefore,

But using the induction hypothesis $\mathcal{B}_{t-1}(d)$ for $\varphi=f(\cdot;1),\dotsc,f(\cdot;t)$ , the term $\langle x^{r},m^{t}-\sum_{s=0}^{t-1}m^{s}\alpha_{s}\rangle$ is almost surely equal to the limit of $\langle x^{r},x^{t}\rangle{\bf B}_{t}^{\sf T}-\sum_{s=0}^{t-1}\langle x^{r},x^{s}\rangle{\bf B}_{s}^{\sf T}\alpha_{s}$ . This can be modified, using the induction hypothesis $\mathcal{B}_{t-1}(c)$ , to $\langle m^{r-1},m^{t-1}\rangle{\bf B}_{t}^{\sf T}-\sum_{s=0}^{t-1}\langle m^{r-1},m^{s-1}\rangle{\bf B}_{s}^{\sf T}\alpha_{s}$ almost surely, which can be written as $G_{(r),(t)}{\bf B}_{t}^{\sf T}-\sum_{s=0}^{t-1}G_{(r),(s)}{\bf B}_{s}^{\sf T}\alpha_{s}$ . Hence,

Notice that in the above equalities we used the fact that $G$ has, almost surely, a non-singular limit as $N\to\infty$ which was discussed in part $(f)$ . ∎

The proof of Eq. (59) follows immediately since the last lemma yields

Note that, using Lemma 4(d), as $N\to\infty$ ,

For $r,s<t$ we can use induction hypothesis. For $r=t,s<t$ ,

Now, by induction hypothesis $\mathcal{B}_{t-1}(d)$ , for $\varphi(\mathbf{v},\mathbf{u})=g(\mathbf{v},\mathbf{u},a,i)$ , each term $\langle m_{C^{N}_{a}}^{i},x_{C^{N}_{a}}^{s+1}\rangle$ has a finite limit. Thus,

where the last line uses the definition of $\alpha_{i}$ and $m_{\perp}^{t}\perp m^{s}$ .

This part follows by a very similar argument to the one in the proof of Lemma $1$ (Step $\mathcal{B}_{t}(e)$ ) in [BM11].

Therefore, using Cauchy-Schwartz inequality twice, we have

Hence for any fixed $t$ , (73) vanishes almost surely when $N$ goes to $\infty$ .

Now given, $\mathbf{x}^{1},\ldots,\mathbf{x}^{t}$ , consider the random variables

Hence we can use induction hypothesis $\mathcal{B}_{t-1}(b)$ for

with $Z\sim{\sf N}(0,{\rm I}_{q\times q})$ independent of $\mathbf{x}^{r+1}_{i}$ , $r\leq t-1$ , to show

On the other hand as proved in part $(c)$ ,

By induction hypothesis $\mathcal{B}_{t-1}(b)$ for the pseudo-Lipschitz functions

for gaussian vectors $Z_{a}^{r+1}~{}\sim{\sf N}(0,\Sigma^{r+1})$ , $Z_{a}^{s+1}~{}\sim{\sf N}(0,\Sigma^{s+1})$ . Using Lemma 5, we have almost surely,

By another application of part $(b)$ for $\phi(\mathbf{x}_{i}^{1},\ldots,\mathbf{x}_{i}^{t+1},\mathbf{y}_{i})=\mathbf{x}_{i}^{r+1}(l)\mathbf{x}_{i}^{s+1}(k)$ for all $l,k\in[q]$ ,

Eq. (63) follows from Eq. (62) exactly by the same argument as in $\mathcal{B}_{0}(d)$ .

Appendix A Reference probability results

In this appendix, we summarize a few probability facts that are repeatedly used in the proof of Lemma 2. We start by the following strong law of large numbers (SLLN) for triangular arrays of independent but not identically distributed random variables. The form stated below follows immediately from [HT97, Theorem 2.1].

Next, we present a standard property of Gaussian matrices without proof. This is a generalization of [BM11, Lemma 2].

The following law of large numbers is a generalization of [BM11, Lemma 4] and can be proved in a very similar manner.